Google Sorts 1 Petabyte In 6 Hours
krewemaynard writes "Google has announced that they were able to sort one petabyte of data in 6 hours and 2 minutes across 4,000 computers. According to the Google Blog, '... to put this amount in perspective, it is 12 times the amount of archived web data in the US Library of Congress as of May 2008. In comparison, consider that the aggregate size of data processed by all instances of MapReduce at Google was on average 20PB per day in January 2008.' The technology making this possible is MapReduce 'a programming model and an associated implementation for processing and generating large data sets.' We discussed it a few months ago. Google has also posted a video from their Technology RoundTable discussing MapReduce."
for knowing how important the Library of Congress metric is to us nerds!
Yay! We finally have unit conversion from 1 LoC to bytes! So...20 PB = 6LoC, means that 1 LoC = 3,333... PB :)
Consider a data set of two numbers, each .5 petabyte big. It should only take a few minutes to sort them and there's even a 50% chance the data is already sorted.
Sorts have been parallelized and distributed for decades. It would be interesting to benchmark Google's approach against SyncSort. SyncSort is parallel and distributed, and has been heavily optimized for exactly such jobs. Using map/reduce will work, but there are better approaches to sorting.
As information continues to pile up behind the corporate firewall, companies and executives are fast recognizing that effective findability is more than a nice-to-have -- it's a must-have for their business. In fact, in a recent survey by AIIM, 62% of respondents saw findability as "imperative or significant" to their overall business goals and success, while only 5% reported that it wasn't a factor.Findability is a complex problem, and our goal is to provide businesses with a simple solution. That's why we've put together 'Enterprise Findability Without the Complexity' - a look into our philosophy and approach to search for businesses. We've noticed that approaches to findability can vary dramatically, which can have a significant impact on subsequent results. For instance, a traditional architecture, as demonstrated in this video, might include a plethora of servers, such as front-end web servers, index servers, query servers, database servers, and SAN storage. Not to mention load balancing servers, identity servers, disaster recovery servers, patch deployment servers, and volume license management servers. What a mouthful!On the other hand, there is the appliance based model - i.e., one box that does it all. The Google Search Appliance can search 10 million documents with just one box, and pull information together from across a business - whether it lives in a database, intranet, business application or content management system. Not to mention it looks pretty snazzy too.You can read the full document here. We look forward to hearing your thoughts.
Comment removed based on user account deletion
I will be able to catalog my pr0n in my lifetime:
Blondes, Brunettes, Red heads, Beastial^H^H^H^H^H "Other"
No sig for you!!
for destroying the language and the culture of every country in which they are found! Just think about it ... they are the only ethnic group who can be in the USA for more than 1-2 generations without sounding like a native English speaker. Even native Mandarin Chinese speakers who come to the USA have children who sound like native English speakers. Niggers have had how many hundreds of years now? Oh but we're supposed to call it "Ebonics" and pretend like this is somehow not a failure of theirs. Better to fuck up the entire language for everyone than to tell a 13% minority to get with the program. Brilliant! Don't worry, at least that way everyone's widdle feelings won't get hurt. Because that's more important than anything else, right? Right, you pantywaists?!
It looks like Google saw Yahoo crowing about winning the 1 TB sort contest using Hadoop and decided to one up them!
Let's see if Yahoo responds!
One quadrillion bytes, or 1 million gigabytes.
How big are the fields being sorted. Is it an exchange sort or a reference sort?
It is probably very impressive, but without a LOT of details, it is hard to know.
It's not enough to sort by blond, black, gay, scat, etc. Some categories are a combination that don't belong in a hierarchy.
That is where tagging comes in. Sorting can be done on-the-fly, with no one category intrinsically more important.
I suggest you read Slashdot
Finaly... A system with enough power to run vista efficiently.
Not a big deal, that's just the data they have on you.
lol: You see no door there!
Here's the definition of "Flamebait" from the Slashdot Faq:
Now, someone please tell me how the parent post fits this criteria. I dare the moderator to explain himself or herself. You won't, of course, because that would require balls and the ability to admit that you made a mistake. Y'know, things that respectable people have.
Posted anonymously because I fully expect that, instead of understanding that preventing/correcting such blatant incompetence is the best way to avoid rants like this one, the other mods will instead play shoot-the-messenger and take out their impotent frustrations on me either because I pointed out this stupidity or because I wasn't care-bear nice about it (because that's so much more important than truth, right?).
As memory gets cheaper and I can store more locally, what I really need to know is whether it is unique or new to me. I can read Frits P0st a million times and never get tired of it. There was a very good article on slashdot the other day and it got over 2000 comments, some of which were very insightful and useful. I need a way to know for myself what is new to me. I would be nice if the browser interacted more with Google to help me with that. I just looked, and RTFM is indexed 4.5 million times which of course includes xkcd#293, and that is really all I need to know.
That's a lot of computing power to use just to get 4,000,000,000,000 0s and 4,000,000,000,000 1s.
872835240
Can we convert that to number of bad car analogies?
...fancy doing my mp3 collection?
Operation Guillotine is in effect.
First of all, this isn't a straight up "Libraries of Congress" (better known and mentioned in prior posts as a LoC). Its the web archiving arm of the LoC. I call for the coining of a new term, WASoLoC (Web Archival System of Library of Congress) which can be defined as X * Y^Z = 1 WASoLoC where X is some medium that people can relate to (books, web pages, documents, tacos, water, etc), Y is a volume (Libaries, Internets, Encyclopedias, end to end from A to B, swimming pools, etc) and Z is some number that marketing drones come up with because it makes them happy in their pants.
Honestly, How am i supposed to know what "..the amount of archived web data in the US Library of Congress as of May 2008." Looks like!? I've been to the library of congress, i've seen it, its a metric shit-ton of books (1 shit-ton = Shit * assloads^fricking lots), but i have no clue what the LoC is archiving, what rate they're going at it, and what the volume is of it.
Is it sad that I am more likely to recognize you and your posts by your sig than your name or UID?
That must have taken a lot of monkeys.
You look forward to hearing our thoughts, do you? Your blog says "Your comment has been saved and will be visible after blog owner approval." I hope you enjoy what's in your approval queue you fucking spammer!
Good.
They clearly have the ability to respond to emergencies. And this puts it out there that they can...
eg;
1) Foot n mouth out break in cattle
2) A supliment to census data
3) Finding information of dissidents/traitors(bloggers)
In post Patriot Act America, the library books scan you.
cia / nsa / Hss homeland security be happy then
and thanks to whomeever i cant post here
doesn't even deserve a email about it.
been longer then 24 hrs
and why is my comment about some jerks replacing ubuntu with solaris NOT a WASTE of time
if ya want ubuntu with ubuntu apps
makes more sense to get the ubuntu linux kernal then some other stupid OS
and you system due to it being hard to read means that IM NOT HUMAN????
With a little bit of excel, if takes 4,000 servers 362 minutes to calculate a 1PB job It takes 1440 (24 hours) on 20,111.11 server to sort 20pb (if it was just plain sorting they were doing). And just on a side note, from their number one of their servers can compute 741.5 MB per minute!
i make this about 48GB/s, my hard drive manages about 20MB/s, even my mid-range ram manages only ~6.4GB/s, and top end ram will reach only ~13GB/s (according to wiki) so even ignoring the ability to process that much data in that time, the ability to simply move that much data is quite impressive (at time of print, may not hold one year down the line)
That's a lot of data...
...that Google hasn't implemented the Libraries of Congress metric into their auto-calculator.
C'mon Google, get on the ball(s)!
It's better to vote for what you want and not get it than to vote for what you don't want and get it.
- E. Debs
bubblesort?
Just like I measure my distance to work (452.75 football fields) i measure the data on my computer by libraries of congress?
Today from Google, the god of all things and doer of all things good in the universe, many millions of dollars in computer equipment were able to sort lots of things, in about the amount of time you would think it would take for millions of dollars of equipment to sort things.
In other news, a woodchuck was found chucking wood as fast as a woodchuck could chuck wood.
Congrats Google, you have a HUGE data set, and an even bigger wallet.
- Adam L. Beberg - The Cosm Project - http://www.mithral.com/
If you feel the urge to play with MapReduce (or reade the paper), you don't need a fancy Linux distro to do it. MapReduce is simply the map() and reduce() functions, exactly as implemented in Python. Granted, Google implementation can work with absurdly large data sets, but for small data sets, Python is all you need.
Not that this wasn't entirely predictable.
It really only took Two Hours - the rest of the time was used stuffing in paid ads.
Whoever mods you "-1" is getting an extra small turkey this Thursday!
And the AC troll shall get only turkey bones and lumps of coal!
It's always bugged me that they've been heralding MapReduce, something any functional programmer has known for the past 50 years, as something revolutionary and new. The worst of it is how all the self-styled geeks, who by rights ought to be familiar with the concept, have been lapping it all up.
I cannot believe how something along the trivial lines of parallelizable problems parallelize made it into /. twice. I now consider /. broken, and will remove it from my bookmarks. I guess a massive company like google has enough manpower to vote crap up here, so thinking people will need to find some backwaters for talking.
Also I was always waiting for the point in time when google would start to be taken over by the usual business s... hats. Negative karma is growing.
Its not that they are saying "OH MY GOD MAPREDUCE IS SO AWSUM WE TOTALLY OWNED YOU N00BS"
They just wondered how long it would take to sort 1PB dataset, after they saw Yahoo! perform a sort with the 1TB set.
All it is showing is improvements of technology.
We can now use these figures as a benchmark in 5 / 10 years to see how better it has gotten since then.
I, for one, welcome our new PB sorting personal supercomputer overlords.
I developed a search engine in c++ in grad school and I remember applying the concepts of vector space model, okapi model in it. Looking at the relatively snail pace speed I got from the c++ stl datastructures (with limited hardware resources), I still can't help but wonder how google pulls off the "magic" of split second results.
MapReduce is a powerful concept and a good starting point is hadoop. Google uses its own proprietary system called GFS.
We once had a session with Mr.Raghavan who was the yahoo search chief.He advised that students in general should learn to work in projects that involve a large corpus of data and that phds are especially valued in the field.
That's almost as big as my downloaded MP3 library!
In Soviet Russia, 6 petabytes sort YOU in ONE hour.
Dude, that's barely half my porn stash.
-
Next time we have to reference amounts of data sorted as n Google Sort.