Google Sorts 1 Petabyte In 6 Hours
krewemaynard writes "Google has announced that they were able to sort one petabyte of data in 6 hours and 2 minutes across 4,000 computers. According to the Google Blog, '... to put this amount in perspective, it is 12 times the amount of archived web data in the US Library of Congress as of May 2008. In comparison, consider that the aggregate size of data processed by all instances of MapReduce at Google was on average 20PB per day in January 2008.' The technology making this possible is MapReduce 'a programming model and an associated implementation for processing and generating large data sets.' We discussed it a few months ago. Google has also posted a video from their Technology RoundTable discussing MapReduce."
Consider a data set of two numbers, each .5 petabyte big. It should only take a few minutes to sort them and there's even a 50% chance the data is already sorted.
It looks like Google saw Yahoo crowing about winning the 1 TB sort contest using Hadoop and decided to one up them!
Let's see if Yahoo responds!
It's not enough to sort by blond, black, gay, scat, etc. Some categories are a combination that don't belong in a hierarchy.
That is where tagging comes in. Sorting can be done on-the-fly, with no one category intrinsically more important.
I suggest you read Slashdot
Assuming it was written in binary in a font that allows for 1 digit per 2mm, the length of the data would be 183251938 m, or 1145324 times the perimeter of an olympic-sized swimming pool.
i make this about 48GB/s, my hard drive manages about 20MB/s, even my mid-range ram manages only ~6.4GB/s, and top end ram will reach only ~13GB/s (according to wiki) so even ignoring the ability to process that much data in that time, the ability to simply move that much data is quite impressive (at time of print, may not hold one year down the line)
If you feel the urge to play with MapReduce (or reade the paper), you don't need a fancy Linux distro to do it. MapReduce is simply the map() and reduce() functions, exactly as implemented in Python. Granted, Google implementation can work with absurdly large data sets, but for small data sets, Python is all you need.
Not that this wasn't entirely predictable.
Parallel/distributed sorting doesn't eliminate the need for map/reduce, it just helps spread the problem set across machines.
Here's the thing though...its the distributing of the problem set and the combining of the results that is the hard part - not map/reduce.
Map and reduce are simple functional programming paradigms. With map, you apply a function to a list - which could be either atomic values or other functions. With reduce, you take a single function(like add or multiply, for instance) and use that to condense the list into a single value or object.
That's my understanding of map/reduce from my functional language classes in school and that's exactly how Google describes it. I don't really see what the big deal is with map/reduce in itself.
Like I said, its the distributing the problem among thousands of machines that is the hard part.
No, Thursday's out. How about never - is never good for you?