Google Sorts 1 Petabyte In 6 Hours

← Back to Stories (view on slashdot.org)

Google Sorts 1 Petabyte In 6 Hours

Posted by Soulskill on Sunday November 23, 2008 @04:53AM from the sort-of-fast dept.

krewemaynard writes "Google has announced that they were able to sort one petabyte of data in 6 hours and 2 minutes across 4,000 computers. According to the Google Blog, '... to put this amount in perspective, it is 12 times the amount of archived web data in the US Library of Congress as of May 2008. In comparison, consider that the aggregate size of data processed by all instances of MapReduce at Google was on average 20PB per day in January 2008.' The technology making this possible is MapReduce 'a programming model and an associated implementation for processing and generating large data sets.' We discussed it a few months ago. Google has also posted a video from their Technology RoundTable discussing MapReduce."

8 of 166 comments (clear)

Min score:

Reason:

Sort:

That's Easy by Lord+Byron+II · 2008-11-23 05:04 · Score: 4, Interesting

Consider a data set of two numbers, each .5 petabyte big. It should only take a few minutes to sort them and there's even a 50% chance the data is already sorted.
One ups Yahoo & Hadoop by DaveLatham · 2008-11-23 05:23 · Score: 3, Interesting

It looks like Google saw Yahoo crowing about winning the 1 TB sort contest using Hadoop and decided to one up them!

Let's see if Yahoo responds!
1. Re:One ups Yahoo & Hadoop by jollyplex · 2008-11-23 11:49 · Score: 5, Interesting
  
  Exactly. It's unclear if their better time was a software engineering or algorithmic feat, though. Hadoop was able to finish sorting the 1 TB benchmark dataset in 209 s; TFA states Google pulled the same event off in 68 s. The Yahoo blog post you linked to says their compute nodes each sported 4 SATA HDDs. Note TFA mentions Google's 1 PB dataset sort used 48,000 HDDs split between 4,000 machines, or 12 HDDs to a machine. If Google used the same machines to perform their 1 TB sort, then they had 3 times as many HDDs on each compute node, and could probably pull data from storage 3 times as fast. 209 s / 68 s ~ 3.1 -- coincidence, or not? =)
tagging by Hao+Wu · 2008-11-23 05:34 · Score: 4, Interesting

I will be able to catalog my pr0n in my lifetime:

It's not enough to sort by blond, black, gay, scat, etc. Some categories are a combination that don't belong in a hierarchy.

That is where tagging comes in. Sorting can be done on-the-fly, with no one category intrinsically more important.

--
I suggest you read Slashdot
Re:Unit conversion by Anonymous Coward · 2008-11-23 05:45 · Score: 2, Interesting

Assuming it was written in binary in a font that allows for 1 digit per 2mm, the length of the data would be 183251938 m, or 1145324 times the perimeter of an olympic-sized swimming pool.
just in perspective... by wjh31 · 2008-11-23 06:38 · Score: 2, Interesting

i make this about 48GB/s, my hard drive manages about 20MB/s, even my mid-range ram manages only ~6.4GB/s, and top end ram will reach only ~13GB/s (according to wiki) so even ignoring the ability to process that much data in that time, the ability to simply move that much data is quite impressive (at time of print, may not hold one year down the line)
MapReduce = map + reduce by Bitmanhome · 2008-11-23 08:45 · Score: 3, Interesting

If you feel the urge to play with MapReduce (or reade the paper), you don't need a fancy Linux distro to do it. MapReduce is simply the map() and reduce() functions, exactly as implemented in Python. Granted, Google implementation can work with absurdly large data sets, but for small data sets, Python is all you need.

--
Not that this wasn't entirely predictable.
Re:Need to benchmark against the best sorts by Pinball+Wizard · 2008-11-23 10:36 · Score: 2, Interesting

Parallel/distributed sorting doesn't eliminate the need for map/reduce, it just helps spread the problem set across machines.
Here's the thing though...its the distributing of the problem set and the combining of the results that is the hard part - not map/reduce.
Map and reduce are simple functional programming paradigms. With map, you apply a function to a list - which could be either atomic values or other functions. With reduce, you take a single function(like add or multiply, for instance) and use that to condense the list into a single value or object.
That's my understanding of map/reduce from my functional language classes in school and that's exactly how Google describes it. I don't really see what the big deal is with map/reduce in itself.
Like I said, its the distributing the problem among thousands of machines that is the hard part.

--
No, Thursday's out. How about never - is never good for you?