Google Sorts 1 Petabyte In 6 Hours

← Back to Stories (view on slashdot.org)

Google Sorts 1 Petabyte In 6 Hours

Posted by Soulskill on Sunday November 23, 2008 @04:53AM from the sort-of-fast dept.

krewemaynard writes "Google has announced that they were able to sort one petabyte of data in 6 hours and 2 minutes across 4,000 computers. According to the Google Blog, '... to put this amount in perspective, it is 12 times the amount of archived web data in the US Library of Congress as of May 2008. In comparison, consider that the aggregate size of data processed by all instances of MapReduce at Google was on average 20PB per day in January 2008.' The technology making this possible is MapReduce 'a programming model and an associated implementation for processing and generating large data sets.' We discussed it a few months ago. Google has also posted a video from their Technology RoundTable discussing MapReduce."

12 of 166 comments (clear)

Min score:

Reason:

Sort:

That's Easy by Lord+Byron+II · 2008-11-23 05:04 · Score: 4, Interesting

Consider a data set of two numbers, each .5 petabyte big. It should only take a few minutes to sort them and there's even a 50% chance the data is already sorted.
One ups Yahoo & Hadoop by DaveLatham · 2008-11-23 05:23 · Score: 3, Interesting

It looks like Google saw Yahoo crowing about winning the 1 TB sort contest using Hadoop and decided to one up them!

Let's see if Yahoo responds!
1. Re:One ups Yahoo & Hadoop by jollyplex · 2008-11-23 11:49 · Score: 5, Interesting
  
  Exactly. It's unclear if their better time was a software engineering or algorithmic feat, though. Hadoop was able to finish sorting the 1 TB benchmark dataset in 209 s; TFA states Google pulled the same event off in 68 s. The Yahoo blog post you linked to says their compute nodes each sported 4 SATA HDDs. Note TFA mentions Google's 1 PB dataset sort used 48,000 HDDs split between 4,000 machines, or 12 HDDs to a machine. If Google used the same machines to perform their 1 TB sort, then they had 3 times as many HDDs on each compute node, and could probably pull data from storage 3 times as fast. 209 s / 68 s ~ 3.1 -- coincidence, or not? =)
tagging by Hao+Wu · 2008-11-23 05:34 · Score: 4, Interesting

I will be able to catalog my pr0n in my lifetime:

It's not enough to sort by blond, black, gay, scat, etc. Some categories are a combination that don't belong in a hierarchy.

That is where tagging comes in. Sorting can be done on-the-fly, with no one category intrinsically more important.

--
I suggest you read Slashdot
Re:Unit conversion by Anonymous Coward · 2008-11-23 05:45 · Score: 2, Interesting

Assuming it was written in binary in a font that allows for 1 digit per 2mm, the length of the data would be 183251938 m, or 1145324 times the perimeter of an olympic-sized swimming pool.
Re:How is this flamebait? by iwein · 2008-11-23 06:14 · Score: 1, Interesting

There is a thing called meta humor, I'll give you an example:
You got baited into a flame in a very elaborate scheme to mock your intelligence (or lack thereof).
There is no category meta-flamebait, so you're proving the mods right I'd say.
I hope this helps.

--
Show a man some news, distract him for an hour. Show a man some mod points, distract him for the rest of his life.
20,111 Servers ?? by johnflan · 2008-11-23 06:17 · Score: 1, Interesting

With a little bit of excel, if takes 4,000 servers 362 minutes to calculate a 1PB job It takes 1440 (24 hours) on 20,111.11 server to sort 20pb (if it was just plain sorting they were doing). And just on a side note, from their number one of their servers can compute 741.5 MB per minute!
just in perspective... by wjh31 · 2008-11-23 06:38 · Score: 2, Interesting

i make this about 48GB/s, my hard drive manages about 20MB/s, even my mid-range ram manages only ~6.4GB/s, and top end ram will reach only ~13GB/s (according to wiki) so even ignoring the ability to process that much data in that time, the ability to simply move that much data is quite impressive (at time of print, may not hold one year down the line)
Re:Unit conversion by Anonymous Coward · 2008-11-23 06:50 · Score: 1, Interesting

Yes, but how much is that in football fields?
You silly sod, you can't measure something in football fields! There's internationalization to take into account!
Canadian football fields are 100x59m, American football fields are 109x49m, and the rest of the world doesn't even play the same game on a football field. And THEIR sport has a standard range, anywhere from 90-120m by 45-90m (Thank you wikipedia).
You've now introduced variable-variables! We can't get an absolute number!
MapReduce = map + reduce by Bitmanhome · 2008-11-23 08:45 · Score: 3, Interesting

If you feel the urge to play with MapReduce (or reade the paper), you don't need a fancy Linux distro to do it. MapReduce is simply the map() and reduce() functions, exactly as implemented in Python. Granted, Google implementation can work with absurdly large data sets, but for small data sets, Python is all you need.

--
Not that this wasn't entirely predictable.
Re:Need to benchmark against the best sorts by Pinball+Wizard · 2008-11-23 10:36 · Score: 2, Interesting

Parallel/distributed sorting doesn't eliminate the need for map/reduce, it just helps spread the problem set across machines.
Here's the thing though...its the distributing of the problem set and the combining of the results that is the hard part - not map/reduce.
Map and reduce are simple functional programming paradigms. With map, you apply a function to a list - which could be either atomic values or other functions. With reduce, you take a single function(like add or multiply, for instance) and use that to condense the list into a single value or object.
That's my understanding of map/reduce from my functional language classes in school and that's exactly how Google describes it. I don't really see what the big deal is with map/reduce in itself.
Like I said, its the distributing the problem among thousands of machines that is the hard part.

--
No, Thursday's out. How about never - is never good for you?
hadoop by voxner · 2008-11-23 15:45 · Score: 1, Interesting

I developed a search engine in c++ in grad school and I remember applying the concepts of vector space model, okapi model in it. Looking at the relatively snail pace speed I got from the c++ stl datastructures (with limited hardware resources), I still can't help but wonder how google pulls off the "magic" of split second results.
MapReduce is a powerful concept and a good starting point is hadoop. Google uses its own proprietary system called GFS.
We once had a session with Mr.Raghavan who was the yahoo search chief.He advised that students in general should learn to work in projects that involve a large corpus of data and that phds are especially valued in the field.