Google Sorts 1 Petabyte In 6 Hours
krewemaynard writes "Google has announced that they were able to sort one petabyte of data in 6 hours and 2 minutes across 4,000 computers. According to the Google Blog, '... to put this amount in perspective, it is 12 times the amount of archived web data in the US Library of Congress as of May 2008. In comparison, consider that the aggregate size of data processed by all instances of MapReduce at Google was on average 20PB per day in January 2008.' The technology making this possible is MapReduce 'a programming model and an associated implementation for processing and generating large data sets.' We discussed it a few months ago. Google has also posted a video from their Technology RoundTable discussing MapReduce."
Don't you mean 1PB = 12LoC?
I think this is the data set. I could be wrong though. The article (yeah yeah) says that
In our sorting experiments we have followed the rules of a standard terabyte (TB) sort benchmark.
Which lead me to this page that describes the data (and it's available for download).
Your sig(k) has been stolen. There is a puff of smoke!
From TFA: they sorted "10 trillion 100-byte records"
No, 1 PB = 12 LoC, so 1 LoC = 0.0833... PB
Also, I'd like to make some kind of swimming pool reference.
Azural - instrumentals
I realize, slashdot..., but maybe you could glance at the article which states:
10 trillion 100-byte records
Are you sure? It wasn't marked Vista capable.
Oh darn. Clearly I was converting pound-congresses to kilos first.
Agreed, but even if it takes 40,000 servers with losses and extra overhead to calculate their daily workload. It makes you wonder what their other estimated 410,000 servers are doing? (2006 estimate)
It's older than design patterns. Lisp has provided map and reduce functions for literally decades. It's a standard functional programming idiom.
Well, American's don't even play *foot*ball with there feet.
True, but not quite the point. The map and reduce functions as you say are implemented in python (and a great many other languages), but what makes MapReduce special is that you replace the Map function with one which distributes it out to other computers. Because any map function can be implemented in parallel you get a speed boost for however many machines you have (dependant on network speeds etc....).
So yeah, you can do it in Python but you arent going to be breaking any records untill you implement your own infrastructure that lets you span it out to thousands of computers. The nice thing being you dont need to write any new code to take advantage of the speed when you do.
Exactly. There is nothing special to map and reduce.
Here's an example. Map and reduce are functional programming tools that work with lists. So we'll start with a simple list.
1 2 3 4 5
Now we'll take a function - x^2, and map it to the list. The list now becomes:
1 4 9 16 25.
Now, we'll apply a reduce function to our list to combine it to a single value. I'll use "+" to keep it simple. We end up with:
55
And that is pretty much all there is to map and reduce.
No, Thursday's out. How about never - is never good for you?
Almost, but not quite. MapReduce has a slightly different format than just map() and reduce(). Here is the signature of map and reduce from a theoretical functional language:
map(): A* -> B*
reduce(): B* -> C
Whereas in MapReduce:
map: (K, V)* -> (K1, V1)*
reduce: (K1, (V1)*)* -> (K2, V2)*
I think that is mostly accurate. Read more accurate/detailed report in MapReduce revisited[PDF].
The individual functions map and reduce are quite standard. The innovation here is the systems work they've done to make it work on such a large scale. All the programmer needs to worry about is implementing the two functions, they don't have to worry about distributing the work, ensuring fault tolerance, or anything else for that matter. That is the innovation.
They mention in the article that if you try and sort a petabyte you WILL get hard disk and computer failures. Hell, you can only read a terabyte hard disk a few times before you encounter unrecoverable errors. The system for executing those maps and reduces is what is important here. The important parts are in the design details, such as dealing with stragglers. If you have 4000 identical machines, you won't necessarily get equal performance. If a few of those machines have a bit flipped and started without disk cache, they might see a huge decrease in read/write performance. The system needs to recognize this and schedule the work differently. That can make a huge difference in execution time. If you graph the percentile complete of a MR job, you'll often see that it quickly reaches 95% and then plateaus. The last 5% may take 20% of the time, and good scheduling is required to bring this time down.
But like I said, the innovation isn't in the idea of using a Map and Reduce function, it is the system that executes the work.