Open Source Solution Breaks World Sorting Records
allenw writes "In a recent blog post, Yahoo's grid computing team announced that Apache Hadoop was used to break the current world sorting records in the annual GraySort contest. It topped the 'Gray' and 'Minute' sorts in the general purpose (Daytona) category. They sorted 1TB in 62 seconds, and 1PB in 16.25 hours. Apache Hadoop is the only open source software to ever win the competition. It also won the Terasort competition last year."
I for one welcome our new datasorting overlords!
The Long Now Foundation
Just give me a few minutes to patch together a bubblesort from my highschool Pascal class. I'll show them record speed!
So, it appears they have finally sorted out whether open source beats proprietary.
When information is power, privacy is freedom.
If it's winning competitions at 0.20, when will they release it?
... truth be told, a lot of good engineering could happen if many of peoples favorite commercial applications could have the souce distributed with them, a lot of old games for instance coudl be updated and maintained.
I think what holds the progress of open source back is interesting projects that exist that people want to work on but are locked away under corporate lock and key.
But has anyone patented it yet? Patents trump copyright after all.
This doesn't say anything if we don't know what kind of records were supposed to be sorted.
Gonna pass this on to my boss, hopefully now we can move off of our terrible, terrible proprietary sorting software...Good to see open source breaking inroads in so many areas!!!
...this cluster had nearly 4 times the number of nodes as the previous records. This competition was testing who had more nodes working together the best, but when you have so many more nodes, it would be hard not to top other clusters.
OK, so where are the "Java is slow" comments? o.O
Dang nabbit. I was depending on the World Sorting Records to be my reference for how people sort in other countries than my own. Avoid open source, next time it will break your nose.
Probably why the second sentence in the article is "All of the sort benchmarks measure the time to sort different numbers of 100 byte records. The first 10 bytes of each record is the key and the rest is the value."
Belief is the currency of delusion.
You can't patent Apache 2.0 licensed stuff.
Also, you can't patent software*.
*in Europe
NB: The message above might reflect my opinion right now, but not necessarily tomorrow or next year.
Also, you can't patent software in Europe
Not yet, but they are working on it. They tried to snuck it through by hiding it in the amendments of an agricultural bill. Luckily Poland kept watch and rose a stink about it.
It's not over. There is too much money to be gained for that.
Java doesn't fit for my environment, does someone know of an open source C++ port of Java Hadoop?
If there is no such port, is anyone interested in starting a port? An example of a Java port is "cLucene", a C++ port of Java Lucene search engine. It usually outperforms its Java sibling in an order of magnitude.
My favourite operating system is ReactOS; binary compatible to WinNT series
2005 winner used only 80 cores and achieved it in 435 seconds. So with 800 cores what 2007 winner achieved is 297 seconds ?
Its not only number of cores its how the logic to use parallel nodes properly to do a particular task is important.
Hadoop won with 1820 cores (910 nodes w/ 2 cores each) at 209 seconds.
I'm all for better sorting algorithms, but eventually the cost of parallelizing something overtakes the profit made. That being said, Hadoop's internal filesystem made to be redundant, which is an important feature whenever you're dealing with large amounts of data.
Hadoop uses Google's MapReduce, by the way, whereas the competition didn't. It's nice to see MapReduce being used in a more public eye.
While better sorting algorithms -do- matter, I have to say that maintenance and running costs also matter.
I'd also like to see how a compatible C version of this software compares with the Java version. However, as I see it, the Java overhead seems fairly limited; sorting code is wonderfully repetitive, and I'd expect that it's already been optimized a fair amount.
By the way, the number of nodes and the hardware in the nodes for this Hadoop cluster is -optimized- for this contest.
There are no perfect answers, only the right questions. More questions at http://foresightandhindsight.blogspot.com/
Google's sorting results from last yeat (link) are much faster; they did a petabyte in 362 minutes, or 2.8 TB/sec. They minute sort didn't exist last year, but Google did 1TB in 68 seconds last year, so I think it may be safe to assume that they could do 1 TB in under a minute this year. Google just hasn't submitted any of their runs to the competition.
From the sort benchmark page, the list the winning run as Yahoo's 100TB run, leaving out the 1PB run; that implies the 1PB run didn't conform to the rules, or was late, or something.
People have commented that this is a "who has the biggest cluster" competition; the sort benchmark also includes the 'penny' sort, which is how much can you sort for 1 penny of computer time (assuming your machine lasts 3 years), and 'Joule' sort, how much energy does it take you to sort a set amount of data. Not surprisingly, the big clusters appear to be neither cost efficient nor energy efficient.
But has anyone patented it yet? Patents trump copyright after all.
There are a number of patent applications related to MapReduce from Google and Yahoo.
Dual Opteron < $600
Why isn't this illegal - adding unrelated legislation to a ? Is there anywhere in the world why this practice is not permitted, or better yet, prosecuted?
Pain is merely failure leaving the body
I'm always confused when teams use Java that don't REQUIRE complex cross platform support. IME, all the old complaints about java are still true.
It is slow, when compared to C/C++ or other mature compiled languages.
It uses more RAM than C++.
Development isn't any easier or faster than C++.
It is fairly easy to write cross platform C++ code that uses less compute resources and easily runs on 10 platforms. Besides having developers who are nearly clueless about the platform isn't good. I've seen some really bad java developer teams and some really bad C++ teams. Overall, the java developers knew less about the platform and hardware than the C++ teams. Java let them be lazy.
Don't get me wrong, there are many uses for java and as CPUs have gotten faster and hold more RAM, we aren't trying to suck every bit of performance. That's a good fit for java programs.
Imagine a beowulf cluster of those!
I really hope that this works across multiple drives, because my p0rn collection is so spread out it would take for ever to sort manually!
Why isn't this illegal - adding unrelated legislation to a ? Is there anywhere in the world why this practice is not permitted, or better yet, prosecuted?
I never heard of it happening here in the UK, as far as I knew only the US did. Shows how little I knew.
If God forks the Universe every time you roll a die, he'd better have a damned good memory.
Im looking forward to sorting my search results by Date, Title, Description, Author, etc..
The Gray sort metric is defined as TB/minute on a large data set (>=100TB). Apache Hadoop got 100TB in 173min = 0.578TB / min.
Half a year ago, Google's MapReduce sorted 1PB in 362 minutes. Rate = 2.762TB / min
http://developers.slashdot.org/article.pl?sid=08/11/23/1637219&from=rss
My sorting algorithm operates in constant time. I should really enter it into one of these competitions. It's called Intelligent Design Sort: http://www.dangermouse.net/esoteric/intelligentdesignsort.html
There are 2 kinds of people in this world. Those that can keep their train of thought,
Here in the UK, the patent office has been issuing software patents for some time in "anticipation" of them becoming legal at some point in the future.
No, I don't understand that either.
That depends how you define unrelated, but I think that the Anti-terrorism, Crime and Security Act of 2001 is a perfect example of the fact that the name of a law is chosen to try and make sure that it gets passed.
Why isn't this illegal - adding unrelated legislation to a ? Is there anywhere in the world why this practice is not permitted, or better yet, prosecuted?
The GP is confusing a bunch of things. First, the Council of Ministers threw out all limiting amendments from the European Parliament and reached an Political Agreement on a shoddy text through backdoor maneuvering by Germany and the European Commission. That text would have turned the European Patent Office's practice of granting software patents into EU legislation.
A Political Agreement has no juridical nor legislative value, but it has never happened that a political agreement was later on annulled and that negotiations were reopened. So also in this case, even though the German, Dutch, Spanish and Danish parliaments afterwards passed motions asking to reopen the discussions, the Council's bureaucrats did not want to do that because it "would undermine the efficiency of the decision making process".
Anyway, once you have a Political Agreement (which is reached by the representatives of the ministries responsible for the matter at hand) and nobody "wants" to discuss it anymore, the agreement can be placed as an "A item" on any EU Council of Ministers meeting, since it only needs rubber stamping in that case. In the case of the Software Patents Directive, it appeared several times as an A item on the agenda of an Agriculture and Fisheries meeting (which is presumably where the GP's confusion stems from).
In principle, there would have been nothing wrong with that, but in this case there was no actual political agreement, and in particular Poland was very unhappy with the way it had been treated. So 4 times in a row, Poland either had this "A item" removed from the agenda (sometimes at the last minute, because the responsible Polish minister had to be informed that they were again trying to get it through at a meeting he had no business with), or turned it into a "B item", which means that it can't be rubber stamped but that they first have to talk a bit about it (which nobody wanted to do).
In the end it still did get approved, but that whole circus helped with in convincing the EU Parliament to table a resolution asking the Commission to restart the directive's process, and when the Commission refused to later on squarely reject it.
You can find some more of my thoughts on the Council's behaviour here.
Donate free food here
GP was referring to tacking on legislation to an unrelated bill, i.e. patent legislation on an agricultural bill. It's my understanding that this is sometimes used in the US to block a bill by means of appending something that no fool will vote in.
If God forks the Universe every time you roll a die, he'd better have a damned good memory.
This is indeed the case (for killing bills). The nastier version is tacking some random crap on to the annual budget, and using the excuse of getting the budget passed to ram it through even though the bill alone wouldn't even get to a vote by itself.
Like the widely-held belief that sorting speed is related to the software license used.
Make that "all of our" instead of "all are". A mind is a terrible thing to waste.
Hadoop's name (and mascot) came from Doug [the project leader] Cutting's son's yellow stuffed elephant toy.
Good luck on your lawsuit with DJ Danger Mouse.
(Kinda stupid to be whoring your vanity site out on the same day as a front page story about the person who could easily sue you and take it from you.)
READ THE GODDAMNED FUCKING ARTICLE YOU STUPID MOTHERFUCKING LAZY COCKSUCKING PIECE OF SHIT.
Really, dude... it's not that hard to put in a modicum of effort that will pay dividends in terms of you not looking like a totally clueless fucking moron.
Gee, I actually thought it was funny. Now, maybe I'm just an idiot that's easily amused, but you sir have problems that could make just about anybody feel better about themselves.
and it's not detrimental to the people that work at those companies to protect the corporations intellectual property.
There was an episode of the Simpsons where Springfield is going to be destroyed by a meteor. Congress meets to quickly pass legislation to fund the evacuation of the city. At the last moment, a Congressman steps up to the podium and says "I'd like to add a rider providing $30 million for the perverted arts". The bill is defeated.
It's funny because it's true.
Anyone who loves or hates any language, platform, or manufacturer, doesn't know what they're talking about.
"Why isn't this illegal"
Because they made it legal by passing it on a Totally Unrelated Bill.
I had one of the Hadoop guys on my podcast a while back and we talked about this, and what Hadoop does (Map/Reduce),
http://www.rce-cast.com/index.php/Podcast/rce04-hadoop.html
Never mind sorting. I'd settle for a filesystem that could stat 1TB (in approx. 800,000 files) in under an hour. Mind you, it doesn't have to md5 them in that time, but it would be nice. I'd settle for just stat.
Sadly, tacking stuff on isn't limited to the budget bills. It happens routinely.
geek friendly VPS's and free API enabled DNS : zerigo.com
Same number of computers 68 seconds.
http://googleblog.blogspot.com/2008/11/sorting-1pb-with-mapreduce.html
I'd heard of a US bill getting pork added to it AFTER it had been voted on - sometime within the last year. I haven't been able to find the story again and would hope that this isn't a common occurrence.
Pain is merely failure leaving the body
Yeah. They'll add it in conference committee, where, after the initial vote, they reconcile differences in bills between the House and Senate versions. It goes back for a quick final vote in each chamber but that's usually considered procedural as I understand.
I don't know for sure, but somehow doubt that it's uncommon. More likely, the changes snuck in aren't enough to raise significant ire so they get away with it. And if if people figure it out and are unhappy, there's always plausible deniability: "Some intern added it; it wasn't supposed to be there."
geek friendly VPS's and free API enabled DNS : zerigo.com
The minidisk player in my closet wants to know why it's not on the list
if not "for months"
-- "As a human being I claim the right to be widely inconsistent", John Peel