Slashdot Mirror


Open Source Solution Breaks World Sorting Records

allenw writes "In a recent blog post, Yahoo's grid computing team announced that Apache Hadoop was used to break the current world sorting records in the annual GraySort contest. It topped the 'Gray' and 'Minute' sorts in the general purpose (Daytona) category. They sorted 1TB in 62 seconds, and 1PB in 16.25 hours. Apache Hadoop is the only open source software to ever win the competition. It also won the Terasort competition last year."

36 of 139 comments (clear)

  1. Overlords by Narpak · · Score: 3, Funny

    I for one welcome our new datasorting overlords!

    1. Re:Overlords by Jurily · · Score: 4, Funny

      I for one welcome our new datasorting overlords!

      With a name like Apache Hadoop, I wouldn't be surprised if they came from Star Wars.

    2. Re:Overlords by rackserverdeals · · Score: 3, Informative

      I wouldn't be surprised if they came from Star Wars.

      Actually, it came from Google. Sorta.

      Apache Hadoop is an implementation of MapReduce that Google uses in their search engine. I believe the details were found in a paper Google released on it's implementation of MapReduce.

      --
      Dual Opteron < $600
    3. Re:Overlords by daemonburrito · · Score: 3, Informative

      "MapReduce: Simplified Data Processing on Large Clusters." Jeffrey Dean and Sanjay Ghemawat, OSDI '04.

      They wrote about it in Beautiful Code, too (great book). MapReduce isn't complex, in fact the name comes from a feature that a lot of functional languages provide (yeah, I know, it's not exactly the same thing).

      There are many implementations of it. The wikipedia article is pretty informative: http://en.wikipedia.org/wiki/MapReduce. I didn't know about "BashReduce"... Heh.

    4. Re:Overlords by davester666 · · Score: 3, Funny

      Fastest implementation of BubbleSort EVER!

      --
      Sleep your way to a whiter smile...date a dentist!
    5. Re:Overlords by ModMeFlamebait · · Score: 5, Funny

      datasorting for I new one our overlords! welcome

      --
      Pavlov. Does this name ring a bell?
  2. Re:I'm sure that I can rock their scores by Thinboy00 · · Score: 5, Funny

    My sort will totally beat yours!

    --
    $ make available
  3. Re:I'm sure that I can rock their scores by Midnight+Thunder · · Score: 2, Funny

    Bogosort: for when you have you are paid by the hour, but aren't penalised for being late.

    --
    Jumpstart the tartan drive.
  4. Is it settled? by Jah-Wren+Ryel · · Score: 4, Funny

    So, it appears they have finally sorted out whether open source beats proprietary.

    --
    When information is power, privacy is freedom.
  5. When's it going to be 1.0? by AlexBirch · · Score: 3, Insightful

    If it's winning competitions at 0.20, when will they release it?

    1. Re:When's it going to be 1.0? by Anonymous Coward · · Score: 5, Informative

      It's 0.20 but it's stable and production ready already. I use it with HBase and it scales awesomely.

    2. Re:When's it going to be 1.0? by BikeHelmet · · Score: 2, Insightful

      Isn't 1.0 production for most software jargon?

      Nah, that's 6.0

      MS DOS 6.0
      IE 6.0
      Visual Studio 6.0

      I doubt anybody would want to use an earlier version than that!

    3. Re:When's it going to be 1.0? by TheRaven64 · · Score: 2, Insightful

      You realise, I hope, that Vista is Windows NT 6.0...

      --
      I am TheRaven on Soylent News
  6. They won the "Who has the most moneys" award. by nathan.fulton · · Score: 5, Insightful

    ...this cluster had nearly 4 times the number of nodes as the previous records. This competition was testing who had more nodes working together the best, but when you have so many more nodes, it would be hard not to top other clusters.

    1. Re:They won the "Who has the most moneys" award. by Rockoon · · Score: 3, Interesting

      I was doing some back-of-the-envelope, and they are sorting 17.7GB/second, which at a minimum would require 177 HD's if each drive can write 100MB/sec.

      If its not written to disk, then there is no achievement here (you don't perform 1 minute+ sorts and then throw the result away in real-world scenarios)

      --
      "His name was James Damore."
  7. Java by cratermoon · · Score: 5, Insightful

    OK, so where are the "Java is slow" comments? o.O

    1. Re:Java by hey! · · Score: 4, Insightful

      Well, not to endorse the "Java is slow" meme or anything, but starting from a red light I can beat most cars across the intersection on my bike.

      Likewise if I had to drive across country in the shortest time possible, I'd choose a Ford F250 if the challenge stipulated I had to bring 3000 pounds of bricks with me.

      Speed is a very task specific notion.

      --
      Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
  8. 100 bytes, 10 byte keys. by eddy · · Score: 5, Informative

    Probably why the second sentence in the article is "All of the sort benchmarks measure the time to sort different numbers of 100 byte records. The first 10 bytes of each record is the key and the rest is the value."

    --
    Belief is the currency of delusion.
  9. Re:What data? by Antisyzygy · · Score: 2, Insightful

    Things can be sorted by any of their properties. What is important is this software sorted data objects this quickly regardless of what property they were being ordered by. It beats all of the other sorting algorithms.

    --
    That brings me to an interesting point, / . is just "the ramblings of socially-inept, technology-literate news-mongers".
  10. Re:I'm sure that I can rock their scores by bmajik · · Score: 3, Funny

    I've asked lots of interview candidates to implement randomSort. They've never heard of it, so then I describe the algorithm.

    Watching their eyes go wide is the highlight of the interview, typically.

    Occasionally some person who has overcome their interview nervousness will, with eager honesty, try to implore to me that this is not a very good sort algorithm, and that much better ones are taught in universities these days.

    Good Times.

    --
    My opinions are my own, and do not necessarily represent those of my employer.
  11. Re:Great! It's open source! by berend+botje · · Score: 4, Interesting

    Also, you can't patent software in Europe

    Not yet, but they are working on it. They tried to snuck it through by hiding it in the amendments of an agricultural bill. Luckily Poland kept watch and rose a stink about it.

    It's not over. There is too much money to be gained for that.

  12. The benefits of parallelizing everything! by Celeste+R · · Score: 2, Informative
    According to a post on the Yahoo developer forums:

    2005 winner used only 80 cores and achieved it in 435 seconds. So with 800 cores what 2007 winner achieved is 297 seconds ?

    Its not only number of cores its how the logic to use parallel nodes properly to do a particular task is important.

    Hadoop won with 1820 cores (910 nodes w/ 2 cores each) at 209 seconds.

    I'm all for better sorting algorithms, but eventually the cost of parallelizing something overtakes the profit made. That being said, Hadoop's internal filesystem made to be redundant, which is an important feature whenever you're dealing with large amounts of data.

    Hadoop uses Google's MapReduce, by the way, whereas the competition didn't. It's nice to see MapReduce being used in a more public eye.

    While better sorting algorithms -do- matter, I have to say that maintenance and running costs also matter.

    I'd also like to see how a compatible C version of this software compares with the Java version. However, as I see it, the Java overhead seems fairly limited; sorting code is wonderfully repetitive, and I'd expect that it's already been optimized a fair amount.

    By the way, the number of nodes and the hardware in the nodes for this Hadoop cluster is -optimized- for this contest.

    --
    There are no perfect answers, only the right questions. More questions at http://foresightandhindsight.blogspot.com/
  13. Re:What data? by rackserverdeals · · Score: 4, Funny

    They sorted 1TB in 62 seconds, and 1PB in 16.25 hours.

    This doesn't say anything if we don't know what kind of records were supposed to be sorted.

    It's amazing what you can learn if you actually RTFA.

    All of the sort benchmarks measure the time to sort different numbers of 100 byte records.

    If that's not good enough for you, post your email address and maybe someone will be kind enough to send you the 100TB and 1PB data files they used.

    --
    Dual Opteron < $600
  14. Not quite as impressive as it sounds by Sangui5 · · Score: 4, Informative

    Google's sorting results from last yeat (link) are much faster; they did a petabyte in 362 minutes, or 2.8 TB/sec. They minute sort didn't exist last year, but Google did 1TB in 68 seconds last year, so I think it may be safe to assume that they could do 1 TB in under a minute this year. Google just hasn't submitted any of their runs to the competition.

    From the sort benchmark page, the list the winning run as Yahoo's 100TB run, leaving out the 1PB run; that implies the 1PB run didn't conform to the rules, or was late, or something.

    People have commented that this is a "who has the biggest cluster" competition; the sort benchmark also includes the 'penny' sort, which is how much can you sort for 1 penny of computer time (assuming your machine lasts 3 years), and 'Joule' sort, how much energy does it take you to sort a set amount of data. Not surprisingly, the big clusters appear to be neither cost efficient nor energy efficient.

    1. Re:Not quite as impressive as it sounds by owenomalley · · Score: 4, Interesting

      In sorting a terabyte, Hadoop beat Google's time (62 versus 68 seconds). For the petabyte sort, Google was faster (6 hours versus 16 hours). The hardware is of course different. (from Yahoo's blog and Google's blog)

      Terabyte:
          Machines: Yahoo 1,407 Google 1,000
          Disks: Yahoo 5,628 Google 12,000
      Petabyte:
          Machines: Yahoo 3658 Google 4000
          Disks: 14,632 Google: 48,000

      Yahoo published their network specifications, but Google did not. Clearly the network speed is very relevant.

      The two take away points are: Hadoop is getting faster and it is closing in on Google's performance and scalability.

  15. Re:Great! It's open source! by haruchai · · Score: 2, Interesting

    Why isn't this illegal - adding unrelated legislation to a ? Is there anywhere in the world why this practice is not permitted, or better yet, prosecuted?

    --
    Pain is merely failure leaving the body
  16. Google Sort by jlebrech · · Score: 2, Interesting

    Im looking forward to sorting my search results by Date, Title, Description, Author, etc..

  17. Re:Great! It's open source! by jimicus · · Score: 3, Informative

    Here in the UK, the patent office has been issuing software patents for some time in "anticipation" of them becoming legal at some point in the future.

    No, I don't understand that either.

  18. Re:Great! It's open source! by Halo1 · · Score: 4, Interesting

    Why isn't this illegal - adding unrelated legislation to a ? Is there anywhere in the world why this practice is not permitted, or better yet, prosecuted?

    The GP is confusing a bunch of things. First, the Council of Ministers threw out all limiting amendments from the European Parliament and reached an Political Agreement on a shoddy text through backdoor maneuvering by Germany and the European Commission. That text would have turned the European Patent Office's practice of granting software patents into EU legislation.

    A Political Agreement has no juridical nor legislative value, but it has never happened that a political agreement was later on annulled and that negotiations were reopened. So also in this case, even though the German, Dutch, Spanish and Danish parliaments afterwards passed motions asking to reopen the discussions, the Council's bureaucrats did not want to do that because it "would undermine the efficiency of the decision making process".

    Anyway, once you have a Political Agreement (which is reached by the representatives of the ministries responsible for the matter at hand) and nobody "wants" to discuss it anymore, the agreement can be placed as an "A item" on any EU Council of Ministers meeting, since it only needs rubber stamping in that case. In the case of the Software Patents Directive, it appeared several times as an A item on the agenda of an Agriculture and Fisheries meeting (which is presumably where the GP's confusion stems from).

    In principle, there would have been nothing wrong with that, but in this case there was no actual political agreement, and in particular Poland was very unhappy with the way it had been treated. So 4 times in a row, Poland either had this "A item" removed from the agenda (sometimes at the last minute, because the responsible Polish minister had to be informed that they were again trying to get it through at a meeting he had no business with), or turned it into a "B item", which means that it can't be rubber stamped but that they first have to talk a bit about it (which nobody wanted to do).

    In the end it still did get approved, but that whole circus helped with in convincing the EU Parliament to table a resolution asking the Commission to restart the directive's process, and when the Commission refused to later on squarely reject it.

    You can find some more of my thoughts on the Council's behaviour here.

    --
    Donate free food here
  19. Re:I'm sure that I can rock their scores by Anonymous Coward · · Score: 3, Funny

    Bogosort: for when you have you are paid by the hour, but aren't penalised for being late.

    with my luck, bogosort would get it right the first time.

  20. Re:I'm sure that I can rock their scores by Anpheus · · Score: 3, Funny

    No, he clearly changed roles from developer to Evil HR. He's probably directly subservient to Catbert.

  21. Re:Overlords - Trivia by e9th · · Score: 5, Informative

    Hadoop's name (and mascot) came from Doug [the project leader] Cutting's son's yellow stuffed elephant toy.

  22. Re:C++ port of Java Hadoop? by Yosho · · Score: 2, Informative

    It usually outperforms its Java sibling in an order of magnitude.

    Do you have any actual benchmarks for that? According to the benchmarks page at the official cLucene wiki, cLucene is roughly twice as fast as the Java Lucene at indexing, and it's only about 10% faster at the actual searching. That's not even close to an order of magnitude.

    --
    Karma: Terrifying (mostly affected by atrocities you've committed)
  23. Re:Use C++ and save 10x the hardware by Yosho · · Score: 2, Informative

    But how much of those libraries exist to achieve Java's religious beliefs on abstraction?

    Wow, how did this get modded insightful? For one, calling the design of a programming language a "religious belief", then asking a vague question about it without providing even a basis of an answer is just inflammatory.

    But the answer that anybody who knows what they're talking about will tell you is, none of them. Java's abstraction mechanisms are built into the language. None of the standard libraries are necessary to support it. They take advantage of it, of course, and you'd be crazy to not take advantage of one of the language's features. Try taking a look at a tree representation of all of the classes in the standard library. The vast majority of classes are not more than one or two levels down from the top-level Object. The things that are deeper are typically things that are complex in any language -- CORBA, GUI toolkits, etc. It certainly looks much cleaner than many graphs I've seen of C++ libraries that abused multiple inheritence.

    --
    Karma: Terrifying (mostly affected by atrocities you've committed)
  24. Re:Great! It's open source! by jjohnson · · Score: 2, Interesting

    There was an episode of the Simpsons where Springfield is going to be destroyed by a meteor. Congress meets to quickly pass legislation to fund the evacuation of the city. At the last moment, a Congressman steps up to the podium and says "I'd like to add a rider providing $30 million for the perverted arts". The bill is defeated.

    It's funny because it's true.

    --
    Anyone who loves or hates any language, platform, or manufacturer, doesn't know what they're talking about.
  25. Re:Great! It's open source! by turbidostato · · Score: 3, Funny

    "Why isn't this illegal"

    Because they made it legal by passing it on a Totally Unrelated Bill.