Slashdot Mirror


Open Source Solution Breaks World Sorting Records

allenw writes "In a recent blog post, Yahoo's grid computing team announced that Apache Hadoop was used to break the current world sorting records in the annual GraySort contest. It topped the 'Gray' and 'Minute' sorts in the general purpose (Daytona) category. They sorted 1TB in 62 seconds, and 1PB in 16.25 hours. Apache Hadoop is the only open source software to ever win the competition. It also won the Terasort competition last year."

139 comments

  1. Overlords by Narpak · · Score: 3, Funny

    I for one welcome our new datasorting overlords!

    1. Re:Overlords by Jurily · · Score: 4, Funny

      I for one welcome our new datasorting overlords!

      With a name like Apache Hadoop, I wouldn't be surprised if they came from Star Wars.

    2. Re:Overlords by rackserverdeals · · Score: 3, Informative

      I wouldn't be surprised if they came from Star Wars.

      Actually, it came from Google. Sorta.

      Apache Hadoop is an implementation of MapReduce that Google uses in their search engine. I believe the details were found in a paper Google released on it's implementation of MapReduce.

      --
      Dual Opteron < $600
    3. Re:Overlords by daemonburrito · · Score: 3, Informative

      "MapReduce: Simplified Data Processing on Large Clusters." Jeffrey Dean and Sanjay Ghemawat, OSDI '04.

      They wrote about it in Beautiful Code, too (great book). MapReduce isn't complex, in fact the name comes from a feature that a lot of functional languages provide (yeah, I know, it's not exactly the same thing).

      There are many implementations of it. The wikipedia article is pretty informative: http://en.wikipedia.org/wiki/MapReduce. I didn't know about "BashReduce"... Heh.

    4. Re:Overlords by Briareos · · Score: 1

      With a name like Apache Hadoop, I wouldn't be surprised if they came from Star Wars.

      Apache Hadouken probably would have packed even more of a punch... :D

      np: DJ Walkman - Milk Und Herring (Milk Und Herring)

      --

      "I'm not anti-anything, I'm anti-everything, it fits better." - Sole

    5. Re:Overlords by davester666 · · Score: 3, Funny

      Fastest implementation of BubbleSort EVER!

      --
      Sleep your way to a whiter smile...date a dentist!
    6. Re:Overlords by ModMeFlamebait · · Score: 5, Funny

      datasorting for I new one our overlords! welcome

      --
      Pavlov. Does this name ring a bell?
    7. Re:Overlords by Anonymous Coward · · Score: 1, Funny

      Actually, it came from Google. Sorta.

      i actually like the name 'Google Sorta' better than Apache Hadoop

  2. I'm sure that I can rock their scores by Anonymous Coward · · Score: 1, Funny

    Just give me a few minutes to patch together a bubblesort from my highschool Pascal class. I'll show them record speed!

    1. Re:I'm sure that I can rock their scores by Thinboy00 · · Score: 5, Funny

      My sort will totally beat yours!

      --
      $ make available
    2. Re:I'm sure that I can rock their scores by Midnight+Thunder · · Score: 2, Funny

      Bogosort: for when you have you are paid by the hour, but aren't penalised for being late.

      --
      Jumpstart the tartan drive.
    3. Re:I'm sure that I can rock their scores by bmajik · · Score: 3, Funny

      I've asked lots of interview candidates to implement randomSort. They've never heard of it, so then I describe the algorithm.

      Watching their eyes go wide is the highlight of the interview, typically.

      Occasionally some person who has overcome their interview nervousness will, with eager honesty, try to implore to me that this is not a very good sort algorithm, and that much better ones are taught in universities these days.

      Good Times.

      --
      My opinions are my own, and do not necessarily represent those of my employer.
    4. Re:I'm sure that I can rock their scores by __aaclcg7560 · · Score: 1

      Just give me enough money to remember my Logo sort algorithm from grade school. The turtle will always be the fastest! :P

    5. Re:I'm sure that I can rock their scores by Midnight+Thunder · · Score: 1

      I've asked lots of interview candidates to implement randomSort. They've never heard of it, so then I describe the algorithm.

      Did you change roles from developer to HR?

      --
      Jumpstart the tartan drive.
    6. Re:I'm sure that I can rock their scores by Anonymous Coward · · Score: 3, Funny

      Bogosort: for when you have you are paid by the hour, but aren't penalised for being late.

      with my luck, bogosort would get it right the first time.

    7. Re:I'm sure that I can rock their scores by Anpheus · · Score: 3, Funny

      No, he clearly changed roles from developer to Evil HR. He's probably directly subservient to Catbert.

    8. Re:I'm sure that I can rock their scores by mwvdlee · · Score: 1

      In what way exactly does this help weed out the bad from the good candidates.

      Any candidate that actually tries and succesfully implements the algorithm is someone you DON'T want on your team.
      Any candidate that runs of screaming is one you DO want, but they're already gone.

      --
      Slashdot social media options: AIM, ICQ, Yahoo, Jabber and Mobile Text. Why no MySpace?
    9. Re:I'm sure that I can rock their scores by prockcore · · Score: 1

      If HR is actually conducting interviews, instead of just vetting resumes, then something has gone seriously wrong at your company.

    10. Re:I'm sure that I can rock their scores by bmajik · · Score: 1

      It's a good question for university students because they've been thinking about sorting problems "recently". So asking someone how to implement a given sort is a fair question. This has the upside of not being one they've had to implement, and it doesn't work like anything they have done. Of course the implementation is trivial; that's not the point. The solution is just the gateway into the rest of the conversation. The follow up questions are

      "is it guaranteed to return? Why or why not?"

      "what is the time complexity?"

      You get a wider spectrum of answers than you might imagine when you start picking into the follow up questions.

      So it's a question that scales well -- there are people that can't get the implementation, there are people that can't decide what they mean about its completion, etc. And no, its not a "serious" technical question, but i do find it useful.

      --
      My opinions are my own, and do not necessarily represent those of my employer.
    11. Re:I'm sure that I can rock their scores by Anonymous Coward · · Score: 0

      Any candidate that actually tries and succesfully implements the algorithm is someone you DON'T want on your team.
      Any candidate that runs of screaming is one you DO want, but they're already gone.

      Cute, but stupid.

      The really smart candidates are likely to realize that this guy knows other sorting algorithms. If they come right out and ask "Do you actually use this algorithm for anything here?" of course he will tell them "no."

      The best interview questions I have seen are simple to code but give the interviewee room for expression. There's only one (good) way to write SwapInt() in C; it's too trivial to be interesting.

      In an interview I once had to write "reverse a linked list". I had never actually coded that before, and I found the little details to be trickier than I had expected. Good times.

    12. Re:I'm sure that I can rock their scores by The_mad_linguist · · Score: 1

      Oh, it's pretty easy to optimize, though, if you subscribe to the many-worlds interpretation of quantum mechanics.

      Sort it randomly, test it, and if it's wrong, destroy the universe.

      In any observable universe, it sorts correctly 100% of the time.

    13. Re:I'm sure that I can rock their scores by Anonymous Coward · · Score: 0

      Cute, but it's not how MWI works.

      As the first qbit is resolved the universe branches into a 1 and a 0 universe, and the tester's worldline has already followed one of the branches by the time she knows the result. This continues in succession for each bit. The other universes are inaccessible.

      Also note that all the other quantum interactions also have their own branchings which the tester will have followed, and in a warm massive object like a human being in a lab full of STP air and warm equipment with blinkinglights, there's a lot of quantum interactions.

      MWI is just a tracing tool for following an event's history; you know a posteriori which branch you must have followed, but you can't be certain about which precise world you are in at any given time because your certainty is rooted in warm massive systems (i.e., your brain).

      MWI's approach to resolving quantum correlation problems is to insist that there are no quantum paradoxes, and that every event has its own consistent history. The concept of a real physical creation of parallel universes is a tool to assist with analysis, rather than a physical aspect of MWI. MWI does not deviate in any way with QM, it makes no testable claims that differentiate it from other common interpretations of QM (like Copenhagen or Decoherence), it is merely a different way of thinking about the same results, including rejecting the existence of quantum paradoxes. MWI is falsfiable in the same way as Copenhagen: if QM is falsifiable, these interpretations are also false. The interpretations themselves are not currently amenable to experimental comparison.

      MWI as a toolset has some advantages in terms of mechanically producing solutions to some problems (in particular, Lorentz invariance is easy to deal with) and some disadvantages in terms of explanatory power (it traces results but does not explain them; several other interpretations lean more towards explanation of events as opposed to tracing them, and so usually mechanically producing the latter can become very difficult as a result).

      Most objections to and misunderstandings of MWI come from a bias towards explaining why one can only say a limited amount about the evolution of a microscopic system under study; MWI starts with an information-hiding mechanism (other universes are always inaccessible) that is useful even if it is obviously non-physical.

      (Again, a physical branched universe that does interact in any way with another would invalidate a lot of physics, including QM, which would also invalidate MWI (and all the other common interpretations)).

      MWI is a local theory, so there is no mechanism for destroying a whole universe at a speed greater than the relativistic speed limit (c). MWI is a realistic theory so there are also conservation problems associated with destroying a universe. Moreover, MWI has a "superconservation" which requires that the sum of all possible universes has a fixed energy that is always conserved (this is how statistics (like Fermi-Dirac) are recovered in MWI; the distribution of energy among the many universes is weighted by probability).

      Although there may be mechanisms that could effectively destroy a universe, no local theory consistent with relativity can do it instantly, and no universe in which the metric expansion of space has a net acceleration can be entirely destroyed because of the resulting horizon.

      Since MWI is a local theory that is consistent with relativity, and since there is good evidence for Cosmic Inflation, and for Lambda-CDM, your idea of destroying the whole universe at all is pretty much SOL.

      MWI however does allow for quantum suicide, if your goal is to present you with a sorted list or effective oblivion. If you get a sorted list you are pretty lucky. If you don't, someone even luckier will probably be wiping your brain matter off the floor while muttering about your misunderstanding of universe branching.

      Let us know how it goes.

    14. Re:I'm sure that I can rock their scores by tkinnun0 · · Score: 1
      Ah, the good old

      Collections.reverse(myLinkedList);

      If you forget that tricky little s there, it will instead refer to the interface Collection, which of course doesn't have a static reverse method.

    15. Re:I'm sure that I can rock their scores by Repton · · Score: 1

      Hah. Easy-peasy in python!

      while lst != sorted(lst): random.shuffle(lst)

      --
      Repton.
      They say that only an experienced wizard can do the tengu shuffle.
    16. Re:I'm sure that I can rock their scores by RivieraKid · · Score: 1
      I guess you're American? In the UK, HR always conducts one interview, usually the final interview to vet your "soft skills". Over here, HR don't know enough about most jobs to vet the CV, the recruitment agent does that, and they work for a totally different company usually.

      Of course, you tell them what they want to hear and nobody minds. The HR interview is usually just a formality, but it's not good to "fail" the HR interview in most companies.

      --
      "Necessity is the plea for every infringement of human freedom. It is the argument of tyrants; it is the creed of slaves
    17. Re:I'm sure that I can rock their scores by svick · · Score: 1

      Hah. Easy-peasy in python!

      while lst != sorted(lst): random.shuffle(lst)

      Isn't it cheating, when you implement slow sorting algorithm using fast one?

  3. Is it settled? by Jah-Wren+Ryel · · Score: 4, Funny

    So, it appears they have finally sorted out whether open source beats proprietary.

    --
    When information is power, privacy is freedom.
    1. Re:Is it settled? by x78 · · Score: 1

      I wonder if they _sorted_ it themselves..

      --
      Don't panic
    2. Re:Is it settled? by marcosdumay · · Score: 1

      They did, but they took a lot more time than the FOSS people.

    3. Re:Is it settled? by nacturation · · Score: 1

      So... open sorts wins?

      --
      Want to improve your Karma? Instead of "Post Anonymously", try the "Post Humously" option.
  4. When's it going to be 1.0? by AlexBirch · · Score: 3, Insightful

    If it's winning competitions at 0.20, when will they release it?

    1. Re:When's it going to be 1.0? by SunTzuWarmaster · · Score: 1

      Software is done when it's good and ready!

    2. Re:When's it going to be 1.0? by Anonymous Coward · · Score: 5, Informative

      It's 0.20 but it's stable and production ready already. I use it with HBase and it scales awesomely.

    3. Re:When's it going to be 1.0? by AlexBirch · · Score: 1

      It'd be nice to put it into a production app...

    4. Re:When's it going to be 1.0? by AlexBirch · · Score: 1

      Isn't 1.0 production for most software jargon?

    5. Re:When's it going to be 1.0? by Anonymous Coward · · Score: 0

      Just like beta is only for pre-production software used by a very limited set of users...oh wait.

    6. Re:When's it going to be 1.0? by nacturation · · Score: 1

      Good enough for Amazon to use and sell: http://aws.amazon.com/elasticmapreduce/

      --
      Want to improve your Karma? Instead of "Post Anonymously", try the "Post Humously" option.
    7. Re:When's it going to be 1.0? by BikeHelmet · · Score: 2, Insightful

      Isn't 1.0 production for most software jargon?

      Nah, that's 6.0

      MS DOS 6.0
      IE 6.0
      Visual Studio 6.0

      I doubt anybody would want to use an earlier version than that!

    8. Re:When's it going to be 1.0? by lostguru · · Score: 1

      I doubt anyone would want to use those versions either

      --
      Jayne: "These are stone killers, little man. They ain't cuddly like me."
      98% of America's teens drink alcohol, smok
    9. Re:When's it going to be 1.0? by TheRaven64 · · Score: 2, Insightful

      You realise, I hope, that Vista is Windows NT 6.0...

      --
      I am TheRaven on Soylent News
  5. I'm only wish more programs were open source by blahplusplus · · Score: 1, Interesting

    ... truth be told, a lot of good engineering could happen if many of peoples favorite commercial applications could have the souce distributed with them, a lot of old games for instance coudl be updated and maintained.

    I think what holds the progress of open source back is interesting projects that exist that people want to work on but are locked away under corporate lock and key.

    1. Re:I'm only wish more programs were open source by Bert64 · · Score: 1

      People maintaining old games is now what the companies that produced those games want... They would rather sell you new games, or sell the old ones to you again..
      Corporations will always put their own interests first, and those interests will often be detrimental to everyone else.

      --
      http://spamdecoy.net - free throwaway anonymous email - avoid spam!
  6. Great! It's open source! by erroneus · · Score: 1

    But has anyone patented it yet? Patents trump copyright after all.

  7. What data? by Tinctorius · · Score: 1, Insightful

    They sorted 1TB in 62 seconds, and 1PB in 16.25 hours.

    This doesn't say anything if we don't know what kind of records were supposed to be sorted.

    1. Re:What data? by Antisyzygy · · Score: 2, Insightful

      Things can be sorted by any of their properties. What is important is this software sorted data objects this quickly regardless of what property they were being ordered by. It beats all of the other sorting algorithms.

      --
      That brings me to an interesting point, / . is just "the ramblings of socially-inept, technology-literate news-mongers".
    2. Re:What data? by rackserverdeals · · Score: 4, Funny

      They sorted 1TB in 62 seconds, and 1PB in 16.25 hours.

      This doesn't say anything if we don't know what kind of records were supposed to be sorted.

      It's amazing what you can learn if you actually RTFA.

      All of the sort benchmarks measure the time to sort different numbers of 100 byte records.

      If that's not good enough for you, post your email address and maybe someone will be kind enough to send you the 100TB and 1PB data files they used.

      --
      Dual Opteron < $600
    3. Re:What data? by allenw · · Score: 1

      I see Hadoop Summit door prizes.

    4. Re:What data? by VeNoM0619 · · Score: 1

      They sorted 1TB in 62 seconds, and 1PB in 16.25 hours.

      This doesn't say anything if we don't know what kind of records were supposed to be sorted.

      It's amazing what you can learn if you actually RTFA.

      If that's not good enough for you, post your email address and maybe someone will be kind enough to send you the 100TB and 1PB data files they used.

      It's amazing what you can learn if you actually RTFS, or read the comment, or read the quote you just quoted: 100TB != 1TB. We can be pedantic all day though if you like. Mr. Smug "I can RTFA"

      --
      Disclaimer: I am not god.
      We may not be created equal
      But we can be treated equal.
  8. Cool by bigdaddy25fb · · Score: 1

    Gonna pass this on to my boss, hopefully now we can move off of our terrible, terrible proprietary sorting software...Good to see open source breaking inroads in so many areas!!!

    1. Re:Cool by PiSkyHi · · Score: 1

      Maybe your boss could write to google and let them know just how special sorting is, I'm sure they'd love to hear it.

  9. They won the "Who has the most moneys" award. by nathan.fulton · · Score: 5, Insightful

    ...this cluster had nearly 4 times the number of nodes as the previous records. This competition was testing who had more nodes working together the best, but when you have so many more nodes, it would be hard not to top other clusters.

    1. Re:They won the "Who has the most moneys" award. by Rockoon · · Score: 3, Interesting

      I was doing some back-of-the-envelope, and they are sorting 17.7GB/second, which at a minimum would require 177 HD's if each drive can write 100MB/sec.

      If its not written to disk, then there is no achievement here (you don't perform 1 minute+ sorts and then throw the result away in real-world scenarios)

      --
      "His name was James Damore."
    2. Re:They won the "Who has the most moneys" award. by Anonymous Coward · · Score: 0

      It still shows that it scales!

    3. Re:They won the "Who has the most moneys" award. by Anonymous Coward · · Score: 0

      There were over 15,000 disks in the cluster.

    4. Re:They won the "Who has the most moneys" award. by marcosdumay · · Score: 1

      Yep, when you have a problem that big, you want scalability before anything else. How well could the other candidades use a machine that big?

    5. Re:They won the "Who has the most moneys" award. by Anonymous Coward · · Score: 0

      If its not written to disk, then there is no achievement here (you don't perform 1 minute+ sorts and then throw the result away in real-world scenarios)

      That's not true at all. I deal with a number of processes that have intermediate sort steps that do not require that the data touch disk.

    6. Re:They won the "Who has the most moneys" award. by gardyloo · · Score: 0

      So you have a TB of RAM?

    7. Re:They won the "Who has the most moneys" award. by drizek · · Score: 1

      SO Open Source isn't all that great at sorting, but since they had the biggest toys, it obviously shows that it works as a business model.

    8. Re:They won the "Who has the most moneys" award. by Anonymous Coward · · Score: 0

      I would assume they wouldn't want to introduce storage bottlenecks as a variable. You can buy the storage needed to support what your solution can do.

      If the different teams were using different storage it would kind of destroy the results of comparing Apache Hadoop to something else. Infiniband could easily carry that amount of throughput to a single node. As far as where or how you want to store it permanently that really isn't relevant to the results.

    9. Re:They won the "Who has the most moneys" award. by Anonymous Coward · · Score: 0

      sed -e "s/a single node/the nodes/g"

    10. Re:They won the "Who has the most moneys" award. by Anonymous Coward · · Score: 1, Interesting

      You don't _always_ need that much main memory -- there's a concept of something called a data-flow architecture.

      The old Tandem (I think HP calls it Neoview now) does this w/ their SQL engine. Of course, you would likely still need the last step to use temporary/overflow files on disk but the intermediate steps could potentially be done w/ data touching disk -- depends on the generated query plan or how you are "reducing" the problem.

    11. Re:They won the "Who has the most moneys" award. by Anonymous Coward · · Score: 0

      Infiniband could easily carry that amount of throughput to a single node.

      *cough* bullshit *cough*

    12. Re:They won the "Who has the most moneys" award. by Anonymous Coward · · Score: 0

      If you RTFA, Yahoo was using 5,628 disks for the terabyte sort and 14,632 disks for the petabyte sort.

    13. Re:They won the "Who has the most moneys" award. by TheRaven64 · · Score: 1

      That's the problem. All of the other groups used big clusters, but with different interconnects and different hardware. All the result shows is that this algorithm on this hardware performs better than other algorithms on other hardware. It doesn't show you whether the hardware or the algorithm was the determining factor.

      Ideally, each group should have run all of the competing algorithms on their own hardware so you would have a two-dimensional data set for the results and be able to see whether an algorithm benefited from a specific hardware setup (e.g. is it computer or interconnect limited) and which one performed best overall.

      --
      I am TheRaven on Soylent News
    14. Re:They won the "Who has the most moneys" award. by Anonymous Coward · · Score: 0

      Something like this perhaps : Texas Memory Systems demos 1TB RAM disk. "24GByte/sec throughput".

    15. Re:They won the "Who has the most moneys" award. by Anonymous Coward · · Score: 0
      Two things:

      A) You missed the point entirely -- our tasks have intermediate sort steps that do not require the result of a number of sort processes be written to disk. It slows the process down considerably when the the sort must be an external sort.

      B) We have machines with .5TB RAM, some of it is used for sorting.

    16. Re:They won the "Who has the most moneys" award. by Anonymous Coward · · Score: 0

      If the business model is having the biggest, most expensive setup, then yeah, I guess it does.

  10. Java by cratermoon · · Score: 5, Insightful

    OK, so where are the "Java is slow" comments? o.O

    1. Re:Java by French31 · · Score: 1
      --
      They who would give up an essential liberty for temporary security, deserve neither liberty or security. --Ben Franklin
    2. Re:Java by dodobh · · Score: 1

      Waiting for similar hardware to become available for other languages.

      I think Tim Bray is on the right track with his widefinder idea.

      See Widefinder 1 and Widefinder 2 for details.

      --
      I can throw myself at the ground, and miss.
    3. Re:Java by hey! · · Score: 4, Insightful

      Well, not to endorse the "Java is slow" meme or anything, but starting from a red light I can beat most cars across the intersection on my bike.

      Likewise if I had to drive across country in the shortest time possible, I'd choose a Ford F250 if the challenge stipulated I had to bring 3000 pounds of bricks with me.

      Speed is a very task specific notion.

      --
      Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
    4. Re:Java by asserted · · Score: 1

      well, actually... i sort of do. look, google has already reported on sorting 1 PB in 6 hours on a 4000 node cluster. their implementation is in C++. yahoo's result is 16.25 hours on a 3800 node cluster and hadoop is written in java. even taking into account the 200 node difference, yahoo's implementation is ~2.6 times slower than google's. it may not all be java's fault, but still.

    5. Re:Java by rackserverdeals · · Score: 1

      Something doesn't seem right.

      Yahoo's cluster had 3,800 nodes with 4 disks per node giving it roughly 15,200 drives plus or minus the dead nodes/drives in the cluster.

      The Google cluster had 4,000 nodes with 48,000 hard drives. 12 drives per node doesn't sound like the typical Google servers I've seen. That one looks like 2-4 drives. This other video seems to show the storage node which looks like it has 5-10 drives.

      The reason I bring up drives is that sorting 1PB likely involves hd access the more drives, the higher I/O throughput.

      Whatever the case, the nodes seem to be vastly different and making a comparison based on the number of nodes doesn't seem appropriate.

      --
      Dual Opteron < $600
    6. Re:Java by AlexBirch · · Score: 1

      Let's compare orange and apples.
      Since a node with 2 GB of memory is equivalent of 4 GB of memory, because a node is a node. What about myrinet is the same as ethernet. What about gigabits?
      Or if we really wanted to compare the numbers we would ensure they sort the same data on the hardware.

    7. Re:Java by dodobh · · Score: 1

      Just use a Boeing 747. You drive so fast that your wheels don't touch the ground.

      (That's like throwing RAM at an IO limited problem).

      --
      I can throw myself at the ground, and miss.
    8. Re:Java by Anonymous Coward · · Score: 0

      Everybody knows those kinds of comments are redundant given common knowledge.

    9. Re:Java by Timmmm · · Score: 1

      Running mathematical algorithms is usually a very small part of what the average desktop program does. Java can be very fast at this since it essentially produces the same assembly as the equivalent C program. For example something like

      for (int i = 0; i 1000; ++i) total += i;

      is going to produce exactly the same machine code in Java or C. I think the difference comes when you start writing real-world desktop programs. These make use of things like vector's strings, function calls, etc. which seem to slow java programs down. I'd also argue that the java programming paradigm and API forces you to do 'slow things'. Consider the mess that is String/StringBuilder compared to std::string or QString.

      As evidence I present the only non-trivial Java programs I've actually used:

      Azereus: I think this one isn't actually too bad, but many people complain about its memory use/speed.
      Netbeans: Awesome program but very slow.
      Eclipse: Tried this once. It was excrutiatingly slow. Slow enough to be unusable.

      Compare that to roughly equivalent C++ programs:

      KTorrent/uTorrent: Admittedly not as large or advance but they are both much much faster and smaller.
      MS Visual Studio/KDevelop: Again, much much faster.

      No doubt the Java fan-boys are going to say "It's all swing's fault!" or "That's not a very objective argument." to which I would say:

      * Yes it probably is largely the fault of the Java API being really really annoying, but there's no way to avoid that is there?
      * It's also a very hard thing to test. The reason benchmarks are all the misleading mathematical algorithm type is because they are quick to write. No-one's going to rewrite eclipse in C++. We can only compare roughly equivalent large programs.

  11. It was only a matter of time by Anonymous Coward · · Score: 0

    Dang nabbit. I was depending on the World Sorting Records to be my reference for how people sort in other countries than my own. Avoid open source, next time it will break your nose.

  12. 100 bytes, 10 byte keys. by eddy · · Score: 5, Informative

    Probably why the second sentence in the article is "All of the sort benchmarks measure the time to sort different numbers of 100 byte records. The first 10 bytes of each record is the key and the rest is the value."

    --
    Belief is the currency of delusion.
  13. Re:Great! It's open source! by buchner.johannes · · Score: 1

    You can't patent Apache 2.0 licensed stuff.
    Also, you can't patent software*.

    *in Europe

    --
    NB: The message above might reflect my opinion right now, but not necessarily tomorrow or next year.
  14. Re:Great! It's open source! by berend+botje · · Score: 4, Interesting

    Also, you can't patent software in Europe

    Not yet, but they are working on it. They tried to snuck it through by hiding it in the amendments of an agricultural bill. Luckily Poland kept watch and rose a stink about it.

    It's not over. There is too much money to be gained for that.

  15. C++ port of Java Hadoop? by frik85 · · Score: 1

    Java doesn't fit for my environment, does someone know of an open source C++ port of Java Hadoop?

    If there is no such port, is anyone interested in starting a port? An example of a Java port is "cLucene", a C++ port of Java Lucene search engine. It usually outperforms its Java sibling in an order of magnitude.

    --
    My favourite operating system is ReactOS; binary compatible to WinNT series :P
    1. Re:C++ port of Java Hadoop? by vbraga · · Score: 1

      It would be interesting to do it.

      I'm willing to volunteer if you're starting a port. Just mail me (my email is on my user page).

      --
      English is not my first language. Corrections and suggestions are welcome.
    2. Re:C++ port of Java Hadoop? by Yosho · · Score: 2, Informative

      It usually outperforms its Java sibling in an order of magnitude.

      Do you have any actual benchmarks for that? According to the benchmarks page at the official cLucene wiki, cLucene is roughly twice as fast as the Java Lucene at indexing, and it's only about 10% faster at the actual searching. That's not even close to an order of magnitude.

      --
      Karma: Terrifying (mostly affected by atrocities you've committed)
    3. Re:C++ port of Java Hadoop? by owenomalley · · Score: 1

      There isn't a C++ port of Hadoop's map/reduce, but there is a C++ interface to the Java code. It is used by Yahoo's WebMap, which is the largest Hadoop application. It lets you write your mapper and reducer code as C++ classes.

      The Hadoop Distributed File System (HDFS) also has C bindings to let C programs access the system. If you want another alternative, the Kosmos File System (KFS) is also a distributed file system and was written in C++. Hadoop includes bindings for HDFS and KFS, so that the application code can transparently use either at run time depending on the path (hdfs://server/path instead of kfs://server/path).

  16. The benefits of parallelizing everything! by Celeste+R · · Score: 2, Informative
    According to a post on the Yahoo developer forums:

    2005 winner used only 80 cores and achieved it in 435 seconds. So with 800 cores what 2007 winner achieved is 297 seconds ?

    Its not only number of cores its how the logic to use parallel nodes properly to do a particular task is important.

    Hadoop won with 1820 cores (910 nodes w/ 2 cores each) at 209 seconds.

    I'm all for better sorting algorithms, but eventually the cost of parallelizing something overtakes the profit made. That being said, Hadoop's internal filesystem made to be redundant, which is an important feature whenever you're dealing with large amounts of data.

    Hadoop uses Google's MapReduce, by the way, whereas the competition didn't. It's nice to see MapReduce being used in a more public eye.

    While better sorting algorithms -do- matter, I have to say that maintenance and running costs also matter.

    I'd also like to see how a compatible C version of this software compares with the Java version. However, as I see it, the Java overhead seems fairly limited; sorting code is wonderfully repetitive, and I'd expect that it's already been optimized a fair amount.

    By the way, the number of nodes and the hardware in the nodes for this Hadoop cluster is -optimized- for this contest.

    --
    There are no perfect answers, only the right questions. More questions at http://foresightandhindsight.blogspot.com/
    1. Re:The benefits of parallelizing everything! by rackserverdeals · · Score: 1

      By the way, the number of nodes and the hardware in the nodes for this Hadoop cluster is -optimized- for this contest.

      The number of nodes was reduced to run the 100TB benchmark but I don't see anything that backs up your comment that the hardware was optimized for this contest. The cluster hardware doesn't look like anything special. Maybe it's optimized for Hadoop which is different than being optimized for the contest.

      --
      Dual Opteron < $600
    2. Re:The benefits of parallelizing everything! by leenks · · Score: 1

      There is a C++ framework called Sector/Sphere, that is quite a bit faster but not as stable. I don't think it scales as well either (yet)

      http://sector.sourceforge.net/

  17. Not quite as impressive as it sounds by Sangui5 · · Score: 4, Informative

    Google's sorting results from last yeat (link) are much faster; they did a petabyte in 362 minutes, or 2.8 TB/sec. They minute sort didn't exist last year, but Google did 1TB in 68 seconds last year, so I think it may be safe to assume that they could do 1 TB in under a minute this year. Google just hasn't submitted any of their runs to the competition.

    From the sort benchmark page, the list the winning run as Yahoo's 100TB run, leaving out the 1PB run; that implies the 1PB run didn't conform to the rules, or was late, or something.

    People have commented that this is a "who has the biggest cluster" competition; the sort benchmark also includes the 'penny' sort, which is how much can you sort for 1 penny of computer time (assuming your machine lasts 3 years), and 'Joule' sort, how much energy does it take you to sort a set amount of data. Not surprisingly, the big clusters appear to be neither cost efficient nor energy efficient.

    1. Re:Not quite as impressive as it sounds by owenomalley · · Score: 4, Interesting

      In sorting a terabyte, Hadoop beat Google's time (62 versus 68 seconds). For the petabyte sort, Google was faster (6 hours versus 16 hours). The hardware is of course different. (from Yahoo's blog and Google's blog)

      Terabyte:
          Machines: Yahoo 1,407 Google 1,000
          Disks: Yahoo 5,628 Google 12,000
      Petabyte:
          Machines: Yahoo 3658 Google 4000
          Disks: 14,632 Google: 48,000

      Yahoo published their network specifications, but Google did not. Clearly the network speed is very relevant.

      The two take away points are: Hadoop is getting faster and it is closing in on Google's performance and scalability.

    2. Re:Not quite as impressive as it sounds by fluffykitty1234 · · Score: 1

      One other difference:

      Google:
      "we asked the Google File System to write three copies of each file to three different disks."

      Yahoo:
      "On the larger runs, failure is expected and thus replication of 2 is required. HDFS protects against data loss during rack failure by writing the second replica on a different rack and thus writing the second replica is relatively slow."

      Google is using 4x as many disks, but writing 1.5 as much data.

      I'm actually more impressed that Google is cramming 12 disks onto a single machine, how do they get them to fit?

    3. Re:Not quite as impressive as it sounds by imsabbel · · Score: 1

      Er... Take a bigger case?

      You can easily get 12, 18 or 24 disks into a server....

      --
      HI O WISE PRINCE. WHT TOOK U SO DAM LONG?
    4. Re:Not quite as impressive as it sounds by Anonymous Coward · · Score: 0

      And Google's code is proprietary so it's pretty much irrelevant to most people in any case...

    5. Re:Not quite as impressive as it sounds by zarqman · · Score: 1

      I'm actually more impressed that Google is cramming 12 disks onto a single machine, how do they get them to fit?

      umm... a rubber mallet?

      More seriously, Google has a history of not even using cases some of the time -- at least not cases as most people think of them.

      As I recall, they're even using custom motherboards and such, so custom cases (or special racks if they're still doing the caseless thing) to accommodate 12 disks per mobo seems very reasonable for them.

      --
      geek friendly VPS's and free API enabled DNS : zerigo.com
  18. Re:Great! It's open source! by rackserverdeals · · Score: 1

    But has anyone patented it yet? Patents trump copyright after all.

    There are a number of patent applications related to MapReduce from Google and Yahoo.

    --
    Dual Opteron < $600
  19. Re:Great! It's open source! by haruchai · · Score: 2, Interesting

    Why isn't this illegal - adding unrelated legislation to a ? Is there anywhere in the world why this practice is not permitted, or better yet, prosecuted?

    --
    Pain is merely failure leaving the body
  20. Use C++ and save 10x the hardware by Anonymous Coward · · Score: 0

    I'm always confused when teams use Java that don't REQUIRE complex cross platform support. IME, all the old complaints about java are still true.
    It is slow, when compared to C/C++ or other mature compiled languages.
    It uses more RAM than C++.
    Development isn't any easier or faster than C++.

    It is fairly easy to write cross platform C++ code that uses less compute resources and easily runs on 10 platforms. Besides having developers who are nearly clueless about the platform isn't good. I've seen some really bad java developer teams and some really bad C++ teams. Overall, the java developers knew less about the platform and hardware than the C++ teams. Java let them be lazy.

    Don't get me wrong, there are many uses for java and as CPUs have gotten faster and hold more RAM, we aren't trying to suck every bit of performance. That's a good fit for java programs.

    1. Re:Use C++ and save 10x the hardware by Anonymous Coward · · Score: 1, Insightful

      Development isn't any easier or faster than C++.

      Ridiculous. Java's library provides 100 times what C++'s library provides, which makes it a solid ground for application development (which is what it excels at).

    2. Re:Use C++ and save 10x the hardware by Anonymous Coward · · Score: 1, Funny

      Use C++ and save 10x the hardware

      You tell em brutha! I'm so tired of carrying 10 cell phones to play java games.

    3. Re:Use C++ and save 10x the hardware by ClosedSource · · Score: 0, Flamebait

      But how much of those libraries exist to achieve Java's religious beliefs on abstraction?

    4. Re:Use C++ and save 10x the hardware by Yosho · · Score: 2, Informative

      But how much of those libraries exist to achieve Java's religious beliefs on abstraction?

      Wow, how did this get modded insightful? For one, calling the design of a programming language a "religious belief", then asking a vague question about it without providing even a basis of an answer is just inflammatory.

      But the answer that anybody who knows what they're talking about will tell you is, none of them. Java's abstraction mechanisms are built into the language. None of the standard libraries are necessary to support it. They take advantage of it, of course, and you'd be crazy to not take advantage of one of the language's features. Try taking a look at a tree representation of all of the classes in the standard library. The vast majority of classes are not more than one or two levels down from the top-level Object. The things that are deeper are typically things that are complex in any language -- CORBA, GUI toolkits, etc. It certainly looks much cleaner than many graphs I've seen of C++ libraries that abused multiple inheritence.

      --
      Karma: Terrifying (mostly affected by atrocities you've committed)
    5. Re:Use C++ and save 10x the hardware by Anonymous Coward · · Score: 0

      Uh, you realize you can just compile your Java code with gcj and get the same native optimizations that g++ will give you, right?

  21. Beowulf by bhsbulldozer · · Score: 1

    Imagine a beowulf cluster of those!

  22. lets just hope.. by pablo_max · · Score: 1

    I really hope that this works across multiple drives, because my p0rn collection is so spread out it would take for ever to sort manually!

    1. Re:lets just hope.. by tinkerghost · · Score: 1

      Do you have that tagged already or are you going to run it through photo recognition software?

  23. Re:Great! It's open source! by newcastlejon · · Score: 1

    Why isn't this illegal - adding unrelated legislation to a ? Is there anywhere in the world why this practice is not permitted, or better yet, prosecuted?

    I never heard of it happening here in the UK, as far as I knew only the US did. Shows how little I knew.

    --
    If God forks the Universe every time you roll a die, he'd better have a damned good memory.
  24. Google Sort by jlebrech · · Score: 2, Interesting

    Im looking forward to sorting my search results by Date, Title, Description, Author, etc..

    1. Re:Google Sort by daemonburrito · · Score: 1

      This may get easier if HTML5 catches on. I've been playing with it, and the new <time> and <article> tags are extremely useful.

      I used to be sympathetic to the "limited view of html" argument, but after writing a couple of tools that need to search the dom, I'm convinced that the semantic tags work a lot better than abusing css classes. The consistency is going to help search engines, too.

  25. World Record? by Anonymous Coward · · Score: 0

    The Gray sort metric is defined as TB/minute on a large data set (>=100TB). Apache Hadoop got 100TB in 173min = 0.578TB / min.

    Half a year ago, Google's MapReduce sorted 1PB in 362 minutes. Rate = 2.762TB / min

    http://developers.slashdot.org/article.pl?sid=08/11/23/1637219&from=rss

  26. My algorithm can sort anything in 1 second by nganju · · Score: 1

    My sorting algorithm operates in constant time. I should really enter it into one of these competitions. It's called Intelligent Design Sort: http://www.dangermouse.net/esoteric/intelligentdesignsort.html

    --
    There are 2 kinds of people in this world. Those that can keep their train of thought,
  27. Re:Great! It's open source! by jimicus · · Score: 3, Informative

    Here in the UK, the patent office has been issuing software patents for some time in "anticipation" of them becoming legal at some point in the future.

    No, I don't understand that either.

  28. Re:Great! It's open source! by franki.macha · · Score: 1

    That depends how you define unrelated, but I think that the Anti-terrorism, Crime and Security Act of 2001 is a perfect example of the fact that the name of a law is chosen to try and make sure that it gets passed.

  29. Re:Great! It's open source! by Halo1 · · Score: 4, Interesting

    Why isn't this illegal - adding unrelated legislation to a ? Is there anywhere in the world why this practice is not permitted, or better yet, prosecuted?

    The GP is confusing a bunch of things. First, the Council of Ministers threw out all limiting amendments from the European Parliament and reached an Political Agreement on a shoddy text through backdoor maneuvering by Germany and the European Commission. That text would have turned the European Patent Office's practice of granting software patents into EU legislation.

    A Political Agreement has no juridical nor legislative value, but it has never happened that a political agreement was later on annulled and that negotiations were reopened. So also in this case, even though the German, Dutch, Spanish and Danish parliaments afterwards passed motions asking to reopen the discussions, the Council's bureaucrats did not want to do that because it "would undermine the efficiency of the decision making process".

    Anyway, once you have a Political Agreement (which is reached by the representatives of the ministries responsible for the matter at hand) and nobody "wants" to discuss it anymore, the agreement can be placed as an "A item" on any EU Council of Ministers meeting, since it only needs rubber stamping in that case. In the case of the Software Patents Directive, it appeared several times as an A item on the agenda of an Agriculture and Fisheries meeting (which is presumably where the GP's confusion stems from).

    In principle, there would have been nothing wrong with that, but in this case there was no actual political agreement, and in particular Poland was very unhappy with the way it had been treated. So 4 times in a row, Poland either had this "A item" removed from the agenda (sometimes at the last minute, because the responsible Polish minister had to be informed that they were again trying to get it through at a meeting he had no business with), or turned it into a "B item", which means that it can't be rubber stamped but that they first have to talk a bit about it (which nobody wanted to do).

    In the end it still did get approved, but that whole circus helped with in convincing the EU Parliament to table a resolution asking the Commission to restart the directive's process, and when the Commission refused to later on squarely reject it.

    You can find some more of my thoughts on the Council's behaviour here.

    --
    Donate free food here
  30. Re:Great! It's open source! by newcastlejon · · Score: 1

    GP was referring to tacking on legislation to an unrelated bill, i.e. patent legislation on an agricultural bill. It's my understanding that this is sometimes used in the US to block a bill by means of appending something that no fool will vote in.

    --
    If God forks the Universe every time you roll a die, he'd better have a damned good memory.
  31. Re:Great! It's open source! by psycho12345 · · Score: 1

    This is indeed the case (for killing bills). The nastier version is tacking some random crap on to the annual budget, and using the excuse of getting the budget passed to ram it through even though the bill alone wouldn't even get to a vote by itself.

  32. This overturns all are fundamental assumptions! by ClosedSource · · Score: 1

    Like the widely-held belief that sorting speed is related to the software license used.

  33. Boy did I screw up that title by ClosedSource · · Score: 1

    Make that "all of our" instead of "all are". A mind is a terrible thing to waste.

    1. Re:Boy did I screw up that title by dotgain · · Score: 1
      No no no, it's "ALL YOUR", look, I'll show you:

      All your fundamentals are overturn. Move belief! What you pay!

  34. Re:Overlords - Trivia by e9th · · Score: 5, Informative

    Hadoop's name (and mascot) came from Doug [the project leader] Cutting's son's yellow stuffed elephant toy.

  35. Re:My algorithm can sort anything in 1 second by Anonymous Coward · · Score: 0

    Good luck on your lawsuit with DJ Danger Mouse.

    (Kinda stupid to be whoring your vanity site out on the same day as a front page story about the person who could easily sue you and take it from you.)

  36. READ THE MOTHERFUCKING ARTICLE YOU STUPID MORON by Anonymous Coward · · Score: 0, Troll

    READ THE GODDAMNED FUCKING ARTICLE YOU STUPID MOTHERFUCKING LAZY COCKSUCKING PIECE OF SHIT.

    Really, dude... it's not that hard to put in a modicum of effort that will pay dividends in terms of you not looking like a totally clueless fucking moron.

    1. Re:READ THE MOTHERFUCKING ARTICLE YOU STUPID MORON by jjohnson · · Score: 1

      Mod parent +1 cluestick.

      --
      Anyone who loves or hates any language, platform, or manufacturer, doesn't know what they're talking about.
  37. Re:My algorithm can sort anything in 1 second by dotgain · · Score: 1

    Gee, I actually thought it was funny. Now, maybe I'm just an idiot that's easily amused, but you sir have problems that could make just about anybody feel better about themselves.

  38. Corporations employee people by Anonymous Coward · · Score: 0

    and it's not detrimental to the people that work at those companies to protect the corporations intellectual property.

    1. Re:Corporations employee people by Bert64 · · Score: 1

      The needs of the many outweigh the needs of the few.

      --
      http://spamdecoy.net - free throwaway anonymous email - avoid spam!
  39. Re:Great! It's open source! by jjohnson · · Score: 2, Interesting

    There was an episode of the Simpsons where Springfield is going to be destroyed by a meteor. Congress meets to quickly pass legislation to fund the evacuation of the city. At the last moment, a Congressman steps up to the podium and says "I'd like to add a rider providing $30 million for the perverted arts". The bill is defeated.

    It's funny because it's true.

    --
    Anyone who loves or hates any language, platform, or manufacturer, doesn't know what they're talking about.
  40. Re:Great! It's open source! by turbidostato · · Score: 3, Funny

    "Why isn't this illegal"

    Because they made it legal by passing it on a Totally Unrelated Bill.

  41. Hadoop Covered on Podcast by Anonymous Coward · · Score: 0

    I had one of the Hadoop guys on my podcast a while back and we talked about this, and what Hadoop does (Map/Reduce),

    http://www.rce-cast.com/index.php/Podcast/rce04-hadoop.html

  42. I'd settle for... by Anonymous Coward · · Score: 0

    Never mind sorting. I'd settle for a filesystem that could stat 1TB (in approx. 800,000 files) in under an hour. Mind you, it doesn't have to md5 them in that time, but it would be nice. I'd settle for just stat.

  43. Re:Great! It's open source! by zarqman · · Score: 1

    Sadly, tacking stuff on isn't limited to the budget bills. It happens routinely.

    --
    geek friendly VPS's and free API enabled DNS : zerigo.com
  44. Already beaten by Anonymous Coward · · Score: 0

    Same number of computers 68 seconds.

    http://googleblog.blogspot.com/2008/11/sorting-1pb-with-mapreduce.html

  45. Re:Great! It's open source! by haruchai · · Score: 1

    I'd heard of a US bill getting pork added to it AFTER it had been voted on - sometime within the last year. I haven't been able to find the story again and would hope that this isn't a common occurrence.

    --
    Pain is merely failure leaving the body
  46. Re:Great! It's open source! by zarqman · · Score: 1

    Yeah. They'll add it in conference committee, where, after the initial vote, they reconcile differences in bills between the House and Senate versions. It goes back for a quick final vote in each chamber but that's usually considered procedural as I understand.

    I don't know for sure, but somehow doubt that it's uncommon. More likely, the changes snuck in aren't enough to raise significant ire so they get away with it. And if if people figure it out and are unhappy, there's always plausible deniability: "Some intern added it; it wasn't supposed to be there."

    --
    geek friendly VPS's and free API enabled DNS : zerigo.com
  47. Minidisk Technology by highonv8splash · · Score: 1

    The minidisk player in my closet wants to know why it's not on the list

  48. best comment this month by egghat · · Score: 1

    if not "for months"

    --
    -- "As a human being I claim the right to be widely inconsistent", John Peel