Yahoo Releases Open Source Hadoop Distribution
ruphus13 writes "Yahoo has been a vociferous Apache Hadoop user and supporter for several years now, and uses it extensively within its Search technologies. Hadoop has been gaining popularity in the Cloud Computing space, with companies like the NYTimes converting 4TB and 11 million articles to PDFs in under 24 hours using Hadoop and EC2 in late 2007. Hadoop has been made available in Amazon's cloud and Yahoo has now released its own Hadoop version. From the article: 'At today's Hadoop Summit in Silicon Valley, Yahoo! announced the availability of the Yahoo! Distribution of Hadoop, a source-only version of Apache Hadoop that Yahoo! uses within its own search engine. [Hadoop] is an open source software framework that helps process very large data sets, and is widely used in large-scale data mining applications as well as in search tools at sites like Facebook and many others. For developers and users interested in Hadoop, it's worth noting that the Yahoo! Distribution of Hadoop has been widely tested and developed at Yahoo! for years now.'"
Perhaps the Ask Slashdot inquirer in this thread will find this news usefull.
Can we bring back the ordinary, sensible pre-Web 2.0 names please?
Not only is it used by Yahooo, but also by Facebook, who get 15TB of new data a day to handle. Checkout the very useful free vids from Cloudera. http://www.cloudera.com/hadoop-training-thinking-at-scale You can download a canned VM preloaded with Hadoop/Pig/Hive goodness, even a copy of Eclipse preconfigured. http://www.cloudera.com/hadoop-training-virtual-machine
I think HBase 0.20 is being released today as well, with a new and much faster file format, better memory management and better availability.
I claim first use of "Error No. 0B" - or "No. 0B error." It'll be the new ID 10T!
Java is slow. How could it possibly be used to process so much data.
Yahoo! really does get a lot of flack around here, but I have to say, they have contributed quite a bit of free and open-source software for developers to use. The list of of APIs and web services that are available is quite impressive and many of them are better than Google's similar offerings (BOSS vs Google's AJAX search, for example). For anybody who's interested, I really recommend checking out the Yahoo! Developer Network site.
I'll admit to knowing basically nothing about Hadoop, but if I saw the same article with "Hadoop" replaced by "GCC", "Postfix", or "OpenOffice", I wouldn't see it as being a good thing.
Usually I only need to google new technology terms that I haven't heard before. Today I had to google vociferous. I was thinking it sounded like a condition that you need to take Levitra for. It didn't really make sense in the sentence but thinking about Yahoo suffering from erectile dysfunction has it's own childish humor when your on your 5th beer.
I've evaluated Hadoop (and Cloudbase, HBase and a few other things) for transaction log mining purposes and found it to be VERY inefficient. Basically, if your machine has a decent RAID array (by "decent" I mean 500-700MB/sec linear read throughput, and 300-500MB/sec write throughput), you will need 12-15 8 core Hadoop boxes to even come close to a single machine's performance. This, IMO, is fucked up. I expected it would be much more efficient than it is.
Therefore, my conclusion was that Hadoop only makes sense when you can't solve a problem any other way and are prepared to pay through the nose for hundreds or thousands of machines to alleviate its performance shortcomings.
Caveat lector - my biggest Hadoop cluster consisted of 20 8-core nodes, with 32GB RAM per node and GigE interconnect.
Fail!
ISO 32000-1:2008: PDF
Because stitching together numerous TIFF files on your own is so much better!