MapReduce Goes Commercial, Integrated With SQL
CurtMonash writes "MapReduce sits at the heart of Google's data processing — and Yahoo's, Facebook's and LinkedIn's as well. But it's been highly controversial, due to an apparent conflict with standard data warehousing common sense. Now two data warehouse DBMS vendors, Greenplum and Aster Data, have announced the integration of MapReduce into their SQL database managers. I think MapReduce could give a major boost to high-end analytics, specifically to applications in three areas: 1) Text tokenization, indexing, and search; 2) Creation of other kinds of data structures (e.g., graphs); and 3) Data mining and machine learning. (Data transformation may belong on that list as well.) All these areas could yield better results if there were better performance, and MapReduce offers the possibility of major processing speed-ups."
and can I run Linux on it? Or it on Linux? Is it available for my iPhone?
Not like MySQL cared about data integrity in the past. . . whay start now?!
Gaaah! Data corruption!
Your post must have been stored in MySQL...
The correct project name is Hadoop. It was factored out of Nutch 2.5 years ago. And Yahoo has been putting a lot of effort to make it scale up. We run 15,000 nodes with Hadoop in clusters of up to 2,000 nodes each and soon that will be 3,000 nodes. I used 900 nodes to win Jim Gray's terabyte sort benchmark by sorting 1 TB of data (100 billion 100 byte records) in 3.5 minutes. It is also used to generate Yahoo's Web Map, which has 1 trillion edges in it.
http://research.google.com/archive/sawzall.html
Sawzall is essentially designed around the mapreduce framework. It's impossible to *not* write a mapreduction in Sawzall. The way it works:
Your program is written to process a single record. The magic part happens when you output: you have to output to special tables. Each of these table types has a different way that it combines data emitted to it.
So, during the map phase, your program is run in parallel on each input record. During the reduce phase, the reduction happens according to the way the output tables do whatever operation was specified.
There was some work to be done having enough different output tables to do everything that was useful, especially since you might want to take the output and plug it in as the input to another phase of mapreduction.
One of the biggest reasons this was a major innovation for Google was that it let some of the people who weren't really programmers still come up with useful programs, because the Sawzall language was pretty simple (especially when combined with some of the library functions that had been implemented to do common sorts of computations.) There were also some interesting ways in which the security model was implemented, but as far as I know they haven't been published yet.
There certainly are plenty of other technical things that can be done to improve a system like MapReduce (and I know that many of them were in various forms of experimentation when I left the company) but at least some of them are highly dependent on Google's infrastructure, and not really relevant to a general discussion. (I suspect that the papers linked above might have some hints, but it has been a while since I looked at them.)