MapReduce Goes Commercial, Integrated With SQL
CurtMonash writes "MapReduce sits at the heart of Google's data processing — and Yahoo's, Facebook's and LinkedIn's as well. But it's been highly controversial, due to an apparent conflict with standard data warehousing common sense. Now two data warehouse DBMS vendors, Greenplum and Aster Data, have announced the integration of MapReduce into their SQL database managers. I think MapReduce could give a major boost to high-end analytics, specifically to applications in three areas: 1) Text tokenization, indexing, and search; 2) Creation of other kinds of data structures (e.g., graphs); and 3) Data mining and machine learning. (Data transformation may belong on that list as well.) All these areas could yield better results if there were better performance, and MapReduce offers the possibility of major processing speed-ups."
and can I run Linux on it? Or it on Linux? Is it available for my iPhone?
People who don't know LISP are bound to reinvent it, badly.
Data warehousing (here I mean databases stored in column order for faster queries, etc.) may get a lift from using map reduce over server clusters. This would get away from using relational databases for massive data stores for problems where you need to sweep through a lot of data, collecting specific results.
I think that it is interesting, useful, and cool that Yahoo is supporting the open source Nutch system, that implements map reduce APIs for a few languages - makes it easier to experiment with map reduce on a budget.
Not like MySQL cared about data integrity in the past. . . whay start now?!
Gaaah! Data corruption!
Your post must have been stored in MySQL...
they go together like paint and peanut butter.
Map/Reduce is better suited for read-only data mining situations.
http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html
then they embrace it
Intron: the portion of DNA which expresses nothing useful.
In functional programming map and reduce is very very old knowledge (and, yup, functional programming has its use and, yes, there are some very good and very successful programs written using functional languages).
What's next? A product called DepthFirstSearch (notice the uber broken camel case for a product name) that has nothing to do with the depth-first search algorithm?
Google? Allo?
Doesn't Oracle have this sort of feature already, without the Google "MapReduce" buzzword buzz?
Are you adequate?
The original paper for map reduce, http://labs.google.com/papers/mapreduce-osdi04.pdf is actually of pretty poor quality.
There are not really any useful comparisons in the paper. They do not indicate how it scales with increases in the number of processors, so while it may be very fast on the mammoth amount of hardware used, how much faster would it actually get on additional hardware.
If you look at the Sort section of the comparison they seem to be comparing to http://www.almaden.ibm.com/cs/gpfs-spsort.html
which is a 10% improvement on wildly improved hardware, which would seem to be rather disappointing results. This would not have been a problem with the paper had there been any mention of this, but there was not.
I am with Bjarne on this one.
Bjarne Stroustrup, creator of the C++ programming language, claims that C++ is experiencing a revival and
that there is a backlash against newer programming languages such as Java and C#. "C++ is bigger than ever.
There are more than three million C++ programmers. Everywhere I look there has been an uprising
- more and more projects are using C++. A lot of teaching was going to Java, but more are teaching C++ again.
There has been a backlash.", said Stroustrup.
He continues.. ..What would the world be like without Google?... Only C++ can allow you to create applications as powerful as MapReduce which allows them to create fast searches.
I totally agree. If Java ( or Pyhton etc. for that matter ) were fast enough why did Google choose C++ to build their insanely fast search engine. MapReduce rocks.. No Java solution can even come close.
I rest my case.
The Map/Confuse algorithm.
I don't think you can credit Bjarne with "compiled code is faster than interpreted code" (or the 21st century version: "compilers can perform better optimizations that JIT translators").
C++ happens to be the most popular fully compiled language, having edged Fortran out of that position some time near the end of the last century.
Back in the early '80s, when he was coming up with C++, the big Fortran savants were saying stuff like "Fortran is bigger than ever. There are more than X million Fortran programmers. Everywhere I look there has been an uprising... a lot of teaching was going to Pascal, but more are teaching Fortran again. There has been a backlash."
----
And that's not the only thing C++ has in common with Fortran, either.
Though this post is my introduction to both MapReduce and the argument, it strikes me that the people arguing are arguing the wrong problem.
While MapReduce might be used against some structured data, it looks to be something for unstructured data and dynamically inventing structures in unstructured data. Additionally, you might want to keep that new structure around for a while. You might want to load it up with terabytes of data. At the same time, this data is less and less useful over time.
Think about two of the key pieces of data Google has, web pages and user interaction and preference data. Web pages change over time. Web sites come and go. Some change a lot (news sites) and some change very little.
There is a LOT of user interaction data. Clicks on pages, javascript that fires to doubleclick, etc. With preferences, that changes over time, too. Also, marketers want to dynamically react to the clicks and even the minute change of a preference that generates a buck.
With such a large, changing, and time sensitive dataset, how could it be structured into something as relatively static as a schema? You would box yourself in by making it a schema and defining all the possible relationships.
So, you take it up one abstraction level and make a "schema" for making relationships. Further more, there is a narrow window within which you even care about data and how it is structured. Granted, you want the webpage/site data to stick around for queries. But even that is marginally useful. Think about how many pages you go into a query on google? I'm sure that will vary by person, but I'd also bet that in practice it is pretty small.
Maybe everyone else gets that and I'm just late to the party. But my point is that the wrong argument is being made that this should follow all the RDBMS work that has come to date.
Sure, I do agree that they shouldn't completely ignore all of the research, but to suggest it has to have a schema, indices, etc. just comes across as arguing all data problems belong in a traditional database.
Or maybe I can take a different approach to this....my brain doesn't have an index. It does categorize data and it can categorize the same piece of data in multiple ways. As I learn new things, my brain creates new "indices" of sort. A large portion of the data in my brain is time sensitive, or indexed over time. The older I get, the more the details of the minutia of life (what I had for dinner this evening) isn't important any more and it loses its categorization. I don't have a schema for my brain, rather I have multiple and I invent and dissolve them over time. I don't know what new one I'll need in the future. I can't know that and without that, I can't make a schema for it. I also can't be constantly modifying the same schema in place. It is easier for me to invent a new one as I go and just abandon the old ones. Sure, new schemas will have parts of the old, but it is still a new schema with the old one still in place and referencing the same data that the new one will soon reference.
-fragbait
As a matter of fact, it was... :/
I have developed a truly marvelous proof of this comment, which this signature is too narrow to contain.
How many of you familiar with functional programming just *cringe* when you see how badly basic math is discussed in the programming mainstream?
Anyone remember this story: http://tech.slashdot.org/tech/08/07/08/201245.shtml? According to Google:
Protocol buffers are now Google's lingua franca for data -- at time of writing, there are 48,162 different message types defined in the Google code tree across 12,183 .proto files. They're used both in RPC systems and for persistent storage of data in a variety of storage systems.
(See http://code.google.com/apis/protocolbuffers/docs/overview.html.)
If you think about it, Protocol Buffers are just about perfect for MapReduce applications. First, Protocol Buffers data streams are "flat" structures, very similar to database tables. If you need hierarchical data, I think that you'd tend to use multiple tables that incorporate foreign keys, rather than embedding the hierarchy every time it's referenced (as XML does).
Second, and again unlike XML, the data serialization is described via a .proto file, which can itself be serialized in exactly the same way as the data stream. It looks fairly easy to write a "Map" or a "Reduce" program that works with any Protocol Buffers data stream.
I suspect that this, rather than SQL compatibility, is the road to success with MapReduce processes.
Nothing for 6-digit uids?
Stonebraker isn't exactly the one to complain about this: just as MapReduce is being overhyped these days, relational databases were being overhyped in the 1970's, and he rode that wave all the way to fame and fortune. 30 years later, although every database system in the world calls itself "relational", very few database applications actually are relational.
MapReduce is indeed a simple, decades-old parallel programming technique. It's not the be-all-and-end-all of parallel programming, but it's good for solving a lot of real-world problems with minimum fuss and hassle.
Between the relational database hype of yore and today's MapReduce hype, give me the MapReduce hype any day. Relational database hype was all about pseudo-mathematical formality and ad hoc formalisms. MapReduce is at least about simple, working, real-world programming techniques. The sooner we get rid of Stonbraker's approach to computer science, the better off we will all be.
I'm astounded that so few people here know about MapReduce. There are lots of good videos about it made by Google.
There's a five-part lecture about it starting here (use this link to view the rest)
Or simply search for "google mapreduce". I suggest watching one of the videos though :)
I'm not insane! My mother had me tested.