MapReduce Goes Commercial, Integrated With SQL
CurtMonash writes "MapReduce sits at the heart of Google's data processing — and Yahoo's, Facebook's and LinkedIn's as well. But it's been highly controversial, due to an apparent conflict with standard data warehousing common sense. Now two data warehouse DBMS vendors, Greenplum and Aster Data, have announced the integration of MapReduce into their SQL database managers. I think MapReduce could give a major boost to high-end analytics, specifically to applications in three areas: 1) Text tokenization, indexing, and search; 2) Creation of other kinds of data structures (e.g., graphs); and 3) Data mining and machine learning. (Data transformation may belong on that list as well.) All these areas could yield better results if there were better performance, and MapReduce offers the possibility of major processing speed-ups."
and can I run Linux on it? Or it on Linux? Is it available for my iPhone?
People who don't know LISP are bound to reinvent it, badly.
Data warehousing (here I mean databases stored in column order for faster queries, etc.) may get a lift from using map reduce over server clusters. This would get away from using relational databases for massive data stores for problems where you need to sweep through a lot of data, collecting specific results.
I think that it is interesting, useful, and cool that Yahoo is supporting the open source Nutch system, that implements map reduce APIs for a few languages - makes it easier to experiment with map reduce on a budget.
Not like MySQL cared about data integrity in the past. . . whay start now?!
Gaaah! Data corruption!
Your post must have been stored in MySQL...
http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html
then they embrace it
Intron: the portion of DNA which expresses nothing useful.
I don't think you can credit Bjarne with "compiled code is faster than interpreted code" (or the 21st century version: "compilers can perform better optimizations that JIT translators").
C++ happens to be the most popular fully compiled language, having edged Fortran out of that position some time near the end of the last century.
Back in the early '80s, when he was coming up with C++, the big Fortran savants were saying stuff like "Fortran is bigger than ever. There are more than X million Fortran programmers. Everywhere I look there has been an uprising... a lot of teaching was going to Pascal, but more are teaching Fortran again. There has been a backlash."
----
And that's not the only thing C++ has in common with Fortran, either.
Actually, the two are paired: programming model and implementation. The reason there's a programming model is that functional methods allow Google's implementation to automatically parallelize the input data for feeding to the cluster. So the implementation is very important, because that's actually how the data is processed and returned.
In that sense, Oracle's clustering optimizations are also a paired programming model and implementation, since, presumably, you need to know Oracle's SQL language extensions in order to take advantage of them (disclaimer: I don't use Oracle). From what I understand about functional programming, SQL should be ideally positioned to take advantage of these kinds of optimizations, since the actual implementation details of any SQL query are always left to the query optimizer, SQL being a declarative language. I'm going to speculate wildly and say that you could probably write a SQL interpreter using a functional style as well, and that good ones probably already do.
If Java ( or Pyhton etc. for that matter ) were fast enough why did Google choose C++ to build their insanely fast search engine.
Because their developers knew it better? Because it had better 64-bit support when they started it? Because full GC's weren't compatible with their use case and IBM's parallel GC VM hadn't been released yet? Because they could get and modify all the source to all the libraries?
I don't know the answer, but there are a lot of possibilities besides speed. You're jumping to an awfully big conclusion there, Mr. Coward.
E pluribus unum
To most people, C++ is C. :-) Unfortunate but true.
Well someone should tell that to the people working on Hadoop. I'm sure they'd love to know that their java mapreduce based framework is impossible. Maybe they'll even be able to use the paradox to built a perpetual motion machine and power the world.
See: http://developers.slashdot.org/comments.pl?sid=900359&cid=24756761
Though this post is my introduction to both MapReduce and the argument, it strikes me that the people arguing are arguing the wrong problem.
While MapReduce might be used against some structured data, it looks to be something for unstructured data and dynamically inventing structures in unstructured data. Additionally, you might want to keep that new structure around for a while. You might want to load it up with terabytes of data. At the same time, this data is less and less useful over time.
Think about two of the key pieces of data Google has, web pages and user interaction and preference data. Web pages change over time. Web sites come and go. Some change a lot (news sites) and some change very little.
There is a LOT of user interaction data. Clicks on pages, javascript that fires to doubleclick, etc. With preferences, that changes over time, too. Also, marketers want to dynamically react to the clicks and even the minute change of a preference that generates a buck.
With such a large, changing, and time sensitive dataset, how could it be structured into something as relatively static as a schema? You would box yourself in by making it a schema and defining all the possible relationships.
So, you take it up one abstraction level and make a "schema" for making relationships. Further more, there is a narrow window within which you even care about data and how it is structured. Granted, you want the webpage/site data to stick around for queries. But even that is marginally useful. Think about how many pages you go into a query on google? I'm sure that will vary by person, but I'd also bet that in practice it is pretty small.
Maybe everyone else gets that and I'm just late to the party. But my point is that the wrong argument is being made that this should follow all the RDBMS work that has come to date.
Sure, I do agree that they shouldn't completely ignore all of the research, but to suggest it has to have a schema, indices, etc. just comes across as arguing all data problems belong in a traditional database.
Or maybe I can take a different approach to this....my brain doesn't have an index. It does categorize data and it can categorize the same piece of data in multiple ways. As I learn new things, my brain creates new "indices" of sort. A large portion of the data in my brain is time sensitive, or indexed over time. The older I get, the more the details of the minutia of life (what I had for dinner this evening) isn't important any more and it loses its categorization. I don't have a schema for my brain, rather I have multiple and I invent and dissolve them over time. I don't know what new one I'll need in the future. I can't know that and without that, I can't make a schema for it. I also can't be constantly modifying the same schema in place. It is easier for me to invent a new one as I go and just abandon the old ones. Sure, new schemas will have parts of the old, but it is still a new schema with the old one still in place and referencing the same data that the new one will soon reference.
-fragbait