Slashdot Mirror


MapReduce — a Major Step Backwards?

The Database Column has an interesting, if negative, look at MapReduce and what it means for the database community. MapReduce is a software framework developed by Google to handle parallel computations over large data sets on cheap or unreliable clusters of computers. "As both educators and researchers, we are amazed at the hype that the MapReduce proponents have spread about how it represents a paradigm shift in the development of scalable, data-intensive applications. MapReduce may be a good idea for writing certain types of general-purpose computations, but to the database community, it is: a giant step backward in the programming paradigm for large-scale data intensive applications; a sub-optimal implementation, in that it uses brute force instead of indexing; not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago; missing most of the features that are routinely included in current DBMS; incompatible with all of the tools DBMS users have come to depend on."

8 of 157 comments (clear)

  1. Vertica by QuietLagoon · · Score: 3, Interesting

    The column was copyright by Vertica. Wouldn't they be concerned about the type of competition that MapReduce presents?

  2. Re:may be missing the (data)points by DragonWriter · · Score: 3, Interesting

    I don't know why this article is so harshly critical of MapReduce.


    The primary grounds for complaint seems to be "this isn't the way we do things in the database world". Each of the complaints (except #3) boils down to this (#1: The database community had arguments a few decades back and developed, at the time, a set of conventions; Map Reduce doesn't follow them and is, therefore, bad; #2: All databases use one of two kinds of indexes to accelerate data access; MapReduce doesn't and is, therefore, bad; #3: Databases do something like MapReduce, so MapReduce isn't necessary; #4: Modern databases tend to offer a variety of support utilities and features that MapReduce doesn't, so MapReduce is bad; #5: MapReduce isn't out-of-the-box compatible with existing tools designed to work with existing databases and is, therefore, bad.)

    And its from The Database Column, a blog that from its own "About" page is comprised of experts from the database industry.

    I suspect part of the reason they are harshly critical is that this is a technology whose adoption and use in large, data-centric tasks is (regardless of efficiency) a threat to the market value of the skills in which they've invested years and $$ developing expertise.

    At the end, they note (as an afterthought) that they recognize that MapReduce is an underlying approach, and that there are projects ongoing to build DBMS's on top of MapReduce, a fact which, if considered for more than a second, explodes all of their criticism which is entirely premised on the idea that MapReduce is intended as a general purposes replacement for existing DBMSs, rather than a lower-level technology which is currently used stand-alone for applications for which current RDBMSs do not provide adequate performance (regardless of their other features), and on which DBMS implementations (with all the features they complain about MapReduce lacking) might, in the future, be built.
  3. Re:may be missing the (data)points by mini+me · · Score: 2, Interesting

    CouchDB, ThruDB, RDDB, and SimpleDB, to name a few.

  4. Re:may be missing the (data)points by mishabear · · Score: 5, Interesting

    > I don't know why this article is so harshly critical of MapReduce.
    > Are these guys just trying to stake a reputation based on being critical of Google?

    Um... yes?

    The Database Column is being coy about being a corporate blog for Vertica, a high performance database database product, but in fact it is. Vertica is a commercial implementation of C-Store and was founded by Michael Stonebraker, the most prominent proponent of column based databases (get it? the database column). So yes, they have a very good reason to be hostile to Google.

    http://www.vertica.com/company/leadership
    http://en.wikipedia.org/wiki/C-Store
    http://en.wikipedia.org/wiki/Michael_Stonebraker
    http://www.databasecolumn.com/2007/09/contributors.html

  5. Re:Money, meet mouth by StarfishOne · · Score: 2, Interesting

    Agreed.

    I recently read somewhere (if only I could recall the link...) that on average Google's MapReduce jobs process something in the order of 100 GB/second, 24/7/365

    I've got nothing against RDBMS... but how can you be critical about a tool that scales and performs so well? It's just a matter of selecting and using the right tool for the job.

  6. Re:may be missing the (data)points by Blakey+Rat · · Score: 2, Interesting

    What bothers me the most is how much hype it gets. I work for a company that has had a "MapReduce" implementation (used internally) for as long as Google has, and we're not getting drooled over by the tech press. I'm sure tons of companies that have had to solve similar problems have already made this tool, even though the languages and syntax involved might change between implementations, it's nothing all that great.

  7. Re:may be missing the (data)points by merreborn · · Score: 4, Interesting
    His conclusion really hits the nail on the head:

    What the authors really want to gripe about is distributed "cloud" data management systems like Amazon's SimpleDB; in fact if you change "MapReduce" to "SimpleDB" the original article almost makes sense.
  8. Re:may be missing the (data)points by eh2o · · Score: 3, Interesting

    MapReduce falls under the category of embarrassingly parallel algorithms. It isn't a step backwards, it just has a limited scope.

    Google's contribution (and yes it does predate them by a long time) is to point out that MapReduce is a bit more than an algorithm -- it is a design pattern. Design patterns help us write clean code by establishing a consistent vocabulary (e.g. actors, containers, operators, etc), and furthermore are important insofar as they making algorithms accessible to programmers. Right now we badly need more well-defined design patterns in the area of parallel computing as this is essentially the future of programming.