Slashdot Mirror


MapReduce — a Major Step Backwards?

The Database Column has an interesting, if negative, look at MapReduce and what it means for the database community. MapReduce is a software framework developed by Google to handle parallel computations over large data sets on cheap or unreliable clusters of computers. "As both educators and researchers, we are amazed at the hype that the MapReduce proponents have spread about how it represents a paradigm shift in the development of scalable, data-intensive applications. MapReduce may be a good idea for writing certain types of general-purpose computations, but to the database community, it is: a giant step backward in the programming paradigm for large-scale data intensive applications; a sub-optimal implementation, in that it uses brute force instead of indexing; not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago; missing most of the features that are routinely included in current DBMS; incompatible with all of the tools DBMS users have come to depend on."

14 of 157 comments (clear)

  1. may be missing the (data)points by yagu · · Score: 5, Insightful

    I don't know why this article is so harshly critical of MapReduce. They base their critique and criticism on the following five tenets, which they further elaborate in detail in the article:

    1. A giant step backward in the programming paradigm for large-scale data intensive applications
    2. A sub-optimal implementation, in that it uses brute force instead of indexing
    3. Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago
    4. Missing most of the features that are routinely included in current DBMS
    5. Incompatible with all of the tools DBMS users have come to depend on

    If you take the time to read the article you'll find they use axiomatic arguments with lemmas like: "schemas are good", and "Separation of the schema from the application is good, etc. First, they make the assumption that these points are relevant and germaine to MapReduce. But, they mostly aren't.

    Also taking the five tenets listed, here are my observations:

    1. A giant step backward in the programming paradigm for large-scale data intensive applications

      they don't offer any proof, merely their view... However, the fact that Google used this technique to re-generate their entire internet index leads me to believe that is this were indeed a giant step backward, we must have been pretty darned evolved to step "back" into such a backwards approach

    2. A sub-optimal implementation, in that it uses brute force instead of indexing

      Not sure why brute force is such a poor choice, especially given what this technique is used for. From wikipedia:

      MapReduce is useful in a wide range of applications, including: "distributed grep, distributed sort, web link-graph reversal, term-vector per host, web access log stats, inverted index construction, document clustering, machine learning, statistical machine translation..." Most significantly, when MapReduce was finished, it was used to completely regenerate Google's index of the World Wide Web, and replaced the old ad hoc programs that updated the index and ran the various analyses.
    3. Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago

      Again, not sure why something "old" represents something "bad". The most reliable rockets for getting our space satellites into orbit are the oldest ones.

      I would also argue their bold approach to applying these techniques in such a massively aggregated architecture is at least a little novel, and based on results of how Google has used it, effective.

    4. Missing most of the features that are routinely included in current DBMS

      They're mistakenly assuming this is for database programming

    5. Incompatible with all of the tools DBMS users have come to depend on

      See previous bullet

    Are these guys just trying to stake a reputation based on being critical of Google?

    1. Re:may be missing the (data)points by starwed · · Score: 4, Informative

      I thought that this blog post was a pretty good sounding critique of the article in question. (Of course, I don't know a damn thing about DB, relational or otherwise. . )

    2. Re:may be missing the (data)points by Anonymous Coward · · Score: 5, Funny

      You missed points 6 through 9:

      6. New things are scary.
      7. Google is on their lawn.
      8. Matlock is the best television show ever.

    3. Re:may be missing the (data)points by mishabear · · Score: 5, Interesting

      > I don't know why this article is so harshly critical of MapReduce.
      > Are these guys just trying to stake a reputation based on being critical of Google?

      Um... yes?

      The Database Column is being coy about being a corporate blog for Vertica, a high performance database database product, but in fact it is. Vertica is a commercial implementation of C-Store and was founded by Michael Stonebraker, the most prominent proponent of column based databases (get it? the database column). So yes, they have a very good reason to be hostile to Google.

      http://www.vertica.com/company/leadership
      http://en.wikipedia.org/wiki/C-Store
      http://en.wikipedia.org/wiki/Michael_Stonebraker
      http://www.databasecolumn.com/2007/09/contributors.html

    4. Re:may be missing the (data)points by abscondment · · Score: 4, Funny

      It's also terrible for painting.

      1. Since the bucket doesn't enforce any schema, you never know what color paint the bucket might hold. Heck, it could even be full of honey. You just can't know, and not being able to know is, well, like programming assembly.
      2. Buckets aren't indexed, so you're not able to find that one ounce of paint that you really want to use next. You've got to split up all of the paint into ounce cups each time and examine very cup. It's very intensive, and really slows down your painting. If you stored the paint in a B-tree of ounce cups, your search for the right ounce of paint would be much more efficient.
      3. Painting is so old. I mean, get with the program. Gold plate your house, or something newer (since newer is always better!). In fact, decades of research into titanium has determined that it'll hold up better to the elements, anyway, so you should just get titanium siding instead of painting.
      4. Painting is an incomplete process. What if you want a window? Yeah, you can't paint a window for yourself, now can you? Did you need a jacuzzi? A fireplace? A new car? Sorry! Painting doesn't support those features yet. You'd better not paint at all if you want those things.
      5. Painting, believe it or not, is incompatible with tennis. There's no racket, there's no court, and there's no ball. There's not even a net (unless you're working from a really tall building, in which case you might fall and so a net is often used). I mean, you don't even need to paint with another person. It's so... incompatible.
    5. Re:may be missing the (data)points by merreborn · · Score: 4, Interesting
      His conclusion really hits the nail on the head:

      What the authors really want to gripe about is distributed "cloud" data management systems like Amazon's SimpleDB; in fact if you change "MapReduce" to "SimpleDB" the original article almost makes sense.
  2. Blink blink by Thelasko · · Score: 4, Funny

    Once I saw the word paradigm in the summary I just glazed over like I do whenever our CEO gives a speech.

    --
    One of our competitors trademarked the term "hypothesis". From now on, we will call them "boneheaded ideas".
    1. Re:Blink blink by spun · · Score: 4, Funny

      Ah, the old "eyes glazing over" paradigm. Definitely no synergy in that. Here's an action item: leverage your value added intellectual capital to architect a new scenario.

      --
      - None can love freedom heartily, but good men; the rest love not freedom, but license. -- John Milton
  3. Databases? WTF? by mrchaotica · · Score: 4, Insightful

    Since when did MapReduce have anything to do with databases? It's actually about parallel computations, which are entirely different.

    --

    "[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz

  4. Ideas ahead of their time? by dazedNconfuzed · · Score: 4, Insightful

    it represents a specific implementation of well known techniques developed nearly 25 years ago

    There are many classic/old techniques which are only now being used - and very successfully - precisely because the hardware simply wasn't there. A recent /. post told of ray-tracing being soon used for real-time 3D gaming, and how it beats the socks off "rasterized" methods when a critical mass of polygons is involved; the techniques were well known and developed nearly 25 years ago, but only now do we have the CPU horsepower and vast fast memory capacities available for those "old" techniques to really shine. Likewise "old" "brute force" database techniques: they may not be clever and efficient like what we've been using for highly stable processing of relatively small-to-medium databases, but they work marvelously well when involving big unreliable networks of processors working on vast somewhat-incoherent databases - systems where modern shiny techniques just crumble and can't handle the scaling.

    Sometimes the "old" methods are best - you just need the horsepower to pull it off. Clever improvements only scale so long.

    --
    Can we get a "-1 Wrong" moderation option?
  5. FTFA by smcdow · · Score: 4, Insightful

    Given the experimental evaluations to date, we have serious doubts about how well MapReduce applications can scale.


    That's a joke, right?

    I think Google's already taken care of all the experimental evaluations you'd need.

    --
    In the course of every project, it will become necessary to shoot the scientists and begin production.
  6. Article really misses the point by steveha · · Score: 4, Insightful

    I read through the whole article, and was just bemused. According to the article, MapReduce isn't as good as a real database at doing the sorts of things real databases do well. Um, okay, I guess, but MapReduce can do quite a lot of other things that they seem to have missed.

    Also, I had a major WTF moment when I read this:

    Given the experimental evaluations to date, we have serious doubts about how well MapReduce applications can scale.

    Empirical evidence to date suggests that MapReduce scales insanely well. Exhibit A: Google, which uses MapReduce running on literally thousands of servers at a time to chew through literally hundreds of terabytes of data. (Google uses MapReduce to index the entire World Wide Web!)

    This in turn suggests that the authors of TFA are firmly ensconced in the ivory tower.

    They complained that brute-force is slower than indexed searches. Well, nothing about MapReduce rules out the use of indexes; and for common problems, Google can add indexes as desired. (Google uses MapReduce to build their index to the Web in the first place.) And because Google adds servers by the rackful, they have quite a lot of CPU power just waiting to be used. Brute force might not be slower if you split it across thousands of servers!

    Likewise, they complain that one can't use standard database report-generating tools with MapReduce; but if the Reduce tasks insert their results into a standard database, one could then use any standard report-generating tools.

    MapReduce lets Google folks do crazy one-off jobs like ask every single server they own to check through their system logs for a particular error, and if it's found, return a bunch of config files and log files. Even if you had some sort of distributed database that could run on thousands of machines, any of which might die at any moment, and if you planned ahead and set the machines to copy their system logs into the database, I don't see how a database would be better for that task. That's just a single task I just invented as an example; there are many others, and MapReduce can do them all.

    And one of the coolest things about MapReduce is how well it copes with failure. Inevitably some servers will respond very slowly, or will die and not respond; the MapReduce scheduler detects this and sends the Map tasks out to other servers so the job still finishes quickly. And Google keeps statistics on how often a computer is slow. At a lecture, I heard a Google guy explain how there was a BIOS bug that made one server in 50 disable some cache memory, thus greatly slowing down server performance; the MapReduce statistics helped them notice they had a problem, and isolate which computers had the problem.

    MapReduce lets you run arbitrary jobs across thousands of machines at once, and all the authors of the article seem to be able to see is that it's not as database-oriented as a real database.

    steveha

    --
    lf(1): it's like ls(1) but sorts filenames by extension, tersely
  7. Indexing is useless here. by SharpFang · · Score: 4, Insightful

    Indexing works by picking a small slice of the data you have (as a list of hashes), and changing it into a much smaller table mapping the data onto a group of records matching it. The index is smaller and conforms to a certain strict standard, so it's very fast to brute force. Then as you get the list of indices, you brute force them, and this way you get the record.

    This works well if you can create such a slice - a piece of data you will match against. It becomes increasingly unwieldy if there are many ways to match a data - multiple columns mean multiple indices. And then if you remove columns entirely, making records just long strings, and start matching random words in the record, index becomes useless - hashes become bigger than chunks of data they match against, indexing all possible combinations of words you can match against results in index bigger than the database, and generally... bummer. Index doesn't work well against freestyle data searchable in random form.

    Imagine a database with its main column being VARCHAR(255) and using about full length of it, then search using a lot of LIKE and AND, picking various short pieces out of that column, and the database being terabytes big. Try to invent a way to index it.

    --
    45 5F E1 04 22 CA 29 C4 93 3F 95 05 2B 79 2A B2
  8. In related news: Screwdrivers suck because... by DragonWriter · · Score: 5, Funny

    1) They don't look like hammers,
    2) They don't work like hammers,
    3) You can already drive in a screw with a hammer,
    4) They aren't good at ripping out nails, and
    5) They aren't good at driving nails.

    Brought to you by The Hammer Column, a blog written by experts in the hammer industry, and launched by Hammertron, makers of a revolutionary new kind of hammer.