Slashdot Mirror


MapReduce — a Major Step Backwards?

The Database Column has an interesting, if negative, look at MapReduce and what it means for the database community. MapReduce is a software framework developed by Google to handle parallel computations over large data sets on cheap or unreliable clusters of computers. "As both educators and researchers, we are amazed at the hype that the MapReduce proponents have spread about how it represents a paradigm shift in the development of scalable, data-intensive applications. MapReduce may be a good idea for writing certain types of general-purpose computations, but to the database community, it is: a giant step backward in the programming paradigm for large-scale data intensive applications; a sub-optimal implementation, in that it uses brute force instead of indexing; not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago; missing most of the features that are routinely included in current DBMS; incompatible with all of the tools DBMS users have come to depend on."

157 comments

  1. may be missing the (data)points by yagu · · Score: 5, Insightful

    I don't know why this article is so harshly critical of MapReduce. They base their critique and criticism on the following five tenets, which they further elaborate in detail in the article:

    1. A giant step backward in the programming paradigm for large-scale data intensive applications
    2. A sub-optimal implementation, in that it uses brute force instead of indexing
    3. Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago
    4. Missing most of the features that are routinely included in current DBMS
    5. Incompatible with all of the tools DBMS users have come to depend on

    If you take the time to read the article you'll find they use axiomatic arguments with lemmas like: "schemas are good", and "Separation of the schema from the application is good, etc. First, they make the assumption that these points are relevant and germaine to MapReduce. But, they mostly aren't.

    Also taking the five tenets listed, here are my observations:

    1. A giant step backward in the programming paradigm for large-scale data intensive applications

      they don't offer any proof, merely their view... However, the fact that Google used this technique to re-generate their entire internet index leads me to believe that is this were indeed a giant step backward, we must have been pretty darned evolved to step "back" into such a backwards approach

    2. A sub-optimal implementation, in that it uses brute force instead of indexing

      Not sure why brute force is such a poor choice, especially given what this technique is used for. From wikipedia:

      MapReduce is useful in a wide range of applications, including: "distributed grep, distributed sort, web link-graph reversal, term-vector per host, web access log stats, inverted index construction, document clustering, machine learning, statistical machine translation..." Most significantly, when MapReduce was finished, it was used to completely regenerate Google's index of the World Wide Web, and replaced the old ad hoc programs that updated the index and ran the various analyses.
    3. Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago

      Again, not sure why something "old" represents something "bad". The most reliable rockets for getting our space satellites into orbit are the oldest ones.

      I would also argue their bold approach to applying these techniques in such a massively aggregated architecture is at least a little novel, and based on results of how Google has used it, effective.

    4. Missing most of the features that are routinely included in current DBMS

      They're mistakenly assuming this is for database programming

    5. Incompatible with all of the tools DBMS users have come to depend on

      See previous bullet

    Are these guys just trying to stake a reputation based on being critical of Google?

    1. Re:may be missing the (data)points by CajunArson · · Score: 3, Insightful

      Are these guys just trying to stake a reputation based on being critical of Google? I tend to agree, I could probably write a nice article about how map-reduce would be a terrible system to use in making a 3D game. Could an article like that be technically true? Sure. Would it be in anything more than a logical non-sequiter? Not unless Google all of the sudden came out and claimed mapreduce is the new platform for all 3D game development (not likely).

      --
      AntiFA: An abbreviation for Anti First Amendment.
    2. Re:may be missing the (data)points by starwed · · Score: 4, Informative

      I thought that this blog post was a pretty good sounding critique of the article in question. (Of course, I don't know a damn thing about DB, relational or otherwise. . )

    3. Re:may be missing the (data)points by dezert_fox · · Score: 3, Insightful

      >If you take the time to read the article you'll find they use axiomatic arguments with lemmas like: "schemas >are good", and "Separation of the schema from the application is good, etc. Actually, it says: "The database community has learned the following three lessons from the 40 years that have unfolded since IBM first released IMS in 1968. Schemas are good. Separation of the schema from the application is good. High-level access languages are good." Way to conveniently drop important contextual information. Axioms like these, derived from 40 years of experience, carry a lot of weight for me.

    4. Re:may be missing the (data)points by Anonymous Coward · · Score: 5, Funny

      You missed points 6 through 9:

      6. New things are scary.
      7. Google is on their lawn.
      8. Matlock is the best television show ever.

    5. Re:may be missing the (data)points by Otter · · Score: 1
      Are these guys just trying to stake a reputation based on being critical of Google?

      I don't know much about database theory, but do know that Michael Stonebraker already has a reputation.

    6. Re:may be missing the (data)points by oldhack · · Score: 1

      I'm guessing MapReduce schemes are eating into traditional RDBMS market? If so, are there concrete products that implement MapReduce algorithm?

      --
      Fuck systemd. Fuck Redhat. Fuck Soylent, too. Wait, scratch the last one.
    7. Re:may be missing the (data)points by samkass · · Score: 3, Insightful

      Speaking as someone who works for a company whose product uses a database that is neither relational nor object-oriented, I can say from experience that folks who have devoted a significant amount of their lives to mastering that methodology see anything else as a threat. There are definitely use-cases for non-relational databases-- they're used at both Google and Amazon, as well as many other places. You can either burn significant effort defending your decision to go non-relational, or you can move on and ignore these folks and produce great products. The problem is that sometimes they make good points (especially about some aspects of indexing), but it's almost always lost in the "but... but... but... you're not relational!" argument.

      --
      E pluribus unum
    8. Re:may be missing the (data)points by Anonymous Coward · · Score: 0

      So it is a tool from google for google, the rest of us who are not interested in building search engines can safely ignore this framework and use and build better tools for the job

    9. Re:may be missing the (data)points by DragonWriter · · Score: 3, Interesting

      I don't know why this article is so harshly critical of MapReduce.


      The primary grounds for complaint seems to be "this isn't the way we do things in the database world". Each of the complaints (except #3) boils down to this (#1: The database community had arguments a few decades back and developed, at the time, a set of conventions; Map Reduce doesn't follow them and is, therefore, bad; #2: All databases use one of two kinds of indexes to accelerate data access; MapReduce doesn't and is, therefore, bad; #3: Databases do something like MapReduce, so MapReduce isn't necessary; #4: Modern databases tend to offer a variety of support utilities and features that MapReduce doesn't, so MapReduce is bad; #5: MapReduce isn't out-of-the-box compatible with existing tools designed to work with existing databases and is, therefore, bad.)

      And its from The Database Column, a blog that from its own "About" page is comprised of experts from the database industry.

      I suspect part of the reason they are harshly critical is that this is a technology whose adoption and use in large, data-centric tasks is (regardless of efficiency) a threat to the market value of the skills in which they've invested years and $$ developing expertise.

      At the end, they note (as an afterthought) that they recognize that MapReduce is an underlying approach, and that there are projects ongoing to build DBMS's on top of MapReduce, a fact which, if considered for more than a second, explodes all of their criticism which is entirely premised on the idea that MapReduce is intended as a general purposes replacement for existing DBMSs, rather than a lower-level technology which is currently used stand-alone for applications for which current RDBMSs do not provide adequate performance (regardless of their other features), and on which DBMS implementations (with all the features they complain about MapReduce lacking) might, in the future, be built.
    10. Re:may be missing the (data)points by ShakaUVM · · Score: 2, Insightful

      Map/Reduce is a very common operation in parallel processing. From my very quick look, it does seem as if the authors are right -- it looks like a quick and dirty implementation of a common operation, and not a "paradigm shift" in the slightest.

    11. Re:may be missing the (data)points by Splab · · Score: 2

      Did an assignment on map reduce some time ago, while I wasn't really impressed with it as a "Database" it was some really cool stuff they did with distributing the calculations - I did however note back then that it wasn't really useful for the general industry, but still was a very nice piece of software.

    12. Re:may be missing the (data)points by NewbieProgrammerMan · · Score: 1

      Speaking as someone who works for a company whose product uses a database that is neither relational nor object-oriented, I can say from experience that folks who have devoted a significant amount of their lives to mastering that methodology see anything else as a threat.
      I've bumped into this attitude in the little bit of time I spent as a developer: people who think that every last bit of configuration and data can (and must!) be crammed into a relational model, whether it belongs there or not. Performance and complexity be damned....it's relational! It's got to be good!
      --
      [b.belong('us') for b in bases if b.owner() == 'you']
    13. Re:may be missing the (data)points by datablaster · · Score: 1

      Kinda looks like somebody who didn't really get it saw Google's paper, told a DBA half the details of what they didn't understand anyway, the DBA heard the word "data", read half a paragraph of Google's paper, and next thing you know all hell broke loose in the DBA's office. The DBAs called their friends who didn't actually read the paper, etc etc.

    14. Re:may be missing the (data)points by Anonymous Coward · · Score: 1, Funny

      And its from The Database Column, a blog that from its own "About" page is comprised of experts from the database industry
       
      Yes, I'm sure they are, but notice that they were unable to resolve a many to many relationship for authors and articles on their own website's db:
       
        [Note: Although the system attributes this post to a single author, it was written by David J. DeWitt and Michael Stonebraker]

    15. Re:may be missing the (data)points by Larry+Lightbulb · · Score: 1

      Pick?

    16. Re:may be missing the (data)points by Anonymous Coward · · Score: 0

      Where does anyone say Map reduce is a general purpose DBMS?
      Sure, it's good for 'G's problem, just like my indexed recipe card file is good in the kitchen.

    17. Re:may be missing the (data)points by MajinBlayze · · Score: 1
      This article comes from a website called "databasecolumn.com" they are going to look at this from a database perspective. (assuredly a *relational* database perspective)

      The first sentence states

      On January 8, a Database Column reader asked for our views on new distributed database research efforts, and we'll begin here with our views on MapReduce.

      This indicates that they are responding to a fairly specific question, or rather a series of specific questions made by individuals who (misguidedly) asked if MapReduce (for example) seemed like an advancement beyond RDBMS's.

      To me, it seems a little more like responding to your example by saying MapReduce won't work for 3D gaming development for these reasons instead of saying MapReduce has implications in other fields, but probably won't affect 3D gaming development.

      If you re-read the article with this in mind, it seems more correct if still a little troll-ish.
      --
      "Hate is baggage. Life's too short to be pissed off all the time." Danny Vinyard -American History X
    18. Re:may be missing the (data)points by MagikSlinger · · Score: 1

      Way to conveniently drop important contextual information. Axioms like these, derived from 40 years of experience, carry a lot of weight for me.

      While a good point, it's still irrelevant to Google and Map-Reduce because Google's search engine is NOT a RDBMS. It's almost pure indexing, and what they are doing is comparing, say , Oracle to a specific B+-Tree implementation. They are seeing the Map-Reduce algorithm purely from a RDBMS perspective--not a "let's solve this specific problem" perspective.

      I'm reminded of a co-worker who accused me of being a database-bigot: you want to solve everything with a [RDBMS]. He was right. :-)

      --
      The bitter lessons of a veteran coder: http://bitterprogrammer.blogspot.com
    19. Re:may be missing the (data)points by mini+me · · Score: 2, Interesting

      CouchDB, ThruDB, RDDB, and SimpleDB, to name a few.

    20. Re:may be missing the (data)points by mishabear · · Score: 5, Interesting

      > I don't know why this article is so harshly critical of MapReduce.
      > Are these guys just trying to stake a reputation based on being critical of Google?

      Um... yes?

      The Database Column is being coy about being a corporate blog for Vertica, a high performance database database product, but in fact it is. Vertica is a commercial implementation of C-Store and was founded by Michael Stonebraker, the most prominent proponent of column based databases (get it? the database column). So yes, they have a very good reason to be hostile to Google.

      http://www.vertica.com/company/leadership
      http://en.wikipedia.org/wiki/C-Store
      http://en.wikipedia.org/wiki/Michael_Stonebraker
      http://www.databasecolumn.com/2007/09/contributors.html

    21. Re:may be missing the (data)points by einhverfr · · Score: 3, Insightful

      Hmmm.... ISTM that the basic critiques come down to:

      1) No indexing.

      Which means

      2) Certain types of constraints probably don't work (such as UNIQUE constraints)

      Which also means

      3) Referential integrity checking and other things don't work.

      This leads to the conclusion that the idea is good for certain types of data-intensive but not integrity-intensive applications (think Ruby on Rails-type apps) but *not* good for anything Edgar Codd had in mind....

      --

      LedgerSMB: Open source Accounting/ERP
    22. Re:may be missing the (data)points by MorpheousMarty · · Score: 1

      I am still missing point 9.

    23. Re:may be missing the (data)points by Anonymous Coward · · Score: 1, Funny

      9. Profit!

    24. Re:may be missing the (data)points by abscondment · · Score: 4, Funny

      It's also terrible for painting.

      1. Since the bucket doesn't enforce any schema, you never know what color paint the bucket might hold. Heck, it could even be full of honey. You just can't know, and not being able to know is, well, like programming assembly.
      2. Buckets aren't indexed, so you're not able to find that one ounce of paint that you really want to use next. You've got to split up all of the paint into ounce cups each time and examine very cup. It's very intensive, and really slows down your painting. If you stored the paint in a B-tree of ounce cups, your search for the right ounce of paint would be much more efficient.
      3. Painting is so old. I mean, get with the program. Gold plate your house, or something newer (since newer is always better!). In fact, decades of research into titanium has determined that it'll hold up better to the elements, anyway, so you should just get titanium siding instead of painting.
      4. Painting is an incomplete process. What if you want a window? Yeah, you can't paint a window for yourself, now can you? Did you need a jacuzzi? A fireplace? A new car? Sorry! Painting doesn't support those features yet. You'd better not paint at all if you want those things.
      5. Painting, believe it or not, is incompatible with tennis. There's no racket, there's no court, and there's no ball. There's not even a net (unless you're working from a really tall building, in which case you might fall and so a net is often used). I mean, you don't even need to paint with another person. It's so... incompatible.
    25. Re:may be missing the (data)points by Blakey+Rat · · Score: 2, Interesting

      What bothers me the most is how much hype it gets. I work for a company that has had a "MapReduce" implementation (used internally) for as long as Google has, and we're not getting drooled over by the tech press. I'm sure tons of companies that have had to solve similar problems have already made this tool, even though the languages and syntax involved might change between implementations, it's nothing all that great.

    26. Re:may be missing the (data)points by merreborn · · Score: 4, Interesting
      His conclusion really hits the nail on the head:

      What the authors really want to gripe about is distributed "cloud" data management systems like Amazon's SimpleDB; in fact if you change "MapReduce" to "SimpleDB" the original article almost makes sense.
    27. Re:may be missing the (data)points by jacquesm · · Score: 1

      map-reduce has *many* applications outside search

    28. Re:may be missing the (data)points by ampsicora · · Score: 1

      I agree that the article misses many points.
      Comparing MapReduce to DBMS it's like comparing apples to oranges.
      In the article it says MapReduce may be a good idea for writing certain types of general-purpose computations. Thing is, MapReduce works where traditional databases don't.
      DBMS are general purpose and deal with concurrency, and have a Read/Insert/Update/Delete paradigm: at any time, you could perform any of these operations on any record.
      MapReduce is specialized on performing calculations (i.e. reading) on a huge amount of data on a cluster, with performance that increases almost linearly with the number of nodes.
      Also, the article cites MapReduce only, but it doesn't mention that MapReduce works hand-in-hand with a DFS (Distributed File System). Try to load 1 Terabyte of data into Oracle (or any other DBMS) daily and let me go how it goes....where are you going to store it?
      If your data is read once, read many (which is the case when you're dealing with logs, like search event logs) you can make assumptions that simplify your storage and processing a big deal (indexes are easier to maintain, etc). That leads to a inexpensive, commodity hardware cluster distributed file system, which is the real key here, as cheap, scalable processing power and storage space are both paramount for growth. It's kind of useless to talk about MapReduce without talking about DFS.

    29. Re:may be missing the (data)points by Tablizer · · Score: 1

      I've bumped into this attitude in the little bit of time I spent as a developer: people who think that every last bit of configuration and data can (and must!) be crammed into a relational model, whether it belongs there or not. Performance and complexity be damned....it's relational! It's got to be good!

      As a table fan, what I find is that the alternatives are usually worse, or at least not different enough to fight over. Very large organizations that do something fairly specific and narrow over and over can indeed experience benefits from custom-built or nichey non-relational or semi-relational DB's. However, for small and medium companies, off-the-shelf RDBMS are usually plenty sufficient such that a roll-your-own-DB solution is not worth it.

      But if you have specific things that RDBMS suck at, let's look at them and learn. Maybe it was just bad schema design instead of a need to toss RDBMS altogether. Or perhaps existing RDBMS can be adjusted or fixed. Let's not toss the baby out with the bath-water.

    30. Re:may be missing the (data)points by tiberiusteng · · Score: 1

      The primary grounds for complaint seems to be "this isn't the way we do things in the database world"
      I think they basically chosen a wrong comparison product ... This product is more likely the "database" of Google. http://labs.google.com/papers/bigtable.html
    31. Re:may be missing the (data)points by rakslice · · Score: 1

      I don't think that these guys are being deliberately antagonistic, it's just that, well,...

      Relational databases are so routinely used in applications that they aren't a good fit for that I suspect many relational database "experts" aren't even acquainted with the practise of selecting network computing paradigms and technologies for an application that are a good fit for the application.

    32. Re:may be missing the (data)points by Anonymous Coward · · Score: 0

      sure, but that's not really the point, there are dozens of different approaches in high performance computing, i prefer mpi where you have freedom to choose and program it the way you want, i don't want big corporations promoting their plattaforms tailored for search engines. The issue is about the map-reduce google implementation hype, not the map-reduce algorithm known for a long time and studied in undergrad CS courses, i don't want to couple my simulations to their codebase.

    33. Re:may be missing the (data)points by Anonymous Coward · · Score: 1, Informative
      I also work for a company that is working on a "post-relational" database. OK... I work for the same company as the parent, but I'm one of the guys in charge of building the back-end. He's one of the guys that makes me look good.


      Relational databases were the perfect solution for the data processing environments of the 60's through the end of the century, but the computational landscape has changed significantly in three ways: scale, dirty data, and distribution.

      To support ACID semantics, relational databases require convergence of control for transactional domains. While this can be distributed, the sacrifices necessary to make it happen typically reduce your relational performance beyond what is acceptable. To handle the scale of data that Amazon and Google process in real time every moment of every day, you need to work independently across dozens or hundreds of machines. Each one can't block its processing in a mother-may-I request to lock objects or commit a transaction. To be able to process every comment on any particular item in Amazon's database in time to spit out a web page in 300ms, you need to leave the transactions behind.

      Especially for Google, the data does not lend itself well to indexing. They're sucking down their data from all corners of the Earth (Earth is processed in another system...) and trying to meaningful analysis on this grungy data. It's much easier to have local parsing and exception handling rather than trying to stuff everything into the same rigid schema. Most post-relational systems have soft notions of schema; it's more in the realm of metadata giving hints about how you might want to look at the data rather than a guarantee about what form the data will have, and the code adapts to dirty data as it comes through the pipe.

      Related to the first point, connectivity is now so cheap that it's a requirement to make these systems distributed. You can't have all your data sitting in one data center; all it takes is one mouth-breather with a backhoe to turn off your company. So you replicate the data to multiple centers. Of course, you want to send updates to all those data centers, which brings us to the first point that distributed transactions are a barrier to scalability. But most fundamental to the whole discussion is the CAP theorem: you can have Consistency, Availability, or survive network Partitions. Pick two. Post-relational systems choose Availability and Partition survival over guaranteed consistency of their data. This allows them to scale tremendously and be very, very resilient to interruptions in the underlying communications systems.

      For a very interesting read, I recommend reading Werner Vogels's excellent paper on the theory and practice of Amazon's Dynamo back-end.

    34. Re:may be missing the (data)points by samkass · · Score: 1

      A colleague of mine replied anonymously to my post and said things much more eloquently than I could. No, it's nothing to do with table design or adjustments... it's a fundamental shortcoming of relational theory. Relational databases require transactional integrity and therefore a centralized locking system. There is no way to scale it beyond a certain point.

      --
      E pluribus unum
    35. Re:may be missing the (data)points by MightyYar · · Score: 1

      9 minus 6 is 3, and there are 3 points there. So what's your problem? :)

      --
      W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
    36. Re:may be missing the (data)points by Allador · · Score: 1

      A giant step backward in the programming paradigm for large-scale data intensive applications
      they don't offer any proof, merely their view... Actually they do. Their proof is that this sort of approach was tried several decades ago, and has been found to be a fundamentally lacking approach for general purpose data processing. They provide several examples.

      Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago

      Again, not sure why something "old" represents something "bad". The most reliable rockets for getting our space satellites into orbit are the oldest ones. It's not bad because its 'old'. It's bad because it was rejected by the community consensus a long time ago as a general purpose solution. Their argument isnt that bad = old. Their argument is that ignoring the lessons of the past is bad.

      Missing most of the features that are routinely included in current DBMS

      They're mistakenly assuming this is for database programming They never state that MapReduce is for database programming. They are evaluating it for applicability in 'large-scale data intensive applications', which is also what modern RDMS's are used for.

      I think the valid criticism is that anyone is suggesting that MapReduce is a useful general purpose framework. It's clearly not. It's only useful in a highly specific (and fairly niche) set of situations. Further, it is really only successful now because of the specific economics around very large scale data processing. It, combined with Google's business model, allows them to have an economic advantage (above a certain scale size).

      In other words, the excitement around Google's use of MapReduce is as much around their economic structuring of their business around some concepts as it is about the technology itself. MapReduce in and of itself isnt terribly novel and interesting. MapReduce as implemented and used by Google as a business is interesting.
    37. Re:may be missing the (data)points by Tablizer · · Score: 1

      A colleague of mine replied anonymously to my post and said things much more eloquently than I could.

      Where is it?

      Relational databases require transactional integrity and therefore a centralized locking system. There is no way to scale it beyond a certain point.

      Not true. Transaction integrity tools are merely a nice feature, but not a necessary one.

    38. Re:may be missing the (data)points by Anonymous Coward · · Score: 0

      Looks like it was #22107502.

      Even MySQL recognized the need for half-assed transactions. If two clients do "UPDATE t SET v = v + 1" simultaneously, the result will always be v + 2, never v + 1 or garbage. There is necessarily a cost for this serialization, but without it you can't scale the system beyond a single client having write access unless you give up correctness (which of course Google can because nobody else has the resources to catch them at it).

    39. Re:may be missing the (data)points by jd · · Score: 1
      MapReduce is a clustered database system, which implies you have a cluster, which shuld be very good for games. Alternatively, if you wanted a VERY large game of Empire or NetPanzer, a distributed world would actually be quite helpful.

      I am a little surprised by the parallelization methods, though. Informix developed some parallel database methods some time back, which is partly why IBM bought them. I'm sure that the cutting-edge parallel database techniques that exist today have advanced beyond herustics and hack-and-slash work division.

      --
      It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
    40. Re:may be missing the (data)points by eh2o · · Score: 3, Interesting

      MapReduce falls under the category of embarrassingly parallel algorithms. It isn't a step backwards, it just has a limited scope.

      Google's contribution (and yes it does predate them by a long time) is to point out that MapReduce is a bit more than an algorithm -- it is a design pattern. Design patterns help us write clean code by establishing a consistent vocabulary (e.g. actors, containers, operators, etc), and furthermore are important insofar as they making algorithms accessible to programmers. Right now we badly need more well-defined design patterns in the area of parallel computing as this is essentially the future of programming.

    41. Re:may be missing the (data)points by samkass · · Score: 1

      you can't scale the system beyond a single client having write access unless you give up correctness ...or explicitly manage it. Detecting the conflicts is significantly cheaper and more parallelizable than preventing them-- then it's a matter of doing something reasonable about it after-the-fact.

      --
      E pluribus unum
    42. Re:may be missing the (data)points by Anonymous Coward · · Score: 0

      Then you have to implement your conflict recovery in a consistent and correct way in every programming language anyone in the organization is using, and convince everyone to use it. In practice people are going to build and use tools without you even knowing about it. Safety checks belong in the database to ensure they'll actually happen.

    43. Re:may be missing the (data)points by samkass · · Score: 1

      Then you have to implement your conflict recovery in a consistent and correct way in every programming language anyone in the organization is using

      Yup, pretty much. Preferably you provide mechanisms to make this easy in a library you distribute along with the DB access tools.

      Safety checks belong in the database to ensure they'll actually happen.

      Not if you want your database to operate distributed across the planet with thousands or hundreds of thousands of simultaneous writers.

      This is exactly the sort of response I mentioned in my original message. Folks whose livelihood depend on traditional database ideas will fight tooth and nail to preserve the status quo.

      --
      E pluribus unum
    44. Re:may be missing the (data)points by Anonymous Coward · · Score: 0

      The key to my argument was: People are going to access your database behind your back, and they aren't going to use your abstraction layer (which you didn't port to PHP3 or VB6 or whatever, no matter how many dozen languages you did support) when they do it. I've seen it happen. I've even had to do it myself.

    45. Re:may be missing the (data)points by samkass · · Score: 1

      The key to my argument was: People are going to access your database behind your back, and they aren't going to use your abstraction layer (which you didn't port to PHP3 or VB6 or whatever, no matter how many dozen languages you did support) when they do it. I've seen it happen. I've even had to do it myself.

      Although the primary language is Java, I expect the only way non-Java folks will even be able to connect to, query from, and submit to our future data stores are through a web service accessible from all modern languages. Besides, there's no other choice. Relational databases with locking are simply algorithmically unsuitable.

      --
      E pluribus unum
  2. Just watch. by jonnythan · · Score: 1, Insightful

    It's a technical step backwards, they're doing it all wrong, experts say you should do it this other way....

    And watch. It'll be massively successful because it works.

  3. Blink blink by Thelasko · · Score: 4, Funny

    Once I saw the word paradigm in the summary I just glazed over like I do whenever our CEO gives a speech.

    --
    One of our competitors trademarked the term "hypothesis". From now on, we will call them "boneheaded ideas".
    1. Re:Blink blink by spun · · Score: 4, Funny

      Ah, the old "eyes glazing over" paradigm. Definitely no synergy in that. Here's an action item: leverage your value added intellectual capital to architect a new scenario.

      --
      - None can love freedom heartily, but good men; the rest love not freedom, but license. -- John Milton
    2. Re:Blink blink by putch · · Score: 1

      [obligatory simpsons quote] Excuse me, but "proactive" and "paradigm"? Aren't these just buzzwords that dumb people use to sound important?

      Not that I'm accusing you of anything like that.

      I'm fired, aren't I?[/obligatory simpsons quote]

      --
      just because I don't care doesn't mean I don't understand!
    3. Re:Blink blink by MetalPhalanx · · Score: 1

      You sound like my boss.... Except he's not joking when he does that. :(

    4. Re:Blink blink by bar-agent · · Score: 1

      Here's an action item: leverage your value added intellectual capital to architect a new scenario.

      Come on, man, you need at least 30 multi-syllabic words to get a proper eye-glaze going. You only had 11. It was really jarring when I hit the end; I actually had to wait for my eye-glaze to finish forming and dissolve before continuing to the next comment!
      --
      i'd hit it so hard, if you pulled me out you'd be the king of britain [bash.org]
  4. Databases? WTF? by mrchaotica · · Score: 4, Insightful

    Since when did MapReduce have anything to do with databases? It's actually about parallel computations, which are entirely different.

    --

    "[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz

    1. Re:Databases? WTF? by mini+me · · Score: 1

      Those newfangled document databases utilize MapReduce to gather records. I'm guessing that's what the article is about.

    2. Re:Databases? WTF? by dezert_fox · · Score: 1

      from TFA: 'The map program reads a set of "records" from an input file, does any desired filtering and/or transformations, and then outputs a set of records of the form (key, data). As the map program produces output records, a "split" function partitions the records into M disjoint buckets by applying a function to the key of each output record. This split function is typically a hash function, though any deterministic function will suffice. When a bucket fills, it is written to disk. The map program terminates with M output files, one for each bucket.' (key, data) = database output dude, RTFA And by the way, databases can be used for computation

    3. Re:Databases? WTF? by DragonWriter · · Score: 1

      Since when did MapReduce have anything to do with databases?


      MapReduce is a tool, one of whose principal applications is conducting queries on large bodies of data consisting of records of similar structure. It, therefore, competes with traditional DBMSs to a degree.

      Now, (largely because of the limitations the authors note), it generally is only used currently for the kind of applications where setting up a traditional RDBMS to handle them would be impractical: Google developed their implementation of MapReduce to handle a task which was at the time straining their existing RDBMS resources.

      And, yeah, it requires more custom programming to make work at all than a DBMS, and probably isn't suitable for typical tasks. OTOH, If you are Google, what you are doing isn't a typical DBMS task or load, even if it is within the scope of things a DBMS could do, abstractly, performance considerations aside.

      Sometimes rolling-your-own with just the features you need, and an implementation tailored to your particular challenges is more efficient that taking something off the shelf.

      And given that there are now open implementations of MapReduce, for people with similar challenges, there isn't as much roll-your-own involved as there were for the people implementing MapReduce the first time, reducing the cost. Yeah, this means that traditional DBMSs have a challenge, though no doubt as MapReduce implementations mature, more of the traditional DBMS features and interfaces (including SQL or something like it) will be bolted on to the successful implementations.

    4. Re:Databases? WTF? by Temporal · · Score: 1

      Um... Nope, sorry, the OP is right. MapReduce is a framework for batch processing of gigantic data sets where you intend to do something with every item in the set, or at least a large fraction of them. Relational databases are better for quickly looking up subsets of the items in a database based on query terms, and can be used for serving real-time queries.

    5. Re:Databases? WTF? by DragonWriter · · Score: 1

      MapReduce is a tool, one of whose principal applications is conducting queries on large bodies of data consisting of records of similar structure. It, therefore, competes with traditional DBMSs to a degree.


      Responding to myself is bad form, but:

      Obviously, this only makes sense taking "similar structure" extremely loosely; still, the point is that MapReduce was developed to fill a niche for which RDBMSs were being used previously, in the absence of a more specialized tool, so it clearly competes with them, to a degree, even while it isn't an DBMS. So there is a relationship.

    6. Re:Databases? WTF? by mg_729 · · Score: 1

      I recently assisted a team implement a distributed MapReduce system for a very large dataset. The team had previously attempted to use a database to solve their business problem, but found performance to be unacceptable

      The MapReduce implementation was simple and exceeded all performance requirements. However, their DBA threw fits every step of the way. To him, everything involving data could and should be solved with a SQL statement.

      More and more systems use databases simply as a data archive, not for primary work. I think the DBA's are starting to be concerned that they will no longer be necessary. Obviously that isn't true, there will always be bigger and tougher problems to solve.

    7. Re:Databases? WTF? by einhverfr · · Score: 1

      Not really.

      ANd no it is not really a step backwards for databases. It is actually something which offers a niche solution for large-scale single-purpose, semi-accurate databases.

      This is almost but not entirely unlike what Codd had in mind when he wrote his seminal paper: "A Relational Model of Data for Large Shared Data Banks."

      If it were a paradigm shift, it would be a step backward. However, as "one more tool in the toolbox" it is useful in some cases where RDBMS's are not.

      --

      LedgerSMB: Open source Accounting/ERP
    8. Re:Databases? WTF? by Anonymous Coward · · Score: 0

      My guess is the DBA was throwing a fit because DBAs are really *really* uptight about the data being right. MapReduce isn't ACID complient, which means data might get trashed. DBAs hate corrupted data, and don't tend to trust it unless it's been handled by a system that has data integrity as a top priority.

      Google is a rare case. If there is a glitch which trashes some of their records, the worst thing likely to happen is you get 107,000 URLs back from a search rather than 109,500 URLs. No big deal. But the vast vast majority of DBAs out there work with databases that have to be exact all the time (finance, inventory, etc, etc) where if you dont get the exact right, and complete answer back, the shit hits the fan. When you get used to all the data being precious, using a system where it isn't (and treats it accordingly) can cause a bit of anxiety.

    9. Re:Databases? WTF? by martin-boundary · · Score: 2, Insightful
      Um, nope. You're not thinking abstractly enough, that is, you're not thinking like a computer scientist. MapReduce is a (rather obvious) framework for processing large lists of (key,data) pairs in parallel, therefore it can be compared with other such systems. Both MapReduce and RDBMSes basically compute a function on a set of (key,data) pairs.

      1) The fact that MapReduce is being used for specific low level applications does not make it intrinsically different or uncomparable to an RDBMS, although it may not be worthwhile.

      2) The more MapReduce gets used for things other than search engine calculations, the more it becomes worthwhile to do the comparison.

    10. Re:Databases? WTF? by martin-boundary · · Score: 2, Insightful

      More and more systems use databases simply as a data archive, not for primary work.
      I wouldn't count on even that being a long term trend. It takes time for people to come up with things to do with a database. Especially really big databases. Wait another ten years, and people will complain that their dumb data archives are not RDBMSes.
    11. Re:Databases? WTF? by Temporal · · Score: 2, Insightful

      I guess if you consider anything that involves (key, value) pairs to be basically an RDBMS, you might as well classify almost everything as an RDBMS, which seems to make the term pointless. Why write software anymore when we can just use a database? The reality is that I would use MapReduce and MySQL to solve very different problems.

      I think TFA is being silly in trying to compare MapReduce to DBMSs. Yes, of course MapReduce compares unfavorably, because it isn't a DBMS. The comment that MapReduce is "A sub-optimal implementation, in that it uses brute force instead of indexing" is particularly telling: MapReduce is not intended for situations where you would want indexing, and never was. In general, the whole article is trying to judge MapReduce on points that are completely irrelevant to what it was designed for and the way it is actually used.

      Really, if MapReduce were a DBMS, then why did the creators of MapReduce also create BigTable? BigTable *is* meant to be like a database, although it omits a lot of features in favor of scalability. MapReduce and BigTable are used for completely different things. I think Jeff and Sanjay (creators of both MapReduce and BigTable) probably find it pretty amusing to see MapReduce evaluated as a DBMS.

    12. Re:Databases? WTF? by martin-boundary · · Score: 1

      I guess if you consider anything that involves (key, value) pairs to be basically an RDBMS, you might as well classify almost everything as an RDBMS, which seems to make the term pointless.
      I don't think it's pointless. It (ie database theory) actually explains why RDBMSes are so popular: they are flexible enough to solve many problems, and people find it easy to think in those terms, and apply that viewpoint.

      I think TFA is being silly in trying to compare MapReduce to DBMSs. Yes, of course MapReduce compares unfavorably, because it isn't a DBMS.
      From reading the article, my impression is that the authors wrote it in response to some questions, and it's targeted at database people who have heard the MapReduce hype and are wondering if it will help them. Kind of like when people wondered if they should use Java for everything in the late 90s.

      MapReduce is not intended for situations where you would want indexing, and never was.
      Such things, if they work well, often outgrow their initial application area. The MapReduce abstraction is the obvious approach for performing large scale matrix multiplication in parallel, such as is needed for PageRank. Once the framework is in place for that, it's a small step to extend it for other things like filtering URL lists, managing robot farms, etc.

      I think Jeff and Sanjay (creators of both MapReduce and BigTable) probably find it pretty amusing to see MapReduce evaluated as a DBMS.
      I think it's good to have comparisons for people who otherwise only realize the advantages and limitations of a technology when it's too late. Some of the slashdot comments here say things like it's scalable and it works because Google use it. But use it for what? should be the question.
    13. Re:Databases? WTF? by Temporal · · Score: 1

      From reading the article, my impression is that the authors wrote it in response to some questions, and it's targeted at database people who have heard the MapReduce hype and are wondering if it will help them. Kind of like when people wondered if they should use Java for everything in the late 90s.
      The article seems to assume that MapReduce is trying to compete with RDBMSs, and even attacks the authors of MapReduce, suggesting that they should read up on database theory. An article which simply argued that MapReduce is not a good alternative to RDBMSs while acknowledging that it is very useful in other areas would be more agreeable.

      I agree that the hype around MapReduce seems a bit silly. It's just an engineering tool, not a computer science revelation. The main things it provides are scalability and fault tolerance (when running a task across thousands of machines, you have to expect failures). Theoretically speaking, it's not very interesting.
    14. Re:Databases? WTF? by martin-boundary · · Score: 1

      The article seems to assume that MapReduce is trying to compete with RDBMSs, and even attacks the authors of MapReduce, suggesting that they should read up on database theory.
      Fair point.
    15. Re:Databases? WTF? by Thyrteen · · Score: 1

      *clap clap*

  5. Huh??? by LWATCDR · · Score: 0

    5. MapReduce is incompatible with the DBMS tools
    A modern SQL DBMS has available all of the following classes of tools:
            * Report writers (e.g., Crystal reports) to prepare reports for human visualization

    Perl? Really Perl was made for doing reports. I am sure that somebody will create a report writer for it. I am just amazed that Chrystal Reports has become the universal solution for so many things.

    This is a pretty new bit of kit. If it catches on then people will start porting tools to it. When it comes to database tech I tend to believe that IBM really knows what they are doing. If this interests them I bet there is something too it.

    --
    See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
  6. Money, meet mouth by tietokone-olmi · · Score: 3, Insightful

    Perhaps the traditional RDBMS experts will return when they can scale their paradigms to datasets that are measured in the tens of terabytes and stored on thousands of computers. Following the airplane rule the solution needs to be able to withstand a crash in a bunch of those hosts without coming unglued.

    Now, this is not to say that a more sophisticated approach wouldn't work. It's just that when you have thousands of boxes in a few ethernet segments, communication overhead becomes really quite large, so large in fact that whatever can be saved with brute-force computation it'll usually be worth it. Consider that from what I've heard, at Google these thousands of boxes are mostly containers for RAM modules so there's rather a lot of computation power per gigabyte available to throw away with a brute force system.

    Also, I would like to point out that map/reduce is demonstrated to work. Apparently quite well too. Certainly better than any hypothetical "better" massively parallel RDBMS available in a production quality implementation today.

    1. Re:Money, meet mouth by StarfishOne · · Score: 2, Interesting

      Agreed.

      I recently read somewhere (if only I could recall the link...) that on average Google's MapReduce jobs process something in the order of 100 GB/second, 24/7/365

      I've got nothing against RDBMS... but how can you be critical about a tool that scales and performs so well? It's just a matter of selecting and using the right tool for the job.

    2. Re:Money, meet mouth by Chuck_McDevitt · · Score: 2, Informative

      Parallel DBMS systems like Teradata, Greenplum, Netezza, etc. have already solved this in the relational DBMS world. Teradata was doing 10's of Terabytes across many computers 15 years ago, and there are Teradata and Greenplum systems that have many hundreds of terabytes.

    3. Re:Money, meet mouth by kpharmer · · Score: 1

      > Perhaps the traditional RDBMS experts will return when they can scale
      > their paradigms to datasets that are measured in the tens of terabytes
      > and stored on thousands of computers.

      I'm not defending the authors of the article, but...

      1. this article wasn't written by rdbms people, but rather by column database people. There's nothing traditional or relational about their background.

      2. Every solution is a result of a variety of compromises, there is no such thing as the perfect solution outside of a context. And google's context is almost completely unique. So, it is somewhat meaningless that a solution works well for them. That doesn't mean it would work at all well for HP, Time-Warner, the State of Maine, the University of Colorado, or Ma Kettle's Fish Shack.

    4. Re:Money, meet mouth by Eivind+Eklund · · Score: 1

      I'm not defending the authors of the article, but...

      1. this article wasn't written by rdbms people, but rather by column database people. There's nothing traditional or relational about their background. A column database is a particular way of implementing relational databases. David DeWitt (faculty homepage, wikipedia entry) is best known for object-relational work, which is in the traditional relational area. Michael Stonebraker, the other author, is probably the best known relational database person living today (though C.J. Date might be another candidate).

      Eivind.

      --
      Doubting the existence of evolution is like doubting the existence of China: It just shows that you're uninformed.
  7. As one of the comments on the blog ... by tcopeland · · Score: 3, Insightful

    ...entry says;

    "You seem to not have noticed that mapreduce is not a DBMS."

    Exactly. These are the same sort of criticisms that you hear around memcached - the feature set is smaller, etc - and they make the same mistake. It's not a DBMS, and it's not supposed to be. But it does what it does quite well nonetheless!

  8. Summary of reaction... by R2.0 · · Score: 1

    "MapReduce is a software framework developed by Google to handle parallel computations over large data sets on cheap or unreliable clusters of computers."

    It ought to be a database, but since it isn't a database, it sucks.

    --
    "As God is my witness, I thought turkeys could fly." A. Carlson
  9. not so fast db snobs... by sneakyimp · · Score: 1

    I'm not at all certain about this but I'd bet that indexes can't solve every problem. I was working on a search routine that would attempt to pick 5 records at random from a database containing potentially a billion records. The search criteria were quite complex and included full-text search of a TEXT field and geographic proximity to a given zip code among other things. They client wanted this done in a fraction of a second.

    Personally, I'm amazed at what the various google search engines do and would bet that this technique they describe is what ties together their 200,000 servers. I wouldn't dismiss it so quickly.

    1. Re:not so fast db snobs... by e4g4 · · Score: 1

      a search routine that would attempt to pick 5 records at random from a database containing potentially a billion records Yeah, I'd say an index wouldn't be much help in that situation. A monkey with a keyboard could probably handle it, though.
      --
      The secret to creativity is knowing how to hide your sources. - Albert Einstein
    2. Re:not so fast db snobs... by Anonymous Coward · · Score: 0

      since you could index the geographic location, that actually would beat a brute force map reduce. Using the index to narrow down potential tuples and then going massively multithreaded (as needed) would probably be faster, of course.

  10. distributed indexes? by magarity · · Score: 1

    in that it uses brute force instead of indexing
     
    Isn't the overhead of a distributed index usually not worth the bother? This scheme sounds similar to the way Teradata handles its distribution and it manages to get a lot done with hardly any secondary indexes. I think the thinking in the article indicates standalone database server box thinking.

  11. Whew! by Anonymous Coward · · Score: 0

    I'm glad someone finally had the nerve to put MapReduce into real perspective. MapReduce has absolutely none of the "why didn't I think of that" factor.

  12. Ideas ahead of their time? by dazedNconfuzed · · Score: 4, Insightful

    it represents a specific implementation of well known techniques developed nearly 25 years ago

    There are many classic/old techniques which are only now being used - and very successfully - precisely because the hardware simply wasn't there. A recent /. post told of ray-tracing being soon used for real-time 3D gaming, and how it beats the socks off "rasterized" methods when a critical mass of polygons is involved; the techniques were well known and developed nearly 25 years ago, but only now do we have the CPU horsepower and vast fast memory capacities available for those "old" techniques to really shine. Likewise "old" "brute force" database techniques: they may not be clever and efficient like what we've been using for highly stable processing of relatively small-to-medium databases, but they work marvelously well when involving big unreliable networks of processors working on vast somewhat-incoherent databases - systems where modern shiny techniques just crumble and can't handle the scaling.

    Sometimes the "old" methods are best - you just need the horsepower to pull it off. Clever improvements only scale so long.

    --
    Can we get a "-1 Wrong" moderation option?
    1. Re:Ideas ahead of their time? by Duncan3 · · Score: 1

      Exactly, noone except Google claims the MapReduce methods are new in any way. And given their lots-of-junk-machines, it's the way to do it. Anyone in the distributed computing space over the last _35_ years would have done it exactly the same way, just in older programming languages :)

      The rest of the article is just DB-centric whining.

      --
      - Adam L. Beberg - The Cosm Project - http://www.mithral.com/
    2. Re:Ideas ahead of their time? by Anonymous Coward · · Score: 0

      In this video http://channel9.msdn.com/Showpost.aspx?postid=314874 (590MB WMV to download) of Brian Beckman explaining the difficult physics of driving games, he demonstrates that it's now possible to achieve something similar by simulating some of the particles (in a sense) in a car rather than using abstract models of how wheels work. He shows a slide in the video with "Future = better simulations through simpler physics and more horsepower" on it.

  13. Bad Perspective by Evets · · Score: 1

    This article was written from the perspective that map-reduce based architectures is in competition with common relational database architecture. It's not.

    Certainly if you were to implement map-reduce within the confines of the relational database world, there are implementation methodologies that would need to be taken to make it easier for the RDBMS developer to work with the storage and querying mechanisms.

    The article implies that map-reduce is bad because it doesn't place restrictions common to the database world on developers. When you get down to programming anything at a basic level, the implementation of standards is an optional step to take.

    I would agree that abstraction and structure would be good things because developers would be able to concentrate on higher level problems, but I would strongly disagree that anybody learning about map-reduce algorithms should be confined to a particular implementation methodology.

  14. A completely uninformed analysis by abes · · Score: 2, Insightful

    Well, INDBE, but MapReduce seems like a pretty cool idea (even it is old [which in my books does not equate bad]). A similar argument could be made against SQL -- it's not appropriate to all solutions. It's used for most nowadays, in part because it's the simplest to use, but that doesn't make it necessarily better. It (of course) depends on what data you want to represent.

    Even more importantly, you can create schemas with MapReduce by how you write your Map/Reduce functions. This is a matter of the datafunction exchange (all data can be represented as a function, likewise all functions can be represented as data). I admit ignorance to how this MapReduce system works, but I would be surprised if you couldn't get a relational database back out.

    The advantage is you get with MapReduce is that you aren't necessarily tied to a single representation of data. Especially for companies like Google, which may want to create dynamic groups of data, this could be a big win. Again, this is all speculative, as I have very little experience with these systems.

  15. A Very Human Response by Anonymous Coward · · Score: 3, Insightful

    The reaction seems straightforward enough. The MapReduce paradigm has proved to be very effective for a company that lives and breathes scalability, while it apparently ignores a whole bunch of database work that's been going on in academia. That fact that industry was able to produce something so effective without making use of all this knowledge base at least implicitly undercuts the importance of that work, and is thus threatening to the community which produced that work. Is it any surprise that the researchers whose work was completely side-stepped by this approach aren't happy with the current situation?

    1. Re:A Very Human Response by Anonymous Coward · · Score: 0

      Or for another view: Yahoo is the database-research-community's company, and Google the systems-research-community's company ;-)

    2. Re:A Very Human Response by DragonWriter · · Score: 1

      The reaction seems straightforward enough. The MapReduce paradigm has proved to be very effective for a company that lives and breathes scalability, while it apparently ignores a whole bunch of database work that's been going on in academia.


      That's not the "problem" (from the perspective of the authors of TFA), really.

      The problem is that provides an alternative to work that has been going on in industry, and in particular that it provides a way to end-run some of the limitations of traditional databases that may reduce demand for the particular alternative-model (column-based) database that the company that launched the blog on which TFA is posted is trying to sell as the way to work around the limitations of traditional row-based databases.

      Of course, mostly MapReduce isn't about being a database, so the criticism in the article seems bizarre, but its mostly intended for the audience who might have a need for which MapReduce might seem like a viable approach and Vertica's "revolutionary" column-oriented database also might seem to be viable approach. While the two tools don't target the exact same needs, the places where they might be useful do overlap.

      Which is why a blog whose initial post heralded the demise of the one-size fits all database is criticizing MapReduce for not being a traditional database. The problem isn't that its different, its that its not the version of "different" that Vertica is selling.
  16. Try by Anonymous Coward · · Score: 0


    Lisp

  17. Even if it was .... by Anonymous Coward · · Score: 0

    Even if it was a RDBMS, there are damn good reasons for violating the "rules" in certain situations. If the only tool in your toolbox is a hammer, everything looks like a nail. Knowing the rules and guidelines goes hand in hand in knowing the situations where they don't work or work against you... academics are big on the former and short on the latter that is a real thing in the real world outside of academia.

    I had to write a DB application once to handle about 80 full CDs of telephone records from a RDMS. I was able to reduce it so it all fit on one CD and was blazingly fast, but I had to violate several "rules" of proper database programming and layout. It happens.

  18. belly acres by rodentia · · Score: 1

    A sub-optimal implementation, in that it uses brute force instead of indexing

    As though these are the exclusive choices. TFA goes on to complain about implementing 25 year old ideas, though they are actually rather older than that--they just didn't strike the RDB types until the eighties. They proceed to insist that the system cannot scale. Arguing google's scalability is like arguing gravity.

    --
    illegitimii non ingravare
  19. FTFA by smcdow · · Score: 4, Insightful

    Given the experimental evaluations to date, we have serious doubts about how well MapReduce applications can scale.


    That's a joke, right?

    I think Google's already taken care of all the experimental evaluations you'd need.

    --
    In the course of every project, it will become necessary to shoot the scientists and begin production.
    1. Re:FTFA by sammy+baby · · Score: 1

      That's a joke, right?


      I know, that's what I thought.

      But then again... a few weeks ago I was involved in a phone call with one of our clients. They're a huge client for us, to a degree that they can significantly influence the future direction of our product by complaining loud enough, and our first client to use some new "high-availability" features we're gradually rolling out.

      In the course of our conversation, one of the client's guys essentially pooped on a large part of our product roadmap, basically because it involved a load balancer. And when we asked why, he said, "Because nobody in the world has been able to demonstrate a working, network-based load balancing solution."

      And that's it. Seriously. As if the entire notion of network based load balancing was a hoax perpetrated on the IT industry, and Google and Yahoo were just having a laugh on us while relying on plain old round-robin DNS or something. I mean, this client has two whole nodes to load balance, and that's clearly out of the reach of, say, F5...

      (Okay. Tangent. Sorry.)
    2. Re:FTFA by Bill,+Shooter+of+Bul · · Score: 1

      People have funny beliefs about things they don't completely understand, especially with technology. By "network based" he probably is referring to some asinine criteria that is most likely not related to any network thats ever existed or could ever exist, like routing the requested based on the evil bit of the packet.

      --
      Well.. maybe. Or Maybe not. But Definitely not sort of.
    3. Re:FTFA by Anonymous Coward · · Score: 0

      I think what they mean is they're concerned because it *does* scale so well. That means there's no money to be made by consultants telling you that you need 10 more database servers, each with 4 CPUs and SCSI RAID arrays and Oracle licenses.

      Also, when they say it's just an implementation of decades-old ideas, what they really mean is there's no money to be made in patent fees. It's like how quicksort may be a neat algorithm, but there's no money in licensing it.

  20. A step from where? by 644bd346996 · · Score: 3, Funny

    If you are starting with a good database, MapReduce is definitely a step backwards. But that isn't what MapReduce is designed to replace. In reality, MapReduce replaces the for loop, and viewed from that perspective, it is a major step forward. Most languages (C, C++, Java, etc.) define the for loop and other iteration facilities in such a way that the compiler can seldom safely parallelize the loop. MapReduce gives the programmer an easy way to convert probably 90% of their for loops into highly scalable code.

    1. Re:A step from where? by Duncan3 · · Score: 1

      Unless you're using one of the dozens of compilers can do just that, or FORTRAN, or OpenMP, or...

      --
      - Adam L. Beberg - The Cosm Project - http://www.mithral.com/
    2. Re:A step from where? by Rakishi · · Score: 1

      Your compiler will parallelize a for loop across 1000 machines AND split the input data across them before you even run the program?

  21. Translation: by Chris+Mattern · · Score: 1

    "We spent all these years making these complex, elegant algorithms--see how intricate this wonderful indexing algorithm is?--and then they solve things by simply throwing cheap hardware at it. It's not *fair!*"

    1. Re:Translation: by Anonymous Coward · · Score: 0

      It's not smart. Anytime you have a dumb algorithm and make it solve your problem by throwing more hardware at it, you're losing. You could use a smart algorithm on more hardware and get more work done. I don't know very many businesses where developer time is worth more than the millions of dollars it takes to build out massively parallel systems to compensate for stupid algorithms.

      Duh.

  22. Missing the forest for the trees... by brundlefly · · Score: 3, Insightful

    The point of MapReduce is that It Works. Cheaply. Reliably. It's not a solution for the Cathedral, it's one for the Bazaar.

    Comparing it to a DBMS on fanciness is pointless, because the DBMS solution fails where MapReduce succeeds.

  23. Step backward? by gmuslera · · Score: 1

    The 1st that come to my mind when i read that was the evolution of a programmer, when a "program" evolving started to get back thin in lines didnt meant that were a step backwards.

    1. Re:Step backward? by Anonymous Coward · · Score: 0

      get back thin in lines? what?

  24. Huh? by Black+Parrot · · Score: 1

    we are amazed at the hype that the MapReduce proponents have spread about how it represents a paradigm shift in the development of scalable, data-intensive applications. So much hype that I never even heard of it before their complain hit Slasdot...
    --
    Sheesh, evil *and* a jerk. -- Jade
  25. so soon? by ImTheDarkcyde · · Score: 1

    I wasn't expecting Google to seize control of the world databases and force people to use their software till at least 2012.

    1. Re:so soon? by ScrewMaster · · Score: 1

      Now, had Google been around when John Cameron's first Terminator film came out, I'll bet that Skynet would have sprung forth from a Google AI project gone awry.

      --
      The higher the technology, the sharper that two-edged sword.
  26. Vertica by QuietLagoon · · Score: 3, Interesting

    The column was copyright by Vertica. Wouldn't they be concerned about the type of competition that MapReduce presents?

    1. Re:Vertica by SteeldrivingJon · · Score: 1

      Maybe they're getting customers asking about mapreduce, and are tired of trying to convince customers that a conventional system is the way to go.

      --
      September 2011: Looking for Cocoa/iOS work in Boston area Cocoa Programmer Quincy, MA
  27. Vertica launches database-focused blog by QuietLagoon · · Score: 1
  28. Information and knowledge management by thomp · · Score: 1

    Data management is becoming so much more than just the data stored in a DBMS. As a data management geek, it's sad that the authors, experts in my field, fail to put MapReduce in its proper context and recognize its value. My bread and butter is DBMS, and even I could see the potential of MapReduce and the failure of the authors' arguments.

    tap

    --
    .sig
  29. The are afraid... by mini+me · · Score: 1

    I gather this is a publication for DBAs. It seems they are worried about their jobs more than anything. With the map-reduce-style databases there isn't a need for any kind of special database expert. The business logic all happens in the application. There is no need for tuning indexes. You don't even need to define a schema. When things get slow any monkey can drop in another computer and you're back up to speed and ready to go.

    Traditional RDBMSes have their place, but we're going to see a lot more applications built on this technology in the near future. The big players (Google, Amazon, etc.) have been doing it for quite some time and we're now finally seeing the technology available to the average Joe. It's a very interesting shift in how data is stored and should lead to some interesting applications that we can only dream of today.

  30. like Spider Robinson sang.. by hmaon · · Score: 2, Funny

    "...I taped twenty cents to my transmission
    So I could shift my pair 'a dimes..."

  31. Article really misses the point by steveha · · Score: 4, Insightful

    I read through the whole article, and was just bemused. According to the article, MapReduce isn't as good as a real database at doing the sorts of things real databases do well. Um, okay, I guess, but MapReduce can do quite a lot of other things that they seem to have missed.

    Also, I had a major WTF moment when I read this:

    Given the experimental evaluations to date, we have serious doubts about how well MapReduce applications can scale.

    Empirical evidence to date suggests that MapReduce scales insanely well. Exhibit A: Google, which uses MapReduce running on literally thousands of servers at a time to chew through literally hundreds of terabytes of data. (Google uses MapReduce to index the entire World Wide Web!)

    This in turn suggests that the authors of TFA are firmly ensconced in the ivory tower.

    They complained that brute-force is slower than indexed searches. Well, nothing about MapReduce rules out the use of indexes; and for common problems, Google can add indexes as desired. (Google uses MapReduce to build their index to the Web in the first place.) And because Google adds servers by the rackful, they have quite a lot of CPU power just waiting to be used. Brute force might not be slower if you split it across thousands of servers!

    Likewise, they complain that one can't use standard database report-generating tools with MapReduce; but if the Reduce tasks insert their results into a standard database, one could then use any standard report-generating tools.

    MapReduce lets Google folks do crazy one-off jobs like ask every single server they own to check through their system logs for a particular error, and if it's found, return a bunch of config files and log files. Even if you had some sort of distributed database that could run on thousands of machines, any of which might die at any moment, and if you planned ahead and set the machines to copy their system logs into the database, I don't see how a database would be better for that task. That's just a single task I just invented as an example; there are many others, and MapReduce can do them all.

    And one of the coolest things about MapReduce is how well it copes with failure. Inevitably some servers will respond very slowly, or will die and not respond; the MapReduce scheduler detects this and sends the Map tasks out to other servers so the job still finishes quickly. And Google keeps statistics on how often a computer is slow. At a lecture, I heard a Google guy explain how there was a BIOS bug that made one server in 50 disable some cache memory, thus greatly slowing down server performance; the MapReduce statistics helped them notice they had a problem, and isolate which computers had the problem.

    MapReduce lets you run arbitrary jobs across thousands of machines at once, and all the authors of the article seem to be able to see is that it's not as database-oriented as a real database.

    steveha

    --
    lf(1): it's like ls(1) but sorts filenames by extension, tersely
    1. Re:Article really misses the point by marcosdumay · · Score: 1

      Also, I had a major WTF moment when I read this:

      Given the experimental evaluations to date, we have serious doubts about how well MapReduce applications can scale.

      Empirical evidence to date suggests that MapReduce scales insanely well.

      Well, I didn't RTFA, but I also had a major WTF moment when I read that line. I don't understand what all that buzz is about, map reduce is an old approach that is known to work well. Also, it scales the best way any algorithm could scale, its only bottleneck is data distribution.

    2. Re:Article really misses the point by Anonymous Coward · · Score: 0

      Map/reduce is an old concept. Google MapReduce is a specific implementation of the idea. Google MapReduce is used by Google to scale code up to thousands of machines.

      http://en.wikipedia.org/wiki/MapReduce

  32. A better Google? by pH7.0 · · Score: 1

    They should implementation their own Google using "modern techniques" and make billions!!!

  33. Also: How's a DBM supposed to profit off that? by SteeldrivingJon · · Score: 1


    And, if mapreduce doesn't generate vast license income for Oracle, it must suck. Imagine the per-processor charges Google would be paying!

    --
    September 2011: Looking for Cocoa/iOS work in Boston area Cocoa Programmer Quincy, MA
  34. Article misses the point of MapReduce/RDBMS by duffbeer703 · · Score: 1

    Sounds like the rumblings of grumpy DBAs.

    The whole point of a relational DBMS is to store, link and maintain the integrity of data in tables based on the relationships among the data.

    MapReduce is about processing data... it's not focused on maintaining integrity, and the kinds of datasets suitable for MapReduce probably don't have well defined relationships.

    --
    Conformity is the jailer of freedom and enemy of growth. -JFK
    1. Re:Article misses the point of MapReduce/RDBMS by DragonWriter · · Score: 1

      Sounds like the rumblings of grumpy DBAs.


      Or maybe talking the (free!) competition from a blog launched by a company trying to sell a different alternative to traditional databases (but one which outwardly looks more like a traditional RDBMS) for an overlapping problem domain (that is, column-oriented databases, which address some of the same distribution and parallelization issues that MapReduce addresses, and target some of the same areas [e.g., "big science"] where it has been suggested that MapReduce might be useful.)
    2. Re:Article misses the point of MapReduce/RDBMS by IvyKing · · Score: 1

      On the other hand TFA may be more of a caution to those who think that Google has solved the "Database on Clusters" problem. The key point is integrity or lack there of (as you pointed out), Google doesn't need to maintain 100% integrity unlike the typical used of a DBMS.

    3. Re:Article misses the point of MapReduce/RDBMS by duffbeer703 · · Score: 1

      Thanks for pointing that out -- I hadn't realized that the article was part of a corporate blog!

      --
      Conformity is the jailer of freedom and enemy of growth. -JFK
  35. Indexing is useless here. by SharpFang · · Score: 4, Insightful

    Indexing works by picking a small slice of the data you have (as a list of hashes), and changing it into a much smaller table mapping the data onto a group of records matching it. The index is smaller and conforms to a certain strict standard, so it's very fast to brute force. Then as you get the list of indices, you brute force them, and this way you get the record.

    This works well if you can create such a slice - a piece of data you will match against. It becomes increasingly unwieldy if there are many ways to match a data - multiple columns mean multiple indices. And then if you remove columns entirely, making records just long strings, and start matching random words in the record, index becomes useless - hashes become bigger than chunks of data they match against, indexing all possible combinations of words you can match against results in index bigger than the database, and generally... bummer. Index doesn't work well against freestyle data searchable in random form.

    Imagine a database with its main column being VARCHAR(255) and using about full length of it, then search using a lot of LIKE and AND, picking various short pieces out of that column, and the database being terabytes big. Try to invent a way to index it.

    --
    45 5F E1 04 22 CA 29 C4 93 3F 95 05 2B 79 2A B2
    1. Re:Indexing is useless here. by Timothy+Brownawell · · Score: 3, Funny

      Imagine a database with its main column being VARCHAR(255) and using about full length of it, then search using a lot of LIKE and AND, picking various short pieces out of that column, and the database being terabytes big. Try to invent a way to index it. Convert it to an HTML table and put it where googlebot can see it.
    2. Re:Indexing is useless here. by Anonymous Coward · · Score: 0

      An efficient index for a common LIKE/AND based text keyword search in the terabyte range? Off the top of my head? A large Bloom mirror filter to eliminate everywhere we don't have matches would be a keen start.

      Hashes might be bigger than chunks of data, and the resulting simple index might be bigger, but that doesn't mean you can't fold the index into a Bloom filter and do a space tradeoff (for a known false positive rate and zero false negative rate).

      You could use a distributed Bloom filter if you liked, to balance the load across search cluster nodes in a tree, and thus eliminate sections of the distributed database which you know don't have hits.

      We've come a long, long way since table indexes you know...

    3. Re:Indexing is useless here. by Anonymous Coward · · Score: 0

      Indexing works by picking a small slice of the data you have (as a list of hashes), and changing it into a much smaller table mapping the data onto a group of records matching it. The index is smaller and conforms to a certain strict standard, so it's very fast to brute force.

      No. The essential property of indexing is to get better performance than "brute force" (with with I assume you mean linear search). Multiple types of indexes exist. The most famous index undoubtedly is the sorted index, which can be searched very fast (O(log(N)) with bisection. Indexes do not make essential use of the fact that the indexed data is smaller than the source data.

    4. Re:Indexing is useless here. by Allador · · Score: 1

      Imagine a database with its main column being VARCHAR(255) and using about full length of it, then search using a lot of LIKE and AND, picking various short pieces out of that column, and the database being terabytes big. Try to invent a way to index it. Turn on full-text indexing?

      This is also an old problem, with well understood solutions. Nearly all modern RDMS systems ship with a solution to this built in and well documented.

    5. Re:Indexing is useless here. by Anonymous Coward · · Score: 0

      Indexing works by picking a small slice of the data you have (as a list of hashes), and changing it into a much smaller table mapping the data onto a group of records matching it. The index is smaller and conforms to a certain strict standard, so it's very fast to brute force. Then as you get the list of indices, you brute force them, and this way you get the record.

      This works well if you can create such a slice - a piece of data you will match against. It becomes increasingly unwieldy if there are many ways to match a data - multiple columns mean multiple indices. And then if you remove columns entirely, making records just long strings, and start matching random words in the record, index becomes useless - hashes become bigger than chunks of data they match against, indexing all possible combinations of words you can match against results in index bigger than the database, and generally... bummer. Index doesn't work well against freestyle data searchable in random form.

      Imagine a database with its main column being VARCHAR(255) and using about full length of it, then search using a lot of LIKE and AND, picking various short pieces out of that column, and the database being terabytes big. Try to invent a way to index it.


      Conceptually at least you are wrong. What you are describing is are the cost of having an index, an index being a way to find a subset of the data faster than scanning the data.

      The most efficient index is the one that doesn't need additional data, for example, a hash table. (A minimal perfect hashing for the purists).

  36. Authors went off topic by pontificator · · Score: 1

    In the intro they mention that

    "a few select universities to teach students how to program such clusters using a software tool called MapReduce [1]. Berkeley has gone so far as to plan on teaching their freshman how to program using the MapReduce framework"

    and you would assume that the article argue why this is a bad trend. They may be right that MapReduce might be getting more attention than it deserves but in their article doesn't discuss this at all. Their editor should have pointed out to them that they went way off topic.

  37. In related news: Screwdrivers suck because... by DragonWriter · · Score: 5, Funny

    1) They don't look like hammers,
    2) They don't work like hammers,
    3) You can already drive in a screw with a hammer,
    4) They aren't good at ripping out nails, and
    5) They aren't good at driving nails.

    Brought to you by The Hammer Column, a blog written by experts in the hammer industry, and launched by Hammertron, makers of a revolutionary new kind of hammer.

  38. Google = statistical database? by gnuman99 · · Score: 2, Insightful

    I thought Google search weren't exact. You know, they were more statistical in nature. The entire algorithm is not probably based on absolute number (guessing, but otherwise it would not make sense).

    The thing is if Google uses this to create their index-like structure of the internet for their search engine, and it is not exactly like a RDBMS, well, so what? The MapReduce thing seems to be targeted at large sets of data and semi-accurate data mining, not exact results. No one really cares if there are 3,000,000,000 sites or 3,000,000,002 sites with Linux in it somewhere.

    Comparing RDBMS to MapReduce is like comparing math function to a paper graph of that function. The first one gives you exact results for all data in its domain. The second gives out quick, pain-free and semi-accurate results for some parts of the domain.

    Now, I will not be using MapReduce but then I don't see why Google should not. It is their business.

  39. They have a point. And it matters by Animats · · Score: 1

    I understand what they're getting at. What makes modern SQL-driven databases so useful is that they optimize queries. If you're asking for every entry in A that's also in B, any modern database will check whether it's faster to look up every A in B, every B in A, or do a match where both databases are read through sequentially by the same key. The best choice depends on the database record counts, available indices, and key types and lengths. The database system figures that out; it's not in the SQL query.

    So the user says what they want, and the system figures out how to do it. It's "do what I mean" that really works. We don't see enough of that in programming.

    Google search itself works much more like a database than a map/reduce system. Think about what has to happen when you search for multiple keywords. That's a join, and joins on big data sets take forever if you don't have the right data structures and an optimizer.

    1. Re:They have a point. And it matters by Anonymous Coward · · Score: 0

      Google search itself works much more like a database than a map/reduce system. Think about what has to happen when you search for multiple keywords. That's a join, and joins on big data sets take forever if you don't have the right data structures and an optimizer. Sure, but how do you think google generates that giant index? I bet you'd need a good bit of infrastructure code to map over all those crawled pages, and reduce it into a set of (term, document list) pairs.
  40. Maybe it is a step backwards, but I use it a lot.. by 10537 · · Score: 1

    ...at work. We use it to aggregate millions of dumped events every day, and while it may be missing features that are common in RDBMSes or use brute force rather than special magic, the fact is that we can point it at a cluster of machines and get aggregated stuff out with a lot less computational overhead than if we used anything else. It's not an RDBMS, and we don't use it as one, and therefore don't give a rat's ass if it's any good as one -- it does one thing, and it does it at a good price/scalability/performance/modifiability/ease-of-use multiratio. (And at the risk of being redundant: Photoshop is a crap word-processor, but the problem there isn't Photoshop, it's the fucktard who uses it to write letters.)

    --
    This sentence no verb.
  41. This coming from the DB Community? by Qbertino · · Score: 1

    Seriously, the DB Community calling something 'backwards' is a joke. Before going after others the DB people maybe should get up to date with their technology and maybe just get rid of that ancient, crappy POS PL called SQL. They should spend their time migrating to some up-to-date LGPLd solution for connection and glue-code. 'Them' using an early 70s interactive terminal hack as cornerstone of their work and calling others 'backwards' is just plain silly.
    When rotating HD disks will be replaced by SSDs and start going the way of the do-do, then we'll see who's backwards and outdated. Until then I'd tune low on any wisecracking about something being 'backwards' compared to DB technology.

    --
    We suffer more in our imagination than in reality. - Seneca
    1. Re:This coming from the DB Community? by geekboy642 · · Score: 1

      Oh wise and mighty Qbertino,

      What, pray tell would you replace SQL with? Bear in mind you have to make it capable of replacing BILLIONS of lines of stored procedures and db code. And whatever magnificent replacement you invent also has to be a great enough improvement over SQL to be worth the TRILLIONS of dollars necessary to retrain the entire SQL economy.

      Go ahead, let's hear your awesome idea! Or were you just being an idiot?

      --
      Just another "DOJ fascist authoritarian totalitarian bootlicker" -- Zeio
    2. Re:This coming from the DB Community? by Qbertino · · Score: 1

      I'll give you a counter example:
      You now all these Frameworks out there that are gaining traction? Rails, Django, CakePHP, etc.? Well, there is this one, Symfony, that uses a powerfull PHP DB abstraction layer called 'Propel'. One of it's perks is that you don't write any SQL anymore. None. Meaning: You write your transactions and persistance layer interactions in the programming language in which you write everything else aswell. If SQL is so cool, then why don't we have a different PL for each task? We could use 'Loop-Language' for loops, 'description-language' for building our objects, etc. I'm not saying "replace everything". There's still enough cobol out there being maintained - but no one would use cobol today to build a new system. The same should apply even more so to SQL.

      --
      We suffer more in our imagination than in reality. - Seneca
    3. Re:This coming from the DB Community? by mdfst13 · · Score: 1

      If SQL is so cool, then why don't we have a different PL for each task? SQL (Structured Query Language) isn't a programming language. It's a client/server communication protocol. When you use a database, your program (the client) is calling the database (server). You need to have some protocol for that communication. SQL is more on par with FTP or HTTP than PHP.

      Now, it's quite reasonable to want to abstract out that communication in your code. For many applications, you won't care about the underlying database implementation and can manage with just create, read, update, and delete operations. That's why Object Relation Mappers (ORMs) exist. It sounds like Symfony (which I haven't used) implements an ORM. A pretty common ORM for Java is Hibernate. An ORM can make things easier for many tasks. However, in practice, one occasionally finds that an ORM does not support asking the questions that actually need answered. SQL does. That's why Hibernate still allows people to write direct SQL.

      Another issue is that SQL isn't always used from a programming language. It's quite common to connect to a database directly and query with SQL. It would be very inconvenient to have to write a PHP program every time I wanted to query the database. It's much easier to use SQL either on an ad hoc basis or via a script.

      If ORMs are enough for you, that's great. In my experience, they aren't enough for me and I need something with the flexibility of SQL.

      You can see an example of a query that's simpler in SQL than in Propel at http://propel.phpdb.org/trac/wiki/Users/Documentation/1.3/ManyToManyRelationships

      They basically implement something along the lines of

      for each book in DB
          get the reader line mapped by the book_reader table

      This takes about seven lines of code and forces the program to call the database n+1 times (where n is the number of books). SQL can express the same question in a single query. If there are a small number of readers and a lot of books, this can significantly reduce the request/reply overhead. I suspect that Propel would also have trouble with questions that would require a left join. I suspect that it would require the same kind of application level mapping of something that the database already implements.
    4. Re:This coming from the DB Community? by Anonymous Coward · · Score: 0

      Most languages make creating and embedding domain-specific languages within them so painful (see PRO*C) that nobody bothers. This is why even SQL is almost always packaged as a flat string and submitted to the database client to be parsed at least once at runtime. As usual, a bleeding edge language (Perl 6) now has the potential to get us another solution to a problem nobody knows Lisp nailed decades ago. But I'm not bitter.

  42. The only thing wrong with map-reduce... by frank_adrian314159 · · Score: 1

    ... is that they misspelled xapping.

    --
    That is all.
  43. Stream processing. by mypalmike · · Score: 1

    The whole point of MapReduce is to take an unindexed stream of data and shrink it down based on some criteria where numerous records can be associated (Map) and aggregated (Reduce). It is a process. The *result* of the process is an indexed database, which is often inserted into a relational or time-series database.

    It's an apples and oranges comparison, and the author's never eaten an orange.

    --
    There are 0x40000000 types of people: those who understand 32-bit IEEE 754 floating point, and those who don't.
  44. reminiscent...(philosophical digression) by cjonslashdot · · Score: 1

    All your comments bring back to my mind the criticisms of XML-based messaging technologies (SOAP, Web services). "A huge step backwards", "incompatible with existing technologies and approaches" (BNF, parsers, languages), "inefficient" (compared to binary formats), etc. Those complaints were right, but they fell on deaf ears, just as these will.... IT is driven by fads and the availability of high-productivity gizmos. Ironically, productivity often suffers in the long run, as people then have to deal with the mess that gets created using approaches that are fundamentally wrong.

  45. Mapreduce is not a database. by Anonymous Coward · · Score: 0

    So of course rating it like one will fail.

    I see map reduce as a really great way to take 10,000,000,000,000 bytes of raw data, map it to a set of computers and reduce the data to a set of tables that could then be placed in a regular database and queried.

    Or is that not how google is using it?

  46. Fnord? by SanityInAnarchy · · Score: 1

    Paradigm.

    Does that mean Paradigm is a Fnord? As in, I can now say stuff you won't be able to consciously read, because it has the Fnord Paradigm in it?

    --
    Don't thank God, thank a doctor!
  47. Wait, are they "experts", though? by SanityInAnarchy · · Score: 1

    Note how their blog represents the post as having a single author, when, in fact, it has multiple authors?

    That does not sound at all like a database expert to me. It's a simple many-to-many relationship!

    --
    Don't thank God, thank a doctor!
  48. Index Every Column? by Tablizer · · Score: 1

    a sub-optimal implementation, in that it uses brute force instead of indexing;

    For Query-by-Example-like tools, often you cannot predict which columns need indexing: they ALL do. At some point it just seems easier to split the data sets up onto dozens or hundreds of hard-drives and just do a sequencial search on each one in parellel. I cannot say whether it is clearly faster than indexing every column, but it is certainly simpler from a technical standpoint. And, it would possibly require less disk-space because there would only be one copy of each cell, unlike indexing which replicates the contents of the indexed column into the index.

    What seems conceptually simpler: maintaining 300 indexes, or simply sequentially scanning tables split across many harddrives? (I've thought about this because I've been kicking around how to build a truely dynamic relational database with auto-columns proof-of-concept because the current "Oracle clones" are too stiff for many kinds of nimble apps.)

  49. Q: Implementation issues by harmonica · · Score: 1

    The authors raise two interesting questions on skew and data interchange wrt scaling in section 2. (poor impl), issues others supposedly have solved. Has anyone run into those problems with MapReduce? Are they not important when using MapReduce in the "real world"?

  50. What is MapReduce SPECIFICALLY useful for? by CurtMonash · · Score: 1

    At the risk of quoting myself,

    Proponents of MapReduce highlight two advantages:

          1. MapReduce makes it very easy to program data transformations, including ones to which relational structures are of little relevance.
          2. MapReduce runs in massively parallel mode "for free," without extra programming.

    Based on those advantages, MapReduce would indeed seem to have significant uses, including:

            * Specialized indexing of large quantities of data. Obviously, MapReduce was built for text indexing of the Web. But it would likely also be useful for, say, preprocessing satellite telemetry or intelligence intercepts, or for doing early steps in large-scale network traffic analysis. MapReduce may not be good for data management, but it looks good for banging stuff into specialized data management systems.
            * Computer-scientific research. If you're trying to figure out better ways to, say, digest and analyze huge amounts of astronomical data, MapReduce seems like a great platform. Today's researchers - even the students - aren't nearly as adept at parallel algorithms as one would hope. Perhaps we should take those complications away to let them focus on the unique parts of their work. Breakthrough programming is hard enough anyway, especially if you're trying to do all the work yourself.

    --
    To err is human. To forgive is good system design.
  51. I figured it out. Really. by mypalmike · · Score: 1

    I had a lightbulb moment after rereading this thing a few times. The authors of the paper think MapReduce is a distributed query processor, backed by a datastore of unstructured records. They picture this database where every query kicks off a MapReduce operation. Seriously, reread it from that perspective. It makes sense. Too bad for them, their fundamental assumption is wrong. It helps to have even a small amount of experience working with a technology before writing a critique of it.

    --
    There are 0x40000000 types of people: those who understand 32-bit IEEE 754 floating point, and those who don't.
  52. ObDilbert by markov_chain · · Score: 1

    "You'd be fools to ignore the Boolean anti-binary least-square approach!"

    --
    Tsunami -- You can't bring a good wave down!
  53. OMFG! Its going to RAIN PIGS! by killmofasta · · Score: 1

    Just as soon as someone comes up with something else that is a lateral improvment, ( this type of data architecture is a definate improvement. ) someone comes up with the incompatible argument.

    HEY! GET A MAC!

    It is a definate MISNOMBER to label this type of data architecture 'unreliable.' The failsafe and reliability only make failure a little bit slower. The redundancy is *IMPROVED* by multiple fallover.

    I hope this technology takes the industry by storm, making all those 350Lb Database admins actually crack a book.

    I mentioned to a java friend of mine, about he was adding interoperability in his CRM to SAP, and I said what a PIG SAP was. He said "No one cares about efficency or formats anymore, its only interoperability, and most of that is just minor 'get this, configure this'

    Of course, Being sceptible, I asked a PeopleSoft programmer about interoperability, and the thing she said was "Interoperability is a done deal, we worked that all out with the y3k problem. Its only the us, the programmers that worry about having the data move between clouds. The DBAs dont really care. The real danger to this industury is the EXPENSIVE house of cards that the database infrastructure is, and how cheap/free upstarts like MySQL are making what costs tens of thousands of dollars avaible for free. "

    Can you imagine Google Earth as a database browser, like Apples ProjectX/Hot Sauce? ( very very obscure refrence folks ).

  54. Chitty Chitty Death Bang by billcopc · · Score: 1

    This article is even less coherent than a Family Guy episode.

    Why the hell are they comparing MapReduce to a DBMS ? I mean, there are some terribly misguided DBMS'es out there (Oracle!), but MapReduce is a distributed computing paradigm.

    Saying MapReduce is a crappy DBMS is like saying the Macbook Thin is a crappy pogo stick.

    --
    -Billco, Fnarg.com
  55. Crawl is concurrent database update, not batch by Animats · · Score: 1

    No, crawling the web isn't a map/reduce type problem. It's a large number of long-running processes feeding a database-like engine.

    Map/reduce is for batch-like jobs. Long-running systems with intercommunication have to be organized differently.

  56. 8th BME International 24-hour Programming Contest! by Anonymous Coward · · Score: 0

    Do you like programming? Can you take challenges? Would you match yourself with others?

    Here is your chance! Grab it!

    The Electrical Engineering Students' Hungarian Association and the Károly Simonyi College for Advanced Studies are proud to present the

    8th BME International 24-hour Programming Contest!

    If you have missed the previous seven occasions, it is now time to join the adventure!

    This contest is a real test of creativity, knowledge, endurance and team-work, an EXTREME CHALLENGE! Sponsors and the offered prizes worth 5000 euros contribute to the high standards of the contest. The team which gaines the utmost points can take home the award and the cup.

    Those teams which will have finished registration until 17th February 2008, must do their best during the online preliminary quailifier on 24th February 2008. The best performing thirty teams can participate onsite at the Finals in Budapest Hungary, between 2nd and 4th of May 2008. During the 24-hour advanturous round, the contestants will have to solve one, but extremely complicated and interesting task. They will need all of their knowledge in the field of algorythms theory, artificial intelligence and program design, and also well-used team work competence is desired.

    The contestants will be allowed to use their own computers. Technical background and catering will be provided. There are no restrictions on the hardware, software and support they use, but communication with the outside world is strictly forbidden.

    For further information and for tasks from the previous years please visit the official homepage of the Challenge where future occurrents will also be available for everyone.

    If you are not afraid of eXtreme challenges , you do not have anything else to do just to establish a team of 3 members and register at http://www.challenge24.org/ ! Participation is completely free of charge.

    Have fun and good luck!
    The Organizers

    Deadline for registration: 17th February 2008
    Website: http://www.challenge24.org/
    For further information please check our website, or contact us by email

    8th BME International 24-hour Programming Contest!

  57. source by Anonymous Coward · · Score: 0

    The article seems to have been written by someone living in a true IT mindset. Its like suggesting that the space-shuttle's systems should be interconnected with web services because the way they communicate today is old and not user-friendly enough. In reality, it works efficiently and meets all the goals. Mapreduce is similar in that regard.

  58. commodity tools are for commodity apps by unexpectedvalue · · Score: 1

    While there are strong reasons for high level abstractions in applications where you need hordes of cheap available developers (these days most people you see on the street are considered web, database and php designers), there are applications that require custom tools, custom training. And they will be completely unportable, un-reusable, and it doesn't matter.

    I think that all really interesting apps fall in this category and require their own unique tools that often need to work on a very low level, and that capable engineers couldn't care less whether they deal with tools that are known to fifty million or five people.

    And that is a prerequisite of true innovation, daring to use tools and methods in spite of IT press hacks.

  59. Google clearly gives credit by danielsanII · · Score: 1

    The article states:

    Teradata has been selling a commercial DBMS utilizing all of these techniques for more than 20 years; exactly the techniques that the MapReduce crowd claims to have invented.

    However, the paper on MapReduce clearly states:

    Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages.

    The column writers claim to be "educators and reasearches" and they can't even read the *only paper* there is on MapReduce?