Slashdot Mirror


MapReduce Goes Commercial, Integrated With SQL

CurtMonash writes "MapReduce sits at the heart of Google's data processing — and Yahoo's, Facebook's and LinkedIn's as well. But it's been highly controversial, due to an apparent conflict with standard data warehousing common sense. Now two data warehouse DBMS vendors, Greenplum and Aster Data, have announced the integration of MapReduce into their SQL database managers. I think MapReduce could give a major boost to high-end analytics, specifically to applications in three areas: 1) Text tokenization, indexing, and search; 2) Creation of other kinds of data structures (e.g., graphs); and 3) Data mining and machine learning. (Data transformation may belong on that list as well.) All these areas could yield better results if there were better performance, and MapReduce offers the possibility of major processing speed-ups."

29 of 99 comments (clear)

  1. Um, first question: WTF is MapReduce? by Anonymous Coward · · Score: 5, Funny

    and can I run Linux on it? Or it on Linux? Is it available for my iPhone?

    1. Re:Um, first question: WTF is MapReduce? by spun · · Score: 4, Funny

      MapReduce is the algorithm used to determine the optimum folding pattern used to reduce a standard road map back into its folded state. Duh.

      --
      - None can love freedom heartily, but good men; the rest love not freedom, but license. -- John Milton
    2. Re:Um, first question: WTF is MapReduce? by AKAImBatman · · Score: 4, Informative

      Good question. I had to look it up. (Would it have killed the submitter or editor to include a link?)

      Basically, the software gets its name from the list processing functions "map" (to take every item in a list and transform it, thus producing a list of the same size) and "reduce" (to perform an operation on a list that produces a single value or smaller list). The actual software has nothing to do with "map" and "reduce", but it does to tokenization and processing on massive amounts of data.

      Presumably the Map/Reduce part comes from first normalizing the items being processed (a map operation) then reducing them down to a folded data structure (reduce), thus creating indexes of data suitable for fast searching.

    3. Re:Um, first question: WTF is MapReduce? by Anonymous Coward · · Score: 4, Funny

      and can I run Linux on it? Or it on Linux?

      Have you ever considered that it might itself be a distro? A, like, super-leet distro that the big Valley firms have been hacking together for the past ten years, only giving access to employees that sign a super-nasty NDA? A disto that traces back to a Photoshop 1.0 plugin for resizing GIFs?

    4. Re:Um, first question: WTF is MapReduce? by jbolden · · Score: 5, Informative

      Here is the connection between map and reduce.

      In programming

      map takes a function from A to B, a list of A's and produces a list of B's

      reduce are associative fold functions. They take a list of B's and an initial value and produce a single C.

      Like say for example MAP a collection of social security numbers to ages and then select (REDUCE TO) the maximum age from the collection.

      Now there are results called "fusions" which allow you make computational reductions for example:
      foldr f a . map g = foldr (f.g) a

      So in other words the data set is being treated like a large array using array manipulation commands.

    5. Re:Um, first question: WTF is MapReduce? by Anonymous Coward · · Score: 2, Funny

      Why can't they just look at the creases? Duuuuuuh.

    6. Re:Um, first question: WTF is MapReduce? by Anonymous Coward · · Score: 3, Insightful

      Probably someone who read the post and knows how wrong he is. Like you traverse the web every time you want to look up a search term or how a map is really the same as load balancing...

    7. Re:Um, first question: WTF is MapReduce? by AmberBlackCat · · Score: 4, Funny

      I thought those were like Rubik's Cubes where you just rip them apart and put them back together right.

    8. Re:Um, first question: WTF is MapReduce? by Jack9 · · Score: 2, Informative

      Google's mapreduce framework has a native resource manager that's aware of what resources are available, aware of failures, and is prepared to reschedule failed processes and where (and when?) to direct finished tasks. Basically it's a job que for distributed processing using a private network. MapReduce is just one tool. You aren't going to get much out of it after you max out your local machine's processing until you start work on the rest of it. What's really scary is that MySQL announces that they finally discovered the ancient algorithm of multithreaded recursive aggregation, "Hey look, in some cases MySQL wont waste processing power!" //i'm a mysql fanboy, but this is really an embarassing announcement

      --

      Often wrong but never in doubt.
      I am Jack9.
      Everyone knows me.
    9. Re:Um, first question: WTF is MapReduce? by Jack9 · · Score: 4, Funny

      I'm a little dyslexic. I immediately see the wheelbarrow as a MySQL icon (which is almost universally a MySQL article) and read _M_apReduce into SQL = MYSQL in the title. This is proof I'm a reactionary blowhard who often fails to comprehend the summary, much less read the article.

      There is no link because my wrongometer is not working, it has melted through its resin casing.

      --

      Often wrong but never in doubt.
      I am Jack9.
      Everyone knows me.
    10. Re:Um, first question: WTF is MapReduce? by severoon · · Score: 5, Informative

      Map-Reduce is definitely a technique related to grid computing, but they are not one and the same.

      The most popular (to my knowledge) open source Java library implementing MR is Hadoop.

      Here's the algorithm in a nutshell (anyone who knows more than me, please correct, and I'll be forever grateful). I have a bunch of documents and I want to generate a list of word counts. So I begin with the first document and map each word in the document to the value 1. I return each mapping as I do it, and it is merge-sorted by key into a map. Let's say I start with a document of a single sentence: John likes Sue, but Sue doesn't like John. At the end of the map phase, I have compiled the following map, sorted by key:

      • but - 1
      • doesn't - 1
      • like - 1
      • likes - 1
      • John - 1
      • John - 1
      • Sue - 1
      • Sue - 1

      Now begins the reduce phase. Since the map is sorted by key, all the reduce phase does is iterate through the keys and add up the associated values until a new key is encountered. The result is:

      • but - 1
      • doesn't - 1
      • like - 1
      • likes - 1
      • John - 2
      • Sue - 2

      Simple. Stupid. What's the point? The point is that the way this algorithm divides up the work happens to be extremely convenient for parallel processing. So, the map phase of a single document can be split up and farmed out to different nodes in the grid for processing, which can be processed separately from the reduce phase. The merge-sort can even be done at a different processing node as mappings are returned. Redundancy can be achieved if the same document chunk is farmed out to several nodes for simultaneous processing, and the first one that returns the result is used, the others simply ignored or canceled (maybe they're queued up at redundant nodes that were busy, so canceling means simply removing from the queue with very few cycles wasted). Similarly, because the resulting map is sorted by key, an extremely large map can easily be split and sent to several processing nodes in parallel. The original task of counting words across a set of documents can be decomposed to an ridiculous extent for parallelization.

      Of course, this doesn't make much sense to actually do this unless you have a very large number of documents. Or, let's say you have a lot of computing resources, but each resource on its own is very limited in terms of processing power. Or both.

      This is very close to the problem a company like Google has to solve when indexing the web. The number of documents is huge (every web page), and they don't have any super computers—just a whole ton of cheap, old CPUs in racks.

      At the end of the day, Map-Reduce is only useful for tasks that can be decomposed, though. If you have a problem with separate phases, where the input of each phase is determined by the output of the previous phase, then they must be executed serially and Map-Reduce can't help you. If you consider the word-counting example I posted above, it's easy to see that the result required depends upon state that is inherent in the initial conditions (the documents)—it doesn't matter how you divide up a document or if you jumble up the words, the count associated with each word doesn't change, so the result you're after doesn't depend on the context surrounding those words. On the other hand, if you're interested in counting the number of sentences in those documents, you might have a much more difficult problem. (You might think you could just chunk the documents up at the sentence level, but whether or not something is a sentence depends upon surrounding context—a machine can easily mistake an abbreviation like Mr. for the end of a sentence, especially if that Mr. is followed by a capital letter which could indicate the beginning of a new sentence...which it almost always is. Actually...if you're smart you can probably come up with a very compelling argument that this

      --
      but have you considered the following argument: shut up.
    11. Re:Um, first question: WTF is MapReduce? by Anonymous Coward · · Score: 4, Informative

      This classic word count example by Google is exactly what Aster demonstrated in their webinar via a live demo of their In-database MapReduce software:

      http://www.asterdata.com/product/webcast_mapreduce.html

  2. Mmm.. MapReduce is LISP by Anonymous Coward · · Score: 2, Insightful

    People who don't know LISP are bound to reinvent it, badly.

    1. Re:Mmm.. MapReduce is LISP by geminidomino · · Score: 4, Funny

      Well done, AC. You've exposed their dirty little Scheme.

  3. Perhaps a good addition to data warehousing by MarkWatson · · Score: 4, Interesting

    Data warehousing (here I mean databases stored in column order for faster queries, etc.) may get a lift from using map reduce over server clusters. This would get away from using relational databases for massive data stores for problems where you need to sweep through a lot of data, collecting specific results.

    I think that it is interesting, useful, and cool that Yahoo is supporting the open source Nutch system, that implements map reduce APIs for a few languages - makes it easier to experiment with map reduce on a budget.

    1. Re:Perhaps a good addition to data warehousing by roman_mir · · Score: 2, Interesting

      Except that relational databases are not just indexed objects copied across a large network of cheap PCs. What's good for Google may not be suitable for other databases, who actually care about ACID properties of transactions and not necessarily have the infrastructure to run highly parallel select queries.

    2. Re:Perhaps a good addition to data warehousing by ELProphet · · Score: 4, Informative

      Actually, MapReduce doesn't do anything in the way data's stored- it's just a pipe between two sets of stored data, and really just needs an interface on both ends to get the task into MapReduce (which is what it seems the projects TFS/A mention do). BigTable is the storage mechanism that's incompatible with most traditional row-based RDBMSs. GFS is just the underlying storage mechanism.

      http://labs.google.com/papers/gfs.html
      http://labs.google.com/papers/bigtable.html
      http://labs.google.com/papers/mapreduce-osdi04.pdf

      Note that all of those were published several years ago- I'd bet dollars to donuts that Google is _WAY_ beyond this internally if it's just reaching commercial use by their competitors.

    3. Re:Perhaps a good addition to data warehousing by owenomalley · · Score: 5, Informative

      The correct project name is Hadoop. It was factored out of Nutch 2.5 years ago. And Yahoo has been putting a lot of effort to make it scale up. We run 15,000 nodes with Hadoop in clusters of up to 2,000 nodes each and soon that will be 3,000 nodes. I used 900 nodes to win Jim Gray's terabyte sort benchmark by sorting 1 TB of data (100 billion 100 byte records) in 3.5 minutes. It is also used to generate Yahoo's Web Map, which has 1 trillion edges in it.

    4. Re:Perhaps a good addition to data warehousing by grae · · Score: 5, Informative
      If you're interested in one of the sorts of things that Google has done with MapReduce, look no further than Sawzall.

      http://research.google.com/archive/sawzall.html

      Sawzall is essentially designed around the mapreduce framework. It's impossible to *not* write a mapreduction in Sawzall. The way it works:

      Your program is written to process a single record. The magic part happens when you output: you have to output to special tables. Each of these table types has a different way that it combines data emitted to it.

      So, during the map phase, your program is run in parallel on each input record. During the reduce phase, the reduction happens according to the way the output tables do whatever operation was specified.

      There was some work to be done having enough different output tables to do everything that was useful, especially since you might want to take the output and plug it in as the input to another phase of mapreduction.

      One of the biggest reasons this was a major innovation for Google was that it let some of the people who weren't really programmers still come up with useful programs, because the Sawzall language was pretty simple (especially when combined with some of the library functions that had been implemented to do common sorts of computations.) There were also some interesting ways in which the security model was implemented, but as far as I know they haven't been published yet.

      There certainly are plenty of other technical things that can be done to improve a system like MapReduce (and I know that many of them were in various forms of experimentation when I left the company) but at least some of them are highly dependent on Google's infrastructure, and not really relevant to a general discussion. (I suspect that the papers linked above might have some hints, but it has been a while since I looked at them.)

  4. Re:MySQL has no common sense anyway. . . by SgtPepperKSU · · Score: 5, Funny

    Not like MySQL cared about data integrity in the past. . . whay start now?!

    Gaaah! Data corruption!
    Your post must have been stored in MySQL...

  5. First they attack it by Intron · · Score: 3, Interesting
    --
    Intron: the portion of DNA which expresses nothing useful.
    1. Re:First they attack it by Bazouel · · Score: 3, Interesting

      From a comment made about the article:

      You [the articles authors] seem to be under the impression that MapReduce is a database. It's merely a mechanism for using lots of machines to process very large data sets. You seem to be arguing that MapReduce would be better (for some value of better) if it were a data warehouse product along the lines of TeraData. Unfortunately the resulting tool would be less effective as a general purpose mechanism for processing very large data sets.

      --
      Intelligence shared is intelligence squared.
  6. Got what right? by argent · · Score: 3, Interesting

    I don't think you can credit Bjarne with "compiled code is faster than interpreted code" (or the 21st century version: "compilers can perform better optimizations that JIT translators").

    C++ happens to be the most popular fully compiled language, having edged Fortran out of that position some time near the end of the last century.

    Back in the early '80s, when he was coming up with C++, the big Fortran savants were saying stuff like "Fortran is bigger than ever. There are more than X million Fortran programmers. Everywhere I look there has been an uprising... a lot of teaching was going to Pascal, but more are teaching Fortran again. There has been a backlash."

    ----

    And that's not the only thing C++ has in common with Fortran, either.

    1. Re:Got what right? by johanatan · · Score: 3, Interesting

      " (or the 21st century version: "compilers can perform better optimizations that JIT translators").

      Actually, JITters can do some optimizations that compilers can't--by splitting the compilation into a frontend and a backend. The front end is essentially just a parser, and the later the back-end compile happens, the more opportunities for optimizations actually open up (including such things as utilizing specific instruction sets for given architectures and fine tuning the compile based on run time statistics).

      See the LLVM for more info: http://llvm.org/

      (or .NET for that matter--but we're anti-MS around here. :-)

  7. Re:Um. by raddan · · Score: 2, Informative

    Actually, the two are paired: programming model and implementation. The reason there's a programming model is that functional methods allow Google's implementation to automatically parallelize the input data for feeding to the cluster. So the implementation is very important, because that's actually how the data is processed and returned.

    In that sense, Oracle's clustering optimizations are also a paired programming model and implementation, since, presumably, you need to know Oracle's SQL language extensions in order to take advantage of them (disclaimer: I don't use Oracle). From what I understand about functional programming, SQL should be ideally positioned to take advantage of these kinds of optimizations, since the actual implementation details of any SQL query are always left to the query optimizer, SQL being a declarative language. I'm going to speculate wildly and say that you could probably write a SQL interpreter using a functional style as well, and that good ones probably already do.

  8. Re:Again Bjarne got it right by samkass · · Score: 4, Interesting

    If Java ( or Pyhton etc. for that matter ) were fast enough why did Google choose C++ to build their insanely fast search engine.

    Because their developers knew it better? Because it had better 64-bit support when they started it? Because full GC's weren't compatible with their use case and IBM's parallel GC VM hadn't been released yet? Because they could get and modify all the source to all the libraries?

    I don't know the answer, but there are a lot of possibilities besides speed. You're jumping to an awfully big conclusion there, Mr. Coward.

    --
    E pluribus unum
  9. Re:Again Bjarne got it right by johanatan · · Score: 3, Insightful

    To most people, C++ is C. :-) Unfortunate but true.

  10. Re:Again Bjarne got it right by Rakishi · · Score: 3, Informative

    Well someone should tell that to the people working on Hadoop. I'm sure they'd love to know that their java mapreduce based framework is impossible. Maybe they'll even be able to use the paradox to built a perpetual motion machine and power the world.

    See: http://developers.slashdot.org/comments.pl?sid=900359&cid=24756761

  11. wrong argument? by fragbait · · Score: 2, Insightful

    Though this post is my introduction to both MapReduce and the argument, it strikes me that the people arguing are arguing the wrong problem.

    While MapReduce might be used against some structured data, it looks to be something for unstructured data and dynamically inventing structures in unstructured data. Additionally, you might want to keep that new structure around for a while. You might want to load it up with terabytes of data. At the same time, this data is less and less useful over time.

    Think about two of the key pieces of data Google has, web pages and user interaction and preference data. Web pages change over time. Web sites come and go. Some change a lot (news sites) and some change very little.

    There is a LOT of user interaction data. Clicks on pages, javascript that fires to doubleclick, etc. With preferences, that changes over time, too. Also, marketers want to dynamically react to the clicks and even the minute change of a preference that generates a buck.

    With such a large, changing, and time sensitive dataset, how could it be structured into something as relatively static as a schema? You would box yourself in by making it a schema and defining all the possible relationships.

    So, you take it up one abstraction level and make a "schema" for making relationships. Further more, there is a narrow window within which you even care about data and how it is structured. Granted, you want the webpage/site data to stick around for queries. But even that is marginally useful. Think about how many pages you go into a query on google? I'm sure that will vary by person, but I'd also bet that in practice it is pretty small.

    Maybe everyone else gets that and I'm just late to the party. But my point is that the wrong argument is being made that this should follow all the RDBMS work that has come to date.

    Sure, I do agree that they shouldn't completely ignore all of the research, but to suggest it has to have a schema, indices, etc. just comes across as arguing all data problems belong in a traditional database.

    Or maybe I can take a different approach to this....my brain doesn't have an index. It does categorize data and it can categorize the same piece of data in multiple ways. As I learn new things, my brain creates new "indices" of sort. A large portion of the data in my brain is time sensitive, or indexed over time. The older I get, the more the details of the minutia of life (what I had for dinner this evening) isn't important any more and it loses its categorization. I don't have a schema for my brain, rather I have multiple and I invent and dissolve them over time. I don't know what new one I'll need in the future. I can't know that and without that, I can't make a schema for it. I also can't be constantly modifying the same schema in place. It is easier for me to invent a new one as I go and just abandon the old ones. Sure, new schemas will have parts of the old, but it is still a new schema with the old one still in place and referencing the same data that the new one will soon reference.

    -fragbait