MapReduce Goes Commercial, Integrated With SQL

← Back to Stories (view on slashdot.org)

MapReduce Goes Commercial, Integrated With SQL

Posted by kdawson on Tuesday August 26, 2008 @08:48AM from the patterns-in-the-data dept.

CurtMonash writes "MapReduce sits at the heart of Google's data processing — and Yahoo's, Facebook's and LinkedIn's as well. But it's been highly controversial, due to an apparent conflict with standard data warehousing common sense. Now two data warehouse DBMS vendors, Greenplum and Aster Data, have announced the integration of MapReduce into their SQL database managers. I think MapReduce could give a major boost to high-end analytics, specifically to applications in three areas: 1) Text tokenization, indexing, and search; 2) Creation of other kinds of data structures (e.g., graphs); and 3) Data mining and machine learning. (Data transformation may belong on that list as well.) All these areas could yield better results if there were better performance, and MapReduce offers the possibility of major processing speed-ups."

99 comments

Min score:

Reason:

Sort:

Um, first question: WTF is MapReduce? by Anonymous Coward · 2008-08-26 08:53 · Score: 5, Funny

and can I run Linux on it? Or it on Linux? Is it available for my iPhone?
1. Re:Um, first question: WTF is MapReduce? by spun · 2008-08-26 08:57 · Score: 4, Funny
  
  MapReduce is the algorithm used to determine the optimum folding pattern used to reduce a standard road map back into its folded state. Duh.
  
  --
  - None can love freedom heartily, but good men; the rest love not freedom, but license. -- John Milton
2. Re:Um, first question: WTF is MapReduce? by AKAImBatman · 2008-08-26 08:58 · Score: 4, Informative
  
  Good question. I had to look it up. (Would it have killed the submitter or editor to include a link?)
  Basically, the software gets its name from the list processing functions "map" (to take every item in a list and transform it, thus producing a list of the same size) and "reduce" (to perform an operation on a list that produces a single value or smaller list). The actual software has nothing to do with "map" and "reduce", but it does to tokenization and processing on massive amounts of data.
  Presumably the Map/Reduce part comes from first normalizing the items being processed (a map operation) then reducing them down to a folded data structure (reduce), thus creating indexes of data suitable for fast searching.
  
  --
  Javascript + Nintendo DSi = DSiCade
3. Re:Um, first question: WTF is MapReduce? by moderatorrater · 2008-08-26 09:00 · Score: 0, Troll
  
  Map reduce: a framework for taking a problem and breaking it up into smaller pieces. As I understand it, Map is the program that decides which server the data gets sent to, Reduce is the program that actually processes it. For google, when you write a query, they send the query to several different servers. Those servers then search their subset of the internet for that term, rank them, and return them. The central server then combines those results and returns them to the user. In this case, the Map program would send the request to the servers and be smart enough to make sure that you don't get duplicate servers. The Reduce program is the one that does the searching and sends them back.
4. Re:Um, first question: WTF is MapReduce? by Anonymous Coward · 2008-08-26 09:03 · Score: 4, Funny
  
  and can I run Linux on it? Or it on Linux?
  
  Have you ever considered that it might itself be a distro? A, like, super-leet distro that the big Valley firms have been hacking together for the past ten years, only giving access to employees that sign a super-nasty NDA? A disto that traces back to a Photoshop 1.0 plugin for resizing GIFs?
5. Re:Um, first question: WTF is MapReduce? by Anonymous Coward · 2008-08-26 10:06 · Score: 1, Informative
  
  If my memory's right, this java api for doing grid computing uses this pattern and gives quite a good explanation about it (I think it was developped by google)
  http://www.gridgain.com/
6. Re:Um, first question: WTF is MapReduce? by jefu · 2008-08-26 10:19 · Score: 1
  
  MapReduce is just an idiom (pattern if you will) for processing collections (arrays, lists, trees, database tables...) of data. There is often another piece :filter that cuts out bits you don't want to do but that can easily be done in the reduce step, though sometimes it is done somewhere else.
  For example, suppose you want to compute exp(x) using the usual Taylor series expansion and 20 terms. Start with the list [0,1,2,3,4,5, .. 19]. Then map the function :
  f(i) = x^i / i!
  to each entry in the list. Then reduce the list by adding all the pieces. (This is, admittedly, a trivial example and one that would be better done in other ways -- skip the map step and just do a reduce that computes the polynomial using Horner's rule.)
  Doing this in code in general can make things easier to read (but not always - sometimes the reduce step can get messy). But suppose you wanted to do something like that on ten million numbers. With a hundred processors, then you could split the numbers up so each processor would have about the same number, the do the maps on each processor, and reduce (on that same processor) all of the mapped values, then collect the values and reduce again. Less data movement and often a much (sometimes much much) less complicated program.
7. Re:Um, first question: WTF is MapReduce? by jbolden · 2008-08-26 10:59 · Score: 5, Informative
  
  Here is the connection between map and reduce.
  In programming
  map takes a function from A to B, a list of A's and produces a list of B's
  reduce are associative fold functions. They take a list of B's and an initial value and produce a single C.
  Like say for example MAP a collection of social security numbers to ages and then select (REDUCE TO) the maximum age from the collection.
  Now there are results called "fusions" which allow you make computational reductions for example:
  foldr f a . map g = foldr (f.g) a
  So in other words the data set is being treated like a large array using array manipulation commands.
8. Re:Um, first question: WTF is MapReduce? by Anonymous Coward · 2008-08-26 11:06 · Score: 2, Funny
  
  Why can't they just look at the creases? Duuuuuuh.
9. Re:Um, first question: WTF is MapReduce? by Anonymous Coward · 2008-08-26 11:07 · Score: 3, Insightful
  
  Probably someone who read the post and knows how wrong he is. Like you traverse the web every time you want to look up a search term or how a map is really the same as load balancing...
10. Re:Um, first question: WTF is MapReduce? by AmberBlackCat · 2008-08-26 11:11 · Score: 4, Funny
  
  I thought those were like Rubik's Cubes where you just rip them apart and put them back together right.
11. Re:Um, first question: WTF is MapReduce? by Jack9 · 2008-08-26 11:13 · Score: 2, Informative
  
  Google's mapreduce framework has a native resource manager that's aware of what resources are available, aware of failures, and is prepared to reschedule failed processes and where (and when?) to direct finished tasks. Basically it's a job que for distributed processing using a private network. MapReduce is just one tool. You aren't going to get much out of it after you max out your local machine's processing until you start work on the rest of it. What's really scary is that MySQL announces that they finally discovered the ancient algorithm of multithreaded recursive aggregation, "Hey look, in some cases MySQL wont waste processing power!" //i'm a mysql fanboy, but this is really an embarassing announcement
  
  --
  
  Often wrong but never in doubt.
  I am Jack9.
  Everyone knows me.
12. Re:Um, first question: WTF is MapReduce? by Anonymous Coward · 2008-08-26 11:23 · Score: 0
  
  Haha, you bought an !phone.
  Everyone point and laugh!
13. Re:Um, first question: WTF is MapReduce? by FilterMapReduce · 2008-08-26 12:20 · Score: 1
  
  Basically, the software gets its name from the list processing functions "map" (to take every item in a list and transform it, thus producing a list of the same size) and "reduce" (to perform an operation on a list that produces a single value or smaller list).
  As does my Slashdot user name. Great, now everything is going to think I'm calling on people to "filter" this software somehow, which I'd never heard of before this story. And it's "highly controversial", that's helpful.
14. Re:Um, first question: WTF is MapReduce? by fimbulvetr · 2008-08-26 12:33 · Score: 1
  
  Got a link for the mysql thing you mentioned?
15. Re:Um, first question: WTF is MapReduce? by Jack9 · 2008-08-26 12:49 · Score: 4, Funny
  
  I'm a little dyslexic. I immediately see the wheelbarrow as a MySQL icon (which is almost universally a MySQL article) and read _M_apReduce into SQL = MYSQL in the title. This is proof I'm a reactionary blowhard who often fails to comprehend the summary, much less read the article.
  There is no link because my wrongometer is not working, it has melted through its resin casing.
  
  --
  
  Often wrong but never in doubt.
  I am Jack9.
  Everyone knows me.
16. Re:Um, first question: WTF is MapReduce? by severoon · 2008-08-26 12:49 · Score: 5, Informative
  Map-Reduce is definitely a technique related to grid computing, but they are not one and the same.
  The most popular (to my knowledge) open source Java library implementing MR is Hadoop.
  Here's the algorithm in a nutshell (anyone who knows more than me, please correct, and I'll be forever grateful). I have a bunch of documents and I want to generate a list of word counts. So I begin with the first document and map each word in the document to the value 1. I return each mapping as I do it, and it is merge-sorted by key into a map. Let's say I start with a document of a single sentence: John likes Sue, but Sue doesn't like John. At the end of the map phase, I have compiled the following map, sorted by key:
  
  but - 1
  doesn't - 1
  like - 1
  likes - 1
  John - 1
  John - 1
  Sue - 1
  Sue - 1
  Now begins the reduce phase. Since the map is sorted by key, all the reduce phase does is iterate through the keys and add up the associated values until a new key is encountered. The result is:
  
  but - 1
  doesn't - 1
  like - 1
  likes - 1
  John - 2
  Sue - 2
  Simple. Stupid. What's the point? The point is that the way this algorithm divides up the work happens to be extremely convenient for parallel processing. So, the map phase of a single document can be split up and farmed out to different nodes in the grid for processing, which can be processed separately from the reduce phase. The merge-sort can even be done at a different processing node as mappings are returned. Redundancy can be achieved if the same document chunk is farmed out to several nodes for simultaneous processing, and the first one that returns the result is used, the others simply ignored or canceled (maybe they're queued up at redundant nodes that were busy, so canceling means simply removing from the queue with very few cycles wasted). Similarly, because the resulting map is sorted by key, an extremely large map can easily be split and sent to several processing nodes in parallel. The original task of counting words across a set of documents can be decomposed to an ridiculous extent for parallelization.
  Of course, this doesn't make much sense to actually do this unless you have a very large number of documents. Or, let's say you have a lot of computing resources, but each resource on its own is very limited in terms of processing power. Or both.
  This is very close to the problem a company like Google has to solve when indexing the web. The number of documents is huge (every web page), and they don't have any super computers—just a whole ton of cheap, old CPUs in racks.
  At the end of the day, Map-Reduce is only useful for tasks that can be decomposed, though. If you have a problem with separate phases, where the input of each phase is determined by the output of the previous phase, then they must be executed serially and Map-Reduce can't help you. If you consider the word-counting example I posted above, it's easy to see that the result required depends upon state that is inherent in the initial conditions (the documents)—it doesn't matter how you divide up a document or if you jumble up the words, the count associated with each word doesn't change, so the result you're after doesn't depend on the context surrounding those words. On the other hand, if you're interested in counting the number of sentences in those documents, you might have a much more difficult problem. (You might think you could just chunk the documents up at the sentence level, but whether or not something is a sentence depends upon surrounding context—a machine can easily mistake an abbreviation like Mr. for the end of a sentence, especially if that Mr. is followed by a capital letter which could indicate the beginning of a new sentence...which it almost always is. Actually...if you're smart you can probably come up with a very compelling argument that this
  --
  but have you considered the following argument: shut up.
17. Re:Um, first question: WTF is MapReduce? by Anonymous Coward · 2008-08-26 14:14 · Score: 0
  
  Pfffffffffffttttt
  Map reduce, in my day it was called LISP....
  GET OF MY LAWN!!!
18. Re:Um, first question: WTF is MapReduce? by Anonymous Coward · 2008-08-26 14:40 · Score: 4, Informative
  
  This classic word count example by Google is exactly what Aster demonstrated in their webinar via a live demo of their In-database MapReduce software:
  http://www.asterdata.com/product/webcast_mapreduce.html
19. Re:Um, first question: WTF is MapReduce? by Paradise+Pete · 2008-08-26 15:13 · Score: 1
  
  What drunk moderated parent a Troll?
  I'm guessing he figured that the post is so thoroughly wrong that it must be deliberate.
20. Re:Um, first question: WTF is MapReduce? by gslavik · 2008-08-26 16:35 · Score: 1
  
  I don't think that there is sorting going on in MapReduce (from what I've read). Could be that I missed something ...
21. Re:Um, first question: WTF is MapReduce? by Anonymous Coward · 2008-08-26 18:05 · Score: 0
  
  So dude how can we filter MapReduce and why do you want us to do it?
22. Re:Um, first question: WTF is MapReduce? by jonaskoelker · 2008-08-27 00:19 · Score: 1
  
  I'm not quite entirely sure what you mean by the verb "map", the noun "map", and in which sense you use it in each instance. Also, I'm unsure why you think sorting enters into it.
  My understanding of MapReduce is that it's (surprise!) all about applying the higher-order functions map and then reduce. Here's what they do:
  Map takes a function f and a list [x_1, ..., x_n], then returns [f(x_1), ..., f(x_n)]. That is, it applies f to all the elements of the list. [variants takes multi-argument functions and multiple lists].
  Reduce takes an associative operator ++ and a non-empty list [y_1, ..., y_n] and returns y_1 ++ y_2 ++ ... ++ y_n. [Variants take an initial value, and may accept empty lists then.]
  Example: you want to know the sum of the squares of 1 through k, and have a list [1, ..., k]. You can evaluate Reduce(addition, map(squaring, [1..k])) to get exactly what you want.
  So, what's the big fuss about Google's MapReduce? It's an implementation of map and reduce that works in parallel on many machines; note that if f has no side effects, you can compute f(x_i) independently from f(x_j). Also, if you know f(x_i) and f(x_{i+1}) then you can compute f(x_i) ++ f(x_{i+1}) without worrying what happens elsewhere in the reduce job.
  Also, Google probably uses it for something other than computing sums of lists of numbers. Especially the ones that have closed form expressions ;)
23. Re:Um, first question: WTF is MapReduce? by Varun+Soundararajan · 2008-08-27 02:03 · Score: 1
  
  and can I run Linux on it? Or it on Linux? Is it available for my iPhone?
  First lets figure out if we can run Vista with it. Vista is toooooo slow.
24. Re:Um, first question: WTF is MapReduce? by spazdor · 2008-08-27 04:34 · Score: 1
  
  That problem has already been solved by a collaboration of millions of computers! Haven't you ever heard of Folding@Home?
  "Ok, what about accordion-style from the leftmost edge, with a vertical fold at the beginning!?
  
  --
  DRM: Terminator crops for your mind!
25. Re:Um, first question: WTF is MapReduce? by severoon · 2008-08-27 09:23 · Score: 1
  
  I only have a passing familiarity with Map-Reduce, so I'm definitely not an authoritative source. It's definitely possible that sorting isn't part of the algorithm itself, but rather one example of context around how it's often implemented. It definitely makes sense, though—why not merge-sort the results as mappings are returned? If you do implement it this way, it just makes it possible to deal with really large maps that need to be spread over multiple nodes.
  
  --
  but have you considered the following argument: shut up.
26. Re:Um, first question: WTF is MapReduce? by severoon · 2008-08-27 09:30 · Score: 1
  
  map, v. - to perform a mapping
  map, n. - a collection of mappings
  I think you describe the nuts and bolts of the algorithm...but that's not really that helpful when it comes to understanding the usefulness.
  The big fuss about map-reduce (not necessarily Google's) is that we've pretty much hit the speed limit in single core processing power. 4GHz is about it...it's not going to get any faster for some time. Unfortunately, most programs are written to only run on a single core, so adding more cores is only going to get you so far. If you want truly distributive load at a low level of granularity, map-reduce can contribute to a compelling story.
  
  --
  but have you considered the following argument: shut up.
27. Re:Um, first question: WTF is MapReduce? by zevans · 2008-08-29 09:34 · Score: 1
  
  MapReduce is the algorithm used to determine the optimum folding pattern used to reduce a standard road map back into its folded state. Duh.
  Coded for, we assume, on the Y chromosome only.
  
  --
  "... and more and more now there are all kinds of electronic goodies available" -- Pink Floyd 1972
28. Re:Um, first question: WTF is MapReduce? by Hurricane78 · 2008-08-29 11:25 · Score: 1
  
  In Haskell, there is the command "fold" (foldr or foldl) for this. What's so special about this?
  Haskell has "map", "filter", "zip" "reverse" and whatnot...
  (... why must I think of Missy Elliott songs now?)
  
  --
  Any sufficiently advanced intelligence is indistinguishable from stupidity.
Mmm.. MapReduce is LISP by Anonymous Coward · 2008-08-26 09:01 · Score: 2, Insightful

People who don't know LISP are bound to reinvent it, badly.
1. Re:Mmm.. MapReduce is LISP by geminidomino · 2008-08-26 10:27 · Score: 4, Funny
  
  Well done, AC. You've exposed their dirty little Scheme.
2. Re:Mmm.. MapReduce is LISP by Anonymous Coward · 2008-08-26 10:48 · Score: 1, Funny
  
  Yes, they're out to Steele LISP's imaginary property.
3. Re:Mmm.. MapReduce is LISP by Anonymous Coward · 2008-08-26 21:55 · Score: 0
  
  People who don't realize that LISP is just functional programming are bound to assume that LISP is special.
  Clearly, MapReduce is conceptually inspired by functional programming languages such as LISP.
  But LISP won't distribute your MapReduce instances, or run them efficiently.
Perhaps a good addition to data warehousing by MarkWatson · 2008-08-26 09:02 · Score: 4, Interesting

Data warehousing (here I mean databases stored in column order for faster queries, etc.) may get a lift from using map reduce over server clusters. This would get away from using relational databases for massive data stores for problems where you need to sweep through a lot of data, collecting specific results.
I think that it is interesting, useful, and cool that Yahoo is supporting the open source Nutch system, that implements map reduce APIs for a few languages - makes it easier to experiment with map reduce on a budget.
1. Re:Perhaps a good addition to data warehousing by roman_mir · 2008-08-26 09:25 · Score: 2, Interesting
  
  Except that relational databases are not just indexed objects copied across a large network of cheap PCs. What's good for Google may not be suitable for other databases, who actually care about ACID properties of transactions and not necessarily have the infrastructure to run highly parallel select queries.
  
  --
  You can't handle the truth.
2. Re:Perhaps a good addition to data warehousing by ELProphet · 2008-08-26 09:30 · Score: 4, Informative
  
  Actually, MapReduce doesn't do anything in the way data's stored- it's just a pipe between two sets of stored data, and really just needs an interface on both ends to get the task into MapReduce (which is what it seems the projects TFS/A mention do). BigTable is the storage mechanism that's incompatible with most traditional row-based RDBMSs. GFS is just the underlying storage mechanism.
  http://labs.google.com/papers/gfs.html
  http://labs.google.com/papers/bigtable.html
  http://labs.google.com/papers/mapreduce-osdi04.pdf
  Note that all of those were published several years ago- I'd bet dollars to donuts that Google is _WAY_ beyond this internally if it's just reaching commercial use by their competitors.
3. Re:Perhaps a good addition to data warehousing by owenomalley · 2008-08-26 09:30 · Score: 5, Informative
  
  The correct project name is Hadoop. It was factored out of Nutch 2.5 years ago. And Yahoo has been putting a lot of effort to make it scale up. We run 15,000 nodes with Hadoop in clusters of up to 2,000 nodes each and soon that will be 3,000 nodes. I used 900 nodes to win Jim Gray's terabyte sort benchmark by sorting 1 TB of data (100 billion 100 byte records) in 3.5 minutes. It is also used to generate Yahoo's Web Map, which has 1 trillion edges in it.
4. Re:Perhaps a good addition to data warehousing by MarkWatson · 2008-08-26 10:14 · Score: 1
  
  Cool!! And thanks for the correction.
5. Re:Perhaps a good addition to data warehousing by jefu · 2008-08-26 10:24 · Score: 1
  
  I'm currently working on a project where users will be able to apply different types of transformation and collection to timestamped data and map/filter/reduce style algorithms are perfect ways to give them that capability.
  The kind of capability might look something like : give me the average temperature at hourly intervals for each day in the year for a dataset that spans multiple years. In this case there's no map, and the reduce does the work, in other cases this may be turned around.
  The data involved is sitting on one processor and not overly large, but a map/reduce view is probably the easiest one for people to understand.
6. Re:Perhaps a good addition to data warehousing by grae · 2008-08-26 11:11 · Score: 5, Informative
  
  If you're interested in one of the sorts of things that Google has done with MapReduce, look no further than Sawzall.
  http://research.google.com/archive/sawzall.html
  Sawzall is essentially designed around the mapreduce framework. It's impossible to *not* write a mapreduction in Sawzall. The way it works:
  Your program is written to process a single record. The magic part happens when you output: you have to output to special tables. Each of these table types has a different way that it combines data emitted to it.
  So, during the map phase, your program is run in parallel on each input record. During the reduce phase, the reduction happens according to the way the output tables do whatever operation was specified.
  There was some work to be done having enough different output tables to do everything that was useful, especially since you might want to take the output and plug it in as the input to another phase of mapreduction.
  One of the biggest reasons this was a major innovation for Google was that it let some of the people who weren't really programmers still come up with useful programs, because the Sawzall language was pretty simple (especially when combined with some of the library functions that had been implemented to do common sorts of computations.) There were also some interesting ways in which the security model was implemented, but as far as I know they haven't been published yet.
  There certainly are plenty of other technical things that can be done to improve a system like MapReduce (and I know that many of them were in various forms of experimentation when I left the company) but at least some of them are highly dependent on Google's infrastructure, and not really relevant to a general discussion. (I suspect that the papers linked above might have some hints, but it has been a while since I looked at them.)
7. Re:Perhaps a good addition to data warehousing by Anonymous Coward · 2008-08-26 11:12 · Score: 0
  
  You guys have done an awesome job getting such an ugly and poorly implemented code to perform. Its still down right nasty code, but at least its not dog slow anymore. Congrats on the great job!
8. Re:Perhaps a good addition to data warehousing by targyros · 2008-08-26 16:23 · Score: 1
  
  This is a great point. To add to that, the way we see it is that MapReduce serves two purposes:
  
  1) Go beyond SQL. This is not a big deal for transactional databases, where most of the logic is well-expressible in standard SQL. But analytics are another story since there is so much custom logic (how do you implement a data mining algorithm, like association rules, in SQL? It's not easy!)
  
  2) Go parallel. Nobody knew what a good parallel API looked like before Google brought MapReduce and proved its value by using its own systems as guinea pigs. Since our Data Warehouse architecture is natively MPP, MapReduce is a great fit to speed up analytical applications.
  
  The combination of these two possibilities we believe can be revolutionary for Data Warehousing. If you're interested to read more take a look at our blog.
9. Re:Perhaps a good addition to data warehousing by Zaaf · 2008-08-26 19:16 · Score: 1
  
  Since the main difference between a RDBMS and MapReduce seems to be that the former is most suited for structured data and the latter best suited for unstructured data, it might be a good fit to use them both. And according to studies, it might be that north of 80% of our data is unstructured. This has been a big topic in data warehousing and led to the start of the whole DWH 2.0 thing.
  So the fact that MapReduce is used in massive parallel processing machines like the ones from Greenplum (as quoted from the article) is not as bad as Stonbraker and Co. seem to think.
  Zaaf
  
  --
  
  ---
  "Multiple exclamation marks are a sure sign of a sick mind." (Terry Pratchett)
10. Re:Perhaps a good addition to data warehousing by tuomoks · 2008-08-26 19:24 · Score: 1
  
  Correct. I sometimes wonder how many /. readers are really developers? Mapreduce is old, old technology, Google just made it famous and, maybe, documented. It is not always useful in all cases but never worse than any other method in throughput. If you have to "map" information and the more they are unbalanced, the better it gets.
  Actually the question about developers came because a lot of replies are talking about API - if you code, write your own, it is very easy once you understand the principle. And I can tell, multiple cpus, parallel processing and mapreduce method was known already in 70's when I had to write data collections for whatever reason.
  And yes, BigTable is a (almost) totally different issue but even that is not new, just used by Google in this scale maybe first time. Not sure even of that - huge government systems do sometimes grazy things but don't tell anybody how.
11. Re:Perhaps a good addition to data warehousing by poot_rootbeer · 2008-08-27 02:27 · Score: 1
  
  The correct project name is Hadoop. It was factored out of Nutch 2.5 years ago
  SPEAK
  ENGLISH
Re:MySQL has no common sense anyway. . . by SgtPepperKSU · 2008-08-26 09:04 · Score: 5, Funny

Not like MySQL cared about data integrity in the past. . . whay start now?!
Gaaah! Data corruption!
Your post must have been stored in MySQL...
Good luck with transactions and map/reduce by Anonymous Coward · 2008-08-26 09:05 · Score: 1, Insightful

they go together like paint and peanut butter.
Map/Reduce is better suited for read-only data mining situations.
First they attack it by Intron · 2008-08-26 09:10 · Score: 3, Interesting

http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html
then they embrace it

--
Intron: the portion of DNA which expresses nothing useful.
1. Re:First they attack it by sohp · 2008-08-26 11:09 · Score: 1
  
  Mahatma Gandhi actually said, "First they ignore you, then they ridicule you, then they fight you, then you win."
  The tool custodians of the massively complex relational database warehouse tools are seeing their world turn obsolete as the lighter weight MySQL and the more flexible mapreduce and the BASE worlds evolve beyond them, so yes, they are going to kick up a fight. Don't let the screen door hit you in butt on the way out, guys.
2. Re:First they attack it by Bazouel · 2008-08-26 12:19 · Score: 3, Interesting
  
  From a comment made about the article:
  You [the articles authors] seem to be under the impression that MapReduce is a database. It's merely a mechanism for using lots of machines to process very large data sets. You seem to be arguing that MapReduce would be better (for some value of better) if it were a data warehouse product along the lines of TeraData. Unfortunately the resulting tool would be less effective as a general purpose mechanism for processing very large data sets.
  
  --
  Intelligence shared is intelligence squared.
3. Re:First they attack it by shutdown+-p+now · 2008-08-27 18:39 · Score: 1
  
  It would only be fair to include the article authors' answer as well:
  
  It's not that we don't understand this viewpoint. We are not claiming that MapReduce is a database system. What we are saying is that like a DBMS + SQL + analysis tools, MapReduce can be and is being used to analyze and perform computations on massive datasets. So we aren't judging apples and oranges. We are judging two approaches to analyzing massive amounts of information, even for less structured information.
4. Re:First they attack it by Bazouel · 2008-08-28 03:51 · Score: 1
  
  That answer does not make sense at all given the points they try to make in the article which clearly shows their misunderstanding of what is MapReduce.
  
  --
  Intelligence shared is intelligence squared.
What a silly name... by Anonymous Coward · 2008-08-26 09:15 · Score: 1, Interesting

In functional programming map and reduce is very very old knowledge (and, yup, functional programming has its use and, yes, there are some very good and very successful programs written using functional languages).
What's next? A product called DepthFirstSearch (notice the uber broken camel case for a product name) that has nothing to do with the depth-first search algorithm?
Google? Allo?
Um. by Estanislao+Mart�nez · 2008-08-26 09:15 · Score: 1

Doesn't Oracle have this sort of feature already, without the Google "MapReduce" buzzword buzz?

--
Are you adequate?
1. Re:Um. by EvilIntelligence · 2008-08-26 09:20 · Score: 1, Informative
  
  Yes, its called hash partitioning. Been around since version 7 or 8 about 10 years ago (current release is 11).
2. Re:Um. by Anonymous Coward · 2008-08-26 09:30 · Score: 1, Informative
  
  Uh, no. MapReduce is a parallel programming model -- not a way of laying out data on disk.
3. Re:Um. by raddan · 2008-08-26 10:18 · Score: 2, Informative
  
  Actually, the two are paired: programming model and implementation. The reason there's a programming model is that functional methods allow Google's implementation to automatically parallelize the input data for feeding to the cluster. So the implementation is very important, because that's actually how the data is processed and returned.
  
  In that sense, Oracle's clustering optimizations are also a paired programming model and implementation, since, presumably, you need to know Oracle's SQL language extensions in order to take advantage of them (disclaimer: I don't use Oracle). From what I understand about functional programming, SQL should be ideally positioned to take advantage of these kinds of optimizations, since the actual implementation details of any SQL query are always left to the query optimizer, SQL being a declarative language. I'm going to speculate wildly and say that you could probably write a SQL interpreter using a functional style as well, and that good ones probably already do.
4. Re:Um. by Estanislao+Mart�nez · 2008-08-26 13:11 · Score: 1
  
  IIRC, Oracle has features for parallelizing query execution automatically for queries. These features are enabled by various combinations of session settings and query hints, and can parallelize execution either within a single server machine, or across multiple machines in a cluster.
  
  I'm going to speculate wildly and say that you could probably write a SQL interpreter using a functional style as well, and that good ones probably already do.
  It's deeper than that. Save for relvar update operations, relational algebra just is a functional language. Relational algebra really just consists of relation types and higher-order functional operators over them. For example, relational restriction is an operator that takes a relation over a tuple type and a predicate over that tuple type, and returns another relation over the same tuple type.
  
  --
  Are you adequate?
5. Re:Um. by raddan · 2008-08-26 14:36 · Score: 1
  
  Yeah, that's why I speculated that SQL might be done so easily-- Oracle really is a rabbit hole. I've done some relational algebra in a database course (and also was exposed to set theory in my discrete maths course), but it was unclear to me whether query optimizers actually broke a query down into relational algebra or not. In fact, I remember that despite having had prior experience with SQL, relational algebra was much easier for me to wrap my head around than SQL. My professor was hesitant to go too much into optimizers, though, since most of that is implementation-dependent. He thought it was much more important to talk about ACID, so that's what we spent a good deal of time on.
6. Re:Um. by EvilIntelligence · 2008-08-27 06:30 · Score: 1
  
  The optimizer in an Oracle database (and others, I'm sure) actually determines "access path" based on resource cost. It automatically generates many different access paths, and based on known statistics about the underlying objects in question, determines the cost in resources to execute that path (CPU, memory, disk I/O, etc, etc), then chooses the one with the least cost. It's not always correct 100% of the time, but you can influence the optimizer through configuration parameters at the database level as well as "hints" in the SQL statement itself as specially coded comments. Oracle supports parallel inserts/updates/deletes across multiple partitions, as well as parallel reads.
  
  Whether you use partitioning in a relational database vs data sharding across multiple machines will depend on what you intend to do with that data. If you only plan to simply use a given value to do a lookup (is the word "car" in that page?), then sharding may be the way to go, since it easily creates a wide and flat surface to lay out your data for quick lookup. If you plan on joining that data or doing any kind of complex analysis, then a relational database is the way to go. So it all still comes back to business requirements for the system.
7. Re:Um. by Estanislao+Mart�nez · 2008-08-27 09:35 · Score: 1
  
  The optimizer in an Oracle database (and others, I'm sure) actually determines "access path" based on resource cost. It automatically generates many different access paths, and based on known statistics about the underlying objects in question, determines the cost in resources to execute that path (CPU, memory, disk I/O, etc, etc), then chooses the one with the least cost.
  Leaving aside the issue of where the "query rewriter" ends and where the "optimizer" starts, no, that's not all that happens to go from SQL query against a database to execution plan. Access paths are things like choosing various types of index access or table scans. However, many optimizations to SQL statements are purely syntactic, and are based on semantic equivalences guaranteed by the relational algebra.
  An esay example: a restriction that applies to just one of the relations in a join can be pushed inside the join onto that relation. For example, if you have restrict(join(A,B), predicate_over_A), you can transform that relational algebra expression into join(restrict(A, predicate_over_A), B). This optimization is called "pushing the restriction," and it reduces the number of rows that have to be processed for the join.
  So, database query optimization, deep down, involves reasoning both about equivalent query transformations and hardware resource costs for various operations.
  
  --
  Are you adequate?
8. Re:Um. by Anonymous Coward · 2008-08-28 05:33 · Score: 0
  
  yes. Oracle has more than one way to skin this cat
  - user defined aggregates (since 8i I think)
  - Table functions . Oracle's table functions can run in parallel and take more than one way to partition the input. It is fairly easy to simulate map reduce using them
  Google's map-reduce is so powerful (IMHO) not because of the programming paradigm but because Google built a distributed fault tolerant data store (GFS) and the environment (a cluster manager?) to manage 1000's of processes on 100's to 1000's of machines..
Low Quality Paper by Anonymous Coward · 2008-08-26 09:16 · Score: 0

The original paper for map reduce, http://labs.google.com/papers/mapreduce-osdi04.pdf is actually of pretty poor quality.
There are not really any useful comparisons in the paper. They do not indicate how it scales with increases in the number of processors, so while it may be very fast on the mammoth amount of hardware used, how much faster would it actually get on additional hardware.
If you look at the Sort section of the comparison they seem to be comparing to http://www.almaden.ibm.com/cs/gpfs-spsort.html
which is a 10% improvement on wildly improved hardware, which would seem to be rather disappointing results. This would not have been a problem with the paper had there been any mention of this, but there was not.
Again Bjarne got it right by Anonymous Coward · 2008-08-26 09:42 · Score: 1, Insightful

I am with Bjarne on this one.
Bjarne Stroustrup, creator of the C++ programming language, claims that C++ is experiencing a revival and
that there is a backlash against newer programming languages such as Java and C#. "C++ is bigger than ever.
There are more than three million C++ programmers. Everywhere I look there has been an uprising
- more and more projects are using C++. A lot of teaching was going to Java, but more are teaching C++ again.
There has been a backlash.", said Stroustrup.
He continues.. ..What would the world be like without Google?... Only C++ can allow you to create applications as powerful as MapReduce which allows them to create fast searches.
I totally agree. If Java ( or Pyhton etc. for that matter ) were fast enough why did Google choose C++ to build their insanely fast search engine. MapReduce rocks.. No Java solution can even come close.
I rest my case.
1. Re:Again Bjarne got it right by johanatan · 2008-08-26 10:13 · Score: 1
  
  You are aware that Python has built in support for map and reduce, no? And that the Python interpreter and most JVMs are written in C++ (not to mention many operating systems). When did the implementation language ever prove the abstraction worthwhile?
2. Re:Again Bjarne got it right by Anonymous Coward · 2008-08-26 10:14 · Score: 1, Informative
  
  Don't confuse the search engine with MapReduce. The MapReduce engine creates the indexes for the search engine, its a batch job processor. Just because google chose C++ does not mean it is the only choice, even if it was the best choice for them. Hadoop (a java project at Yahoo, and open source too) has a MapReduce implementation.
3. Re:Again Bjarne got it right by Lucas.Langa · 2008-08-26 10:28 · Score: 1
  
  Python is written in C, actually.
  
  --
  Build a tool even an idiot can use and only an idiot will want to use it. -S.O.B.
4. Re:Again Bjarne got it right by samkass · 2008-08-26 10:40 · Score: 4, Interesting
  
  If Java ( or Pyhton etc. for that matter ) were fast enough why did Google choose C++ to build their insanely fast search engine.
  Because their developers knew it better? Because it had better 64-bit support when they started it? Because full GC's weren't compatible with their use case and IBM's parallel GC VM hadn't been released yet? Because they could get and modify all the source to all the libraries?
  I don't know the answer, but there are a lot of possibilities besides speed. You're jumping to an awfully big conclusion there, Mr. Coward.
  
  --
  E pluribus unum
5. Re:Again Bjarne got it right by Jack9 · 2008-08-26 10:57 · Score: 1
  
  Only C++ can allow you to create applications as powerful as MapReduce which allows them to create fast searches.
  Except that MapReduce is not an application, that it was originally codified in LISP, and that Google started using the technology because they bought AltaVista, where it was originally used for searching.
  An AC getting it all wrong? Unpossible.
  
  --
  
  Often wrong but never in doubt.
  I am Jack9.
  Everyone knows me.
6. Re:Again Bjarne got it right by johanatan · 2008-08-26 11:02 · Score: 3, Insightful
  
  To most people, C++ is C. :-) Unfortunate but true.
7. Re:Again Bjarne got it right by Rakishi · 2008-08-26 12:31 · Score: 3, Informative
  
  Well someone should tell that to the people working on Hadoop. I'm sure they'd love to know that their java mapreduce based framework is impossible. Maybe they'll even be able to use the paradox to built a perpetual motion machine and power the world.
  See: http://developers.slashdot.org/comments.pl?sid=900359&cid=24756761
8. Re:Again Bjarne got it right by adpowers · 2008-08-26 13:16 · Score: 1
  
  Except that AltaVista was bought by Overture who were then bought by Yahoo!. Also, I wouldn't really call MapReduce a technology. The individual functions (Map and Reduce) come from functional programming, but the concept is becoming popular because Google's implementation and Hadoop have made it easy to write large scale data processing applications without having to worry about scaling or failures yourself. It also doesn't hurt that many problems can be solved with MapReduce.
  A five digit user getting it all wrong? Unpossible.
9. Re:Again Bjarne got it right by smellotron · 2008-08-26 13:21 · Score: 1
  
  Stop embracing the ignorance.
10. Re:Again Bjarne got it right by Jack9 · 2008-08-26 15:04 · Score: 1
  
  The technology is not just mapreduce, it's how you manage multiple resources to leverage what is essentially brute force. Now, try to keep up, someone can buy the shell of a company after another buys the heart:
  http://arnoldit.com/wordpress/2008/01/18/map-reduce-the-great-database-controversy/
  Hey look, we're both guilty of not being perfect. Thanks for the vote of confidence though!
  
  --
  
  Often wrong but never in doubt.
  I am Jack9.
  Everyone knows me.
11. Re:Again Bjarne got it right by Anonymous Coward · 2008-08-26 15:17 · Score: 1, Informative
  
  Hadoop is written in Java and does a fine job. and google uses more java than you can imagine.
12. Re:Again Bjarne got it right by Anonymous Coward · 2008-08-26 20:48 · Score: 0
  
  Because compiled languages with a frugal library use _far_ less memory than the VM + JIT compiler + All-included class library?
  When the amount of data to process is insanely big, the most important optimizations include reducing run-time space and execution time. Pseudo-compiled garbage-collected languages fail big on both optimizations. Those are better at optimizing development time, though.
  As always, you have to use the right tool for the job.
13. Re:Again Bjarne got it right by johanatan · 2008-08-27 01:33 · Score: 1
  
  Oh, I don't embrace it! In fact, I don't care to ever use C (proper) and I certainly never intend to use C++ as if it were C (that's actually my biggest gripe with C++ currently as recent co-workers do not always agree that high-level design is good and the language [and apparently sound arguments] do nothing to convince them of that).
  
  But, my original point still stands if you substitute 'C' for 'C++'. Heck, I could've even mentioned assembly if we really want to talk perf. Everyone knows that hand-tuned assembly beats everything else, no? But, the point of MapReduce is to provide a high-level abstraction for massive parallelization. And, in fact, it is something that you'd get for free if using a functional language like Haskell or the built-in map and reduce of Python (and, there's quite a bit of Python at Google if I am not mistaken).
  
  In short: the language of the implementation says nothing of the validity of the abstraction. Yes, C++ is the fastest language, but there are times when even it is not fast enough and assembly must be hand-tuned.
14. Re:Again Bjarne got it right by apathy+maybe · 2008-08-27 03:12 · Score: 1
  
  To me, C is basically a subset of C++ (and I am well aware that C came first, and that it is exactly a subset).
  That is, if I can program in C, I can do C++ as well, and if I can do C++, I can use many of the techniques when programming C.
  Of course, I can't program either C or C++ (Java and PHP are the closest I've got).
  So, your original comment the "Python interpreter and most JVMs are written in C++" is correct, if you understand C as being a subset of C++. But actually, you are wrong when it comes down the nitty gritty. And you should have said C originally, if you knew that was what Python was actually written in.
  
  --
  I wank in the shower.
15. Re:Again Bjarne got it right by Rakishi · 2008-08-27 09:35 · Score: 1
  
  Well technically the most popular (and fastest I believe) implementation of python is written in C but python itself doesn't need to be written in C. There is a Java implementation, a python implementation, a .net implementation and probably a few others.
16. Re:Again Bjarne got it right by Anonymous Coward · 2008-08-27 09:45 · Score: 0
  
  Uhhhm you're contradicting yourself. When your data set is 20gb you really don't care if the program has an overhead of 20mb or 200mb. In other words loading all the extra VM and library stuff is inconsequential. Now if you had said memory efficiency (or garbage collection overhead) that would be a different point but you didn't.
  CPU usage is probably not the bottleneck in many of these cases since data reading alone is a huge bottleneck. Large amount of data processing is NOT equivalent to large amounts of computation. Protein folding may have a couple kilobytes of data but still require a super computer (and still fail). Corporate data aggregation may have a couple terabytes of data but require little more than a Pentium 1 if the statistics being computed are simple enough.
  Also in terms of execution time Java is close to C/C++ in most cases or at least close enough that it probably doesn't matter much.
17. Re:Again Bjarne got it right by johanatan · 2008-08-27 14:30 · Score: 1
  
  It was a slip. I am and was aware that Python is written in C (though I fail to see why really). C++ can do everything C can and better. And, I disagree with the statement about C programmers being able to program C++. That is just not true. C++ is a multi-paradigm language and C is essentially only a single paradigm--namely, procedural. It is exactly C++'s support for the [obsolete] procedural/structured methodology that would [mis]-lead a C programmer into thinking that they know C++.
18. Re:Again Bjarne got it right by johanatan · 2008-08-27 14:38 · Score: 1
  
  And, one other minor point-- C is not exactly a subset of C++. Ever since C99 brought about new features to C (the specific details of which I do not recall) which C++ does not support (and possibly even before then), they have diverged. It is true though that C is essentially a subset of C++.
19. Re:Again Bjarne got it right by apathy+maybe · 2008-08-28 03:55 · Score: 1
  
  I meant to say "not exactly", damn brain running ahead of myself again...
  It makes more sense if you automatically insert the "not" that I inadvertently missed.
  (I seem to be doing it quite often as well, forgetting my negatives...)
  
  --
  I wank in the shower.
Simply alternative to Map/Reduce by Anonymous Coward · 2008-08-26 09:53 · Score: 0

The Map/Confuse algorithm.
Got what right? by argent · 2008-08-26 09:59 · Score: 3, Interesting

I don't think you can credit Bjarne with "compiled code is faster than interpreted code" (or the 21st century version: "compilers can perform better optimizations that JIT translators").
C++ happens to be the most popular fully compiled language, having edged Fortran out of that position some time near the end of the last century.
Back in the early '80s, when he was coming up with C++, the big Fortran savants were saying stuff like "Fortran is bigger than ever. There are more than X million Fortran programmers. Everywhere I look there has been an uprising... a lot of teaching was going to Pascal, but more are teaching Fortran again. There has been a backlash."
----
And that's not the only thing C++ has in common with Fortran, either.
1. Re:Got what right? by johanatan · 2008-08-26 11:08 · Score: 3, Interesting
  
  " (or the 21st century version: "compilers can perform better optimizations that JIT translators").
  Actually, JITters can do some optimizations that compilers can't--by splitting the compilation into a frontend and a backend. The front end is essentially just a parser, and the later the back-end compile happens, the more opportunities for optimizations actually open up (including such things as utilizing specific instruction sets for given architectures and fine tuning the compile based on run time statistics).
  
  See the LLVM for more info: http://llvm.org/
  
  (or .NET for that matter--but we're anti-MS around here. :-)
2. Re:Got what right? by argent · 2008-08-27 00:53 · Score: 1
  
  including such things as utilizing specific instruction sets for given architectures and fine tuning the compile based on run time statistics
  1. That's a nice theory but in practice JIT implementations of interpreters are not actually anywhere near as fast as compilers for real world workloads.
  2. When performance is critical (or even if you only THINK it's critical, see "Gentoo Linux"), compilers can use the same techniques, and still take advantage of the better regional and global optimizations they can do... see Intel's compiler for the IA64 architectures for an extreme example.
  3. Improvements in local optimization are nice, but unless you're running on something like Itanium regional optimizations trump local ones. And if you are, regional optimizations STILL trump local ones.
  4. Finally, when you're REALLY up against a wall, there's JIT recompilation.
wrong argument? by fragbait · 2008-08-26 13:09 · Score: 2, Insightful

Though this post is my introduction to both MapReduce and the argument, it strikes me that the people arguing are arguing the wrong problem.
While MapReduce might be used against some structured data, it looks to be something for unstructured data and dynamically inventing structures in unstructured data. Additionally, you might want to keep that new structure around for a while. You might want to load it up with terabytes of data. At the same time, this data is less and less useful over time.
Think about two of the key pieces of data Google has, web pages and user interaction and preference data. Web pages change over time. Web sites come and go. Some change a lot (news sites) and some change very little.
There is a LOT of user interaction data. Clicks on pages, javascript that fires to doubleclick, etc. With preferences, that changes over time, too. Also, marketers want to dynamically react to the clicks and even the minute change of a preference that generates a buck.
With such a large, changing, and time sensitive dataset, how could it be structured into something as relatively static as a schema? You would box yourself in by making it a schema and defining all the possible relationships.
So, you take it up one abstraction level and make a "schema" for making relationships. Further more, there is a narrow window within which you even care about data and how it is structured. Granted, you want the webpage/site data to stick around for queries. But even that is marginally useful. Think about how many pages you go into a query on google? I'm sure that will vary by person, but I'd also bet that in practice it is pretty small.
Maybe everyone else gets that and I'm just late to the party. But my point is that the wrong argument is being made that this should follow all the RDBMS work that has come to date.
Sure, I do agree that they shouldn't completely ignore all of the research, but to suggest it has to have a schema, indices, etc. just comes across as arguing all data problems belong in a traditional database.
Or maybe I can take a different approach to this....my brain doesn't have an index. It does categorize data and it can categorize the same piece of data in multiple ways. As I learn new things, my brain creates new "indices" of sort. A large portion of the data in my brain is time sensitive, or indexed over time. The older I get, the more the details of the minutia of life (what I had for dinner this evening) isn't important any more and it loses its categorization. I don't have a schema for my brain, rather I have multiple and I invent and dissolve them over time. I don't know what new one I'll need in the future. I can't know that and without that, I can't make a schema for it. I also can't be constantly modifying the same schema in place. It is easier for me to invent a new one as I go and just abandon the old ones. Sure, new schemas will have parts of the old, but it is still a new schema with the old one still in place and referencing the same data that the new one will soon reference.
-fragbait
Re:MySQL has no common sense anyway. . . by slimjim8094 · 2008-08-26 14:24 · Score: 1

As a matter of fact, it was... :/

--
I have developed a truly marvelous proof of this comment, which this signature is too narrow to contain.
Functional Programming by KliX · 2008-08-26 14:27 · Score: 0, Offtopic

How many of you familiar with functional programming just *cringe* when you see how badly basic math is discussed in the programming mainstream?
just add Protocol Buffers by vrmlguy · 2008-08-26 19:03 · Score: 1

Anyone remember this story: http://tech.slashdot.org/tech/08/07/08/201245.shtml? According to Google:

Protocol buffers are now Google's lingua franca for data -- at time of writing, there are 48,162 different message types defined in the Google code tree across 12,183 .proto files. They're used both in RPC systems and for persistent storage of data in a variety of storage systems.

(See http://code.google.com/apis/protocolbuffers/docs/overview.html.)
If you think about it, Protocol Buffers are just about perfect for MapReduce applications. First, Protocol Buffers data streams are "flat" structures, very similar to database tables. If you need hierarchical data, I think that you'd tend to use multiple tables that incorporate foreign keys, rather than embedding the hierarchy every time it's referenced (as XML does).
Second, and again unlike XML, the data serialization is described via a .proto file, which can itself be serialized in exactly the same way as the data stream. It looks fairly easy to write a "Map" or a "Reduce" program that works with any Protocol Buffers data stream.
I suspect that this, rather than SQL compatibility, is the road to success with MapReduce processes.

--
Nothing for 6-digit uids?
1. Re:just add Protocol Buffers by Prof.Phreak · 2008-08-27 00:55 · Score: 1
  
  I suspect that this, rather than SQL compatibility, is the road to success with MapReduce processes.
  Why not both? :-)
  A lot of distributed databases already implicitly support functionality that's equivalent to mapreduce, especially greenplumb and netezza.
  ie: map operation is just:
  create table output as
  select [cols] from [table] where [condition] distribute on (key1,key2,key3);
  Which will scan the table stored on all nodes, and deposit the data across all the nodes in netezza distributed on key1,key2,key3---ie: implicit ``map''.
  One can then apply aggregate functions to do a ``reduce'' (possibly group by key1,key2,key3?)
  The upshot? It's a lot more flexible in SQL than pretty much any weird language structure I've seen.
  
  --
  "If anything can go wrong, it will." - Murphy
1970's style hype meets 2000's style hype by speedtux · 2008-08-26 20:00 · Score: 1

Stonebraker isn't exactly the one to complain about this: just as MapReduce is being overhyped these days, relational databases were being overhyped in the 1970's, and he rode that wave all the way to fame and fortune. 30 years later, although every database system in the world calls itself "relational", very few database applications actually are relational.
MapReduce is indeed a simple, decades-old parallel programming technique. It's not the be-all-and-end-all of parallel programming, but it's good for solving a lot of real-world problems with minimum fuss and hassle.
Between the relational database hype of yore and today's MapReduce hype, give me the MapReduce hype any day. Relational database hype was all about pseudo-mathematical formality and ad hoc formalisms. MapReduce is at least about simple, working, real-world programming techniques. The sooner we get rid of Stonbraker's approach to computer science, the better off we will all be.
Wow... So few know about MapReduce? by ZerdZerd · 2008-08-27 06:22 · Score: 1

I'm astounded that so few people here know about MapReduce. There are lots of good videos about it made by Google.
There's a five-part lecture about it starting here (use this link to view the rest)
Or simply search for "google mapreduce". I suggest watching one of the videos though :)

--
I'm not insane! My mother had me tested.