Google Caffeine Drops MapReduce, Adds "Colossus"
An anonymous reader writes "With its new Caffeine search indexing system, Google has moved away from its MapReduce distributed number crunching platform in favor of a setup that mirrors database programming. The index is stored in Google's BigTable distributed database, and Caffeine allows for incremental changes to the database itself. The system also uses an update to the Google File System codenamed 'Colossus.'"
That sums it up nicely. Nothing more needs to be added.
This issue is a bit more complicated than you think.
I have no idea what any of that means.
This sounds like it's going to be highly inefficient for nonlocal calculations, or am I missing something? Basically, if the calculation at some database entry is going to require inputs from arbitrarily many other database entries which could reside anywhere in the database, then the computation cost per entry will be huge compared to a batch system.
So does that mean Microsoft is developing a competeing distributed computing system called "Guardian"? And how does that possibly seem like a good idea?
"This is the voice of world control. I bring you peace. It may be the peace of plenty and content or the peace of unburied death. The choice is yours: Obey me and live, or disobey and die. [...] We can coexist, but only on my terms. You will say you lose your freedom. Freedom is an illusion. All you lose is the emotion of pride. To be dominated by me is not as bad for humankind as to be dominated by others of your species. Your choice is simple."
-Colossus.
Source: http://www.imdb.com/title/tt0064177/
Colossus? That sounds ominous.
I am so glad Google has moved away from the Argus platform and into the Mercedes system. It makes it so much easier for those of us who are used to programming in Gibberish. Don't get me wrong - the days of Jabberwocky code were brilliant, but it's high time we moved into the Century of the Fruitbat.
...is this a fancy way of saying a transactional system? Just say it then!
"I'm an old-fashioned type of guy. I worship the Sun and Moon as gods. And fear them."
Colossus is incremental, whereas MapReduce is batch-based.
In MapReduce, you run code against each item with each operation spread across N processors, then you reduce it using a second set of code. You have to wait for the first stage to finish before running the second stage. The second stage is itself broken up into a number of discrete operations and tends to be restricted to summing results of the first stage together, and the return profile of the overall result needs to be the same as that for a single reduce operation. This is really great for applications which can be broken up in this fashion, but there are disadvantages as well.
MapReduce is a sequence of batch operations, and generally, Lipkovits explains, you can't start your next phase of operations until you finish the first. It suffers from "stragglers," he says. If you want to build a system that's based on series of map-reduces, there's a certain probability that something will go wrong, and this gets larger as you increase the number of operations. "You can't do anything that takes a relatively short amount of time," Lipkovitz says, "so we got rid of it."
The problem for Google is that the disadvantages scale. The fact that you have to wait for all operations from the first stage to finish and that you have to wait for the whole thing to run before you find out if something broke can have a very high cost at high item counts (noting that MapReduce typically runs against millions of items or more, so "high" is very high). With the present size, it's apparently more advantageous to get changes committed successfully the first time, even if MapReduce might be able to compute the result faster under ideal circumstances.
For example, why do you use ECC memory in a server? Because you have a bloody lot of memory across a bloody lot of computers running a bloody lot of operations, and failures potentially have more serious consequences than if a program on someone's desktop. At higher scales, non-ideal circumstances are more common and have more serious consequences. So while they still use MapReduce for some functions where it's appropriate, it's no longer appropriate for the purpose of maintaining the search index. It's just gotten too big.
I bet they are working the next version . Caffeine was deployed a year ago.
Googled around for more information on this Caffeine architecture. The best I could come up was a paper on BigTable, purported to be the basis of Caffeine in news articles.
Recently I googled the subject of a slashdot article I was reading. The /. article was the third result from google. So how does google know a new article is up? Is there a special interface for that?
http://michaelsmith.id.au
This is going to give my Camfrog name a new meaning, as I *LOVE* screwing around with file systems. Colossus Hunter, indeed!
Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
n/t
Charlie, say it! SAY IT Charlie!
The sequel was, in my opinion, as interesting as the original novel. Jones delved into some uncomfortable social (to me) territory, then finished up with a nice Faustian twist. (Damn, I read the *sequel* 35 years ago.... where DOES the time go?)