Collaborative Map-Reduce In the Browser
igrigorik writes "The generality and simplicity of Google's Map-Reduce is what makes it such a powerful tool. However, what if instead of using proprietary protocols we could crowd-source the CPU power of millions of users online every day? Javascript is the most widely deployed language — every browser can run it — and we could use it to push the job to the client. Then, all we would need is a browser and an HTTP server to power our self-assembling supercomputer (proof of concept + code). Imagine if all it took to join a compute job was to open a URL."
Two comments:
1. He places the map/emit/reduce functions in the page itself. This is unnecessary. Since Javascript can easily be passed around in text form, the packet that initializes the job can pass a map/emit/reduce function to run. e.g.:
In fact, the entire architecture would work more smoothly using AJAX with either JSON or XML rather than passing the data around as HTML content. As a bonus, new types of jobs can be injected into the compute cluster at any time.
2. Both Gears and HTML5 have background threads for this sort of thing. Since abusing the primary thread tends to lock the browser, it's much better to make use of one of these facilities whenever possible. Especially since multithreading appears to be well supported by the next batch of browser releases.
(As an aside, I realize this is just a proof of concept. I'm merely adding my 2 cents worth on a realistic implementation. ;-))
Javascript + Nintendo DSi = DSiCade
Imagine how much *spam* you could send using this approach.
No, wait...
Save your wrists today - switch to Dvorak
We already have that. See botnets.
If you were really interested enough to donate your CPU cycles, is it really that much harder to install BOINC, and get a job running?
Plus then you can run native code instead of having to run in [shudder]Javascript[/shudder].
Convert FLACs to a portable format with FlacSquisher
Progress is running less JavaScript, not more.
You could also use this to index the MP3 files on everybody's hard drives, then share the music just by visiting a URL!! ... oh wait...
Javascript really isn't suited for this kind of thing, even with worker threads, for two reasons I can think of. First, web clients are transient... they'd have to report back often in case the user clicks away.
But more importantly, Javascript just isn't a good language for massive computation. It only supports one kind of number (double), has no vectorization or multicore capabilities, has no unboxed arrays, and even for basically scalar code is some 40x slower than C, let alone optimized ASM compute kernels. (This is for crypto on Google Chrome. Other browsers are considerably slower on this benchmark. YMMV.)
I hereby place the above post in the public domain.
So, where do I sign up for grid time?
Time to work on a Javascript MD5 bruteforce implementation. SHA, too.
for those like myself that had no idea what MapReduce was:
http://en.wikipedia.org/wiki/MapReduce
KDawson! No, its not the typical KD FUD, but still - we all could continue to thrive without seeing some "proof of concept" bullshit as useless as this. Remind me again how he is still an editor? I mean, this is unbelievable! I stared at the headline and summary for a second, and then it dawned on me - I knew who posted this before even looking under the title, and that is just pitiful.
The virus/worm writers, that is...
[Personal Opinion]
joining a supercomputer by just clicking a URL.. isn't that the definition of a botnet?
Restrictions are prohibited. Be well, get better.
Oh, please, make the MapReduce fanboyism stop.
Yes, it's a neat technique. It's also very old and obvious. Google's implementation is also good, but this stuff is just not rocket surgery. It's just a simple pattern of how to massively parallelize some types of computational tasks.
But somehow, just because some dudes at Google wrote a paper about it, it's become the second coming of Alan Turing or something among some silly folks. Hell, a couple of weeks ago somebody was saying on the comments here that MapReduce was a good alternative to relational databases. Now that is silly.
Are you adequate?
If clicking a link is all it would take to donate my CPU cycles I'm getting off the internet now. I'm sure we've all been conned into clicking on tubgirl, goatse, lemon party, etc. what's to stop somebody from conning you into clicking a mass computing (see botnet above) cloud masked by a fun but pointless flash game people could enjoy for hours on end?
One-click CPU cycle donation is a bad idea. Make it deliberate, make it a program that has 30 checkboxes making sure the user understands they're donating cycles and on their terms.
"Javascript...â" every browser can run it..."
There is a huge difference between being able to run javascript apps and run javascript apps well - not to forget that a lot of the javascript I see out there really only works on PC's with IE or Firefox, Opera and Safari, especially on OS X seem to have trouble with some sites that aren't coded for compatibility, but instead pushed out quickly with little regard for anything other than IE on Windows.
Ave Molech Setting
A common mistake in multi-server builds is that bandwidth is free.
Bandwidth Costs Money and Time. Both are reduced by having the network closer to the processing. This is one of the reasons google bought all that "dark fiber" left around after the .com bust.
Another flaw is that computation of data is difficult to provide "good results" in blocks unless they're doing relativity matrices (Think PageRank).
Something to think about:
If I'm sending names to your pc, what can I derive from that list without having the entire list?
If there were a couple-few or more orgs competing to use my extra cycles, outbidding each other with money in my account buying my cycles, I might trust them to control those extra cycles. If they sold time on their distributed supercomputer, they'd have money to pay me.
As a variation, I wouldn't be surprised to see Google distribute its own computing load onto the browsers creating that load.
Though both models raise the question of how to protect that distributed computing from being attacked by someone hostile, poisoning the results to damage the central controller's value from it (or its core business).
--
make install -not war
You cannot say that guy. This is web 2.0. This is progress. If it does not work in your browser, you can still an web 2.0 internet browser application to make it compatible.
Is this why my browser keeps telling me scripts on the slashdot main page are taking too long and do I want to stop them for the last few months?
This seems to me a self-defeating idea. The obvious goal is to get more processing power. Yet using a scripted language is inefficient, and a waste of processing power. If you want more processing power, you need to group computers of the same general instruction set, and which can run compiled (or, dare I say it?) assembled machine code.
This revolutionary technology can be used to create a superior networked colonial defence system.
-- Gaius Baltar
First off, I'm impressed by the geekiness and the creativity of the idea. It's awesome.
But, scientists had better performance 10 years ago. On a single machine. Because they wrote things in FORTRAN.
"Crowdsource" is not a word. It's not hip or cool or whatever, it's an annoying buzzword. Stop using it. kthxbye.
hijack-weasels who need to be shot have pretty much ruined the idea of distributed donated computing resources, thanks.
if this is supposed to be a new economy, how come they still want my old fashioned money?
My CPU time isn't idle. It's keeping my laptop from being too hot to touch and too noisy to work on. And there's no reason to pay more for electricity than I already do.
-B
Ash and Hickory, straight-grained and true, make excellent bludgeons, dandy for the cudgeling of vegetarians.
install powertop and watch what those "spare" CPU cycles cost you in terms of Watts. Every computation costs energy. Even worse is when you waste energy on inefficient computation. Don't promote wasteful computing, use a real language that is actually capable of providing fast array access (not possible in JS due to prototypes and parent array deletions).
and you don't think you could get 100 times more users to visit your web app than you could convince to download and install an exe?
Because you can - or because you should?
Think Google already does that with their Chrome, Desktop, and Updater apps on your machine. Problem I have is how the author dismiss MOST of the issues and purpose off hand with an over-simplified sentence: "Aside from storing and distributing the data the most expensive part of any job is the CPU time..." (my emphasis). The LHC project would've given us a chance to test the capability of one of the world's biggest Grid infrastructure - if it ever run in production. P2P does not a Super-Computer/Search Engine make.
Most popular mature application has a scripting interface. Like it or not, if MSFT cannot kill JavaScript, it is here to stay. But to condamn JS straight away is like saying no language - C/C++/C#/Java/etc. - can and ever should evolve. Might as well belittle Ruby/Perl too since they cannot even render a web page without HTML/CSS. So is proprietary plug-in like Adobe Flash/Flex your preference? Just get rid of browsers altogether and implemnet custom HTTP interface in each app?
I think this approach to MapReduce is a pretty creative angle to take on it. However, there are a number of distributed systems-type problems with doing it this way, that would need to be solved to actually make this realistically possible:
1) The dataset size is currently limited by the web server's disk size.
Possible solution: push the data to S3 or some other large store.
2) There is a single bottleneck/point-of-failure in the web server. In theory 10,000 clients could try to emit their map keys all at once to the web server. IIRC, Google's mapreduce elects nodes in the cluster to act as receivers for map keys during the map/sort phase.
Possible solution: Again, if you were using S3, you could assign them temporary tokens to push their data to S3 -- but that would be a large number of S3 PUT requests (one per key).
3) Fault-tolerance -- what happens when a node in the browser compute cluster fails for any of N reasons? How does the web server re-assign that map task? You'd especially want to ensure that computation finishes on a job in an unknown environment such as 1,000,000 random machines on the internet.
Possible solution: If you haven't heard from a node in N seconds, you could reassign their map task to someone else. This is a similar idea to the MapReduce paper's description of sending multiple machines on a single map task, and racing them to the finish.
4) Security -- there is no way to deterministically know whether the data emit()ed from a user's browser session is real or not. How do you trust the output of 1,000,000 users' Javascript browser executions (I think the answer is, you don't).
http://www.pluraprocessing.com/ has been doing this for a while now using Java.
MapReduce is interesting because at Google and Hadoop it has a distributed filesystem underneath it too. The clever part is how data is distributed, and processing is moved to the data rather than moving data to the processing. I don't really see how this really helps matters, unless you are going to have data involved too - which brings in the privacy concerns yadda yadda yadda.
Sure, some things would work that require huge amounts of processing power on limited data, but why would you use map-reduce for that - why not just use conventional divide and conquer techniques.
Does anyone else find it worrying that one of the projects linked to in TFA is called (out of choice) Skynet? Unlikely, I know, but really, why tempt fate?
is it beta?
Or they could use an Applet or JWS and get several times the performance for only a mild reduction in install base. JWS would even be able to run offline or when the browser window's closed and cache some output to a JVM-managed scratchpad file on disk.
"Anyone who attempts to generate random numbers by deterministic means is living in a state of sin." -- John von Neumann
I think you accidentally the last sentence of your post.
factor 966971: 966971
It's probably related to brain science.
factor 966971: 966971
And why clicking on a tag pops up a login box, NOT THE ARTICLES TAGGED WITH THE TAG?
Change != progress, assmonkeys.
Javascript really isn't suited for this kind of thing, even with worker threads, for two reasons I can think of. First, web clients are transient... they'd have to report back often in case the user clicks away.
I don't see why web clients being transient is a problem. The whole point of the MapReduce algorithm is that each worker (the web clients in this case) don't need to know anything about what the other worker is doing, what the system as a whole is doing, nor what it had done with any past job.
It would need to be 10000x at the very minimum.
If a user downloads, say, folding@home, it's running all day, every day, on all cores of the machine, whenever the computer is on and idle, which is most of the time. The user doesn't have to remember to run it, doesn't have to devote screen real estate, attention and so on, and the program is less annoying because of its low priority and relatively low memory footprint (less boxing).
Additionally, the 40x I cited was in the fastest available browser (Chrome), compared to a relatively slow implementation (OpenSSL), for code that doesn't benefit from vectorization (at least, not on x86-based processors). I expect that the difference between a scientific compute kernel in JS and in assembly would be at least 100x, maybe 200x or more.
Let's suppose that everyone in your rosy world uses FF 3.1 with JIT. That's 3-5x slower than Chrome in my benchmarks; say 4x. Let's suppose that Chrome is 25x slower than unvectorized C, which is 4x slower than optimized assembly. Let's say people run the site 5 hours a day on one core for a week, but have their dual-core computers on for 10 hours a day, 90% idle and would keep folding@home installed for a year.
Then the EXE is 4 * 25 * 4 * 2 * 2 * 50 * 0.9 = 72000x more productive.
Use the right tool for the job.
I hereby place the above post in the public domain.
And I'll perform floating point operations on them with my Pentium and send them back.
(ca. '94 rimshot)
AJAX + self updating js saved in cookies
Where is the "Ignorant" mod tag?
Local Storage APIs would probably work better. The entire data set could even be dumped to local storage to allow recovery from browser failures. In addition, using the SQL engine of the Local Storage database can speed up certain sorting and aggregation tasks, thus (potentially) allowing for a faster response than making Javascript do all the heavy lifting.
Javascript + Nintendo DSi = DSiCade
It's ridiculous. Javascript? What a waste! Just rent part of the Amazon elastic cloud and at least do your preciousss calculations in native code. And pay for it yourself too.
The increase in computing power caused by more users joining because it's so simple will be offset by the massive decrease associated with using Javascript rather than native code.
I don't see why web clients being transient is a problem. The whole point of the MapReduce algorithm is that each worker (the web clients in this case) don't need to know anything about what the other worker is doing, what the system as a whole is doing, nor what it had done with any past job.
Which is why Map-Reduce is only suitable for "easily" distributed problems. Lucky for Google that almost all their computational problems fit into this mold. But in the rest of the world, this just isn't the case. Which is why Map-Reduce is more interesting and trendy than a solid change in how distributed systems are designed.
"Those that start by burning books, will end by burning men."
If the user leaves before a task completes, you don't have anything to reduce.
Nerd rage is the funniest rage.
Without reading the entire article in depth, seems very similar as an idea to the legion distributed computing grid done in silverlight without the bells and whistles:
http://www.codeproject.com/KB/silverlight/GridComputing.aspx
but you can find 41 idiots with webbrowsers in the time it take you to find 1 preapared to install something.
IranAir Flight 655 never forget!
If the user leaves before a task completes, you don't have anything to reduce.
Google's implementation of MapReduce already takes this into account. Haven't you heard of how they just have a bunch of vanilla x86 networked together, and when one of them fails, they just throw it away plug in a new one.
I'm sorry, but isn't this practically identical to the patent application to use javascript to treat browsers as distributed clients to perform a job like a distributed super computer? The patent application is at http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.html&r=1&f=G&l=50&s1=%2220020198932%22.PGNR.&OS=DN/20020198932&RS=DN/20020198932
The previous comment is purposely vague and generalized, but all of the facts are completely true.
So because a tool which uses X resources is considerably more efficient than a tool which uses Y resources, we should use the first tool and X resources only and ignore the Y resources?
You seem to be confusing Javascript and its implementations versus the DOM and its implementations. The source of the problems you describe are the latter pair, not the former pair.
And it's called a virus.
This could be a possible way to generate revenue from popular websites... instead of selling something of such dubious quality as "advertising impressions", high-volume sites such as /. could support themselves by taxing, say, 10% of a viewer's CPU with an unobtrusive background thread, and selling the aggregated processing power to customers. I'd certainly be happier donating a percentage of my otherwise totally wasted CPU time to a site than having to read crappy ads for products I don't want.
Rampant carbon sequestration destroyed the Dinosaurs' tropical paradise. I'm here to help repair the damage.
Why have you assumed the javascript user ran the site for 5 hours a day for a week, but that the installed .exe user ran it for a year? Even if one accepts your estimate of an installed .exe as being 400x faster than a javascript app, you should at least allow equal running time. And are you sure that modern browsers on multicore machines don't let multiple JS threads run on different cores?
/. window open while they're at work (and possibly have another one open at work).
In which case, I would find it easy to believe that for every one slashdotter who would install a distributed computing node, there would be at least 500 who would leave a
Rampant carbon sequestration destroyed the Dinosaurs' tropical paradise. I'm here to help repair the damage.
Javascript just isn't a good language for massive computation. It only supports one kind of number (double), has no vectorization or multicore capabilities, has no unboxed arrays, and even for basically scalar code is some 40x slower than C, let alone optimized ASM compute kernels. (This is for crypto on Google Chrome. Other browsers are considerably slower on this benchmark. YMMV.)
YMMV is true. I see speed differences of x5-x10 between -O3 C code and V8 - significant, but far from x40.
As for having only doubles, that is true for the language, but not for engines, which can implement an integer type as well. This is a little tricky to do, but certainly possible: Anything that starts out as an integer will remain one over addition, subtraction and multiplication; you need to add checks for overflows and to handle division. In other words, developers have the convenience of only working with doubles - something Python 3.0 sort of agrees with, as division now returns float values by default - but a clever engine can make a lot of code as fast as if it were using integers.
Saying that javascript is great because it works on all browsers, is like saying that anal sex is great since it works on all genders.
You know what really annoys me? Seeing people (and I'm guilty of this too) mixing hot and cold fluids together. Cold milk out of the fridge into hot coffee, using a hot gas flame to warm up food, that sort of thing. I should imagine that in millions of years' time, when the heat death of the universe is well underway, needless and wasteful increase in entropy will be punishable by permanent deletion.
Rampant carbon sequestration destroyed the Dinosaurs' tropical paradise. I'm here to help repair the damage.
Remember folks, be sure to filter ALL user input!
We published a paper that did evolutionary algorithms in the browser some time ago: Browser-based distributed evolutionary computation: performance and scaling behavior. In the same conference, there was another paper: Unwitting distributed genetic programming via asynchronous JavaScript and XML
It's just a BloJJ
Comment removed based on user account deletion
I think you are vastly underestimating the JIT engines of Chrome and FF. While these JIT engines still have a way to go, I would expect the execution speed of Javascript to approach the performance of other modern virtual machines like the JVM.
09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
Sure. My point is that if the javascript tasks are long enough, every task will fail.
Nerd rage is the funniest rage.
http://it.slashdot.org/article.pl?sid=03/12/31/2246241&tid=93
MD5CRK used a JavaApplet that used this Chinese Lottery concept. The applet performed 95% as fast as a pure C implementation of MD5. JavaScript is another matter however. And an assebly code that inlieved MMX/SSE with ALU was much faster.
Background threads in browsers will help of course.
Actually, once upon a time, there was a distributed Java applet, alot like BOINC but in a browser. This particular project was about calculating the emission of gamma rays from nuclear waste.
It didn't last long, probably about a year or two, but it did get quite a few results.
(T>t && O(n)--) == sqrt(666)
Interesting post, and remember: a double is 55 integer bits.
Map reduce is not a good candidate for this type of distribution. We are talking about huge amounts of data that would have to be sent to your browser. The data locality is very important to Google's success. If the computation to data set size is linear or better, and/or it a very large data set, transport of the data swamps any gains. If the computation is much higher than the ratio to the data set, as in folding, then you have a win. Google's scalable approach to adding a another, cheap CPU along with another box full of storage scales nicely. Their real goal is adding redundant, replaceable storage on a massive scale, right-sized with the CPU power for computation and communication. If that CPU is not near the data, it won't work. On another front, if you have ever done anything large in browser javascript, you might have noticed that memory management is not that great. Perhaps it is getting better. Lastly, javascript is a great tool and easy to program. If you have a really intensive job, the expense of that job should justify the proper engineering resources to move the problem computationally as close to the hardware (CPU, main memory) as possible. This might suggest C, assembler, etc.
More or less, yes. If both types of resources are abundant, and X resources are more efficient to use, then there's no reason to use Y resources... just focus your effort on getting more X resources. This is why most groups just run their stuff on a cluster, rather than dealing with the development, hosting, marketing and distribution of a folding@home type project.
I hereby place the above post in the public domain.
Why have you assumed the javascript user ran the site for 5 hours a day for a week, but that the installed .exe user ran it for a year?
Because user attention is a scarce resource. If the user has to open your application for it to run, it will not run as often.
And are you sure that modern browsers on multicore machines don't let multiple JS threads run on different cores?
No, although with the exception of Safari 4, no browser with worker threads is even out of beta.
In which case, I would find it easy to believe that for every one slashdotter who would install a distributed computing node, there would be at least 500 who would leave a /. window open while they're at work (and possibly have another one open at work).
Perhaps for a while, but I think the novelty of this would wear off quickly. Like I said, attention is a scarce resource.
I hereby place the above post in the public domain.
When the world starts massive computing using javascript, I think I quit IT. :)
That sounds like a pretty serious environmental disaster. Computers are already a noticeable user of power. If big problems are being solved so inefficiently then that will get much worse.
=~ s,(.*),<sarcasm>$1</sarcasm>,g if any_point_you_wish();
As for having only doubles, that is true for the language, but not for engines, which can implement an integer type as well.
True enough. But the fundamental problem remains: the more information is available statically, the better you can compile it. Javascript has relatively little information available statically, and so is hard to compile.
I hereby place the above post in the public domain.
True enough. But the fundamental problem remains: the more information is available statically, the better you can compile it. Javascript has relatively little information available statically, and so is hard to compile.
Emphasis mine.
JavaScript, as a dynamic language, can't be statically compiled - just as you say. But, at runtime a lot more information is available (in fact, perhaps more than at compile time with a static language). That information is used by modern JavaScript engines, e.g., TraceMonkey utilitizes Trace Trees to optimize such things. So it can detect that a loop iterates over an integer value - even though that isn't obvious from looking at the code - and generates native code for that.
Static information does make compilation optimizations easy, but isn't necessary for performance, if you have a suitably sophisticated runtime environment - which we are now finally starting to have.
You do realize that that sounds insightful, and not sarcastic/funny to some of us? Mod parent up, BTW.
I know tobacco is bad for you, so I smoke weed with crack.