To Grid Or Not To Grid?
dbgimp writes "In my job at a (large) investment bank I am constantly being pushed to use grid technology.
I have many problems with this (not least that our data center is at best 100 Mb/s and our software is actually more data than computation heavy). A typical batch job takes 10-30 minutes consisting of around 10,000 trades. I would far rather spend the time and money on multi-core machines and optimizing the software than on the latest fad technology.
I am interested to hear from other people in a similar position and, in particular, why or why not they chose grid software over improving the existing code to leverage better processor technology, and which grid software they chose to use and why. Or, conversely, why they chose not to use grid software."
Make sure you know the difference between grid technology and clustering. Basically, grid is much more complicated but more flexible; the name means you can connect something to a grid to get computing power, just like you can connect to the power grid to get electricity. It looks like you're thinking of clustering instead, which is easier to deploy and in many ways closer to a multiproc machine
Escher was the first MC and Giger invented the HR department.
I can't help but feel that people are missing the point of grid computing. Grid is not HPC. It's not super computers. You can build grids using HPC, but they don't have to go hand in hand. As such, all this talk about parallel whatnot is actually missing the point. I assume there exists code. I assume the code is serial, since most is. I also assume that there are many of these jobs, rather than one job that currently takes a day and a half. Typically there's no need to start getting exotic with MPI/OpenMP or whatever. Simply submit more serial jobs to do what needs to be done. Look at it from a batch scheduling point of view, and see what can be done. If you want to parallelise it as well feel free.
Grid within a company typically just means decent remote access to a shared cluster. A web service that submits jobs to sun grid engine (which has nothing to do with 'grid' btw) would probably fill in all the buzzword bingo requirements of a grid project without being anything of the sort. For sadists look into OMII and GT4, but don't feel compelled...
jh
My management is similarly obsessed with virtualization: they want to lower admin costs, lower lab costs, etc through this simple technology.
So, rather than move everything over to lpars I took a simple step - purchased a large virtualization-oriented server highly touted as perfect for this, and moved over a single app, with the goal of putting two apps on this server. Along the way I learned:
- io virtualization sucks for io-heavy applications
- the tools to determine how much of the cpu your app is getting at a given moment stink
- memory virtualization in which you resize application memory is primitive and almost useless
- there were no guidelines for optimization of the server - just recommendations to try it
hundreds of different ways and leave it on the best settings
- basic setup of the machine required wading through tons of jargon that even the os engineers didn't seem to know well
- out of the box - a single app on the new virtualization server performed more slowly than it did on a free seven year-old server
- some of the most heavily-advertised virtualization features of the product just don't work
- virtualization of multiple busy apps onto the same server is mostly a waste of money
- virtualization of multiple mostly idle app (failover servers, test servers, demo servers, etc) should work very well
- we spent at least $25k on labor just to create something that was a slam dunk
- I'm glad that we started with a small prototype - and didn't waste a ton of cash moving everything over immediately the way some management hoped
- I think in the end we'll get multiple apps working on this box just fine. BUT - we will have spent more money on this scenario than by simply purchasing separate systems. We may recoup a savings if we move enough idle systems onto virtual boxes.
As a result of this experience my team now knows more about virtualization than any other people in the division, we now have a production server supporting it, my management is now cool on this technology, and there is no risk of being forced to migrate critical servers over quickly to the virtual world. I'd call that a success.
I think that you're right - that grid is in a hype cycle right now. So - there are quite a few disappointments to be had along the way to its implementation. For example - if your workload is heavily transactional - you're really not going to get much benefit. In this example oracle supports grids - but it is really more about failover than performance. If you roll your own or use a more sophisticated product you can be safe in assuming that you'll hit unexpected issues, a gap between vendor marketecture & what you really need, and possibly the pain of having a vendor talking directly to your management.
You might want to consider having management fund a small prototype to prove out the benefits. Then let them see that they can achive perhaps better availability but worse performance at a very high cost through this approach.
good luck
Sounds like a trade-auditing project I was once on.
If the 10,000 trades are easily broken into small groups, such as by the initial letter of the ticker symbol, and if all the data for the analysis is fetched in the first step, you can in fact spread the processing over 26-odd machines for a speedup of (fixed part + (per-ticker-symbol part/26)).
I have an article on doing the load-balancing part of this kind of processing, albeit on a large multiprocessor, at http://www.sun.com/blueprints/0605/819-2888.pdf[In PDF].
As you've already guessed, sometimes the problem doesn't decompose
nicely into parts that can be distributed to machines
far from the database.
The rule of the thumb is that grid does distributed computation, where you ship small amounts of data to many CPUs. If you have large amounts of data, you need to have previously distributed data stores, and then you ship the processing to reside with it, instead of the other way around. Alas, some folks call the latter grid, when it should be called something like "data grid" (;-))
--dave
davecb@spamcop.net
Something you might want to experiment with is Quantian. It is a bootable linux distro (knoppix descendant) with clustering (openmosix) and a huge variety cluster capable scientific & financial open source tools built in. It is a very quick & easy way to set up a cluster to experiment and see how you application could be altered to work well in a massively parallel environment. I've never seen a quicker or easier way of building a cluster. With Quantian and a pile of networked PCs, you can literally have a openmosix cluster in minutes.
Wow, that stuff about scapegoating is a pretty jaded take on things. Can't say I haven't been there, but still...
If we're talking about an application that can truly benefit from clustering, and is built so that node failure can be detected and worked around relatively gracefully, this isn't much of a consideration. If you have 10 machines, and 1 goes down, you lose 10% thruput. If you look at it in terms of cores, 20 1 core machines is equivalent to 10 2 core machines, so your downtime per core essentially doubles, true. But! In any sufficiently large system you should account for the fact that n machines will be down at any given point in time. So, make sure you have 2n spare cores (n systems) instead of n spare cores, and you're fine. Even if you estimate n at something as high as 25%, the economics of things will still force you into dual-core servers, since all the new cpus have dual core, and it's getting hard to find single-core server grade hardware. In short, the economics clearly balance out in favor of dual core CPUs.