To Grid Or Not To Grid?
dbgimp writes "In my job at a (large) investment bank I am constantly being pushed to use grid technology.
I have many problems with this (not least that our data center is at best 100 Mb/s and our software is actually more data than computation heavy). A typical batch job takes 10-30 minutes consisting of around 10,000 trades. I would far rather spend the time and money on multi-core machines and optimizing the software than on the latest fad technology.
I am interested to hear from other people in a similar position and, in particular, why or why not they chose grid software over improving the existing code to leverage better processor technology, and which grid software they chose to use and why. Or, conversely, why they chose not to use grid software."
Well, I'm not sure about what your particular job is but my current job is developing webservices. There are two servers that I use, a clustered and an unclustered. I deploy the same projects to them--and occasionally find myself rectifying strange resource allocation problems on the clustered server. There's only two machines on that cluster so it's more symbolism right now to the consumer that our software is scalable.
That's right, it seems to me that upper management likes the idea of having a clustered system because if a customer ever asked if our software would work for 1,000 people, my manager would say, "Sure, just buy more machines for the cluster." And everyone likes that idea. The idea that well, the system might not be able to handle everyone right away but wait a year or two and CPU cycles will be so cheap we can just buy 30 low end machines and cluster them to get the job done. Thanks to the common scheme of access that all databases use, this is an actual option.
I offer only the suggestion that maybe your bosses like the idea of just being able to throw more machines at it. Look at it from a financial perspective, if you tailored the code for multi-core CPUs--something I'm not even sure how to do--you would have to rebuild and maybe recode everything for future generations of machines. I can see why grid computing might sound so enticing to your employer. Look at Google's distributed scheme, hundreds of thousands of cheap machines running a stripped down form of Red Hat--I don't know if that's 'grid' computing but I imagine it's along the same lines.
It isn't clear to me whether your bank offers a service for trading or you do them in batches. It seems that the latter is true. Now, you mentioned you work at an investment bank so money probably isn't that big of an issue. Just go to your superior and say, "Look, I need the following." and if he balks at you just ask him how important these 10,000 transactions a day are to him.
So, to me, it would seem more intelligent to use the following idea. Buy new network hardware that handle gigabit ethernet. The cards, the router, whatever you have, just up it so that your internal network can really throw data around. Maybe look at relaying fibre if you don't have it. Then take what money is left over and buy a few more machines. Get a low-end server to act as a proxy that dishes out the requests for a trade to a cluster of machines. Write the software independent of the hardware so that you can always just buy more machines and install your client application on the machines. At some point, your choke point is going to be your database but if you make it that far, you've kind of hit a wall, in my opinion, and the only solution for that is to juice up the box (with database sepecific hardware) that's serving your database.
My work here is dung.
If the process is more data than computation intensive then throwing more machines at the problem is the most cost efficient way of going forward. You have already countered your argument for multi-core machines. Especially if this is finance it is highly unlikely that optimizing the software will produce anything remotely practical in a short time period or at low cost. Software optimization also can introduce bugs and lock you down on an implementation that cannot be easily updated.
Take search engine technology as an example, Google have hundreds of thousands of machines running advanced software on non ultra-optimized platforms: Java and Python. The alternative is having a couple of hundred big iron machines running hand tweaked C / assembly. As a business you should be seeking to reduce the overhead of operations, by increasing the number of machines, lowering the cost of each machine, reducing the time optimizing the software by allowing higher level languages that are easier to use and maintain you can actually get better performance, reliability, and flexibility.
Period.
In fact, GRID software is constantly in flux, because there is no grant money to run a GRID, only to develop one, so they keep throwing stuff out and developing new parts -- to get grant money.
And yes, I am posting this anonymously because I work for such a place, and mostly like my job.
Our processes tend to be more computation (than data) heavy compared to what you describe, but we are using lots of clustered computers. Take your 10,000 trades and split them into chunks of 100 trades and have separate machines value each chunk and reassemble the results. Depending on the nature of what your software does this may or may not make sense. If you can split your workload into small chunks that can be analyzed independently you can achieve much better throughput.
The newer cluster/grid software can be really shiny, but you don't always need it. Plain old PVM can still work wonders. Also, a lot of the commercial cluster software out there isn't well suited to this kind of high performance computation clustering.