To Grid Or Not To Grid?

Clustered Benefit by eldavojohn · 2006-10-03 23:27 · Score: 5, Interesting

Well, I'm not sure about what your particular job is but my current job is developing webservices. There are two servers that I use, a clustered and an unclustered. I deploy the same projects to them--and occasionally find myself rectifying strange resource allocation problems on the clustered server. There's only two machines on that cluster so it's more symbolism right now to the consumer that our software is scalable.

That's right, it seems to me that upper management likes the idea of having a clustered system because if a customer ever asked if our software would work for 1,000 people, my manager would say, "Sure, just buy more machines for the cluster." And everyone likes that idea. The idea that well, the system might not be able to handle everyone right away but wait a year or two and CPU cycles will be so cheap we can just buy 30 low end machines and cluster them to get the job done. Thanks to the common scheme of access that all databases use, this is an actual option.

I offer only the suggestion that maybe your bosses like the idea of just being able to throw more machines at it. Look at it from a financial perspective, if you tailored the code for multi-core CPUs--something I'm not even sure how to do--you would have to rebuild and maybe recode everything for future generations of machines. I can see why grid computing might sound so enticing to your employer. Look at Google's distributed scheme, hundreds of thousands of cheap machines running a stripped down form of Red Hat--I don't know if that's 'grid' computing but I imagine it's along the same lines.

It isn't clear to me whether your bank offers a service for trading or you do them in batches. It seems that the latter is true. Now, you mentioned you work at an investment bank so money probably isn't that big of an issue. Just go to your superior and say, "Look, I need the following." and if he balks at you just ask him how important these 10,000 transactions a day are to him.

So, to me, it would seem more intelligent to use the following idea. Buy new network hardware that handle gigabit ethernet. The cards, the router, whatever you have, just up it so that your internal network can really throw data around. Maybe look at relaying fibre if you don't have it. Then take what money is left over and buy a few more machines. Get a low-end server to act as a proxy that dishes out the requests for a trade to a cluster of machines. Write the software independent of the hardware so that you can always just buy more machines and install your client application on the machines. At some point, your choke point is going to be your database but if you make it that far, you've kind of hit a wall, in my opinion, and the only solution for that is to juice up the box (with database sepecific hardware) that's serving your database.

--
My work here is dung.

Re:Clustered Benefit by cofaboy · 2006-10-04 01:44 · Score: 1

Cluster/grid it, if you use multi core CPU's you have fewer machines, one of those machines go down and you lose major processing power.
Remember to ask yourself, "How much is MY job worth if those trades DON'T get processed because of something I recommended?."
Plan your response accordingly
Your boss is already looking for a potential scapegoat, make sure your not it.

--
In the end, It's all bovine dung you know
Re:Clustered Benefit by dirtyhippie · 2006-10-04 02:46 · Score: 2, Informative

Wow, that stuff about scapegoating is a pretty jaded take on things. Can't say I haven't been there, but still...

If we're talking about an application that can truly benefit from clustering, and is built so that node failure can be detected and worked around relatively gracefully, this isn't much of a consideration. If you have 10 machines, and 1 goes down, you lose 10% thruput. If you look at it in terms of cores, 20 1 core machines is equivalent to 10 2 core machines, so your downtime per core essentially doubles, true. But! In any sufficiently large system you should account for the fact that n machines will be down at any given point in time. So, make sure you have 2n spare cores (n systems) instead of n spare cores, and you're fine. Even if you estimate n at something as high as 25%, the economics of things will still force you into dual-core servers, since all the new cpus have dual core, and it's getting hard to find single-core server grade hardware. In short, the economics clearly balance out in favor of dual core CPUs.

Typical "ask slashdot" whining. by cerberusss · 2006-10-03 23:30 · Score: 1

I would far rather spend the time and money on multi-core machines and optimizing the software than on the latest fad technology

I get the impression from the poster that it's not up to him to choose how he spends his time.

Yet, he says to us: my boss asks me, but I don't feel like it. Well, though shit brother.

--
8 of 13 people found this answer helpful. Did you?

Re:Typical "ask slashdot" whining. by kjart · 2006-10-03 23:33 · Score: 1

I get the impression from the poster that it's not up to him to choose how he spends his time.

Seems more likely to me that he is looking for a case with which to justify selling a non-grid solution to his boss. He may not be making the final choice, but it seems quite likely that he would have at least some influence over it. Having the right justifications and facts to back yourself up would obviously be of great help.
Re:Typical "ask slashdot" whining. by cerberusss · 2006-10-04 00:02 · Score: 1

Seems more likely to me that he is looking for a case with which to justify selling a non-grid solution to his boss
Exactly. But that's only interesting to him. What's interesting to us is in which cases a grid vs. non-grid solution would be a good choice. However, the poster has such an obvious slant against a grid solution that the discussion doesn't get a fresh start.

--
8 of 13 people found this answer helpful. Did you?

Well compare costs by jellomizer · 2006-10-03 23:37 · Score: 2, Insightful

I would say give them the full price to get it done right then have management decide. Be Sure to give them the full quote with the cost of making 10gbs Network or faster for these systems. Then you need to realize how much of the code can be parallelized, estimate you time it will take you to make the changes, and add in proper debugging time. Next find alternative solutions that will increase performance. For example except for grids you have clustering which works better for some applications which have fast calculations but a lot of users that just make it slow. Grid Computing works well when you have a large segments of code with minimum communication between each system. If you need a lot of CPU to CPU communication then you will need to get a real supercomputer where the processors communicate across the bus.

Or You just need to index your tables.

The rule of thumb is go the safe rout unless you are told by higher ups to do otherwise, have higher ups sign off on the more risky method (to save your butt) then get the method working focusing on getting the job done right and stop complaining how bad decision it was.

--
If something is so important that you feel the need to post it on the internet... It probably isn't that important.

Re:Well compare costs by DaveV1.0 · 2006-10-04 00:23 · Score: 2, Insightful

Well, that is half the work. For this instance, he should do a cost benefit analysis.

Just providing costs comparisons boils down to "Your way costs X, my way cost Y." But, that may not matter to someone who wants to be buzz-word compliant. When an executive gets it in his head that "this" is better than "that", the best way to handle it is to show that "this" will give a give a crappy ROI while "that" will give a great ROI.

Unfortunately, sometimes even that does not work and you end up doing it the boss' way, becaue he is the boss.

--
There is no "-1 offended" or "-1 you don't agree with me" mod options for a reason.
Re:Well compare costs by jellomizer · 2006-10-04 03:27 · Score: 1

Well of course in the end you will need to do what the manager states. But giving them the facts first will guide them in the correct decision. But if they choose not to follow it and problems occure or gets to expensive you have a printed and dated "I Told You So" on your desk or in your email logs. It is important if you feel like your boss is making a mistake you need to do things to prevent a backlash from going to you. If you were wrong and the Boss was right and you did his work well then you look good and he looks good. Either way it is a Win Win situation.

--
If something is so important that you feel the need to post it on the internet... It probably isn't that important.

Different technology by doctor_nation · 2006-10-03 23:39 · Score: 2, Insightful

Maybe it's just me, but you sound like you're just resisting going to a different technology where you have to learn a new set of skills. Grid/multi-processor computing is definitely not simple, but depending on how many spare CPU cycles you have, you'll get a much faster and larger speedup in your runs than if you tinker with the code to make it run faster single-threaded. Also, won't you need to do some multi-threaded on a multi-core machine as well? (I'm not particularly familiar with multi-cores, so I could be wrong).

I recommend learning how to use grid computing and convert the program. Not only will your code run faster at the end of the day, you'll gain a valuable set of skills that will look great on a resume.

Re:Different technology by CastrTroy · 2006-10-03 23:44 · Score: 1

I took a parallel processing course in university, and can very much attest that it's a lot more difficult than programming single threaded programs. Because of that, it's also more interest, and therefore a lot more fun. I mean, you could spend your time doing the same tired old thing, but isn't it every geeks dream to experiment with new and different technologies.

--

Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
Re:Different technology by Anonymous Coward · 2006-10-04 00:16 · Score: 0

You might try Condor, which is a big player in the Globus/grid community. Globus is very complex and takes a long time to grasp, but one of the things grid software like Globus needs in order to run well is a a job scheduler, and most of the time they use Condor.

Condor by itself can schedule jobs across multiple clusters with much less overhead and a simpler learning curve.

And it's free.
Re:Different technology by duffbeer703 · 2006-10-04 03:00 · Score: 1

You must be a student who hasn't been jaded yet. My guess is that this guy's boss's boss went to a conference and heard some consultant pine on about how amazing grid technology is. Then he read Oracle magazine on the flight home and read about how Oracle uses the grid to save civilization as we know it.

So the boss's boss told the boss "we're falling behind, use Grid technology"! And this poor bastard is stuck sticking square pegs in round holes.

--
Conformity is the jailer of freedom and enemy of growth. -JFK

Answered your self ... by phoebe · 2006-10-03 23:47 · Score: 2, Interesting

... our software is actually more data than computation heavy ... I would far rather spend the time and money on multi-core machines and optimizing the software than on the latest fad technology. ...

If the process is more data than computation intensive then throwing more machines at the problem is the most cost efficient way of going forward. You have already countered your argument for multi-core machines. Especially if this is finance it is highly unlikely that optimizing the software will produce anything remotely practical in a short time period or at low cost. Software optimization also can introduce bugs and lock you down on an implementation that cannot be easily updated.

Take search engine technology as an example, Google have hundreds of thousands of machines running advanced software on non ultra-optimized platforms: Java and Python. The alternative is having a couple of hundred big iron machines running hand tweaked C / assembly. As a business you should be seeking to reduce the overhead of operations, by increasing the number of machines, lowering the cost of each machine, reducing the time optimizing the software by allowing higher level languages that are easier to use and maintain you can actually get better performance, reliability, and flexibility.

Re:Answered your self ... by christophe.vg · 2006-10-04 00:25 · Score: 1

I agree with this personally ... but let's play devil's advocate.

Dealing with large quantities of data has always been the sales pitch for mainframes. The question could therefore maybe be broadened to "can grids/clusters/multi-core/... really replace the mainframe?"
Re:Answered your self ... by jsailor · 2006-10-04 02:44 · Score: 1

While I don't necessarily disagree with you. It should be pointed out that this mentality has a much larger operational impact: cost for facilities. Power consumption from these massive deployments of PC servers is responsible for an explosion of operating expense budgets - especially for financial firms. Power (and cooling) requirements are exhausting existing capacity and forcing the contruction of massively powered data centers at a cost of $300+ million each. Paying the power bill is the other problem. Power costs typically exceed the cost of capital for the build after 10-15 years depending on your utility rate. these costs only include the physical, mechanical, and electrical components of the data center. Add in the systems, migration costs, MAN/WAN bandwidth and things get even uglier.

In short - you're applying a very inefficient solution that impacts other departments - namely the data center guys.

grid by Anonymous Coward · 2006-10-03 23:56 · Score: 0

If the problem and the underlying software are such that they are easily amenable to parallel computing then, provided the infrastructure in place, it is much cheaper and easier for a bank to simply throw hardware at it, rather than waste developer time ekeing performance out of it. Moreover, parallel solutions are scalable, whereas code optimization is eventually going to hit a ceiling. So I can't see where the difficulty arises!

Having said that, if you can easily make a significant (>=10x) improvement in your underlying solution, then of course you should do that as well.

Clusters innit by frinkacheese · 2006-10-04 00:02 · Score: 1

Cus then you get a load of iron on which to run distributed.net ! We went up like 3000 places when we installed our cluster! It rocks!

Correction On Database by eldavojohn · 2006-10-04 00:04 · Score: 1

At some point, your choke point is going to be your database but if you make it that far, you've kind of hit a wall, in my opinion, and the only solution for that is to juice up the box (with database sepecific hardware) that's serving your database.

Allow me to correct myself. If you fear this occurring further down the line, you do have another option. Buy multiple database machines and, in your client app, select the connection information for an account based on a lookup table. Then split your database among different machines (I know this is scary because global queries will have to be done across many machines). When the proxy dishes a request to a member of the cluster, they use the lookup table to see where the transactions should be coming from/going to and they will connect to those database machines. This way, you kind of develop your own cluster yourself.

Of course, there are clustered database options (PDF warning) out there if you want to go that route. I might not be understanding the difference between grid and clustered but ... I think that multi-machine schemes are much more cost efficient than multi-CPU machines.

--
My work here is dung.

What sort... by BJH · 2006-10-04 00:17 · Score: 1

...of processing on the trades are you actually doing?
30 minutes for 10,000 trades seems an awfully long time - I work in the same industry (specifically developing position management systems) and the only thing we do as a batch job is our daily rollover/mark-to-market, which finishes in less time than yours with a hell of a lot more trades than that.

A Cynics reply... by Anonymous Coward · 2006-10-04 00:26 · Score: 5, Interesting

The real reason folks like High Energy Physics experiments and university groups are using/developing GRID software is to get grant money.

Period.

In fact, GRID software is constantly in flux, because there is no grant money to run a GRID, only to develop one, so they keep throwing stuff out and developing new parts -- to get grant money.

And yes, I am posting this anonymously because I work for such a place, and mostly like my job.

Grid vs cluster by TeknoHog · 2006-10-04 00:31 · Score: 4, Informative

Make sure you know the difference between grid technology and clustering. Basically, grid is much more complicated but more flexible; the name means you can connect something to a grid to get computing power, just like you can connect to the power grid to get electricity. It looks like you're thinking of clustering instead, which is easier to deploy and in many ways closer to a multiproc machine

--
Escher was the first MC and Giger invented the HR department.

Re:Grid vs cluster by Anonymous Coward · 2006-10-04 01:30 · Score: 0

I think even you're confused. Clustering is used for High Availability. (Or you can split the load and have everything go down one pipe when there's a failure if things aren't time critical.)
Re:Grid vs cluster by Fastolfe · 2006-10-04 02:31 · Score: 1

Different kind of clustering. The parent post is referring to a high-performance computing cluster (which is normally what people are talking about when they talk about clusters), which differs from an HA/load balancing cluster by having an OS tailored to the task, and applications and OS working together to split computation across multiple cluster members.

"Clusters" as used by most web communities are normally HA clusters, which are simply a logical group of servers running the same software, with a request being sent to one cluster member. The goal isn't to split computation across systems; it's to distribute traffic and (mostly) survive the loss of a single cluster member.

Grid != Parallel by prefect42 · 2006-10-04 00:40 · Score: 2, Informative

I can't help but feel that people are missing the point of grid computing. Grid is not HPC. It's not super computers. You can build grids using HPC, but they don't have to go hand in hand. As such, all this talk about parallel whatnot is actually missing the point. I assume there exists code. I assume the code is serial, since most is. I also assume that there are many of these jobs, rather than one job that currently takes a day and a half. Typically there's no need to start getting exotic with MPI/OpenMP or whatever. Simply submit more serial jobs to do what needs to be done. Look at it from a batch scheduling point of view, and see what can be done. If you want to parallelise it as well feel free.

Grid within a company typically just means decent remote access to a shared cluster. A web service that submits jobs to sun grid engine (which has nothing to do with 'grid' btw) would probably fill in all the buzzword bingo requirements of a grid project without being anything of the sort. For sadists look into OMII and GT4, but don't feel compelled...

--

jh

Cluster where it makes sense by lesinator · 2006-10-04 01:07 · Score: 4, Interesting

I work for a large bank, doing very much what you describe.

Our processes tend to be more computation (than data) heavy compared to what you describe, but we are using lots of clustered computers. Take your 10,000 trades and split them into chunks of 100 trades and have separate machines value each chunk and reassemble the results. Depending on the nature of what your software does this may or may not make sense. If you can split your workload into small chunks that can be analyzed independently you can achieve much better throughput.

The newer cluster/grid software can be really shiny, but you don't always need it. Plain old PVM can still work wonders. Also, a lot of the commercial cluster software out there isn't well suited to this kind of high performance computation clustering.

I like the middleware layers by SpaghettiPattern · 2006-10-04 01:35 · Score: 1

I'm not a typical Grid user. I have tested and reviewed DataSynapse to use inside our organisation.

We want to offload processing cicles from z/OS onto a cheaper platform. As we process highly secure data we do not want any of this to land on insecure Windows boxes so our Grid engines will be typically tightly controlled Solaris or Linux boxes.

What I like best is the re-division of the application. The application submits a request for processing to a broker/manager. The broker/manager dispatches requests to available engines. Engines register to the broker/manager as "ready to process" and are given work to do.

The beauty of this is the prioritisation and the load management that comes with the package. You can setup your engine systems to process a maximum amounts of requests by which the jobs are processed in a best time way. Each job get the full CPU, does the processing as fast as possible and then the engine continues with the next one. Note that we run high CPU intensive jobs.

And then there's the redundancy and automatic fail-over of the broker/manager and engine. If any one of these dies, jobs are rerouted/rescheduled. We actually need high availability.

--

I hadn't the slightest objection to his spending his time planning massacres for the bourgeoisie... (P.G. Wodehouse)

i have a similar problem with virtualization by kpharmer · 2006-10-04 01:38 · Score: 4, Informative

My management is similarly obsessed with virtualization: they want to lower admin costs, lower lab costs, etc through this simple technology.

So, rather than move everything over to lpars I took a simple step - purchased a large virtualization-oriented server highly touted as perfect for this, and moved over a single app, with the goal of putting two apps on this server. Along the way I learned:
- io virtualization sucks for io-heavy applications
- the tools to determine how much of the cpu your app is getting at a given moment stink
- memory virtualization in which you resize application memory is primitive and almost useless
- there were no guidelines for optimization of the server - just recommendations to try it
hundreds of different ways and leave it on the best settings
- basic setup of the machine required wading through tons of jargon that even the os engineers didn't seem to know well
- out of the box - a single app on the new virtualization server performed more slowly than it did on a free seven year-old server
- some of the most heavily-advertised virtualization features of the product just don't work
- virtualization of multiple busy apps onto the same server is mostly a waste of money
- virtualization of multiple mostly idle app (failover servers, test servers, demo servers, etc) should work very well
- we spent at least $25k on labor just to create something that was a slam dunk
- I'm glad that we started with a small prototype - and didn't waste a ton of cash moving everything over immediately the way some management hoped
- I think in the end we'll get multiple apps working on this box just fine. BUT - we will have spent more money on this scenario than by simply purchasing separate systems. We may recoup a savings if we move enough idle systems onto virtual boxes.

As a result of this experience my team now knows more about virtualization than any other people in the division, we now have a production server supporting it, my management is now cool on this technology, and there is no risk of being forced to migrate critical servers over quickly to the virtual world. I'd call that a success.

I think that you're right - that grid is in a hype cycle right now. So - there are quite a few disappointments to be had along the way to its implementation. For example - if your workload is heavily transactional - you're really not going to get much benefit. In this example oracle supports grids - but it is really more about failover than performance. If you roll your own or use a more sophisticated product you can be safe in assuming that you'll hit unexpected issues, a gap between vendor marketecture & what you really need, and possibly the pain of having a vendor talking directly to your management.

You might want to consider having management fund a small prototype to prove out the benefits. Then let them see that they can achive perhaps better availability but worse performance at a very high cost through this approach.

good luck

It's a trade-off by davecb · 2006-10-04 01:44 · Score: 3, Informative

Sounds like a trade-auditing project I was once on.

If the 10,000 trades are easily broken into small groups, such as by the initial letter of the ticker symbol, and if all the data for the analysis is fetched in the first step, you can in fact spread the processing over 26-odd machines for a speedup of (fixed part + (per-ticker-symbol part/26)).

I have an article on doing the load-balancing part of this kind of processing, albeit on a large multiprocessor, at http://www.sun.com/blueprints/0605/819-2888.pdf[In PDF]. As you've already guessed, sometimes the problem doesn't decompose nicely into parts that can be distributed to machines far from the database.

The rule of the thumb is that grid does distributed computation, where you ship small amounts of data to many CPUs. If you have large amounts of data, you need to have previously distributed data stores, and then you ship the processing to reside with it, instead of the other way around. Alas, some folks call the latter grid, when it should be called something like "data grid" (;-))

--dave

--
davecb@spamcop.net

Re:It's a trade-off by Anonymous Coward · 2006-10-04 05:12 · Score: 0

broken into small groups, such as by the initial letter of the ticker symbol
That's a pretty primative hash function you got there. ;)
Re:It's a trade-off by davecb · 2006-10-04 05:31 · Score: 1

Yup: if you could do it at run-time, you'd use a bin-packing algorithm to create N equally-sized buckets of trades (;-))
--dave

--
davecb@spamcop.net

Quantian by lesinator · 2006-10-04 01:53 · Score: 3, Informative

Something you might want to experiment with is Quantian. It is a bootable linux distro (knoppix descendant) with clustering (openmosix) and a huge variety cluster capable scientific & financial open source tools built in. It is a very quick & easy way to set up a cluster to experiment and see how you application could be altered to work well in a massively parallel environment. I've never seen a quicker or easier way of building a cluster. With Quantian and a pile of networked PCs, you can literally have a openmosix cluster in minutes.

Re:Quantian by M1FCJ · 2006-10-04 04:38 · Score: 1

I used to use OpenMosix on ClusterKnoppix but there hasn't been a new release since 2004, OpenMosix still runs on 2.4 kernel. It would be quite nice if OpenMosix managed to move to more recent kernels - still quite useful.

some warnings from experience by Ubergrendle · 2006-10-04 02:30 · Score: 1

Our bank has experimented, and is running some production systems on grid infrastructure (Sun to be specific).

Some learnings:
1. Software licensing is your biggest enemy. Oracle in particular is evil in this regard, but every vendor fears grid computing since it doesn't conform to their pricing models and gives you more bang for the buck. Investigate the consequences of grid at the earliest opportunity.
2. By linking numerous apps to a pool of servers, you've just complicated your software currency lifecycle tremendously. You don't now have to upgrade one O/S, you have to upgrade 10 or 100 simultaneously. And EVERY APP IN THE POOL has to be ready for the O/S upgrade too.
3. Your infrastructure costs go towards networking, storage, and monitoring. Your operations staff needs to be aware of every app utilising the grid now, not just their pool of servers running single applications.

Find your greyframers in the organisation and tap their knowledge. They've been working with LPARS before you were born most likely, and know how to work/coordinate in a virtualised environment.

Use grid as an extended cluster for like-services. Don't mix different workloads on the grid as it gets confusing to manage. Watch your licensing implications at all times -- be prepared to walk away from vendors that don't want to play nicely.

IMHO i'm not convinced grid computing is much better than standard N-tier architectures. The labour to learn + setup + maintain a well run grid is not always returned in $ value. The sad fact is that developers and operations staff are expensive, hardware is cheap. Grid saves you h/w costs, which isn't our biggest worry. If grid becomes EASIER to manage than stand-alone servers (again, not convinced) it will return on its promise.

--
John Maynard Keynes: "When the facts change, I change my mind. What do you do?"

Re:A Cynics reply: You aren't alone. by Anonymous Coward · 2006-10-04 02:34 · Score: 0

I also work on/with Grid software (and for similar reasons choose to remain anonymous). I'm under the impression that computational scientists, like myself (note the distinction between computational science and computer science) are forced to develop this software because the software developed by the computer scientists (who for some reason were tasked to do it) is largely crap.

The most common Grid toolkit, Globus, is basically the very definition of bloatware. Every incarnation adds new standards that no-one except computer scientists could possibly care about. It used to depend on so many packages that it was impossible to install. They "fixed" this by bundling their own modified versions of those packages, shifting the problem so that the software is impossible to maintain instead (if a security hole is found in a library, you have to update the Globus version separately, and often you have to wait for the fix to be back-ported into the Globus specific version). It is largely impossible (or at least extremely difficult) to install only the bits of Globus you want to use. It now relies on Java, limiting its portability to platforms with a fairly modern and complete VM. Globus is also extremely user-unfriendly. I can use it fairly comfortably, but that's only because I'm comfortable with the command-line. Finally, but most damningly, it doesn't actually solve any of the difficult problems which you need it to do, instead, it "re-solves" problems that have already been solved elsewhere, like executable staging and secure file transfer.

For example, it handles the heterogeneous nature of the Grid by passing the problem onto the user. Want to be able to run on a number of different machines with different architectures at different sites? That's fine, you still need to write as much scripting as you would for different batch systems, just you have to do it in one file in RSL, which has an alarming lack of documentation. Oh, and each site has its own variables anyway. You still have to build binaries at each site or at least for each site if you have access to the appropriate cross-compilers. What is the point of executable staging if you have to have binaries at each site anyway?

Or how about co-scheduling of resources so that you can use resources at multiple sites at the same time? Or security? GSI layered on top of UNIX/Linux is a fantastic kludge, joyously combining the security of having only one password for every account you own with the inconvenience of actually having to have multiple accounts/passwords, since you need them anyway.

With each release, the bloat gets worse. It is extremely telling that the most successful scientific Grids are still using Globus 2.x rather than 4.x because quite frankly, after 2.x they just started adding useless features that no-one wants but are interesting to computer scientists, like web services.

And Globus is not alone. I think that all you need to do to convince someone new to the technology that Grid technology is not ready for production use is show them AccessGrid in action. It's a farce. I can't begin to describe how much Grid software in general has hindered rather than helped my current project. Back in the good old days, you got access to a resource and you got an account name and a password. You logged in and did your work. Now you get them (if you are lucky, some places force you to authenticate with GSI), and have to get certificates set up and sorts of other hassle, before you can even get as far as running a job.

Finally, having the software funded by grants is, as you suggest, a disaster, not least because when the grant runs out, the software becomes abandon-ware with nobody having the funds to maintain it. In this country, the phrases "e-Science" and "Grid Computing" are regularly tacked onto grant proposals, because having them there makes it easy to get funding for what we were doing anyway before the buzz-word craze invaded our quiet little labs.

In one of those hilarious coincidences, my image authentication thingy was "unproven". Says it all really.

hmm. by pizza_milkshake · 2006-10-04 02:41 · Score: 2, Insightful

I would far rather spend the time and money on multi-core machines and optimizing the software than on the latest fad technology.

think about that for a second.

Re:hmm. by Anonymous Coward · 2006-10-04 05:24 · Score: 0

Yeah, software optimization is a flash in the pan.

Identify bottle necks by Neil+Watson · 2006-10-04 02:51 · Score: 1

It seems to that you do not know, or have not stated, why these transactions take twenty to thirty minutes. Identify the bottle neck (e.g. network, memory, CPU, i/o) and then you'll know how to improve the performance.

--
UNIX/Linux Consulting

A few comments from my experience by GogglesPisano · 2006-10-04 02:57 · Score: 1

I've been developing grid-based applications for my company (we work in the financial industry) for several years. We chose a grid solution because the nature of our applications lent themselves well to parallel processing. We were able to reduce the processing time of our runs from hours (or even days) to seconds.

Not every application is an ideal candidate for the grid: the problems that scale best are composed of many discrete, independent calculations (think Monte Carlo simulations). Strictly linear processes (step A must finish before step B can proceed) do not fare as well in performance, although they do benefit from the other advantages of grid computing (redundancy and virtualization of resources).

Data movement is one of our biggest challenges. In the (likely) event that your application requires database access, you really need to think carefully about concurrency. The database server may have to contend with hundreds (or thousands) of simultaneous connections from the grid nodes - you probably want to use something like Oracle here, rather than mySQL. Some of the grid platform implementations provide data caching schemes that can help.

Careful error handling is also (even more) critical with the grid. When your application is spread out across a few hundred machines, sooner or later a node will fail - maybe the hardware is flaky, or the network hiccupped, or some dimwit shut down the machine in the middle of the calc. It is unacceptable to lose a whole run just because one node crapped out - the application must be able to recover (automatic retry on node failure). It is also essential to have a centralized means of retrieving error logs from the nodes for post-mortems. Again, your choice of grid platform will likely provide help with this.

fix the easy problems first by blackcoot · 2006-10-04 03:22 · Score: 1

low hanging fruit (which, in your case, appears to be abundant) is what you really should be pursuing first. it seems like an upgrade of your networking equipment to gigE or 10gigE would be a good start, since regardless of what forward path you choose (clustering, grid based, etc.) you're going to have to make the investment in fat enough pipes to get the data to where it needs to be.

the next thing to think about is how to educate the powers that be on their options in terms of parallel processing. this means you yourself are going to have to become very smart about the difference between clustering (MPI is a rather nice interface with C and C++ bindings, i prefer lam but others swear by mpich) and grid technologies (the only possible candidate that immediately comes to mind is apple's Xgrid, altough HP, IBM, etc. all claim to have various grid services). as you become smart about these solutions, bear in mind that parallel programming is, generally, the hardest programming that you will do. it's hard to debug (doubly so when it's distributed), it's hard to write, it's hard to understand and it's really easy to screw up. take your development time estimates, apply whatever internal multiplier that you use, then add another factor of 2 to 3 for anything that isn't essentially trivially parallel. while you're at it, make sure you include debugging tools (etnus totalview is one of the few debuggers i know of that natively groks mpi). add another factor of at least 1.5 if you want to make sure that your processing is robust and generally fault tolerant (i.e. able to handle losing hardware mid-transaction at inopportune moments, which will happen). you'll also need to factor in training and the hiring of some extremely clever people who can wrap their heads around the parallelization strategies you'll be using (and trust me, these can be counterintuitive in the extreme).

ultimately, you'll arrive at a cost benefit analysis that will show that the features that grid computers buy you (potentially cheaper hardware, potentially easier scalability — both easily achieved if you wrote a clustered version "properly") come at a hefty price tag, one that may well outweigh the value that the grid solution offers.

good luck :)

Seperate cluster from datacenter.. by dk.r*nger · 2006-10-04 03:24 · Score: 1

(not least that our data center is at best 100 Mb/s and our software is actually more data than computation heavy)

First: I assume that you are talking about clusters, not grids (grid=>cluster as road=>car).

Second: The computation nodes *do not* sit on your regular datacenter network. A computation node only ever talks to its master and its peers, so they sit on their own, dedicated, high-speed network (usually no less than 1 gbps).

Third: Some tasks are better for SMP, other for clusters. Find out which yours are. As others have pointed out, the degree of data dependency is the most interesting parameter.

SMPs are usually easier to program for and to setup, but at a certain scale they just stop being very costeffecient.
Example: A Sun Fire E20K with 36 UltraSparc IV+ 1.8 ghz and 144 GB RAM will set you back $2.500.000.
Alternative? Let's build a 36-node cluster..
1: Spend $100.000 on network infrastructure (there are numerous approaches to this, I'm no expert)
2: Buy 36 of fattest servers from Penguin Computing, at $30.000 each.
Now, at $1.180.000, you have 36x4 Opteron 885 (dual-core) cpu and 4,5 GB RAM per core (32 GB per machine).
Don't tell me that won't kick the SunFires ass at any problem appropriately parallel... But obviously there are painfully dependent problems out there, and that's how Sun sells those beasts.

Grid and parallelize by ahmusch · 2006-10-04 03:37 · Score: 1

1. Ensure the database layer is a parallelizeable RDBMS that has a concurrent access mechanism and is running on a multicore/multiCPU box.

2. Grid/Parallelize the application layer -- i.e., ensure you can run parallel jobs with discrete data.

3. If that doesn't help, then grid the database layer.

If your application isn't built to scale today -- see the second point -- all the grid in the world won't help you.

I agree with you that it sounds like the code needs some optimization -- 10-30 minutes to process is 5 to 15 trades per second, and that's ridiculously slow. One thing you haven't addressed is the I/O subsystem -- if you're using crappy disk in a I/O intensive batch system, it can kill you. If you were to move to a grid, you'd have to move to some sort of SAN/NAS in order for your application to function. Of course, you can move to a SAN/NAS without being on a grid or in a cluster.

Oracle's got lots of performance diagnostic tools to help you determine what's going on between the time the app makes a database call and the app gets control back, if that's what you're running. Okay, only one really good one (extended SQL trace), but it's really good. If you're running Oracle, get a trace or a statspack report of one of your batch runs to try to determine what the performance issue is before you spend any money chasing an architectural solution.

Ok first off, the two are not mutually exclusive by Colin+Smith · 2006-10-04 04:06 · Score: 1

You can have a grid system in place as well as multi core machines.
You can implement a grid in a couple of days worth of sysadmin time. A few wrapper scripts should be able to simplify and manage job submission.
A grid is probably not what you think it is. Current grid technologies are essentially updated network batch queueing systems. You kick a job off, the grid determines the least loaded and fastest machine to run it on. Distributing parts of jobs over the grid essentially requires rewriting the software to take advantage of pvm or mpi, but that can be done later once there's a stable infrastructure in place.
New (super-duper multi-core) machines added to a grid are typically the most utilised, completely automatically, they tend to be the fastest so run through jobs faster, and are therefore the least loaded and are handed correspondingly more jobs to run. You can increase the utilisation of your hardware from around 10% on average to 80%, 90%
Most of the grid software I've seen automatically handles failures of applications or hardware, so you can get redundancy essentially for free.
You can get the software for free(there's several of them), a bit of a no brainer.

Frankly I don't see much of a downside to griding your existing machines, it's simple, low cost and really just a change of the way you look at them. Instead of individual systems, they become components in a coherent network.

--
Deleted

Improve code! by Khyber · 2006-10-04 04:13 · Score: 1

in particular, why or why not they chose grid software over improving the existing code to leverage better processor technology

To a point, I have to wonder this as well. I'm really annoyed at not having the ability to use 32-bit drivers in a 64-bit OS when it's still going to end up addressing the same fucking registers. (I'm talking about a webcam, here, BTW) Considering that current 64-bit processors are based off of and share the same registers/opcodes (in the AMD/Intel market, that is, I can't speak about other processors) as a 32-bit processor, why the hell not just make a 64-bit wrapper/emulator for 32-bit drivers? Until the architecture undergoes a MAJOR change, it shouldn't be all that hard, should it?

--
Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.

Re:Improve code! by Toba82 · 2006-10-04 05:43 · Score: 1

What the hell do drivers have to do with this discussion again?

--
I pretend to know more than I really do by mooching off google and wikipedia.

I forgot by Colin+Smith · 2006-10-04 04:30 · Score: 2, Insightful

7. It's not a fad. The technology has been around since the 80s IIRC, possibly earlier. The word "grid" is a fad, but not the technology. They started as network or batch queueing systems. The word "grid" is like the word "middleware". It isn't well defined and means a bunch of different things to different people.
8. Off the top of my head, freebies include Torque, GridEngine, Condor.
9. Yes it would be a Beowulf of those. Mwhahaha!

--
Deleted

Grid means more than you think by Anonymous Coward · 2006-10-04 04:54 · Score: 0

I suggest going to www.gridtoday.com and using the search function. There have been dozens of articles within the past few months talking about grids and data. At this point in its evolution, Grid computing is a lot more than just cycle scavenging.

Gotta ask, is this EQRMS? by Anonymous Coward · 2006-10-04 04:55 · Score: 0

"Large investment bank using grid computing"

Citigroup? EQRMS System?

GRID is an administration system, not HPC by RogerWilco · 2006-10-04 05:06 · Score: 0

You shouldn't confuse GRID with HPC, supercomputers, or clustering.
GRID is complex, it's main advantage is the way it can handle users, data and computational resource administration in a very hetrogenous environment. GRID is all about adminstring groups of users all across the globe, using all kinds of different hardware, to process data that's to big to be stored in one location, and therefore also very distributed. It has al kinds of tools to distribute the management of access to resources, users, etc. And as other posters have pointed out, it's still kind of immature. If you have a Petabyte of data, distributed in 40 Terabyte chunks across the globe, or 57 different user communities, with varying needs, permissions and members, and 57 different people are responsible for those, but all need access to the same data or computational resources, then GRID is a nice idea.

If that describes your kind of environment, then you should consider using GRID technology to administer it.

Otherwise some simpler solution will probably save you a lot of headaces.

--
RogerWilco the Adventurous Janitor

Re:GRID is an administration system, not HPC by oaktowner · 2006-10-04 10:58 · Score: 1

Most grids in place today are not linking thousands of machines across dozens of sites--most are deployed within enterprises. Some are into the thousands of nodes, but far more are in the hundreds or scores of nodes.

Check out IceGrid from zeroc by Catamaran · 2006-10-04 05:39 · Score: 1

I have been using Ice for about a year now and can strongly recommend it as a middleware framework. They now support IceGrid which I have not tried but it appears much more elegant than Globus.

--
Test 1 2 3 4

Mod parent up. by Animats · 2006-10-04 05:40 · Score: 1

This is the only post that sounds like it's coming from someone with a clue.

Re:Mod parent up. by davecb · 2006-10-04 06:11 · Score: 1

Thanks, kind sir (;-))

--
davecb@spamcop.net

Warning signs by georgewilliamherbert · 2006-10-04 06:49 · Score: 1

A large investment bank running a datacenter on "at best 100 megabit" ??? For data-intensive workloads? I don't know if you have a SAN behind that or not, but... you don't need a grid, you need a gig switch and an architect internally who has a clue and can bring management around.

I've been at two investment banks, one midsized and one gargantuan. Gargantuan one has a grid, along with piles of Linux servers, piles of Sun servers, large medium and small databases of both transactional and data warehousing nature and software choice. The grid has its place in things, but that place is not doing any of the actual trading. If someone is suggesting that a grid is the right solution for trading, based on my knowledge of how trading is done at places I've been at, the apps, and the description that was given... run, don't walk, away. You may have a custom parallelizable griddable trading app, but I've never seen one before. Trading apps typically are network, data, and CPU centric in a very internally connected way. Grids are pretty much the absolutely wrong solution for the trading apps I have seen. It might be able to run on one, but it's a complete architectural mismatch.

You keep using this word 'Parallel'... by mkcmkc · 2006-10-04 07:41 · Score: 1

Er, you seem to be saying that if I run a large number of independent ("serial") tasks at the same time that this is not parallelism. I'm trying to think of what parallelism would be, if it is not this. :-)

--
"Not an actor, but he plays one on TV."

Re:You keep using this word 'Parallel'... by prefect42 · 2006-10-04 07:49 · Score: 1

;) What I mean is, take advantage of any parallism possible due to the nature of the task without recoding the application. An individual job won't be finished any quicker that way, but all jobs will be finished sooner. Parallel programming is different, since this involves multiple threads of processes communicatating either through shared memory or with some form of message passing.

--
jh
Re:You keep using this word 'Parallel'... by tolldog · 2006-10-04 13:52 · Score: 1

I call those tasks "embarassingly parallel".

--
-I just work here... how am I supposed to know?

Data heavy parallelization by solid_liq · 2006-10-04 08:16 · Score: 1

The key here is that the task you need parallelize is DATA heavy, not computation heavy. In other words, what would be the benefit of multicore processors? Practically none. It makes me curious why you haven't mentioned traditional multiprocessor machines. A NUMA multiprocessor machine would show benefit, because of the seperation of memory. Having seperate caches will help also. Having a multicore processor with a shared cache will just cause more cores to sit idle while the cache is thrashing. Having many machines in a cluster will work well, because they can each chew on their own chunk of the data. This is an application which should be very easy to parallelize. Just divvy up the task into batches, dedicate one node to management, and "a batch batch here, a batch batch there, here a batch, there a batch, everywhere a batch batch." You shouldn't even need MPI or any other clustering libraries to help you out. Just create a simple communications framework with asynchronous sockets, distribute the tasks, and wait for the results. The results will come back much faster than they will with a multicore architecture. Remember Danielson, the processor waits on data, not the other way around. Anyone who disputes this argument needs to come down from their 10,000 foot view of how things work and perhaps use a tool like VTune to give them insight into how the hardware actually operates. It's all about data, so distribute. Sorry for the rambling approach to an explanation. Too much on my mind...

Just some ideas... by Iron+Condor · 2006-10-04 09:49 · Score: 1

I am interested to hear from other people in a similar position and, in particular, why or why not they chose grid software over improving the existing code to leverage better processor technology,

Not sure how "comparable" my situation is to yours (aerospace industry) but in a similar "many machines versus optimizing to exploit smaller, faster, better machines" situation we came down soundly on the side of the former. The reasoning went roughly like this:

We know today's budget. We can use it to upgrade all machines to the fastest, shiniest stuff out there and hire the people to optimize the code for that. Or we can buy vast numbers of mediocre metal and spend the programmer-money on making our tasks take advantage of this grid.

We do NOT know tomorrow's budget. The markets. The directions the businesses will go. If we have sudden extra demand AND the money to satisfy it, a gridded solution will let us simply expand the grid. A Mainframe solution will at best obsolete our current best machines (and at worst there's no computer out there fast enough for us). If we get a bump in demand but NOT adequate funding (the usual case) then a grid is as scalable as our budget will allow, while each new generation of mainframe requires massive new porting investment. If we get a drop in demand, simple/dumb grid machines can easily be turned over to other departments/tasks, while super-high-powered metal requires super-high-powered programming to make it useful.

Manpower costs more than equipment. A hundred boxes running a thin veneer of load-balancing on to of what amounts to linux can be run and maintained by a high-schooler. And if a couple of the boxes crash and burn because of mis-management it'll affect the grid only minimally. A few high-powered/specialized machines cost the price of a sysadmin who knows how to handle the thing and in case a box goes down, you've lost a significant fraction of your computing ability.

If your software is already geared towards using some large-but-unspecified grid for its operations and you get in a sudden crunch, you can buy grid-(CPU-)power by the hour without having to shell out more for equipment.

Equipment depreciates. If you need "the fastest hardware out there" you're forever going to buy new machines because todays "fastest" is tomorrow's "junk". A grid merely needs "some machines" and if they're a year or two out of date, then you can still count the depreciation against your tax burden...

Really: the only advantage to concentrating computational power is that it consumes less space and less power -- so if you're in a situation where these two are negligible against the rest of your operating costs (which is almost always the case) then there's really no good reason to mess with it.

--
We're all born with nothing.
If you die in debt, you're ahead.

Re:Ok first off, the two are not mutually exclusiv by oaktowner · 2006-10-04 10:49 · Score: 1

Modern grids do a lot more than just kick off batch processes--they can be integrated into the programming model, and distribute objects (think .NET or Java) and interact with those objects. The idea is the same, but the interfaces are a lot more convenient than the old command-line tools.

More of a summary by Vanth+Dreadstar · 2006-10-04 11:08 · Score: 1

...than anything else, of a lot of things said previously, except the first line: Determine what you need, to get what you have to do done. A three-tiered design will likely give you the most flexibility to solve the problem in any number of ways, plus give you the ability to upgrade layers independently if bottlenecks are found. As someone mentioned, you obviously need a big infrastructure update. At the very least, get into gigabit ethernet. Put your database on, at least, a MP machine, such as a Core Duo or Core 2 Duo, or a small cluster of such machines (although for 10k transactions I can't see a need for more than one). This machine is isolated from user machines by the middle layer. Put your business logic on a small cluster of inexpensive machines that have direct access to the database and each other. The database can respond to requests from this layer, and load balancing or simple round-robin transaction allocation between the machines can be used to employ each unit efficiently. Requests could come not only from the daily batch allocation, but also be processed in real-time from any terminal, spanning the spectrum from dumb terminals through PCs to web pages accessed from anywhere, as long as your security issues are properly evaluated and accounted for. I doubt grid computing will give you any benefit over this as it would not give you one central data repository, just a bunch of number crunching machines which you specifically said would not help. Grid computing is a number crunching solution, not a transaction processing solution.

The key to running a grid successfully is attitude by Colin+Smith · 2006-10-04 12:06 · Score: 1

You stop thinking of computers as individual machines and start thinking of the network as the machine. This has implications for how you manage the systems. If you know what you're doing it actually simplifies almost everything, though this is really a function of the attitude and isn't specific to grids, any network would benefit from the same way of working. A grid fits right into an N tier architecture, it's just one of the layers.

--
Deleted

it all depends by tolldog · 2006-10-04 14:03 · Score: 1

How many of these 30 minute jobs do you have? Do you have 10 or do you have 10,000.
How long are people willing to wait on the 30 minute jobs?

A lot of tasks like this tend to batch easy, and if you can batch it, then you can throw it on a batch queueing system (like LSF, the one I have my experience with).

At the end of the day, its a lot easier to run multiple jobs on multiple machines than it is to optimize a single job. It all depends on where you want to spend your time and what return you want and expect.

Having 2 machines and an even load could cut the turn around time in half, writing cleaner, threaded code may not give you that. How much communication do the threads need? If the threads are just there to take advantage of multiple cores, then you probably should just batch up a bunch of serial jobs and run N jobs per host, where N is the number of cores.
Only when the labor becomes undivisible on a current job, like 1 trade, will you really see the benifit of parallelizing the actual job.

Lots of people use batch processing via LSF or some other mechanism in financials.

--
-I just work here... how am I supposed to know?

What virtualization platform did you use? by Slashdot+Parent · 2006-10-05 01:54 · Score: 1

I'm curious, what virtualization platform did you use? Which guest operating systems?

--
They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock

Re:What virtualization platform did you use? by kpharmer · 2006-10-05 03:07 · Score: 1

> I'm curious, what virtualization platform did you use? Which guest operating systems?

Wish I could tell you, but I can't.

its important too.. by pjr.cc · 2006-10-05 02:10 · Score: 1

Cluster and grid arent multually exclusive either... personally, I tend to think of grid computer as "this generations MPAR" ;)

Mod this up. by Anonymous Coward · 2006-10-07 16:08 · Score: 0

That's the first thing I thought when I read the figures. 1000 trades doesn't sound like a lot of data to me. I don't know much about investment, but this sounds way off. Is this a legacy system or something? If so, you can increase performance 1000x by re-implementing it.

I don't know why, but when I hear "batch" I think "legacy." I know modern systems still batch things at times, but I don't think batching is ever needed or desirable. I don't know if you can eliminate it in your case (sometimes you have to work with people that 'need' it), but it's shown up at most of my jobs, and it's usually the major cause of frustration; if one little thing is out of order your whole batch is forfeit. (Sadly, our systems don't (and can't be modified to) do data verification, so it has to be eyeballed every time... luckily it will be rewritten soon.)

68 comments