What Cluster Management Software is Out There?
Dreddlox writes "I'm looking at a company producing cluster management software for Linux-based clusters such as Beowulf and would like to get a feel for waht's out there right now. I know Turbolinux for example has a product, enFuzion as do a few others. I'm trying to get a complete list of the players in this market as well as on any open source software and distributions of it if any exist. Also if anybody has any pointers or information on the use of Linux clusters in non-military and non-academic applications (such as auto manufacture and finance) especially outisde the US. I'm curious as to potential for such software in Europe and Asia (with all the talk of China being a ripe market for Linux due to the open source angle)."
If you are looking for software to create a cluster, there are several, depending upong what type of cluster you are trying to create. If you are creating a service-based cluster, check out TurboLinux Cluster Server, Linux Virtual Servers, PolyServe Understudy, and Legato. There are many others available, including hardware solutions from Cisco, F5, and Alteon. I'm not too familiar with Beowulf-type clusters.
If you are looking for software to manage groups of systems, that's a whole different story. You might look into Enlighten DSM, Tivoli, or OpenNMS. I'm sure there's a lot of competition in that field as well, but I don't have any experience with those products.
Software sucks. Open Source sucks less.
I'm involved with a part of the Human Genome Project, and we run (around) 100 linux boxen under Condor for our often quite large computations.
This is not what everyone wants from a cluster; the focus is on large amounts of computation over time, not particularly fast computation over a few minutes. And it can take some work to break a task into jobs it will distribute effectively. Perhaps I should be clearer: it's intended to distribute jobs not cycles. And it can be hard to control how local file access and networked file access interact, if you need to. And it does some tricks with user priority which don't fit with some models of shared access.
Nevertheless, it works, and it's a wonderful thing to see a giant numbercrunch you've been running as one process split up and run in parallel, my goodness. And Condor has allowed us to do that without getting involved in any tricky parallel programming or even changing code: jobs run on their own box, happily unaware that they're part of a cluster at all, essentially.
So while I've had my issues with it, it has certainly been useful to us and is worth checking out. It's easy to find faults with anything; this has allowed us to get our work done, which is the main thing. I should also mention that Condor is continually improving, and the new version we've just installed seems to resolve at least some of our problems.
I'm interested to see what else is out there...