Implementing Better Task Scheduling for Servers?
trifakir asks: "We are running some quite expensive SunFire servers with Solaris 8. In the 'crontabs' of these hosts we have scheduled maybe some hundred odd jobs, which are constrained by multiple factors: dependencies on other jobs, time constrains, CPU and memory usage, network bandwidth, and so on. Obviously this imposes a CSP. On the other hand the number of these jobs, each one of them can take from minutes to hours is growing and we are now experiencing performance problems given the limited resources we have. Of course we have opened the bag-of-tricks with our best *ad-hoc* solutions, using mostly Open Source software, to turn our system into an event-based and less dependent on the scheduling expertise of the admins. At certain point we were considering using AutoSys and I was looking for a grid-like scheduler like OpenPBS, both of which were discarded for various reasons. I am curious, how you guys, would solve this problem, which seems very trivial for many environments. Both advice about theory (scheduling) and practice will help us and any other readers who may be tackling this difficult problem."
First off, let me warn you. Do whatever you can to make someone else support this. The tangled web you weave becomes a nighmare after a few years.
We use a software package called Control-M. It works on mainframes, unix, and windows. Jobs can be scheduled to run within windows of time, and can depend on out conditions of one or many other jobs. You can specify thing like the second tuesday of the month, etc. Jobs on UNIX can depend on successful status of jobs on the Mainframe, etc. It really does it all.
A few problems from the sysadmin perspective:
The user interface, job naming conventions and client/server/gui server model appears to have been designed by drunken crack smoking mainfame geezers. (ALL CAPS ABRVTDJBNMS) It is one of those interfaces where you'll find yourself randomly clicking on buttons and menu items while praying and cursing.
Developers see it as a crutch (why check dependencies when the scheduler is supposed to do it?) and a nice place to point fingers when things don't work. They can get away with writing little scripts and then inserting them into this tangled mess of jobs and dependencies. That mess is your problem.
You'll get hammered with requests to add/delete and change jobs for whatever reason. It will become this central time clock from which most major business processing is done. Once everything is migrated from cron (*heh*) then you're very vulnerable to problems. Oh yes, have a problem late at night and the execs precious, TPS reports don't arrive on thier desk every morning. Heads roll....
It's hell to manage and support. It's really a half time position for a large organization. Send some poor sucker to the class and then make them responsible for the whole thing.
Support is decent. As good as most vendors.
So Control-M, does everything you could need, but is probably the most miserable application I've ever had the displeasure of managing.
Oh yeah, and it is expensive $$$$$$$
Best of luck. If you're really THAT bound by resources buy more severs and spread the load.
you didn't mention why you discarded pretty common solutions for this problem, namely why you didn't like autosys. it's not free, nor cheap, but it's ability to group tasks together and control dependencies between jobs, groups of jobs and resources is pretty nice. just get used to using return codes in your scheduled jobs and you're good to go, no other change in the jobs you already have required.
If your interested, put some kind of reply so that I can get in direct contact with you.
I use openpbs (patched out the wazoo) with maui as the scheduler. The scheduler that comes with pbs sucks. Bad.
PBS can do dependancies, and you can set up node properties for heterogenous environments.
I hate to say this, but buying faster and cheaper machines may help too. Sun/Solaris is slow. No flames intended, but its a fact. Fortunately your not using solaris 9 with its 30% decrease in TCP/IP performance vs 7. I'm not sure how robust 8 is, but 9 has too many "features" (read bugs) for my taste.
Maui also works with other resource managers. The maui people have also forked off OpenPBS to something that is "better", YMMV. Maui also has a text based interface called wiki that you can make your own resource manager.
The info in your problem description is kinda lacking, but there should be a reasonable solution to your problem.
This is a common problem on supercomputers: you have lots of users that want to run lots of jobs that have conflicting requirements for resources, and typically some dependancies between jobs and the like. Take a look at some of the scheduling and resource management tools available for supercomputers and maybe one of those will scratch your itch.
A couple pointers to get you started:
Those are the ones I think it would be useful to look at for now. Most of the other systems are vendor specific.
-"Zow"