Slashdot Mirror


Implementing Better Task Scheduling for Servers?

trifakir asks: "We are running some quite expensive SunFire servers with Solaris 8. In the 'crontabs' of these hosts we have scheduled maybe some hundred odd jobs, which are constrained by multiple factors: dependencies on other jobs, time constrains, CPU and memory usage, network bandwidth, and so on. Obviously this imposes a CSP. On the other hand the number of these jobs, each one of them can take from minutes to hours is growing and we are now experiencing performance problems given the limited resources we have. Of course we have opened the bag-of-tricks with our best *ad-hoc* solutions, using mostly Open Source software, to turn our system into an event-based and less dependent on the scheduling expertise of the admins. At certain point we were considering using AutoSys and I was looking for a grid-like scheduler like OpenPBS, both of which were discarded for various reasons. I am curious, how you guys, would solve this problem, which seems very trivial for many environments. Both advice about theory (scheduling) and practice will help us and any other readers who may be tackling this difficult problem."

11 of 30 comments (clear)

  1. OMFG!? by Anonymous Coward · · Score: 3, Funny

    I think this is the first Ask Slashdot question ever asked that can't be answerd by a search on Google. Nice!

    Now let's see if anyone even has an answer...

    1. Re:OMFG!? by jon787 · · Score: 3, Funny

      Now let's see if anyone even has an answer...

      No, no, no thats not how an Ask Slashdot works!

      One group berates the person for not Googling first.
      Another points out this has been asked before.
      A third group goes and argues about a minor detail in the question instead of the real issue.
      A fourth will make jokes completely irrelevant to the issue
      The next group will troll the debate by saying Windows already does it.
      One person will eventually answer the actual question but they will get modded to -1 because their answer isn't nearly as interesting as the rest of the comments.
      --
      X(7): A program for managing terminal windows. See also screen(1).
  2. Celebrity Slashdotter by c0d3h4x0r · · Score: 5, Funny

    I am curious, how you guys, would solve this problem, which seems very trivial for many environments.

    Oh my god! Christopher Walken posts on Slashdot!

    --
    Moderator hint: a comment is neither "Flamebait" nor "Troll" if it is true.
    1. Re:Celebrity Slashdotter by Enry · · Score: 3, Funny

      Then it's obvious.

      Your servers need more cowbell!

  3. Welcome to the nightmare. CONTROL-M by crstophr · · Score: 4, Informative

    First off, let me warn you. Do whatever you can to make someone else support this. The tangled web you weave becomes a nighmare after a few years.

    We use a software package called Control-M. It works on mainframes, unix, and windows. Jobs can be scheduled to run within windows of time, and can depend on out conditions of one or many other jobs. You can specify thing like the second tuesday of the month, etc. Jobs on UNIX can depend on successful status of jobs on the Mainframe, etc. It really does it all.

    A few problems from the sysadmin perspective:

    The user interface, job naming conventions and client/server/gui server model appears to have been designed by drunken crack smoking mainfame geezers. (ALL CAPS ABRVTDJBNMS) It is one of those interfaces where you'll find yourself randomly clicking on buttons and menu items while praying and cursing.

    Developers see it as a crutch (why check dependencies when the scheduler is supposed to do it?) and a nice place to point fingers when things don't work. They can get away with writing little scripts and then inserting them into this tangled mess of jobs and dependencies. That mess is your problem.

    You'll get hammered with requests to add/delete and change jobs for whatever reason. It will become this central time clock from which most major business processing is done. Once everything is migrated from cron (*heh*) then you're very vulnerable to problems. Oh yes, have a problem late at night and the execs precious, TPS reports don't arrive on thier desk every morning. Heads roll....

    It's hell to manage and support. It's really a half time position for a large organization. Send some poor sucker to the class and then make them responsible for the whole thing.

    Support is decent. As good as most vendors.

    So Control-M, does everything you could need, but is probably the most miserable application I've ever had the displeasure of managing.

    Oh yeah, and it is expensive $$$$$$$

    Best of luck. If you're really THAT bound by resources buy more severs and spread the load.

  4. MAKE, or PMAKE by fdragon · · Score: 3, Interesting

    I am not sure on the details as I have not done this myself.

    But in your situation I would be creating a make file to schedule the jobs with. Make can handle concurrency and with available patches can be made to distribute jobs to multiple nodes. Parallel Make Patches for GNU Make.

    In a method like this I would recomend a small shared file system so that as you complete each job you can touch a file. This would allow you to continue from the point you left off, or if you wish, clear it out and start over.

    --
    The program isn't debugged until the last user is dead.
  5. openpbs by hackstraw · · Score: 4, Informative

    If your interested, put some kind of reply so that I can get in direct contact with you.

    I use openpbs (patched out the wazoo) with maui as the scheduler. The scheduler that comes with pbs sucks. Bad.

    PBS can do dependancies, and you can set up node properties for heterogenous environments.

    I hate to say this, but buying faster and cheaper machines may help too. Sun/Solaris is slow. No flames intended, but its a fact. Fortunately your not using solaris 9 with its 30% decrease in TCP/IP performance vs 7. I'm not sure how robust 8 is, but 9 has too many "features" (read bugs) for my taste.

    Maui also works with other resource managers. The maui people have also forked off OpenPBS to something that is "better", YMMV. Maui also has a text based interface called wiki that you can make your own resource manager.

    The info in your problem description is kinda lacking, but there should be a reasonable solution to your problem.

  6. How I solved similar prob: make + daemontools by Anonymous Coward · · Score: 5, Interesting

    First of all I don't know if you actually need to run certain tasks at certain specific times. If so, you will need to use cron after all. But here's some ideas:

    I had a app server that ran a number of critical tasks, and they had a somewhat arbitrary and complex dependency graph. A good analogy would be eBay's indexing cycle: a bunch of stuff has to happen as often as possible, but it's not really important if it takes 30 minutes or 35 minutes or 1 hour to do each cycle.

    Also it needed to be easy to extend: a programmer should be able to write the code and "stick it somewhere" to extend the system.

    The previous admin had set some tasks to run in cron every 5 minutes, not realizing that after a while some of the jobs actually took 6-7 minutes and growing (you can imagine what the process table looked like after a while, and some of these were not locking resources properly.....)

    I came up with a system using Makefiles (takes care of interdependencies and the -j flag will run indepent processes in parallel) and djb's daemontools package.

    If you're not familar with daemontools, it is an incredibly tight little set of tools that lets you *atomically* and *reliably* start, stop, and configure daemons, and it lets you turn ANY script into a daemon. It just runs a "run" script you supply, and when the script dies for any reason, it restarts it. So you can create a script like this:

    #!/bin/sh
    make -C /foo/bar/baz update
    sleep 60

    and it will run it over and over again. Combine this with resource limits and multilog logging and you have a bulletproof way to keep things going in the background.

    So I set up the dependencies in the Makefile, threw in a couple scripts to run all the scripts in a couple "drop box" directories that programmers can use, and documented everything and made a web interface for checking the results. Now it doesn't really matter if the cycle takes 5 minutes or 10 minutes or 30 minutes, the makefiles are run over and over again in a loop, keeping things up to date.

    Again, I don't know the specifics of your needs but this is definitely something to consider. Especially if your crontab has grown into a huge confusing mess, and you don't actually care what exact time things are running.

  7. "Click and Pray" by MarcQuadra · · Score: 3, Funny

    you'll find yourself randomly clicking on buttons and menu items while praying and cursing

    You're the reason my bank statements were two days late last month, aren't you?

    --
    "Sometimes, I think Trent just needs a cup of hot chocolate and a blankie." -Tori Amos on Nine Inch Nails
  8. scheduler or resource manager by "Zow" · · Score: 3, Informative

    This is a common problem on supercomputers: you have lots of users that want to run lots of jobs that have conflicting requirements for resources, and typically some dependancies between jobs and the like. Take a look at some of the scheduling and resource management tools available for supercomputers and maybe one of those will scratch your itch.

    A couple pointers to get you started:

    • SLURM, which while designed for Linux clusters is a good system and at least should seed a Google search (disclaimer: I work for LLNL and am on the user end of slurm, and I'm only speaking for myself here).
    • Condor is a lot more than scheduling, but it does that as well.

    Those are the ones I think it would be useful to look at for now. Most of the other systems are vendor specific.

    -"Zow"

    1. Re:scheduler or resource manager by wik · · Score: 3, Informative

      I have mod points, but I'd rather add to the dicussion. Condor does an excellent job of scheduling tasks with the resource constraints that you mentioned and it works across machines. I'd highly suggest looking into it. While it doesn't have a periodic submission feature (AFAIK, I wouldn't use that, even if it did), I don't see why you couldn't use cron to submit an initial condor job that starts everything else off every day.

      The dag scripts handle dependencies (setting these up the first time might be a big hairy). Setting resource requirements is easy. Scheduling comes for free.

      --
      / \
      \ / ASCII ribbon campaign for peace
      x
      / \