Slashdot Mirror


Resources On Practical Job Scheduling?

felciano asks: "We have a fairly involved content build process that involves about 20 different jobs that have dependencies between them and are of sometimes radically different running times. Some jobs are parallelizable. We have a suite of machines across which we distribute the load right now, but the actual job-to-machine allocation is done by hand. I'm trying to do capacity and hardware purchase planning, and am getting a headache trying to model this. I've tried doing it in both MS Excel and MS Project, but neither seems suited for this type of model, especially when it comes to asking "what if" questions (e.g. if we bought 3 more machines, how much would that buy us in overall build time?). I realize that scheduling, network flow, etc. problems can get into into hairy (sometimes NP) problem spaces, but given the rise in popularity of Linux compute farms as ways to address scalability problems, I would assume that there are at least some basic tools to help model this. I've looked at distributed job-scheduling software and haven't found much on the automated side -- are there any practical toolkits, apps, libraries, etc. to help do this type of modeling and planning in other ways?"

4 of 32 comments (clear)

  1. Looks like my systems course really has a use. by Christopher+Thomas · · Score: 3

    This is the kind of job queueing theory was designed for. Pick up a textbook from your local university bookstore, if you're interested in the topic. This will let you fairly easily get estimates of how varying system parameters affect performance.

    I'm afraid I don't know what commercial packages handle this, though. We used a high-level system simulation tool called "MAP", but it wasn't very intuitive to use. Better solutions certainly exist.

    1. Re:Looks like my systems course really has a use. by Christopher+Thomas · · Score: 2

      hmmm.. interesting....! Which model are we talkin about here?

      CAVEAT: Bear in mind that I've only taken one systems course, so my terminology and descriptions may be a bit off, and I won't know about the more advanced techniques...

      The idea is to represent your system as a series of processing nodes with task queues, and to see what happens as you send tasks through the system. Based on the length of time spent at each node for the different classes of tasks you're sending around (specified as a probability distribution), you get the average "utilization" of each node (how much of the time it's being used), the average queue length at each node, and so forth. This helps you find system bottlenecks for any given configuration.

      Best of all, the math that produces these statistics is straightforward enough that, with a few approximations, it works even for very large systems. This is what your off-the-shelf software packages will handle. You can vary the design of the system, and the software will rapidly tell you how the change affected the system "hot spots".

      For small systems and coarse approximations (exponential/memoryless probability distributions), you can even work out a lot of this by hand.

      In this context, I'd probably declare the compiling machines to be nodes, and use existing task data to come up with a (back of the envelope) probability distribution for task lengths on the various machines. Questions like "what's the point of diminishing returns for adding nodes?" and "what happens if I change the paths of task flow like *this*?" can be easily answered from there.

  2. Option: A tweaking script? by Christopher+Thomas · · Score: 2

    The best I've come up with is a glorified spreadsheet that estimates total build time and lets me tweak #of machines and # of split-up jobs, and watch the bottom line change. Gross but I can't think of anything better.

    Have you considered writing a script that tries to optimize this for you?

    I've hacked together SA-based optimizers in Perl on occasion for similar tasks.

    Whether this is effective for you depends on how much time you'd spend scanning solutions by other methods (a script like this takes anywhere from a couple of hours to a couple of days to write, depending on complexity).

    1. Re:Option: A tweaking script? by Christopher+Thomas · · Score: 2

      Do you have any sample code or resources you could point to?

      A search for "genetic algorithm"+"simulated annealing" gives mixed results.

      I once wrote a brief description of both for a friend. I'll email it to you.

      There are definitely points in the spreadsheet where it stops being a good idea to buy more new machines and instead becomes a better idea to add more RAM to your existing machines. So the GA solution would somehow need to take into account the number and type of machines available at a given time T in order to optimize what should happen at time T+1.

      That's all in how you decide to specify the system state vector and evaluate costs.

      SA and GA approaches represent "solutions" to the task as long strings of numbers (vectors). How you translate a vector to a real system configuration is up to you; the GA or SA, on the other hand, just needs the vector, and doesn't care how its interpreted.

      One approach is to just have two numbers, specifying the number of machines and the RAM per machine. Another approach is to use variable-length vectors and have one (speed,RAM) pair per entry. A myriad of other approaches are possible. You just have to have a black-box function that can evaluate how well the configuration specified by a given vector would perform. Both GA and SA approaches work by perturbing state vectors in various ways, trying to get a system with a better "performance" number. That's all they do.

      The re-use of existing equipment can be taken into account when generating the "performance" number. A system that's more expensive could be deemed to "perform" worse (this isn't just compilation speed - it's an arbitrary figure of merit that you define). This would give better "performance" for systems that re-used existing hardware, and would place an upper bound on how much new hardware you could buy while still improving "performance".

      This is an interesting idea. Right now this particular model is embedded in a pretty large scalability spreadsheet, but I could probably extract the key variables and formulas. What's less clear to me is how to model this over time under a GA approach.

      For any GA or SA-based solution, you only need to come up with two things: a way of representing a system configuration as a vector, and a way of figuring out how "good" the solution represented by a given vector is. Plug these in, and watch it go :).