Slashdot Mirror


MIT Randomizes Tasks To Speed Massive Multicore Processors

itwbennett writes Researchers at the Massachusetts Institute of Technology have created a data structure that they claim can help large multicore processors churn through their workloads more effectively. Their trick? Do away with the traditional first-come, first-served work queue and assign tasks more randomly. The SprayList algorithm allows processors with many cores to spread out their work so they don't stumble over one another, creating bottlenecks that hamper performance.

15 of 63 comments (clear)

  1. I don't know enough about this stuff by Pikoro · · Score: 2

    It always seemed wrong that out of order execution was more efficient than in order. I'll have to read up on it more. You'd think that the results coming in out of order would mess things up.

    --
    "Freedom in the USA is not the ability to do what you want. It is the ability to stop others from doing what THEY want"
    1. Re:I don't know enough about this stuff by mwvdlee · · Score: 3, Insightful

      Results don't come in out-of-order. Imagine two variables, A and B, each undergoes a number of calculation steps which don't refer to the other variable. I.e. A=A+5/2*13-29 and B=B*B*3+12/N, then finally adding them together as Z=A+B. Normal execution would first do all the calculations for A then all the calculations for B, then finally Z. Out-of-order execution would calculate both A and B simultaneously, wait for both to finish, then calculate Z. Out-of-order execution involves a lot of this type of waiting, but since it's waiting for just the slowest calculation instead of the sum of both the slowest and fastest calculation it ends up being done sooner. If things cannot be calculated like this, an out-of-order capable processor will simply do things in-order.

      At least that's how I understand it at a very abstract level.

      --
      Slashdot social media options: AIM, ICQ, Yahoo, Jabber and Mobile Text. Why no MySpace?
    2. Re:I don't know enough about this stuff by fey000 · · Score: 3, Informative

      If I'm not mistaken, you are thinking about branch prediction, not out-of-order-executions in an otherwise serial pipe.

      To elaborate, OOE deals with computing as much as possible without having to wait for a result first.
      Branch prediction is a cache separate from the execution tray that attempts to predict the outcome of an if/switch or other branching evaluation and then load the pipelines to the execution tray with the computations following that branching, since the time it takes to evaluate an if/switch can be long, and without a prediction the cpu would have to stall until the evaluation is complete.

    3. Re:I don't know enough about this stuff by anarcobra · · Score: 2

      What you say is pretty much correct.
      The reason out of order execution is faster, is because in most cases the compiler doesn't know the full pipeline delay of each instruction.
      This helps with making binaries compatible with different processors implementing the same instruction set.
      For instance, the compiler might assume that +, -, *, and / all take the same amount of cycles to calculate, when in reality of course some of those will be faster than others.

      A calculation like A = B*C; B = A+D; C= C+1 could then be reordered to A=B*C C=C+1 B=A+D. (assuming that multiplication takes longer than addition.)
      In this case the B=A+D calculation waits for the A=B*C calculation, but the C=C+1 calculation doesn't have to wait.
      If the compiler was fully aware of pipeline delays of each instruction, then it could have scheduled the instructions like that in the first place.
      But then you code would only run optimally on processors with those exact pipeline delays. It also makes scheduling more complicated.

    4. Re:I don't know enough about this stuff by gnupun · · Score: 3, Interesting

      It always seemed wrong that out of order execution was more efficient than in order.

      Dependency between instructions can temporarily delay execution of some instructions. This can be easily explained with an example:


      1) A = A + 1
      2) B = A + 2
      3) C = D + 3

      Above is pseudo code of 3 instructions being executed by a CPU. While instr #1 is being executed, the CPU will also try to execute instr #2, but since instr #2 needs the current value of register A, the CPU waits until instr #1 is complete and value of A can be used to calculate B. This is why in-order CPUs can be simpler to create, but be slightly slow.

      Out-of-order CPUs work around this problem by deferring instr #2. Instead, the CPU executes instr #3 which has no dependency to instr #1 or #2. Therefore an OOO CPU can execute two instructions in parallel in this case, whereas the in-order CPU may not be able to execute multiple instructions in some cases.

    5. Re:I don't know enough about this stuff by userw014 · · Score: 2

      The article is about scheduling access to thread prioritization lists implemented as linked lists, not scheduling individual instructions. The issue involves multiple CPUs contending for the root (start) of a linked list, which causes cache-invalidation in the CPUs, thereby causing systemic inefficiencies.

      A very dumbed-down description of their solution is to have multiple "root" lists in order to reduce the frequency of cache-contention/invalidation.

  2. Oooh I have an idea by Anonymous Coward · · Score: 5, Funny

    3D print it too, that always helps!

  3. It never ceases to amaze me how .... by OneSmartFellow · · Score: 2

    ...many computing algorithms rely upon randomization for efficiency.

  4. Avoiding bottlenecks by Zorpheus · · Score: 3, Interesting
    The article says that the SprayList algorithm is faster for many cores than a traditional priority queue, since there are collisions when several cores ask for the top priority task at once.
    Couldn't you just distribute the tasks ahead of time, giving every core a new task before its current task is finished?
    Also, the article syas:

    Random assignment has traditionally been frowned upon by those who think a lot about computer processors, the researchers noted in a paper explaining the work. A random scheduling algorithm takes longer to jump around the queue than a conventional one does. Caches can't be used to store upcoming work items. And if a set of tasks needed to perform one job are executed out of order, then the computer needs additional time to reassemble the final results.

    I would think these problems are the same for the priority queue that they compare performance to. And I guess there are other ways which avoid these problems, which might produce faster results.

    1. Re:Avoiding bottlenecks by Bengie · · Score: 2

      They're not pushing or keeping a back-log of work to be done. Each core only accesses the task list once it's done with its current task, but it grabs a task random from a range of tasks at the beginning of the queue instead of just the first task.

      "Distributing" requires something to push. You don't want to push because you might send a task to an already overloaded core. You could keep track of which core has the least amount of work remaining, assuming that's even possible to know, but then you're back to keeping a central data structure up to date with core information, and you've gained nothing.

      You could break up cores into groups and have groups of cores share a common task pool, then have a central distribution logic monitor those pools. Everything comes with a trade off and random just works very well. Fewer corner cases and typically graceful degradation, compared to catastrophic performance lost due to an unforeseen interaction of task run-time patterns and the pool and the distribution algorithm.

    2. Re:Avoiding bottlenecks by Half-pint+HAL · · Score: 2

      But assuming you have a known number of worker threads you could predistribute tasks in a round-robin fashion so that when thread #3 asks for a tasks it gets nextTask[3], off you go and we'll repopulate nextTask lazily once we've got lock on the main queue again, I assume if it's acceptable to pick from a range any one task will finish within an acceptable delay so that next task still gets done in time too. I wouldn't want to put a call to rand() anywhere in there.

      It's not about assigning tasks to threads, though, it's about assigning threads to cores. There is no authority above the core, and so the cores have to manage their own scheduling. If you assign each core an index, then you no longer have a true queue, and if core 1 hangs, the head of the list will just sit there indefinitely as core 2 continues to take task 2, etc. And what about dynamic systems that power down cores. Do you reassign core indexes ever time a core shuts down? Now suddenly you're introducing more intercore coordination.

      The thing to remember is that "random" isn't always "truly random". Think instead of it as "stochastic" in this case. The results are not exactly predictable, but give a predictable distribution of results. That those results include fewer resource clashes is a Very Good Thing.

      --
      Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
    3. Re:Avoiding bottlenecks by Half-pint+HAL · · Score: 2

      Rule number one about computer performance . . . know your workload!

      I think of the algorithm in TFA as "Socratic computing performance" -- the only true knowledge about your workload is knowing that you know nothing about your workload. The more the computer has to introspect about individual tasks on the fly, the higher the computational load. Eventually you end up with the cost of optimisation being higher than the cost of the inefficiencies in the scheduling.

      --
      Got them moderator blues I blieve I walk out the do', With these mod-points I been gettin', I 'most never post no mo'
  5. Throughput versus Latency ... Again ... by Wrath0fb0b · · Score: 2

    This is old hat in the CS world that gets re-discovered fairly often: you can increase throughput at the cost of ravaging your latency. For some tasks, this is an acceptable tradeoff -- for others (especially anything interactive) it's completely unacceptable. Moreover, any synchronization point in the program converts the worst-case latency of the previous tasks into limits on throughput, e.g. the time it takes to join a set of N threads is equal to the maximum latency of any single thread in the list.

    The best analogy is an elevator (sorry car folks): you can optimize your elevator for throughput by having it always select the closest floor with an active call. The cost, obviously, is that if people are shuffling between floors 1 & 5 a lot, then the poor guy up on 30 might wait a really long time. The throughput is still maximized though, since the elevator can do more operations per unit time by avoiding the long trip to and from floor 30.

    In some cases this is fine, in the vast majority of cases you want to ensure that all tasks complete in a more bounded amount of time, even if that reduces the total number of tasks completed per unit time.

  6. Is this news? by LostMyBeaver · · Score: 2

    I have been workong directly and indirectly with massive scale computing for decades:

    1) break single threaded tasks into smaller jobs that can be distributed.
    2) test that they still work
    3) establish a chain of dependencies so that tasks which must be performed in order will.
    4) establish pipeline cost and cache coherency cost for each job.
    5) distribute tasks who share common data sets to cores which have the lowest latency between them
    6) take jobs which depend on output of other jobs and offset their start time so the input data from the other job is available befor it is need in the current job decreasing the cost or synchronization.
    8) optimize job distribution so that tasks running in parallel with dependencies will be distributed to cores nearest to one another in the coherency ring.

    This is old school thinking. Randomizing should not offer better performance than properly scheduled tasks.

  7. Re:If random selection is bad... by Bengie · · Score: 2

    We're reaching a point where we have more transistors than we can turn on without melting the chip. You can have a lot of spare cores that are turned off and can be turned back on for the occasion jitter inducing burst of work. The extra cores act as a buffer to absorb unexpected work, while allowing your target core utilization as your average.

    Traffic shapers already have this notion of "real time" traffic, where the packet scheduler can pretty much guarantee that latency sensitive traffic can get scheduled up to 80% of the link speed. Non latency sensitive traffic can use the other 20%, but that 20% cannot be used by real time traffic because it's too close to saturation to guarantee that the packet can get scheduled in time.