MIT Randomizes Tasks To Speed Massive Multicore Processors
itwbennett writes Researchers at the Massachusetts Institute of Technology have created a data structure that they claim can help large multicore processors churn through their workloads more effectively. Their trick? Do away with the traditional first-come, first-served work queue and assign tasks more randomly. The SprayList algorithm allows processors with many cores to spread out their work so they don't stumble over one another, creating bottlenecks that hamper performance.
It always seemed wrong that out of order execution was more efficient than in order. I'll have to read up on it more. You'd think that the results coming in out of order would mess things up.
"Freedom in the USA is not the ability to do what you want. It is the ability to stop others from doing what THEY want"
3D print it too, that always helps!
...many computing algorithms rely upon randomization for efficiency.
Couldn't you just distribute the tasks ahead of time, giving every core a new task before its current task is finished?
Also, the article syas:
Random assignment has traditionally been frowned upon by those who think a lot about computer processors, the researchers noted in a paper explaining the work. A random scheduling algorithm takes longer to jump around the queue than a conventional one does. Caches can't be used to store upcoming work items. And if a set of tasks needed to perform one job are executed out of order, then the computer needs additional time to reassemble the final results.
I would think these problems are the same for the priority queue that they compare performance to. And I guess there are other ways which avoid these problems, which might produce faster results.
This is old hat in the CS world that gets re-discovered fairly often: you can increase throughput at the cost of ravaging your latency. For some tasks, this is an acceptable tradeoff -- for others (especially anything interactive) it's completely unacceptable. Moreover, any synchronization point in the program converts the worst-case latency of the previous tasks into limits on throughput, e.g. the time it takes to join a set of N threads is equal to the maximum latency of any single thread in the list.
The best analogy is an elevator (sorry car folks): you can optimize your elevator for throughput by having it always select the closest floor with an active call. The cost, obviously, is that if people are shuffling between floors 1 & 5 a lot, then the poor guy up on 30 might wait a really long time. The throughput is still maximized though, since the elevator can do more operations per unit time by avoiding the long trip to and from floor 30.
In some cases this is fine, in the vast majority of cases you want to ensure that all tasks complete in a more bounded amount of time, even if that reduces the total number of tasks completed per unit time.
I have been workong directly and indirectly with massive scale computing for decades:
1) break single threaded tasks into smaller jobs that can be distributed.
2) test that they still work
3) establish a chain of dependencies so that tasks which must be performed in order will.
4) establish pipeline cost and cache coherency cost for each job.
5) distribute tasks who share common data sets to cores which have the lowest latency between them
6) take jobs which depend on output of other jobs and offset their start time so the input data from the other job is available befor it is need in the current job decreasing the cost or synchronization.
8) optimize job distribution so that tasks running in parallel with dependencies will be distributed to cores nearest to one another in the coherency ring.
This is old school thinking. Randomizing should not offer better performance than properly scheduled tasks.
We're reaching a point where we have more transistors than we can turn on without melting the chip. You can have a lot of spare cores that are turned off and can be turned back on for the occasion jitter inducing burst of work. The extra cores act as a buffer to absorb unexpected work, while allowing your target core utilization as your average.
Traffic shapers already have this notion of "real time" traffic, where the packet scheduler can pretty much guarantee that latency sensitive traffic can get scheduled up to 80% of the link speed. Non latency sensitive traffic can use the other 20%, but that 20% cannot be used by real time traffic because it's too close to saturation to guarantee that the packet can get scheduled in time.