SW Weenies: Ready for CMT?

← Back to Stories (view on slashdot.org)

Posted by Hemos on Monday June 13, 2005 @01:20AM from the step-on-up dept.

tbray writes "The hardware guys are getting ready to toss this big hairy package over the wall: CMT (Chip Multi Threading) and TLP (Thread Level Parallelism). Think about a chip that isn't that fast but runs 32 threads in hardware. This year, more threads next year. How do you make your code run fast? Anyhow, I was just at a high-level Sun meeting about this stuff, and we don't know the answers, but I pulled together some of the questions."

2 of 378 comments (clear)

Min score:

Reason:

Sort:

Re:how much for the best of both worlds? by InvalidError · 2005-06-13 02:51 · Score: 4, Informative

Hardware threading has been mainstream for more than two years in the form of HyperThreading.

Simultaneous Multi-Threading is a CPU's ability to concurrently execute mixed instructions from multiple threads. Intel's HT simply 2-ways SMT.

Chip Multi-Threading is a CPU's ability to hold execution states for multiple threads, executing instructions from only one of them at a time unless the chip is also SMT.

In Sun's case, the mid-term plan is to eventually offer 8-ways SMT with 32-ways CMT: the CPU can hold states for up to 32 threads and have in-flight instructions from as many as eight of them.
Why the future of SMT is bleak by spockvariant · 2005-06-13 04:54 · Score: 5, Informative

I'm a researcher working on high performance computing and have used various configurations of Simultaneous Multithreading (aka Hyperthreading aka CMT) (Intel Xeon, IBM POWER5). The result is always the same - at the end, memory latencies and OS overheads kill most of the gains of instruction level parallelism coming from SMT. Look at it this way - the typical latencies of operations on most modern processors are of the order of 1 nanosecond, whereas DRAM latencies are of the order of 200ns. As long as you can't do anything about this latency, there's no point in cutting down on processing times. There's a very nice paper in this year's ACM SIGMETRICS that gives real experimental data to illustrate this fact - http://www.cs.princeton.edu/~yruan/XeonSMT/smt.pdf The paper shows that the speedups obtained using SMT in practice are meagre. The reason that the simulation results coming from the original UWashington research on the subject - http://www.cs.washington.edu/research/smt/ - looked far better was their use of unreasonably large caches in their simulations, and that they completely ignored the OS overhead of enabling SMT - which is non-negligeable - and is a thing that has been pointed out often on the Linux Kernel mailing list as well.