Reverse Multithreading CPUs

← Back to Stories (view on slashdot.org)

Posted by ryuzaki0 on Tuesday April 18, 2006 @10:26AM from the quick-geordi-reverse-the-polarity dept.

microbee writes "The register is reporting that AMD is researching a new CPU technology called 'reverse multithreading', which essentially does the opposite of hyperthreading in that it presents multiple cores to the OS as a single-core processor." From the article: "The technology is aimed at the next architecture after K8, according to a purported company mole cited by French-language site x86 Secret. It's well known that two CPUs - whether two separate processors or two cores on the same die - don't generate, clock for clock, double the performance of a single CPU. However, by making the CPU once again appear as a single logical processor, AMD is claimed to believe it may be able to double the single-chip performance with a two-core chip or provide quadruple the performance with a quad-core processor."

17 of 263 comments (clear)

Min score:

Reason:

Sort:

Re:I suggest a compromise by Anonymous Coward · 2006-04-18 10:37 · Score: 1, Informative

That may not be as absurd as you first thought...
Re:Scheduling Threads by DerGeist · 2006-04-18 10:42 · Score: 5, Informative

It's not, you're actually losing parallism here. The idea is to hide the multiple processors from the OS and make it think it is scheduling for only one. The OS is so good at single-processor scheduling that allowing the CPUs to take care of who does what will effect better performance than splitting up the tasks among the processors at the OS level.
At least that's the idea. Whether or not it works is yet to be seen.
Sounds a lot like Intel's Mitosis research by Anonymous Coward · 2006-04-18 10:46 · Score: 3, Informative

Despite the lack of details, it sounds quite a bit like Intel's Mitosis research:
http://www.intel.com/technology/magazine/research/ speculative-threading-1205.htm

The article has simulated performance comparisons.

From the article:
"Today we rely on the software developer to express parallelism in the application, or we depend on automatic tools (compilers) to extract this parallelism. These methods are only partially successful. To run RMS workloads and make effective use of many cores, we need applications that are highly parallel almost everywhere. This requires a more radical approach."
occam by EmbeddedJanitor · 2006-04-18 11:01 · Score: 4, Informative

About the best language I've ever seen for multi-threading is occam, the language used with Transputers. occam allows threading to be done as a language primitive. http://en.wikipedia.org/wiki/Occam_programming_lan guage

--
Engineering is the art of compromise.
It's not exactly clear what they have in mind by ameline · 2006-04-18 11:07 · Score: 4, Informative

There are several techniques for increased performance or throughput that the designers of next gen microarchitectures are likely looking at.

There are extensions to known techniques;

A: more execution units, deeper reorder buffers, etc trying to extract more Instruction Level Paralelism (ILP).

B: More cores = more threads

C: hyper threading -- fill in pipeline bubbles in an OOO superscaler architetcure; also = more threads

I personally don't think any of these carry you very far...

Then there are some new ideas:

a: run-ahead threads -- use another core/hyperthread to perform only the work needed to discover what memory accesses are going to be performed and preload them into the cache - mainly a memory latency hiding technique, but that's not a bad thing as there are many codes that are dominated by memory latency

a': More aggressive OoO run-ahead where other latencies are hidden

Intel has published some good papers on these techniques, but according to those papers these techniques help in-order (read Itanic) cores much more than OoO.

b: aggressive peephole optimization (possibly other simple optimizations usually performed by compilers) done on a large trace cache. Macro/micro-op fusion is a very simple and limited start at this sort of thing. (Don't know if this is a good idea or not, or whether anyone is doing it)

But it's far from clear what AMD is doing. Whatever it is, anything that improves single threaded performance will be very welcome. Threading is hard (hard to design, implement, debug, maintain, and hard to QA). And not all code bases or algorithms are amenable to it.

Intels next gen (nahalem) is likely going to do some OoO look-ahead, as they have Andy Glew working on it, and that's been an area of interest to him...

A very interesting new concept is that of "strands" (AKA: dependency chains, traces, or sub-threads). (The idea is instead of scheduling independent instructions, schedule independent dependency chains. - For more info, see http://www.cse.ucsd.edu/users/calder/papers/IPDPS- 05-DCP.pdf)
But it's not clear how well it would apply to OoO architectures, but I would expect that likely approaches would also need large trace caches.

Applying this to an OoO x86 architecture, and detecting the critical strand dynamically in that processor could be very cool, and potentially revolutionary.

It will be very interesting to see what Intel and AMD are up to -- it would be even cooler of they both find different ways to make things go faster...

--
Ian Ameline
1. Re:It's not exactly clear what they have in mind by ameline · 2006-04-19 08:35 · Score: 2, Informative
  
  Where did I find the Evil Research(tm)? Where else but directly from the source of evil -- no, no, not Microsoft, the *other* source of evil -- Intel :-)
  
  It's already in their compiler;
  http://www.intel.com/software/products/compilers/c lin/docs/main_cls/mergedprojects/optaps_cls/common /optaps_pgo_sspopt.htm
  (Their compiler absolutely rocks BTW)
  
  And their excellent paper titled "Speculative Precomputation: Long-range Prefetching of Delinquent Loads" by Jamison Collins, Hong Wang, Dean Tullsen, Christopher Hughes, Yong-Fong Lee, Dan Lavery, and John Shen can be found here;
  http://www.intel.com/research/mrl/library/148_coll ins_j.pdf
  
  (Those damn delinquent loads -- GET OFF OF MY LAWN YOU DELINQUENTS! :-)
  
  There's also
  "Physical Experimentation with Prefetching Helper Threads on Intel's Hyper-Threaded Processors" by
  Dongkeun Kim, Steve Shih-wei Liao, Perry Wang, Juan del Cuvillo, Xinmin Tian, Xiang Zou, Hong Wang, Donald Yeung, Milind Girkar, and John Shen which can be found here;
  http://www.cgo.org/cgo2004/papers/02_80_Kim_D_REVI SED.pdf
  
  And also;
  "Speculative Precomputation on Chip Multiprocessors" by Jeffery Brown, Hong Wang, George Chrysos, Perry Wang, and John Shen at;
  http://www.cs.ucsd.edu/~jbrown/papers/sp-cmp.pdf
  
  There, that ought to cure your insomnia and answer your question: "How could a thread possibly be executed far enough in advance to make the time savings worth while, yet be sure that it is "predicting" memory accesses correctly?"
  
  Read the papers carefully -- there will be a quiz later.
  
  --
  Ian Ameline
Re:Software isn't evolving. by DrSkwid · 2006-04-18 11:07 · Score: 3, Informative

Limbo is an example of a CSP programming language. One definitely worth having a look at.

--
There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
Academia's been proposing this for awhile by Mifflesticks · 2006-04-18 11:15 · Score: 4, Informative

There are various projects that take differing views about how to do this. One class of such processors are "run-ahead" microprocessors. The idea here is to allow invalid results to be executed but not retired by a second processor running up to a few thousand instructions "ahead" of the processor executing real code to be retired.

There are several variations of this. One is to use the second core to run in advance of the 1st thread, the first thread effectively acting as a dynamic and instruction-driven prefetcher. One such effort includes "slipstreaming" processors, which works by using the advanced stream to "warm up" caches, while the rear stream makes sure the results are accurate, and to dynamically remove unecessary instructions in the advanced stream. Prior, similar research has been done to perform the same work using various forms of multithreading (like HT/SMT, and even coarse-grained multithreading). See the www.cs.ucf.edu/~zhou/dce_pact05.pdf for more details.

Others, such as Dynamic Multithreading techniques take single-threaded code and use hardware to generate other threads from from a single instruction stream. Akkaray (at Intel) and Andy Glew (previously intel, then amd, then...?) have proposed these ideas, as have others. Some call it "Implicit Multithreading".

Now, the register article is so wimpy (as usual) that there's no actual information about what technologies are used, but maybe it's a variation on one of the above.
Shi's law by G3ckoG33k · 2006-04-18 11:15 · Score: 5, Informative

From here:

Researchers in the parallel processing community have been using Amdahl's Law and Gustafson's Law to obtain estimated speedups as measures of parallel program potential. In 1967, Amdahl's Law was used as an argument against massively parallel processing. Since 1988 Gustafson's Law has been used to justify massively parallel processing (MPP). Interestingly, a careful analysis reveals that these two laws are in fact identical. The well publicized arguments were resulted from misunderstandings of the nature of both laws.

This paper establishes the mathematical equivalence between Amdahl's Law and Gustafson's Law. We also focus on an often neglected prerequisite to applying the Amdahl's Law: the serial and parallel programs must compute the same total number of steps for the same input. There is a class of commonly used algorithms for which this prerequisite is hard to satisfy. For these algorithms, the law can be abused. A simple rule is provided to identify these algorithms.

We conclude that the use of the "serial percentage" concept in parallel performance evaluation is misleading. It has caused nearly three decades of confusion in the parallel processing community. This confusion disappears when processing times are used in the formulations. Therefore, we suggest that time-based formulations would be the most appropriate for parallel performance evaluation.
Re:bullshit by tomstdenis · 2006-04-18 11:20 · Score: 3, Informative

For those not in the know... reading a register from core 1 and loading it in core 0 would work like this

1. core 1 issues a store to memory [dozens if not hundreds of cycles]
2. core 0 issues a read, the XBAR realises it owns the address and the SRQ picks up the read
3. core 0 now read a register from core 1

It would be so horribly slow that accessing the L1 data cache as a place to spill would be faster.

The IPC of most applications is less than three and often around one. So more ALU pipes is not what K8 needs. It needs more access to the L1 data cache. Currently it can handle two 64-bit reads or one 64-bit store per cycle. It takes three cycles from issue to fetched.

Most stalls are because of [in order of frequency]

1. Cache hit latency
2. Cache miss latency
3. Decoder stalls (e.g. unaligned reads or instructions which spill over 16 byte boundary)
4. Vectorpath instruction decoding
5. Branch misprediction

AMD making the L1 cache 2 cycle instead of 3 cycle would immediately yield a nice bonus in performance. Unfortunately it's probably not feasible with the current LSU. That is, you can get upto 33% faster in L1 intense code with that change.

But compared to "pairing" a core, die space is better used improving the LSU, adding more pipes to the FPU, etc.

Tom

--
Someday, I'll have a real sig.
Re:Scheduling Threads by Homology · 2006-04-18 11:31 · Score: 3, Informative

The problem with this reasoning is that all contemporary OSes have been designed with multiprocessor machines in mind and are thus not only heavily multithreaded, but also have schedulers designed to detect and take maximum advantage of, multiple CPUs.
A kernel intended to run on a single CPU machine can be made to run faster, partly due to less need to use locks. OpenBSD has offers two kernels for the archs that supports multi CPU: one single CPU kernel, and a multi CPU kernel. The single CPU kernel is faster.
Re:Scheduling Threads by drsmithy · 2006-04-18 11:50 · Score: 3, Informative

A kernel intended to run on a single CPU machine can be made to run faster, partly due to less need to use locks. OpenBSD has offers two kernels for the archs that supports multi CPU: one single CPU kernel, and a multi CPU kernel. The single CPU kernel is faster.
OpenBSD's SMP support is not particularly good, I don't think it's a good example to use for performance comparison purposes.
Re:Isn't that just superscalar? by lbrandy · 2006-04-18 12:14 · Score: 2, Informative

. Last time I heard about that, it was just called "superscalar execution".

That's not quite right, and I think there is alot of misunderstanding going around. So let me tell you what I know about this technology. First of all, the entire idea of having two processors work on a single thread of a program isn't that far-fetched, and has been a topic of research for a long time. What most people don't understand is that, in general, it requires a massive revamp of the instruction set. What happens is you design instructions in very particular ways to maxmimize parallelism, and you also, generally, INCLUDE dependancy information in the instruction. This, essentially, pushes the blind scheduler currently in hardware onto the compiler. However, with this setup, you can generally create computers that get close to x2 speedups with x2 cores. Of course, the real question is, how does a 1x processor compare with and old-style current-instruction-set processor? The answer is almost always, unfavorably. To create such a "parrallel" instruction set, you really end up gimping the instruction set in some ways. There is, of course, always room for improvement.

So, to summerize... there is, in modern compiled software, alot of parrallelism to be taken advantage of.. the problem is that recognizing it is incredibly difficult.. especially on the fly, blindly, and in hardware... So, the future lies, most likely, in a new type of instruction set that is simpler, that pushes the dependancy information onto the compiler. This will make compilers quite a bit more difficult, but allow processors and multi-cores to more effectively split stuff up. This is largely, as I understand it, way off in the distance.. so I think someone, somewhere, got a little excited by the prospects and is pumping this "development". However, I don't think it's quite as ridiculous as most of ya'll believe.
Re:This is Like RAID for CPU's by be-fan · 2006-04-18 13:25 · Score: 2, Informative

Yes, it would be. RAM takes up a lot of area. Have you ever looked at a RAM module? It's made up of 8-16 seperate chips. The densest single RAM chips are on the order of 128MB. Moreover, RAM is manufactured on very different (and much cheaper) processes than CPUs are. Certain types of RAM are compatible with certain CPU processes (eDRAM, for example), but these are not cheap, nor particularly dense.

--
A deep unwavering belief is a sure sign you're missing something...
It's not branch mis-prediction, it's the memory by vlad_petric · 2006-04-18 14:32 · Score: 4, Informative

Current superscalars still fetch instructions in order, and squash everything after a mis-predicted branch. The cost of branch mis-speculation is in fact getting higher, because deeper pipelines means longer times between the mis-prediction of the branch and the execution (where the correction takes place). In other words, it means longer times on the wrong path.
The purpose of "good dispatching" (i.e. out-of-order execution) is to hide the latencies of misses to main memory (it takes between 200 and 400 cycles these days to get something from memory, assuming that the memory bus isn't saturated), by executing instructions following the miss but not dependent on it. Out-of-order execution has been around Pentium Pro, btw.

--
The Raven
1. Re:It's not branch mis-prediction, it's the memory by Anonymous Coward · 2006-04-18 17:42 · Score: 1, Informative
  
  This is a false economy, since the bottleneck in modern CPUs is power. Executing both ways of a branch means you always pay the power cost of a "mispredict" whereas with traditional branch prediction you only pay if you guess wrong.
Re:Software isn't evolving. by 0xABADC0DA · 2006-04-19 03:28 · Score: 2, Informative

The main problem is that the CPUs are only designed for high-level parallelism. So you don't get much benefit from Haskell of Erlang because they could theoretically say "run this loop of 1..10 as two loops 1..5 and 6..10", but in practice doing so would be much slower on today's multi-core multiprocessors due to the setup overhead.

I actually have a brand new dual-core box and the gnome System Monitor shows both cpu's separately. The only time I've seen both cpu's used by a single program it was a Java program. So I don't really understand where you are coming from. Today's multiprocessors are designed for high-level parallelism that the programmer has to declare in some way, and Java is excellent for this (C# to a somewhat lesser degree).

Now a language like Erlang or Scheme could get a huge performance boost from processors like the now-outdated Tera MTA (aka Cray), which was the inspiration for Sun's MAJC processor. These have hardware threading, so the language compiler can break loops out into multiple threads with next to no overhead. This also benefits Java to some extent since it has moderately better rules regarding aliasing for instance than C#. C/C++ can also benefit automatically with 'loose' compilers, macros, and generally a lot of work on the programmer's side.