Reverse Multithreading CPUs

Isn't that just superscalar? by tepples · 2006-04-18 10:29 · Score: 4, Interesting

Multiple cores presented as one sounds familiar. Last time I heard about that, it was just called "superscalar execution". As I understand it, multithreading and multicore were added because CPUs' instruction schedulers were having a hard time extracting parallelism from within a thread.

I suggest a compromise by Quaoar · 2006-04-18 10:30 · Score: 5, Funny

I believe that one and a half cores, sideways-threaded, is the way to go.

--
I'll form my OWN solar system! With blackjack! And hookers!

Re:I suggest a compromise by mctk · 2006-04-18 10:34 · Score: 5, Funny

Of cores!

--
Paul Grosfield - the quicker picker upper.

Scheduling Threads by mikeumass · 2006-04-18 10:31 · Score: 5, Insightful

If the OS scheduler only know about one core, how in the world would it ever know to set two threads in the execute state simultaniously to take advantage of the extra horsepower. This article is lacking any substantial detail.

Re:Scheduling Threads by DerGeist · 2006-04-18 10:42 · Score: 5, Informative

It's not, you're actually losing parallism here. The idea is to hide the multiple processors from the OS and make it think it is scheduling for only one. The OS is so good at single-processor scheduling that allowing the CPUs to take care of who does what will effect better performance than splitting up the tasks among the processors at the OS level.
At least that's the idea. Whether or not it works is yet to be seen.
Re:Scheduling Threads by Homology · 2006-04-18 11:31 · Score: 3, Informative

The problem with this reasoning is that all contemporary OSes have been designed with multiprocessor machines in mind and are thus not only heavily multithreaded, but also have schedulers designed to detect and take maximum advantage of, multiple CPUs.
A kernel intended to run on a single CPU machine can be made to run faster, partly due to less need to use locks. OpenBSD has offers two kernels for the archs that supports multi CPU: one single CPU kernel, and a multi CPU kernel. The single CPU kernel is faster.
Re:Scheduling Threads by drsmithy · 2006-04-18 11:50 · Score: 3, Informative

A kernel intended to run on a single CPU machine can be made to run faster, partly due to less need to use locks. OpenBSD has offers two kernels for the archs that supports multi CPU: one single CPU kernel, and a multi CPU kernel. The single CPU kernel is faster.
OpenBSD's SMP support is not particularly good, I don't think it's a good example to use for performance comparison purposes.
Re:Scheduling Threads by somersault · 2006-04-18 20:54 · Score: 3, Insightful

I did some rudimentary benchmarking for Debian with UP and SMP kernels (same config except for the SMP option), in each case using only one processor

Why do you think they included 2 different kernels, and how do you expect a kernel that has been optimised for parallelisation to run as well on a single processor? Seems rather trivial to me..

--
which is totally what she said

Huh? by SilentJ_PDX · 2006-04-18 10:32 · Score: 3, Interesting

What's the difference between 'reverse multithreading' (it sounds like having one execution pipeline on a chip with enough hardware for 2 cores) and just adding more Logic/Integer/FP units to a chip?

Re:Huh? by mrscorpio · 2006-04-18 10:34 · Score: 4, Funny

......these amps go to 11!

Sounds familiar by Anonymous Coward · 2006-04-18 10:32 · Score: 5, Funny

Didn't they do this on Star Trek once to get more power or something?

Re:Sounds familiar by aliens · 2006-04-18 11:15 · Score: 4, Funny

No that was Ghostbusters. They crossed the streams.

--
-- taking over the world, we are.

Software isn't evolving. by Anonymous Coward · 2006-04-18 10:38 · Score: 3, Interesting

Part of the problem is that we're still writing software using techniques that were designed for single-processor systems. Languages like C and C++ just aren't suited for writing large distributed and/or concurrent programs. It's a shame to see that even languages like Java and C# only have rudimentary support for such programming.

The future lies not with languages such as Erlang, and Haskell, but likely with languages heavily influenced by them. Erlang is well known for its uses in massively concurrent telephony applications. Programs written in Haskell, and many other pure functional languages, can easily be executed in parallel, without the programmer even having to consider such a possibility.

What is needed is a language that will bring the concepts of Erlang and Haskell together, into a system that can compete head-on with existing technologies. But more importantly, a generation of programmers who came through the ranks without much exposure to the techniques of Haskell and Erlang will need to adapt, or ultimately be replaced. That is the only way that software and hardware will be able to work together to solve the computational problems of tomorrow.

Re:Software isn't evolving. by DrSkwid · 2006-04-18 11:07 · Score: 3, Informative

Limbo is an example of a CSP programming language. One definitely worth having a look at.

--
There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter

may not want to go back.. yeah right by igotmybfg · 2006-04-18 10:39 · Score: 5, Funny

However, by the time the technology ships - if it proves real, and ever becomes more than a lab experiment - the software industry will have had several years focusing on multi-threaded apps, and it may not want to go back.

Hah, yeah right, we started parallel programming just this semester and already I want to kill myself. "May not want to go back"? I'd go back in a heartbeat!

Re:may not want to go back.. yeah right by ivan256 · 2006-04-18 10:50 · Score: 3, Insightful

Boy are you screwed.

Even though the trade rags haven't realized it, real life software engineers have been using parallel programming techniques for decades. Sure, apps are optimized for what they run on, so most shrinkwrap software at your local CompUSA probably doesn't have much of that in there, but the author missed the boat already when it comes to "had several years focusing on...".

Better learn to like that parallel programming stuff. It's the way things work.

Gotta love these CPU companies... by __aaclcg7560 · 2006-04-18 10:39 · Score: 5, Funny

First, they get the software industry's licensing panties in a knot because users only want to pay a license fee for one physical chip instead of paying for each processor on the chip. Now, twisting the panties in other direction, they want to reverse all that by representing multiple processors as one virtual processor. Would that be covered by a multi or single processor license agreement? Do I still get free wedgie with that one?

Amdahl's Law by overshoot · 2006-04-18 10:41 · Score: 4, Interesting

OK, I know some of the gang doing architecture for AMD and they are damned sharp people.

What I want to know is which of the premises underlying Amdahl's Law they've managed to escape?

--
Lacking <sarcasm> tags, /. substitutes moderation as "Troll."

Re:Amdahl's Law by grumbel · 2006-04-18 10:57 · Score: 3, Interesting

Quick guess:

Amdahl's Law has little impact when the number of cores is small and the available task is "large", as todays multitaskin OSs are.

Of course that doesn't mean that AMD will get a 100% improvment, but something close to that migth be doable if they can break the tasks at hand into parallel stuff at a much smaller level then threads.

No, superscalar is different by overshoot · 2006-04-18 10:45 · Score: 5, Interesting

Superscalar refers to having multiple execution paths inside of a single processor, allowing the dispatch of multiple instructions in a single clock cycle. However, the register sets (etc.) maintain a common state (although keeping the out-of-order updates straight sucks a huge amount of complexity and power.)

In this case, AMD appears to be trying to decouple the states enough that the out-of-order resolution doesn't require micromanaging all of the processes from a single control point.

--
Lacking <sarcasm> tags, /. substitutes moderation as "Troll."

Re:No, superscalar is different by RalphTWaP · 2006-04-18 11:00 · Score: 5, Insightful

What AMD appears to be trying isn't the same as superscalar processing, but it might run into a similar problem.

Where superscalar requires a good dispatcher to minimize branch prediction misses, AMD appears to be making decisions, not about dispatch, but about how to do locking of shared memory (think critical sections).

Critical section prediction might prove less expensive than branch prediction in practice even if they are similar in theory (http://www.cs.umd.edu/~pugh/java/memoryModel/Doub leCheckedLocking.html shows the problem, which already is an issue on 64-bit hardware).

Sounds a lot like Intel's Mitosis research by Anonymous Coward · 2006-04-18 10:46 · Score: 3, Informative

Despite the lack of details, it sounds quite a bit like Intel's Mitosis research:
http://www.intel.com/technology/magazine/research/ speculative-threading-1205.htm

The article has simulated performance comparisons.

From the article:
"Today we rely on the software developer to express parallelism in the application, or we depend on automatic tools (compilers) to extract this parallelism. These methods are only partially successful. To run RMS workloads and make effective use of many cores, we need applications that are highly parallel almost everywhere. This requires a more radical approach."

Similar to MacOSRumors rumor by salimma · 2006-04-18 10:46 · Score: 3, Insightful

.. in this post they reported on a project supposedly aiming at breaking down single threads into multiple threads so as to better utilize core utilization beyond the fourth core.

It supposedly involve Intel. I personally think both rumors are just that, but the timing is curious. Same source behind both? AMD PR people not wanting to lose out in imaginary rumored technology to Intel?

--
Michel
Fedora Project Contribut

I know... by Expert+Determination · 2006-04-18 10:47 · Score: 4, Funny

Hyperthreading makes one core look like two. Reverse hyperthreading makes two cores look like one. So if we chain reverse hyperthreading with hyperthreading we can make one core look like one core but have twice as many features for the marketing department to brag about.

--
"The White House is not an intelligence-gathering agency," -- Scott McClellan, Whitehouse spokesman.

Re:I know... by barracg8 · 2006-04-18 12:31 · Score: 4, Insightful

Ironic that this post is modded funny, since I think it might be closest to the mark.
I'd suggest x86-secret & the Reg have got the wrong end of the stick here. SMT is running two threads on one core - try taking "reverse hyperthreading" literally. I'd suggest that AMD are looking at running the one same thread in lock-step on two cores simultaneously. This is not about performance, it is about reliability - AMD looking at the market for big iron (running execution cores in lock-step is the kind of hardware reliability you are looking at on mainframe systems).
The behaviour of a CPU core should be completely deterministic. If the two cores are booted up on the same cycle they should make the same set of I/O requests at the same point, and so long as the system interface satisfies these requests identically an on the same cycle, then the cores should have no reason not to remain in sync with each other until the next point that they both should put out the next, identical pair of I/O requests. If the cores every get out of sync with each other, this indicates an error.
Just speculation of course, but I seem to recall AMD looking into this having been rumoured previously.
G.

occam by EmbeddedJanitor · 2006-04-18 11:01 · Score: 4, Informative

About the best language I've ever seen for multi-threading is occam, the language used with Transputers. occam allows threading to be done as a language primitive. http://en.wikipedia.org/wiki/Occam_programming_lan guage

--
Engineering is the art of compromise.

It's not exactly clear what they have in mind by ameline · 2006-04-18 11:07 · Score: 4, Informative

There are several techniques for increased performance or throughput that the designers of next gen microarchitectures are likely looking at.

There are extensions to known techniques;

A: more execution units, deeper reorder buffers, etc trying to extract more Instruction Level Paralelism (ILP).

B: More cores = more threads

C: hyper threading -- fill in pipeline bubbles in an OOO superscaler architetcure; also = more threads

I personally don't think any of these carry you very far...

Then there are some new ideas:

a: run-ahead threads -- use another core/hyperthread to perform only the work needed to discover what memory accesses are going to be performed and preload them into the cache - mainly a memory latency hiding technique, but that's not a bad thing as there are many codes that are dominated by memory latency

a': More aggressive OoO run-ahead where other latencies are hidden

Intel has published some good papers on these techniques, but according to those papers these techniques help in-order (read Itanic) cores much more than OoO.

b: aggressive peephole optimization (possibly other simple optimizations usually performed by compilers) done on a large trace cache. Macro/micro-op fusion is a very simple and limited start at this sort of thing. (Don't know if this is a good idea or not, or whether anyone is doing it)

But it's far from clear what AMD is doing. Whatever it is, anything that improves single threaded performance will be very welcome. Threading is hard (hard to design, implement, debug, maintain, and hard to QA). And not all code bases or algorithms are amenable to it.

Intels next gen (nahalem) is likely going to do some OoO look-ahead, as they have Andy Glew working on it, and that's been an area of interest to him...

A very interesting new concept is that of "strands" (AKA: dependency chains, traces, or sub-threads). (The idea is instead of scheduling independent instructions, schedule independent dependency chains. - For more info, see http://www.cse.ucsd.edu/users/calder/papers/IPDPS- 05-DCP.pdf)
But it's not clear how well it would apply to OoO architectures, but I would expect that likely approaches would also need large trace caches.

Applying this to an OoO x86 architecture, and detecting the critical strand dynamically in that processor could be very cool, and potentially revolutionary.

It will be very interesting to see what Intel and AMD are up to -- it would be even cooler of they both find different ways to make things go faster...

--
Ian Ameline

But what I really want to know... by Joebert · 2006-04-18 11:10 · Score: 4, Interesting

Is Microsoft going to recognise this contraption as a single, or multi-liscense-able processor ?

And

Will AMD only hide the fact there's multi-cores from Operating systems other than Microsoft ?

--
Wanna fight ? Bend over, stick your head up your ass, and fight for air.

Academia's been proposing this for awhile by Mifflesticks · 2006-04-18 11:15 · Score: 4, Informative

There are various projects that take differing views about how to do this. One class of such processors are "run-ahead" microprocessors. The idea here is to allow invalid results to be executed but not retired by a second processor running up to a few thousand instructions "ahead" of the processor executing real code to be retired.

There are several variations of this. One is to use the second core to run in advance of the 1st thread, the first thread effectively acting as a dynamic and instruction-driven prefetcher. One such effort includes "slipstreaming" processors, which works by using the advanced stream to "warm up" caches, while the rear stream makes sure the results are accurate, and to dynamically remove unecessary instructions in the advanced stream. Prior, similar research has been done to perform the same work using various forms of multithreading (like HT/SMT, and even coarse-grained multithreading). See the www.cs.ucf.edu/~zhou/dce_pact05.pdf for more details.

Others, such as Dynamic Multithreading techniques take single-threaded code and use hardware to generate other threads from from a single instruction stream. Akkaray (at Intel) and Andy Glew (previously intel, then amd, then...?) have proposed these ideas, as have others. Some call it "Implicit Multithreading".

Now, the register article is so wimpy (as usual) that there's no actual information about what technologies are used, but maybe it's a variation on one of the above.

Shi's law by G3ckoG33k · 2006-04-18 11:15 · Score: 5, Informative

From here:

Researchers in the parallel processing community have been using Amdahl's Law and Gustafson's Law to obtain estimated speedups as measures of parallel program potential. In 1967, Amdahl's Law was used as an argument against massively parallel processing. Since 1988 Gustafson's Law has been used to justify massively parallel processing (MPP). Interestingly, a careful analysis reveals that these two laws are in fact identical. The well publicized arguments were resulted from misunderstandings of the nature of both laws.

This paper establishes the mathematical equivalence between Amdahl's Law and Gustafson's Law. We also focus on an often neglected prerequisite to applying the Amdahl's Law: the serial and parallel programs must compute the same total number of steps for the same input. There is a class of commonly used algorithms for which this prerequisite is hard to satisfy. For these algorithms, the law can be abused. A simple rule is provided to identify these algorithms.

We conclude that the use of the "serial percentage" concept in parallel performance evaluation is misleading. It has caused nearly three decades of confusion in the parallel processing community. This confusion disappears when processing times are used in the formulations. Therefore, we suggest that time-based formulations would be the most appropriate for parallel performance evaluation.

Re:bullshit by tomstdenis · 2006-04-18 11:20 · Score: 3, Informative

For those not in the know... reading a register from core 1 and loading it in core 0 would work like this

1. core 1 issues a store to memory [dozens if not hundreds of cycles]
2. core 0 issues a read, the XBAR realises it owns the address and the SRQ picks up the read
3. core 0 now read a register from core 1

It would be so horribly slow that accessing the L1 data cache as a place to spill would be faster.

The IPC of most applications is less than three and often around one. So more ALU pipes is not what K8 needs. It needs more access to the L1 data cache. Currently it can handle two 64-bit reads or one 64-bit store per cycle. It takes three cycles from issue to fetched.

Most stalls are because of [in order of frequency]

1. Cache hit latency
2. Cache miss latency
3. Decoder stalls (e.g. unaligned reads or instructions which spill over 16 byte boundary)
4. Vectorpath instruction decoding
5. Branch misprediction

AMD making the L1 cache 2 cycle instead of 3 cycle would immediately yield a nice bonus in performance. Unfortunately it's probably not feasible with the current LSU. That is, you can get upto 33% faster in L1 intense code with that change.

But compared to "pairing" a core, die space is better used improving the LSU, adding more pipes to the FPU, etc.

Tom

--
Someday, I'll have a real sig.

Not True! by Gorimek · 2006-04-18 11:21 · Score: 4, Funny

We have always been at war with hyperthreading!

Speculative Multithreading by DrDitto · 2006-04-18 11:45 · Score: 4, Interesting

This was proposed in acadamia over 10 years ago. Its called speculative multithreading, or "multiscalar" as coined by one of the primary inventors at the University of Wisconsin (Guri Sohi).

Basically the processor will try to split a program into multiple threads of execution, but make it appear as a single thread. For example, when calling a function, execute that function on a different thread and automatically shuttle dependent data back/forth between the callee and the caller.

Re:multi cpu by cgenman · 2006-04-18 11:57 · Score: 3, Insightful

I'm guessing economic reasons push harder than technical ones.

Sony already assumes that their PS3 chips will have a fault in one of the cores, and simply lock off that section when one is found. One fault no longer kills a chip, though two can render the power unacceptably low.

The cool thing is this scales. If you have a 10cm^2 chip, traditionally your chance of perfection is 1/4th that of a 5cm chip, cutting your yield drastically. But if you have 6 cores on a chip with one dead one, and you want to go to 12, you should get a similar yield for a proportionally similar amount of dead cores.

Cores let you limit damage from manufacturing errors, letting you build bigger chips more cheaply. At least, that's my layman's understanding.

--
The ______ Agenda

Load balancing might be interesting by Mia'cova · 2006-04-18 11:57 · Score: 4, Insightful

It might be interesting if they took this idea in a slightly different direction. Set it up so the OS detects two CPUs. But, when the OS fails to utilize both CPUs effectively, allow the idle CPU to take some of the active CPU's load. I'm taking this idea from nVidia working on load balancing between graphics and physics in a SLI setup. So in this case the OS gets the best of both worlds, the ability to break tasks off to each CPU and a free boost when it's stuck with a single cpu-limited thread.

This is Like RAID for CPU's by Marc_Hawke · 2006-04-18 11:57 · Score: 4, Interesting

Striping: What is that? Raid 1? Raid 0? You take multiple disks, present them as one, and let the controller make the most effecient use of them while the OS and all the programs just have to deal with one big disk.

Looks like the same thing. You take multiple CPU's present them as one, and let the controller figure out how to best use them.

This could make for hot-swappable CPUs (heh) and the ability to have a CPU die without taking out your system. The redundacy nature of the other RAID configurations don't seem to translate very easily, but the 'encapsilation' concept seems to fit nicely.

--
--Welcome to the Realm of the Hawke--

It's not branch mis-prediction, it's the memory by vlad_petric · 2006-04-18 14:32 · Score: 4, Informative

Current superscalars still fetch instructions in order, and squash everything after a mis-predicted branch. The cost of branch mis-speculation is in fact getting higher, because deeper pipelines means longer times between the mis-prediction of the branch and the execution (where the correction takes place). In other words, it means longer times on the wrong path.

The purpose of "good dispatching" (i.e. out-of-order execution) is to hide the latencies of misses to main memory (it takes between 200 and 400 cycles these days to get something from memory, assuming that the memory bus isn't saturated), by executing instructions following the miss but not dependent on it. Out-of-order execution has been around Pentium Pro, btw.

--

The Raven

Or maybe predicting both branches? by silverdirk · 2006-04-18 18:18 · Score: 5, Interesting

As one reply stated, you can't know which is right unless you had 3 cores.

But, with two cores, you could have a way to predict "branch" and "not branch" at every prediction spot. The core that gets it right sends the registers to the other core so they can continue as if every branch were predicted correctly...

That would only work if you had a nice fast way to copy registers accross in a very small number of clock cycles... so again, just a bunch of speculation. But it was a neat enough idea I had to say it.

--
Mark of the Coder fades from you. You perform Opening on World of Warcraft. Warcraft crits GPA for 4. GPA dies.

38 of 263 comments (clear)