Reverse Multithreading CPUs

Isn't that just superscalar? by tepples · 2006-04-18 10:29 · Score: 4, Interesting

Multiple cores presented as one sounds familiar. Last time I heard about that, it was just called "superscalar execution". As I understand it, multithreading and multicore were added because CPUs' instruction schedulers were having a hard time extracting parallelism from within a thread.

Re:Isn't that just superscalar? by JordanL · 2006-04-18 10:41 · Score: 0, Offtopic

Too late. I've already patented it.

--
FanFictionRecs.net
Re:Isn't that just superscalar? by lbrandy · 2006-04-18 12:14 · Score: 2, Informative

. Last time I heard about that, it was just called "superscalar execution".

That's not quite right, and I think there is alot of misunderstanding going around. So let me tell you what I know about this technology. First of all, the entire idea of having two processors work on a single thread of a program isn't that far-fetched, and has been a topic of research for a long time. What most people don't understand is that, in general, it requires a massive revamp of the instruction set. What happens is you design instructions in very particular ways to maxmimize parallelism, and you also, generally, INCLUDE dependancy information in the instruction. This, essentially, pushes the blind scheduler currently in hardware onto the compiler. However, with this setup, you can generally create computers that get close to x2 speedups with x2 cores. Of course, the real question is, how does a 1x processor compare with and old-style current-instruction-set processor? The answer is almost always, unfavorably. To create such a "parrallel" instruction set, you really end up gimping the instruction set in some ways. There is, of course, always room for improvement.

So, to summerize... there is, in modern compiled software, alot of parrallelism to be taken advantage of.. the problem is that recognizing it is incredibly difficult.. especially on the fly, blindly, and in hardware... So, the future lies, most likely, in a new type of instruction set that is simpler, that pushes the dependancy information onto the compiler. This will make compilers quite a bit more difficult, but allow processors and multi-cores to more effectively split stuff up. This is largely, as I understand it, way off in the distance.. so I think someone, somewhere, got a little excited by the prospects and is pumping this "development". However, I don't think it's quite as ridiculous as most of ya'll believe.
Re:Isn't that just superscalar? by modecx · 2006-04-18 13:40 · Score: 1

You know, all we need is for AMD to buy SGI!

--
Constitutional rights may be respected, repealed, or modified; but they must never be ignored.
Re:Isn't that just superscalar? by Chris+Snook · 2006-04-18 18:49 · Score: 1

No, this is super-duper-scalar!

--
There's no failure quite as dissatisfying as a complete and total solution to the wrong problem.
Re:Isn't that just superscalar? by Sigg3.net · 2006-04-19 02:15 · Score: 1

Multiple cores presented as one sounds familiar.

It does sound like 'multiple whores fermented for fun', doesn't it?

Yes. But Voodoo5 went bankrupt. They knew they wouldn't cover the costs if they didn't get to bundle Duke Nukem Forever demo with the card.

--
Defining Statistics and Social Research

I suggest a compromise by Quaoar · 2006-04-18 10:30 · Score: 5, Funny

I believe that one and a half cores, sideways-threaded, is the way to go.

--
I'll form my OWN solar system! With blackjack! And hookers!

Re:I suggest a compromise by mctk · 2006-04-18 10:34 · Score: 5, Funny

Of cores!

--
Paul Grosfield - the quicker picker upper.
Re:I suggest a compromise by Anonymous Coward · 2006-04-18 10:37 · Score: 1, Informative

That may not be as absurd as you first thought...
Re:I suggest a compromise by PunTrollCritic · 2006-04-18 10:42 · Score: 2, Funny

Score -1 Punny
Re:I suggest a compromise by Anonymous Coward · 2006-04-18 19:17 · Score: 0

I believe that one and a half cores, sideways-threaded, is the way to go.

Right or left handed threads?
Re:I suggest a compromise by fdisk3hs · 2006-04-19 08:50 · Score: 1

Sideways threading is so March 2006. Everyone knows that the Infinite Improbability Threading is the wave of the future. Every instruction passes through every register in the Universe simultaneously, for infinite computational power, without all that tedious mucking about with three dimensional space-time.

Scheduling Threads by mikeumass · 2006-04-18 10:31 · Score: 5, Insightful

If the OS scheduler only know about one core, how in the world would it ever know to set two threads in the execute state simultaniously to take advantage of the extra horsepower. This article is lacking any substantial detail.

Re:Scheduling Threads by WindBourne · 2006-04-18 10:41 · Score: 2, Insightful

It won't. But that is not the problem. They are moving the selection down below and making the access to memory and other resources the issues. It is possible that this will increase the overall system performance.

--
I prefer the "u" in honour as it seems to be missing these days.
Re:Scheduling Threads by DerGeist · 2006-04-18 10:42 · Score: 5, Informative

It's not, you're actually losing parallism here. The idea is to hide the multiple processors from the OS and make it think it is scheduling for only one. The OS is so good at single-processor scheduling that allowing the CPUs to take care of who does what will effect better performance than splitting up the tasks among the processors at the OS level.
At least that's the idea. Whether or not it works is yet to be seen.
Re:Scheduling Threads by drsmithy · 2006-04-18 10:54 · Score: 1

The OS is so good at single-processor scheduling that allowing the CPUs to take care of who does what will effect better performance than splitting up the tasks among the processors at the OS level.
The problem with this reasoning is that all contemporary OSes have been designed with multiprocessor machines in mind and are thus not only heavily multithreaded, but also have schedulers designed to detect and take maximum advantage of, multiple CPUs.
I'm highly sceptical that any CPU can do a better job than the OS of scheduling when it has (relatively speaking) only an extremely limited view of what the machine's state is (not to mention the user's intentions), so I can't see this doing more help than harm, but I'm always willing to wait and see.
Re:Scheduling Threads by misleb · 2006-04-18 10:59 · Score: 1

But I don't understand how the CPU can split a single process/thread among cores without the same problems encountered with superscaler architectures.

-matthew

--
"THERE IS NO JUSTICE, THERE IS ONLY ME." -Death
Re:Scheduling Threads by giminy · 2006-04-18 11:15 · Score: 1

The OS is so good at single-processor scheduling that allowing the CPUs to take care of who does what will effect better performance than splitting up the tasks among the processors at the OS level.

I thought OSes were only so good at multiprocessor scheduling because things can only be done in parallel to a certain level of granularity -- data dependencies, data locking, and other problems cause stalls in how well multithreading can work.

I guess what we're all trying to figure out: how does 'figuring out what stuff should be in what thread at the hardware level instead of at the OS level' work? Logic is logic, whether its done in hardware or software -- if a set of operations simply can't be broken down into separate threads and scheduled separately in software, how do we expect to break down the logically equivalent set of operations into threads in hardware?

--
The Right Reverend K. Reid Wightman,
Re:Scheduling Threads by Boronx · 2006-04-18 11:23 · Score: 0

This is just a guess, but aren't some processors able to change the order of instructions? Some subset of these instructions must be interchangeable because they are independent of one another, and these might as well be run in parallel.

--
Play Command HQ online
Re:Scheduling Threads by Homology · 2006-04-18 11:31 · Score: 3, Informative

The problem with this reasoning is that all contemporary OSes have been designed with multiprocessor machines in mind and are thus not only heavily multithreaded, but also have schedulers designed to detect and take maximum advantage of, multiple CPUs.
A kernel intended to run on a single CPU machine can be made to run faster, partly due to less need to use locks. OpenBSD has offers two kernels for the archs that supports multi CPU: one single CPU kernel, and a multi CPU kernel. The single CPU kernel is faster.
Re:Scheduling Threads by drsmithy · 2006-04-18 11:50 · Score: 3, Informative

A kernel intended to run on a single CPU machine can be made to run faster, partly due to less need to use locks. OpenBSD has offers two kernels for the archs that supports multi CPU: one single CPU kernel, and a multi CPU kernel. The single CPU kernel is faster.
OpenBSD's SMP support is not particularly good, I don't think it's a good example to use for performance comparison purposes.
Re:Scheduling Threads by diegocgteleline.es · 2006-04-18 11:52 · Score: 1

Export that info somewhere, once the the cpu scheduler know what features the CPU has it can start to try to take decisions optimized for that cpu. 2.6.17 will feature a new "scheduler domain" which optimizes scheduling decisions for multi-core CPUs, for example.

Of course you could choose not to export that info and let the CPU do it transparently, but does that have any sense at all? Now that cores are becoming so important you may end having more than one CPU with different number of cores each one, and the OS wants to know that.
Re:Scheduling Threads by misleb · 2006-04-18 11:53 · Score: 1

Right, this is currently done inside modern CPUs. The reason for doing hyperthreading (single CPU presented as 2) was because of the trouble of parallelizing single threads. The idea is that the OS knows better than the CPU what can run in parallel. But, I guess if you are targetting home users who generally only run one program at a time...

-matthew

--
"THERE IS NO JUSTICE, THERE IS ONLY ME." -Death
Re:Scheduling Threads by Theatetus · 2006-04-18 12:06 · Score: 1

If the OS scheduler only know about one core, how in the world would it ever know to set two threads in the execute state simultaniously to take advantage of the extra horsepower

It won't and that doesn't matter. When will you SMP-fetishist's learn that two simultaneous threads won't be running each in their own CPU? If you have two threads (or processes) running, that doesn't mean that each gets its own CPU; they'll share the 2 cpu's along with the dozens of other running processes. If AMD is right that they can get the scheduling overhead lower with this than you can get the scheduling overhead for SMP, then you'll see performance improvements.

--
All's true that is mistrusted
Re:Scheduling Threads by ttfkam · 2006-04-18 12:49 · Score: 1

Perhaps for the same reasons that some folks use stored procedures in databases rather than sending a series of queries and responses over the wire. A highly tuned CPU and northbridge chipset combination may be able to perform functions with faster timings than a somewhat more generic OS-level version by reducing the number of roundtrips in and out of the CPU and memory subsystems.

And hardware can indeed be faster than the equivalent software. Witness the rise of the GPU. A dedicated hardware item like a GPU will run (and does run) circles around the software equivalent running on a general purpose CPU.

Never underestimate the potential of the transistor. ;-)

--

- I don't need to go outside, my CRT tan'll do me just fine.
Re:Scheduling Threads by bleak+sky · 2006-04-18 13:07 · Score: 1

This is not specific to OpenBSD. I did some rudimentary benchmarking for Debian with UP and SMP kernels (same config except for the SMP option), in each case using only one processor, and found between a 15% and 30% performance hit depending on hardware configuration.

The issue was that Debian was (probaby still is) considering not shipping any UP kernels, since it's kind of a pain to maintain a UP and SMP flavor for each kernel configuration. It turns out the performance hit is still big enough that, except on architectures where uniprocessor models are rare, it still makes sense to ship a UP kernel.

See http://movingsucks.org/benchmark if you're interested.
Re:Scheduling Threads by skwirlmaster · 2006-04-18 13:09 · Score: 1

You're correct, OpenBSD has primitive SMP support. However it seems the point was to show an example of an OS that would perform better without dealing with SMP.

--
My inner self is ineffable, so don't eff with me.
Re:Scheduling Threads by macshit · 2006-04-18 19:53 · Score: 1

This is not specific to OpenBSD.

There is always some overhead related to MP support, and there will always be benchmarks which can show its worst effect, but OpenBSD is quite well known for being far worse in this respect than other mainstream FOSS kernels. Linux in particular has spent vast amounts of effort optimizing the MP case, and there are many core linux hackers who have very strong MP experience from IBM, SGI, etc.

--
We live, as we dream -- alone....
Re:Scheduling Threads by somersault · 2006-04-18 20:54 · Score: 3, Insightful

I did some rudimentary benchmarking for Debian with UP and SMP kernels (same config except for the SMP option), in each case using only one processor

Why do you think they included 2 different kernels, and how do you expect a kernel that has been optimised for parallelisation to run as well on a single processor? Seems rather trivial to me..

--
which is totally what she said
Re:Scheduling Threads by somersault · 2006-04-18 20:58 · Score: 1

Would the problem there not be that you'll just be adding complexity to the processor, so it will be more difficult to achieve high clock speeds for a CISC as opposed to a RISC processor?

--
which is totally what she said
Re:Scheduling Threads by TheRaven64 · 2006-04-18 22:26 · Score: 2

Actually, it is, for exactly that reason. OpenBSD's current SMP support uses the simplest possible approach; put the entire kernel inside a big giant lock. This means that every system call has an implicit get-lock operation at the start and a release-lock operation at the end. This is the best possible case performance for a SMP-capable kernel when running on a single CPU.
Consider a write operation. With the OpenBSD kernel, you make the system call, lock the kernel, run to completion, unlock the kernel and return. With a kernel that supports finer-grained locking then you might have to lock (and later unlock) the VFS layer, the filesystem driver and the disk driver. This gives three times as many lock operations, which makes the whole thing slower. The advantage, of course, is that three processes on different CPUs can be completing parts of a write operation simultaneously. As with so many other things, you are trading speed for scalability; take a 10%[1] performance hit in exchange for a 50%[1] greater performance gain when adding another CPU. Until you run into Amdahl's law, of course.
Some kernels, such as Linux, make extensive use of spin-locks in the kernel. These are much cheaper in the case when you can gain the lock immediately, but much more expensive when you can't. On a large SMP system, this doesn't matter since the CPU running the thread doesn't have anything better to do with its time than spin. On a small SMP or UP system, that spinning takes time that could have been used for other processes (especially ones that are no system-call bound).
On a single processor machine, you can guarantee that only a single system call will be executing at a time[2] and so you don't need to do any locking of the kernel making it even faster. For a 2-4 processor system, the big giant lock approach is often faster, particularly if you are not running many system calls. My ad-hoc testing shows that most processes spend about 5-10% (often less) of their time making system calls. A big giant lock allows the CPU(s) spend more time executing the userspace code and less time worrying about locking.

[1] This number was pulled out of the air.
[2] As long as your system calls are not pre-emptible. This is true on a traditional UNIX kernel, but not on Windows. Modern UNIX-like systems vary in their approach.

--
I am TheRaven on Soylent News
Re:Scheduling Threads by AlecC · 2006-04-18 23:57 · Score: 1

Presumabley, because (as was said) they wer consideting dumping the UP kernel and shipping only the SMP kernel.

Obviously, SMP will haev some overhead - but what is that? 1% woulkd be so small that you could ignore it for nearly all purposes. 15% is too large to ignore. It seems to me a perfect reasonable piece of research to measure what that overhead actually is.

--
Consciousness is an illusion caused by an excess of self consciousness.
Re:Scheduling Threads by Anonymous Coward · 2006-04-19 00:52 · Score: 0

Would the problem there not be that you'll just be adding complexity to the processor, so it will be more difficult to achieve high clock speeds...
I suppose the trade off for this higher complexity, lower clock speed should be weighted against the CPU cycles required for a traditional multi CPU OS to perform the parallel task (data locking etc).
Let's not forget that this new tech will probably come with an interface that a new modified single core CPU OS can use to "optimize" it's scheduling.
Imaging three threads waiting to be scheduled to a single core in a round robin fashion. thread 1 gets clocks (hypothetical perfect conditions here) 0 - 10000, 30000 - 40000, 70000 - 80000. Threads two and three get those clocks during which thread 1 is waiting.
This new processing mechanism might allow threads two and three to execute during the same clock cycles as thread 1. So thread 1 executes during 0 - 10000, the cpu might instantly send an interupt to the OS which gets processed in 1000 cycles. Thread one continues to execute out of direct control of the OS, the OS initiates execution of thread 2 during cycles 1000 - 11000... So the same three threads execute during 0 - 10000 (thread 1), 1000 - 11000 (thread 2), 2000 - 12000...
Now hardware must keep track of data locks so just in case the two threads work on the same data they don't "create interesting side effects"
This architecture might not end up with quaduple performance on single threads but rather permit 4 threads to exectute simultaniously without the OS managing the multi CPU locking overhead. 4 threads could almost execute at full speed... (where 4 is the number of processor cores being reverse-hyperthreaded)
I am curious about whether this architecture could actually simplify multi-core designs being finer grained and completely in hardware control. The multi CPU architecture with an die cache must be a beast itself. Lcam

Even better.... by Anonymous Coward · 2006-04-18 10:32 · Score: 0

four cores presented as two?

Re:Even better.... by s0m3body · 2006-04-18 13:03 · Score: 1

great news for everyone paying oracle licenses :-)
Re:Even better.... by buttered+noodle · 2006-04-18 16:53 · Score: 1

overclocking wont work well, if the vary in speed slightly(at all?) the thread will be screwed

Huh? by SilentJ_PDX · 2006-04-18 10:32 · Score: 3, Interesting

What's the difference between 'reverse multithreading' (it sounds like having one execution pipeline on a chip with enough hardware for 2 cores) and just adding more Logic/Integer/FP units to a chip?

Re:Huh? by mrscorpio · 2006-04-18 10:34 · Score: 4, Funny

......these amps go to 11!
Re:Huh? by the+phantom · 2006-04-18 10:36 · Score: 1

Why don't you just make 10 louder?

--
Rhapsody in Numbers
Re:Huh? by intangible · 2006-04-18 11:25 · Score: 1

It grows exponentially! At 10 we're everywhere in the universe at the same time! Or maybe I'm confusing something here...
Re:Huh? by cyngus · 2006-04-18 11:47 · Score: 1

Count me in the confused camp. It would seem to achieve roughly the same effect as adding more functional units, but with lower returns because the code running on each chip can't share state information as easily. My best guess is that they want the processor to appear as two cores sometimes and one core at other times. I haven't a clue who makes this decision, perhaps a hypervisor in a virtualized environment? My only other speculation is that perhaps branch prediction is somehow easier, although I can't imagine why or how.
Re:Huh? by smallfries · 2006-04-18 12:07 · Score: 1

The effect is the same but you gain more flexibility. If you only add more EUs to a chip then they can get starved because it's hard to to the despatch to keep them full. If you have multicore, but only one main thread then at least half of your EUs are getting unused. The cute answer, from an engineering point of view is to allow both, and then switch between them. Then if you have long single-threaded sections of code with lots of implicit parallelism (ie games) you can load up the EUs, or if you have lots of threads without lots of implicit parallelism then you can still load them up - but segrated into cores.

The article had no details at all, one of the hard questions is who decides what mode the chip is in? Does it dynamically switch modes depending on the current load (very tricky, and will require a lot of logic between the cores)? Or is it a software enabled mode switch - tough on the thread scheduler in the kernel.

--
Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
Re:Huh? by cyngus · 2006-04-18 12:40 · Score: 1

The effect is the same but you gain more flexibility. If you only add more EUs to a chip then they can get starved because it's hard to to the despatch to keep them full. If you have multicore, but only one main thread then at least half of your EUs are getting unused. The cute answer, from an engineering point of view is to allow both, and then switch between them. Then if you have long single-threaded sections of code with lots of implicit parallelism (ie games) you can load up the EUs,

You definitely lost me here. Why would it be easier for a dispatcher to keep twice the number of execution units fed when they're on two different cores as opposed to twice the number of execution units on a single core. If you put all of the EUs on one core, just increase the size of the dispatcher, since you're effectively doubling it here anyway. I think you're hinting at interleaving which core instructions are dispatched to, but I would imagine the switching time would be huge, so I don't think this will really work.

or if you have lots of threads without lots of implicit parallelism then you can still load them up - but segrated into cores.

But if you're representing the dual core processor as single core one, you'll never know there are additional threads, because the OS will never schedule them to run. In fact a processor has no idea how many threads there are, it just gets told which one to run when.
Re:Huh? by smallfries · 2006-04-18 13:55 · Score: 1

The difficulty in keeping the cores full is because most program don't expose that much parallelism. Adding cores doesn't magically fix that - but some programs do. Chosing the chip design is just an engineering tradeoff - which is the most common case, and then optimise for that. So Intel/AMD went down the superscalar route for as far as they could, but they got diminishing returns after a while.

Multicore designs are optimising for a different kind of code - but they suck at running the programs that do expose a lot of implicit parallelism. Think game inner loops that have been partially unrolled by a compiler and lots of registers have been used so that the processor can see that there are no data dependencies. A multicore chip can't take advantage of this kind of code - the parallelism is too fine grained. It is pretty good at code with lots of coarse grain parallalism that has been separated into lots of threads. But when it is executing inner loops half of the EU's are on another core, unused.

AMDs idea (I guess) is that you can switch which way the chip operates, so in 'single-thread' mode the EUs on both cores can be fed by a common despatcher which means that implicitly parallel code can be accelerated. In the other mode two threads (without much implicit parallelism) can be executed so you still get an increase in throughput.

So it isn't easier for the despatcher to keep the EUs fed on both cores, in fact as you say, it is much harder. I don't know how they're going to get around the complexity issue (despatch difficulty isn't linear in the number of cores), or the communication costs between the cores. The gain is that you can still run multi-threaded code fast by switching modes, whereas in a pure superscalar design you are chosing one sweetspot in the design over another.

--
Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
Re:Huh? by Malcolm+Chan · 2006-04-18 20:15 · Score: 1

But surely that would be the same as HyperThreading. You basically have 2 despatchers, using up the available EUs as necessary. If there is only 1 main thread, then one of the despatchers will have to try to use up the EUs as efficiently as possible. With more threads, well, they just compete for the available EUs.

--
/MC
Re:Huh? by smallfries · 2006-04-19 00:36 · Score: 1

That would make sense, although from what I know of hyperthreading the virtual core doesn't have as high a priority as the real core. So the despatcher issues as many instructions from the main thread as possible, but the second thread is more of a guest with instructions being issued when the real thread isn't using all the resources. I'm not 100% sure about that though.

--
Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
Re:Huh? by metroplex · 2006-04-19 02:20 · Score: 1

haha, that was so appropriate. You made my slashdot day

--
"Words of wisdom: drop that zero and get with the hero" -- Vanilla Ice

Sounds familiar by Anonymous Coward · 2006-04-18 10:32 · Score: 5, Funny

Didn't they do this on Star Trek once to get more power or something?

Re:Sounds familiar by ScrewMaster · 2006-04-18 10:48 · Score: 1

No, that was the "Reverse Algorithmic" on that the video technician on CSI uses to sharpen blurry images.

--
The higher the technology, the sharper that two-edged sword.
Re:Sounds familiar by aliens · 2006-04-18 11:15 · Score: 4, Funny

No that was Ghostbusters. They crossed the streams.

--
-- taking over the world, we are.
Re:Sounds familiar by zippthorne · 2006-04-18 12:25 · Score: 1

You're both wrong. It has to do with the process used to change Bender Bending Rodriguez from robot to frat guy.

--
Can you be Even More Awesome?!
Re:Sounds familiar by DigiShaman · 2006-04-18 14:45 · Score: 1

Dr. Egon Spengler: There's something very important I forgot to tell you.

Dr. Peter Venkman: What?

Dr. Egon Spengler: Don't cross the streams.

Dr. Peter Venkman: Why?

Dr. Egon Spengler: It would be bad.

Dr. Peter Venkman: I'm fuzzy on the whole good/bad thing. What do you mean, "bad?"

Dr. Egon Spengler: Try to imagine all life as you know it stopping instantaneously and every molecule in your body exploding at the speed of light.

Dr Ray Stantz: Total protonic reversal.

Dr. Peter Venkman: Right. That's bad. Okay. All right. Important safety tip. Thanks, Egon.

from IMDB.com

--
Life is not for the lazy.
Re:Sounds familiar by p3d0 · 2006-04-19 01:30 · Score: 1

This is the kind of comment that makes me wish there was a Score:6, Funny.

--
Patrick Doyle
I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....

Great... by Anonymous Coward · 2006-04-18 10:34 · Score: 0

I'm thrilled to know that the threaded code I write isn't going to behave that way in hardware.

Hmmm by TheSpoom · 2006-04-18 10:36 · Score: 1

This would seem to be better for processes designed to only use one CPU, but it then prevents me from coding something in, say, OpenMP, in order to fine tune the parallelisation of my code (which would almost certainly work better than the generic optimizations that they would be putting in the CPU). Admittedly 95+% of programs aren't coded to be parallel, but this would still take away an option that would otherwise be there.

Perhaps there could be a documented way to access both CPUs directly? That may solve the problem.

--
It's better to vote for what you want and not get it than to vote for what you don't want and get it.
- E. Debs

Re:Hmmm by InstinctVsLogic · 2006-04-18 12:59 · Score: 0

I believe they are trying to combine the cores into a single processor at the hardware level, rather than software. This would almost certainly be multiples faster than any optimizations you could make to your code.
Re:Hmmm by CastrTroy · 2006-04-18 13:36 · Score: 1

I haven't used Open MP, but I took a class in parallel processing, and we used LAM-MPI. If they are anything alike, then anything you program takes 10 times as long, plus you have to explicity tell it how to split up and collect the data in an efficient manner. Which is often the hardest part. Anyway, I think that this kind of stuff is only necessary for applications which are required to be highly parallel. Otherwise, it would probably be easy just to add a couple threads to you application, and let the OS figure out how to schedule them properly. Oh, and the other thing, most parallel algoriths only work well on a large number of processors/processes. For instance, you can sort N items in O(1) time, but you need to run N processes. Most of the time you don't have anywhere near N processors, so you are running more than 1 process per processor. You don't end up getting extra performance once you factor in the overhead.

--

Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
Re:Hmmm by TheSpoom · 2006-04-18 14:06 · Score: 1

Your parallelized program took ten times as long? What were you doing?

Although I will say that when we were doing parallel programming, we did have access to Sharcnet nodes, so perhaps you were more limited.

--
It's better to vote for what you want and not get it than to vote for what you don't want and get it.
- E. Debs
Re:Hmmm by CastrTroy · 2006-04-18 14:13 · Score: 1

Sorry for the ambiguity. It takes 10 times as long to write the code. Can't really say much about the performance increase, as I mostly ran the stuff on a single CPU, so the overhead would make things run slower. There was a couple 2 and 4 processor machines that we could SSH into, but when you're sharing with 50 other users, it's hard to gauge from one run to the next whether or not you're seeing any actual improvement.

--

Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
Re:Hmmm by Anonymous Coward · 2006-04-18 15:38 · Score: 0

but it then prevents me from coding something in, say, OpenMP [openmp.org], in order to fine tune the parallelisation of my code

Nothing personal ... but I'd have to say, great!

The amount of performance issues that can be contributed to programmers poorly managing multi-threaded apps is enormous. Let's remove the complexity and see a simpler programming model with generic multi-threading that 'just works' ... silicon speeds up at rates far faster than anybody is able to re-write a multi-threaded app to handle the new silicon.

Re:Frenchie Site! by rmsmith · 2006-04-18 10:36 · Score: 1

Eh? What? The Register is British.

Software isn't evolving. by Anonymous Coward · 2006-04-18 10:38 · Score: 3, Interesting

Part of the problem is that we're still writing software using techniques that were designed for single-processor systems. Languages like C and C++ just aren't suited for writing large distributed and/or concurrent programs. It's a shame to see that even languages like Java and C# only have rudimentary support for such programming.

The future lies not with languages such as Erlang, and Haskell, but likely with languages heavily influenced by them. Erlang is well known for its uses in massively concurrent telephony applications. Programs written in Haskell, and many other pure functional languages, can easily be executed in parallel, without the programmer even having to consider such a possibility.

What is needed is a language that will bring the concepts of Erlang and Haskell together, into a system that can compete head-on with existing technologies. But more importantly, a generation of programmers who came through the ranks without much exposure to the techniques of Haskell and Erlang will need to adapt, or ultimately be replaced. That is the only way that software and hardware will be able to work together to solve the computational problems of tomorrow.

Re:Software isn't evolving. by DrSkwid · 2006-04-18 11:07 · Score: 3, Informative

Limbo is an example of a CSP programming language. One definitely worth having a look at.

--
There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
Re:Software isn't evolving. by Anpheus · 2006-04-18 11:23 · Score: 1

Fortress is shaping up to be an interesting competitor, with some pretty nifty features, and integrated into the language are certain concepts such as atomicity, parallel processing, and novel, more powerful methods for implementing loops and operators.
Re:Software isn't evolving. by richman555 · 2006-04-18 13:49 · Score: 1

Replaced? My company still has a boatload of COBOL programmers.
Re:Software isn't evolving. by junklight · 2006-04-18 20:41 · Score: 1

Yep,

this is just a fix for the python global interpreter lock

Why sort out language issues when you can make the hardware suit it better instead
Re:Software isn't evolving. by Anonymous Coward · 2006-04-18 22:53 · Score: 0

That would be Oz. In Oz you have dataflow variables on which you can synchronize without having to use explicit locks: if the value of an unbound variable is needed in some thread, it is blocked until some other thread binds it.
Run a procedure P in a new thread? thread {P ...} end.

Oz support for Haskell: fun lazy {F ...} ... end. Functions without "lazy" are strict as in Scheme and SML. Oz support for Erlang: {NewPort Stream Port}, {Send Port ...}. Oz support for Prolog: choice Alt1 [] Alt2 [] ... end, {Search.base.all Query}. Oz support for Java: class ... end. Oz also has support for finite domain constraints (X::4#7 meaning variable X must be between 4 and 7), etc, etc.

http://www.mozart-oz.org/
Re:Software isn't evolving. by PhrostyMcByte · 2006-04-19 00:09 · Score: 1

At the 2005 PDC, Microsoft presented some very interesting ideas on building threading primitives into C++. It wasn't half-assed at all either. If you are serious on the issue I suggest you google for the presentation.
Re:Software isn't evolving. by excelsior_gr · 2006-04-19 01:18 · Score: 1

Talking about evolution... In the science field the most popular languages that are used in parallel systems are C and Fortran 77! In most of the situations, you'll find them parallelized with OpenMP and MPI (depending on whether the system has shared memory or not, but most of the times such systems are hybrids). I've attended a seminar recently on the subject, and the conclusion was that there is way too much legacy code already there for the people that really have the need for speed! (meaning mathematicians, physicists and chemists). Furthermore, Fortran's latest updates have made it quite attractive for writing new programs as well(!) always in the scientific field of course.
Re:Software isn't evolving. by frostfreek · 2006-04-19 01:23 · Score: 1

What is needed is a language that will bring the concepts of Erlang and Haskell together,

Hmmm.... Hurl-ing? or Erkel? It's doomed already!
Re:Software isn't evolving. by Anonymous Coward · 2006-04-19 03:13 · Score: 0

I've been writing C/C++ for twenty years, and I've created rather large, multi-threaded applications, some of which were distributed across networks. I see no reason why C and C++ aren't suited for such tasks. The concepts are the same, no matter what computer language you use to implement them with. If you don't have some widget, some functionality you like, create a new class and build the functionality into it. That's the whole idea behind object oriented programming.

Your arument is like saying English isn't a language you can properly talk about sunsets with, because French is so much more romantic. Give me a break.
Re:Software isn't evolving. by 0xABADC0DA · 2006-04-19 03:28 · Score: 2, Informative

The main problem is that the CPUs are only designed for high-level parallelism. So you don't get much benefit from Haskell of Erlang because they could theoretically say "run this loop of 1..10 as two loops 1..5 and 6..10", but in practice doing so would be much slower on today's multi-core multiprocessors due to the setup overhead.

I actually have a brand new dual-core box and the gnome System Monitor shows both cpu's separately. The only time I've seen both cpu's used by a single program it was a Java program. So I don't really understand where you are coming from. Today's multiprocessors are designed for high-level parallelism that the programmer has to declare in some way, and Java is excellent for this (C# to a somewhat lesser degree).

Now a language like Erlang or Scheme could get a huge performance boost from processors like the now-outdated Tera MTA (aka Cray), which was the inspiration for Sun's MAJC processor. These have hardware threading, so the language compiler can break loops out into multiple threads with next to no overhead. This also benefits Java to some extent since it has moderately better rules regarding aliasing for instance than C#. C/C++ can also benefit automatically with 'loose' compilers, macros, and generally a lot of work on the programmer's side.

may not want to go back.. yeah right by igotmybfg · 2006-04-18 10:39 · Score: 5, Funny

However, by the time the technology ships - if it proves real, and ever becomes more than a lab experiment - the software industry will have had several years focusing on multi-threaded apps, and it may not want to go back.

Hah, yeah right, we started parallel programming just this semester and already I want to kill myself. "May not want to go back"? I'd go back in a heartbeat!

Re:may not want to go back.. yeah right by ivan256 · 2006-04-18 10:50 · Score: 3, Insightful

Boy are you screwed.

Even though the trade rags haven't realized it, real life software engineers have been using parallel programming techniques for decades. Sure, apps are optimized for what they run on, so most shrinkwrap software at your local CompUSA probably doesn't have much of that in there, but the author missed the boat already when it comes to "had several years focusing on...".

Better learn to like that parallel programming stuff. It's the way things work.
Re:may not want to go back.. yeah right by DrSkwid · 2006-04-18 11:27 · Score: 1

What language are they foisting upon you ?

--
There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
Re:may not want to go back.. yeah right by esmrg · 2006-04-18 12:01 · Score: 1

You just started.
Had you spent the time writing a multi-threaded software package that runs well - and then were informed multithreading is now handled by the processor - you wouldn't want to rewrite it, would you?
Re:may not want to go back.. yeah right by John_Sauter · 2006-04-18 12:14 · Score: 1

Better learn to like that parallel programming stuff. It's the way things work.

I can echo that. I have been doing programming on parallel CPUs since 1968 (on a monstrosity at Stanford University that included a 166 and a KA-10 processor). You have to think differently to write parallel code, but once you learn to think that way it becomes no harder than conventional, “linear” programming.
Re:may not want to go back.. yeah right by Anonymous Coward · 2006-04-18 12:21 · Score: 0

vb.NET
Re:may not want to go back.. yeah right by Diabolus777 · 2006-04-18 13:52 · Score: 1

I just did my parallel processing course. On clusters we used C lib MPI (LAMMPI) On shared memory machines we used pthreads. We had a brief brush with OpenMP and JavaMPI which both were just presented for completeness, not for any serious use (labs). I'm glad i did the course even tho it's just an introduction. I'm sure that in the future I'll see more and more of pthreads.

--
We should have been
So much more by now
Too dead inside
To even know the guilt
Re:may not want to go back.. yeah right by XchristX · 2006-04-18 14:14 · Score: 1

Maybe if you were using kick-ass parallel grids like the ones I am for my simulations:

http://www.tacc.utexas.edu/services/userguides/lon estar/
http://www.tacc.utexas.edu/services/userguides/cha mpion/

you might change your mind. Plus, multithreading using OpenMP (http://www.openmp.org/ is relatively easy. Message Passing (MPI) is trickier, but much more powerful.

I'm just running a proggie on the champion grid (above) that's using just about all 96 cpus off-and-on for 3 days straight (Floquet time evolution of a 2-boson interacting system). On a single cpu system it would've taken months to make just one run.

--
l'Homme n'est Rien l'Oeuvre Tout: Gustave Flaubert to George Sand
Re:may not want to go back.. yeah right by Gr8Apes · 2006-04-18 14:52 · Score: 1

I can echo that. I have been doing programming on parallel CPUs since 1968 (on a monstrosity at Stanford University that included a 166 and a KA-10 processor). You have to think differently to write parallel code, but once you learn to think that way it becomes no harder than conventional, "linear" programming.

What's this "conventional, 'linear' programming"?

Some of us have been writing conventional asynchronous code for quite some time now. e.g., I couldn't imagine having to have a single thread sit there and wait on some slow-ass DB to respond... it should be doing some useful work! ;)

--
The cesspool just got a check and balance.
Re:may not want to go back.. yeah right by MemoryDragon · 2006-04-18 19:45 · Score: 1

Ahem, parallelism is the norm not the exception, I have had no program with less than at least three threads written in the last 5 years. Not SIMD parallelism or active agents, but multithreading has become a standard technique everyone has to use. Multithreading basically is one of those things you cannot avoid anymore once your programs are a little bit bigger than hello world or the average small shell script.
Re:may not want to go back.. yeah right by tdi1 · 2006-04-18 20:32 · Score: 1

The problem is that most programmers can't even program single-threaded in C without bringing the machine crashing down. As the OP's comment showed, thinking about multi-threaded coding issues makes their head explode.

It's like the manual transmission - those that want the best performance learn how to use it but most people just want to cruise.

Ultimately, it won't matter whether the masses are programming single or multi-threaded, because their apps are inefficiently designed. What will matter is how a few select pieces of code run. This includes the OS, some games and a few apps like PhotoShop. Those coders on those will have to continue to use multi-threaded techniques to get decent performance/system responsiveness. The multi-core processors will win out. AMD is wasing their time.
Re:may not want to go back.. yeah right by TheRaven64 · 2006-04-18 22:31 · Score: 1

I don't understand the reluctance many people have to write asynchronous code. When I first learned to program, I remember learning about functions and thinking how great they were - until I discovered that the rest of my code had to wait until they returned. It turns out that some other people had the same idea, and modern OO languages support the idea of continuations (an object immediately returns a proxy object in response to a message, and any operations on this proxy object block until the real object is available). Meanwhile, I've been writing code in languages like Erlang and Termite that runs happily on a 64 CPU machine, or on a cluster.
When I have to write serial code, I find that I first design a parallel algorithm and then try to shoe-horn it into a serial execution model. Being tied to a serial model is far less flexible and far less fun to program.

--
I am TheRaven on Soylent News
Re:may not want to go back.. yeah right by ivan256 · 2006-04-19 00:33 · Score: 1

The problem is that most programmers can't even program single-threaded in C without bringing the machine crashing down. As the OP's comment showed, thinking about multi-threaded coding issues makes their head explode.

After 10 years experience as a professional software engineer, I can say that certainly isn't true of most programmers I've worked with. People like that, quite honestly, have no business writing code that can run in a place that may bring the system down. Let 'em write shell scripts and VB.NET (which are frequently multi-'threaded' anyway).

The mechanics of parallel programming are becoming easier and easier as languages and libraries evolve. If you can't mentally compartmentalize the threads of a program you're writing, either you have terrible teachers (which is more common that should be the case), or you have no business being a professional software engineer.
Re:may not want to go back.. yeah right by Gr8Apes · 2006-04-19 03:37 · Score: 1

Serial code is a crap-load easier to learn to debug, for most people. It's similar to object modeling or music, 2-D items or single melodies are just easier to visualize initially for most people. To truly get into higher orders apparently takes certain skills, whether natural or learned, that many don't have or take the time to invest in learning.

--
The cesspool just got a check and balance.
Re:may not want to go back.. yeah right by Nevyn · 2006-04-19 05:55 · Score: 1

After 10 years experience as a professional software engineer, I can say that certainly isn't true of most programmers I've worked with. People like that, quite honestly, have no business writing code that can run in a place that may bring the system down.

Again with the "developers just need to understand threads" theory? Unless you are talking about everyone moving to some new language that significantly helps developers manage syncronization and sharing ... then I call your "theory" and raise you real the world data that OS kernel developers (Linux, Solaris, FreeBSD) repeatedly have problems managing the above threading problems. Coverity has found hundreds if not thousands of threading bugs in Linux, and there must certainly be many it hasn't.

So you (and "most programmers" you've worked with) either believe that you really are much better than all of the Linux and Solaris kernels developers ... or, more likely, idiots who can't judge their own incompetance.

So listen up, exposing large parts of your entire application to run at multiple points simultaneously is really hard to do well. And if you have a real OS (Ie. not from microsoft) then doing fork() and explicitly stating which bits of data you want to share is less prone to errors, more maintainable and just as efficent/scalable.

Yes, sometimes, you'll then be required to design the interfaces to communicate between the processes instead of just hacking it ... but again, often, just hacking it with threads produces buggy unmaintainable crap.

--
ustr: Managed string API with ave. 44% overhead over strdup(), for 0-20B
Re:may not want to go back.. yeah right by ivan256 · 2006-04-19 08:23 · Score: 1

Oh great. A response from a know-it-all.

Again with the "developers just need to understand threads" theory? [...] And if you have a real OS (Ie. not from microsoft) then doing fork() and explicitly stating which bits of data you want to share is less prone to errors, more maintainable and just as efficent/scalable.

Are we having the same conversation here? Did you insert a whole bunch of discourse that didn't actually happen inside your head so you could show how much you know? What theory am I talking about all of a sudden, and without my knowledge? Congratulations, you have a four digit user id and a three character domain name. That and a quarter will get you a pack of gum. Go get some real world experience and reading comprehension skills.

Just FYI: Calling fork gives you another thread of execution. Just because I used the word 'threads' doesn't mean I meant it in the posix sense. I meant it as a computer science term, and not an implementation specific term. Multiple processes using IPC of one form or another is a parallel programming technique.

So you (and "most programmers" you've worked with) either believe that you really are much better than all of the Linux and Solaris kernels developers ...

The problems you point to aren't due to the problems being difficult (not that they aren't difficult, but who says your job as a programmer has to be easy?), but due to laziness. Do an informal survey of how many Linux kernel hackers think it's easiest to design while you're writing the code instead of designing ahead of time and you'll find the problem.

We'll skip the peridoxical nature of your question about what I believe, as I, and many of the people I have worked with are kernel developers.

Coverity has found hundreds if not thousands of threading bugs in Linux, and there must certainly be many it hasn't.

Oh, come on. Have you ever used Coverity. 99% of the shit it finds is because it is too stupid to realize what code and what states are unreachable. They're 'bugs in theory', not necessarily real bugs. It's telling you that your code is sloppy, not necessarily that it's going to break.
Re:may not want to go back.. yeah right by Nevyn · 2006-04-19 10:35 · Score: 1

Just FYI: Calling fork gives you another thread of execution. Just because I used the word 'threads' doesn't mean I meant it in the posix sense.

Oh ... come ON, if you'd have said task without qualifying it, then I might have hoped you didn't mean pthread_create() but I still wouldn't have put a lot of money on it.

I meant it as a computer science term, and not an implementation specific term. Multiple processes using IPC of one form or another is a parallel programming technique.

Yes, well done ... threads and processes have well defined meanings as computer science terms. And I can guarantee that any non-four digit /. user is going to take:
"If you can't mentally compartmentalize the threads of a program you're writing, you [...] have no business being a professional software engineer."
...and interpret it as "professional software engineers" can easily use pthread_create() and manage the outcome that results. fork() is certainly not going to enter the equation.

--
ustr: Managed string API with ave. 44% overhead over strdup(), for 0-20B
Re:may not want to go back.. yeah right by ivan256 · 2006-04-19 12:05 · Score: 1

Oh ... come ON, if you'd have said task without qualifying it, then I might have hoped you didn't mean pthread_create() but I still wouldn't have put a lot of money on it.

Yeah, I could have said that, but it wouldn't have meant the right thing, because I was trying really hard not to be specific.

You're taking what I'm saying, applying all sorts of technical details that were intentionally left out, and calling me wrong based on those details. The best part about all of this is that it doesn't even negate my initial assertion, which was that in real life, programming involves multiple threads of execution.

can easily use pthread_create() and manage the outcome that results. fork() is certainly not going to enter the equation

It doesn't matter if you use pthread_create, fork, &, open PIPE, "$1 |", mexec, or whatever. The point was that you need to be able to view each thread of execution as a unit.

Gotta love these CPU companies... by __aaclcg7560 · 2006-04-18 10:39 · Score: 5, Funny

First, they get the software industry's licensing panties in a knot because users only want to pay a license fee for one physical chip instead of paying for each processor on the chip. Now, twisting the panties in other direction, they want to reverse all that by representing multiple processors as one virtual processor. Would that be covered by a multi or single processor license agreement? Do I still get free wedgie with that one?

Re:Gotta love these CPU companies... by suv4x4 · 2006-04-18 11:12 · Score: 1

"Now, twisting the panties in other direction, they want to reverse all that by representing multiple processors as one virtual processor. Would that be covered by a multi or single processor license agreement?"

Or maybe the software industry can start acting logically and license per a machine.

That's of course until the "reverse virtualisation" from Intel happens, that makes your entire server cluster run as a single PC :)
Re:Gotta love these CPU companies... by dgatwood · 2006-04-18 11:19 · Score: 1

That's of course until the "reverse virtualisation" from Intel happens, that makes your entire server cluster run as a single PC :)
You mean NUMA?

--
Check out my sci-fi/humor trilogy at PatriotsBooks.
Re:Gotta love these CPU companies... by EmperorKagato · 2006-04-18 11:23 · Score: 1

Excellent concept! You are right, when it comes to licensing for SQL Server you owuld have to pay the amount for license based on the processor. With this chip IT departments can get away with murder.

However would the Software perform as well as multiprocessor is something I would like to see.

--
----- You know you have ego issues when you register a domain in your name.
Re:Gotta love these CPU companies... by Brandybuck · 2006-04-18 12:39 · Score: 1

I don't believe in paying license fees to begin with, so there!

--
Don't blame me, I didn't vote for either of them!

In Soviet Russia...(I know, sorry) by TheDarkener · 2006-04-18 10:40 · Score: 0, Offtopic

The cores thread YOU!

--
It is pitch black. You are likely to be eaten by a grue.

Re:In Soviet Russia...(I know, sorry) by Anonymous Coward · 2006-04-18 10:46 · Score: 0

In Soviet Slashdot, bad joke is sorry for you.

Amdahl's Law by overshoot · 2006-04-18 10:41 · Score: 4, Interesting

OK, I know some of the gang doing architecture for AMD and they are damned sharp people.

What I want to know is which of the premises underlying Amdahl's Law they've managed to escape?

--
Lacking <sarcasm> tags, /. substitutes moderation as "Troll."

Re:Amdahl's Law by grumbel · 2006-04-18 10:57 · Score: 3, Interesting

Quick guess:

Amdahl's Law has little impact when the number of cores is small and the available task is "large", as todays multitaskin OSs are.

Of course that doesn't mean that AMD will get a 100% improvment, but something close to that migth be doable if they can break the tasks at hand into parallel stuff at a much smaller level then threads.
Re:Amdahl's Law by carlmenezes · 2006-04-18 19:04 · Score: 1

Seems to me to be like a kind of RAID0 setup for CPUs. Am I right?

--
Find a job you like and you will never work a day in your life.

No, superscalar is different by overshoot · 2006-04-18 10:45 · Score: 5, Interesting

Superscalar refers to having multiple execution paths inside of a single processor, allowing the dispatch of multiple instructions in a single clock cycle. However, the register sets (etc.) maintain a common state (although keeping the out-of-order updates straight sucks a huge amount of complexity and power.)

In this case, AMD appears to be trying to decouple the states enough that the out-of-order resolution doesn't require micromanaging all of the processes from a single control point.

--
Lacking <sarcasm> tags, /. substitutes moderation as "Troll."

Re:No, superscalar is different by RalphTWaP · 2006-04-18 11:00 · Score: 5, Insightful

What AMD appears to be trying isn't the same as superscalar processing, but it might run into a similar problem.

Where superscalar requires a good dispatcher to minimize branch prediction misses, AMD appears to be making decisions, not about dispatch, but about how to do locking of shared memory (think critical sections).

Critical section prediction might prove less expensive than branch prediction in practice even if they are similar in theory (http://www.cs.umd.edu/~pugh/java/memoryModel/Doub leCheckedLocking.html shows the problem, which already is an issue on 64-bit hardware).

Sounds a lot like Intel's Mitosis research by Anonymous Coward · 2006-04-18 10:46 · Score: 3, Informative

Despite the lack of details, it sounds quite a bit like Intel's Mitosis research:
http://www.intel.com/technology/magazine/research/ speculative-threading-1205.htm

The article has simulated performance comparisons.

From the article:
"Today we rely on the software developer to express parallelism in the application, or we depend on automatic tools (compilers) to extract this parallelism. These methods are only partially successful. To run RMS workloads and make effective use of many cores, we need applications that are highly parallel almost everywhere. This requires a more radical approach."

Re:Sounds a lot like Intel's Mitosis research by logicnazi · 2006-04-18 13:48 · Score: 1

No, it sounds nothing at all like this research. Intel's research (in the paper you link and with the entire Itanium system) has been all about exposing the out of order execution and speculative execution capabilities of the processesers to the compiler. In other words the exact opposite of what AMD is supposedly doing here by hiding the dual core nature of the chip.

For what it's worth I think in the long run intel has the right answer the question is whether AMD can steal lots of market share in the short run before they run into a performance wall.

--
If you liked this thought maybe you would find my blog nice too:

Similar to MacOSRumors rumor by salimma · 2006-04-18 10:46 · Score: 3, Insightful

.. in this post they reported on a project supposedly aiming at breaking down single threads into multiple threads so as to better utilize core utilization beyond the fourth core.

It supposedly involve Intel. I personally think both rumors are just that, but the timing is curious. Same source behind both? AMD PR people not wanting to lose out in imaginary rumored technology to Intel?

--
Michel
Fedora Project Contribut

Re:Similar to MacOSRumors rumor by Calroth · 2006-04-18 14:40 · Score: 1

.. in this post they reported on a project supposedly aiming at breaking down single threads into multiple threads so as to better utilize core utilization beyond the fourth core.

In the Eiffel programming language, they've proposed a concurrency algorithm that doesn't use "traditional" threads.

The idea is, you sprinkle the "separate" keyword onto various objects that it makes sense for. The compiler or runtime then does a dependency analysis and breaks out your program into different threads or processes. Automatically. As many different threads as is optimal for your system.

That's it. No locking, no threading, no synchronisation. Just the one damn keyword.

OK, so the above is a simplification. But it's not a huge one. You can read up on it at the Eiffel site or in your local copy of OOSC.

The "problem" is, Eiffel is already a niche language, and this is even more niche. In fact, I don't think anyone's even implemented this concurrency scheme in production code. Also, from what I've read, it's formally proven to work, but this was some time ago so someone might have poked (theoretical or practical) holes in the scheme.

I know... by Expert+Determination · 2006-04-18 10:47 · Score: 4, Funny

Hyperthreading makes one core look like two. Reverse hyperthreading makes two cores look like one. So if we chain reverse hyperthreading with hyperthreading we can make one core look like one core but have twice as many features for the marketing department to brag about.

--
"The White House is not an intelligence-gathering agency," -- Scott McClellan, Whitehouse spokesman.

Re:I know... by cnettel · 2006-04-18 11:43 · Score: 1

Actually, that would be the answer to what to do with code that really is multi-threaded on a CPU like this. Give the OS two states (stack + registers and whatnot) to run threads on, but virtualize that above actual cores, so one thread might totally dominate both cores, especially if the other is just executing HLT. I still think this is vaporware close to vacuum, though.
Re:I know... by barracg8 · 2006-04-18 12:31 · Score: 4, Insightful

Ironic that this post is modded funny, since I think it might be closest to the mark.
I'd suggest x86-secret & the Reg have got the wrong end of the stick here. SMT is running two threads on one core - try taking "reverse hyperthreading" literally. I'd suggest that AMD are looking at running the one same thread in lock-step on two cores simultaneously. This is not about performance, it is about reliability - AMD looking at the market for big iron (running execution cores in lock-step is the kind of hardware reliability you are looking at on mainframe systems).
The behaviour of a CPU core should be completely deterministic. If the two cores are booted up on the same cycle they should make the same set of I/O requests at the same point, and so long as the system interface satisfies these requests identically an on the same cycle, then the cores should have no reason not to remain in sync with each other until the next point that they both should put out the next, identical pair of I/O requests. If the cores every get out of sync with each other, this indicates an error.
Just speculation of course, but I seem to recall AMD looking into this having been rumoured previously.
G.
Re:I know... by logicnazi · 2006-04-18 13:56 · Score: 1

Mod up!

This is the most plausible explanation offered so far. At least it makes sense.

--
If you liked this thought maybe you would find my blog nice too:
Re:I know... by rabiddeity · 2006-04-18 18:07 · Score: 1

But assuming the cores go out of sync because of an error (not an outright lockup), how would you know which core is wrong? For true failsafe reliability you'd need a 3rd core as well as some extra logic. If 1 core disagrees with the other 2 it gets shut down and the system keeps working. But if the cores are on the same chip, you can't exactly hotswap one of the cores out like you could hotswap a failed RAID 1 drive. I don't see how this would be useful.
Re:I know... by kermitthefrog917 · 2006-04-18 19:41 · Score: 1

So this is basically similar to RAID, but for processors, with RAID 0 and RAID 1 being the proposed options?

--
I may be wrong but you're downright ugly!
Re:I know... by Anonymous Coward · 2006-04-18 20:41 · Score: 0

You don't know. But you do know that whatever this processor was doing went wrong, and that can be corrected in software. Of course this requires that the system has more than one multi-core processor. One well-known way to do it would be to have the processors only communicate through a database, so that the database can detect a dropped connection and roll back the current transaction. All client software will also have to be notified, and be able to try their transactions another time.
Re:I know... by yarbo · 2006-04-19 05:06 · Score: 1

You can have several units like that. If one has a contradiction, take it offline and bring in a spare.

Yes, AMD! You get it! by totro2 · 2006-04-18 10:57 · Score: 2, Interesting

As a systems admin in a large datacenter with many AIX, Solaris, HPUX, Redhat, and Suse boxes, I'm glad to see a vendor who wants to simplify management of systems (one processor is easier to manage than two). This is to say nothing about all the developer effort that would be saved from not needing to make making SMP-safe code. I want large, enterprise level boxes to be just as easy to administer/use as the cheapest desktop in their line. The OS should see as-simple-as-possible hardware. You wouldn't believe all the different kinds of "system managent consoles" I have to log into, which are always vendor specific and annoying.

Alas... by thePowerOfGrayskull · 2006-04-18 11:00 · Score: 0, Flamebait

... these are the kinds of stories we get when the digg kiddies submit articles to /.

Rumor, wisps of hot air, and nothing definitive.

Re:Alas... by Anonymous Coward · 2006-04-18 11:55 · Score: 0

these are the kinds of stories we get when the digg kiddies submit articles to /.

sounds like a digg of jelosy from a slash snot. Slash not ratings are going going down and diggs are going up. Me thinks its due to snotty noses and mindless bantering/.
Re:Alas... by WilliamSChips · 2006-04-18 12:27 · Score: 1

You do realize that's based on Alexa, which is based on a spyware toolbar which 1) the "average" /.er is unlikely to use and 2) diggers installed to artificially boost their ratings.

--
Please, for the good of Humanity, vote Obama.
Re:Alas... by thePowerOfGrayskull · 2006-04-19 12:22 · Score: 1

Eh? Ratings? Why would I care about them? I'm here for the articles my bravely unannounced friend. Could give two shits about the ratings.

occam by EmbeddedJanitor · 2006-04-18 11:01 · Score: 4, Informative

About the best language I've ever seen for multi-threading is occam, the language used with Transputers. occam allows threading to be done as a language primitive. http://en.wikipedia.org/wiki/Occam_programming_lan guage

--
Engineering is the art of compromise.

To sweet to be true by suv4x4 · 2006-04-18 11:02 · Score: 2, Insightful

"AMD is claimed to believe it may be able to double the single-chip performance with a two-core chip or provide quadruple the performance with a quad-core processor."

Even the article writers aren't pretty sure that's possible to do, apparently it's possible to "claim" it though, what isn't :)?

Modern processors, including the Core Duo rely on a complex "infrastructure" that allowed them to execute instructions out of order, if certain requirements are met, or execute several "simple" instructions at once. This is completely transparent to the code that is being executed.

Apparently for this to be possible the commands should not produce results co-dependent of each other, meaning you can't execute out-of-order or at-once instruction that modify the same register for ex.

This is an area where multiple cores could join forces and compute results for one single programming thread as the article suggests.

But you can hardly get twice the performance from two cores out of that.

License Cost?? by hallkbrdz · 2006-04-18 11:04 · Score: 0, Redundant

I just wonder what Oracle's CPU price index will be for this thing if it makes it out? Let me see, you have an AMD super single CPU core, not a hyper-threaded or standard one. You need to multiply (PI*GHZ/Watts*Sockets*200000)+(5000/days_until_qua rter_closeout) to get the correct license fee. :-)

Probably... by Anonymous Coward · 2006-04-18 11:06 · Score: 0

After all, there isn't a problem in time or space that can't be solved by simply reversing the polarity of the neutron flow.

multi cpu by tacocat · 2006-04-18 11:07 · Score: 0

Considering how awesomely powerful many CPU's are, I would think that they would continue moving towards more multi-core cpus instead. After a while, lots of cpus will out flank a fast one.

Re:multi cpu by nsayer · 2006-04-18 11:17 · Score: 1

Sun has been saying this, more or less, since about 1994. Personally, I always saw that argument as similar to the guy who gets the wooden medal saying he could have won the 100 meter if it had been best-of-4.
Re:multi cpu by cgenman · 2006-04-18 11:57 · Score: 3, Insightful

I'm guessing economic reasons push harder than technical ones.

Sony already assumes that their PS3 chips will have a fault in one of the cores, and simply lock off that section when one is found. One fault no longer kills a chip, though two can render the power unacceptably low.

The cool thing is this scales. If you have a 10cm^2 chip, traditionally your chance of perfection is 1/4th that of a 5cm chip, cutting your yield drastically. But if you have 6 cores on a chip with one dead one, and you want to go to 12, you should get a similar yield for a proportionally similar amount of dead cores.

Cores let you limit damage from manufacturing errors, letting you build bigger chips more cheaply. At least, that's my layman's understanding.

--
The ______ Agenda
Re:multi cpu by Billly+Gates · 2006-04-18 12:26 · Score: 1

Its still 2. The difference is the threading and scheduling for smp systems is now done in hardware rather than software.

Makes sense since its a difficult and complex mess to write an app or an operating system that can run on 2 or more cpu's efficiently.

My guess is in hardware you can do alot more then in software.

--
http://saveie6.com/
Re:multi cpu by tacocat · 2006-04-18 12:46 · Score: 1

Well you can certainly do it faster. But it's a lot harder to work on a patch.
Re:multi cpu by Lehk228 · 2006-04-18 18:36 · Score: 1

in theory, assuming the best of the best are designing the systems and all software, software/compiler side optimizaiton for multiple cores is best, since it has a long time (relatively infinit compared to how long a CPU gets to look at the binary) however it is much easier to make sure your relatively small team of hardware engineers knows what the fuck they are doing.

moving complicated stuff from run to design time is often good. moving mentally challenging tasks from very frequent to rather infrequent is better, especially when screwing up can mean crap performance.

--
Snowden and Manning are heroes.
Re:multi cpu by Trejkaz · 2006-04-18 23:08 · Score: 1

But this technology they're talking about here is what will *make* multiple CPUs out-flank the single fast one. Cause, and effect.

--
Karma: It's all a bunch of tree-huggin' hippy crap!

It's not exactly clear what they have in mind by ameline · 2006-04-18 11:07 · Score: 4, Informative

There are several techniques for increased performance or throughput that the designers of next gen microarchitectures are likely looking at.

There are extensions to known techniques;

A: more execution units, deeper reorder buffers, etc trying to extract more Instruction Level Paralelism (ILP).

B: More cores = more threads

C: hyper threading -- fill in pipeline bubbles in an OOO superscaler architetcure; also = more threads

I personally don't think any of these carry you very far...

Then there are some new ideas:

a: run-ahead threads -- use another core/hyperthread to perform only the work needed to discover what memory accesses are going to be performed and preload them into the cache - mainly a memory latency hiding technique, but that's not a bad thing as there are many codes that are dominated by memory latency

a': More aggressive OoO run-ahead where other latencies are hidden

Intel has published some good papers on these techniques, but according to those papers these techniques help in-order (read Itanic) cores much more than OoO.

b: aggressive peephole optimization (possibly other simple optimizations usually performed by compilers) done on a large trace cache. Macro/micro-op fusion is a very simple and limited start at this sort of thing. (Don't know if this is a good idea or not, or whether anyone is doing it)

But it's far from clear what AMD is doing. Whatever it is, anything that improves single threaded performance will be very welcome. Threading is hard (hard to design, implement, debug, maintain, and hard to QA). And not all code bases or algorithms are amenable to it.

Intels next gen (nahalem) is likely going to do some OoO look-ahead, as they have Andy Glew working on it, and that's been an area of interest to him...

A very interesting new concept is that of "strands" (AKA: dependency chains, traces, or sub-threads). (The idea is instead of scheduling independent instructions, schedule independent dependency chains. - For more info, see http://www.cse.ucsd.edu/users/calder/papers/IPDPS- 05-DCP.pdf)
But it's not clear how well it would apply to OoO architectures, but I would expect that likely approaches would also need large trace caches.

Applying this to an OoO x86 architecture, and detecting the critical strand dynamically in that processor could be very cool, and potentially revolutionary.

It will be very interesting to see what Intel and AMD are up to -- it would be even cooler of they both find different ways to make things go faster...

--
Ian Ameline

Re:It's not exactly clear what they have in mind by Anonymous Coward · 2006-04-18 20:51 · Score: 0

Yes, there are lots of methods that are known to improve single thread performance through aggressive techniques like the ones you've cited in your excellent and well informed post (you must keep up with all of the MICRO and ISCA papers).

However - the problem is that all of these techniques require an ever increasing amount of power to implement. The performance/watt number is extremely poor for these techniques.

Computer performance has fallen dramatically behind the exponential improvement curve in the last few years (this is a decades-long trendline that has been broken). The reason is not the much feared "memory-wall". We know how to climb that - that's just a speed-bump compared to the Mount Everest that is the "power-wall".

Explicitly parallel architectures are coming. Programmers better get used to writing parallel programs because it's the most power efficient thing we can come up with. The ground-breaking research will not be in architecture, but in new software, languages and programming paradigms designed to make the life of the parallel programmer easier. Maybe the functional programming crowd will finally have their day.
Re:It's not exactly clear what they have in mind by ameline · 2006-04-19 01:37 · Score: 1

AC writes: "Programmers better get used to writing parallel programs because it's the most power efficient thing we can come up with. The ground-breaking research will not be in architecture, but in new software, languages and programming paradigms designed to make the life of the parallel programmer easier."

In order for a CPU architecture to be commercially viable, it must run existing programs fast. And existing programs -- especially large ones, and they're all getting large now, are very unweildy. Imagine taking something like Alias' Maya -- over 25 million lines of C++, and making it run well on an explicitly parallel machine. It's just not going to happen. It's hard enough to get some parallelism out of some small fragments of it. On a large scale, parallel evaluation and traversal of the Maya Dependency Graph just does not seem feasible. And rewriting it is just not going to happen.

What Maya likes is exactly what Intel is doing with Conroe/Merom -- shorter pipelines, lower branch mispredict penalty, branch prediction for computed jumps, better memory disambiguation for better OoO around memory accesses, more execution resources, better OoO, etc, etc.

Some of those aggressive techniques I wrote about will help even more.

It runs like *crap* on explicitly parallel machines.

--
Ian Ameline
Re:It's not exactly clear what they have in mind by Anonymous Coward · 2006-04-19 02:22 · Score: 0

Sorry, but programmers will not have a choice. Believe me we've tried, and we're well aware of the implications. Every computer vendor is going in this direction. This is not some blind "follow the leader" marketing thing. Billions of dollars of R&D and thousands of man years have gone into solving the problem. The fact that we've all come to the same conclusion is telling.

It's no accident that the BEST thing architects could think of to do with Moore's riches of transistors is to add another duplicate core. This is across the board from Intel, AMD, Sun and IBM - some of the smartest computer architects in the world work there. If you look out at future roadmaps, you'll see some incremental single thread improvements but the bulk of the transistor budget is going to more cores.

Single thread performance will increase, just not at the rate that we have been accustomed to. Your program performance will scale at an ever slower rate and at some point in the not too distant future - will not scale at all! Since all computer vendors are heading in this direction, only those programmers that take advantage of the explicit parallelism will see performance scaling. Go plot SPEC performance from 1989 until today. You'll clearly see the 'knee' in the curve.

Bottom line - Single thread performance is power limited. It's not that we don't know how to increase single thread performance from a hardware-algorithm point of view. It's the fact that the power dissipation of those algorithms is unacceptable. This is a fundamental "laws-of-physics" kind of thing. You won't find any clever solutions or workarounds to it.

Nobody said free performance would last forever :)
Re:It's not exactly clear what they have in mind by jon_c · 2006-04-19 04:05 · Score: 1

dude you said "aggressive peephole optimization"

huh huh huhhuhhuhu

--
this is my sig.
Re:It's not exactly clear what they have in mind by slcdb · 2006-04-19 05:31 · Score: 1

run-ahead threads -- use another core/hyperthread to perform only the work needed to discover what memory accesses are going to be performed and preload them into the cache - mainly a memory latency hiding technique, but that's not a bad thing as there are many codes that are dominated by memory latency
Good god man! Where did you dig up this EVIL research?!?

How could a thread possibly be executed far enough in advance to make the time savings worth while, yet be sure that it is "predicting" memory accesses correctly? The only way that could happen is with code where memory accesses are independent of the result of previously executed instructions. Where does this happen in the real world? Almost nowhere.

Such a technology would inevitably make incorrect predictions. As a result, there's a high probability that it would induce a cache miss where there otherwise would not have been one (not to mention the extra memory bandwidth used by these shenanigans).

What's worse is this sort of thing would play absolute havoc with programmers' and compilers' cache optimizations. This is pure EEEEEVVIIIILLL.

--
Despite what EULAs say, most software is sold, not licensed.
Re:It's not exactly clear what they have in mind by ameline · 2006-04-19 08:35 · Score: 2, Informative

Where did I find the Evil Research(tm)? Where else but directly from the source of evil -- no, no, not Microsoft, the *other* source of evil -- Intel :-)

It's already in their compiler;
http://www.intel.com/software/products/compilers/c lin/docs/main_cls/mergedprojects/optaps_cls/common /optaps_pgo_sspopt.htm
(Their compiler absolutely rocks BTW)

And their excellent paper titled "Speculative Precomputation: Long-range Prefetching of Delinquent Loads" by Jamison Collins, Hong Wang, Dean Tullsen, Christopher Hughes, Yong-Fong Lee, Dan Lavery, and John Shen can be found here;
http://www.intel.com/research/mrl/library/148_coll ins_j.pdf

(Those damn delinquent loads -- GET OFF OF MY LAWN YOU DELINQUENTS! :-)

There's also
"Physical Experimentation with Prefetching Helper Threads on Intel's Hyper-Threaded Processors" by
Dongkeun Kim, Steve Shih-wei Liao, Perry Wang, Juan del Cuvillo, Xinmin Tian, Xiang Zou, Hong Wang, Donald Yeung, Milind Girkar, and John Shen which can be found here;
http://www.cgo.org/cgo2004/papers/02_80_Kim_D_REVI SED.pdf

And also;
"Speculative Precomputation on Chip Multiprocessors" by Jeffery Brown, Hong Wang, George Chrysos, Perry Wang, and John Shen at;
http://www.cs.ucsd.edu/~jbrown/papers/sp-cmp.pdf

There, that ought to cure your insomnia and answer your question: "How could a thread possibly be executed far enough in advance to make the time savings worth while, yet be sure that it is "predicting" memory accesses correctly?"

Read the papers carefully -- there will be a quiz later.

--
Ian Ameline

K8L by Anonymous Coward · 2006-04-18 11:07 · Score: 1, Interesting

This would be, perhaps, more useful with the quad-core version - appearing as two processors to still allow the OS to allocate multiple threads?

But what I really want to know... by Joebert · 2006-04-18 11:10 · Score: 4, Interesting

Is Microsoft going to recognise this contraption as a single, or multi-liscense-able processor ?

And

Will AMD only hide the fact there's multi-cores from Operating systems other than Microsoft ?

--
Wanna fight ? Bend over, stick your head up your ass, and fight for air.

Re:But what I really want to know... by Zak3056 · 2006-04-18 12:11 · Score: 1

Is Microsoft going to recognise this contraption as a single, or multi-liscense-able processor, and will AMD only hide the fact there's multi-cores from Operating systems other than Microsoft ?

You're barking up the wrong tree here. MS has already addressed this in favor of their customers, and licenses on a per-socket rather than a per-core basis. One core, two cores, four cores, doesn't matter--one processor.

--
What part of "shall not be infringed" is so hard to understand?
Re:But what I really want to know... by TubeSteak · 2006-04-18 12:35 · Score: 1

Will AMD only hide the fact there's multi-cores from Operating systems other than Microsoft?
I'm going to guess that AMD will hide the multi-cores from everyone.

The idea is that AMD will have the CPU do all the fancy (de)threading stuff on the chip. The entire point is to increase performance for non-optimized applications.

If you're going to be using programs optimized for dual CPUs/cores, then there really isn't a point in buying a chip with AMD's technology on it, unless AMD plans to stop selling 'normal' dual core CPUs entirely.

--
[Fuck Beta]
o0t!

Yes but can I get one by the end of the week? by edwardpickman · 2006-04-18 11:11 · Score: 0

I'm starting a big rendering job and this type of chip would be perfect. I've been using dual chip motherboards for years and started recently using multicore chips. The best I've ever gotten is 1.5X for any dial chip configuration and generally it's much less. On rare occasions I've had dual systems run slower but that hasn't happened too often. The only reason they are practicle is the software is so bloody expensive it makes since. Most people get no benefit off dual chip configurations since very few softwares can use more than one processor. Renderers are the exception. Even things like Maya can only use multicore or chips in rendering. All other functions use one processor. This type of intergrated chip should solve the problem. Now how many CPUs can you cram on a chip? Personally I'd like to see one the size of a dinner plate. Alright it might have to have it's own 220 line but I can work with that. Neighbors might get annoyed when I start a render and their lights dim though.

bullshit by tomstdenis · 2006-04-18 11:11 · Score: 1

The bus between the two cores is FAR TOO SLOW for this sort of operation. Moving [say] EAX from core 0 to core 1 would take hundreds of cycles.

So if the theory is to take the three ALU pipes from core 1 and pretend they're part of core 0... it wouldn't work efficiently. Also what instruction set would this run? I mean how do we address registers on the second core?

AMD would get more bang for buck by doing other improvements such as adding more FPU pipes, adding a 2nd multiplier to the integer side, increasing L1 bandwidth, etc.

This story is pure and utter bullshit.

Tom

--
Someday, I'll have a real sig.

Re:bullshit by tomstdenis · 2006-04-18 11:20 · Score: 3, Informative

For those not in the know... reading a register from core 1 and loading it in core 0 would work like this

1. core 1 issues a store to memory [dozens if not hundreds of cycles]
2. core 0 issues a read, the XBAR realises it owns the address and the SRQ picks up the read
3. core 0 now read a register from core 1

It would be so horribly slow that accessing the L1 data cache as a place to spill would be faster.

The IPC of most applications is less than three and often around one. So more ALU pipes is not what K8 needs. It needs more access to the L1 data cache. Currently it can handle two 64-bit reads or one 64-bit store per cycle. It takes three cycles from issue to fetched.

Most stalls are because of [in order of frequency]

1. Cache hit latency
2. Cache miss latency
3. Decoder stalls (e.g. unaligned reads or instructions which spill over 16 byte boundary)
4. Vectorpath instruction decoding
5. Branch misprediction

AMD making the L1 cache 2 cycle instead of 3 cycle would immediately yield a nice bonus in performance. Unfortunately it's probably not feasible with the current LSU. That is, you can get upto 33% faster in L1 intense code with that change.

But compared to "pairing" a core, die space is better used improving the LSU, adding more pipes to the FPU, etc.

Tom

--
Someday, I'll have a real sig.
Re:bullshit by DrDitto · 2006-04-18 12:21 · Score: 1

Tom, quit being an armchair architect. Read this paper:

ftp://ftp.cs.wisc.edu/sohi/papers/1995/isca.multis calar.pdf

BTW-- RC delay is causing on-chip wires to get pretty slow, but nowhere near "hundreds of cycles".
Re:bullshit by tomstdenis · 2006-04-18 12:35 · Score: 1

"armchair"... whatever. I'd say I know a bit more about the K8 design than your average slashdotter.

The point is as it stands now the K8 cannot, repeat cannot, get a register from one core to another FASTER THAN THE L1 CACHE WORKS.

Now that we got that out of the way... realize that ...

IPC OF 99% OF ALL CODE is less than 1 on most cases and why is that? Aside from register contention there is the three cycle latency of the L1. So it's very trivial to stall an entire execution unit.

So AMD would see little benefit from tying the ALUs on core 1 (which can only access the registers local to it) to core 0 since they would just go unused most of the time.

The only possible benefit is the FPU of the second core but even then it's pushing it. Getting data from one core to the other is really slow.

AMD would benefit more from just adding another FPU adder or multiplier [or both] to a single core than by adding high speed super-wide busses between cores (which in terms of processors are "far away").

Tom

--
Someday, I'll have a real sig.
Re:bullshit by DrDitto · 2006-04-18 12:47 · Score: 1

No, you don't understand modern processor issues. The problem is increasing the size of the instruction window. It simply cannot be done within today's frequency and power contraints. An 8-wide machine with a large enough instruction window to get the IPC increases can't be build due to the size of the CAMs.
The idea is to get a larger instruction window by via multithreading. Either via programmed threads (standard multithreading), or speculative techniques for a single-threaded program.
"Adding another FPU" unit is funny and makes no sense. The ILP cannot be extracted to use additional ALU units without making a larger instruction window.
Re:bullshit by tomstdenis · 2006-04-18 13:40 · Score: 1

Um actually you're wrong. The Core [64-bit stuff coming out] processors have a 4-way instruction window which is 1 larger than AMD already. That means they can issue upto 4 macro-ops per cycle. So processors are already using more pipes.

There there THREE FPU pipes. Therefore it is possible to add an adder [or vice versa] to the multiplier then have the decoder be aware of this and feed stuff into either pipe. So technically you don't have to change the ICU at all to support more FPU resources.

As for the ALU performance I never said make it wider. They're vastly underutilized as it is. L1 cache stalls account for quite a bit of cycles even when there is a hit.

As for threading... that's an OS issue. Doing anything on the level the CPU will recognize is not feasible. You simply cannot extract architectural state fast enough. The best way to use two cores is with SMP aware software.

I haven't heard of any AMD projects to merge cores like this and in fact the emphasis has always been on SMP and NUMA aware development practices.

Tom

--
Someday, I'll have a real sig.
Re:bullshit by glsunder · 2006-04-18 14:53 · Score: 1

AMD making the L1 cache 2 cycle instead of 3 cycle would immediately yield a nice bonus in performance.

General Question: would it be possible to add something like a direct mapped and very small (like 4KB or less) L0 cache, where if it hit, you'd get the data in 2 (or ideally, 1) clock cycle and if it missed you'd still have a shot at getting it from the larger L1? If it was under the page size, could you avoid accessing the TLB (I think early alphas did this)? I'm by no means an expert, and since very small but fast caches dont seem to be used much, I'm sure there's something wrong with the above idea.

I called it L0 because you'd be requesting from both L0 and L1 at the same time, keeping the L1 latency at 3 clock cycles.
Re:bullshit by markhahn · 2006-04-18 15:36 · Score: 1

nah. latency between cores is O(L2 latency), which is ~20-30 cycles. so if you know a thread is going to take a major bubble (cache miss), you might well do something about it by punting it off to an otherwise idle core. that assumes that the other core has some mechanism to retrieve the microarchitectural state of the thread, though (instructions and data in flight, etc). unless, of course, all that state is already available to the other "core", in which case it's really HT all over again. this is not a bad thing, since HT (SMT in general) got a bad rap because of how it was conflated with some other issues on the P4. the idea of switching to useful work rather
than idling the pipe during a bubble is a sound one. the real question is how much the threads will interfere (contending for cache, etc), and how much overhead you pay for doing the fine-grain switching (extra tags on the internal non-architectural reg file, for instance, complicating retirement, etc.)

clearly this could also be done speculatively, as mentioned in the reference to mitosis. but it's pretty unclear how much it could be done on-the-fly, since there's a huge danger in drowing in speculative work which winds up useless when the initial fork is resolved...

that said, AMD still has lots of other things to clean up: wider superscalarity and single-cycle SSE to match Intel, for instance.
Re:bullshit by Lehk228 · 2006-04-18 18:43 · Score: 1

i am not a CPU engineer, but how feasable would a cache/memory coprocessor be? it's sole function would be to predict what the CPU would grab next based on past actions and the execution thread entering the CPU, depending on how sure the co-processor is it moved data closer or farther from the CPU. if it screws up too many times in a row, or too great a percentage of acitons it would stop acting for a period of time so it would not interfere too much in code execution which does the opposite of what it expects.

--
Snowden and Manning are heroes.
Re:bullshit by tomstdenis · 2006-04-18 22:46 · Score: 1

Snooping between cores is fast but not super efficient. It also can send out snoops to the HT bus in MP systems if neither core owns the cache line. And as far as I know [from public info] you have to hit the SRQ before a memory read from the other core can read something written from the other core. The latency to the L2 is ~20 cycles or so on it's own. etc, etc, etc....

As for the other comments, the ALU is already wide enough. You're right about the SSE side. At best FPU opcodes are 2 [of 4] EX cycles giving a latency of 2 cycles. That's partly because of the scheduler though as it looks for things in steps of 2 cycles.

Tom

--
Someday, I'll have a real sig.
Re:bullshit by tomstdenis · 2006-04-18 22:50 · Score: 1

[speaking in general].

You'll find that most LSUs in modern processors are really their own independent units.

Think of a processor as a program with a bunch of threads and really efficient IPC [inter process communication]. the Load-Store Unit [LSU] is just one of many things going on.

Both Intel and AMD have hardware prefetchers which examine memory usage and makes fetches to system memory to bring stuff in [L1 or L2 depending on the design].

Tom

--
Someday, I'll have a real sig.
Re:bullshit by tomstdenis · 2006-04-18 22:54 · Score: 1

Answer: No. It would cause the real L1 to have L0.delay additional cycles of delay. This is also why most cpus don't have L3s even if you could make them out of DRAM on chip. So unless your hit rate for the L0 was like 99% you'd expect to lose performance.

In both the Intel and AMD cases the L1 access is pipelined which is why it's multiple cycles. Intel merely has a shorter pipe to the LSU which is why they have [often] 2 cycle caches as opposed to the 3 cycle AMD has.

Tom

--
Someday, I'll have a real sig.
Re:bullshit by be-fan · 2006-04-19 03:08 · Score: 1

Actually, the cache access latency is dominated by the size of the cache. Intel managed 2-cycle L1s in the P4 because the data cache was only 8KB. The 32KB data cache in the NGMA is 3 cycles.

--
A deep unwavering belief is a sure sign you're missing something...
Re:bullshit by tomstdenis · 2006-04-19 03:25 · Score: 1

Again only partially correct. While it's true you need more address decoder bits and area [e.g. longer wires] the actual data read is a single 64-bit value from a cache bank [bits discarded if smaller]. Both Intel and AMD pipeline their LSUs because they actually have multiple steps of work.

This is how RaW works for instance... You'd have at least two cycles

1. present address (write buffers, L1 and L2 pick up request)

This is important because cache coherency is important. You have to make sure that you're going to read the latest and greatest copy. It may be anywhere in the CPU.

2. read data (if in write buffer or L1)or stall (if in memory or L2)
3. either done or ... wait ....
X. read from L2 or memory

I don't know the exact design of either off by heart but that's the jist of a LSUs job.

A larger cache does mean a larger area [longer wires] so that's entirely possible the reason for 3 cycles instead of 2. But fundamentally the LSU of the K8 and P4 are not the same so even if the K8 had an 8KB cache it's possible that the delay would still be 3 cycles.

Tom

--
Someday, I'll have a real sig.
Re:bullshit by Anonymous Coward · 2006-04-19 04:00 · Score: 0

Yes, its bullshit alright.

One could move EAX, or any gp register, in 7 cycles. Come to think of it, depending on the circuit I could probably do it in 3.
Re:bullshit by DrDitto · 2006-04-19 07:30 · Score: 1

There is no such thing as a "4-way instruction window". There is 4-issue (4-way superscalar). Instruction windows are sized to dozens of instructions.
Re:bullshit by Anonymous Coward · 2006-04-19 12:22 · Score: 0

You're a CS guy who seriously overestimates his knowledge of chip design. Better stop before AMD sees how retarded you are and fires you.
Re:bullshit by be-fan · 2006-04-19 12:52 · Score: 1

Your description of the steps is not accurate. In the K8's LSU, the three cycle latency is divided into the following tasks:

1) Address computation in the AGU
2) Cache address decode and virtual -> physical translation (in parallel)
3) Data cache access

In any case, the LSU is usually not the bottleneck for something like this. There is a reason why Intel's Core microarchitecture has a 3-cycle 32KB cache despite the P4 had a 2-cycle cache. The LSU can kill your latency, sure, but you're really limited mostly by the latency of the SRAM itself. All the 32KB-64KB physically-addressed data caches I'm aware of (the Power6's, the Power5's, the K8's, and the Core's) are 3-cycle latency. Indeed, the documents on Fujitsu's SPARC64 V processor go into detail about the tradeoff between cache size and latency. It points out that there was original consideration of an 8KB, 2-cycle data cache, but ultimately, they decided to go with a 128KB 4-cycle data cache. Given the variety of processor architectures (and thus LSU designs) that have not been able to achieve 2-cycle latency for a large (32KB+) L1 in a 2GHz+ processor suggests that the physical limitations of the access latency of large SRAMs is the bottleneck, rather than LSU design.

--
A deep unwavering belief is a sure sign you're missing something...
Re:bullshit by tomstdenis · 2006-04-20 03:07 · Score: 1

Yeah I wasn't talking about cache "ways". I meant the standard issue of macrops is 4. Just like in AMDland where we retire a line of [upto] 3 macrops at once Intel likely does that with 4.

Tom

--
Someday, I'll have a real sig.

Huh = A Little More Moore's Law by sreekotay · 2006-04-18 11:13 · Score: 0

The idea is basically a way to continue to extend Moore's Law with current Comp Sci paradigms. Multi-core (and multi-CPUs generally) is the same idea, but requires software re-thinking to really be advantaged.

More units don't help things go much faster unless you can figure out how to feed them.

Like multi-CPU tech, there's probably a big diminishing return, so this seems like a 2 to 4-ish X multiplier - or about 18 to 36 months more of Moore.
--
graphicallyspeaking

--
graphically speaking

Re:Huh = A Little More Moore's Law by Anonymous Coward · 2006-04-18 16:36 · Score: 0

Moore's "Law": The complexity available to an IC doubles approximately every eighteen months.

Academia's been proposing this for awhile by Mifflesticks · 2006-04-18 11:15 · Score: 4, Informative

There are various projects that take differing views about how to do this. One class of such processors are "run-ahead" microprocessors. The idea here is to allow invalid results to be executed but not retired by a second processor running up to a few thousand instructions "ahead" of the processor executing real code to be retired.

There are several variations of this. One is to use the second core to run in advance of the 1st thread, the first thread effectively acting as a dynamic and instruction-driven prefetcher. One such effort includes "slipstreaming" processors, which works by using the advanced stream to "warm up" caches, while the rear stream makes sure the results are accurate, and to dynamically remove unecessary instructions in the advanced stream. Prior, similar research has been done to perform the same work using various forms of multithreading (like HT/SMT, and even coarse-grained multithreading). See the www.cs.ucf.edu/~zhou/dce_pact05.pdf for more details.

Others, such as Dynamic Multithreading techniques take single-threaded code and use hardware to generate other threads from from a single instruction stream. Akkaray (at Intel) and Andy Glew (previously intel, then amd, then...?) have proposed these ideas, as have others. Some call it "Implicit Multithreading".

Now, the register article is so wimpy (as usual) that there's no actual information about what technologies are used, but maybe it's a variation on one of the above.

Re:Academia's been proposing this for awhile by shieldforyoureyes · 2006-04-18 19:39 · Score: 1

So it's a low-latency, low-efficiency opposite of
Sun's Niagra chip.
Re:Academia's been proposing this for awhile by Anonymous Coward · 2006-04-18 20:40 · Score: 0

Andy is back at Intel by the way.
It makes you wonder what scared him away from AMD.

Shi's law by G3ckoG33k · 2006-04-18 11:15 · Score: 5, Informative

From here:

Researchers in the parallel processing community have been using Amdahl's Law and Gustafson's Law to obtain estimated speedups as measures of parallel program potential. In 1967, Amdahl's Law was used as an argument against massively parallel processing. Since 1988 Gustafson's Law has been used to justify massively parallel processing (MPP). Interestingly, a careful analysis reveals that these two laws are in fact identical. The well publicized arguments were resulted from misunderstandings of the nature of both laws.

This paper establishes the mathematical equivalence between Amdahl's Law and Gustafson's Law. We also focus on an often neglected prerequisite to applying the Amdahl's Law: the serial and parallel programs must compute the same total number of steps for the same input. There is a class of commonly used algorithms for which this prerequisite is hard to satisfy. For these algorithms, the law can be abused. A simple rule is provided to identify these algorithms.

We conclude that the use of the "serial percentage" concept in parallel performance evaluation is misleading. It has caused nearly three decades of confusion in the parallel processing community. This confusion disappears when processing times are used in the formulations. Therefore, we suggest that time-based formulations would be the most appropriate for parallel performance evaluation.

Re:Yes, AMD! You get it! by saleenS281 · 2006-04-18 11:18 · Score: 1

that makes no sense at all. So you want all boxes to act as uniprocessor... and then what happens when you want to run multiple tasks at once? You do realize sometimes you just want things to run parallel don't you?

I guess by your response I'm highly doubting you admin systems in a large datacenter because it makes absolutely no sense. I don't know any admin that would only want to have one processor, logical or not, in a large server. There's WAYYYY too many things that need to go on at the same time. There's a reason why Sun sells 128-way systems, and it's not because they can get the job done with one really fast cpu.

Not True! by Gorimek · 2006-04-18 11:21 · Score: 4, Funny

We have always been at war with hyperthreading!

Re:Not True! by WilliamSChips · 2006-04-18 12:30 · Score: 1

No, we have always been at war with HyperTransport. You have committed thoughtcrime. Doubleplusungood.

--
Please, for the good of Humanity, vote Obama.

Re:Oooooh by NuShrike · 2006-04-18 11:22 · Score: 1

Imagine if you had TWO C&C chips to split the workload.. What about 4? What about an array of C&C chips to send jobs to a farm of workhorse processors?

Where are the monkeys?

Still sounds like distributed.net.

Magically Parallelized? by Bob9113 · 2006-04-18 11:28 · Score: 1

I write a fair shitload of multithreaded and single threaded code. Most code cannot be magically parallelized. Parallel execution of code that has not been made thread-safe would cause teaming masses of race conditions. Null pointers everywhere. Division by zero would be the norm, not an exception.

Now, if they're talking about allowing separate processes to run separately without specific SMP code in the kernel, fine. But that's not 2x performance.

--
Stop-Prism.org: Opt Out of Surveillance

Re:Magically Parallelized? by DoctorSVD · 2006-04-18 11:54 · Score: 1

> Now, if they're talking about allowing separate processes to run separately without
> specific SMP code in the kernel, fine. But that's not 2x performance.

Hmm, how would that even work without a type of OS (scheduler) completely different from what exists today? The scheduler on a single (logical) CPU system is multiplexing the CPU between different threads/processes. One thread is running while the others sleep, so only instructions from a single instruction stream is fed to the CPU at any given moment. Thus we are back at the much harder problem of extraction parallelism from a single instruction stream. In other words, "reverse hyperthreading" as described only reduces the amount of potential parallelism presented to the hardware at any given moment.
I don't think the authors of that article know that they are talking about.

Word by Bill,+Shooter+of+Bul · 2006-04-18 11:31 · Score: 1

Sorry, I don't have mod points. Thats pretty darn informative right there.

I think thats a great example of the problems facing researchers in matehmatics (and sciences) today. Its really hard to make connections between all of the disperate facts, theories, and expiramental data to draw conclusions and lead to productive research and development. In short, we often experience mental stack overflow errors.

--
Well.. maybe. Or Maybe not. But Definitely not sort of.

Mathematically impossible. by blair1q · 2006-04-18 11:34 · Score: 0

Unless their compiler can predict the future, multiple cores will always have synchronization issues that keep them from approaching twice the performance of one core.

Re:Mathematically impossible. by DrDitto · 2006-04-18 11:57 · Score: 1

Super-linear speedup is possible and has been observed in SMPs because of greater cache locality.
Re:Mathematically impossible. by Lehk228 · 2006-04-18 18:46 · Score: 1

but could't such benefits also be realized by, rather than putting in a second core putting a huge cache and a few extra levels of slower but still faster than main memory cache?

--
Snowden and Manning are heroes.

Speculative Multithreading by DrDitto · 2006-04-18 11:45 · Score: 4, Interesting

This was proposed in acadamia over 10 years ago. Its called speculative multithreading, or "multiscalar" as coined by one of the primary inventors at the University of Wisconsin (Guri Sohi).

Basically the processor will try to split a program into multiple threads of execution, but make it appear as a single thread. For example, when calling a function, execute that function on a different thread and automatically shuttle dependent data back/forth between the callee and the caller.

Re:Speculative Multithreading by pkhuong · 2006-04-18 12:21 · Score: 1

Or, potentially more simply, speculatively execute both branches and only commit the changes from cache (and kill the mispredicted thread) when the branch has been resolved. Things become easier to safely multi-thread when mutable state isn't shared. *cough* FP *cough* ;)

--
Try Corewar @ www.koth.org - rec.games.corewar
Re:Speculative Multithreading by AcidPenguin9873 · 2006-04-18 12:52 · Score: 2, Insightful

The "problem" with Multiscalar is that it requires compiler support to partition the program (AFAIK). It's not really a problem, I suppose, because Multiscalar is an academic project. But for millions of existing codes out there, a compiler-driven TLS system isn't going to buy you anything in terms of single-thread performance.
There are other academic projects that are attempting to do TLS dynamically, in hardware. PolyFlow at Illinois is one, Dynamic Multithreading (mentioned elsewhere in this story) is another, I'm sure there are others.

beware... by xiao_haozi · 2006-04-18 11:45 · Score: 2

"...two-core chip or provide quadruple the performance with a quad-core processor." unify, unite, and unihilate....beware the QUAD LAY-ZAH!

--
my site of misleading and incorrect information!

Load balancing might be interesting by Mia'cova · 2006-04-18 11:57 · Score: 4, Insightful

It might be interesting if they took this idea in a slightly different direction. Set it up so the OS detects two CPUs. But, when the OS fails to utilize both CPUs effectively, allow the idle CPU to take some of the active CPU's load. I'm taking this idea from nVidia working on load balancing between graphics and physics in a SLI setup. So in this case the OS gets the best of both worlds, the ability to break tasks off to each CPU and a free boost when it's stuck with a single cpu-limited thread.

Re:Load balancing might be interesting by tomstdenis · 2006-04-18 12:41 · Score: 1

The problem is once an instruction gets down to the core of the ... er core ... it's hard to get it to another core.

So you can only load balance at the process/thread level.

Tom

--
Someday, I'll have a real sig.
Re:Load balancing might be interesting by Mia'cova · 2006-04-18 17:17 · Score: 1

Very true. It's hard. But clearly they're trying to split things or we wouldn't be having this conversation. I agree that you'd have to commit to doing something one way or the other and rebalancing might be slow. But I don't think a typical use case would be jumping all over the map. We have to assume that there's some level of predictability. I suppose it might be easier to envision of an effective load balancing approach if this scaled to a higher number of cores per chip and used and didn't balance between modes within a single core. Lets say in five or ten years we have a 16 core CPU. Maybe it'll make sense to put 15 cores against a CPU hogging task and assign a ton of IO bound threads to the last. This would avoid slowing down the main execution with context switching between processes, which is usually considered to be costly. Context switching might be even more of a pain with reverse multithreading. I couldn't really say.

See my other comment for a little more idea tossing on this.
Re:Load balancing might be interesting by GauteL · 2006-04-18 21:20 · Score: 1

"I'm taking this idea from nVidia working on load balancing between graphics and physics in a SLI setup"

Sadly, this is not as easy with general purpose CPUs as it is with graphics. A very large part of the graphics process is so-called "embarrassingly parallel". Basically, in many ways each pixel can be rendered independently. If the GPU is too busy, some of these calculations can simply be offloaded to the other GPUs.

For general purpose computing, there may be many, many forms of dependences that ruin this form of load-balancing. It will be interesting to see how well AMD solves this problem, most likely it will involve a lot of guesswork from the cores, but as long as it guesses correctly most of the time (like the branch prediction in super-scalar CPUs), this could very well lead to massive speed increases for single threaded applications.
Re:Load balancing might be interesting by tomstdenis · 2006-04-18 22:35 · Score: 1

You're missing the point.

Any task you can sufficiently isolate to different cores is a task you can thread. Otherwise if there is a lot of interdependence the concept won't work. Specially if they work in the same memory space. Keeping the caches sync'ed between cores would basically kill any benefit you think you can get.

Unlike say the Intel designs the AMD "dual cores" are really two distinct independent cores with their own caches living in their own worlds.

So why would AMD spend money on researching a concept which is basically doomed to failure.

Tom

--
Someday, I'll have a real sig.
Re:Load balancing might be interesting by P3NIS_CLEAVER · 2006-04-19 02:54 · Score: 1

I was thinking on a much higher level. A gamer or CAD guy would install the single-processor os and the office user would install the multi-processing os.

--
Please sign petition to restore sanity to our banking system!!!

http://financialpetition.org/

This is Like RAID for CPU's by Marc_Hawke · 2006-04-18 11:57 · Score: 4, Interesting

Striping: What is that? Raid 1? Raid 0? You take multiple disks, present them as one, and let the controller make the most effecient use of them while the OS and all the programs just have to deal with one big disk.

Looks like the same thing. You take multiple CPU's present them as one, and let the controller figure out how to best use them.

This could make for hot-swappable CPUs (heh) and the ability to have a CPU die without taking out your system. The redundacy nature of the other RAID configurations don't seem to translate very easily, but the 'encapsilation' concept seems to fit nicely.

--
--Welcome to the Realm of the Hawke--

Re:This is Like RAID for CPU's by SGrunt · 2006-04-18 12:08 · Score: 1

Ooh, hot-swappable CPUs. Now if somebody can come up with hot-swappable RAM we'll be *really* set. Fully redundant systems, here we come. 100% uptimes are the way of the future!
Re:This is Like RAID for CPU's by verbatim_verbose · 2006-04-18 12:23 · Score: 1

Actually, linux has supported hot-swappable CPUs for years. You generally need special hardware for it, but it's been out there for a while.
Re:This is Like RAID for CPU's by algae · 2006-04-18 12:27 · Score: 1

1990 called - they want their IBM mainframe back.

--
Causation can cause correlation
Re:This is Like RAID for CPU's by zippthorne · 2006-04-18 12:34 · Score: 1

This may be a stupid question, but why is the ram separate from the CPU anymore anyway? Would it really be that difficult to include a few gigs on the main chip with enough redundancy to overcome manufacturing defects?

--
Can you be Even More Awesome?!
Re:This is Like RAID for CPU's by x2A · 2006-04-18 13:13 · Score: 1

Well then you've got two options - upgrading the RAM would mean you have some of your address space slower than the rest. So, you'd wanna use the faster RAM for most frequently used stuff, ie, it becomes a level3 (or for some CPU's) cache, you'd need management and storage to keep track of what's most often used, and swap pages out to the slower ram.

So all you're talking about is having a larger cache, which is happening.

--
The revolution will not be televised... but it will have a page on Wikipedia
Re:This is Like RAID for CPU's by be-fan · 2006-04-18 13:25 · Score: 2, Informative

Yes, it would be. RAM takes up a lot of area. Have you ever looked at a RAM module? It's made up of 8-16 seperate chips. The densest single RAM chips are on the order of 128MB. Moreover, RAM is manufactured on very different (and much cheaper) processes than CPUs are. Certain types of RAM are compatible with certain CPU processes (eDRAM, for example), but these are not cheap, nor particularly dense.

--
A deep unwavering belief is a sure sign you're missing something...
Re:This is Like RAID for CPU's by tomstdenis · 2006-04-18 13:43 · Score: 1

The problem is that CPUs are very independent once instructions get into the decoder window. The only way to stop it is to raise an exception or interrupt (e.g. APIC signal).

So just because you may have 4 cores in your box [say dual-core 2P] doesn't mean all of the cores can act as one logically to the OS in a meaningful and efficient manner.

The striping analogy would be to dispatch instructions in round-robin fashion to all the processors. The problem with that is that the architectural state has to be shared. Keeping that insync with current cores would kill any sort of performance gain you might hope to obtain.

Tom

--
Someday, I'll have a real sig.
Re:This is Like RAID for CPU's by Lehk228 · 2006-04-18 18:50 · Score: 1

Sun Microsystems called and wants their hardware back... and i think IBM is on the other line

--
Snowden and Manning are heroes.
Re:This is Like RAID for CPU's by Trejkaz · 2006-04-18 23:11 · Score: 1

Whoops, your motherboard just fried. Too bad you didn't have a redundant motherboard.

--
Karma: It's all a bunch of tree-huggin' hippy crap!

Re:Yes, AMD! You get it! by misleb · 2006-04-18 12:13 · Score: 1

How is one processor easier to manage than two? The OS takes care of it for you. All you have to do is make sure the load is appropriate and balanced. But you have to do that anyway... The problem with the OS seing "as-simple-as-possible" hardware is that it can't take advantage of any of the features that you get with high end hardware. You can't get good diagnostics. And it is difficult to tune for a particular task. What if the algorithm that AMD uses to parallelize single threads isn't very good for your particular application? How do you tune the OS for a particular task if everything is done by the hardware?

If you don't care about these things, perhaps you should consider a new line of work.

-matthew

--
"THERE IS NO JUSTICE, THERE IS ONLY ME." -Death

Well, hey... by SGrunt · 2006-04-18 12:14 · Score: 0, Offtopic

To heck with powering millions of individual computers. Just make a million-core chip and wire them all up like this and you'll have one computer with the power of millions less all that nasty overhead of having to run an OS on all of the other computers. Power to the people!

Maybe looser coupling but shared cache by Anonymous Coward · 2006-04-18 12:16 · Score: 0

Could be that what is envisioned is to have one and the same cache shared between two processor cores, no need to separate it or arbitrate most of the time, so the coupling is looser than multiple execution units on a machine. What could wind up happening is that programs would be sequential only when forced to be. If it were a whole new ISA this might be interesting to program anew. As it is there'll need to be a bunch of these implied synch points to emulate 80x86 well enough. Lord knows if enough parallelism can be found to make it worth while.

a beowulf cluster of cpus? by pork0ne · 2006-04-18 12:47 · Score: 0, Redundant

imagine what you could do with it!

Re:a beowulf cluster of cpus? by Anonymous Coward · 2006-04-18 13:28 · Score: 0

i can finally play tetris

The Ideal Processor and Software Model by MOBE2001 · 2006-04-18 12:49 · Score: 1

FTA: It's the very antithesis of the push for greater levels of parallelism

There is only one way to achieve optimum performance using multiple cores (or multiple processors) and that is to adopt a non-algorithmic, signal-based, synchronous software model. In this reactive model , there are no threads at all, or rather, every instruction is its own thread or processor object. It waits for a signal to do something, performs its operation and then sends a signal to one or more objects. There can be as many operations executing in parallel. At every tick of a virtual clock, there is a list of operations to be executed. These can be chanelled to the available cores for processing, assuring a full load for the cores at all times.

The only caveat is the von Neuman memory access bottleneck which gets you every time. In the end, I suspect that only optical computing or something based on quantum tunelling will get around this very serious problem.

Re:The Ideal Processor and Software Model by cnettel · 2006-04-18 12:57 · Score: 1

The only caveat is that it would easily kill performance, and that your reliability statements are bogus. Wired hardware isn't reliable because you handle signals. It's relatively reliable because mistakes are expensive. Implement something like a chess AI or a nice user interface without algorithmic parts and without bugs. THAT would impress me.

Re:Yes, AMD! You get it! by Anonymous Coward · 2006-04-18 13:12 · Score: 0

You have a point about 1 processor being basically equally as easy as two to manage. It is the usefulness of this example as perhaps the start of a larger trend of simplifying system management that you fail to give value to. It is this potential trend that got me happy.

Think what you will, I know damn well where I work, and how much of my time is wasted learning yet another proprietary management tool that only brings very marginal benefits at the end of the day. That's why all of Google's datacenters run ONLY 1U cheapo rackmount servers with virtually no hardware redundancy. They agree that big iron just isn't worth the huge increase in cost. It is me who doubts you understand how marginal the gains from "big iron" usually are. They do failover in software (designed in house, unfortunately for the rest of us). I agree that many processors are good, but it is even better when they appear as one.

The easiest boxes to admin have one processor, one power supply, one nic, etc. When there are problems, there are less places to check. They do not need exotic drivers, or exotic options to turn on "the full powers". Think "make -j 2" when you compile a kernel, when all I wish I **needed** to know was "make". The more I have to learn all the little gotchas and caveats to using all this fancy stuff, the less productive I am, generally speaking.

Itanium by logicnazi · 2006-04-18 13:22 · Score: 1

Isn't this exactly what intel did with the IA64 instruction set, i.e., the Itanium family. Added explicit support for simultaneous instruction execution?

Personally I'm still a big fan of this instruction set/system and feel it's a real shame that backward's compatibility/resistance to change has kept it out of the mainstream. I would dearly love the irony if AMD tried to introduce an Itanium like processor now.

--

If you liked this thought maybe you would find my blog nice too:

Re:Itanium by Anonymous Coward · 2006-04-18 17:51 · Score: 0

"I would dearly love the irony if AMD tried to introduce an Itanium like processor now."

This is extremely unlikely for two reasons:
1) Explicit static parallelism still can't compete with the dynamic parallelism of an OoO CPU in terms of keeping execution units busy on common codes.
2) One of the big reasons Intel broke backwards compatability with Itanium was so that they could generate an interlocking web of patents, so that they would never again have to face compatible competition.

Itanium was a clever architecture that is optimized for the old world where parallelism, issue width and frequency defined performance. It doesn't work now that power consumption (or heat disapation, depending on your perspective) is the key bottleneck. Netburst/P4 ran into the same frequency/leakage wall at around the same time, though at least it wasn't an in-order design.
Re:Itanium by imgod2u · 2006-04-19 05:01 · Score: 1

I fail to see why it is not relevent to modern day problems of power consumption. One of the key ideas of IA-64 was to reduce transistor count (as transistors, like power now, were a premium back then). This is indeed true even today as the IA-64 core itself is tiny in comparison to a Netburst or K8 core. The vast majority of the Itanium chips (even moreso than the Netburst or K8 chips) is cache.
As for parallelism, it is very much the "new world". Multicore is the new trend, didn't you hear? With an ISA that has explicit parallelism built in, it would be much easier to do something like the proposed scheme that started this thread.
Of course, the strict design that Intel built into IA-64 does make it less flexible to things such as long memory latencies and such. I don't see it as what I would envision in an explicitly parallel ISA but that's just me.

Registers by Anonymous Coward · 2006-04-18 13:28 · Score: 0

Armchair engineer here: Why couldn't they just virutalize the registers?

You've missed something. by Anonymous Coward · 2006-04-18 13:34 · Score: 0

The case you stated is balancing 2 tasks (1. physics, 1. graphics) over 2 processors. - fine
Then you say 'well, they're doing that easily enough, lets divide one task over two processors', without noticing it's a totally different problem.

I'm sorry, but I think the only thing 'insightful' about this comment is that you don't really understand what you're talking about.

Re:You've missed something. by Mia'cova · 2006-04-18 17:06 · Score: 1

I was citing where my idea came from. I certainly did NOT suggest that a GPU and CPU were the same thing. Thanks for that. I suppose I meant core rather than CPU as this whole article is about a multi-core processor, not two GPUs. But let me be a little more descriptive. You've got three cases in the simple GPU example which sparked this for me, although one isn't really practical.

1) 100% graphics
2) A balance between graphics and physics (unlikely to be 50:50)
3) 100% physics (seems unlikely in practice)

So we have two GPUs and one or two tasks. Take a 60:40 graphics:physics situation in case 2. Ignoring overhead, nVidia would aim to use 100% of one GPU on graphics and 20% graphics, 80% physics on the second GPU. I'm only considering a case where CPUs could offer a similar kind of load balancing. In case 1, they operate in a manor similar to how AMD is proposing, where the single task is split between two cores. In case 2, they operate independently.

Lets say it takes a full half second for a dual core CPU to switch from dual mode to reverse multithreading mode. But let's say that the first core can continue working without noticeable interruption. An OS's dispatcher could be implemented to pick modes. If you've got 50 idle processes/threads and one CPU-bound thread, use reverse multithreading. If you have a sufficient number of 'active enough' threads to bring each CPU to 100% independently, do that.

This would work under the reasonable (imho) assumption that a typical machine does not flip between many active processes and one single active process rapidly. Low load would probably run faster in a reverse multithreading mode. If multiple applications happen to spike at the same time, multiprocessing can handle that just as easily. Or just idle the 2nd core entirely for power consumption.

That's doing it without a smooth load balancing but would be effective in most situations where you're going to have a clear choice. Doing a smooth load balancer would certainly take more silicon and may or may not make sense. Kinda hard to say when we haven't heard how reverse multithreading is being implemented.

You completely failed to explain to me what is different between the two cases or why specifically this doesn't work. I'm throwing ideas around here so if you have anything productive to say, feel free to add constructively, sir anon coward...
Re:You've missed something. by Anonymous Coward · 2006-04-19 03:30 · Score: 0

My humble apologies for going off the handle there. No offence was meant to yourself (it was directed more at the moderators).

>Take a 60:40 graphics:physics situation in case 2. Ignoring overhead, nVidia would aim to use 100% of one GPU on graphics and 20% graphics, 80% physics on the second GPU.

Ok, this seems fine, but for this to work we'd still need to be able to sub-divide the 'graphics' task across the two processors. This is easy to do if the task is designed in this manner (ie it's multi-threaded), but a nightmare if it isn't. The interesting aspect about a graphics pipeline is that large parts of it can be parallised fairly easily. Scan-Line Interleaving is a good example of this. Having two threads modifying the same variables requires synchronisation, and is therefore more complicated. This is the more general problem we have.
Certain aspects of a rendering pipeline lend themselves well to parallelism, due to the type of number-crunching that's being performed. Unfortunately, most software is very different to these types of algorithms, and less easy to parallelise in this manner, so we're back to either writing software in a multi-threaded manner, or having a programming language take care of it 'under the hood'.

Typically for power-saving you have a thread that's the lowest priority on the system, but always ready to run, that puts the core it's running on into a sleep mode. When another thread is ready to run, it gets pre-empted. So yes, that's what typically happens when you can't sub-divide the task, and there's nothing left to run.

Imagine... by Junior+J.+Junior+III · 2006-04-18 13:39 · Score: 1

Is this more or less like a beowulf cluster on a chip?

No, seriously, I'm having trouble envisioning it.

--
You see? You see? Your stupid minds! Stupid! Stupid!

What about this... by Darkenreaper57 · 2006-04-18 13:55 · Score: 2, Interesting

I don't have a lot of background in CPU architechture, but what if there was a parallel processing unit designed specifically to allocate threads to the cpus? This way, the cores can all function as one at the hardware level, rather than the software level (thus making it easier on developers and potentially increasing performance). Would it be better to have a dedicated unit/sector to process this information and divy it up to the separate cores, or no?

theory by Anonymous Coward · 2006-04-18 14:14 · Score: 0

I might be talking out my ass here, but it seems to me that the best way to run a multi-core processor as a single would be to have the processor being used change when there is a context switch. Seems simple enough doesn't it?

It's not branch mis-prediction, it's the memory by vlad_petric · 2006-04-18 14:32 · Score: 4, Informative

Current superscalars still fetch instructions in order, and squash everything after a mis-predicted branch. The cost of branch mis-speculation is in fact getting higher, because deeper pipelines means longer times between the mis-prediction of the branch and the execution (where the correction takes place). In other words, it means longer times on the wrong path.

The purpose of "good dispatching" (i.e. out-of-order execution) is to hide the latencies of misses to main memory (it takes between 200 and 400 cycles these days to get something from memory, assuming that the memory bus isn't saturated), by executing instructions following the miss but not dependent on it. Out-of-order execution has been around Pentium Pro, btw.

--

The Raven

Re:It's not branch mis-prediction, it's the memory by flogic42 · 2006-04-18 16:05 · Score: 0

One of the benefits of having superscalar or multiple processors is that you could execute both the branch-taken and the branch-not-taken codes at the same time before the branch is reolved, reducing the penalty for misprediction to zero. This has not yet been implemented by Intel or AMD.

--
Check out my women's designer clothing store.
Re:It's not branch mis-prediction, it's the memory by Anonymous Coward · 2006-04-18 17:42 · Score: 1, Informative

This is a false economy, since the bottleneck in modern CPUs is power. Executing both ways of a branch means you always pay the power cost of a "mispredict" whereas with traditional branch prediction you only pay if you guess wrong.

windows multi proc support by sentientbrendan · 2006-04-18 14:50 · Score: 1

I've always heard that windows support for multiple processors was pretty limited. Is this still the case? Is limited multi proc support for windows encouraging the development of this technology? Does load balancing of processes or threads across processors happen automatically?

I keep hearing that people get dual proc or dual core machines like the apple core duos, but that one proc or core lies dead under windows. Is this actually the case? Is this just a driver problem specific to a few machines? Do you need windows server 2003 or something?

Re:windows multi proc support by Slashcrap · 2006-04-18 21:55 · Score: 1

I've always heard that windows support for multiple processors was pretty limited. Is this still the case?

No. Not unless you're running Windows 98.

Is limited multi proc support for windows encouraging the development of this technology?

No. As a side note, this whole story is probably bullshit. The concept doesn't really make any sense at all. More likely they are just adding more execution units - this is a suspected feature of the upcoming K8L anyway. Remember that this story was reported by the Inquirer. They are rather prone to technical flights of fancy.

Does load balancing of processes or threads across processors happen automatically?

The OS scheduler does that.

I keep hearing that people get dual proc or dual core machines like the apple core duos, but that one proc or core lies dead under windows. Is this actually the case?

No. Unless they are loading XP Home on it. In fact I seem to remember that although XP Home doesn't support SMP it does support dual cores. So they're probably just bullshitting you. Although I find it hard to believe that a Mac owner would lie about Windows to make it look bad.
Re:windows multi proc support by pinkfloyd89 · 2006-04-19 02:25 · Score: 1

I run dual, dual core opterons in windows xp, and I get 4 cpus just fine. Windows handles multiple cores and multiple processors just fine

Better use of parallelism by mentaldrano · 2006-04-18 15:19 · Score: 1

Rather than trying to make an end run around Amdahl's law, why not duplicate the processor paths?

Say you have a single threaded application with lots of branches and little instruction level parallelism (ILP). Rather than trying to predict the branches or worry about read-before write errors, just clone the processor state and run BOTH branches simultaneously! If you have a core (or three) just lying around while you run a single threaded app, use it. No need for prediction at all, and no penalty for mispredicting a branch. Just dump the state of the cores that missed, clone the "correct" path to all available cores, and keep going. Assuming core-to-core cloning is fast and there is no ILP that could be taken advantage of by the other cores, why waste them?

Tell me I'm not the first person to think of this, 'cause it's too obvious.

Re:Better use of parallelism by Anonymous Coward · 2006-04-19 01:49 · Score: 0

Tell me I'm not the first person to think of this, 'cause it's too obvious.

You aren't. What you are describing is how the Itanium works with predication except there's no "cloning". Duplication of effort isn't a good thing as it's a waste of resources. However, the Itanium (EPIC) architecture can execute both/multiple paths of logic simultaneously and when one path is found to be the incorrect one, you can throw the incorrect one away and continue on with the correct one.

Multi-cpu emulation? by cbreaker · 2006-04-18 15:30 · Score: 1

Think of it - Multi-core CPU's bound to appear as a single CPU, and then Hyperthreading on top? =)

--
- It's not the Macs I hate. It's Digg users. -

Re:Multi-cpu emulation? by Anonymous Coward · 2006-04-18 22:21 · Score: 0

That might not be a bad idea! I guess we'll find out when/if Intel steals it.

_V_

Very limited applicability by hyc · 2006-04-18 15:37 · Score: 1

None of this will mean anything to desktop PCs, this is something that only the HPC crowd would need. First you have to start from the basic assumption that you have an algorithm that you only know how to code in serial fashion. And, you need it to run as fast as possible. You would like to run it on a single 6 GHz CPU but all you have on hand is a pair of 3 GHZ CPUs, and this algorithm of yours can't be partitioned for parallel execution.

Intuition tells me that if a competent programmer with complete access to information about and understanding of a particular algorithm can't figure out how to effectively parallelize it, there's no way a hardware state machine will do a better job of it.

Again, only the HPC crowd would ever run this way. E.g., you could boot DOS on your 3 GHz Opteron and it would be several orders of magnitude faster than Windows, but it would only run one thing at a time. Very few desktop PC users can live with single-tasking today, otherwise more people would still be using DOS. And multiple cores are better for running multiple independent tasks than a single core, so this whole pursuit is only useful if you want to use a a monster machine to solve a single problem at a time (e.g. Deep Thought).

--
-- *My* journal is more interesting than *yours*...

Re:Very limited applicability by cnettel · 2006-04-18 22:22 · Score: 1

I beg to disagree. If you look at the peaks of a dual-core system for a desktop user, you still don't reach 100 %. General GUI responsiveness is goodie-goodie, but halving waiting times would also be good. Remember that a P4 with HT will give much of the responsiveness, so a system that could merge threads in theory, but also split them, would be quite nice. (If you solve the "merging" the problem, it seems trivial to split them up again with quite short notice.)
Re:Very limited applicability by hyc · 2006-04-19 09:54 · Score: 1

Perhaps. I guess we won't know until someone actually defines the theory and puts it into practice.

At one extreme, you'll have the strictly serial code, a sequence of instructions that all depend on the preceding instructions/results. Splitting this into separate threads would be horrendous for performance, because you'd have to be continuously shuttling data between the registers of one core to the other. At the opposite extreme will be code with very few sequential dependencies, something that could easily have been written as explicitly parallel if the programmer wanted to. I suspect many applications will fall somewhere in between, but the majority of code will be serial.

That's an observation based on common practice today, of course. The basic definition of a "program" is a *sequence* of instructions, so that's how we're taught to think about computing. That's also how we relate to life though: in terms of timelines. Even if we can train ourselves to think of programs as collections of independent events, in order to interface them to the real world, we have to serialize them.

That assumes the program has to interface to the real world while it runs. E.g., a video codec that is decoding a bitstream for display on a framebuffer in realtime, probably ought to do the decoding in scanline order. But a batch-mode codec that's just transcoding from one file format to another can process the data in an arbitrary order. If the memory hardware had a quirk that made some other order most optimal, you could break up the processing to accomodate that... I think this illustrates the divide between "theory" and "reality" pretty well - programs that must operate in real time, or provide real time response to user input, probably won't be able to take much advantage of the theoretical merge/split optimizations.

--
-- *My* journal is more interesting than *yours*...

Mod parent up by ameline · 2006-04-18 16:54 · Score: 1

Excellent and informative post.

--
Ian Ameline

I think it will look like two cores after all by Dr.+Spork · 2006-04-18 17:22 · Score: 1

My initial guess would be something like yours, except not intended as funny. I pictured a dual core CPU that looks to the OS like a dual-core CPU, and works like one when the OS gives both cores some work to do. But if the OS only schedules real work for one of the cores, the chip logic kicks in and the entire chip poses as the one core that is being given work. I assume that the posing is doing by some sort of run-ahead or something - that's the big mystery - but anyway, in single-thread apps, the core asked to run the thread somehow covertly gets help from the other core and with this help the task gets done faster.

When you have two cores, it would be a real waste to not let an SMP-optimized program see both of them. That's why I doubt that if this ever becomes a product, it will look to the OS like a single core. But if it really is possible to let two cores cooperate on running a single thread, it would be nice for them to do so when an application is only willing to run as a single thread.

Let's try an analogy: Assume two heads are better than one. But some tasks are explicitly not meant for two heads, say, taking a math test. So say I go in for the test, appearing only as a single "head" to the test-giver (the "interface") but I covertly ask my friend to help me on the side. This makes my result better. Of course if the "interface" explicitly allowed for team work on the test, there would be no reason for the covert (probably inefficient) communication, so we could drop the pretense of being one person. So the analogy is, when a single core is asked to "work alone" on a problem and it can figure out how to get useful help from its friend so the work goes faster, AMD wants to make sure that it really gets the help.

the reason for AMD combining cores by techrunner · 2006-04-18 17:25 · Score: 1

AMD is planning to increase the number of cores on a single chip. I would expect to see hundreds of cores on a single chip a decade from now. The question becomes, how best to allocate cores to different processes. Should each thread just one core? Should the cores run at different speeds and some threads get faster cores than other cores?

From a design standpoint, it is better to run every core at the same speed. It makes design much simpler because every core is the same. The problem then becomes what to do with a thread that needs a faster core. It would be very nice to be able to combine multiple slower running cores to form a virtually fast running core.

AMD's strategy by Anonymous Coward · 2006-04-18 17:31 · Score: 0

AMD's strategy is obvious: when Intel steps in a direction that breaks backwards compatibility or requires industry wide changes, they step in and provide the lazy route for the industry. If they hadn't adopted this strategy for x86_64, x86 would be on its dying breath right now. It's annoying really... I used to hate Intel's backwards compatibility dogma, but now AMD has also tried to take advantage of it. Here they are now, with a likely half-baked solution that promotes status quo. This will likely persuade the business types weary of risk. In my view, Intel,AMD, and Microsoft are all guilty of holding the industry back. We are doomed to sit at the local maximums because of these policies.

Or maybe predicting both branches? by silverdirk · 2006-04-18 18:18 · Score: 5, Interesting

As one reply stated, you can't know which is right unless you had 3 cores.

But, with two cores, you could have a way to predict "branch" and "not branch" at every prediction spot. The core that gets it right sends the registers to the other core so they can continue as if every branch were predicted correctly...

That would only work if you had a nice fast way to copy registers accross in a very small number of clock cycles... so again, just a bunch of speculation. But it was a neat enough idea I had to say it.

--
Mark of the Coder fades from you. You perform Opening on World of Warcraft. Warcraft crits GPA for 4. GPA dies.

Re:Or maybe predicting both branches? by qa'lth · 2006-04-18 19:02 · Score: 1

Why even transfer the state over? You are already executing the correct path on one core, just prime the other core to pick up the next branch and let things go from there.

This seems like it would work really really well on very parallel CPUs, or quantum CPUs.
Re:Or maybe predicting both branches? by Anonymous Coward · 2006-04-18 21:06 · Score: 0

congrats, you have just re-invented the itanium processor

they do something very similar about misbranches

and look how the industry loved that
Re:Or maybe predicting both branches? by Anonymous Coward · 2006-04-19 03:53 · Score: 0

...someone tell the parent poster he just described IA64.
Re:Or maybe predicting both branches? by mikefe · 2006-04-19 20:21 · Score: 1

...someone tell the parent poster he just described IA64.

I believe it is the instruction set that is a major hutle, not executing the same instructions on two cores and switching cores based on correct branch outcomes.

Also there are certain workloads that other process can't beat when compared with IA64 IIRC.

--
There: Something at a specific location.
Their: Owned by someone.
Please make sure your english compiles.
Re:Or maybe predicting both branches? by silverdirk · 2006-04-20 09:57 · Score: 1

Well, in order to prime the other core.. I'd think you'd need to get its state to match the first one. Like, you can't start executing instructions that need the value of registers EAX and EBX unless you have the correct value stored in them, and the correct values are what is in the registers of the processor that chose the right branch.
Unfortunately, there's also a lot of other state that would need copied as well... like the contents of the entire pipeline, just about. But, you don't need them to be synchronized. Assume one processor is designed to always take branches (call it T), and the other is always designed to not take branches (call it NT). Lets say NT is wrong about a branch, and takes 4 clocks to get the state streamed from T. I'm assuming this is faster than dumping its pipeline and doing the calculations itself. It is now about 4 clocks behind. Suppose it is wrong again. It loads the state from T, and is still 4 clocks behind. Suppose we are in a loop, and it is wrong for the next 500 iterations. It is still 4 clocks behind. Now suppose the loop ends, and T is wrong, and NT is right. NT now becomes the "official" processor, and sends its state to T. T is now 4 clocks behind. (8 clocks behind where it was originally).
So basically, this would use two processors to reduce the penalty of a mis-predicted branch to some small number of clock cycles. I'm not sure if that would be worth it or not.

--
Mark of the Coder fades from you. You perform Opening on World of Warcraft. Warcraft crits GPA for 4. GPA dies.

Re:Yes, AMD! You get it! by Anonymous Coward · 2006-04-18 21:04 · Score: 0

"This is to say nothing about all the developer effort that would be saved from not needing to make making SMP-safe code".

That's silly. Come on, you still have to write "smp-safe" code if you want to run more than one thread, or even more than one process if they access shared resources. And it haapens everday...

Perhaps with virtualisation by Donegrim · 2006-04-18 21:45 · Score: 1

Most of the time the CPU would be running its special "appears to the OS to be single core" thing, but if a process really explicitly wanted to see all of the available cores, then it could call on some SMP aware OS to run on the CPUs hardware virtualisation. Might not work, im not sure.

Re:Frenchie Site! by Slashcrap · 2006-04-18 21:58 · Score: 1

Eh? What? The Register is British.

Yes, embarassing isn't it old chap?

FreeScale already has this in their MPC8641D PPC by willy_me · 2006-04-18 22:15 · Score: 1

For more info, click here.

Willy

MT is More then just speed by mindflow · 2006-04-18 23:18 · Score: 1

Multithreading is more then just speeding things up. Quite importantly it is also allowing for more responsive operation. Ex: having gui and logic in separated processes, allowing the gui to respond when logic is bussy. This does'nt make annything "faster", it just makes things more responsive.

Re:MT is More then just speed by mindflow · 2006-04-18 23:47 · Score: 1

I guess what I wanted to say, was that you still need skills in concurrent programming, even though CPU throughput speed continues too increase with new technologies.

basic electronics by Anonymous Coward · 2006-04-18 23:24 · Score: 0

whats the deal with putting all these things in parallel?
everybody knows they would work faster in SERIES!

AMD = Less Intel Market share? by SphericalCrusher · 2006-04-18 23:41 · Score: 1

It seems that the thing is that there will be a two core AMD processor working as hard as it can and displaying to the system as one single processor. I don't know if that will be able to work harder over time or not. And about a cooling method, the way it sounds, it shouldn't be overheating that much. So in reality, I don't really know what AMD is aiming for other than putting Intel's HyperThreading technology down on the market.

--
"Instant gratification takes too long." - Carrie Fisher

A good start... by BecomingLumberg · 2006-04-19 00:19 · Score: 1

This is certainly a good start, although another user (above) was correct in saying that it is superscaling all over again. The real magic would be the ability to switch between modes dynamically- have 4 processors when you are IMing, /.ing, and other general dicking around, 1 big processor when you are gaming, and maybe two processors when you are running a few big programs. Even more cool would be a 3-1 split when you are using one heavy duty app and a few lightweights at the same time.

One good example: The ability to dynamically switch between SLI modes is what is currently making me look into it for my new build, since i can use multi monitors during the day for work and one for gaming at night. Using this concept for CPUs, I think you could build chips that can be both powerful and flexible, instead of either or as is now the case.

--
If a nation expects to be ignorant and free, in a state of civilization, it expects what never was and never will be.-TJ

The Pugh link is Java-specific by p3d0 · 2006-04-19 00:57 · Score: 1

What part of that article is relevant to the topic of the new AMD processors?

--
Patrick Doyle
I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....

What I think this might be. by Illissius · 2006-04-19 01:09 · Score: 1

On the surface, this sounds a bit like transforming single threaded into multithreaded code, which as I gather, is pretty much impossible to do in a generic and widely useful fashion.

What it may be instead, is the ability to dynamically reallocate execution logic into different configurations of 'cores' at will; so the CPU could appear either as a single core with 6 execution units, or as a dual core with 3 units in each, to take advantage of either instruction- or thread-level parallelism, whichever is in greater abundance at the moment. This way it's not creating extra paralellism out of thin air, which wasn't there before (which, I believe, would either be ridiculously hard or impossible to do), but rather makes maximum use of whatever paralellism is there to be found*.

I'm not sure how theoretically possible this is, but it does seem more likely than the other proposition.
.

--
Work is punishment for failing to procrastinate effectively.

SMP Safe Code by everphilski · 2006-04-19 01:45 · Score: 1

I'm suprised you are willing to sacrifice performace for simplicity. You realise a single core can only run a single process at once, correct?

This is to say nothing about all the developer effort that would be saved from not needing to make making SMP-safe code.

lol. It really isn't that hard to make code thread and SMP safe. Check here for a good starting point. I learned how to write thread-safe code as a hobby over the course of about two weeks, I converted my thesis work and a simulation toolkit to be thread-safe in about a week. It really isn't hard. I'm not a CS either, I'm an aerospace engineer. Now I can do a sh*tload of monte carlo runs on my dual core box at home at double the rate. (By the way: Qt's thread library is great but if you can't live with the license then check out OpenSceneGraph's OpenThreads library)

Your website reads like crakpottery by p3d0 · 2006-04-19 01:59 · Score: 1

I'd like to believe you have found the Silver Bullet. Any examples of real complex systems you have developed with it?

--
Patrick Doyle
I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....

It's a softwre licensing issue by MykePagan · 2006-04-19 02:06 · Score: 1

Forget any talk about CPU efficiency. This is a software licensing play first and foremost. Much of the most expensive software (*cough*Oracle*cough*) is licensed per CPU. There has been some browbeating by AMD and Intel to get ISVs to license per socket, and some have gone this way, but there has been a lot of acrimony. If you license per CPU, the software company makes out like a bandit as all the machines have twice as many CPUs in them. The result is some customers defer upgrading CPUs, and AMD/Intel lose out. If the ISVs charge per socket, AMD/Intel is very happy since it's a no-brainer for their customers to upgrade, but the ISV perceives a potential loss of revenue.

In the "old days" where CPUs just got faster and faster, the ISVs didn't complain about this. In fact, they benefitted since they could cram more bugs^H^H^H^Hfeatures into each sequential release and the customer didn't complain. Now the prevalence of dual core CPUs makes them feel like they're potentially leaving money on the table.

Enter "reverse hyperthreading." Make a multicore CPU look like a single CPU, and those software licensing issues go away and we're back to the good old days.

Ob SNL ref by IceAgeComing · 2006-04-19 02:53 · Score: 1

Do I still get free wedgie with that one?

OKLAHOMA !!! OOOOOOOKLAHOMA !!!

Re:Yes, AMD! You get it! by saleenS281 · 2006-04-19 05:41 · Score: 1

You have a point about 1 processor being basically equally as easy as two to manage. It is the usefulness of this example as perhaps the start of a larger trend of simplifying system management that you fail to give value to. It is this potential trend that got me happy.

I don't think that's a trend at all. I think this is AMD trying to find a gimmick to oust intel, assuming it's true at all. If you want easy to manage, buy M$. But realize with that ease of management you loose the flexibility of *nix

Think what you will, I know damn well where I work, and how much of my time is wasted learning yet another proprietary management tool that only brings very marginal benefits at the end of the day. That's why all of Google's datacenters run ONLY 1U cheapo rackmount servers with virtually no hardware redundancy. They agree that big iron just isn't worth the huge increase in cost. It is me who doubts you understand how marginal the gains from "big iron" usually are. They do failover in software (designed in house, unfortunately for the rest of us). I agree that many processors are good, but it is even better when they appear as one.

I hate to be the one to break it to you, but google is moving away from this. We have several accounts with them and I can tell you they're doing just the opposite. They've found all the 1U's to be a waste of energy and resources and have been looking to purchase "big iron" to replace it.

The easiest boxes to admin have one processor, one power supply, one nic, etc. When there are problems, there are less places to check. They do not need exotic drivers, or exotic options to turn on "the full powers". Think "make -j 2" when you compile a kernel, when all I wish I **needed** to know was "make". The more I have to learn all the little gotchas and caveats to using all this fancy stuff, the less productive I am, generally speaking.

No, the easiest boxes to administer are the ones that were engineered with administration in mind. I'd MUCH rather have a "big iron" sun box that says "HEY FUCKER, CPU6 IS FUBAR", than a 1u/1cpu whitebox that just starts having compile issues, and random reboots with no apparent cause. What you want is good engineering, you're just confused. I can tell you from personal experience, it can be far easier to track down a problem on a 28way Sun box than a no-name built-from-newegg whitebox.

How about games? by freezin+fat+guy · 2006-04-19 06:50 · Score: 1

John Carmack was recently interviewed regarding the new multicore game consoles. One of the more memorable quotes:

"...Anything that makes the game development process more difficult is not a terribly good thing."

Re:How about games? by hyc · 2006-04-19 09:29 · Score: 1

You have to look deeper to understand that statement. Games inherently involve multiple tasks running simultaneously. Look at every arcade game made from the 1980s onward and you'll see a cluster of processors working in concert. In a modern game every enemy has its own independent AI controlling it. The reality is that it *simplifies* game development to make those AIs run in their own thread. It simplifies game development to have a smart dedicated audio processor that you just periodically tell "play this list of sounds (i.e., an audio program) at time t" because your main CPU can keep focused on processing the game action. It simplifies game development to have a smart dedicated graphics object processor that you tell "draw this list of objects/textures (i.e., a video program) and move it there, and tell me about any collisions."

Look at MAME and look at how much effort it takes to make a single Pentium running at however-many gigahertz play a 1990s era arcade game smoothly. The fact is that making a single CPU do everything *sucks* for responsiveness. The secret to the playability of those games is multiprocessing, multiple CPUs splitting up the tasks amongst themselves. Jumping through hoops to make all of that stuff play nicely on a PC is ridiculous. (And yes, I was in the arcade game business in the 80s and 90s, I know a bit about how game development on PCs and on arcade consoles works.)

The fact is that good, responsive games inherently need to be multi-threaded. Running them on multiple cores is a natural path to improving their responsiveness.

--
-- *My* journal is more interesting than *yours*...
Re:How about games? by freezin+fat+guy · 2006-04-19 10:06 · Score: 1

Wish I could mod your comment up but someone else will have to do that for me.

What is the issue Carmack is addressing then? (I'm not implying, like some, that he's the be-all of the industry, but he is certainly significant.) Is he wishing for a standard between consoles? Does his concern stem from the fact he has always supported PC's?

I'd like to know what application this is for by GWBasic · 2006-04-19 08:18 · Score: 1

I'd like to know what application this is for. Making a multi-core chip appear as one chip would cause performence degredation in a multi-threaded environment because of the overhead with context switches. (For example, I don't see how this would help someone who's ripping a CD, running a virus scanner, and browsing the web at the same time.)

This might help games, but by the time it comes out it could be too late. At IDF, Intel demonstrated some very powerful programmer aids for multi-threaded programming. By the time AMD's technology is on the market, will it offer an improvement to games that take advantage of new multi-threading techniques?

Given that the current "make your program multithreaded" techniques are primarily making "for" loops multi-threaded, is AMD's technique going to "automagically" distribute loops among all of the cores?

--
No, I will not work for your startup

263 comments