AMD Going Dual-Core In 2005
gr8_phk writes "We recently learned of Intel's
plans to go dual-core in late 2005. Well it seems AMD has
decided to follow suit. It should be noted that the K8 architecture has had this designed in
from the start. Will this be socket 939 or should I try to hold out another year to buy?"
linky linky!
2's company, 3's a crowd, and 4 is for the fat cats who wipe their ass with 50 dollar bills.
you can find them all here. It seems news has gotten around, and that AMD's dual core will consume just about as much power as a single core CPU at 90nm.
ignorance is bliss. googlefiberatx.com
actually it'll probably be more like the processors gets so big that you just clip things onto the outside of it and it takes the place of the motherboard.
A feeling of having made the same mistake before: Deja Foobar
I have seen some licensing schemes that apply to per-processor costs... 1 CPU = $1,000, 2 CPU = $2,000 etc.
How long will it take to argue that consumers with a dual core processor should pay 2x the price? I'm betting not long.
Interestingly, in a review of P4 vs. K8, the K8 had a clear advantage at the 4 processor level and above, apparently because of reduced bus conflicts with their individual memory spaces. If AMD were to proliferate cores on chip, they'd wind up contesting for the memory bandwidth, just like the P4.
--- Bill
heat
Yes..the evil of all machines
the reason why when the AC is not on in my house, and it is 90degrees outside, my computer resets
and of course..the reason why we're not going quad core
well..at least that's my personal opinion...as for the real reason...probally for profit...
Moore's Law has NOTHING to do with CPU speed.
/morz law/ prov. The observation that the logic density of silicon integrated circuits has closely followed the curve (bits per square inch) = 2^(t - 1962) where t is time in years; that is, the amount of information storable on a given amount of silicon has roughly doubled every year since the technology was invented. This relation, first uttered in 1964 by semiconductor engineer Gordon Moore (who co-founded Intel four years later) held until the late 1970s, at which point the doubling period slowed to 18 months. The doubling period remained at that value through time of writing (late 1999). Moore's Law is apparently self-fulfilling. The implication is that somebody, somewhere is going to be able to build a better chip than you if you rest on your laurels, so you'd better start pushing hard on the problem. See also
from a google search.
Moore's Law
This could ultimately lead to a reformulation of Moore's Law. Thus, I propose k4_pacific's hypothesis:
The number of processor cores doubles every eighteen months.
Unknown host pong.
They're making the first Desktop Fusion Unit!
You're planning on waiting more than a full year between computer upgrades? Are you sure you're on the right website?
How can we continue to believe in a just universe and freedom to eat crackers if we have no ale?
I could see a big future of heatsink business in Intel and AMD's plans.
There is a spark in every single flame bait point.
Sure, if you are happier not only with liquid radiator cooling, and also having to have copper heatpipe cooling. That is right as I have discovered here Apple has had to implement not one, but two separate cooling solutions for their 2.5GHz PowerMac G5. What were you saying again? You do realize don't you that you will be able to swap out a single core dual Opteron system with two dual core CPUs and have Quad CPU power don't you? And that makes the G5 an advantage how?
ignorance is bliss. googlefiberatx.com
You'll need a new motherboard.
The DDR memory interface appears to wrap around both L2 caches, meaning that it looks like both cores have their own 128-bit memory interface; whether or not both memory controllers will be enabled is another thing, but if this is true we have a number of implications to talk about. If dual core Opterons do indeed have two memory controllers, the pincount of dual core Opterons will go up significantly - it will also make them incompatible with current sockets. AMD is all about maintaining socket compatibility so it is quite possible that they could only leave half of the memory controllers enabled, in order to offer Socket-940 dual core Opterons. AMD isn't being very specific in terms of implementation details, but these are just some of the options.
Are you a VF grad? Check out the VFMA Alumni Forums VFMA Alumni Forum
is dilithium cores!
From the article. "If dual core Opterons do indeed have two memory controllers, the pincount of dual core Opterons will go up significantly - it will also make them incompatible with current sockets. AMD is all about maintaining socket compatibility so it is quite possible that they could only leave half of the memory controllers enabled, in order to offer Socket-940 dual core Opterons. AMD isn't being very specific in terms of implementation details, but these are just some of the options."
Dual cores processors seem to me like a pretty good alternative to a dual processor system. You don't have the hassle of 2 huge coolers blowing out hot air, the mainboards are don't have to be overpriced and it is already supported by all OS.
Some years ago I was thinking about getting a dual processor system. Alone the motherboard was two times as expensive as a similar single processor one, applications did not support it all and so on. I hope newer applications are ready for dual cores. Quake III was the first game I know that used two processors and finally I can consider that animated desktop background.
Is there a list which applications can effectively use dual cores besides obvious things like webservers?
Anything multithreaded. Which is just about any modern GUI app.
-
You mean 8. This is a computer, you're legally bound to use a power of 2.
I still have more fans than freaks. WTF is wrong with you people?
Dual cores have been in the IBM PPC pipeline for quite a while - of course the (now old) Power4 arch has been multi-core all along.
In all probability the PPC little brother of Power5 (rumored to be called the 975) will debut at 90 nanometers and the next chip will be a ~60 nanometer dual core version possibliy called the 976.
Which if these will be called the G6 is left up to the reader as an exercise. My money is on the 976. Either way the PPC has some serious legs.
=tkk
Bill Gates - Creationist?!?
Just when I thought I had saved up enough money between upgrades to splurge on those fancy ramen noodles, you know, the one with the dried peas, this comes along.
Hey, Wal-Mart brand noodles are only 8 cents!
Diablo II, Starcraft, Warcraft
Unreal Tournament 2004, Neverwinter Night, Dungeon Siege, Civ III
Myst, Riven, Exile
Medal of Honor and expansions, Battlefield 1942, Ghost Recon
Ghost Master
Quake III, Beyond Castle Wolfenstein
Escape Velocity Series, among others
There are plenty of other games for the Mac platform as well, check the Apple website for a larger list.
Its amusing to watch the chip manufacturers scramble desperately to meet the recommend specifications for Longhorn in time.
Oh, c'mon don't look at me like that. A slashdot story without some kind of Microsoft snipe just wouldn't be the same now, would it?
Alright, fine. I'll pick on SCO or AdTi next time. Sheesh. /me crawls back under his rock
She's built like a steak house, but she handles like a bistro....
actually there is plenty of bandwidth left in hypertransport to pull it off. also each cpu gets its own bank of memory. the design is superior to all others for SMP. even AMD's man CPU man says that at infoworld
AMD's dual-core server processors will share a single memory controller, Weber said. This won't create a bottleneck because a server with two Opteron chips, and therefore two memory controllers, already has more than enough memory bandwidth required to run that system, he said.
"It's always a juggling act to add a little more processing and a little more memory. Right now, we have plenty of memory and I/O bandwidth, so we're adding processing," Weber said.
The dual-core chips will work with current socket technology in motherboards that are rated for the specifications of the dual-core chips, Weber said. A BIOS change will be required, but otherwise the chips will work in the same sockets as single-core Opterons, he said.
ignorance is bliss. googlefiberatx.com
doesn't require 5 loud fans in the case to keep it cool enough
While I understand the desire to build your own and preferring not to be vendor locked, you G5 fan comments are quite ignorant. The Apple G5's are well designed and exceptionally well layed out to create thermal zones serviced by different variable speed vans. It is a very quiet solution. Do not confuse the G5 with some of the homebuilt Athlon abominations that have poor layout, poor airflow, and require multiple screaming fans. YMMV.
While the idea of dual core cpus is really cool, and will take over shortly due in part to the fact that we need something to do with all those extra transistors, I wonder why the focus of the industry is on chip multi-processors (CMP).
.
While CMP processors can give us rougly the same performance of a standard SMP system (somewhat faster due to interprocessor communication and shared memory, but also slower due to a larger memory bottleneck) I don't think that a CMP system would compete with a simultaneous multi-threading (SMT) solution.
While Intel's response to SMT (hyperthreading) has some benifits the performance of it is rather lackluster. The reason has more to do with their particular implementation. If you've read about the initial observations on SMT an 8-way SMT processor was shown to outperform a 4-way CMP processor. Now, I must note that the 8-way smt processor had more functional units then the cores in the 4-way CMP processor, but the overall area of the 8-way SMT processor would be much much smaller (far less structures need to be duplicated for SMT as opposed to CMP). For more information on this check out some of the papers at http://www.cs.washington.edu/research/smt/
What I don't understand is the insistance of the industry to use CMP first. From everything I've read, an 8-way SMT processor should take up less die space then a two way CMP processor. Even assuming that the 8 way processor contains more functional units. It kind of makes sense that a CMP processor is faster when there aren't enough threads to fully utilize a SMT processor (say only 2 or 3 threads that want full cpu usage). I guess SMT is a big chance in the model of programming and application development (I'm currently running research on the subject which is why I'm so interested in it). Is the reason to embrace CMPs simply because there's less new technology to add (they "just" have to interconnect two cores as opposed to adding the extra logic for SMT).
Does anyone else have any other opinions regarding this matter, or any idea why no one seems to be fully embracing SMT's potential.
Philip Garcia
I agree.
:)
I got to drive one of the nice newer Mercedes coupes,with a big V8 in it. They were bragging up the horsepower, so I was wanted proof. "Let me drive." I ran it hard. The owner, in the passenger seat, was impressed with the power I was pulling from it. Then asked the owner how much the car cost. Something around $100k. I handed him the keys to my car (2000 TransAm WS/6) and said "now drive this."
I paid about $25k for my car. New it was something like $30k. My car has better handling, better acceleration, better braking, and is faster. This was before I did any mods to it. The interior trim may not be as nice, but my car does have all the options including leather seats, and it turns more heads when I drive past, than a Mercedes does. It's comfortable enough for two people to ride in it all day (done that many times), and the back seats are just about as big.
Apple's are very pretty. I've used a few. I was happy that my girlfriend was on one using OS/X, but when that machine started acting flaky, we didn't buy a new Apple, we spent $1500 on really good parts. AMD 2800+, 1Gb RAM, 200Gb hdd, DVD reader, DVD writer, asus motherboard, high end video card, etc, etc.. What Apple does $1500 buy you? When we want faster, all we have to do is buy some faster components. When the G6, G7, or whatever comes out, well, you're buying a new Apple.
You can buy a new Mercedes at the really fancy store, or you can (could) buy a TransAm at any dealership. If I want more power, I grab Jegs or Summit, and start shopping.
You can buy an Apple at the fancy Apple store, or buy parts from a wholesaler whos "Will Call" area is the back door of the warehouse.
I still say "Pretty" every time I look at a Apple. I give them that. Then I hop back on my x86 based Linux machine and drive faster.
Serious? Seriousness is well above my pay grade.
The opteron (k8) has an integrated memory controller and up to three hypertransport links. In a dual k8 system, the cpus communicate over a single hypertransport link and are usually paired with their own memory bank. If one cpu needs data from the other's bank, it comes over the hypertransport link. Some cheap dual opteron boards save traces by pairing one cpu with all the memory banks - so every memory operation on the non directly linked cpu passes over the h-link.
The dual core cpu might have the pins for two seperate memory bank arrays or just the pins for one. Either way, the situation as far as dual k8s go is not really different from what we have already. Either way, it's a few steps above the p4 design: shared cpu bus to northbridge to memory. (yech! with a single proc, this introduces latency, with multiproc, you get contention and latency at every level)
AMD's cpu interconnect is so well thought out... it gives me the warm fuzzies pondering it:
A uniproc hammer needs one h-link for io.
A dually needs two per core: 1 for core to core, 1 for io (though all the io on all the boards I have seen feeds to only one proc's h-link... so that you don't lose PCI busses and such if you have only one proc installed, I suppose).
Quad and above requires three: each core links to two other cores, leaving one h-link per core for io. One could have a pci-e bus per proc, if one desired. But again, I haven't seen a design that doesn't feed all io into a single h-link.
Since no one uses the extra h-link anyway, a dual core package for a dual core system would need only one external h-link (saving some cash).
A quad core, dual package system would require three h-links feeding out of each package, though. But even then, the number of h-links laid out on the mobo is reduced and the whole shebang should be cheaper.
Intel's "one huge shared bus" + northbridge design is definitely being trampled...
But then, the trick is that he did not mention memory latency, only bandwidth! Getting the latter is relatively easy -- just make memory bus wider (as given bus speed), trying to decrease latency will pretty soon make you run into speed-of-light limitation.
;-)
Maybe those processors do have enough memory bandwidth to load two of them completely doing SAXPY? Assuming 12 GFLOPS sustained (3 GHz, 2 cores, separate ADD and MUL on each) you need to feed input vectors at 12*8 bytes/double = 96 GB/sec, for, say 1 GHz memory bus it is translated into 96*8=768 memory pins only for input -- well, wider than I've seen on desktop PCs...
When you start doing anything else , the roundtrip time between processors and memory (latency) becomes more important than raw bandwidth.
Paul B.
Because the K8 has the memory controller on die, as you add processors, you actually add memory bandwidth. It kinda stands the old logic on its head. Really the only thing that can be an issue on this core is latency can make a difference at 16 CPUs or more ;-)
[RIAA] says its concern is artists. That's true, in just the sense that a cattle rancher is concerned about its cattle.
No it doesn't
Yes, it does.
If you're at all familiar with the Opteron architecture, you'd realize that each chip's memory controller does, indeed go to a new memory bank.
As an example, I just bought a 4-way Opteron. It's got four seperate banks of memory on it. Each processer has a 128-bit, DDR400 memory controller, all independent of each other.
If you have a program on each CPU, accessing memory tied to that CPU, the 4-way machine I mentioned would have a theoretical memory throughput of 25.6 gigabytes/second. The theoretical throughput of a dual-Xeon machine is 5.4 gigabytes/second. That's a huge difference.
You're right, it takes some intelligent work to schedule programs on CPUs that are close to the memory the program will access. If you hadn't been in a hole for the past year or two, you'd know that there has been a lot of work put into Linux to make it handle these NUMA architectures more intelligently. IBM has some VERY large NUMA systems, and has been pouring a lot of development into Linux.
As for system costs going up so much that it would be prohibitive for a desktop, think again. AMD's entire desktop line is transistioning to the Opteron architecture. Even the lowly 1xx single-proc Opterons and Athlon64's have nearly all of the features of the highest 8xx 8-way chips. The difference between a 848 and a 148 is just reduced cache, and fewer Hypertransport lines out of the chip.
steve
Oh, you're not stuck, you're just unable to let go of the onion rings.
Ah but with multi-core chips they can transduce their flux capacitors with the onboard trans-mogrification controllers. Seriously "reduced bus conflicts with their memory space", what does that mean?? That's gibberish.
P4, presumably, like the P6 GTL+ host bus is a shared bus (like most buses are). Only CPU can use the bus at any one time. If the bus does x GB/s, that's only to one CPU at any given time - effectively it is shared. Further, P6 and P4 do not have integrated memory controllers, and must access RAM via the (shared) GTL+ bus, if it is not in cache. Eg, a 4 CPU machine looks like:
Also GTL+ is limited to 4 CPUs and one controller. To get 8 CPUs some controller vendors have invented a GTL+ 'bridge' to stitch 2 GTL+ buses together, but that just makes things worse really from a scaleability POV I'd imagine.
The K8 on the other hand uses a point-to-point (PtP) serialish, packet based transport, HyperTransport to interconnect CPUs and has onboard memory controller(s) (connected internally via HyperTransport links). A 4 CPU K8 machine looks like:
Each of the lines out of a K is a HyperTransport link. Each MC is integrated into the die itself. (you'll have to imagine interconnects and right-hand top/bottom MC's lining up with the K symbols, cause /.'s filter is chomping whitespace in some strange way on me).
Each CPU has 4 HT links, two to other CPUs, two to its (integrated on die) memory controller. For dual CPU setups, each CPU needs only link to another CPU obviously. Indeed the difference between 2xx, 4xx and 8xx AMD Opteron CPUs is the number of HyperTransport links. Indeed in large multi-CPU (ie 8+) SMP setups one need not attach a memory controller to each CPU, one might choose to have a central "cross-bar" of fully-meshed K8s who then connect to peripheral K8s which have memory controllers and hence RAM. Tis all down to the board designers I guess. And a bit of a fun computer science problem too in terms of designing optimal 'networks' of interconnected nodes with the best compromise of maximum node to node distance for lowest number of required interconnects.
The K8 is actually a ccNUMA (cache coherent, Non-Uniform Memory Architecture) machine, in SMP configurations. Ie, different memory is at different distances to different CPUs, or to put it another way, some memory is local, other memory is distant, some memory may be more distant than other memory. Eg, for the top-left CPU to access RAM on it's "local" MC is obviously potentially far quicker, in terms of latency, than to access "distant" RAM on another node, and to access memory on an adjacent K8's memory controller will have lower latency than to access memory allocated in the bottom-right CPUs RAM. A good OS aware of the issues can try ensure to keep processes on the CPUs to which that processes memory is "local" and hence maximise performance, but it's quite a juggling act (Linux has some NUMA support).
What AMD will do for multi-core we dont know. For certain the individual cores will be connected by HyperTransport. Most likely AMD will give each core their own dedicated memory controller, which would simply make a multi-core SMP be exact same in terms of architecture as the current dual K8 architecture (ie 2xx opteron), and hence no different in terms of bandwidth contention than for existing SMP Opterons.
It will make large SMP machines a lot easier to build though. Eg
I use Friend/Foe + mod-point modifiers as a karma/reputation system.
Did you miss the part about shrinking it down to modern geomerty, meaning it would run faster on less power (read less heat) than the original? Sure a 90nm i486 isn't going to run at 3.6GHz like a P4, however I expect it would run a good amount faster than a 486DX2-66 once did.
Unfortunately, nothing will beat the architectural gains which have advanced since the 486 era, and the "worst case" pipeline waits will keep your clockspeed at an insanely low level.
Let me try to explain. The 486 had a 5 stage pipeline - fetch, decode, dispatch, execute, and writeback. Now, each of those pipeline stages isn't going to take the same "minimum" amount of time - some of them are fixed by things other than switching latencies. So, say your execute stage is fine taking only 1 clock cycle up to, say, 2 GHz (a minimum latency of 500 ps), but your decode stage, simply from physical concerns, is going to take at least 5 ns to complete. This means that the maximum you can ramp the clock speed up to is 200 MHz, because each stage in the pipeline has to take 1 clock cycle, so if 5 ns is your minimum, you'll have a max clock speed of 200 MHz.
The solution, though, is obvious - break that "5 ns" decode step into multiple pipeline steps - say, 5 of them, each taking 1 ns each. Now your maximum clock frequency is 1 GHz. The problem is that your pipeline is now 9 stages long, and you have a new architecture - which is precisely what Intel did several times over to allow the clock speed to ramp.
And that's just the pipelining limitation. There are other architectural problems with "ancient cores" as well. One basic problem is that the x87 floating-point architecture is crap. It's stack-based, which means you can only do math with the "stack head". So in order to store things in the registers, you need to use the FXCH instruction to switch the stack head and one of the registers. Well, modern CPUs (the P3 and the Athlon) got around this by saying "we'll make FXCH be a zero-cycle execute when paired with an arithmetic instruction (and after the Pentium, screw it, they're free totally)". Since the modern CPUs can decode more than one instruction per cycle (3 for an Athlon), and the FXCH instruction only lives up to the decode stage, you're really not hurt, as the FXCH fills a pipeline stage that probably would've been left empty anyway. Now consider the P4, which was designed to try to encourage people to move away from x87: it does not have a zero-cycle FXCH, and its x87 performance is abysmal. (The 486 does not have a pipelined FPU, nor a free FXCH instruction. It would be even worse.)
And I haven't even mentioned register renaming yet, which works around the register limitations of the x86 ISA by creating registers that the software doesn't know about, but which the hardware can "cheat" and recognize certain compiler patterns which work around the register limitation.
In short - many core 486 CPUs would suck. Even many core Pentiums would suck. Architecturally, they're old, dead ends. The best designs for multicore processors would be the P6 design (PPro/PII/PIII/PM) and the Athlon design (K7/K8 - while the K8 is "new", it's about as new as the PM is to the P6 design). Curiously enough, Intel is likely to go with a multicore PM, and AMD is likely to go with a multicore K8.
It should also be noted that a 486DX had a transistor count of 1.2M transistors. A P3 had a transistor count of 9.5M transistors. That's an increase of about 8X - however, the P3 also has twice the data width (64-bit rather than 32-bit), 4X the L1 cache (32KB rather than 8KB), and had two instruction set enhancements tacked onto it, as well as massive architectural improvements, including, essentially, multiple versions of the 486 execute engines inside it. An 8X increase in size for those enhancements is not crazy at all.