AMD's x86-64 Moves Forward
MBCook writes "AMD Hammer line is definatly moving forward. The Inquirer has a supposidly leaked memo from MS saying that they have working x86-64 silicon that runs both 32 and 64-bit Win XP. Van's Hardware is reporting that MS is backing x86-64 over Intel's IA-64, and that MS has apparently convinced Intel to move to x86-64! There is an article over at Ace's Hardware from CeBIT that includes some coverage of AMD's Hammer line (including its NUMA). Last but not least is News.com's report that MS is preparing Windows to support NUMA." And it looks like the line will be named Opteron.
Check out, http://zdnet.com.com/2100-1103-847712.html and SuSe already supports it, http://www.suse.com/us/press/press_releases/archiv e02/x86_64.html
-Jason Yates
And a 64 in the processor. IIRC, the processor is based on the Alpha core. And when the alpha came out, it was faster than anything 32 bit. But comparing the X-box and the Nintendo 64, which were released many years apart won't buy you much of a conclusion other than current processors are generally faster than older processors.
All other things being equal, a processors with larger word size (instruction sizes and address sizes) will be faster than those with smaller, though, depending on application, the results can be negligible or even worse, especially if compilers and programs aren't properly optimized.
Of course, I don't really know, I'm just guessing, just like the rest of you ;).
XML causes global warming.
I wonder why this is listed as a leak.
When AMD announced this at a press conference a few hours ago.
AMD nails Microsoft backing for Hammer-CNET
MS to confirm Hammer support-The Register, UK
Microsoft to Support AMD's Hammer-eWeek
A webcast of today's conference call announcing Opteron is available here
-Mark
"As we reported yesterday, Microsoft is committed to supporting AMD's x86-64, the open standard instruction set native to Hammer. Key decision makers inside the software company have been so enamored with x86-64 that the software giant persuaded Intel to adopt the Hammer language.
One of most avid proponents of x86-64 is the legendary Microsoft programmer responsible for the NT kernel. David Cutler, formerly of DEC, has reportedly voiced inside Microsoft extremely strong preferences for x86-64 over Intel IA- 64. Allegedly, Cutler and Microsoft do not want to expend the extra resources to produce ongoing support of IA-64, an instruction set Cutler and others inside the software giant reportedly disdain as inferior."
Got to:
http://www.vanshardware.com/
Hmm. Well. This kinda sums it up:
;-)
1) Cost.
If you have to make everything X times wider (eg registers, buses..) you will use approx X times more chip area -> Higher cost. So to keep a constant cost, you have to wait until the silicon guys come up with some process that can make your chip X times smaller.
2) Power consumption (heat dissipation)
A wider bus will need more current to charge it's capacitive load. If the load gets higher your only options to keep a constant power consumption and keeping your processor from melting are these:
* lower probability for bitchange (smarter coding)
* lower frequency (not desired, right?), or
* lower voltage (1990: 5V, 2002: 3.3V. 2010: 1.5V?)
Perhaps 2) is not a real issue wrt bus sizes, I haven't investigated it. Take it for what it's worth; a semi-educated guess.
I guess that in video game consoles they have either the margins to take the higher cost, or they use 128-bits only in parts of the processor and didn't tell the marketing department.
When a graphics processor is 128-bit, it means you can work on 128 pixels at once, basically doing the same thing to each one in parallel. Or, alternatively, 16 pixels at once, operating on an 8-bit color channel, still doing essentially the same operation to each one at the same time, in parallel.
When a CPU is 64- or 128-bit, that means its computational units can crunch on integers 64 bits in size. Roughly speaking, a 32-bit processor can work on integers between 0 and a few billion (2^32 is around 10^9), and a 64-bit processor can work on integers between 0 and a billion billion (2^64 is around 10^18). Think of a pocket calculator with twice as many digits across the display, and you have the right idea. Same old calculator, bigger numbers.
Those 16 Billion GB of data fill up 64-bit. Say you had a 256-bit processor? You could count to a LARGE number. Whoop-de-frickin'-doo. When was the last time you needed to count to that large a number? Or one of your programs did? Probably very rarely.
SIMD uses aside (which have been mentioned elsewhere in responses to this), you don't need that much. In fact, 256 bits would decrease performance. Why? Because when you tried to pull that information in out of memory to the cache, or from disk to the memory, you'd have more time spent waiting for these bits to be transferred around. In fact, this sucks the most because almost all of those bits will be 0 (unless you're dealing with small negative numbers, in which case they'll all mostly be 1).
Numbers that are this big are used so infrequently that you can use bignums (used in lisp, for example) to represent them without taking much of a performance hit at all in the common case, and the exceptional case (VERY rare) will only be maybe twice as slow, which makes the average case faster.
Thus, if you want a slower computer, go design a 256-bit processor, by all means.
-Dan
Yeah, I didn't mean to bring price into it - that will of course change everything. That's why I said "all things being equal" - assume that applies to price too ;) The most important part of "64-bit" for most people is address space. There can be some performance benefit as well, but as you say, they're not worth the cost considering alternative ways to get the same raw horsepower. The address space increases exponentially, but the performace gains from parallelism increase only linearly. This cost increase is probably somewhere in between.
XML causes global warming.
Actually, it's only funny on two levels. One if you take it literally, and one if you realize Intel makes StrongARMs.
I disagree from an architectural standpoint. In an ideal world, we'd all have 8-bit machines. All our arithmatic would be insanely fast; we'd be able to use combinational logic to allow two probagation levels for ANY operation (add, sub, mul, div, sqrt, log, etc). That's because it's cost effective to do so; a minimal set of possible outcomes. I'm not completely sure, but I'll speculate that it's possible to arbitrarily generate an arbitrarily sized number from just these 8 bits; though most likely it would be programatically (even if done via micro-code), and thus would be non-optimal for larger than 8-bit data-sets. So obviously, as we've been able to, we've increased the data-length throughout history as we've demonstrated a need.
Contrary to the impression that's given in these posts, a larger word size fundamentally is slower in calculating smaller values. Sticking with higher performance two-stage combinational logic requires an exponentially increasing number of transistors. Breaking the logic up into tiers allows designers to trade the number of transistors for the number of probagation delays. The more delays, the slower the clock; the more transistors, the less practical the design (due to heat, cost, and feasibility of fabrication). Pipelining somewhat helps aleviate the issue of extreme probagation delay, but it's impossible to achieve 100% efficiency, and thus you're practically garunteed slower operation for deeper pipelines. What's more, pipelining requires additional probagation layers for buffering, so you take an immediate performance hit; speculating that you'll achieve greater over-all performance.
In an ideal architecture, you'd minimize the probagation delays for each instructional unit, but practical measures say you must group most, if not all, of the CPU such that the slowest part drives the system. (P4's are nice in that they sub-divide the clock for the simpler integer units).
Combining the two ends, we can better appretiate the trade-off.. If we're performing large-valued arithmetic which is slow programatically (emulated 64bit), then it's worth the extra cost (towards the speed of each operation, and in terms of the number of transistors). In other words, one hardware 64bit add is most certainly faster than than several assembly language instructions that piece together 32bit values. BUT, now all your 32bit arithmetic is slower (unless you have separate 32bit/64bit logic cores).
It's possible to design 32/64bit cores that only take as many clock-ticks to complete as necessary, and thus 32bit arith isn't horribly slower, but there are definately additional probagation delays. The augmentation to 64bit can never increase the speed of a 32bit operation. (Any speed ups must be due to over-all advances in computational efficiency, which should benifit a pure 32bit core even more).
The trade-off must then be a statistical one. We cost out the largest word size that provides benifit. You're going to have arbitrarily large ALU operations (just look at encryption), so choose a cutoff where a certain percentage of all operations occur at that high of a word size. This is how we moved from 8 to 16 bit, and then the painful shift from 16 to 32 bits. And for server-targeted machines, the shift has already been cost-out to adopt 64bits. The desktop has not yet made sufficient requirements to adopt 64bits, though the underlying x86 CPUs are being shared in server-space which is nudging 64bit's acceptance.
Another important factor (which is presumably obvious in concept) is that a higher word-size has a greater probability of wasted space. A 1-bit boolean, for example, wasts 63bits.. Booleans are very common, and though they can easily be consolidated in c-struct's, such is rarely the case, since there are memory alignment issues (and flat-out laziness on the part of programmers). The wasted word-space also affects the instructions. Rarely do you actually see 64bit aligned CPU-instructions (except in VLIW or in places that the data-word-size was irrelevant). Such a situation would have massive implications towards performance. But one serious consideration is that the population of 64bit constants using a 32bit instructional word is expensive. Now you have to perform at least 3 (probably 4 or 5) instructions just to load a constant. Suddenly "a++" starts to look scary (at least when non-optimal compilers are used). In all cases sub-word-size'd instructional arguments are permissable to the delight of compiler designers, but there are still classes of problems that thwart this.. Namely memory addressing...
Memory addressing is arguably the strongest supporter of 64bit architectures. The 4GB limit is already apon us on desk-top machines (I have half a gig on all my home machines, and I don't need it). When you add swap-space, it's entirely possible for modern desk-tops to run enough apps to desire 4+Gig of memory. (Especially considering that large chunks of the address space are wasted). Aside from the various tricks designers have employed over the years to avoid augmenting the address space (8086's segment-registers, 80386's segment-selectors, OS's swapping out apps completely from memory, etc), it's arguably slower to emulate larger address spaces.
In addition to the above arguments against larger address spaces, there is massive cache polution; doubling the word-length, literrally halves the usefulness of a cache-line-load, unless you were previously emulating a larger word-size. You can only load 4 words on a pentium-class cache-line-fill instead of 8. Your bandwidth requirements litterally double (unless you don't standardize at one word-length).
Now in contrast, there are a few advantages. If your minimal word-size is larger, then the number of address pins that you need are reduced. But this is really independent of the core word-size. Pentiums have long required 64bits for their external bus, and use even larger cache-line sizes. Thus most of the advantages attributed to this argument are moot.
Theoretically, an architecture can be designed to split an ALU such that it acts as either 1 64bit unit or 2 32bit units. This is especially true for vector-cores (which are already up to 128bits for main-stream processors). In general, however, there is still the trade-off here, since additional logic-probagations are required which slow down the general case of only a single But comparing the X-box and the Nintendo 64, which were released many years apart won't buy you much of a conclusion other than current processors are generally faster than older processors.
I'd like to address the nature of large bit-sizes with respect to graphics. While this isn't my expertise to the extent of the above, this primarily affects the bus width. In graphics, you commonly have multi-integer structures (red, blue, green, alpha (opacity), Z-depth (the depth into the screen the geometrical object that drew this dot is), stencil, etc). The entire structure is usually just 16bits, 32bits, 64bits, etc. The larger the structure, both the more features you can pack into it, and the more accurate each individual number can be. Thus saying that an architecture is 128bits purely based on this is very misleading. What's even worse are labeling the bus-width numbers (e.g. 128, 256). That's like calling the Pentium I a 64bit CPU, just because it has a 64bit bus (used purely for cache-line burst fills). Yes it makes it go faster, but so does shortening the length of each wire; it's not really innovative. I'll throw this in, but I'm starting to get in over my head; Graphics units (especially the filtering parts) make heavy use of hard-wired combinational logic units. The number of bits going into these units is really meaningless (how many thousands of wires go into the control logic portion of a CPU?). Thus the ability of a custom pieces of hardware to utilize larger bit-depths is unimpressive. What would be impressive would be to say that a Graphics unit does it's integer / floating arithmetic in 128bits so as to minimize error (even though the input/output might only be 8 or 16bits per atomic unit).
-Michael
-Michael
heres a wierd question...Why dont they put some sort of optical controller and coupler between chips?
...but we know most about Intel and AMD's because they need the marketing gee-whiz factor to sell their crap.
Cost. The bill of materials on a motherboard is insanely tight -- they count resistors, remember! All of the fancy interconnect to go optical is way to expensive, and has very little benefite: aka, ROI.
Beside, what good does it bring? I agree that copper limits on a motherboard are approaching rapidly: anyone who has ever tried to debug a RAMBUS implementation knows how painful it is to build interface hardware to debug a 1ghz strobed differential bus. However, I would think that until the b/w at the CPU and DRAM _PINS_ vastly exceeds what is possible with a copper trace, the ROI on optical would be nonexsitant.
I need to re-read AMDs point-to-point proposal, I'm not sure I buy their claim that adding additional CPUs doesn't decrease bandwidth.
As for symmetric-MP, et al: there are lots of weird topologies for MP out there in server-land. The first teraflop machine was PentiumPro bus architecture, which is only 4P scalable on Intel arch, but custom chipsets can do anything!
And when I say "crap", I mean, "crap". I firmly believe that if both giants were not chained to their product roadmap and stock-holders, aka profits, then we would see some truly efficient high-performing architectures. A giant can't take a 90-degree turn... that's why all x86 architectures are basically suped up jalopies -- making too sudden a change can be devastating if you don't have the capital. Look at Itanium -- if AMD spent the resources Intel did, it would have broken their bank like the Cold War broke Russia's. Look at AMD bolting on *-64, it's basically a blower on a Chevette!!!
I'm _really_ rambling, I'll stop now.
https://www.accountkiller.com/removal-requested
A little, but not much. It keeps most of the ugly instruction decode of tradition x86, but adds an extra 8 GPRs (accessed via an additional opcode prefix). It also does a few minor other things, like I think it adds IP relative addressing.
Basically, they did what they could to make it better while still using the same hardware to decode instructions. I'd have preferred it if they had a second, simpler decode unit to handle the new stuff so that the overcomplicated x86 decode could eventually be phased out, but it didn't make sense business-wise for them to do so.