Posted by
michael
on from the double-the-pleasure-double-the-fun dept.
msolnik writes: "Over at RealWorldTech they've published an article on the future of 64-bit performance. This article covers the different technology from Sparc to Hammer. Its a great read if you are looking for information on up-and-coming products from Intel, AMD, Sun, and Compaq."
AMDs going for a slightly different track, AMD
is the only one trying to put 64-bit on the
desktop. Now for us linux freaks SUSE Linux
and NetBSB will be fine for a 64-bit desktop,
but if AMD want to lock up some of the market
into x86-64, they really need a mainstream OS.
Unfortainately that means Windows, and "if
we build it they will come" doesn't necessary
work if they is no competition. Still in the
mean time, Crawhammer will be a damn fine 32-bit
chip as well, and Sledgehammer will bring
high-end servers right down to mid range prices.
Re:AMD's gonna win
by
Space+cowboy
·
· Score: 3, Informative
Whereas I agree with you on the AMD front, I strongly disagree regarding SCSI vs IDE in a server environment.
SCSI drives have disconnect abilities, which means they can have commands sent to them, and the bus is then disconnected (free for other use) while the drive is seeking to the sectors required and buffering in the internal drive RAM. This means that other drives can be instructed in this 'dead' time. On a single-drive system, this is irrelevant, but even on a small server (say 0.5Tb disk array) it is crucial.
IDE drives hog the channel - which is why you can't get much more speed out of a RAID-0 array with 3 or 4 drives than one with 2 (masters) on a standard PC. There are only 2 channels, so only 2 drives can be accessed at once. Contrast this to a SCSI system, where anything up to ~64 disks might be attached to a single channel, but using disconnect to manage that channel amongst them.
To see why disconnect works so well, remember that the time it takes to seek the disk head is measured in milliseconds - this is several orders of magnitude slower than the time to send the commands/data over the bus to the host computer.
Also remember that the ATA-100 is (AFAIK) a burst-speed, ie: it can transfer at that speed when the source data is in the cache - it cannot read the data at that speed... The latest SCSI standard is 320MBytes/sec (Seagate, I believe) although I think 160 MBytes/sec is the highest widely available standard. Given the architecture underlying both technologies, which do you think will have the best chance of filling it's cache more often in a RAID array? (Hint: it starts with an 'S':-)
The only company I have seen to make large-scale IDE RAID arrays work as fast as SCSI ones uses an IDE controller *per drive*, and attaches a SCSI/Fibrechannel front-end via custom hardware. It's still cheaper than SCSI, but not by that much, and getting people who know about it is more difficult when it goes wrong...
Simon.
-- Physicists get Hadrons!
Re:AMD's gonna win
by
tap
·
· Score: 3, Interesting
Furthermore, the actual drive mechanics are the same for both
SCSI and IDE versions of a drive
Why do people keep repeating this myth? If you look at the physical parameters for any SCSI and IDE drive made in the last 5 years, you will see that they are completely different. I dare anyone to find a SCSI and IDE drive from the same manufacture produced since 1998 that has the same number of heads, spins at the same speed, and has the same capacity. You won't find any.
Re:AMD's gonna win
by
tap
·
· Score: 4, Informative
SCSI's disconnect ability looks good in theory, but in practice it's not such a great advantage. With SCSI you can attach up to 15 devices to a single channel, and effectively access them all the the same time. With IDE you can attach up to two devices to a single channel, and only access one at a time. Sounds like SCSI is lots better, but only if you have a single IDE/SCSI channel and more than one drive. If you put each IDE drive on a seperate channel, and you can buy IDE controllers with 8 channels, then there really is no advantage to SCSI's disconnect/reconnect ability.
Re:AMD's gonna win
by
MSG
·
· Score: 3, Informative
BTW, High-end servers tend to use fibre channel as a physical interconnect for SCSI devices, anyway.
Might have 64-bit computing very soon.
by
x136
·
· Score: 3, Interesting
If the Power Mac G5 is introduced at Macworld on Monday*, you can all have your 64-bit goodness by the months end!
*I'm not really expecting it to be released this soon, maybe later this year. But who knows? It could happen.
-- SIGFEH
Re:Might have 64-bit computing very soon.
by
inio
·
· Score: 3, Informative
yes, but the G5 replaces the 32-bit ALU with a 32/64-bit ALU. The PPC spec has included 64-bit instructions from day 1, but they've only been used in IBM's mainframes. The problem with apple using a standard 64-bit PPC is that there are a few minor differences in how certain generic instructions are handled (most instructions are specific to single- or double-words) which make running code compiled for 32-bit PPC uncertain on 64-bit PPC. So what I'm assuming Mot has done with the G5 is add a "64-bit mode" that apple disabled by default and applications must explicitly request.
link to Full article
by
mESSDan
·
· Score: 5, Interesting
is here: That way you only have to wait a longass time for it to load once, instead of a longass time for each of the 5 or 6 pages.
--
-- Dan
Contents of the Article
by
Anonymous Coward
·
· Score: 3, Informative
Looking Forward to 2002
By: Paul DeMone (pdemone@realworldtech.com) Updated: 01-02-2002
A Quick Look Back
In the last six months several noteworthy events and disclosures have occurred in the fast moving world of microprocessors. AMD started shipping its Palomino K7 processor as the Athlon XP. Despite the controversy surrounding the performance rating based model naming scheme associated with the XP, it appears the latest refinement of the AMD's venerable K7 design has, by most measures relevant to the PC world, eclipsed the performance of the 2 GHz Pentium 4 (P4), the highest speed grade offered for Intel's first implementation of its new x86 microarchitecture. However, this advantage should prove short-lived, as the second generation 0.13 um Northwood P4 will be officially released in early January. The Northwood will offer higher clock rates, an L2 cache doubled in size, and minor internal performance enhancements.
Extending their rivalry on a different front, Intel and AMD unveiled microarchitectural details of their forthcoming 64-bit standard bearers at Microprocessor Forum in October. Although the McKinley and Hammer are both future flagship parts, and thus important symbols of Intel and AMD struggle for technological leadership, the two processor families will be sold into different markets and won't directly compete. In other 64-bit news, IBM officially unveiled the POWER4 processor in several different hardware configurations with clock rates as high as 1.3 GHz and took the top spot in both the integer and floating point performance categories of the SPEC CPU 2000 benchmark. However, preliminary "teaser" numbers from Compaq suggest that IBM will lose SPEC performance leadership when the EV7, the final major product introduction in the doomed Alpha line, is unveiled. Regardless of who wins bragging rights for technical computing, both processors will offer memory and I/O bandwidth far ahead of their competitors and both should do quite well on commercial workloads.
Sun Microsystems continues to slowly upgrade its UltraSPARC-III line in the face of an increasingly difficult competitive environment. Sun recently introduced its copper process based version of the US-III at 900 MHz. The latest device ostensibly includes a fix to the prefetch buffer bug that vexed the earlier aluminum based device. Far more interesting than the new silicon was the latest version of Sun's compiler. It raised the new copper US-III/900's SPECfp2k score by roughly 20% by spectacularly accelerating one of the 14 programs in the suite using an undisclosed optimization. A recent call was issued for new programs for the next generation of the SPEC CPU benchmark. Tentatively named SPEC 2004, it now seems like it couldn't come soon enough.
McKinley: Little more Logic, Lots more Cache
The most striking aspect of McKinley is its size and transistor count. Weighing in at a hefty 220 million transistors, this 0.18 um device occupies a substantial 465 mm2 of die area. The majority of McKinley's transistor count is tied up in its cache hierarchy. It is the first microprocessor to include three levels of cache hierarchy on chip. The first level of cache consists of separate 16 KB instruction and data caches, the second level of cache is unified and 256 KB in size, and the third level of cache is an astounding 3 MB in size. The die area consumed by the final level of on-chip cache can be seen in the floorplan of the McKinley and some representative server and PC class MPUs shown in Figure 1.
Figure 1 Floorplan of McKinley and Select Server and PC MPUs.
The Itanium (Merced) floorplan is shown as blank because although its chip floorplan has been previously disclosed its die size is still considered sensitive information by Intel and has not been released. The outlines shown indicate the range of likely sizes of the Itanium die based on estimates from a number of industry sources.
Both the first and second generation IA64 designs, Itanium/Merced and McKinley, are six issue wide in-order execution processors. In-order execution processors cannot execute past stalled instructions so it is important to have low average memory latency to achieve high performance. This focus on the memory hierarchy can be clearly seen in the McKinley [1]. Although it is not surprising that the on-chip level 3 cache in McKinley is much faster than the external custom L3 SRAMs used in the Itanium CPU module, it is interesting to see how much faster in terms of processor cycles the McKinley level 1 and 2 caches are despite the McKinley's 25 to 50 percent faster clock rate in the same 0.18 um aluminum bulk CMOS process.
The improvement in average memory latency between Itanium and McKinley can be approximated using the comparative access latencies presented by Intel at their last developers conference, combined with representative hit rates based on the size of each cache in the two designs and an assumed average memory access time of 160 ns. This data is shown in Table 1.
CPU
Processor
Itanium
McKinley
Frequency (MHz)
800
1000
L1
Size (KB)
16
16
Latency (cycles)
2
1
Miss rate
5.0%
5.0%
L2
Size (KB)
96
256
Latency (cycles)
12
5
Global Miss rate
1.8%
1.1%
L3
Size (MB)
4
3
Latency (cycles)
21
12
Global Miss rate
0.5%
0.6%
Mem
Latency (ns)
160
160
Latency (cycles)
128
160
Total
Average Latency (cycles)
3.62
2.34
Average Latency (ns)
4.52
2.34
The back of the envelope type calculations in Table 1 suggests that a load instruction will be executed by McKinley with about half the average latency in absolute time than it would on Itanium. No doubt this is a major contributor to the much higher performance of the second generation IA64 processor. Although the large die area of McKinley suggests a substantial cost premium compared to typical desktop MPUs, for large scale server applications the extra silicon cost is insignificant compared to the overall system cost budget. In fact, from the system design perspective, the ability to reasonably forgo board level cache probably more than pays for the extra silicon cost of McKinley through reduction of board/module area, power, and cooling requirements per CPU. Large scale systems based on the EV7 will also eschew board level cache(s), although with the Alpha it is the greater latency tolerance of the out-of-order execution CPU core plus the integration of high performance memory controllers that permit this, rather than gargantuan amounts of on-chip cache.
Besides the greatly enhanced cache hierarchy, the McKinley will boast two more "M-units" than Itanium. These are functional units that perform memory operations as well as most type of integer operations. In a recent article I speculated about the nature of McKinley design improvements. I suggested that it would contain 2 more I-units and 2 more M-units than Itanium in order to simplify instruction dispatch and reduce the frequency of split issue due to resource oversubscription. In IA64 parlance, both I-units and M-units can execute simple ALU based integer instructions like add, subtract, compare, bitwise logical, simple shift and add, and some integer SIMD operations. I-units also execute integer instructions that occur relatively infrequently in most programs but require substantial and area intensive functional units. These include general shift, bit field insertion and extraction, and population count.
Because the integer instructions that cannot be executed by an M-unit are relatively rare, the McKinley designers saved significant silicon area with little performance loss by only adding two M-units (for a total of four) and staying with the two I-units of Itanium. Data on the relative frequency of different integer operations suggest that the vast majority of integer operations, about 90%, that occur in typical programs are of the type that can be executed by either an M-unit or I-unit [2]. If we consider a random selection of six integer operations, each with a 90% chance of being executable by an M-unit, then the odds are better than 98% that any six instructions are compatible with the MMI + MMI bundle pair combination and can be dual issued by McKinley. Thus there is practically no incentive to add two extra I-units to McKinley to permit the dual issue of the MII + MII bundle pair combination.
One curiosity in the McKinley disclosure was the fact that the basic execution pipeline was revealed to be 8 stages long. Although this is still 2 stages shorter than the pipeline in the slower clocked Itanium, it is one more stage than the 7 stages previously attributed to McKinley [3]. Whether this represents a slightly different way of counting the pipe stages or an actual design change isn't clear. Ironically, it has long been rumored that the Itanium pipeline was stretched by at least one stage quite late in development. It will be interesting to see if the new IA64 core under development by the former Alpha EV8 design team (now at Intel) also suffers this strange pipeline growth affliction.
Hammering x86 into the 64 bit World
In October AMD revealed some aspects of K8, its next generation x86 core code-named Hammer [4]. This new design is primarily distinguished by being the first processor to implement x86-64, AMD's extension to the x86 instruction that supports 64 bit flat addressing, 64 bit GPRs, as well as other enhancements. As can be seen in Figure 2, the Hammer core heavily leverages AMD's highly successful K7 core
Figure 2 Comparison of K7 Athlon and K8 Hammer Organization
The back end execution engine of the K8 Hammer core is basically identical to that of the K7 except that the integer schedulers are expanded from 5 to 8 ROPs. The increase in the integer out-of-order instruction scheduling capability this implies may have been intended to better hide the data cache's two cycle load-use latency, and thus slightly increase per clock performance. An alternative hypothesis is that the latency of some integer operations may have been increased to allow higher clock rates and the change was made to prevent a slight loss in per clock performance. The basic execution pipeline of the K7 and K8 are compared in Figure 3.
Figure 3 Comparison of K7 and K8 Basic Execution Pipeline
The K8 execution pipeline has two more stages than K7, and these new stages seem to be related to x86 instruction decode and macro op distribution to the integer and floating point schedulers. Although some of the stages have been renamed it appears that the final five pipe stages, representing the back end execution engine, are comparable. This is unsurprising as the most complex and difficult task in an x86 processor like the K7 or K8 is the parallel parsing of up to three variable length x86 instructions from the instruction fetch byte stream and their decoding into groups of systematized internal operations. In comparison, the execution engine is hardly much more complex than a typical out-of-order execution RISC processor.
Both the block diagram and execution pipeline indicate that AMD has spent nearly all its effort in Hammer development at revamping the front end of the K7 design. Some of the extra degree of pipelining may be related to the extra degree of complexity in decoding yet another level of extensions (x86-64) on top of the already Byzantine x86 ISA. Some of the increase may be related to increased flexibility in internal operation dispatch to reduce the occurrence of stall conditions and increase IPC. And, some of the increase may simply reflect a reduction in the work per stage to increase the clock scalability relative to the K7 core. Without a detailed description of each of the pipeline stages in the K8 it is difficult to correlate front end pipe stages in the K7 to the K8, and next to impossible to assess how the benefit of the extra two pipe stages is allocated between accounting for increased ISA complexity, measures to increase IPC, and reduction in timing pressure per pipe stage to allow higher clock rates.
Although the 64-bit instruction set extension makes for attention grabbing headlines in the technical trade press, the major performance enhancements in the Hammer series are much more prosaic from a processor architecture point of view. These enhancements are the direct integration of interprocessor communications interfaces and a high performance memory controller. Like a "poor man's EV7", the Hammer includes three bi-directional HyperTransport (HT) links and a memory controller supporting a 64 or 128-bit wide DDR memory system using unbuffered or registered DIMMs. With the latter, a K8 processor can directly connect to 8 DIMMs, although this number may be reduced at the higher memory speeds supported. It is interesting to compare the results of the same design philosophy applied to the high-end server and mainstream PC segments of the MPU market as shown in Table 2. Power and clock rates for the Hammer MPU are estimates.
Alpha EV7 [5]
K8 Hammer
Process
0.18 um bulk CMOS
0.13 um SOI CMOS
Die Size
397 mm2
104 mm2
Power
125 W @ 1.2 GHz
~70 W @ 2 GHz
Comm Links
4 links, each 6.4 GB/s,
one 6.4 GB/s IO bus
3 links, each ~6 GB/s
Memory Controller
2 x 64 bit DRDRAM
12.8 GB/s peak
64 or 128 bit DDR
2.7 or 5.4 GB/s peak
Package
1443 LGA
?
Although the Intel McKinley and AMD Hammer are both 64 bit MPUs, these devices are directed at different markets. While the large and expensive McKinley will target medium and high-end server applications, the first member of the Hammer family, code named "Clawhammer", will target the high end desktop PC market. That is not to say that McKinley will outperform the Clawhammer device. Indeed, I expect the AMD device will easily beat the much slower clocked IA64 server chip in SPECint2K and many other integer benchmarks, as well as challenge much faster clocked Pentium 4 devices in both integer and floating point performance.
Exactly how much performance the Hammer core may provide is the subject of some controversy. AMD's Fred Weber was quoted as stating the Hammer core could offer SPECint2k performance as much as twice that of current processors. Although this comment is vague enough to drive a truck through (twice as fast as the best AMD processor? Best x86 processor? Best processor announced but not yet shipping?, IA-32 or x86-64 code?, Clawhammer or the big cache Sledgehammer?) a few web based news sites interpreted this comment as meaning the Hammer would achieve 1400 SPECint2k and now some people are incorrectly attributing this figure to Weber himself. Keep in mind that no Hammer device has even taped out as of the end of 3Q01 let alone been fabricated, debugged, verified, and benchmarked at the target clock frequency. Whatever figure Mr. Weber had in mind was derived from architectural simulation and for a benchmark suite as cycle intensive as SPEC CPU simulation results are approximate at best [6][7]. As been shown time and time again, it is best not to count performance chickens too closely before the silicon eggs hatch.
Alpha Goes Out With a Bang not a Whimper
Although Compaq announced the wind down of Alpha development in June and transferred nearly the entire EV8 development team to Intel over the summer there is still one more surprise in store for the computer industry. The EV7, the final major design revision in store for Alpha, has been the subject of intense testing, verification, and system integration exercises since late spring. This design has been in the pipeline for a long time. It was first announced more than three years ago and finally taped out in early 2001. Because the complexity of this device (basically a complex CPU and large scale server chipset all on one die) and the incredible degree of shakedown server class MPUs and systems undergo, the EV7 will not go into volume production until the second half of 2002. To bridge the gap between current products and EV7 based systems Compaq will shortly release a 1.25 GHz version of the workhorse EV68.
Although general details of the EV7 design have been in the public domain for more than three years, and specific facts about the performance of this MPU's router and memory controllers were disclosed in February, I think the performance it will achieve when officially rolled out in 2H02 will surprise and dismay many in the industry (possibly including senior Compaq management). At the Microprocessor Forum in October Compaq's Peter Bannon unveiled some preliminary performance numbers for the EV7, namely 804 SPECint2k, 1253 SPECfp2k, and roughly 5 GB/s STREAM performance.
Although these numbers are quite good in absolute terms, comparable to the fastest speed grade POWER4 running in a contrived and unrealistic hardware configuration, the numbers failed to live up to my estimates given in a previous article. However, former members of the Alpha design team have privately confirmed my suspicions that Mr. Bannon was clearly sandbagging the EV7 numbers, keeping a not insignificant amount of performance off the table. For a product still more than six months from release that is a not unexpected tactic. I still hold the opinion that when it is all said and done the EV7 has a good chance of being the highest performance general purpose microprocessor ever fabricated in 0.18 um technology, a fitting ending to a remarkable and tragic technological saga (EV79, an EV7 shrink to 0.13 um SOI is on the roadmap for 1H04 but the continued turmoil at Compaq suggests a healthy amount of scepticism is in order).
Sun's Surprising Spike SPARCs SPECulation
Sun recently introduced a new member of its UltraSPARC-III family. This new 900 MHz device differs from earlier US-III parts by the use of copper interconnect instead of aluminum. Although Sun submitted official SPEC scores for a 900 MHz Sun Blade 1000 Model 1900 using an aluminum US-III in late 2000, yield was apparently poor and this speed grade wasn't generally available. A rarely occurring bug related to a prefetch buffer inside the US-III was discovered and as a work around this feature was disabled in firmware. Unfortunately for Sun Microsystems, this caused the SPECfp_base2k score for the Model 1900 to drop from an already lackluster 427 to a lamentable 369 in a second SPEC submission in the spring of 2001. So it comes as no small surprise that the Sun Blade 1000 Model 900 Cu workstation, based on the new copper processor turned in a SPECfp_base2k score of 629 in a recent submission. Both the Model 1900 and Model 900 Cu versions of the Blade 1000 feature 8 MB of L2 cache.
It is possible that the copper US-III incorporates improvements beyond a fix to the prefetch buffer bug as well as improvements to system level hardware between the Model 1900 and Model 900 Cu. However it appears much of the improvement can be attributed to the use of the Sun Forte 7 EA compiler instead of the earlier Forte 6 update 1 compiler used to generate the 427 and 369 scores. The reason why I say that with confidence can be seen quite readily in the graph in Figure 4.
Figure 3 SPECfp_base2k Component Scores for US-III and Competitors
The SPECfp_base2k scores for the 14 sub-component programs for the pre-bug fix Sun Blade 1000 Model 1900 submission using the Forte 6 compiler are compared to the recent Sun Blade Model 900 submission using the Forte 7 compiler. In addition, scores for the Itanium (4MB, 800 MHz version in an HP i2000), Alpha EV68C (1000 MHz version in an ES45/1000), and POWER4 (1300 MHz version in a pSeries 690 Turbo) are provided for reference. It is the new compiler's score on the 179.art program that quite literally stands out from the rest. Although several other programs see appreciable improvement (the 183.equake score nearly triples), the new compiler increases the score of 179.art by more than 800%. In absolute terms this score, 8176, is more than four times higher than that achieved by the Alpha EV68 and POWER4, MPUs that easily beat the copper US-III on nearly every other SPECfp2k program. The 179.art score achieved by the Forte 7 compiler is vital to the new machine's pumped up SPECfp_base2k score. If you leave 179.art out of the geometric mean then its SPECfp_base2k score would drop by nearly 18% from 629 to 516.
This remarkable improvement on 179.art is unusual in the field of compiler engineering where single digit percentage performance increases are often considered major victories. So it is no surprise that Sun's achievement immediately raised suspicions among industry observers and competitors about the nature of the optimization employed by the Forte 7 compiler. It is hard not to think of Intel's infamous eqntott compiler bug that erroneously increased the SPECint92 score of its processors by about 10% until caught and fixed [8]. This bug used an illegal optimization that allowed the output of 023.eqntott to pass result checking with the test data used but was invalid in the general case.
Although the exact nature of the new Sun optimization isn't known, suspicion has fallen on several inner loops within the 179.art program. Speculation is that this code was originally written in FORTRAN and converted to C. Because FORTRAN and C access two dimensional arrays in opposite row and column order it is presumed that 179.art accesses arrays by the wrong index in the innermost loop causing poor cache locality. It is possible that the new Sun compiler recognizes this situation and turns the nested loops that step through the array accesses "inside out" and achieves much lower cache miss rates. Whatever the exact nature of the Sun optimization turns out to be there is the question of whether it violates one of the SPEC rules, namely "Optimizations must improve performance for a class of programs where the class of programs must be larger than a single SPEC benchmark or benchmark suite".
Without knowing the nature of the new Sun optimization it is impossible to say whether Sun should be praised or scolded. But here are the words of Sun engineer John Henning who made the following comments in a November 27 post to the comp.arch usenet news group:
"Our compiler team believes that what Sun has done with art is (1) the result of perfrectly [sic] legitimate optimizations (2) compliant with SPEC's rules and (3) not appropriate for further discussion - if you want to figure out to make art faster, go work on it yourself, don't ask Sun how we did it!"
With the widespread attention this incident has engendered within the industry we can presume that compiler and benchmarking experts working for Sun's competitors have closely scrutinized the code Forte 7 generates for 179.art. The fact that Sun's new scores haven't been withdrawn from the SPEC official web site yet suggests that Mr. Henning is correct. No doubt we can expect competitor's processors to score much higher on 179.art in the months and years to come as the Sun optimization migrates to other compilers. Depreciation of a benchmark's value is seldom as spectacular as in the case of 179.art, but still naturally occurs over time and provides incentive to accelerate the development of a successor to the SPEC CPU 2000 benchmark suite (which no doubt will not include 179.art). A message soliciting programs for this new suite, tentatively named SPEC 2004, was posted on comp.arch on December 28. Ironically the author of this message, the secretary of the SPEC CPU subcommittee, is none other than the previously mentioned John Henning.
Conclusion
It is comforting to see the pace of innovation in the microprocessor field shows no sign of slackening. The great seesaw battle between Intel and AMD for share of silicon's richest prize, the x86 microprocessor market, is about to enter a new phase with the imminent release of the 0.13 um Northwood Pentium 4. Although AMD will also migrate its K7 core to 0.13 um later in 2002 with both bulk and SOI versions, it is unlikely to be in the position to regain the performance advantage over Intel it previously achieved with the T-bird and XP Athlon until its new 64-bit Hammer core ships. Unlike AMD, Intel plans to reserve its 64-bit offerings for the high-end market. With McKinley Intel hopes to address the significant performance difficulties seen in the Itanium in part by taking advantage of its capacious manufacturing facilities to incorporate a huge amount of on-chip cache on its sizable die.
It seems like the time it takes for new ideas and features to migrate down from high-end server MPUs to mass-market devices is shrinking. The integration of high performance interprocessor communication links and memory controller(s) onto a processor die has been on the drawing board for many years and will soon be realized in the high end server market in the form of the EV7. Remarkably, the same concepts will appear in a mass-market x86 processor, the first of AMD's Hammer series, not too much later. Although these features will naturally be more limited in the scope in the x86 device to keep costs under control, they should still provide a large boost in performance from significantly reduced memory access latency as well as a dramatic reduction in the cost of producing multiprocessor systems based on this device.
Few topics in the computer and microprocessor field can raise a controversy, as well as blood pressure, as quickly as benchmarks and benchmarking. Sun managed to throw a hand grenade into the simmering debate between the supporters and detractors of the industry standard SPEC CPU benchmark by speeding up the execution of one of the fourteen programs in the floating point suite by nearly an order of magnitude through the use of a previously unexploited compiler optimization. This in turn raised the SPECfp2k score of its latest US-III processor by roughly 20%. We can now look forward to the spectacle of competing firms scrambling to reverse engineer Sun's new compiler trick and incorporate the same voodoo into their own wares.
References
[1] Krewell, K."Intel's McKinley Comes Into View", Microprocessor Report, October 2001, Volume 15, Archive 10.
[2] Hennessy, J. and Patterson, D., "Computer Architecture A Quantitative Approach", Morgan Kaufmann Publishers Inc., 1990, ISBN 1-55860-069-8, p. 181.
[3] Advance Program, 2001 IEEE International Solid-State Circuits Conference", p. 35.
[4] Weber, F., "AMD's Next Generation Microprocessor Architecture", October 2001, Downloaded from AMD web site.
[5] Jain, A. et al, "A 1.2 Ghz Alpha Microprocessor with 44.8 GB/s Chip Pin Bandwidth", Digest of Technical Papers, ISSCC 2001, Feb 6, 2001, p. 240.
[6] Dulong, C. et al, "The Making of a Compiler for the Intel Itanium Processor", Intel Technology Journal, Q3 2001, Downloaded from Intel web site.
[7] Desikan, R. et al, "Measuring Experimental Error in Microprocessor Simulation", Digest of Technical Papers, 28th Annual International Symposium on Computer Architecture, June 2001.
[8] "Intel OverSPECs Parts", Microprocessor Report, January 22, 1996, Volume 10, Number 1, P. 5.
Compaq and Alphacide
by
Anonymous Coward
·
· Score: 4, Interesting
There is an interesting discussion over in comp.arch on Usenet about
Compaq, Alpha, and the Itanium. The thread is called
Alphacide. Interesting stuff. It appears
that Compaq drank the Koolade.
By the way, Pricewatch is quoting about $3K for the lowend Itaniums running at about 700 Mhz.
No thanks.
Shrinkage
by
CaptainAlbert
·
· Score: 5, Informative
Impressive though 64-bit processors might be, I'm not convinced that the performance improvement is going to be as big as people are expecting.
Remember that the components in any digital system - and I'm not just talking about your windoze desktop PC, but servers, mainframes and embedded systems too - have to talk to each other in order to do anything remotely useful. Last time I looked, most PCI devices din't utilise the provision for 64-bit data bus operation.
There's a perfectly good reason for this, of course... in order to attach a chip to a circuit board, you need an array of pins (or solder balls) that are macroscopic, so they can be soldered and handled without too much risk of accidental damage. Additionally, PCB tracks can only go so small (and so close together) without undesirable electrical effects and again, an inability to work with it in a production environment.
The "more bits" phenomenon has been sustained by improvements in VLSI and the advent of true System-on-a-chip design, but this too has its limits. If you compare a P4 motherboard with, say, a 386 mobo circa 1995, you'll see the chip count is drastically reduced. But fewer interconnected components means less repairability, upgradability, and interoperability. My old 486 had a VLB EIDE hard disk controller, which I swapped in after the last one failed. If my controller failed today, I couldn't do that; I'd either need to buy a new mobo or start replacing chips on the old one (which is just as expensive).
Don't get me wrong - I'm all for progress! And I expect we'll see more and more 64/128-bit chips springing up inside custom devices (e.g. 3D cards, routers) where the local interconnect can be made as fat as necessary. But the PC will remained shackled by slow frontside busses for a while yet, I reckon.
Re:Shrinkage
by
CaptainAlbert
·
· Score: 5, Interesting
> Perhaps your 486 MB was the first of its kind,
> but modern motherboards with integrated devices
> have the ability to disable them so that can be
> replaced by cards in slots.
True, but that presupposes the existence of spare slots;-)
I hear what you're saying about trashable chips, but I think the real phenomenon is the "trashable board". Think about it - if your mobo dies and your warrantee has run out, you go buy a replacement and ditch the old board. If it happens still to be under its manufacturer's warrantee, most likely you just take it back to the shop and swap it for a working one. What happens to the old one? Most likely, they throw it away. The cost of postage, packing, an engineer's time to find the problem, repairs, parts... it's more than the damn thing retails for anyway.
I think this is missing the point anyway. The integration idea goes like this: with today's technology, you could put the equivalent of an early Pentium processor, plus hard disk and graphics controllers, BIOS chipset, etc. onto a single piece of silicon. Pretty much all you'd be left with off-chip would be (a) RAM and (b) I/O circuitry, because they're both harder to integrate. So your computer is about four or five chips. This is approximately the case in palm-tops now.
The point is that you've lost all ability to choose your own components. That graphics block/macrocell has probably been chosen by the manufacturer becuase it was the best value for money (i.e. the cheapest they could find). If you're lucky, they will give you expansion ports so you can plug your own stuff in. But that costs money, and if they think you'll pay for the lesser product then they'll make that instead.
Does it matter? Probably not to the average user. But I think it would matter to the industry. The whole point of having standard architectures like PCI, SCSI, EIDE (and before them, ISA et al.) is that many vendors can compete to produce compatible products, which drives innovation and generally provides a good deal for the consumer.
But if the minimisation continues and the busses become subsumed into the very chips themselves, then the chances are the manufacturers will cut corners. They won't wait for the not-quite-standard-yet SuperBus2005 architecture... they'll design their own and make you buy their proprietary upgrades. Again, the economics work out such that you the consumer probably get a good deal. But trading off good deals today against innovation tomorrow is dangerous.
So, it would be much better to keep all those busses outside the individual components, right? But that's exactly what is keeping the PC architecture slow at the moment (which was the point of my previous post. I think.).
Re:So why do I need 64bits?
by
4im
·
· Score: 5, Interesting
One word: addressing. With those 32 bits,
you can
typically address up to 2 gig files on your
machine - which is a limit easily encountered when
you start working with video, for instance.
It took hacks to get 4 gig of RAM working on x86
with the linux kernel.
Go 64 bit, and that limit vanishes. You keep your
linear addressing, none of those ugly segments like
in the unfamous real-mode of PC-XT times.
I don't see what's really new about it all though,
we've had 64 bit since Alpha, and there's several
64 bit architectures around. It may not be
mainstream yet, but will IA 64 or Hammer really
change that (soon)? Allow me to have doubts.
Intel learning from their mistakes
by
jazzyjez
·
· Score: 5, Insightful
Much as I hate to say it, the Intel McKinley looks like a very well designed piece of kit, and it appears Intel have learned from their mistakes with the P4 by including a big, fast 3-level cache on the McKinley. It's also good to see them reducing their pipeline size, which means it may finally be able to compete with the G4 in terms of efficiency. However, this is of course going to kick them in the teeth in terms of competing on processor speed, which they have been pushing so hard recently in their marketing.
The same can't be said of AMD's offering, although in fairness the Hammer is not directed at the server market unlike the McKinley. The pipeline is longer than both their previous design and the McKinley, which is going to give them a performance hit. We can only hope that their cache is as good as Intel's.
What amazes me is that they can still keep adding instruction extensions without too much of a performance hit. Anyone looked at the latest instruction set documentation for these processors? Eugh! The pain of backwards compatibility...
Re:Intel learning from their mistakes
by
nusuth
·
· Score: 4, Informative
IA64 is an incompatible and new instruction set, intel is not adding anything to their x86 ISA.
Hammer does not have an 3MB L3 but it has an integrated memory controller, that would drastically reduce latencies of cache misses.
Assuming amd will go fro bigger than 32 kb L1 cache, and will not succeed in making cache hits as fast as mckinley (speculation based on current offerings) picture is a bit complicated:
Watch it: hammer and mckinley asks for an instruction/piece of data, both hit, mckinley wins, but a more probable scenerio is mckinley misses and hammer hits - a clear win for hammer, a still more probable scenerio is that both misses. If data is in the L2, mckinley is faster, it has lower miss penalty and can fetch from L2 faster but it is more probable that it is in hammer's cache, but not in mckinley's cache, that would benefit hammer . If L2 misses too, but mckinley scores an L3 hit, mckinley wins, if it suffers from an L3 miss, it has to suffer both L3 miss latency and memory latency, but hammer suffers no L3 miss latency and its memory latency is probably much lower, so with huge data processed in not-so-tight loops hammer wins hands down, while for medium sized data that could fit into L3 mckinley wins hands down.
Although mckinley is a server product and hammer is not (or so it is said), an integrated memory controller benefits hammer in multiway systems so much that it may as well be positioned as a server product. No more asking the chipset to fetch a piece of data and wait until chipset serves other processors' requests, just go and grab it!
Finally, some of the hammer line will have L3 caches and hammer line will have a higher clockrate than mckinley. If Amd can deliver what they have promised, they have a clear winner overall. But I'm still a bit scpetical.
--
Gentlemen, you can't fight in here, this is the War Room!
AMD is deceiving you
by
hatchet
·
· Score: 3, Insightful
First of all I'd like to say, I am not biased in either way.. after all I'm going to get me a new AthlonXP next week.
IA64 is very different from x86-64. AMD's 64bit solution is nothing more than extension to current 32bit instruction set. Of course there are some tweaks, but nothing very radical. You will still be able to run old 16 and 8bit code efficiently.
Intel's IA64 is a huge step in the future... architecture wise is far superior to x86-64. Why?
Why do we need 64bit processors? Addressing? Nah, current processors can address enough space.. with 386 processors FAR addressing was introduced, which expanded allocatable address space drastically. (those silly DS, SS,.. registers) And newest processors can deal with them with same ease as with non-far addressing.
AMD's 64bit solution currently has no real value.. except for huge data storage (could work faster with 64bit data blocks) and probably some heavy encryption. x86-64 compiled Quake3 would make minimum use of 64bit registers.. and would probably be just a margin faster than IA32 compiled version.
Is IA64 better? Yes it is. IA64 has 128 usable 64bit registers, predicates... But that is not all.. in single 64bit register you can store 4 16bit values(common integer). (or 8 8bit or 2 32bit)And manipulate with them almost as much as you like. And if you have 4 integers in other register.. you can make 4 arithmetical operations with SINGLE instruction. You can do similar things with floating point operations... and with ILP you could do 3 instructions per cycle. This means that Quake2's VectorAdd/Subtract could be done in SINGLE cycle.
Clawhammer will be better for a year or so.. but soon it will hit the ceiling. Intel will be able to get better performance from 1/2 clocked IA64.
And please don't respond with lame comments if you haven't read at least whitepapers from Intel and AMD.
Re:AMD is deceiving you
by
statusbar
·
· Score: 3, Interesting
And unfortunately it will be a LONG TIME before good non-buggy optimizing compilers will be available for such a complex architecture.
software pipelining and parallel instructions give you a real complex monster cpu. Languages like C and C++ make it extremely hard to optimize for a cpu like this since C and C++ were never designed for fine-grained parallelism and software pipelining. So it results is a lot of wasted clock cycles.
What I'd LOVE to see is a statically typed pure functional language that could be used to generate the code for IA64. Then it would be feasable to fully take advantage of the IA64's features!
In the meantime, people compiling IA64 C code with GCC will be extremely disappointed. People compiling IA64 C code with Intel's optimizing compiler will be happier but will only be mildly impressed.
I've worked with VLIW (256 bit instructions) software pipelined DSP's before and learned very quickly that the C and C++ language standards are fundamentally limited for these things. I also learned very quickly that writing assembly language directly for them is an easy way to gain a special invitation to a padded room!!! I shudder to think what the compiler writers have to go through!
--jeff
-- ipv6 is my vpn
Re:AMD is deceiving you
by
SQL+Error
·
· Score: 5, Informative
Bullshit.
AMD has stated *explicitly* that the Hammer is an evolutionary rather than revolutionary design. They've said all along that it is an Athlon with 64-bit extensions and some minor tweaks (SSE2, extended pipeline). They haven't deceived anyone.
Now, as to the relative performance of the two architectures (x86-64 vs. IA-64): the Athlon XP 1900+ achieves a SpecInt2000 score of 701 (peak) while the 800MHz Itanium manages... 314. On floating point the Itanium does rather better: 645 vs. 634 for the Athlon. (The current leader is the IBM Power4, which gets 814 SpecInt and 1169 SpecFP.)
Having 128 64-bit registers is good, but remember that the Athlon and Hammer have far more physical registers than are presented in the programming model, and automatically map them according to the requirements of instructions in the pipeline. And the predicates and wide issue of the Itanium are balanced against the ability of the Athlon to *automatically* issue instructions speculatively and re-order the instruction queue to improve ILP.
And on the subject of manipulating multiple values with a single instruction: ever heard of MMX? 3DNow? SSE? Athlon has all of these, and Hammer will add SSE2. What do you think these are for?
As to the value of 64-bit addressing: I've programmed for machines (Suns and Compaq Alphas) with as much as 64GB of memory. While you *can* address that much with a 32-bit CPU, it means that you have to constantly re-map your view of memory, which is a royal pain. Moving to 64 bit addressing makes the problem disappear. And with current memory prices, even small commodity servers could make good use of more than 4GB of memory.
And 64-bit integer registers are good for a lot of things, and while you can certainly use 64-bit integers on a 32-bit CPU, making them faster won't hurt.
So, Athlon currently has a huge performance advantage over Itanium on integer apps, and a huge price/performance advantage (with comparable absolute performance) on FP apps. AMD's aim with Hammer is to extend Athlon cheaply and effectively into the 64-bit realm.
Intel's aim with Itanium appears to be to crush all competition; unfortunately, they've placed a *huge* bet on improvements in compiler technology that just hasn't paid off yet, resulting in a high-end chip that lags behind not just the high-end RISC chips like Alpha and Power, but low-cost desktop chips. To achieve commercial success, the Itanium needs integer performance somewhere in the vicinity of their competitors, but they currently trail the pack by a huge margin. Even SGI do better, and they all but shut down their CPU design efforts years ago.
Maybe McKinley will be the answer - but it doesn't look like it, given that the promised speeds have dropped to 1GHz. IA-64 is an interesting architecture which may even have a future, but so far it just don't fly.
Re:AMD is deceiving you
by
SurfsUp
·
· Score: 4, Insightful
You don't have a clue. Let me just pick out a couple of the grossly wrong items...
Why do we need 64bit processors? Addressing? Nah, current processors can address enough space.. with 386 processors FAR addressing was introduced, which expanded allocatable address space drastically. (those silly DS, SS,.. registers) And newest processors can deal with them with same ease as with non-far addressing.
Sheesh, where are you coming from? You can address 64 Gig of physical memory with an x86 now, but you can only address 4 Gig (at most!) of it linearly. 32 bit address registers, get it? Gosh, and far addressing was introduced with 386's was it? Give me a break, try 8086's.
AMD's 64bit solution currently has no real value.. except for huge data storage (could work faster with 64bit data blocks) and probably some heavy encryption. x86-64 compiled Quake3 would make minimum use of 64bit registers.. and would probably be just a margin faster than IA32 compiled version.
Right, and I'm supposed to believe you on this, given your performance above. Um, you seem to have ignored the value of being able to crunch 8 byte integers, or pixels 8 bytes at a time, nicely matching the width of the MMX registers. For starters. Repeat this to yourself: "sledge hammer". "sledge hammer". Good, that's more like it.
Is IA64 better? Yes it is. IA64 has 128 usable 64bit registers, predicates... But that is not all.. in single 64bit register you can store 4 16bit values(common integer). (or 8 8bit or 2 32bit)
Um, and guess how many 16 bit values you can store in a 64 bit sledgehammer register? Ah, and guess how many fp/mms instructions sledge can retire per cycle?
Clawhammer will be better for a year or so.. but soon it will hit the ceiling. Intel will be able to get better performance from 1/2 clocked IA64.
You don't have any idea why it's called itanic, do you. Moderaters, take a look above. Remember, that's what 'random' looks like. Yes, I've got mod points right now. No, I won't waste them on you.
-- Life's a bitch but somebody's gotta do it.
Now we can wait for software support...
by
green+pizza
·
· Score: 5, Interesting
Once we get the 64-bit hardware, we still have the MMOS (minor matter of software) to worry about.
Cases in point:
Silicon Graphics machines with MIPS R4400 (and up) CPUs were 64-bit, but the additional address and pointer space wern't utilized until IRIX 6.0 in 1994 -- over 18 months later. (And, of course, certain SGIs still run in 32-bit mode due to RAM concerns -- 64-bit requires more RAM -- all Indys, all Indigos, all O2s, and R4400 Indigo2s).
Sun machines with UltraSPARC CPUs were 64-bit, but again, the additional address and pointer space had to wait for software support. (Multi-stage transition to 64-bit, starting with Solaris 2.5 and finally complete with Solaris 7 in 1998).
Then there's application optimization. Many apps can get slight speedups by processing data in larger (say, 38-bit or even 64-bit chunks). Sometimes the difference is huge, many times it's small. But, lots of little speedups can add up across an entire system. Still, someone has to make these changes to apps and compilers. It takes time, testing, and adoption. In better times, SGI did several such overhauls... they got some insane speed out of Netscape Enterprise and Netscape FastTrack web servers during the Everest project. One of their engineers also did some cool (but nonstandard) hacks to Apache, including the very first pure, clean 64-bit port/mod.
Newer, faster, wider, more-torque hardware is always great. But don't forget the software.
Re:Now we can wait for software support...
by
dunstan
·
· Score: 4, Informative
Even with a reference application (oracle 8.1.6) on a reference OS (Solaris 8), the patch levels for the 64 -bit version were 3 revs behind those for the 32-bit version when I last looked. What bothered me was that the bug I'd run into was fixed in the 32-bit version but still there in the 64-bit version. Guess which version I ran.
Dunstan
--
The last scintilla of doubt just rode out of town
Re:64-bit is more than speed
by
BCoates
·
· Score: 3, Insightful
Huh? you could do 32 bit addressing long before you could buy 4GB drives, but nobody thought memory mapping your hard drive into your address space was a good idea then... What would be the point?
--
Benjamin Coates
Re:So why do I need 64bits?
by
thogard
·
· Score: 3, Insightful
Since that was marked troll, I'll blow more karma...
With most operations, 64 bits isn't 2x as fast its 1x as fast unless you deal with the stack in which case it could be even slower.
Addressing has little to do with word size. The 8088 shows that.
Suns running in 64 bit mode are offten slower than running in 32 bit mode.
Nintendo 64 games are all 32 bit code with just a few 64 bit operations. The good emulator proved that.
As far as going two 32 bit ops at once, I still don't need a 64 bit data path to do that, I just need several 32 bit data paths. What I don't need is to dump a bunch of unused 64 bit number on the stack everytime an exception happens (which one of my computers has done about 1047563950 times in the last 51 days)
Chips, maybe, but applications?
by
Zocalo
·
· Score: 3, Informative
There may well be a slew of 64 bit chips by the years end, but I doubt you are going to see much non-specialist application support for some time. Sure PhotoShop and a few other desktop applications will arrive fairly quickly, but look at Windows and 32 bit support; Intel shipped the 80386 in 1985 and only now can you boot a Windows PC without running 16 bit code from the HDD.
Actually, even that's not strictly true, since according to the Resource Kit documentation Windows XP's initial configuration detection is *still* 16 bit.
-- UNIX? They're not even circumcised! Savages!
FAR addressing.
by
leuk_he
·
· Score: 3, Insightful
Why do we need 64bit processors? Addressing? Nah, current processors can address enough space.. with 386 processors FAR addressing was introduced, which expanded allocatable address space drastically.
Far adressing can handle 4GB. but 4GB is not much by today standards(it is not little either). You want flat adress space, and ram is cheap this days. 32 bits get you to 4GB, if you want more you have to resort to tricks. (those silly DS.DD etc)
With the first 64 bit alpha's they used this as an argument: it is useful for fast memory scans when using big databases.
AMD is the future. Glad to see an underdog win.
If the Power Mac G5 is introduced at Macworld on Monday*, you can all have your 64-bit goodness by the months end! *I'm not really expecting it to be released this soon, maybe later this year. But who knows? It could happen.
SIGFEH
is here: That way you only have to wait a longass time for it to load once, instead of a longass time for each of the 5 or 6 pages.
-- Dan
Looking Forward to 2002
By: Paul DeMone (pdemone@realworldtech.com) Updated: 01-02-2002
A Quick Look Back
In the last six months several noteworthy events and disclosures have occurred in the fast moving world of microprocessors. AMD started shipping its Palomino K7 processor as the Athlon XP. Despite the controversy surrounding the performance rating based model naming scheme associated with the XP, it appears the latest refinement of the AMD's venerable K7 design has, by most measures relevant to the PC world, eclipsed the performance of the 2 GHz Pentium 4 (P4), the highest speed grade offered for Intel's first implementation of its new x86 microarchitecture. However, this advantage should prove short-lived, as the second generation 0.13 um Northwood P4 will be officially released in early January. The Northwood will offer higher clock rates, an L2 cache doubled in size, and minor internal performance enhancements.
Extending their rivalry on a different front, Intel and AMD unveiled microarchitectural details of their forthcoming 64-bit standard bearers at Microprocessor Forum in October. Although the McKinley and Hammer are both future flagship parts, and thus important symbols of Intel and AMD struggle for technological leadership, the two processor families will be sold into different markets and won't directly compete. In other 64-bit news, IBM officially unveiled the POWER4 processor in several different hardware configurations with clock rates as high as 1.3 GHz and took the top spot in both the integer and floating point performance categories of the SPEC CPU 2000 benchmark. However, preliminary "teaser" numbers from Compaq suggest that IBM will lose SPEC performance leadership when the EV7, the final major product introduction in the doomed Alpha line, is unveiled. Regardless of who wins bragging rights for technical computing, both processors will offer memory and I/O bandwidth far ahead of their competitors and both should do quite well on commercial workloads.
Sun Microsystems continues to slowly upgrade its UltraSPARC-III line in the face of an increasingly difficult competitive environment. Sun recently introduced its copper process based version of the US-III at 900 MHz. The latest device ostensibly includes a fix to the prefetch buffer bug that vexed the earlier aluminum based device. Far more interesting than the new silicon was the latest version of Sun's compiler. It raised the new copper US-III/900's SPECfp2k score by roughly 20% by spectacularly accelerating one of the 14 programs in the suite using an undisclosed optimization. A recent call was issued for new programs for the next generation of the SPEC CPU benchmark. Tentatively named SPEC 2004, it now seems like it couldn't come soon enough.
McKinley: Little more Logic, Lots more Cache
The most striking aspect of McKinley is its size and transistor count. Weighing in at a hefty 220 million transistors, this 0.18 um device occupies a substantial 465 mm2 of die area. The majority of McKinley's transistor count is tied up in its cache hierarchy. It is the first microprocessor to include three levels of cache hierarchy on chip. The first level of cache consists of separate 16 KB instruction and data caches, the second level of cache is unified and 256 KB in size, and the third level of cache is an astounding 3 MB in size. The die area consumed by the final level of on-chip cache can be seen in the floorplan of the McKinley and some representative server and PC class MPUs shown in Figure 1.
Figure 1 Floorplan of McKinley and Select Server and PC MPUs.
The Itanium (Merced) floorplan is shown as blank because although its chip floorplan has been previously disclosed its die size is still considered sensitive information by Intel and has not been released. The outlines shown indicate the range of likely sizes of the Itanium die based on estimates from a number of industry sources.
Both the first and second generation IA64 designs, Itanium/Merced and McKinley, are six issue wide in-order execution processors. In-order execution processors cannot execute past stalled instructions so it is important to have low average memory latency to achieve high performance. This focus on the memory hierarchy can be clearly seen in the McKinley [1]. Although it is not surprising that the on-chip level 3 cache in McKinley is much faster than the external custom L3 SRAMs used in the Itanium CPU module, it is interesting to see how much faster in terms of processor cycles the McKinley level 1 and 2 caches are despite the McKinley's 25 to 50 percent faster clock rate in the same 0.18 um aluminum bulk CMOS process.
The improvement in average memory latency between Itanium and McKinley can be approximated using the comparative access latencies presented by Intel at their last developers conference, combined with representative hit rates based on the size of each cache in the two designs and an assumed average memory access time of 160 ns. This data is shown in Table 1.
CPU
Processor
Itanium
McKinley
Frequency (MHz)
800
1000
L1
Size (KB)
16
16
Latency (cycles)
2
1
Miss rate
5.0%
5.0%
L2
Size (KB)
96
256
Latency (cycles)
12
5
Global Miss rate
1.8%
1.1%
L3
Size (MB)
4
3
Latency (cycles)
21
12
Global Miss rate
0.5%
0.6%
Mem
Latency (ns)
160
160
Latency (cycles)
128
160
Total
Average Latency (cycles)
3.62
2.34
Average Latency (ns)
4.52
2.34
The back of the envelope type calculations in Table 1 suggests that a load instruction will be executed by McKinley with about half the average latency in absolute time than it would on Itanium. No doubt this is a major contributor to the much higher performance of the second generation IA64 processor. Although the large die area of McKinley suggests a substantial cost premium compared to typical desktop MPUs, for large scale server applications the extra silicon cost is insignificant compared to the overall system cost budget. In fact, from the system design perspective, the ability to reasonably forgo board level cache probably more than pays for the extra silicon cost of McKinley through reduction of board/module area, power, and cooling requirements per CPU. Large scale systems based on the EV7 will also eschew board level cache(s), although with the Alpha it is the greater latency tolerance of the out-of-order execution CPU core plus the integration of high performance memory controllers that permit this, rather than gargantuan amounts of on-chip cache.
Besides the greatly enhanced cache hierarchy, the McKinley will boast two more "M-units" than Itanium. These are functional units that perform memory operations as well as most type of integer operations. In a recent article I speculated about the nature of McKinley design improvements. I suggested that it would contain 2 more I-units and 2 more M-units than Itanium in order to simplify instruction dispatch and reduce the frequency of split issue due to resource oversubscription. In IA64 parlance, both I-units and M-units can execute simple ALU based integer instructions like add, subtract, compare, bitwise logical, simple shift and add, and some integer SIMD operations. I-units also execute integer instructions that occur relatively infrequently in most programs but require substantial and area intensive functional units. These include general shift, bit field insertion and extraction, and population count.
Because the integer instructions that cannot be executed by an M-unit are relatively rare, the McKinley designers saved significant silicon area with little performance loss by only adding two M-units (for a total of four) and staying with the two I-units of Itanium. Data on the relative frequency of different integer operations suggest that the vast majority of integer operations, about 90%, that occur in typical programs are of the type that can be executed by either an M-unit or I-unit [2]. If we consider a random selection of six integer operations, each with a 90% chance of being executable by an M-unit, then the odds are better than 98% that any six instructions are compatible with the MMI + MMI bundle pair combination and can be dual issued by McKinley. Thus there is practically no incentive to add two extra I-units to McKinley to permit the dual issue of the MII + MII bundle pair combination.
One curiosity in the McKinley disclosure was the fact that the basic execution pipeline was revealed to be 8 stages long. Although this is still 2 stages shorter than the pipeline in the slower clocked Itanium, it is one more stage than the 7 stages previously attributed to McKinley [3]. Whether this represents a slightly different way of counting the pipe stages or an actual design change isn't clear. Ironically, it has long been rumored that the Itanium pipeline was stretched by at least one stage quite late in development. It will be interesting to see if the new IA64 core under development by the former Alpha EV8 design team (now at Intel) also suffers this strange pipeline growth affliction.
Hammering x86 into the 64 bit World
In October AMD revealed some aspects of K8, its next generation x86 core code-named Hammer [4]. This new design is primarily distinguished by being the first processor to implement x86-64, AMD's extension to the x86 instruction that supports 64 bit flat addressing, 64 bit GPRs, as well as other enhancements. As can be seen in Figure 2, the Hammer core heavily leverages AMD's highly successful K7 core
Figure 2 Comparison of K7 Athlon and K8 Hammer Organization
The back end execution engine of the K8 Hammer core is basically identical to that of the K7 except that the integer schedulers are expanded from 5 to 8 ROPs. The increase in the integer out-of-order instruction scheduling capability this implies may have been intended to better hide the data cache's two cycle load-use latency, and thus slightly increase per clock performance. An alternative hypothesis is that the latency of some integer operations may have been increased to allow higher clock rates and the change was made to prevent a slight loss in per clock performance. The basic execution pipeline of the K7 and K8 are compared in Figure 3.
Figure 3 Comparison of K7 and K8 Basic Execution Pipeline
The K8 execution pipeline has two more stages than K7, and these new stages seem to be related to x86 instruction decode and macro op distribution to the integer and floating point schedulers. Although some of the stages have been renamed it appears that the final five pipe stages, representing the back end execution engine, are comparable. This is unsurprising as the most complex and difficult task in an x86 processor like the K7 or K8 is the parallel parsing of up to three variable length x86 instructions from the instruction fetch byte stream and their decoding into groups of systematized internal operations. In comparison, the execution engine is hardly much more complex than a typical out-of-order execution RISC processor.
Both the block diagram and execution pipeline indicate that AMD has spent nearly all its effort in Hammer development at revamping the front end of the K7 design. Some of the extra degree of pipelining may be related to the extra degree of complexity in decoding yet another level of extensions (x86-64) on top of the already Byzantine x86 ISA. Some of the increase may be related to increased flexibility in internal operation dispatch to reduce the occurrence of stall conditions and increase IPC. And, some of the increase may simply reflect a reduction in the work per stage to increase the clock scalability relative to the K7 core. Without a detailed description of each of the pipeline stages in the K8 it is difficult to correlate front end pipe stages in the K7 to the K8, and next to impossible to assess how the benefit of the extra two pipe stages is allocated between accounting for increased ISA complexity, measures to increase IPC, and reduction in timing pressure per pipe stage to allow higher clock rates.
Although the 64-bit instruction set extension makes for attention grabbing headlines in the technical trade press, the major performance enhancements in the Hammer series are much more prosaic from a processor architecture point of view. These enhancements are the direct integration of interprocessor communications interfaces and a high performance memory controller. Like a "poor man's EV7", the Hammer includes three bi-directional HyperTransport (HT) links and a memory controller supporting a 64 or 128-bit wide DDR memory system using unbuffered or registered DIMMs. With the latter, a K8 processor can directly connect to 8 DIMMs, although this number may be reduced at the higher memory speeds supported. It is interesting to compare the results of the same design philosophy applied to the high-end server and mainstream PC segments of the MPU market as shown in Table 2. Power and clock rates for the Hammer MPU are estimates.
Alpha EV7 [5]
K8 Hammer
Process
0.18 um bulk CMOS
0.13 um SOI CMOS
Die Size
397 mm2
104 mm2
Power
125 W @ 1.2 GHz
~70 W @ 2 GHz
Comm Links
4 links, each 6.4 GB/s,
one 6.4 GB/s IO bus
3 links, each ~6 GB/s
Memory Controller
2 x 64 bit DRDRAM
12.8 GB/s peak
64 or 128 bit DDR
2.7 or 5.4 GB/s peak
Package
1443 LGA
?
Although the Intel McKinley and AMD Hammer are both 64 bit MPUs, these devices are directed at different markets. While the large and expensive McKinley will target medium and high-end server applications, the first member of the Hammer family, code named "Clawhammer", will target the high end desktop PC market. That is not to say that McKinley will outperform the Clawhammer device. Indeed, I expect the AMD device will easily beat the much slower clocked IA64 server chip in SPECint2K and many other integer benchmarks, as well as challenge much faster clocked Pentium 4 devices in both integer and floating point performance.
Exactly how much performance the Hammer core may provide is the subject of some controversy. AMD's Fred Weber was quoted as stating the Hammer core could offer SPECint2k performance as much as twice that of current processors. Although this comment is vague enough to drive a truck through (twice as fast as the best AMD processor? Best x86 processor? Best processor announced but not yet shipping?, IA-32 or x86-64 code?, Clawhammer or the big cache Sledgehammer?) a few web based news sites interpreted this comment as meaning the Hammer would achieve 1400 SPECint2k and now some people are incorrectly attributing this figure to Weber himself. Keep in mind that no Hammer device has even taped out as of the end of 3Q01 let alone been fabricated, debugged, verified, and benchmarked at the target clock frequency. Whatever figure Mr. Weber had in mind was derived from architectural simulation and for a benchmark suite as cycle intensive as SPEC CPU simulation results are approximate at best [6][7]. As been shown time and time again, it is best not to count performance chickens too closely before the silicon eggs hatch.
Alpha Goes Out With a Bang not a Whimper
Although Compaq announced the wind down of Alpha development in June and transferred nearly the entire EV8 development team to Intel over the summer there is still one more surprise in store for the computer industry. The EV7, the final major design revision in store for Alpha, has been the subject of intense testing, verification, and system integration exercises since late spring. This design has been in the pipeline for a long time. It was first announced more than three years ago and finally taped out in early 2001. Because the complexity of this device (basically a complex CPU and large scale server chipset all on one die) and the incredible degree of shakedown server class MPUs and systems undergo, the EV7 will not go into volume production until the second half of 2002. To bridge the gap between current products and EV7 based systems Compaq will shortly release a 1.25 GHz version of the workhorse EV68.
Although general details of the EV7 design have been in the public domain for more than three years, and specific facts about the performance of this MPU's router and memory controllers were disclosed in February, I think the performance it will achieve when officially rolled out in 2H02 will surprise and dismay many in the industry (possibly including senior Compaq management). At the Microprocessor Forum in October Compaq's Peter Bannon unveiled some preliminary performance numbers for the EV7, namely 804 SPECint2k, 1253 SPECfp2k, and roughly 5 GB/s STREAM performance.
Although these numbers are quite good in absolute terms, comparable to the fastest speed grade POWER4 running in a contrived and unrealistic hardware configuration, the numbers failed to live up to my estimates given in a previous article. However, former members of the Alpha design team have privately confirmed my suspicions that Mr. Bannon was clearly sandbagging the EV7 numbers, keeping a not insignificant amount of performance off the table. For a product still more than six months from release that is a not unexpected tactic. I still hold the opinion that when it is all said and done the EV7 has a good chance of being the highest performance general purpose microprocessor ever fabricated in 0.18 um technology, a fitting ending to a remarkable and tragic technological saga (EV79, an EV7 shrink to 0.13 um SOI is on the roadmap for 1H04 but the continued turmoil at Compaq suggests a healthy amount of scepticism is in order).
Sun's Surprising Spike SPARCs SPECulation
Sun recently introduced a new member of its UltraSPARC-III family. This new 900 MHz device differs from earlier US-III parts by the use of copper interconnect instead of aluminum. Although Sun submitted official SPEC scores for a 900 MHz Sun Blade 1000 Model 1900 using an aluminum US-III in late 2000, yield was apparently poor and this speed grade wasn't generally available. A rarely occurring bug related to a prefetch buffer inside the US-III was discovered and as a work around this feature was disabled in firmware. Unfortunately for Sun Microsystems, this caused the SPECfp_base2k score for the Model 1900 to drop from an already lackluster 427 to a lamentable 369 in a second SPEC submission in the spring of 2001. So it comes as no small surprise that the Sun Blade 1000 Model 900 Cu workstation, based on the new copper processor turned in a SPECfp_base2k score of 629 in a recent submission. Both the Model 1900 and Model 900 Cu versions of the Blade 1000 feature 8 MB of L2 cache.
It is possible that the copper US-III incorporates improvements beyond a fix to the prefetch buffer bug as well as improvements to system level hardware between the Model 1900 and Model 900 Cu. However it appears much of the improvement can be attributed to the use of the Sun Forte 7 EA compiler instead of the earlier Forte 6 update 1 compiler used to generate the 427 and 369 scores. The reason why I say that with confidence can be seen quite readily in the graph in Figure 4.
Figure 3 SPECfp_base2k Component Scores for US-III and Competitors
The SPECfp_base2k scores for the 14 sub-component programs for the pre-bug fix Sun Blade 1000 Model 1900 submission using the Forte 6 compiler are compared to the recent Sun Blade Model 900 submission using the Forte 7 compiler. In addition, scores for the Itanium (4MB, 800 MHz version in an HP i2000), Alpha EV68C (1000 MHz version in an ES45/1000), and POWER4 (1300 MHz version in a pSeries 690 Turbo) are provided for reference. It is the new compiler's score on the 179.art program that quite literally stands out from the rest. Although several other programs see appreciable improvement (the 183.equake score nearly triples), the new compiler increases the score of 179.art by more than 800%. In absolute terms this score, 8176, is more than four times higher than that achieved by the Alpha EV68 and POWER4, MPUs that easily beat the copper US-III on nearly every other SPECfp2k program. The 179.art score achieved by the Forte 7 compiler is vital to the new machine's pumped up SPECfp_base2k score. If you leave 179.art out of the geometric mean then its SPECfp_base2k score would drop by nearly 18% from 629 to 516.
This remarkable improvement on 179.art is unusual in the field of compiler engineering where single digit percentage performance increases are often considered major victories. So it is no surprise that Sun's achievement immediately raised suspicions among industry observers and competitors about the nature of the optimization employed by the Forte 7 compiler. It is hard not to think of Intel's infamous eqntott compiler bug that erroneously increased the SPECint92 score of its processors by about 10% until caught and fixed [8]. This bug used an illegal optimization that allowed the output of 023.eqntott to pass result checking with the test data used but was invalid in the general case.
Although the exact nature of the new Sun optimization isn't known, suspicion has fallen on several inner loops within the 179.art program. Speculation is that this code was originally written in FORTRAN and converted to C. Because FORTRAN and C access two dimensional arrays in opposite row and column order it is presumed that 179.art accesses arrays by the wrong index in the innermost loop causing poor cache locality. It is possible that the new Sun compiler recognizes this situation and turns the nested loops that step through the array accesses "inside out" and achieves much lower cache miss rates. Whatever the exact nature of the Sun optimization turns out to be there is the question of whether it violates one of the SPEC rules, namely "Optimizations must improve performance for a class of programs where the class of programs must be larger than a single SPEC benchmark or benchmark suite".
Without knowing the nature of the new Sun optimization it is impossible to say whether Sun should be praised or scolded. But here are the words of Sun engineer John Henning who made the following comments in a November 27 post to the comp.arch usenet news group:
"Our compiler team believes that what Sun has done with art is (1) the result of perfrectly [sic] legitimate optimizations (2) compliant with SPEC's rules and (3) not appropriate for further discussion - if you want to figure out to make art faster, go work on it yourself, don't ask Sun how we did it!"
With the widespread attention this incident has engendered within the industry we can presume that compiler and benchmarking experts working for Sun's competitors have closely scrutinized the code Forte 7 generates for 179.art. The fact that Sun's new scores haven't been withdrawn from the SPEC official web site yet suggests that Mr. Henning is correct. No doubt we can expect competitor's processors to score much higher on 179.art in the months and years to come as the Sun optimization migrates to other compilers. Depreciation of a benchmark's value is seldom as spectacular as in the case of 179.art, but still naturally occurs over time and provides incentive to accelerate the development of a successor to the SPEC CPU 2000 benchmark suite (which no doubt will not include 179.art). A message soliciting programs for this new suite, tentatively named SPEC 2004, was posted on comp.arch on December 28. Ironically the author of this message, the secretary of the SPEC CPU subcommittee, is none other than the previously mentioned John Henning.
Conclusion
It is comforting to see the pace of innovation in the microprocessor field shows no sign of slackening. The great seesaw battle between Intel and AMD for share of silicon's richest prize, the x86 microprocessor market, is about to enter a new phase with the imminent release of the 0.13 um Northwood Pentium 4. Although AMD will also migrate its K7 core to 0.13 um later in 2002 with both bulk and SOI versions, it is unlikely to be in the position to regain the performance advantage over Intel it previously achieved with the T-bird and XP Athlon until its new 64-bit Hammer core ships. Unlike AMD, Intel plans to reserve its 64-bit offerings for the high-end market. With McKinley Intel hopes to address the significant performance difficulties seen in the Itanium in part by taking advantage of its capacious manufacturing facilities to incorporate a huge amount of on-chip cache on its sizable die.
It seems like the time it takes for new ideas and features to migrate down from high-end server MPUs to mass-market devices is shrinking. The integration of high performance interprocessor communication links and memory controller(s) onto a processor die has been on the drawing board for many years and will soon be realized in the high end server market in the form of the EV7. Remarkably, the same concepts will appear in a mass-market x86 processor, the first of AMD's Hammer series, not too much later. Although these features will naturally be more limited in the scope in the x86 device to keep costs under control, they should still provide a large boost in performance from significantly reduced memory access latency as well as a dramatic reduction in the cost of producing multiprocessor systems based on this device.
Few topics in the computer and microprocessor field can raise a controversy, as well as blood pressure, as quickly as benchmarks and benchmarking. Sun managed to throw a hand grenade into the simmering debate between the supporters and detractors of the industry standard SPEC CPU benchmark by speeding up the execution of one of the fourteen programs in the floating point suite by nearly an order of magnitude through the use of a previously unexploited compiler optimization. This in turn raised the SPECfp2k score of its latest US-III processor by roughly 20%. We can now look forward to the spectacle of competing firms scrambling to reverse engineer Sun's new compiler trick and incorporate the same voodoo into their own wares.
References
[1] Krewell, K."Intel's McKinley Comes Into View", Microprocessor Report, October 2001, Volume 15, Archive 10.
[2] Hennessy, J. and Patterson, D., "Computer Architecture A Quantitative Approach", Morgan Kaufmann Publishers Inc., 1990, ISBN 1-55860-069-8, p. 181.
[3] Advance Program, 2001 IEEE International Solid-State Circuits Conference", p. 35.
[4] Weber, F., "AMD's Next Generation Microprocessor Architecture", October 2001, Downloaded from AMD web site.
[5] Jain, A. et al, "A 1.2 Ghz Alpha Microprocessor with 44.8 GB/s Chip Pin Bandwidth", Digest of Technical Papers, ISSCC 2001, Feb 6, 2001, p. 240.
[6] Dulong, C. et al, "The Making of a Compiler for the Intel Itanium Processor", Intel Technology Journal, Q3 2001, Downloaded from Intel web site.
[7] Desikan, R. et al, "Measuring Experimental Error in Microprocessor Simulation", Digest of Technical Papers, 28th Annual International Symposium on Computer Architecture, June 2001.
[8] "Intel OverSPECs Parts", Microprocessor Report, January 22, 1996, Volume 10, Number 1, P. 5.
Copyright © 1996-2001, Real World Technologies - All Rights Reserved
By the way, Pricewatch is quoting about $3K for the lowend Itaniums running at about 700 Mhz. No thanks.
Impressive though 64-bit processors might be, I'm not convinced that the performance improvement is going to be as big as people are expecting.
Remember that the components in any digital system - and I'm not just talking about your windoze desktop PC, but servers, mainframes and embedded systems too - have to talk to each other in order to do anything remotely useful. Last time I looked, most PCI devices din't utilise the provision for 64-bit data bus operation.
There's a perfectly good reason for this, of course... in order to attach a chip to a circuit board, you need an array of pins (or solder balls) that are macroscopic, so they can be soldered and handled without too much risk of accidental damage. Additionally, PCB tracks can only go so small (and so close together) without undesirable electrical effects and again, an inability to work with it in a production environment.
The "more bits" phenomenon has been sustained by improvements in VLSI and the advent of true System-on-a-chip design, but this too has its limits. If you compare a P4 motherboard with, say, a 386 mobo circa 1995, you'll see the chip count is drastically reduced. But fewer interconnected components means less repairability, upgradability, and interoperability. My old 486 had a VLB EIDE hard disk controller, which I swapped in after the last one failed. If my controller failed today, I couldn't do that; I'd either need to buy a new mobo or start replacing chips on the old one (which is just as expensive).
Don't get me wrong - I'm all for progress! And I expect we'll see more and more 64/128-bit chips springing up inside custom devices (e.g. 3D cards, routers) where the local interconnect can be made as fat as necessary. But the PC will remained shackled by slow frontside busses for a while yet, I reckon.
These sigs are more interesting tha
One word: addressing. With those 32 bits, you can typically address up to 2 gig files on your machine - which is a limit easily encountered when you start working with video, for instance.
It took hacks to get 4 gig of RAM working on x86 with the linux kernel.
Go 64 bit, and that limit vanishes. You keep your linear addressing, none of those ugly segments like in the unfamous real-mode of PC-XT times.
I don't see what's really new about it all though, we've had 64 bit since Alpha, and there's several 64 bit architectures around. It may not be mainstream yet, but will IA 64 or Hammer really change that (soon)? Allow me to have doubts.
Much as I hate to say it, the Intel McKinley looks like a very well designed piece of kit, and it appears Intel have learned from their mistakes with the P4 by including a big, fast 3-level cache on the McKinley. It's also good to see them reducing their pipeline size, which means it may finally be able to compete with the G4 in terms of efficiency. However, this is of course going to kick them in the teeth in terms of competing on processor speed, which they have been pushing so hard recently in their marketing.
The same can't be said of AMD's offering, although in fairness the Hammer is not directed at the server market unlike the McKinley. The pipeline is longer than both their previous design and the McKinley, which is going to give them a performance hit. We can only hope that their cache is as good as Intel's.
What amazes me is that they can still keep adding instruction extensions without too much of a performance hit. Anyone looked at the latest instruction set documentation for these processors? Eugh! The pain of backwards compatibility...
First of all I'd like to say, I am not biased in either way.. after all I'm going to get me a new AthlonXP next week. .. registers) And newest processors can deal with them with same ease as with non-far addressing.
IA64 is very different from x86-64. AMD's 64bit solution is nothing more than extension to current 32bit instruction set. Of course there are some tweaks, but nothing very radical. You will still be able to run old 16 and 8bit code efficiently.
Intel's IA64 is a huge step in the future... architecture wise is far superior to x86-64. Why?
Why do we need 64bit processors? Addressing? Nah, current processors can address enough space.. with 386 processors FAR addressing was introduced, which expanded allocatable address space drastically. (those silly DS, SS,
AMD's 64bit solution currently has no real value.. except for huge data storage (could work faster with 64bit data blocks) and probably some heavy encryption. x86-64 compiled Quake3 would make minimum use of 64bit registers.. and would probably be just a margin faster than IA32 compiled version.
Is IA64 better? Yes it is. IA64 has 128 usable 64bit registers, predicates... But that is not all.. in single 64bit register you can store 4 16bit values(common integer). (or 8 8bit or 2 32bit)And manipulate with them almost as much as you like. And if you have 4 integers in other register.. you can make 4 arithmetical operations with SINGLE instruction. You can do similar things with floating point operations... and with ILP you could do 3 instructions per cycle. This means that Quake2's VectorAdd/Subtract could be done in SINGLE cycle.
Clawhammer will be better for a year or so.. but soon it will hit the ceiling. Intel will be able to get better performance from 1/2 clocked IA64.
And please don't respond with lame comments if you haven't read at least whitepapers from Intel and AMD.
Once we get the 64-bit hardware, we still have the MMOS (minor matter of software) to worry about.
Cases in point:
Silicon Graphics machines with MIPS R4400 (and up) CPUs were 64-bit, but the additional address and pointer space wern't utilized until IRIX 6.0 in 1994 -- over 18 months later. (And, of course, certain SGIs still run in 32-bit mode due to RAM concerns -- 64-bit requires more RAM -- all Indys, all Indigos, all O2s, and R4400 Indigo2s).
Sun machines with UltraSPARC CPUs were 64-bit, but again, the additional address and pointer space had to wait for software support. (Multi-stage transition to 64-bit, starting with Solaris 2.5 and finally complete with Solaris 7 in 1998).
Then there's application optimization. Many apps can get slight speedups by processing data in larger (say, 38-bit or even 64-bit chunks). Sometimes the difference is huge, many times it's small. But, lots of little speedups can add up across an entire system. Still, someone has to make these changes to apps and compilers. It takes time, testing, and adoption. In better times, SGI did several such overhauls... they got some insane speed out of Netscape Enterprise and Netscape FastTrack web servers during the Everest project. One of their engineers also did some cool (but nonstandard) hacks to Apache, including the very first pure, clean 64-bit port/mod.
Newer, faster, wider, more-torque hardware is always great. But don't forget the software.
Huh? you could do 32 bit addressing long before you could buy 4GB drives, but nobody thought memory mapping your hard drive into your address space was a good idea then... What would be the point?
--
Benjamin Coates
Since that was marked troll, I'll blow more karma...
With most operations, 64 bits isn't 2x as fast its 1x as fast unless you deal with the stack in which case it could be even slower.
Addressing has little to do with word size. The 8088 shows that.
Suns running in 64 bit mode are offten slower than running in 32 bit mode.
Nintendo 64 games are all 32 bit code with just a few 64 bit operations. The good emulator proved that.
As far as going two 32 bit ops at once, I still don't need a 64 bit data path to do that, I just need several 32 bit data paths. What I don't need is to dump a bunch of unused 64 bit number on the stack everytime an exception happens (which one of my computers has done about 1047563950 times in the last 51 days)
Actually, even that's not strictly true, since according to the Resource Kit documentation Windows XP's initial configuration detection is *still* 16 bit.
UNIX? They're not even circumcised! Savages!
Why do we need 64bit processors? Addressing? Nah, current processors can address enough space.. with 386 processors FAR addressing was introduced, which expanded allocatable address space drastically.
Far adressing can handle 4GB. but 4GB is not much by today standards(it is not little either). You want flat adress space, and ram is cheap this days. 32 bits get you to 4GB, if you want more you have to resort to tricks. (those silly DS.DD etc)
With the first 64 bit alpha's they used this as an argument: it is useful for fast memory scans when using big databases.
--640 Kb will be enough....