HyperTransport 3.0 Ratified
Hack Jandy writes "The HyperTransport consortium just released the 3.0 specification of HyperTransport. The new specification allows for external HyperTransport interconnects, basically meaning you might plug your next generation Opteron into the equivalent of a USB port at the back of your computer. Among other things, the new specification also includes hot swap, on-the-fly reconfigurable HT links and also a hefty increase in bandwidth."
HT 3.0 increases the bandwidth to 41.6 GB/s, that's 86% more than 2.0. It's also expected to be backwards compatible with current motherboards using 2.0. The new processor will run with 3.0 speeds while the motherboard will be stuck with 2.0. The new Rev. F AMD cpus are expected to have HT 3.0. It should help with multi-processor systems where the high bandwidth connects each cpu.
In a design class I took, our professor talked about something called "processor-in-RAM". The idea is that you'd have a few processors all with their dedicated RAM. The program you are running would be copied in each processors's RAM. When a branch was ready to be taken, half the processors would go one way and the other half the other. The processors that guessed right would let the other processors know they were wrong and update them with the new information. This way there is no penalty hit as all branches are correctly predicted.
I'm guessing that a whopping 64-128 meg cache aught to be enough for sometime.
Yeah, it'd provide some huge performance gain, but the shear cost of that much cache would easily be on the order of tens if not hundreds of thousands of dollars. Cache requires a few gates for each bit stored, while RAM uses gates to control capacitors (one capacitor for each bit).
Why are MacBook Pros so much faster than Powerbooks?
:/
The MacBook Pro sports a 666Mhz DDR FSB, while the Powerbook sports a 133Mhz FSB. It doesn't matter how fast your processor is if you don't have a fast enough way to power it (much like a V-12 will not do well with a single-barrel carb used on a lawnmower engine).
The Von Neumann bottleneck is the significant limiting factor in all machines, once your working set of data exceeds that of your L1/L2 cache. Suddenly your 1.5 Ghz G4 is 266 Mhz
Faster hypertransport means happier users of AMD machines. My AMD64 beats the pants off my Sempron 2500 because its 800Mhz HT bus allows it to do context switches in less than 1/3rd the time of the Sempron!
--
Internet Explorer (n): Another bug -- that is, a feature that can't be turned off -- in Windows.
The reason fiber optic (particularly glass core) is so expensive is due to the difficult and sensitive process required to manufacture that cable, though the materials used are extremely inexpensive. The diameter of the glass core must be matched exactly to the wavelength of light to travel over that fiber. In addition the composition and purity of the glass must meet certain standards to prevent reflection, signal attenuation, or signal skew, all of which would result in inconsistent or degraded performance. As far as the lasers being cheap, yes a laser can be cheap, but again the same demanding requirements apply to both versions of laser used in data communications, which again increases the manufacturing cost.
You're mixing up a few pieces of technology here. Processors with their own dedicated memory has been invented many times by different people. Modern loosely coupled clusters fit this bill, but further back there was the transputer systems in which each processor had memory on board. Systems like this are more difficult to program than single image systems (even with a CSP derivative as the language) but they produce higher performance.
The other thing that you are describing is multiway branch prediction. A processor like the Pentium guesses which way a branch goes and despatching instructions down that path to the pipeline. When it is wrong there is a hit as the pipeline stalls and all of those cycles are lost. In multiway branching both outcomes of the branch are despatched to the pipeline. The cost is that half the instructions being executed will be thrown away. If you go 2 branches deep then it is 75%. The advantage is the latency is minimised as the pipeline is always full.
The last thing is processor-in-RAM, or smart memory. In this system a miniture processor is embedded on the DRAM die. The small processor is capable of computing striding patterns in arrays. As the program executes on the main processor the smaller processor predicts which memory locations are going to be accessing and presending the data to the host processor, reducing latency.
Good luck on your class. Architecture is one of the more interesting courses in a CS degree.
Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
Err ... your AMD64 is good because it's got a low latency on-die memory controller. It doesn't even have to think about the slow FSB bottleneck.
The fact that the link to the chipset is also fast is just a bonus.
So fifteen years ago everyone else had 20GB/sec buses? Funny, Sun seems to think they were using MBus, which peaked at around 350-400MB/sec. And HP was dropping CPUs on a GSC bus running at ~ 250MB/sec. I'd look up what state of the art was for SGI and IBM, but it would be silly. AMD and Intel surpassed other chip vendors on a number of fronts years ago.
Perhaps it's because your Sempron 2500 is a socket 754 chip, so cannot use dual-channel memory. The AMD64 has a faster FSB, and it's dual-channel.
Many people (including yourself it seems) misunderstand HT. It isn't the FSB, an Athlon 64 has no FSB. HT is only used to communicate non-memory I/O and to synchronize caches between processors when doing memory I/O. So it's rather unlikely that HT could make your context switches 3X faster. Best thing for that would be a bigger cache, which your AMD64 probably has also.
http://lkml.org/lkml/2005/8/20/95
My AMD64 is a Socket 754, and my Sempron is Socket 462. It's on a much, much slower bus connection to its RAM. The Sempron has 180ns latency to RAM, while my AMD64 has 60 ns (worst case).
The AMD64 average context switch latency is a few microseconds; 15ns average. Sempron is 10ns best, 70ns average. I can send you a PDF with a few hundred graphs I did with lmbench on several platforms for a reseach project recently, if you don't believe me.
So, if my kernel is doing a context switch HZ times a second, I'm getting way better interactive performance on my AMD64 machine -- which is a socket 754 single-channel memory device. The FSB dominates.
The bus connection between my CPU and the RAM is, indeed, the Hypertransport. Northbridge, CPU, and RAM are all connected by it. Perhaps you missed all the AMD documentation on this, or the entry in Wikipedia:
"Front-Side Bus Replacement
The primary use for HyperTransport is to replace the front-side bus, which is currently different for every machine (or some set of them). For instance, a Pentium cannot be plugged into a PCI bus. In order to expand the system the front-side bus must connect through adaptors for the various standard buses, like AGP or PCI. These are typically included in a controller called the northbridge."
And, yes, I am taking into account caches as well. I do appreciate the healthy skepticism.
--
Internet Explorer (n): Another bug -- that is, a feature that can't be turned off -- in Windows.
guess what
Cray use Hyper Transport now
XML - A clever joke would be here if