Domain: hotchips.org
Stories and comments across the archive that link to hotchips.org.
Comments · 25
-
Proprietary all the way down.
So I was interested in what drives this thing, the Myriad 2 VPU and found out this is right up Intel's ally because it's proprietary from top to bottom. Everything needs software only they can provide and naturally comes with conditions. I found a presentation which clearly shows what their priorities are.
- 8+ years of heritage. Close to $60M invested into technology development
- Proven architecture. 100% internally developed. Strong IP positionBuy into the lock-in now! -_-
-
By 2040 maybe
We have a long way to go. The longest simulation by a purpose built computer is 1.0ms, and contains less than 250,000 atoms. For comparison the average cell contains more than 1 trillion atoms.
-
Re:Caches, eh?
A quite cool (actually hot) chip presentation (see page 15): http://www.hotchips.org/wp-content/uploads/hc_archives/hc24/HC24-9-Big-Iron/HC24.29.918-SPARC64X-Maruyama-Fujitsu-rev2.5.2.pdf and the presentation: http://www.youtube.com/watch?feature=player_embedded&v=ipirVUart88#t=1072
Itanium has/had retry as well. I don't know if the E7 Xeons have it already. As the number of cores increase faster that can be served by memory systems, it might be increasingly practical to execute the same code in different parts of the chip, even in the non-mainframe applications. -
Re:1.6 ghz?
Except that you're forgetting one key component of the 360 CPU: SMT.
Fine-grained SMT (the only SMT worth pursuing) allows for a second thread to populate unused execution units, allowing for an in-order CPU core to potentially exceed 1.0 IPC when running highly-threaded code (or maintain near 1.0 in I/O-blocked instances)..
The 360 cores are dual-decode, dual-issue (just like the Pentium, Intel Atom), as anything less would make zero sense to implement SMT for, and anything more would be overkill for an in-order design. It features triple 128-bit vector units, but will usually only be able to execute 2 vector instructions per-cycle. Here are the specs if you want to peruse them.
The AMD Bobcat core is not a very powerful out-of-order unit. Like the 360 CPU, it features dual-decode and dual-issue (a trait shared by the Jaguar refresh). You can see how little boost Bobcat receives from out-of-order by putting it up against the Intel Atom.
The Atom gets trounced in single-threaded operations, and also in some c tests where Brazos can keep itself fed. But in some tests the I/O becomes the bottleneck, and in those cases Atom catches-up (or exceeds Brazos).
Thus, for certain operations SMT offers similar per-clock performance to out-of-order execution. This means that an optimized multi-threaded load on the 3.2 GHz 360 CPU may run 50-75% faster than on a 1.6 GHz Jaguar core.
Thus if you assume PERFECT scaling for those 3 cores on the 360 and 8 cores of Jaguar, you really see only a 2x overall speedup (especially since Jaguar is getting an upgrade to dual 128-bit vector units).
-
Re:Just let x86 die, please.
There is IBM Mainframe System 360 code from the '60s still running on current zEnterprise systems today.
...and the latest implementations of it use the same "translate some multi-step instructions into internal micro-ops" technique that a lot of x86 processors, dating back to the Pentium Pro, do. (They also use Alpha's "trap some instructions to processor-dependent code running in a special mode with access to some special internal registers" technique, only they call it "millicode" rather than "PALcode" - and they have some more instructions to trap.)
-
Re:Silly.
That's silly. They're trying to build a supercomputer out of MIPS chips. That'll never work...
Do yourself a favour by watching this.
-
Re:Because when I think graphics, I think intel
Larrabee is expected to at least be competitive with nVidia/AMD's stuff, although it might not be until the second generation product before they're on equal footing.
Competitiveness is not a quality of generation number. Still: What statistics have you seen that compare Larrabee and something people use right now (ATI/nVidia)? There is this presentation (PDF) they made at SIGGRAPH, which shows that performance increases as you add more Larrabee cores. Here's a graph which may mean something. The y-axis is "scaled performance" What might that mean?
Graphs show how many 1 GHz Larrabee cores are required to maintain 60 FPS at 1600x1200 resolution in several popular games. Roughly 25 cores are required for Gears of War with no antialiasing, 25 cores for F.E.A.R with 4x antialiasing, and 10 cores for Half-Life 2: Episode 2 with 4x antialiasing.
Sounds neat. I guess that's why they're going to promote the 32-core Larrabee. How much will something to run these cost and how much power will it consume? They're still developing this thing, so why do I keep hearing that it will BLOW MY MIND? I have no doubt that Intel has an army of capable engineers that could build something to render graphics great, but if it costs more than the consumer can possibly pay, there's no real point. Intel is gunning for 2 TFLOPs. I'm pretty sure the Radeon HD 4870 passes that mark already (and you can purchase it for less than $500). Sure, it's a cool technology, but I'd like to see some more facts and figures.
What have I heard? Power usage/heat: 300W TDP. That's pretty horrific. Cost: 12-layer PCB. That's twice the typical graphics card and four more than the high-end Radeon and nForce cards. That doesn't directly translate into cost, but generally more complicated equals more expensive.
But back to the PS4 -- Sony's real mistake with the PS3 was expecting the Cell processor to be the most incredible computing device ever. Original plans for the PS3 included 2 Cell processors, but they changed to the RSX when they realized the Cell wasn't capable of rendering graphics like they wanted to (whereas the XBox 360's architecture was designed with the GPU and CPU co-existing from the start). You can't build a bunch of fast parts and stick them together, you have to build a fast system. Perhaps Sony has learned their lesson.
-
It still doesn't exist on the xbox 360
It is not the architecture referenced in the article, where GPU commands are integrated with the CPU commands. Please examine the specs from Microsoft:
There is a dedicated GPU, with a memory controller on-die. The CPU cores are on a separate die, and contain a separate instruction set. The CPU's memory access is tightly linked with the GPU so that the CPU can send geometry data to the GPU with no delays.
If anything, this represents a third architecture, more different from that discussed in the article than it is from a PC shared-memory architecture (except in the PC, the north bridge mediates the memory access, instead of the GPU.)
My point is entirely that if you merge the CPU and GPU command set into the same core, you will slow the system down. The XBOX360 gets speed from linking the CPU's geometry data send to the GPU's memory access, in essence allowing it to get a free pre-fetch. That's not what the article is about.
-
Re:Smart
You just attacked a stereotypical claim that 64-bit is somehow inherently faster, and then you made a erroneous claim of your own.
64 bit processors also need larger instruction caches because the instructions are way bigger in size. As a result, some small subset of things perform slower in 64 bit mode.
Actually, because x86-64 is an extension of the IA-32 CISC instuction set, it benefits from the exceptionally small instruction size. x86 uses variable instruction sizes. Thus, the move to a larger address space likely only affects instructions with explicit access to memory.
The addition of 8 more registers means that you can reduce the number of instructions, because you no longer have to cleverly juggle your data quite as much.
According to AMD, the the instruction size for x86-64 is typically 10-15% larger, while the number of instructions is reduced by about 10%. This is why we see virtuallty no performance hit with x86, and in certain situations we see a huge performance increase.
See AMD's presentation here. -
Cell
It is funny to see posts like this:Sony was hyping up the Cell so much it was almost guarenteed to suck. It's almost like the Cell architecture was designed to score the highest possible score on trivial benchmarks (like the ones that give you FLOPS) without worrying about real world performance. Where have we seen this before? Oh yeah, the Emotion Engine (PS2)! Wasn't Sony saying that we'd be sticking Cell processers in everything because they were going to be so great? I seem to recall talk about personal computers switching over to Cell because it was going to blow regular processors away. In a way, it does (FLOPS), but in practice it's way slower than even processers from last year.
How does it come that the Cell processor has been presented at various supercomputer conferences and will take a major slot at the Hotchips Symposium for High-Performance Chips.
The first benchmark proved, that it is about 100 times faster in large FFTs than a Xeon processor: PDF
I can't remember any presentations of the Emotion Engine at a supercomputer conference. -
Re:but I want
Sounds like you want the POWER5 processor from IBM.
Dual-core with simultaneous multithreading for each core, 4 chips (8 cores) can be connected together on a module and modules can be connected with each other forming up to 64-way systems.
Each processor chip has 1.875MiB on-chip L2-cache and 36 MiB off-chip L34-cache.
And yes, it is a server processor and no you won't be able to afford it...
-
Sun, IBM, other major vendors also going dual-coreThe UltraSPARC IV processor is also essentially two UltraSPARC III processors on a chip, integrated using chip multithreading (CMT) technology. Here is an article and some marketing blurbs about the UltraSPARC IV.
The current IBM POWER4 and upcoming POWER5 chips are both dual-core chips. Here is a nice presentation(PDF format) about the POWER5; you can see in the die photos where there are two cores. There have also been rumors of a dual-core PowerPC based on it, but nothing concrete yet.
Broadcom (which bought SiByte) markets a dual-core, 1GHz 64-bit MIPS chip called the BCM1250 which has a lot of integrated networking goodies.
Finally, it bears pointing out that on the other side of Intel's severed corpus callosum, they're also working on a dual-core chip.
-
Good for Power5
This is an important step, at least for the Power5. It's immensely complex, and I think feedback from collaborators such as OS people is important when they (IBM) ask themselves if a design decision makes sense. For example, SMT adds 24% to the die area for each core (see here). Compare that with Intel's HyperThreading, which adds little area but is still complicated to verify. Getting feedback and involving other groups can help determine if design decisions/features are worthwhile.
-
Good for Power5
This is an important step, at least for the Power5. It's immensely complex, and I think feedback from collaborators such as OS people is important when they (IBM) ask themselves if a design decision makes sense. For example, SMT adds 24% to the die area for each core (see here). Compare that with Intel's HyperThreading, which adds little area but is still complicated to verify. Getting feedback and involving other groups can help determine if design decisions/features are worthwhile.
-
Sun will exit the hardware side of systems market.Sun Microsystems (SUNW) is being rapidly forced off the desktop. SUNW has no intention of hanging around in the workstation market because SUNW does not make a competitive product. Athlon64 and Prescott have and will, respectively, lockup the workstation market. PowerPC970 (in G5) is the wild card and can capture a nice 20+% of the market if Steve Jobs were not so clueless.
Now, SUNW is conceding the market for high-end servers.
SUNW recently purchased Afara. It supplies processors for low-end servers. SUNW will still try to maintain a presence there. Unfortunately, with the SPARC64 going to 4 cores per die and 2 threads per core, the processor from Afara is starting to look less and less competitive. SUNW will exit the market for even low-end servers by 2007.
The announcement of Power5, with its SMT capabilities, is tantamount to announcing a starship for intergalatic space travel when all the spacecraft in the Federation can only travel within the solar system. Power5 and, to a lesser extent, SPARC64 basically killed the UltraSPARC line and the entire hardware business of the Sun Microsystems.
By the way, Professor Susan Eggers of the University of Washington must be tickled pink because she development most of the technology for simultaneous multithreading. IBM, with its Power5, proved that her ideas were all right. The Draper prize in engineering should be going her way.
... from the desk of the reporter -
More on what Google's CEO saidAccording to this article the issue had to do with both price and power consumption.
From the article:
Eric Schmidt, the computer scientist who is chief executive of Google,
told a gathering of chip designers at Stanford last month that the computer
world might now be headed in a new direction. In his vision of the future,
small and inexpensive processors will act as Lego-style building blocks
for a new class of vast data centers, which will increasingly displace the
old-style mainframe and server computing of the 1980's and 90's.
It turns out, Dr. Schmidt told the audience, that what matters most to the
computer designers at Google is not speed but power -- low power, because data
centers can consume as much electricity as a city.He gave the Monday keynote at the "Hot Chips" conference at Stanford last August.
There is an abstract of his keynote. -
More on what Google's CEO saidAccording to this article the issue had to do with both price and power consumption.
From the article:
Eric Schmidt, the computer scientist who is chief executive of Google,
told a gathering of chip designers at Stanford last month that the computer
world might now be headed in a new direction. In his vision of the future,
small and inexpensive processors will act as Lego-style building blocks
for a new class of vast data centers, which will increasingly displace the
old-style mainframe and server computing of the 1980's and 90's.
It turns out, Dr. Schmidt told the audience, that what matters most to the
computer designers at Google is not speed but power -- low power, because data
centers can consume as much electricity as a city.He gave the Monday keynote at the "Hot Chips" conference at Stanford last August.
There is an abstract of his keynote. -
Re:HPs Strategy
Its a sad state of affairs when the superior architecture...
IA64 is the superior architecture, by a wide margin.
If you don't understand the article, you're not sufficiently qualified to comment on computer architecture, much less run around chanting that Alpha is "so fundamentally sound".
Alpha is OK. IA64 is better. (x86 is horrible, and x86-64 is only marginally better.) Maybe it's time to pull a finger out? -
Re:Intel is in trouble
The Itanium is a dud: systems based on it are hugely expensive, have iffy performance, and are not usefully x86 compatible.
Hugely expensive? You can get yourself a nice Itanium-2 workstation for less than you get get a 2x1.4G Opteron box
Iffy performance? Fastest SPECfp2000 result, bar none. Second fastest SPECint2000 result, clock-for-clock: only HP's awesome PA-8700 is ahead.
Not usefully x86 compatible? It's more than enough to run acroread. For everything else, you've got source, right? With over 90% of Debian packages available for IA64, and three free (as in beer) compilers for Itanium available to download, what's the problem with porting?
AMD aren't going to have a big winner on their hands. They're going to have another Athlon - enough to make an initial impression, that slowly fades into market oblivion over a few years as Intel take advantage of an inherently superior architecture while AMD are stuck trying to make their 64 bit extension of a 32 bit extension of a 16 bit extension to an 8 bit microprocessor go faster.
Do you think Intel are stupid, or something? -
Your missing the point... and IBM already did this
Why does it seem like everyone is missing the point of the story. Built in cryptographic hardware engines on the CPU! Transmeta doesn't give any performance numbers, so I wonder how they compare to other hardware implementations...
IBM did this first, and announced last year at the Hot Chips conference. See here.
Integrated Cryptographic Hardware Engines on the zSeries Microprocessor
The presentation gives an overview of how IBM did it, and predicted that other platforms would have to adopt this class of features in the future.
The future is now.
-
Your missing the point... and IBM already did this
Why does it seem like everyone is missing the point of the story. Built in cryptographic hardware engines on the CPU! Transmeta doesn't give any performance numbers, so I wonder how they compare to other hardware implementations...
IBM did this first, and announced last year at the Hot Chips conference. See here.
Integrated Cryptographic Hardware Engines on the zSeries Microprocessor
The presentation gives an overview of how IBM did it, and predicted that other platforms would have to adopt this class of features in the future.
The future is now.
-
Re:it is sad
you're a stupid buttfucker, aren't you.
electrical engineer maybe, computer scientist? no fucking way. and that's always been alpha's problem. designed by engineers who can cook cucumbers (bonus prize for knowing what i'm referring to here ;) but who honestly have nfi what a computer actually spends its time doing.
-
Official from HP: Alpha *SUCKS*
proof
"Itanium arch has 40% fewer memory ops and 30% fewer branches than Alpha"
stupid, stupid slashdot fuckers. it's people like you who are ruining the US economy. well, keep it up, i'm not a dumb yankee!
i do feel sorry for the 0.01% of americans with their brains switched on, though :(
-
Already licensed
IBM has alread licensed Altivec from Motorola as far back as 1998-2000, allegedly to help with the design of the PowerPC variant used in the Gamecube. The PowerPC 405 embedded processor in the Gamecube contains 38 additional instructions for vector FP math (vs. the 162 in Altivec). A glance at this PDF file from the web makes me pretty sure that these aren't just lifted from Altivec. Instead of Altivec's 128-bit 4 32-bit FP vectors, Gecko adds instructions for fitting 2 32-bit FP numbers in a single 64-bit FPU register and working with them. It also adds some odd but interesting MMU features.
Anyway, I know it's been licensed because back in 2000 there was a lot of conspiracy theories that Motorola was preventing IBM from selling faster clocked PPC chips to Apple than they could produce via an obscure clause in that license. Both parties denied it, of course. I don't really believe that was the case. I think it was just bitter rumor-mongering by Mac users who were rightfully angry at Motorola for pissing away the performance (and MHz) advantage that PowerPC had on x86 chips back in the 603/604e and Pentium/PPro days.
Oh, admittedly, the MHz advantage went away as Intel/AMD extended their pipelines for that explicit purpose earning theirselves increased performance penalties for mispredicted branches and requiring increased CPI for many instructions, but I still miss the days when PPCs were faster per clock AND had higher clock rates. Now the clock rate advantage is so extreme that the PowerPCs' better performance per cycle doesn't catch up for the most commonly executed code. Once again, though, I digress. -
Re:The size of...The answer to your "what is the fastest single CPU out there" can probably be found at the Hot Chips web site.
My guess is that the Japanese (NEC or Fujitsu) are the current leaders, as they have continued to build highly vectorized processors - along the lines of what Cray used to do in the past.
Another thing to keep in mind is that these machines are very rarely run in a mode where a single application is using all of the machine. I work on these machines (currently ASCI blue), and the real payoff is that a dozen or so people can be running moderately parallel jobs all at the same time.