Larrabee Based On a Bundle of Old Pentium Chips
arcticstoat writes "Intel's Pat Gelsinger recently revealed that Larrabee's 32 IA cores will in fact be based on Intel's ancient P54C architecture, which was last seen in the original Pentium chips, such as the Pentium 75, in the early 1990s. The chip will feature 32 of these cores, which will each feature a 512-bit wide SIMD (single input, multiple data) vector processing unit."
Ah the dreams of the past, a beowulf cluster of old computers come to life :)
A little context might help. This isn't the Inquirer for god's sake.
http://en.wikipedia.org/wiki/Get_Smart
Previously: "Linux... Toward the Sunrise..." Now: "Linux... Toward the-- No, now, part of Every Sunrise"
Sounds great, as long as you don't plan on doing any floating point math on it!
"Stone knives and bearskins"
This is just unbelievably good news. After all this time, I get to start telling Pentium jokes again! I never thought I would!
Aide-toi, le Ciel t'aidera - Jeanne D'Arc.
Get your acronyms right....
No sig today...
The card features one 150W power connector, as well as a 75W connector. Heise deduces that this results in a total power consumption of 300W,
Um, that just doesn't seem to quite add up to me.
Prediction: The real iPhone killer is going to be sex robots from Japan. Think about it.
It really is all about the Pentiums.
Doh! Intel already beat me to it.
good. sounds like a sensible engineering decision.
on the basis that..
the design is well known, understood and has had rigorous testing in the field
they will no doubt fix any understood errors firstlimits the RnD to the multicore section
as long as the chip performs well for the silicon overhead then they should feel free to cram as many in as they want.
seems perfectly sensible to me.
Core 1: 4195835/3145727 = 1.33382
Core 2: 4195835/3145727 = 1.33382
Core 3: 4195835/3145727 = 1.33382
Core 4: 4195835/3145727 = 1.33382
.
.
.
Core 31: 4195835/3145727 = 1.33382
Core 32: 4195835/3145727 = mmm... 1.33374? Oh, f*ck!
I doubt it. Maybe they mentioned the Pentium as an example to explain an in-order superscalar architecture as opposed to more modern CPUS.
-There is a lot of overheard in the P54C to execute complex CISC operations that are completely useless for graphic acceleration.
-The P54C was manufactured in a 0.6micron BiCMOS process. Shrinking this to 0.045micron CMOS (more than 100x smaller!) would require a serious redesign up to the RTL level. Circuit design had evolve with process technology.
-a lot more...
Careful, Intel! Don't base these on core designs that are TOO old!
Or do y'all still think math is like playing horseshoes?
Larrabee is going to be Intel's next creation in the GPU world. A many core GPU which has the following peculiarities :
- fully compatible with x86 instruction set. (whereas other GPU use different architecture, and often instruction sets that aren't as much adapted to run general computing).
Thus, the Larrabee could *also* be used as a many core main processor (if popped into a quick path socket) and used to execute a good multicore OS. Something that's not achievable with any current GPU (both ATI's and nVidia's completely lack some control structures - both are unable to use subroutines and everything must be in-lined at compile time)
- unlike most current Intel x86 CPUs, features a shallow pipeline, executing instruction in-order. Hence, the Larrabee (and the Silverthorne which also have such characteristics) are regularly compared with old Pentiums (which also share those characteristics) since the initial announcement and including in TFA.
- feature more cores with narrower SIMD : 32 cores able each to handle 16 32bit float simultaneously. Whereas, for exemple nVidia's CUDA-compatible GPU have up to 16 cores only, but each able to execute 32 threads over 4 cycles and keep up to 768 threads in flight.
This enable Larrabee to cope with slightly more divergent code than traditional GPUs and make it a good candidate to run stuf like GPU accelerated RayTracing.
Hence all the recent technical demos running Quake 4 in raytracing mentionned on /.
That's for what Intel tells you.
Now the old and experienced geek will also notice that Intel has only kept making press releases and technical demo running on plain regular multi-chip multi-core Intel Cores (just promising that the real chip will be even better than the demoed stuff).
Meanwhile, ATI and nVidia are churning new "half"-generations each 6 months.
And the whole Larrabee is starting to sound like a big vaporware.
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
Larrabee is the Chief's cousin
Faith: n. -- That human impulse that drives them to steal appliances when the power goes out
... of the A20 gate!
...and the Pentium III was basically the same as the Pentium Pro.
If Intel is going backwards then why not go all the way back to the original Pentium? Makes sense to me.
No sig today...
It's more likely that they are taking basic design concepts. It says 'based on' not 'clone of'. By optimizing some of the overhead you mention with more modern architectural technicques than can both keep it simple and capitalize on modern optimizations.
It's only 13x smaller. :)
Comment removed based on user account deletion
If anyone remembers those old original Pentiums, their 16-bit processing sucked - so much that a similarly clocked 486 could outperform them. I guess that it would be reasonably trivial for Intel to slice off the 16bit microcode on this old chip to make a 'pure' 32-bit only processor. I am sure that they will be using the designs with a working FPU... but for many visual operations, occasional maths errors would largely go unnoticed. Remember when some graphics chip vendors were cheating on benchmarks by reducing the quality ... and how long it took for people to notice?
Although, if I had Intel's resources and was designing a 32-core cpu, I would probably choose the core from the latter 486 chips... I don't think a graphics pipeline processor would benefit much from the Pentium's dual instruction pipelines and I doubt that it would be worth the silicon realestate. The 486 has all the same important instructions useful for multi-core work - the CMPXCHG instruction debuted on the 486.
No sig. Move along - nothing to see here.
From TFA "Heise also claims that the cores will feature a 512-bit wide SIMD (single input, multiple data) vector processing unit. The site calculates that 32 such cores at 2GHz could make for a massive total of 2TFLOPS of processing power."
I don't see how they get to 2 TFLops.
512-bit = 64 bit * 8 way SIMD or 32 bit * 16 way SIMD. Let's go with the bigger of these two and say we are performing 16 single Floating point operations per clock-cycle per core. 16 operations per clock-core * 32 cores * 2 Billion clocks per second = 1024 Single Precision GFlops. It looks more like 512 Double Precision GFlops for 300 Watts which means a DP Teraflop on Larabee will cost you 513 Dollars a Year at 10 cents/kWH. If we're considering single precision, we can cut this in half to 257 dollars per years per single precision teraflop.
Compare to Clearspeed which offers 66 DP GFLops at 25 Watts costing 332 dollars for a sustained DP teraflop for a year.
even the NVidia Tesla has better performance at single precision: you can buy 4 SP TFlops consuming only 700W or 5.7 GFLops/Watt, for an annual power budget of 153 dollars.
our precise calculations at Intel suggest that partial core technology has great potential.
if this is supposed to be a new economy, how come they still want my old fashioned money?
Obviously they're not just going to slap a bunch of Pentium cores on there and call it good. But the high-level design can probably start off with the P54, and just rip out stuff that doesn't need to be supported, possibly including:
Scalar floating-point, 16-bit protected mode, real mode, operand size overrides, segment registers, the whole v86 mode, the i/o address space, BCD arithmetic, virtual memory, interrupts, #LOCK, etc, etc.
Once you've done that, you'll have a much simpler model to synthesize down to an implementation. And with a slightly-modified compiler spec, you can crank out code for it with existing compilers, like ICC and GCC.
I get the feeling this is supposed to be shocking news, but I must be missing something important. Isn't the Core microarchitecture also based on the original Pentium? I mean, I thought it was a redesign of the Pentium M series which was derived from the Pentium III which evolved from the Pentium II...and we know where that came from.
But when I run CPU-Z on the system, it only reports 31.33374 cores
First Core Tech was based off pre Netburst Architecture and now this. In 5 years intel will announce a 4096 Core 80386 for sound your sound card or something. ;P
One does not "shrink" a chip by taking photomasks and shrinkenating.
'course not. You use a transmogrifier. In the industry, it is known as the "Bill Watterson" process.
It can also be used to turn photomasks into elephants, which, while less profitable, is immensely entertaining if the operator didn't see you change the setting.
dragonhawk@iname.microsoft.com
I do not like Microsoft. Remove them from my email address.
Right. It clearly isn't using the Pentium design, but a Pentium-like design.
To that, they will have added SMT, because (a) in-order designs adapt to SMT well because they have a lot of pipeline bubbles and (b) there will be a lot of latency in the memory system and SMT helps hide that. I would assume 4 way SMT, but maybe 8. Larrabee will therefore support 128 or 256 hardware threads. nVidia's GT280 supports 768.
The closest chip I can think of right now is Sun's Niagara and Niagara 2 processors, except with a really beefy SIMD unit on each core, and a large number of cores on the die because of 45nm. I think Niagara 3 is going to be a 16 core device with 8 threads/core, can anyone confirm?
Note that this is pretty much what Sony wanted with Cell, but Cell was 2 process shrinks too early. 45nm PowerXCell32 will have 32 SPUs and 2 PPUs (whereas Larrabee looks like it is matching an equivalent of a weak-PPU with each SPU equivalent). It could run at 5GHz too... power/cooling notwithstanding.
and they want to keep Kontrol... They want to shag the field with Austen-sible power CON-sumptions? So, do they want to *86* or DEEP-SIX ATI & nVidia and others?
Previously: "Linux... Toward the Sunrise..." Now: "Linux... Toward the-- No, now, part of Every Sunrise"
at least 20 years ago, I thought, hey, with the density and speed of transistors these days, and with RISC being popular, why not go all the way and make chip with literally hundreds of (wait for it..) Z80 cpu's?
Of course I and others dismissed the idea as being just slightly ludicrous. But then, at the time, I also thought eventually there would be Amiga emulators and interpreted versions of C language, for which I was also called crazy to think...
-- Senior Software Engineer, Attorney appearance services, locallawyerapp.com.
Why not 486 cores? Then you could put 4X as many of them on your die. They already include integral FP and 1 op/cycle for most instructions.
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
ha! anyone remember the f00f bug?
I learned how to embed machine code into C and ran amok halting university systems with that for a little while.
Or about that floating point bug?
Will it include the FDIV bug X32?
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
Maybe we'll go back to a million 6502 cores running at 3 Ghz. Personally think programming this in C or assembly would be more exciting than implementing Java RFC 56532.1324342 on the latest Pentagoogaxeon 256000.
Oh, so 2 years from now (two lifetimes in the GPU business) Intel will be releasing a chip comparable to this month's ATI HD 4870 X2.
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
So it will be choked by the FSB?
http://babelfish.yahoo.com/translate_url?doit=done&tt=url&intl=1&fr=bf-home&trurl=http%3A%2F%2Fwww.heise.de%2Fct%2F08%2F15%2F022%2F&lp=de_en&btnTrUrl=Translate
Actually, they got the "Gelsinger said so" remark from Expreview, itself a Chinese site:
http://en.expreview.com/2008/07/07/larrabee-unleashes-2-tflops-capacity (note they curteously attached the Larrabee board diagram leaked from a while back):
"Gelsinger said the Larrabee will be a 45nm product featuring SIMD technique, 64-bit address. Besides, 32 of cores runing at 2.00 GHz will unleash 2 TFLOPS capacity, twice as much as the RV770XT."
But did Gelsinger really SAID those things?
Here is the Google translation of the same Heise article: http://translate.google.com/translate?u=http%3A%2F%2Fwww.heise.de%2Fct%2F08%2F15%2F022%2F&hl=en&ie=UTF8&sl=de&tl=en
It seems that no matter which crappily translated version of the German article one looks at, it appears that Gelsinger said no such thing... The part about Larrabee containing P54C cores was clearly in a separate paragraph, written after a speculative question.
So I guess Expreview THOUGHT Pat said something after it took a too-short of a look at the Heise article, after which CustomPC sensationalized the whole thing, not really bothering to actually read even the translated link it posted. Now, some random Slashdotter is doing the same curtesy.
There you go, folks- Internet reporting.
Sometimes you have to wonder about Intel. Here they have their low-power small footprint completely modern Atom chip already working on the modern foundry process. So instead of a multiple implementation of them they go back to the P54C. Was Atom a poor design choice, or does the right hand not know what the left hand is doing? Why wasn't Atom P54C based also?
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
Maybe you should be trying GPU-Z.
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
I bet Duke Nukem Forever is gonna look SWEET on one of these!
- fully compatible with x86 instruction set. (whereas other GPU use different architecture, and often instruction sets that aren't as much adapted to run general computing).
I was about to ask "Since when is the x86 instruction set optimized to run general computing?"
Then I noticed that the word was "adapted". Yeah, that's fair...
Seriously: The x86 (inspired by the hardware driving Datapoint's early smart terminals and previous chips for building hand calculators) was contemporary with Motorola's 68x (inspired by Gordon Bell's masterfully engineered PDP-11 and VAX instruction sets). While a lot of good people have poured their hearts and souls into turning it into a silk purse, and the original sows were particularly good examples of their breeds, the x86's descent from a pair of sow's ears is still apparent.
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
First, Atom is much more complex than P54C. I found a source that claims an Atom has approximately 47 million transistors, and someone previously posted that the P54C core was more like 3 million transistors. Granted, more of those transistors are used for cache on the Atom than on the P54C, but it's still 15 times the size.
The Atom also has a bunch more architectural baggage that wouldn't be applicable to Larrabee's intended use. I doubt that Larrabee will need to support 32 and 64-bit modes, or three varieties of SSE, or enhanced virtualization whatsits.
According to the diagram in the article, the Larrabee has 8 GDDR memory interfaces, which will supply rather a lot of bandwidth. Presumably, those are GDDR4 or GDDR5 interfaces, so that's 4.5 Gb/s * 8 = 4.5 GB/s bandwidth.
Getting data onto and off the board will still be a challenge - you're limited by PCI Express transfers.
They are not even getting a banana for their efforts.
If you mod me down, I shall become more powerful than you could possibly imagine.
Pentium is so much sexier than 486. I think Intel's marketing department pulled rank on this one.
So how does memory access work? Does each little CPU have its own memory, like the Cell? Do they all work through interlocked caches, as a symmetrical shared-memory multiprocessor? Or is there some partially-shared scheme?
What do you run as an OS on this thing? Something like VXworks? Real time Linux? Windows CE? You're going to need some kind of OS to manage resource allocation, even if the OS isn't exposed to the customer.
And the real question: is this a useful mainstream graphics architecture? This sounds like one of Intel's "build it and they will come" architectures, like the Itanic and the IXP series of network processors.
isnt SIMD = Single Instruction Multiple Data ?
I knew I was right!
http://groups.google.co.uk/group/comp.arch/browse_thread/thread/a100e90bba2e2b12/fde291da3e7ecab8?hl=en&lnk=st&q=free+advice+to+amd+#fde291da3e7ecab8
echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
Eww, that comment just sticks in my Craw!
In the early 90th my team thought about realizing a multi core design with 16 to 64 6502 cores on a chip with the same complexity as a 68020. Some rough estimates showed that this system could outperform an equal complex 68020 system by ten times. Later this drifted towards using an AMD29000 core but it all failes as all ready designs weren't available for licensing back then.
Back to topic, a 6502 has around 4 000 micro elements, compared to a GT-280 with 1 400 000 000 elements. Given that a nowadays 6502 would include some additional circuits this would mean we could include 1400000000/5000=280 000 cores running at 1,2Ghz each, resulting in 150-200 Tera Integer Operations per second. Given that many long floating point operations can be reprogrammed as short integer operations with less than 100 cycles per op this would outperform every recent solution.
"Life is short and in most cases it ends with death." Sir Sinclair
Yes. And newer GPUs have even more transistors than that. The AMD R770 GPU (HD 4800 series video cards) has 965 million of transistors. The Nvidia GT200 (GTX 200 series cards) has 1.4 billion of them.
Also, the R770 is based on a VLIW architecture with 160 stream processing units, 5 ALUs each, 800 ALUs total, potentially executing 800 instructions per clock (theoretical of course, only ideal code can reach this throughput). The R770 is clocked at up to 750 MHz (HD 4870). As to the GT200, it has 240 streaming processors and can theoretically execute up to 240 instructions per clock. It runs at up to 1296 MHz (GTX 280) or 1500 MHz (Tesla-based S1070 1U box). Of course AMD's and Nvidia's next generation GPUs (8 months from now?) will be even faster.
All this to say that it is way too early to know how Larrabee, planned for production in 2009, will rank among the competition. x86 compatibility will be nice for sure, and I assume Larrabee will have low instruction latencies and good caching performance, but as a GPGPU and i386/amd64 assembly developer, hearing about a "32-core P54C-based GPU" is actually below my expectations. I predict AMD/Nvidia GPUs will still perform better on highly parallelizable workloads with low instruction interdependencies. Intel seems is positionning Larrabee to be better at more general-purpose workloads that are inconvenient to port to today's GPU architectures.
I have discovered that ATI is poorly enough supported for Linux so another manufacturer is just going to split the opensource devs time up. Nvidia is the only way to go for people who are actually interested in playing flightgear(sorry I mean seriously flying a plane).
On a long enough timeline. The survival rate for everyone drops to zero. Chuck Palahniuk, Fight Club, 1996
Partly, it could be a way to get early-adopters started with a seriously multi-core CPU. Getting some some cool apps developed and tested with it will validate the platform, and will invite the next stage of adopters. By the time the proper CPU line has 32 cores (in a few short years), the platform will be ready - or, at least, more ready than alternatives (like the IBM/PS3 Cell processor).
Whoever gets real traction with multi-core will win. This discontinuity is an opportunity for a new manufacturer (ie. that no one has ever heard of) to "own" computers.
It's mainly a question of "on which scale are we comparing chips".
Yes, x86 instruction set is utterly ugly and horribly contrived, compared to nice contemporary architectures like 68k. Computing would probably be filled with less hoops had IBM decided to go with Motorolas for their PCs (as lot of other home computers or arcade and home console have done).
*BUT*
if we place GPUs on the same scale, suddenly the x86 shines : it doesn't completely suck at branching, and has an actual stack that can be used to call sub procedures, has interrupts, etc.
It is an architecture able to run an OS.
nVidia CUDA machine on the other hand, mainly use SIMD-masking for most conditional operation, aren't really brilliant when it comes to branching, and completely lack any way to do sub-procedures. Those chips have loads of register. But instead of using them to do register windows and do RISC-style sub calls, they use the registers to keep more thread in flight.
It definitely make a lot of sense from a functional point of view (those are GPUs, they are made to processing fuck-loads of pixels per seconds), but this makes them unable to run linux.
On that scale, having x86 on a GPU suddenly makes it a lot interesting for usages outside the usual "draw triangles very fast". Even if x86 sucks to begin with.
And for the record : there's hardly a way that the 68k architecture ever prevailed. It's a good one. But IBM was never seing its PC as anything better than a glorified terminal. For such kind of machine, there were of course going for the cheapest possible chip.
Given a choice between a half assed chip from Intel with a 16bit extension quickly tackled over a design inherited from early 8bit chips (8008, 8080 and concurrent Zx80 - most assembler code can be directly recompiler on 8088 after a few register renaming) AND a very nice chip from Motorola redesigned from the ground up to be a nice and clean 16/32 bits architecture designed for future expension :
Of course they will pick the Intel. It's cheaper and there's no need for a future proof 32bits processor in a fucking "Terminal Deluxe".
And of course, because of the (relatively) low cost, because of the (very strong) brand recognition, because of the (somewhat) openness of the platform enabling clones (in the sense it was documented. Of course, Phoenix had to completely rewrite the BIOS because of copyright restrictions - but IBM considered Big Irons being they main products and didn't mind such clones), and because they were takin a relatively uncrowded market (most home computers were for homes, school, and small shops - PC were marketed for corporations) :
The PC was bound to take over the market very quickly - *with* its bad design (almost *because* of it). And was bound to set the standard, as bad this standard is.
And by then, it was too late for IBM to take a better architecture to produce a "Terminal Deluxe Pro Mark-III" with a clean 68k chip.
Of course, had the PC had a less crippled OS, designed to be slightly more extensible and making less assumption about the architecture than MS-DOS (you know the "we laid everything around 1MiB and though it would last for at least 10 years" by mr. Gates), perhaps a switch to a better different architecture could have been less painful, and a cleaner architecture could have blessed the PC world sooner.
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
This rig sounds like a dec based graphic system that I saw being developed at bell labs in homdel. It was around 1986 and they were getting Mandelbrot zooms in real time and could zoom in past the processor's precision in less than a minute taking small gulps. It was a rather impressive thing to watch in the days of ascii graphics. The stuff that they were doing with ray tracing was phenomenal. It probably wouldn't impress a 20 year old today but this was some very cool stuff. I saw them doing stuff I still haven't seen anyone duplicate it. I am looking forward to seeing some real time computer generated graphics that aren't dependent on all kinds of raster trickery.
Yes, the T2 was an UltraSPARC IIi chip, the same as in my laptop, multiplied N ways. The Intel sounds more like something to hang SIMD processors on.
I'm not sure what I'd use a SIMD attached processor for in normal office use: if finding a multi-instance or multi-threaded program is hard for a PC, imaging how hard it is to find vector programs!
Off-topic: my IIi with a slow 3600 RPM disk loads full Mozilla in less than 10 seconds, and the T5 is faster than it is.
--dave
davecb@spamcop.net
instead of trying to make your own improvements, just shrink it down and use multiple instances of it?
In all seriousness, I was wondering what intel's next plan was since it was stated a couple years ago that multicores was the way to go.
Andreas Stiller from c't speculates about the P54C in the current issue of the German magazine. That guy is one of the most well informed people in that particular area. If Stiller speculates on something it usually means that it is true, but the sources didn't want to be named or identified. They probabely spilled the news over a beer or sth.
Yeah, funny story about that, actually...
Mind you, I was a dumb kid at the time...
Anyway, our local dial-up ISP ran a Unix shell host, including a C compiler. One day I decided to see if they'd patched their system for the F00F bug...
In hindsight, I would say, don't do things like that! It's really not a very good idea.
Bow-ties are cool.
Don't you think it's a bit retarded to give my comment a +5 funny, and give the OP a -1 offtopic? It was quite relevant to the thread, I would go so far as to suggest it was informative. Can we all start paying more attention to context in addition to content?
Sometimes, life itself is sarcasm...
Germany is a long way from Santa Clara. The fact remains that the original poster typed "Pat Gelsinger recently revealed that Larrabee's 32 IA cores will in fact be based..." instead of "Speculations are that..." I see the thread as "extreme sloppy reporting and non-checking by all parties" followed by "a massive episode of base-jumping related lemming deaths".
From my testing these older chips did more "work" per clock tick then the current line of P4's and use less transistors, and so make a much better candidate when choosing a candidate for a cluster of processors on a single chip.
http://www.dnull.com/cpubenchmark/budmark3.html
When I was researching doing this type of thing back in 2001 It turned out that using even smaller lower and processors and running them faster makes even more sense.
I think there choice for using the P54C really is the best decision, but it is just not obvious without all of the facts.
The P54C with 2GHz clock rates up is like a P4 3Ghz but uses less power, it's a better design and much smaller lower transistor count and now using modern high res fabs they are getting 32 probably in the same silicon as one single current Pentium D Extreme.
I really think this is going to be the coolest chip out in a very long time.
I am always doing that which I can not do, in order that I may learn how to do it. - Pablo Picasso