Intel Squeezes 1.8 TFlops Out of One Processor
Jagdeep Poonian writes "It appears as though Intel has been able to squeeze 1.8 TFlops out of one processor and with a power consumption of 62 watts." The AP version of the story is mostly the same; a more technical examination of TeraScale is also available.
Imagine a Beowolf cluster of those!!
The trick like SPEs is finding way to efficiently use them in as many tasks as they can.
I'm glad to see Intel is using their size for more than x86 core production though.
Tom
Someday, I'll have a real sig.
That's not 62 watts at 1.8 teraflops. That's 62 watts at 3.16 GHz FTFA: "Intel claims that it can scale the voltage and clock speed of the processor to gain even more floating point performance. For example, at 5.1 GHz, the chip reaches 1.63 TFlops (2.61 Tb/s) and at 5.7 GHz the processor hits 1.81 TFlops (2.91 Tb/s). However, power consumption rises quickly as well: Intel measured 175 watts at 5.1 GHz and 265 watts at 5.7 GHz. However, considering the fact that just 202 of these 80-core processors could replicate the floating point performance of today's highest performing supercomputer, those power consumption numbers appear even more convincing: The Department of Energy's BlueGene/L system, rated at a peak performance of 367 TFlops, houses 65,536 dual core processors."
Now if they can only find a way to lessen its thirst for volts they could make it useful for the masses.
"Intel" "Introducing the NEW CORE 80, personal laptop supercomputer running Windows waste my ram and cpu cycles SP2 edition" But seriously this looks interesting for the future. Now we just need software to fully utilize multicore processors.
bus speed *cough* bus speed *cough* bus speed
Does this permit the practical use of any truly breakthrough apps?
Does it suddenly make previously crappy technologies worthwhile? I.e., does image recognition or untrained speech recognition become a mainstream technology with this new processing power?
I gotta get me one of these. This lends new creedence the Staples Red Button of major scientific and engineering problems. "That was easy!"
The first thing that jumped out at me was the presence of MACs. They are the heart of any DSP. So, this chip is good for computation although not necessarily processing. As other posters have pointed out, this chip could become a very cool GPU. It should also be awesome for encryption and compression. Given that the processor is already an array, it should be a natural for spreadsheets and math programs such as Matlab and Scilab. Having a chip like this in my computer just might obviate the need for a Beowolf cluster. :-)
64 cores should be enough for anybody.
Get the bugs worked out be Xmas and you could sell at 1.81 Tflop easy-bake oven
{...I need more sleep...}
A goal is a dream with a deadline
Gonna get one of these. That should bump up my Vista Experience score.
w00t
Ray tracing is embarassingly parallelizable, and while I'm no expert, two terraflops might just be enough calculating power to do a pretty good job at scene rendering, maybe even in real time. To think this performance would be available from a standard 65nm die that uses 65 watts... that really could make a difference to gamers!
The architecture is very much like how one might build a cellular automata machine, albeit with FPUs instead of lookup tables.
As an example, check out CAM-8: http://www.ai.mit.edu/projects/im/cam8/
This dated from 1993 or so, it took at least a 1 GHz Pentium III to match its cellular automata performance, if I recall correcly.
I hope they can get them back in.
33 of these CPU's should be more than enough to construct Lt. Cmdr Data.
Since petaflops are likely by the end of the decade its time to imagine exaflops in 2020.
"Intel Squeezes 1.8 TFlops Out of One Processor"
In other news, AMD pinches a 1.9 TFlops loaf out of one processor
I will forever be a student.
The FSB will be a big bottleneck even more so with the cpu needing to use to get to ram. You would need about 3-4 FSBs with 1-2 mb per core of L2 to make it fast.
Comment removed based on user account deletion
Yep. The only way to really use this effectively is to load it up with lots of bloatware. Imagine the tons of ads one can finally get with this type of CPU! doubleclick.net would seriously love this.
People still effectively use processing power equivelant to that of an 800mhz Pentium 3 for basic stuff (and I'm just talking about Word processing, email, internet, no gaming) on average. Why would someone need a quad core CPU, and a crappy videocard just for surfing the net, typing, etc?
In reality, that is what will ultimately happen. Just lots of stuff running in the background without us really noticing it. The speed and cores can make it easier to hide spyware in the background because you won't notice any slowdown in your system when the spyware loads, whereas if you have an older PC you will notice when something is running in the background as it will slow it down considerably. Bloatware will end up becoming tolerable when these types of CPUS start being put in desktop PCs. People will get used to it as much as most people tolerate spam in their email.
Previewing comments are for sissies!
Many comments on this post are centered around the processor's use as a personal computing solution. There is much more to computing than PCs! When viewed alongside specialized programming technology, bioinformatics, neurology, and psychology, this (rather large) leap in processing power brings AI to yet another level, and continues the law of accelerated returns. I'm not saying "oh wow now we can have human-like AI", I'm just saying that the ability to process 1.8 Tflops is nothing to scoff. Personal computing is inane and almost moot when compared to the other applications that new processors may pave the way for. Know your facts, but use your imagination.
They've already allocated 40 cores to the RIAA and MPAA for DRM processing, 30 cores to NSA/Homeland Security surveillance of all your computing activities, and 6 cores to combat spam and phishing. In the end, there is no net gain in performance over today's processors. Sorry.
(tongue firmly planted in cheek)
$nice = $webHosting + $domainNames + $sslCerts
Looks like Intel finally put the "80" in 80x86.
Comment removed based on user account deletion
This clearly isn't for CPU's. It's for building GPU's and more importantly for intel get a part of the huge growing market demand for general purpose programming on GPU's. We'll have to call them something other than GPU's in 5-10 years as they'll do all sorts of other jobs too.
s crete_gpu_return).
IBM saw this coming and went with the Cell, AMD saw this coming and bought ATi, NVidia already has a card that has all these shader units. Intel would be stupid not to respond. They've already admitted a discrete GPU part is on the way (http://www.reghardware.co.uk/2007/01/23/intel_di
Only the other day there was a story (either the register or inquirer that's AFAIK has been now deleted...) about their GPU part being a whole chunk of in order x86 parts on a chip. Pieces of the jigsaw are slotting togheter. Makes programming GPGPU stuff easy for many. Intel want to move x86 architecture onto GPU's.
Ah well, I wonder when we'll get that story confirmed. Intel are clearly up to something... I think we'll know what shortly. All in all it spells trouble for NVidia as being left out of the CPU part of the equation with Intel, AMD and in some respects IBM all with combo's.
Anon because I've signed way too many NDA's...
I want something that will do 1.8 trillion integer operations per second (single threaded). This simulation is taking 5 hours per run with this A64 3200+. Gimme give me 1.8TIOPs and I'll be listening.
FGD 135
You can't say this is useless, and support nVidia or ATI's stream computing, they are the same thing.
This is the future of CPUs: everyone is doing it, and with GFX manufacturers heading down this path, it proves to be a very interesting future.
https://www.accountkiller.com/removal-requested
Others have built large scale parallelism in the past such as Thinking Machines and Masspar. They were not fully general CPUs, i.e. floating point. Plus the companies could only develop new generations on a 3-5 year time scale, so the general purpose workstations and clusters almost caught up by then. Having a "major" back large scale parallelism may finally lift the curse.
One year later, and /. has updated their Intel logo to the new one?
...the future crusty old bastards are already drinking the Kool-Aid.
Comment removed based on user account deletion
Easy, you use something like CSP where just about everything is a thread.
Only on slashdot will you find a post complaining about how bad of an idea an 80-core processor is. (On a side note, I'll finally be able to open PDFs in less time than it takes to go to the bathroom and back.)
Remember that a lot of algorithms can be parallelized to use any number of cores (it gets inefficient after a point, but there definitely is an initial speedup).
An old-timer with old-timey ideas.
please ue a power of two for the number of cores. Base 10 sucks.
/. nerds
Sincerely,
The AACS key is NOT 0xF606EEFD628B1CA427BEA93A9CA9773F
with compilers/tools meant for programming it. before virtual memory programmers had to program for their machine's RAM size and manually manage their memory using "overlays" (or so i've read), but now this concept seems horrid to younger programmers. a generation from now, programmers will read about how computers used to only have one logical core and think it ludicrous.
my uninformed, amateur guess is that functional languages will become more popular for programming massively multi-core machines (this coming from a C programmer). they will start to become faster than imperative languages because their workloads can be more easily recognized and farmed out to multiple cores.
OpenMP hide multithreading from developer and make parallelization completly transparent. Couple of OpenMP instructions can parallelize complex loop, witn no effort form developer at all. That is especially easy in physical simulation and AI. http://en.wikipedia.org/wiki/OpenMP
Sorry, I obviously meant "Base 1010 sucks"...
The AACS key is NOT 0xF606EEFD628B1CA427BEA93A9CA9773F
Take a game like starcraft:
As mentioned earlier, ray tracing is embarrasingly parallel; each core can render a few hundred pixels, making real-time ray tracing possible at 30fps.
AI: some strategy applications of AI are parallel: i.e. figuring out several possible paths at once; as the path branches, more cores can be used to determine the best possible approach.
each unit can have an AI (probably more usefull in FPS games)
and finally: there is more to computing than starcraft. (sorry, Korea)
"Hate is baggage. Life's too short to be pissed off all the time." Danny Vinyard -American History X
anyone for Duke Nukem forever??
The interesting question is, if you take a special-purpose processor (GPU) and turn it into a general-purpose processor, which was the wrong classification initially?
"There are a dozen opinions on a matter until you know the truth. Then there is only one." - CS Lewis (paraprhase)
Who says each game would have to utilize 80 cores? These multi core processors make other things possible such as encoding videos and such in the background while playing a video game. Imagine with things like this everyone could easily run a game server while playing the game on the computer and having no slowdown. You could run a CS, UT, and a couple other game servers for friends all while playing one of these games!
Xbox360 = ~1TFLOPS
PS3 = ~2.18TFLOPS
According to Wikipedia
Also, why does the article compare to a BlueGene variant, when in supercomputer terms it's really competing against things like MDGRAPE-3 which are already in the PFLOP range?
Did anyone else see that?
"Even more impressive, this chip is able to achieve incredibly high clock speeds on modest power usage. Running on a 1.0v current at 110 degrees C the tile maximum frequency is 3.13 GHz while at 1.2v the tiles can run at 4.0 GHz."
That would be about 250f, would peltier coolers be mandatory?
OS X automatically uses as many cores as it can find, splitting the job between them. They swapped out the dual Core2Duos on a Mac Pro for two four-core processors and it was able to see and use every core it had.
As the article points out, this is a VLIW (Very Long Instruction Word) design -- in effect, each instruction word will be broken up into chunks, with a chunk going to each processor. This means that you can end up with some bizarre situations -- what happens, for example, if one processor needs to jump to one location in memory and the other 79 don't? Effectively, your compiler would need to be able to realize this, and have the instructions at that memory location for the 79 processors be the same. (In reality, I don't think you'd do this -- that processor would probably just sit and wait for the others.) This is not the equivalent of having two cores, with each able to run independently.
The real bottleneck here is the compiler, not the processor, because the compiler has to be able to pick up on implicit parallelism in the code and dole it out among the available cores. While it's possible to the compiler by using a language where the programmer specifies the parallelism, if you think about it, that's the opposite direction from the progress of computer languages in the last 20 years.
The biggest problem that this technology has is that it is expensive when compared with a compute cluster, which can scale easily and can be more easily programmed. The main time the cluster won't do better are the instances where each core needs results from other cores so frequently that the overhead in message passing is too high.
Thread based programming really isn't that hard, particularly where you have a problem space which can be split up into discreet chunks of work. Example - a photoshop blur filter. Just divide the image up into (overlapping) chunks and blur each piece on a different thread. Another example - digital audio. Put each VST instrument on it's own thread. Once your apps are well threaded (and in many cases they already are) you can simply rely on the OS to schedule them over how ever many cores are available. For example, I write server code on my desktop box (single core w/hyperthreading) and it runs perfectly happily on the 64 core production servers, just faster.
Of course this is simplifying things a bit, and it is hard to get the very best performance from any given environment, but you can make a big difference quite easily.
---- Den ene knappen er powerknapp, den andre er Bender voice knapp "Bite My Shiny Metal Ass"
Now that's a summary I'm willing to read- Bravo editors!!!111
Ya know, at those clock speeds/flops Java and .NET start to look attractive! ;-)
I have one too... It does 2 Tflops on the same amount of power. As long as all of the opcodes are "NOOP".
In Soviet Russia, sig types you!
I'm sure this has already been pointed out, but unless you're working with a well documented highly scalable and very specific problem like say, Navier-Stokes simulations you're not going to utilise many of those 80 cores. Running a single threaded program will still only use one core in all likelihood. (But hey, Vista needs more cores to run all those DRM checks 30 times a second!) And most parallel programs don't scale well past 8 processors anyway, so until some new programming paradigm and better compilers are available, this is just another "Everest" a.k.a "Because it was there and we could".
Are we allowed to imagine a Beowulf cluster of chips that obviate the need for a Beowulf cluster?
If you want a vision of the future, imagine a youtube comments section scrolling - forever.
With 80 threads.
Stop thinking sequentially. How about a thread FOR EACH AI that needs to find a path. What's the branch factor in that tree? How about a thread FOR EACH top-level branch FOR EACH thing that needs to find a path.
Got 200 units in the game? How about a thread FOR EACH one's AI, a thread to update screen data FOR EACH one, a thread FOR EACH one to accept new commands.
Gaming with autonomous AIs is embarrassingly parallel.
I didn't (yet) read the Fine Article, but I did envision an 80 cores "Intel Hasty Style" :
(Warning, this vision's description could definetly damage a Hardware Geek's brain, or cause him to get an erection, depending)
80 centimeters of stacked Very Expensives Interconnected Xeons (tm) with a 2 cubic meters deep freezed radiator enclosing them.
It takes 40+ muscles to frown, but only four to extend your arm and bitchslap the motherfucker
Isn't this the same number of flops it takes before Microsoft gets a product right?
Imagine a Beowolf cluster on a single chip! Oh, never mind ...
But alot more up to date :-)
I can wait to build an hypercube with those ...
...is a version of the Sims 2 rewritten so that the Sims have a much greater degree of genuine autonomy, and for said version to be run without human intervention (and recorded) for a period of months or years on a multiple TFlop system. If the environment was made a lot more detailed than it is in the retail version of the game, and if the Sims were given somewhat more capacity for learning than what they've currently got, something tells me the results of such an experiment might be extremely interesting, given enough time.
So did I understand correctly that POV-ray at this point doesn't support parallel processing? If that's so, it would be a shame and it must really limit its usefulness in big projects.
It would be cool if, just as the routines got more sophisticated, they'd get a consumer-grade processor that could run them in real-time.
Well, it doesn't solve world hunger, but parallel processing on an 80-core chip? :-)
Yup...can eat this sucker for breakfast... well, when it can support a JVM that is
http://www.pervasivedatarush.com/
Java To Excess
Well, if you are building data-intensive apps (gigs and terabytes of data in a batch processes), then you use http://www.pervasivedatarush.com/ and not OpenMP....or COBOL !!
Java To Excess
Totally disagree with the notion that compilers need to somehow gleen implied parallelism from the source code. If anything .NET and J2EE taught us is that developers can indeed consume relatively complex frameworks to achieve threaded/concurrent application design. In this case, .NET and J2EE take care of OLTP computing architectures.
For data-intensive batch processing, we have http://www.pervasivedatarush.com/ and other frameworks. Yes, you have to be able to build self-contained components that can be assembled in a dataflow graph of sorts... but developers had to learn the same craft when EJB's were invented.
Bring on the cores, Intel. Do you feel lucky? Do ya?
Java To Excess