Domain: clearspeed.com
Stories and comments across the archive that link to clearspeed.com.
Comments · 22
-
Re:Instruction set...
That's a clear testament to scalability when you consider the speed improvement in the last 30 years using basically the same ISA.
It's scaled that way until now. We've hit a power wall in the last few years: as you increase the number of transistors on chip it gets more difficult to distribute a faster clock synchronously, so you increase the power, which is why Nehalem is so power hungry, and why you haven't seen clock speeds really increase since the P4. In any case, we're talking about parallelism, not just "increasing the clock speed" which isn't even a viable approach anymore.
When you said "Compact" I assumed you meant the instruction set itself was compact rather than the average length- I was talking about the hardware needed to decode, not necessarily code density. Even so, x86 is nothing special when it comes to density, especially considered against things like ARM's Thumb-2.
If you take look at Nehalem's pipeline, there's a significant chunk of it simply dedicated to translating x86 instructions into RISC uops, which is only there for backwards compatability. The inner workings of the chip don't even see x86 instructions.
Sure you can do everything the same with shared memory and channel comms, but if you have a multi-node system, you're going to be doing channel communcation anyway. You also have to consider that memory speed is a bottleneck that just won't go away, and for massive parallelism on-chip networks are just faster. In fact, Intel's QPI and AMD's HyperTransport are examples of on-chip network- they provide a NUMA on Nehalem and whatever AMD have these days. Indeed, in the article, it saysMattson has argued that a better approach would be to eliminate cache coherency and instead allow cores to pass messages among one another.
The thing is, if you want to put more cores on a die, you need either a bigger die or smaller cores. x86 is stuck with larger cores because of all the translation and prediction it's required to do to be both backwards compatible and reasonably well-performing. If you're scaling horizontally like that, you want the simplest core possible, which is why this chip only has 48 cores, and Clearspeed's 2-year-old CSX700 had 192.
-
Re:Instruction set...
That's a clear testament to scalability when you consider the speed improvement in the last 30 years using basically the same ISA.
It's scaled that way until now. We've hit a power wall in the last few years: as you increase the number of transistors on chip it gets more difficult to distribute a faster clock synchronously, so you increase the power, which is why Nehalem is so power hungry, and why you haven't seen clock speeds really increase since the P4. In any case, we're talking about parallelism, not just "increasing the clock speed" which isn't even a viable approach anymore.
When you said "Compact" I assumed you meant the instruction set itself was compact rather than the average length- I was talking about the hardware needed to decode, not necessarily code density. Even so, x86 is nothing special when it comes to density, especially considered against things like ARM's Thumb-2.
If you take look at Nehalem's pipeline, there's a significant chunk of it simply dedicated to translating x86 instructions into RISC uops, which is only there for backwards compatability. The inner workings of the chip don't even see x86 instructions.
Sure you can do everything the same with shared memory and channel comms, but if you have a multi-node system, you're going to be doing channel communcation anyway. You also have to consider that memory speed is a bottleneck that just won't go away, and for massive parallelism on-chip networks are just faster. In fact, Intel's QPI and AMD's HyperTransport are examples of on-chip network- they provide a NUMA on Nehalem and whatever AMD have these days. Indeed, in the article, it saysMattson has argued that a better approach would be to eliminate cache coherency and instead allow cores to pass messages among one another.
The thing is, if you want to put more cores on a die, you need either a bigger die or smaller cores. x86 is stuck with larger cores because of all the translation and prediction it's required to do to be both backwards compatible and reasonably well-performing. If you're scaling horizontally like that, you want the simplest core possible, which is why this chip only has 48 cores, and Clearspeed's 2-year-old CSX700 had 192.
-
why wait five years when you can buy 96 cores now?
96 cores, 10 watts, Intel has some catching up to do: http://www.clearspeed.com/products/csx600/
About the people wanting to accelerate arbitrary functions, AMD's HyperTransport has the lead there: http://www.xtremedatainc.com/Products.html, http://www.fpgajournal.com/news_2006/09/20060906_0 1.htm -
Re:76 too many cores?
In fact the summary and the write up are very confusing or even slightly wrong. According to what I took from the keynote, the architecture is something similar clearspeed which already has more than 80 parallel floating point cores.
-
Re:Any 64 bit GPU's?
-
Re:Physics Engine !!!
A friend had an interview for a job as a maths person with these people:
http://www.clearspeed.com/
They are making an accelerator card for numerical work. They claim they can get a sustained 50 GFlops in a BLAS matrix multiply. They hope to put several cards into a PC, make a farm of them, and sell the thing as a supercomputer.
Their card is much, much more expensive than PhysX, but they still cant get (in my opinion) the kind of performance advantage that'd you'd need to really make a compelling case for the thing. -
Re:use as a cpu?
You think 7x is fast, you should check out what ClearSpeed CSX600 is going to do 10x with AMD's coherence HT.
-
Re:What I want to see.
There is a card for you but will not help if you aren't a scientist.
http://www.clearspeed.com/
Folding@home (Stanford) is internally testing it. That thing is damn fast. We are speaking about 25 GFlops (sustained) speed here. -
Re:big margins
I totally agree, the 1$ per cpu hour is an outrage. When you think about forming large computer clusters it may make sense but that is gradually changing. Check this out : http://www.clearspeed.com/ These guys sell a 64 bit 50 gflop co-processor for 10000$ added to the system through pci-e port, and this is a comperatively new practice of good old co-processor technology. As it develops more, we can expect better prices, higher power. For the same processing power, an array made with these babies currently will need to be about half the hardware price. Also it will mean about 1/10 cluster size, further reducing maintenance costs significantly.
-
Re:BlueGene/C will be finished soon
The man said revolutionary, not exotic. This chip is going to create a revolution because it packs **immense** computing power (the 500Mhz clock given in the post is misleading in that a lot happens in each processor in a clock cycle) while consuming very little electrical power and costing just what one chip produced in the zillions costs.
These features are what you need to build machines that interact with the real environment. A welding robot in a factory places a weld at the right spot whether the part is there or not. The machine that drives a vehicle in the desert *reliably* (un)like in the DARPA challenge needs thousands of times more computing power. Or consider a machine that moves around in an orchard and picks ripe fruit, trims the trees, fertilizes and removes weeds and is overall cost competitive with a migrant worker.
Our technology is moving in that direction now, it is the next logical step. Whether it will be Blue Gene Cyclops or a similar chip from http://www.clearspeed.com/ or a Chinese design depends on company politics. The older ones here will remember that the IBM PC was consistently downgraded by IBM management because it did what another IBM competing product (the Displaywriter) was doing, for one third the cost.As result, IBM profited from it mildly, everybody else profited wildly. -
Re:Farked if you do, or not
-
Re:Farked if you do, or not
-
Re:92 short...
The ClearSpeed CSX600
http://www.clearspeed.com/bio/ -
Why not reuse existing technology?
Clearspeed http://www.clearspeed.com/ is just coming to market with their CSX600 'application accelerator' processor.
It has 96 execution elements, 96 ops simultaneously for your data. Sounds like ideal for graphics processing. Power consumption 5 watts, 50 GFlops of computing power.
And they make PCI-X cards for PC systems. You can have several cards in one system for compounded processing power. Now, all this monster would need is the graphics output parts and drivers. They even have a full development kit for both Windows and Linux. The card's programmed in C.
Perhaps the PC's of the future would have two CPU's, one linear general purpose CPU (current x86 based) for program code and system management and one massively parallel CPU for tasks better suited for it. If there's no one true road to happiness, make it two then. -
New trend in computing. Vector processingHas anyone else noticed that vector processing is gaining momentum ? Some array processing links .
-
Re:echoes of Transmeta
I have little faith in processors from unknown companies that claim to do what Intel, AMD and IBM combined haven't yet been able to achieve.
Well, Intel, AMD, and IBM haven't really tried. This chip isn't a normal microprocessor, it's an "array processor", meaning that it's designed specifically to execute operations on large floating-point arrays. The market for such processors has been rather small in the past. It's hard to write code for them with traditional programming tools, and they're only applicable to a restricted class of problems. Most high-end supercomputing these days uses massively parallel sets of conventional processors. So there hasn't been an opportunity large enough to be worthy of the attention of the big guys.
Once you take a closer look at ClearSpeed's PR, this thing seems interesting, but not worth making a huge fuss about. The ClearSpeed press release shows that 25.6 GFLOPS is the peak performance. That's the absolute maximum the chip can do for the easiest kind of problem, which is why some people call peak performance "guaranteed never to exceed". With 64 processing elements, 25.6 GFLOPS means 400 MFLOPS per element. According to ClearSpeed's Microprocessor Forum presentation (warning: this is a slow download), the chip runs at 200 MHz, so this implies each element can do up to 2 floating-point operations per cycle. In fact, ClearSpeed's press release claims only "more than twice the processing speed of competitive products". So this isn't exactly an earth-shattering advance.
The power usage figure is more eye-opening, of course. Here ClearSpeed's press release claims they are twenty times more efficient than their competitors. But it turns out they're comparing the thing with conventional high-speed processors like the Pentium 4, hardly paragons of power efficiency. And ClearSpeed's presentation says they are using IBM's 0.13-micron process, so IBM should get some credit for providing the semiconductor technology to make this possible. 2-3 watts for modern 200 MHz logic using that kind of process doesn't sound outside the realm of possibility. (Remember, this isn't a conventional superscalar processor which requires huge amounts of logic for instruction issuing and control. This thing is mostly a simple mass of ALUs.)
In any case, ClearCase's presentation says they'll be sampling in the 4th quarter of 2003, so they'll have to demonstrate real hardware soon.
-
Re:echoes of Transmeta
I have little faith in processors from unknown companies that claim to do what Intel, AMD and IBM combined haven't yet been able to achieve.
Well, Intel, AMD, and IBM haven't really tried. This chip isn't a normal microprocessor, it's an "array processor", meaning that it's designed specifically to execute operations on large floating-point arrays. The market for such processors has been rather small in the past. It's hard to write code for them with traditional programming tools, and they're only applicable to a restricted class of problems. Most high-end supercomputing these days uses massively parallel sets of conventional processors. So there hasn't been an opportunity large enough to be worthy of the attention of the big guys.
Once you take a closer look at ClearSpeed's PR, this thing seems interesting, but not worth making a huge fuss about. The ClearSpeed press release shows that 25.6 GFLOPS is the peak performance. That's the absolute maximum the chip can do for the easiest kind of problem, which is why some people call peak performance "guaranteed never to exceed". With 64 processing elements, 25.6 GFLOPS means 400 MFLOPS per element. According to ClearSpeed's Microprocessor Forum presentation (warning: this is a slow download), the chip runs at 200 MHz, so this implies each element can do up to 2 floating-point operations per cycle. In fact, ClearSpeed's press release claims only "more than twice the processing speed of competitive products". So this isn't exactly an earth-shattering advance.
The power usage figure is more eye-opening, of course. Here ClearSpeed's press release claims they are twenty times more efficient than their competitors. But it turns out they're comparing the thing with conventional high-speed processors like the Pentium 4, hardly paragons of power efficiency. And ClearSpeed's presentation says they are using IBM's 0.13-micron process, so IBM should get some credit for providing the semiconductor technology to make this possible. 2-3 watts for modern 200 MHz logic using that kind of process doesn't sound outside the realm of possibility. (Remember, this isn't a conventional superscalar processor which requires huge amounts of logic for instruction issuing and control. This thing is mostly a simple mass of ALUs.)
In any case, ClearCase's presentation says they'll be sampling in the 4th quarter of 2003, so they'll have to demonstrate real hardware soon.
-
Re:Two words
Stock. Price.
Except that they're privately owned.
- Peter -
Re:not very impressive
According to this presentation, it runs at 200MHz. It's refreshing to see someone taking this approach, rather than insane clock frequency/power dissipation. I'll be impressed, though, if real application software can use it efficiently.
-
Vapor... or not?From ClearSpeed's website
HPEC 2003
That page also has a PDF of their presentation at the 2003 Microprocessor Forum. Whether this technology will pan out is a matter for the markets, but ClearSpeed isn't looking very vaporous.
Lexington, MA
September, 2003
Lockheed-Martin and Worldscape Defense presented the results of their work using ClearSpeed's processing solutions.
They benchmarked FFT and pulse compression algorithms and found between 20 and 30 times improvement in performance per watt against competitive solutions.
-
Somewhat off Topic but...
I would expect to see the price of super compting dropping dramatically in the comming year. This is based on the press release from Clearspeed regarding their new processor. Here is a wired article that talk a little more about it. Its worth the read.
-
Very soon we will have that power..
Not that kind of purchasing power, but the super computing power rather..
According to this Wired article a small firm in CA called Clear Speed will soon revolutionize the PC space with Super computing power.
I know we will all believe it when we can find these chips on Bestbuy aisle no:4, but still currently from where I am sitting (I am sitting on a Microsoft biztalk server 2004 training session, boring as hell, being inundated by claims of innovation by a clueless trainer who programmed in Visual basic for her entire life), Clear speed is as close to Innovation that I can think of and its more of them that this industry/world need, and less of Microsoft.
Oh..lunch time.. gotta go hit on some free sandwiches. Viva la microsoft..