Clockless Computing?
ContinuousPark writes: "Ivan Sutherland, father of computing graphics, has been for the last ten years designing chips that don't use a clock. He's been proposing a method called asyncronous logic where there's no clock signal being distributed and regulating every part of the chip. The article doesn't give many technical details (greatly needed) but Sutherland, now doing research for Sun, is telling that significant breakthroughs have been made recently to make this technology viable for mass production. It is estimated that 15% of a chip's circuitry is dedicated to distributing the clock signal and as much as 20% percent of the power is consumed by the clock. This is indeed intriguing; what unit will replace the familiar megahertz?"
So, what is a CFPP? It is a processor with a pipeline where data and instructions flow in opposite directions, with the instructions usually thought of as moving "up" and data as moving "down". The functional units (FU) are attached as sidings to the main pipeline. Each FU launches from a single pipeline stage and writes its results to a different stage, further "up" the pipeline. The main goals of this architecture were to make the processor simple and regular enough to create a correctness proof and to achieve purely local control.
If Sun ever produces a processor that is asynchronous, it will likely look similar to this.
--
"You can put a man through school,
But you cannot make him think."
"You can put a man through school,
But you cannot make him think."
Ben Harper
What makes it interesting is that you have to fundamentally redesign your your whole logical design so that you have a general purpose design.
With clocked computing, it is easy to see how you would flush buffers, etc. Clockless computing would be more problematic, and of course, would probably be proprietary.
My initial reactions are that it would work easiset in things like embedded processing. I also wonder if there would have to be some sort of evolution similar to what we have seen over the past few years with Intel, Motorola, etc.
One must not forget that the increases in performance for an awfull lot of these chips has to do with clock speed increases, as well as code designed to take advantadge of certain coding features in the hardware.
an early example of this is when the Pentiums first came out. For a while you had 486 boxes and pentiums with the same clock speeds on the market. you could compare performance between systems with the same video cards, same ram, same cache, etc. even though the chip sets with not the same, etc. This was educational. As I recall, the performance boost for somewhere not taking advantadge of the pentium feature set was aboput 20 - 25% (?) I may have this wrong, of course.
But at a time when pentium systems cost twice of a 486, it was definitely buying for the future.
"It is a greater offense to steal men's labor, than their clothes"
FLOPS, of course.
Even without a processor clock, you should still be able to measure how many operations it can do per (real-time) second.
It is both amusing and frustrating to hear all of the "armchair computer scientists" discussing the reasons this technology is a bad idea. As if they knew more about the subject than the many PhD's who have dedicated their careers to this subject based on the knowledge gleamed from the one Computer Architecture class the poster took as an undergraduate.
I was invited to work on a team at the University of Utah (Sutherland's old school) where they were researching this very topic. This is old news; they have been working on it for years. And as some people have correctly pointed out, there are both good and bad points to sync or async logic.
There are two major reasons to work on async logic: clock skew and power savings. The reason for power savings alone is a good one. People here have been complaining that it "is not worth it for only a 20% power savings".
Yes it is! In a modern office, computers end up taking a lot of power. Imagine your local server room. Don't you think they would like a 20% decrese in their power bill?
That means instead of building five power plants, you only need four (on a grand scale; please no newbie replies like doodz, thiz guy thinkz you n33d five pawer pl3nts to run a box). That is significant. And with today's high-MHz CPUs this means even more. Some think >50% savings, and even more during low cycle time.
The clock skew issue has been covered somewhat here. One of the major hurdles in solving the design problem is the development of new design tools, which is what many people at Utah are currently working on.
The way to move forward is not to argue for the limitations of systems of the past. Don't make me pull out Ken Olson quotes here.
Well, it's not exactly a dataflow machine, anyway.
The old E&S machines were dataflow architectures at the equivalent of the "machine code" level. Newer architectures are using similar ideas, but in a way that does not require details of the dataflow model leeching outside the chip.
Look at the Pentium 3, for example. It exploits dataflow ideas at the microcode level by prefetching several machine code instructions, splitting them into a larger set of "micro-instructions" and then schedules them together. That's not really a dataflow architecture, but it does use ideas from it: the idea of deciding on how to schedule the instructions at run-time.
The new clockless CPUs will exploit dataflow ideas by implementing a kind of dataflow machine between the functional units of the CPU itself. The CPU, remember, is like an interpreter for machine code. Since the "program" for that interpreter does not change, it can be implemented in a "plugboard" kind of way and people or programs producing machine code will never know the difference, apart from speed.
sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
I'm also an old fart and not some software geek to whom every hardware technology mentioned is something unheard of before. That being said...
The first computing machines weren't synchronous. I forget the names, but this kind of thing was being done way back when because it was impractical to distribute a common clock across the racks and racks of equipment that made up a CPU back then.
Also, Motorola's PowerPC chips implement an asynchronous divider, so you might be using asynchronous technology right now.
The idea of having a computer run as fast as the transistors can go is a great goal, but there's some impractical aspects to the use of asynchronous circuits.
First, how do you know your computation is done? Well, there's several different ways of telling. You can use a current sensor to decide when your gates have settled out for a decent length of time or you can wait a predetermined amount of time based on worst case. All solutions involve bloating the design with more transistors to time the handshaking between Muller-C elements. Whether it's some type of current sensor or just inverter chains, there's at least 10% of a circuit tied up in timing (and it can run much, much higher).
Also, what do you do with the data once you've processed it so fast? The IOBs are only so quick in driving pins, so while the core of the design can run really stinkin' fast asynchronously, it's hampered by the ability to get data in and out.
Design verification is also a nightmare with asynchronous logic. It's a hard enough problem figuring out my longest path between registers across process and temperature variations, but to add in the factor of not knowing your clock is... well, icky.
Finally, what about noise in an asynchronous design? For my current work, I have to make sure everything happens synchronously... or I end up with nasty noise in my CCD section. I can tolerate a little bit of asynchronous behaviour, but not a lot.
Where asynchronous technology makes sense now is something like Motorola's divider circuit. By making it asynchronous, they gain the speed advantage of not having to rely on a slower, global clock distribution network, by making it a local function, they avoid the problem of slow IO, and by using it for a "small" amount of their design, they avoid die bloat and noise problems.
I guess the idea of asynchronous design boils down to one of history. If it's such a wonderful thing and has been around for so long, why doesn't everybody do it? Well, because it has drawbacks and the design philosophy rarely fits the design criteria (cost, tools, reliability, performance, and function).
I don't think this is a newsworthy item. In asynchronous design, it's pretty much ALL old hat. Academic papers recycle the same ideas and the UK email reflector for asynchronous "researchers" goes quiet for months at a time.
Maybe tomorrow, /. will report the discovery of fire.
Unfortunately for Sutherland, there's something called the PS300.
Back in the late 70's and early 80's, his company, Evans and Sutherland, ruled the world of computer graphics with their very slick Picture System machines. These were peripherals to PDP-11s and VAXes, and were wonderfully programmable machines. There was a fast interface between host memory and Picture System memory; letting you mess with the bits to your heart's content. We had a couple of them at NYIT's computer graphics lab; and did a lot of great animation with them.
E&S's next machine, though, was the PS300. This was a far more powerful machine, its first machine with a raster display. It was an advance in every way, except that it imposed a dataflow paradigm on programming the machine. You could only write programs by wiring up functional units. It was astonishingly difficult to write useful programs using this technology. Everybody I know that tried (and this was the early 80s, when people were used to having to work very hard to get anything on the screen at all); every one, gave up in frustration and disgust.
ILM got the most out of the machine; but that was by imposing their will on E&S to provide them with a fast direct link to the PS300's internal frame buffer.
Basically, dataflow ideas killed the PS300, which destroyed the advantage that E&S had as the pioneer graphics company, and they have never recovered from it. While the idea is charming, and to a hardware engineer it makes a lot of sense, programming them takes you back to the plugboard era of the very first WW-II machines. Nobody wants to do that.
thad
I love Mondays. On a Monday, anything is possible.
The problem with Mips:
;-)
Not all Mips are created equal. For example: is it fair and reasonable to compare a CISC Mips to a RISC Mips? The CISC may be doing something like a string move with one instruction while the RISC machine does it with series of instructions in a loop. Obviously this is an apples an oranges comparison.
Okay - next you look at Flops - aren't Flops the same on every machine. Well - no, though that is probably less of an issue for comparing IEEE based implementations. The question comes up (and it has already been mentioned) that Flops don't compare useful work loads! The vast majority of computer work loads don't involve significant floating point operations. (Yes you can find workloads where that is the case - but it isn't the majority situation.)
So it comes down to comparing computer "systems" is a tricky business. Even Mhz in the same architecture family doesn't work because you don't know how efficiently the machine is designed -the hardware might be capable of greater than one instruction per clock!
Finally - I don't believe the estimate of upto 15 % or clock distribution. It's more like 1%-2%. ( I do chip design for a living..at least I have an educated opinion on this!) The clocks ARE a significant part of the power issue though. CMOS burns power when signals move. The clock moves. Simple enough analysis there.
Asynch design methods have been around forever, but present a number of problems for traditional design tools that depend on the clock to do their work. Further, there are alot of chip designers that throw up their hands if you just mention the word "asynchronous design" to them. Any push to this kind of design would be tramatic to say the least
Have you compiled your kernel today??
Here's the URL for the asynchronous design group's homepage There's more info there.
The Amulet project has been going for over 10 years (it's an asynchronous ARM-like core, IIRC). I remember seeing a circuit that did asynchronous addition (or was it multiplication?) in a lecture about 2 years ago.
Another advantage to power is also the speed; the clock speed isn't determined by the worse case of the most expensive instruction. (e.g. adding 0 and 1 can be done a lot quicker than adding (2^31)-1 and 1, because of no overflow)
First, most ASICs built these days are built with logic synthesis tools from Synopsys or Cadence. The inputs are typically register transfer level (RTL) code written in either the VHDL or Verilog languages. These logic synthesis tools have been around for quite some time (well over a decade for Synopsys) and have a significant infrastructure built around them. This design paradigm and sets of tools all assume synchronous logic. I can't fathom how you would build/constrain/debug these circuits in an asynchronous style with the existing toolset. And don't say "we'll use something else". It is these types of tools which have made our million gate ASICs possible. If we were still using schematics or other hack tools we would barely have passed the 80286. The current design tools took a long time to develop, hone, and get the bugs out of. The amount of money involved in just the tools is on the order of billions of dollars per year. That's a lot of inertia to move away from.
Second, yes the asynchronous approach can reduce the power consumption of ASICs. However, there are a lot of clocked approaches that do a very good job of reducing power. It all depends on what goals you have when you design the ASIC. Having multiple clocks and clock gating is common in the low power and embedded domains. It hasn't been as much of a factor in desktop systems but is certainly in use in handheld devices. The Crusoe takes these approaches to an extreme level. It's all a matter of what you want to design for and time to market pressures.
Lastly, speed. I think folks forget the feedback path. If you're going to rely on this asynchronous handshake, it requires a given stage to hold its outputs until the next stage acknowledges (asynchronously) that it got the data. This means the given stage can't accept anything new yet. This cascades/ripples back through the pipeline. This feedback takes time (and logic levels) that don't exist in clocked logic. Imagine an automotive assembly line where things could only move forward if each station got permission from his adjacent stations. In clocked logic you've guaranteed that the data is ready to move forward because you've calculated these things out. You've removed a bunch of communication overhead. Yes, there is slack in the synchronous pipeline, but for the most part current designs are pretty well balanced so that each stage uses a large portion of its clock cycle.
That's about all I can think of at the moment. I need to be getting home before I get snowed in! ;-) Just a few comments from a digital hardware designer. Hope this provided some food for thought...
Asynchronous ARM core nears commercial debut (1998)
ARM researches asynchronous CPU design (feb 1995)
AMULET3: A High-Performance Self-Timed ARM Microprocessor (1998)