Clockless Computing?
ContinuousPark writes: "Ivan Sutherland, father of computing graphics, has been for the last ten years designing chips that don't use a clock. He's been proposing a method called asyncronous logic where there's no clock signal being distributed and regulating every part of the chip. The article doesn't give many technical details (greatly needed) but Sutherland, now doing research for Sun, is telling that significant breakthroughs have been made recently to make this technology viable for mass production. It is estimated that 15% of a chip's circuitry is dedicated to distributing the clock signal and as much as 20% percent of the power is consumed by the clock. This is indeed intriguing; what unit will replace the familiar megahertz?"
So, what is a CFPP? It is a processor with a pipeline where data and instructions flow in opposite directions, with the instructions usually thought of as moving "up" and data as moving "down". The functional units (FU) are attached as sidings to the main pipeline. Each FU launches from a single pipeline stage and writes its results to a different stage, further "up" the pipeline. The main goals of this architecture were to make the processor simple and regular enough to create a correctness proof and to achieve purely local control.
If Sun ever produces a processor that is asynchronous, it will likely look similar to this.
--
"You can put a man through school,
But you cannot make him think."
"You can put a man through school,
But you cannot make him think."
Ben Harper
What makes it interesting is that you have to fundamentally redesign your your whole logical design so that you have a general purpose design.
With clocked computing, it is easy to see how you would flush buffers, etc. Clockless computing would be more problematic, and of course, would probably be proprietary.
My initial reactions are that it would work easiset in things like embedded processing. I also wonder if there would have to be some sort of evolution similar to what we have seen over the past few years with Intel, Motorola, etc.
One must not forget that the increases in performance for an awfull lot of these chips has to do with clock speed increases, as well as code designed to take advantadge of certain coding features in the hardware.
an early example of this is when the Pentiums first came out. For a while you had 486 boxes and pentiums with the same clock speeds on the market. you could compare performance between systems with the same video cards, same ram, same cache, etc. even though the chip sets with not the same, etc. This was educational. As I recall, the performance boost for somewhere not taking advantadge of the pentium feature set was aboput 20 - 25% (?) I may have this wrong, of course.
But at a time when pentium systems cost twice of a 486, it was definitely buying for the future.
"It is a greater offense to steal men's labor, than their clothes"
FLOPS, of course.
Even without a processor clock, you should still be able to measure how many operations it can do per (real-time) second.
It is both amusing and frustrating to hear all of the "armchair computer scientists" discussing the reasons this technology is a bad idea. As if they knew more about the subject than the many PhD's who have dedicated their careers to this subject based on the knowledge gleamed from the one Computer Architecture class the poster took as an undergraduate.
I was invited to work on a team at the University of Utah (Sutherland's old school) where they were researching this very topic. This is old news; they have been working on it for years. And as some people have correctly pointed out, there are both good and bad points to sync or async logic.
There are two major reasons to work on async logic: clock skew and power savings. The reason for power savings alone is a good one. People here have been complaining that it "is not worth it for only a 20% power savings".
Yes it is! In a modern office, computers end up taking a lot of power. Imagine your local server room. Don't you think they would like a 20% decrese in their power bill?
That means instead of building five power plants, you only need four (on a grand scale; please no newbie replies like doodz, thiz guy thinkz you n33d five pawer pl3nts to run a box). That is significant. And with today's high-MHz CPUs this means even more. Some think >50% savings, and even more during low cycle time.
The clock skew issue has been covered somewhat here. One of the major hurdles in solving the design problem is the development of new design tools, which is what many people at Utah are currently working on.
The way to move forward is not to argue for the limitations of systems of the past. Don't make me pull out Ken Olson quotes here.
Well, it's not exactly a dataflow machine, anyway.
The old E&S machines were dataflow architectures at the equivalent of the "machine code" level. Newer architectures are using similar ideas, but in a way that does not require details of the dataflow model leeching outside the chip.
Look at the Pentium 3, for example. It exploits dataflow ideas at the microcode level by prefetching several machine code instructions, splitting them into a larger set of "micro-instructions" and then schedules them together. That's not really a dataflow architecture, but it does use ideas from it: the idea of deciding on how to schedule the instructions at run-time.
The new clockless CPUs will exploit dataflow ideas by implementing a kind of dataflow machine between the functional units of the CPU itself. The CPU, remember, is like an interpreter for machine code. Since the "program" for that interpreter does not change, it can be implemented in a "plugboard" kind of way and people or programs producing machine code will never know the difference, apart from speed.
sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
I'm also an old fart and not some software geek to whom every hardware technology mentioned is something unheard of before. That being said...
The first computing machines weren't synchronous. I forget the names, but this kind of thing was being done way back when because it was impractical to distribute a common clock across the racks and racks of equipment that made up a CPU back then.
Also, Motorola's PowerPC chips implement an asynchronous divider, so you might be using asynchronous technology right now.
The idea of having a computer run as fast as the transistors can go is a great goal, but there's some impractical aspects to the use of asynchronous circuits.
First, how do you know your computation is done? Well, there's several different ways of telling. You can use a current sensor to decide when your gates have settled out for a decent length of time or you can wait a predetermined amount of time based on worst case. All solutions involve bloating the design with more transistors to time the handshaking between Muller-C elements. Whether it's some type of current sensor or just inverter chains, there's at least 10% of a circuit tied up in timing (and it can run much, much higher).
Also, what do you do with the data once you've processed it so fast? The IOBs are only so quick in driving pins, so while the core of the design can run really stinkin' fast asynchronously, it's hampered by the ability to get data in and out.
Design verification is also a nightmare with asynchronous logic. It's a hard enough problem figuring out my longest path between registers across process and temperature variations, but to add in the factor of not knowing your clock is... well, icky.
Finally, what about noise in an asynchronous design? For my current work, I have to make sure everything happens synchronously... or I end up with nasty noise in my CCD section. I can tolerate a little bit of asynchronous behaviour, but not a lot.
Where asynchronous technology makes sense now is something like Motorola's divider circuit. By making it asynchronous, they gain the speed advantage of not having to rely on a slower, global clock distribution network, by making it a local function, they avoid the problem of slow IO, and by using it for a "small" amount of their design, they avoid die bloat and noise problems.
I guess the idea of asynchronous design boils down to one of history. If it's such a wonderful thing and has been around for so long, why doesn't everybody do it? Well, because it has drawbacks and the design philosophy rarely fits the design criteria (cost, tools, reliability, performance, and function).
I don't think this is a newsworthy item. In asynchronous design, it's pretty much ALL old hat. Academic papers recycle the same ideas and the UK email reflector for asynchronous "researchers" goes quiet for months at a time.
Maybe tomorrow, /. will report the discovery of fire.
This is indeed intriguing; what unit will replace the familiar megahertz?
Given the absence of a clock, Id go for Inhertzia.
Karma karma karma karma karmeleon: it comes and goes, it comes and goes.
The story is about asynchronous computing, not about clocks in general. Asynchronous computing is to synchronous computing as functional programming is to imperative programming. Sure you may have methods of synchronizing with external entities, but the internal processes are (mainly) asynchronous.
:-) So the basic idea is that a neuron is asynchronous in principle, but groups of them may find it easier to communicate synchronously.
The brain is an excellent example of parallell asynchronous computing, since a neuron will only fire when its input-treshold has been reached. However, many internal processes in the brain may in fact be more or less synchronous, due to the fact that maybe it's an evolutionary advantage
- Steeltoe
http://www.debunkingskeptics.com/
Unfortunately for Sutherland, there's something called the PS300.
Back in the late 70's and early 80's, his company, Evans and Sutherland, ruled the world of computer graphics with their very slick Picture System machines. These were peripherals to PDP-11s and VAXes, and were wonderfully programmable machines. There was a fast interface between host memory and Picture System memory; letting you mess with the bits to your heart's content. We had a couple of them at NYIT's computer graphics lab; and did a lot of great animation with them.
E&S's next machine, though, was the PS300. This was a far more powerful machine, its first machine with a raster display. It was an advance in every way, except that it imposed a dataflow paradigm on programming the machine. You could only write programs by wiring up functional units. It was astonishingly difficult to write useful programs using this technology. Everybody I know that tried (and this was the early 80s, when people were used to having to work very hard to get anything on the screen at all); every one, gave up in frustration and disgust.
ILM got the most out of the machine; but that was by imposing their will on E&S to provide them with a fast direct link to the PS300's internal frame buffer.
Basically, dataflow ideas killed the PS300, which destroyed the advantage that E&S had as the pioneer graphics company, and they have never recovered from it. While the idea is charming, and to a hardware engineer it makes a lot of sense, programming them takes you back to the plugboard era of the very first WW-II machines. Nobody wants to do that.
thad
I love Mondays. On a Monday, anything is possible.
Our group at Cornell University works on asynchronous design. My advisor built an asynchronous MIPS processor at Caltech a couple of years ago. It works, and it is extremely energy-efficient (better than pretty much anything in existence for the same process). We use a different design methodology than Sutherland's group, and none of the criticisms posted here about asynchronous design apply to us (for example, all of our circuits -- including full CPU's -- have been formally proved to be 100% correct).
c as e.html
If anyone's interested, our group's page is:
http://vlsi.cornell.edu/
Anyone who wants a good overview of asynchronous design should read this paper:
http://vlsi.cornell.edu/~rajit/abstracts/async-
ck
A fully asynchronous design requires lots of ready signals, or some very careful time-of-flight constraints. Aside from the fact that the current popular logic-synthesis tools don't provide neatly packaged solutions for this kind of design, if you don't implement this stuff in an intelligent manner, you can easily create a design which completely destroys any advantage over a synchronous design in speed, power, reliability and/or area.
On the other hand, one more advantage that I haven't seen mentioned about asynchronous design is modularity - most synchronous designs can only be verified for correctness in the context of the global clock signal, whereas if you've verified the correctness of an asynchronous module, you can plug it in wherever its functionality fits, without having to adjust all the stuff around it.
When you think about it, however, you will note that synchronous design is actually just a SUBSET of asynchronous design - the clock signals are just a way of indicating a "data ready" condition to the next bunch of logic gates. Careful logic designers who hold this viewpoint can design hybrid synchronous/asynchronous designs, where the overall design is actually a bunch of smaller synchronous designs, where each block of synchronous logic receives a "clock" which is actually a data ready signal for the logic block as a whole.
The problem with Mips:
;-)
Not all Mips are created equal. For example: is it fair and reasonable to compare a CISC Mips to a RISC Mips? The CISC may be doing something like a string move with one instruction while the RISC machine does it with series of instructions in a loop. Obviously this is an apples an oranges comparison.
Okay - next you look at Flops - aren't Flops the same on every machine. Well - no, though that is probably less of an issue for comparing IEEE based implementations. The question comes up (and it has already been mentioned) that Flops don't compare useful work loads! The vast majority of computer work loads don't involve significant floating point operations. (Yes you can find workloads where that is the case - but it isn't the majority situation.)
So it comes down to comparing computer "systems" is a tricky business. Even Mhz in the same architecture family doesn't work because you don't know how efficiently the machine is designed -the hardware might be capable of greater than one instruction per clock!
Finally - I don't believe the estimate of upto 15 % or clock distribution. It's more like 1%-2%. ( I do chip design for a living..at least I have an educated opinion on this!) The clocks ARE a significant part of the power issue though. CMOS burns power when signals move. The clock moves. Simple enough analysis there.
Asynch design methods have been around forever, but present a number of problems for traditional design tools that depend on the clock to do their work. Further, there are alot of chip designers that throw up their hands if you just mention the word "asynchronous design" to them. Any push to this kind of design would be tramatic to say the least
Have you compiled your kernel today??
I think in such a system, other features (code optimization, use of 3D accelerators, etc) will be more important than the speed of an add. It will even take several years of experimentation to determine what optimizations to make (how many times is it better to add than multiply, how should loops be unrolled, etc).
I think many traditional measurements will become worse than useless, and instead misleading. Since a lot of your repetative math operations may be unloaded on your 3D accelerator, it is questionable that, even if you could decide how to measure it, floating-point-operations per seconds would be a real indicator. I wouldn't want the manufacturer optimizing for that over other, useful things.
A better question is, how long does a NOP last? Won't this system optimize it out? How can you time a NOP without a clock?
One problem with aynchronous systems is testing.
If you have a chip where some of the units are slower than expected, you might get curious interactions and "race conditions" that are
very hard to test before you put the chip into
service.
Also, designing for asychronous logic has
been difficult - designing clocked and even
pipelined systems is a breeze compared to
dealing with asynchronous design. A lot of the
structured methods that have been developed for
conventional clocked circuits cannot be used,
and so designers have a lot of trouble
building complex systems.
A little bit of a pain, but far from impossible. Anyone who works on software for a multithreaded, multiprocessor, or distributed environment solves asynchrony-related problems all the time. We do it by having locks instead of clocks; hardware folks can do and on occasion have done just the same. I'm sorry to hear that such basically simple problems are considered unsolvable by garden-variety EEs.
Slashdot - News for Herds. Stuff that Splatters.
Here's the URL for the asynchronous design group's homepage There's more info there.
You shouldn't use the term "Tri-State" is this context. Tri-State, a copyright of National Semiconductor, means drivers that can be put into a high-impedence output mode and so be disconnected from a bus in a simple way. What you are referring to is called "Multiple Valued Logic" and has been researched forever. It has found it's way into a few products (ROMs most notably) but in general is more work than it's worth.
There have already been commercially successful asynchronous computers. For instance, in the DEC PDP-10 family, the KA10 (1968) and KI10 (1972) processors were asynchronous, as was their predecessor, the 166 processor of the PDP-6 (1964). The PDP-10 family was commonly found in universities until the late 1980s.
The Amulet project has been going for over 10 years (it's an asynchronous ARM-like core, IIRC). I remember seeing a circuit that did asynchronous addition (or was it multiplication?) in a lecture about 2 years ago.
Another advantage to power is also the speed; the clock speed isn't determined by the worse case of the most expensive instruction. (e.g. adding 0 and 1 can be done a lot quicker than adding (2^31)-1 and 1, because of no overflow)
So far most of the comments here are along the lines of "this won't work, it's too hard to debug, etc.". But it seems to me that the human brain is a pretty good example of asynchronous computing? The last time I checked, there wasn't any sort of high frequency clock signal running down my spine.
- Mike
First, most ASICs built these days are built with logic synthesis tools from Synopsys or Cadence. The inputs are typically register transfer level (RTL) code written in either the VHDL or Verilog languages. These logic synthesis tools have been around for quite some time (well over a decade for Synopsys) and have a significant infrastructure built around them. This design paradigm and sets of tools all assume synchronous logic. I can't fathom how you would build/constrain/debug these circuits in an asynchronous style with the existing toolset. And don't say "we'll use something else". It is these types of tools which have made our million gate ASICs possible. If we were still using schematics or other hack tools we would barely have passed the 80286. The current design tools took a long time to develop, hone, and get the bugs out of. The amount of money involved in just the tools is on the order of billions of dollars per year. That's a lot of inertia to move away from.
Second, yes the asynchronous approach can reduce the power consumption of ASICs. However, there are a lot of clocked approaches that do a very good job of reducing power. It all depends on what goals you have when you design the ASIC. Having multiple clocks and clock gating is common in the low power and embedded domains. It hasn't been as much of a factor in desktop systems but is certainly in use in handheld devices. The Crusoe takes these approaches to an extreme level. It's all a matter of what you want to design for and time to market pressures.
Lastly, speed. I think folks forget the feedback path. If you're going to rely on this asynchronous handshake, it requires a given stage to hold its outputs until the next stage acknowledges (asynchronously) that it got the data. This means the given stage can't accept anything new yet. This cascades/ripples back through the pipeline. This feedback takes time (and logic levels) that don't exist in clocked logic. Imagine an automotive assembly line where things could only move forward if each station got permission from his adjacent stations. In clocked logic you've guaranteed that the data is ready to move forward because you've calculated these things out. You've removed a bunch of communication overhead. Yes, there is slack in the synchronous pipeline, but for the most part current designs are pretty well balanced so that each stage uses a large portion of its clock cycle.
That's about all I can think of at the moment. I need to be getting home before I get snowed in! ;-) Just a few comments from a digital hardware designer. Hope this provided some food for thought...
A floating point operation is usually taken to mean a floating point multiply followed by a floating point addition, also known as a Multiply/Accumulate Cycle (MAC).
A MAC is a very important operation in digital signal processing. For example, to implement a digital lowpass filter (to remove tape hiss, for example), you define a finite impulse response filter (FIR filter) of some number of taps. You might need 256 taps to implement the needed low pass filter (this is a shot from the hip, the actual number of taps may be more or less). That means for every sample of audio (88.2kSamples/second for stereo audio) you need to do 256 MACs, or 22.6MFLOPS.
www.eFax.com are spammers
Asynchronous ARM core nears commercial debut (1998)
ARM researches asynchronous CPU design (feb 1995)
AMULET3: A High-Performance Self-Timed ARM Microprocessor (1998)