Clockless Computing: The State Of The Art
Michael Stutz writes: "This article in Technology Review is a good overview of the state of clockless computing, and profiles the people today who are making it happen." The article explains in simple terms some of the things that clockless chips are supposed to offer (advantages in raw performance, power consumption and security) and what characteristics make these advantages possible.
The AMULET group at Manchester University have been developing this for years based on ARM cores.
http://www.cs.man.ac.uk/amulet/index.html
The old CDC supercomputers, and the Cray 1, were clockless. They were designed by that inspired madman, ...
The reason be built them clockless is that the propogation time to get the clock signal across the machines (which were fairly large) would have significantly slowed the performance. Instead, all of the wires are the right length so that all of the signals arrive at their destination at the right time. I've been told horror stories by ex-CDC salesmen that when they installed new machines, they would spend days or weeks clipping wires to different lengths and debugging hardware failure modes until it all ran smoothly.
Cray also solved the heat dissapation problem by designing the computer to run hot. This meant that when you turned it on it didn't work reliably until all of the ceramic boards heated up (and expanded) so that the connections were solid, etc.
F-ing brilliant.
if there is no mass market for asynchronous chips, there's little incentive to create tools to build them; if there are no tools, no chips get produced. The same problem applies to the development of chip-testing technologies. Without any significant quantity of asynchronous circuits to test, there is no market for third-party testing tools.
But at least here there's an accidental solution - the Cross-Check Array.
Conventional clocked chips can be tested by scan: A multiplexer is added to the flop inputs, and a test signal turns them into one or more long shift registers. The old state of the flops is shifted out for examination while a new state is shifted in to start the next phase of the test. This only works when the flops to be strung together are all part of a common clocking domain.
The Cross-Check Array is more like a RAM. A grid of select lines and sense lines are laid down on the chip, with a transistor at each intersection. The transistor is undersized compared to those of the gates, forming a small tap on a nearby signal - or it can inject a signal if the sense line is driven rather than monitored. Select drivers are laid down along one edge of the chip, sense amplifiers/drivers along another.
This approach does not depend on the flip-flops to be active participants in the observation process (though it can still force their state), and thus can observe signals in asynchronous as well as synchronous designs. It also gives observability of testpoints in combinatorial logic without the addition of extra flops. Compared to a fullscan design it gives much greater observability and takes about half the silicon-area overhead.
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
Of course there is some overhead. There has to be a system telling other parts of the computer when something is finished. But if that is a long enough stage (perhaps thousands of instructions) then it'll be faster overall.
Escher was the first MC and Giger invented the HR department.
It *can* be different, that but's really a function of the state of compilers and languages adapted for an asycn system. It needn't be different at all.
Disclaimer, I was a student at Caltech, and I took 1 async VLSI course, and not very in depth at that.
One way to go about it is to make an async CPU that externally looks like a sync CPU; then you drop it into just about any system, and it works. Speed is wholey dependent upn VCore settings, cooling solutions, and drive strength, I think, though of course there's always gate and transistor performance bottlenecks. Programming and using such a chip would be no different than any other CPU.
Another method is to have a partially async system, in which the CPU, some of the motherboard, and the ram interface is async because of how fast they operate; go ahead and clock something like PCI, USB, etc, because those operate slow enough that the effort of async isn't worth it. This solution is just a question of degrees, really, on how much of the system is async and how much isn't.
Now, that aside, there's the software aspect; how do you program an async system? At the lowest level it resembles, slightly, multi-threaded programming, in which you have multiple threads equating to the multiple function units, execution units, decoders, and stages in the pipeline, etc.
You shuttle data around and wait for acknowledges that the data has been processed before you continue shuttling and processing data. You can synchronize around stages or functional units by making other stages or units dependent upon the output of said unit; instead of waiting for a clock to signal the next cycle of execution, you wait for an acknowledge signal.
To be a little more clear, at the ASM level you would mov data, wait for an ack before another mov data, wait for an ack before sending an instruction, etc. Due to the magic of pipelining, the CPU doesn't have to be finished before you can start stuffing the pipeline, and because it's asynchronous, that means you can actually feed in data as fast as the processor can recieve it, even if the back end or the core is chocking on a particularly nasty multiplication.
So you're feeding data at a furious rate into the CPU, while the CPU is processing prior instructions. If the front end gets full, or whatnot, it fails to signal an ack, so whatever mechanism is feeding data in (ram, cache, memory, whatever) pauses until the CPU can handle more data.
The core, independent off the front end, is processing the data and sending out more instructions, branches, setting bits. With multiple functional units, each unit can run at it's own speed at it's own rate. So if all it's doing is adds, checking conditionals, etc, it may be able to outrun the data feed mechanism, since an add can be completed in one pipeline unit, while data always has to wait upon a slower storage mechanism.
Or if the execution units are waiting because it's doing a square root or something, it just tells the prefetch or whatever front end units to wait, because it cannot handle another chunk of data or instruction, yet, which propogates back to the data feed to wait as well.
When it finishes with it's current instruction a ready signal would get propogated back through all the stages or so, and then more data would get fed in.
So at the lowest levels it would start to resemble writing threaded code, in which you have to wait for the thread to be ready, to be awake, to be active before you send data, and if the thread is asleep, you wait until it awakes, or something like that.
Multiprocessor async is similar, except that each CPU is just another thread, and if there's a hardware front end that decides which CPU to send instructions to, then it's really just a function of stuffing instructions into the least loaded or fastest running CPU; each CPU could, more or less, look like just another functional unit, and clusters pretty well because they all run asynchronously, meaning you don't have to do anything particularly special for load balancing; just send the data to the first one who signals ready, or if there are multiple cpus ready, read a status register to see which is more empty or whatever.
Apologies if I made some errors, especially to those who know much more than I; this is a 4 year old interpretation of my async vlsi class =)
GPL Deconstructed
The article is surprisingly accurate, for a change. Read it.
/. trademark by this point...
However, it seems to have spawned the usual problems here with misunderstanding and confusion. Practically a
Whether you construct a processor using conventional or asynchronous logic makes no difference to the programmer. The programming paradigm can be completely independant from the underlying hardware. (Admittedly, if you want to squeeze the absolute most performance from a given hardware design, you need to program with it in mind, but there is no reason why an ix86, or PPC, or SPARC, or MIPS chip couldn't be implemented asynchronously.)
One of the most interesting advantages of asynchronous logic is that it allows the use of arbitrarily large die sizes. In synchronous logic, you're limited by the delays that arise from transmitting your clock pulses across the chip... at some point maintaining a global lock-step becomes infeasible.
One of the most marketable advantages of asynchronous logic is the power saved by not having to constantly drive the same clock circutry. Most chips support a 'sleep' or 'low power' mode where they turn off the clock or provide it to only a limited portion of the chip. The chip then has to go through a 'wake up' cycle to re-establish the clock throughout the chip before returning to normal operation. The power saved by asynchronous operation can be substantial, and the lack of a 'wake up' latency can be critical in certain applications.
The biggest problem right now is that the vast Layout and Design masses are used to solving the synchronous problems and not the asynchronous problems, ditto for the availible tools. Howver, with an asynchronous-savvy group, a given solution can be designed in less time than the equivalent synchronous solution (someone here was claiming otherwise...).
And this technology is -not- vaporware... it's real and it's here. And whether you believe it or not, it's at least one part of the future.
-YA
PS: BS in EE from Caltech. Working for a company mentioned in the article, although their opinions have no logical relation or tie to mine.