When Mistakes Improve Performance
jd and other readers pointed out BBC coverage of research into "stochastic" CPUs that allow communication errors in order to reap benefits in performance and power usage. "Professor Rakesh Kumar at the University of Illinois has produced research showing that allowing communication errors between microprocessor components and then making the software more robust will actually result in chips that are faster and yet require less power. His argument is that at the current scale, errors in transmission occur anyway and that the efforts of chip manufacturers to hide these to create the illusion of perfect reliability simply introduces a lot of unnecessary expense, demands excessive power, and deoptimises the design. He favors a new architecture, that he calls the 'stochastic processor,' which is designed to handle data corruption and error recovery gracefully. He believes he has shown such a design would work and that it would permit Moore's Law to continue to operate into the foreseeable future. However, this is not the first time someone has tried to fundamentally revolutionize the CPU. The Transputer, the AMULET, the FM8501, the iWARP, and the Crusoe were all supposed to be game-changers but died cold, lonely deaths instead — and those were far closer to design philosophies programmers are currently familiar with. Modern software simply isn't written with the level of reliability the stochastic processor requires (and many software packages are too big and too complex to port), and the volume of available software frequently makes or breaks new designs. Will this be 'interesting but dead-end' research, or will Professor Kumar pull off a CPU architectural revolution really not seen since the microprocessor was designed?"
If the processor goofs up the instructions that its supposed to execute, how does it recover gracefully?
The "robustification" of software, as he calls it, involves re-writing it so an error simply causes the execution of instructions to take longer.
Ooh, this is tricky. So we can reduce CPU power consumption by a certain amount if we rewrite software in such a way that it can slowly roll over errors when they take place. There are some crude numbers in the document: a 1% error rate, whatever that means, causes a 23% drop in power consumption. What if the `robustification' of software means that it has an extra "check" instruction for every three "real" instructions? Now you're back to where you started, but you had to rewrite your software to get here. I know, it's unfair to compare his proven reduction in power consumption with my imaginary ratio of "check" instructions to "real" instructions, but my point still stands. This system may very well move the burden of error correction from the hardware to the software in such a way that there is no net gain.
More importantly, if the software is more robust so as to detect and correct errors, then it will require more clock cycles of the CPU and negate the performance gain.
This CPU design sounds like the processing equivalent of a perpetual motion device. The additional software error correction is like the friction that makes the supposed gain impossible.
What he is proposing is to reduce the number of transistors on a chip, to increase its speed and reduce power usage.
So in fact he is trying to reverse more's law.
Ethernet is an improvement over than token ring, yet Ethernet has collisions and token ring doesn't.
Token ring avoids collisions, Ethernet accepts collisions will take place but has a good error recovery system.
The summary talked about the communication links... I remember when we were running SLIP over two-conductor serial lines and "overclocking" the serial lines. Because the networking stack (software) was doing checksums and retries, it worked faster to run the line into its fast but noisy mode, rather than to clock it conservatively at a rate with low noise.
If the chip communications paths start following the trend of off-chip paths (packetized serial streams), then you could have higher level functions of the chip do checksums and retries, with a timeout that aborts back even higher to a software level. Your program could decide how much to wait around for correct results versus going on with partial results, depending on its real-time requirements. The memory controllers could do this, using the large, remote SRAM and RAM spaces as an assumed-noisy space and overlaying its own software checksum format on top.
This is really not so different from modern filesystems which start to reduce their trust in the storage device, and overlay their own checksum, redundancy, and recovery methods. You can imagine bringing these reliability boundaries ever "closer" to the CPU. Of course, you are right that it doesn't make sense to allow noisy computed goto addresses, unless you can characterize the noise and link your code with "safe landing areas" around the target jump points. It makes even less sense to have noisy instruction pointers, e.g. where it could back up or skip steps by accident, unless you can design an entire ISA out of idempotent instructions which you can then emit with sufficient redundancy for your taste.
He favors a new architecture, that he calls the 'stochastic processor,' which is designed to handle data corruption and error recovery gracefully.
I dub thee neuron.
Shai Schticks:"You don't make peace with friends, you make peace with enemies"
...the problem is software. In the last twenty years, we've gone from machines running at a few MHz to multicore, multi-CPU machines with clock speeds in the GHz, with corresponding increases in memory capacity and other resources. While the hardware has improved by several orders of magnitude, the same has largely not been true of software. With the exception of games and some media software, which actually require and can use all the hardware you can throw at them, end user software that does very little more than it did twenty years ago could not even run on a machine from 1990, much less run usably fast. I'm not talking enterprise database software here, I'm talking about spreadsheets and word processors.
All of the gains we make in hardware are eaten up as fast or faster than they are produced by two main consumers: useless eye-candy for end users, and higher and higher-level programming languages and tools that make it possible for developers to build increasingly inefficient and resource-hungry applications faster than before. And yes, I realize that there are irresistible market forces at work here, but that only applies to commercial software; for the FOSS world, it's a tremendous lost opportunity that appears to have been driven by little more than a desire to emulate corporate software development, which many FOSS developers admire for reasons known only to them and God.
It really doesn't matter how powerful the hardware becomes. For specialist applications, it's still a help. But for the average user, an increase in processor speed and memory simply means that their 25 meg printer drivers will become 100 meg printer drivers and their operating system will demand another gig of RAM and all their new clock cycles. Anything that's left will be spent on menus that fade in and out and buttons that look like quivering drops of water -- perhaps next year, they'll have animated fish living inside them.
Proud member of the Weirdo-American community.
Because the stochastic processor will be able to make mistakes much more quickly of course.
Don't you understand progress?
=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Friends don't let friends enable ecmascript.
Suppose those CPUs really allow for faster instruction handling using less resources, maybe you could put more in a package, for the same price, which on a hardware level would give rise to more processing cores at the same cost. (Multi-Core stochastic CPUs)
Naturally, you have the ability to do parallel processing, with errors possible, but you are able to process instructions at a faster rate.
On the software side, the support for concurrency is a mayor selling point, of course, there has to be something able recover from those pesky stochastics gracefully. I come up with the functional language 'Erlang'.
This is taken from wikipedia
http://en.wikipedia.org/wiki/Erlang_(programming_language)#Concurrency_and_distribution_orientation
Concurrency supports the primary method of error-handling in Erlang. When a process crashes, it neatly exits and sends a message to the controlling process which can take action. This way of error handling increases maintainability and reduces complexity of code
From the official source:
http://www.erlang.org/doc/reference_manual/processes.html#errors
Erlang has a built-in feature for error handling between processes. Terminating processes will emit exit signals to all linked processes, which may terminate as well or handle the exit in some way. This feature can be used to build hierarchical program structures where some processes are supervising other processes, for example restarting them if they terminate abnormally.
Asked to 'refer to OTP Design Principles for more information about OTP supervision trees, which use[s] this feature' I read this:
http://www.erlang.org/doc/design_principles/des_princ.html
A basic concept in Erlang/OTP is the supervision tree. This is a process structuring model based on the idea of workers and supervisors. Workers are processes which perform computations, that is, they do the actual work. Supervisors are processes which monitor the behaviour of workers. A supervisor can restart a worker if something goes wrong. The supervision tree is a hierarchical arrangement of code into supervisors and workers, making it possible to design and program fault-tolerant software.
This seems well fit? Create a real, physical machine for a language both able to reap its benefits and cope with the trade-off.
Or maybe I'm too far off (I'm bored technologically, allow me some paradigmatic change at slashdot).
TamedStochastics - Hiring.
Yes, checksumming on dedicated hardware was my first thought as well.
AMD and nVidia's workstation cards are the same as their gaming cards, the only difference being that the workstation ones are certified to produce 100% accurate output. If a gaming card colours a pixel wrong every now and then it's no big deal and the player probably won't even notice.
As OpenCL and other "abuses" of GPU power become more popular, "colors a pixel wrong" will eventually happen in the wrong place at the wrong time on someone using a "gaming" card.
...that the Transmeta Crusoe processor has sod-all to do with porting or different programming models. The whole point of the Crusoe was that it could distil down various types of instruction (e.g. x86, even Java bytecode) to native instructions it understood. It could run 'anything' so to speak, given the right abstraction layer in between
Its lack of success was nothing to do with programming - just that no one needed a processor that could these things. The demand wasn't there
And yes, I realize that there are irresistible market forces at work here, but that only applies to commercial software; for the FOSS world, it's a tremendous lost opportunity that appears to have been driven by little more than a desire to emulate corporate software development, which many FOSS developers admire for reasons known only to them and God.
I think I know why. If free software lacks eye candy, free software has trouble gaining more users. If free software lacks users, hardware makers won't cooperate, leading to the spread of "paperweight" status on hardware compatibility lists. And if free software lacks users, there won't be any way to get other software publishers to document data formats or to get publishers of works to use open data formats.
I've seen this before, except for an application that made more sense: GPUs. A GPU instruction is almost never critical. Errors writing pixel values will just result in minor static, and GPUs are actually far closer to needing this sort of thing. The highest-end GPUs draw far more power than the highest-end CPUs, and heating problems are far more common.
It may even improve results. If you lower the power by half for the least significant bit, and by a quarter for the next-least, you've cut power 3% for something invisible to humans. In fact, a slight variation in the rendering could make the end result look more like our flawed reality.
A GPU can mess up and not take down the whole computer. A CPU can. What happens when the error hits during a syscall? During bootup? While doing I/O?
Seriously, people, get some common sense.
What use is a blazing fast computer that is no longer reliable
Meh.. I'm pretty happy to have my brain, even if it makes some mistakes sometimes.
which is totally what she said
Ethernet has lower latency than token ring, and is over all easier to implement. However its bandwidth scaling is poor, when you are talking old school hub ethernet. The more devices you have, the less total throughput you get due to collisions. Eventually you can grind a segment to a complete halt because collisions are so frequent little data gets through.
Modern ethernet does not have collisions. It is full duplex, using separate transmit and receive wires. It scales well because the devices that control it, switches, handle sending data where it needs to go and can do bidirectional transmission. The performance you see with gig ethernet is only possible because you do full duplex communications with essentially no errors.
I see lots of people down on the theory - even though the original proposal was for highly-error forgiving applications - because somehow it means we can't trust the computations from the CPU anymore.
People - realize that you can't trust them NOW.
As someone who's spent way too much time in the ZFS community talking about errors, their sources and how to compensate, let me enlighten you:
modern computers are full of uncorrected errors
By that, I mean that there is a decided tradeoff between hardware support for error correction (in all the myriad places in a computer, not just RAM) and cost, and the decision has come down on the side of screw them, they don't need to worry about errors, at least for desktops. Even for better quality servers and workstations, there are still a large number of places where the hardware simply doesn't check for errors. And, in many cases, the hardware alone is unable to check for errors and data corruption.
So, to think that your wonderful computer today is some sort of accurate calculating machine is completely wrong! Bit rot and bit flipping happens very frequently for a simple reason: error rates per billion operations (or transmissions, or whatever) have essentially stayed the same for the past 30 years, while every other component (and bus design, etc.) is pretty much following Moore's Law. The ZFS community is particularly worried about disks, where the hard error rates are now within two orders of magnitude of the disk's capacity (e.g. for a 1TB disk, you will have a hard error for every 100TB or so of data read/written). But, there's problems in on-die CPU caches, bus line transmissions, SAS and FC cable noise, DRAM failures, and a whole host of other places.
Bottom line here: the more research we can do into figuring out how to cope with the increasing frequency of errors in our hardware, the better. I'm not sure that we're going to be able to force a re-write of applications, but certainly, this kind of research and possible solutions can be taken care of by the OS itself.
Frankly, I liken the situation to that of using IEEE floating point to calculate bank balances: it looks nice and a naive person would think it's a good solution, but, let me tell you, you come up wrong by a penny more often that you would think. Much more often.
-Erik
There are always four sides to every story: your side, their side, the truth, and what really happened.
That's why some languages use ComeFrom rather than GoTo.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)