When Mistakes Improve Performance

← Back to Stories (view on slashdot.org)

When Mistakes Improve Performance

Posted by kdawson on Saturday May 29, 2010 @10:21AM from the let's-change-everything dept.

jd and other readers pointed out BBC coverage of research into "stochastic" CPUs that allow communication errors in order to reap benefits in performance and power usage. "Professor Rakesh Kumar at the University of Illinois has produced research showing that allowing communication errors between microprocessor components and then making the software more robust will actually result in chips that are faster and yet require less power. His argument is that at the current scale, errors in transmission occur anyway and that the efforts of chip manufacturers to hide these to create the illusion of perfect reliability simply introduces a lot of unnecessary expense, demands excessive power, and deoptimises the design. He favors a new architecture, that he calls the 'stochastic processor,' which is designed to handle data corruption and error recovery gracefully. He believes he has shown such a design would work and that it would permit Moore's Law to continue to operate into the foreseeable future. However, this is not the first time someone has tried to fundamentally revolutionize the CPU. The Transputer, the AMULET, the FM8501, the iWARP, and the Crusoe were all supposed to be game-changers but died cold, lonely deaths instead — and those were far closer to design philosophies programmers are currently familiar with. Modern software simply isn't written with the level of reliability the stochastic processor requires (and many software packages are too big and too complex to port), and the volume of available software frequently makes or breaks new designs. Will this be 'interesting but dead-end' research, or will Professor Kumar pull off a CPU architectural revolution really not seen since the microprocessor was designed?"

23 of 222 comments (clear)

Min score:

Reason:

Sort:

Impossible design by ThatMegathronDude · 2010-05-29 10:24 · Score: 4, Interesting

If the processor goofs up the instructions that its supposed to execute, how does it recover gracefully?
1. Re:Impossible design by Anonymous Coward · 2010-05-29 10:36 · Score: 3, Funny
  
  The Indian-developed software will itself fuck up in a way that negates whatever fuck up just happened with the CPU. In the end, it all balances out, and the computation is correct.
2. Re:Impossible design by TheThiefMaster · 2010-05-29 10:42 · Score: 4, Insightful
  
  Especially a JMP (GOTO) or CALL. If the instruction is JMP 0x04203733 and a transmission error makes it do JMP 0x00203733 instead, causing it to attempt to execute data or an unallocated memory page, how the hell can it recover from that? It could be even worse if the JMP instruction is changed only subtly, jumping only a few bytes too far or too close could land you the wrong side of an important instruction that throws off the entire rest of the program. All you could do is to detect the error/crash and restart from the beginning and hope. What if the error was in your error detection code? Do you have to check the result of your error detection for errors too?
3. Re:Impossible design by Anonymous Coward · 2010-05-29 10:44 · Score: 3, Interesting
  
  Thats a good point. You accept mistakes with the data, but don't want the operation to change from add (where, when doing large averages plus/minus a few hundreds wont matter) to multiply or divide.
  But once you have the opcode separated from the data, you can mess with the former. E.g. not care when something is a race condition because that happening every 1000th operation doesn't matter too much.
  And as this is a source of noise, you just got a free random data!
  Still, this looks more like something for scientific computing, and when they build the next big one that can easily be factored in. For home computing, not so much, 99% of the time they wait for user input anyhow.
4. Re:Impossible design by AmiMoJo · 2010-05-29 11:15 · Score: 3, Informative
  
  The first thing to say is that we are not talking about general purpose CPU instructions but rather the highly repetitive arithmetic processing that is needed for things like video decoding or 3D geometry processing.
  The CPU can detect when some types of error occur. It's a bit like ECC RAM where one or two bit errors can be noticed and corrected. It can also check for things like invalid op-codes, jumps to invalid or non-code memory and the like. If a CPU were to have two identical ALUs it could compare results.
  Software can also look for errors in processed data. Things like checksums and estimation can be used.
  In fact GPUs already do this to some extent. AMD and nVidia's workstation cards are the same as their gaming cards, the only difference being that the workstation ones are certified to produce 100% accurate output. If a gaming card colours a pixel wrong every now and then it's no big deal and the player probably won't even notice. For CAD and other high end applications the cards have to be correct all the time.
  
  --
  const int one = 65536; (Silvermoon, Texture.cs)
  SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
5. Re:Impossible design by Chowderbags · 2010-05-29 11:19 · Score: 3, Insightful
  
  Moreover, if the processor goofs on the check, how will the program know? Do we run every operation 3 times and take the majority vote (then we've cut down to 1/3rd of the effective power)? Even if we were to take the 1% error rate, given that each core of CPUs right now can run billions of instructions per second, this CPU will fail to check correctly every second (even checking, rechecking, and checking again every single operation). And what about memory operations? Can we accept errors in a load or store function? If so, we can't in practice trust our software to do what we tell it. (change a bit on load and you could do damn near anything from adding the wrong number, to saying an if statement is true when it should be false, to not even running the right fricken instruction.
  
  There's a damn good reason why we want our processors to be rock solid. If they don't work right, we can't trust anything they output.
6. Re:Impossible design by Interoperable · 2010-05-29 12:46 · Score: 5, Insightful
  
  The research is targeted specifically at dedicated audio/video encoding/decoding blocks within the processors of mobile devices and similar error-tolerant applications. The journalist just didn't mention the fact that the idea isn't to expose the entire system to fault-prone components. When considered in the light that the most power-sensitive mainstream devices (cell-phones) spend most of their time doing these error-tolerant tasks, the research becomes quite interesting. They claim to have demonstrated the effectiveness of the technique to encode an h.264 video.
  
  --
  So if this is the future...where's my jet pack?
7. Re:Impossible design by pipatron · 2010-05-29 13:04 · Score: 4, Insightful
  
  Not very insightful. You seem to say that a CPU today is error-free, and if this is true, the part of the new CPU that does the checks could also be made error-free so there's no problem.
  Well, they aren't rock-solid today either, so you can not trust their output even today. It's just not very likeley that there will be a mistake. This is why mainframes execute a lot of instructions at least twice, and decides on-the-fly if something went wrong. This idea is just an extension of that.
  
  --
  c++; /* this makes c bigger but returns the old value */
8. Re:Impossible design by demerzeleto · 2010-05-29 15:07 · Score: 3, Interesting
  
  There's a damn good reason why we want our processors to be rock solid. If they don't work right, we can't trust anything they output.
  Have you ever tried transferring large files over a 100 MBps ethernet link? Thats right, billions of bytes over a noisy, unreliable wired link. And how often have you seen files corrupted? I never have. The link runs along extremely reliably (BER of 10^-9 I think) with as little as 12MBps out of the 100MBps spent on error checking and recovery.
  
  Same case here. I'd expect the signal-to-noise ratio on the connects within CPUs (when the voltage is cut by say 25%) to be similar, if not better, than ethernet links. So the CPU could probably get along with lesser error checking and recovery. Or, if you choose applications (like video decoding or graphics rendering) that have no problems with a few bad bits here and there, you could manage with almost no ECC at all.
  
  If you were to plot Error Rates vs CPU power, I'd say most modern CPUs lie at the far end of the region of diminishing returns. Theres a gold mine to be reaped by moving backwards on the curve.
Moving, not fixing, the problem by Red+Jesus · 2010-05-29 10:35 · Score: 4, Interesting

The "robustification" of software, as he calls it, involves re-writing it so an error simply causes the execution of instructions to take longer.
Ooh, this is tricky. So we can reduce CPU power consumption by a certain amount if we rewrite software in such a way that it can slowly roll over errors when they take place. There are some crude numbers in the document: a 1% error rate, whatever that means, causes a 23% drop in power consumption. What if the `robustification' of software means that it has an extra "check" instruction for every three "real" instructions? Now you're back to where you started, but you had to rewrite your software to get here. I know, it's unfair to compare his proven reduction in power consumption with my imaginary ratio of "check" instructions to "real" instructions, but my point still stands. This system may very well move the burden of error correction from the hardware to the software in such a way that there is no net gain.
1. Re:Moving, not fixing, the problem by Interoperable · 2010-05-29 12:34 · Score: 4, Insightful
  
  I did some digging and found some material by the researcher, unfiltered by journalists. I don't have any background in processor architecture but I'll present what I understood. The original publications can be found here.
  The target of the research is not general computing, but rather low-power "client-side" computing, as the author puts it. I understand this to be decoding application, such as voice or video in mobile devices. Furthermore, the entire architecture would not be stochastic, but rather it would contain some functional blocks that are stochastic. I think the idea is that certain mobile hardware devices devote much of their time to specialized applications that do not require absolute accuracy.
  A mobile phone may spend most of it's time being used encode/decode low resolution voice and video and would have significant blocks within the processor devoted to those tasks. Those tasks could be considered error tolerant. The operating system would not be exposed to error-prone hardware, only applications that use hardware acceleration for specialized, error-tolerant tasks. In fact, the researchers specifically mention encoding/decoding voice and video and have demonstrated the technique on encoding h.264 video.
  
  --
  So if this is the future...where's my jet pack?
2. Re:Moving, not fixing, the problem by jd · 2010-05-29 12:57 · Score: 3, Insightful
  
  For this, I'd point to the RISC vs. CISC debate. RISC programs took many more instructions to do the same things, but the gain in performance was so great that you ended up with greater performance. Extra steps = some amount of overhead, but so long as the net gain is greater than the net overhead, you will gain overall. The RISC chip demonstrated that such examples really do exist in the real world.
  But just because there are such situations, it does not automatically follow that more steps always equals greater performance. It may be that better error-correction techniques in the hardware would handle the transmission errors just fine without having to alter any software at all. It depends on the nature of the errors (bursty vs randomly-distributed, percent of signal corrupted, etc) as to what error-correction would be needed and whether the additional circuitry can be afforded.
  Alternatively, the problem may in fact turn out to be a solution. Once upon a time, electron leakage was a serious problem in CPU designs. Then, chip manufacturers learned that they could use electron tunneling deliberately. The cause of these errors may be further electron leakage or some other quantum effect, it really doesn't matter. If it leads to a better practical understanding of the quantum world to the point where the errors can be mitigated and the phenomenon turned to the advantage of the engineers, it could lead to all kinds of improvement.
  There again, it might prove redundant. There are good reasons for believing that "Processor In Memory" architectures are good for certain types of problem - particularly for providing a very standard library of routines, but certain opcodes can be shifted there as well. There is also the Cell approach, which is to have a collection of tightly-coupled but specialized processors of different kinds. A heterogenious cluster with shared memory, in effect. If you extend the idea to allow these cores to be physically distinct, you can offload from the original CPU that way. In both cases, you distribute the logic over a much wider area without increasing the distance signals have to travel (you may even shorten the distance). As such, you can eliminate a lot of the internal sources of errors.
  It may prove redundant in other ways, too. There are plenty of cooling kits these days that will take a CPU to much lower temperatures. Less thermal noise may well result in fewer errors, since that is a likely source of some of them. Currently, processors often run at 40'C - and I've seen laptop CPUs reach 80'C. If you can bring the cores down to nearly 0'C and keep them there, that should have a sizable impact on whether the data is being transmitted accurately. The biggest change would be to modify the CPU casing so that heat is reliably and rapidly transferred from the silicon. I would imagine that you'd want the interior of the CPU casing to be flooded with something that conducts heat but not electricity - fluorinert does a good job there - and then have the top of the case to also be an extra-good heat conductor (plastic only takes you so far).
  However, if programs were designed with fault-tolerence in mind, these extra layers might not be needed. You might well be able to get away with better software on a poorer processor. Indeed, one could argue that the human brain is an example of an extremely unreliable processor whose net processing power (even allowing for the very high error rates) is far far beyond any computer yet built. This fits perfectly with the professor's description of what he expects, so maybe this design actually will work as intended.
  
  --
  It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Re:Moore's law by takev · 2010-05-29 10:49 · Score: 3, Informative

What he is proposing is to reduce the number of transistors on a chip, to increase its speed and reduce power usage.
So in fact he is trying to reverse more's law.
Sounds reasonable to me by bug1 · 2010-05-29 10:59 · Score: 4, Insightful

Ethernet is an improvement over than token ring, yet Ethernet has collisions and token ring doesn't.
Token ring avoids collisions, Ethernet accepts collisions will take place but has a good error recovery system.
A brainy idea. by Ostracus · 2010-05-29 11:28 · Score: 4, Interesting

He favors a new architecture, that he calls the 'stochastic processor,' which is designed to handle data corruption and error recovery gracefully.

I dub thee neuron.

--
Shai Schticks:"You don't make peace with friends, you make peace with enemies"
The problem isn't hardware to begin with... by Angst+Badger · 2010-05-29 11:31 · Score: 4, Insightful

...the problem is software. In the last twenty years, we've gone from machines running at a few MHz to multicore, multi-CPU machines with clock speeds in the GHz, with corresponding increases in memory capacity and other resources. While the hardware has improved by several orders of magnitude, the same has largely not been true of software. With the exception of games and some media software, which actually require and can use all the hardware you can throw at them, end user software that does very little more than it did twenty years ago could not even run on a machine from 1990, much less run usably fast. I'm not talking enterprise database software here, I'm talking about spreadsheets and word processors.
All of the gains we make in hardware are eaten up as fast or faster than they are produced by two main consumers: useless eye-candy for end users, and higher and higher-level programming languages and tools that make it possible for developers to build increasingly inefficient and resource-hungry applications faster than before. And yes, I realize that there are irresistible market forces at work here, but that only applies to commercial software; for the FOSS world, it's a tremendous lost opportunity that appears to have been driven by little more than a desire to emulate corporate software development, which many FOSS developers admire for reasons known only to them and God.
It really doesn't matter how powerful the hardware becomes. For specialist applications, it's still a help. But for the average user, an increase in processor speed and memory simply means that their 25 meg printer drivers will become 100 meg printer drivers and their operating system will demand another gig of RAM and all their new clock cycles. Anything that's left will be spent on menus that fade in and out and buttons that look like quivering drops of water -- perhaps next year, they'll have animated fish living inside them.

--
Proud member of the Weirdo-American community.
1. Re:The problem isn't hardware to begin with... by Draek · 2010-05-29 13:02 · Score: 4, Insightful
  
  for the FOSS world, it's a tremendous lost opportunity that appears to have been driven by little more than a desire to emulate corporate software development, which many FOSS developers admire for reasons known only to them and God.
  You yourself stated that high-level languages allow for a much faster rate of development, yet you dismiss the idea of using them in the F/OSS world as a mere "desire to emulate corporate software development"?
  Hell, you also forgot another big reason: high-level code is almost always *far* more readable than its equivalent set of low-level instructions, the appeal of which for F/OSS ought to be similarly obvious.
  Sorry but no, the reason practically the whole industry has been moving towards high-level languages isn't because we're all lazy, and if you worked in the field you'd probably know why.
  
  --
  No problem is insoluble in all conceivable circumstances.
2. Re:The problem isn't hardware to begin with... by Homburg · 2010-05-29 13:13 · Score: 3, Insightful
  
  So you're posting this from Mosaic, I take it? I suspect not, because, despite your "get off my lawn" posturing, you recognize in practice that modern software actually does do more than twenty-year-old software. Firefox is much faster and easier to use than Mosaic, and it also does more, dealing with significantly more complicated web pages (like this one; and terrible though Slashdot's code surely is, the ability to expand comments and comment forms in-line is a genuine improvement, leaving aside the much more significant improvements of something like gmail). Try using an early 90s version of Word, and you'll see that, in the past 20 years word processors, too, have become significantly faster, easier to use, and capable of doing more (more complicated layouts, better typography).
  Sure, the laptop I'm typing this on now is, what, 60 times faster than a computer in 1990, and the software I'm running now is neither 60 times faster nor 60 times better than the software I was running in 1990. But it is noticeably faster, at the same time that it does noticeably more and is much easier to develop for. The idea that hardware improvements haven't led to huge software improvements over the past 20 years can only be maintained if you don't remember what software was like 20 years ago.
Re:why use a stochastic processor? by Arker · 2010-05-29 11:37 · Score: 3, Funny

Why use a stochastic processor which makes mistakes when we can use our brains, which make mistakes?
Because the stochastic processor will be able to make mistakes much more quickly of course.
Don't you understand progress?

--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Friends don't let friends enable ecmascript.
Late, and innaccurate by gman003 · 2010-05-29 12:19 · Score: 3, Interesting

I've seen this before, except for an application that made more sense: GPUs. A GPU instruction is almost never critical. Errors writing pixel values will just result in minor static, and GPUs are actually far closer to needing this sort of thing. The highest-end GPUs draw far more power than the highest-end CPUs, and heating problems are far more common.
It may even improve results. If you lower the power by half for the least significant bit, and by a quarter for the next-least, you've cut power 3% for something invisible to humans. In fact, a slight variation in the rendering could make the end result look more like our flawed reality.
A GPU can mess up and not take down the whole computer. A CPU can. What happens when the error hits during a syscall? During bootup? While doing I/O?
Re:Wrong approach? by somersault · 2010-05-29 12:54 · Score: 3, Interesting

What use is a blazing fast computer that is no longer reliable
Meh.. I'm pretty happy to have my brain, even if it makes some mistakes sometimes.

--
which is totally what she said
Not really by Sycraft-fu · 2010-05-29 13:46 · Score: 3, Informative

Ethernet has lower latency than token ring, and is over all easier to implement. However its bandwidth scaling is poor, when you are talking old school hub ethernet. The more devices you have, the less total throughput you get due to collisions. Eventually you can grind a segment to a complete halt because collisions are so frequent little data gets through.
Modern ethernet does not have collisions. It is full duplex, using separate transmit and receive wires. It scales well because the devices that control it, switches, handle sending data where it needs to go and can do bidirectional transmission. The performance you see with gig ethernet is only possible because you do full duplex communications with essentially no errors.
This branch could bear some interesting fruit... by trims · 2010-05-29 17:26 · Score: 4, Interesting

I see lots of people down on the theory - even though the original proposal was for highly-error forgiving applications - because somehow it means we can't trust the computations from the CPU anymore.
People - realize that you can't trust them NOW.
As someone who's spent way too much time in the ZFS community talking about errors, their sources and how to compensate, let me enlighten you:
modern computers are full of uncorrected errors
By that, I mean that there is a decided tradeoff between hardware support for error correction (in all the myriad places in a computer, not just RAM) and cost, and the decision has come down on the side of screw them, they don't need to worry about errors, at least for desktops. Even for better quality servers and workstations, there are still a large number of places where the hardware simply doesn't check for errors. And, in many cases, the hardware alone is unable to check for errors and data corruption.
So, to think that your wonderful computer today is some sort of accurate calculating machine is completely wrong! Bit rot and bit flipping happens very frequently for a simple reason: error rates per billion operations (or transmissions, or whatever) have essentially stayed the same for the past 30 years, while every other component (and bus design, etc.) is pretty much following Moore's Law. The ZFS community is particularly worried about disks, where the hard error rates are now within two orders of magnitude of the disk's capacity (e.g. for a 1TB disk, you will have a hard error for every 100TB or so of data read/written). But, there's problems in on-die CPU caches, bus line transmissions, SAS and FC cable noise, DRAM failures, and a whole host of other places.
Bottom line here: the more research we can do into figuring out how to cope with the increasing frequency of errors in our hardware, the better. I'm not sure that we're going to be able to force a re-write of applications, but certainly, this kind of research and possible solutions can be taken care of by the OS itself.
Frankly, I liken the situation to that of using IEEE floating point to calculate bank balances: it looks nice and a naive person would think it's a good solution, but, let me tell you, you come up wrong by a penny more often that you would think. Much more often.
-Erik

--
There are always four sides to every story: your side, their side, the truth, and what really happened.