When Mistakes Improve Performance

← Back to Stories (view on slashdot.org)

When Mistakes Improve Performance

Posted by kdawson on Saturday May 29, 2010 @10:21AM from the let's-change-everything dept.

jd and other readers pointed out BBC coverage of research into "stochastic" CPUs that allow communication errors in order to reap benefits in performance and power usage. "Professor Rakesh Kumar at the University of Illinois has produced research showing that allowing communication errors between microprocessor components and then making the software more robust will actually result in chips that are faster and yet require less power. His argument is that at the current scale, errors in transmission occur anyway and that the efforts of chip manufacturers to hide these to create the illusion of perfect reliability simply introduces a lot of unnecessary expense, demands excessive power, and deoptimises the design. He favors a new architecture, that he calls the 'stochastic processor,' which is designed to handle data corruption and error recovery gracefully. He believes he has shown such a design would work and that it would permit Moore's Law to continue to operate into the foreseeable future. However, this is not the first time someone has tried to fundamentally revolutionize the CPU. The Transputer, the AMULET, the FM8501, the iWARP, and the Crusoe were all supposed to be game-changers but died cold, lonely deaths instead — and those were far closer to design philosophies programmers are currently familiar with. Modern software simply isn't written with the level of reliability the stochastic processor requires (and many software packages are too big and too complex to port), and the volume of available software frequently makes or breaks new designs. Will this be 'interesting but dead-end' research, or will Professor Kumar pull off a CPU architectural revolution really not seen since the microprocessor was designed?"

11 of 222 comments (clear)

Min score:

Reason:

Sort:

Impossible design by ThatMegathronDude · 2010-05-29 10:24 · Score: 4, Interesting

If the processor goofs up the instructions that its supposed to execute, how does it recover gracefully?
1. Re:Impossible design by TheThiefMaster · 2010-05-29 10:42 · Score: 4, Insightful
  
  Especially a JMP (GOTO) or CALL. If the instruction is JMP 0x04203733 and a transmission error makes it do JMP 0x00203733 instead, causing it to attempt to execute data or an unallocated memory page, how the hell can it recover from that? It could be even worse if the JMP instruction is changed only subtly, jumping only a few bytes too far or too close could land you the wrong side of an important instruction that throws off the entire rest of the program. All you could do is to detect the error/crash and restart from the beginning and hope. What if the error was in your error detection code? Do you have to check the result of your error detection for errors too?
2. Re:Impossible design by Interoperable · 2010-05-29 12:46 · Score: 5, Insightful
  
  The research is targeted specifically at dedicated audio/video encoding/decoding blocks within the processors of mobile devices and similar error-tolerant applications. The journalist just didn't mention the fact that the idea isn't to expose the entire system to fault-prone components. When considered in the light that the most power-sensitive mainstream devices (cell-phones) spend most of their time doing these error-tolerant tasks, the research becomes quite interesting. They claim to have demonstrated the effectiveness of the technique to encode an h.264 video.
  
  --
  So if this is the future...where's my jet pack?
3. Re:Impossible design by pipatron · 2010-05-29 13:04 · Score: 4, Insightful
  
  Not very insightful. You seem to say that a CPU today is error-free, and if this is true, the part of the new CPU that does the checks could also be made error-free so there's no problem.
  Well, they aren't rock-solid today either, so you can not trust their output even today. It's just not very likeley that there will be a mistake. This is why mainframes execute a lot of instructions at least twice, and decides on-the-fly if something went wrong. This idea is just an extension of that.
  
  --
  c++; /* this makes c bigger but returns the old value */
Moving, not fixing, the problem by Red+Jesus · 2010-05-29 10:35 · Score: 4, Interesting

The "robustification" of software, as he calls it, involves re-writing it so an error simply causes the execution of instructions to take longer.
Ooh, this is tricky. So we can reduce CPU power consumption by a certain amount if we rewrite software in such a way that it can slowly roll over errors when they take place. There are some crude numbers in the document: a 1% error rate, whatever that means, causes a 23% drop in power consumption. What if the `robustification' of software means that it has an extra "check" instruction for every three "real" instructions? Now you're back to where you started, but you had to rewrite your software to get here. I know, it's unfair to compare his proven reduction in power consumption with my imaginary ratio of "check" instructions to "real" instructions, but my point still stands. This system may very well move the burden of error correction from the hardware to the software in such a way that there is no net gain.
1. Re:Moving, not fixing, the problem by Interoperable · 2010-05-29 12:34 · Score: 4, Insightful
  
  I did some digging and found some material by the researcher, unfiltered by journalists. I don't have any background in processor architecture but I'll present what I understood. The original publications can be found here.
  The target of the research is not general computing, but rather low-power "client-side" computing, as the author puts it. I understand this to be decoding application, such as voice or video in mobile devices. Furthermore, the entire architecture would not be stochastic, but rather it would contain some functional blocks that are stochastic. I think the idea is that certain mobile hardware devices devote much of their time to specialized applications that do not require absolute accuracy.
  A mobile phone may spend most of it's time being used encode/decode low resolution voice and video and would have significant blocks within the processor devoted to those tasks. Those tasks could be considered error tolerant. The operating system would not be exposed to error-prone hardware, only applications that use hardware acceleration for specialized, error-tolerant tasks. In fact, the researchers specifically mention encoding/decoding voice and video and have demonstrated the technique on encoding h.264 video.
  
  --
  So if this is the future...where's my jet pack?
Sounds reasonable to me by bug1 · 2010-05-29 10:59 · Score: 4, Insightful

Ethernet is an improvement over than token ring, yet Ethernet has collisions and token ring doesn't.
Token ring avoids collisions, Ethernet accepts collisions will take place but has a good error recovery system.
A brainy idea. by Ostracus · 2010-05-29 11:28 · Score: 4, Interesting

He favors a new architecture, that he calls the 'stochastic processor,' which is designed to handle data corruption and error recovery gracefully.

I dub thee neuron.

--
Shai Schticks:"You don't make peace with friends, you make peace with enemies"
The problem isn't hardware to begin with... by Angst+Badger · 2010-05-29 11:31 · Score: 4, Insightful

...the problem is software. In the last twenty years, we've gone from machines running at a few MHz to multicore, multi-CPU machines with clock speeds in the GHz, with corresponding increases in memory capacity and other resources. While the hardware has improved by several orders of magnitude, the same has largely not been true of software. With the exception of games and some media software, which actually require and can use all the hardware you can throw at them, end user software that does very little more than it did twenty years ago could not even run on a machine from 1990, much less run usably fast. I'm not talking enterprise database software here, I'm talking about spreadsheets and word processors.
All of the gains we make in hardware are eaten up as fast or faster than they are produced by two main consumers: useless eye-candy for end users, and higher and higher-level programming languages and tools that make it possible for developers to build increasingly inefficient and resource-hungry applications faster than before. And yes, I realize that there are irresistible market forces at work here, but that only applies to commercial software; for the FOSS world, it's a tremendous lost opportunity that appears to have been driven by little more than a desire to emulate corporate software development, which many FOSS developers admire for reasons known only to them and God.
It really doesn't matter how powerful the hardware becomes. For specialist applications, it's still a help. But for the average user, an increase in processor speed and memory simply means that their 25 meg printer drivers will become 100 meg printer drivers and their operating system will demand another gig of RAM and all their new clock cycles. Anything that's left will be spent on menus that fade in and out and buttons that look like quivering drops of water -- perhaps next year, they'll have animated fish living inside them.

--
Proud member of the Weirdo-American community.
1. Re:The problem isn't hardware to begin with... by Draek · 2010-05-29 13:02 · Score: 4, Insightful
  
  for the FOSS world, it's a tremendous lost opportunity that appears to have been driven by little more than a desire to emulate corporate software development, which many FOSS developers admire for reasons known only to them and God.
  You yourself stated that high-level languages allow for a much faster rate of development, yet you dismiss the idea of using them in the F/OSS world as a mere "desire to emulate corporate software development"?
  Hell, you also forgot another big reason: high-level code is almost always *far* more readable than its equivalent set of low-level instructions, the appeal of which for F/OSS ought to be similarly obvious.
  Sorry but no, the reason practically the whole industry has been moving towards high-level languages isn't because we're all lazy, and if you worked in the field you'd probably know why.
  
  --
  No problem is insoluble in all conceivable circumstances.
This branch could bear some interesting fruit... by trims · 2010-05-29 17:26 · Score: 4, Interesting

I see lots of people down on the theory - even though the original proposal was for highly-error forgiving applications - because somehow it means we can't trust the computations from the CPU anymore.
People - realize that you can't trust them NOW.
As someone who's spent way too much time in the ZFS community talking about errors, their sources and how to compensate, let me enlighten you:
modern computers are full of uncorrected errors
By that, I mean that there is a decided tradeoff between hardware support for error correction (in all the myriad places in a computer, not just RAM) and cost, and the decision has come down on the side of screw them, they don't need to worry about errors, at least for desktops. Even for better quality servers and workstations, there are still a large number of places where the hardware simply doesn't check for errors. And, in many cases, the hardware alone is unable to check for errors and data corruption.
So, to think that your wonderful computer today is some sort of accurate calculating machine is completely wrong! Bit rot and bit flipping happens very frequently for a simple reason: error rates per billion operations (or transmissions, or whatever) have essentially stayed the same for the past 30 years, while every other component (and bus design, etc.) is pretty much following Moore's Law. The ZFS community is particularly worried about disks, where the hard error rates are now within two orders of magnitude of the disk's capacity (e.g. for a 1TB disk, you will have a hard error for every 100TB or so of data read/written). But, there's problems in on-die CPU caches, bus line transmissions, SAS and FC cable noise, DRAM failures, and a whole host of other places.
Bottom line here: the more research we can do into figuring out how to cope with the increasing frequency of errors in our hardware, the better. I'm not sure that we're going to be able to force a re-write of applications, but certainly, this kind of research and possible solutions can be taken care of by the OS itself.
Frankly, I liken the situation to that of using IEEE floating point to calculate bank balances: it looks nice and a naive person would think it's a good solution, but, let me tell you, you come up wrong by a penny more often that you would think. Much more often.
-Erik

--
There are always four sides to every story: your side, their side, the truth, and what really happened.