Slashdot Mirror


When Mistakes Improve Performance

jd and other readers pointed out BBC coverage of research into "stochastic" CPUs that allow communication errors in order to reap benefits in performance and power usage. "Professor Rakesh Kumar at the University of Illinois has produced research showing that allowing communication errors between microprocessor components and then making the software more robust will actually result in chips that are faster and yet require less power. His argument is that at the current scale, errors in transmission occur anyway and that the efforts of chip manufacturers to hide these to create the illusion of perfect reliability simply introduces a lot of unnecessary expense, demands excessive power, and deoptimises the design. He favors a new architecture, that he calls the 'stochastic processor,' which is designed to handle data corruption and error recovery gracefully. He believes he has shown such a design would work and that it would permit Moore's Law to continue to operate into the foreseeable future. However, this is not the first time someone has tried to fundamentally revolutionize the CPU. The Transputer, the AMULET, the FM8501, the iWARP, and the Crusoe were all supposed to be game-changers but died cold, lonely deaths instead — and those were far closer to design philosophies programmers are currently familiar with. Modern software simply isn't written with the level of reliability the stochastic processor requires (and many software packages are too big and too complex to port), and the volume of available software frequently makes or breaks new designs. Will this be 'interesting but dead-end' research, or will Professor Kumar pull off a CPU architectural revolution really not seen since the microprocessor was designed?"

44 of 222 comments (clear)

  1. Impossible design by ThatMegathronDude · · Score: 4, Interesting

    If the processor goofs up the instructions that its supposed to execute, how does it recover gracefully?

    1. Re:Impossible design by Anonymous Coward · · Score: 3, Funny

      The Indian-developed software will itself fuck up in a way that negates whatever fuck up just happened with the CPU. In the end, it all balances out, and the computation is correct.

    2. Re:Impossible design by TheThiefMaster · · Score: 4, Insightful

      Especially a JMP (GOTO) or CALL. If the instruction is JMP 0x04203733 and a transmission error makes it do JMP 0x00203733 instead, causing it to attempt to execute data or an unallocated memory page, how the hell can it recover from that? It could be even worse if the JMP instruction is changed only subtly, jumping only a few bytes too far or too close could land you the wrong side of an important instruction that throws off the entire rest of the program. All you could do is to detect the error/crash and restart from the beginning and hope. What if the error was in your error detection code? Do you have to check the result of your error detection for errors too?

    3. Re:Impossible design by Anonymous Coward · · Score: 3, Interesting

      Thats a good point. You accept mistakes with the data, but don't want the operation to change from add (where, when doing large averages plus/minus a few hundreds wont matter) to multiply or divide.

      But once you have the opcode separated from the data, you can mess with the former. E.g. not care when something is a race condition because that happening every 1000th operation doesn't matter too much.
      And as this is a source of noise, you just got a free random data!
      Still, this looks more like something for scientific computing, and when they build the next big one that can easily be factored in. For home computing, not so much, 99% of the time they wait for user input anyhow.

    4. Re:Impossible design by Turzyx · · Score: 2, Interesting

      Or worse, it could jump to itself repeatedly, thereby creating a HCF situation.

    5. Re:Impossible design by AmiMoJo · · Score: 3, Informative

      The first thing to say is that we are not talking about general purpose CPU instructions but rather the highly repetitive arithmetic processing that is needed for things like video decoding or 3D geometry processing.

      The CPU can detect when some types of error occur. It's a bit like ECC RAM where one or two bit errors can be noticed and corrected. It can also check for things like invalid op-codes, jumps to invalid or non-code memory and the like. If a CPU were to have two identical ALUs it could compare results.

      Software can also look for errors in processed data. Things like checksums and estimation can be used.

      In fact GPUs already do this to some extent. AMD and nVidia's workstation cards are the same as their gaming cards, the only difference being that the workstation ones are certified to produce 100% accurate output. If a gaming card colours a pixel wrong every now and then it's no big deal and the player probably won't even notice. For CAD and other high end applications the cards have to be correct all the time.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    6. Re:Impossible design by Chowderbags · · Score: 3, Insightful

      Moreover, if the processor goofs on the check, how will the program know? Do we run every operation 3 times and take the majority vote (then we've cut down to 1/3rd of the effective power)? Even if we were to take the 1% error rate, given that each core of CPUs right now can run billions of instructions per second, this CPU will fail to check correctly every second (even checking, rechecking, and checking again every single operation). And what about memory operations? Can we accept errors in a load or store function? If so, we can't in practice trust our software to do what we tell it. (change a bit on load and you could do damn near anything from adding the wrong number, to saying an if statement is true when it should be false, to not even running the right fricken instruction.

      There's a damn good reason why we want our processors to be rock solid. If they don't work right, we can't trust anything they output.

    7. Re:Impossible design by Interoperable · · Score: 5, Insightful

      The research is targeted specifically at dedicated audio/video encoding/decoding blocks within the processors of mobile devices and similar error-tolerant applications. The journalist just didn't mention the fact that the idea isn't to expose the entire system to fault-prone components. When considered in the light that the most power-sensitive mainstream devices (cell-phones) spend most of their time doing these error-tolerant tasks, the research becomes quite interesting. They claim to have demonstrated the effectiveness of the technique to encode an h.264 video.

      --
      So if this is the future...where's my jet pack?
    8. Re:Impossible design by pipatron · · Score: 4, Insightful

      Not very insightful. You seem to say that a CPU today is error-free, and if this is true, the part of the new CPU that does the checks could also be made error-free so there's no problem.

      Well, they aren't rock-solid today either, so you can not trust their output even today. It's just not very likeley that there will be a mistake. This is why mainframes execute a lot of instructions at least twice, and decides on-the-fly if something went wrong. This idea is just an extension of that.

      --
      c++; /* this makes c bigger but returns the old value */
    9. Re:Impossible design by maxume · · Score: 2, Funny

      Encomistakesding.

      Or maybe truoble.

      --
      Nerd rage is the funniest rage.
    10. Re:Impossible design by demerzeleto · · Score: 3, Interesting

      There's a damn good reason why we want our processors to be rock solid. If they don't work right, we can't trust anything they output.

      Have you ever tried transferring large files over a 100 MBps ethernet link? Thats right, billions of bytes over a noisy, unreliable wired link. And how often have you seen files corrupted? I never have. The link runs along extremely reliably (BER of 10^-9 I think) with as little as 12MBps out of the 100MBps spent on error checking and recovery.

      Same case here. I'd expect the signal-to-noise ratio on the connects within CPUs (when the voltage is cut by say 25%) to be similar, if not better, than ethernet links. So the CPU could probably get along with lesser error checking and recovery. Or, if you choose applications (like video decoding or graphics rendering) that have no problems with a few bad bits here and there, you could manage with almost no ECC at all.

      If you were to plot Error Rates vs CPU power, I'd say most modern CPUs lie at the far end of the region of diminishing returns. Theres a gold mine to be reaped by moving backwards on the curve.

    11. Re:Impossible design by DavidRawling · · Score: 2, Informative

      space the instructions further apart so that one or two bit flips won't map to another instruction.

      Yeah - I think you left out the thinking bit before your comment.

      Sure, a single bit flip in the least significant bit only moves you 1 byte forward or backward in RAM. But in the most significant bit in a 32 bit CPU it moves you 2GB away (let alone the 8.4 billion GB in a 64 bit CPU, if my mental maths is correct).

      Just how far apart do you want the instructions?

  2. Moving, not fixing, the problem by Red+Jesus · · Score: 4, Interesting

    The "robustification" of software, as he calls it, involves re-writing it so an error simply causes the execution of instructions to take longer.

    Ooh, this is tricky. So we can reduce CPU power consumption by a certain amount if we rewrite software in such a way that it can slowly roll over errors when they take place. There are some crude numbers in the document: a 1% error rate, whatever that means, causes a 23% drop in power consumption. What if the `robustification' of software means that it has an extra "check" instruction for every three "real" instructions? Now you're back to where you started, but you had to rewrite your software to get here. I know, it's unfair to compare his proven reduction in power consumption with my imaginary ratio of "check" instructions to "real" instructions, but my point still stands. This system may very well move the burden of error correction from the hardware to the software in such a way that there is no net gain.

    1. Re:Moving, not fixing, the problem by sourcerror · · Score: 2, Insightful

      This system may very well move the burden of error correction from the hardware to the software in such a way that there is no net gain.

      People said the same about RISC processors.

    2. Re:Moving, not fixing, the problem by Turzyx · · Score: 2, Interesting

      I'm making assumptions here, but if these errors are handled by software would it not be possible for a program to 'ignore' errors in certain circumstances? Perhaps this could result in improved performance/battery life for certain low priority tasks. Although an application where 1% error is acceptable doesn't spring immediately to mind, maybe supercomputing - where anomalous results are checked and verified against each other...?

    3. Re:Moving, not fixing, the problem by DigiShaman · · Score: 2, Informative

      From what I understand, all modern processors are now a hybrid of both RISK and CISC (Intel Core 2, AMD K8, etc). Except for embedded applications, the generic CPU doesn't have that kind of pure classification anymore. Right?

      --
      Life is not for the lazy.
    4. Re:Moving, not fixing, the problem by xZgf6xHx2uhoAj9D · · Score: 2, Informative

      The classifications weren't totally meaningful to begin with, but CISC has essentially died. I don't mean there aren't CISC chips anymore--any x86 or x64 chip can essentially be called "CISC"--but certainly no one's designed a CISC architecture in the past decade at least.

      RISC has essentially won and we've moved into the post-RISC world as far as new designs go. VLIW, for instance, can't really be considered classical RISC, but it builds on what RISC has accomplished.

      The grandparent's point is a good one: people thought RISC would never succeed; they were wrong.

    5. Re:Moving, not fixing, the problem by xZgf6xHx2uhoAj9D · · Score: 2, Informative

      Error Correction Codes (aka Forward Error Correction) are typically more efficient for high-error channels than error detection (aka checksum and retransmit), which is why 10Gbps Ethernet uses Reed-Solomon rather than CRC in previous Ethernet standards: it avoids the need to retransmit.

      I had the same questions about how this is going to work, though. What is the machine code going to look like and how will it allow the programmer to check for errors? Possibly each register could have an extra "error" bit (similarly to IA-64's NaT bit on its GP registers). E.g., if you do an "add" instruction, it checks the error bits on its source operands and propagates them. So long as you only allow false positives and false negatives, it would work, and could be relatively efficient.

    6. Re:Moving, not fixing, the problem by JoeMerchant · · Score: 2, Interesting

      Why rewrite the application software? Why not catch it in the firmware and still present a "perfect" face to the assembly level code? Net effect would be an unreliable rate of execution, but who cares about that if the net rate is faster?

    7. Re:Moving, not fixing, the problem by Interoperable · · Score: 4, Insightful

      I did some digging and found some material by the researcher, unfiltered by journalists. I don't have any background in processor architecture but I'll present what I understood. The original publications can be found here.

      The target of the research is not general computing, but rather low-power "client-side" computing, as the author puts it. I understand this to be decoding application, such as voice or video in mobile devices. Furthermore, the entire architecture would not be stochastic, but rather it would contain some functional blocks that are stochastic. I think the idea is that certain mobile hardware devices devote much of their time to specialized applications that do not require absolute accuracy.

      A mobile phone may spend most of it's time being used encode/decode low resolution voice and video and would have significant blocks within the processor devoted to those tasks. Those tasks could be considered error tolerant. The operating system would not be exposed to error-prone hardware, only applications that use hardware acceleration for specialized, error-tolerant tasks. In fact, the researchers specifically mention encoding/decoding voice and video and have demonstrated the technique on encoding h.264 video.

      --
      So if this is the future...where's my jet pack?
    8. Re:Moving, not fixing, the problem by jd · · Score: 3, Insightful

      For this, I'd point to the RISC vs. CISC debate. RISC programs took many more instructions to do the same things, but the gain in performance was so great that you ended up with greater performance. Extra steps = some amount of overhead, but so long as the net gain is greater than the net overhead, you will gain overall. The RISC chip demonstrated that such examples really do exist in the real world.

      But just because there are such situations, it does not automatically follow that more steps always equals greater performance. It may be that better error-correction techniques in the hardware would handle the transmission errors just fine without having to alter any software at all. It depends on the nature of the errors (bursty vs randomly-distributed, percent of signal corrupted, etc) as to what error-correction would be needed and whether the additional circuitry can be afforded.

      Alternatively, the problem may in fact turn out to be a solution. Once upon a time, electron leakage was a serious problem in CPU designs. Then, chip manufacturers learned that they could use electron tunneling deliberately. The cause of these errors may be further electron leakage or some other quantum effect, it really doesn't matter. If it leads to a better practical understanding of the quantum world to the point where the errors can be mitigated and the phenomenon turned to the advantage of the engineers, it could lead to all kinds of improvement.

      There again, it might prove redundant. There are good reasons for believing that "Processor In Memory" architectures are good for certain types of problem - particularly for providing a very standard library of routines, but certain opcodes can be shifted there as well. There is also the Cell approach, which is to have a collection of tightly-coupled but specialized processors of different kinds. A heterogenious cluster with shared memory, in effect. If you extend the idea to allow these cores to be physically distinct, you can offload from the original CPU that way. In both cases, you distribute the logic over a much wider area without increasing the distance signals have to travel (you may even shorten the distance). As such, you can eliminate a lot of the internal sources of errors.

      It may prove redundant in other ways, too. There are plenty of cooling kits these days that will take a CPU to much lower temperatures. Less thermal noise may well result in fewer errors, since that is a likely source of some of them. Currently, processors often run at 40'C - and I've seen laptop CPUs reach 80'C. If you can bring the cores down to nearly 0'C and keep them there, that should have a sizable impact on whether the data is being transmitted accurately. The biggest change would be to modify the CPU casing so that heat is reliably and rapidly transferred from the silicon. I would imagine that you'd want the interior of the CPU casing to be flooded with something that conducts heat but not electricity - fluorinert does a good job there - and then have the top of the case to also be an extra-good heat conductor (plastic only takes you so far).

      However, if programs were designed with fault-tolerence in mind, these extra layers might not be needed. You might well be able to get away with better software on a poorer processor. Indeed, one could argue that the human brain is an example of an extremely unreliable processor whose net processing power (even allowing for the very high error rates) is far far beyond any computer yet built. This fits perfectly with the professor's description of what he expects, so maybe this design actually will work as intended.

      --
      It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
  3. More likely: Impossible gains by Anonymous Coward · · Score: 2, Insightful

    More importantly, if the software is more robust so as to detect and correct errors, then it will require more clock cycles of the CPU and negate the performance gain.

    This CPU design sounds like the processing equivalent of a perpetual motion device. The additional software error correction is like the friction that makes the supposed gain impossible.

  4. Re:Moore's law by takev · · Score: 3, Informative

    What he is proposing is to reduce the number of transistors on a chip, to increase its speed and reduce power usage.
    So in fact he is trying to reverse more's law.

  5. Sounds reasonable to me by bug1 · · Score: 4, Insightful

    Ethernet is an improvement over than token ring, yet Ethernet has collisions and token ring doesn't.

    Token ring avoids collisions, Ethernet accepts collisions will take place but has a good error recovery system.

  6. Actually you've got the right idea by Anonymous Coward · · Score: 2, Interesting

    The summary talked about the communication links... I remember when we were running SLIP over two-conductor serial lines and "overclocking" the serial lines. Because the networking stack (software) was doing checksums and retries, it worked faster to run the line into its fast but noisy mode, rather than to clock it conservatively at a rate with low noise.

    If the chip communications paths start following the trend of off-chip paths (packetized serial streams), then you could have higher level functions of the chip do checksums and retries, with a timeout that aborts back even higher to a software level. Your program could decide how much to wait around for correct results versus going on with partial results, depending on its real-time requirements. The memory controllers could do this, using the large, remote SRAM and RAM spaces as an assumed-noisy space and overlaying its own software checksum format on top.

    This is really not so different from modern filesystems which start to reduce their trust in the storage device, and overlay their own checksum, redundancy, and recovery methods. You can imagine bringing these reliability boundaries ever "closer" to the CPU. Of course, you are right that it doesn't make sense to allow noisy computed goto addresses, unless you can characterize the noise and link your code with "safe landing areas" around the target jump points. It makes even less sense to have noisy instruction pointers, e.g. where it could back up or skip steps by accident, unless you can design an entire ISA out of idempotent instructions which you can then emit with sufficient redundancy for your taste.

  7. A brainy idea. by Ostracus · · Score: 4, Interesting

    He favors a new architecture, that he calls the 'stochastic processor,' which is designed to handle data corruption and error recovery gracefully.

    I dub thee neuron.

    --
    Shai Schticks:"You don't make peace with friends, you make peace with enemies"
  8. The problem isn't hardware to begin with... by Angst+Badger · · Score: 4, Insightful

    ...the problem is software. In the last twenty years, we've gone from machines running at a few MHz to multicore, multi-CPU machines with clock speeds in the GHz, with corresponding increases in memory capacity and other resources. While the hardware has improved by several orders of magnitude, the same has largely not been true of software. With the exception of games and some media software, which actually require and can use all the hardware you can throw at them, end user software that does very little more than it did twenty years ago could not even run on a machine from 1990, much less run usably fast. I'm not talking enterprise database software here, I'm talking about spreadsheets and word processors.

    All of the gains we make in hardware are eaten up as fast or faster than they are produced by two main consumers: useless eye-candy for end users, and higher and higher-level programming languages and tools that make it possible for developers to build increasingly inefficient and resource-hungry applications faster than before. And yes, I realize that there are irresistible market forces at work here, but that only applies to commercial software; for the FOSS world, it's a tremendous lost opportunity that appears to have been driven by little more than a desire to emulate corporate software development, which many FOSS developers admire for reasons known only to them and God.

    It really doesn't matter how powerful the hardware becomes. For specialist applications, it's still a help. But for the average user, an increase in processor speed and memory simply means that their 25 meg printer drivers will become 100 meg printer drivers and their operating system will demand another gig of RAM and all their new clock cycles. Anything that's left will be spent on menus that fade in and out and buttons that look like quivering drops of water -- perhaps next year, they'll have animated fish living inside them.

    --
    Proud member of the Weirdo-American community.
    1. Re:The problem isn't hardware to begin with... by Draek · · Score: 4, Insightful

      for the FOSS world, it's a tremendous lost opportunity that appears to have been driven by little more than a desire to emulate corporate software development, which many FOSS developers admire for reasons known only to them and God.

      You yourself stated that high-level languages allow for a much faster rate of development, yet you dismiss the idea of using them in the F/OSS world as a mere "desire to emulate corporate software development"?

      Hell, you also forgot another big reason: high-level code is almost always *far* more readable than its equivalent set of low-level instructions, the appeal of which for F/OSS ought to be similarly obvious.

      Sorry but no, the reason practically the whole industry has been moving towards high-level languages isn't because we're all lazy, and if you worked in the field you'd probably know why.

      --
      No problem is insoluble in all conceivable circumstances.
    2. Re:The problem isn't hardware to begin with... by Homburg · · Score: 3, Insightful

      So you're posting this from Mosaic, I take it? I suspect not, because, despite your "get off my lawn" posturing, you recognize in practice that modern software actually does do more than twenty-year-old software. Firefox is much faster and easier to use than Mosaic, and it also does more, dealing with significantly more complicated web pages (like this one; and terrible though Slashdot's code surely is, the ability to expand comments and comment forms in-line is a genuine improvement, leaving aside the much more significant improvements of something like gmail). Try using an early 90s version of Word, and you'll see that, in the past 20 years word processors, too, have become significantly faster, easier to use, and capable of doing more (more complicated layouts, better typography).

      Sure, the laptop I'm typing this on now is, what, 60 times faster than a computer in 1990, and the software I'm running now is neither 60 times faster nor 60 times better than the software I was running in 1990. But it is noticeably faster, at the same time that it does noticeably more and is much easier to develop for. The idea that hardware improvements haven't led to huge software improvements over the past 20 years can only be maintained if you don't remember what software was like 20 years ago.

    3. Re:The problem isn't hardware to begin with... by jasonwc · · Score: 2, Informative

      The problem isn't really modern CPUs but the lack of improvement in conventional hard drive speeds. With a Core i7 processor and a 160 GB X-25M Gen 2 Intel SSD, pretty much everything I run loads within 1-2 seconds, and there is little or no difference between loading apps from RAM or my hard drive. Even with a 1 TB WD Caviar Black 7200 RPM drive, my Core i7 machine was constantly limited by the hard drive.

      With an SSD, I boot to a usable desktop and can load Firefox with Adblock, Pidgin, Skype, Foobar2000 and Word in around 2 seconds. Many programs like Chrome load so quickly that they are effectively instant-on. Even though quad-core processors are often derided for desktop use, I see a tremendous improvement with a Core i7 + high-performance SSD vs. a Core 2 Duo + mediocre laptop drive. Modern CPUs can make your desktop experience much more responsive. You just need a hard drive that can keep up.

      Oh, and in video playback, the difference is incredibly obvious. My roommate is still using a 7 year old laptop which can barely playback a DVD (MPEG-2). In contrast, my Core i7 can simultaneously decode 5 1080p H.264 videos with ease (after this point, the hard drives can't keep up). While this might be considered useless, it definitely makes a difference when running background tasks such as backups. With my Core 2 Duo without hardware decoding, I would have to pause when scheduled backups started or video would skip. With my quad-core system, I can run any task in the background without fear of slowdown, and also use high-quality upscale filters and renderers that would have slowed my dual-core system to a crawl.

      Too many people claim that modern processors and hardware do not provide meaningful improvements to the desktop experience. I just don't find this to be true. Multi-core processors have allowed users to run background tasks, install software etc. with no noticeable speed degradation. When I am working with old single-core machines, I miss this benefit.

      In addition, today's software is more powerful. You may not need all the features in Word 2007 or the latest Firefox build, but that doesn't mean they aren't useful.

      Adblock, Flashblock, Session Management, the ability to have dozens of tabs loaded without memory issues, the ability to stream high definition video in my browser with no or minimal buffering, the "Awesomebar" etc. are all features that didn't exist 5+ years ago.

      Real-time indexing of system files and applications is relatively recent, and yet I find that it has fundamentally transformed how I access data.

      There are many more examples. It may be popular to say that things haven't changed much in two decades, because word processors are superficially similar for example, but a great deal has changed.

    4. Re:The problem isn't hardware to begin with... by rdnetto · · Score: 2, Insightful

      a desire to emulate corporate software development, which many FOSS developers admire for reasons known only to them and God.

      Probably because the major FOSS developers are in corporate software development.

      --
      Most human behaviour can be explained in terms of identity.
  9. Re:why use a stochastic processor? by Arker · · Score: 3, Funny

    Why use a stochastic processor which makes mistakes when we can use our brains, which make mistakes?

    Because the stochastic processor will be able to make mistakes much more quickly of course.

    Don't you understand progress?

    --
    =-=-=-=-=-=-=-=-=-=-=-=-=-=-
    Friends don't let friends enable ecmascript.
  10. While not move and fix it? by BartholomewBernsteyn · · Score: 2, Informative
    This may be a far thought, but if stochastic CPUs allow for increased performance in a trade-off for correctness, maybe something like following description may reap the benefits while keeping out the stochastics right away:
    Suppose those CPUs really allow for faster instruction handling using less resources, maybe you could put more in a package, for the same price, which on a hardware level would give rise to more processing cores at the same cost. (Multi-Core stochastic CPUs)
    Naturally, you have the ability to do parallel processing, with errors possible, but you are able to process instructions at a faster rate.
    On the software side, the support for concurrency is a mayor selling point, of course, there has to be something able recover from those pesky stochastics gracefully. I come up with the functional language 'Erlang'.
    This is taken from wikipedia

    http://en.wikipedia.org/wiki/Erlang_(programming_language)#Concurrency_and_distribution_orientation

    Concurrency supports the primary method of error-handling in Erlang. When a process crashes, it neatly exits and sends a message to the controlling process which can take action. This way of error handling increases maintainability and reduces complexity of code

    From the official source:

    http://www.erlang.org/doc/reference_manual/processes.html#errors

    Erlang has a built-in feature for error handling between processes. Terminating processes will emit exit signals to all linked processes, which may terminate as well or handle the exit in some way. This feature can be used to build hierarchical program structures where some processes are supervising other processes, for example restarting them if they terminate abnormally.

    Asked to 'refer to OTP Design Principles for more information about OTP supervision trees, which use[s] this feature' I read this:

    http://www.erlang.org/doc/design_principles/des_princ.html

    A basic concept in Erlang/OTP is the supervision tree. This is a process structuring model based on the idea of workers and supervisors. Workers are processes which perform computations, that is, they do the actual work. Supervisors are processes which monitor the behaviour of workers. A supervisor can restart a worker if something goes wrong. The supervision tree is a hierarchical arrangement of code into supervisors and workers, making it possible to design and program fault-tolerant software.

    This seems well fit? Create a real, physical machine for a language both able to reap its benefits and cope with the trade-off.
    Or maybe I'm too far off (I'm bored technologically, allow me some paradigmatic change at slashdot).

    TamedStochastics - Hiring.

    Yes, checksumming on dedicated hardware was my first thought as well.

  11. OpenCL by tepples · · Score: 2, Interesting

    AMD and nVidia's workstation cards are the same as their gaming cards, the only difference being that the workstation ones are certified to produce 100% accurate output. If a gaming card colours a pixel wrong every now and then it's no big deal and the player probably won't even notice.

    As OpenCL and other "abuses" of GPU power become more popular, "colors a pixel wrong" will eventually happen in the wrong place at the wrong time on someone using a "gaming" card.

  12. I'd just like to point out... by DavidR1991 · · Score: 2, Insightful

    ...that the Transmeta Crusoe processor has sod-all to do with porting or different programming models. The whole point of the Crusoe was that it could distil down various types of instruction (e.g. x86, even Java bytecode) to native instructions it understood. It could run 'anything' so to speak, given the right abstraction layer in between

    Its lack of success was nothing to do with programming - just that no one needed a processor that could these things. The demand wasn't there

    1. Re:I'd just like to point out... by 10101001+10101001 · · Score: 2, Interesting

      The whole point of the Crusoe was that it could distil down various types of instruction (e.g. x86, even Java bytecode) to native instructions it understood. It could run 'anything' so to speak, given the right abstraction layer in between

      Yea, uh, that's true for *any* general purpose processor. What Crusoe original promised was that this dynamically recompiled code might be either faster (by reordering and optimizing many instructions to fit Crusoe's Very Large Instruction Word design--not unlike how the Pentium Pro and above do it in hardware with multiple APU/FPU functional units) or more power efficiently (by removing the hardware parts of the reorderer/optimizer and having the software equivalent run unoften). Of course, the former just didn't hold because Intel/AMD could just pump out higher hertz processors and the latter didn't matter as much as simply underclocking the whole CPU when the system was idle (which is often enough). In short, Crusoe found two niches that both Intel and AMD cornered.

      Its lack of success was nothing to do with programming - just that no one needed a processor that could these things. The demand wasn't there

      Well, that's the other part of the equation. If the Crusoe had actually provided multiple abstraction layers and not just the one (the x86 one), perhaps they could have survived. Crusoe would have been a great platform for emulating the PSX, for example. Or providing multiple, concurrent x86/Java/whatever systems to sandbox for servers--and for which the power efficiency would be important. But, then, providing well-optimized and many software solutions isn't an easy task, especially when balanced against the task of avoiding running the "Code Morphing" software as much as possible to avoid all the penalties associated with it.

      In short, the problem wasn't the demand per se; it was a lack of supply. Pining the hopes of the company on a few niches of which the competitors managed to relatively quickly occupy certainly didn't help.

      --
      Eurohacker European paranoia, gun rights, and h
  13. How eye candy helps interoperability by tepples · · Score: 2, Interesting

    And yes, I realize that there are irresistible market forces at work here, but that only applies to commercial software; for the FOSS world, it's a tremendous lost opportunity that appears to have been driven by little more than a desire to emulate corporate software development, which many FOSS developers admire for reasons known only to them and God.

    I think I know why. If free software lacks eye candy, free software has trouble gaining more users. If free software lacks users, hardware makers won't cooperate, leading to the spread of "paperweight" status on hardware compatibility lists. And if free software lacks users, there won't be any way to get other software publishers to document data formats or to get publishers of works to use open data formats.

  14. Late, and innaccurate by gman003 · · Score: 3, Interesting

    I've seen this before, except for an application that made more sense: GPUs. A GPU instruction is almost never critical. Errors writing pixel values will just result in minor static, and GPUs are actually far closer to needing this sort of thing. The highest-end GPUs draw far more power than the highest-end CPUs, and heating problems are far more common.

    It may even improve results. If you lower the power by half for the least significant bit, and by a quarter for the next-least, you've cut power 3% for something invisible to humans. In fact, a slight variation in the rendering could make the end result look more like our flawed reality.

    A GPU can mess up and not take down the whole computer. A CPU can. What happens when the error hits during a syscall? During bootup? While doing I/O?

    1. Re:Late, and innaccurate by thegarbz · · Score: 2, Interesting

      I can't see this working. The premise here was that the hardware allows faults, yet I don't see how you could design hardware like this to be accurate on demand. GPUs aren't only used for games these days. Would an error still be tolerated while running Folding@Home?

  15. Re:Wrong approach? by jibjibjib · · Score: 2, Funny
    Yeah because academics are idiots who measure performance in incorrect calculations per second, and they did this research without thinking of all these things that you've thought up in two minutes reading the Slashdot summary.

    Seriously, people, get some common sense.

  16. Re:Wrong approach? by somersault · · Score: 3, Interesting

    What use is a blazing fast computer that is no longer reliable

    Meh.. I'm pretty happy to have my brain, even if it makes some mistakes sometimes.

    --
    which is totally what she said
  17. Not really by Sycraft-fu · · Score: 3, Informative

    Ethernet has lower latency than token ring, and is over all easier to implement. However its bandwidth scaling is poor, when you are talking old school hub ethernet. The more devices you have, the less total throughput you get due to collisions. Eventually you can grind a segment to a complete halt because collisions are so frequent little data gets through.

    Modern ethernet does not have collisions. It is full duplex, using separate transmit and receive wires. It scales well because the devices that control it, switches, handle sending data where it needs to go and can do bidirectional transmission. The performance you see with gig ethernet is only possible because you do full duplex communications with essentially no errors.

  18. This branch could bear some interesting fruit... by trims · · Score: 4, Interesting

    I see lots of people down on the theory - even though the original proposal was for highly-error forgiving applications - because somehow it means we can't trust the computations from the CPU anymore.

    People - realize that you can't trust them NOW.

    As someone who's spent way too much time in the ZFS community talking about errors, their sources and how to compensate, let me enlighten you:

    modern computers are full of uncorrected errors

    By that, I mean that there is a decided tradeoff between hardware support for error correction (in all the myriad places in a computer, not just RAM) and cost, and the decision has come down on the side of screw them, they don't need to worry about errors, at least for desktops. Even for better quality servers and workstations, there are still a large number of places where the hardware simply doesn't check for errors. And, in many cases, the hardware alone is unable to check for errors and data corruption.

    So, to think that your wonderful computer today is some sort of accurate calculating machine is completely wrong! Bit rot and bit flipping happens very frequently for a simple reason: error rates per billion operations (or transmissions, or whatever) have essentially stayed the same for the past 30 years, while every other component (and bus design, etc.) is pretty much following Moore's Law. The ZFS community is particularly worried about disks, where the hard error rates are now within two orders of magnitude of the disk's capacity (e.g. for a 1TB disk, you will have a hard error for every 100TB or so of data read/written). But, there's problems in on-die CPU caches, bus line transmissions, SAS and FC cable noise, DRAM failures, and a whole host of other places.

    Bottom line here: the more research we can do into figuring out how to cope with the increasing frequency of errors in our hardware, the better. I'm not sure that we're going to be able to force a re-write of applications, but certainly, this kind of research and possible solutions can be taken care of by the OS itself.

    Frankly, I liken the situation to that of using IEEE floating point to calculate bank balances: it looks nice and a naive person would think it's a good solution, but, let me tell you, you come up wrong by a penny more often that you would think. Much more often.

    -Erik

    --
    There are always four sides to every story: your side, their side, the truth, and what really happened.
  19. Re:So... It was harmful after all by jd · · Score: 2, Funny

    That's why some languages use ComeFrom rather than GoTo.

    --
    It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)