New Framework For Programming Unreliable Chips
rtoz writes "For handling the future unreliable chips, a research group at MIT's Computer Science and Artificial Intelligence Laboratory has developed a new programming framework that enables software developers to specify when errors may be tolerable. The system then calculates the probability that the software will perform as it's intended.
As transistors get smaller, they also become less reliable. This reliability won't be a major issue in some cases. For example, if few pixels in each frame of a high-definition video are improperly decoded, viewers probably won't notice — but relaxing the requirement of perfect decoding could yield gains in speed or energy efficiency."
Asking software to correct hardware errors is like asking godzilla to protect tokyo from mega godzilla
this does not lead to rising property values
So, expect the quality of computers to go downhill over the next few years, but we'll do out best to fix it in software?
That sounds like we're putting the quality control on the wrong side of the equation to me.
Lost at C:>. Found at C.
...Our new snowy-screen overlords. I was just thinking how much I liked the good old days where if the TV was flaking out, you could just give the set a good whack and it would get its act together.
Seriously, why do we want to do this? Is power usage going to cut in half? Are yields going to double? I think it's nice to talk about (especially in the interest of making systems that go "kinda bad" instead of completely breaking, but why not just invest the time/effort into fixing the issues directly? Have we run out of ideas on how to do that?
but relaxing the requirement of perfect decoding could yield gains in speed or energy efficiency."
Which you could already get now simply by not doing error correction. No need for some other programming framework to get this.
Or we could just use java, with it's "almost" IEEE complete libraries. I mean who really needs a perfect answer anway?
Does that register "really" need to contain that value? How about any value?
Does the stack pointer "really" need to be there?
Does the password really need to match? It's just a hash anyway, what are a few bits of uncertainty?
Does the packet "really" need to get sent?
does the CRC "really" have to match..
These functions will make great improvements to Java.
if few pixels in each frame of a high-definition video are improperly decoded, viewers probably won't notice
We used to return those LCD monitors back to the store.
h.264 relies heavily on the pixels in all previous frames. Incorrectly decoded pixels will be visible on many frames that are following. What's worse, they will start moving around and spreading.
are they going to make "unreliable transistors" that, upon failure, simply decode a pixel incorrectly, rather than, oh, I don't know, branching the program to an unspecified memory address in the middle of nowhere and borking everything.
They'd have to completely re-architect whatever chip is doing the calculations. You'd need three classes of "data" - instructions, important data (branch addresses, etc), and unimportant data. Only one of these could be run on unreliable transistors.
I can't imagine a way of doing that where the overhead takes less time than actually using decent transistors in the first place.
Oh, wait. It's a software lab that's doing this. Never mind, they're not thinking about the hardware at all.
It may look like I'm doing nothing, but I'm actively waiting for my problems to go away.
--Scott Adams
Yeah we really want those "almost working" machines:
- flying planes
- controlling infrastructure
- running financial transactions
- doing medical inferrence
Well they would probably be alright in mobile devices, except
- when authorizing transactions
- doing secure communications
Well they would be fine for Angry Birds.
We need software to design hardware to make software...
In short it is about better adjusting your tolerance levels on individual features.
I want my Integer arithmetic to be perfect. My Floating point, good up to 8 decimals places. But there components meant for interfacing with the human. Audio, so much stuff is altered or loss due to difference in quality of speakers, every top notch ones with Gold(Or whatever crazy stuff) Cables. So in your digital to audio conversion, you may be fine if a voltage is a bit off, or you skipped a random change, as the smoothing mechanism will often hide that little mistake.
Now for displays... We need to be pixel perfect when we have screens with little movement. But if we are watching a movie, a Pixel color #8F6314 can be #A07310 for 1 60th of a second and we wouldn't notice it. And most most displays are not even high enough quality to show these differences.
We hear of these errors and think, how horrible that we are not good perfect products... However it is more due to the trade-off of getting smaller and faster with a few more glitches,
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
So it's like the Pentium FDIV bug, where "a little error" wasn't enough reason to recall the processors until they got bashed for it.
The idea of fault tolerable computing is similar to the EnerJ programming language being developed at the University of Washington for power savings The Language of Good Enough Computing
The jist of the idea is that the programmer can specify which variables need to be exact and which variables can be approximate. The approximate variables would then be stored a low refresh RAM which is more prone to errors to save power, while the precise variables would be stored a higher power memory which would be error free.
The example they gave was calculating the average shade of grey in a large image of 1000 by 1000 pixels. The running total could be held in an approximate variable since the error incurred by adding one pixel incorrectly out of a million would be small, while the control loop variable would be accurate since you wouldn't want your loop to overflow.
am I the only french who thinks that the "Computer Science and Artificial Intelligence Laboratory" sound like this in french :
CS-AIL ?
Such as using imprecise calculations during an animation sequence moving a UI around since it won't really be worthwhile making it absolutely 100% perfect since it will be moving fast.
If it is something like the UI for a mobile app, extremely worth doing, but if it was a game that required pixel-perfect smooth animation, might be a problem. (mind you, in games, both cases are useful since you wouldn't notice, such as spinning stupidly fast in an FPS, or slowly moving through corridors hunting for people)
Simply put, some things do not require quality to perform, they just need something that more-or-less is accurate to a point.
It wouldn't be hard to imagine such a thing working easily, you just need to provide the system with the correct interfaces to allow software to punt code off to either the "TCP" of CPU or the "UDP" of CPU, to put it another way.
With TCP we want 100% of the data perfect, with UDP we only really care if most of it gets there or is even accurate.
We already use it. So if it could drop prices considerably, hell, even moderately, for some parts of a hardwares design, it would be good in the long run.
Mobile applications such as tablets, phones, watches and whatever comes next would benefit greatly from this.
But even fixed applications like desktops would benefit since it would drop prices in general. Only a percentage of hardware would need to be working 100% where some could have failed or be of bad quality. Most hardware already has a yield value, the effective products that survived creation, others are usually gimped and then sold at a cheaper price to make the most use of them.
Designing them with failure in mind could alleviate a lot of headaches and allow more headroom to fix other issues and come up with better designs.
It wouldn't be welcomed very well initially. When people think "inefficient" they'd think "bad", which would be right, but it isn't all bad.
It it adds more frames and less power use with very minimal impact on smooth animation when it is needed, what is to hate?
Not everything requires quality. There are acceptable levels of bad quality. Like JPEG for natural pictures. But not GIF. Screw GIF. And screw people that use GIF for non-animated imagery. Why won't you just unexist?
Ok i have to ask this... If chip itself is unreliable, then how can you trust software running on sayd chip to detect errors reliably? Would the unreliability of chip effect software as well?
I love this idea, because it reminds me of the most energy efficient signal processing tool in the known universe, the human brain. Give Ken Jennings a granola bar, and he'll seriously challenge Watson, who will be needing several kilowatt-hours to do the same job. Plus, Ken Jennings is a lot more flexible. He can carry on conversations, tie shoes, etc. This is because his central processing unit basically relies on some sort of fault-tolerant software. I think that there will be a lot more applications of a fault-tolerant, energy efficient software strategy, beyond just media decoding. When we get around to asking computers to be creative and apply variously-weighted "rules of thumb", I expect that those operations will run best on systems that sacrifice calculation accuracy for speed and energy efficiency. You gain almost nothing when you apply rough heuristic rules precisely. Let's allow the computers to apply rough rules imprecisely, and reap the speed and energy benefits of the trade.
For example, if few pixels in each frame of a high-definition video are improperly decoded, viewers probably won't notice
Ok, fine, but that assumes those transistors are dedicated to decoding, and are not used for anything that requires complete accuracy.
In other words, it assumes that we won't be using general-purpose computers in the future.
Remember that the future will be full of encryption and DRM. Those technologies have maximum brittleness -- just one wrong bit will cause large amounts of data to be discarded or blocked.
For every transistor that is DIRECTLY connected to a specific software algorithm, even when this algorithm is encapsulated in a hardware block, there are >10 transistors acting in essential support roles, whose malfunction CANNOT be trivially ignored. Either a chip is fabricated so the TOTAL number of detectable errors across the die is vanishing low, or the chip is USELESS.
We already have the power/speed trade-off of lower precision for chains of maths calculations that can withstand the accumulating errors such lowered precision creates. Running a maths block at such a high speed, or low power, that the actions of individual transistors becomes impossible to predict, is self-defeating.
The problem is that cretinous PhDs, people who have remained in academia for all the wrong reasons, get to spew blue-sky papers to justify their existence to their universities, with no regard to the real world. "Hey, my maths is correct' does NOT ensure the quality of a paper. The 'maths' can always be made correct, with no regard to real world, applied issues.
As for error tolerance, well almost every computer is producing errors all the time. Your Windows PC, for instance, is designed to be error tolerant in the sense that most errors do not 'crash' the machine, and are frequently handled invisibly (to the user) in the background. HOWEVER, this does not mean any test is made to ensure the errors are 'harmless'. While your hardware and OS may seemingly continue to function happily, many errors CAN be silently corrupting data or processing on which you rely. It just so happens that with the experience of billions of computers deployed across the decades, such errors have proven to be worth ignoring on average.
AGAIN raising clocks, or lowering power increases the probability of errors, so the PROPER Computer Science method is to use the LEAST work to complete your task, and this includes using numerical analysis to do no more maths than is absolutely necessary. For instance, morons as poorly skilled as those responsible for the paper in the article had MPEG1/2 decoding as a fully floating point process, because "everyone knows floats, especially doubles, are always FAR better than integers". As a consequence, no two MPEG1/2 decode units created exactly the same output from the same input data.
MPEG4, on the other hand, was designed by DECENT mathematicians, and uses Integer decode methods, producing much more correct output, with less energy per unit of maths work (although decoding MPEG4 is more maths intensive than decoding MPEG1/2). Better, every MPEG4 decode unit produces identical results (if coded correctly).
May the best chi(m)p win.
Doesn't intel already make a chip that is unreliable?
Yeah, let's take away the only thing that computers had going for them - doing exactly what they're told. THAT sounds like a GREAT idea.
It can be done, we dont have to race for atomic size transistors before we have the technology ot make them more reliable.
Do not look at laser with remaining good eye.
now that would be world changing!
What do you think the artefacts shown on screen are when you overclock your video card too high? Acceptable (sometimes) hardware errors.
This is why everything is disposable and nothing works anymore. People are too willing to sacrifice quality and reliability for cost.
If errors in the low-order bits were economically acceptable, we wouldn't be using high-precision data formats like floats and doubles in the first place. We'd be using 4-bit fixed point or some such BS.
If you look at the history of GPUs, you see the exact opposite trend. The native data types have gotten larger and more precise every generation, because this is actually a very cheap thing to improve.
These assholes are hyping the shit out of a deliberately crippled product nobody asked for. Fuck them. Fuck them all.
So: This assumes that something, somewhere knows which transistors are unreliable. This data needs to be stored somewhere - on the "good" transistors. How is this data obtained? is there a trustworthy "map" of "unreliable transistors" ? And the code that determines the probability has to run on the "good" transistors too. Will those transistors stay good?
I cannot see any way of allowing *any* transistor being unreliable... And based on my (admittedly incomplete) understanding of chip production, *any one* of the transistors on the sillicon can be faulty, so there still is a chicken-and-egg problem in here somewhere.
Surely, such "suspect" transistors can only be used for storing the final end result of a calculation: If you were to use it for intermediate values on which you base "if" statements (or any sort of branch), your code will end up unreliable as a result. Unfortunately, 99% of the time the "end result" of one calculation is used as input to another calculation, so the problem spreads like rings in the water.
What if humans want to rely on the output of the computer? Does that pixel on the screen matter? If you are playing Angry Birds, fine. But the pixels may be important if you're a doctor looking at a scan. Or you're a flight controller scanning the screen for planes. The graphics routines do not know the context in which they run. So the actual usability of this ends up being radically diminished....
What use is a computer where you cannot trust the result? We already have logic bugs, race conditions, usability issues etc confusing everybody - I don't think we need to make the computers even more unreliable...
I already thought we had a framework for making chips unreliable in the programming realm known as Windows API.
Oh wait...
-Hackus
Got Geometrodynamics? Awe, too hard to figure out? Too bad.
if it's a choice between using a slower chip that is reliable and a chip that is blistering fast but makes mistakes, i'll take the slower chip every time.
Anons need not reply. Questions end with a question mark.
From the article: "A third possibility, which some researchers have begun to float, is that we could simply let our computers make more mistakes.
A fourth possibility is to forget this silliness before it turns into epic failure, go back to the drawing board, and design computers that make fewer mistakes, not more mistakes. Sheesh, what lunacy!
Circle the wagons and fire inward. Entropy increases without bounds.