Why Do Computers Still Crash?
geoff lane asks: "I've used computers for about 30 years and over that time their hardware reliability has improved (but not that much), but their software reliability has remained largely unchanged. Sometimes a company gets it right -- my Psion 3a has never crashed despite being switched on and in use for over five years, but my shiny new Zaurus crashed within a month of purchase (a hard reset losing all data was required to get it running again). Of course, there's no need to mention Microsoft's inability to create a stable system. So, why are modern operating systems still unable to deal with and recover from problems? Is the need for speed preventing the use of reliable software design techniques? Or is modern software just so complex that there is always another unexpected interaction that's not understood and not planned for? Are we using the wrong tools (such as C) which do not provide the facilities necessary to write safe software?" If we were to make computer crashes a thing of the past, what would we have to do, both in our software and in our operating systems, to make this come to pass?
People write sloppy code that makes them...
Solid!
Any of the following reasons conspire to result in buggy software these days.. (a) clueless marketing departments, project managers, etc set unrealistic deadlines for completing code to an acceptable standard. shortcuts are taken to meet the unrealistci deadliens and buggy products are the end result... (b) to satisfy client demands for increased functionality (no matter how unnecessary) results in more compelx code.. complx code is harder to maintain and troubleshoot... i sometimes think IT peopel have forgotten the notion that a simple solution that achieves functionality is the best solution... (c) programmers are humans, humans make mistakes in code... (d) companies to reduce the time/resource necessary to complete a product put in place aenemic testing/load testing methologies... (e) people often compare a computer to a kettle, car etc.. why can't it just work like that... well kettles do one thing and that's it.. computers do many complex things from rendering a CAD diagram through to a large scale mail server... etc etc... cars do one thing by relative comparison too but even most cars get more maintenace than some IT environments i've seen and you don't see people rushing out to buy a no name no brand car (e.g. like pc clones etc etc)... and many more im sure... how many more faield IT projects/Buggy software have to occur before peopel realize these things?
I saw a quote from a game maker on this...he said something to the effect of, "We have to be really careful with testing the console versions of our games, because if something goes wrong we can't just issue a patch to fix it like we can the PC version."
Shutting down free speech with violence isn't fighting fascism. It IS fascism!
Even with my uptime experiments, which consisted of taking an old but reliable hardware, installing Windows 95/OSR2/98/98SE/ME, and then letting the computer idle and do nothing never resulted in more than about 25 days before I came over and windows was fubar'ed or the computer was simply locked hard.
Windows 3.1 actually did quite well if I remember right, as it seemed perfectly content sitting idle doing nothing seemly forever. Windows 9x always seemed to randomly thrash the HDD, even after a clean install, which led me to believe that Windows 9x is never truly idle, it's always up to something (virtual memory?), and that something eventually will bring it down.
Windows 9x actually has a bug in it that would lock the computer after 46 days of uptime, but it took years to catch it because no one ever got close to that mark.
I've got the solution: the TeXOS(tm)!
Slogan: Crash-free -- Donald Knuth guarantees it!
-Waldo Jaquith
I develop software at a small shop for a living. We're scraping by; money is extremely tight. As a result, anything we code is coded as quickly as possible. The boss always says "we need this done fast and we need it done right." This sentence is almost always followed up with statements like "don't build it for the ages" or any number of quotes that indicate he doesn't care how, just get it finished as soon as possible.
Welcome to the sorry state of affairs in the software industry today. Developers are too rushed (or don't care themselves) to come up with good designs and write solid implementations. Weaker coders are rewarded for their speed while stronger coders are degraded for software built to last.
Good engineering principles must be applied if software is to not: crash all the time, contain more than a fair share of bugs, contain security vulnerabilities, and not corrupt data. These engineering principles are complicated in practice, but not so numerous. I cannot be exhaustive here, but I am trying to convey a general idea.
- Build tiny, atomic pieces and make sure they work. It amazes me how my peers always come up with blanket solutions to problems. These solutions are remarkably complex and may work for most of the data, but not all of the data. Remember tiny pieces! The immediate question is how to make sure these pieces work. It's more than just testing here. You cannot just evalute a small number of pre and post conditions and assume something works. Prove mathematically that for all possible inputs/pre-states you receive correct outputs/post-states. Remember your discrete math class? Remember doing proofs? Apply it! Computers are fundamentally number crunchers and your input/output are fundamentally numbers and can be represented symbolically and in finite terms. Certainly cases exist where this principle cannot be employed, but those are rare. People working in the encryption field should understand this principle very well.
- Have clearly defined specifications for the software to be written. Strive to work out any questions or ambiguities in the specification before even embarking on the design process. If the specification is unclear or ambiguous, it is simply a matter of time before programmers do the wrong thing or begin to make incorrect or unreasonable assumptions. Another important note on this principle is the partitioning of specifications where appropriate. Do not let specifications for user interface mingle with those for the back-end. While they may be closely related, try to follow the Model Control View (MCV or MVC... it varies). This must be adhered to at the earliest stages of the specification, all the way up to the actual pounding of keyboards.
- Conduct frequent peer review! This is one of the strongest points of open source software development. I argue that it does not occur frequently in the commercial world because everyone is afraid of their peers negatively reviewing their code, placing their jobs at risk. Sadly, this only results in a suboptimal product. The more other people look at your code, the more likely your mistakes (and they do exist) are likely to be found. It's a shame work place environments are not geared to eliminate fear of failure, otherwise I think most software would be a lot better today if people were eager to do reviews.
Once again, this isn't entirely complete, but I think the point is clear. This was written on the fly and mostly off the top of my head, but I think I've got it right. In general, a lot of common sense needs to be applied. For example, if your input is for all intents and purposes random (it's coming from the user) then do extensive checking on it! If you want to encounter unexpected values in your data structures, make sure you hide as much as possible from the rest of the code. It amazes me how little the most basic computer science principles are followed in most software development projects. This is one of the biggest reasons software is so unstable.
Join Tor today!
In C++, which a great deal of software is written in, an exception block [or the language or system equivalent] placed around the entire application will catch just about any recoverable error. This is how most of the windows blue-screens or 'your application has performed an illigal operation and will be terminated' messages are brought up. This is how Linux and other unixes generates a core dump.
The actual handling may be in a signal handler, try/catch block, or abend, but the functionality is present in every activly developed language I have ever worked with from cobol and fortran to c, c++, java, and object pascal.
The main reason for applications actually crashing is programmer lazyness.
The main reason for applications getting into a state that they can crash is improper complexity management.
When it comes to drivers, I'm much more forgiving, since it is quite difficult to manage both the hardware and software, and the communication between different programs.
Finally, the operating system itself, which is the layer between the drivers and the applications, I haven't seen any in the last 5 years that has been unstable. Even Windows ME, for all its faults, was very stable in the actual 'operating system'.
But that's just my 2 pesos.
frob
//TODO: Think of witty sig statement
I've crashed my Super Nintendo. Quite a bit, actually.
In Final Fantasy III, in the Phoenix Cave, occasionally when I encountered a random battle, the sprites would all become garbled, andthen the game would hang. At first I thought it was a secret, because one of my characters turned into General Leo, but then the game stopped working and I had to reset.
You can also sometimes crash Final Fantasy III by using Relm's "Sketch" command on Gau. What you do is you let Gau use "Leap" to learn a new Rage ability when you're roaming the Veldt, and then when you find him in another battle and sketch him when he appears. I'm not sure if it always crashes the game- I recall that sometimes it gave you tons of extra random items (like 99 daggers, among other things)- but that might be another bug.
Insightful: 76, Off-Topic: 379, Flamebait: 24, Funny: 152, Interesting: 201, Underrated: 55, Troll: 9, Total: 896
My Win2k box plays games reliably and maintains more than a few months of uptime.
Please refer to this post for more information.
Thank you.
Debian now has 8,710 applications. There are few things I'd like to do that I can't. I spend much less time "rebuilding" computers and more time doing those things now.
I mean, you can't do much except play back multimedia,
Hmmm, ever heard of film gimp? Sure, there are some hardware problems but those will go away as M$ dies. Hardware makers are already taking free software into account.
there's seldom any games you can play on it.
I'm not a game boy, quake II is good enough for me. More will come, in the mean time dual boot. Woody takes care of that auto-magically now.
I liken it to a rock in the middle of a field. Damned stable, that rock. It just sits there.
Yes that's the picture you drew. Reality is different. Think of it as a tremendous magic building, where everyone is invited to come and do as they please. Building materials are free, and so long as you follow a few basic guidlines, your changes and additions will be as sturdy as any piece and everyone can enjoy it at once.
Friends don't help friends install M$ junk.
For many megs of answers about why software isn't 100% reliable, read Risks Digest.
There is indeed hardware out there with this level of reliability (like an AT&T 5ESS/Lucent 7R/E phone switch) however it is highly expensive and very unflexable.
I don't mean to bash AT&T. In fact, the very infrequency of this sort of problem is a strong argument for their reliability. I had to go back to the pre-Lucent days for this one, folks. However, they do have some occasional bugs in their software. And it makes the news when they do:
Risks Digest, Volume 9: Issue 69, Tuesday 20 February 1990
The net will not be what we demand, but what we make it. Build it well.
Computer, Heal Thyself
"Systems inevitably fail. The key to reliable computing is building systems that crash gracefully and recover quickly."
One simple rule for its versus it's
A lot of people are answering the question of why there are bugs at all, and it's an important question, but I'd like to take a different angle and consider why there are so many visible bugs. Why does a bug in a driver, or even an application, bring down a whole system? In addition to reducing the incidence of actual bugs, IMO, we should also do a better job of containing the bugs that will inevitably exist even if we all use the latest whiz-bang code analysis tools (which rarely work for kernel code anyway). Some of the semi-informed members of the audience are probably thinking that's the job of the operating system; I'd argue that our entire current notion of operating systems is flawed. There are way too many components in a typical computer system that "trust each other with their lives" in the sense that if one dies all die. Memory protection between user processes is great, but there should be memory protection between kernel entities, and other kinds of protection, as well. One of the basic services that operating systems need to provide going forward is greater fault isolation and graceful instead of catastrophic degradation.
The Recovery Oriented Computing project at Berkeley has gotten some press recently for trying to address this issue. Many here on Slashdot don't seem to "get it" because they've never worked on systems in which a component failure was survivable; they don't realize that rebooting a single component - perhaps even preemptively - is better than having the whole system crash. "Software rot" is a real problem, no matter how hard we try to wish it away. ROC isn't about saying bugs are OK; it's about saying that bugs happen even though they're not OK, and let's do the best we can about that. Another project in the same space, with more of a hardware/security orientation, is Self Securing Devices at CMU. There, the idea is to find ways that parts of a system can work together without having to share each others' fate. While the focus of the work is on security, it shouldn't be hard to see how much of the same technology could be applied to protect a system from outright failure as well as compromise. There are plenty of other projects out there trying to address this problem, but those are two with which I happen to have personal experience.
The key idea in all cases is that current OS design forces us to put all of our eggs in one basket, and that's really not necessary. Designing fault-resilient systems is tough - few know that better than I do - but that's only a reason why we should do it once instead of devising ad-hoc clustering solutions for each specific application. Lots of people use various forms of clustering as a way to achieve fault containment and survive failures, but the solutions tend to be very ad-hoc and application-specific. Do you think Google's solution works for anything but Google, or that a database transaction monitor is useful for anything that's not a database? Fault containment needs to be a fundamental part of the OS, not something we layer on top of it.
Slashdot - News for Herds. Stuff that Splatters.
I have a Microsoft reference driver for my soundcard (i.e. Microsoft made the driver and approved it themselves). I use it on my computer.
Unfortunately, two things cause it to fail.
1) It doesn't play nice with other drivers on the same IRQ.
2) Microsoft's advanced power management driver assigns it to the same IRQ as my USB port and my network card, and that can't be changed without a reinstall of Windows.
So basically, what happens is that the sound card will eventually crap out completely and never work again (until reboot) if it attempts to work at the same time either of the other two devices on that IRQ are working.
Keep in mind:
1) Microsoft knows about this bug
2) It causes system instability for lots of drivers - even certified ones
I should also mention that there is nowhere that this bug is reported by the OS; I had to find it through trial, error, and lots of research. Win2K is not as stable as you think
Mod me down and I will become more powerful than you can possibly imagine!
You ought to work tech support some time. There are real costs associated with software bugs. These costs are measured. Many times these costs are measured more meticulously than software vendors would like to admit. There are more organizations than you might realize that purposely delay software deployments to make sure that they do not ruin their technology infrastructure. Often times, when I work with a senior admin within the organization, I find they are the "NO" people. "No", we will not apply that patch unless you prove to us it will fix the problem. "No", we will not apply that patch unless you prove it won't introduce new problems. And, in the case that there are unforeseen complications in a software upgrade, guess who gets the heat? Directly, it's the senior admins. Both directly and indirectly, it's the software vendor. Bad publicity == lost sales. Ask any sales person (technical or non-technical).
Of course, I'm at the end of the equation where these costs are realized after the fact. Also, I think that since I come from the Unix world, I've seen more preference towards quality over quantity. Unix-oriented orgs are much more cautious than Windows-oriented orgs. I attribute this to lack of experience in that market, but the way things are going, experience is not in short supply. Bugs and security breaches are costing companies in real dollors nowadays, and commercial and gov't organizations are not ignorant of this fact, even at the high echelon levels.
For proof, look at Microsoft. I certainly remember reading that they decided to go for a company-wide code freeze to resolve bugs and security issues. This code freeze lasted for SIX MONTHS. That's a HUGE risk for a software company. Also, there's that whole trillion dollor fine against the company thing, too, that's been circulating a bit lately. It also undermines any arguments based on "customers are lemmings that will buy anything we dangle in front of them". Maybe the fact that features outweighed stability was true during the dot-com boom. I think it's definitely less true now, by a significant degree.
--- Journals are boring; Go to my web page instead
Microsoft says so.
Actually it's in some driver, not the core OS, so it's not surprising that it doesn't happen to everyone. (There's a few other things with similar problems.)
Having spoken with Microsoft OS developers about this issue in some detail, they make risk / benefit choices all the time where they know one way will not crash ever, and the other way will crash but will be amazingly faster.
Guess which way consumers pay them to build it. When they choose the crash-but-fast method, they just put an astounding amount of QA into it to whittle the probability of a crash down to an acceptable level. And I agree with them about what an acceptable level is, because I know that when I crash my Win2k system, it's my fault.
They put more testing and research into their OSes than anyone these days. Maybe Sun used to have them beat there, but Sun isn't nearly as focused on those things anymore.
Scheme and Smalltalk are bad examples, because dynamically typed languages produce entirely different types of faults (typing errors).
ML is a much better example.
Others will claim Java is a good example, but it's a bad one, because despite being statically typed it causes typing errors (from casts) and null-pointer exceptions (ick).
Safer languages still don't mean that programs don't fail, but they eliminate some of the ways they can fail.
Not that anyone will even read this, but good call.
I see so many people spend beacoup money on their internals and then say, "oh, yeah, and this 420Watt PS I got for $35. What a steal!"
A good power supply is not cheap. On the flip side, though, a good power supply is not cheap. And a bad power supply is the most annoying thing to troubleshoot.
Antec makes some good stuff that I've been very happy with, ditto for PC Power and Cooling. Expensive, but so worth it it isn't funny.
Actually, you're right. Words used together as a compound adjective modifying a noun should be hypenated. There's one little catch here: Because "high school" is itself an adjective modifying "level," we should put a hyphen between "high" and "school" (think of it as a first-order compound) but a longer en-dash between "high-school" and "level" (a second-order compound).
So: high-school-level, where the second dash should be HTML entity ߝ (but Slashcode won't allow it).
It's x86 hardware, but it's powered through the video card. Looks pretty good at 800x600 too (it's a TFT display).
Unisys 10.5" LCD Monitor w/ 2MB PCI Video Card
It says 2MB video card, but the one I got was a 4MB video card. It happily supports dual displays with Windows 9x and higher, but it doesn't support video playback, so scrap the idea of getting it to watch TV or play movies on. But for what you're describing, a small monitor for a low-power system, I think this would be ideal.
Sadly, they don't have a Linux driver for the required 65550 video card, but there's always Google and the price is right.
I know this is slashdot, but what about Win2000/XP? I've only had XP crash when I used old software drivers. Since then it's been fine.
Overall a very good read, highly recommended.
"Most computer crashes are caused by an INTERACTION of two pieces of code that did not know about each other and were never tested."
A majority of crashes are at the kernel level. How do you suppose one would go about "introducing" their code (drivers), and ensure compatibility. with an OS which is not open source?
"1) accept a restricted operating system that will never be able to compete with a commercial system like Windows."
Yep - unix, linux, OS X - they are all so "restricted" and shrouded in a veil of secrecy. Ha. Must be why they are not "commercial", right?
2) Never install a program that was not A) created by the same company/group that wrote your operating system, B) specifically designed for your particular computer, and C) designed to be used with and thoroughly tested against all the other software that is currently installed on your PC.
You highlighted this while reading your Getting Started with Windows XP booklet - didn't you?
"That is what companies do when they make non-pc computer equipment (cars have tiny computers) and is the reason why such things do not crash."
You are referring to what is called an embedded system - your reasoning/comparison is sorely invalid.
Why? To waste processing time? Programmers should most certainly check beforehand if something catastrophic will happen...such as checking for NULL pointers, or making sure a drive head won't go too far and crash. But I don't see why they should have to predict if an error will happen beforehand for every case.
Sometimes a more elegant, efficient ior easy solution is to check afterward. Who even says the result will be wanted if an overflow happens anyway? The program may just end the function on an error, or pop up a warning box.
If needed, an increment can easily be undone, and overflow checking is far easier than comparing against the max value--which you'll have to figure out. Not easy if you don't know the size of the operand (such as with C and ints). Not easy if the size of the operand may be changed at a later date (you'll also have to change all your comparison code, or the max constant if you used one). This may not seem like a big deal, but it's one more thing to get wrong.
Some 80x86 assembly to illistrate how overflow works to the programmer's advantage:
inc al ; increment the al register
jo errorhandler ; if overflow, jump to error handler
; no error, continue on...
Without overflow:
cmp al,127 ; compare al against the max value of a byte
jge errorhandler ; if al is greater or equal to max, jump to error handler
inc al
Yeah, the extra cmp opcode may not look like much, but it does add to the code size, and will use additional processing time. Doing this a lot in a large program will add up--especially if it needs to be fast.
This was just a simple case for an increment. What about adds? The precheck will be much more complex than the previous example and probably use up an extra register.