Why Do Computers Still Crash?

← Back to Stories (view on slashdot.org)

Why Do Computers Still Crash?

Posted by Cliff on Tuesday May 20, 2003 @12:55PM from the shouldn't-this-have-been-fixed-by-now dept.

geoff lane asks: "I've used computers for about 30 years and over that time their hardware reliability has improved (but not that much), but their software reliability has remained largely unchanged. Sometimes a company gets it right -- my Psion 3a has never crashed despite being switched on and in use for over five years, but my shiny new Zaurus crashed within a month of purchase (a hard reset losing all data was required to get it running again). Of course, there's no need to mention Microsoft's inability to create a stable system. So, why are modern operating systems still unable to deal with and recover from problems? Is the need for speed preventing the use of reliable software design techniques? Or is modern software just so complex that there is always another unexpected interaction that's not understood and not planned for? Are we using the wrong tools (such as C) which do not provide the facilities necessary to write safe software?" If we were to make computer crashes a thing of the past, what would we have to do, both in our software and in our operating systems, to make this come to pass?

10 of 1,224 comments (clear)

Min score:

Reason:

Sort:

Take your pick... by Rocker2000 · 2003-05-20 13:18 · Score: 3, Informative

Any of the following reasons conspire to result in buggy software these days.. (a) clueless marketing departments, project managers, etc set unrealistic deadlines for completing code to an acceptable standard. shortcuts are taken to meet the unrealistci deadliens and buggy products are the end result... (b) to satisfy client demands for increased functionality (no matter how unnecessary) results in more compelx code.. complx code is harder to maintain and troubleshoot... i sometimes think IT peopel have forgotten the notion that a simple solution that achieves functionality is the best solution... (c) programmers are humans, humans make mistakes in code... (d) companies to reduce the time/resource necessary to complete a product put in place aenemic testing/load testing methologies... (e) people often compare a computer to a kettle, car etc.. why can't it just work like that... well kettles do one thing and that's it.. computers do many complex things from rendering a CAD diagram through to a large scale mail server... etc etc... cars do one thing by relative comparison too but even most cars get more maintenace than some IT environments i've seen and you don't see people rushing out to buy a no name no brand car (e.g. like pc clones etc etc)... and many more im sure... how many more faield IT projects/Buggy software have to occur before peopel realize these things?
Re:Whose computers still crash? by toddestan · 2003-05-20 13:49 · Score: 5, Informative

Even with my uptime experiments, which consisted of taking an old but reliable hardware, installing Windows 95/OSR2/98/98SE/ME, and then letting the computer idle and do nothing never resulted in more than about 25 days before I came over and windows was fubar'ed or the computer was simply locked hard.

Windows 3.1 actually did quite well if I remember right, as it seemed perfectly content sitting idle doing nothing seemly forever. Windows 9x always seemed to randomly thrash the HDD, even after a clean install, which led me to believe that Windows 9x is never truly idle, it's always up to something (virtual memory?), and that something eventually will bring it down.

Windows 9x actually has a bug in it that would lock the computer after 46 days of uptime, but it took years to catch it because no one ever got close to that mark.
A Contributing Factor & Principles by Jerk+City+Troll · 2003-05-20 13:53 · Score: 3, Informative

I develop software at a small shop for a living. We're scraping by; money is extremely tight. As a result, anything we code is coded as quickly as possible. The boss always says "we need this done fast and we need it done right." This sentence is almost always followed up with statements like "don't build it for the ages" or any number of quotes that indicate he doesn't care how, just get it finished as soon as possible.

Welcome to the sorry state of affairs in the software industry today. Developers are too rushed (or don't care themselves) to come up with good designs and write solid implementations. Weaker coders are rewarded for their speed while stronger coders are degraded for software built to last.

Good engineering principles must be applied if software is to not: crash all the time, contain more than a fair share of bugs, contain security vulnerabilities, and not corrupt data. These engineering principles are complicated in practice, but not so numerous. I cannot be exhaustive here, but I am trying to convey a general idea.

- Build tiny, atomic pieces and make sure they work. It amazes me how my peers always come up with blanket solutions to problems. These solutions are remarkably complex and may work for most of the data, but not all of the data. Remember tiny pieces! The immediate question is how to make sure these pieces work. It's more than just testing here. You cannot just evalute a small number of pre and post conditions and assume something works. Prove mathematically that for all possible inputs/pre-states you receive correct outputs/post-states. Remember your discrete math class? Remember doing proofs? Apply it! Computers are fundamentally number crunchers and your input/output are fundamentally numbers and can be represented symbolically and in finite terms. Certainly cases exist where this principle cannot be employed, but those are rare. People working in the encryption field should understand this principle very well.

- Have clearly defined specifications for the software to be written. Strive to work out any questions or ambiguities in the specification before even embarking on the design process. If the specification is unclear or ambiguous, it is simply a matter of time before programmers do the wrong thing or begin to make incorrect or unreasonable assumptions. Another important note on this principle is the partitioning of specifications where appropriate. Do not let specifications for user interface mingle with those for the back-end. While they may be closely related, try to follow the Model Control View (MCV or MVC... it varies). This must be adhered to at the earliest stages of the specification, all the way up to the actual pounding of keyboards.

- Conduct frequent peer review! This is one of the strongest points of open source software development. I argue that it does not occur frequently in the commercial world because everyone is afraid of their peers negatively reviewing their code, placing their jobs at risk. Sadly, this only results in a suboptimal product. The more other people look at your code, the more likely your mistakes (and they do exist) are likely to be found. It's a shame work place environments are not geared to eliminate fear of failure, otherwise I think most software would be a lot better today if people were eager to do reviews.

Once again, this isn't entirely complete, but I think the point is clear. This was written on the fly and mostly off the top of my head, but I think I've got it right. In general, a lot of common sense needs to be applied. For example, if your input is for all intents and purposes random (it's coming from the user) then do extensive checking on it! If you want to encounter unexpected values in your data structures, make sure you hide as much as possible from the rest of the code. It amazes me how little the most basic computer science principles are followed in most software development projects. This is one of the biggest reasons software is so unstable.

--
Join Tor today!
I'm surprised nobody has pointed out yet... by Frobnicator · 2003-05-20 13:56 · Score: 3, Informative

That beyond all the hyperbole and other reasons, there is something that could be done but usually isn't.
In C++, which a great deal of software is written in, an exception block [or the language or system equivalent] placed around the entire application will catch just about any recoverable error. This is how most of the windows blue-screens or 'your application has performed an illigal operation and will be terminated' messages are brought up. This is how Linux and other unixes generates a core dump.
The actual handling may be in a signal handler, try/catch block, or abend, but the functionality is present in every activly developed language I have ever worked with from cobol and fortran to c, c++, java, and object pascal.
The main reason for applications actually crashing is programmer lazyness.
The main reason for applications getting into a state that they can crash is improper complexity management.
When it comes to drivers, I'm much more forgiving, since it is quite difficult to manage both the hardware and software, and the communication between different programs.
Finally, the operating system itself, which is the layer between the drivers and the applications, I haven't seen any in the last 5 years that has been unstable. Even Windows ME, for all its faults, was very stable in the actual 'operating system'.
But that's just my 2 pesos.
frob

--
//TODO: Think of witty sig statement
Re:and by Temporal · 2003-05-20 14:31 · Score: 3, Informative

My Win2k box plays games reliably and maintains more than a few months of uptime.

Please refer to this post for more information.

Thank you.
Containing the Damage by Salamander · 2003-05-20 15:24 · Score: 4, Informative

A lot of people are answering the question of why there are bugs at all, and it's an important question, but I'd like to take a different angle and consider why there are so many visible bugs. Why does a bug in a driver, or even an application, bring down a whole system? In addition to reducing the incidence of actual bugs, IMO, we should also do a better job of containing the bugs that will inevitably exist even if we all use the latest whiz-bang code analysis tools (which rarely work for kernel code anyway). Some of the semi-informed members of the audience are probably thinking that's the job of the operating system; I'd argue that our entire current notion of operating systems is flawed. There are way too many components in a typical computer system that "trust each other with their lives" in the sense that if one dies all die. Memory protection between user processes is great, but there should be memory protection between kernel entities, and other kinds of protection, as well. One of the basic services that operating systems need to provide going forward is greater fault isolation and graceful instead of catastrophic degradation.

The Recovery Oriented Computing project at Berkeley has gotten some press recently for trying to address this issue. Many here on Slashdot don't seem to "get it" because they've never worked on systems in which a component failure was survivable; they don't realize that rebooting a single component - perhaps even preemptively - is better than having the whole system crash. "Software rot" is a real problem, no matter how hard we try to wish it away. ROC isn't about saying bugs are OK; it's about saying that bugs happen even though they're not OK, and let's do the best we can about that. Another project in the same space, with more of a hardware/security orientation, is Self Securing Devices at CMU. There, the idea is to find ways that parts of a system can work together without having to share each others' fate. While the focus of the work is on security, it shouldn't be hard to see how much of the same technology could be applied to protect a system from outright failure as well as compromise. There are plenty of other projects out there trying to address this problem, but those are two with which I happen to have personal experience.

The key idea in all cases is that current OS design forces us to put all of our eggs in one basket, and that's really not necessary. Designing fault-resilient systems is tough - few know that better than I do - but that's only a reason why we should do it once instead of devising ad-hoc clustering solutions for each specific application. Lots of people use various forms of clustering as a way to achieve fault containment and survive failures, but the solutions tend to be very ad-hoc and application-specific. Do you think Google's solution works for anything but Google, or that a database transaction monitor is useful for anything that's not a database? Fault containment needs to be a fundamental part of the OS, not something we layer on top of it.

--
Slashdot - News for Herds. Stuff that Splatters.
Nope! Case in point. by fireboy1919 · 2003-05-20 15:35 · Score: 3, Informative

I have a Microsoft reference driver for my soundcard (i.e. Microsoft made the driver and approved it themselves). I use it on my computer.

Unfortunately, two things cause it to fail.
1) It doesn't play nice with other drivers on the same IRQ.
2) Microsoft's advanced power management driver assigns it to the same IRQ as my USB port and my network card, and that can't be changed without a reinstall of Windows.

So basically, what happens is that the sound card will eventually crap out completely and never work again (until reboot) if it attempts to work at the same time either of the other two devices on that IRQ are working.

Keep in mind:
1) Microsoft knows about this bug
2) It causes system instability for lots of drivers - even certified ones

I should also mention that there is nowhere that this bug is reported by the OS; I had to find it through trial, error, and lots of research. Win2K is not as stable as you think

--
Mod me down and I will become more powerful than you can possibly imagine!
Re:Simple, yes, for other reasons by Surazal · 2003-05-20 16:29 · Score: 3, Informative

You ought to work tech support some time. There are real costs associated with software bugs. These costs are measured. Many times these costs are measured more meticulously than software vendors would like to admit. There are more organizations than you might realize that purposely delay software deployments to make sure that they do not ruin their technology infrastructure. Often times, when I work with a senior admin within the organization, I find they are the "NO" people. "No", we will not apply that patch unless you prove to us it will fix the problem. "No", we will not apply that patch unless you prove it won't introduce new problems. And, in the case that there are unforeseen complications in a software upgrade, guess who gets the heat? Directly, it's the senior admins. Both directly and indirectly, it's the software vendor. Bad publicity == lost sales. Ask any sales person (technical or non-technical).

Of course, I'm at the end of the equation where these costs are realized after the fact. Also, I think that since I come from the Unix world, I've seen more preference towards quality over quantity. Unix-oriented orgs are much more cautious than Windows-oriented orgs. I attribute this to lack of experience in that market, but the way things are going, experience is not in short supply. Bugs and security breaches are costing companies in real dollors nowadays, and commercial and gov't organizations are not ignorant of this fact, even at the high echelon levels.

For proof, look at Microsoft. I certainly remember reading that they decided to go for a company-wide code freeze to resolve bugs and security issues. This code freeze lasted for SIX MONTHS. That's a HUGE risk for a software company. Also, there's that whole trillion dollor fine against the company thing, too, that's been circulating a bit lately. It also undermines any arguments based on "customers are lemmings that will buy anything we dangle in front of them". Maybe the fact that features outweighed stability was true during the dot-com boom. I think it's definitely less true now, by a significant degree.

--
--- Journals are boring; Go to my web page instead
Re:Whoops, bullshit alert. by rabidcow · 2003-05-20 16:40 · Score: 4, Informative

Microsoft says so.

Actually it's in some driver, not the core OS, so it's not surprising that it doesn't happen to everyone. (There's a few other things with similar problems.)
Re:Computers don't crash by geekoid · 2003-05-20 17:49 · Score: 3, Informative

I would call anything that unnesarily foists a business rule onto the user an error.

" If this can occur, it is the developer who is at fault, not the user."

thank you. I have to combat 'stupid user' mentality at work every day.
from "the user will never do that, so don't worry about it" to "I can't be blamed if the user wont read the manual".
I try not to say it, but it is hell working with coders who got into code 3 years ago for the money. Whining about working 45 hours a week and not understanding things like pointer and user defined types. Normally thats fine, I don't mind mentoring, but when you explain it, to a developer, and there eyes glaze over until you tell them exactly what to write, for the 3rd time that day. I thought it was me, but I even gave the photo copies of very clear explaination, with very simple examples and diagram.

hmmm, sorry about that, I think work is getting to me.

--
The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect