Why Do Computers Still Crash?

crashes? by Moridineas · 2003-05-20 13:03 · Score: 3, Interesting

Well the computers that I manage we've got an OpenBSD server hat never crashes (uptime max is around 6months--when a new release comes out) and a FreeBSD server that has never crashed--max up time has been around 140-150 days, and that was for system upgrades/hardware additions.

On the workstation side they are definitely not THAT stable, but since we've switched to XP/2K on the PC side, those pc's regularly get 60+ days of uptime. Just as a note--I had a XP computer the other day that would crash about two or three times a day. The guy that was using it kept yelling about microsoft, etc etc etc. Turned out to be bad ram. After switching in new ram it's currently at 40 days uptime (not a single crash).

For some reason the macs we have get turned off every night so their uptime isn't an issue, but from what I hear OSX is quite stable.

Touchy subject by aarondyck · 2003-05-20 13:04 · Score: 5, Interesting

I remmeber years ago having a conversation with an IT manager at IBM. We were talking about the inability of computer programmers to make their code foolproof. His point was that we don't see problems like this with proprietary hardware. When was the last time someone crashed their Super Nintendo? Of course, with a PC platform (or even Mac, or whatever else) there are problems of unreliability. His idea is that this is because of sloppy programming. The reason we were having this conversation is that I had a piece of software (brand new, I might add) that would not install on my computer. You would think that a reputable software company (and this was a reputable company) would test their product on at least a few systems to make sure that it would at least install! The end result was that I ended up never playing the game (not even to this day), nor have I purchased another title from that company since that time. Perhaps that is the solution to the root problem?

--
In order to be immortal you must be organize

Scientific American... by Hanji · 2003-05-20 13:04 · Score: 4, Interesting

Scientific American actually had an article on a similar topic. Basically, they seem to be accepting crashes as ineveitable, and were focusing on systems to help computers recover from crashes faster and more reliably...

They also propose that all computer systems should have an "undo" feature built in to allow harmful changes (either due to mistakes or malice) to be easily undone...

--
A Minesweeper clone that doesn't suck

Re:Simple ... by The+Analog+Kid · 2003-05-20 13:07 · Score: 5, Interesting

Yes, on my parents computer, which has 2000 on it(tried Linux it didn't work for them). I set most of the services to manual that aren't needed. Disabled Auto-update. Put it behind a router ofcourse. The only problem remained was Internet Exploder, well I just installed Mozilla with an IE theme, haven't noticed a difference). I think killing most of the services keeps it up. Haven't had a problem with it. This was done before KDE 3.1.x so who knows Linux might work after all.

The ultimate solution by dsanfte · 2003-05-20 13:09 · Score: 4, Interesting

The ultimate solution to the problem is to let computers write the software themselves. Give them a goal, set up evolutionary and genetic algorithms, and let them go at it on a supercomputer cluster for a few months.

Of course, you'd need to make sure the algorithms that humans wrote aren't flawed themselves, but once you got that pinned down, you would be more or less home-free.

Even if you didn't take this drastic a step, another solution would be computer-aided software burn-in. Let the computer test the software for bugs. A super-QA Analysis if you will. Log complete program traces for every trial run, and let the machine put the software through every input/output possiblity.

--
occultae nullus est respectus musicae - originally a Greek proverb

Re:The ultimate solution by Jeremi · 2003-05-20 13:24 · Score: 5, Interesting

The ultimate solution to the problem is to let computers write the software themselves. Give them a goal, set up evolutionary and genetic algorithms, and let them go at it on a supercomputer cluster for a few months.

That only works if you can write a fiteness algorithm that can tell whether the program did the correct thing or not -- otherwise, you have no way to decide what to "breed" and what to throw away. And for many types of program, that fitness algorithm would be more difficult to write than the program you are trying to auto-generate...

Of course, you'd need to make sure the algorithms that humans wrote aren't flawed themselves, but once you got that pinned down, you would be more or less home-free.

All you've done is replace a hard problem ("write a program that does X") with a harder problem ("write a program that teaches a computer to write a program that does X"). No dice.

Even if you didn't take this drastic a step, another solution would be computer-aided software burn-in. Let the computer test the software for bugs. A super-QA Analysis if you will. Log complete program traces for every trial run, and let the machine put the software through every input/output possiblity.

For most modern programs, there isn't nearly enough time left before the heat-death of the universe to do this. Hell, for programs other than simple batch-processors, the number of possible input and outputs is infinite (since the program can do an arbitrary number of actions before the user quits it)

--

I don't care if it's 90,000 hectares. That lake was not my doing.

Mandate memory checking tools by hawkstone · 2003-05-20 13:15 · Score: 5, Interesting

I'm sure it's harder to accomplish this for kernel level code (it's primarily OSes being pointed at right here) but you can think everything is working hunkey-dorey and not realize something is going wrong under the covers.

Most errors of this can be found with testing under tools like valgrind or Rational's purify. I'm sure there are others (I've heard of ParaSoft Insure++, ATOM Third Degree, CodeGaurd, and ZeroFault), but the quality of these tools really matters.

The issue is that tiny errors can cause crashes intermittently, and not immediately. For example:
uninitialized memory reads -- usually not a problem, but if this value is ever actually used, it will be.
array bounds reads -- never acceptable, but depending on the structure of memory, may not always cause an immediate crash.
array bounds writes -- like ABRs, may not be immediately fatal, but these are going to crash your code sooner or later.

Since they don't always cause an immediate crash, these errors are likely to creep in to released code without use of one of these tools. And if you want to know why we shouldn't always run programs in an environment that checks these kinds of things, try it once; you'll notice a speed hit of usually an order of magnitude. C/C++ is a perfectly acceptable language -- not all debugging has to be done by the compiler/interpreter or only after you notice a problem.

Anyway, hope that wasn't too pedantic....

We've got a lot of techniques in the gaming world by Samir+Gupta · 2003-05-20 13:18 · Score: 3, Interesting

In the world of games, especially console games, a crash immediately spoils the user's gameplay experience, and it's doubly so if you don't have a mechanism to patch games as in the PC world.

In the GameCube, crashes are alleviated by having only a thin OS layer between the hardware and the game, and restricting only a single task to be run in a single privilege level of the CPU, avoiding context switches and going back and forth between user and kernel mode which introduces complexity and can wreak havoc if malicious data is present.

Furthermore, we have a set hardware configuration, running a well defined consistent set of drivers, which are again, minimal, and this eliminates another factor that often leads to crashes in the PC world.

The most important thing though is robust software design. In our games, we all code exception handlers for the software, so that a single errant NULL pointer doesn't bring the whole thing down with a "Segmentation fault" message as PC users seem to experience with their software, but rather, we gracefully recover, perhaps immediately rolling back to the previous iteration in the game loop and "moving" the player a bit, for instance, in a FPS where the player might have entered into an area in a orientation that happens to create a divide by zero error due to numerical imprecision.

In the future with CPU and memory speeds increasing, we are investigating new designs, such as microkernel based architectures where individual game entities are separate protected "processes" that communicate via some fast IPC mechanism such as shared memory or a "tuplespace", so that a bug in one entity doesn't bring the whole universe crashing to a halt, and I hope that such techniques are adopted by the general computing world.

--
-- Samir Gupta, Ph. D. Head, New Technology Research Group, Nintendo Co. Ltd., Kyoto, Japan.

Re:Microsoft by VTS · 2003-05-20 13:20 · Score: 5, Interesting

Some time ago I would have agreed with you, but not anymore, If media player crashes playing some video then the whole system becomes unstable and then even doing something like sending a file to the recyclebin freezes the UI...

--
--- No 16-bit support in Vista? Half of our modules still use it! ---

Don't single out Microsoft by callipygian-showsyst · 2003-05-20 13:32 · Score: 3, Interesting

Of course, there's no need to mention Microsoft's inability to create a stable system
My Windows XP box, which is my fileserver, has been up for 5 months so far.

My OS X box, which I use for web browsing and word processing, crashes about once every three days.

Now, I certainly have some bones to pick with Microsoft, but Apple is no better.

--
Best Buy can have you arrested

Turing showed this by martin-boundary · 2003-05-20 13:41 · Score: 4, Interesting

A crashed computer is a computer that's stopped. Alan Turing proved in 1936 that the halting problem is unsolvable. So, it's impossible to know when and how a computer is going to crash or not under all possible circumstances (inputs).

Accept it. It's a fact of nature.

all systems crash, not just MS by dirk · 2003-05-20 13:44 · Score: 4, Interesting

When can we finally give up the FUD of "MS crashes all the time"? Anyone who has used a later MS OS (Win2k or XP) can easily see they crash very rarely. I have had my Redhat install have more problems than my Windows install in the past 6 months, and on the MS system most of the problems have been 3rd party software while on the Linux most of the problems have been the OS itself. The reason systems crash is that there are many pieces, written by many different people, interacting with each other. This is the same whether the OS is Linux of Windows. The harping on the instability of Windows does nothing but hurt the Linux cause, since anyone who actually uses a newer version of Windows knows that the person has no basis in reality.

--

"Information wants to be expensive" - Stewart Brand, the same guy who said "Information wants to be free"

Time is Money. by Rimbo · 2003-05-20 13:53 · Score: 5, Interesting

I think this is basically the right answer.

A couple of months ago, the company I worked for spent a lot of time and effort developing a robust testing methodology. We had a software product that through blood sweat and tears would not crash unless you basically blasted the hardware in some way.

But that led to two problems. First, we only had so many people working, and resources spent testing and bugfixing were not being used to add new features. Second, the time it took to get it that robust delayed the product's release beyond the point where we could recover the investment. [Time developing] * [Cost of operating] was greater than [expected number of units sold] * [price per unit].

What ended up happening was that we lacked the features to justify the price and number of units we needed to sell to cover the cost of developing it. We had no bugs -- and we could be certain of it -- that would crash the machine.

As of last month, the company could no longer afford to pay me. I'm not there any more.

The moral of the story is that trying to make a bug-free product will bankrupt your company, especially a startup. Software tools have improved, but the benefit largely goes towards adding new whiz-bang features that sell the product for more money, not to being able to fix more bugs.

What we should do as engineers and managers of software products is to not be afraid of getting the product out the door with a few bugs in it if we want our company to do well; this business reality is ultimately why bugs will a big part of software for the forseeable future.

What are you smoking? by Jerk+City+Troll · 2003-05-20 14:02 · Score: 4, Interesting

My OS X box, which I use for web browsing and word processing, crashes about once every three days.

The Ti PowerBook G4 I am writing this post on is running Mac OS X 10.2.x. It goes in an out of sleep on an irregular basis, and not always when it is idle. I swap PCMCIA cards in and out. It hops from network to network. I do a lot more than browsing and word processing.

According to my Konfabulator uptime widget, I have 83 days, 23 hours, 20 minutes. My load average at the moment is 1.7. It has not been rebooted since I installed OS X (I did it myself after buying it just for messing around purposes).

You sir are either lying, have bad hardware, or you've severely corrupted your installation. This operating system (which is BSD) is solid as a rock.

--
Join Tor today!

Re:Computers don't crash by Anonymous Coward · 2003-05-20 14:06 · Score: 5, Interesting

The current issue of Scientific American states that 51% of crashes are due to user error. 15%=software error. 34%=hardware error. Refer to article for further info.

You made a little "user error" there yourself-- the article says that 34%=software error and 15%=hardware error.

Oh, and those figures are just for Web applications, not software applications in general.

It's an interesting article. Unfortunately, they're not very clear about what constitutes a "user error." I've filled out Web forms that gave me an "error" when I included hyphens in my phone number or credit card number. That's far from an error, it's just poor user interface design.

In my opinion, something the user does should never cause a program or operating system crash. If this can occur, it is the developer who is at fault, not the user.

Apple's Human Interface Guidelines are a nice introduction to user-fault tolerance, even if you're developing for other platforms.

Re:For those who are willing to pay... by dghcasp · 2003-05-20 14:22 · Score: 4, Interesting

Think of the systems used by the telcos, or NASA. Are they perfect? No, but they are much, much more stable than Win32, or Mac, or Linux. The reason is simple, the owners demand them to be.

This reminds me of a story I read in the internal magazine of a telecomunications equipment supplier that I used to work for. It was about an international toll switch somewhere in the U.K. that had been up for 17 years (or something extreme like that.) Furthermore, this included having all of its hardware upgraded and replaced. Twice.

Just stop and think about that for a while in PC terms... "I replaced my motherboard with the power on without rebooting my system, while it was serving 10,000 web pages a second."

Granted, this is a higher level of hardware with full redundancy, but it still boggles my mind.

Re:Simple, yes, for other reasons by Chris+Carollo · 2003-05-20 14:25 · Score: 3, Interesting

Jets are complex too. So is the Space Shuttle. Cruise ships. CARS are pretty complex.

Then again, if one of the overhead bin latches get stuck, or my overhead light burns out, or my seatbelt gets stuck, the entire plane or car doesn't instantly explode. The issue isn't complexity, it's fragility.

Software is incomprehensibly fragile -- any single thing can cause a crash, taking the whole system or application down. And even those critical parts of things like airplanes have multiple redundancies, something that's hard to build into software. You can do things like catching exceptions, but you typically can't recover as gracefully as if there was never a problem at all.

The shuttle is actually not a bad analogy -- it's also very fragile due to the stresses it endures. And we've effectively had two crashes in 100 runs. Most software is more stable than that.

Re:Microsoft by CognitivelyDistorted · 2003-05-20 14:47 · Score: 4, Interesting

Yes, NT5+ is very stable. MS is working on the driver problem. SLAM is a tool for verifying drivers. Given a requirement, e.g., after acquiring a kernel lock the driver must release it exactly once on all control paths, and some driver source code, SLAM can find all the ways the driver can fail the requirement. They have specifications for various driver types and are using them to test some drivers. It's a research project by the Software Development Tools group in MSR, but they're working on getting it stable and powerful enough to verify more drivers. If they can get it to work well enough, they'll supply it to hardware vendors.

it DOES cause an error by ChrisCampbell47 · 2003-05-20 15:17 · Score: 4, Interesting

Interesting that the first two posts in the thread had English syntax errors in their first sentences. We can still understand it, but compilers/CPUs would have problems. Seems that the real problem is the difference in the natures of wetware and hardware.

Actually, "syntax errors" like this DO cause a problem for wetware systems -- they cause the brain (well, mine at least) to kind of glaze over and take the remainder of the sentence/thought much less seriously. Kind of like aborting/returning out of a subroutine.

Here in the Slashdot world of "definately" and "righting", I've learned that any posted comment that makes high-school-level grammatical or spelling errors is not worth my time and I immediately skip the post. I've been doing this quite rigorously lately -- blah blah blah "seperate" PAGE DOWN.

OK now, everybody nod and think I'm talking about someone else's posts ...

--
One simple rule for its versus it's

Re:Human Error by JohnsonWax · 2003-05-20 16:00 · Score: 4, Interesting

"All programs (for the most part) must be written by people. ... Computers crash because people cant catch that one little fatal error in 10,000 lines of code."

All bridges (for the most part) must be built by people. Bridges collapse because people can't catch that one little fatal error in one or two million components.

The shit coders put out there, I swear... The reason software crashes is that by-and-large it's hacked together, not engineered. You hack a bridge together, and yes, it'll fail. You engineer software, and yes, it will run reliably. It's not fun to do - no easter eggs, no cool tricks, no cramming features in weeks before ship.

I'm stunned at the amount of code that goes out that was written by interns, by unexperienced coders, by people that just don't have a clue. The software industry really has no concept of best practices, no leadership, no authority body. The fact that buffer overflows still happen is stunning.

It's not small projects that work well because out of dumb luck they happen to not fail, or larger projects that work okay because we have 34,000 people looking at the code. If that's 'best practices', then we're doomed.

"Mozilla (www.mozilla.org) has a feedback option to help them debug, many software companies are including this."

Uh huh. Let's translate that to my car: "Hi. Yeah, I'd like to report a bug. I have a Saturn Ion, version 1.1v4. Yeah, when I turn on the left turn signal and then turn on the lights, the car catches on fire. You might want to fix that in the next version. Just though you might want to know. Bye."

Re:Try the UML by Billly+Gates · 2003-05-20 16:18 · Score: 3, Interesting

Architects and engineers use extremely detailed drawings. Have you ever taken any drafting courses in Highschool or College? Every piece and even the size of every screw is accurately detailed as possible. It takes forever to get anything done because the precsion is more important. It drives some people like myself crazy.

The blueprint is the actual prototype of the product being designed.

The problem is if you document every step and algorthim in exact detail you will spend weeks, months, and yes years without a single line of code!

This is unacceptable in today's bussiness world where all the projects are due yesterday and your bosses demand percentage wise how much of the code is being developed. If you spend a month planning and not a single line of code is developered your canned.

My father took over a project where a clueless IT manager got because she slept with the CIO. Anyway she went to a seminar which talked about over flowcharting everything would be the wave of the future. She then had all the programers draft every single algorithm to the very if statements themselves on paper. After 4 months and not a single line of code my old man took over. From there he finished the project within 3 weeks!

My point is that drafting programs is too time consuming. In a way your drawing is the program and changes can be made as you go. Its essential to have good flowcharts and notes but they need to be generalized. If there is an error in it you can delete the line and fix it. In engineering you would have to dissamble the actual product and redesign it. Because they would cost time and money it is not accepted. In software that limitation is not there or as sevre.

UML tries to be the blueprint of all software programs but instead is only used to explain certain subsystems and algorithms. Mostly flowcharts are used so all the developers have a sense on how the program will work and how to invoke different pieces of the program.

I do not think this going to change unless there is a quick and easy way to debug UML charts. Logic errors are killer and if its perfect I suppose you can compile the uml directly into the language of choice.

Hmmm infact this might be the way to do it in the future.

--
http://saveie6.com/

Re:OT: Electric overconsumption by doorbot.com · 2003-05-20 16:54 · Score: 5, Interesting

I wish there was consumer demand for low power destop computing.

My mail/web server would run fine off of something rediculously small, like a Sharp Zaurus. Here are my requirements, and I will pay for one if it is available.

Non-x86 hardware designed for lower power -- extra speed is nice, but not required; Pentium 200 speeds or better
Low power, with 9V or AA-based battery backup (changeable while system is running)
3" - 4" LCD (with manual switch to turn off) at 640 x 480, or some sort of LED array/VFD, because all I really need is a low power terminal supporting 80 x 24 characters.
USB port for keyboard
Serial port
Two or three 10/100 NICs
Full (Debian) Linux support of all hardware
Some sort of expansion (PCMCIA maybe, or via USB)
Support for CompactFlash for backups
Hardware encryption would be a nice goodie but not required

Yes, I could probably build this with PC104 components, but I want a pre-built product, and I'm willing to pay for it (maybe $300 - $400).

Re:and by sheldon · 2003-05-20 17:08 · Score: 3, Interesting

Interesting.

I play RTCW quite a bit on my WinXP box with no issues. RTCW occasionally crashes, and I have to hit CTRL-ALT-DEL to bring up task manager and kill it, but the system remains stable.

When I first built this box I had some issues, after a while it would lock up. Turned out it was because the video card was overheating. The system itself wasn't locking up, just the video card. Put the system in a new Antec SX-835II case with better cooling and haven't had a problem since.

Easy.. economics and ongoing profit by smeenz · 2003-05-20 20:27 · Score: 3, Interesting

In the vast majority of cases, it's simply not economic to release bug-free code.

1. Any programmer knows that 90% of the code is written in the first 90% of the time, and the other 10% of the code is written in the other 90% of the time. (no typo). That is to say, it takes a lot more time, effort, and hence money, to move a project from "working well" to "working perfectly".

2. Many software companies these days make very little profit on the 1.0 release of their software, and make huge amounts of money through ongoing support charges. Microsoft is a classic example of this type of company.

3. If you release a piece of software that works really well, does everything the users want, and never crashes or causes trouble, then you may as well pack up shop and go out of business quietly. The unfortunate truth is that nobody is going to buy version 2 if they can do everything they want with version 1, and they're not getting constantly frustrated by crashes. The only carrot you have in this situation is to think up some really great ideas for version 2 in order to encourage people to upgrade - In fact, some of those ideas may have been deliberately left out of version 1 just so that they could be added later. Version 3 is more difficult still, and version 5 is right out. By comparison - how many versions of office are we up to now ?

A notable except to this business model is the games writers. Companies like valve and id software consistantly produce very near to bug-free code that works well and generally impresses the masses.

In all the years since half-life was released, there have been relatively few patches and fixes, and many of those were to prevent ingenious new methods of cheating, or to add support for hardware that didn't exist when the game was first released. The unreal engine had a similar history.

People buy new games because they crave the excitement or challange of exploring and interacting with it. That's not something that could really be said about excel or word, so those sorts of products have to rely on the "draw out the profit over many releases" strategy described above.

Another (big) factor is people's expectations - most people expect that word will crash from time to time, and given microsoft's past history, they have little reason to expect that to change. On the other hand, gamers have an expectation that the latest game from id software will be as solid as a rock, and that the few problems that do crop up after the release will be fixed quickly.

If a games company didn't spend that "other" 90% on the last 10% of development, and released something that crashed as often as explorer, their reputation would be mud within days, and people would stop buying their games.

And lastly, choice.

People have a choice as to which games they want to buy. It's a competitive market out there, with many people having little disposible income to spend on games. On the other hand, despite what linux advocates (I can't believe I'm saying this on slashdot) say, most people use MS apps and operating systems because they don't have a choice - say due to corporate rules.

You might think that it is the end user that gets the sharp end of the stick here, but the people that really get screwed are the dedicated and talented programmers, who are working for companies that don't care too much if they release code before it has been fully tested.

Re:Whoops, bullshit alert. by tagevm · 2003-05-20 22:08 · Score: 3, Interesting

I bet the piece of code causing this looks something like this: ... /*
Check every second....
Maybe GetTickCount wraps, but I don't care,
something else will probably break before 49 days anyway
*/

if (m_dwLastTick-GetTickCount())>1000)
{
DoSomeThingImportant();
m_dwLastTick=GetTickCount();
}

GetTickCount returns the number of millisecs since reboot, after 49 days it will wrap and start over, so lazy programmers using code such as above will have a problem.

programmers trending downward by junkgoof · 2003-05-21 02:41 · Score: 3, Interesting

I think this brings up a good point. Hardware may have improved, software development tools may have improved, the people writing software have gotten much worse. A few years ago most people who were in the computer industry were there because they knew something. Now they are there because they wanted money, some HR droid picked their CV out of a pile because of the acronyms, and some manager does not know enough to fire them. Layoffs haven't helped either, generally the knowldegable people with higher salaries get booted first. Security vulnerabilities are up (including old stuff that has not been patched) and successful projects are down.

--
You got me into this! You were the ideologue! I'm only a poor assassin! - Twenty evocations, Bruce Sterling

Slashdot Mirror

Why Do Computers Still Crash?

26 of 1,224 comments (clear)