Blackout Cause: Buggy Code
blanca writes "The big northeast blackout from last summer was caused in part by a software bug in an energy managment system sold by General Electic, according to a story on SecurityFocus. The bug meant that a computerized alarm that should have been triggered never went off, hindering FirstEnergy's response to the train of events that lead to the cascading blackout. Investigators found the bug in a intensive code audit following the outage, and a patch is now available."
Didn't the story used to be that after a tech maintenenced the machine, he forgot to re-enable an alarm?
tasks(723) drafts(105) languages(484) examples(29106)
http://www.schneier.com/crypto-gram-0312.html#1
A snippet of the article:
With all the lip service about "homeland security," one ought to be concerned about anything affecting national infrastructure being sent abroad where you really don't know who is doing the coding, whether the coding projects are being further outsorced to say alQaidaSoft, etc.
People say I'm crazy, I got diamonds on the soles of my shoes...
How about the energy companies?
Certainly, the energy corporations must be somewhat culpable for not rigorously testing the software in the first place? It is not in the interest of a for-profit company to see to it that such systems are functioning correctly, as that cost will detract from the bottom line profit. Only when disaster strikes can they be goaded into looking into problems.
Stop corporate
From the article:
When a backup server kicked-in, it also failed, unable to handle the accumulation of unprocessed events that had queued up since the main system's failure. Because the system failed silently, FirstEnergy's operators were unaware for over an hour that they were looking at outdated information on the status of their portion of the power grid, according to the November report.
How in the world did they manage to build a system nearly completely dependant upon computers, and yet not know when they lost not just one, but two computers that monitored the system?
Homer: Don't turn off the computer! Don't turn off the computer! Don't turn off the computer!
"Click"
Your wish is Bruce Schneier's command
I'd sort of tend to agree, although under your standards, the stuff I do as an EE really would fit under development, we don't have the budget to send out for external certification and external testing. No biggie, I guess I can live with being a hardware developer.
Is it true that some states have prohibited Microsoft from issuing MSCEs? I heard this somewhere but I can't remember. Something about Microsoft not having the authority to certify engineers.
I bet they had much wider safety margins built into the system which prevented blackouts. But these safety margins probably cost money ( I say this without knowing a thing about the electrical system ) they probably mean a less efficient use of resources. So power companies buy GE's software. They don't buy it so that they can have an added measure of blackout prevention, they buy it because it enables them to cut out expensive/inefficient safety margins without (supposedly) sacrificing reliability. They do this to lower their cost of providing electricity to you.
Eat at Joe's.
By my calculations, assuming air ionizes about 10,000 Volts / centimeter, a 50MV line should be at least 5,000 cm (or 50 meters) from any ground. 50 meters on either side of a line is a lot of property for an electical company to buy, and with a surge in the line I'd bet the distance would need to be even more.
what good is a backup system if it's never been tested?
If she floats, she's a witch.
Actually, from the way the article sounds, the black out might not have been as large, as long, or even happened if the software was properly updating. The electrical grid is constantly falling apart. It is never all up. That's o.k. It is the status quo. It is when the electrical company doesn't know what is happening and get people to the trouble spots that these things become noticable. Usually they are fixed within 30mins to 2 hours. From everything that I've read it wasn't a big problem at all. It was a fixable problem that was allowed to exist too long. After that point it became a big problem. I'd hold the monitoring software responsible.
The software handled one part of the electrical system involved.
What about a good Electrical/Mechanical/Civil Engineering solution that would have prevented it from cascading through different systems / electrical companies / countries?
One piece of software which didn't raise an alarm is shocking. The fact that it cascaded over such a wide area is simply mind blowing.
Before we talk about "software engineers" how about talking about "traditional engineers" and their role in this massive failure?
The surprise isn't how often we make bad choices; the surprise is how seldom they defeat us.
We may slam Microsoft for all of it's bugs, but it's really hard to top a software bug triggering an international blackout the size of one last summer. I think I should sue GE for making me walk 3.5 hours home in the heat with no money in Toronto, uphill, because I couldn't take a subway home. I smell a lawsuit the size of the eastern seaboard.
In Canada, "Engineer" is a protected term, like "Doctor."
Doctor is not a protected term. Perhaps you mean "Medical Doctor"? There are lots of non-medical doctors.
I was arguing once with a MD friend of mine who thought that PhDs (like myself) don't have the right to call themselves Doctor. I explained that while medicince has been around for a very long time, the degree of MD has not. PhDs degrees have a much longer history than MD degrees.
It gets very funny when another friend of mine (who has a PhD in nursing) is called "Dr" in her hospital.
After lots of years as a developer, I realized that the engineering process that goes into other professions (for example, civil engineering) can't be applied to software. The reason is simple: software is many orders more complex. Software has many interdependencies between components, has many states, and it is subject to change every minute. It's very difficult to see ahead and provide APIs that fit all the needs, that's why we go back and change the damn thing. What does a civil engineer has to do ? he/she has to combine parts and test if they hold together. There are a lot of parts, but the general principles are a few and can be easily remembered...unlike software.
Furthermore, the tools we have for the job are inadequate. The programming languages are primitive. The debugging tools are dumb. The machines are not clever and strong enough to prove the mathematical theorems behind its program. We don't even learn these things in college...we learn how to use programming languages, but we don't learn how to program...but I seriously believe we will never learn how to program, because a program's complexity increases tenfold for each line of code written!!!
Just a nitpick,
Creating a true software engineer is different than making them PE's. Right now, most of the engineers that design things in industry don't have PE's and if they do, they don't make it known publicly for the very reasons you mentioned.
The rest of us with out PE's don't need the insurance, as that is supplied by the company.
Also, keep in mind that just because an engineer worked on something doesn't mean that it will be expensive. Most of what I engineer costs less than a dollar.
If you haven't guessed, IAAE (I am an Engineer)
"...At the end of the day"..."when everyone goes home, you're stuck with yourself." RIP Layne Staley
You have 4 main variables in the software development equasion: Time, Quality, Functionality, and Efficiency. Notice that we only measure time, not man-hours or monetarycost. As we know from reading The Mythical Man-Month , we cannot reduce time by adding more people or by spending more money. While we list efficiency as a variable, we really have to treat it as a constant within the scope of a single release cycle. Improvements in efficency are generally very gradual and incremental, and for the most part cannot be effectively implemented in the middle of a release cycle.
I postulate that Time is directly proportional to the product of Quality, Functionality, and Efficiency [T = EQF]. Since E is constant within the scope of a single release, we can't use process improvements or similar techniques to improve quality in the short term. Assuming our goal is to improve quality, we either have to decrease functionality or increase time. Since monetary cost is directly proportional to time (time is money!), managers are very reluctant to give you more time. Furthermore, we are frequently under hard time constraints due to contractual obligations or market pressure. If we can't change time, we either have to sacrifice quality or functionality. Missing functionality is very obvious, whereas low quality isn't necessarily noticable in the short term, so it should be no suprise that quality is almost always takes the back seat to functionality.
Why is it that the proponents of "one nation under God" are so eager to get rid of "liberty and justice for all"?
I always treat watchdog software with just a bit of skepticism. The problem, as pointed out by NERC, was that a process in the system was somehow present, but not communicating well.
The alarm subsystem is often a seperate process. It doesn't talk to the field. That's the job for other elements of the SCADA system. It was supposed to watch for semaphores, messages, or read shared memory somewhere. How do you watchdog something like that if it gets the message, but doesn't do what it's supposed to?
In a SCADA system near and dear to my career, we set alarm thresholds so low that the operators expect a certain amount of alarm traffic even for routine events. This helps to discover any misbehavior in the alarm system.
There is such a thing as a control center which is TOO quiet.
Nearly fifty percent of all graduates come from the bottom half of the class!
In the case of the electric blankets, you're not exposing yourself to a lot of any B or H fields- there's not enough current present to generate much. Now, if you'd said something like a hair dryer, where the field is concentrated to power the motor...
The phone may generate more relative power, but it's at a different frequency- in regards to electricity and the human body, frequency matters as much as anything else.
For DC, 10ma of current may not be noticable to a person.
For 50/60Hz AC, it's going to cause a twitching of the muscles.
For DC 100ma to 1a of current, you're going to get a zap similar in nature to sticking your tongue on a 9v battery, proportionate to the current in question.
For 50/60Hz AC, 100ma to 1a, it's going to be causing painful contractions of your muscles, and very probably stopping your heart outright if the conduction pathway crosses it.
There's been studies that tend to prove that even low energy densities of 50/60Hz AC can accelerate tumor growth- no studies have actually proven that they generate them though. Effects like the one mentioned tend to be caused more by continuous exposure than point exposure- so the low levels of the energy radiated by the high-tension lines may be a problem if you're next to them since it's a continuous background level sort of thing.
I am not merely a "consumer" or a "taxpayer". I am a Citizen of the State of Texas
I just never thought I'd see this in reality.
But the second unit had failed in the identical manner a few milliseconds before. And why not? It was running the same software.
I have read that story before on a different site. Everybody keep this in mind before you assume redundant systems can protect you against software errors.
Tell me more, tell me more