Blackout Cause: Buggy Code
blanca writes "The big northeast blackout from last summer was caused in part by a software bug in an energy managment system sold by General Electic, according to a story on SecurityFocus. The bug meant that a computerized alarm that should have been triggered never went off, hindering FirstEnergy's response to the train of events that lead to the cascading blackout. Investigators found the bug in a intensive code audit following the outage, and a patch is now available."
The first thing I saw at that site, "Reliable, Field-Proven & Adaptable". Funny.
Well, that statement is only half false, it's reliability has been field-proven.
Vonal Declosion
Didn't the story used to be that after a tech maintenenced the machine, he forgot to re-enable an alarm?
tasks(723) drafts(105) languages(484) examples(29106)
"Patch available"
Phew! then at least i can patch my own power craft before anything happens!
Oh this bug took six months to find and now a patch is available. I thought someone said the bug was found six months ago and now the patch was available. My bad, nobody would ever do that :-)
i have been dreaming writting such a bug myself. quite an achievement to blackout quarter of a continent with some crappy code...
Aure entuluva!
... when you outsource to the lowest bidder?
I've said enough.
Only half right. We have to find a way to make Linux and/or open source the shining alternative.
Toronto-area transit rider? Rate your ride.
http://www.schneier.com/crypto-gram-0312.html#1
A snippet of the article:
The term 'Software Engineering' is bantered about in the software industry. I think little that you could call engineering happens. Software is developed. It doesn't meet the strict standards of testing and reliability of physical products.
I am a software developer not an engineer, as are most people in the field. Software won't become an engineering science until companies are willing to pay for that process. Given the current trend towards cost cutting I don't see that happening anytime soon.
As x approaches total apathy I couldn't care less.
Just a question for everyone here:
Who thinks this could have been any better with Open Source and why?
People make the comment of the many eyes, but who is really looking at the code?
Curiosity was framed; ignorance killed the cat. -- Author unknown
I thought the Canadians did it?
Disclaimer: This opinion was created without the use of any facts
Chalk up another one for the most disasterous software bugs in history. This one should give the Ariane 5 explosion a go for no 1.
I'm waiting for the next big power failure, then the excuses about why the patch was never applied. :)
One code to light it all, ...
One coder to code it,
One debugger to miss the bug
and into the darkness lead them.
That might be the case except that XA 21 is developed in melbourne (Fl.)
facts before hysteria thanks
How about the energy companies?
Certainly, the energy corporations must be somewhat culpable for not rigorously testing the software in the first place? It is not in the interest of a for-profit company to see to it that such systems are functioning correctly, as that cost will detract from the bottom line profit. Only when disaster strikes can they be goaded into looking into problems.
Stop corporate
Now if in fact this was buggy code, and if Software Engineers are in fact part of the engineering profession, then a professional body should be taking the engineer(s) to task. This would be the same thing that would take place in the event that a civil engineer signed off on faulty building plans. But smart money says no software "engineer" will get nailed.
A look at the software industry will show this to be the norm. And that is why there is such a problem with having people claiming the title of "software engineer". "Engineer" doesn't just mean having the technical savvy, it also means having a responsibility to the public for the use of that knowledge and being beholden to a professional body charged with ensuring you are held accountable.
"Consensus" in science is _always_ a political construct.
This is Slashdot! Isn't that supposed to say Microsoft? It's always Microsoft.
I was going to put a sig here, but I had already submitted the message.
if you read the article and other associated articles, you will realise that this bug did not *cause* the blackout, on it's own this bug would have had no effect on the continued power supply. However, the timing of the bug along with a number of other issues (which I wont repeat here, read the article for a clue!) all contributed.
From the article:
When a backup server kicked-in, it also failed, unable to handle the accumulation of unprocessed events that had queued up since the main system's failure. Because the system failed silently, FirstEnergy's operators were unaware for over an hour that they were looking at outdated information on the status of their portion of the power grid, according to the November report.
How in the world did they manage to build a system nearly completely dependant upon computers, and yet not know when they lost not just one, but two computers that monitored the system?
Homer: Don't turn off the computer! Don't turn off the computer! Don't turn off the computer!
"Click"
XA21 is developed in Melbourne (Fl.)
And yep, it runs on major critical systems, including energy systems and satellites.
Lean on it in the slightest and it will crash and burn with little chance for recovery. Tibco even says they don't test their own software (lack of docs lowers their liability). Press them for test results and they will offer you to pay them to test for you.
When a backup server kicked-in, it also failed, unable to handle the accumulation of unprocessed events that had queued up since the main system's failure.
Sounds like classic Tibco.
I can tell you from working a couple cases at GE Power Systems that a LOT of their coding is done in India, and that the teams they work with state side are comprised mostly of Indians on work visa's and some naturalized Americans of Indian origins. Specifically the guys I talked with were from Gousherott (sp?). Btw this work wasn't outsourced, these were regular employees of GE, just on another continent.
Blaming the black out on a software bug is a damn cop-out. The cause of the black out was a horribly managed electrical grid that can barely keep up with the current demand. Any major failure in the system can cause a cascading failure of the entire section of the grid. That is a horrible design. A software bug may have been the trigger but it is by no means the true cause.
The grid in the North East US is supplied by horribly inefficient and antiquated power lines that were struggling to keep up thirty years ago. That they are still in use today is an outright crime. There's also the issue of the operators of the lines generators trying to save a few bucks by cutting maintenance on equipment and facilities and cutting supervising staffs down to skeleton crews. It is much easier to fit "software bug" into a sound bite so the news media will stick with that. Unfortunately the real cause of the black out is not ever going to be patched and another blackout is as inevitable as this last one was. I hope next time a few more people will have invested in backup generators or some alternate form of power to keep from losing their business during a blackout.
I'm a loner Dottie, a Rebel.
Posting anonymously for obvious reasons to me :)
Given my personal experience with this certain Fortune 5 company and software development as a whole, I am not surprised.
The bottom line is that there is soooo much software developed here by non-computer programmers. There are many great Engineers (Mechanical, Aerospace, etc.) here, yet very few can write good code. Many of them are asked to write code nonetheless and thanks to the travesty that is Visual Basic and other Rapid Application Development tools the code that is produced is extremely un-maintainable.
Then you have the matter of people moving jobs every 2 years and the poor bastard who has to maintain someone else's code gets lost inside of it.
Consider me very frustrated at the whole process.
I bet they had much wider safety margins built into the system which prevented blackouts. But these safety margins probably cost money ( I say this without knowing a thing about the electrical system ) they probably mean a less efficient use of resources. So power companies buy GE's software. They don't buy it so that they can have an added measure of blackout prevention, they buy it because it enables them to cut out expensive/inefficient safety margins without (supposedly) sacrificing reliability. They do this to lower their cost of providing electricity to you.
Eat at Joe's.
To me, this report give a good example of why a monolithic (monocultural) dispatching system is not a good idea. If every transaction were controlled by a central center, a single software bug could shut down the entire North American grid.
sPh
In all fairness...
The Mars Rover's software crashed in just a few days.
Virtually all software should be designed and tested better than it is.
However, I'm perplexed at why the Mars Rover failure and resurrection is considered a miracle of human inginuity, rather than an indictment of crummy testing.
I'll not excuse the power grid software either; but it seems to work more reliably than the software on the Rover.
Well, I have news for you: 50MV lines don't exist! Not out in the open, anyway. Was it 50 kV, perchance?
The software handled one part of the electrical system involved.
What about a good Electrical/Mechanical/Civil Engineering solution that would have prevented it from cascading through different systems / electrical companies / countries?
One piece of software which didn't raise an alarm is shocking. The fact that it cascaded over such a wide area is simply mind blowing.
Before we talk about "software engineers" how about talking about "traditional engineers" and their role in this massive failure?
The surprise isn't how often we make bad choices; the surprise is how seldom they defeat us.
After lots of years as a developer, I realized that the engineering process that goes into other professions (for example, civil engineering) can't be applied to software. The reason is simple: software is many orders more complex. Software has many interdependencies between components, has many states, and it is subject to change every minute. It's very difficult to see ahead and provide APIs that fit all the needs, that's why we go back and change the damn thing. What does a civil engineer has to do ? he/she has to combine parts and test if they hold together. There are a lot of parts, but the general principles are a few and can be easily remembered...unlike software.
Furthermore, the tools we have for the job are inadequate. The programming languages are primitive. The debugging tools are dumb. The machines are not clever and strong enough to prove the mathematical theorems behind its program. We don't even learn these things in college...we learn how to use programming languages, but we don't learn how to program...but I seriously believe we will never learn how to program, because a program's complexity increases tenfold for each line of code written!!!
...that presented itself in the AT&T software is told at the end of the chapter, repeated here for your convenience:
"As it happened, the problem itself - the problem per se - took this form. A piece of telco software had been written in C language, a standard language of the telco field. Within the C software was a long "do... while" construct. The "do... while" construct contained a "switch" statement. The "switch" statement contained an "if" clause. The "if" clause contained a "break." The "break" was supposed to "break" the "if" clause. Instead, the "break" broke the "switch" statement."
Slashdot: stuff for news, nerds that matter, matter for news, stuff that nerd
I'm sure this was mentioned in the original blackout posts - since the Blaster virus was running full tilt at that time, there was an increased load on servers, routers, switches, hubs and blinky things that go whoop! whoop!! WHOOOOP! The increased demand on computing resources caused increased power demand (not to mention the cranked ACs at the homes of the poor IT staff who were staring at their blackberrys and sweating bullets) which in turn caused the alarm conditions which didn't get alarmed properly and so the powergrid went down. All because of an MS security hole.
How's that?
So the software didn't raise alarms as it should've. That's bad. But it seems to me that the software is being made a scape goat here. It's much easier to blame "that #$@&@$ computer" than "FirstEnergy's failure to trim back trees encroaching on high-voltage power lines" or the fact that the infrastructure for the powergrid is old and poorly setup such that one failure can bring down the whole system. There's no reason why a failure in Ohio should blackout New York and there's nothing software can do to fix that.
Oh mgod, we better stop outsourcing our precious programming jobs to Florida!
It is unpatriotic to move them from California, where they belong! I bet they pay the people in Florida a lot less.
(This is a joke)
The land beneath the lines was clear-cut about 12 years ago. But there are now trees under this line that are about 10 meters high.
Years ago when my wife was concerned about "power line emissions" the power company loaned her a meter that showed "electrical fields." I don't remember the scale, or even what it was supposed to measure, but I do remember that we had to actually get about 200 feet from the wire before the field from the line stopped affecting the meter. (Yes, on a humid summer day I once stood in my back yard with a neon bulb and caused it to illuminate by simply dangling a three foot wire from one lead and touching the other.) I had always assumed it was a 750kV line, and that the 100 foot easement was more than sufficient. Now, I wonder. Hey, maybe this is enough of an excuse to go out and get one of those IKE toys!
John
So what? You use a cell phone, don't you? The electrical energy exposure you get from that is substantially greater.
How about electric blankets or heating pads? How about a battery powered shaver?
You expose yourself to these fields every day to an extent far greater than what you may have received from that transmission line.
By the way, you can light a neon light with a bit of wire and very little power. You can also light it with a MW AM broadcast transmitter less than a mile away; you can light it with a CB radio; and with just a bit more wire, and a location closer to the poles of the earth, you can light it when the earth is hit by a solar flare. Many among the various eco-scare-monger groups like to make this demonstration as if it were an indicator of something dangerous. If it were, there would be no life anywhere near the Arctic Circle.
Aside of the poor maintainance for the clear-cut area, you really have no need to be concerned about this.
Nearly fifty percent of all graduates come from the bottom half of the class!
Had this been a Windows-based system, the torrent of comments about how unreliable the OS and platform fundementally was would be huge.
Funny, just because this ships for "industrial strength" AIX / Solaris RISC systems (see specs on pg 8), I don't see any cheap, reflexive comments about the platform.
I guess the message here is that good or bad code can be written for any architecture.
I always treat watchdog software with just a bit of skepticism. The problem, as pointed out by NERC, was that a process in the system was somehow present, but not communicating well.
The alarm subsystem is often a seperate process. It doesn't talk to the field. That's the job for other elements of the SCADA system. It was supposed to watch for semaphores, messages, or read shared memory somewhere. How do you watchdog something like that if it gets the message, but doesn't do what it's supposed to?
In a SCADA system near and dear to my career, we set alarm thresholds so low that the operators expect a certain amount of alarm traffic even for routine events. This helps to discover any misbehavior in the alarm system.
There is such a thing as a control center which is TOO quiet.
Nearly fifty percent of all graduates come from the bottom half of the class!
After looking at the original report, it looks more like the GE XA21 SCADA network failure was not the primary cause of the cascading failure but more an effect of the failure. The key failure seems to be a software system callled the "State Estimator" (SE) that is used by the Midwest System Operator (MISO), a NERC reliability coordinator, to develop optimal solutions of for the planned operating level of all of the power generation and transmission equipment in the MISO area covering about 10 midwest states and 1 million square miles. It is not described in much detail but the SE seems to be an optimization tool using a linear programming model that gathers availability data for all of the major system components and load demand every five minutes and then calculates the 'optimal' use of those system components to maintain system reliability at the required level. The 'solution' of the model is then used to plan the operation of the overall system by sending the target operating levels to each facility in the system. So why did it fail? Two reasons. First, the model depends on having accurate availability information from each major system component. Status information is sent to MISO in Indiana by the "ECAR" data netork or by direct links. On the day of the failure, the direct link to a key transmission line was not working and the analyst had turned off the estimator to troubleshoot it. After fixing the problem, he went to lunch and forgot to put the system back in automatic mode where it would develop updated solutions. This situation existed for 2 hours from 12:15 to 14:40. When the estimator was switched back to automatic, it was unable to develop a solution because another key transmission line had overloaded and tripped and *its* new non-operational status was unknown to the model, apparently because the status of that line is assumed to be 'on' until told otherwise. This problem was not corrected until 16:04. The bottom line is that a critical major planning tool was not available for 4 hours for a regional generation and distribution system that absolutely required it's use to be operated successfully when the system power supply was very close to the demand.
The SCADA system itself did not fail, but its alarm function did, which provides alarms to control room operators about system operational problems. The problem with the alarm function seems to be a case of too many alarms for the system to handle as the problems multiplied. The software bug that they are now reporting was probably related to the unexpectedly large number of alarms that the system was experiencing. The new alarm inputs built up and then overflowed the process input buffers. The alarm system just stalled while processing an alarm event and the alarm function stopped. Then, at 14:41 the primary server hosting the alarm processing application failed due to some combination of the stalling of the alarm application and the queueing to the remote terminals. The hapless backup server then was automatically activated and everything was was transferred to it, even the functional non-alarm stuff. The backup server failed after 13 minutes. Basically, the SCADA alarm system seems to have been massively overloaded (which shouldn't ever happen, of course) beyond the capability of the system design to cope with. The bug apparently prevented an indication that the alarm system was failing but it looks like the cascading failure still would have occurred even if the software bug had not been present because the system deterioration had progressed to far to recover by the time that the bug manifested itself.
The immediate cause of the failure seems to be the forgetfulness of the analyst who was operating the planning model. The significant underlying contributory cause seems to be a very poor regional operational design in which a critical centralized system planning tool was being used with insufficient backup and oversight. It looks as though both Unix and Windows escape blame. The SCADA system probably was doing far more than it's designers intended and probably performed heroically until it died. 'Aye Captain...I canna do no more.'