Blackout Cause: Buggy Code
blanca writes "The big northeast blackout from last summer was caused in part by a software bug in an energy managment system sold by General Electic, according to a story on SecurityFocus. The bug meant that a computerized alarm that should have been triggered never went off, hindering FirstEnergy's response to the train of events that lead to the cascading blackout. Investigators found the bug in a intensive code audit following the outage, and a patch is now available."
... when you outsource to the lowest bidder?
I've said enough.
The term 'Software Engineering' is bantered about in the software industry. I think little that you could call engineering happens. Software is developed. It doesn't meet the strict standards of testing and reliability of physical products.
I am a software developer not an engineer, as are most people in the field. Software won't become an engineering science until companies are willing to pay for that process. Given the current trend towards cost cutting I don't see that happening anytime soon.
As x approaches total apathy I couldn't care less.
Just a question for everyone here:
Who thinks this could have been any better with Open Source and why?
People make the comment of the many eyes, but who is really looking at the code?
Curiosity was framed; ignorance killed the cat. -- Author unknown
"Things are so compliated, we don't know that a small event, or series of small events won't bring down the whole system"
Yeah, well I don't know that I won't be fired tomorrow for reading Slashdot at work, but that doesn't mean that I will.
I want to delete my account but Slashdot doesn't allow it.
Now if in fact this was buggy code, and if Software Engineers are in fact part of the engineering profession, then a professional body should be taking the engineer(s) to task. This would be the same thing that would take place in the event that a civil engineer signed off on faulty building plans. But smart money says no software "engineer" will get nailed.
A look at the software industry will show this to be the norm. And that is why there is such a problem with having people claiming the title of "software engineer". "Engineer" doesn't just mean having the technical savvy, it also means having a responsibility to the public for the use of that knowledge and being beholden to a professional body charged with ensuring you are held accountable.
"Consensus" in science is _always_ a political construct.
And yep, it runs on major critical systems, including energy systems and satellites.
Lean on it in the slightest and it will crash and burn with little chance for recovery. Tibco even says they don't test their own software (lack of docs lowers their liability). Press them for test results and they will offer you to pay them to test for you.
When a backup server kicked-in, it also failed, unable to handle the accumulation of unprocessed events that had queued up since the main system's failure.
Sounds like classic Tibco.
Blaming the black out on a software bug is a damn cop-out. The cause of the black out was a horribly managed electrical grid that can barely keep up with the current demand. Any major failure in the system can cause a cascading failure of the entire section of the grid. That is a horrible design. A software bug may have been the trigger but it is by no means the true cause.
The grid in the North East US is supplied by horribly inefficient and antiquated power lines that were struggling to keep up thirty years ago. That they are still in use today is an outright crime. There's also the issue of the operators of the lines generators trying to save a few bucks by cutting maintenance on equipment and facilities and cutting supervising staffs down to skeleton crews. It is much easier to fit "software bug" into a sound bite so the news media will stick with that. Unfortunately the real cause of the black out is not ever going to be patched and another blackout is as inevitable as this last one was. I hope next time a few more people will have invested in backup generators or some alternate form of power to keep from losing their business during a blackout.
I'm a loner Dottie, a Rebel.
If this isn't a call to take a closer look at the possibility of more widely using tools like Z and B to develop important software, I don't know what is.
Yes, they're difficult. Yes, they aren't likely to eliminate all bugs. BUT. They provide a much better chance (as I understand it - I'm not an expert) that what is designed is what actually gets implimented. That shifts the burden onto the design, but that's OK - that burden was always there. It just means that the design gets properly implimented, which is all that can reasonably be asked of the coding process.
Currently, again as I understand it, the life of a software program in development is a constant struggle by the developers to cope with ever changing demands of customers. I think if people want matters to improve the customers are going to have to come to grips with reality, take the time to sit down and think things through, and make all critical design decisions BEFORE the development process begins. More expensive up front? You bet. That's why I think companies should look at cooperative effort for this type of thing. Distribute the cost of developing one really good program across an industry. A lot of the same core functionality can likely be shared between businesses - if they all pay for one proper design and implimentation of an open program up front, and they all get copies of the logic and proof code with rights to extend as they see fit, they all benefit. They can also open up the more general parts of the package to the world at large under GPL, and anyone could contribute who can generate valid B and Z designs/proofs. Sort of an "academic" open source code development forum - peer review and all. The companies get the benefit of all new development - if they are using it internally they can extend the GPL code for themselves, so long as they don't distribute it. If they do distribute it, they can so so under GPL for everyone to enhance. A plugin based model can also allow them to develop components to the system they can sell as commercial software, if they wish.
Whether this would work/appeal with corporate thinking I have no idea - many of those folks seem to view cooperation like the plague. But it might allow a higher grade of software to be developed and universally used, and I have a hard time imagining how that could be a bad thing for anyone.
"I object to doing things that computers can do." -- Olin Shivers, lispers.org
Posting anonymously for obvious reasons to me :)
Given my personal experience with this certain Fortune 5 company and software development as a whole, I am not surprised.
The bottom line is that there is soooo much software developed here by non-computer programmers. There are many great Engineers (Mechanical, Aerospace, etc.) here, yet very few can write good code. Many of them are asked to write code nonetheless and thanks to the travesty that is Visual Basic and other Rapid Application Development tools the code that is produced is extremely un-maintainable.
Then you have the matter of people moving jobs every 2 years and the poor bastard who has to maintain someone else's code gets lost inside of it.
Consider me very frustrated at the whole process.
and yes, there is no reason that a 12" tree should be anywhere CLOSE to a 50 MV line.
Rather, there is no reason that a 50 MV line should be anywhere close to a 12" tree.
To me, this report give a good example of why a monolithic (monocultural) dispatching system is not a good idea. If every transaction were controlled by a central center, a single software bug could shut down the entire North American grid.
sPh
This is informative why??? mention of some of your friends who have nothing to do with XA21?
And some random comments on GE selling crypto hardware....
where's the connection??
Clues please?
In all fairness...
The Mars Rover's software crashed in just a few days.
Virtually all software should be designed and tested better than it is.
However, I'm perplexed at why the Mars Rover failure and resurrection is considered a miracle of human inginuity, rather than an indictment of crummy testing.
I'll not excuse the power grid software either; but it seems to work more reliably than the software on the Rover.
According to the SecurtyFocus article, the operators had no way of knowing, because the data wasn't "live." This is a common problem with SCADA systems--the systems will display the "last known-good value" if something goes offline. However, the system should also visibly identify the data as "out of service" or "offline," and this didn't seem to happen. That could be an issue at the server, or it could be something blamed on the people commissioning the XA/21 system (assuming the display is configurable enough to allow you to program it at this level).
Even so, there should have been sufficient watchdog messages between the client, the server, and the field hardware for the XA/21 to broadcast a general alarm along the lines of "I can't talk to the stinking field, so we're all flying blind here, you morons!" This is exactly the same as software in my industry (HVAC fire/security systems for large buildings), where if you lose communication to a subsystem or the field, you have to raise alarms all over the place.
The real question is how you could lose such comm and the operators had no visible indication that they were relying on old data. This sounds like a missed requirement, if not insufficient testing.
Tim
So the software didn't raise alarms as it should've. That's bad. But it seems to me that the software is being made a scape goat here. It's much easier to blame "that #$@&@$ computer" than "FirstEnergy's failure to trim back trees encroaching on high-voltage power lines" or the fact that the infrastructure for the powergrid is old and poorly setup such that one failure can bring down the whole system. There's no reason why a failure in Ohio should blackout New York and there's nothing software can do to fix that.
There is also the fact that consumer use of electricity is growing faster than the infrastructure to support it. If you can squeeze an additional 10-20% transmission capacity by more efficient use of existing facilities, then you can hold on until new infrastructure is built.
Had this been a Windows-based system, the torrent of comments about how unreliable the OS and platform fundementally was would be huge.
Funny, just because this ships for "industrial strength" AIX / Solaris RISC systems (see specs on pg 8), I don't see any cheap, reflexive comments about the platform.
I guess the message here is that good or bad code can be written for any architecture.
Water accelerates the growth of a plant, but it doesn't cause the plant to be. The seed did that.