Blackout Cause: Buggy Code

← Back to Stories (view on slashdot.org)

Posted by michael on Thursday February 12, 2004 @01:31AM from the civilization-meets-the-woodpecker dept.

blanca writes "The big northeast blackout from last summer was caused in part by a software bug in an energy managment system sold by General Electic, according to a story on SecurityFocus. The bug meant that a computerized alarm that should have been triggered never went off, hindering FirstEnergy's response to the train of events that lead to the cascading blackout. Investigators found the bug in a intensive code audit following the outage, and a patch is now available."

12 of 377 comments (clear)

Min score:

Reason:

Sort:

Bad bugs by Rico_za · 2004-02-12 01:42 · Score: 5, Informative

Chalk up another one for the most disasterous software bugs in history. This one should give the Ariane 5 explosion a go for no 1.
Re:GE Outsourcing To India by cassidyc · 2004-02-12 01:49 · Score: 5, Informative

That might be the case except that XA 21 is developed in melbourne (Fl.)

facts before hysteria thanks
Re:This is unacceptable by cassidyc · 2004-02-12 01:52 · Score: 5, Informative

if you read the article and other associated articles, you will realise that this bug did not *cause* the blackout, on it's own this bug would have had no effect on the continued power supply. However, the timing of the bug along with a number of other issues (which I wont repeat here, read the article for a clue!) all contributed.
Re:speaking of outsourcing... by cassidyc · 2004-02-12 01:55 · Score: 4, Informative

XA21 is developed in Melbourne (Fl.)
Re:speaking of outsourcing... by Anonymous Coward · 2004-02-12 01:56 · Score: 4, Informative

I can tell you from working a couple cases at GE Power Systems that a LOT of their coding is done in India, and that the teams they work with state side are comprised mostly of Indians on work visa's and some naturalized Americans of Indian origins. Specifically the guys I talked with were from Gousherott (sp?). Btw this work wasn't outsourced, these were regular employees of GE, just on another continent.
Re:Uh... by UnknowingFool · 2004-02-12 02:36 · Score: 4, Informative

An initial cause has always been that the alarm did not sound when the problem occurred; however, First Energy was also blamed because even though there was no alarm, the operators should have seen the problem because the instrumentation display indicated that there was a dangerous surge.

--
Well, there's spam egg sausage and spam, that's not got much spam in it.
Re:Development vs Engineering by Kombat · 2004-02-12 02:38 · Score: 4, Informative

It was in Canada. In Canada, "Engineer" is a protected term, like "Doctor." I can't take a 6-month IT course and call myself a "Network Doctor," and put the title "Dr. Kevin" on my business cards. It's the same thing with "Engineer" in Canada (and "Architect", too, interestingly enough).

There is only one university in Canada that is actually allowed to graduate "Software Engineers," and it's in Newfoundland (MUN). Other universities are not allowed to call their grads "Engineers" unless they follow the strict cirriculum requirements of the main engineering authority in Canada, whose name escapes me at the moment.

This is all second-hand info, spoken as a guy who's married to a genuine, certified Engineer (Industrial). :)

--
Like woodworking? Build your own picture frames.
Re:Development vs Engineering by superflex · 2004-02-12 03:17 · Score: 4, Informative

Universities in Canada must have their curriculum certified by the Canadian Engineering Accreditation Board, the national body for regulating engineering education.
Furthermore, each province has a regulatory body which manages licensing of Professional Engineers (P.Eng.'s) which is a regulated designation. In Ontario this body is the PEO. They have a webpage here on the whole "software engineering" issue.

--
sigs are for suckers
Re:More Reliable than Mars Rover by Ken+D · 2004-02-12 03:57 · Score: 4, Informative

The Rover did not crash in "just a few days". The Rover crashed after the number of files in its flash filesystem accumulated to the point where the file table couldn't fit in the available memory anymore. This took 6 months of file accumulation to occur.
Re:Electrical Field Exposure? by AB3A · 2004-02-12 04:30 · Score: 5, Informative

So what? You use a cell phone, don't you? The electrical energy exposure you get from that is substantially greater.

How about electric blankets or heating pads? How about a battery powered shaver?

You expose yourself to these fields every day to an extent far greater than what you may have received from that transmission line.

By the way, you can light a neon light with a bit of wire and very little power. You can also light it with a MW AM broadcast transmitter less than a mile away; you can light it with a CB radio; and with just a bit more wire, and a location closer to the poles of the earth, you can light it when the earth is hit by a solar flare. Many among the various eco-scare-monger groups like to make this demonstration as if it were an indicator of something dangerous. If it were, there would be no life anywhere near the Arctic Circle.

Aside of the poor maintainance for the clear-cut area, you really have no need to be concerned about this.

--
Nearly fifty percent of all graduates come from the bottom half of the class!
Re:Yeah, right. by per+unit+analyzer · 2004-02-12 04:35 · Score: 4, Informative

> Well, I have news for you: 50MV lines don't exist! Not out in the open, anyway. Was it 50 kV, perchance?

>>nope, MV... though it may have been 45MV...

The first guy is right; there is no such thing as a 45 MV transmission line. The highest voltage transmission line classification is 765 kV. (That would be 0.765 MV.) In the mid-1970s American Electric Power and Ohio Brass played with some experimental 1.5 MV transmission equipment but they killed the project when they realized land owners would never let AEP put a 1.5MV line in their back yards.

The lines that First Energy put in the trees were 345 kV. I'm guessing they were rated to carry between 1000 to 1500 MVA. I have no idea where the 45 number came from or what unit would have been associated with it.

--zawada

--
In Soviet Russia, the Beowulf cluster imagines you!
The alarm bug contributed but was not the cause by dtjohnson · 2004-02-12 04:58 · Score: 5, Informative

After looking at the original report, it looks more like the GE XA21 SCADA network failure was not the primary cause of the cascading failure but more an effect of the failure. The key failure seems to be a software system callled the "State Estimator" (SE) that is used by the Midwest System Operator (MISO), a NERC reliability coordinator, to develop optimal solutions of for the planned operating level of all of the power generation and transmission equipment in the MISO area covering about 10 midwest states and 1 million square miles. It is not described in much detail but the SE seems to be an optimization tool using a linear programming model that gathers availability data for all of the major system components and load demand every five minutes and then calculates the 'optimal' use of those system components to maintain system reliability at the required level. The 'solution' of the model is then used to plan the operation of the overall system by sending the target operating levels to each facility in the system. So why did it fail? Two reasons. First, the model depends on having accurate availability information from each major system component. Status information is sent to MISO in Indiana by the "ECAR" data netork or by direct links. On the day of the failure, the direct link to a key transmission line was not working and the analyst had turned off the estimator to troubleshoot it. After fixing the problem, he went to lunch and forgot to put the system back in automatic mode where it would develop updated solutions. This situation existed for 2 hours from 12:15 to 14:40. When the estimator was switched back to automatic, it was unable to develop a solution because another key transmission line had overloaded and tripped and *its* new non-operational status was unknown to the model, apparently because the status of that line is assumed to be 'on' until told otherwise. This problem was not corrected until 16:04. The bottom line is that a critical major planning tool was not available for 4 hours for a regional generation and distribution system that absolutely required it's use to be operated successfully when the system power supply was very close to the demand.

The SCADA system itself did not fail, but its alarm function did, which provides alarms to control room operators about system operational problems. The problem with the alarm function seems to be a case of too many alarms for the system to handle as the problems multiplied. The software bug that they are now reporting was probably related to the unexpectedly large number of alarms that the system was experiencing. The new alarm inputs built up and then overflowed the process input buffers. The alarm system just stalled while processing an alarm event and the alarm function stopped. Then, at 14:41 the primary server hosting the alarm processing application failed due to some combination of the stalling of the alarm application and the queueing to the remote terminals. The hapless backup server then was automatically activated and everything was was transferred to it, even the functional non-alarm stuff. The backup server failed after 13 minutes. Basically, the SCADA alarm system seems to have been massively overloaded (which shouldn't ever happen, of course) beyond the capability of the system design to cope with. The bug apparently prevented an indication that the alarm system was failing but it looks like the cascading failure still would have occurred even if the software bug had not been present because the system deterioration had progressed to far to recover by the time that the bug manifested itself.

The immediate cause of the failure seems to be the forgetfulness of the analyst who was operating the planning model. The significant underlying contributory cause seems to be a very poor regional operational design in which a critical centralized system planning tool was being used with insufficient backup and oversight. It looks as though both Unix and Windows escape blame. The SCADA system probably was doing far more than it's designers intended and probably performed heroically until it died. 'Aye Captain...I canna do no more.'