Slashdot Mirror


Blackout Cause: Buggy Code

blanca writes "The big northeast blackout from last summer was caused in part by a software bug in an energy managment system sold by General Electic, according to a story on SecurityFocus. The bug meant that a computerized alarm that should have been triggered never went off, hindering FirstEnergy's response to the train of events that lead to the cascading blackout. Investigators found the bug in a intensive code audit following the outage, and a patch is now available."

29 of 377 comments (clear)

  1. Re:Patch Available by will_die · · Score: 2, Informative

    But they did not provide site to get it from.

  2. Bad bugs by Rico_za · · Score: 5, Informative

    Chalk up another one for the most disasterous software bugs in history. This one should give the Ariane 5 explosion a go for no 1.

  3. Re:GE Outsourcing To India by cassidyc · · Score: 5, Informative

    That might be the case except that XA 21 is developed in melbourne (Fl.)

    facts before hysteria thanks

  4. Re:Another opinion: maybe Blaster is to blame by YU+Nicks+NE+Way · · Score: 3, Informative

    Did you read the Security Focus article? It explicitly stated both that Blaster was not related to the blackout and that SF had been one of the first publications to extend the hypothesis that they had been related.

    In short, the Microsoft bashers were wrong -- and at least Security Focus had the guts to acknowledge it.

  5. Re:This is unacceptable by cassidyc · · Score: 5, Informative

    if you read the article and other associated articles, you will realise that this bug did not *cause* the blackout, on it's own this bug would have had no effect on the continued power supply. However, the timing of the bug along with a number of other issues (which I wont repeat here, read the article for a clue!) all contributed.

  6. Visual Basic by Shanep · · Score: 1, Informative

    Some of my friends were software developers at General Electric years ago (admittedly doing Wintel desktop software).

    I'm too tired to read the article, but I will say this, everything they did, they did in VB.

    I know GE has also sold US approved crypto hardware to other countries, gear which was found to have back doors or known weaknesses that have allowed the US to eavesdrop on their supposed "friends" with ease.

    Maybe they should stick to designing jet engines and toasters.

    --
    War crimes, torture, lies, illegal spying... Would someone give Bush a blowjob, already, so he can be impeached?
  7. Re:speaking of outsourcing... by cassidyc · · Score: 4, Informative

    XA21 is developed in Melbourne (Fl.)

  8. Re:speaking of outsourcing... by Anonymous Coward · · Score: 4, Informative

    I can tell you from working a couple cases at GE Power Systems that a LOT of their coding is done in India, and that the teams they work with state side are comprised mostly of Indians on work visa's and some naturalized Americans of Indian origins. Specifically the guys I talked with were from Gousherott (sp?). Btw this work wasn't outsourced, these were regular employees of GE, just on another continent.

  9. SCADA is really neato... by Anonymous Coward · · Score: 2, Informative
    SCADA is a protocol which can be used to control and monitor small things; it is not just in use with the power industry managing high-tension wires, but they also use it to control converyor belts in manufacturing facilities, or even automatic doors on trains. All of that stuff has code around it, one way or another, and every so often bugs do appear.

    No-one writes flawless code, not Sun, not IBM, and not even Linus or Alan Cox or Larry Wall. Anything that is controlled by code is bound to break, but that is why there are humans around and ways to override systems.

    Regardless, First Energy had many, many ways to know something was up (whether it was MISO calling them, the general disruption they had before it could cascade) but they refused to take the necessary actions and close themselves off from the grid.

  10. Re:A patch is now available by mstyne · · Score: 1, Informative

    Right here.

    --
    mstyne: real name, no gimmicks
  11. Apparently, not DCOM/OPC related by TimTheFoolMan · · Score: 2, Informative

    Based on the PDF for the XA/21 system, it sounds like this wasn't related to some of the DCOM/OPC issues many (myself included) were speculating about. Thoough it's a SCADA control system (where Windows is common, though not universal), it's running on AIX (IBM or Motorola) or Solaris.

    Interestingly enough, the sales literature describes it as having, "[an] established track record of field performance - over one million hours of online operation."

    I wonder if they'll revise the brochure now?

    Tim

  12. Re:Uh... by UnknowingFool · · Score: 4, Informative

    An initial cause has always been that the alarm did not sound when the problem occurred; however, First Energy was also blamed because even though there was no alarm, the operators should have seen the problem because the instrumentation display indicated that there was a dangerous surge.

    --
    Well, there's spam egg sausage and spam, that's not got much spam in it.
  13. Re:Development vs Engineering by Kombat · · Score: 4, Informative

    It was in Canada. In Canada, "Engineer" is a protected term, like "Doctor." I can't take a 6-month IT course and call myself a "Network Doctor," and put the title "Dr. Kevin" on my business cards. It's the same thing with "Engineer" in Canada (and "Architect", too, interestingly enough).

    There is only one university in Canada that is actually allowed to graduate "Software Engineers," and it's in Newfoundland (MUN). Other universities are not allowed to call their grads "Engineers" unless they follow the strict cirriculum requirements of the main engineering authority in Canada, whose name escapes me at the moment.

    This is all second-hand info, spoken as a guy who's married to a genuine, certified Engineer (Industrial). :)

    --
    Like woodworking? Build your own picture frames.
  14. Re:Development vs Engineering by bpfinn · · Score: 2, Informative
    Is it true that some states have prohibited Microsoft from issuing MSCEs? I heard this somewhere but I can't remember. Something about Microsoft not having the authority to certify engineers.

    In Texas, you can't legally call yourself an Engineer until you've passed the Professional Engineering examination. I haven't heard of anyone in Texas who had to stop calling themselves an MCSE, however.

  15. Re:Development vs Engineering by Troed · · Score: 3, Informative

    Agreed. I'm both a Mechanical Engineer and a Software Engineer, and I work as a consultant in embedded software development. The embedded sector is WAY ahead of "desktop programming" when it comes to strict requirements and processes, and yet not even that is close to being a true engineering discipline.

    I've actually concluded myself that software development _can never_ become an engineering discipline, it's too creative a process for that. A software developer is more an artist than an engineer.

    Really.

  16. History repeats itself... by thrill12 · · Score: 2, Informative

    as described in the excellent work by Bruce Sterling, "The Hacker Crackdown" (which everyone probably read): the blackout of the AT&T telephone switching system in 1990 also occured because of a software error.
    What happened then (accusing of hackers as being responsible) is happening again: people pointing to external factors as being the cause for the culprit.

    When do people start to learn from mistakes made and realize that instead of accusing people, they can better spend time in software audits?

    --
    Slashdot: stuff for news, nerds that matter, matter for news, stuff that nerd
  17. Re:Development vs Engineering by superflex · · Score: 4, Informative

    Universities in Canada must have their curriculum certified by the Canadian Engineering Accreditation Board, the national body for regulating engineering education.
    Furthermore, each province has a regulatory body which manages licensing of Professional Engineers (P.Eng.'s) which is a regulated designation. In Ontario this body is the PEO. They have a webpage here on the whole "software engineering" issue.

    --
    sigs are for suckers
  18. By the way, the actual bug... by thrill12 · · Score: 3, Informative

    ...that presented itself in the AT&T software is told at the end of the chapter, repeated here for your convenience:
    "As it happened, the problem itself - the problem per se - took this form. A piece of telco software had been written in C language, a standard language of the telco field. Within the C software was a long "do... while" construct. The "do... while" construct contained a "switch" statement. The "switch" statement contained an "if" clause. The "if" clause contained a "break." The "break" was supposed to "break" the "if" clause. Instead, the "break" broke the "switch" statement."

    --
    Slashdot: stuff for news, nerds that matter, matter for news, stuff that nerd
  19. Re:Hmm by westlake · · Score: 2, Informative
    Once upon a time, there was a power grid without any software. This is true because electricity predates computers. What did they do then?

    They accepted more frequent but more localized power outages. Rural electric service didn't become available in our area until 1926 and four decades later you could still safely predict it would go down in a storm.

  20. Re:More Reliable than Mars Rover by Ken+D · · Score: 4, Informative

    The Rover did not crash in "just a few days". The Rover crashed after the number of files in its flash filesystem accumulated to the point where the file table couldn't fit in the available memory anymore. This took 6 months of file accumulation to occur.

  21. One more event to add to Engineering 101 disaters by wetshoe · · Score: 2, Informative
    I remember Engineering 101 my first semester in college. It was a general introduction to engineering for the entire engineering school.

    Part of the class was dedicated to ensuring that we learned from the mistakes of the past. They showed us the video of the infamous Takoma Narrows bridge, and several other engineering mishaps. I was a computer science major and most, if not all, of the examples shown in the class, as far as I can remember, were engineering mishaps. I think this is a great example that can be now be added to the list of infamous engineering slip ups. This is a particularly good example for computer science majors, it shows that yes, you really do need good testing, and yes, major disasters can be caused by as little as one line of bad code.

    I always wondered why we CS majors had to sit through that class, but here's a great example why.

  22. Try something new by sleepingsquirrel · · Score: 2, Informative

    ...If programming is so complex, then why don't we try something new. You want a program without state? Try Haskell. You want to be able to prove something about your program? Try ML. But don't despair, I think the reason for crummy software is that it hasn't been around for that long. Civil engineers have had the hindsight of building roads, and aqueducts, and buildings for thousands of years. Software been around for what, 2 generations?

  23. Re:More Reliable than Mars Rover by ed1park · · Score: 3, Informative

    Your opinion comes from a "glass half full/half empty" perspective, which you can't really address.

    What you should be asking is why is it so difficult to write bug free code? The obvious answer is because developing and testing code is harder than you realize. A simple if statement looping 10 times will have over 1000 different code paths that you would need to test if you wanted to be thorough. So a large software project makes this kind of testing impossible.

    What people try to do instead is use Paredo's 80/20 rule. Basically, you try and focus on a few modules that generate the majority of bugs. There are many other methods of testing, but none are 100% and any significant project will have errors. Unfortunate, but a fact of life. People are not perfect.

  24. Re:Electrical Field Exposure? by AB3A · · Score: 5, Informative

    So what? You use a cell phone, don't you? The electrical energy exposure you get from that is substantially greater.

    How about electric blankets or heating pads? How about a battery powered shaver?

    You expose yourself to these fields every day to an extent far greater than what you may have received from that transmission line.

    By the way, you can light a neon light with a bit of wire and very little power. You can also light it with a MW AM broadcast transmitter less than a mile away; you can light it with a CB radio; and with just a bit more wire, and a location closer to the poles of the earth, you can light it when the earth is hit by a solar flare. Many among the various eco-scare-monger groups like to make this demonstration as if it were an indicator of something dangerous. If it were, there would be no life anywhere near the Arctic Circle.

    Aside of the poor maintainance for the clear-cut area, you really have no need to be concerned about this.

    --
    Nearly fifty percent of all graduates come from the bottom half of the class!
  25. Re:Yeah, right. by per+unit+analyzer · · Score: 4, Informative
    > Well, I have news for you: 50MV lines don't exist! Not out in the open, anyway. Was it 50 kV, perchance?

    >>nope, MV... though it may have been 45MV...

    The first guy is right; there is no such thing as a 45 MV transmission line. The highest voltage transmission line classification is 765 kV. (That would be 0.765 MV.) In the mid-1970s American Electric Power and Ohio Brass played with some experimental 1.5 MV transmission equipment but they killed the project when they realized land owners would never let AEP put a 1.5MV line in their back yards.

    The lines that First Energy put in the trees were 345 kV. I'm guessing they were rated to carry between 1000 to 1500 MVA. I have no idea where the 45 number came from or what unit would have been associated with it.

    --zawada

    --
    In Soviet Russia, the Beowulf cluster imagines you!
  26. Re:Software "Engineering"? by KenSeymour · · Score: 2, Informative

    Don't get your hopes up. In the recent Wired article about software development, it was pointed out that some of the Indian companies are
    SEI level 4 and 5 shops.

    So if tougher standards are required, more work could go to India.
    The required activities to get to SEI level 3 are mostly management, so programmers by themselves cannot bring the level of software development beyond that.

    --
    "We can't solve problems by using the same kind of thinking we used when we created them." -- Albert Einstein
  27. The alarm bug contributed but was not the cause by dtjohnson · · Score: 5, Informative

    After looking at the original report, it looks more like the GE XA21 SCADA network failure was not the primary cause of the cascading failure but more an effect of the failure. The key failure seems to be a software system callled the "State Estimator" (SE) that is used by the Midwest System Operator (MISO), a NERC reliability coordinator, to develop optimal solutions of for the planned operating level of all of the power generation and transmission equipment in the MISO area covering about 10 midwest states and 1 million square miles. It is not described in much detail but the SE seems to be an optimization tool using a linear programming model that gathers availability data for all of the major system components and load demand every five minutes and then calculates the 'optimal' use of those system components to maintain system reliability at the required level. The 'solution' of the model is then used to plan the operation of the overall system by sending the target operating levels to each facility in the system. So why did it fail? Two reasons. First, the model depends on having accurate availability information from each major system component. Status information is sent to MISO in Indiana by the "ECAR" data netork or by direct links. On the day of the failure, the direct link to a key transmission line was not working and the analyst had turned off the estimator to troubleshoot it. After fixing the problem, he went to lunch and forgot to put the system back in automatic mode where it would develop updated solutions. This situation existed for 2 hours from 12:15 to 14:40. When the estimator was switched back to automatic, it was unable to develop a solution because another key transmission line had overloaded and tripped and *its* new non-operational status was unknown to the model, apparently because the status of that line is assumed to be 'on' until told otherwise. This problem was not corrected until 16:04. The bottom line is that a critical major planning tool was not available for 4 hours for a regional generation and distribution system that absolutely required it's use to be operated successfully when the system power supply was very close to the demand.

    The SCADA system itself did not fail, but its alarm function did, which provides alarms to control room operators about system operational problems. The problem with the alarm function seems to be a case of too many alarms for the system to handle as the problems multiplied. The software bug that they are now reporting was probably related to the unexpectedly large number of alarms that the system was experiencing. The new alarm inputs built up and then overflowed the process input buffers. The alarm system just stalled while processing an alarm event and the alarm function stopped. Then, at 14:41 the primary server hosting the alarm processing application failed due to some combination of the stalling of the alarm application and the queueing to the remote terminals. The hapless backup server then was automatically activated and everything was was transferred to it, even the functional non-alarm stuff. The backup server failed after 13 minutes. Basically, the SCADA alarm system seems to have been massively overloaded (which shouldn't ever happen, of course) beyond the capability of the system design to cope with. The bug apparently prevented an indication that the alarm system was failing but it looks like the cascading failure still would have occurred even if the software bug had not been present because the system deterioration had progressed to far to recover by the time that the bug manifested itself.

    The immediate cause of the failure seems to be the forgetfulness of the analyst who was operating the planning model. The significant underlying contributory cause seems to be a very poor regional operational design in which a critical centralized system planning tool was being used with insufficient backup and oversight. It looks as though both Unix and Windows escape blame. The SCADA system probably was doing far more than it's designers intended and probably performed heroically until it died. 'Aye Captain...I canna do no more.'

  28. It gets worse (oh, and not 50 MV) by Beryllium+Sphere(tm) · · Score: 2, Informative

    The clearance can narrow in some conditions. When the lines get hot, they expand and sag noticeably. Hot weather will do it, and so will high current.

    Then, just when you most need the power, a tree that used to be at a just barely safe distance shorts the power line.

    The high end for mainstream deployments, by the way, is 750 KV or 1 MV. Corona losses get really bad above that level.

  29. Re:50MV arc'd to a tree by CreatureComfort · · Score: 2, Informative

    Here in Texas a lot of the trans-state electrical transmission lines run across ranches, or the right-of-way is leased to ranchers. Many, many generations of cattle are conceived, born, raised, bred, slaughtered, and sent to market spending thier, and thier ancestor's, lives entirely under the power lines. Considerably closer than U.S. regulations allow you to build your house to the same power lines.

    I have yet to have any of my friends who are ranchers complain about cancer, or other health problems in thier cattle raised under these conditions.

    --
    "Unheard of means only it's undreamed of yet,
    Impossible means not yet done." ~~ Julia Ecklar