Slashdot Mirror


Blackout Cause: Buggy Code

blanca writes "The big northeast blackout from last summer was caused in part by a software bug in an energy managment system sold by General Electic, according to a story on SecurityFocus. The bug meant that a computerized alarm that should have been triggered never went off, hindering FirstEnergy's response to the train of events that lead to the cascading blackout. Investigators found the bug in a intensive code audit following the outage, and a patch is now available."

20 of 377 comments (clear)

  1. Uh... by Short+Circuit · · Score: 5, Interesting

    Didn't the story used to be that after a tech maintenenced the machine, he forgot to re-enable an alarm?

    1. Re:Uh... by TimTheFoolMan · · Score: 5, Insightful

      According to the SecurtyFocus article, the operators had no way of knowing, because the data wasn't "live." This is a common problem with SCADA systems--the systems will display the "last known-good value" if something goes offline. However, the system should also visibly identify the data as "out of service" or "offline," and this didn't seem to happen. That could be an issue at the server, or it could be something blamed on the people commissioning the XA/21 system (assuming the display is configurable enough to allow you to program it at this level).

      Even so, there should have been sufficient watchdog messages between the client, the server, and the field hardware for the XA/21 to broadcast a general alarm along the lines of "I can't talk to the stinking field, so we're all flying blind here, you morons!" This is exactly the same as software in my industry (HVAC fire/security systems for large buildings), where if you lose communication to a subsystem or the field, you have to raise alarms all over the place.

      The real question is how you could lose such comm and the operators had no visible indication that they were relying on old data. This sounds like a missed requirement, if not insufficient testing.

      Tim

  2. Wrong article! by ThePretender · · Score: 5, Funny

    Oh this bug took six months to find and now a patch is available. I thought someone said the bug was found six months ago and now the patch was available. My bad, nobody would ever do that :-)

  3. the bug of my dreams by vargul · · Score: 5, Funny

    i have been dreaming writting such a bug myself. quite an achievement to blackout quarter of a continent with some crappy code...

    --
    Aure entuluva!
  4. Bad bugs by Rico_za · · Score: 5, Informative

    Chalk up another one for the most disasterous software bugs in history. This one should give the Ariane 5 explosion a go for no 1.

  5. Re:GE Outsourcing To India by cassidyc · · Score: 5, Informative

    That might be the case except that XA 21 is developed in melbourne (Fl.)

    facts before hysteria thanks

  6. Re:This spells trouble by duffbeer703 · · Score: 5, Funny

    Indeed. We all must consider ourselves incredibly lucky that the /. editors are not working on energy management software or embedded medical devices.

    Subscribe to Slashdot -- we have to keep these guys employed and out of the real world!

    --
    Conformity is the jailer of freedom and enemy of growth. -JFK
  7. Software "Engineering"? by fygment · · Score: 5, Insightful

    Now if in fact this was buggy code, and if Software Engineers are in fact part of the engineering profession, then a professional body should be taking the engineer(s) to task. This would be the same thing that would take place in the event that a civil engineer signed off on faulty building plans. But smart money says no software "engineer" will get nailed.

    A look at the software industry will show this to be the norm. And that is why there is such a problem with having people claiming the title of "software engineer". "Engineer" doesn't just mean having the technical savvy, it also means having a responsibility to the public for the use of that knowledge and being beholden to a professional body charged with ensuring you are held accountable.

    --
    "Consensus" in science is _always_ a political construct.
    1. Re:Software "Engineering"? by Detritus · · Score: 5, Insightful

      You can't have responsibility without authority. The building never gets built without the signature of the civil engineer on the plans. Few software engineers have that control.

      --
      Mea navis aericumbens anguillis abundat
    2. Re:Software "Engineering"? by Anonymous Coward · · Score: 5, Insightful

      That's why you'll never see a proper software ENGINEER... when engineers undertake a project they know the materials, the requirements, the environment, etc. As soon as a piece of software goes out the door all bets are off.

      How long do you think engineering (as it stands today) would last if that bridge meant to stand on bedrock spanning no more than 1000' and carry a load of no more than 1500 tons at any given time were suddenly put on a sandy bed, stretched to cover 1100' and carry 1600 tons... oh yes, and the user didn't like that third support so they removed it.

      Software and engineering are VASTLY different disciplines. If software is ever judged like engineering then it would kill the market because the EULAs would have to say that you use THIS motherboard with X amount of RAM and Y amount of hard-drive space. The agreement would only be in effect as long as you used OS "ABC" and no other processes besides those required by the OS and the programme in question were running. It would make the cost of running a business prohibitively expensive.

      When you consider that most large-scale software development projects are equivalent in complexity to building structures like the Golden Gate Bridge or the Empire State Building (I didn't want to mention any buildings outside the US since I realise the audience on here is largely American and probably wouldn't know what I was talking about) consider the cost of actually treating software development the same way... I'm sure companies everywhere will be lining up to pay $300M for that content management system.

  8. Re:This is unacceptable by cassidyc · · Score: 5, Informative

    if you read the article and other associated articles, you will realise that this bug did not *cause* the blackout, on it's own this bug would have had no effect on the continued power supply. However, the timing of the bug along with a number of other issues (which I wont repeat here, read the article for a clue!) all contributed.

  9. Re:Text of the article by AKnightCowboy · · Score: 5, Funny
    The comment preceding the code in question was:
    // Not sure why this works for my test data.
    // Probably should come back and re-write this
    // if we have time before the product ships.
  10. Re:Would this be any better in an OSS environment? by eraserewind · · Score: 5, Insightful
    People make the comment of the many eyes, but who is really looking at the code?
    Probably nobody, especially if you are talking about something as dull as a utility management app. That's why companies pay people to look at these things.

    Open source almost certainly would have not prevented the bug. The bug might have been found faster after it happened though, because curious (or under pressure from their boss) engineers engineers in every facility affected would spend at least some time trying to figure out what went wrong.

    Having the source is great, and you would be surprised at the number of companies who license the source for what they use. Risk management is important. Free isn't everything, you can get many of the same things by paying :-)
  11. Metroid by Graymalkin · · Score: 5, Insightful

    Blaming the black out on a software bug is a damn cop-out. The cause of the black out was a horribly managed electrical grid that can barely keep up with the current demand. Any major failure in the system can cause a cascading failure of the entire section of the grid. That is a horrible design. A software bug may have been the trigger but it is by no means the true cause.

    The grid in the North East US is supplied by horribly inefficient and antiquated power lines that were struggling to keep up thirty years ago. That they are still in use today is an outright crime. There's also the issue of the operators of the lines generators trying to save a few bucks by cutting maintenance on equipment and facilities and cutting supervising staffs down to skeleton crews. It is much easier to fit "software bug" into a sound bite so the news media will stick with that. Unfortunately the real cause of the black out is not ever going to be patched and another blackout is as inevitable as this last one was. I hope next time a few more people will have invested in backup generators or some alternate form of power to keep from losing their business during a blackout.

    --
    I'm a loner Dottie, a Rebel.
  12. Yeah, right. by Anonymous Coward · · Score: 5, Funny

    Well, I have news for you: 50MV lines don't exist! Not out in the open, anyway. Was it 50 kV, perchance?

  13. Re:Development vs Engineering by Anonymous Coward · · Score: 5, Interesting

    In Canada, "Engineer" is a protected term, like "Doctor."

    Doctor is not a protected term. Perhaps you mean "Medical Doctor"? There are lots of non-medical doctors.

    I was arguing once with a MD friend of mine who thought that PhDs (like myself) don't have the right to call themselves Doctor. I explained that while medicince has been around for a very long time, the degree of MD has not. PhDs degrees have a much longer history than MD degrees.

    It gets very funny when another friend of mine (who has a PhD in nursing) is called "Dr" in her hospital.

  14. Re: ms WAS responsible - chain of events by galtsavenger · · Score: 5, Funny

    I'm sure this was mentioned in the original blackout posts - since the Blaster virus was running full tilt at that time, there was an increased load on servers, routers, switches, hubs and blinky things that go whoop! whoop!! WHOOOOP! The increased demand on computing resources caused increased power demand (not to mention the cranked ACs at the homes of the poor IT staff who were staring at their blackberrys and sweating bullets) which in turn caused the alarm conditions which didn't get alarmed properly and so the powergrid went down. All because of an MS security hole.

    How's that?

  15. Re:50MV arc'd to a tree by plover · · Score: 5, Interesting
    My property abuts a set of high voltage transmission lines. (I'm about three miles from a coal plant.) The lines cut a long, skinny park through my city. The plat for the site shows a 200 foot wide easement, which is about 30 meters to the property on either edge of the park. I've never measured the height of the towers, but my rough guess is that the line itself is perhaps 25 meters above ground. That puts the line itself about 39 meters from the edge of my property.

    The land beneath the lines was clear-cut about 12 years ago. But there are now trees under this line that are about 10 meters high.

    Years ago when my wife was concerned about "power line emissions" the power company loaned her a meter that showed "electrical fields." I don't remember the scale, or even what it was supposed to measure, but I do remember that we had to actually get about 200 feet from the wire before the field from the line stopped affecting the meter. (Yes, on a humid summer day I once stood in my back yard with a neon bulb and caused it to illuminate by simply dangling a three foot wire from one lead and touching the other.) I had always assumed it was a 750kV line, and that the 100 foot easement was more than sufficient. Now, I wonder. Hey, maybe this is enough of an excuse to go out and get one of those IKE toys!

    --
    John
  16. Re:Electrical Field Exposure? by AB3A · · Score: 5, Informative

    So what? You use a cell phone, don't you? The electrical energy exposure you get from that is substantially greater.

    How about electric blankets or heating pads? How about a battery powered shaver?

    You expose yourself to these fields every day to an extent far greater than what you may have received from that transmission line.

    By the way, you can light a neon light with a bit of wire and very little power. You can also light it with a MW AM broadcast transmitter less than a mile away; you can light it with a CB radio; and with just a bit more wire, and a location closer to the poles of the earth, you can light it when the earth is hit by a solar flare. Many among the various eco-scare-monger groups like to make this demonstration as if it were an indicator of something dangerous. If it were, there would be no life anywhere near the Arctic Circle.

    Aside of the poor maintainance for the clear-cut area, you really have no need to be concerned about this.

    --
    Nearly fifty percent of all graduates come from the bottom half of the class!
  17. The alarm bug contributed but was not the cause by dtjohnson · · Score: 5, Informative

    After looking at the original report, it looks more like the GE XA21 SCADA network failure was not the primary cause of the cascading failure but more an effect of the failure. The key failure seems to be a software system callled the "State Estimator" (SE) that is used by the Midwest System Operator (MISO), a NERC reliability coordinator, to develop optimal solutions of for the planned operating level of all of the power generation and transmission equipment in the MISO area covering about 10 midwest states and 1 million square miles. It is not described in much detail but the SE seems to be an optimization tool using a linear programming model that gathers availability data for all of the major system components and load demand every five minutes and then calculates the 'optimal' use of those system components to maintain system reliability at the required level. The 'solution' of the model is then used to plan the operation of the overall system by sending the target operating levels to each facility in the system. So why did it fail? Two reasons. First, the model depends on having accurate availability information from each major system component. Status information is sent to MISO in Indiana by the "ECAR" data netork or by direct links. On the day of the failure, the direct link to a key transmission line was not working and the analyst had turned off the estimator to troubleshoot it. After fixing the problem, he went to lunch and forgot to put the system back in automatic mode where it would develop updated solutions. This situation existed for 2 hours from 12:15 to 14:40. When the estimator was switched back to automatic, it was unable to develop a solution because another key transmission line had overloaded and tripped and *its* new non-operational status was unknown to the model, apparently because the status of that line is assumed to be 'on' until told otherwise. This problem was not corrected until 16:04. The bottom line is that a critical major planning tool was not available for 4 hours for a regional generation and distribution system that absolutely required it's use to be operated successfully when the system power supply was very close to the demand.

    The SCADA system itself did not fail, but its alarm function did, which provides alarms to control room operators about system operational problems. The problem with the alarm function seems to be a case of too many alarms for the system to handle as the problems multiplied. The software bug that they are now reporting was probably related to the unexpectedly large number of alarms that the system was experiencing. The new alarm inputs built up and then overflowed the process input buffers. The alarm system just stalled while processing an alarm event and the alarm function stopped. Then, at 14:41 the primary server hosting the alarm processing application failed due to some combination of the stalling of the alarm application and the queueing to the remote terminals. The hapless backup server then was automatically activated and everything was was transferred to it, even the functional non-alarm stuff. The backup server failed after 13 minutes. Basically, the SCADA alarm system seems to have been massively overloaded (which shouldn't ever happen, of course) beyond the capability of the system design to cope with. The bug apparently prevented an indication that the alarm system was failing but it looks like the cascading failure still would have occurred even if the software bug had not been present because the system deterioration had progressed to far to recover by the time that the bug manifested itself.

    The immediate cause of the failure seems to be the forgetfulness of the analyst who was operating the planning model. The significant underlying contributory cause seems to be a very poor regional operational design in which a critical centralized system planning tool was being used with insufficient backup and oversight. It looks as though both Unix and Windows escape blame. The SCADA system probably was doing far more than it's designers intended and probably performed heroically until it died. 'Aye Captain...I canna do no more.'