Slashdot Mirror


Tracking the Blackout Bug

Alien54 writes "This earlier Slash story cited a CNN news report on how the August blackout was preventable. But, as seen in this Security Focus article, things are not so simple. 'In the initial stages, nobody really knew what the root cause was,' says Mike Unum, manager of commercial solutions at GE Energy. 'We test exhaustively, we test with third parties, and we had in excess of three million online operational hours in which nothing had ever exercised that bug,' says Unum. 'I'm not sure that more testing would have revealed that. Unfortunately, that's kind of the nature of software... you may never find the problem. I don't think that's unique to control systems or any particular vendor software.' Which leads to a number of other questions."

9 of 207 comments (clear)

  1. Re:Software bug was just one part of bigger proble by Raindance · · Score: 4, Interesting

    I agree that there's more to this than just one line of code, as some folks seem to believe- I think referring to it as 'one bug' is rather misleading.

    As well refer to the things leading up to WWII as 'one problem'.

  2. B Method? by starseeker · · Score: 5, Interesting

    "the bug was unmasked as a particularly subtle incarnation of a common programming error called a "race condition," triggered on August 14th by a perfect storm of events and alarm conditions on the equipment being monitored. The bug had a window of opportunity measured in milliseconds. "

    Isn't this the type of problem the B Method (and maybe the Z language too) are designed to address? Use proof logic initially - once you have decided on a behavior you want, design the system in such a way that it is provable it executes this design.

    That doesn't mean the DESIGN is flawless, of course. But if we start engineering software on as many levels as we can, mightn't things improve? Normal software development and testing would never have found a critical bug with rare trigger conditions and a millisecond window. If you need precision on that level, you need to (for starters) to KNOW your implimentation of your design is sound, and preferably the code you are running exactly impliments the proven logic. Isn't this what the B Method was created for?

    --
    "I object to doing things that computers can do." -- Olin Shivers, lispers.org
    1. Re:B Method? by mccalli · · Score: 4, Interesting
      Isn't this the type of problem the B Method (and maybe the Z language too) are designed to address? Use proof logic initially - once you have decided on a behavior you want, design the system in such a way that it is provable it executes this design.

      Ye gods, you've frightened the hell out of me with reference to Z. I'd almost entirely forgotten it, and had hoped its cold corpse would lie in the ground undisturbed, undiscovered and most importantly of all unreferenced until the end of time. Still, "That is not dead which may eternal lie"...

      Z is a beautiful way to mathematically prove that you have design bugs at the highest level possible. You can then design your unit tests around those bugs, and confirm that they're valid.

      That's it. It provides nothing else that unit testing on its own couldn't do, with the exception of a few salaries and a research grant here and there. Whilst you can mathematically prove implementations of certain designs, the vast majority of designs have more complex interactions. Try using Z for a multithreaded real-time environment for example - my Software Engineering tutor at the time, Iain Sommerville (well known in the field due to his books, oh and 'at the time' would ~1993), basically said that Z just breaks down in those circumstances. I wouldn't know - I personally had no clue how to even make it begin in those circumstances, let alone break down.

      Please confine Z to camp-fire ghost stories used to scare new programmers. It always was a living hell, and it really shouldn't be resurrected now.

      Cheers,
      Ian

    2. Re:B Method? by Orne · · Score: 4, Interesting

      SCADA systems transport data samples. My company's system collects from several hundred thousands of meters, about half of which are expected to send in a sample about once every 10 seconds, some as fast as once every two seconds. The concept is that you have a communications buffer that collects the data, the link writes to the memory while the other EMS applications (about a dozen) read from the memory.

      Now admittedly, FirstEnergy's system is a little smaller in territory, but I wonder if their mergers over the recent years (Cleveland Electric and Ohio Edison became FE, and then proceeded to take Toledo Edison and GPU of PA) have outpaced the collection capabilities of their mainframe (which was already at the end of its life and was scheduled to be replaced). That could account for some of the "slowing" that the G.E. testers said they had to do to make the race condition appear.

  3. Re:The problem with SCADA systems by Vancorps · · Score: 5, Interesting
    This all reminds me of the movie Resident Evil where they shut down power and all the doors unlock when power is restored.

    You bring up a great point about failure states. I work for several large hotels and the fire control systems are the ones that alert whenever there is any problem of any kind largely because any problem of any kind needs to be addressed immediately so it makes sense.

    I would think power systems would think along the same lines since the odds are, ANY failure whatsoever needs immediate attention of engineers that maintain the system. This is not a requirement for all software but when it comes to such critical services why doesn't everybody do the same practice? It seems so blatently obvious that alarms should have been raised.

    Also, in situation's where you don't work on a live environment you can always create a test environment that is for all intensive purposes "live" For web development work I do I have a testing domain which is used to test sites to ensure that because they work here in my lab they will work when I hand them off to the client. Its 100% accurate, I've seen it done with countless other systems, so why wasn't it done here?

  4. The American jackasses who blamed Canada by Kevin+Mitnick · · Score: 5, Interesting

    Did anyone ever retract their statements? I know the NY Mayor was pretty quick to blame us Canucks.

  5. Reasons for power blackouts by pcraven · · Score: 4, Interesting

    I've been reading several papers on this for a grad class I'm taking. One of the several problems is no government control. If a power outage might be prevented by shedding some load (turning out power to some people), no company wants to step up to the plate and be the one to turn out the power to their customers. So they luck out, or they have a massive power outage.

    This paper (click on the PDF link) has a good summary of the problems in keeping power outages from happening again.

  6. World's largest machine by stefanb · · Score: 4, Interesting
    An article featured on Slashdot last year lays out the underlying complexity of the power grid very well: "The World's Largest Machine"

    OK, it's nitpicking, but the largest machine is arguably the telephone system. Among other things, it maintains a synchronized clock (8 kHz base), even across oceans and continents.

  7. Software ENGINEERING by Anonymous Coward · · Score: 4, Interesting

    If I want to build a large structure (bridge or building) where it is possible that public safety is at issue, I had better have an engineer's signature on the drawings.

    This case seems like a real good argument for having the same requirement for software.

    Good engineering practice would probably have prevented this. A simple example of such a system would be a burglar/fire alarm panel. The system is self-checking. If any part of the system isn't working (ie. someone cuts a wire), then that causes an alarm.

    I realize that there will be strange undetectable bugs in software but if the system as a whole is properly engineered, the system will fail gracefully and safely.