Slashdot Mirror


Tracking the Blackout Bug

Alien54 writes "This earlier Slash story cited a CNN news report on how the August blackout was preventable. But, as seen in this Security Focus article, things are not so simple. 'In the initial stages, nobody really knew what the root cause was,' says Mike Unum, manager of commercial solutions at GE Energy. 'We test exhaustively, we test with third parties, and we had in excess of three million online operational hours in which nothing had ever exercised that bug,' says Unum. 'I'm not sure that more testing would have revealed that. Unfortunately, that's kind of the nature of software... you may never find the problem. I don't think that's unique to control systems or any particular vendor software.' Which leads to a number of other questions."

7 of 207 comments (clear)

  1. Software bug was just one part of bigger problem by bonnyman · · Score: 5, Informative

    The software bug was just one piece of a much bigger problem; I wouldn't want to overstate its' role. There were many other factors; here are just a few:

    Poor vegetation management probably played an even bigger role as overloaded power lines warmed up, expanded and sagged into trees and bushes that were supposed to have been cut back.

    Poor communications between utilities played a major role.

    This whole section of the transmission system was known to be unstable.

    An inadequate regulatory structure lacked teeth to deal with known problems.

    Lack of adequate transmission line capacity

    If all these other problems hadn't been in place, the software bug might never have surfaced. And certainly, the rpoblems would have been contained within a much smaller area -- maybe just First Energy's service area.

    An article featured on Slashdot last year lays out the underlying complexity of the power grid very well: "The World's Largest Machine"

  2. B Method? by starseeker · · Score: 5, Interesting

    "the bug was unmasked as a particularly subtle incarnation of a common programming error called a "race condition," triggered on August 14th by a perfect storm of events and alarm conditions on the equipment being monitored. The bug had a window of opportunity measured in milliseconds. "

    Isn't this the type of problem the B Method (and maybe the Z language too) are designed to address? Use proof logic initially - once you have decided on a behavior you want, design the system in such a way that it is provable it executes this design.

    That doesn't mean the DESIGN is flawless, of course. But if we start engineering software on as many levels as we can, mightn't things improve? Normal software development and testing would never have found a critical bug with rare trigger conditions and a millisecond window. If you need precision on that level, you need to (for starters) to KNOW your implimentation of your design is sound, and preferably the code you are running exactly impliments the proven logic. Isn't this what the B Method was created for?

    --
    "I object to doing things that computers can do." -- Olin Shivers, lispers.org
  3. Re:The problem with SCADA systems by Vancorps · · Score: 5, Interesting
    This all reminds me of the movie Resident Evil where they shut down power and all the doors unlock when power is restored.

    You bring up a great point about failure states. I work for several large hotels and the fire control systems are the ones that alert whenever there is any problem of any kind largely because any problem of any kind needs to be addressed immediately so it makes sense.

    I would think power systems would think along the same lines since the odds are, ANY failure whatsoever needs immediate attention of engineers that maintain the system. This is not a requirement for all software but when it comes to such critical services why doesn't everybody do the same practice? It seems so blatently obvious that alarms should have been raised.

    Also, in situation's where you don't work on a live environment you can always create a test environment that is for all intensive purposes "live" For web development work I do I have a testing domain which is used to test sites to ensure that because they work here in my lab they will work when I hand them off to the client. Its 100% accurate, I've seen it done with countless other systems, so why wasn't it done here?

  4. The American jackasses who blamed Canada by Kevin+Mitnick · · Score: 5, Interesting

    Did anyone ever retract their statements? I know the NY Mayor was pretty quick to blame us Canucks.

  5. Testing isn't the answer... by evilviper · · Score: 5, Insightful

    You can't expect just testing to reveal all bugs in a program. Even a simple program would have to be fed completely random data constantly, in every different order and circumstance concievable, for a very long time, to reveal all bugs. That's just not a real option.

    The only way to have bug-free software is to write it properly. You have to modularize and simplify everything down to the point that each one is easilly understandable, and it is easy to detect when one is providing a sensless answer (in other words, cross-checking every result). Then, you have to tie them all together in a robust but simple way.

    I know it's far easier to say it than do it, but it seems like nobody even tries to do it these days. Even mission-critical systems are commonly built as a single monolithic program, and when you have a lot of things going on within a single program, with no checks of the sanity of the data going into or comming out of each component, there is no way to be 100% certain that the program is theoretically and genuinely perfect. Meanwhile, by modularizing everything, you can PROVE that it is actually perfect.

    But this is really just the old Macrokernel vs. Microkernel arguement all over again. A Microkernel can be perfect, while a macrokernel can never be completely bug-free, but people just find the latter to be easier to write, and then spend hundreds times more man-hours finding and removing bugs, rather than spending (less, overall) time doing it correctly in the first place.

    Oh yes, almost forgot, IMHO...

    --
    Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
  6. Race conditions are nasty ... by cagle_.25 · · Score: 5, Insightful
    As you programmers all know, avoiding race conditions is really difficult. The fellow Neumann quoted in the article who said
    But Peter Neumann, principal scientist at SRI International and moderator of the Risks Digest, says that the root problem is that makers of critical systems aren't availing themselves of a large body of academic research into how to make software bulletproof.
    is overly optimistic; it's theoretically impossible to write a general test to find all race conditions in code. This is a variant of the Halting Problem.
    --
    Human being (n.): A genetically human, genetically distinct, functioning organism.
  7. Re:Software bug was just one part of bigger proble by Shakrai · · Score: 5, Insightful
    The software bug was just one piece of a much bigger problem; I wouldn't want to overstate its' role. There were many other factors; here are just a few:

    Why don't we point out the real problem that likely caused this to happen. Energy deregulation in the first place.

    I know I'll be jumped on by the free market types for daring to suggest this, but I'd rather have a regulated monopoly then a free-market for my life essential services anyday of the week. That article you linked is very interesting reading. Some quotes:

    Prior to the implementation of Federal Energy Regulatory Commission Order 888, which greatly expanded electricity trading, the cost of electricity, excluding fuel costs, was gradually falling. However, after Order 888, and some retail deregulation, prices increased by about 10%, costing consumers $20 billion a year.

    "Under the new system, the financial incentive was to run things up to the limit of capacity," explains Carreras. In fact, energy companies did more: they gamed the system. Federal investigations later showed that employees of Enron and other energy traders "knowingly and intentionally" filed transmission schedules designed to block competitors' access to the grid and to drive up prices by creating artificial shortages. In California, this behavior resulted in widespread blackouts, the doubling and tripling of retail rates, and eventual costs to ratepayers and taxpayers of more than $30 billion. In the more tightly regulated Eastern Interconnect, retail prices rose less dramatically.

    In the four years between the issuance of Order 888 and its full implementation, engineers began to warn that the new rules ignored the physics of the grid. The new policies " do not recognize the single-machine characteristics of the electric-power network," Casazza wrote in 1998. "The new rule balkanized control over the single machine," he explains. "It is like having every player in an orchestra use their own tunes."

    Equally important, the frequency stability of the grid rapidly deteriorated, with average hourly frequency deviations from 60 Hz leaping from 1.3 mHz in May 1999, to 4.9 mHz in May 2000, to 7.6 mHz by January 2001. As predicted, the new trading had the effect of overstressing and destabilizing the grid.

    Of course it's the first quote that rings true with me. If deregulation is so friggen great then where is the cheap electric? Why can my Village sell me electric for $0.04/kWh with their regulated municipal power authority (while paying their workers Government rates and with Government benefits) when my girlfriend (who lives a whole two miles away) pays $0.14/kWh for electric supplied by a company that is supposedly part of the free market (a company that pays their employees crap and outsources their call center/billing functions to India). What's the problem with that picture?

    Before energy deregulation the price of our electric was regulated by the PSC (Public Service Commission) and was fairly stable. The company that had the monopoly in this area made a set amount of profit (it wasn't a bad stock to pick up either -- you knew what you were getting), treated their employees well and charged a fair rate. Nowadays they treat their employees like crap, the stock has tanked because they are eating the price difference from their suppliers (otherwise we'd be paying about $0.20 kWh) and they are being raped by out of state suppliers that bought all of their generation capacity.

    In another slightly related story the out of state company that bought one of their power plants sued the local township because they wanted the tax levy on the power plant reduced. They claimed that it wasn't worth what it used to be because they didn't plan on operating it (it was to be backup generation). After a three-year legal battle the township lost (ran out of money to pay the lawyers) and the tax levy was reduced by some 60%. Property and school taxes on

    --
    I want peace on earth and goodwill toward man.
    We are the United States Government! We don't do that sort of thing.