Tracking the Blackout Bug
Alien54 writes "This earlier Slash story cited a CNN news report on how the August blackout was preventable. But, as seen in this Security Focus article, things are not so simple. 'In the initial stages, nobody really knew what the root cause was,' says Mike Unum, manager of commercial solutions at GE Energy. 'We test exhaustively, we test with third parties, and we had in excess of three million online operational hours in which nothing had ever exercised that bug,' says Unum. 'I'm not sure that more testing would have revealed that. Unfortunately, that's kind of the nature of software... you may never find the problem. I don't think that's unique to control systems or any particular vendor software.' Which leads to a number of other questions."
The software bug was just one piece of a much bigger problem; I wouldn't want to overstate its' role. There were many other factors; here are just a few:
Poor vegetation management probably played an even bigger role as overloaded power lines warmed up, expanded and sagged into trees and bushes that were supposed to have been cut back.
Poor communications between utilities played a major role.
This whole section of the transmission system was known to be unstable.
An inadequate regulatory structure lacked teeth to deal with known problems.
Lack of adequate transmission line capacity
If all these other problems hadn't been in place, the software bug might never have surfaced. And certainly, the rpoblems would have been contained within a much smaller area -- maybe just First Energy's service area.
An article featured on Slashdot last year lays out the underlying complexity of the power grid very well: "The World's Largest Machine"
Al Bonnyman
Community Broadband Networks
From the perspective of New York, they saw a surge race through their system East to West, through the choke point into Canada at Niagra station. NY constantly has problems with IMO not following schedules, and from their perspective, this was yet another incident of bad reliability control across the border.
What they didnt know is that the energy was routed through the southern bit of Canada along the lake area, back into the USA in Michigan, to feed all of the communities along the southern shores of the great lakes. The reason this happened is that the coastal towns became electrically isolated from southern ohio because of failures in FirstEnergy territory. I don't think to this day FE has accepted full responsibility for their roles in the failures, something I think should be done with a good house-clearing in their company...
Baloney. It is possible to write programs for which race conditions are undecideable. Such programs are broken. It is possible to write programs for which race condition detection is NP-hard. Such programs are broken if N is large. It is also possible to write programs for which race conditions can be proven to be absent. That's what you want to do.
Actually, it's straightforward to design software to be free of race conditions on a single machine. You then have a deadlock avoidance problem, but deadlocks are easily detected when they occur.
Hardware is routinely designed to be free of race conditions, after all.
Umm, actually, you can get sample slippage
when crossing clock domains in the telephone
network. Although it _appears_ to be completely
synchronized, it's really just that all the
different master clocks have really, really
tight tolerances.
how do you have a large nondeterministic?
hint: NP-hard is a problem that is NP-complete, or worse. An NP-hard problem does not have to be solvable. NP in this context stands for nondeterministic polynomial (with reference to time bounds). NP means that a problem can be solved in polynomial time with an infinitely parallel system. NP-complete problems are at least as hard as all other NP problems.
Sorry, it just bugs me whenever people try to talk about theory of CS and use "non-polynomial" or something else for NP.
And the muscular cyborg German dudes dance with sexy French Canadians
It is possible for some problems to construct a formal
description of the code. There are many,
many tools (e.g., SPIN, ACL2) that take this
formal description and produce a rigorous
proof of some property, e.g., that some state is
never reached, that a safety or liveness property
is upheld, etc.
http://spinroot.com/spin/whatispin.html
AMD uses this to test the floating point unit
in their chips, to make sure the algorithm they
use will not result in an Intel-style half
billion dollar mistake.
The question is: does your application warrant
the time and cost needed to create the formal
description of the problem, needed to drive these
tools.
NOTE: I AM NOT SAYING BRAZIL IS BETTER THAN THE USA... JUST THAT IT'S NOT WORSE EITHER.
Brazil's electrical power, as of 2001, was about 97% hydroelectric. Because of years of below-average rainfall, this system was threatened, and in 2001, we were told there might be "rolling blackouts" here (except that the Brazilian government, unlike the US government, was honest enough to call it what it was: power rationing). We ended up not getting any "rolling blackouts," and a regression toward the mean in rainfall has left us sufficiently well off that we don't even have to use the new polluting thermo plants that were built around the time of the crisis. Electrical power here is cheap and reliable, especially compared to places like California, where a lot of my friends had to endure "rolling blackouts" because the folks at the deregulated power companies decided to put more money on their bottom line by not investing in infrastructure upgrades and maintenance. So the execs who made those decisions increased profits in the short term, increasing their bonuses and the value of their stock. When the $#!+ hit the fan, guess who had to pay, both in damages from "rolling blackouts" and in higher rates? The consumers, of course!
The only power problems I've had here in São Paulo were a neighborhood issue, not a city-wide, state-wide, or nation-wide problem. Basically, the new condo across the street overloaded the local grid 3 times in a 2-week span. The worst thing is that the new condo has its own generator, so the newcomers would knock out the neighborhood power and then not even notice, because their generator kicked in. Meanwhile, those of us who had already been in the neighborhood were screwed. Even those problems have been resolved, though. With even more people moving into the new condo, it's been about 6 weeks since we had a problem. The power companies here are pretty efficient. Yeah, I'd have liked for somebody to stop people from moving into the new condo until the local power grid was adequately updated, but they responded pretty quickly once the problem did present itself in an inconvenient way.
--Mark
"It is nice to know that the computer understands the problem. But I would like to understand it too." --Eugene Wigner
From the reports I have seen, other than FE, the various companies did take appropriate action and shed load where necessary, it's just that the situation developed too quickly (from their perspective) and was too large to save by the time they could see it.
The problem was that the grid was running too close to capacity in general. Since the electricity is traveling as fast as any control signal could, it is necessary for the system to be able to tolerate whatever condition may exist long enough for systems to react and get a command transmitted. To make matters worse, you can't just switch off that much current, it takes several seconds for a switch to trip and the arc to be extinguished.
At 50% of design limits, a sudden doubling of demand due to a failure is no big problem (but needs to be dealt with before something else goes wrong), if you're at 90% though, you have a problem.
The real problem is that peak capacity simply isn't there. Our grid does run around 90% during peak load. The question is, are we willing to pay for the extra peak capacity.
California's problems were quite different since it was basically an effort to wring out more profit than existed in the system. THAT is a good reason for regulation. It may be that a more limited form of that is why we don't have more peak capacity, and that needs to be addressed.
Exhaustive testing, however you wish to define that
Exhaustive \Ex*haust"ive\, a.
Serving or tending to exhaust; exhibiting all the facts or arguments; as, an exhaustive method. Ex*haust"ive*ly, adv.
Basically, it should mean you've tested everything (which is of course impossible in most cases).
The term usually used (and rightfully so) is extensive testing.
One issue is that there is no safe state for the system to go to if the control system breaks down. Bringing the power grid in an area down safely is as hard as bringing it up safely (which, if you remember, took a while) and is harder than just keeping the system running.
The system is full of inductors, whose voltage drop is determined by the change in current through them. If you disconnect a transmission line, suddenly you're trying to change the current to 0, which puts all of the inductors at whatever voltage is necessary to make the current change more slowly. Generally, the way of making the current change more slowly is either to shoot a bolt of lightning across the gap you're creating or to melt your equipment into a conductive lump of metal, but this is only a temporary solution. Instead, the inductors (inside transformers and such) can melt down so that they aren't inductors any more and the current can change more quickly. Of course, when this happens, the next segment of transmission line is now not getting current, so it has the same problem.
The only safe way to bring down the grid is by coordinating with the adjacent grids to carefully remove the load on the line you want to disable; but that's not really an option when the problem is that communication is out.
I agree. These SCADA systems can become quite complex. If you are interested, you can even read General Electric's brochures for the XA/21 system.