Tracking the Blackout Bug

Software bug was just one part of bigger problem by bonnyman · 2004-04-10 07:04 · Score: 5, Informative

The software bug was just one piece of a much bigger problem; I wouldn't want to overstate its' role. There were many other factors; here are just a few:

Poor vegetation management probably played an even bigger role as overloaded power lines warmed up, expanded and sagged into trees and bushes that were supposed to have been cut back.

Poor communications between utilities played a major role.

This whole section of the transmission system was known to be unstable.

An inadequate regulatory structure lacked teeth to deal with known problems.

Lack of adequate transmission line capacity

If all these other problems hadn't been in place, the software bug might never have surfaced. And certainly, the rpoblems would have been contained within a much smaller area -- maybe just First Energy's service area.

An article featured on Slashdot last year lays out the underlying complexity of the power grid very well: "The World's Largest Machine"

--

Al Bonnyman
Community Broadband Networks

Re:Software bug was just one part of bigger proble by Raindance · 2004-04-10 07:07 · Score: 4, Interesting

I agree that there's more to this than just one line of code, as some folks seem to believe- I think referring to it as 'one bug' is rather misleading.

As well refer to the things leading up to WWII as 'one problem'.

For the 21st century... by Anonymous Coward · 2004-04-10 07:10 · Score: 3, Funny

If a bug exists in the code, but it's never triggered, is it really a bug?

Re:For the 21st century... by Raven42rac · 2004-04-10 07:14 · Score: 4, Insightful

Yes, yes it is. If a mime gets hit by a tree in the forest, does anyone care? Sometimes, no matter how much testing you do, shit just happens. It is a fact of life. Show me one perfect, bug free, piece of software. Stuff breaks all the time, we only notice it when it affects us. We take for granted sometimes how good we have it. Power in this country is extremely reliable. We act as if a bomb dropped when the power goes out. Some parts of the world do not have power, clean water, etc. We should think of that before we start whining about having to actually talk to each other, use candles, read books, etc.

--
I hate sigs.
Re:For the 21st century... by evilviper · 2004-04-10 08:15 · Score: 4, Insightful

Power in this country is extremely reliable.

Actually, that's statistically untrue. We have, perhaps, the least reliable power system in all the countries of the first-world. Sure, 3rd-world countries have worse-off power systems, but the comparison isn't valid at all.

Some parts of the world do not have power, clean water, etc. We should think of that before we start whining about having to actually talk to each other, use candles, read books, etc.

Since when does the hardship of others make an unreliable power system a plus? Some places may be worse, but so what? We pay a lot for power, and expect our money is being spent on making sure we DO NOT have many outages.

Meanwhile, in California, prices are high, and power was VERY unreliable. "Rolling Blackouts" anyone?

My point is this. If something is broken, we want to fix it. We don't want to sit around saying "Well, it isn't as broke as that one". If we do, pretty soon it will get worse, and worse, and worse, until we have no other countries to point at.

How about our medical system, and water utility? Should we accept thousands of deaths due to malpractice, or contaminated water, by just saying "Well, it's not as bad as country XYZ"? No, I don't think anyone would believe that, but it's really the same thing. Power outages do mean deaths, and do mean losses of lots of money. Businesses can't run, food can't be properly preserved, or even delivered. People die of heat-stroke, or hypothermia due to power loss. Ambulances can't get through dense traffic caused by traffic signals loosing power, etc.

A power outage is a lot more serious than people "whining" about not being able to watch TV... And yet you get moderated up anyhow... Amazing.

--
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
Re:For the 21st century... by Mark_in_Brazil · 2004-04-10 10:07 · Score: 4, Informative

Meanwhile, in California, prices are high, and power was VERY unreliable. "Rolling Blackouts" anyone?
Good point. I live in Brazil, and there's a real sick tendency among people here to kiss American ass and fantasize that the United States are a place where everything works perfectly and nobody has to pay for anything. When they do that, I chuckle and point out things like the difference in the electrical power systems in the two countries.
NOTE: I AM NOT SAYING BRAZIL IS BETTER THAN THE USA... JUST THAT IT'S NOT WORSE EITHER.
Brazil's electrical power, as of 2001, was about 97% hydroelectric. Because of years of below-average rainfall, this system was threatened, and in 2001, we were told there might be "rolling blackouts" here (except that the Brazilian government, unlike the US government, was honest enough to call it what it was: power rationing). We ended up not getting any "rolling blackouts," and a regression toward the mean in rainfall has left us sufficiently well off that we don't even have to use the new polluting thermo plants that were built around the time of the crisis. Electrical power here is cheap and reliable, especially compared to places like California, where a lot of my friends had to endure "rolling blackouts" because the folks at the deregulated power companies decided to put more money on their bottom line by not investing in infrastructure upgrades and maintenance. So the execs who made those decisions increased profits in the short term, increasing their bonuses and the value of their stock. When the $#!+ hit the fan, guess who had to pay, both in damages from "rolling blackouts" and in higher rates? The consumers, of course!
The only power problems I've had here in São Paulo were a neighborhood issue, not a city-wide, state-wide, or nation-wide problem. Basically, the new condo across the street overloaded the local grid 3 times in a 2-week span. The worst thing is that the new condo has its own generator, so the newcomers would knock out the neighborhood power and then not even notice, because their generator kicked in. Meanwhile, those of us who had already been in the neighborhood were screwed. Even those problems have been resolved, though. With even more people moving into the new condo, it's been about 6 weeks since we had a problem. The power companies here are pretty efficient. Yeah, I'd have liked for somebody to stop people from moving into the new condo until the local power grid was adequately updated, but they responded pretty quickly once the problem did present itself in an inconvenient way.

--Mark

--
"It is nice to know that the computer understands the problem. But I would like to understand it too." --Eugene Wigner

B Method? by starseeker · 2004-04-10 07:12 · Score: 5, Interesting

"the bug was unmasked as a particularly subtle incarnation of a common programming error called a "race condition," triggered on August 14th by a perfect storm of events and alarm conditions on the equipment being monitored. The bug had a window of opportunity measured in milliseconds. "

Isn't this the type of problem the B Method (and maybe the Z language too) are designed to address? Use proof logic initially - once you have decided on a behavior you want, design the system in such a way that it is provable it executes this design.

That doesn't mean the DESIGN is flawless, of course. But if we start engineering software on as many levels as we can, mightn't things improve? Normal software development and testing would never have found a critical bug with rare trigger conditions and a millisecond window. If you need precision on that level, you need to (for starters) to KNOW your implimentation of your design is sound, and preferably the code you are running exactly impliments the proven logic. Isn't this what the B Method was created for?

--
"I object to doing things that computers can do." -- Olin Shivers, lispers.org

Re:B Method? by mccalli · 2004-04-10 07:38 · Score: 4, Interesting

Isn't this the type of problem the B Method (and maybe the Z language too) are designed to address? Use proof logic initially - once you have decided on a behavior you want, design the system in such a way that it is provable it executes this design.
Ye gods, you've frightened the hell out of me with reference to Z. I'd almost entirely forgotten it, and had hoped its cold corpse would lie in the ground undisturbed, undiscovered and most importantly of all unreferenced until the end of time. Still, "That is not dead which may eternal lie"...
Z is a beautiful way to mathematically prove that you have design bugs at the highest level possible. You can then design your unit tests around those bugs, and confirm that they're valid.
That's it. It provides nothing else that unit testing on its own couldn't do, with the exception of a few salaries and a research grant here and there. Whilst you can mathematically prove implementations of certain designs, the vast majority of designs have more complex interactions. Try using Z for a multithreaded real-time environment for example - my Software Engineering tutor at the time, Iain Sommerville (well known in the field due to his books, oh and 'at the time' would ~1993), basically said that Z just breaks down in those circumstances. I wouldn't know - I personally had no clue how to even make it begin in those circumstances, let alone break down.
Please confine Z to camp-fire ghost stories used to scare new programmers. It always was a living hell, and it really shouldn't be resurrected now.
Cheers,
Ian
Re:B Method? by Orne · 2004-04-10 07:39 · Score: 4, Interesting

SCADA systems transport data samples. My company's system collects from several hundred thousands of meters, about half of which are expected to send in a sample about once every 10 seconds, some as fast as once every two seconds. The concept is that you have a communications buffer that collects the data, the link writes to the memory while the other EMS applications (about a dozen) read from the memory.

Now admittedly, FirstEnergy's system is a little smaller in territory, but I wonder if their mergers over the recent years (Cleveland Electric and Ohio Edison became FE, and then proceeded to take Toledo Edison and GPU of PA) have outpaced the collection capabilities of their mainframe (which was already at the end of its life and was scheduled to be replaced). That could account for some of the "slowing" that the G.E. testers said they had to do to make the race condition appear.

Re:The problem with SCADA systems by Vancorps · 2004-04-10 07:13 · Score: 5, Interesting

This all reminds me of the movie Resident Evil where they shut down power and all the doors unlock when power is restored.

You bring up a great point about failure states. I work for several large hotels and the fire control systems are the ones that alert whenever there is any problem of any kind largely because any problem of any kind needs to be addressed immediately so it makes sense.

I would think power systems would think along the same lines since the odds are, ANY failure whatsoever needs immediate attention of engineers that maintain the system. This is not a requirement for all software but when it comes to such critical services why doesn't everybody do the same practice? It seems so blatently obvious that alarms should have been raised.

Also, in situation's where you don't work on a live environment you can always create a test environment that is for all intensive purposes "live" For web development work I do I have a testing domain which is used to test sites to ensure that because they work here in my lab they will work when I hand them off to the client. Its 100% accurate, I've seen it done with countless other systems, so why wasn't it done here?

The American jackasses who blamed Canada by Kevin+Mitnick · 2004-04-10 07:14 · Score: 5, Interesting

Did anyone ever retract their statements? I know the NY Mayor was pretty quick to blame us Canucks.

Re:The American jackasses who blamed Canada by spinkham · 2004-04-10 08:11 · Score: 3, Funny

We blame you, you blame the Newfies. It's the pecking order around here, deal with it ;-)

--
Blessed are the pessimists, for they have made backups.

Testing isn't the answer... by evilviper · 2004-04-10 07:25 · Score: 5, Insightful

You can't expect just testing to reveal all bugs in a program. Even a simple program would have to be fed completely random data constantly, in every different order and circumstance concievable, for a very long time, to reveal all bugs. That's just not a real option.

The only way to have bug-free software is to write it properly. You have to modularize and simplify everything down to the point that each one is easilly understandable, and it is easy to detect when one is providing a sensless answer (in other words, cross-checking every result). Then, you have to tie them all together in a robust but simple way.

I know it's far easier to say it than do it, but it seems like nobody even tries to do it these days. Even mission-critical systems are commonly built as a single monolithic program, and when you have a lot of things going on within a single program, with no checks of the sanity of the data going into or comming out of each component, there is no way to be 100% certain that the program is theoretically and genuinely perfect. Meanwhile, by modularizing everything, you can PROVE that it is actually perfect.

But this is really just the old Macrokernel vs. Microkernel arguement all over again. A Microkernel can be perfect, while a macrokernel can never be completely bug-free, but people just find the latter to be easier to write, and then spend hundreds times more man-hours finding and removing bugs, rather than spending (less, overall) time doing it correctly in the first place.

Oh yes, almost forgot, IMHO...

--
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant

Reasons for power blackouts by pcraven · 2004-04-10 07:27 · Score: 4, Interesting

I've been reading several papers on this for a grad class I'm taking. One of the several problems is no government control. If a power outage might be prevented by shedding some load (turning out power to some people), no company wants to step up to the plate and be the one to turn out the power to their customers. So they luck out, or they have a massive power outage.

This paper (click on the PDF link) has a good summary of the problems in keeping power outages from happening again.

World's largest machine by stefanb · 2004-04-10 07:30 · Score: 4, Interesting

An article featured on Slashdot last year lays out the underlying complexity of the power grid very well: "The World's Largest Machine"

OK, it's nitpicking, but the largest machine is arguably the telephone system. Among other things, it maintains a synchronized clock (8 kHz base), even across oceans and continents.

Race conditions are nasty ... by cagle_.25 · 2004-04-10 07:36 · Score: 5, Insightful

As you programmers all know, avoiding race conditions is really difficult. The fellow Neumann quoted in the article who said

But Peter Neumann, principal scientist at SRI International and moderator of the Risks Digest, says that the root problem is that makers of critical systems aren't availing themselves of a large body of academic research into how to make software bulletproof.

is overly optimistic; it's theoretically impossible to write a general test to find all race conditions in code. This is a variant of the Halting Problem.

--
Human being (n.): A genetically human, genetically distinct, functioning organism.

Re:Race conditions are nasty ... by platipusrc · 2004-04-10 08:19 · Score: 3, Informative

how do you have a large nondeterministic?

hint: NP-hard is a problem that is NP-complete, or worse. An NP-hard problem does not have to be solvable. NP in this context stands for nondeterministic polynomial (with reference to time bounds). NP means that a problem can be solved in polynomial time with an infinitely parallel system. NP-complete problems are at least as hard as all other NP problems.

Sorry, it just bugs me whenever people try to talk about theory of CS and use "non-polynomial" or something else for NP.

--
And the muscular cyborg German dudes dance with sexy French Canadians

Software ENGINEERING by Anonymous Coward · 2004-04-10 07:39 · Score: 4, Interesting

If I want to build a large structure (bridge or building) where it is possible that public safety is at issue, I had better have an engineer's signature on the drawings.

This case seems like a real good argument for having the same requirement for software.

Good engineering practice would probably have prevented this. A simple example of such a system would be a burglar/fire alarm panel. The system is self-checking. If any part of the system isn't working (ie. someone cuts a wire), then that causes an alarm.

I realize that there will be strange undetectable bugs in software but if the system as a whole is properly engineered, the system will fail gracefully and safely.

Re:Software ENGINEERING by Orne · 2004-04-10 08:45 · Score: 3, Interesting

The two systems you describe are fundamentally different from the design of this alarming system. In fire or safety, the "reading" is the voltage of the closed loop wire itself; 12 volts connected, 0 volts open.

Now imagine if you have a layer in between; you want to monitor the fire status of a complex of warehouses from a single room several miles away. Analog/Digial the signals to all of the individual buildings, transport the data to a common computer, and view the data there. Figure you have several hundred buildings you're watching at once, and now you're getting closer in scale to how the grid dispatchers get their data.

Now imagine that the computer's software back at the main station reads all these meters, and if a line's open (say you're tracking window openings for security), it writes an alarm to a text log on the screen; on a good day, you don't get any alarms. Now suppose the driver that writes the alarms to the screen hangs; since you werent expecting any alarms, you're not that concerned that you aren't seeing anything. That's pretty much what caught FirstEnergy for those 3 hours that afternoon, while the system was failing and they didn't realize they needed to act.
Re:Software ENGINEERING by sjames · 2004-04-10 09:52 · Score: 3, Interesting

At first glance, that can seem like a good idea, but are you prepared to pay for that signoff from each engineer whenever you install a piece of software?

A PE signs off on each particular instance of a design taking intended use, site and other construction into account. If you then build elsewhere, you need a new signoff. If you make any significant change (including adding other structural elements to the design (that is, installing more software), you'll need a new signoff. Add a new network driver, another signoff. Upgrade the CPU? You guessed it!

Some software is poorly designed and crash prone. Other software is well designed but cannot be signed off on because it might be installed on nearly anything that pretends to be a compatible platform.

The one justification for that sort of signoff is in situations where a bug will kill someone. Even then, the system should be divided into critical and auxillary parts to limit what must be signed off on.

Autopilots work that way. You have a small and reletivly simple part that assures safe conditions, is extensively tested, and rarely changed. Another portion is more frequently updated, attempts to optimize the flight and provides a nicer interface. The latter can fail completely and the plane will continue to fly (possibly with poor fuel economy and the pilot navigating manually, but it won't fall out of the sky).

There are many tradeoffs. In some sense, many small distributed systems are more robust than centralized control. However, it's a lot easier to create a chaotic system that way. If you do, you won't know until the system falls into a weird state without warning.

Bug free! by Ghoser777 · 2004-04-10 07:43 · Score: 4, Funny

int main()
{
return 0;
}

Because I have shown you bug free software, does that invalidate the rest of your argument?

Matt Fahrenbacher

--
James Tiberius Kirk: "Spock, the women on your planet are logical. No other planet in the galaxy can make that claim."

Additional Information by Orne · 2004-04-10 07:47 · Score: 3, Interesting

Oddly enough, while writing a comment to another user's message, I threw some info in google to learn about FirstEnergy's EMS system, and found this other SecurityFocus story in Feburary 2004, which gives more raw facts than this newer story.

"DiNicola said Thursday that the company, working with GE and energy consultants from Kema Inc., had pinned the trouble on a software glitch by late October and completed its fix by Nov. 19..."

"With the software not functioning properly at that point, data that should have been deleted were instead retained, slowing performance, he said. Similar troubles affected the backup systems. " This dovetails well with why the testers had to "slow" their testing to make the race condition appear.

Re:342 years of online operational hours? by Creepy+Crawler · 2004-04-10 07:58 · Score: 3, Interesting

342/x

x = "how many reactors they have in operation"

--

Mod parent up! by Anonymous Coward (Score:1) Thurs, Nov 31, @13:37

Testing vs RTFS. Proprietary vs open. by SharpFang · 2004-04-10 08:02 · Score: 4, Insightful

if(int(rand()*1e20)==31337){
blow_up();
} else {
do_your_work();
}

Now I can't imagine amount of testing in proprietary software that could reveal this example of malicious code. In open source one look at the code will reveal it. Of course not all cases are so obvious, but always reading the code should be used together with "testing the software". How do you know lots of proprietary software that IS close-source isn't i.e. a gatweway for terrorists? How do you know biggest companies' stuff isn't all trojans? It wouldn't be hard to hide it. Say your software is kind of server. It does its job okay unless it receives TCP packets starting with certain string. Then it just executes commands contained after that string. Boom. No amount of -testing- will reveal this.
And there are bugs that can be triggered once in several billion cases. Only looking at the code could fix them and explaining "we did a lot of tests" is bullshit.
I put a lot of iron, gum, different materials, C4, glass and some more together and it goes, I call it "a car" and I rode 1000's of kilometers okay. Now no amount of testing in all road conditions will reveal it contains the C4 explosives. Looking under the hood will reveal it really fast.

--
45 5F E1 04 22 CA 29 C4 93 3F 95 05 2B 79 2A B2

"We test exhaustively..." by Fratz · 2004-04-10 08:12 · Score: 4, Insightful

Um, no you don't. By definition, if you tested exhaustively, you'd have found everything that could possibly go wrong with whatever you tested.

I'm not saying it's always feasible to test exhaustively, but don't say you did when you clearly didn't.

Also: "we had in excess of three million online operational hours in which nothing had ever exercised that bug"

Taken with the "exhaustively" statement, I'm thinking that whoever said these things doesn't understand QA very well. It's easy to write code that works well when everything's good, and it's often just as easy to test that. It's another thing entirely to write code that works well (or fails gracefully) when everything's wrong. And again, it's harder to test that.

--
-- Fratz, human

Re:Software bug was just one part of bigger proble by Shakrai · 2004-04-10 08:20 · Score: 5, Insightful

The software bug was just one piece of a much bigger problem; I wouldn't want to overstate its' role. There were many other factors; here are just a few:

Why don't we point out the real problem that likely caused this to happen. Energy deregulation in the first place.

I know I'll be jumped on by the free market types for daring to suggest this, but I'd rather have a regulated monopoly then a free-market for my life essential services anyday of the week. That article you linked is very interesting reading. Some quotes:

Prior to the implementation of Federal Energy Regulatory Commission Order 888, which greatly expanded electricity trading, the cost of electricity, excluding fuel costs, was gradually falling. However, after Order 888, and some retail deregulation, prices increased by about 10%, costing consumers $20 billion a year.

"Under the new system, the financial incentive was to run things up to the limit of capacity," explains Carreras. In fact, energy companies did more: they gamed the system. Federal investigations later showed that employees of Enron and other energy traders "knowingly and intentionally" filed transmission schedules designed to block competitors' access to the grid and to drive up prices by creating artificial shortages. In California, this behavior resulted in widespread blackouts, the doubling and tripling of retail rates, and eventual costs to ratepayers and taxpayers of more than $30 billion. In the more tightly regulated Eastern Interconnect, retail prices rose less dramatically.
In the four years between the issuance of Order 888 and its full implementation, engineers began to warn that the new rules ignored the physics of the grid. The new policies " do not recognize the single-machine characteristics of the electric-power network," Casazza wrote in 1998. "The new rule balkanized control over the single machine," he explains. "It is like having every player in an orchestra use their own tunes."
Equally important, the frequency stability of the grid rapidly deteriorated, with average hourly frequency deviations from 60 Hz leaping from 1.3 mHz in May 1999, to 4.9 mHz in May 2000, to 7.6 mHz by January 2001. As predicted, the new trading had the effect of overstressing and destabilizing the grid.

Of course it's the first quote that rings true with me. If deregulation is so friggen great then where is the cheap electric? Why can my Village sell me electric for $0.04/kWh with their regulated municipal power authority (while paying their workers Government rates and with Government benefits) when my girlfriend (who lives a whole two miles away) pays $0.14/kWh for electric supplied by a company that is supposedly part of the free market (a company that pays their employees crap and outsources their call center/billing functions to India). What's the problem with that picture?

Before energy deregulation the price of our electric was regulated by the PSC (Public Service Commission) and was fairly stable. The company that had the monopoly in this area made a set amount of profit (it wasn't a bad stock to pick up either -- you knew what you were getting), treated their employees well and charged a fair rate. Nowadays they treat their employees like crap, the stock has tanked because they are eating the price difference from their suppliers (otherwise we'd be paying about $0.20 kWh) and they are being raped by out of state suppliers that bought all of their generation capacity.

In another slightly related story the out of state company that bought one of their power plants sued the local township because they wanted the tax levy on the power plant reduced. They claimed that it wasn't worth what it used to be because they didn't plan on operating it (it was to be backup generation). After a three-year legal battle the township lost (ran out of money to pay the lawyers) and the tax levy was reduced by some 60%. Property and school taxes on

--
I want peace on earth and goodwill toward man.
We are the United States Government! We don't do that sort of thing.

Re:The problem with SCADA systems by miu · 2004-04-10 09:22 · Score: 4, Insightful

For web development work I do I have a testing domain which is used to test sites to ensure that because they work here in my lab they will work when I hand them off to the client. Its 100% accurate, I've seen it done with countless other systems, so why wasn't it done here?

Mostly because web systems are still toys compared to real systems.

These systems get real and very intensive testing in labs as close to live as they can get. Even once they knew the conditions and affected subsystems it took the dev and testing teams months to recreate this bug in the lab. The lab is never just like real life, it cannot be - because even real life now is not always the real life of 10 seconds ago.

--

[Set Cain on fire and steal his lute.]

bugs are not inevitable by ummit · 2004-04-10 10:21 · Score: 4, Insightful

We test exhaustively... I'm not sure that more testing would have revealed that.

For an obscure race condition, this is undoubtedly true.

Unfortunately, that's kind of the nature of software... you may never find the problem.

This is sorta true, sorta false, and definitely misleading.

I don't think that's unique to control systems or any particular vendor software.

No, it's not unique; bugs that may never be found are rampant in most varieties of software. What's false -- tragically, crushingly false -- is the presumption that these unfindable bugs are therefore inevitable. They are not.

If there's a class of bugs that's hard to test for -- and of course there are many such classes -- the prudent thing to do is to find development methodologies that skirt those bugs entirely. If you don't put in so many bugs in the first place, you obviously don't have to work so hard trying to find and fix them.

28 of 207 comments (clear)