Tracking the Blackout Bug
Alien54 writes "This earlier Slash story cited a CNN news report on how the August blackout was preventable. But, as seen in this Security Focus article, things are not so simple. 'In the initial stages, nobody really knew what the root cause was,' says Mike Unum, manager of commercial solutions at GE Energy. 'We test exhaustively, we test with third parties, and we had in excess of three million online operational hours in which nothing had ever exercised that bug,' says Unum. 'I'm not sure that more testing would have revealed that. Unfortunately, that's kind of the nature of software... you may never find the problem. I don't think that's unique to control systems or any particular vendor software.' Which leads to a number of other questions."
According to the SecurtyFocus article, the operators had no way of knowing, because the data wasn't "live." This is a common problem with SCADA systems--the systems will display the "last known-good value" if something goes offline. However, the system should also visibly identify the data as "out of service" or "offline," and this didn't seem to happen. That could be an issue at the server, or it could be something blamed on the people commissioning the XA/21 system (assuming the display is configurable enough to allow you to program it at this level).
Even so, there should have been sufficient watchdog messages between the client, the server, and the field hardware for the XA/21 to broadcast a general alarm along the lines of "I can't talk to the stinking field, so we're all flying blind here, you morons!" This is exactly the same as software in my industry (HVAC fire/security systems for large buildings), where if you lose communication to a subsystem or the field, you have to raise alarms all over the place.
The real question is how you could lose such comm and the operators had no visible indication that they were relying on old data. This sounds like a missed requirement, if not insufficient testing.
Tim
how can you respond to an incident? It just goes to show the need for multiple monitoring systems in mission critical systems.
Wireless News www.DailyWireless
Yes, yes it is. If a mime gets hit by a tree in the forest, does anyone care? Sometimes, no matter how much testing you do, shit just happens. It is a fact of life. Show me one perfect, bug free, piece of software. Stuff breaks all the time, we only notice it when it affects us. We take for granted sometimes how good we have it. Power in this country is extremely reliable. We act as if a bomb dropped when the power goes out. Some parts of the world do not have power, clean water, etc. We should think of that before we start whining about having to actually talk to each other, use candles, read books, etc.
I hate sigs.
You can't expect just testing to reveal all bugs in a program. Even a simple program would have to be fed completely random data constantly, in every different order and circumstance concievable, for a very long time, to reveal all bugs. That's just not a real option.
The only way to have bug-free software is to write it properly. You have to modularize and simplify everything down to the point that each one is easilly understandable, and it is easy to detect when one is providing a sensless answer (in other words, cross-checking every result). Then, you have to tie them all together in a robust but simple way.
I know it's far easier to say it than do it, but it seems like nobody even tries to do it these days. Even mission-critical systems are commonly built as a single monolithic program, and when you have a lot of things going on within a single program, with no checks of the sanity of the data going into or comming out of each component, there is no way to be 100% certain that the program is theoretically and genuinely perfect. Meanwhile, by modularizing everything, you can PROVE that it is actually perfect.
But this is really just the old Macrokernel vs. Microkernel arguement all over again. A Microkernel can be perfect, while a macrokernel can never be completely bug-free, but people just find the latter to be easier to write, and then spend hundreds times more man-hours finding and removing bugs, rather than spending (less, overall) time doing it correctly in the first place.
Oh yes, almost forgot, IMHO...
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
Human being (n.): A genetically human, genetically distinct, functioning organism.
So, as far as I can figure, there are 24 hours in a day, and 365 days in a year, which equals about 8760 hours in a year (give or take).
Now then, 3 million hours divided by 8760 hours per year equals approximately 342 years, modulo 4070 hours (i.e. approximately 169 days...).
Now then... how the hell do they get the idea that they've been up-and-running for 342 years? Are they counting things in parallel? Even if they were counting end-user operational hours, the number should at least be a couple orders-of-magnitude higher, no?
3M online operational hours sounds like fuddy-duddy accounting to me... although, obviously I haven't looked over the books. I would be interested to see how they came up with this number.
Comment removed based on user account deletion
if(int(rand()*1e20)==31337){
blow_up();
} else {
do_your_work();
}
Now I can't imagine amount of testing in proprietary software that could reveal this example of malicious code. In open source one look at the code will reveal it. Of course not all cases are so obvious, but always reading the code should be used together with "testing the software". How do you know lots of proprietary software that IS close-source isn't i.e. a gatweway for terrorists? How do you know biggest companies' stuff isn't all trojans? It wouldn't be hard to hide it. Say your software is kind of server. It does its job okay unless it receives TCP packets starting with certain string. Then it just executes commands contained after that string. Boom. No amount of -testing- will reveal this.
And there are bugs that can be triggered once in several billion cases. Only looking at the code could fix them and explaining "we did a lot of tests" is bullshit.
I put a lot of iron, gum, different materials, C4, glass and some more together and it goes, I call it "a car" and I rode 1000's of kilometers okay. Now no amount of testing in all road conditions will reveal it contains the C4 explosives. Looking under the hood will reveal it really fast.
45 5F E1 04 22 CA 29 C4 93 3F 95 05 2B 79 2A B2
I'm not saying it's always feasible to test exhaustively, but don't say you did when you clearly didn't.
Also: "we had in excess of three million online operational hours in which nothing had ever exercised that bug"
Taken with the "exhaustively" statement, I'm thinking that whoever said these things doesn't understand QA very well. It's easy to write code that works well when everything's good, and it's often just as easy to test that. It's another thing entirely to write code that works well (or fails gracefully) when everything's wrong. And again, it's harder to test that.
-- Fratz, human
Actually, that's statistically untrue. We have, perhaps, the least reliable power system in all the countries of the first-world. Sure, 3rd-world countries have worse-off power systems, but the comparison isn't valid at all.
Since when does the hardship of others make an unreliable power system a plus? Some places may be worse, but so what? We pay a lot for power, and expect our money is being spent on making sure we DO NOT have many outages.
Meanwhile, in California, prices are high, and power was VERY unreliable. "Rolling Blackouts" anyone?
My point is this. If something is broken, we want to fix it. We don't want to sit around saying "Well, it isn't as broke as that one". If we do, pretty soon it will get worse, and worse, and worse, until we have no other countries to point at.
How about our medical system, and water utility? Should we accept thousands of deaths due to malpractice, or contaminated water, by just saying "Well, it's not as bad as country XYZ"? No, I don't think anyone would believe that, but it's really the same thing. Power outages do mean deaths, and do mean losses of lots of money. Businesses can't run, food can't be properly preserved, or even delivered. People die of heat-stroke, or hypothermia due to power loss. Ambulances can't get through dense traffic caused by traffic signals loosing power, etc.
A power outage is a lot more serious than people "whining" about not being able to watch TV... And yet you get moderated up anyhow... Amazing.
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
I think the nation/region would be served better if we stepped back a bit and took another look at more decentralised power generation as a full bore government encouraged option. Not as a complete replacement, but frankly, I see no reason we can't have millions more solar panels and wind generators out there. Economy of scale in manufacturing, spurring on even more R&D, etc, works for everything else it appears. And having a lot more points of production, spread out, would help to mitigate cascading failures, especially if islanding was more precise and easier to implement in smaller areas. Wind, were the average wind is adequate enough, is especially cost effective now, approaching coal burning costs per watt. Solar is nice where applicable by climate because it can make use of dead space going to waste, millions of roofs already there.
I like the "not all your eggs in one basket" approach to problems, and I believe in backups for everything.
Why don't we point out the real problem that likely caused this to happen. Energy deregulation in the first place.
I know I'll be jumped on by the free market types for daring to suggest this, but I'd rather have a regulated monopoly then a free-market for my life essential services anyday of the week. That article you linked is very interesting reading. Some quotes:
Of course it's the first quote that rings true with me. If deregulation is so friggen great then where is the cheap electric? Why can my Village sell me electric for $0.04/kWh with their regulated municipal power authority (while paying their workers Government rates and with Government benefits) when my girlfriend (who lives a whole two miles away) pays $0.14/kWh for electric supplied by a company that is supposedly part of the free market (a company that pays their employees crap and outsources their call center/billing functions to India). What's the problem with that picture?
Before energy deregulation the price of our electric was regulated by the PSC (Public Service Commission) and was fairly stable. The company that had the monopoly in this area made a set amount of profit (it wasn't a bad stock to pick up either -- you knew what you were getting), treated their employees well and charged a fair rate. Nowadays they treat their employees like crap, the stock has tanked because they are eating the price difference from their suppliers (otherwise we'd be paying about $0.20 kWh) and they are being raped by out of state suppliers that bought all of their generation capacity.
In another slightly related story the out of state company that bought one of their power plants sued the local township because they wanted the tax levy on the power plant reduced. They claimed that it wasn't worth what it used to be because they didn't plan on operating it (it was to be backup generation). After a three-year legal battle the township lost (ran out of money to pay the lawyers) and the tax levy was reduced by some 60%. Property and school taxes on
I want peace on earth and goodwill toward man.
We are the United States Government! We don't do that sort of thing.
the system will fail gracefully and safely.
A mission-critical system, by definition, cannot fail "safely", since it must not fail at all.
For an obscure race condition, this is undoubtedly true.
Unfortunately, that's kind of the nature of software... you may never find the problem.
This is sorta true, sorta false, and definitely misleading.
I don't think that's unique to control systems or any particular vendor software.
No, it's not unique; bugs that may never be found are rampant in most varieties of software. What's false -- tragically, crushingly false -- is the presumption that these unfindable bugs are therefore inevitable. They are not.
If there's a class of bugs that's hard to test for -- and of course there are many such classes -- the prudent thing to do is to find development methodologies that skirt those bugs entirely. If you don't put in so many bugs in the first place, you obviously don't have to work so hard trying to find and fix them.
Yes, we do not hear about them, because they do not exist.
Sure, it was First Energy's lines that failed initially, but if it wasn't First Energy, some other utility would have failed eventually. The engineering and the legal descriptions of the current electrical generation and distriubtion system in North America are at odds with one another.
There's a good technical discussion on the failings of the power grid that may interest you.
I'm going to call bullshit on that. My Village has been running municipal electric since the 1910s. It is self-sustaining (i.e: takes in enough money to operate without using tax dollars) and geared towards the future. They didn't annex any lines or equipment from private companies -- it was built from the ground up. They don't own their own generator plants anymore (last one went offline in the 50s) -- they buy it from the wholesale grid just like everybody else. And yet somehow they are able to provide it at $0.04 kWh without screwing over their employees or customers. This municipal grid feeds everybody in town from houses to streetlights to factories. I'll grant you they don't have to serve a rural area but rural areas aren't automatically four to five times as expensive -- if that was the case then why do my parents, girlfriend, grandparents and friends all pay the same high rate even though they all live in suburban or urban areas with the exception of Grandma?
I find it hard to feel sorry for that one township you sited, since there are many townships without that high taxed power plant around.
Perhaps you'd feel for them more if it was your friends and family that lived there. Perhaps you'd feel for them if half of them had previously worked there before being laid off by the company that bought the plant and screwed the town over. Then the company fired the plant back up and brought in it's own people from out of state to run things. But that's ok -- last I heard New York state was going to go after them. Care to place bets on who will run out of money first in that legal battle?
Wholesale rates have gone up everywhere. I live in an area where there never was regulation, and we face exactly the same higher rates. Coal prices are higher.
So coal prices are the reason why Enron and it's buddies were slashing power production at their plants in California to drive up the wholesale prices? I wish my state would just tell the Feds to fuck off and regulate our own power industry. Everybody was better served when it was regulated -- from the power company itself (I never heard them complaining when they were posting a 15-20% profit margin) to the consumers. The product was more reliable. Those are facts. I just don't think it's a good idea to allow essential services that you really can't get elsewhere to be run by unregulated industries. What's your option if your power company is screwing you (and they are screwing you in all likelihood because if they don't they won't survive because they are being screwed by the wholesalers)? Not using electric? Try that one in a New England winter.
While I'm on this rant I might also point out the phone and cable companies. Before the deregulation of the phone companies my standard phone service was $15-$20 (before long distance charges). Now it's $35 -- or would be if I still had a landline phone. Before the deregulation of the cable industry (forced on my state by the Feds) we had tons of local cable companies and basic cable (50-70 channels depending on where you lived) was $15 a month. Now with Time Warner it's $45 a month. The only thing you can count on Time Warner for is to raise their rates once a year. You can set your watch by it.
I want peace on earth and goodwill toward man.
We are the United States Government! We don't do that sort of thing.
Now, for an example. I stress tested a database I am in the process of building for Mercedes, I made the machine come to crawl. I did it to a dual cpu server, a quad cpu server, and a 16 cpu server. Guess what? They all behaved exactly the same as the system grew. Now scale it up to the DB/2 cluster that it will actually be working on. I do the same thing and guess what? Yep, the exact same result.
If testing fails to produce an outcome that brings a fault then there is a flaw in the testing procedure. The real world can have more connections, but I don't care, software can be 100% bug free. I constantly hear programmers saying that every complex piece of software will have bugs. Its poor planning, and management, even if the programmer is incredibly gifted they will make mistakes without proper planning.Yes, in the real world there are unforeseen variables but as systems because critical they should be undergoing more testing to ensure such things don't occur. But as I said in my original post. One outage in recent years is hardly a trend and is more likely just blown out of proportion.
Among other things, it maintains a synchronized clock (8 kHz base), even across oceans and continents.
It's actually plesiochronus, and only synchronized within certain (relatively large) regions. And I don't know where you're getting that 8 kHz figure from.
Basic relativity (not to mention propogation) will tell you that what you're describing is impossible.
design the system in such a way that it is provable it executes this design
Unfortunately, if you actually come out of the library and the computer labs, software has to be done -- yesterday. Flawless, provable code would cause most software houses to go bankrupt. It's a fact of life...
The first thing they teach you in a software testing course is that testing cannot guarantee the absence of bugs. The only way you can guarantee, through testing alone, that your program is error-free is to exhaustively test every possible "input" (combination of external inputs and internal state) and check them. When was the last time you wrote a program with a finite (and tractable) input space?
If you need 100% reliable programs, you'll need to prove them correct, and that's enormously difficult to do, and doesn't help if the bug is the result of a flaw in the program's requirement specification rather than an incorrect implementation of that specification.
What testing *can* do is provide estimates of a system's reliability, and in the real world that's all you're going to get.
Any sufficiently advanced technology is indistinguishable from a rigged demo
--Andy Finkel (J. Klass?)