Tracking the Blackout Bug
Alien54 writes "This earlier Slash story cited a CNN news report on how the August blackout was preventable. But, as seen in this Security Focus article, things are not so simple. 'In the initial stages, nobody really knew what the root cause was,' says Mike Unum, manager of commercial solutions at GE Energy. 'We test exhaustively, we test with third parties, and we had in excess of three million online operational hours in which nothing had ever exercised that bug,' says Unum. 'I'm not sure that more testing would have revealed that. Unfortunately, that's kind of the nature of software... you may never find the problem. I don't think that's unique to control systems or any particular vendor software.' Which leads to a number of other questions."
I agree that there's more to this than just one line of code, as some folks seem to believe- I think referring to it as 'one bug' is rather misleading.
As well refer to the things leading up to WWII as 'one problem'.
"the bug was unmasked as a particularly subtle incarnation of a common programming error called a "race condition," triggered on August 14th by a perfect storm of events and alarm conditions on the equipment being monitored. The bug had a window of opportunity measured in milliseconds. "
Isn't this the type of problem the B Method (and maybe the Z language too) are designed to address? Use proof logic initially - once you have decided on a behavior you want, design the system in such a way that it is provable it executes this design.
That doesn't mean the DESIGN is flawless, of course. But if we start engineering software on as many levels as we can, mightn't things improve? Normal software development and testing would never have found a critical bug with rare trigger conditions and a millisecond window. If you need precision on that level, you need to (for starters) to KNOW your implimentation of your design is sound, and preferably the code you are running exactly impliments the proven logic. Isn't this what the B Method was created for?
"I object to doing things that computers can do." -- Olin Shivers, lispers.org
You bring up a great point about failure states. I work for several large hotels and the fire control systems are the ones that alert whenever there is any problem of any kind largely because any problem of any kind needs to be addressed immediately so it makes sense.
I would think power systems would think along the same lines since the odds are, ANY failure whatsoever needs immediate attention of engineers that maintain the system. This is not a requirement for all software but when it comes to such critical services why doesn't everybody do the same practice? It seems so blatently obvious that alarms should have been raised.Also, in situation's where you don't work on a live environment you can always create a test environment that is for all intensive purposes "live" For web development work I do I have a testing domain which is used to test sites to ensure that because they work here in my lab they will work when I hand them off to the client. Its 100% accurate, I've seen it done with countless other systems, so why wasn't it done here?
Did anyone ever retract their statements? I know the NY Mayor was pretty quick to blame us Canucks.
I've been reading several papers on this for a grad class I'm taking. One of the several problems is no government control. If a power outage might be prevented by shedding some load (turning out power to some people), no company wants to step up to the plate and be the one to turn out the power to their customers. So they luck out, or they have a massive power outage.
This paper (click on the PDF link) has a good summary of the problems in keeping power outages from happening again.
OK, it's nitpicking, but the largest machine is arguably the telephone system. Among other things, it maintains a synchronized clock (8 kHz base), even across oceans and continents.
I suppose one silver lining in having an outage once a year or so is that it forces us to keep backup systems for hospitals etc in place. If we only lost power once every 10 years, probably nobody at the hospital would even know what to do when power was lost, and people could die. It's just so hard to keep a backup system maintained and working if you are never forced to really use it once in a while. Like planning ahead for a weeklong camping trip, if you don't work up to it by taking shorter trips your chances of being fully prepared are nigh on 0%.
If I want to build a large structure (bridge or building) where it is possible that public safety is at issue, I had better have an engineer's signature on the drawings.
This case seems like a real good argument for having the same requirement for software.
Good engineering practice would probably have prevented this. A simple example of such a system would be a burglar/fire alarm panel. The system is self-checking. If any part of the system isn't working (ie. someone cuts a wire), then that causes an alarm.
I realize that there will be strange undetectable bugs in software but if the system as a whole is properly engineered, the system will fail gracefully and safely.
Oddly enough, while writing a comment to another user's message, I threw some info in google to learn about FirstEnergy's EMS system, and found this other SecurityFocus story in Feburary 2004, which gives more raw facts than this newer story.
"DiNicola said Thursday that the company, working with GE and energy consultants from Kema Inc., had pinned the trouble on a software glitch by late October and completed its fix by Nov. 19..."
"With the software not functioning properly at that point, data that should have been deleted were instead retained, slowing performance, he said. Similar troubles affected the backup systems. " This dovetails well with why the testers had to "slow" their testing to make the race condition appear.
342/x
x = "how many reactors they have in operation"
I'm also a big fan of watchdog timers. The process that periodically resets the timer can make all sorts of health and sanity checks.
Mea navis aericumbens anguillis abundat
This is exactly the same as software in my industry (HVAC fire/security systems for large buildings), where if you lose communication to a subsystem or the field, you have to raise alarms all over the place.
And perhaps the software in question also tries to do that. However, there are any number of reasons it could still fail.
Consider the following scenario: one software component (a proccess, if you will) is responsible for synchronizing the data between the remote testing station and the local data storage. Another pulls the locally stored data and displays it to the user. The natural place to check for lost comm is in the first component; but if, for some reason, the lost comm causes that component to fail, the second one may not be aware that the locally cached data is not being refreshed (a silly mistake, but I've seen it happen). Furthermore, the user will be unaware that the link failed because the process responsible for generating the notification will no longer be running.
You make a good point, but in my company, we have hundreds of data points reporting continuously. When the communications (telephone company) fails, which it does multiple times every day, you end up with wrong data temporarily. If the operator had to investigate every comm failure, he'd never get anything else done. So, there has to be a threshold somewhere of when does a problem reach a level that it needs to generate an alarm.
Why don't we point out the real problem that likely caused this to happen. Energy deregulation in the first place.
I think it is more accurate to say that deregulation enabled, not caused, the problem. Certainly First Energy used deregulation to put in place much of the pieces of the problem. You just don't hear about all the well run deregulated power systems.
If you open your mind too wide, people will throw trash in it.
For example, I once worked at a place with many many Window web servers. Every time a server failed, an alarm would sound. But the reason we used Window servers is that they were dirt cheap so we could buy enough to compensate for the expected frequent failures. The result were near constant alarms that were uniformly ignored. Therefore, the alarms resulted in no security benefits. This place had many other example of impressive front door security with nonexistent backdoor security.
It could be that the data was often "not live". Such 'failures' might be due to perfectly legitimate and expected condition. As such, these would not be exception in the sense that it was not unexpected. It is quite possible that the system was designed to have a human check some board on a periodic basis to confirm the age of the data. It may be that as long as an operator did this job once an hour there would be no problem. Some group decided that additional indication would not do any good because the data was so often "not live" that the operators would suffer blindness to the alarm.
Of course we do not know this for sure, but it could happen. But it is a consideration. As another example my check engine light has been on for a long time, and yet the mechanic says that nothing is significantly wrong with the engine. How will I ever trust the light again?
"She's a scientist and a lesbian. She's not going to let it slide." Orphan Black
Wasn't Chernobyl taken out by a test gone bad?
Testing is all fine and good, but there are always going to be instances where something will remain undetectable for years until circumstances are just right (wrong?)
I am a technician at a plant that makes batteries and we see this all the time.
I remember one time where an operator was cleaning a conveyor with a cloth soaked in Methanol (standard procedure) but forgot about the rag he had left on the underside of the running conveyor. Once the Meth had all evaporated, the dry rag got caught on the conveyor and jammed in the sprocket. At the same instance a valve had opened to fill the electrolyte tank. The jammed sprocket blew a breaker which stopped the machine. The PLC (Programmable Logic Controller) is programmed to keep valves in their current state in the case of an emergency (you kill less operators this way), but in this case it should have closed the valve. The result was a large puddle of nasty smelling, toxic, expensive electrolyte underneath the machine. Much fun.
My point is that as much as we try to make our machines foolproof, there is always at least one fool out there that will one day outsmart you.
The elitism seen here is incredible, just because a system in and of itself isn't complex doesn't mean you can take stock of how they manage. Although personally I'm about to design a call center application for Mercedes that will be used by hundreds of thousands of people. This system can get quite complex albeit, not as important as a power system.
When it comes to troubleshooting systems you always have the option of making an exact scale model. You scale it up for more precision. This is a simple concept and apparently a lot of people think just because a system is complex and antiquated the same ideas can't apply.Even if you could create a model to test with that is identical to the live system you cannot test every possible situation which can occur in the real world. Integration testing can only test those things which can be envisioned by those responsible for testing.
You absolutely do the best testing you can, unit test every piece of functionality, test subsystems and whole systems in integration testing, but you will never test every single possibility. The more complex (and antiquated) the system, the greater the number of interactions, and the greater the potential for bugs. I'm convinced that there are bugs lurking in every piece of hardware and software I use, the conditions under which those bugs manifest may have never occurred, but they are there.
I'm not fatalistic about software quality, and I don't disagree that we need to test better, but complexity to testing difficulty is not linear and I dislike seeing it trivialized. People who underestimate the difference between a system with 100 parts and 1000 parts are in for a rough time.
[Set Cain on fire and steal his lute.]
Bzzzt wrong answer. My municipal power agency has been self-sustaining since 1920. They don't take in any tax dollars -- they run it all on the money they take in. Sure it's a Government run Agency so it can't make a profit (though they do take in extra cash for a rainy day fund) -- but for the sake of the argument if they increased prices 50% (to make a profit) they'd still be cheaper then the non-municipal options.
If it's not taxes, then the municipal funds itself by offering bonds, which then pushes the higher costs onto future subscribers.
Wrong again. The last bond they issued was back in the 1950s to build a new substation. The Agency started in the 1900s off tax dollars with a charter to provide street lighting. Over time they hooked up private customers (the infrastructure was already in place) and became self-sustaining. Perhaps that's the exception rather then the rule but you shouldn't go painting all municipal power with a broad brush of "You are just being screwed on your taxes" or what not.
Enron is the exception, and not the norm. Not many companies operate like Enron did, or was as unethical they were.
Really? Did you bother to read the story about the power plant in a local township near me? After they won their petty tax battle by exhausting the town's financial resources they fired the plant back up with out of state employees that they brought in. Sure we could rehire the local people that used to work there but they actually fought us on our tax levy so fuck em! I hope NYS shoves it up their ass -- they are going after them last I heard and something tells me that NYS won't run out of money like the township did.
I think we can all agree that unethical behavior, ignorance, and incompetence are not limited to private corporations, but government agencies, municipal authorities also exhibit those human qualities.
Your point?
btw, nice strawman, mentioning outsourcing while talking about a deregulated power company. sure to get a raise, but can we keep the logical fallacies to a minimum please? thanks
Why not? It's a valid point. Our power company (which was always a publicly held company) used to make enough profit that they could hire local people and pay them a decent (some would say too high but that's another story) wage. Now that they were forced to sell off their generation capacity they are being raked over the coals by the out of state suppliers and profits are a thing of the past.
So how did they respond? By laying off as many workers as possible and outsourcing whatever they could. And they still aren't back in the black. The PSC isn't going to let them charge the $0.20 kWh it would cost to put them in the black (why should they? All the money would just be leaving NYS) so it's a lose-lose battle for all involved. The customers get screwed, the employees get screwed, the townships get screwed and the shareholders (of the power company) get screwed. The only people who are winning are the shareholders of the out of state energy company that's screwing us over. The only reason it's not as bad as it was in California is because NYS has access to cheap hydroelectric power from Canada. That's the only thing keeping them from screwing us completely -- and it's the only thing keeping our power companies solvent. Thank god the Canadian companies at least have some ethics and responsibility.
So keep advocating your deregulated industry. I'm waiting for individual states to just start regulating it on their own. It wouldn't be the first time.
I want peace on earth and goodwill toward man.
We are the United States Government! We don't do that sort of thing.