Tracking the Blackout Bug

Re:Software bug was just one part of bigger proble by Raindance · 2004-04-10 07:07 · Score: 4, Interesting

I agree that there's more to this than just one line of code, as some folks seem to believe- I think referring to it as 'one bug' is rather misleading.

As well refer to the things leading up to WWII as 'one problem'.

B Method? by starseeker · 2004-04-10 07:12 · Score: 5, Interesting

"the bug was unmasked as a particularly subtle incarnation of a common programming error called a "race condition," triggered on August 14th by a perfect storm of events and alarm conditions on the equipment being monitored. The bug had a window of opportunity measured in milliseconds. "

Isn't this the type of problem the B Method (and maybe the Z language too) are designed to address? Use proof logic initially - once you have decided on a behavior you want, design the system in such a way that it is provable it executes this design.

That doesn't mean the DESIGN is flawless, of course. But if we start engineering software on as many levels as we can, mightn't things improve? Normal software development and testing would never have found a critical bug with rare trigger conditions and a millisecond window. If you need precision on that level, you need to (for starters) to KNOW your implimentation of your design is sound, and preferably the code you are running exactly impliments the proven logic. Isn't this what the B Method was created for?

--
"I object to doing things that computers can do." -- Olin Shivers, lispers.org

Re:B Method? by mccalli · 2004-04-10 07:38 · Score: 4, Interesting

Isn't this the type of problem the B Method (and maybe the Z language too) are designed to address? Use proof logic initially - once you have decided on a behavior you want, design the system in such a way that it is provable it executes this design.
Ye gods, you've frightened the hell out of me with reference to Z. I'd almost entirely forgotten it, and had hoped its cold corpse would lie in the ground undisturbed, undiscovered and most importantly of all unreferenced until the end of time. Still, "That is not dead which may eternal lie"...
Z is a beautiful way to mathematically prove that you have design bugs at the highest level possible. You can then design your unit tests around those bugs, and confirm that they're valid.
That's it. It provides nothing else that unit testing on its own couldn't do, with the exception of a few salaries and a research grant here and there. Whilst you can mathematically prove implementations of certain designs, the vast majority of designs have more complex interactions. Try using Z for a multithreaded real-time environment for example - my Software Engineering tutor at the time, Iain Sommerville (well known in the field due to his books, oh and 'at the time' would ~1993), basically said that Z just breaks down in those circumstances. I wouldn't know - I personally had no clue how to even make it begin in those circumstances, let alone break down.
Please confine Z to camp-fire ghost stories used to scare new programmers. It always was a living hell, and it really shouldn't be resurrected now.
Cheers,
Ian
Re:B Method? by Orne · 2004-04-10 07:39 · Score: 4, Interesting

SCADA systems transport data samples. My company's system collects from several hundred thousands of meters, about half of which are expected to send in a sample about once every 10 seconds, some as fast as once every two seconds. The concept is that you have a communications buffer that collects the data, the link writes to the memory while the other EMS applications (about a dozen) read from the memory.

Now admittedly, FirstEnergy's system is a little smaller in territory, but I wonder if their mergers over the recent years (Cleveland Electric and Ohio Edison became FE, and then proceeded to take Toledo Edison and GPU of PA) have outpaced the collection capabilities of their mainframe (which was already at the end of its life and was scheduled to be replaced). That could account for some of the "slowing" that the G.E. testers said they had to do to make the race condition appear.
Re:B Method? by Mr.+Slippery · 2004-04-10 08:02 · Score: 2, Interesting

Use proof logic initially - once you have decided on a behavior you want, design the system in such a way that it is provable it executes this design.

Problem is, doing and verifying proofs is just as subject to error as creating and reviewing code. All you've really done is change your symbol set.

--
Tom Swiss | the infamous tms | my blog
You cannot wash away blood with blood

Re:The problem with SCADA systems by Vancorps · 2004-04-10 07:13 · Score: 5, Interesting

This all reminds me of the movie Resident Evil where they shut down power and all the doors unlock when power is restored.

You bring up a great point about failure states. I work for several large hotels and the fire control systems are the ones that alert whenever there is any problem of any kind largely because any problem of any kind needs to be addressed immediately so it makes sense.

I would think power systems would think along the same lines since the odds are, ANY failure whatsoever needs immediate attention of engineers that maintain the system. This is not a requirement for all software but when it comes to such critical services why doesn't everybody do the same practice? It seems so blatently obvious that alarms should have been raised.

Also, in situation's where you don't work on a live environment you can always create a test environment that is for all intensive purposes "live" For web development work I do I have a testing domain which is used to test sites to ensure that because they work here in my lab they will work when I hand them off to the client. Its 100% accurate, I've seen it done with countless other systems, so why wasn't it done here?

The American jackasses who blamed Canada by Kevin+Mitnick · 2004-04-10 07:14 · Score: 5, Interesting

Did anyone ever retract their statements? I know the NY Mayor was pretty quick to blame us Canucks.

Reasons for power blackouts by pcraven · 2004-04-10 07:27 · Score: 4, Interesting

I've been reading several papers on this for a grad class I'm taking. One of the several problems is no government control. If a power outage might be prevented by shedding some load (turning out power to some people), no company wants to step up to the plate and be the one to turn out the power to their customers. So they luck out, or they have a massive power outage.

This paper (click on the PDF link) has a good summary of the problems in keeping power outages from happening again.

World's largest machine by stefanb · 2004-04-10 07:30 · Score: 4, Interesting

An article featured on Slashdot last year lays out the underlying complexity of the power grid very well: "The World's Largest Machine"

OK, it's nitpicking, but the largest machine is arguably the telephone system. Among other things, it maintains a synchronized clock (8 kHz base), even across oceans and continents.

Re:For the 21st century... by timeOday · 2004-04-10 07:31 · Score: 2, Interesting

I suppose one silver lining in having an outage once a year or so is that it forces us to keep backup systems for hospitals etc in place. If we only lost power once every 10 years, probably nobody at the hospital would even know what to do when power was lost, and people could die. It's just so hard to keep a backup system maintained and working if you are never forced to really use it once in a while. Like planning ahead for a weeklong camping trip, if you don't work up to it by taking shorter trips your chances of being fully prepared are nigh on 0%.

Software ENGINEERING by Anonymous Coward · 2004-04-10 07:39 · Score: 4, Interesting

If I want to build a large structure (bridge or building) where it is possible that public safety is at issue, I had better have an engineer's signature on the drawings.

This case seems like a real good argument for having the same requirement for software.

Good engineering practice would probably have prevented this. A simple example of such a system would be a burglar/fire alarm panel. The system is self-checking. If any part of the system isn't working (ie. someone cuts a wire), then that causes an alarm.

I realize that there will be strange undetectable bugs in software but if the system as a whole is properly engineered, the system will fail gracefully and safely.

Re:Software ENGINEERING by Orne · 2004-04-10 08:45 · Score: 3, Interesting

The two systems you describe are fundamentally different from the design of this alarming system. In fire or safety, the "reading" is the voltage of the closed loop wire itself; 12 volts connected, 0 volts open.

Now imagine if you have a layer in between; you want to monitor the fire status of a complex of warehouses from a single room several miles away. Analog/Digial the signals to all of the individual buildings, transport the data to a common computer, and view the data there. Figure you have several hundred buildings you're watching at once, and now you're getting closer in scale to how the grid dispatchers get their data.

Now imagine that the computer's software back at the main station reads all these meters, and if a line's open (say you're tracking window openings for security), it writes an alarm to a text log on the screen; on a good day, you don't get any alarms. Now suppose the driver that writes the alarms to the screen hangs; since you werent expecting any alarms, you're not that concerned that you aren't seeing anything. That's pretty much what caught FirstEnergy for those 3 hours that afternoon, while the system was failing and they didn't realize they needed to act.
Re:Software ENGINEERING by sjames · 2004-04-10 09:52 · Score: 3, Interesting

At first glance, that can seem like a good idea, but are you prepared to pay for that signoff from each engineer whenever you install a piece of software?

A PE signs off on each particular instance of a design taking intended use, site and other construction into account. If you then build elsewhere, you need a new signoff. If you make any significant change (including adding other structural elements to the design (that is, installing more software), you'll need a new signoff. Add a new network driver, another signoff. Upgrade the CPU? You guessed it!

Some software is poorly designed and crash prone. Other software is well designed but cannot be signed off on because it might be installed on nearly anything that pretends to be a compatible platform.

The one justification for that sort of signoff is in situations where a bug will kill someone. Even then, the system should be divided into critical and auxillary parts to limit what must be signed off on.

Autopilots work that way. You have a small and reletivly simple part that assures safe conditions, is extensively tested, and rarely changed. Another portion is more frequently updated, attempts to optimize the flight and provides a nicer interface. The latter can fail completely and the plane will continue to fly (possibly with poor fuel economy and the pilot navigating manually, but it won't fall out of the sky).

There are many tradeoffs. In some sense, many small distributed systems are more robust than centralized control. However, it's a lot easier to create a chaotic system that way. If you do, you won't know until the system falls into a weird state without warning.

Additional Information by Orne · 2004-04-10 07:47 · Score: 3, Interesting

Oddly enough, while writing a comment to another user's message, I threw some info in google to learn about FirstEnergy's EMS system, and found this other SecurityFocus story in Feburary 2004, which gives more raw facts than this newer story.

"DiNicola said Thursday that the company, working with GE and energy consultants from Kema Inc., had pinned the trouble on a software glitch by late October and completed its fix by Nov. 19..."

"With the software not functioning properly at that point, data that should have been deleted were instead retained, slowing performance, he said. Similar troubles affected the backup systems. " This dovetails well with why the testers had to "slow" their testing to make the race condition appear.

Re:342 years of online operational hours? by Creepy+Crawler · 2004-04-10 07:58 · Score: 3, Interesting

342/x

x = "how many reactors they have in operation"

--

Mod parent up! by Anonymous Coward (Score:1) Thurs, Nov 31, @13:37

Clocks by Detritus · 2004-04-10 08:13 · Score: 2, Interesting

That's one reason that I like to put UTC clocks on displays. A quick glance at the clock will tell you if the display subsystem has crashed.

I'm also a big fan of watchdog timers. The process that periodically resets the timer can make all sorts of health and sanity checks.

--
Mea navis aericumbens anguillis abundat

Re:Clocks by corngrower · 2004-04-10 11:45 · Score: 2, Interesting

A watchdog timer on the alarm system (that was deadlocked) would probably have prevented this scenareo. And I also agree that displying clocks on the screen is a good way for the operator so see if the display system is functioning properly.

Not having the display system give a visual indication of stale data was also a deficiency.

There also seems to have been a problem in that the data collection and monitoring portion of the system was held up by a malfunctioning alarm system.

Re:The problem with SCADA systems by Kirill+Lokshin · 2004-04-10 09:32 · Score: 2, Interesting

This is exactly the same as software in my industry (HVAC fire/security systems for large buildings), where if you lose communication to a subsystem or the field, you have to raise alarms all over the place.

And perhaps the software in question also tries to do that. However, there are any number of reasons it could still fail.

Consider the following scenario: one software component (a proccess, if you will) is responsible for synchronizing the data between the remote testing station and the local data storage. Another pulls the locally stored data and displays it to the user. The natural place to check for lost comm is in the first component; but if, for some reason, the lost comm causes that component to fail, the second one may not be aware that the locally cached data is not being refreshed (a silly mistake, but I've seen it happen). Furthermore, the user will be unaware that the link failed because the process responsible for generating the notification will no longer be running.

Re:The problem with SCADA systems by spurdy · 2004-04-10 09:58 · Score: 2, Interesting

You make a good point, but in my company, we have hundreds of data points reporting continuously. When the communications (telephone company) fails, which it does multiple times every day, you end up with wrong data temporarily. If the operator had to investigate every comm failure, he'd never get anything else done. So, there has to be a threshold somewhere of when does a problem reach a level that it needs to generate an alarm.

Re:Software bug was just one part of bigger proble by Grayswan · 2004-04-10 10:11 · Score: 2, Interesting

Why don't we point out the real problem that likely caused this to happen. Energy deregulation in the first place.

I think it is more accurate to say that deregulation enabled, not caused, the problem. Certainly First Energy used deregulation to put in place much of the pieces of the problem. You just don't hear about all the well run deregulated power systems.

--
If you open your mind too wide, people will throw trash in it.

Re:The problem with SCADA systems by fermion · 2004-04-10 10:13 · Score: 2, Interesting

It kind of depends on how often the out of data conditions occur and how long they occur. My understanding is that the design of proper alarms is actually a complicated security issue, and improper alarms leads to less effective security.

For example, I once worked at a place with many many Window web servers. Every time a server failed, an alarm would sound. But the reason we used Window servers is that they were dirt cheap so we could buy enough to compensate for the expected frequent failures. The result were near constant alarms that were uniformly ignored. Therefore, the alarms resulted in no security benefits. This place had many other example of impressive front door security with nonexistent backdoor security.

It could be that the data was often "not live". Such 'failures' might be due to perfectly legitimate and expected condition. As such, these would not be exception in the sense that it was not unexpected. It is quite possible that the system was designed to have a human check some board on a periodic basis to confirm the age of the data. It may be that as long as an operator did this job once an hour there would be no problem. Some group decided that additional indication would not do any good because the data was so often "not live" that the operators would suffer blindness to the alarm.

Of course we do not know this for sure, but it could happen. But it is a consideration. As another example my check engine light has been on for a long time, and yet the mechanic says that nothing is significantly wrong with the engine. How will I ever trust the light again?

--
"She's a scientist and a lesbian. She's not going to let it slide." Orphan Black

Re:The problem with SCADA systems by Fishead · 2004-04-10 10:51 · Score: 2, Interesting

Wasn't Chernobyl taken out by a test gone bad?

Testing is all fine and good, but there are always going to be instances where something will remain undetectable for years until circumstances are just right (wrong?)

I am a technician at a plant that makes batteries and we see this all the time.

I remember one time where an operator was cleaning a conveyor with a cloth soaked in Methanol (standard procedure) but forgot about the rag he had left on the underside of the running conveyor. Once the Meth had all evaporated, the dry rag got caught on the conveyor and jammed in the sprocket. At the same instance a valve had opened to fill the electrolyte tank. The jammed sprocket blew a breaker which stopped the machine. The PLC (Programmable Logic Controller) is programmed to keep valves in their current state in the case of an emergency (you kill less operators this way), but in this case it should have closed the valve. The result was a large puddle of nasty smelling, toxic, expensive electrolyte underneath the machine. Much fun.

My point is that as much as we try to make our machines foolproof, there is always at least one fool out there that will one day outsmart you.

Re:The problem with SCADA systems by Vancorps · 2004-04-10 16:19 · Score: 2, Interesting

Web systems were but one example. I'll through another much more complex example. Take DNA from bacteria and splice it with stem cells to produce nerves much more resistent to damage. You are talking thousands about thousands of long protein strands most of which you have no idea what perform what task. Do this without destroying the cell. When you are done with that test you move on to a more complex test until ultimately you are ready to do it with humans, at which time you can accurately predict exactly what it will do. Yes there are occasions when that doesn't happen that way but that is usually because something was missed in the testing procedure.

The elitism seen here is incredible, just because a system in and of itself isn't complex doesn't mean you can take stock of how they manage. Although personally I'm about to design a call center application for Mercedes that will be used by hundreds of thousands of people. This system can get quite complex albeit, not as important as a power system.

When it comes to troubleshooting systems you always have the option of making an exact scale model. You scale it up for more precision. This is a simple concept and apparently a lot of people think just because a system is complex and antiquated the same ideas can't apply.

Re:The problem with SCADA systems by miu · 2004-04-10 18:18 · Score: 2, Interesting

When it comes to troubleshooting systems you always have the option of making an exact scale model. You scale it up for more precision. This is a simple concept and apparently a lot of people think just because a system is complex and antiquated the same ideas can't apply.

Even if you could create a model to test with that is identical to the live system you cannot test every possible situation which can occur in the real world. Integration testing can only test those things which can be envisioned by those responsible for testing.

You absolutely do the best testing you can, unit test every piece of functionality, test subsystems and whole systems in integration testing, but you will never test every single possibility. The more complex (and antiquated) the system, the greater the number of interactions, and the greater the potential for bugs. I'm convinced that there are bugs lurking in every piece of hardware and software I use, the conditions under which those bugs manifest may have never occurred, but they are there.

I'm not fatalistic about software quality, and I don't disagree that we need to test better, but complexity to testing difficulty is not linear and I dislike seeing it trivialized. People who underestimate the difference between a system with 100 parts and 1000 parts are in for a rough time.

--

[Set Cain on fire and steal his lute.]

Re:Software bug was just one part of bigger proble by Shakrai · 2004-04-11 09:20 · Score: 2, Interesting

Well, let's handle this first. Your municipal can offer lower rates because you're paying more in taxes to subsidize them. You pay local, state, and federal taxes which then go to artificially lower the up-front costs you pay for electricity. But it is not necessarily cheaper.

Bzzzt wrong answer. My municipal power agency has been self-sustaining since 1920. They don't take in any tax dollars -- they run it all on the money they take in. Sure it's a Government run Agency so it can't make a profit (though they do take in extra cash for a rainy day fund) -- but for the sake of the argument if they increased prices 50% (to make a profit) they'd still be cheaper then the non-municipal options.

If it's not taxes, then the municipal funds itself by offering bonds, which then pushes the higher costs onto future subscribers.

Wrong again. The last bond they issued was back in the 1950s to build a new substation. The Agency started in the 1900s off tax dollars with a charter to provide street lighting. Over time they hooked up private customers (the infrastructure was already in place) and became self-sustaining. Perhaps that's the exception rather then the rule but you shouldn't go painting all municipal power with a broad brush of "You are just being screwed on your taxes" or what not.

Enron is the exception, and not the norm. Not many companies operate like Enron did, or was as unethical they were.

Really? Did you bother to read the story about the power plant in a local township near me? After they won their petty tax battle by exhausting the town's financial resources they fired the plant back up with out of state employees that they brought in. Sure we could rehire the local people that used to work there but they actually fought us on our tax levy so fuck em! I hope NYS shoves it up their ass -- they are going after them last I heard and something tells me that NYS won't run out of money like the township did.

I think we can all agree that unethical behavior, ignorance, and incompetence are not limited to private corporations, but government agencies, municipal authorities also exhibit those human qualities.

Your point?

btw, nice strawman, mentioning outsourcing while talking about a deregulated power company. sure to get a raise, but can we keep the logical fallacies to a minimum please? thanks

Why not? It's a valid point. Our power company (which was always a publicly held company) used to make enough profit that they could hire local people and pay them a decent (some would say too high but that's another story) wage. Now that they were forced to sell off their generation capacity they are being raked over the coals by the out of state suppliers and profits are a thing of the past.

So how did they respond? By laying off as many workers as possible and outsourcing whatever they could. And they still aren't back in the black. The PSC isn't going to let them charge the $0.20 kWh it would cost to put them in the black (why should they? All the money would just be leaving NYS) so it's a lose-lose battle for all involved. The customers get screwed, the employees get screwed, the townships get screwed and the shareholders (of the power company) get screwed. The only people who are winning are the shareholders of the out of state energy company that's screwing us over. The only reason it's not as bad as it was in California is because NYS has access to cheap hydroelectric power from Canada. That's the only thing keeping them from screwing us completely -- and it's the only thing keeping our power companies solvent. Thank god the Canadian companies at least have some ethics and responsibility.

So keep advocating your deregulated industry. I'm waiting for individual states to just start regulating it on their own. It wouldn't be the first time.

--
I want peace on earth and goodwill toward man.
We are the United States Government! We don't do that sort of thing.

25 of 207 comments (clear)