Blackout Cause: Buggy Code
blanca writes "The big northeast blackout from last summer was caused in part by a software bug in an energy managment system sold by General Electic, according to a story on SecurityFocus. The bug meant that a computerized alarm that should have been triggered never went off, hindering FirstEnergy's response to the train of events that lead to the cascading blackout. Investigators found the bug in a intensive code audit following the outage, and a patch is now available."
The first thing I saw at that site, "Reliable, Field-Proven & Adaptable". Funny.
Well, that statement is only half false, it's reliability has been field-proven.
Vonal Declosion
Didn't the story used to be that after a tech maintenenced the machine, he forgot to re-enable an alarm?
tasks(723) drafts(105) languages(484) examples(29106)
So's where my indemnification.
My karma is not a Chameleon.
It's dark here, what about a bug?
"Patch available"
Phew! then at least i can patch my own power craft before anything happens!
With well over one million hours of online operation, the XA/21 system has improved utilities' bottom lines by helping to: ... ...
Avert potential outages
Truth in advertising.
Oh this bug took six months to find and now a patch is available. I thought someone said the bug was found six months ago and now the patch was available. My bad, nobody would ever do that :-)
With all the brainpower on Slashdot, I'm sure we can find a way!
i have been dreaming writting such a bug myself. quite an achievement to blackout quarter of a continent with some crappy code...
Aure entuluva!
Where's the URL, dude? I want to apply it to my local copy.
Cause my system's crashing whenever there's a thunder storm...
... when you outsource to the lowest bidder?
I've said enough.
The code did work, but there was no hardware left to signal the alarm ! Someone likely snarfed the alarm for a CPU usage monitor..
http://www.schneier.com/crypto-gram-0312.html#1
A snippet of the article:
I'm pretty sure the idiom is chain of events and not train of events.
Might I add GE outsources much of its software to India. GE railroads is a BIG outsourcing of software to India. I hope we don't have a railroad accident because of poor quality.
The term 'Software Engineering' is bantered about in the software industry. I think little that you could call engineering happens. Software is developed. It doesn't meet the strict standards of testing and reliability of physical products.
I am a software developer not an engineer, as are most people in the field. Software won't become an engineering science until companies are willing to pay for that process. Given the current trend towards cost cutting I don't see that happening anytime soon.
As x approaches total apathy I couldn't care less.
Just a question for everyone here:
Who thinks this could have been any better with Open Source and why?
People make the comment of the many eyes, but who is really looking at the code?
Curiosity was framed; ignorance killed the cat. -- Author unknown
I thought the Canadians did it?
Disclaimer: This opinion was created without the use of any facts
Chalk up another one for the most disasterous software bugs in history. This one should give the Ariane 5 explosion a go for no 1.
I'm waiting for the next big power failure, then the excuses about why the patch was never applied. :)
Wouldn't it just be classic if it turns out that this code was outsourced?
who are those slashdot people? they swept over like Mongol-Tartars.
One code to light it all, ...
One coder to code it,
One debugger to miss the bug
and into the darkness lead them.
With all the lip service about "homeland security," one ought to be concerned about anything affecting national infrastructure being sent abroad where you really don't know who is doing the coding, whether the coding projects are being further outsorced to say alQaidaSoft, etc.
People say I'm crazy, I got diamonds on the soles of my shoes...
"Things are so compliated, we don't know that a small event, or series of small events won't bring down the whole system"
Yeah, well I don't know that I won't be fired tomorrow for reading Slashdot at work, but that doesn't mean that I will.
I want to delete my account but Slashdot doesn't allow it.
The XA/21 brochure has a few inconsistencies. It states that it client, server, and front-end processors are supported on a mix of IBM AIX6000, Sun Solaris, and Motorola AIX hardware. The whole thing appears to use X-windows for management, yet a few of the the screenshots on page 7 look like Windows to me. Or perhaps I just need another cup of coffee this morning.
Nothing but the finest in meaningless drivel
But is anyone else thinking of Medal of Honor?
Sound zee alarm!!
...what badly audited code can do these days? In this case the results weren't nearly as disastrous as they could have been. For example if a similar software error had prevent an alarm from going off in a Nuclear powerplant, we could be on for another Chernobyl. Now one could argue that all code in that kind of situation would be properly auditted, but I'm sure the GE code had been tested fairly thoroughly. I find it quite disturbing that occurences like this can happen..
tim
Why can't we all just get along???
Still, even if one bug caused blackout, it still should have never happened. One company, be it by mistake or software glitch or whatever, should absolutely not have the ability to take out the power grid of tens-of-millions of people. Period. Each company should have at least some independence in the even that something like this occurs. It is irresponsible of all parties involved to not have any form of backup plan in an event like this, software bug or not. Each company needs to be able to run on its own in case something catestrophic happens.
How about the energy companies?
Certainly, the energy corporations must be somewhat culpable for not rigorously testing the software in the first place? It is not in the interest of a for-profit company to see to it that such systems are functioning correctly, as that cost will detract from the bottom line profit. Only when disaster strikes can they be goaded into looking into problems.
Stop corporate
Now if in fact this was buggy code, and if Software Engineers are in fact part of the engineering profession, then a professional body should be taking the engineer(s) to task. This would be the same thing that would take place in the event that a civil engineer signed off on faulty building plans. But smart money says no software "engineer" will get nailed.
A look at the software industry will show this to be the norm. And that is why there is such a problem with having people claiming the title of "software engineer". "Engineer" doesn't just mean having the technical savvy, it also means having a responsibility to the public for the use of that knowledge and being beholden to a professional body charged with ensuring you are held accountable.
"Consensus" in science is _always_ a political construct.
back in my youngers day a bug patch was a piece of steel mesh placed over a hole to keep the moths out of the relay contacts.
This is Slashdot! Isn't that supposed to say Microsoft? It's always Microsoft.
I was going to put a sig here, but I had already submitted the message.
From the article:
When a backup server kicked-in, it also failed, unable to handle the accumulation of unprocessed events that had queued up since the main system's failure. Because the system failed silently, FirstEnergy's operators were unaware for over an hour that they were looking at outdated information on the status of their portion of the power grid, according to the November report.
How in the world did they manage to build a system nearly completely dependant upon computers, and yet not know when they lost not just one, but two computers that monitored the system?
Homer: Don't turn off the computer! Don't turn off the computer! Don't turn off the computer!
"Click"
Some of my friends were software developers at General Electric years ago (admittedly doing Wintel desktop software).
I'm too tired to read the article, but I will say this, everything they did, they did in VB.
I know GE has also sold US approved crypto hardware to other countries, gear which was found to have back doors or known weaknesses that have allowed the US to eavesdrop on their supposed "friends" with ease.
Maybe they should stick to designing jet engines and toasters.
War crimes, torture, lies, illegal spying... Would someone give Bush a blowjob, already, so he can be impeached?
And yep, it runs on major critical systems, including energy systems and satellites.
Lean on it in the slightest and it will crash and burn with little chance for recovery. Tibco even says they don't test their own software (lack of docs lowers their liability). Press them for test results and they will offer you to pay them to test for you.
When a backup server kicked-in, it also failed, unable to handle the accumulation of unprocessed events that had queued up since the main system's failure.
Sounds like classic Tibco.
Let me guess, they blacked out the Northeast in retaliation for blowing up Siberia with our trojan-horse pump and valve control system.
Did anyone check snopes.com for this one?
Blaming the black out on a software bug is a damn cop-out. The cause of the black out was a horribly managed electrical grid that can barely keep up with the current demand. Any major failure in the system can cause a cascading failure of the entire section of the grid. That is a horrible design. A software bug may have been the trigger but it is by no means the true cause.
The grid in the North East US is supplied by horribly inefficient and antiquated power lines that were struggling to keep up thirty years ago. That they are still in use today is an outright crime. There's also the issue of the operators of the lines generators trying to save a few bucks by cutting maintenance on equipment and facilities and cutting supervising staffs down to skeleton crews. It is much easier to fit "software bug" into a sound bite so the news media will stick with that. Unfortunately the real cause of the black out is not ever going to be patched and another blackout is as inevitable as this last one was. I hope next time a few more people will have invested in backup generators or some alternate form of power to keep from losing their business during a blackout.
I'm a loner Dottie, a Rebel.
If this isn't a call to take a closer look at the possibility of more widely using tools like Z and B to develop important software, I don't know what is.
Yes, they're difficult. Yes, they aren't likely to eliminate all bugs. BUT. They provide a much better chance (as I understand it - I'm not an expert) that what is designed is what actually gets implimented. That shifts the burden onto the design, but that's OK - that burden was always there. It just means that the design gets properly implimented, which is all that can reasonably be asked of the coding process.
Currently, again as I understand it, the life of a software program in development is a constant struggle by the developers to cope with ever changing demands of customers. I think if people want matters to improve the customers are going to have to come to grips with reality, take the time to sit down and think things through, and make all critical design decisions BEFORE the development process begins. More expensive up front? You bet. That's why I think companies should look at cooperative effort for this type of thing. Distribute the cost of developing one really good program across an industry. A lot of the same core functionality can likely be shared between businesses - if they all pay for one proper design and implimentation of an open program up front, and they all get copies of the logic and proof code with rights to extend as they see fit, they all benefit. They can also open up the more general parts of the package to the world at large under GPL, and anyone could contribute who can generate valid B and Z designs/proofs. Sort of an "academic" open source code development forum - peer review and all. The companies get the benefit of all new development - if they are using it internally they can extend the GPL code for themselves, so long as they don't distribute it. If they do distribute it, they can so so under GPL for everyone to enhance. A plugin based model can also allow them to develop components to the system they can sell as commercial software, if they wish.
Whether this would work/appeal with corporate thinking I have no idea - many of those folks seem to view cooperation like the plague. But it might allow a higher grade of software to be developed and universally used, and I have a hard time imagining how that could be a bad thing for anyone.
"I object to doing things that computers can do." -- Olin Shivers, lispers.org
Posting anonymously for obvious reasons to me :)
Given my personal experience with this certain Fortune 5 company and software development as a whole, I am not surprised.
The bottom line is that there is soooo much software developed here by non-computer programmers. There are many great Engineers (Mechanical, Aerospace, etc.) here, yet very few can write good code. Many of them are asked to write code nonetheless and thanks to the travesty that is Visual Basic and other Rapid Application Development tools the code that is produced is extremely un-maintainable.
Then you have the matter of people moving jobs every 2 years and the poor bastard who has to maintain someone else's code gets lost inside of it.
Consider me very frustrated at the whole process.
That some of the guys who coded something that worked with such critical hardware MUST have been employed for M$ at some point? Them: "Crap. A bug. 5 minutes to coffee break. It'll sort itself out, I'm off to check the stock prices." Their management: "Be sure to drum up the fact we have a patch for a bug, but not what the bug really is, how severe it is, or how it got there! But patch it quick!"
One of the 187.
man, is my dad going to be relieved when he reads this article. he works for firstenergy and will be glad to know that its not his fault that the blackout was his fault.
That's a cover-up. It was really a Martian invasion. Mars was at its closest point to Earth at the time. Read more!
The Uncoveror: It's the real news.
this patch has been available for over 8 months and they're just finding it now? sheesh!
http://www.theregister.co.uk/content/53/35511.html
... stop slacking! :-)
Reported on this hours ago.
Come on people
Worst
No-one writes flawless code, not Sun, not IBM, and not even Linus or Alan Cox or Larry Wall. Anything that is controlled by code is bound to break, but that is why there are humans around and ways to override systems.
Regardless, First Energy had many, many ways to know something was up (whether it was MISO calling them, the general disruption they had before it could cascade) but they refused to take the necessary actions and close themselves off from the grid.
Right here.
mstyne: real name, no gimmicks
whats funny is that the RCA didnt point to software at all...
here's what happened:
a 50 MV line arc'd to a 12" diameter tree.
and yes, there is no reason that a 12" tree should be anywhere CLOSE to a 50 MV line.
... hi bingo
Ok , B is an outdated precursor to the C language so I'm not sure why that would be any use , but wtf is Z??
They monitor every field widget, but forget to monitor the monitoring servers? That's bright...dark...err...
The cesspool just got a check and balance.
The bug wasn't responsible for the alarm not going off. Turns out that there was a small slip of paper that had fallen down between the hammer and the bell of the alarm that was supposed to ring. The paper dampened the ring of the alarm, and thus it was never heard.
Oh well, i've got to go back to sniffing more Sterno.
There are 01 types of people in this world. Those that understand binary, and me.
one more time...
;)
root@powerplant12:/# apt-get update && apt-get -s upgrade
Get:1 ftp://ftp.gepower.com stable/main Packages [2726kB]
Hit ftp://ftp.gepower.com stable/main Release
Fetched 2.8MB in 2s (1408kB/s)
Reading Package Lists... Done
Building Dependency Tree... Done
Reading Package Lists... Done
Building Dependency Tree... Done
1 packages upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
Inst xa-21-base (2.1-3 GE-whoops:stable)
Conf xa-21-base (2.1-3 GE-whoops:stable)
Yup, its official...
To me, this report give a good example of why a monolithic (monocultural) dispatching system is not a good idea. If every transaction were controlled by a central center, a single software bug could shut down the entire North American grid.
sPh
- train of events + chain of events
I write code.
A "professional body" isn't going to do anything. Let's say this guy is a memeber of The Loyal Order of Moose Engineers, chapter 471. He gets a reprimand, or he gets the boot from the professional organization. How does this solve the problem? As is, there's a mechanism in place already. They're called "lawsuits". Rest assured, that somebody is going to pay for this fuckup. Doubtful the engineer will personally pay, but his employer will, and we'll get canned. That's a much more severe correction than any professional body could ever accomplish.
Based on the PDF for the XA/21 system, it sounds like this wasn't related to some of the DCOM/OPC issues many (myself included) were speculating about. Thoough it's a SCADA control system (where Windows is common, though not universal), it's running on AIX (IBM or Motorola) or Solaris.
Interestingly enough, the sales literature describes it as having, "[an] established track record of field performance - over one million hours of online operation."
I wonder if they'll revise the brochure now?
Tim
give your dog a bad name and hang it.
give coders a bad name and outsource them.
Sure. Patch that code. Maybe a higher percentage of power companies will apply these patches than apply Microsoft patches. Next summer, another blackout will be caused by a different bug, 148,234 lines away and just as hidden. The way to prevent a recurrence is to set up a system that moonitors the monitors. Every few seconds, it contacts each of the critical monitoring PCs, asking them to do a health-check. When one PC doesn't respond, a screen goes red (and maybe makes a sound). Additionally, this central monitor system puts the time of the last successful survey up in big letters on the screen. Power company personnel will find out within a few seconds that something is wrong. Hey! G.E.! Are you listening?!
To a politician, one email equals one voter.
Finally! The Y2K bug bit....
Oh wait..
Finally! The Y2K + 3 years, 8 months bug bit!
See? All those powdered eggs and shotgun shells paid off.
Hushed voice in my head: (PSST! The power was only out for a day or so)
Uhhhhh, nevermind.
WTF? Over?
By my calculations, assuming air ionizes about 10,000 Volts / centimeter, a 50MV line should be at least 5,000 cm (or 50 meters) from any ground. 50 meters on either side of a line is a lot of property for an electical company to buy, and with a surge in the line I'd bet the distance would need to be even more.
Does this mean we can't blame Slammer for this?
In all fairness...
The Mars Rover's software crashed in just a few days.
Virtually all software should be designed and tested better than it is.
However, I'm perplexed at why the Mars Rover failure and resurrection is considered a miracle of human inginuity, rather than an indictment of crummy testing.
I'll not excuse the power grid software either; but it seems to work more reliably than the software on the Rover.
Well, I have news for you: 50MV lines don't exist! Not out in the open, anyway. Was it 50 kV, perchance?
but I feel like I'd be showing my age if I said
I worked at WildFire.
what good is a backup system if it's never been tested?
If she floats, she's a witch.
What you are saying makes perfect sense. Just one question, when was the last time you saw a sensible corporation?
So what? We still have an electric grid that needs a complete haulover.
In HW development, I always create the test units before anything else. It's part of the spesification. How many SW developers does that? How many even bother to create test units?
Oops, I hit submit when I should'a hit preview.
Mentally start new paragraphs at "The Way..." and "Hey...".
Also, note that I meant "monitors", not "moonitors", though moonitors could mean: "monitoring with attitude"...
To a politician, one email equals one voter.
Z is a "formal methods" language. These are languages that allow you to write proofs about your programs - given some specification, you can generate a proof which demonstrates that your program complies to the specification.
Obviously it doesn't/can't deal with errors in the specification itself, but it can reduce errors in the implementation process.
Nae bother
The software handled one part of the electrical system involved.
What about a good Electrical/Mechanical/Civil Engineering solution that would have prevented it from cascading through different systems / electrical companies / countries?
One piece of software which didn't raise an alarm is shocking. The fact that it cascaded over such a wide area is simply mind blowing.
Before we talk about "software engineers" how about talking about "traditional engineers" and their role in this massive failure?
The surprise isn't how often we make bad choices; the surprise is how seldom they defeat us.
Obviously this has to be the security bug that MicroSoft sat on for 6 months.
We may slam Microsoft for all of it's bugs, but it's really hard to top a software bug triggering an international blackout the size of one last summer. I think I should sue GE for making me walk 3.5 hours home in the heat with no money in Toronto, uphill, because I couldn't take a subway home. I smell a lawsuit the size of the eastern seaboard.
If there was an engineer signing off on the project (not just working on it, mind, but signing off on it), his license to practise would be revoked (he could never sign off on another project, ever), he would face huge fines (tens of thousands to millions of dollars, but probably covered by insurance that he would be required to have to practise), and there would probably be jail time if an investigation found him negligent in his duties. There would be an investigation, and considering that the system failed spectacularly he probably would be found negligent.
I don't know of any professional licensing body for software development. So that almost surely isn't the case here. For chemical reactors and other things that do require oversight by a licensed engineer this is the case, and when they fail, engineers are disbarred, fined, and imprisoned.
... shortly before the blackout, a chick in black PVC dropped a motorcycle full of explosives on the control room and then jacked in with a laptop full of unauthorized software.
But that couldn't possibly cause a blackout, could it?
as described in the excellent work by Bruce Sterling, "The Hacker Crackdown" (which everyone probably read): the blackout of the AT&T telephone switching system in 1990 also occured because of a software error.
What happened then (accusing of hackers as being responsible) is happening again: people pointing to external factors as being the cause for the culprit.
When do people start to learn from mistakes made and realize that instead of accusing people, they can better spend time in software audits?
Slashdot: stuff for news, nerds that matter, matter for news, stuff that nerd
After lots of years as a developer, I realized that the engineering process that goes into other professions (for example, civil engineering) can't be applied to software. The reason is simple: software is many orders more complex. Software has many interdependencies between components, has many states, and it is subject to change every minute. It's very difficult to see ahead and provide APIs that fit all the needs, that's why we go back and change the damn thing. What does a civil engineer has to do ? he/she has to combine parts and test if they hold together. There are a lot of parts, but the general principles are a few and can be easily remembered...unlike software.
Furthermore, the tools we have for the job are inadequate. The programming languages are primitive. The debugging tools are dumb. The machines are not clever and strong enough to prove the mathematical theorems behind its program. We don't even learn these things in college...we learn how to use programming languages, but we don't learn how to program...but I seriously believe we will never learn how to program, because a program's complexity increases tenfold for each line of code written!!!
It was terrorists! We had a lightning stike! We didn't have enough money to updgrade our equipment! My transformer didn't come back from the cleaners! Dick Cheney came in from out of town! Someone stole my power! There was an earthquake! A terrible flood! A GE software bug! IT WASN'T MY FAULT, I SWEAR TO GOD!
...that presented itself in the AT&T software is told at the end of the chapter, repeated here for your convenience:
"As it happened, the problem itself - the problem per se - took this form. A piece of telco software had been written in C language, a standard language of the telco field. Within the C software was a long "do... while" construct. The "do... while" construct contained a "switch" statement. The "switch" statement contained an "if" clause. The "if" clause contained a "break." The "break" was supposed to "break" the "if" clause. Instead, the "break" broke the "switch" statement."
Slashdot: stuff for news, nerds that matter, matter for news, stuff that nerd
I'm sure this was mentioned in the original blackout posts - since the Blaster virus was running full tilt at that time, there was an increased load on servers, routers, switches, hubs and blinky things that go whoop! whoop!! WHOOOOP! The increased demand on computing resources caused increased power demand (not to mention the cranked ACs at the homes of the poor IT staff who were staring at their blackberrys and sweating bullets) which in turn caused the alarm conditions which didn't get alarmed properly and so the powergrid went down. All because of an MS security hole.
How's that?
Silly person. You didn't read the EULA on that software before clicking install. There is no warranty or guarantee that the software will even do what it claims to do let alone furction correctly in any way. You waive all right to hold the company responsible.
"I am not a number! I am a free man!"-- The Prisoner
So the software didn't raise alarms as it should've. That's bad. But it seems to me that the software is being made a scape goat here. It's much easier to blame "that #$@&@$ computer" than "FirstEnergy's failure to trim back trees encroaching on high-voltage power lines" or the fact that the infrastructure for the powergrid is old and poorly setup such that one failure can bring down the whole system. There's no reason why a failure in Ohio should blackout New York and there's nothing software can do to fix that.
has anyone made a bittorrent for the patch and seeded it?
Snippet from the top of the file in question // Copyright (c) SCO group, Inc.
Now, where's thet $699 they owe?
Who would win this election: Andrew Weiner vs Andrew Weiner's weiner.
I would like to be able to see the Milky Way again...
O Canada, that far-off land somwhere up north. I didn't even know they had electricity. For the rest of us, thats eight states and 1 Canadian province.
-Malloc___________________ I want to be free()!
On Tuesday, the North American Electric Reliability Council (NERC), the industry group responsible for preventing blackouts in the U.S. and Canada, approved a raft of directives to utility companies aimed at preventing a recurrence of the outage. One of them gives FirstEnergy a June 30th deadline to install any known patches for its XA/21 system.
Giving them till the end of June to install software patches is ridiculous! Do they want another blackout or something? I wonder what are the deadlines for other directives like.
Part of the class was dedicated to ensuring that we learned from the mistakes of the past. They showed us the video of the infamous Takoma Narrows bridge, and several other engineering mishaps. I was a computer science major and most, if not all, of the examples shown in the class, as far as I can remember, were engineering mishaps. I think this is a great example that can be now be added to the list of infamous engineering slip ups. This is a particularly good example for computer science majors, it shows that yes, you really do need good testing, and yes, major disasters can be caused by as little as one line of bad code.
I always wondered why we CS majors had to sit through that class, but here's a great example why.
Where can I download "the patch"?
LOL. are you stupid?
Patch refers to open source world. In commercial apps, a guy with a hat comes to the office, sits infront of the PC and installs new version. It's not like the company releases the patch and whoever wants can download/check it....
...If programming is so complex, then why don't we try something new. You want a program without state? Try Haskell. You want to be able to prove something about your program? Try ML. But don't despair, I think the reason for crummy software is that it hasn't been around for that long. Civil engineers have had the hindsight of building roads, and aqueducts, and buildings for thousands of years. Software been around for what, 2 generations?
Rent out the upstairs as a battery charger.
My co-worker and I have come up with the idea of using "RE" in our title... signifying "Real Engineer".
Not that I have anything against primates.
When a backup server kicked-in, it also failed, unable to handle the accumulation of unprocessed events that had queued up since the main system's failure.
I'd have to say that this stands out to me to be a big part of the problem.
You can't expect these people to be able to make the appropriate choices if they don't have relevant data.
--Phillip
Can you say BIRTH TAX
So what? You use a cell phone, don't you? The electrical energy exposure you get from that is substantially greater.
How about electric blankets or heating pads? How about a battery powered shaver?
You expose yourself to these fields every day to an extent far greater than what you may have received from that transmission line.
By the way, you can light a neon light with a bit of wire and very little power. You can also light it with a MW AM broadcast transmitter less than a mile away; you can light it with a CB radio; and with just a bit more wire, and a location closer to the poles of the earth, you can light it when the earth is hit by a solar flare. Many among the various eco-scare-monger groups like to make this demonstration as if it were an indicator of something dangerous. If it were, there would be no life anywhere near the Arctic Circle.
Aside of the poor maintainance for the clear-cut area, you really have no need to be concerned about this.
Nearly fifty percent of all graduates come from the bottom half of the class!
We, electrical engineers, screwed ourselves by making system run well for years, unlike, the computer science and electronics cousins who still sell bunch of wires connected to a PCB version of circuitboard with SMPS sitting one inch from the CPU and call it state of the art computer. Things go down, gets fixed and people get lot of respect for screwing it up in the first place.
One outage in a decade--and you are bitching ! shame on you !
- People who believe other people have no right to live, got no right to live ...
Great! Where do I download it? Oh, wait. I don't own a massive pieve of electricity distribution infrastructure; never mind.
"'I pass the test,' she said. 'I will diminish, and go into the West, and remain Galadriel.'"
- JRR Tolkien.
the electricity blacks you out... :~)
"'I pass the test,' she said. 'I will diminish, and go into the West, and remain Galadriel.'"
- JRR Tolkien.
Had this been a Windows-based system, the torrent of comments about how unreliable the OS and platform fundementally was would be huge.
Funny, just because this ships for "industrial strength" AIX / Solaris RISC systems (see specs on pg 8), I don't see any cheap, reflexive comments about the platform.
I guess the message here is that good or bad code can be written for any architecture.
GE's software may suck. I don't know. I've never seen it. I am suspicious of people who attempt to hide their own negligence by blaming a third party.
I've seen it, and worked in software engineering at GE (not in Power Systems, though). Like any other place, you have some brain-deaded code monkeys and lots of good people. Sometimes a BCM is promoted to management, and you get crappy or nonexistant code reviews. Just like anywhere else.
It's interesting that GE has been touting Six Sigma as a way of insuring that this sort of thing can't happen, yet trying to apply statistical quality analysis to software development is inherently doomed; it's like trying to measure the color of the wind or the temperature of music; it's a measuring tool that doesn't work on the same anything as that which is trying to be measured. So, the six-sigma projects in software development tend to be very, very indirect measurements of anything useful, let alone code stability and quality.
It's a software bug, plain and simple, and it's got the GE Meatball plastered all over it; no point in trying to shift blame when they sold and controlled it.
The bug in GE Energy's XA/21 system was discovered in an intensive code audit..."This fault was so deeply embedded, it took them weeks of pouring through millions of lines of code and data to find it."
Ah, the benefits of outsourcing.
I always treat watchdog software with just a bit of skepticism. The problem, as pointed out by NERC, was that a process in the system was somehow present, but not communicating well.
The alarm subsystem is often a seperate process. It doesn't talk to the field. That's the job for other elements of the SCADA system. It was supposed to watch for semaphores, messages, or read shared memory somewhere. How do you watchdog something like that if it gets the message, but doesn't do what it's supposed to?
In a SCADA system near and dear to my career, we set alarm thresholds so low that the operators expect a certain amount of alarm traffic even for routine events. This helps to discover any misbehavior in the alarm system.
There is such a thing as a control center which is TOO quiet.
Nearly fifty percent of all graduates come from the bottom half of the class!
Well, the story is stupid, not the doctor.
At Virginia Tech, I started off in the honors program. The Dean of the Honors program was named, you guessed it, "Nurse". So he was Dr. Nurse.
Not so funny, you say? Well wait, there's more !
Dr. Nurse's wife worked in the infirmary, and she was a nurse !
So we would wander by his office, and say, "Hello Dr. Nurse, how is the wife? Nurse Nurse?"
I guess it wasn't really that funny.
I am on a team to build a SCADA system. Doing it right depends on two things:
1) Defining all your status bits to have zero be fail safe.
2) Clearing all the status bits from the reporting system when communications go down.
For example, if you are doing train control, you define train on this track to be zero, no train to be one.
If you communications go down, all the tracks reported by that field processor show up as having trains on them, so don't send another train through that track.
For those that haven't seen the acronym before:
SCADA = Supervisory Control and Data Acquisition
"We can't solve problems by using the same kind of thinking we used when we created them." -- Albert Einstein
After looking at the original report, it looks more like the GE XA21 SCADA network failure was not the primary cause of the cascading failure but more an effect of the failure. The key failure seems to be a software system callled the "State Estimator" (SE) that is used by the Midwest System Operator (MISO), a NERC reliability coordinator, to develop optimal solutions of for the planned operating level of all of the power generation and transmission equipment in the MISO area covering about 10 midwest states and 1 million square miles. It is not described in much detail but the SE seems to be an optimization tool using a linear programming model that gathers availability data for all of the major system components and load demand every five minutes and then calculates the 'optimal' use of those system components to maintain system reliability at the required level. The 'solution' of the model is then used to plan the operation of the overall system by sending the target operating levels to each facility in the system. So why did it fail? Two reasons. First, the model depends on having accurate availability information from each major system component. Status information is sent to MISO in Indiana by the "ECAR" data netork or by direct links. On the day of the failure, the direct link to a key transmission line was not working and the analyst had turned off the estimator to troubleshoot it. After fixing the problem, he went to lunch and forgot to put the system back in automatic mode where it would develop updated solutions. This situation existed for 2 hours from 12:15 to 14:40. When the estimator was switched back to automatic, it was unable to develop a solution because another key transmission line had overloaded and tripped and *its* new non-operational status was unknown to the model, apparently because the status of that line is assumed to be 'on' until told otherwise. This problem was not corrected until 16:04. The bottom line is that a critical major planning tool was not available for 4 hours for a regional generation and distribution system that absolutely required it's use to be operated successfully when the system power supply was very close to the demand.
The SCADA system itself did not fail, but its alarm function did, which provides alarms to control room operators about system operational problems. The problem with the alarm function seems to be a case of too many alarms for the system to handle as the problems multiplied. The software bug that they are now reporting was probably related to the unexpectedly large number of alarms that the system was experiencing. The new alarm inputs built up and then overflowed the process input buffers. The alarm system just stalled while processing an alarm event and the alarm function stopped. Then, at 14:41 the primary server hosting the alarm processing application failed due to some combination of the stalling of the alarm application and the queueing to the remote terminals. The hapless backup server then was automatically activated and everything was was transferred to it, even the functional non-alarm stuff. The backup server failed after 13 minutes. Basically, the SCADA alarm system seems to have been massively overloaded (which shouldn't ever happen, of course) beyond the capability of the system design to cope with. The bug apparently prevented an indication that the alarm system was failing but it looks like the cascading failure still would have occurred even if the software bug had not been present because the system deterioration had progressed to far to recover by the time that the bug manifested itself.
The immediate cause of the failure seems to be the forgetfulness of the analyst who was operating the planning model. The significant underlying contributory cause seems to be a very poor regional operational design in which a critical centralized system planning tool was being used with insufficient backup and oversight. It looks as though both Unix and Windows escape blame. The SCADA system probably was doing far more than it's designers intended and probably performed heroically until it died. 'Aye Captain...I canna do no more.'
In the case of the electric blankets, you're not exposing yourself to a lot of any B or H fields- there's not enough current present to generate much. Now, if you'd said something like a hair dryer, where the field is concentrated to power the motor...
The phone may generate more relative power, but it's at a different frequency- in regards to electricity and the human body, frequency matters as much as anything else.
For DC, 10ma of current may not be noticable to a person.
For 50/60Hz AC, it's going to cause a twitching of the muscles.
For DC 100ma to 1a of current, you're going to get a zap similar in nature to sticking your tongue on a 9v battery, proportionate to the current in question.
For 50/60Hz AC, 100ma to 1a, it's going to be causing painful contractions of your muscles, and very probably stopping your heart outright if the conduction pathway crosses it.
There's been studies that tend to prove that even low energy densities of 50/60Hz AC can accelerate tumor growth- no studies have actually proven that they generate them though. Effects like the one mentioned tend to be caused more by continuous exposure than point exposure- so the low levels of the energy radiated by the high-tension lines may be a problem if you're next to them since it's a continuous background level sort of thing.
I am not merely a "consumer" or a "taxpayer". I am a Citizen of the State of Texas
I guess the danger comes more from the magnetic field induced by the power lines than the electric field (measured in Tesla), which is far more important than the magnetic field induced by your cell phone
Violence is the last refuge of the incompetent - Salvor Hardin
The reason it spread so far was the amount of load that was shifted was more than the adjacent stations could handle. So they took themselves offline to protect themselves, and redistributed the load causing it to just get bigger and bigger like the proverbial snowball rolling down the hill. As for why the adjacent stations weren't able to handle the additional load, you're looking at management decisions not to increase capacity, and government/electorate decisions to prevent building of new power plants.
The system worked as designed by the engineers to prevent a far worse calamity than a blackout that lasted only a few days.
The bigger problem here is that people have a problem with the job title of others.
The branch of Computer Science called "Software Engineering" teaches the various ways of constructing large scale computer programs. What is a logical name for someone that works in the field of "Software Engineering?"
There is a difference between someone who is a software engineer and someone who just writes code.
The clearance can narrow in some conditions. When the lines get hot, they expand and sag noticeably. Hot weather will do it, and so will high current.
Then, just when you most need the power, a tree that used to be at a just barely safe distance shorts the power line.
The high end for mainstream deployments, by the way, is 750 KV or 1 MV. Corona losses get really bad above that level.
SW engineering is still in it's infancy: we've only been writing software for 50 years. Look at Civil Engineering for a comparison. As a craft, it dates back to classical antiquity (EG late Bronze age civs like Ancient Greece and Egypt). It didn't become a true engineering discipline as we know it today until (at least) the Renaissance. (One could argue that engineering, in the modern sense of the word, didn't exist until the Industrial Revolution.)
Why is it that the proponents of "one nation under God" are so eager to get rid of "liberty and justice for all"?
Can you please remove the above post, which is a cut & paste of the article on our site. Sorry, but that's a copyright infringement and contrary to the nature of Slashdot. We always LOVE it when our articles get slashdotted, but not posts like this. Instead, just follow the link to the article like usual. Thanks, Kelly Martin Content Editor, SecurityFocus kel@securityfocus.com
my sincere apolizes to above post. no you you
moron.
maybe the radiation from a cell phone might be more
but a cell phone isn't 90 foot tall!
you might get an X-ray for a soar tooth or a
broken bone. you can't feel anything. but
step outside your house on a sunny summer day and
you can feel the heat from the sun really good.
the field from the mobile phone might be stronger
but just affects a really small region.
while having a 90 foot monster pumping
three quaters of a million volt away can affect
you.
i hope top poster sue goverment or moves.
i would never live near a monster that can make
a "neon" tube glow just by pointing it at it.
It was system engineering.
This was analogous to a broadcast storm on a network. You can't audit the code for one node and say "remove the subroutine for creating broadcast storms". The problem is an emergent property that only shows up when lots of things are put together, not a property of any single installation.
Humans have only been building huge tightly coupled systems with fast-moving surprises for ~50 years. That's less time than it took to figure out how to build bridges and cathedrals that didn't fall down. Common sense doesn't suffice. The big ATT long distance outage was caused when a simple coding bug interacted with, of all things, the fault tolerance design and made the switching network DDOS itself.
Making things worse, one cause of cascade failures is running things without a lot of spare capacity, or in other words, economic efficiency. Expect a lot more events like this in the future, and don't expect EE's to prevent them.
(I'm a CISSP not an EE).
"...Congress must act now to rein in the Patriot Act" - Newt Gingrich
To state that the cause of the blackout was "buggy code" is a bit much. A factor, yes. Major factor, perhaps. But the fact is that there were - and still are - many things wrong with the system that led to the power disruptions experienced.
Software bug, could only mean one thing... Windows!, wonder if the patch is up on Windows Update yet :)
FirstEnergy says it already patched the blackout bug last fall, when GE made a fix available, and is in the process of replacing the XA/21 with a competing system -- a changeover that was planned before the blackout.
FE patched the bug, but didn't disclose that it was the cause of the Blackout. Even though they were already dumping GE's software. What is the mysterious power GE has over FE, that FE wouldn't deflect the blame, that it so clearly wants to shirk, onto GE, the weak link?
--
make install -not war
Be thankfull the commend did not look like this:
' Not sure why this works for my test data.
' Probably should come back and re-write this
' if we have time before the product ships.
It didn't say that Blaster wasn't related to the blackout, just that Blaster didn't cause this particular bug in the GE software.
I find it hard to believe that one single bug can cause the whole grid to go down. More likely it was a combination of factors, indeed later on in the artical it says:
"FirstEnergy says its problems were some of many issues destabilizing power flow in the northeast that day, and that its role in the outage is overstated in the interim report."
They found this bug during a code audit. The real question, then, is why the hell did they not do just as intensive an audit BEFORE releasing the software?
Chalk one up for software again! First the Mars lander Spirit and now this! w007! 1337 programming!
Software: 2
Hardware: 0
"Injustice anywhere is a threat to justice everywhere." - Martin Luther King, Jr.
I just never thought I'd see this in reality.
Also, where are all the +5 posts that unequivocally claimed "this is the type of thing open source would have prevented" what with all those eyes looking at all those bugs.
Nope, all I can hear is the sound of crickets in the background.
Oh but wait, we don't want to talk about these things here, $deity forbid that the "community" be somehow characterized as hysterical FUD-spreading blob of mindless sheep. Kinda like that FUD-spreading mindless corporation they accuse of everything and anything. Nah, that would be just too painful.
We also don't want to compare roblimo's article with the BBC's editorial that gently placed the MyDoom blame on open source developers. Wait, I don't remember anyone complaining about the utter stupidity of the NewsForge article so I guess we can't compare them. Although one might certainly make the case that they're basically the same thing, if one wanted.
And I'll post this at +2, just so I can bleed off more moderator points. But reality sucks, doesn't it?
It was a combination of factors, so therefore, your pet theory about a possible factor is correct.
Makes you wonder who wrote this software? Offshore developers?
Clearly you haven`t read the other comments the XA21 code is not "outsourced" by developed in Melbourne Fl.
but then that would require a quick investigation of the facts!
Ha ha! Wish I could mod you up.
Isn't this exactly like saying that a faulty fire-alarm caused a fire? This is a non-sequitor used as a smokescreen to cover up the real cause.
Deregulation.
This software bug may play in heavily - but there's a reason why that bug is there. . . a reason why it made it into mission-critical software, on a live production system, and a reason why it contributed to a massive cascade failure that affected such a wide area.
This is the same reason for the Enron/Worldcom/Tyco failures (via 1995 Private Securities Litigation Reform Act).
This is the same reason for 9/11 (airline security deregulation).
This is the same reason for Janet Jackson's tit (regardless of whether you felt offended or outraged by it).
This is the same reason for the California Rolling Blackouts of 2000.
Please THINK before you vote.
Not all regulation is bad regulation. But even bad regulation is better than NO regulation.
These are my friends, See how they glisten. See this one shine, how he smiles in the light.
So how do you do this? I live very close to a bunch of broadcast towers. Can I replace my solar collectors by collecting this power and storing it in my batteries somehow?
I'd guess that cost of labor is part of the picture. Instead of remotely operating switchgear, you send someone to a substation to do it. Instead of remotely reading currents and temperatures, you send someone around in a truck to write down these readings.
Ooh, a patch is available? Where can I download, anybody got a link?
You're not going to heat your house, but you might make a noticable difference in a rechargeable AAA.
By B I mean the B method:
http://vl.fmnet.info/b/
"I object to doing things that computers can do." -- Olin Shivers, lispers.org
In a SCADA system near and dear to my career, we set alarm thresholds so low that the operators expect a certain amount of alarm traffic even for routine events. This helps to discover any misbehavior in the alarm system.
It also trains your operators to treat alarms as expected events which can be ignored, rather than something unusual which must be attended to immediately.
455fe10422ca29c4933f95052b792ab2
OH THE SHAME I fell off the wagon and use sigs again!
I would like to patch my energy management system but I can't seem to find the file. Has anyone got a BitTorrent link?
Folks, be sure to remind your less computer-literate aquaintances to patch their energy management systems too!
Funny, just because this ships for "industrial strength" AIX / Solaris RISC systems (see specs on pg 8), I don't see any cheap, reflexive comments about the platform.
The only thing "funny" about that is how it reflects poorly on Windows developers (bet you didn't see that coming! :-). When people have Windows problem, yeah, it's just an endless stream of "boy, isn't Microsoft putting out crap". When a Unix system has similar problem, its essentially "we screwed this up; no scapegoat to blame".
What you try to spin as something Windows coders can point to and gloat is actually something Unix coders can point at and gloat. It's called professionalism, which is something sorely lacking at Camp Microsoft these days. The fact you didn't realize that before posting means you're probably a Windows goon. How embarrassing for you. Don't you wish you could delete posts? :-)
Had it been Freeware GE might have gotten a shock before the end of the world.
In reality, I think these power lines may have done me more good than harm. During thunderstorms, I can pretty much count on the lightning striking the tower in the backyard instead of the house.
But yeah, I'd like them to take a bit more care of the easement too. We really don't need anyone hurt by an arc some stormy night.
John
They should have gone with Sorny.
Sorry I don't want to delete the post. Despite your alleged implications the situation, surely you're not denying that had the problem occured on a Windows platform many slashdotters would make reflex comments about the platform whether the problem lay with the coders or not?
I certainly don't concede your suggestion that RISC/UNIX systems are so stable that it would be clear that it was bad coding. There's lot's of examples of flaws in various *NIX OSs and physical issues on RISC servers (ask current Sun users about system quality and reliability on the low-end V series boxes)
BTW, I spend about 50% of my time with Windows environments and 50% with *UNIX/Linux environments. My single most consistant observation is that quality of end-user management disipline by managers and adminstrators is the most likely source of system failures and variance here is far more important than the particular platform.
In this vein, the closest I'll come to your position is that Microsoft made it easy (via quick GUIs and reams of pre-baked defaults) for poorly trained people to poorly deploy Windows systems - that do run. *nix/Linux systems are just hard enough to get installed that if you really don't know what you're doing you can't put them into production at all. Economics ensures that some fraction of the business community will go to the lowest cost option that seems to work (badly or not)-which is why there's so many poor Windows installations out there. Because there CAN be.
I can light a fluorescent tube with my 5W amateur radio handheld (144 MHz).
Another difference between the power lines and your cell phone is that the wavelength of the RF used by the cell phone is much more likely to interact with your body than 50/60Hz. A 2 meter tall human is approximately:
- a fullwave at 150 MHz
- a half wave at 75 MHz
- a quarter wave at 37.5 MHz.
That same human would be a 20e-6 wave at 60 Hz.At cell phone freqs, (850 MHz, 1900 MHz), parts of the body begin to exhibit resonances; 850 MHz = 35.3 cm, 1900 MHz = 15.8 cm. If you're curious, there are MPE limits (Maximum Permissible Exposure) that apply to RF sources regulated by the FCC. Amateur radio operators, cell tower operators, etc must abide by these safety rules. Oddly, the limits seem not to apply under 300 KHz.
Tiller's Rule: Never use a word in written form that you've only heard and never read. You will end up looking foolish.
Sorry I don't want to delete the post. Despite your alleged implications the situation, surely you're not denying that had the problem occured on a Windows platform many slashdotters would make reflex comments about the platform whether the problem lay with the coders or not?
A software problem can only lie with the coders. My point is that coders anywhere can name Windows as a cop-out, but they can't do that with a Unix (especially an open one) unless they demonstrate an actual OS bug.
I certainly don't concede your suggestion that RISC/UNIX systems are so stable that it would be clear that it was bad coding. There's lot's of examples of flaws in various *NIX OSs and physical issues on RISC servers (ask current Sun users about system quality and reliability on the low-end V series boxes)
Again, it's not about what/where flaws do or don't exist, it's about the professionalism of the coders who step forward and accept the faults in their code. The Windows camp has so often blamed MS that it's a running joke. The Unix camp simply can't blame the OS unless it's a serious issue. My point stands.