Software Update Shuts Down Nuclear Power Plant

← Back to Stories (view on slashdot.org)

Software Update Shuts Down Nuclear Power Plant

Posted by Soulskill on Friday June 6, 2008 @11:58AM from the we-have-safety-systems-because-we-are-very-stupid dept.

Garabito writes "Hatch Nuclear Power Plant near Baxley, Georgia was forced into a 48-hour emergency shutdown when a computer on the plant's business network was rebooted after an engineer installed a software update. The Washington Post reports, 'The computer in question was used to monitor chemical and diagnostic data from one of the facility's primary control systems, and the software update was designed to synchronize data on both systems. According to a report filed with the Nuclear Regulatory Commission, when the updated computer rebooted, it reset the data on the control system, causing safety systems to errantly interpret the lack of data as a drop in water reservoirs that cool the plant's radioactive nuclear fuel rods. As a result, automated safety systems at the plant triggered a shutdown.' Personally, I don't think letting devices on a critical control system accept data values from the business network is a good idea."

38 of 355 comments (clear)

Install Complete... by Anonymous Coward · 2008-06-06 12:02 · Score: 5, Funny

Must restart reactor to complete software installation.

[Yes] [No] [OMFG!]
Hmmm, threw an exception by Anonymous Coward · 2008-06-06 12:03 · Score: 5, Insightful

I'd rather it shut itself down then suffer major failure.
1. Re:Hmmm, threw an exception by xlv · 2008-06-06 12:44 · Score: 5, Funny
  
  I'd rather it shut itself down then suffer major failure. Personally, I'd rather it doesn't suffer a major failure at all, whether it's after a shutdown or not. Oh you meant than and not then, never mind...
Critical Update by Enderandrew · 2008-06-06 12:04 · Score: 5, Funny

Adds a whole new meaning to "Critical Update".

--
http://blindscribblings.com - Tasty pop-culture in conceptual fashion.
Fail-Safe by lobiusmoop · 2008-06-06 12:07 · Score: 4, Insightful

Personally, I am reassured that these reactors are designed to shut down at the drop of a hat. This is not a situation were fuck-ups should be masked, any discontinuity, however minor, really needs to be highlighted and dealt with immediately.

--
"I bless every day that I continue to live, for every day is pure profit."
1. Re:Fail-Safe by snkline · 2008-06-06 12:23 · Score: 4, Insightful
  
  Umm, yes you do. If something in the system is shit, you don't want the reactor ON!
Oblig Simpsons reference by J'ai+Friedpork · 2008-06-06 12:11 · Score: 5, Funny

"Vent radioactive gas? Venting gas prevents explosion. [Yes / No]"

--
Took this comment seriously, did you?
Misreading of the Article by Anonymous Coward · 2008-06-06 12:12 · Score: 5, Interesting

"Personally, I don't think letting devices on a critical control system accept data values from the business network is a good idea." The article did not say that the data values were being read from the machine that was rebooted. It actually said that the rebooting triggered a problem in which values could not be read.

I wonder if they were using something like EPICS. I worked on a large experiment which used EPICS to control the system. Rebooting a machine would sometimes expose a problem with resources not being freed, eventually leading to a situation where data channels would read the 'INVALID/MISSING' value. The solution, as anyone who has worked on this sort of experiment will know, was to reboot more machines until the thing worked. ;-)

(I don't mean to complain about EPICS. It is very powerful and flexible... it's just that the version we used had these occasional hiccups.)
the slashdot crowd is dying to know... by mathfeel · 2008-06-06 12:13 · Score: 4, Funny

did it run Windows?

--
The only possible interpretation of any research whatever in the 'social sciences' is: some do, some don't
1. Re:the slashdot crowd is dying to know... by Anonymous Coward · 2008-06-06 13:05 · Score: 5, Funny
  
  If it was running Windows the OS is at fault.
  If it was running something else then the application was at fault.
Re:More like bad system design by RiotingPacifist · 2008-06-06 12:20 · Score: 4, Insightful

The only safe way to update a system is a reboot, sure you CAN do some stuff on linux bsd etc to avoid having to reboot( hell this was probably running some unix derivative so it was probably possible to do the update without rebooting), but you wouldn't want to run the risk of introducing an unchecked bug by doing a live update. when your choices are:
a) high chance of accidentally shutting down a reactor harmlessly
b) small chance of fucking up a nuclear reactor
you'll always go for (a), if your sane.

--
IranAir Flight 655 never forget!
EULA! by bluephone · 2008-06-06 12:20 · Score: 5, Funny

It says right in the EULA that it's not to be used in a nuclear power plant!

--
jX [ Make everything as simple as possible, but no simpler. - Einstein ]
The problem is the update - not business network by markdj · 2008-06-06 12:21 · Score: 5, Interesting

I write this type of software for a living so I know that having a computer on the business network connected to the control computers is a risk, bur that risk can be managed. The problem here is that the software update wiped out the nuclear control system data. This exposes two bad problems. First customers are always asking why they can't update their system while it is still running. We liken that to changing your tire while driving down the road. Secondly the software update did not respect the data in the nuclear control system and synchronized it to new initial data in the update on the other system! Not a good idea. In critical safety systems, you always practice an update before actually doing one.
"King-size Homer" season 7 episode 7, Nov 5, 1995 by layer3switch · 2008-06-06 12:25 · Score: 4, Funny

"... The move to SCADA systems boosts efficiency at utilities because it allows workers to operate equipment remotely."

Another proof that Homer Simpson was truly ahead of his time.

Are you mad, woman? You never know when an old calendar might come in handy. Sure, it's not 1985 now, but who knows what tomorrow will bring? -Homer

--
"Don't let fools fool you. They are the clever ones."
Re:Obligatory by Kamokazi · 2008-06-06 12:26 · Score: 4, Funny

Don't forget about the now mutated sharks living in the coolant water growing frickin' laser beams on their heads.

--
As our way of thanking you for your positive contributions to Slashdot, you are eligible to disable Slashdot 2.0.
Re:The problem is the update - not business networ by dissy · 2008-06-06 12:30 · Score: 4, Funny

First customers are always asking why they can't update their system while it is still running. We liken that to changing your tire while driving down the road. Oh sure, NOW you think of a debian slogan ;}
Re::O by Lurker2288 · 2008-06-06 12:40 · Score: 5, Insightful

What exactly do you find frightening about an automatic safety system doing exactly what it's supposed to in response to unusual input?
Re:Wow that is so funny by Anonymous Coward · 2008-06-06 12:41 · Score: 4, Insightful

Correct. It is not the better choice. In the foreseeable future, it is the only choice.
Only the biz machine was updated. Why trouble? by Ungrounded+Lightning · 2008-06-06 12:46 · Score: 5, Insightful

Secondly the software update did not respect the data in the nuclear control system and synchronized it to new initial data in the update on the other system! Not a good idea. In critical safety systems, you always practice an update before actually doing one.

I have no problem with a computer on the process control subnet reporting information to a computer on the business subnet.

I have a BIG problem with a computer on the business subnet being able to modify and corrupt data in a computer on the process control subnet.

"I can't dump data to the business side" is a reason to make a log entry and maybe sound a minor alarm. It's not a reason to shut down the reactor (unless the data is needed for regulatory compliance and the process control side isn't able to buffer it until the business side is working correctly.)

But if a business subnet computer can tamper with something as critical as a process control machine's idea of the level of coolant in a reservoir, it rings my "design flaw" alarms.

Is it ONLY able to reset it to "empty" as poorly-designed part of a communication restart sequence? Or could it also make the process control machine think the level was nominal when it WAS empty?

IMHO this should be examined more closely. It may have exposed a dangerous flaw in the software design.

Security flaws don't care if they're exercised by mischance or malice. If nothing else, this is a way to Dos a nuclear plant through a breakin on the business side of the net.

--
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
This was not a "fail-safe" incident by Drenaran · 2008-06-06 12:48 · Score: 5, Insightful

The problem here is that the system didn't shut down because it detected an error in the data collection system, instead it incorrectly detected a problem that did not in fact exist and then proceeded to take action. While the engineer in me is fairly certain that the system is designed to always fail to a safe state (as in, any automatic emergency response couldn't accidentally make things worse - at least not without raising all sorts of alarms), it is still concerning that internal control systems can be so effected by external servers.

In the article they mention that the system wasn't designed for security (since it was meant to be internal) - but this isn't a security issue at all! Any sort of system that relies upon other systems should be designed to assume failure can and will occur in other systems - that is not to say that it needs to verify/evaluate incoming data to make sure it is "good", but rather that it can tell the difference between receiving data (such as current water levels) and receiving no data at all (system failure). Once it has that it can ideally automatically switch to a backup system, or do what it did here and enter a fail-safe state (the difference being that it does so while pointing out the actual problem and not a incorrectly perceived problem in a different part of the system).
Re:Wow that is so funny by Anonymous Coward · 2008-06-06 12:50 · Score: 5, Insightful

And a shutdown, while incovenient, is not a catastrophe. In fact, it speaks well for the plant's safety that it did automatically shut down when faced with bad data.
MOD PARENT UP! by Lux · 2008-06-06 13:01 · Score: 4, Funny

He's trying to find an opportunity to bash Microsoft!
just to shortcircuit the nuclear hysteria by circletimessquare · 2008-06-06 13:06 · Score: 4, Informative

most freakouts surrounding nuclear power are based on 1960s technology. modern reactor designs, such as pebble bed reactors, are designed to be passively safe. that is, you can just walk away from them, doing nothing, and they will not release gas, go china syndrome, or anything else unsafe. older nuke tech requires active safety management: someone must always be on the job, making sure nothing f***s up. designing safety into nuclear reactor design from the philosophical ground up is the way of the future

--
intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
1. Re:just to shortcircuit the nuclear hysteria by dbIII · 2008-06-06 13:32 · Score: 5, Insightful
  
  While that may be true the first full scale prototypes of pebble bed are yet to go online - however construction of several in China is at an advanced stage. As Superphoenix showed with fast breeders you really need a full scale prototype to identify all of the problems (it was economic ones that killed fast breeders and not safety issues).
  India's accelerated thorium idea is also very promising.
  The major problem I see with US nuclear power is the assumption that it is a solved problem and almost zero has been spent on R&D for decades. The "new generation" of reactors from Westinghouse and others is little more than 1960's white elephants painted green.
Re:Lesson learned: by bluefoxlucid · 2008-06-06 13:13 · Score: 4, Informative

No, it has no reason to believe the coolant system has water. It's called FAIL SAFE; if I'm not quite sure, then fuck it, back off and shut the grid down and go MAKE SURE everything looks right.

The proper response of a nuclear cooling system to not knowing whether or not it's working correctly is not "let's keep running hot and see if more sample data comes across."

--
Support my political activism on Patreon.
Re:One begs the question by badboy_tw2002 · 2008-06-06 13:20 · Score: 5, Funny

Good enough evidence for me! Microsoft caused a nuclear meltdown! Quickly, to the Blogo-Sphere!
Business Network? by camperdave · 2008-06-06 13:31 · Score: 4, Interesting

The business computers should not be connected to the control network.

From the summary:
The computer in question was used to monitor chemical and diagnostic data from one of the facility's primary control systems...
... when the updated computer rebooted, it reset the data on the control system...
If it's monitoring the primary control system then it seems to me like the machine would have to be on the control network. The real issue is why did the primary control system accept a reset from a monitoring system. It sounds like there's more than one bug to track down.

--
When our name is on the back of your car, we're behind you all the way!
Re::O by afidel · 2008-06-06 13:34 · Score: 4, Insightful

I have quite a few Windows 2003 servers that haven't been rebooted since August 2006 when we upgraded our computer room to a small datacenter (we went from a single busline and a constantly breaking AC unit to dual UPS's powered by separate generators and dual chillers with separate condensers.) It's not like it's impossible to get good uptimes on Windows, the only servers we reboot on a regular basis are our Citrix servers due to some bad code on Citrix's part that leaks memory over time and our Oracle server due to a bug where 10gR2 pulls time from the deprecate ticks counter (the same one that used to crash Windows9x) which rolls over after ~42 days. Both of those are the result of poor third party coding, not bugs in Windows.

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Re:Only the biz machine was updated. Why trouble? by Platinumrat · 2008-06-06 13:35 · Score: 4, Interesting

Secondly the software update did not respect the data in the nuclear control system and synchronized it to new initial data in the update on the other system! Not a good idea. In critical safety systems, you always practice an update before actually doing one. I have no problem with a computer on the process control subnet reporting information to a computer on the business subnet. I have a BIG problem with a computer on the business subnet being able to modify and corrupt data in a computer on the process control subnet. "I can't dump data to the business side" is a reason to make a log entry and maybe sound a minor alarm. It's not a reason to shut down the reactor (unless the data is needed for regulatory compliance and the process control side isn't able to buffer it until the business side is working correctly.) But if a business subnet computer can tamper with something as critical as a process control machine's idea of the level of coolant in a reservoir, it rings my "design flaw" alarms. Is it ONLY able to reset it to "empty" as poorly-designed part of a communication restart sequence? Or could it also make the process control machine think the level was nominal when it WAS empty? IMHO this should be examined more closely. It may have exposed a dangerous flaw in the software design. Security flaws don't care if they're exercised by mischance or malice. If nothing else, this is a way to Dos a nuclear plant through a breakin on the business side of the net. I agree with the previous post. In railway signalling (at least outside of the USA) formal safety processes must be followed with software design and configuration. Part of that is a formal hazard analysis. There are various Safety Integrity Levels(SIL) for systems that are applied to different control and monitoring components (SIL-0 being lowest to SIL-4 for stuff that can kill people if it goes wrong). There is no condition under which it is even a acceptable for a business system to feed vital sensor data for the control system. This should always be a hazard analysis performed when making any changes to a control system, at which point this sort of thing should have been detected.
Re:Wow that is so funny by zappepcs · 2008-06-06 13:36 · Score: 4, Insightful

I'd go just a bit further and say that it speaks well for the software coders. There are at least three ways to treat any 'out of bounds' condition. They chose to make sure that the safe action was chosen.

An area where that loosely controlled type of team work gets into trouble unless all coders treat data passed to their code, and from their code in the same uniform functional ways.

It also makes me wonder how the code will react to certain malicious software, should it get loose in the facility. If I were writing code to destroy a nuclear facility, it is how data is passed from one process to another that I would definitely attack as well as other vulnerable places.

It is sort of reassuring to have seen a failure result in a controlled shutdown rather than some other, more undesirable action.

--
Support NYCountryLawyer RIAA vs People
Re:Wow that is so funny by Wo1ke · 2008-06-06 13:50 · Score: 5, Insightful

Yeah, so when a sensor breaks and stops sending in data, it'll keep running like usual, with maybe a small error code in the background. Cause, you know, that's how we want nuclear fucking powerplants to work.
Re:Wow that is so funny by spikedvodka · 2008-06-06 15:06 · Score: 4, Insightful

It's not a nuclear power plant, but still, my network...

I've set nagios up to monitor my network, and any los of signal is considered CRITICAL, not just a warning, but critical... and I need to know then.

--
I will not give in to the terrorists. I will not become fearful.
Re:Wow that is so funny by profplump · 2008-06-06 15:22 · Score: 4, Informative

The system as a whole *did* know the reading was bogus. The control/safety system shut down because it stopped getting "safe" indications from the monitoring/input system. It seems pretty clear that the input system itself correctly logged the reason for the error.

The interface to the control system for the tank level doesn't (or at least shouldn't) have an entire separate "error" parameter -- it probably takes a simple numeric value from the input system.

The input software knows when the reading are bogus or missing. In that case it either stops sending input, which would presumably trigger a watchdog in the control system, or it sends data that indicates a worst-case scenario. with which the control system can do whatever it does in a worst-case scenario.

The control system itself doesn't care why there is or may not be safe input parameters, it only cares that it cannot rely on the input it needs for safe operation. Giving it any more information just adds code and interface complexity to safety-critical software.

Here's the system as implemented:
level = tank.getLevel()
if (level < SANE_MIN || level > SANE_MAX)
level = 0
control.input.set(TANK_LEVEL, level)

Here's the system you describe:
error = 1
level = tank.getLevel()
if (level > SANE_MIN && level < SANE_MAX)
error = 0
control.input.set(TANK_LEVEL, level, error)

The later makes the safety-critical control software more complex, with more test cases and more input parameters, none of which add any value to the safe operation of the control system. The error parameter potentially allows for operation during transient errors, but that's a decision you can make in other ways, without adding interface complexity.

The only inconvenience of the simpler interface is that you have to check the logs from the input device in addition to the control device to determine why the error occurred. And please don't argue that consolidated error logging is worth extra code complexity -- that's probably not even true in a web app, let alone a human-safety control system.
Re:Wow that is so funny by icebike · 2008-06-06 16:10 · Score: 5, Informative

What part of FAIL SAFE don't you understand?

The System FAILED. It is programmed to SAFE the reactor when shit happens.

Without its sensors it had no choice but to assume worse case and scram the reactor.

It did it the right way. It did it the way it was programmed to do it.

What would you have it do to determine why it is no longer getting critical data? Send out a droid to check the cat5 cables? Its a frikin computer in a rack, not R2D2.

It worked the way it was supposed to.

Take a step back and let the big boys handle the reactor, Please.

--
Sig Battery depleted. Reverting to safe mode.
Re:Wow that is so funny by GigaplexNZ · 2008-06-06 17:20 · Score: 4, Insightful

It did it the way it was programmed to do it. Based on the information provided in the article, it was programmed to shut down due to lack of water. What actually happened was accidental data reset, which is what happened. A separate fail safe mechanism should have detected the missing critical data. Instead, it
errantly interpret the lack of data as a drop in water reservoirs - I would rather it correctly, as opposed to errantly, detect unsafe conditions. The plant should have shut down as it did, but it sounds a bit like chance that it actually did.
Re:Wow that is so funny by GigaplexNZ · 2008-06-06 17:23 · Score: 4, Insightful

I understand exactly what fail safe means. I agree that no data = very very very bad data. I agree that it should have gone into the safest possible mode. I don't agree that the "low water level" detection is the correct mechanism to determine the "no data = very very very bad data" condition. I'm suggesting that based on the information quoted in my original post,
safety systems to errantly interpret the lack of data as a drop in water reservoirs does not necessarily sound like good planning but sounded more like chance that some erroneous interpretation picked up on the invalid state. It may have detected the "no data = very very very bad data" case and shut down for that reason, but that's not what the article is suggesting. Other users hinting that I am a moron for thinking that the plant shouldn't have shut down have misinterpreted what I was trying to get across.
Re:Wow that is so funny by barius · 2008-06-06 17:42 · Score: 5, Insightful

I think you're missing the real point, which is that the central safety systems are being fed data from a 'business network'. What would happen if that computer had an issue that caused it to send the same data continuously even when the coolant level had really dropped? WHY are any safety systems receiving data from an insecure network?

It's bad enough that most reactors use regular PC's to do the data collection and reporting, given the security risks posed by such systems (especially if networked), but I never realized they would be so stupid as to feed data in the other direction like this!
Re:Wow that is so funny by Anonymous Coward · 2008-06-06 18:11 · Score: 5, Informative

I think you're missing the real point, which is that the central safety systems are being fed data from a 'business network'. What would happen if that computer had an issue that caused it to send the same data continuously even when the coolant level had really dropped? WHY are any safety systems receiving data from an insecure network?

It's bad enough that most reactors use regular PC's to do the data collection and reporting, given the security risks posed by such systems (especially if networked), but I never realized they would be so stupid as to feed data in the other direction like this!
Obviously you have -zero- experience with power plant networks. Allow me to enlighten albeit anonymously.

The reason machines like this receive data from networks that could be considered 'less secure' is because telemetry is required from a multitude of sources to actually ascertain any useful realtime information. Aggregation machines have to speak many different protocols and translate between them while communicating with other machines that belong at other plants, cities, states, and even companies to effectively get an accurate picture of the entire grid's current conditions.

The world of plant control machines themelves is very vendor-driven. Most facilities have turnkey solutions brought in by the few major players in this field. ABB, Hathaway, GE, etc. Those players don't even use the same SCADA protocols. Some use ICCP, some use DNP, and others prefer Etherpoll. I've seen RS232 data encapsulated into everything from fully-meshed TCP connections via OSI-Soft's PI to barely encoded into modbus and slapped onto ethernet with only an understanding of ARP.

The solutions are required because electricity is not just one powerplant pumping watts blindly. Instead, you have a multitude of plants all pushing power onto ISO-controlled grids that all have to work in concert with each other. This requires -- yes, you guessed it -- networking! The world of plant networks is pretty complex despite the hype you see in the media. The business of making actual watts appear magically at your house at a nice, consistent 60Hz is vastly more involved that most people realize.

Telemetry comes from secured networks, business networks, and other companies and controlling agencies. That is how it works. Period.

If you are actually interested in seeing the way these are regulated to be secured, the information is cleverly hidden in plain sight at the NERC website.