Software Update Shuts Down Nuclear Power Plant

← Back to Stories (view on slashdot.org)

Software Update Shuts Down Nuclear Power Plant

Posted by Soulskill on Friday June 6, 2008 @11:58AM from the we-have-safety-systems-because-we-are-very-stupid dept.

Garabito writes "Hatch Nuclear Power Plant near Baxley, Georgia was forced into a 48-hour emergency shutdown when a computer on the plant's business network was rebooted after an engineer installed a software update. The Washington Post reports, 'The computer in question was used to monitor chemical and diagnostic data from one of the facility's primary control systems, and the software update was designed to synchronize data on both systems. According to a report filed with the Nuclear Regulatory Commission, when the updated computer rebooted, it reset the data on the control system, causing safety systems to errantly interpret the lack of data as a drop in water reservoirs that cool the plant's radioactive nuclear fuel rods. As a result, automated safety systems at the plant triggered a shutdown.' Personally, I don't think letting devices on a critical control system accept data values from the business network is a good idea."

10 of 355 comments (clear)

Min score:

Reason:

Sort:

Misreading of the Article by Anonymous Coward · 2008-06-06 12:12 · Score: 5, Interesting

"Personally, I don't think letting devices on a critical control system accept data values from the business network is a good idea." The article did not say that the data values were being read from the machine that was rebooted. It actually said that the rebooting triggered a problem in which values could not be read.

I wonder if they were using something like EPICS. I worked on a large experiment which used EPICS to control the system. Rebooting a machine would sometimes expose a problem with resources not being freed, eventually leading to a situation where data channels would read the 'INVALID/MISSING' value. The solution, as anyone who has worked on this sort of experiment will know, was to reboot more machines until the thing worked. ;-)

(I don't mean to complain about EPICS. It is very powerful and flexible... it's just that the version we used had these occasional hiccups.)
1. Re:Misreading of the Article by Anonymous Coward · 2008-06-06 13:14 · Score: 1, Interesting
  
  It actually said that the rebooting triggered a problem in which values could not be read.
  
  I feel so fucking vindicated:
  
  Long uptimes are a bad thing! How do you know a configuration change hasn't rendered one of your startup scripts ineffective? If you have to reboot for some unexpected reason, you could be stuck debugging unrelated problems at very inopportune moments.
  
  You need to schedule regular reboots so that you can test that your servers can start up fine at a moment's notice. Long uptimes are a sign a sysadmin hasn't been doing his job.
  
  You're right. While you're on the phone with hazmat explaining that you have a issue with green goo, how about i test the reboots of my PBX before you give your address?
  
  yeah, I run mission critical systems. yes, i have proper redundancy and resiliency systems. Think I'm going to disrupt operations to test my reboots? Hell no. When it comes to public safety, 5 nines is the *only* option.
  
  Looks like necrogram or somebody with his attitude is responsible for this.
Terminal Error by Anubis_Ascended · 2008-06-06 12:13 · Score: 2, Interesting

Reminds me of Terminal Error.
The problem is the update - not business network by markdj · 2008-06-06 12:21 · Score: 5, Interesting

I write this type of software for a living so I know that having a computer on the business network connected to the control computers is a risk, bur that risk can be managed. The problem here is that the software update wiped out the nuclear control system data. This exposes two bad problems. First customers are always asking why they can't update their system while it is still running. We liken that to changing your tire while driving down the road. Secondly the software update did not respect the data in the nuclear control system and synchronized it to new initial data in the update on the other system! Not a good idea. In critical safety systems, you always practice an update before actually doing one.
This is why... by rat7307 · 2008-06-06 12:47 · Score: 3, Interesting

This is why you keep the IT nerds away from the process network.

I've had a whole plant lose view of it's system because some well meaning retard in IT decided to push updates onto a SCADA system without qualifying the updates....... never had it KILL the control side of things though....well done whoever you were, you've done well.

--
Burma?
Business Network? by camperdave · 2008-06-06 13:31 · Score: 4, Interesting

The business computers should not be connected to the control network.

From the summary:
The computer in question was used to monitor chemical and diagnostic data from one of the facility's primary control systems...
... when the updated computer rebooted, it reset the data on the control system...
If it's monitoring the primary control system then it seems to me like the machine would have to be on the control network. The real issue is why did the primary control system accept a reset from a monitoring system. It sounds like there's more than one bug to track down.

--
When our name is on the back of your car, we're behind you all the way!
Re:Only the biz machine was updated. Why trouble? by Platinumrat · 2008-06-06 13:35 · Score: 4, Interesting

Secondly the software update did not respect the data in the nuclear control system and synchronized it to new initial data in the update on the other system! Not a good idea. In critical safety systems, you always practice an update before actually doing one. I have no problem with a computer on the process control subnet reporting information to a computer on the business subnet. I have a BIG problem with a computer on the business subnet being able to modify and corrupt data in a computer on the process control subnet. "I can't dump data to the business side" is a reason to make a log entry and maybe sound a minor alarm. It's not a reason to shut down the reactor (unless the data is needed for regulatory compliance and the process control side isn't able to buffer it until the business side is working correctly.) But if a business subnet computer can tamper with something as critical as a process control machine's idea of the level of coolant in a reservoir, it rings my "design flaw" alarms. Is it ONLY able to reset it to "empty" as poorly-designed part of a communication restart sequence? Or could it also make the process control machine think the level was nominal when it WAS empty? IMHO this should be examined more closely. It may have exposed a dangerous flaw in the software design. Security flaws don't care if they're exercised by mischance or malice. If nothing else, this is a way to Dos a nuclear plant through a breakin on the business side of the net. I agree with the previous post. In railway signalling (at least outside of the USA) formal safety processes must be followed with software design and configuration. Part of that is a formal hazard analysis. There are various Safety Integrity Levels(SIL) for systems that are applied to different control and monitoring components (SIL-0 being lowest to SIL-4 for stuff that can kill people if it goes wrong). There is no condition under which it is even a acceptable for a business system to feed vital sensor data for the control system. This should always be a hazard analysis performed when making any changes to a control system, at which point this sort of thing should have been detected.
Such systems already exist. by ZombieEngineer · 2008-06-06 15:20 · Score: 3, Interesting

Safety control systems in the chemical industry have been used for 20+ years. These systems have: - redundant CPU modules (which can be hot plugged) - redundant IO modules (which can be hot plugged) - redundant communication systems - self diagnostics (can detect a failed output transistor) - internal diagnostics (CPU voting to detect failed CPU core) - standard algorithms for redundant transmitters Shutting down is the "safer option" however there is still risks (such as thermal stressing pipework). It is a lesser of the two evils problem. This stuff is bread & butter for the chemical industry, there are a number of control companies that refuse to deal with the nuclear industry due to the requirement for unlimited indementy. ZombieEngineer
Re:Install Complete... by Joe+Jay+Bee · 2008-06-06 15:54 · Score: 2, Interesting

Tell that to the people who worked at Chernobyl.

Oh wait, you can't; they were all blowed up. ;)

--
I write bullshit
Re:Water by Anonymous Coward · 2008-06-06 17:45 · Score: 1, Interesting

You can't put too much water into a nuclear reactor Actually, you can, particularly in a PWR. If the turbine trips, or any other kind of loss of heatsink accident occurs, the primary loop coolant will initially heat up and expand. Without a gas volume to buffer the resulting pressure increase, the piping would burst. To make an automotive analogy (which I'm sure the typical /. user will appreciate), it would be like putting very rigid shocks and springs on a car. This is the exact reason why the operators at TMI initially turned of the Emergency Core Cooling System. They saw pressurizer water level rising and were concerned that the pressurizer would "go solid". Since the pressurizer is physically located above the pressure vessel, they assumed (wrongly) that the core was covered and turned off ECCS.

Also, if the water level is too high in the steam generator (or pressure vessel, in the case of a BWR), you will get water droplets mixed in with the steam going to the turbines. This is a good way to damage turbine blades.

Third, if you're concerned about maintaining a BWR subcritical, you shouldn't let the water level get too high. The water surrounding the core acts as a reflector, decreasing neutron leakage. So, higher water level leads to increased reactivity. In fact, my recollection is that, in some cases, the emergency operating procedures suggest lowering the water level in order to control reactivity.

On a different note, the reason this incident is somewhat concerning (to me, at least), is that the logic for the reactor protection system is supposed to be not only fail-safe but also fault-tolerant. There are typcially four independent channels, and the logic to actually get a scram is ((A || B) && (C || D)). So the question is, how did one computer failure cause multiple, supposedly-independent channels to indicate a scram condition?

Lastly, given the many statements suggesting that the electrical and software systems are on a hair-trigger, it's worthwhile to note that many mechanical failures don't require the plant to shut down immediately. The tech specs have the details. For example, the Hope Creek plant has been operating since Wednesday morning with one of it's Emergency Core Cooling Systems declared inoperable. That's right, they do not currently have a safety-rated system capable of injecting water when the reactor is at operating pressure. And they're allowed, by law, to operate like this for two weeks.