Software Update Shuts Down Nuclear Power Plant
Garabito writes "Hatch Nuclear Power Plant near Baxley, Georgia was forced into a 48-hour emergency shutdown when a computer on the plant's business network was rebooted after an engineer installed a software update. The Washington Post reports, 'The computer in question was used to monitor chemical and diagnostic data from one of the facility's primary control systems, and the software update was designed to synchronize data on both systems. According to a report filed with the Nuclear Regulatory Commission, when the updated computer rebooted, it reset the data on the control system, causing safety systems to errantly interpret the lack of data as a drop in water reservoirs that cool the plant's radioactive nuclear fuel rods. As a result, automated safety systems at the plant triggered a shutdown.' Personally, I don't think letting devices on a critical control system accept data values from the business network is a good idea."
I wonder if they were using something like EPICS. I worked on a large experiment which used EPICS to control the system. Rebooting a machine would sometimes expose a problem with resources not being freed, eventually leading to a situation where data channels would read the 'INVALID/MISSING' value. The solution, as anyone who has worked on this sort of experiment will know, was to reboot more machines until the thing worked.
(I don't mean to complain about EPICS. It is very powerful and flexible... it's just that the version we used had these occasional hiccups.)
Reminds me of Terminal Error.
I write this type of software for a living so I know that having a computer on the business network connected to the control computers is a risk, bur that risk can be managed. The problem here is that the software update wiped out the nuclear control system data. This exposes two bad problems. First customers are always asking why they can't update their system while it is still running. We liken that to changing your tire while driving down the road. Secondly the software update did not respect the data in the nuclear control system and synchronized it to new initial data in the update on the other system! Not a good idea. In critical safety systems, you always practice an update before actually doing one.
This is why you keep the IT nerds away from the process network.
I've had a whole plant lose view of it's system because some well meaning retard in IT decided to push updates onto a SCADA system without qualifying the updates....... never had it KILL the control side of things though....well done whoever you were, you've done well.
Burma?
From the summary: If it's monitoring the primary control system then it seems to me like the machine would have to be on the control network. The real issue is why did the primary control system accept a reset from a monitoring system. It sounds like there's more than one bug to track down.
When our name is on the back of your car, we're behind you all the way!
Safety control systems in the chemical industry have been used for 20+ years. These systems have: - redundant CPU modules (which can be hot plugged) - redundant IO modules (which can be hot plugged) - redundant communication systems - self diagnostics (can detect a failed output transistor) - internal diagnostics (CPU voting to detect failed CPU core) - standard algorithms for redundant transmitters Shutting down is the "safer option" however there is still risks (such as thermal stressing pipework). It is a lesser of the two evils problem. This stuff is bread & butter for the chemical industry, there are a number of control companies that refuse to deal with the nuclear industry due to the requirement for unlimited indementy. ZombieEngineer
Tell that to the people who worked at Chernobyl.
;)
Oh wait, you can't; they were all blowed up.
I write bullshit
Also, if the water level is too high in the steam generator (or pressure vessel, in the case of a BWR), you will get water droplets mixed in with the steam going to the turbines. This is a good way to damage turbine blades.
Third, if you're concerned about maintaining a BWR subcritical, you shouldn't let the water level get too high. The water surrounding the core acts as a reflector, decreasing neutron leakage. So, higher water level leads to increased reactivity. In fact, my recollection is that, in some cases, the emergency operating procedures suggest lowering the water level in order to control reactivity.
On a different note, the reason this incident is somewhat concerning (to me, at least), is that the logic for the reactor protection system is supposed to be not only fail-safe but also fault-tolerant. There are typcially four independent channels, and the logic to actually get a scram is ((A || B) && (C || D)). So the question is, how did one computer failure cause multiple, supposedly-independent channels to indicate a scram condition?
Lastly, given the many statements suggesting that the electrical and software systems are on a hair-trigger, it's worthwhile to note that many mechanical failures don't require the plant to shut down immediately. The tech specs have the details. For example, the Hope Creek plant has been operating since Wednesday morning with one of it's Emergency Core Cooling Systems declared inoperable. That's right, they do not currently have a safety-rated system capable of injecting water when the reactor is at operating pressure. And they're allowed, by law, to operate like this for two weeks.