Data Storm Caused Nuclear Plant To Shut Down

Shut down? by Anonymous Coward · 2007-05-19 09:04 · Score: 5, Insightful

>Investigators want to know whether the data storm could have been initiated from outside the plant.

Do invesigators also want to know how a "data storm" could have caused a nuclear plant to shut down?

nothing to see, move along. by SuperBanana · 2007-05-19 09:10 · Score: 5, Insightful

Some choice quotes, emphasis added:

An investigation into the failure found that the controllers for the pumps locked up following a spike in data traffic -- referred to as a "data storm" in the NRC notice -- on the power plant's internal control system network. The deluge of data was apparently caused by a separate malfunctioning control device, known as a programmable logic controller (PLC).

"Conversations between the Homeland Security Committee staff and the NRC representatives suggest that it is possible that this incident could have come from outside the plant," Committee Chairman Bennie G. Thompson (D-Miss.) and Subcommittee Chairman James R. Langevin (D-RI) stated in the letter. "Unless and until the cause of the excessive network load can be explained, there is no way for either the licensee (power company) or the NRC to know that this was not an external distributed denial-of-service attack."

Wow. Just...wow. As if you needed more proof that this wasn't a hacking attempt:

"The integrated control system (ICS) network is not connected to the network outside the plant, but it is connected to a very large number of controllers and devices in the plant," Johnson said. "You can end up with a lot of information, and it appears to be more than it could handle."

Seriously, how stupid do you have to be to think "OMG, Haxxors?" Answer: work at Homeland inSecurity, or be a Congresscritter. They already figured it out. It was a controller for a specific piece of equipment that flooded the network and triggered a bug in the variable-frequency-drive controllers for pumps.

--
Please help metamoderate.

Re:nothing to see, move along. by (negative+video) · 2007-05-19 14:02 · Score: 2, Insightful

A random fluctuation in internal traffic levels seems equally unlikely.

Look up "Poisson distribution". At low packet rates, large rate fluctuations by random chance are the rule. You also have to consider events that can trigger a common packet rate spike, such as a a non-critical subnet being power cycled. Combine this with a device that has an overflowable packet buffer and you have a recipe for inevitable failure.

A true network storm is unlikely - the term exists, but describes an astronomically rare situation. ... A network storm is when capacity is exceeded in a way that is self-perpetuating.

At work we recently had a cheap router near the edge that decided to start echoing broadcast packets. ARP traffic was not pretty, and DHCP got so confused that the Windows clients went all plug-n-play and started making up their own addresses. The core routers automatically detected the repeated packets and decided to go into cycle-breaking mode: automatic rolling network bisection. Unfortunately they had the smarts to find cycles on their own ports but not echoes from a misbehaving device, so that actually made the network more confusing. Eventually IS had to manually bisect the network until the talky node could be found.

In other words, this is a gross programming error that the coders and managers are desperately trying to blame on something - anything - other than their own ineptness.

It's an honest description of the final event that resulted in the system failure.
Re:nothing to see, move along. by kasperd · 2007-05-19 23:31 · Score: 2, Insightful

A random fluctuation in internal traffic levels seems equally unlikely. Why? Because it has worked for some time, and I doubt the reactor was doing anything unusual at the time.
This is not about the network being highly loaded with lots of packets comming from all sorts of places. This is about a single device for some reason flooding the network. I have seen the results of units flooding a network with broadcast traffic. I don't consider it highly unlikely for one unit to eventually start doing that because of a design flaw. Somebody should take a closer look on the design of that PLC to see if there is a likely explanation. Maybe a physical defect could have caused it to send a broadcast packet and afterwards think it had not been sent yet and send it again and again. Maybe the explanation is something else. There is no way I can say for sure without having seen the PLC.

Network drivers may also be event-driven, but if the interrupt handler is buggy - which would usually mean the handler can be interrupted by itself indefinitely - it's hardly the fault of the network.
If the handler could interrupt itself, it would probably result in a stack overflow and crash the unit. But that is not the most likely bug to introduce. A more likely and almost as bad problem would be if by the time the interrupt handling ended, it would immediately take another pending interrupt. In that case it would never be processing more than one interrupt at the same time, but yet it would spend all of its CPU time handling interrupts. The unit would appear locked up, but would come back to life shortly after the flooding stops. I have seen the later happen with Linux machines (I don't remember which kernel version, I think 2.4.something). I later repeated the experiment with a Windows ME machine, which also locked up, but didn't come back to life when the network cable was disconnected. This situation was quite easy to test, just loop a cheap 100Mbit/s switch back to itself. It would probably take a 1000Mbit/s network to actually cause this with the last generation of CPUs. I don't know if switches and/or network drivers have been improved to avoid the exact scenario I tested.

In my case this was not a problem, but of course in some critical systems, it can be. I see at least two problems. Units not tested against this scenario, and having redundant units communicate to each other over the same ethernet. Of course just having two ethernets does not solve the problem of one of them being able to take down units. Redundant units protect you against physical defects in one unit, not against design flaws.

--

Do you care about the security of your wireless mouse?
Re:nothing to see, move along. by DerekLyons · 2007-05-20 20:20 · Score: 2, Insightful

A random fluctuation in internal traffic levels seems equally unlikely. Why? Because it has worked for some time, and I doubt the reactor was doing anything unusual at the time. A true network storm is unlikely - the term exists, but describes an astronomically rare situation.

When investigating an accident you cannot ground rule out an occurence that is unlikely or rare - unless you have positive evidence that said unlikely or rare condition did not occur, or positive evidence of another cause. "Unlikely" and "rare" are not synonyms for impossible.

In other words, this is a gross programming error that the coders and managers are desperately trying to blame on something - anything - other than their own ineptness.

Absent any facts (as opposed to opinions presented as facts), what precisely is your evidence for this conclusion?

Standards! by 26199 · 2007-05-19 09:12 · Score: 5, Insightful

You'd hope that in something as critical as a nuclear power plant the answer would be, very quickly, "no, it didn't come from an external source because that's impossible". Followed by detailed analysis of the logs to determine which internal system screwed up.

That said, the article is a bit sparse on actual technical details, so my derision may be unwarranted.

Re:Standards! by mrchaotica · 2007-05-19 09:37 · Score: 4, Insightful

You'd hope that in something as critical as a nuclear power plant the answer would be, very quickly, "no, it didn't come from an external source because that's impossible".

Actually, power plants have to have a connection to the outside world. Why? Load-balancing for the power grid. If another plant goes down somewhere, this plant needs to know about it so that it can adjust output to compensate. For that, all the plants need to be hooked to a communications grid, which could conceivably be hacked (even though -- I would hope -- it's not connected to the Internet).

--
"[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz

Redesign the entire infrastructure by packetmon · 2007-05-19 09:19 · Score: 1, Insightful

Firstly I would re-design that entire infrastructure and rid that power plant of incompetent IT people. Secondly I would hold those in power responsible for 1) not having failover measures in place 2) not having a stable and robust enough infrastructure in place 3) obviously not being SCADA compliant. If they can't pass IT security implement simplistic measures such as a properly designed network, it makes me wonder about the physical security aspects of it. What am I paying higher taxes for everytime the gov cries about strenghtening infrastructure when they couldn't even avoid something as stupid and as simple as a 1) safe 2) stable network. Why wasn't there any failover who knows. Insanity when three different agencies can all come down on one agency instead of WORKING with that agency to take corrective measures. US Tax dollars at work. We need to redesign infrastructure and some of these idiots in office.

--
Infiltrated dot Net

Re:Redesign the entire infrastructure by Detritus · 2007-05-19 09:25 · Score: 2, Insightful

When you get back to the real world, let us know. You don't just wave a magic wand and completely redesign and reimplement a highly complex safety-critical system.

--
Mea navis aericumbens anguillis abundat
Re:Redesign the entire infrastructure by mrcdeckard · 2007-05-19 09:34 · Score: 4, Insightful

i think the fact that an unforeseen erroneous condition caused the plant to *shutdown* and not *meltdown* is a pretty good indication that it was designed quite well.

There will always be unforeseen situations. The key is for the system to shutdown in an orderly fashion. In programming, this is accomplished through use of error traps.

Now, the hysteria surrounding terrorism is another thing the plant engineers have to worry about.

i just wonder if and when we get to put this hysteria behind us, and get along with our lives. unfortunately, terry gilliam's brazil is on a constant loop in my mind these days. . . .

mr c

--
"Physics is like sex. Sure, it may give some practical results, but that's not why we do it." - R. Feynman

Even stupider by packetmon · 2007-05-19 09:30 · Score: 4, Insightful

After yet re-reading, I find this government even more insanely stupider than I would have hoped for... Such failures are common among PLC and supervisory control and data acquisition (SCADA) systems, because the manufacturers do not test the devices' handling of bad data, said Dale Peterson, CEO of industrial system security firm DigitalBond.

"What is happening in this marketplace is that vendors will build their own (network) stacks to make it cheaper," Peterson said. "And it works, but when (the device) gets anything that it didn't expect, it will gag." So you mean to tell me pretty much there is no enforcement for manufacturers to maintain compliance on their products even if those products are going into a nuclear *ANYTHING... Which on the worst case scenario could cause catastrophe, yet we have regulatory commissions on the flow of ketchup, regulatory commissions/directions/etc., on weight loss products, lipsticks, etc. (FDA), but this place is not concerned with nuclear plants. Sinful.

--
Infiltrated dot Net

It's not stupid. by twitter · 2007-05-19 09:51 · Score: 5, Insightful

Seriously, how stupid do you have to be to think "OMG, Haxxors?" Answer: work at Homeland inSecurity, or be a Congresscritter. They already figured it out. It was a controller for a specific piece of equipment that flooded the network and triggered a bug in the variable-frequency-drive controllers for pumps.

As someone who used to work in system's engineering for a sister BWR, I think the inspection is a good idea. Oh, there's dumb and there's nuclear dumb but this is not a case of either. Nuclear dumb involves putting machine guns nests inside the plant. Finding the root cause of the accident is a good idea.

Handwaving about a PLC device won't do. What ultimately caused the PLC malfunction needs to be answered at a component level. There's going to be something wrong with it and that should be reported and every other device like it needs to be ripped out and trashed. If there is not component failure, there's a software problem which also must be understood.

Yes, it could have been hackers. The "internal control network" might at some point hits a desk that's connected to the wider world. It could be something mundane and unintentional, like an operator's virused up laptop.

An outage like that is something that's going to have both NRC and corporate ass-chewers looking at everything. Corporate might want to paint a nice picture for the NRC, but the poor devil that lies to them goes to jail. In either case, the problem will be identified and eliminated.

You might also have noted in the article that this is not the first plant to go thumbs down over some winblows born virus. In 2003, the slammer worm caused havoc at an offline Ohio plant. Yes, that was hackers. They did not mean to do it, but the plant's systems were open to it and failed. That's not acceptable from any standpoint.

Despite the better advice of the computer people at the plants, Entergy is a big M$ Partner. They take the big dogs out fishing and sell them the works. Ten years ago, M$ had something worth while and interesting. It was used in places it should not have been. Worse, the flaws from ten years ago have not been addressed or fixed. A good clean up is in order.

--

Friends don't help friends install M$ junk.

Re:Storm in the tubes by ichigo+2.0 · 2007-05-19 11:49 · Score: 5, Insightful

Because "spike in network traffic" sounds lame. Data storm, OTOH, sounds cool and dangerous. Contact Jack Bauer quickly! We need to open a new port for the nucular plant, so the terrorists don't destroy us! And while you're at it, give us more money so we can prevent these awful storms in the future!

Re:What network technology were they using? by mplex · 2007-05-19 12:00 · Score: 3, Insightful

Using Ethernet is not odd, that's literally all there is these days. Sure, there are technologies like Infiniband, but Ethernet is far and away the cheapest and most widely supported networking standard. It sounds like they were experiencing a broadcast storm from a locked up device. I can't tell you the amount of times I've seen stand-alone devices lock up on a busy network because of a bad TCP/IP stack. Often times they will flood packets, especially broadcast frames. There are protections against bad devices such as broadcast limiters and a number of features that protect and limit unauthorized or undesirable traffic.

Ethernet isn't perfect but it's the only realistic option. Managed properly, it can be very reliable. The biggest problem I see from this article is that there is a lack of regulation and testing of the equipment that goes in to these plants. These poor TCP/IP stacks should have never gotten past the testing phase when it comes to a nuclear power plant.

Re:Storm in the tubes by Jugalator · 2007-05-19 14:57 · Score: 2, Insightful

I've worked in IT a while now & have never heard of a "data storm".

Maybe it's the precursor to a logic bomb!

Wow, can't you request article deletion from Wikipedia on the basis of "ridiculous term"?
Or better yet, mind erasing for the very same reason... :-p

--
Beware: In C++, your friends can see your privates!

Re:What network technology were they using? by Bo'Bob'O · 2007-05-19 17:42 · Score: 2, Insightful

This are PLCs we're talking about, there are loads of network, protocol and connection systems, proprietary or otherwise, for all ranges of complexity.

Network stack has too high priority by Esben · 2007-05-19 20:50 · Score: 3, Insightful

I have actually seen such a problem myself: Controllers crashing because someone was testing the network. The problem was, ofcourse, that the CPU spent a lot of time to handle the amount of packages on the network and therefore didn't have time enough for it's real-time application. (It didn't help that the platform didn't support DMA.)

Solution: Make the network interrupt handler threaded and prioritize it below the real-time application. Sure, that doesn't help the SCADA performance, but you have to make sure that the real-time application meets it's deadlines no matter what is going on on the network. I simply don't buy that you can secure a network stretching over more than 1 meter against "data storms."

Slashdot Mirror

Data Storm Caused Nuclear Plant To Shut Down

17 of 178 comments (clear)