One Failed NIC Strands 20,000 At LAX

← Back to Stories (view on slashdot.org)

One Failed NIC Strands 20,000 At LAX

Posted by kdawson on Wednesday August 15, 2007 @07:56AM from the comp-dot-risks dept.

The card in question experienced a partial failure that started about 12:50 p.m. Saturday, said Jennifer Connors, a chief in the office of field operations for the Customs and Border Protection agency. As data overloaded the system, a domino effect occurred with other computer network cards, eventually causing a total system failure. A spokeswoman for the airports agency said airport and customs officials are discussing how to handle a similar incident should it occur in the future.

11 of 293 comments (clear)

Min score:

Reason:

Sort:

Whiskey Tango Foxtrot by SatanicPuppy · 2007-08-15 07:58 · Score: 5, Insightful

According to the effing article, it wasn't even a server, but a goddamn desktop. How in the holy hell does a desktop take down the whole system? I can't even conceive of a situation where that could be the case on anything other than a network designed by chimps, especially through a hardware failure...A compromised system might be able to do it, but a system just going dark?

For that to have had any effect at all, that system must have been the lynchpin for a critical piece of the network...probably some Homeland security abortion tacked on to the network, or some such crap...This is like the time I traced a network meltdown to a 4 port hub (not a switch, and unmanaged hub) that was plugged into (not a joke) a T-3 concentrator on one port, and and three subnets of around 200 computers each on the other 3 ports. Every single one of the outbound cables from the $15.00 hub terminated in a piece of networking infrastructure costing not less than $10,000 dollars.

This is like that. Single point of failure in the worst possible way. Gross incompetence, shortsightedness, and general disregard for things like "uptime"; pretty much what we've come to expect from the airline industry these days. If I'm not flying myself, I'm going to be driving, sailing, or riding a goddamn bicycle before I fly commercial.

--
ad logicam Claiming a proposition is false because it was presented as the conclusion of a fallacious argument.
1. Re:Whiskey Tango Foxtrot by MightyMartian · 2007-08-15 08:07 · Score: 5, Insightful
  
  If the NIC starts broadcasting like nuts, it will overwhelm everything on the segment. If you have a flat network topology, then kla-boom, everything goes down the shits. A semi-decent switch ought to deal with a broadcast storm. The best way to deal with it is to split your network up, thus rendering the scope of such an incident significantly smaller.
  
  --
  The world's burning. Moped Jesus spotted on I50. Details at 11.
Re:That's all it takes by Svet-Am · 2007-08-15 08:00 · Score: 5, Insightful

Of course they're running old and outdated hardware. When thing work, particularly in a mission critical situation, you don't touch them! Even if the IT admins knew that computer was old and on the brink of dying, how are they supposed to convince the suits and beancounters of that? Non-technical people take the approach that since computers are inherently binary (work or no-work) that if the machine is up and running _right now_ then there is no problem and no sense on spending money to replace it.

If the IT folks were clueless about this machine's age or condition, then the blame lies solely with them for not knowing what the hell they were doing. However, if it was the other folks who shot the IT folks down about upgrading then "welcome to the current state of business", unfortunately.

--
[move .sig! for great justice, take off every .sig!]
The backup plan by Animats · 2007-08-15 08:08 · Score: 5, Funny

DHS's idea of a "backup plan" will probably be to build a huge fenced area into which to dump arriving passengers when their systems are down.
Re:That's all it takes by KillerCow · 2007-08-15 08:15 · Score: 5, Interesting

I am not a networks guy... but it's my understanding that a switch acts like a hub when it sees a TO: MAC address that it doesn't know what port it's on. They learn the switching structure of a network by watching the FROM fields on the datagrams. When the switch powers up, it behaves exactly like a hub and just watches/learns what MAC addresses are on which ports and builds a switching table. If it starts getting garbage packets, it will look at the TO field and say "I don't know what port this should go out on, so I have to send it on all of them." So garbage packets would overwhelm a network even if it was switched.

It would take a router to stop this from happening. I don't think that there are many networks that use routers for internal partitioning. Even then, that entire network behind that router would be flooded.
Re:Head of IT for LAX should be fired... by Rob+T+Firefly · 2007-08-15 08:17 · Score: 5, Funny

They have to find someone who can not only design a vital high-traffic network and maintain it... but who didn't have fish for dinner.

--
Slashdot Burying Stories About Slashdot Media Owned
Re:That's all it takes by Kadin2048 · 2007-08-15 08:23 · Score: 5, Interesting

Would you think that LAX is running anything that out-of-date or crappy? I assume that they're running everything with spit, duct tape, wishful thinking, ancient custom software, near-fossilized hardware, and Excel spreadsheets ... just like pretty much everything else in the public sector.

I've seen what's running some government agencies, and it's frightening.

--
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
Re:That's all it takes by EmperorKagato · 2007-08-15 09:13 · Score: 5, Insightful

Even if the IT admins knew that computer was old and on the brink of dying, how are they supposed to convince the suits and beancounters of that?
You show the suits and bean counters how much it costs the company if the system failed and time was spent recovering that system.

--
----- You know you have ego issues when you register a domain in your name.
Re:You figure it out by ctr2sprt · 2007-08-15 10:32 · Score: 5, Informative

One not to unreasonable strategy is to set up SNMP traps on all your NICs.

That doesn't make much sense. If the NIC goes down or starts misbehaving, the chances of your NIC's SNMP traps arriving at their destination is effectively zero. You probably mean setting up traps on your switches with threshold traps on all the interfaces, the switch's CPU, CAM table size, etc. Which would be more useful. You could also use a syslog server, which is going to be considerably easier if you don't have a dedicated monitoring solution.

But they are all pretty standard these days, and your polling interval could be fairly long, like every 2 minutes.

You're not thinking of traps if you're talking about polling. Traps are initiated by the switch (or other device) and sent to your log monster. You can use SNMP polling of the sort that e.g. MRTG and OpenNMS do which, with appropriate thresholds, can get you most of the same benefits. But don't use it on Cisco hardware, not if you want your network to function, anyway. Their CPUs can't handle SNMP polling, not at the level you're talking about.

No alarms, but at least a quick heartbeat of your (conceivably very large) network. A similar system can be used to watch 30,000+ cable modems, without to much load on the snmp trap server.

I think you are underestimating exactly how much SNMP trap spam network devices send. You'll get a trap for the ambient temperature being too high. You'll get a trap if you send more than X frames per second ("threshold fired"), and another trap two seconds later when it drops below Y fps ("threshold rearmed"). You'll get at least four link traps whenever a box reboots (down for the reboot, up/down during POST, up when the OS boots; probably another up/down as the OS negotiates link speed and duplex), plus an STP-related trap for each link state change ("port 2/21 is FORWARDING"). You'll get traps when CDP randomly finds, or loses, some device somewhere on the network. You'll get an army of traps whenever you create, delete, or change a vlan. If you've got a layer 7 switch that does health checks, you'll get about ten traps every time one of your HA webservers takes more than 100ms to serve its test page, which happens about once per server per minute even when nothing is wrong.

And the best part is that because SNMP traps are UDP, they are the first thing to get thrown away when the shit hits the fan. So when a failing NIC starts jabbering and the poor switch's CPU goes to 100%, you'll never see a trap. All you'll see are a bunch of boxes on the same vlan going up and down for no apparent reason. You might get a fps threshold trap from some gear on your distribution or core layers, assuming it's sufficiently beefy to handle a panicked switch screaming ARPs at a gig a second and have some brains left over, but that's about it. More likely you won't have a clue that anything is wrong until the switch kicks and 40 boxes go down for five minutes.

Monitoring a network with tens of thousands of switch ports sucks hardcore, there's no way around it.
Re:That's all it takes by quanticle · 2007-08-15 10:53 · Score: 5, Insightful

You show the suits and bean counters how much it costs the company if the system failed and time was spent recovering that system.

That's very difficult to do, and your estimates of the costs will be called into question. Its often impossible to predict how long it'll take to diagnose and fix a problem unless you've already diagnosed and fixed a similar problem.

Making this kind of estimate also places you into a lose-lose position. If your estimate was high, then management sees you as "chicken little" and will be more likely to dismiss further concerns as more fearmongering. If your estimate was low, then the blame for the outage will cascade down onto you for not showing/convincing management that new equipment was needed.

--
We all know what to do, but we don't know how to get re-elected once we have done it
The scope of the problem by WheelDweller · 2007-08-15 11:47 · Score: 5, Interesting

I agree, but the scope of the problem is much larger.

Americans are still designing systems (and I'm talking WHOLE systems, not just the computers) for the industrial revolution. Much the same way, we're educating our kids for the same purpose- to make them cogs for manufacturing.

The Japanese have a more 'cellular' structure, as opposed to the 'pyramid' designed back a couple of 'turns of the century' ago. One man on top drives five, who drive 200, who drive them all. But the Japanese model is more like object orientation: each unit has private parts. So long as the command it's given produces the proper results and stays within budget, who cares?

Assembly lines gather at their meetings and decide policy on their own. "Fred has been late 3 times this week; do we care?" and the only people to whom it matters, decide. There's no need for a strict, top-down policy, especially since only tiny organizations all do only one job.

Imagine the broken structures in a holding company; they own a newspaper, a carwash and a grocery store; the top man can't say "We'll only use glass containers", because that would be a disaster in a car wash. They can't say "we choose leaded inks" which might be fine for the car wash, but danger at the newspaper. Each unit has it's own purpose.

So how about giving the network admins the power to do *whatever* it takes to let them keep the equipment up to date? As long as it runs, under budget, and doesn't get'em on the newspapers, who cares about the specifics? Why not let the unused budget from every year sit in an account (not being taken back) and use THAT to improve infrastructure?

If these guys were able to have that kind of control, this discussion wouldn't be happening.

--
--- For a good time mail uce@ftc.gov