One Failed NIC Strands 20,000 At LAX
The card in question experienced a partial failure that started about 12:50 p.m. Saturday, said Jennifer Connors, a chief in the office of field operations for the Customs and Border Protection agency. As data overloaded the system, a domino effect occurred with other computer network cards, eventually causing a total system failure. A spokeswoman for the airports agency said airport and customs officials are discussing how to handle a similar incident should it occur in the future.
According to the effing article, it wasn't even a server, but a goddamn desktop. How in the holy hell does a desktop take down the whole system? I can't even conceive of a situation where that could be the case on anything other than a network designed by chimps, especially through a hardware failure...A compromised system might be able to do it, but a system just going dark?
For that to have had any effect at all, that system must have been the lynchpin for a critical piece of the network...probably some Homeland security abortion tacked on to the network, or some such crap...This is like the time I traced a network meltdown to a 4 port hub (not a switch, and unmanaged hub) that was plugged into (not a joke) a T-3 concentrator on one port, and and three subnets of around 200 computers each on the other 3 ports. Every single one of the outbound cables from the $15.00 hub terminated in a piece of networking infrastructure costing not less than $10,000 dollars.
This is like that. Single point of failure in the worst possible way. Gross incompetence, shortsightedness, and general disregard for things like "uptime"; pretty much what we've come to expect from the airline industry these days. If I'm not flying myself, I'm going to be driving, sailing, or riding a goddamn bicycle before I fly commercial.
ad logicam Claiming a proposition is false because it was presented as the conclusion of a fallacious argument.
Of course they're running old and outdated hardware. When thing work, particularly in a mission critical situation, you don't touch them! Even if the IT admins knew that computer was old and on the brink of dying, how are they supposed to convince the suits and beancounters of that? Non-technical people take the approach that since computers are inherently binary (work or no-work) that if the machine is up and running _right now_ then there is no problem and no sense on spending money to replace it.
If the IT folks were clueless about this machine's age or condition, then the blame lies solely with them for not knowing what the hell they were doing. However, if it was the other folks who shot the IT folks down about upgrading then "welcome to the current state of business", unfortunately.
[move
First you see latency on a network, then you fire up a sniffer and hope to god you can get enough packets to deduce which is the flaky card without shutting down every NIC on your network.
Of course I did write a paper on this behavior years ago in my CS networking class. Taking a Snort box and a series of custom scripts to notify admins with spikes on the network outside of normal operating ranges for that device's history. However implementing this successfully in an elegant fashion has been beyond me and I just rely on Nagios to do a lot of my bidding.
CS: It is all sink or swim...oh and did I mention there are sharks in that water?
...for not firing the networking manager. The fact that they were NOT terrified that this news would get out and were too stupid to cover it up indicates he/she and their subordinates SIMPLY DON'T KNOW THEY DID ANYTHING WRONG by not putting in a sufficently montiored switch architecture which would rapidly alert IT staff and lock out the offending node.
Simply amazing. Will someone in the press publish the names of these losers so they can be blacklisted?
DHS's idea of a "backup plan" will probably be to build a huge fenced area into which to dump arriving passengers when their systems are down.
I am not a networks guy... but it's my understanding that a switch acts like a hub when it sees a TO: MAC address that it doesn't know what port it's on. They learn the switching structure of a network by watching the FROM fields on the datagrams. When the switch powers up, it behaves exactly like a hub and just watches/learns what MAC addresses are on which ports and builds a switching table. If it starts getting garbage packets, it will look at the TO field and say "I don't know what port this should go out on, so I have to send it on all of them." So garbage packets would overwhelm a network even if it was switched.
It would take a router to stop this from happening. I don't think that there are many networks that use routers for internal partitioning. Even then, that entire network behind that router would be flooded.
To all you novice net admins out there: network cards do *not* like chunky peanut butter! Smooth/creamy only, please.
Now you see what happens when some joker thinks [s]he can get away with using chunky for something as critical as proper care and feeding of network cards. Pfft.
Bah! Kids these days... I tell ya. Probably the same folks that think the interwebnet is the same as the World Wide Web.
Great, Scott! What's next?!
Running 'Nix is like owning a Lightsaber. It's "a more elegant weapon for a more civilized time."
When the U.S. Government is letting millions of illegal aliens cross over from Mexico and live here with impunity, then what the fuck is the point with stopping a few thousand document carrying people getting off of planes from entering the country?
I guess the system exists to give the appearance that the feds actually give a shit.
And then the Pres and Congress wonder why their approval ratings are as small as their shoe sizes...
Out of order? Fuck! Even in the future nothing works! - Dark Helmet (Rick Moranis) "Spaceballs"
I've seen what's running some government agencies, and it's frightening.
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
Years ago we had a 10BT nic go defective so that whenever the nic was plugged into the switch it would obliterate traffic on that segment. The fun part: EVEN IF THE NIC WAS NOT PLUGGED INTO THE PC. Luckily that happened in one of the few areas that had switches at the time, everything else was one huge flat lan.
You're right to a point. An ethernet frame, along with the source and destination addresses, has a checksum. A switch that is using a store and forward procedure is supposed to drop the frame if the checksum is invalid. If the nic was throwing garbled frames onto the network, it would have to be garbled in such a way as to have a valid checksum (assuming they are using store and forward switches in the first place).
When our name is on the back of your car, we're behind you all the way!
----- You know you have ego issues when you register a domain in your name.
Sadly, many real-world systems are often nothing like what people might envision as them as. We all sit back in our chairs reading slashdot and thinking everything is masterfully architected, fully HA, redundant, etc.
Then as you work more places you start seeing that this is pretty far from actual truth. Many "production" systems are held together by rubber bands, and duct tape if you're lucky (but not even the good kind.) In my experience it can be a combination of poor funding, poor priorities, technical management that doesn't understand technology, or just a lack of experience or skills among the workers.
Not every place is a Google or Yahoo!, that I can imagine look and smell like technology wherever you go on their fancy campuses. Most organizations are businesses first, and tech shops last. If software and hardware appears to "work", it is hard to convince anybody in a typical business that anything should change- even if what is "working" is a one-off prototype running on desktop hardware. It often requires strong technical management and a good CIO/CTO to make sure that things happen like they should.
I suspect that a lot of things that we consider "critical" in our society are a hell of a lot less robust under then hood than anything Google is running.
There's no reason you can't leave the almost-broken computer there and get a new one. You just build a backup system. Surely management understands that redundancy is good. Then, when the crappy one breaks, you can swap it out instantly. That way, you don't have to mess with things prematurely, but you're only down for hopefully a few minutes. (Of course, replacing it "intentionally", before it fails, is more reliable, but keeping a backup system is a viable alternative if nobody wants to touch the working system.)
ttuttle is a rankmaniac
Why are you assuming that this is an Ethernet network? As old as the equipment they are using is, it may be a Token Ring network - the symptoms that were described sound just like a "beaconing" token ring network.
The new system is usually extremely expensive. Why spend all that money on a new system when the old one works? I know programmers who refuse to update their code from VB3.
You show the suits and bean counters how much it costs the company if the system failed and time was spent recovering that system.
That's very difficult to do, and your estimates of the costs will be called into question. Its often impossible to predict how long it'll take to diagnose and fix a problem unless you've already diagnosed and fixed a similar problem.
Making this kind of estimate also places you into a lose-lose position. If your estimate was high, then management sees you as "chicken little" and will be more likely to dismiss further concerns as more fearmongering. If your estimate was low, then the blame for the outage will cascade down onto you for not showing/convincing management that new equipment was needed.
We all know what to do, but we don't know how to get re-elected once we have done it
It is laughable that there is no non-computerised backup for the system. (How about filling out the forms and scanning them in later?)
I agree, but the scope of the problem is much larger.
Americans are still designing systems (and I'm talking WHOLE systems, not just the computers) for the industrial revolution. Much the same way, we're educating our kids for the same purpose- to make them cogs for manufacturing.
The Japanese have a more 'cellular' structure, as opposed to the 'pyramid' designed back a couple of 'turns of the century' ago. One man on top drives five, who drive 200, who drive them all. But the Japanese model is more like object orientation: each unit has private parts. So long as the command it's given produces the proper results and stays within budget, who cares?
Assembly lines gather at their meetings and decide policy on their own. "Fred has been late 3 times this week; do we care?" and the only people to whom it matters, decide. There's no need for a strict, top-down policy, especially since only tiny organizations all do only one job.
Imagine the broken structures in a holding company; they own a newspaper, a carwash and a grocery store; the top man can't say "We'll only use glass containers", because that would be a disaster in a car wash. They can't say "we choose leaded inks" which might be fine for the car wash, but danger at the newspaper. Each unit has it's own purpose.
So how about giving the network admins the power to do *whatever* it takes to let them keep the equipment up to date? As long as it runs, under budget, and doesn't get'em on the newspapers, who cares about the specifics? Why not let the unused budget from every year sit in an account (not being taken back) and use THAT to improve infrastructure?
If these guys were able to have that kind of control, this discussion wouldn't be happening.
--- For a good time mail uce@ftc.gov
Yes NICs can go crazy and start blasting broadcasts or Unicasts over your network, if you have a Cisco switch (or any other that supports storm control like features) you may want to enable it, it costs you nothing but the time it takes you to update the config. on the access switch (the one connected to your PCs) get into config mode at type this on every interface that connects directly to a PC (use the interface range command to speed things up if you want). Switch(config-if)#storm-control unicast level X where X is the percent of total interface bandwidth you specify as the threshold for cutting access to that port. Its measure every second, so if you have 100 meg port and you set it to 30, if the PC pushes more than 30 meg a sec in unicasts the switch kills the port, till the pc calms down, if its a 10 meg port the 30 then equals 3 meg, etc etc. You can also add a second line to control broadcasts by changing the word unicast to broadcast. If that had this in place, when the NIC went nuts, the switch would have killed the port, and no outage (I assume a lot here, but you get the point).
In contrast technical staff get to hear a lot about the Tacoma Narrows Bridge, Liberty Ships, Titanic or similar disasters from long ago as illustrations of how things can go wrong before they get let out of their first year of training. Some management would discard those lessons as things from the days of dinosaurs which is why we seem to have maintainance, infrastructure and contingincy plans reduced to nothing every decade and then be seen as important in the years immediately following a string of expensive or deadly disasters.