Dublin Air Traffic Control Brought Down By Faulty NIC

← Back to Stories (view on slashdot.org)

Dublin Air Traffic Control Brought Down By Faulty NIC

Posted by timothy on Thursday July 17, 2008 @07:40PM from the can-go-wrong-can-go-wrong-nothing-can dept.

Not so very long ago after passengers were left hanging by a similar glitch at LAX, Gilby4mPuck writes with another story of NIC failure leading to a disruption of air traffic, this time in Ireland, excerpting: "Data showing the location, height and speed of approaching planes disappeared from screens for 10 minutes each time. ... Thales ATM stated that in 10 similar air traffic control Centres worldwide with over 500,000 flight hours (50 years), this is the first time an incident of this type has been reported. ... '[They] confirmed the root cause of the hardware system malfunction as an intermittent malfunctioning network card which consequently overcame the built-in system redundancy,' said an IAA spokeswoman."

203 comments

Min score:

Reason:

Sort:

testing and QA by hostyle · 2008-07-17 19:43 · Score: 0, Redundant

Whatever happened to testing of installed hardware? You'd think they might csider that sort of thing important when it involves the lives of thousands of people. Then again, maybe they were drunk at the time.

--
Caesar si viveret, ad remum dareris.
1. Re:testing and QA by Thanshin · 2008-07-17 19:55 · Score: 4, Insightful
  
  Testing doesn't confer prescience.
2. Re:testing and QA by zach_d · 2008-07-17 19:56 · Score: 2, Interesting
  
  I think the issue is one of maintenance. things need to be replaced after their life-cycle is over, even if they seem to be functioning at the time.
3. Re:testing and QA by MortenLJ · 2008-07-17 19:58 · Score: 4, Informative
  
  The possiblity of failure can be reduced, but never completely removed. It's a simple matter of probabilities. E.g. a certain component fails on any day with probability p, if we add n redunndant fail-overs, the total system will fail with probability 1-p^n, an equation which will never be one, but it can get close.
4. Re:testing and QA by tinkertim · 2008-07-17 20:06 · Score: 5, Funny
  
  Whatever happened to testing of installed hardware? You'd think they might csider that sort of thing important when it involves the lives of thousands of people. Then again, maybe they were drunk at the time.
  Well, when we set up some cheap NAS boxes with redundant nics .. some load balancers and other goodies .. we tested it by yanking cables on the bonded nics and making sure everything still worked.
  This was for an e-commerce site.. I would agree in hoping more testing with real failures would be done on systems that monitor air traffic.
  Also, we were very drunk when yanking cables during our test .. so I don't think intoxication is really a factor. In fact, turning a drunken monkey loose in a data center with a clearance to pull cables is _very_ good fail over testing :)
5. Re:testing and QA by mcrbids · 2008-07-17 20:11 · Score: 5, Insightful
  
  The very best planned of redundant systems can be brought to its knees by hardware that "mostly works".
  It's not hard to have system B check that system A is on/off line, and step in if the latter is the case. But what happens when A is *mostly* or *sorta* online? Does system B check that ALL functionality done by A is being done appropriately? Almost never.
  And that's why, even in the best, most carefully designed, fully redundant high-availability systems, you never, ever see 100% uptime. It's just not possible to anticipate everything that can go wrong.
  So design a system that fails gracefully! That's what nature did.
  Take a look at your own body. It's a gorgeous example of a high-availability, high-redundancy system. There are literally BILLIONS of cells in your body, each operating as a semi-independent unit, such that any of them can fail without bringing down the whole, or even affecting it noticeably. Your body is an excellent example of a cheap, redundant, high-availability system.
  Yet catastrophic failures still occur. Whether by cancer, diabetes, or heart disease, even a well-designed, tested-for-millions-of-years high-redundancy system with billions of individual, replaceable parts fails catastrophically from time to time.
  It's the nature of the beast.
  Mother nature has compensated by making not only the system redundant, but the need for the system also redundant. Rapid reproduction is nature's friend! Not just redundancy, but redundant redundancy.
  High availability - it's much, much, MUCH harder than you thought.
  
  --
  I have no problem with your religion until you decide it's reason to deprive others of the truth.
6. Re:testing and QA by boaworm · 2008-07-17 20:18 · Score: 1
  
  Quite likely it did work at the time of FAT, SAT, Shadow operation and when going into live operation.
  If it breaks down later on is another issue, that's not possible to test for beforehand. Isn't that pretty obvious? It is like testing a car to see if it will ever be in an accident. You sir, are the drunk one :)
  
  --
  Probable impossibilities are to be preferred to improbable possibilities.
  Aristotele
7. Re:testing and QA by Valehru · 2008-07-17 20:40 · Score: 2, Interesting
  
  I had an engineer stuck in Germany for three days due to this stupidity. He got his fill of beer, good hotel rooms and sightseeing done, so in his mind it was a decent holiday. The insane thing was that this issue happened before a few weeks earlier, there was an investigation however it did not discover then faulty NIC then either.
8. Re:testing and QA by HungryHobo · 2008-07-17 20:41 · Score: 5, Insightful
  
  I'm inclined to trust the card which has been working fine for 5 years over a card which was put in yesterday.
9. Re:testing and QA by kitgerrits · 2008-07-17 20:43 · Score: 5, Insightful
  
  The problem is not that redundancy wasn't implemented.
  The problem is that redundancy doesn't handle 'flapping' hardware very well.
  The NIC intermittently failed, causing the redundancy to switch cards several times.
  This can play havoc on systems that work on a LAN and assume the MAC address to stay the same.
  Also, a NIC that does not report an error, doesn't fail completely and simply swaps a few bits around can be nigh-on impossible to diagnose.
  This could have been caught with real-time hardware and log-monitoring, but I have to confess even I only check the logs daily, not real-time. While some monitoring systems can mail the admin in the event of failure, not all systems are usually configured that way ('workstations' being a prime candidate).
  There is a line you draw between monitoring and cost-effectiveness. Every company takes a claculated risk in this and they got bitten.
  
  --
  "I was in love with a beautiful blonde once, dear. She drove me to drink. It's the one thing I am indebted to her for."
10. Re:testing and QA by zach_d · 2008-07-17 20:50 · Score: 3, Interesting
  
  in a high noise/vibration/dust environment?
11. Re:testing and QA by leuk_he · 2008-07-17 20:53 · Score: 1
  
  If it works after 5 years. sure.
  If not there always is a backup. isn't there? Well, in that case there is a backup of the backup.
12. Re:testing and QA by Hal_Porter · 2008-07-17 20:53 · Score: 5, Funny
  
  Only The Spice confers prescience.
  
  --
  echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
13. Re:testing and QA by DaedalusHKX · 2008-07-17 20:55 · Score: 3, Insightful
  
  The article says "it overcame the built in system redundancy"... how the hell does ONE failing card in a redundant setup "overcome" the redundant backup parts/systems ??
  I call "CYA kissass excuse maker" to the stand!
  Someone screwed up big, and they're Covering Their Asses now.
  
  --
  " What luck for rulers that men do not think" - Adolf Hitler
14. Re:testing and QA by Hal_Porter · 2008-07-17 20:56 · Score: 1
  
  This is very true. I worked on system where we had lots of redundancy in critical places. But given enough tests sometimes the bugs would get through, usually in ways that you hadn't thought of.
  
  --
  echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
15. Re:testing and QA by seifried · 2008-07-17 21:01 · Score: 2, Interesting
  
  Yes but if _one_ NIC can bring the entire system down what other single failures in a component could bring the entire system down? Obviously the system with the malfunctioning NIC can do any number of things that may result in a similar failure mode. Or what happens if the network switch it is attached to fails (I assume they use multiple paths... but if one nic can nuke it all, imagine if a switch went bonkers).
16. Re:testing and QA by Anonymous Coward · 2008-07-17 21:09 · Score: 2, Insightful
  
  -- "The very best planned of redundant systems can be brought to its knees by hardware that "mostly works"."--
  --NO, you are wrong there. What this indicates is that someone skimped. Techniques for processing and getting reliable signals through systems that only mostly work are very well known and used routinely. What this event means is that someone, either explicitly or implicitly assumed that NICs are binary - they either work or they don't, and designed accordingly.
  What should have been used is multiple (more than two) parallel simultaneous communication paths with comparison voting at the far end to determine if the information received can be regarded as valid or not. Assuming that a NIC will fail gracefully is so boneheaded that in a safety critical application like this someone could likely be prosecuted for negligence.
  Unfortunately, using the right techniques is expensive. Given that systems like this are provided by tender processes that favour low-bidders, it is not surprising that problems appear.
  Of course, somebody may have done a cost-benefit analysis and decided the risk of one (or several) aircraft accidents didn't merit the extra expenditure. That's unlikely to be publicised 'though - although it would be a correct calculation to run.
  As for testing - well running around pulling cables out at random doesn't really do it. Unplugging and plugging cables at various frequencies/intervals, swapping cables, plugging them into incorrect sockets, injecting noise, dropping the voltage on the power supply, overvoltage/spikes are all things that could and should be done - and in some cases mathematical formal proof that the system will work as required. All of this (and more) is done for safety critical applications.
17. Re:testing and QA by tinkertim · 2008-07-17 21:10 · Score: 2, Insightful
  
  The problem is not that redundancy wasn't implemented. The problem is that redundancy doesn't handle 'flapping' hardware very well.
  The NIC intermittently failed, causing the redundancy to switch cards several times.
  This can play havoc on systems that work on a LAN and assume the MAC address to stay the same.
  That's what got me curious, it looked like they were using takeover instead of bonding devices.
  The most well engineered system in the world can not hope to escape a ~9 minute ARP cache upstream, which makes me wonder why it was designed the way that it was.
  I'm not thinking in an antagonistic sense, I'm more wondering what changed in the network _after_ the system was deployed.
18. Re:testing and QA by Hognoxious · 2008-07-17 21:10 · Score: 5, Funny
  
  if we add n redunndant fail-overs, the total system will fail with probability 1-p^n
  Any number raised to the power 0 is 1. So if you don't install anything, hence n is 0, it will always work since the probability of failure is 1-1 = 0.
  
  --
  Confucius say, "Find worm in apple - bad. Find half a worm - worse."
19. Re:testing and QA by TheThiefMaster · 2008-07-17 21:13 · Score: 2, Insightful
  
  If one fails with probability p, and you have n of them, a total system failure is probability p^n, not 1-p^n. Well technically it's Mult(p,1->n) where p1 is the probability of the first failing, p2 the probability of the second, etc, multiplying them all together to get the chance of a total system failure.
  The probability of any one device in a redundant system failing is (1-((1-p)^n)). This equation rapidly approaches 1, so in larger setups failures will be a common occurrence, but they'll largely be harmless due to redundancy.
  Of course this all assumes the failure mode of the device is "off" or "non-functioning". If it fails in a way which routes 15A of mains power into a network cable, redundancy might not help a whole lot.
  Obviously that's not what happened, but it's not outside possibility for one device to take down an entire redundant system.
20. Re:testing and QA by diskis · 2008-07-17 21:17 · Score: 2, Informative
  
  Air traffice towers generally are not noisy or dusty. And in any case, disregarding the ports, the NIC card itself is practically eternal. Compared to the rest of the system, and the lifetime of the system that is.
  Two lessons learned from years of technical support. The NIC isn't broken, unless the computer has been dragged from the network cable. And that the CPU is not broken as long as the system has not been overclocked, and the heatsink is still in place.
21. Re:testing and QA by Phroggy · 2008-07-17 21:21 · Score: 2, Informative
  
  The problem is, NICs can fail in all kinds of ways that yanking cables won't simulate. In this case it sounds like if they had yanked the cable, the backup system would have come online exactly like it was supposed to, but because the faulty NIC was kinda-sorta-almost-but-not-really working, it didn't. That's a difficult thing to test in the lab.
  
  --
  $x='S24;r)>63/* h@<5+oZ)32"5cz';$me='phroggy'x$];
  $x=~y+ -xz+\0-Tx+;print$_^chop$me for split'',$x;
22. Re:testing and QA by diskis · 2008-07-17 21:22 · Score: 1
  
  > But what happens when A is *mostly* or *sorta* online?
  
  You have dual NICs on system A, with a etherkiller connected to the second card.
  
  When B takes over, it then can make sure that A stays down :)
23. Re:testing and QA by mcrbids · 2008-07-17 21:28 · Score: 1
  
  And this is where the "cheap" part of my comment "cheap, redundant, high-availability system" comes into play.
  See, the likelihood of failure in a redundant system goes *up* as the number of units increases. But as the number of units in a redundant system increases, the likelihood of a *complete* failure drops to a number never equalling zero. In other words, no matter how much redundancy you build in, you'll never achieve zero downtime over the long haul.
  The human body achieves zero downtime over a few decades in many cases, and close to 5 nines (%99.999) over 6-7 decades in most cases. This is very, very good uptime and is very noteworthy, but requires BILLIONS of redundant units and expensive external intervention (AKA the "hospital") to achieve.
  You'll never get 100%. So get off it, already. Instead, prepare for the 1% to 0.1% of downtime and call it a day!
  
  --
  I have no problem with your religion until you decide it's reason to deprive others of the truth.
24. Re:testing and QA by clickety6 · 2008-07-17 21:28 · Score: 1
  
  why stop with the drunken monkey...
  http://www.the5thwave.com/gallery/comp_misc/677.html
  
  --
  ----------------------------------- My Other Sig Is Hilarious -----------------------------------
25. Re:testing and QA by Anonymous Coward · 2008-07-17 21:30 · Score: 0
  
  Yes, because as soon as we get up in the morning before we even have breakfast we have a pint of Guinness and by the time we get to work we can barely even stand, never mind operate a radar system. Then again it might have been because we left the job up to the leprechauns who were to foucsed on protecting their crock of gold.
26. Re:testing and QA by Thanshin · 2008-07-17 21:32 · Score: 1
  
  Only The Spice confers prescience.
  Actually, it confers "the ability to fold space. That is, travel to any part of universe without moving."
  Well, at least they got the "not moving" part right.
27. Re:testing and QA by Anonymous Coward · 2008-07-17 21:37 · Score: 0
  
  if we add n redunndant fail-overs,
  if we have n redundant fail-overs
  
  the total system will fail with probability 1-p^n
  the probability for a fail with n redundant fail-overs is
  p^n
  no?
28. Re:testing and QA by xiox · 2008-07-17 21:43 · Score: 1
  
  I always thought that CPUs could never be broken. We had an Athlon 64 processor 4600+, it was never overclocked and always used with a standard fan/heatsink, in a well ventilated case. After a year of work, it then started randomly crashing every few weeks. Replacing all the components except the CPU didn't fix the problem (different motherboard, memory, etc). Replacing the CPU did fix the problem. They can die randomly but it is very rare.
29. Re:testing and QA by HJED · 2008-07-17 21:51 · Score: 1
  
  Whatever happened to testing of installed hardware? You'd think they might csider that sort of thing important when it involves the lives of thousands of people. Then again, maybe they were drunk at the time.
  Well, when we set up some cheap NAS boxes with redundant nics .. some load balancers and other goodies .. we tested it by yanking cables on the bonded nics and making sure everything still worked.
  This was for an e-commerce site.. I would agree in hoping more testing with real failures would be done on systems that monitor air traffic.
  Also, we were very drunk when yanking cables during our test .. so I don't think intoxication is really a factor. In fact, turning a drunken monkey loose in a data center with a clearance to pull cables is _very_ good fail over testing :)
  it is point less to destroy a system testing it unless you have a big buget and can rebuild the system exactly the same
  intoxcated monkey destroying data center != testing
  intoxcated monkey destroying data center == (waste_of_money && destroying_data_center)
  destroying_data_center == waste_of(time && money)
  destorying_data_center != testing
  
  --
  null
30. Re:testing and QA by mowall · 2008-07-17 22:06 · Score: 4, Insightful
  
  The article says "it overcame the built in system redundancy"... how the hell does ONE failing card in a redundant setup "overcome" the redundant backup parts/systems ??
  I suspect it's because, as mentioned in the summary, it was "an intermittent malfunctioning network card". i.e. the failover system must have thought the card was functioning.
31. Re:testing and QA by putaro · 2008-07-17 22:10 · Score: 2, Informative
  
  That was in the movie. Read the book, it's much better.
32. Re:testing and QA by jimicus · 2008-07-17 22:17 · Score: 4, Interesting
  
  Yes but if _one_ NIC can bring the entire system down what other single failures in a component could bring the entire system down? Obviously the system with the malfunctioning NIC can do any number of things that may result in a similar failure mode. Or what happens if the network switch it is attached to fails (I assume they use multiple paths... but if one nic can nuke it all, imagine if a switch went bonkers).
  You don't need to bring the entire system down to cause havoc. What if there's a hitherto unknown bug in one of the CPUs which under some very specific set of circumstances causes aircraft altitude to be misreported on the operator's screen? As the GP said, most redundant systems only ensure that the components appear to be broadly working. They seldom check that all the components are doing something sensible.
33. Re:testing and QA by david.given · 2008-07-17 22:27 · Score: 3, Informative
  
  Actually, it confers "the ability to fold space. That is, travel to any part of universe without moving."
  Actually actually, the space folding is done using the Holtzman drive, which is a perfectly ordinary machine. The Navigator merely navigates, plotting a safe path through the non-space/time foldspace. The spice grants the Navigator the limited prescience required to do this.
  Eventually the Navigators become obsolete, replaced by Ixian semisentient machines known as Compilers that perform the same task without needing melange. A good thing too, because by that point Arrakis is rubble and sandworms are pretty much extinct.
  Details courtesy of Wikipedia (and my lack of a social life).
34. Re:testing and QA by TheSunborn · 2008-07-17 22:34 · Score: 1
  
  Not according to the book. According to the book the spice is needed to predict what will happen when you arrive, that is: To ensure that you don't arrive inside a planet, or other dangerous place. The point is that when you do something that amount to traveling faster then light, the only way to know anything about where you arrive, is to predict the future.
  This is also a big reason, that the guild newer took over Dune. They were so conditioned to always seek the safe path(Because that was what their ships needed) that they could not imagine starting an operation that was not know to be 100% safe for the guild.
  (Only slightly offtopic)
35. Re:testing and QA by kitgerrits · 2008-07-17 22:44 · Score: 1
  
  Indeed.
  I've seen Network Bonding in RedHat Enterprise Linux with HP hardware use a 'fake' MAC address that is bound to several interfaces to avoid just this problem.
  Unfortunately, it may confuse the switch it is connected to, because of said ARP cache (CAM table, ours was 16 hours).
  Really-HA systems require genuine engineers with tons of real-life experience, just to know what bits work and what bits you want to avoid.
  I hope to become one, one day ;-)
  
  --
  "I was in love with a beautiful blonde once, dear. She drove me to drink. It's the one thing I am indebted to her for."
36. Re:testing and QA by somersault · 2008-07-17 22:46 · Score: 1
  
  If you didn't apply the heatsink yourself, you don't know if it's been done correctly. On my first PC that I bought with my own money (well, half bought and my dad paid the rest), it kept locking up randomly, and after lots of IRQ and driver troubleshooting my dad removed the heatsink only to find that they hadn't applied it correctly. One reapplication of thermal paste and proper connection to the CPU later, and everything was fine (until that system got messed up in a lightning storm a few years later, but I still use the case when building my own machines)
  
  --
  which is totally what she said
37. Re:testing and QA by somersault · 2008-07-17 22:48 · Score: 1
  
  Sometimes, pure intuition can be more handy than maths.
  
  --
  which is totally what she said
38. Re:testing and QA by Anonymous Coward · 2008-07-17 22:54 · Score: 0
  
  My method is foolproof: I do processing with millionfold redundantly. I partition the nodes into segmented networks and when the results within a segment disagree (which is the usual case) I go with the plurality for that segment. The individual segments then get weighted scores proportionate to the number of units in that segment and finally the computation is performed by averaging the result of each segment weighted by the percentage of all nodes that are in that segment.
  This computation is then compared (for historical reasons) with a separate result computed by nine "supreme" nodes, and finally I return the latter value, ignoring the former.
  Foolproof, I tell you!
39. Re:testing and QA by LiquidCoooled · 2008-07-17 22:57 · Score: 1
  
  CPUs with correctly seated heatsinks which stay within their prime operating temperature usually have no problems.
  However its rather easy to get the wrong amount of goop or something else wrong with the airflow, or just a marginal chip, etc
  I had an AMD t'bird 1.4ghz which would NOT run happily at 1.4ghz no matter how much I tried.
  In the end I gave up and ran it happily at 1.33 for years.
  
  --
  liqbase :: faster than paper
40. Re:testing and QA by ebolaZaireRules · 2008-07-17 23:10 · Score: 1
  
  I'm sure that its not merely the destination, but also the journey that needs to be known beforehand... 1 in 10 'disappeared', not ended up 1/2 parked inside of an asteroid.
  
  --
  The Bible: Historically verifiable fact from an observers point of view
41. Re:testing and QA by xiox · 2008-07-17 23:14 · Score: 1
  
  We did - we applied the heatsink several times, when we moved the CPU between different motherboards. Proper thermal transfer compound was used. The temperature of the CPU was fine.
42. Re:testing and QA by methamorph · 2008-07-17 23:30 · Score: 3, Insightful
  
  The article says "it overcame the built in system redundancy"... how the hell does ONE failing card in a redundant setup "overcome" the redundant backup parts/systems ??
  If the card had failed completely the redundant one would probably have kicked in. What I think happened is the card malfunctioned in a way causing the system to still think that the card is fine and there is no need for the redundant one to kick in.
43. Re:testing and QA by witherstaff · 2008-07-17 23:38 · Score: 1
  
  I recall some DEC NICS that when they started to fail, all got the same MAC. Talk about a fun thing to troubleshoot on a network! If it was a plain old switch using MAC switching, you can cause havoc pretty easily.
44. Re:testing and QA by Anonymous+Cowpat · 2008-07-17 23:48 · Score: 1
  
  if you did apply it yourself, you can't be sure that it's been done correctly! On the first system I built; I put the heat sink on pi radians rotated from where it was supposed to be, and without thermal paste at all. (How was I supposed to know?)
  Talk about expensive mistake.
  
  --
  FGD 135
45. Re:testing and QA by dotancohen · 2008-07-18 00:01 · Score: 1
  
  High availability - it's much, much, MUCH harder than you thought.
  I would like nothing more that to be able to breed my way out of hardware failures.
  "What? Another NIC failed? Honey, spread 'em!"
  
  --
  It is dangerous to be right when the government is wrong.
46. Re:testing and QA by Z00L00K · 2008-07-18 00:07 · Score: 1
  
  The big problem here was the intermittent function and that can happen anywhere in the lifecycle. Just replacing things won't help much and can cause a bigger problem.
  What has to be improved is the redundancy system solution that has to be able to detect intermittent function and therefore do a complete failover.
  And this isn't the first time this kind of problem have happened, and it's probably not the last time either. But in this case it was on a mission critical system.
  
  --
  If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
47. Re:testing and QA by colfer · 2008-07-18 00:16 · Score: 1
  
  Lightning damage can come in through the NIC. On the network I saw, a Linksys router hooked to a satellite modem started sending damaging voltage down the ethernet after a storm. One computer lost its NIC, two others lost NIC + motherboard. After the storm was over, the Linksys box was still deadly to a laptop.
48. Re:testing and QA by Hal_Porter · 2008-07-18 00:21 · Score: 5, Interesting
  
  One of the odd and very likable things about Dune is that there are occasionally implications that the society we read about is not the most advanced. Maybe their taboos are limiting them. Essentially the world we read about is actually in its own version of the Dark Ages where progress has all but stopped and feudalism is the only system. The Tleiaxu and the Ixians aren't in a Dark Age though. But we don't here too much about them because they are outside the known world because they violate the taboos that govern the know world.
  Essentially it's a bit like reading history Taliban controlled Afghanistan, or unfortunately anywhere with an Islamic government. And I'm sure it's deliberate - Frank Herbert apparently was inspired by the Islamic uprisings against the British.
  Or if you look at another way he wanted to write a hallucinogenic, retro sci fi epic, and he came up with a bunch of explanations - the Butlerian Jihad, the necessary for spice based prescience for interstallar travel, and the incompatibily between directed energy weapons and shields to explain why his universe was that way and not like conventional sci fi with ray guns, robots and open societies in the Popper sense.
  
  --
  echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
49. Re:testing and QA by vmcto · 2008-07-18 00:44 · Score: 1
  
  Probably a cosmic ray.... http://www.scitech.ac.uk/PMC/PRel/STFC/Cosmic.aspx
50. Re:testing and QA by Anonymous Coward · 2008-07-18 00:49 · Score: 0
  
  Don't be a dick. You mentioned cheap specifically in the context of the human body, not in the overall context. An of course the likelihood of a component failure in a redundant system goes up as the component count increases: but we are talking about the system availability. Don't confuse the two.
  You are right, however, in saying guaranteed 100% uptime is not achievable. Planning what to do when (not if) the system fails is always well worth doing. With luck, you'll never need the plans.
51. Re:testing and QA by kaiidth · 2008-07-18 01:06 · Score: 1
  
  Flew into Belfast instead last week because of this... when you're inconvenienced by technology it is very calming to know that what caused it was a real honest to goodness fuck-up, rather than a much less interesting case of human error :-)
52. Re:testing and QA by Big+Hairy+Ian · 2008-07-18 01:08 · Score: 1
  
  Hardware can still go down regardless of how often its tested. The issue here is it took 10 minutes to rectify the fault.
  
  --
  Build a Man a Fire, and He'll Be Warm for a Day. Set a Man on Fire, and He'll Be Warm for the Rest of His Life.
53. Re:testing and QA by knightri · 2008-07-18 01:47 · Score: 0
  
  If the NIC fails, the secondary NIC card will take over permanently. I see no reason why control would be given back to a card that has failed but now reads OK. At least thats how we do things in the power industry.
  
  --
  'Or else pizza is going to order out for you'
54. Re:testing and QA by xalorous · 2008-07-18 02:01 · Score: 1
  
  Yes, the books are better, and the spice does confer prescience in a small number of instances.
  
  --
  TANSTAAFL GIGO Acronyms to live by!
55. Re:testing and QA by hostyle · 2008-07-18 02:01 · Score: 1
  
  Data showing the location, height and speed of approaching planes disappeared from screens for 10 minutes each time.
  The airport has been in chaos for the week or so since it started to happen. It took them that long to get back to full working capacity.
  
  --
  Caesar si viveret, ad remum dareris.
56. Re:testing and QA by xalorous · 2008-07-18 02:07 · Score: 1
  
  NIC's go bad. Fact of life. Equipment fails. This particular piece of equipment failed in a dramatic fashion that caused errors resulting in a denial of service to other network devices. The fix is monitoring at the switch level for these types of failures and disabling the port before the errors can cause cascading failures.
  Do you run a full function check on every piece of hardware on your computer every time you turn it on? Or do you run it til it breaks then fix it? Also, consider how hard it is to trace this type of error. It is a partial failure, intermittent, and the failure brings down other machines. Any one of the machines that go down could be the culprit.
  Kudos to the troubleshooters who found this beast!
  
  --
  TANSTAAFL GIGO Acronyms to live by!
57. Re:testing and QA by Anonymous Coward · 2008-07-18 02:38 · Score: 0
  
  faulty psu can usually cause cpu death.
58. Re:testing and QA by Anonymous Coward · 2008-07-18 02:53 · Score: 0
  
  This comment makes me wish multiple moderations were available. -1 Offtopic, +1 Funny
59. Re:testing and QA by morgan_greywolf · 2008-07-18 03:18 · Score: 1
  
  echo -e 'global _start \n _start: \n mov eax, 2 \n int 80h \n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a; a
  Hmm....I wouldn't do that if you have 'ulimit' set to 'unlimited' number of processes allowed per user.
  
  --
  My blog
60. Re:testing and QA by Dirtside · 2008-07-18 03:33 · Score: 2, Funny
  
  So if you don't install anything, hence n is 0, it will always work since the probability of failure is 1-1 = 0.
  Actually, it would be more accurate to say that it would never fail. ;)
  
  --
  "Destroy science and religion. Science would re-emerge exactly the same; but not religion." - Penn Jillette, paraphrased
61. Re:testing and QA by ColdWetDog · 2008-07-18 03:39 · Score: 1
  
  .echo -e 'global _start \n _start: \n mov eax, 2 \n int 80h \n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a; a
  Hmm....I wouldn't do that if you have 'ulimit' set to 'unlimited' number of processes allowed per user.
  
  You need to go OUTSIDE today, my man.
  
  --
  Faster! Faster! Faster would be better!
62. Re:testing and QA by Anonymous Coward · 2008-07-18 03:41 · Score: 0
  
  Indeed, to create a logic system B that can confirm _exactly_ whether or not a logic system A is functioning, one must encode completely the logic of system A into system B, thus creating a sort of recursive "who watches the watchers" type situation...
63. Re:testing and QA by kilodelta · 2008-07-18 03:46 · Score: 1
  
  Network Interface Cards and chips are probably the most failure prone components in a computer.
  
  If you've been around networks long enough you've found jabbering nics, dead nics etc.
64. Re:testing and QA by Ihmhi · 2008-07-18 04:05 · Score: 2, Interesting
  
  So putting in a faulty NIC card and seeing what happened wouldn't have done anything at all, huh?
  Part of testing systems is trying to emulate what happens when a portion goes down.
  
  --
  Random Thoughts From A Diseased Mind (Not For Dummies)
65. Re:testing and QA by ISoldat53 · 2008-07-18 04:41 · Score: 1
  
  Read Brain Herbert's books. They are prequels to the original Dune series and cover the history of most of the Dune races.
66. Re:testing and QA by morgan_greywolf · 2008-07-18 04:51 · Score: 3, Funny
  
  Yeah. Fortunately I just got back from going outside. OTOH, it was just raining, and I saw all the millions and millions of tiny water drops falling from the sky. Which made me think of Interrupt 80 and all those forked off-processes it would spawn with that code...
  
  --
  My blog
67. Re:testing and QA by geminidomino · 2008-07-18 05:11 · Score: 1
  
  Just have a solid supply of your own hallucinogen of choice if you plan to make it through them. Pee-ew
68. Re:testing and QA by geminidomino · 2008-07-18 05:14 · Score: 1
  
  On the first system I built; I put the heat sink on pi radians rotated from where it was supposed to be, and without thermal paste at all.
  The word is "backwards," mate.
69. Re:testing and QA by geminidomino · 2008-07-18 05:21 · Score: 1
  
  intoxcated monkey destroying data center != testing
  intoxcated monkey destroying data center == (waste_of_money && destroying_data_center)
  destroying_data_center == waste_of(time && money)
  destorying_data_center != testing
  You forgot
  intoxicated_monkey_destroying_data_center == really_fucking_funny
70. Re:testing and QA by geminidomino · 2008-07-18 05:23 · Score: 1
  
  Oh gods no... My bosses are highly fond of linksys for some reason.
  My dick would fall off inside of 6 months. :'( I'm not a machine, dammit!
71. Re:testing and QA by mr_3ntropy · 2008-07-18 05:24 · Score: 1
  
  The problem is not that redundancy wasn't implemented.
  [...]
  This can play havoc on systems that work on a LAN and assume the MAC address to stay the same.
  In other words, redundancy wasn't implemented.
72. Re:testing and QA by skarphace · 2008-07-18 05:48 · Score: 2, Funny
  
  So putting in a faulty NIC card and seeing what happened wouldn't have done anything at all, huh?
  You keep a bunch of 'faulty' NICs around?
  
  --
  Bullish Machine Tzar
73. Re:testing and QA by mpeg4codec · 2008-07-18 06:04 · Score: 1
  
  Guess it really depends on where you are (or probably more poignantly, where you think you are) on the bathtub curve.
74. Re:testing and QA by A440Hz · 2008-07-18 06:34 · Score: 1
  
  Are you suggesting that NICs should reproduce?
75. Re:testing and QA by bangwhistle · 2008-07-18 08:06 · Score: 1
  
  If "redundancy" is clumsily implemented, then having multiple widgets doesn't help. Your monitor may think widget 1 is up when it really isn't. I've seen that problem in a common load balancing appliance.
76. Re:testing and QA by seifried · 2008-07-18 08:30 · Score: 1
  
  (Relatively) simple then, you have multiple systems, they vote on an answer, if someone is out they get voted off the island, you have another system with a different implementation also check to make sure they answer is sensible. Granted this is hideously expensive and probably only suitable for really expensive things like the space shuttle it is possible.
77. Re:testing and QA by AnotherDaveB · 2008-07-18 08:43 · Score: 1
  
  Frank Herbert apparently was inspired by the Islamic uprisings against the British.
  Where did you hear that?
78. Re:testing and QA by lgw · 2008-07-18 09:32 · Score: 1
  
  Perhaps the fact that the leader of the Islamic updrisers was named "Maudi", or the fact that that uprising defeated a world-spanning empire with "desert power"? Or many other similar events?
  
  --
  Socialism: a lie told by totalitarians and believed by fools.
79. Re:testing and QA by lgw · 2008-07-18 09:39 · Score: 1
  
  You can test kinda-sorta-almost-but-not-really working NIC by putting a jammer in-line with the test card. Systematically mess with bits important to every protocol layer intermittantly and look for potential issues, then focus there and do more tests.
  It also helps to keep any bad NICs you fin around forever for this kind of testing, since they're far cheaper than jammers.
  
  --
  Socialism: a lie told by totalitarians and believed by fools.
80. Re:testing and QA by DaedalusHKX · 2008-07-18 10:25 · Score: 1
  
  You would think on a NIC or some form of transmission interface, the status could be checked by simply pinging a known good host/point on the route the service normally takes... yep, and as soon as interface A doesn't provide adequate bandwidth, it is rerouted to B, or a better approach is to evenly share the load and to reroute failed/unconfirmed transmissions via a different interface than the last one used to try. After a certain amount of FAILED transmissions, a simple algorithm to compare how many fail on each NIC could simply update a VISIBLE reliability counter ina control panel available to any operator of the machine who could then say... HMMMM... packet A was resent 200 times through interface A and failed 70 times. Packet A was sent successfully each time through interface B. Packet B failed once or twice on interface B, failed over to interface A, failed once or twice, then went back to interface C and successfully sent through and confirmed.
  There's a million samples of verbal flowcharting/pseudocode type stuff we did when I was in comp sci 101 and 102 back in college. Hell we sat around brainstorming server queue ideas for miniature servers to perform stupid little operations with no purpose. Call it "practice" since it was nothing but.
  Anyways... it isn't that hard to handle in software as a means to combat subtle hardware failure, especially in a system that SHOULD be redundant and unbreakable. After all, they spent how many trillions on "anti terrorism" ?? And we had ONE plausible airborne terrorist attack in how many decades? Meanwhile we have HOW many fatal airliner wrecks in same decades? Perhaps the money is ill spent by self serving bureaucrats, but I could, of course, be mistaken. After all, we know self serving bureaucrats only have our best interests in mind. 6 figure salaries never figure into it... I'm sure.
  
  --
  " What luck for rulers that men do not think" - Adolf Hitler
81. Re:testing and QA by kitgerrits · 2008-07-18 10:51 · Score: 1
  
  It may have been implemented, even implemented correctly.
  It may have signaled the problem and worked around it, the way it should.
  The bug may also have show up because the software assumes there will never be a hardware failure, no matter how short.
  Cisco switches implement redundancy by default with STP.
  What they don't tell you that it may take over a minute before the network has fully recovered.
  (that's why woy want to set up rapid STP for backbone switches and maybe tune the setting for userland).
  
  --
  "I was in love with a beautiful blonde once, dear. She drove me to drink. It's the one thing I am indebted to her for."
82. Re:testing and QA by kbahey · 2008-07-18 15:03 · Score: 1
  
  I don't know about uprising against the British, but for sure, Herbert used many Arabic and Islamic themes in Dune. Some of the stuff is obscure historical terms, so he digged deeper than just current colloquial terms in use in the Middle East at the time.
  
  --
  2bits.com, Inc: Drupal, WordPress, and LAMP performance tuning.
83. Re:testing and QA by Hognoxious · 2008-07-18 22:44 · Score: 1
  
  It's like the old story that the more engines a plane has, the more likely it is to have an engine failure, so single engined planes are safest. It's not how many fail, what really matters is how many you have left working.
  
  --
  Confucius say, "Find worm in apple - bad. Find half a worm - worse."
84. Re:testing and QA by Ihmhi · 2008-07-25 13:28 · Score: 1
  
  Give me two minutes and a soldering gun and I can make one.
  
  --
  Random Thoughts From A Diseased Mind (Not For Dummies)
The way the summary reads by Centurix · 2008-07-17 19:57 · Score: 0, Redundant

Makes it sound like the NIC was fighting against The Man! Go NIC!

--
Task Mangler
There's only one way to solve this by anomnomnomymous · 2008-07-17 19:58 · Score: 5, Funny

Put all those NIC's on the terror watchlist!

--
When you shoot a mime, do you use a silencer?
1. Re:There's only one way to solve this by eclectro · 2008-07-17 20:17 · Score: 1
  
  Put all those NIC's on the terror watchlist!
  Why would anyone listen to you? Somebody who was just put on the terror watchlist by a bad NIC.
  
  --
  Take the cheese to sickbay, the doctor should see it as soon as possible - B'Elanna Torres, "Learning Curve"
2. Re:There's only one way to solve this by Anonymous Coward · 2008-07-17 22:40 · Score: 0
  
  You joke, but we really should put the manufacturer on our personal list. It's a NIC! We should be able to rely on it. This is like paper turning black if you leave it in sunlight. Fuck them. Anyone know who the vender is?
3. Re:There's only one way to solve this by Anonymous Coward · 2008-07-18 01:01 · Score: 0
  
  *MICKs
redundant? by Anonymous Coward · 2008-07-17 20:05 · Score: 0

if they were really smart they would have two separate machines dedicated to this information for more redundancy.
More scary stories. by rixster_uk · 2008-07-17 20:18 · Score: 2, Interesting

People - I am trying to collect airport related scary stories. I haven't got many yet but if you have some then please let me know - you can email me at admin@scareports.com or just visit the site (blatant pimping) here .
1. Re:More scary stories. by DriedClexler · 2008-07-18 02:35 · Score: 1
  
  Well, gee, I tried to email you the videofeed of every airport security line in the US, but the server rejected it for being too big.
  
  --
  Information theory is life. The rest is just the KL divergence.
Intermittent problems are the worst by niks42 · 2008-07-17 20:24 · Score: 2, Interesting

I'd have to have some sympathy that it was an intermittent problem. They can really cause confusion to automated systems that are designed to cope with hard failures. I've had many occasions in my latter career in Service Delivery and support where it's taken human conviction to sort out issues caused by the cluster software trying to cope with intermittent connections
1. Re:Intermittent problems are the worst by mikael · 2008-07-18 03:44 · Score: 1
  
  In the early of office LAN's when there was just one big Ethernet cable with repeaters for a whole building, there were basically three ways a network card could fail:
  o It just stopped working - no problem, just replace the card.
  o It just kept jabbering ie. kept sending out random data or the same packet over and over again. This would jam the entire network segment. The only way to fix this was to detach each segment onto an alternative backbone until the rest of the office could get back to work.
  o It just sent out the occasional runt - a small packet consisting of less than a header's worth of data packet. Most network cards would just ignore them. Others would keel over faster than a drunk parrot.
  o Two cards had the same MAC address. I'm not sure how this happened, but one card just seemed to lose one bit of MAC address. Nobody really complained.
  
  --
  Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
2. Re:Intermittent problems are the worst by sjames · 2008-07-18 14:42 · Score: 1
  
  That's why, in spite of all the marketing bullet points, often the best "failover" is manual. In case of failure, unplug this machine and plug that one in.
NICtzche by cornjchob · 2008-07-17 20:27 · Score: 3, Funny

if this piece of hardware was capable of "overc[oming] the built-in system redundancy", perhaps its ilk ought to be patrolling the transistorized wunderplatz of interconnected morsels governing our most hubris means of transportation? I, for one, would certainly feel safer.

--
We now have confirmed reports from an informed Orange County minister that Ethel is still an active communist.
Well its a step above the old AppleTalk by LM741N · 2008-07-17 20:35 · Score: 2, Interesting

When I was administering a small network in Marin, every time we had a small earthquake, all of the AppleTalk connectors would come loose. Took hours to find the faults and push them together. I guess we should have used duct tape.
I suppose at an airport as each jet came in creating vibrations, those same connectors would have dislodged.
ten minutes by Iamthecheese · 2008-07-17 20:35 · Score: 1

Ten minutes at a time? That doesn't sound like a "mostly broken" problem to me, that sounds like a 10 minute fail-over time. Shit happens, but if it takes you 10 minutes for your stuff to automatically start working again you're doing it wrong, especially since its all int one data center. And whatever hapened to redundant off-site systems? New law: As a conversation progresses, the chance of someone saying "terrorist" approaches 100%

--
If video games influenced behavior the Pac Man generation would be eating pills and running away from their problems.
1. Re:ten minutes by wintermute000 · 2008-07-17 20:49 · Score: 5, Informative
  
  there are plenty of examples of 10 minute failover
  Older cisco ATAs take 10 minutes to swing onto SRST if keepalives are lost to the callmanager cluster.
  a complex routing protocol refresh (big BGP networks) can take many minutes
  a faulty NIC can easily bring down a LAN segment, with or without redundant switching paths - and it makes it look like a router failure as the router overloads trying to deal with the broadcast storm
2. Re:ten minutes by Anonymous Coward · 2008-07-18 05:44 · Score: 0
  
  this doesn't make any sense.
  protections at the switch level have been available for years.
  a NIC can fail, leading it to spew thousands of garbage frames a second onto the network (broadcast, even).
  or a NIC can fail by repeatedly flapping up/down; tying up switch resources.
  any mid-to enterprise level switch most likely supports admin_disable of the port/interface based on configurable thresholds (e.g., 10 link flaps/minute), or CP/Rate-limits to auto-disable the interface if #broadcast packets/second threshold is exceeded.
  how these protections were not put in place is astounding.
3. Re:ten minutes by wintermute000 · 2008-07-18 11:44 · Score: 1
  
  Yeah but it might overload the local router before it exceeds the switch threshold.
  Though most of the time I have seen this is with old/crappier models (26xx, 17xx), I've never seen this with 28xx or 18xx series before the switch err-disables the port.
  Then again like 99% of workplaces, its probably 10 year old gear that just worked - until it borked. I'm on the network team of a fortune 500 and I swear there are sites larger than 10 people still using 10Meg HUBS that they are too cheap to replace. Heck our internet facing checkpoint Fws are 5 year old Nokia IP440s. No budget.
Re:Last time by HungryHobo · 2008-07-17 20:45 · Score: 1, Redundant

we don't have a prime minister and I'm fairly sure customs don't wear green.
Re:Last time by JohnHegarty · 2008-07-17 20:52 · Score: 0, Redundant

Any I am not entirely sure what a Lucky Charm is ...

--
Cruise TT
It's a success story. by Farmer+Tim · 2008-07-17 20:54 · Score: 4, Funny

"...an intermittent malfunctioning network card which consequently overcame the built-in system redundancy"
But it's one of the lucky ones.
Every year, thousands of NICs fall victim to built-in system redundancy; if you know a card whose activity indicators are darkened and lifeless, it may have a redundancy problem. With your support and donations, we at Ethernetics Anonymous can help more network cards beat the scourge of built-in system redundancy, and make them feel like a useful part of society again.

--
Blank until /. makes another boneheaded UI decision.
1. Re:It's a success story. by karnal · 2008-07-18 00:50 · Score: 1
  
  Should have added:
  "With your donation of only two bits per day"
  
  --
  Karnal
2. Re:It's a success story. by Farmer+Tim · 2008-07-18 02:41 · Score: 1
  
  I don't want to sound greedy, but we're trying to make packets.
  
  --
  Blank until /. makes another boneheaded UI decision.
My idea of fault tolerance by Bromskloss · 2008-07-17 20:54 · Score: 1

in this case would be the ability to run air traffic control without all those fancy computrons, should the need arise.

--
Swedish plasma phys. PhD student; MSc EE; knows maths, programming, electronics; finance interest; seeks opportunities
1. Re:My idea of fault tolerance by a_real_bast... · 2008-07-17 21:09 · Score: 2, Insightful
  
  Unfortunately, this NIC's fault showed up as the radar not working. What were they supposed to fail-over to? Binoculars?
  
  --
  You're making me think. You won't like me when I'm thinking.
2. Re:My idea of fault tolerance by Bromskloss · 2008-07-17 21:28 · Score: 2, Interesting
  
  Unfortunately, this NIC's fault showed up as the radar not working. What were they supposed to fail-over to? Binoculars?
  I suppose so, if it's possible to do it that way. Also, have the planes do the old-fashioned "circle the airport and keep an eye out for other traffic" if that works with big, heavy planes. It sure gives you (the pilot) a nice sense of being a free and sovereign person anyway, like on small airfields. :-)
  
  --
  Swedish plasma phys. PhD student; MSc EE; knows maths, programming, electronics; finance interest; seeks opportunities
3. Re:My idea of fault tolerance by clickety6 · 2008-07-17 21:32 · Score: 1
  
  [quote]What were they supposed to fail-over to? Binoculars?[/quote]
  And a giant relief model of the airport with young ladies pushing around little model aircraft with billiard cues. And a big glass panel with people marking up aircraft positions with wax crayons.
  
  --
  ----------------------------------- My Other Sig Is Hilarious -----------------------------------
4. Re:My idea of fault tolerance by a_real_bast... · 2008-07-17 21:36 · Score: 1
  
  And gives the Comptroller a fit about the extra fuel expenditure? (",)
  
  --
  You're making me think. You won't like me when I'm thinking.
5. Re:My idea of fault tolerance by zmollusc · 2008-07-17 23:29 · Score: 1
  
  That is the stupidest plan ever, it is snooker cue _rests_ with which the ladies push the little model aircraft around.
  
  --
  They whose government reduces their essential liberties for temporary security, receive neither liberty nor security.
6. Re:My idea of fault tolerance by NoPantsJim · 2008-07-17 23:43 · Score: 1
  
  (This is coming from a future air traffic controller)
  You're forgetting a few things.
  1. Not all air traffic control is done from airport towers. There are also TRACONs and ARTCCs, which is the type of facility you see in the movie Pushing Tin. Basically big dark buildings filled with radar screens and strung out people completely messed up on caffeine.
  2. "circle the airport and watch for traffic" doesn't work for airplanes at FL350 doing 500+ knots. Usually that's IFR traffic, so the planes would have no chance of seeing each other anyway. Also, research has shown that at this speed, even on a cloudless day, two planes heading for each other would be unable to react and prevent a collision, no matter what.
  3. If voice communication is going to be digitized as they plan to do with NextGen, a faulty NIC would silence all radios. The pilots would never even realize there had been a problem.
  I wrote a research paper on NextGen and its faults this past Spring. The whole concept needs a serious overhaul if it's ever going to be safe.
  
  --
  Name...That...Autocomplete!
7. Re:My idea of fault tolerance by NoPantsJim · 2008-07-17 23:50 · Score: 1
  
  Quick correction on an error. I said that anything over FL350 is 'usually' IFR traffic. What I meant to say was IFR conditions. Everything over 18,000ft is always IFR traffic.
  
  --
  Name...That...Autocomplete!
8. Re:My idea of fault tolerance by ddrichardson · 2008-07-18 02:11 · Score: 1
  
  The default to scenario, where ever possible, would be to divert to other airports. This is perfectly viable in the UK given the relatively small distances between airports.
  
  --
  A thistle is a fat salad for an ass's mouth...
9. Re:My idea of fault tolerance by Anonymous Coward · 2008-07-18 03:01 · Score: 0
  
  I'm not familiar with this particular Air Traffic Control system vendor, but historically there have been various options used for fail-over when your radar goes wonky.
  if its not an airport control tower ATC:
  1) "shrimp boats" - commonly used before the introduction of radar-based (aka "positive control") ATC, the older green-screen ATC consoles seen in older movies were specially designed to be able to swing down from a vertical orientation to a horizontal one so that the controllers could revert to using "shrimp boats" when radar wasn't working (http://jurassicbark.blogspot.com/2008/05/fix-on-fail.html/).
  2) redundant ATC systems complete with redundant feeds from the radar sites - the "DARC" with "RDP" referenced in the blog post at that URL is one such redundant system
  for airport control tower ATCs:
  1) you said it as a joke, but binoculars are (or at least used to be) an officially-planned fail-over method for use when the radars went out. Airport tower ATC is mostly visual-based control anyway (that's why they have/need all those panoramic windows) so its not as silly as you thought.
  It really comes down to how many "nines" of availability Dublin ATC imposed as a requirement upon their ATC vendor. If their ATC operations were as disrupted by a single NIC failure as the article implies, they can't have asked for a highly available system but that doesn't mean they couldn't have had a fail-over ATC method available if they had wanted one.
In the queue by davew · 2008-07-17 20:59 · Score: 3, Funny

I was due to fly the evening it all went wrong. Here's a lesson: if you're standing in a three-hour queue for the Ryanair desk, and they tell people to rebook on the web, and you take out a laptop and 3G modem, be prepared for a stampede.
1. Re:In the queue by a_real_bast... · 2008-07-17 21:12 · Score: 1
  
  You made two fatal mistakes:
  1) You didn't do it where no-one could see you.
  2) You flew Ryanair.
  
  --
  You're making me think. You won't like me when I'm thinking.
2. Re:In the queue by Anonymous Coward · 2008-07-17 21:48 · Score: 0
  
  3. you didn't display your premium usage rates
3. Re:In the queue by bernywork · 2008-07-17 22:09 · Score: 1
  
  On top of that..
  You are flying ryan air, everything is an extra, I am suprised they don't charge to use the bathroom onboard (Having said that, they probably will now).
  5 a person to rebook on your laptop, would have paid for a new laptop!
  
  --
  Curiosity was framed; ignorance killed the cat. -- Author unknown
4. Re:In the queue by Anonymous Coward · 2008-07-17 22:33 · Score: 0
  
  Serves you right for flying Ryanair ( or Not So Easyjet for that matter )
  Ryanair bumped me off a flight earlier this year. No warning except a refusal to let me board.
  Ok they offered me a later flight but my almost 5 month pregnant partner was already on the flight and we missed our connecting flight.
  They then called security when I kicked up a stink.
  They Suck royally
5. Re:In the queue by caluml · 2008-07-17 22:40 · Score: 1
  
  You could make a pretty penny with that. £5 a shot, or whatever. Plus your keystroke logger would have tonnes of valid credit card information. :)
  
  --
  Get your own free personal location tracker
6. Re:In the queue by heathen_01 · 2008-07-17 22:55 · Score: 1
  
  Did your laptop do you any good?
  The Ryanair website was unusable (for me) during that time.
7. Re:In the queue by davew · 2008-07-18 01:22 · Score: 1
  
  1) I wasn't leaving that queue. :-)
  2) The thing is - and believe me, I do shop around - none of the other options are very much better. :( The national flag carrier has remodelled itself to a low-cost airline and now matches ryanair feature for misfeature. BMI (Baby) are quite good where they fly, but to many other locations options seem to be very expensive, even accounting for ultimate cost including disasters like the above, or nonexistent.
8. Re:In the queue by davew · 2008-07-18 01:24 · Score: 1
  
  Unfortunately not. The flight I was on appeared not to be formally cancelled on the system, so I wasn't allowed to rebook at the time - and of course their system was jammers anyway. I tried to help out the people either side of me but was pretty unsuccessful.
9. Re:In the queue by davew · 2008-07-18 01:31 · Score: 1
  
  Actually, that's not true - it did do me good. I booked an Easyjet flight from Belfast for the next day. :-)
10. Re:In the queue by a_real_bast... · 2008-07-18 01:54 · Score: 1
  
  Aer Lingus has turned shit, yes - but you get marginally more legroom, still, and they aren't out to gouge you quite as hard. It's the difference between 5 turns of the thumbscrew and 4, I know, but it's still just that tiny bit nicer... plus it's not giving money to that tosser O'Leary, which is a bonus.
  
  --
  You're making me think. You won't like me when I'm thinking.
First time? by Anonymous Coward · 2008-07-17 21:00 · Score: 0

âoeThales ATM stated that in 10 similar air traffic control Centres worldwide with over 500,000 flight hours (50 years), this is the first time an incident of this type has been reported.â
Is the LAX incident not of the same type, then?
1. Re:First time? by Anonymous Coward · 2008-07-17 21:33 · Score: 0
  
  Thales don't have any comparable ATC systems in the US.
One card "overcame the redundancy"??? by gweihir · 2008-07-17 21:01 · Score: 3, Insightful

If they have good redundancy, they have two separate networks and two independent, preferrably different network cards, in all systems. Then they would do fail-over. Seems to me that if one card can bring this down, then the people that designed the redundancy screwed up badly.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
1. Re:One card "overcame the redundancy"??? by jacquesm · 2008-07-17 21:38 · Score: 1
  
  second that... sorry, I missed your post before I wrote mine. Whoever built the system goofed, and to screw up with flight control systems at this level should be grounds for termination and never ever to get work in mission critical systems again. There really is no room for error in systems like this.
  I've worked a bit in the aerospace industry, specifically on software that would estimate the amount of fuel required for a flight taking into account alternative landing areas, winds and so on.
  The amount of checking I did on that code bordered on the paranoid but I really could not live with some plane going down somewhere because of a stupid error in design.
  Come to think of it, mission critical software should probably be open source, *always* so you can see what you're entrusting your life to and so that the 'many eyes' out there can point out the flaws. (assuming they're not eyes that will use that knowledge to bring down your system...).
  
  --
  MP3 Search Engine
2. Re:One card "overcame the redundancy"??? by gweihir · 2008-07-18 01:41 · Score: 1
  
  Indeed. The open souce angle is also critical for fast and conclusive accident investigation.
  Come to think of it, I have never worked on a really critical system, but I am in IT security, which shares the thinking about ways to break a system. One difference is that our "malfunctions" are intelligent and malicious. On the other hand, they typically cannot kill large numbers of people. I think I prefer that. Having software out there that can kill, would probably give me bad dreams....
  
  --
  Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
3. Re:One card "overcame the redundancy"??? by Anonymous Coward · 2008-07-18 02:42 · Score: 0
  
  no, ideally they have 4 cards, two bonded pairs on different networks.
4. Re:One card "overcame the redundancy"??? by Anonymous Coward · 2008-07-18 03:23 · Score: 0
  
  Come come my good Negro, doest thou not know of "network teaming"?
  Not to be corn-fused with "network teeming" of course.
  Such a simplistic solution might well appeal to the simple-minded PHBs sure to be spearheading systems services in this sector.
5. Re:One card "overcame the redundancy"??? by Phroggy · 2008-07-18 03:54 · Score: 1
  
  If they have good redundancy, they have two separate networks and two independent, preferrably different network cards, in all systems. Then they would do fail-over. Seems to me that if one card can bring this down, then the people that designed the redundancy screwed up badly.
  It sounds to me like they DID have two separate networks. A faulty NIC was able to overcome that setup, by tricking the fail-over system into thinking that it didn't need to switch to the backup network.
  
  --
  $x='S24;r)>63/* h@<5+oZ)32"5cz';$me='phroggy'x$];
  $x=~y+ -xz+\0-Tx+;print$_^chop$me for split'',$x;
6. Re:One card "overcame the redundancy"??? by Anonymous Coward · 2008-07-18 16:00 · Score: 0
  
  It's all well and good to have redundancy. However many vendors will sell a redundant system but after the sale no monitoring is done to ensure that when one of the components goes off-line, it's immediately replaced.
  Without monitoring & maintenance a system is only redundant until the first component fails.
  I see this time & again with customers using RAID systems - they purchase off-the-shelf systems, don't do any training & possibly just don't have the tools for managing the device. The first they know there's a failure with any of this disks is when the 2nd or 3rd drive fails and takes the whole set out.
Was it running windows? by iwein · 2008-07-17 21:06 · Score: 3, Funny

http://www.networkworld.com/community/node/29644?page=2&ts=

--
Show a man some news, distract him for an hour. Show a man some mod points, distract him for the rest of his life.
Re:Last time by Anonymous Coward · 2008-07-17 21:14 · Score: 0

we don't have a prime minister and I'm fairly sure customs don't wear green.
Lordy, someone's lost the craic! Quick get this man a pint of the black stuff and a plate of annoying stereotype! /me dances a jig to up-beat folk music.
More seriously: had I said Taoiseach the joke would have flown over everyones heads. Secondly taoiseach is just Irish for prime minister, so get off your high horse. Thirdly, I can exploit Irish stereotypes for a cheap joke, because I am Irish. Personally my favourite examples of Irish stereotypes are the Monkey Dust sketches: Diary of Anne Frank and The Crusades, always worth a plug.
Why!? by damburger · 2008-07-17 21:21 · Score: 4, Funny

I am flying to Florida tomorrow, it will only be my fifth plane flight in total and my first transatlantic flight. Despite being a rational scientist, who knows how safe it is statistically, I am having trouble suppressing my anxiety.
And at this point, fate sees fit to bombard me with horror stories about flying. This news about air traffic control comes on the heels of a headline I just saw on the front page of the Independent about pilots not reporting faults on aircraft and thus unsafe ones still flying about. I can't remember the exact wording because my brain parsed it as "TOMORROW YOU WILL DIE IN FLAMES"

--
If we can put a man on the moon, why can't we shoot people for Apollo-related non-sequiturs?
1. Re:Why!? by FrostedWheat · 2008-07-17 21:51 · Score: 3, Funny
  
  A long time ago I went on a school trip to London, and it was the first time I had ever been on a plane so I was a bit nervous. In the airport shop there was a magazine (can't remember which now) with a plane in flames on the front cover, with the large headline "Why Planes Crash". Whoever put them out must have had an evil streak too, they had spread them out to fill the entire top shelf.
2. Re:Why!? by damburger · 2008-07-17 21:57 · Score: 1
  
  Damn thats cold
  Your signature, however, gives me something else to focus on. Fucking software patents! Idiotic corporate pandering EU! Grrrrrr! I'm not afraid of flying, I'm angry about IP abuse!
  
  --
  If we can put a man on the moon, why can't we shoot people for Apollo-related non-sequiturs?
3. Re:Why!? by Anonymous Coward · 2008-07-17 22:10 · Score: 0
  
  When I took a trip over to the Emerald Isle last, the night before we were due to fly over to Blighty I made everyone in the party watch a Discovery Channel programme on Helios Flight 522.
  Needless to say, anxiety has never been an issue for me.
  I got away with it by pointing out that there was no conceivable way in which watching a TV show could alter the probabilities of the plane dropping out of the sky the next day.
4. Re:Why!? by damburger · 2008-07-17 22:15 · Score: 2, Funny
  
  Depends, was the pilot at your house?
  
  --
  If we can put a man on the moon, why can't we shoot people for Apollo-related non-sequiturs?
5. Re:Why!? by Anonymous Coward · 2008-07-17 23:16 · Score: 0
  
  Then I guess this won't help very much :)
  
  See, this is why people are stupid, because all you see is this. Also, I've noticed that page contains too many "terrorists" words, really. Yes, there are bad things happening but those are the only ones that hit the news, the good stories where you were asked to go through the X-ray machine and you were just scared of X-rays so they let you pass don't make the news or they make the news as "guy walks through airport without being scanned! terrorists could do that."
  
  Just ignore this /. story and move along because good things have always happened and will happen to you in the future and feeling good is all you should expect from flying. While you're up there, in stead of thinking of what could happen (and probably never will), think of what is really happening in that moment: you're flying!
  
  Thanks for flying with us, have a nice trip ;)
6. Re:Why!? by Anonymous Coward · 2008-07-17 23:49 · Score: 0
  
  That's one small step for lawyers, one giant leap for software patents.
7. Re:Why!? by dotancohen · 2008-07-18 00:04 · Score: 1
  
  I can't remember the exact wording because my brain parsed it as "TOMORROW YOU WILL DIE IN FLAMES"
  Just to reassure you, you are far more likely to die from the impact, or failing that smoke inhalation, or failing that drowning. The passenger compartment is rather well insulated from flame.
  
  --
  It is dangerous to be right when the government is wrong.
8. Re:Why!? by adamofgreyskull · 2008-07-18 00:34 · Score: 2, Funny
  
  --EXT: PLANE FLYING OVERHEAD
  --INT: PLANE COCKPIT
  PILOT #1: Oh wow, I really hope we don't have a crash.
  PILOT #2: Me too.
  PILOT #1: But they say it's safer than crossing the road!
  PILOT #2: Yes, but we have to do that too.
  PILOT #1: Best not to think about it.
9. Re:Why!? by TheGratefulNet · 2008-07-18 01:49 · Score: 1
  
  "TOMORROW YOU WILL DIE IN FLAMES"
  its not the flames that kill you, its the long long LONG fall.
  but don't think of it as an end; think of it as a really effective way to cut down on your living expenses.
  
  --
  
  --
  "It is now safe to switch off your computer."
10. Re:Why!? by ddrichardson · 2008-07-18 02:15 · Score: 1
  
  they had spread them out to fill the entire top shelf.
  Top shelves aren't what they used to be.
  
  --
  A thistle is a fat salad for an ass's mouth...
11. Re:Why!? by gad_zuki! · 2008-07-18 02:17 · Score: 1
  
  >I am having trouble suppressing my anxiety.
  What I do is I look at the flight crew and pilot and think "These people have been on hundreds if not thousands of flights each and they are still alive and uninjured. I'm just a beginner compared to them." That kinda kills the drama right there.
12. Re:Why!? by Anonymous Coward · 2008-07-18 02:22 · Score: 0
  
  There's one rather crucial point you're missing out on here... Air Traffic Control was "down" for 10 minute increments... guess what did NOT happen? Nobody was injured, no planes crashed.
  What I take away from this is that even small system glitches in Air Traffic Control can be smoothly overcome.
  As a rational scientist, shouldn't this make you even more comfortable?
13. Re:Why!? by damburger · 2008-07-18 02:52 · Score: 1
  
  Such things make me, the rational scientist, much more comfortable. However, they have little effected on the parts of my brain I inherited from flighty little rodents.
  
  --
  If we can put a man on the moon, why can't we shoot people for Apollo-related non-sequiturs?
14. Re:Why!? by LWATCDR · 2008-07-18 04:27 · Score: 1
  
  Don't worry. I have flown a lot and nothing terrible has ever happened.
  I don't know where you are flying into but let's say it is Miami. If the ATC for Miami airport goes down they can have you divert to Fort Lauderdale West Palm, Tampa or Orlando. If the ACT center goes down Patrick AFB, McDill AFB, Cape Canaveral AFS, or Key West NAS could take over using military systems.
  If the weather is clear which it often is in Florida then it really isn't a big issue at all. If the weather is terrible there are alternate airports and Military bases you could divert to.
  Rotten time to be coming to Florida. It is hot and humid here this time of year. Winter is when we have the nice weather.
  
  --
  See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
15. Re:Why!? by LWATCDR · 2008-07-18 04:30 · Score: 1
  
  Funny but I have no fear of flying at all. I wonder if that has to do with the fact that the first flight I remember was with my dad in Piper PA-128-180. A little four seat puddle jumper.
  I hope you have a nice stay here. It is hot and humid. Kennedy Space Center is a must see.
  
  --
  See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
16. Re:Why!? by gacl · 2008-07-18 04:49 · Score: 1
  
  http://www.newsday.com/services/newspaper/printedition/friday/news/ny-bzair185767879jul18,0,1384201.story
  Hope you feel better!
17. Re:Why!? by damburger · 2008-07-18 07:04 · Score: 1
  
  Seeing KSC first day after we land :)
  I'm hoping to catch a glimpse of the giant finger they use to check the space shuttles centre of mass
  
  --
  If we can put a man on the moon, why can't we shoot people for Apollo-related non-sequiturs?
18. Re:Why!? by Anonymous Coward · 2008-07-18 08:16 · Score: 0
  
  On eof my last flights, which went well, came with an in-flight video. The flight was about seven hours and the video looped two or three times during the trip. The whole thing was the physics and analysis of a recent plane crash. Not exactly a good thing to have playing while you're trying to nap.
19. Re:Why!? by LWATCDR · 2008-07-18 08:50 · Score: 1
  
  Well if you have your wife with you and your yungish I would suggest a trip to Ron Jons surf shop. It is the worlds largest surf shop and is right on Cocoa Beach, my wife loves it.
  Also they have nice beaches around there as well.
  I grew up south of there in Vero Beach FL. My family used to run up to the Merit Island mall twice a year. Back to school shopping and Christmas shopping. It also had the first Arcade I ever saw. I played Lunar Lander there:) Yes I am old.
  My guess you are going to land in Orlando. I just had some people from the UK here for training. They where shocked by the distances here. Orlando is about a two hour drive from KSC depending where you are in Orlando. Try to get to KSC early since we often have afternoon thunder showers this time of year. In fact is pooring right now.
  
  --
  See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
20. Re:Why!? by damburger · 2008-07-18 09:28 · Score: 1
  
  Sadly we aren't likely to be able to get to the beaches (although apparently my fiancees parents have arranged some form of dolphin-related activity for us).
  
  --
  If we can put a man on the moon, why can't we shoot people for Apollo-related non-sequiturs?
21. Re:Why!? by LWATCDR · 2008-07-18 09:40 · Score: 1
  
  My sister took her little girl to that if it is the one in Orlando.
  They liked it.
  Too bad really Orlando is fun but it really doesn't give you much of the feeling of what the US or Florida is really like. A lot of it is just a giant tourist trap.
  But just like Vegas if you have money you will never be board.
  Oh when you got to KSC make sure you go to the Saturn V exhibit.
  
  --
  See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
22. Re:Why!? by Anonymous Coward · 2008-07-26 12:47 · Score: 0
  
  so, did you die?
Re:Last time by Anonymous Coward · 2008-07-17 21:32 · Score: 0

Customs do wear green, but it is hard to prove since the camera does not manage to capture them.
Beside that, it is not a area where photograping is allowed. If you do so anyway the X-ray machine will wipe your camera (It is not supposed to do that, but it will do anyway!)
that redundancy by Anonymous Coward · 2008-07-17 21:33 · Score: 0

was not 'built in' properly, the system should have been isolated after the first fault and not put back into service until the fault was diagnosed and fixed.
pretty sloppy if you ask me...
Truth? by Fri13 · 2008-07-17 21:36 · Score: 1

"[They] confirmed the root cause of the hardware system malfunction as an intermittent malfunctioning network card which consequently overcame the built-in system redundancy,' said an IAA spokeswoman."
And when we edit littlebit, can we have the truth?:
They confirmed the root caused the hardware system malfunction using an intermittent malfunctioning network card wich consequently overcame the build-in system redundancy.
1. Re:Truth? by ddrichardson · 2008-07-18 02:18 · Score: 1
  
  The "overcoming" redundancy thing is unusual and I wonder if its linked to the specified time between failures. If the self test in the system operated at a periodicity as large as ten minutes it could miss such intermittent faults. This is why bus systems with self testing bus controllers are normally used in these circumstances. Either way, it shouldn't happen.
  
  --
  A thistle is a fat salad for an ass's mouth...
Confusing terminology by ddrichardson · 2008-07-17 21:43 · Score: 5, Informative

I work in aviation and wonder if the terminology being used by the newspaper articles is correct.
It appears to be talking about mode S IFF (Interrogation Friend or Foe) or SIFF radar systems which identify aircraft and appends height data. The speed is the only thing that needs calculating, as it isn't encoded in the pulse train.
Why this is weird is because much older bus technologies are normally used to handle this data being transferred than current network technology, such as MIL-STD-1553.
This makes me wonder if it was one of two things - a system inputing to an ethernet PC system that calculates and displays the information or more likely they are talking about a DLTU type stub connector (or remote terminal) used in such typical buses. This is unlikely because the bus systems they are employed on, the bus controller would have picked up on the failure during continuous built in test and pulled in an alternative.
If its the former then someone needs shooting. ATC is a realtime application and the overhead involved here would be unacceptable. I'm not even sure of the benefit of a network, multiple self contained indiviual terminals would be safer.

--
A thistle is a fat salad for an ass's mouth...
1. Re:Confusing terminology by ledow · 2008-07-17 23:05 · Score: 1
  
  A quick google turns up:
  http://en.wikipedia.org/wiki/Avionics_Full-Duplex_Switched_Ethernet
  Which suggests that Ethernet-derived products are, indeed, used in critical systems (although this seems to be on-aircraft rather than in ATC). It (apparently) has seen wide deployment on common "famous" aircraft.
  And the UK has been "upgrading" its air traffic control for years and years - so much so that they now appear to be nothing more than an office with some multi-head display if the footage shown on news-reports of a year or so ago are to be believed. It's concievable that this is truer than you would think.
  However, I bow down to your knowledge as I know nothing about aviation at all.
2. Re:Confusing terminology by ddrichardson · 2008-07-17 23:27 · Score: 2, Interesting
  
  While you're right, the key phrase from the article you give is:
  
  ARINC 664 Specification which defines how Commercial Off-the-Shelf networking components will be used for future generation Aircraft Data Networks (ADN).
  Specifically, this standard is aimed at use on aircraft not in ATC, in fact because of the weight reduction it offers.
  Also not to split hairs but Dublin is not in the UK, this seems trite but is valid as there are different agencies involved. More over, the appropriation of new technologies is obsessive in the UK at present and has been for some time (except in the financial sector). There is a perception that newer is better and that answers to questions nobody asked are best solved by combining off the shelf components in a similar topology to older generation systems.
  There is an argument to upgrade ATC due to higher volumes of aircraft but I can't help wonder if there is a bigger drive towards efficiency rather than safety.
  
  --
  A thistle is a fat salad for an ass's mouth...
3. Re:Confusing terminology by Anonymous Coward · 2008-07-18 06:45 · Score: 0
  
  I work in aviation and wonder if the terminology being used by the newspaper articles is correct.
  If the news media are talking about technology, you can pretty much guarantee that there's a mistake in the details, if not the "big picture".
  
  /former broadcast engineer, who regularly shakes his head at what goes out over the air.
annoying.. by Anonymous Coward · 2008-07-17 21:43 · Score: 0

This is the first time in 50 years that this has happenend. And the first time they had accurate information on screens was 10-15 years ago..
Re:Last time by Anonymous Coward · 2008-07-17 22:04 · Score: 0

Outside ireland, lucky charms is an american-style (sickly sweet and artificial, featuring sort of freeze-dried marshmallow lumps) breakfast cereal, with a hollywood-irish-accented mascot "lucky the leprechaun" (akin to "tony the tiger" from "frosties"), with a surrounding advertising campaign that could be considered vaguely offensive (on grounds of nauseating cutesyness if nothing else), at least if irish people were excessively thin-skinned (fortunately, they're generally not, and since they're also white-skinned people in america probably wouldn't care if they were upset anyway). It's not hugely offensive, it's not "Bloody Sunday Breakfast Snacks" or something, but all the same, it's not sold in the Republic of Ireland, and was withdrawn from the UK market fairly rapidly upon introduction.
As part of that campaign, the leprechaun obsessively worries about people trying to take his lucky charms, or at least used to, now he just seems to be resigned to thieving kids running off with it.
Irish Examiner, ha! by PinkyDead · 2008-07-17 22:34 · Score: 3, Funny

Everyone in Ireland knows that the Irish Examiner used to be the Cork examiner - and they never miss an opportunity to point out how Dublin is doing a bad job.
This is because Cork thinks that it's the centre of the friggin' universe. The 'Real Capital', my arse! Just a bunch of thunderin' ejits, living in their little Blarney fantasy land. Sure they can't even talk right. What the hell is a 'langer', anyway. They wouldn't even know how to spell NIC.
The fact that they are right is quite beside the point.
(For a North American cultural equivalent, please see http://en.wikipedia.org/wiki/South_Park:_Bigger%2C_Longer_%26_Uncut)
Anyone who mods me down is from Cork - believe it!

--
Genesis 1:32 And God typed :wq!
1. Re:Irish Examiner, ha! by Anonymous+Cowpat · 2008-07-18 00:13 · Score: 1
  
  but Cork is the only bit of Ireland that will still float if the country falls into the sea!
  
  --
  FGD 135
2. Re:Irish Examiner, ha! by cwgatling · 2008-07-18 01:06 · Score: 1
  
  Indeed. A well-balanced Cork man has a chip on both shoulders.
No system, no failure by duyn · 2008-07-17 23:16 · Score: 1

G-GP:

if we add n redunndant[sic] fail-overs, the total system will fail with probability 1-p^n
GP:

Any number raised to the power 0 is 1. So if you don't install anything, hence n is 0, it will always work since the probability of failure is 1-1 = 0.
P:

Sometimes, pure intuition can be more handy than maths.
Only if you're not good at the math.
The way the G-GP described the system, the number of redundant fail-overs includes the primary system. With n=0, you have no system in place. No system, no possibility of system failure.
1. Re:No system, no failure by somersault · 2008-07-17 23:53 · Score: 1
  
  No system, no possibility of system failure.
  That was my point.
  
  --
  which is totally what she said
2. Re:No system, no failure by Hognoxious · 2008-07-18 00:23 · Score: 1
  
  And mine!
  
  --
  Confucius say, "Find worm in apple - bad. Find half a worm - worse."
Zing Zang Zoom by worldcitizen · 2008-07-17 23:41 · Score: 1

It's a cover-up for Zing Zang Zoom rolling out a rootkit protection
Speaking as a sysadmin of just such a network... by Interfacer · 2008-07-17 23:50 · Score: 1

...it is not so black and white.
I administer a network like that. Pharmaceutical plant to be precise.
All machines on the production network have 2 independent PCI nics, connecting to 2 identical but separate networks, using separate routers and switches. The critical servers are stratus high availability servers which have dual redundant everything, driving all components in lock-step and correcting errors on the fly.
If something happens to cause a network switch over, there is a bulk of network traffic to deal with it, because sockets have to be opened and closed, state has to be transferred, system control message flow has to be restarted so that all controllers go back to the normal state, ... And at application level, everything is RPC and DCOM based, so this will cause a significant disruption for the running services, since COM objects and RPC marshalling have to be destroyed and recreated, reinitialized, ...
The whole thing is very complex from a systemwide point of view, and causes a significant disruption.
Now, switchovers can be triggered by different things, like a maintenance techician replacing a controller and causing a timeout halfway through a message flow, network cable that has to be unplugged for some reason, ...
When that happens, performance of the system sucks for the next couple of minutes.
Critical signals will work within defined tolerances, but anything else will simply stop responding during the switchover.
I can perfectly imagine that IF you have a nic which is failing in just such a way that makes the network switch back and forth, it'll dirupt and eventually kill the entire network.
Unless of course the switches are smart enough to detect this and disable the physical port. But even then, we are not talking about DDoSing, but just minor errors at exactly the right time to trigger network failover.
Network redundancy of complex systems (and air traffic is much more complex than a pharma plant) can protect very well against single or predictable failures.
But there will always be very specific failures, depending on dozens of variables, which have the power to bring down the system. Since is the first time such an event occurred with that air traffic control system, it is likely to be one of those corner cases.
It's the same with other proecedures with aviation. everything is supposed to be double and triple redundant, but still National Geographic has enough material to create the series 'Seconds from disaster' where one unlikely error amplifies another, and in the end, the plane hits the building / swamp / ground.
Re:Last time by Bloke+down+the+pub · 2008-07-18 00:38 · Score: 1

More seriously: had I said Taoiseach the joke would have flown over everyones heads.
I'll have you know, my good man, that not all of slashdot's readers are American.

--
It's true I tell you, feller at work's next door neighbour read it in the paper.
But what is a "contol"? by rbanffy · 2008-07-18 00:39 · Score: 2, Funny

What is a "contol" and why is this so important?

--
http://www.dieblinkenlights.com
Re:Speaking as a sysadmin of just such a network.. by bepe86 · 2008-07-18 00:42 · Score: 1

Ever heard of a nice thing called Spanning Tree Protocol and the Campus Model for network design? With proper network design, this would not be an issue.
Blame it on the sales guy by Midnight+Thunder · 2008-07-18 01:03 · Score: 1

I think this is a case of Sales guy vs web dude.

--
Jumpstart the tartan drive.
Bloody Sunday Breakfast Snacks by Anonymous Coward · 2008-07-18 01:04 · Score: 0

I prefer Bobby Sands Diet Bars myself.
Success vs failure by sjbe · 2008-07-18 01:13 · Score: 1

No system, no possibility of system failure.
No possibility of success either. Brilliant! Let's all just go back to bed and forget about trying anything that might possibly fail...
Re:Speaking as a sysadmin of just such a network.. by gweihir · 2008-07-18 01:37 · Score: 1

If something happens to cause a network switch over, there is a bulk of network traffic to deal with it, because sockets have to be opened and closed, state has to be transferred, system control message flow has to be restarted so that all controllers go back to the normal state, ... And at application level, everything is RPC and DCOM based, so this will cause a significant disruption for the running services, since COM objects and RPC marshalling have to be destroyed and recreated, reinitialized, ...
That strikes me as the wrong approach. Use the two physical NICs as interfaces of a router and route everything from/to a logical NIC. Then there is no failover disruption. I think that today there is no need to do this so that the applications actually notice or (worse) need to be able to deal this type of failure.
I can perfectly imagine that IF you have a nic which is failing in just such a way that makes the network switch back and forth, it'll dirupt and eventually kill the entire network.
The well established way to deal with that is lockout periods, where only a failure of the second network and a re-availability of first will cause a switch-back before a certain time has passed. Note that this is needed regardless of switchover delay, since the system can ''oscillate'' faster for faster switchovers.
I see three possible problems: Either this system was not state-of-the art and has serious issues simply because of its old and unflexible deaign, or the designers screwed up, or it was indeed some very complex failure that had several unexpected sideeffects. Avionic people are vey good in learning from past desasters. Their history of prevention and prediction is not nearly as good.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Oops! by cashman73 · 2008-07-18 01:55 · Score: 1

Sounds like they spilled Guinness on the servers again! Either that, or it's those damned pesky Leprechauns! :-)
Re:testing and QA: and dynamic behavior by pruneau · 2008-07-18 02:24 · Score: 1

It's hard enough designing redundant system, but when you have an intermittent failure, the complexity steps up a bit, because redundant system design generally involves people thinking along the lines: ok, it this fails, how best can the system react ? But when people think "If this fails", they practically always assume "if this fails, _and_stay_that_way_".
That's why intermittent failures are so bad, because they introduce a dynamic element into an already very complicated equation, and usual testing strategies don't go as far as this.
For example, just try to see how any routing algorithm reacts to a router that's up and down, up and down, up and down...

--
[Pruneau /\o^O/\ warranty void if this .sig is removed]
My first flight and a book from the wife by Dareth · 2008-07-18 02:40 · Score: 1

I was on my first flight on an airplane and was excited and a bit nervous. I decide to read the book my wife gave me at the last minute for some reading material. The first line was, "I was going down fast..." regarding a crashing plane. My wife claims she had no idea of course. After the fact, I can appreciate the irony at least.

--

I only look human.
My mother is a halfling and my dad is an ogre, so that makes me an Ogreling
How we do it by Anonymous Coward · 2008-07-18 02:58 · Score: 1, Informative

I work in a company that makes stuff for ATC. Our systems have 2 networks (2 NICs in each box, 2 sets of cables, switches etc)
Every packet is sent on both networks and alarms are set off if a packet turns up somewhere without being received on the other network.
Iffy connectors etc. are discovered pretty quickly, but it is still be possible for the system to fail if, for example, a switch on the 'red' network failed and a box had a dodgy NIC on the 'green' network -- this kind of thing happens when somebody chooses to ignore an alarm because "it's working fine, we'll get round to replacing that card next week"
Re:Last time by geminidomino · 2008-07-18 05:26 · Score: 2, Funny

Lucky Charms never pissed me off so much as Trix did. I remember one commercial where the rabbit actually *bought his own cereal* and the kids took it because "trix are for kids". I wanted to see him mow the little fuckers down for home invasion or something...
Silly rabbit, supersonic lead is for thieving, speciesist little pricks.
Network Access Control to the rescue? by TheSync · 2008-07-18 07:22 · Score: 1

I swear I've heard about something like this happening before.
Can't a Network Access Control (NAC) enabled switch with some kind of "reasonable NIC operation metrics" shut down the port the bad NIC card is on? (While notifying the network admin, of course)
Looks like they caught the error... by Anonymous Coward · 2008-07-18 09:29 · Score: 0

...just in the NIC of time.
you don't understand what "open source" means by Anonymous Coward · 2008-07-18 13:59 · Score: 0

"Open source" is not equivalent to "source code is available to the end-user for inspection" even though that is one of the consequences of making software open source.
"Open source" means that the author places little-to-no restrictions on what the end-user may do with the software.
There are many, MANY other ways to supply a software's source code to its end-users including under contracts which explicitly places heavy restrictions on what an end-user may do with the supplied source code. I've received (from Sun) the source code to Solaris 2.6, but that was certainly never open source!
If you'd developed actual flight control systems (or even air traffic control systems such as the one which failed here) you'd know that such systems are always subjected to review by 'many eyes' -- they're just drawn from a restricted set. Think about all the havoc experienced when some hacker is the first to find a 0-day exploit in IIS or Apache and then imagine how much worse it would be if hacker-Ivan was the first to find an exploitable bug in a nuclear reactor control system or air traffic radar processing.
No. Odds are exceptionally high that whomever built this system built it to exactly the level of system availability the Irish ATC asked them to provide and the folks at-fault are still Ireland's ATC for failing to specify a more reliable system.
Whose NIC? by herbierobinson · 2008-07-19 04:38 · Score: 1

I know of one NIC (that was never recalled) that mysteriously stops taking interrupts every so often. I am prohibited by NDA to mentioning that manufacturer's name. Anybody want to guess who it is?

--
An engineer who ran for Congress. http://herbrobinson.us