Blow-by-Blow Account of the OSDN Outage
Our network operations staff was shorthanded; one of our most knowledgable people had quit recently to go into business with a friend and had not yet been replaced. Another was in the hospital, ill and unreachable. A third's cell phone was on the kitchen counter, unhearable from the bedroom, and the fourth one's cell phone battery had fallen out. It was a frustrating comedy of errors, and an unusual one. Our netops staff is typically "on the bounce" 24/7.
Dave Olszewski, an OSDN programmer who is not technically part of our netops staff and is not trained in our equipment setup, happened to be on IRC at the time. He doesn't live far from the Exodus facility in Waltham, MA, where our server cage lives, so he went there immediately. Kurt Gray, lead programmer, who we dragged out of bed, was not far behind. Hemos and others were awake by then, growing frantic as we found that not only Slashdot, but also NewsForge, freshmeat, OSDN.com, ThinkGeek, and QuestionExchange were down, along with our old -- but still popular -- MediaBuilder and AnmationFactory sites. Arrgh!
This is Kurt's "on the scene" report from Exodus:
Walk into our cage at Exodus and it seems harmless enough but try to learn what everything is doing and where the wires are all going in less than an hour and you could go insane. You're standing in a nice, clean, uncomfortably air-conditioned facility with 150 of VA's FullOn and various other servers humming away. Greeting you at the door is "Big Gay Al" our Cisco 6509, which contains two redundant router modules: Kyle and Stan. If Stan dies, Kyle takes over and vice-versa. Across the cage are two Arrowpoint CS800 load balancing switches: one is racked and idle (as a hot spare) and the other is live and balancing the load for most of our OSDN web sites. Between the Cisco 6509 and the Arrowpoint is a bridging FreeBSD firewall using ipfw rules to block stuff like ping just to drive everyone nuts basically.Headshaking all around. Meanwhile, about 11:40 a.m. Yazz Atlas woke up and got his cell phone reunited with its battery. He picked up his voice mail messages, tossed on clothes, and hustled over to Exodus."I can't ping your site!"
"Yeah, we know."
Just to make things interesting we've added ports to the 6509 by cascading to a Foundry Fast Iron II and also a Cisco 3500. We've got piles of printouts and documetation of all sorts, drawings and spreadsheets, helping us keep track of every IP and machine in this cage, yet it doesn't seem to get any clearer unless you've either built it yourself (only one person who did still works here and wasn't available this weekend) or if you've had the joyful opportunity of spending a night trying to trace through it all under pressure of knowing that the minutes of downtime are piling up and the answer is not jumping out at you.
At this point if you know anything about networking you'll demand an explanantion for why we're using each piece of equipment in the cage and not a WhizBang 9000 SuperRouter like the one you've been using flawlessly that even washes your dishes for you and makes food taste better too... I can only tell you that I'm not the networking design person here, I didn't chose this equipment or configure it but I'm told it's very good hardware as long as you know what you're doing, but as CowboyNeal once said, "You can take everything I know about Cisco, put it in a thimble and throw it away."
So Dave takes a look, can't ping the gateway, can't ping anything. Reboot the firewall. Didn't help. Still can't ping outside. OK, reboot the Arrowpoint. No difference. Hold your wallet... reboot the 6509... rebooting... rebooting... no difference. This is not good.
"Did you reboot the firewall?" I asked Dave.
"I rebooted everything," he said. "I think's it's the Cisco."
So we console into the Cisco 6509. What a mess. Neither of us understand how this switch was configured and what it is trying to do. We don't fully understand why you can get a console connection on Stan but not Kyle (turns out the standby module doesn't have active console, that's normal).
Yazz says, "When I arrived at Exodus, Kurt and Dave were trying every combination of things to do to get the 6509 back. But neither they nor I even knew the Cisco Passwords." The op who was supposed to be on duty (the one whose phone was out of hearing) was still nowhere to be found. They called their hospitalized coworker and got the Cisco passwords.
But, says Yazz, "Since the Cisco was rebooted there were no logs to look at. We could ping something on the inside but not everything. On some VLANs we could ping the gateway and others not. The outside world could ping one of the IPs the 6509 handles but not the other. From the inside we could not ping the IP that the outside world could ping. We could ping the one that they couldn't...very frustrating..."
Kurt again:
Several hours of this sort of network debugging went on until 3:00 AM Sunday. By then we had called Cisco for help. They couldn't help us until they saw the switch config and got a chance to review it. We were spent. We had to go to bed and stay down for the night.The next day, Monday, Kurt talked to Exodus network engineers and asked them why our uplink settings were so confusing to Cisco engineers. Instead of getting an answer from Exodus and running to Cisco with it, and then back again, he got Cisco and Exodus engineers to talk directly to each other and work it out. He conferenced an Exodus network engineer to Barnaby at Cisco and, Kurt says, "they talked alien code about VLANs, standby IPs, HSRP, multihoming, etc. etc., and they came to an agreement: our switch config was a mess... but at least Barnaby knew what the settings were supposed to be and an Exodus engineer agreed with him."Next morning we're back at Exodus and the situation hasn't changed -- our network is unreachable to the outside world. I was hoping that during the wee hours of the morning the Cisco 6509 had become sentient and fixed its own configuration or perhaps a friendly hacker had cracked into it and fixed it for us, or perhaps ball lighting would travel down a drain spout and shock our cage back to life like those heart paddles paramedics use... "It's a miracle!" No such luck.
So I called Cisco tech support. I wish had done this sooner. I was amazed first of all by how you can talk to a qualified Cisco tech immediately... we're talking an 800 number that you dial and within less than a minute you are talking to a technician... doesn't Cisco realize how shocking this is to technical people, to actually be able to talk to qulified technicians immediately who say things other than, "Well, it works on my computer here..."? Do they not know that tech support phone numbers are supposed to be 900 numbers that require you to enter your personal information and product license number, then forward you to unthinking robots who put you on hold for hours, then drop your call to the Los Angeles Bus Authority switchboard... does Cisco not understand that if you do not put people on hold for at least 10 minutes they might pass out in shock for being able to talk to a human too soon? Apparently not.
So I asked the Cisco technician, Scott, to telnet into our switch and take a look at the config. I figured he'd balk and say, "No I can't do that," because of course this is a tech support number I called so he's going to tell me to give the phone to my mommy if she's there and ask her to log into the switch because, since I don't have a lot of experience with IOS, I must be some kind of idiot to even call tech support without knowing what my HSRP configuration is on VLAN 4. Instead he says, "OK, what's the login password?" I can't believe this... I must have dialed the wrong number, he's not going to just go into our switch and sort this out for me right here and now, is he?
So he's in the switch and he's disgusted and horrified by how we have it configured, and I'm sure he's right. So I ask him, "Well, can you change all that?" I figure he'd say, "No, this your equipment, you fix it yourself," but he doesn't, he says, "Sure, what's the config password?" You gotta be kidding me, I must have dialed the wrong number here... this cannot be a tech support line... you can't actually get a tech support rep on a toll-free number simply to log in and fix your router setup while you whine at him on the phone... this is not real.
So he's in the switch config and he's having a great time pointing out everything some of our people warned us about months ago. He tells me this is wrong, we shouldn't be doing this or that... "Well, then change it if you don't mind," I tell him. "Switch broke. Me dumb. You fix." ...so at one moment Scott wanted to undo some changes. He bounces the switch... copy startup-config running-config ... the switch resets itself... then email starts streaming into my inbox... then I can ping our sites all of a sudden... we're back online! Everything is back! Weird.
Ok, that's all fine, but Scott is still freaked out about how we have the switch configured. Soon I get a call from Barnaby, another hot shot Cisco tech rep. He just logged into our switch and he's horrified too. He wants to walk me through a total switch upgrade and cleanup right now. "Not tonight", I tell him, "I'm burnt and I need to consult some some network people over here before we mess with this any further."
Before moving on to the (short) Tuesday outage, here are a few more notes from Yazz:
The one card going bad wouldn't have been such a big deal if the config in both were set up correctly. It was meant to flop to the other interface if the primary card died, which it did, but not with all the info it needed... AKA it was misconfigured...Tuesday was router reconfig day. It was originally only supposed to cause "about five minutes" of downtime, so it didn't seem worth posting any kind of notice that it was going to happen. Why the middle of the day instead of a low-traffic post-midnight time? Because this way, if there was any trouble lots of people at Exodus and Cisco would be awake and around to help. And it was a good thing this choice was made. Kurt picks up the story:Exodus really wasn't set up to handle the type of failover the 6509 was meant to do. Thats what the Cisco folks said basically, and the Exodus people are no longer supporting this type of Cisco in their setups. Half the VLANs were only stored on one unit and the other half of them on other. So when one died it only knew half of the full setup and couldn't route things correctly since the VLANs it wanted weren't there... Fun!!!
Tuesday 11:00 a.m. we're back in the cage. Barnaby is logged into our switch while he's talking to me on my cell phone (which disconnects every 5 minutes just to make my day more challenging), helping us by upgrading the Cisco 6509 firmware, then he's going to clean up the config. First step was getting the firmware patches onto a TFTP server near the switch (had to be less 3 hops from the switch, TFTP doesn't work over longer hops). Yazz took care of that. From there Barnaby patched the firmware, had me reboot the switch, and we should be down for just 5 minutes. Unfortunately 5 minutes turned into 2 hours.This has not been OSDN's finest week. But we thought it was better to give you the full rundown than try to pretend we're perfect. At least we've learned a lot from the experience -- like to call for help from specialists right away instead of trying to gut things out, and just how valuable good tech support can be. If nothing else, perhaps this story can help others avoid some of the mistakes we made. We certainly aren't going to make the same ones again! (~.*)After the switch reboot part of our network was unreachable again, much like Saturday's episode only this time with a Cisco rep on the phone helping us work it out. Again we started tracing cables all over the cage, pinging every corner of the matrix. Barnaby got an Arrowpoint tech rep, Jim, on the line and into our Arrowpoint. But this is tech support, Jim isn't just going to log into our Arropoint and debug for it for us, right? Wrong, this is Cisco tech support: Jim logs into our Arrowpoint and works with Barnaby to trace packets and debug our network.
For a while we put a cross-over cable in place of the firewall just to be sure the firewall box wasn't jamming us. Nope. Didn't help. Barnaby and Jim are mapping hardware addresses to IP addresses to figure out where each packet is going. Finally Yazz and I are staring at this other switch cascading off of the 6509, this little out-of-the-way Cisco 3500 just sitting there... is this thing connected? We look at the link light leading it to the 6509. It's dark. "Uh Barnaby... can you check port 1 on module 2?"
"Hold on," he says over the phone to me. Then the light goes green, and after a few seconds of routers correcting their spantrees we're back online. Everything is back online. All this time it was this little interface to an ignored switch that none of us bothered to account for. Make a big note about in the network documentation, please.
After we came back online Barnaby went ahead and cleaned up our switch configuration, put things the way they ought to be, made our conections sane and stable.
I have worked in the Cisco TAC for about 2.5 years. Currently on the routing protocols team (EIGRP, OSPF, BGP etc.) Prior to coming here I had never dealt with people this obsessed with getting everything right all the time. Really they drill it into you. Mandatory perpetual training and such.
Many people who call don't understand how the system works internally so here's a summary: We have cases in 4 groups, priorities 1 through 4, 1 being the most important. The designation of the priority of the case is entirely up to you as a customer. All cases are P3s by default which more or less means they need resolution within 72 hours. If your network is down and you need help right now, today with no waiting we'll elevate to a P2. If you are in a serious network situation like the one described in the article then it's a P1 and literally everything else stops, a bell goes off and everyone crowds around the tech w/ the problem (unless it's a softball case).
There are TACs all over the world but for English-speaking customers what usually happens is the US TACs roll over to the Australian TACs in the early evening who in turn roll over to Belgium and then back to the US. P1s get worked 24 hours until they're resolved, and if they're not fixed in less than 4 hours it's not so good for us.
We have to close about 5 of these cases a day which is sometimes cake (I can't ping my interface which is shut down) and sometimes nasty (redistribution 12 times over).
Also, those little surveys you get everytime you work with us (Bingos) are very important. If you'll recall you can rate us from 1 through 5 in 8 to 10 different categories. Anyone who doesn't maintain an average of at least 4.59 is not long for the TAC, 2 or 3 months tops.
The pay is actually kind of crap but there's no better place in the world to prep for your CCIE. I don't think anyone views the TAC as a long-term environment. Too much stress honestly.
Very easy - someone typed in the wrong time. We deemed the shared source to be more important, and that was supposed to go up first.
Not everything is a conspiracy folks.
Yeah, I'm that guy.
No black eye for Exodus, please. Our router config was not a standard one they support. Exodus dude Derek Lam, especially, went way "above and beyond" this last week.
- Robin
I'm not sure what's happening with moderation but since so many people wanting to know: One of our netops quit suddenly Sunday without any explanantion, I assume she was put off by being called in on a weekend and being asked to stay late until it was fixed. I don't know, but these things happen so we deal with it. One thing you don't want to do is publically flame someone who still has your root passwords (although I trust this particular person with our root still), besides we're not mad at her, wish her well, sorry things didn't work out.
Was anyone else waiting for the "*clickity-click* Wow, it looks like your entire root directory was deleted!" punchline? :-)
"Pinky, you've left the lens cap of your mind on again." - P&TB
"I can see my house from here!" - ST:
Was this configuration ever tested?! It sounds like it was put together, prayed over and sent out into the world.
it would have been simple to test too... pull out one of the uplinks... then the other... now try pulling out some of the webservers... and so on.
By 7 a.m. it was obvious that this was not a typical, easily-fixed, reboot-the-database problem.
Reboot the database?? WTF? You just proved my point as to why MySQL is NOT ready for primetime. Reboot the fscking database??
So Dave takes a look, can't ping the gateway, can't ping anything. Reboot the firewall. Didn't help. Still can't ping outside. OK, reboot the Arrowpoint. No difference. Hold your wallet... reboot the 6509... rebooting... rebooting... no difference. This is not good.
Guys, this isn't Windows -- Rebooting is an absolute last resort and if it works then you have discovered a problem, either in hardware or software and it needs fixed, not just a "oh well, a reboot fixed it, life goes on." Bastions of professionalism you're not.
I don't normally flame people for this kind of thing but the Slashdot crew are especially keen on bashing Windows, yet you resort to their exact tactics whenever a problem comes up.
Reboot the database?? I still can't believe I read that. Sorry.
Cisco Systems have some wonderful systems -- Hell I just recently found out about their stack trace analyzer... feed it a "sh stack" and it emails you back a list of IOS and/or hardware bugs which likely caused the crash. That is just plain old SCHWEEEET. Or being able to read their memory mappings to find out what is causing a bus crash... Ideal. You don't just randomly reboot the damn shit to try and get it to work. If it isn't working something is causing it. Embedded systems are generally pretty good at throwing up the red flags; you just need to look for them (logs, stack traces, extensive use of the debugging facilities...) Use the tools at hand instead of the big red button!
First step was getting the firmware patches onto a TFTP server near the switch (had to be less 3 hops from the switch, TFTP doesn't work over longer hops).
Unless this is something specific to the IOS or router, that's bullshit. I just upgraded 5 AS5248s to IOS 12.1(9) with a TFTP server that is 8 hops away. I'm not aware of any TTL issues with TFTP.
Finally Yazz and I are staring at this other switch cascading off of the 6509, this little out-of-the-way Cisco 3500 just sitting there... is this thing connected? We look at the link light leading it to the 6509. It's dark. "Uh Barnaby... can you check port 1 on module 2?"
You mention that your network documentation is shitty -- I sure as hell hope you'll push to have it upgraded and maintained with a high degree of readability. Even complex systems do not have to be undocumented just because they're complex. Use pictures, use words. I haven't found anything in IT which cannot be explained by a combination of both. And throw in a glossary for the non-techies like yourself who are called upon to fix it. :-)
Don't get me wrong; I'm glad you're back up. But this could have been prevented. Very easily from the sounds of it. I hope you did fire your cisco admin; it sounds like s/he didn't have a clue and was too terrified of losing his/her job that s/he didn't ask for help. Cisco has mailing lists, tons of documentation and there are many IRC channels to ask for help.
Point 1./ Why do you allow TELNET in to your routing/switching equipment from the outisde world? If a CISCO tech' with the password can do it then a hacker without the password likely can too.
Up until recently you had no choice but to telnet to Cisco equipment. I came up with a quick solution: deny telnet from anywhere but a same-segment computer (in our case, it's our RADIUS authentication box). Now ssh to the server and telnet from there to the NAS. Problem solved. :-)
Point 2./ If you are connected to the Internet in any way NEVER replace your firewall with a cross over cable. Basically at that stage you have your pants around your ankles, are bent over, with a big "Do Me Now!!!!!" sign on your butt!
While I usually agree, sometimes it is necessary to do a quick check. Even with the number of blackhats out there the chances of them doing anything signficant (or anything at all) for the 2-5 minutes you have the firewall out are insignficantly small.
I can confirm this. I've been a network consultant for almost a decade, primarily as a Cisco router/switch jock. I've dealt with the TAC (Technical Assistance Center) too many times to count.
Hold times can vary, depending on time of day, but are never as bad as the stories from other companies. In most cases, you are on the phone with a real, live engineer within 5 minutes.
90% of the time, the engineer you are transferred to will be able to get your problem corrected. On the few occassions where they have not been able to help me, Cisco has moved mountains to get the right people invloved. I had an issue with Serial SNA - DLSW+ encapulation last year that was escalated to the point where the guy that wrote that portion of the code for IOS was on the phone, and was prepared to come to my client's site (True, they had purchased about $8M dollars in hardware...).
You do, typically, have to have a Smartnet contract, but as other posters have pointed out, if the problem is not hardware related, they will generally help you straighten out your configurations even without the contract.
Alot of people like to make comparisons between Cisco and Microsoft. Anyone who has dealt with the two will be quick to dispell any similarities. Cisco is a first-rate organization, with first-rate support, and I've made a career out of working with their products.
For those that would die defending it, Freedom
has a sweet taste that the protected will never know.
Yes, if you have a SmartNet contract for that device, it's pretty much true. Cisco, mid-1990's Novell, and Oracle are the only organizations I know of that provide this kind of help. Microsoft "Gold" support plan, anyone? (gag).
Caveat: Cisco basically does not have first level support (i.e. "'Is the router plugged in?' 'What's a router?') - you are supposed to have second level knowledge and have completed the first level troubleshooting before you call TAC.
But - I have been out of the office and had brand-new network techs call Cisco with a problem, and they did help out even then.
sPh
Actually it isn't. As the other respondant to your comment pointed out, it's possible to determine system type from the ICMP responses. One should also realize that not all exploits use fragmented ICMP attacks. There's all kinds of abuses of ICMP that could be concievably used to take a system down. It's better to nip any of those in the bud for a high volume site or set of sites.
I am not merely a "consumer" or a "taxpayer". I am a Citizen of the State of Texas
Luckily, /. is monitored, this historical event will be kept in the monitoring systems for ever and ever ;)
Go to the monitoring system page.
Click the www.slashdot.org link
Select services
This will give you some graphs showing the outtage.
A much more common experience is to wait on hold for 15-20 minutes, but I have waited on hold as long as an hour with them.
Well, in this case Slashdot was down. That can explain the instant response.
__
__
Men with no respect for life must never be allowed to control the ultimate instruments of death.
GW Bu
Look at it this way: every time he comes back again, all by itself! Other people die once and for all...
Szo
Red Leader Standing By!
...Cisco is reporting a projected 40% upswing in earnings for the next quarter, after a favorable review of their technical support personnel on the discussion site Slashdot led to a surge in sales for support contracts.
"It's the first the the Slashdot effect has been a productive one", said an unnamed Cisco official, pausing briefly to dodge a large bag of cash sailing through a nearby window.
Jay (=
Sometimes rebooting will fix the problem. Sometimes you don't have any alternative. Sometimes you can't fix the problem, but you can get things working again (e.g., Windows). And rebooting may the the best (or only) way to do that.
It is clear that they were out of their depth. It is clear that they didn't know what they were doin. They knew that they didn't know what they were doing. But the experts were unreachable. So they tried something that sometimes works. I really don't see how you can fault them for that. It would, of course, have been better if they had know what their choices and options were, but they didn't.
I wouldn't have either. Probably most of us wouldn't have.
Caution: Now approaching the (technological) singularity.
I think we've pushed this "anyone can grow up to be president" thing too far.
Point 1./ Why do you allow TELNET in to your routing/switching equipment from the outisde world? If a CISCO tech' with the password can do it then a hacker without the password likely can too.
Bah, you're talking without knowing the parameters. For all you know, they could've enabled the telnet access on the outbound interface specifically for the checking/cisco rep, disabling it afterwards.
Secondly -- if I remember correctly you can have pretty damn long passwords on ciscoequipment. We do not know the length of the password, but its highly probable that the password is 10+ characters. A bruteforce-attack is pretty damn difficult when you have to check 64^10 possibilities. According to my bc:
arcade@lux:~$ echo 64^10 | bc
1152921504606846976
Now, that is a pretty impressive number of queries you've got to make to exhaust that pwd-space. To be quite frank -- I don't see the problem.
Point 2./ If you are connected to the Internet in any way NEVER replace your firewall with a cross over cable. Basically at that stage you have your pants around your ankles, are bent over, with a big "Do Me Now!!!!!" sign on your butt!
Oh, yes of course. If you don't have a firewall You are phooked!!
Ehh? Excuse me? Why the fsck do a properly configured serverfarm need firewalls _at all_? Please, enlighten us with your wisdom oh dimwit.
Firewalls _are not needed_ if you're not running services that _should not be running_ on servers for the internet.
--
"Rune Kristian Viken" - http://www.nwo.no - arca
Maybe that's the next site OSDN should come up with. The idea is that anyone who has had a major problem with their network or computers and solved the problem, could post their write up to help others who find themselves in such a situation.
It definetly enjoyed reading this article and I am sure that it will be bookmarked by a fair few techie minded network admins, just in case.
Jumpstart the tartan drive.
Of course, it depends greatly on who you are talking to. The platforms team does have a huge slant toward NT/2000 because that's what they support and allegedly like. Those of us in Exchange support (I'll leave it to you to figure out what part of Exch. support I'm in) handle calls where Unix servers are relays, Pix firewalls sit between systems and load-balances continually send packets off into the woods. If you *don't* know non-Microsoft stuff, aren't prepare to acknowledge that non-MS works and works well, or just can't handle the idea of public standards, you are fucked in that group.
It all comes down to who you get on the phone. If you don't like who you are dealing with, ask to speak with their manager or technical lead. Get it straightened out with them or request another support tech. You're paying for it, get what you are paying for.
(As always, my comments are my own and my employer doesn't take any responsibility for them. Like they would want to anyway.)
---
I don't normally swear, but if someone asks me if Cisco support is good, I have to reply: "Abso-fucking-lutely". They are easily the tightest organization out there, bar none. I don't think anyone: UPS, the Military, Wall Street, runs as good an operation as they do.
And I've sat with two engineers at 1:00am through to 11:00am as they fixed my small gateway to an ISP, not a big ticket item. At one point, they did an engineer transfer, connecting me to a different part of the world, and spent thirty minutes overlapped, with the engineers working together to make sure that the new engineer knew what the first had tried. As it turned out, the firmware storage was flakey, and the config corrupted itself semi-randomly.
Years later, I watched Cisco do the exact same thing - only this time, they correctly identified that the problem wasn't them, but in some Bay routing equipment, *and* they told us the exact commands to fix it (I was a outside consultant just watching, but I believe they even offered to telnet in and fix it themselves).
So, yes. Cisco is the only brand I will buy, no matter how expensive they are. Think of the extra expense as insurance. You *may* not need it, but it sure pays for itself if you do.
--
Evan
"$30 for the One True Ring. $10 each additional ring!" -- JRR "Bob" Tolkien
I think the Anne Tomlinson post was a particularly brilliant troll.
A quick Google search for "Anne Tomlinson" returns an orchestra conductor and someone in a retirement community.
If it was a real post, CmdrTaco probably would have ignored it. His good humored response makes me think it was a troll.
Is there any evidence that it was real?
-B
When I was able to do my own spam-armoring, you got a chance to email me. Now you can only hope I see your reply.
http://www.bmug.org/news/articles/MSvsPF.html
.....'" We do, and then we reboot. Problem solved.
I beg to differ.
That article details calling the 900 line, but even with support contracts, most MS tech support reps toe the company line in a distressing fashion.
"Unplug all the unix servers, that'll fix it"
"Upgrade everything to Win2k Adv Serv, that'll fix it"
"Upgrade to SQL Server (from Oracle), that'll fix it."
They seem to have no ability to distinguish which network components could be involved in a problem and are unwilling to accept that you've already localized the problem.
Case in point, there was a problem where two WinNT boxes wouldn't see each other. They both had IPs, they could both ping everything else. They were connected via a 100mbps switch.
We made sure each properly had an IP, that it could reach other machines, that the switch worked, and then swapped ports with two machines that were working just fine. We also tried isolating these two machines on their own switch, to avoid potential IP conflicts.
When we called the support number we honestly described the situation to the tech. He asked what else was on the network. We explained that it was in a different IP range, but on the same switches as a bunch of Linux machines, an Open BSD (firewall for the desktop machines), and a couple Suns (doing something for the other department, dunno what.)
He then proceeded to tell us that it was the other computers, despite our telling him that we had isolated the NT boxes in question on their own switch and we still had the problem, but when we put a third computer on, both of the NT boxes could reach it just fine.
We eventually lied to him, telling him that yes, we had unplugged all the unix machines, etc. (Like we're going to just unplug out company on the say-so of a moron, and like two junior techs would have the authority to do so anyway.) So now jim-bob starts to help, by telling us that Win2k is so much better, etc, that we wouldn't have these problems with it, etc.
When we flat-out refuse to "upgrade" to fix this bug, his advice is that we format the drives and reinstall. ARGH!
We finally convince him that these machines are somewhat important and we can't just wipe them everytime there's a small problem.
After over an hour with this jack-off, we hang-up, problem unresolved.
We get permission from the boss to call someone in... So we look through our list of contacts and grab someone whose card says they deal with networking and windows. Call him up. As we're describing the problem he listens quietly, grunts affirmatively when we describe how we isolated the problem, agrees that it couldn't be any of the other machines.
Then he says, "It sounds like it's an issue with a bad route, type 'route
He said that it, whatever it was, was a very common problem where the machines basically forget how to get from A to B. That command zeroed the routing (which didn't show any bad routes) and the reboot brought it back up.
Cost, a 15-minute phone consultation. $45
Microsoft tech support was basically a sales department, staffed with the marketing rejects.
So, don't EVER believe it if someone tells you that MS supports their products. Any company whose line is "Format and reinstall" has no business calling a product "Server", let alone claiming they're in the enterprise level.
Schon, earlier in this thread, said "Rebooting doesn't solve the problem!!" I wonder what he'd say about formatting and reinstalling.
No, we don't have a right to know. Ms. Tomlinson's departure is between her and her employer; not some tabloid expose for a bunch of overly curious rumor mongering conspiracy theorists. I wouldn't be surprised if the people who blurted this out on a public forum haven't been seriously bitch slapped by HR.
As a community it would be best to let the matter drop. I'm sure if you were in Anne's position you'd be severely pissed. A little perspective and some empathy would be appropriate.
I don't want knowledge. I want certainty. - Law, David Bowie
Who's to say that it's not a 1 PPM problem that won't affect the system again for another hour/day/month/year? Once the packets are flowing again, then you can relax and take the time to root cause the problem and fix it.
And who's to say that the problem that's being experienced will be fixed by a reboot?
We had a server running, one of the things it did was SMB sharing - one of the drives (the one dedicated to non-critical SMB shares, in fact) died.. This box was doing MUCH more than SMB - it was also our internal DHCP, and DNS server
I was out, and one of our MS guys decided "I don't know what all these error messages mean, but I can't see my windows drives, so I'll just reboot it." Because the drive was dead, the machine wouldn't boot. He took the WHOLE DAMN DEPARTMENT OUT - nobody had DNS, and when people's windows machines stopped working, the solution was (guess what?) REBOOT them - so THEY stop talking to the network altogether.
Now, the kicker is that the drives in this machine were hot pluggable. If the reboot hadn't happened, I could have swapped in a new drive, restored from last night's tape backup, and people could have continued working. Instead, because the machine was rebooted the whole department was down for several hours.
The mantra stands - REBOOTING WILL NOT FIX THE PROBLEM. And if you reboot before you know what the problem is, then not only don't you know if it will help at all, but you also don't know if it will make the situation worse.
sometimes getting back online as fast as possible is more important.
That's the trap - there is no guarantee that rebooting will do this - and you might just be screwing it even worse.
Getting back online as fast as possible involves solving the problem first - REBOOTING WILL NOT FIX THE PROBLEM.
it may have resolved the problem for a short while
Even though you think you're saying the opposite of what I said, you've hit the nail squarely on the head - rebooting never fixes any problem.
It may temporarily fix the symptom, but the problem is still there.
It is possible for routers, Linux boxes, etc to crash.
Yes, it is. But if they crash, it's for a reason - perhaps there is a bug in the configuration, or firmware; or perhaps it's hardware.. but what's important is that rebooting will not actually fix the problem, all it will do is temporarily alleviate the symptom.
If the problem is with the configuration, then you fix the configuration. If there is a bug in your software, you fix that. If it's hardware, you replace the faulty hardware. If it's firmware, you upgrade the firmware (or replace the unit with a different model, from a manufacturer who actually does quality testing.)
But you do not just blindly reboot - if a reboot is required, you do it after you've discovered WHY the machine has crashed, and you've fixed it. Once again, the mantra is "Rebooting will not fix the problem."
I laughed out loud when I read this:
But, says Yazz, "Since the Cisco was rebooted there were no logs to look at."
You fell into the classic "Windows" trap.. this is what I tell the Jr. tech guys here when one of the servers goes wonky: "If it doesn't work, there is a reason; something is wrong. Rebooting will not fix the problem."
They usually respond with "but I didn't know what else to do."
To which I answer "Repeat after me - REBOOTING WILL NOT FIX THE PROBLEM."
"But I didn't know what else to do."
"Then call someone who does - REBOOTING WILL NOT FIX THE PROBLEM."
Just wanted to say thank you for the explanation. After all, we are your customers! :) It is really nice to get an accounting of what happened.
BTW: Are you going to plan any redundancy/failover drills as a result of this?
If this is a "blow-by-blow" account, then could someone, I dunno, involved in the mess explain that little comment Taco made for about 20 minutes on Sunday about when the "qualified personnel" arrived, "[they] discovered that she wasn't actuually as qualified as we had hoped. Then she quit, thus terminating 3 local star systems."
/.'s ass in the face of a potential libel suit?
Was Rob just popping off at random, or was that little bit removed trying to cover
Jes' wondering...
Someday, you're going to die. Get over it.
While technically correct, you have to look at the bigger picture. Rebooting may not fix the root cause of the problem, but it could very possibly get the system back online. Who's to say that it's not a 1 PPM problem that won't affect the system again for another hour/day/month/year? Once the packets are flowing again, then you can relax and take the time to root cause the problem and fix it.
You can make a case that valuable troubleshooting info is lost when systems are rebooted. I agree, but counter that all good systems should have detailed event logging. Leaving the system online and intact is the best way to root cause a bug. But, sometimes getting back online as fast as possible is more important.
...to all of us that do this for a living. Forget for a moment that most here have never set foot in a real data center, much less even own a server. No pros want to see another's network go down (well, most of the time ;-) ), and we don't want ours down. I've spent many an hour looking at an errant PIX, or troubleshooting some other network config. I know what those guys were going through. It sucks...
Don't slack. When you slack it bites you in the ass. Maybe not today, maybe not tomarrow, but someday, someday soon, it will.
Test your failover configs. How? By actually making them fail. During the maintaince window, power that primary router/firewall/load balancer down hard and see if the fail over works. It's like testing back ups, kids. You have to know they work before you need them.
Realistically develop on call strategies. OSDN didn't really have a net ops staff of four. One had quit (why are they counted?), one was in hospital, and two had weak "couldn't reach my cell phone" excuses. That just don't work in the real world. If you are on call, you are on call. The "phone too far away" and "battery fell out" just don't cut it in the adult world of professional net ops. Get a satellite pager, and if you are on call, make sure it's on, and near you so you can hear it.
Don't bash your employees/ former employees, particularly during a heated situation. Shows no class. Besides, if you are such a hot shit. grab that console and fix it. Otherwise, keep your mouth shut. Besides, who is in charge of making sure the people that are hired are qualified? Hmmm?
Document your shit. It's not that hard. Visio can do much of it for you. I'm going to break an NDA here, but the Exodus Service Agreement states that all machines and cables are to be labeled. That is so when the dude (or dudette) has to leave the NOC and enter your cage to reboot your lame box, they know what is going on. Also works well for when you net ops staff is too concerned with getting drunk or laid and your poor programmers have to go in to fix the network.
Some folks really went above and beyond, but it seems to me that the management severely dropped the ball.
Is VA really ready to abandon the hardware market for software services? One has to wonder.
Dave
been there before...
Yes, but have you tried dialing that number when Slashdot wasn't down?
/., and in the TAC they could see it was a major network outage since the whole of the OSDN sites were unreachable. Nothing to do but wait, or answer calls from other customers :-)
/. story for their manager]
Rumour has it the conversation went a little something like this:
[Kurt] Hi, cisco tech support?
[TAC] Yes
[Kurt] this is Kurt at slashdot...
[TAC] Oh my god, its about time you called us. You've been offline for nearly 24 hours, we're all going through withdrawls. Hang on a sec, our top techs are dying to help.
I talked to a friend in cisco TAC (Brussels) who said that they regularly lurk on
Since summer weather had come to Europe, I, personally, did not notice the outage. But I promise in the futur to not have a life.
the AC
[Note to Kurt and company, make sure you return your customer satisfaction survey. Those TAC folks live and die based on keeping a very high level of sat scores. I think they need a 4.85 (on scale of 1 to 5) just to keep their jobs within cisco, and a 4.89 to get a raise. So 5's across the board, and in the comments put a link to this
Hemos is like...sci-fi fans;he thinks technology is cool, but he hasn't bothered to understand the science it's based on
I've seen this scenario over and over again... one guy who knows and understands the network, ten people standing around at the equipment trying various silly commands to fix it when it's down...
Here's some suggestions -- you probably already realize that 90% of your pain was avoidable, but everyone has to learn "the hard way" the first time, right?
We've got piles of printouts and documetation of all sorts, drawings and spreadsheets, helping us keep track of every IP and machine in this cage, yet it doesn't seem to get any clearer...
That's called bad documentation that no one ever reads.
Get your networking guys to document TROUBLESHOOTING techniques and to teach the programmers how the network is acutally set up and why. You have plenty of talent capable of understanding how it all works there.
Get more than one way (cell phone) to reach your most important network engineers. Pop for a guaranteed delivery text pager and ask them to carry that as well as the cell phone.
Yazz says, "When I arrived at Exodus, Kurt and Dave were trying every combination of things to do to get the 6509 back. But neither they nor I even knew the Cisco Passwords."
Paper. Wallet. Put them there. Better yet, PGP encrypted password escrow somewhere that anyone can get access to, and a locked cheap fire safe at the office with the public and private PGP keys on a CD-R inside -- for just this type of scenario.
So I asked the Cisco technician, Scott, to telnet into our switch...
Bad bad bad... telnet = bad. Good network security always goes out the window when the network's down...
So he's in the switch and he's disgusted and horrified by how we have it configured...
This is probably the most important hint during your entire outage... your network people either don't know what they're doing, or you're not ALLOWING them to do their jobs, or they're understaffed, or whatever other excuses can be made up ... your call, but don't forget this -- if Cisco's "horrified" by your configs, there's a serious issue you need to find and correct somewhere in your organization. Everything from training, to documentation, to troubleshooting procedures needs a serious walk-through.
The one card going bad wouldn't have been such a big deal if the config in both were set up correctly. It was meant to flop to the other interface if the primary card died, which it did, but not with all the info it needed... AKA it was misconfigured...
DO FAIL-OVER TESTING. If you'd have done a fail-over test of this config you'd have known it didn't work correctly during a nice scheduled time when your network engineers are available and at the equipment, instead of the middle of the night during an outage with all of them MIA. This is so easy to avoid.
Exodus really wasn't set up to handle the type of failover the 6509 was meant to do. Thats what the Cisco folks said basically, and the Exodus people are no longer supporting this type of Cisco in their setups.
Nice of them to tell you. Who is the customer here again?
Put a $20/month POTS line in your cabinet for goodness sake!
That's enough... I'm appalled, but hopefully you will straighten out some things now that the site was down for an extended period. Done properly, network downtime should be a rare event, usually caused by human error, not by bad configuration.
Many outages are unavoidable, your outage sounds like it was avoidable, and certain steps could have been taken to minimize the length of the outage.
+++OK ATH
What you said.
I did a bit of (very junior-level) sysadminning back in my day.
First thing the BOFH told me was "Buy a hard-cover notebook. Not spiral-bound. Not softcover. Write down everything you do. Feel free to doodle and write obscenities if you like. Someday you'll thank me for this".
I was a bit befuddled, and then he showed me his notebooks. Five years of dramatic fuckups and even more dramatic recoveries. His own personal "deja.google.com" (but it was 1992, and long-term USENET searching hadn't been invented yet, hell our office was using UUCP!) for everything he'd had to work out from first principles on his own.
And thus was the PFY enlightened.
(And yes, I did buy him a beer in late 1992, when something I wrote down in mid-1992 jumped off my page and saved my ass.)
While I agree that I usually get someone at cisco who knows what they're talking about, it is very rare in my experience that it happens in only a minute, although it does occasionally happen. A much more common experience is to wait on hold for 15-20 minutes, but I have waited on hold as long as an hour with them.
All of that being said, I would have to agree that cisco's TAC is probably one of the best tech support groups I've ever worked with.
--
Key to financial independence: Spend less than you earn. Save and invest the difference. Do it for a long time.
Someone kind of elluded to this but MY GOD are your security procedures busted!
Point 1./ Why do you allow TELNET in to your routing/switching equipment from the outisde world? If a CISCO tech' with the password can do it then a hacker without the password likely can too.
Point 2./ If you are connected to the Internet in any way NEVER replace your firewall with a cross over cable. Basically at that stage you have your pants around your ankles, are bent over, with a big "Do Me Now!!!!!" sign on your butt!
This is exactly what we need on the 'net for us sysadmins to read. Failure stories. Why? You don't learn much from success stories, because things worked the first time.
/.) But writeups like this one and Steve Gibson's at GCR about the DDOS attacks are priceless. They show what people have tried, what hasn't worked, what did work, and definately where to start the next time.
"Welcome to the HOWTO. My setup worked the first time. Why didn't yours?"
Granted, noone wants to see stuff on the 'net go down (and we're glad you're back,
Really, what Linux (and other geek subjects) need is to have a Great Book of Failure Stories -- writeups like these that detail horrible outages, downtimes, misconfigurations, security hacks, etc., so that we all can learn from other's mistakes.
Blog,Twitter
I hereby propose the term "anne-tomlinson", or "tomlinson" to describe the act of departing a company in the most suspicious of circumstances, known only to a very privileged few. Used in the following example:
X: "What happened to Anne?"
Y: "I don't know; all I know is that she anne-tomlinnsoned from work."
Note that this verb should have the subject of the remark used as the subject of the verb, and the organization left as the indirect object. This should be adhered to regardless if the subject quit, was fired, laid off, died, disappeared, never existed, or there was a mutual decision for the subject to leave. In fact, the verb should mainly be used when the method of departure is unknown or never officially stated (or, even officially acknowledged).
Also note that this verb should NOT refer to a person leaving another person, as in "Fred's now-ex-wife had tomlinsonned from him." The number of people (one or more) that are the subject should be less than the number of people who the object represents.
Continuing on, this verb should NEVER be applied in a self referential matter, IE: "I anne-tomlinsonned from them". This implies that the subject either A) knows the reasons, and is just being a prick about not stating them, or B) the subject does not know the reasons due to massive thick-headedness.
Lastly, this term should only be used to convey the sense of inpenetrable mystery surrounding the departure. It would be oxy-moronic to state: "Ted tomlinsonned because he was bored and wanted to leave." If the mystery surrounding the departure is penetrable, use another phrase.
anne-tomlinson, v,: to leave or be removed from a group under extremely odd, and mysterious, circumstances; especially when the actual method of departure or initiating party of departure is unknown. More especially, when the actual departure is apparently covered up or left un-acknowledged.
tenses: anne-tomlinson, anne-tomlinsons, anne-tomlinsonning, anne-tomlinsonned, had anne-tomlinsonned.
"Don't mind me cutting myself on Occam's Razor"
Well, we have a $LITTLE_NUMBER support contract with Cisco, and have had similar with two previous companies.
Our results were much the same. Very, very responsive people.
I have to agree with Taco, if they gave this kind of service down at the DMV, they'd be picking up passed out folks left and right.
*scoove*
Wish the companies I deal with on a regular basis ever showed that level of skill when I need help. well... hmm... actually Speakeasy is generally pretty good about accepting that my problem is accurately diagnosed and figuring out what's wrong. And Viewsonic the other day was able to provide refresh-rate specs on a monitor I wanted to order within about 60 seconds of my placing the call (Though they dropped the ball by not having the specs I wanted available on their web page) What is this trend of good service? It's scaring me...
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?
Just because someone screwed at your work doesn't make your mantra a universal rule.. Especially when dealing with something like a router or a switch.. These things are normally not meant to be user serviceable and will take a reboot just fine(no hot swappable drives there).. You could have hit a 1ppm problem and rebooting just brings everything back online until statistics kick in again. Little uptime is better than none.
Sure it won't fix anything per se, but getting things normalized enables you to start concentrating on the problems at a less hectic pace..
---
/bin/fortune | slashdotsig.sh
I'm not sure I understand that. Why does the router purge its logs when you reboot it?
That sounds lame as hell. (Granted, though, configuring a Pipeline 50 goes right over my little bow head, much less a Cisco. So yes, I'll stipulate that I'm talking out of my ass here.)
The act of rebooting should be just another even that gets logged, NOT a synonym for "oh, and by the way, you can delete the old log file now."
IMHO log deletion should be done on a calendar basis; everything more than x days old gets purged automatically. What's Cisco's rationale for auto-deleting logs during the boot process?
Dahlmann tightly grips the knife, which he may have no idea how to use, and steps out into the plain.
C'mon, tell us the full story!
BTW, feel free to mod me down, prove my point and compound my paranoia; I've got karma to spare : )
We're a very small installation and get similar response. If you've paid for a support contract and it even smells like a router problem, the fastest way to fix it is to call them right away. They are a model tech support organization.
.sig: file not found
The only thing worse than having no backups/redundancy is having backups/redundancy that you think will work, but, in fact, don't.
Vintage computer games and RPG books available. Email me if you're interested.
I will now prove, using extremely shaky methods, that "Blow-by-Blow Account of the OSDN Outage" by Roblimo is, in fact, an epic myth.
... the switch resets itself... then email starts streaming into my inbox... then I can ping our sites all of a sudden... we're back online! Everything is back! Weird."
I. Call to Adventure
"By 7 a.m. it was obvious that this was not a typical, easily-fixed, reboot-the-database problem. The network operations people were paged, but did not respond."
II. Meeting the Mentor
CowboyNeal once said, "You can take everything I know about Cisco, put it in a thimble and throw it away."
Whoops, that's not it.
"So I called Cisco tech support."
There we go.
III. Obstacles
"Just to make things interesting we've added ports to the 6509 by cascading to a Foundry Fast Iron II and also a Cisco 3500. We've got piles of printouts and documetation of all sorts, drawings and spreadsheets, helping us keep track of every IP and machine in this cage, yet it doesn't seem to get any clearer unless you've either built it yourself (only one person who did still works here and wasn't available this weekend) or if you've had the joyful opportunity of spending a night trying to trace through it all under pressure of knowing that the minutes of downtime are piling up and the answer is not jumping out at you."
IV. Fulfilling The Quest
"He bounces the switch... copy startup-config running-config
V. Return of the Hero
"The next day, Monday, Kurt talked to Exodus network engineers and asked them why our uplink settings were so confusing to Cisco engineers."
"Tuesday was router reconfig day."
VI. Transformation of the Hero
"At least we've learned a lot from the experience -- like to call for help from specialists right away instead of trying to gut things out, and just how valuable good tech support can be."
"We certainly aren't going to make the same ones [ed: mistakes] again!"
Peace,
Amit
ICQ 77863057
[o]_O
Monkey sense
Take a relational database for example; there is so much, that can go wrong with it. For starters, there are bugs in such complex products and fixing them (save for Postgresql) is beyond your control.
But it must not even be a bug in the database code. It can be something in your network component (we chased cases for month which turned out to be a DECnet issue, but where attributed to the database server), it could be the fact that the db vendor compiles his product on multiple platforms and it's virtually impossible to test every functionality of a new release on every supported platform. Yes, I know that in an ideal world this should be done, but it isn't.
Assume it would be possible to perform such tests. Save for propriatery (or semi propriatery) architectures like OpenVMS/AXP you can have so many different hardware- and network components, that it's just not possible to forsee all eventualities.
After ruling out such possibilities, we're not there yet: What are the query characteristics, how many concurrent users do when, what. What front ends do they use, how are they connected. The problem may even be caused by a component that has nothing to do with the database engine (Access front end, anyone ?)
Although the fundamental cause for the problem might never be detected a reboot of the data server might fix the problem and it will never occur again, since the same combination of factors occurs so rare that it's even impossible to reproduce the problem.
However, the [alt-ctrl-del] attitude of younger IT folks (specifically those that grew up in a PC environment) makes me barf and indicates just how clueless a lot of those folks are. You never reboot a productive IT component, unless there is no other choice or in the context of your normal maintencance cycle (memory leaks do occur in software)
ich bin der musikant
mit taschenrechner in der hand
kraftwerk
Either Anne is real or she isn't. If she's real, this is an internal matter that we really don't need to interfere in. If, as the "Anne" poster suggested, she quit because Taco and Hemos are hard to work with, she was within her rights and should get at least some support from a community which often says "Quit! Now!" to Ask Slashdots about PHBs.
If she's not, this is all a big waste of everyone's time, and possibly the best troll we have ever seen on slashdot. (An account by that name has a brand new uid (462836) and zero comments.) Think of the trolls you've posted - how many led to 100s of posts on other threads, conspiracy theories galore, and posts by #1 and #2? Whoever did this (if not Anne) should get mad props from the troll fans, but should not take any more of our time.
My bet is that she's not real. But in either case we should drop it and get on to more important things.
sulli
RTFJ.
OK, so the config was a mess. But it was like that BEFORE the outage, right? So what happened between "running OK" and "we're down" to cause it to fail? I didn't see anything to explain that in the report. Or maybe they don't know...
-S
--- What parts of "shall make no law", "shall not be infringed", and "shall not be violated" don't you understand?
Just don't name the router "Kenny".... he dies every week.
-S
--- What parts of "shall make no law", "shall not be infringed", and "shall not be violated" don't you understand?
Everything failed at about 7AM Sat. Dave was at Exodus between 8:30 - 10:30AM Sat (didn't look at the log book when I got there). Kurt arrived shortly after that I belive (again I didn't look at the log book). I arrived there around 11:40AM. Sat.
And yes my battery was lose on my Nextel. Just takes a little pressure upwards to lossen the batter on the i1000plus I have. The batter doesn't fall out, just loss enought so it lose contact and turns the phone off.
I have now taped the battery in place!
Yazz Atlas
> (we have 4 T-1's...some ask why don't you go with a OC-3? 4 T-1's are probably cheaper and provide redundancy...nuff said).
:-)
ehr... 4 T1's = 4 x 1.544 Mbit. (=6.176Mbit... der)
1 OC-3 = 155.52 Mbit.
Not really similar, eh
After having been modded down next to the goatse links, somebody please explain to me how the hell we're supposed to discuss the decidedly strange disappearance (and subsequent reappearance) of this story on the site without getting modded as "offtopic"?
Just where, exactly, are we to discuss this little point? For example, why did this story disappear? Was it technical? Was it editorial?
For a group that is so damned keen on openness and truth, it strikes me as somewhat ironic that several dozen mod points have been used to effectively supress this part of the thread.
I want to know what happened. Others do to. If you can't give us a decent place on Slashdot to discuss this issue, then don't mod us down as offtopic!
Obliteracy: Words with explosions
Security through obscurity is no security.
No matter how FUBAR'd your router/switch/firewall configuration is, it's still no serious obstacle to crackers, Robin.
The next Slashdot story will be ready soon, but subscribers can beat the rush and slashdot the links early!
if (comment like "%girl%" or comment like "%What happened to%" or comment like "%original story%") update posting set score = -1, reason = random("Troll","Offtopic")
Is there someone else outthere hosting a site where we can have a non-biased discussion ?
BROWSE AT -1 Checkout how many posts have gone straight in at -1 (and this one too will, I betcha...)
Two wrongs may not make a right, but three
Where does this mysterious woman fit into the story above?
sig sig sputnik
what happened to the woman that quit?
-
sean
The world moves for love. It kneels before it in awe.
I remember when I started out in computer networking (and it didn't seem like it was that long ago), I was told this by one of the other technical members of our team, something that I haven't forgotten: redundancy in a system is necessary not only in the hardware and software in that system, but also in the resources that are used to keep that system running (that includes of, course human resources, as well as power HVAC, and so on).
Too often, the human part of the redundancy equation isn't totally factored in. When you don't put all of the human factors into the redundancy equation, you have a redundant system isn't really redundant.
Of course, it helps if you have a vendor that will work with you (and those of you who remember working with Novell servers in "the old days" know what I'm talking about, too).
These are the good old days you'll be telling your children about. Make them worthwhile.