Blow-by-Blow Account of the OSDN Outage

← Back to Stories (view on slashdot.org)

Blow-by-Blow Account of the OSDN Outage

Posted by Roblimo on Wednesday June 27, 2001 @02:36AM from the warts-and-all dept.

The first hint that all was not well came at about 2 a.m. on Saturday, US eastern time, in the form of slow-loading pages. By 7 a.m. it was obvious that this was not a typical, easily-fixed, reboot-the-database problem. The network operations people were paged, but did not respond. Uh-oh.

Our network operations staff was shorthanded; one of our most knowledgable people had quit recently to go into business with a friend and had not yet been replaced. Another was in the hospital, ill and unreachable. A third's cell phone was on the kitchen counter, unhearable from the bedroom, and the fourth one's cell phone battery had fallen out. It was a frustrating comedy of errors, and an unusual one. Our netops staff is typically "on the bounce" 24/7.

Dave Olszewski, an OSDN programmer who is not technically part of our netops staff and is not trained in our equipment setup, happened to be on IRC at the time. He doesn't live far from the Exodus facility in Waltham, MA, where our server cage lives, so he went there immediately. Kurt Gray, lead programmer, who we dragged out of bed, was not far behind. Hemos and others were awake by then, growing frantic as we found that not only Slashdot, but also NewsForge, freshmeat, OSDN.com, ThinkGeek, and QuestionExchange were down, along with our old -- but still popular -- MediaBuilder and AnmationFactory sites. Arrgh!

This is Kurt's "on the scene" report from Exodus:

Walk into our cage at Exodus and it seems harmless enough but try to learn what everything is doing and where the wires are all going in less than an hour and you could go insane. You're standing in a nice, clean, uncomfortably air-conditioned facility with 150 of VA's FullOn and various other servers humming away. Greeting you at the door is "Big Gay Al" our Cisco 6509, which contains two redundant router modules: Kyle and Stan. If Stan dies, Kyle takes over and vice-versa. Across the cage are two Arrowpoint CS800 load balancing switches: one is racked and idle (as a hot spare) and the other is live and balancing the load for most of our OSDN web sites. Between the Cisco 6509 and the Arrowpoint is a bridging FreeBSD firewall using ipfw rules to block stuff like ping just to drive everyone nuts basically.
"I can't ping your site!"
"Yeah, we know."
Just to make things interesting we've added ports to the 6509 by cascading to a Foundry Fast Iron II and also a Cisco 3500. We've got piles of printouts and documetation of all sorts, drawings and spreadsheets, helping us keep track of every IP and machine in this cage, yet it doesn't seem to get any clearer unless you've either built it yourself (only one person who did still works here and wasn't available this weekend) or if you've had the joyful opportunity of spending a night trying to trace through it all under pressure of knowing that the minutes of downtime are piling up and the answer is not jumping out at you.
At this point if you know anything about networking you'll demand an explanantion for why we're using each piece of equipment in the cage and not a WhizBang 9000 SuperRouter like the one you've been using flawlessly that even washes your dishes for you and makes food taste better too... I can only tell you that I'm not the networking design person here, I didn't chose this equipment or configure it but I'm told it's very good hardware as long as you know what you're doing, but as CowboyNeal once said, "You can take everything I know about Cisco, put it in a thimble and throw it away."
So Dave takes a look, can't ping the gateway, can't ping anything. Reboot the firewall. Didn't help. Still can't ping outside. OK, reboot the Arrowpoint. No difference. Hold your wallet... reboot the 6509... rebooting... rebooting... no difference. This is not good.
"Did you reboot the firewall?" I asked Dave.
"I rebooted everything," he said. "I think's it's the Cisco."
So we console into the Cisco 6509. What a mess. Neither of us understand how this switch was configured and what it is trying to do. We don't fully understand why you can get a console connection on Stan but not Kyle (turns out the standby module doesn't have active console, that's normal).

Headshaking all around. Meanwhile, about 11:40 a.m. Yazz Atlas woke up and got his cell phone reunited with its battery. He picked up his voice mail messages, tossed on clothes, and hustled over to Exodus.

Yazz says, "When I arrived at Exodus, Kurt and Dave were trying every combination of things to do to get the 6509 back. But neither they nor I even knew the Cisco Passwords." The op who was supposed to be on duty (the one whose phone was out of hearing) was still nowhere to be found. They called their hospitalized coworker and got the Cisco passwords.

But, says Yazz, "Since the Cisco was rebooted there were no logs to look at. We could ping something on the inside but not everything. On some VLANs we could ping the gateway and others not. The outside world could ping one of the IPs the 6509 handles but not the other. From the inside we could not ping the IP that the outside world could ping. We could ping the one that they couldn't...very frustrating..."

Kurt again:

Several hours of this sort of network debugging went on until 3:00 AM Sunday. By then we had called Cisco for help. They couldn't help us until they saw the switch config and got a chance to review it. We were spent. We had to go to bed and stay down for the night.
Next morning we're back at Exodus and the situation hasn't changed -- our network is unreachable to the outside world. I was hoping that during the wee hours of the morning the Cisco 6509 had become sentient and fixed its own configuration or perhaps a friendly hacker had cracked into it and fixed it for us, or perhaps ball lighting would travel down a drain spout and shock our cage back to life like those heart paddles paramedics use... "It's a miracle!" No such luck.
So I called Cisco tech support. I wish had done this sooner. I was amazed first of all by how you can talk to a qualified Cisco tech immediately... we're talking an 800 number that you dial and within less than a minute you are talking to a technician... doesn't Cisco realize how shocking this is to technical people, to actually be able to talk to qulified technicians immediately who say things other than, "Well, it works on my computer here..."? Do they not know that tech support phone numbers are supposed to be 900 numbers that require you to enter your personal information and product license number, then forward you to unthinking robots who put you on hold for hours, then drop your call to the Los Angeles Bus Authority switchboard... does Cisco not understand that if you do not put people on hold for at least 10 minutes they might pass out in shock for being able to talk to a human too soon? Apparently not.
So I asked the Cisco technician, Scott, to telnet into our switch and take a look at the config. I figured he'd balk and say, "No I can't do that," because of course this is a tech support number I called so he's going to tell me to give the phone to my mommy if she's there and ask her to log into the switch because, since I don't have a lot of experience with IOS, I must be some kind of idiot to even call tech support without knowing what my HSRP configuration is on VLAN 4. Instead he says, "OK, what's the login password?" I can't believe this... I must have dialed the wrong number, he's not going to just go into our switch and sort this out for me right here and now, is he?
So he's in the switch and he's disgusted and horrified by how we have it configured, and I'm sure he's right. So I ask him, "Well, can you change all that?" I figure he'd say, "No, this your equipment, you fix it yourself," but he doesn't, he says, "Sure, what's the config password?" You gotta be kidding me, I must have dialed the wrong number here... this cannot be a tech support line... you can't actually get a tech support rep on a toll-free number simply to log in and fix your router setup while you whine at him on the phone... this is not real.
So he's in the switch config and he's having a great time pointing out everything some of our people warned us about months ago. He tells me this is wrong, we shouldn't be doing this or that... "Well, then change it if you don't mind," I tell him. "Switch broke. Me dumb. You fix." ...so at one moment Scott wanted to undo some changes. He bounces the switch... copy startup-config running-config ... the switch resets itself... then email starts streaming into my inbox... then I can ping our sites all of a sudden... we're back online! Everything is back! Weird.
Ok, that's all fine, but Scott is still freaked out about how we have the switch configured. Soon I get a call from Barnaby, another hot shot Cisco tech rep. He just logged into our switch and he's horrified too. He wants to walk me through a total switch upgrade and cleanup right now. "Not tonight", I tell him, "I'm burnt and I need to consult some some network people over here before we mess with this any further."

The next day, Monday, Kurt talked to Exodus network engineers and asked them why our uplink settings were so confusing to Cisco engineers. Instead of getting an answer from Exodus and running to Cisco with it, and then back again, he got Cisco and Exodus engineers to talk directly to each other and work it out. He conferenced an Exodus network engineer to Barnaby at Cisco and, Kurt says, "they talked alien code about VLANs, standby IPs, HSRP, multihoming, etc. etc., and they came to an agreement: our switch config was a mess... but at least Barnaby knew what the settings were supposed to be and an Exodus engineer agreed with him."

Before moving on to the (short) Tuesday outage, here are a few more notes from Yazz:

The one card going bad wouldn't have been such a big deal if the config in both were set up correctly. It was meant to flop to the other interface if the primary card died, which it did, but not with all the info it needed... AKA it was misconfigured...
Exodus really wasn't set up to handle the type of failover the 6509 was meant to do. Thats what the Cisco folks said basically, and the Exodus people are no longer supporting this type of Cisco in their setups. Half the VLANs were only stored on one unit and the other half of them on other. So when one died it only knew half of the full setup and couldn't route things correctly since the VLANs it wanted weren't there... Fun!!!

Tuesday was router reconfig day. It was originally only supposed to cause "about five minutes" of downtime, so it didn't seem worth posting any kind of notice that it was going to happen. Why the middle of the day instead of a low-traffic post-midnight time? Because this way, if there was any trouble lots of people at Exodus and Cisco would be awake and around to help. And it was a good thing this choice was made. Kurt picks up the story:

Tuesday 11:00 a.m. we're back in the cage. Barnaby is logged into our switch while he's talking to me on my cell phone (which disconnects every 5 minutes just to make my day more challenging), helping us by upgrading the Cisco 6509 firmware, then he's going to clean up the config. First step was getting the firmware patches onto a TFTP server near the switch (had to be less 3 hops from the switch, TFTP doesn't work over longer hops). Yazz took care of that. From there Barnaby patched the firmware, had me reboot the switch, and we should be down for just 5 minutes. Unfortunately 5 minutes turned into 2 hours.
After the switch reboot part of our network was unreachable again, much like Saturday's episode only this time with a Cisco rep on the phone helping us work it out. Again we started tracing cables all over the cage, pinging every corner of the matrix. Barnaby got an Arrowpoint tech rep, Jim, on the line and into our Arrowpoint. But this is tech support, Jim isn't just going to log into our Arropoint and debug for it for us, right? Wrong, this is Cisco tech support: Jim logs into our Arrowpoint and works with Barnaby to trace packets and debug our network.
For a while we put a cross-over cable in place of the firewall just to be sure the firewall box wasn't jamming us. Nope. Didn't help. Barnaby and Jim are mapping hardware addresses to IP addresses to figure out where each packet is going. Finally Yazz and I are staring at this other switch cascading off of the 6509, this little out-of-the-way Cisco 3500 just sitting there... is this thing connected? We look at the link light leading it to the 6509. It's dark. "Uh Barnaby... can you check port 1 on module 2?"
"Hold on," he says over the phone to me. Then the light goes green, and after a few seconds of routers correcting their spantrees we're back online. Everything is back online. All this time it was this little interface to an ignored switch that none of us bothered to account for. Make a big note about in the network documentation, please.
After we came back online Barnaby went ahead and cleaned up our switch configuration, put things the way they ought to be, made our conections sane and stable.

This has not been OSDN's finest week. But we thought it was better to give you the full rundown than try to pretend we're perfect. At least we've learned a lot from the experience -- like to call for help from specialists right away instead of trying to gut things out, and just how valuable good tech support can be. If nothing else, perhaps this story can help others avoid some of the mistakes we made. We certainly aren't going to make the same ones again! (~.*)

25 of 389 comments (clear)

Min score:

Reason:

Sort:

Re:Cisco Support by Anonymous Coward · 2001-06-27 00:06 · Score: 5

I have worked in the Cisco TAC for about 2.5 years. Currently on the routing protocols team (EIGRP, OSPF, BGP etc.) Prior to coming here I had never dealt with people this obsessed with getting everything right all the time. Really they drill it into you. Mandatory perpetual training and such.

Many people who call don't understand how the system works internally so here's a summary: We have cases in 4 groups, priorities 1 through 4, 1 being the most important. The designation of the priority of the case is entirely up to you as a customer. All cases are P3s by default which more or less means they need resolution within 72 hours. If your network is down and you need help right now, today with no waiting we'll elevate to a P2. If you are in a serious network situation like the one described in the article then it's a P1 and literally everything else stops, a bell goes off and everyone crowds around the tech w/ the problem (unless it's a softball case).

There are TACs all over the world but for English-speaking customers what usually happens is the US TACs roll over to the Australian TACs in the early evening who in turn roll over to Belgium and then back to the US. P1s get worked 24 hours until they're resolved, and if they're not fixed in less than 4 hours it's not so good for us.

We have to close about 5 of these cases a day which is sometimes cake (I can't ping my interface which is shut down) and sometimes nasty (redistribution 12 times over).

Also, those little surveys you get everytime you work with us (Bingos) are very important. If you'll recall you can rate us from 1 through 5 in 8 to 10 different categories. Anyone who doesn't maintain an average of at least 4.59 is not long for the TAC, 2 or 3 months tops.

The pay is actually kind of crap but there's no better place in the world to prep for your CCIE. I don't think anyone views the TAC as a long-term environment. Too much stress honestly.
Re:Grr... by Hemos · 2001-06-27 00:40 · Score: 5

Very easy - someone typed in the wrong time. We deemed the shared source to be more important, and that was supposed to go up first.

Not everything is a conspiracy folks.

--
Yeah, I'm that guy.
Re:Eeep - scary moderators! by Kurt+Gray · 2001-06-26 23:31 · Score: 5

I'm not sure what's happening with moderation but since so many people wanting to know: One of our netops quit suddenly Sunday without any explanantion, I assume she was put off by being called in on a weekend and being asked to stay late until it was fixed. I don't know, but these things happen so we deal with it. One thing you don't want to do is publically flame someone who still has your root passwords (although I trust this particular person with our root still), besides we're not mad at her, wish her well, sorry things didn't work out.
Within a minute ??? by Masem · 2001-06-26 22:12 · Score: 5

...I was amazed first of all by how you can talk to a qualified Cisco tech immediately... we're talking an 800 number that you dial and within less than a minute you are talking to a technician...Instead he says, "OK, what's the login password?" .... he says, "Sure, what's the config password?"...
Was anyone else waiting for the "*clickity-click* Wow, it looks like your entire root directory was deleted!" punchline? :-)

--
"Pinky, you've left the lens cap of your mind on again." - P&TB
"I can see my house from here!" - ST:
I just have a couple of questions by tzanger · 2001-06-26 23:05 · Score: 5

By 7 a.m. it was obvious that this was not a typical, easily-fixed, reboot-the-database problem.

Reboot the database?? WTF? You just proved my point as to why MySQL is NOT ready for primetime. Reboot the fscking database??

So Dave takes a look, can't ping the gateway, can't ping anything. Reboot the firewall. Didn't help. Still can't ping outside. OK, reboot the Arrowpoint. No difference. Hold your wallet... reboot the 6509... rebooting... rebooting... no difference. This is not good.

Guys, this isn't Windows -- Rebooting is an absolute last resort and if it works then you have discovered a problem, either in hardware or software and it needs fixed, not just a "oh well, a reboot fixed it, life goes on." Bastions of professionalism you're not.

I don't normally flame people for this kind of thing but the Slashdot crew are especially keen on bashing Windows, yet you resort to their exact tactics whenever a problem comes up.

Reboot the database?? I still can't believe I read that. Sorry.

Cisco Systems have some wonderful systems -- Hell I just recently found out about their stack trace analyzer... feed it a "sh stack" and it emails you back a list of IOS and/or hardware bugs which likely caused the crash. That is just plain old SCHWEEEET. Or being able to read their memory mappings to find out what is causing a bus crash... Ideal. You don't just randomly reboot the damn shit to try and get it to work. If it isn't working something is causing it. Embedded systems are generally pretty good at throwing up the red flags; you just need to look for them (logs, stack traces, extensive use of the debugging facilities...) Use the tools at hand instead of the big red button!

First step was getting the firmware patches onto a TFTP server near the switch (had to be less 3 hops from the switch, TFTP doesn't work over longer hops).

Unless this is something specific to the IOS or router, that's bullshit. I just upgraded 5 AS5248s to IOS 12.1(9) with a TFTP server that is 8 hops away. I'm not aware of any TTL issues with TFTP.

Finally Yazz and I are staring at this other switch cascading off of the 6509, this little out-of-the-way Cisco 3500 just sitting there... is this thing connected? We look at the link light leading it to the 6509. It's dark. "Uh Barnaby... can you check port 1 on module 2?"

You mention that your network documentation is shitty -- I sure as hell hope you'll push to have it upgraded and maintained with a high degree of readability. Even complex systems do not have to be undocumented just because they're complex. Use pictures, use words. I haven't found anything in IT which cannot be explained by a combination of both. And throw in a glossary for the non-techies like yourself who are called upon to fix it. :-)

Don't get me wrong; I'm glad you're back up. But this could have been prevented. Very easily from the sounds of it. I hope you did fire your cisco admin; it sounds like s/he didn't have a clue and was too terrified of losing his/her job that s/he didn't ask for help. Cisco has mailing lists, tons of documentation and there are many IRC channels to ask for help.
Re:OSDN, Audit ALL of your systems NOW. by tzanger · 2001-06-26 23:16 · Score: 5

Point 1./ Why do you allow TELNET in to your routing/switching equipment from the outisde world? If a CISCO tech' with the password can do it then a hacker without the password likely can too.

Up until recently you had no choice but to telnet to Cisco equipment. I came up with a quick solution: deny telnet from anywhere but a same-segment computer (in our case, it's our RADIUS authentication box). Now ssh to the server and telnet from there to the NAS. Problem solved. :-)

Point 2./ If you are connected to the Internet in any way NEVER replace your firewall with a cross over cable. Basically at that stage you have your pants around your ankles, are bent over, with a big "Do Me Now!!!!!" sign on your butt!

While I usually agree, sometimes it is necessary to do a quick check. Even with the number of blackhats out there the chances of them doing anything signficant (or anything at all) for the 2-5 minutes you have the firewall out are insignficantly small.
Re:Cisco Support by Pii · 2001-06-26 23:06 · Score: 5

I can confirm this. I've been a network consultant for almost a decade, primarily as a Cisco router/switch jock. I've dealt with the TAC (Technical Assistance Center) too many times to count.
Hold times can vary, depending on time of day, but are never as bad as the stories from other companies. In most cases, you are on the phone with a real, live engineer within 5 minutes.
90% of the time, the engineer you are transferred to will be able to get your problem corrected. On the few occassions where they have not been able to help me, Cisco has moved mountains to get the right people invloved. I had an issue with Serial SNA - DLSW+ encapulation last year that was escalated to the point where the guy that wrote that portion of the code for IOS was on the phone, and was prepared to come to my client's site (True, they had purchased about $8M dollars in hardware...).
You do, typically, have to have a Smartnet contract, but as other posters have pointed out, if the problem is not hardware related, they will generally help you straighten out your configurations even without the contract.
Alot of people like to make comparisons between Cisco and Microsoft. Anyone who has dealt with the two will be quick to dispell any similarities. Cisco is a first-rate organization, with first-rate support, and I've made a career out of working with their products.

--
For those that would die defending it, Freedom
has a sweet taste that the protected will never know.
Re:Microsoft Support by WNight · 2001-06-27 01:16 · Score: 5

http://www.bmug.org/news/articles/MSvsPF.html

I beg to differ.

That article details calling the 900 line, but even with support contracts, most MS tech support reps toe the company line in a distressing fashion.

"Unplug all the unix servers, that'll fix it"
"Upgrade everything to Win2k Adv Serv, that'll fix it"
"Upgrade to SQL Server (from Oracle), that'll fix it."

They seem to have no ability to distinguish which network components could be involved in a problem and are unwilling to accept that you've already localized the problem.

Case in point, there was a problem where two WinNT boxes wouldn't see each other. They both had IPs, they could both ping everything else. They were connected via a 100mbps switch.

We made sure each properly had an IP, that it could reach other machines, that the switch worked, and then swapped ports with two machines that were working just fine. We also tried isolating these two machines on their own switch, to avoid potential IP conflicts.

When we called the support number we honestly described the situation to the tech. He asked what else was on the network. We explained that it was in a different IP range, but on the same switches as a bunch of Linux machines, an Open BSD (firewall for the desktop machines), and a couple Suns (doing something for the other department, dunno what.)

He then proceeded to tell us that it was the other computers, despite our telling him that we had isolated the NT boxes in question on their own switch and we still had the problem, but when we put a third computer on, both of the NT boxes could reach it just fine.

We eventually lied to him, telling him that yes, we had unplugged all the unix machines, etc. (Like we're going to just unplug out company on the say-so of a moron, and like two junior techs would have the authority to do so anyway.) So now jim-bob starts to help, by telling us that Win2k is so much better, etc, that we wouldn't have these problems with it, etc.

When we flat-out refuse to "upgrade" to fix this bug, his advice is that we format the drives and reinstall. ARGH!

We finally convince him that these machines are somewhat important and we can't just wipe them everytime there's a small problem.

After over an hour with this jack-off, we hang-up, problem unresolved.

We get permission from the boss to call someone in... So we look through our list of contacts and grab someone whose card says they deal with networking and windows. Call him up. As we're describing the problem he listens quietly, grunts affirmatively when we describe how we isolated the problem, agrees that it couldn't be any of the other machines.

Then he says, "It sounds like it's an issue with a bad route, type 'route .....'" We do, and then we reboot. Problem solved.

He said that it, whatever it was, was a very common problem where the machines basically forget how to get from A to B. That command zeroed the routing (which didn't show any bad routes) and the reboot brought it back up.

Cost, a 15-minute phone consultation. $45

Microsoft tech support was basically a sales department, staffed with the marketing rejects.

So, don't EVER believe it if someone tells you that MS supports their products. Any company whose line is "Format and reinstall" has no business calling a product "Server", let alone claiming they're in the enterprise level.

Schon, earlier in this thread, said "Rebooting doesn't solve the problem!!" I wonder what he'd say about formatting and reinstalling.
Re:Eeep - scary moderators! by Flower · 2001-06-26 23:30 · Score: 5

We have a right to know.

No, we don't have a right to know. Ms. Tomlinson's departure is between her and her employer; not some tabloid expose for a bunch of overly curious rumor mongering conspiracy theorists. I wouldn't be surprised if the people who blurted this out on a public forum haven't been seriously bitch slapped by HR.

As a community it would be best to let the matter drop. I'm sure if you were in Anne's position you'd be severely pissed. A little perspective and some empathy would be appropriate.

--
I don't want knowledge. I want certainty. - Law, David Bowie
You know you've been using windows too long when.. by schon · 2001-06-26 22:02 · Score: 5

I laughed out loud when I read this:

But, says Yazz, "Since the Cisco was rebooted there were no logs to look at."

You fell into the classic "Windows" trap.. this is what I tell the Jr. tech guys here when one of the servers goes wonky: "If it doesn't work, there is a reason; something is wrong. Rebooting will not fix the problem."

They usually respond with "but I didn't know what else to do."

To which I answer "Repeat after me - REBOOTING WILL NOT FIX THE PROBLEM."

"But I didn't know what else to do."

"Then call someone who does - REBOOTING WILL NOT FIX THE PROBLEM."
Really full disclosure? by Platinum+Dragon · 2001-06-26 23:17 · Score: 5

If this is a "blow-by-blow" account, then could someone, I dunno, involved in the mess explain that little comment Taco made for about 20 minutes on Sunday about when the "qualified personnel" arrived, "[they] discovered that she wasn't actuually as qualified as we had hoped. Then she quit, thus terminating 3 local star systems."

Was Rob just popping off at random, or was that little bit removed trying to cover /.'s ass in the face of a potential libel suit?

Jes' wondering...

--

Someday, you're going to die. Get over it.
Re:You know you've been using windows too long whe by mgoff · 2001-06-26 23:20 · Score: 5

While technically correct, you have to look at the bigger picture. Rebooting may not fix the root cause of the problem, but it could very possibly get the system back online. Who's to say that it's not a 1 PPM problem that won't affect the system again for another hour/day/month/year? Once the packets are flowing again, then you can relax and take the time to root cause the problem and fix it.

You can make a case that valuable troubleshooting info is lost when systems are rebooted. I agree, but counter that all good systems should have detailed event logging. Leaving the system online and intact is the best way to root cause a bug. But, sometimes getting back online as fast as possible is more important.
Re:Responsive techies by anticypher · 2001-06-26 23:57 · Score: 5

Yes, but have you tried dialing that number when Slashdot wasn't down?

Rumour has it the conversation went a little something like this:
[Kurt] Hi, cisco tech support?
[TAC] Yes
[Kurt] this is Kurt at slashdot...
[TAC] Oh my god, its about time you called us. You've been offline for nearly 24 hours, we're all going through withdrawls. Hang on a sec, our top techs are dying to help.

I talked to a friend in cisco TAC (Brussels) who said that they regularly lurk on /., and in the TAC they could see it was a major network outage since the whole of the OSDN sites were unreachable. Nothing to do but wait, or answer calls from other customers :-)

Since summer weather had come to Europe, I, personally, did not notice the outage. But I promise in the futur to not have a life.

the AC

[Note to Kurt and company, make sure you return your customer satisfaction survey. Those TAC folks live and die based on keeping a very high level of sat scores. I think they need a 4.85 (on scale of 1 to 5) just to keep their jobs within cisco, and a 4.89 to get a raise. So 5's across the board, and in the comments put a link to this /. story for their manager]

--
Hemos is like...sci-fi fans;he thinks technology is cool, but he hasn't bothered to understand the science it's based on
Re:More Writeups Needed by Tackhead · 2001-06-27 00:12 · Score: 5

> Really, what Linux (and other geek subjects) need is to have a Great Book of Failure Stories -- writeups like these that detail horrible outages, downtimes, misconfigurations, security hacks, etc., so that we all can learn from other's mistakes.
What you said.
I did a bit of (very junior-level) sysadminning back in my day.
First thing the BOFH told me was "Buy a hard-cover notebook. Not spiral-bound. Not softcover. Write down everything you do. Feel free to doodle and write obscenities if you like. Someday you'll thank me for this".
I was a bit befuddled, and then he showed me his notebooks. Five years of dramatic fuckups and even more dramatic recoveries. His own personal "deja.google.com" (but it was 1992, and long-term USENET searching hadn't been invented yet, hell our office was using UUCP!) for everything he'd had to work out from first principles on his own.
And thus was the PFY enlightened.
(And yes, I did buy him a beer in late 1992, when something I wrote down in mid-1992 jumped off my page and saved my ass.)
Not like my experiences.. by mjh · 2001-06-26 21:47 · Score: 5

So I called Cisco tech support. I wish had done this sooner. I was amazed first of all by how you can talk to a qualified Cisco tech immediately... we're talking an 800 number that you dial and within less than a minute you are talking to a technician.

While I agree that I usually get someone at cisco who knows what they're talking about, it is very rare in my experience that it happens in only a minute, although it does occasionally happen. A much more common experience is to wait on hold for 15-20 minutes, but I have waited on hold as long as an hour with them.
All of that being said, I would have to agree that cisco's TAC is probably one of the best tech support groups I've ever worked with.
--

--
Key to financial independence: Spend less than you earn. Save and invest the difference. Do it for a long time.
More Writeups Needed by tarsi210 · 2001-06-26 22:05 · Score: 5

This is exactly what we need on the 'net for us sysadmins to read. Failure stories. Why? You don't learn much from success stories, because things worked the first time.

"Welcome to the HOWTO. My setup worked the first time. Why didn't yours?"

Granted, noone wants to see stuff on the 'net go down (and we're glad you're back, /.) But writeups like this one and Steve Gibson's at GCR about the DDOS attacks are priceless. They show what people have tried, what hasn't worked, what did work, and definately where to start the next time.

Really, what Linux (and other geek subjects) need is to have a Great Book of Failure Stories -- writeups like these that detail horrible outages, downtimes, misconfigurations, security hacks, etc., so that we all can learn from other's mistakes.

--
Blog,Twitter
1. Re:More Writeups Needed by plcurechax · 2001-06-26 23:38 · Score: 5
  
  If you haven't heard of it before, get your browser over to RISKS digest or comp.risks. Forum On Risks To The Public In Computers And Related Systems
  I discovered the RISKS digest when reading the about software engineering, and it has certainly helped me think about failures and recovery when designing and building systems.
  There is also the underlying thesis in the article about how complexity, whether in a bungled redundant network connection or just a large poorly documented, poorly tested, and poor configured system is your enemy in building reliable systems. A lot of systems were built like Slashdot during the dot.com IPO craze, I wonder how many of us rely on such poorly built systems?
  Building complex, reliable networks is hard, and expensive. About 3 times what your estimate is, which is about 2 times what you boss expects to spend.
Eeep - scary moderators! by Dr_Cheeks · 2001-06-26 22:50 · Score: 5

Hey, try browsing at -1 nested - seems like everyone who's questioned the story about the woman who "quit" (Anne Tomlinson?) has suddenly been modded down. I'm not usually one for conspiracy theories, but is this surprising anyone else? I think that the question couldn't really be more on topic, and they're hardly flamebaits or trolls - what's up?
BTW, feel free to mod me down, prove my point and compound my paranoia; I've got karma to spare : )

--
hang on, i can do this... by Lord+Omlette · 2001-06-27 00:40 · Score: 5

I will now prove, using extremely shaky methods, that "Blow-by-Blow Account of the OSDN Outage" by Roblimo is, in fact, an epic myth.

I. Call to Adventure

"By 7 a.m. it was obvious that this was not a typical, easily-fixed, reboot-the-database problem. The network operations people were paged, but did not respond."

II. Meeting the Mentor

CowboyNeal once said, "You can take everything I know about Cisco, put it in a thimble and throw it away."

Whoops, that's not it.

"So I called Cisco tech support."

There we go.

III. Obstacles

"Just to make things interesting we've added ports to the 6509 by cascading to a Foundry Fast Iron II and also a Cisco 3500. We've got piles of printouts and documetation of all sorts, drawings and spreadsheets, helping us keep track of every IP and machine in this cage, yet it doesn't seem to get any clearer unless you've either built it yourself (only one person who did still works here and wasn't available this weekend) or if you've had the joyful opportunity of spending a night trying to trace through it all under pressure of knowing that the minutes of downtime are piling up and the answer is not jumping out at you."

IV. Fulfilling The Quest

"He bounces the switch... copy startup-config running-config ... the switch resets itself... then email starts streaming into my inbox... then I can ping our sites all of a sudden... we're back online! Everything is back! Weird."

V. Return of the Hero

"The next day, Monday, Kurt talked to Exodus network engineers and asked them why our uplink settings were so confusing to Cisco engineers."
"Tuesday was router reconfig day."

VI. Transformation of the Hero

"At least we've learned a lot from the experience -- like to call for help from specialists right away instead of trying to gut things out, and just how valuable good tech support can be."
"We certainly aren't going to make the same ones [ed: mistakes] again!"

Peace,
Amit
ICQ 77863057

--
[o]_O
Kenny by sdo1 · 2001-06-26 21:52 · Score: 5

Just don't name the router "Kenny".... he dies every week.

-S

--
--- What parts of "shall make no law", "shall not be infringed", and "shall not be violated" don't you understand?
Re:Beware of departure from original statement by Aaton · 2001-06-27 00:25 · Score: 5

Are we really to believe that nobody was available for 48 hours?

Everything failed at about 7AM Sat. Dave was at Exodus between 8:30 - 10:30AM Sat (didn't look at the log book when I got there). Kurt arrived shortly after that I belive (again I didn't look at the log book). I arrived there around 11:40AM. Sat.
And yes my battery was lose on my Nextel. Just takes a little pressure upwards to lossen the batter on the i1000plus I have. The batter doesn't fall out, just loss enought so it lose contact and turns the phone off.
I have now taped the battery in place!
Yazz Atlas
Grr... by American+AC+in+Paris · 2001-06-26 22:50 · Score: 5

All right.
After having been modded down next to the goatse links, somebody please explain to me how the hell we're supposed to discuss the decidedly strange disappearance (and subsequent reappearance) of this story on the site without getting modded as "offtopic"?
Just where, exactly, are we to discuss this little point? For example, why did this story disappear? Was it technical? Was it editorial?
For a group that is so damned keen on openness and truth, it strikes me as somewhat ironic that several dozen mod points have been used to effectively supress this part of the thread.
I want to know what happened. Others do to. If you can't give us a decent place on Slashdot to discuss this issue, then don't mod us down as offtopic!

--
Obliteracy: Words with explosions
1. Re:Grr... by American+AC+in+Paris · 2001-06-27 01:19 · Score: 5
  
  Thank you, Hemos. I was kinda guessing and hoping that it was something technical.
  You'll understand my consternation, though, upon seeing my (admittedly offtopic) post on the Shared Source article regarding the disappearance of this article modded down three points to -1 in the course of roughly one minute, and the seemingly similar fate of a good many other posts like it. Also worth noting is the fact that this article touches on what many would consider a rather sensitive issue with the OSDN and /. crew right now. I don't like conspiracy theories much, but I'll be damned if the situation didn't seem, well, rather odd.
  Might I suggest, though, that once stories are actually posted to the front page, they remain as is, even if the order of presentation is not the most desirable? Consistency is key, and having articles disappearing from the front page is not terribly consistent.
  
  --
  Obliteracy: Words with explosions
How many times do we have to say it... by Bonker · 2001-06-26 21:54 · Score: 5

Security through obscurity is no security.

No matter how FUBAR'd your router/switch/firewall configuration is, it's still no serious obstacle to crackers, Robin.

--
The next Slashdot story will be ready soon, but subscribers can beat the rush and slashdot the links early!
Beware of departure from original statement by aldjiblah · 2001-06-26 22:18 · Score: 5

Quoted from the original Slashdot Back Online article (before it was modified): "And when our qualified personel arrived, we discovered that she wasn't actuually as qualified as we had hoped. Then she quit, thus terminating 3 local star systems."
Where does this mysterious woman fit into the story above?

--
sig sig sputnik