Blow-by-Blow Account of the OSDN Outage
Our network operations staff was shorthanded; one of our most knowledgable people had quit recently to go into business with a friend and had not yet been replaced. Another was in the hospital, ill and unreachable. A third's cell phone was on the kitchen counter, unhearable from the bedroom, and the fourth one's cell phone battery had fallen out. It was a frustrating comedy of errors, and an unusual one. Our netops staff is typically "on the bounce" 24/7.
Dave Olszewski, an OSDN programmer who is not technically part of our netops staff and is not trained in our equipment setup, happened to be on IRC at the time. He doesn't live far from the Exodus facility in Waltham, MA, where our server cage lives, so he went there immediately. Kurt Gray, lead programmer, who we dragged out of bed, was not far behind. Hemos and others were awake by then, growing frantic as we found that not only Slashdot, but also NewsForge, freshmeat, OSDN.com, ThinkGeek, and QuestionExchange were down, along with our old -- but still popular -- MediaBuilder and AnmationFactory sites. Arrgh!
This is Kurt's "on the scene" report from Exodus:
Walk into our cage at Exodus and it seems harmless enough but try to learn what everything is doing and where the wires are all going in less than an hour and you could go insane. You're standing in a nice, clean, uncomfortably air-conditioned facility with 150 of VA's FullOn and various other servers humming away. Greeting you at the door is "Big Gay Al" our Cisco 6509, which contains two redundant router modules: Kyle and Stan. If Stan dies, Kyle takes over and vice-versa. Across the cage are two Arrowpoint CS800 load balancing switches: one is racked and idle (as a hot spare) and the other is live and balancing the load for most of our OSDN web sites. Between the Cisco 6509 and the Arrowpoint is a bridging FreeBSD firewall using ipfw rules to block stuff like ping just to drive everyone nuts basically.Headshaking all around. Meanwhile, about 11:40 a.m. Yazz Atlas woke up and got his cell phone reunited with its battery. He picked up his voice mail messages, tossed on clothes, and hustled over to Exodus."I can't ping your site!"
"Yeah, we know."
Just to make things interesting we've added ports to the 6509 by cascading to a Foundry Fast Iron II and also a Cisco 3500. We've got piles of printouts and documetation of all sorts, drawings and spreadsheets, helping us keep track of every IP and machine in this cage, yet it doesn't seem to get any clearer unless you've either built it yourself (only one person who did still works here and wasn't available this weekend) or if you've had the joyful opportunity of spending a night trying to trace through it all under pressure of knowing that the minutes of downtime are piling up and the answer is not jumping out at you.
At this point if you know anything about networking you'll demand an explanantion for why we're using each piece of equipment in the cage and not a WhizBang 9000 SuperRouter like the one you've been using flawlessly that even washes your dishes for you and makes food taste better too... I can only tell you that I'm not the networking design person here, I didn't chose this equipment or configure it but I'm told it's very good hardware as long as you know what you're doing, but as CowboyNeal once said, "You can take everything I know about Cisco, put it in a thimble and throw it away."
So Dave takes a look, can't ping the gateway, can't ping anything. Reboot the firewall. Didn't help. Still can't ping outside. OK, reboot the Arrowpoint. No difference. Hold your wallet... reboot the 6509... rebooting... rebooting... no difference. This is not good.
"Did you reboot the firewall?" I asked Dave.
"I rebooted everything," he said. "I think's it's the Cisco."
So we console into the Cisco 6509. What a mess. Neither of us understand how this switch was configured and what it is trying to do. We don't fully understand why you can get a console connection on Stan but not Kyle (turns out the standby module doesn't have active console, that's normal).
Yazz says, "When I arrived at Exodus, Kurt and Dave were trying every combination of things to do to get the 6509 back. But neither they nor I even knew the Cisco Passwords." The op who was supposed to be on duty (the one whose phone was out of hearing) was still nowhere to be found. They called their hospitalized coworker and got the Cisco passwords.
But, says Yazz, "Since the Cisco was rebooted there were no logs to look at. We could ping something on the inside but not everything. On some VLANs we could ping the gateway and others not. The outside world could ping one of the IPs the 6509 handles but not the other. From the inside we could not ping the IP that the outside world could ping. We could ping the one that they couldn't...very frustrating..."
Kurt again:
Several hours of this sort of network debugging went on until 3:00 AM Sunday. By then we had called Cisco for help. They couldn't help us until they saw the switch config and got a chance to review it. We were spent. We had to go to bed and stay down for the night.The next day, Monday, Kurt talked to Exodus network engineers and asked them why our uplink settings were so confusing to Cisco engineers. Instead of getting an answer from Exodus and running to Cisco with it, and then back again, he got Cisco and Exodus engineers to talk directly to each other and work it out. He conferenced an Exodus network engineer to Barnaby at Cisco and, Kurt says, "they talked alien code about VLANs, standby IPs, HSRP, multihoming, etc. etc., and they came to an agreement: our switch config was a mess... but at least Barnaby knew what the settings were supposed to be and an Exodus engineer agreed with him."Next morning we're back at Exodus and the situation hasn't changed -- our network is unreachable to the outside world. I was hoping that during the wee hours of the morning the Cisco 6509 had become sentient and fixed its own configuration or perhaps a friendly hacker had cracked into it and fixed it for us, or perhaps ball lighting would travel down a drain spout and shock our cage back to life like those heart paddles paramedics use... "It's a miracle!" No such luck.
So I called Cisco tech support. I wish had done this sooner. I was amazed first of all by how you can talk to a qualified Cisco tech immediately... we're talking an 800 number that you dial and within less than a minute you are talking to a technician... doesn't Cisco realize how shocking this is to technical people, to actually be able to talk to qulified technicians immediately who say things other than, "Well, it works on my computer here..."? Do they not know that tech support phone numbers are supposed to be 900 numbers that require you to enter your personal information and product license number, then forward you to unthinking robots who put you on hold for hours, then drop your call to the Los Angeles Bus Authority switchboard... does Cisco not understand that if you do not put people on hold for at least 10 minutes they might pass out in shock for being able to talk to a human too soon? Apparently not.
So I asked the Cisco technician, Scott, to telnet into our switch and take a look at the config. I figured he'd balk and say, "No I can't do that," because of course this is a tech support number I called so he's going to tell me to give the phone to my mommy if she's there and ask her to log into the switch because, since I don't have a lot of experience with IOS, I must be some kind of idiot to even call tech support without knowing what my HSRP configuration is on VLAN 4. Instead he says, "OK, what's the login password?" I can't believe this... I must have dialed the wrong number, he's not going to just go into our switch and sort this out for me right here and now, is he?
So he's in the switch and he's disgusted and horrified by how we have it configured, and I'm sure he's right. So I ask him, "Well, can you change all that?" I figure he'd say, "No, this your equipment, you fix it yourself," but he doesn't, he says, "Sure, what's the config password?" You gotta be kidding me, I must have dialed the wrong number here... this cannot be a tech support line... you can't actually get a tech support rep on a toll-free number simply to log in and fix your router setup while you whine at him on the phone... this is not real.
So he's in the switch config and he's having a great time pointing out everything some of our people warned us about months ago. He tells me this is wrong, we shouldn't be doing this or that... "Well, then change it if you don't mind," I tell him. "Switch broke. Me dumb. You fix." ...so at one moment Scott wanted to undo some changes. He bounces the switch... copy startup-config running-config ... the switch resets itself... then email starts streaming into my inbox... then I can ping our sites all of a sudden... we're back online! Everything is back! Weird.
Ok, that's all fine, but Scott is still freaked out about how we have the switch configured. Soon I get a call from Barnaby, another hot shot Cisco tech rep. He just logged into our switch and he's horrified too. He wants to walk me through a total switch upgrade and cleanup right now. "Not tonight", I tell him, "I'm burnt and I need to consult some some network people over here before we mess with this any further."
Before moving on to the (short) Tuesday outage, here are a few more notes from Yazz:
The one card going bad wouldn't have been such a big deal if the config in both were set up correctly. It was meant to flop to the other interface if the primary card died, which it did, but not with all the info it needed... AKA it was misconfigured...Tuesday was router reconfig day. It was originally only supposed to cause "about five minutes" of downtime, so it didn't seem worth posting any kind of notice that it was going to happen. Why the middle of the day instead of a low-traffic post-midnight time? Because this way, if there was any trouble lots of people at Exodus and Cisco would be awake and around to help. And it was a good thing this choice was made. Kurt picks up the story:Exodus really wasn't set up to handle the type of failover the 6509 was meant to do. Thats what the Cisco folks said basically, and the Exodus people are no longer supporting this type of Cisco in their setups. Half the VLANs were only stored on one unit and the other half of them on other. So when one died it only knew half of the full setup and couldn't route things correctly since the VLANs it wanted weren't there... Fun!!!
Tuesday 11:00 a.m. we're back in the cage. Barnaby is logged into our switch while he's talking to me on my cell phone (which disconnects every 5 minutes just to make my day more challenging), helping us by upgrading the Cisco 6509 firmware, then he's going to clean up the config. First step was getting the firmware patches onto a TFTP server near the switch (had to be less 3 hops from the switch, TFTP doesn't work over longer hops). Yazz took care of that. From there Barnaby patched the firmware, had me reboot the switch, and we should be down for just 5 minutes. Unfortunately 5 minutes turned into 2 hours.This has not been OSDN's finest week. But we thought it was better to give you the full rundown than try to pretend we're perfect. At least we've learned a lot from the experience -- like to call for help from specialists right away instead of trying to gut things out, and just how valuable good tech support can be. If nothing else, perhaps this story can help others avoid some of the mistakes we made. We certainly aren't going to make the same ones again! (~.*)After the switch reboot part of our network was unreachable again, much like Saturday's episode only this time with a Cisco rep on the phone helping us work it out. Again we started tracing cables all over the cage, pinging every corner of the matrix. Barnaby got an Arrowpoint tech rep, Jim, on the line and into our Arrowpoint. But this is tech support, Jim isn't just going to log into our Arropoint and debug for it for us, right? Wrong, this is Cisco tech support: Jim logs into our Arrowpoint and works with Barnaby to trace packets and debug our network.
For a while we put a cross-over cable in place of the firewall just to be sure the firewall box wasn't jamming us. Nope. Didn't help. Barnaby and Jim are mapping hardware addresses to IP addresses to figure out where each packet is going. Finally Yazz and I are staring at this other switch cascading off of the 6509, this little out-of-the-way Cisco 3500 just sitting there... is this thing connected? We look at the link light leading it to the 6509. It's dark. "Uh Barnaby... can you check port 1 on module 2?"
"Hold on," he says over the phone to me. Then the light goes green, and after a few seconds of routers correcting their spantrees we're back online. Everything is back online. All this time it was this little interface to an ignored switch that none of us bothered to account for. Make a big note about in the network documentation, please.
After we came back online Barnaby went ahead and cleaned up our switch configuration, put things the way they ought to be, made our conections sane and stable.
Please explain, if you can, why CmdrTaco wrote "we discovered that she wasn't actuually as qualified as we had hoped."
It is obvious that he tried to cover up this unfortunate choice of words.
What is not so obvious is why you have written a story which causes his earlier account make absolutely no sense. If "she" was the tech who had quit earlier, what on earth did her qualification have to do with the outage?
..not to throw pennies into the router.
I thought that it was only Windows that needed rebooting.
We have a $MEDIUM_NUMBER contract w/Cisco on our 6509 and 2900 switches, and Cisco ABSOLUTELY ROCKS! when it comes to support. I've had to pick myself up off the floor a couple of times while working with them... Cisco provides the best tech support on the planet, bar none!
Cisco's daytime support(from a US perspective) is handled in the US, while calling after hours gets you the UK. While the US guys are good, the UK guys are Excellent.
I have only read a small portion of this, since I now have to go on shift in the Cisco TAC. :-)
I sure have enjoyed the thread though... I've been with the TAC for nearly 7 years, first in Routing Protocols, and now a "newbie" in switching. (Boy, that's an odd feeling!) Most people burn out after a year or two, but I actually _enjoy_ making my customers happy as I fix their networks. I think that is what makes the difference here, most of us like our work and really _want_ to help. We have alot of fun in our jobs here, laughing with each other and customers, and making the networks hum again...
I came from SynOptics TAC, and there is a huge difference at Cisco. We can actually make a difference here, suggest ways to improve, and are empowered to make decisions for ourselves and customers, reach developers, etc. as a normal part of our duties.
My hat is off to John Chambers for being so candid, easy to approach, humorous, and having a great sense of how to manage Cisco so well.
The reason it is not so good for our TAC if a P1 isn't resolved in 4 hours is that a business rule fires somewhere and John gets a page. :)
And when John gets a page... people die. (Just kidding but I was watching Austin Powers the other day and couldn't resist throwing that in.)
Your post is violating the principle of causality!
Don't let women near the computers!!!
8M in cisco hardware? oh, they bought ONE 12000. Cisco is a horryfying rip off.
Well deserved though.
I think that it's called a lawsuit (or threat thereof). The original post could be considered libel, because if the person in question went somewhere else and had /. on her resume, she might have a really hard time convincing them that's she's worth a darn. IANAL, but it's just my thoughts....
I have worked in the Cisco TAC for about 2.5 years. Currently on the routing protocols team (EIGRP, OSPF, BGP etc.) Prior to coming here I had never dealt with people this obsessed with getting everything right all the time. Really they drill it into you. Mandatory perpetual training and such.
Many people who call don't understand how the system works internally so here's a summary: We have cases in 4 groups, priorities 1 through 4, 1 being the most important. The designation of the priority of the case is entirely up to you as a customer. All cases are P3s by default which more or less means they need resolution within 72 hours. If your network is down and you need help right now, today with no waiting we'll elevate to a P2. If you are in a serious network situation like the one described in the article then it's a P1 and literally everything else stops, a bell goes off and everyone crowds around the tech w/ the problem (unless it's a softball case).
There are TACs all over the world but for English-speaking customers what usually happens is the US TACs roll over to the Australian TACs in the early evening who in turn roll over to Belgium and then back to the US. P1s get worked 24 hours until they're resolved, and if they're not fixed in less than 4 hours it's not so good for us.
We have to close about 5 of these cases a day which is sometimes cake (I can't ping my interface which is shut down) and sometimes nasty (redistribution 12 times over).
Also, those little surveys you get everytime you work with us (Bingos) are very important. If you'll recall you can rate us from 1 through 5 in 8 to 10 different categories. Anyone who doesn't maintain an average of at least 4.59 is not long for the TAC, 2 or 3 months tops.
The pay is actually kind of crap but there's no better place in the world to prep for your CCIE. I don't think anyone views the TAC as a long-term environment. Too much stress honestly.
It ain't Robcode anymore. :)
Yeah, I'm that guy.
Very easy - someone typed in the wrong time. We deemed the shared source to be more important, and that was supposed to go up first.
Not everything is a conspiracy folks.
Yeah, I'm that guy.
--
"Remember when the U.S. had a drug problem, and then we declared a War On Drugs, and now you can't buy drugs anymore?"
No black eye for Exodus, please. Our router config was not a standard one they support. Exodus dude Derek Lam, especially, went way "above and beyond" this last week.
- Robin
There is no such thing as a small support contract with Cisco - you basically pay the cost of the equipment again every two years as a minimum, IME.
See I run a Linux router/firewall, and I do that too. 99% of the time it works. X fubared the display? No problem, reboot and it works again. ipfilter stopped forwarding NAT packets for some reason? No problem, restart it and it works again. etc.
10 PRINT CHR$(205.5+RND(1)); : GOTO 10
Not anymore, because I've come to determine that X sucks, which is why it's now just a firewall/router box (headless). It used to be an attempt at a Linux workstation before I gave that up...
10 PRINT CHR$(205.5+RND(1)); : GOTO 10
I also agree... This is what "hacking" is really about, solving complex problems through ingenuity and diligence.
Yes you can, for example browse this article at -10 like this: http://slashdot.org/comments.pl?sid=01%2F06%2F27%2 F124207&cid=&pid=0&startat=&threshold=-10&mode=thr ead&commentsort=3&op=Change
I'm not sure, but not under this article anyway...
Posted by polar_bear:
This is on a different scale entirely, but I host several small-traffic Websites through PHPWebHosting.com, and they rock. Their support is only via email (hey, it's only $9.95 a month! to host w/them...) but they're very quick, efficient and cheerful. (Yes, cheerfulness can come through in email...)
They offer MySQL, PHP, Perl and (I think...) Python - though I haven't done any Python. A few weeks ago I asked if it would be possible to install a few libraries they didn't already have installed so that I could use Midgard - I expected a "maybe in 6 months" or "are you kidding?" but they were happy to help and said they'd install them ASAP - which turned out to be less than a week. I'm wholly pleased with their service - if you need to host a small site, they're awesome.
Sites I host with them:
http://www.zonkerbooks.net/
http://www.dissociatedpress.net/
http://www.linuxroutingbook.com/
You put our favorite news engine in the middle of a routing mess that the network engineers had been warning you about for months?
What were you thinking?
You must be able to find a nice, comfortable colocation site somewhere.
..and then she was erased from the latest "official" version of the story. What the fuck is this? This isn't a "blow-by-blow account", it's a service pack to fix the "bugs" in your last account of what happened!
And we have always been at war with Eurasia.
..and then she was erased from the latest "official" version of the story. What the fuck is this? This isn't a "blow-by-blow account", it's a service pack to fix the "bugs" in your last account of what happened!
Go on! Mod me down to -1 again! You'll have to do it a few times before I go below the "post-at-2" threshold!!
I'm not sure what's happening with moderation but since so many people wanting to know: One of our netops quit suddenly Sunday without any explanantion, I assume she was put off by being called in on a weekend and being asked to stay late until it was fixed. I don't know, but these things happen so we deal with it. One thing you don't want to do is publically flame someone who still has your root passwords (although I trust this particular person with our root still), besides we're not mad at her, wish her well, sorry things didn't work out.
Only #2 ever gets to see #1. What I want to know where is #6 in all this?
~^~~^~^^~~^
Was anyone else waiting for the "*clickity-click* Wow, it looks like your entire root directory was deleted!" punchline? :-)
"Pinky, you've left the lens cap of your mind on again." - P&TB
"I can see my house from here!" - ST:
Oracle are a big company, and vary hugely in the support they give you. I've had situations where I've been given the runaround, like you. Getting passed from extension to extension, explaining my problem over and over again, "oh, umm, we don't do that stuff here, call this number..." and finding out that Bob's on holiday and his secretary has no idea who else I could speak to...
I've also had situations where Oracle have said our engineers aren't sleeping until this gets fixed, and a few hours later there's a motorcycle courier at my door with a gold disc containing a brand new build of Oracle with the bug fixed. I've had Oracle techs ssh into my servers, I've had the come to the data centre with mysterious CDs containing Oracle software that they don't let outsiders have, and that they erase from your machine once they're done.
Helps to have (or at least have access to) a high-end support contract, tho'. If you're some kid downloaded 9i onto his Red Hat box, forget it.
Cisco rocks for support. I can open a case online, within 20 minutes have a call back. And, I can say I've tried x,y,z, and if they have any ideas, and say 'i think there is a hardware problem', and they go 'ok, what's the part number, and have a part on my doorstep the next day'. For what it's worth, if you ever do anything mission critical, or pseudo mission critical with Cisco stuff, ask one of their SE's to look over your config. They'll do it for free if you are buing their hardware (and, I suppose, are paying enough.) I know that the 6509 in this case was probably around $120k, so they probably would have done this for free, and it could have alleviated this problem in the first place. Then again, hind sight is always right. Then again, i've also had good luck with Sun support on hardware. I haven't used them for software that much,however.
Uh... TFTP uses UDP, which is a connectionless protocol, you can of course transfer files over more hops, but keep in mind, the more routers, etc you have in the middle, the more chance of a packet being dropped, and one packet can mean quite a bit when your transfering a new IOS image to your cisco ;)
Now it's been quite some time since I've looked at the TFTP RFC but I'm pretty damn sure it has the capability to request a block be retransmitted in the case of a timeout (packet loss). In fact, I'm sure of it; during the upgrade a few '.'s were noticed amongst a ton of '!'s and the checksum still worked out.
Was this configuration ever tested?! It sounds like it was put together, prayed over and sent out into the world.
it would have been simple to test too... pull out one of the uplinks... then the other... now try pulling out some of the webservers... and so on.
By 7 a.m. it was obvious that this was not a typical, easily-fixed, reboot-the-database problem.
Reboot the database?? WTF? You just proved my point as to why MySQL is NOT ready for primetime. Reboot the fscking database??
So Dave takes a look, can't ping the gateway, can't ping anything. Reboot the firewall. Didn't help. Still can't ping outside. OK, reboot the Arrowpoint. No difference. Hold your wallet... reboot the 6509... rebooting... rebooting... no difference. This is not good.
Guys, this isn't Windows -- Rebooting is an absolute last resort and if it works then you have discovered a problem, either in hardware or software and it needs fixed, not just a "oh well, a reboot fixed it, life goes on." Bastions of professionalism you're not.
I don't normally flame people for this kind of thing but the Slashdot crew are especially keen on bashing Windows, yet you resort to their exact tactics whenever a problem comes up.
Reboot the database?? I still can't believe I read that. Sorry.
Cisco Systems have some wonderful systems -- Hell I just recently found out about their stack trace analyzer... feed it a "sh stack" and it emails you back a list of IOS and/or hardware bugs which likely caused the crash. That is just plain old SCHWEEEET. Or being able to read their memory mappings to find out what is causing a bus crash... Ideal. You don't just randomly reboot the damn shit to try and get it to work. If it isn't working something is causing it. Embedded systems are generally pretty good at throwing up the red flags; you just need to look for them (logs, stack traces, extensive use of the debugging facilities...) Use the tools at hand instead of the big red button!
First step was getting the firmware patches onto a TFTP server near the switch (had to be less 3 hops from the switch, TFTP doesn't work over longer hops).
Unless this is something specific to the IOS or router, that's bullshit. I just upgraded 5 AS5248s to IOS 12.1(9) with a TFTP server that is 8 hops away. I'm not aware of any TTL issues with TFTP.
Finally Yazz and I are staring at this other switch cascading off of the 6509, this little out-of-the-way Cisco 3500 just sitting there... is this thing connected? We look at the link light leading it to the 6509. It's dark. "Uh Barnaby... can you check port 1 on module 2?"
You mention that your network documentation is shitty -- I sure as hell hope you'll push to have it upgraded and maintained with a high degree of readability. Even complex systems do not have to be undocumented just because they're complex. Use pictures, use words. I haven't found anything in IT which cannot be explained by a combination of both. And throw in a glossary for the non-techies like yourself who are called upon to fix it. :-)
Don't get me wrong; I'm glad you're back up. But this could have been prevented. Very easily from the sounds of it. I hope you did fire your cisco admin; it sounds like s/he didn't have a clue and was too terrified of losing his/her job that s/he didn't ask for help. Cisco has mailing lists, tons of documentation and there are many IRC channels to ask for help.
Point 1./ Why do you allow TELNET in to your routing/switching equipment from the outisde world? If a CISCO tech' with the password can do it then a hacker without the password likely can too.
Up until recently you had no choice but to telnet to Cisco equipment. I came up with a quick solution: deny telnet from anywhere but a same-segment computer (in our case, it's our RADIUS authentication box). Now ssh to the server and telnet from there to the NAS. Problem solved. :-)
Point 2./ If you are connected to the Internet in any way NEVER replace your firewall with a cross over cable. Basically at that stage you have your pants around your ankles, are bent over, with a big "Do Me Now!!!!!" sign on your butt!
While I usually agree, sometimes it is necessary to do a quick check. Even with the number of blackhats out there the chances of them doing anything signficant (or anything at all) for the 2-5 minutes you have the firewall out are insignficantly small.
I don't know about Cisco's daytime support, but I can confirm that if you call 'em at three or four in the morning, they're incredibly helpful. I had to pull an all-nighter to get a site live a couple of months ago, and they spent a number of hours on the phone with us figuring out why the traffic wasn't routing (turns out this particular firewall didn't like doing NAT when the internal IP address had a number in the 60's--don't know why). Very polite, knowledgable, and willing to help--certainly the high point of that particular hellish project.
... the continued poor performance of the site. Connections frequently time out or are very slow. Quite often, garbage is returned, or the wrong page. To all intents and purposes, the router/switch might as well be down more frequently. Just trying to get this posted has taken half-an-hour whilst I waited for every link to stop taking me back to the main page (with the login box present).
I can confirm this. I've been a network consultant for almost a decade, primarily as a Cisco router/switch jock. I've dealt with the TAC (Technical Assistance Center) too many times to count.
Hold times can vary, depending on time of day, but are never as bad as the stories from other companies. In most cases, you are on the phone with a real, live engineer within 5 minutes.
90% of the time, the engineer you are transferred to will be able to get your problem corrected. On the few occassions where they have not been able to help me, Cisco has moved mountains to get the right people invloved. I had an issue with Serial SNA - DLSW+ encapulation last year that was escalated to the point where the guy that wrote that portion of the code for IOS was on the phone, and was prepared to come to my client's site (True, they had purchased about $8M dollars in hardware...).
You do, typically, have to have a Smartnet contract, but as other posters have pointed out, if the problem is not hardware related, they will generally help you straighten out your configurations even without the contract.
Alot of people like to make comparisons between Cisco and Microsoft. Anyone who has dealt with the two will be quick to dispell any similarities. Cisco is a first-rate organization, with first-rate support, and I've made a career out of working with their products.
For those that would die defending it, Freedom
has a sweet taste that the protected will never know.
I have been TAC'd "around the world" literally with Cisco support; One TAC case lasted 32 hours, all on the phone. We went from California, to the East Cost, to Brussels/Belgium, to Egypt, to Asia then Asia-Pacific and back to California. We had several problems that basically caused us to create a new core network from other pieces of equipment, tear down and rebuild the original router from the chassis up after a bad power supply ate the old one, and each and every card in it. This was a few years ago before everyone could afford or manage completely redundant network infrastructure, and things like 2 hour turnaround on hardware was supposed to alleviate things like this. The problem was, some of the cards would pass first level diags, but not run for long. Each part got there in less than 2 hours tho! It was one of those 'one in a million' cases, but the rep on the other end of the phone was cooperative the whole time.
That said, I've also had low priority cases where they don't respond for weeks; It's almost to the point at times that anything I've opened gets opened at Medium priority (business impact) or higher.
You only get the 1 minute response if you hit the Network Down Emergency option.
True. Even at the best of the best, there will be better and "less better" people. And anyone can have an off day.
OTOH, I also had the experience of a TAC rep spending 2 hours on the phone with a competitor's tech support line, explaining to them why their config wasn's working. He was right, too.
A good long-term sales tactic, though: guess whose product I specified the next time.
I do wonder what will happen to the quality level at Cisco TAC with the recent layoffs, though. The first sign of impending doom at both WordPerfect and Novell was when the tech support quality suddenly headed down the tubes.
sPh
Yes, if you have a SmartNet contract for that device, it's pretty much true. Cisco, mid-1990's Novell, and Oracle are the only organizations I know of that provide this kind of help. Microsoft "Gold" support plan, anyone? (gag).
Caveat: Cisco basically does not have first level support (i.e. "'Is the router plugged in?' 'What's a router?') - you are supposed to have second level knowledge and have completed the first level troubleshooting before you call TAC.
But - I have been out of the office and had brand-new network techs call Cisco with a problem, and they did help out even then.
sPh
There's a reason why thier stuff's pricier than the rest- it's overall reliability (except on their low, low end...) AND the support.
They really ARE this responsive.
I am not merely a "consumer" or a "taxpayer". I am a Citizen of the State of Texas
Actually it isn't. As the other respondant to your comment pointed out, it's possible to determine system type from the ICMP responses. One should also realize that not all exploits use fragmented ICMP attacks. There's all kinds of abuses of ICMP that could be concievably used to take a system down. It's better to nip any of those in the bud for a high volume site or set of sites.
I am not merely a "consumer" or a "taxpayer". I am a Citizen of the State of Texas
Luckily, /. is monitored, this historical event will be kept in the monitoring systems for ever and ever ;)
Go to the monitoring system page.
Click the www.slashdot.org link
Select services
This will give you some graphs showing the outtage.
A much more common experience is to wait on hold for 15-20 minutes, but I have waited on hold as long as an hour with them.
Well, in this case Slashdot was down. That can explain the instant response.
__
__
Men with no respect for life must never be allowed to control the ultimate instruments of death.
GW Bu
--Dan
The Attacks on GRC.COM.
Jason.
Er, all Nextel phones are made by Motorola.
Of course it's a troll. This conspiracy crap is just a bunch of idiots. Don't they have anything better to do?
Sun is ceasing its open-source program, which now turns out to be an experiment. DirecTV is being hacked to possibly allow open content. And for Chrissakes, Microsoft floated a trial balloon about "shared source" and is waiting to see the community's reaction before they decide on the terms of the license.
And what will be the Microserf's report to his boss about Slashdot's reaction? "Boss, we floated the shared-source balloon, and nobody seems to care -- they're awful concerned about a woman named Anne someone who doesn't even seem to exist." "Excellent. Deploy the death ray, oops I mean the we-share-our-source meme."
Open your fucking eyes and look at the big picture. Don't get dragged into this pedantic navel-gazing meta-meta-meta bullshit. Some days I think our readership really has gone to hell. You all suck. Bah.
Jamie McCarthy
Jamie McCarthy
jamie.mccarthy.vg
I recently purchased a Grand Junction FastSwitch
2100 off eBay. The FS2100 is the same hardware
as the Cisco Catalyst 2100 - Cisco just bought
GJ, and repainted and re-logo'ed their current
product line.
Anyway, the switch had a console password set,
and I couldnt get in. The Cisco web page on
bypassing console passwords didnt work (said
"If you're running xx revsision call TAC").
I called TAC, opened up a case, told them it
wasnt urgent, and prepared to wait a couple
hours.
Five minutes later, a guy in Australia calls
(this was 9pm CST or so), asks for the serial
number of the switch, takes a minute, and
proceeds to give me the hard-coded override
password so that I can get into the switch,
change the settings, and update the firmware.
That quick response - on a clearly NON-priority
case, and I didnt have any kind of support
contract, and wasnt the original owner of the
hardware.
I'm *still* impressed. Cisco costs more, but
when stuff is broken, they WILL fix it.
Cisco tech support has done -nothing- but be the best, most understanding (especially in network down situation) and knowledgable support I have ever come across. You owe those guys a beer.
Mmm.. beeer.
Hey, the big 4 he was mentioning happen to have a LOT of hardware out there. Wether or not they suck, is moot. If they're out there, and out there THAT much, they've gotta have the sppt, their Service Lvl Agreement customers deserve.
I called cisco one night, while replacing my Nortel BCN with a cisco3662. For some reason, I couldn't get my BGP peer established with Sprint. It was 3am Central US time when I called cisco. I first talked to an individual who stated that they were on a callback. I figured that I would get a call in 45 minutes to an hour. 3 minutes later, my phone rang, and it was a gent from Belgium. He logged into my router for me, found the bgp error immediately, fixed it, and I was on my merry way. He even fixed some of my access-lists for me while he was there.
:)
Cisco has _the_best_ customers service that I have ever seen. It is good enough, that I don't mind paying a bit more for the hardware, because I know that if it breaks, there will always be someone to help me out.
And, I don't work for cisco
Joseph W. Breu
Not always true. I used to admin JSP-based web servers. My experience is that the Java virtuals machines that server jsp pages have a way of starting to act funny. Stopping and restarting the services fixes the problem.
If I was ever building a network, I would not allow JSP to be a part of the network for this very reason.
Then again, if a JSP guru knows what can cause a JSP engine to act wonky, or how to set up a JSP engine so it is stable and doesn't need reboots, please post a follow up describing how to do this.
- Sam
The secret to enjoying Slashdot is to realize that it should not be taken too seriously.
Of course the original story, or, I should say, some of the versions of the original story (how often can you rewrite the original and it still be the original?) mentioned "...when our qualified personel arrived, we discovered that she wasn't actuually as qualified as we had hoped. Then she quit..." which doesn't sound like someone who was already not working there anymore before the troubles started, so I assume that we're talking about 2 different people here, only one of which was identified one way or the other by sex/gender.
Quite a ways down in the responses to the aforementioned "original" story is an AC post signed Anne Tomlinson that seems to give another perspective on the events that weekend. It's a little ways down the page from another post that has some of the different versions of the original story.
I see even classic Slashdot is now pretty much unusable on dial up anymore.
For all we know they are and we can't browse low enough to see it.
I see even classic Slashdot is now pretty much unusable on dial up anymore.
Okay, so were there any posts at -2 or lower?
I see even classic Slashdot is now pretty much unusable on dial up anymore.
Cisco tech support, for me at least, has been nothing short of amazing. With any problem it takes a few moments after you call in to get someone on the line, and its 24/7 and anywhere in the world apparently. One night at 3AM, we called up cisco and talked to someone in Australia apparently. Cisco has an awesome VoIP telephony network that they use to handle all these calls. And Cisco is also willing to config whatever you neeed. Pity not all companies are like this :(
-
aphex
I Steal Music!
Point 2./ If you are connected to the Internet in any way NEVER replace your firewall with a cross over cable. Basically at that stage you have your pants around your ankles, are bent over, with a big "Do Me Now!!!!!" sign on your butt!
While the crossover cable was active I broke into slashdot and stole the source code and I'm selling it on eBay.. Just look for open source slashd code.
"It's not like your minds are as open as the source you love..." - Me to the majority of Slashdot.
"Next time you think you are calling technical support droids, next time you think that you will but put on hold for hours, be careful, you may be placing a call to ... the twilight zone."
Yike, I say. Yike. Competent tech support does not exist in this earth. What planet is Cisco on, and to what worthy cause can I donate money to see that humans never send a manned mission there and pollute this fascinating superior alien culture?
--G
Look at it this way: every time he comes back again, all by itself! Other people die once and for all...
Szo
Red Leader Standing By!
Service Level Agreements are all well and good, but if the cell phones are dead or whatever, all it does is give you a better excuse to yell when it's all over. SLA's don't change the laws of physics, or violate Murphy's law. Itshay appenhays.
Ironically apparently this story was pulled (I got here by a direct link). Anyone else envisioning a fist fight happening in the OSDN server room?
Agree with you to a point, however there is lots of equipment out there (Cisco switches definitely being among them...especially with the older software versions) where a problem suddenly crops up yet there has been zero changes in configuration or demands: whether it's a bit that was changed in volitile RAM by a solar disturbances, or a power fluctuation that disturbed the microprocessor, or an errant software pathway that is seldom travelled that a particular piece of data "exploited", and in these cases a reboot does fix the problem (I've seen this in network switches, routers, phone PBXs, and of course PC machines). It's hardly a scientific method, but when you don't have qualified staff to evluate the system sometimes it's your best hope.
Where I do agree with your assessment is the classic where someone modifies a couple of settings/wires/configurations and then problems occur, and then they think rebooting will fix them (usually while telling coworkers that they didn't touch a thing...it just "went wonky"). Correlation/causation.
While from the story it appears that the routers were configured in an odd manner, evidence has shown us that the site obviously worked before this outage. So the question then is what changed that caused the system to collapse? Did someone try their hand at playing with the settings?
Again: Obviously the OSDN network was working with this configuration, so what happened that caused the collapse? Was the router 0wZ3d?
Best computer support in any category that I've run into was back in the very early '90's. Phar-Lap had a bunch of smart guys there who wrote their DOS extender. I called them once to ask them a simple question, and happened to mention some of the frustrating things I was running into on an unrelated problem. Without even blinking, the support guy proceeded to give me great advice on how to fix the problem, and gave me a detailed explanation at the BIOS level of what was going wrong. I was amazed. They didn't have to do that, first of all. The level of pure competence displayed by a tech support person is something that I still remember clearly.
If tits were wings it'd be flying around.
This topic comes up many times on comp.risks: there's no point in having a backup (server, archive, database, router, etc.) unless you TEST your backup procedure to make sure it works. Pull the plug on the server - does the backup kick in? Kick over the router - does it fail over to the backup? Those who ignore the RISKS digest are doomed to repeat it!
--Jim
Not sure if anyone posted this yes, but you could have recovered the enable password using the instructions here http://www.cisco.com/warp/public/474/pswdrec_6000a ccess.shtml, which took me about 15 seconds to find on Google (first response for 'cisco 6000 reset password').
:-)
And OSDN technical people needed Cisco tech support's help to upgrade the IOS?
Perhaps OSDN should make passing CCNA mandatory for their networking people (as well as brushing up on their Googling skills).
We'll just have to wait for an article giving a blow-by-blow account of the Slashdot outage article's outage.
Ita erat quando hic adveni.
This plot is a total rip-off of the Miyazaki Classic: "Hagamaki Ortifunk", or (from it's American release) "Whistling in the Dark, with Daisies". Can't Americans think of anything original anymore? Could they ever?
I'm working on a web site to expose this travesty to the world. I'm sure everyone will be impressed with my esoteric knowledge of this classic of Japanese animation.
by Mike Buddha -- Someday the mountain might get him, but the law never will.
...Cisco is reporting a projected 40% upswing in earnings for the next quarter, after a favorable review of their technical support personnel on the discussion site Slashdot led to a surge in sales for support contracts.
"It's the first the the Slashdot effect has been a productive one", said an unnamed Cisco official, pausing briefly to dodge a large bag of cash sailing through a nearby window.
Jay (=
At an old job we had a wee Cisco 1604 router, just doing ISDN for our /24 (at the time ISDN was the only affordable thing in our area)
I had a problem with something and mailed Cisco. No more than an hour went by and I had email from a real life person in front of me telling me what to do to fix our problem.
Cisco isn't cheap, but you do get what you pay for.
grubTrolling is a art,
i've always gotten nothing but 100% help from the cisco tech line. i guess there IS a reason for paying $20k for a router.
i can't believe exodus doesn't have a cisco person that could fix that problem.
cisco - kudos
exodus - slight black eye
OSDN sites - inevitable downtime
moral of the story:
don't fire the network people, and make sure all network people are on the same page. there's no reason to even have 4 on-call people if none of them are going to be available. also, documentation, documentation, documentation!!!
Why read the article when I can just make up a snap judgement?
I found it interesting that it clears the logs - shouldn't such a big, expensive piece of HW have at least some non-volatile storage? Minimum it should log to a separate box that does have a disk drive, since I could see how incremental logging to flash parts might have some issues.
While I'm at it, this article seemed pretty slim on the details about the mystery woman. It didn't sound like anyone quit, in fact. I expect the full details on the story...
...about as soon as I get the full story on my URL above. In other words, I'm not holding my breath :)
Caution: contents may be quarrelsome and meticulous!
Your right to not believe: Americans United for Separation of Church and
For those of us who aren't netops, it was an interesting read. If it bugs you so much, just don't read the article!
Caution: contents may be quarrelsome and meticulous!
Your right to not believe: Americans United for Separation of Church and
There's a big difference between always walking around with your pants around your ankles, and briefly lowering your pants for a minute when your doctor wants to do the "turn your head and cough" thing.
---
As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.
Sometimes rebooting will fix the problem. Sometimes you don't have any alternative. Sometimes you can't fix the problem, but you can get things working again (e.g., Windows). And rebooting may the the best (or only) way to do that.
It is clear that they were out of their depth. It is clear that they didn't know what they were doin. They knew that they didn't know what they were doing. But the experts were unreachable. So they tried something that sometimes works. I really don't see how you can fault them for that. It would, of course, have been better if they had know what their choices and options were, but they didn't.
I wouldn't have either. Probably most of us wouldn't have.
Caution: Now approaching the (technological) singularity.
I think we've pushed this "anyone can grow up to be president" thing too far.
Four words: (D)DoS by ping flood.
-- Veni, vidi, dormivi
> Reality corrupted. Reboot universe? (Y/N)
Shouldn't that be:
Reality corrupted. Reboot universe? [confirm]
Couldn't help it. On topic - Cisco TAC is amazing (of course I have a lot of friends either there or graduates of TAC). OTOH it doesn't hurt that I am a block away from the main campus. I once had a TAC engineer offer to drive the part over to me when a 6509 sup1a failed.
- Chris
-- I need more coffee. It's Monday. There is no such thing as enough coffee on a Monday.
I need to add something here. Of couse, if its nonencrypted telnet, it shouldn't be used most of the time. If its a crisis - then change it to a scrappable password, let the servicengineer do his thing, then change it afterwards.
Preferrably encrypted login should be used, of course. Be it ssh, telnet-ssl or whatever.
--
"Rune Kristian Viken" - http://www.nwo.no - arca
Defense in depth is a good philosophy to have, protecting against configuration mistakes.
Of course.
You are also protected if exploit code is run (say via a buffer overflow that changes hosts.deny).
uh? That sounds pretty damn unlikely. The bufferoverflow could just as well execute a reverse-channel back to the attacker. Of course, you limit the possibilities of the attackers. However, you're now already talking about running services with known vulnerabilities.
Firewalls can also protect against low-level attacks that don't attack the services/applications themselves.
That is better done at core-routers.
When properly configured, firewalls can be invaluable in logging traffic and otherwise keeping out unwanted traffic and IP spoofs -- and can do a far better job than simple packet filtering on a router.
That is better done by snort, or any other decent IDS.
I think it's pretty poor form to call someone else a dimwit when you're lacking a lot of info yourself. There's a reason that a firewall is industry-wide best practice for an Internet site or user network, and it's not because we're all dimwits
I regularly call those that thinks running firewalls is the be-all or end-all of security for dimwits. Unplugging a firewall on a network you know isn't exactly a horrible thing to do.
A Firewall is a good thing to have when you've got a network you don't have time to audit, and that doesn't have people to audit it on a regular basis. Its a good thing to have when you've got servers which you don't have any possibility of patching, or upgrading -- but that needs to be running some services (nonvulnerable) to the internet.
Of course, you could do lots of these things with NAT-devices. (Which of course isn't a perfect solution neither).
Blargh, I could rant on forever.
--
"Rune Kristian Viken" - http://www.nwo.no - arca
Point 1./ Why do you allow TELNET in to your routing/switching equipment from the outisde world? If a CISCO tech' with the password can do it then a hacker without the password likely can too.
Bah, you're talking without knowing the parameters. For all you know, they could've enabled the telnet access on the outbound interface specifically for the checking/cisco rep, disabling it afterwards.
Secondly -- if I remember correctly you can have pretty damn long passwords on ciscoequipment. We do not know the length of the password, but its highly probable that the password is 10+ characters. A bruteforce-attack is pretty damn difficult when you have to check 64^10 possibilities. According to my bc:
arcade@lux:~$ echo 64^10 | bc
1152921504606846976
Now, that is a pretty impressive number of queries you've got to make to exhaust that pwd-space. To be quite frank -- I don't see the problem.
Point 2./ If you are connected to the Internet in any way NEVER replace your firewall with a cross over cable. Basically at that stage you have your pants around your ankles, are bent over, with a big "Do Me Now!!!!!" sign on your butt!
Oh, yes of course. If you don't have a firewall You are phooked!!
Ehh? Excuse me? Why the fsck do a properly configured serverfarm need firewalls _at all_? Please, enlighten us with your wisdom oh dimwit.
Firewalls _are not needed_ if you're not running services that _should not be running_ on servers for the internet.
--
"Rune Kristian Viken" - http://www.nwo.no - arca
They did change the IP back. The switch over was temporary to get an announcement up ... and that was outside the Exodus cage. Fortunately they did have 1 (out of 3) authority DNS servers outside of there, so they could get people over to the announcement ... eventually as cache TTLs expired.
It's already bad enough to have a 24 hour expiration on the A-record. But you don't anticipate these outages, so 1D is fairly common practice (even longer in some places trying to reduce their DNS load). But the real mistake was putting 24 hours expriation on the temporary IP. Basically that says "as soon as I change this, everyone who cached this temporary IP address is going to have to wait a day from when they first say the page, before they can get their /. fix (or other OSDN stuff)". What? Did someone actually think they were going to change the IP back 24 hours BEFORE the sites were back up? The temporary A-record should have had a TTL of less than about 30 minutes. I'd have put in 10 minutes if it were me. But then, if I were there, but if I were there, I'd have also been doing the Cisco stuff and actually tested the failover configuration.
I do recommend:
These are the kinds of things system and network administrators are supposed to do. Programmers tend to hate that kind of work, so that's why there are separate job descriptions. Just because a good programmer can install and configure a server doesn't mean that just doing that is all that needs to be done. Businesses run smoothly when people know what they are supposed to do. And in the exceptional circumstances, they're doing things they don't routinely do, and it is essential to not only have those things written down, but also make sure they do work, and can be found even in a power failure.
now we need to go OSS in diesel cars
In this case, "Full Disclosure" means, "A good yarn."
--
--
"Outlook not so good." That magic 8-ball knows everything! I'll ask about Exchange Server next.
Not a libel suit, but rather /. is afraid of all the angry parents who would claim that his post would make little girls avoid taking classes involving computers, networks, etc., in much the same way that Barbie convinces girls to avoid math.
--
--
"Outlook not so good." That magic 8-ball knows everything! I'll ask about Exchange Server next.
Maybe that's the next site OSDN should come up with. The idea is that anyone who has had a major problem with their network or computers and solved the problem, could post their write up to help others who find themselves in such a situation.
It definetly enjoyed reading this article and I am sure that it will be bookmarked by a fair few techie minded network admins, just in case.
Jumpstart the tartan drive.
I've had this kind of support as an end user with a Cisco 804 ISDN router. The same quality of support that we were getting on our support contract with a 7206, at my previous job.
The main reason that they're so prompt, is that they have a global network for phone support. When you call them, your call gets transferred to a technician who has just arrived at work (ie, if you're in the US and call at 3am, you'll probably end up speaking to a technician in central or western Asia).
I'm reminded of an intrusion team story about one such team that faked a package from a OS vendor (letterhead, box, etc) containing a "patch." The admins looked at the box, assumed the obvious, and installed the patch which, while fixing an actual problem, also backdoord'd their system.
I could see running a remote exploit to crash your box, sending you mail about it (faked, of course) and then sending you a "patch" to "fix" the exploit (while adding some of my own...).
Be careful, there are some tricky bastards around with way too much time on their hands. Check those MD5 sums...
-- "I am disrespectful to dirt. Can you not see that I am serious!"
The logs are held in a ring buffer in RAM. What you are supposed to do is configure the router/switch with the address of a syslogd server which will handle the logs better.
Of course, it depends greatly on who you are talking to. The platforms team does have a huge slant toward NT/2000 because that's what they support and allegedly like. Those of us in Exchange support (I'll leave it to you to figure out what part of Exch. support I'm in) handle calls where Unix servers are relays, Pix firewalls sit between systems and load-balances continually send packets off into the woods. If you *don't* know non-Microsoft stuff, aren't prepare to acknowledge that non-MS works and works well, or just can't handle the idea of public standards, you are fucked in that group.
It all comes down to who you get on the phone. If you don't like who you are dealing with, ask to speak with their manager or technical lead. Get it straightened out with them or request another support tech. You're paying for it, get what you are paying for.
(As always, my comments are my own and my employer doesn't take any responsibility for them. Like they would want to anyway.)
---
Her only post was just a few days ago, but her user info page now claims she has zero posts. My user info page currently shows my posts going back about two weeks. Why would Anne Tomlinson's one post mysteriously "disappear" from the system so soon??!
cpeterso
If they aren't already doing this, they are even more clueless than I previously thought. ugh.
EOM
"First step was getting the firmware patches onto a TFTP server near the switch (had to be less 3 hops from the switch, TFTP doesn't work over longer hops)."
I've tftp'd images to cisco's and ascend's across the Internet (many hops) without problems. It's not smart because if you loose your path to the server you're screwed, but it does work.
I don't normally swear, but if someone asks me if Cisco support is good, I have to reply: "Abso-fucking-lutely". They are easily the tightest organization out there, bar none. I don't think anyone: UPS, the Military, Wall Street, runs as good an operation as they do.
And I've sat with two engineers at 1:00am through to 11:00am as they fixed my small gateway to an ISP, not a big ticket item. At one point, they did an engineer transfer, connecting me to a different part of the world, and spent thirty minutes overlapped, with the engineers working together to make sure that the new engineer knew what the first had tried. As it turned out, the firmware storage was flakey, and the config corrupted itself semi-randomly.
Years later, I watched Cisco do the exact same thing - only this time, they correctly identified that the problem wasn't them, but in some Bay routing equipment, *and* they told us the exact commands to fix it (I was a outside consultant just watching, but I believe they even offered to telnet in and fix it themselves).
So, yes. Cisco is the only brand I will buy, no matter how expensive they are. Think of the extra expense as insurance. You *may* not need it, but it sure pays for itself if you do.
--
Evan
"$30 for the One True Ring. $10 each additional ring!" -- JRR "Bob" Tolkien
Not sure Nokia's the answer. My Nokia has the same problem that Yazz's has.
Admit nothing, deny everything and make counter-accusations.
Point 2./ If you are connected to the Internet in any way NEVER replace your firewall with a cross over cable.
...unless the risk of being comprimised within that short period is outweighed by the information you will gain by testing around your firewall. It is a simple trade-off.
Better yet, ssh into their home boxen and blast an mp3. According to /. polls, their computers probably aren't more than 10 ft. from their beds.
___
Cognitive Overflow
more than yo
I think the Anne Tomlinson post was a particularly brilliant troll.
A quick Google search for "Anne Tomlinson" returns an orchestra conductor and someone in a retirement community.
If it was a real post, CmdrTaco probably would have ignored it. His good humored response makes me think it was a troll.
Is there any evidence that it was real?
-B
When I was able to do my own spam-armoring, you got a chance to email me. Now you can only hope I see your reply.
http://www.bmug.org/news/articles/MSvsPF.html
.....'" We do, and then we reboot. Problem solved.
I beg to differ.
That article details calling the 900 line, but even with support contracts, most MS tech support reps toe the company line in a distressing fashion.
"Unplug all the unix servers, that'll fix it"
"Upgrade everything to Win2k Adv Serv, that'll fix it"
"Upgrade to SQL Server (from Oracle), that'll fix it."
They seem to have no ability to distinguish which network components could be involved in a problem and are unwilling to accept that you've already localized the problem.
Case in point, there was a problem where two WinNT boxes wouldn't see each other. They both had IPs, they could both ping everything else. They were connected via a 100mbps switch.
We made sure each properly had an IP, that it could reach other machines, that the switch worked, and then swapped ports with two machines that were working just fine. We also tried isolating these two machines on their own switch, to avoid potential IP conflicts.
When we called the support number we honestly described the situation to the tech. He asked what else was on the network. We explained that it was in a different IP range, but on the same switches as a bunch of Linux machines, an Open BSD (firewall for the desktop machines), and a couple Suns (doing something for the other department, dunno what.)
He then proceeded to tell us that it was the other computers, despite our telling him that we had isolated the NT boxes in question on their own switch and we still had the problem, but when we put a third computer on, both of the NT boxes could reach it just fine.
We eventually lied to him, telling him that yes, we had unplugged all the unix machines, etc. (Like we're going to just unplug out company on the say-so of a moron, and like two junior techs would have the authority to do so anyway.) So now jim-bob starts to help, by telling us that Win2k is so much better, etc, that we wouldn't have these problems with it, etc.
When we flat-out refuse to "upgrade" to fix this bug, his advice is that we format the drives and reinstall. ARGH!
We finally convince him that these machines are somewhat important and we can't just wipe them everytime there's a small problem.
After over an hour with this jack-off, we hang-up, problem unresolved.
We get permission from the boss to call someone in... So we look through our list of contacts and grab someone whose card says they deal with networking and windows. Call him up. As we're describing the problem he listens quietly, grunts affirmatively when we describe how we isolated the problem, agrees that it couldn't be any of the other machines.
Then he says, "It sounds like it's an issue with a bad route, type 'route
He said that it, whatever it was, was a very common problem where the machines basically forget how to get from A to B. That command zeroed the routing (which didn't show any bad routes) and the reboot brought it back up.
Cost, a 15-minute phone consultation. $45
Microsoft tech support was basically a sales department, staffed with the marketing rejects.
So, don't EVER believe it if someone tells you that MS supports their products. Any company whose line is "Format and reinstall" has no business calling a product "Server", let alone claiming they're in the enterprise level.
Schon, earlier in this thread, said "Rebooting doesn't solve the problem!!" I wonder what he'd say about formatting and reinstalling.
I don't know about the non-fiber portion of Cisco, but the last guy I talked to got $800 or so just for taking our call at 3am. This is on top of whatever they're being payed to answer the phone.
Look at it this way:
How much redundancy?!?!? Even if they're on different ---, they're probably still riding in on a loop from the same ---.. and then, they're most likely delivered on the same physical ---.... Don't be too reliant on this setup for redundancy's sake...
You learn things by context. Obviously, soulseller is saying that having multiple T1 lines isn't going to make things any more redundant because there all going to be part of the same physical connection. If one goes down, they all go down.
ReadThe ReflectionEngine, a cyberpunk style n
OSDN, Audit ALL of your systems NOW.
:)
They should be well schooled in how to do it after the FluffyBunny cracks...
/me cowers in anticipation of Flamebait moderation.
deus does not exist but if he does
So, some people think that the editors sit around and mod down trolls, crap, and maybe also some posts that the editor just isn't comfortable with.
:)
Personally, I find this ridiculous. Yes, it is possible that the editors could be doing this, but who would be willing to waste that much of their time? I can't think of a more boring job.
You, too, can intern at VA Linux this summer! Our stock price may not be all that high, so we might find it hard to pay decent wages, but we'll give you unlimited mod points on Slashdot!
deus does not exist but if he does
So that means that someone on the Slashdot staff was reading Slashdot at 2 A.M.?
Apparently OSDN does not have something like "mon", "Big Brother", or "Spong" monitoring all the equipment and links. Or if they do, they didn't mention looking at the screen and seeing the flashing red "Failure" icons on some equipment.
Apparently the network staff had to be manually paged, due to not having any of the aforementioned monitoring tools.
Science college students may know this as a "Lab Notebook"; there actually is a course which tells you write everything down and suggests the types of format and details to include.
If the only problem was a switch/router config, why was a change of IP involved?
--
--
My comments and opinions completely reflect those of anyone and anything I am remotely associated with.
Probably memory leaks in your JSP code.
(Yes, java can have memory leaks. Accidentally keeping references you don't need will leak memory, and/or other resources)
Various abuse of the Session object is probably the biggest one fer this.
-- -- The Dragon De Monsyne
I wonder if they will investigate syslogging the messages to another box. Would this even be worth the effort?
I don't want knowledge. I want certainty. - Law, David Bowie
What I take issue with is not only did the editors divulge that someone quit but that they also labeled her as incompetent (My interpretation of the comments.) That should have never happened. You, I and everybody else here should have never been told that and it boggles me that those who defend this "right to know" BS are chomping at the bit to get the dirt on this alleged dustup. There isn't a single person I know in this field who would want to be put in that spotlight.
But what irks me the most is that this thread is so hot but my other post about syslogging the Cisco which is much more relevent to the article just sits with little discussion.
Christ people. Ditch the tabloid mentality and get back to the Nerd stuff.
I don't want knowledge. I want certainty. - Law, David Bowie
No, we don't have a right to know. Ms. Tomlinson's departure is between her and her employer; not some tabloid expose for a bunch of overly curious rumor mongering conspiracy theorists. I wouldn't be surprised if the people who blurted this out on a public forum haven't been seriously bitch slapped by HR.
As a community it would be best to let the matter drop. I'm sure if you were in Anne's position you'd be severely pissed. A little perspective and some empathy would be appropriate.
I don't want knowledge. I want certainty. - Law, David Bowie
shouldn't such a big, expensive piece of HW have at least some non-volatile storage?
:o)
Usually not. I'd guess that it's because a HD would become a potential source of failure (mechanical parts tend to wear out before non-mechanical ones.)
Minimum it should log to a separate box that does have a disk drive
Yes. Every "real" router I've seen has the option of logging to a remote syslog. (I LOVE standards
Would this even be worth the effort?
Short answer: Yes.
Long answer: Yes, it's ALWAYS worth the effort.
Setting up a remote syslog takes all of 20 minutes and a spare box. It's trivial, even without considering the payoff.
Who's to say that it's not a 1 PPM problem that won't affect the system again for another hour/day/month/year? Once the packets are flowing again, then you can relax and take the time to root cause the problem and fix it.
And who's to say that the problem that's being experienced will be fixed by a reboot?
We had a server running, one of the things it did was SMB sharing - one of the drives (the one dedicated to non-critical SMB shares, in fact) died.. This box was doing MUCH more than SMB - it was also our internal DHCP, and DNS server
I was out, and one of our MS guys decided "I don't know what all these error messages mean, but I can't see my windows drives, so I'll just reboot it." Because the drive was dead, the machine wouldn't boot. He took the WHOLE DAMN DEPARTMENT OUT - nobody had DNS, and when people's windows machines stopped working, the solution was (guess what?) REBOOT them - so THEY stop talking to the network altogether.
Now, the kicker is that the drives in this machine were hot pluggable. If the reboot hadn't happened, I could have swapped in a new drive, restored from last night's tape backup, and people could have continued working. Instead, because the machine was rebooted the whole department was down for several hours.
The mantra stands - REBOOTING WILL NOT FIX THE PROBLEM. And if you reboot before you know what the problem is, then not only don't you know if it will help at all, but you also don't know if it will make the situation worse.
sometimes getting back online as fast as possible is more important.
That's the trap - there is no guarantee that rebooting will do this - and you might just be screwing it even worse.
Getting back online as fast as possible involves solving the problem first - REBOOTING WILL NOT FIX THE PROBLEM.
it may have resolved the problem for a short while
Even though you think you're saying the opposite of what I said, you've hit the nail squarely on the head - rebooting never fixes any problem.
It may temporarily fix the symptom, but the problem is still there.
It is possible for routers, Linux boxes, etc to crash.
Yes, it is. But if they crash, it's for a reason - perhaps there is a bug in the configuration, or firmware; or perhaps it's hardware.. but what's important is that rebooting will not actually fix the problem, all it will do is temporarily alleviate the symptom.
If the problem is with the configuration, then you fix the configuration. If there is a bug in your software, you fix that. If it's hardware, you replace the faulty hardware. If it's firmware, you upgrade the firmware (or replace the unit with a different model, from a manufacturer who actually does quality testing.)
But you do not just blindly reboot - if a reboot is required, you do it after you've discovered WHY the machine has crashed, and you've fixed it. Once again, the mantra is "Rebooting will not fix the problem."
I laughed out loud when I read this:
But, says Yazz, "Since the Cisco was rebooted there were no logs to look at."
You fell into the classic "Windows" trap.. this is what I tell the Jr. tech guys here when one of the servers goes wonky: "If it doesn't work, there is a reason; something is wrong. Rebooting will not fix the problem."
They usually respond with "but I didn't know what else to do."
To which I answer "Repeat after me - REBOOTING WILL NOT FIX THE PROBLEM."
"But I didn't know what else to do."
"Then call someone who does - REBOOTING WILL NOT FIX THE PROBLEM."
Just wanted to say thank you for the explanation. After all, we are your customers! :) It is really nice to get an accounting of what happened.
BTW: Are you going to plan any redundancy/failover drills as a result of this?
IIRC, Boeing equipped all their maintenance staff a few years ago with laptops, camera's and video editing software. The engineers then made training vids for each other, in particular noting the fuckups. ie, they were not edited out, but deliberately left in so ppl could know what went wrong
Of course, I'm not so fond of Cisco as they are, aftering having to type "no shutdown" to bring up an ISDN router...
A client of mine had no support contract with Cisco other than having a Cisco ISDN router that was still under warranty. The tech explained that my client was supposed to have a support contract to get support but fixed the router configuration anyway.
I'd say that hold time is probably tied to your support contract with them. I work for a large uni with a huge userbase and a pricey support contract. The network outage queue is never more than 5 minutes of hold time, if you end up on hold. For RMA or other non-urgent calls, I've waited a lot longer. Of course, our hefty contract means we're paying for that prompt response, but IMHO we more than get our money worth. Also, the professionalism, I could go on about that for days... Wonderful.
I've also watched an outage start with a fellow engineer on the phone with a router manufacturer who shall remain nameless (not Cisco) where the support tech managed to give advice so erroneus as to cause a network outage that took down our entire core during the middle of the day. Not very fun.
itachi
No, see, they hadn't logged into the console and typed "show log", or they would've seen the failover attempt. In fact, as far as I can tell, they didn't log into any of the network devices before the rebooted ALL OF THEM! Managed network devices are usually pretty helpful in terms of troubleshooting if you go to the trouble of getting console access. In this case, they went with the "Reboot, then troubleshoot" approach, which is dumb.
As for routers and switches being user-serviceable, sure, you aren't supposed to be in there with a soldering iron and a multimeter, but a config is absolutely user-servicable. If it might be a config error, rebooting will do more harm than good. The only time you should need reboot a piece of serious network hardware (by price alone, I think we can define a 6509 as serious hardware) is when you have no console access, the lights that are supposed to blink aren't blinking and the lights that aren't supposed to blink are blinking. Or smoke. That might also be a valid excuse. But it would have to be a lot of smoke...
itachi
If this is a "blow-by-blow" account, then could someone, I dunno, involved in the mess explain that little comment Taco made for about 20 minutes on Sunday about when the "qualified personnel" arrived, "[they] discovered that she wasn't actuually as qualified as we had hoped. Then she quit, thus terminating 3 local star systems."
/.'s ass in the face of a potential libel suit?
Was Rob just popping off at random, or was that little bit removed trying to cover
Jes' wondering...
Someday, you're going to die. Get over it.
Yes, but you don't have to fill the whole pipe. It gets channelized into T-1's and you only have to turn on (and pay for) as many as you need. And since you don't have to pay for local access on the the T's, the cost evens out at a pretty low number (7 T's in our case).
OK, I should read before posting. I'm talking about a DS-3, not an OC-3.
I had a support guy go through a complicated debugging procedure having to do with password changes failing under obscure conditions with Win95 clients and NT servers.
Would this be the infamous MIT Realm bug? ("Passwords must be at least 16785 characters in length...")
-Lx?
While technically correct, you have to look at the bigger picture. Rebooting may not fix the root cause of the problem, but it could very possibly get the system back online. Who's to say that it's not a 1 PPM problem that won't affect the system again for another hour/day/month/year? Once the packets are flowing again, then you can relax and take the time to root cause the problem and fix it.
You can make a case that valuable troubleshooting info is lost when systems are rebooted. I agree, but counter that all good systems should have detailed event logging. Leaving the system online and intact is the best way to root cause a bug. But, sometimes getting back online as fast as possible is more important.
Firewalls _are not needed_ if you're not running services that _should not be running_ on servers for the internet.
Because
Defense in depth is a good philosophy to have, protecting against configuration mistakes.
You are also protected if exploit code is run (say via a buffer overflow that changes hosts.deny).
Firewalls can also protect against low-level attacks that don't attack the services/applications themselves.
When properly configured, firewalls can be invaluable in logging traffic and otherwise keeping out unwanted traffic and IP spoofs -- and can do a far better job than simple packet filtering on a router. That said, anyone who believes that firewalls are the be-all end-all of security is fooling themselves.
I think it's pretty poor form to call someone else a dimwit when you're lacking a lot of info yourself. There's a reason that a firewall is industry-wide best practice for an Internet site or user network, and it's not because we're all dimwits.
"You can never have too many elephants on your team."
Do Cisco tech folks take tips? I mean, of course they're only doing their job, but when someone saves your ass bigtime and really impresses you in the process, it's often nice to show some appreciation.
I've had this sig for three days.
I suspect that someone from legal saw that, feared a lawsuit, and whacked Taco with a clue-stick. Thus neccessitating a quick edit.
Best Slashdot Co
That actually makes sense. So, who whacked Taco with the clue-stick to get him to pull the original posting?
Best Slashdot Co
I had an experience recently with a Cisco firewall and due to the nature of the problem I would have to wait-and-see days at a time to see if the problem was reoccurring or had been fixed as we tried different things. The tech I was assigned to called me several times a day, was willing to stay on the phone with me for hours at a time educating me on the issues, the technology and walking me through the solution. I couldn't believe it. He emailed several times a day too, as did the Cisco dispatch system that kept track of the issue.
Congratulations to Cisco. They are huge and have a massive install base but provide the best tech support I have ever seen.
I don't even want to talk about my bad experiences with Microsoft's premier technical support.
----------------------------
...to all of us that do this for a living. Forget for a moment that most here have never set foot in a real data center, much less even own a server. No pros want to see another's network go down (well, most of the time ;-) ), and we don't want ours down. I've spent many an hour looking at an errant PIX, or troubleshooting some other network config. I know what those guys were going through. It sucks...
Don't slack. When you slack it bites you in the ass. Maybe not today, maybe not tomarrow, but someday, someday soon, it will.
Test your failover configs. How? By actually making them fail. During the maintaince window, power that primary router/firewall/load balancer down hard and see if the fail over works. It's like testing back ups, kids. You have to know they work before you need them.
Realistically develop on call strategies. OSDN didn't really have a net ops staff of four. One had quit (why are they counted?), one was in hospital, and two had weak "couldn't reach my cell phone" excuses. That just don't work in the real world. If you are on call, you are on call. The "phone too far away" and "battery fell out" just don't cut it in the adult world of professional net ops. Get a satellite pager, and if you are on call, make sure it's on, and near you so you can hear it.
Don't bash your employees/ former employees, particularly during a heated situation. Shows no class. Besides, if you are such a hot shit. grab that console and fix it. Otherwise, keep your mouth shut. Besides, who is in charge of making sure the people that are hired are qualified? Hmmm?
Document your shit. It's not that hard. Visio can do much of it for you. I'm going to break an NDA here, but the Exodus Service Agreement states that all machines and cables are to be labeled. That is so when the dude (or dudette) has to leave the NOC and enter your cage to reboot your lame box, they know what is going on. Also works well for when you net ops staff is too concerned with getting drunk or laid and your poor programmers have to go in to fix the network.
Some folks really went above and beyond, but it seems to me that the management severely dropped the ball.
Is VA really ready to abandon the hardware market for software services? One has to wonder.
Dave
been there before...
There are times you need to reboot a cisco, particularly during your CCIE exam. :-)
At the start of the 2 day CCIE exam, the proctors casually mention they knock off points for un-necessary rebooting of routers. But the progression of the modules in the test will likely wedge a routing protocol, requiring a reboot, and they are really looking for those monks wise enough to know when to reboot
IOS is an amazing mess of spagetti modules, and the fact they work together so well is a testament to cisco's dev test and solution test people. But sometimes the appletalk routing module will choke, and a reboot is the only remedy. Or NetFlow forgets, or Policy Routing doesn't. But a wise cisco expert will copy the logs and generally preserve the state of the machine for analysis after the reboot in case the machine doesn't come back. But wise cisco experts cost a lot of money.
Its a good mantra in networking "REBOOTING WILL NOT FIX THE PROBLEM"
the AC
Hemos is like...sci-fi fans;he thinks technology is cool, but he hasn't bothered to understand the science it's based on
Yes, but have you tried dialing that number when Slashdot wasn't down?
/., and in the TAC they could see it was a major network outage since the whole of the OSDN sites were unreachable. Nothing to do but wait, or answer calls from other customers :-)
/. story for their manager]
Rumour has it the conversation went a little something like this:
[Kurt] Hi, cisco tech support?
[TAC] Yes
[Kurt] this is Kurt at slashdot...
[TAC] Oh my god, its about time you called us. You've been offline for nearly 24 hours, we're all going through withdrawls. Hang on a sec, our top techs are dying to help.
I talked to a friend in cisco TAC (Brussels) who said that they regularly lurk on
Since summer weather had come to Europe, I, personally, did not notice the outage. But I promise in the futur to not have a life.
the AC
[Note to Kurt and company, make sure you return your customer satisfaction survey. Those TAC folks live and die based on keeping a very high level of sat scores. I think they need a 4.85 (on scale of 1 to 5) just to keep their jobs within cisco, and a 4.89 to get a raise. So 5's across the board, and in the comments put a link to this
Hemos is like...sci-fi fans;he thinks technology is cool, but he hasn't bothered to understand the science it's based on
Actually I have a couple of puny little 2500s and no contract and they still helped me with an IOS bug a year after I purchased them - eventually giving me a free upgrade of the IOS. If for ANY reason (IOS/hardware) a Cisco is having problems I guarantee that Cisco will help you.
About a year and a half ago I bought a used 3600 and then found out via Cisco tech support that that particular run of 3600s had a hardware bug and they RMA'd the damn thing, shipping me a BRAND NEW unit before I returned the other one. Past that, I called later with a problem of my own causing, and they still had a couple of their techs help me out.
All in all, Cisco does a great job of supporting their hardware.
Shamless plug: Check out Grub!
>small branch (...but managing all of Benelux no less..), and get hardly any more info than you
>have.". And no, that particular problem (RMI in Jserver crashing after several hours of just
>sitting there..) has not been fixed in a week.
I just had to laugh when I saw you comment. I also get my Oracle Support from Benelux, and I
just happen to also have an Apache/Jserv related problem outstanding for a while.
One thing though, your Oracle office was telling the truth about being a small branch, and not
being able to do too much. All of the real work goes on in the US, and the local offices don't
really have alot of contact with the Oracle US.
A little disclosure is in order here. I worked for one of the Benelux support offices for 2
years as a manager of one of the support groups - this was after working in two of their other
offices, including the US HQ. I was amazed how much of a backwater the Benelux offices were. In
fact, I quit out of frustration after a fight my boss - the support center manager because I knew
that we weren't giving an acceptable level of support.
"The best part? I became an ordained minister while not wearing pants." -- CleverNickName
I've seen this scenario over and over again... one guy who knows and understands the network, ten people standing around at the equipment trying various silly commands to fix it when it's down...
Here's some suggestions -- you probably already realize that 90% of your pain was avoidable, but everyone has to learn "the hard way" the first time, right?
We've got piles of printouts and documetation of all sorts, drawings and spreadsheets, helping us keep track of every IP and machine in this cage, yet it doesn't seem to get any clearer...
That's called bad documentation that no one ever reads.
Get your networking guys to document TROUBLESHOOTING techniques and to teach the programmers how the network is acutally set up and why. You have plenty of talent capable of understanding how it all works there.
Get more than one way (cell phone) to reach your most important network engineers. Pop for a guaranteed delivery text pager and ask them to carry that as well as the cell phone.
Yazz says, "When I arrived at Exodus, Kurt and Dave were trying every combination of things to do to get the 6509 back. But neither they nor I even knew the Cisco Passwords."
Paper. Wallet. Put them there. Better yet, PGP encrypted password escrow somewhere that anyone can get access to, and a locked cheap fire safe at the office with the public and private PGP keys on a CD-R inside -- for just this type of scenario.
So I asked the Cisco technician, Scott, to telnet into our switch...
Bad bad bad... telnet = bad. Good network security always goes out the window when the network's down...
So he's in the switch and he's disgusted and horrified by how we have it configured...
This is probably the most important hint during your entire outage... your network people either don't know what they're doing, or you're not ALLOWING them to do their jobs, or they're understaffed, or whatever other excuses can be made up ... your call, but don't forget this -- if Cisco's "horrified" by your configs, there's a serious issue you need to find and correct somewhere in your organization. Everything from training, to documentation, to troubleshooting procedures needs a serious walk-through.
The one card going bad wouldn't have been such a big deal if the config in both were set up correctly. It was meant to flop to the other interface if the primary card died, which it did, but not with all the info it needed... AKA it was misconfigured...
DO FAIL-OVER TESTING. If you'd have done a fail-over test of this config you'd have known it didn't work correctly during a nice scheduled time when your network engineers are available and at the equipment, instead of the middle of the night during an outage with all of them MIA. This is so easy to avoid.
Exodus really wasn't set up to handle the type of failover the 6509 was meant to do. Thats what the Cisco folks said basically, and the Exodus people are no longer supporting this type of Cisco in their setups.
Nice of them to tell you. Who is the customer here again?
Put a $20/month POTS line in your cabinet for goodness sake!
That's enough... I'm appalled, but hopefully you will straighten out some things now that the site was down for an extended period. Done properly, network downtime should be a rare event, usually caused by human error, not by bad configuration.
Many outages are unavoidable, your outage sounds like it was avoidable, and certain steps could have been taken to minimize the length of the outage.
+++OK ATH
I think `everything' in this context refers to all the network equipment, not the servers. All the OSDN services were down, so it obviously wasn't a problem with just the slashdot servers.
What you said.
I did a bit of (very junior-level) sysadminning back in my day.
First thing the BOFH told me was "Buy a hard-cover notebook. Not spiral-bound. Not softcover. Write down everything you do. Feel free to doodle and write obscenities if you like. Someday you'll thank me for this".
I was a bit befuddled, and then he showed me his notebooks. Five years of dramatic fuckups and even more dramatic recoveries. His own personal "deja.google.com" (but it was 1992, and long-term USENET searching hadn't been invented yet, hell our office was using UUCP!) for everything he'd had to work out from first principles on his own.
And thus was the PFY enlightened.
(And yes, I did buy him a beer in late 1992, when something I wrote down in mid-1992 jumped off my page and saved my ass.)
So.. how much would you pay for the chance to go into that big room with all the whizzo big computers and routers and wires, and see a PC and go "Hey! That's slashdot!! :D!"
If this isn't 31337 then I don't know what is..
(Maybe a topic for next week's poll)
Rebooting does not solve the problem because Windows is the problem.
Hmm... Austrailian chicks on an 800- number. I got to get me one of them Cisco thingies.
How do you reboot a cisco router?
But are they all CISCO certified?
AFAIK, the only way to spoof IP addresses in Windows is to install a new networking stack, and that is difficult to do in the kind of generic way that zombie clients work, for reasons Gibson discusses in an article at his site.
While I agree that I usually get someone at cisco who knows what they're talking about, it is very rare in my experience that it happens in only a minute, although it does occasionally happen. A much more common experience is to wait on hold for 15-20 minutes, but I have waited on hold as long as an hour with them.
All of that being said, I would have to agree that cisco's TAC is probably one of the best tech support groups I've ever worked with.
--
Key to financial independence: Spend less than you earn. Save and invest the difference. Do it for a long time.
Someone kind of elluded to this but MY GOD are your security procedures busted!
Point 1./ Why do you allow TELNET in to your routing/switching equipment from the outisde world? If a CISCO tech' with the password can do it then a hacker without the password likely can too.
Point 2./ If you are connected to the Internet in any way NEVER replace your firewall with a cross over cable. Basically at that stage you have your pants around your ankles, are bent over, with a big "Do Me Now!!!!!" sign on your butt!
Unfortunately I've spent many nights there in the past. I can confirm their remarks about Cisco support. It's usually fantastic! Lucent should take a few pointers from them, especially the guys in their INS group. They did forget to mention the security there. If you're not in the Exodus database, you won't get past the little lobby with the bullet proof glass. You've gotta use your proximity card like six times just to go to the bathroom!
Um, this is my sig.
A Cisco 6509 is not a cheap piece of hardware (yours was at least $30K w/ what you've described in it). Especially if you've got redundant supervisor engines, and doubly so if you're doing layer-3 in the box. That having been said, why couldn't you have hired a competent technician to install the box properly when you installed it in the first place, rather than having a half-assed configuration loaded?
I doubt your setup would've been more than three or four hours of configuration, based on what you've described, and you'd have gotten decent documentation out of all of it, if you'd hired a good technician. It's quite obvious that one visit by a technician at $250.00/hr would've more than made up for the cost of the downtime and headaches you incurred as a result of having a poor configuration in the first place.
The Attitude Adjuster, I hate me, you can too.
You made your point with capitalized letters but still.. If everything has failed and you have nothing to loose why not give it a shot?
:-)
If everything else has failed & you haven't found the problem, you haven't tried everything. Granted, there are things like memory leaks that won't purge themselves, or total hangs where the box is unresponsive, but if you're in there looking at the system, you should be (a) finding and fixing, or (b) recording everything you can to find the problem later.
Because sometimes, in my experience, you do have to reboot. You've got 2,000 users sitting on their thumbs and the regulatory commission wants data & is going to fine you for every hour you're late and the division VP is on the conference call wanting to "help" and.... It may be better to reboot & get the thing back up and running, but even so, do not assume you've fixed the problem.
Except in Windows, for those problems that don't require a re-install.
That's not the point. Some of us don't want people to know that our machine exists.
Interested in open source engine management for your Subaru?
Yeah, but you wouldn't think you you have joe average asking why their Oracle setup is generating redo logs at 200 MB/minute under light load. Yet, unless you have platinum support ${SO FUCKING HUGE NUMBER YOU CANT BELIEVE IT} you have to sit there explaining to first tier tech support what a redo log is.
Once you get to second teir, Oracle support is pretty good, though not spectacular. But of course, you have to pay $BIGNUMBER to get any support at all, and unless you even more, they hang up the phone at 5pm.
Once you escalate a few levels they are OK. And the people I have talked to were friendly and honestly trying. They just were clueless. I worked with Oracle for a year and a half, and knew more about it that just about anyone I talked to there.
On the other hand, the one Oracle consultant I talked to was a genius. Unfortunately, he was way, way, way to expensive for us, and the 3rd party consultants we talked to sucked total ass.
That's a damn lie. Engineers from separate companies always require some clueless idiot act as a go-between.
If you aren't part of the solution, there is good money to be made prolonging the problem
This is a little off topic, but I've seen some other similar things, so, my $.02: I just had an amazing experience with tech support at Penguin Computing. I have a server that was in immediate danger of losing its IBM deskstar 75 GB drive. The following is an excerpt from the letter that I wrote to them thanking them; the new drive (under warranty) was in my hand less than 24 hours after my phone call. 22 hours later the server is back on line with all of the data restored my total cost? nada : Much to my surprise, my call was answered by a human. They asked how they could help -- I told them, and I was immediately connected to another human. No hold time, no muzak, no "press 9 if your laptop is on fire" messages. The fellow that I talked to -- I regret to say that I'm not sure of his name; I though it was *******, but your customer service rep, *********, says it was ********... at any rate, he sounded British, if that helps -- was courteous, and, much to my surprise, he ... listened. No script, no checklist, no lets spend three hours going over all of the stuff that I had figured out before I called tech
support in the first place. He let me talk, asked two very specific questions about the contents of the log file, then simply agreed with my assessment that the hard drive was in immenent danger of failing, and that a new one would be shipped to me right away. Despite the fact that it was about 4 o'clock on the east coast when I called, **** informs me the new drive will arrive this morning.
I was stunned. I can honestly say that even when I was working in an environment where we payed over $100K/year for support to Sun, I have never been treated with the courtesy, respect, promptness and knowledgeable professionalism that I was when I called Penguin Computing yesterday. I felt that my problems were of genuine concern and that
everything possible was being done to correct them. I promise you that from this point on, as long as you keep doing what you're doing, I will never buy a server or Linux workstation from anyone else.
This is exactly what we need on the 'net for us sysadmins to read. Failure stories. Why? You don't learn much from success stories, because things worked the first time.
/.) But writeups like this one and Steve Gibson's at GCR about the DDOS attacks are priceless. They show what people have tried, what hasn't worked, what did work, and definately where to start the next time.
"Welcome to the HOWTO. My setup worked the first time. Why didn't yours?"
Granted, noone wants to see stuff on the 'net go down (and we're glad you're back,
Really, what Linux (and other geek subjects) need is to have a Great Book of Failure Stories -- writeups like these that detail horrible outages, downtimes, misconfigurations, security hacks, etc., so that we all can learn from other's mistakes.
Blog,Twitter
I hereby propose the term "anne-tomlinson", or "tomlinson" to describe the act of departing a company in the most suspicious of circumstances, known only to a very privileged few. Used in the following example:
X: "What happened to Anne?"
Y: "I don't know; all I know is that she anne-tomlinnsoned from work."
Note that this verb should have the subject of the remark used as the subject of the verb, and the organization left as the indirect object. This should be adhered to regardless if the subject quit, was fired, laid off, died, disappeared, never existed, or there was a mutual decision for the subject to leave. In fact, the verb should mainly be used when the method of departure is unknown or never officially stated (or, even officially acknowledged).
Also note that this verb should NOT refer to a person leaving another person, as in "Fred's now-ex-wife had tomlinsonned from him." The number of people (one or more) that are the subject should be less than the number of people who the object represents.
Continuing on, this verb should NEVER be applied in a self referential matter, IE: "I anne-tomlinsonned from them". This implies that the subject either A) knows the reasons, and is just being a prick about not stating them, or B) the subject does not know the reasons due to massive thick-headedness.
Lastly, this term should only be used to convey the sense of inpenetrable mystery surrounding the departure. It would be oxy-moronic to state: "Ted tomlinsonned because he was bored and wanted to leave." If the mystery surrounding the departure is penetrable, use another phrase.
anne-tomlinson, v,: to leave or be removed from a group under extremely odd, and mysterious, circumstances; especially when the actual method of departure or initiating party of departure is unknown. More especially, when the actual departure is apparently covered up or left un-acknowledged.
tenses: anne-tomlinson, anne-tomlinsons, anne-tomlinsonning, anne-tomlinsonned, had anne-tomlinsonned.
"Don't mind me cutting myself on Occam's Razor"
Well, we have a $LITTLE_NUMBER support contract with Cisco, and have had similar with two previous companies.
Our results were much the same. Very, very responsive people.
I have to agree with Taco, if they gave this kind of service down at the DMV, they'd be picking up passed out folks left and right.
*scoove*
for Cisco's folks to be so empowered. I know at my tech support job I can run around my wheel for hours before being told anything of merit. During a DOS attack on my university, the operations folks did not correctly diagnose the problem for hours, and left me in tech support with about 8000 angry students calling in...I wish more places would follow Cisco and give the techs some real power.
And I love the review of what went wrong. Reminds me of similar situations with missing computers...check everything and with everyone, and no one knows where computer is, call more people, try more logs, nothing, then the one guy who's out of touch comes in and tells us he has it...
http://thechubbyferret.net - Ferret pictures and informative links.
Half the VLANs were only stored on one unit and the other half of them on other. So when one died it only knew half of the full setup and couldn't route things correctly since the VLANs it wanted weren't there
Basically the network was fine as long as both cards were up since they could share their half of the VLAN info with the other. Once one card went down, the other had no idea what to do with traffic to/from the other half of the VLAN.
Oracle? Maybe if you live in the US. Around here we get the line "Sure we entered your bug report into our database. However, we are unable to tell you when it will be fixed. Maybe next week, maybe in ten years. Sorry, we are only a small branch (...but managing all of Benelux no less..), and get hardly any more info than you have.". And no, that particular problem (RMI in Jserver crashing after several hours of just sitting there..) has not been fixed in a week. Actually, we still haven't heard back about it, even though it was reported last autumn.
Say no to software patents.
Kurt:JESUS CHRIST!!! Don't go in there man!
Dave:What the hell is it!
Kurt:I don't know but its big and its pissed off!
AARRRrrhhhggh....
Kurt:That was the Exodus admin. I told him not to go in there!
Yazz:Shit! Its the fucking Lameness Filter man! The Lameness Filter!!!
Kurt:It must have mutated or something, my god its turned on us! I don't understand... wait... what is that sound...
ARAAARAAAAHHHHHHHhhhhhh
Someone you trust is one of us.
What about the chick that quit or was fired in the middle of the crisis. I want to see photos of Rob crying when the boys at VA told him to get the site back up or he'd lose his job. This isn't a blow by blow. This is a cover up. The Powers That Be are pulling the wool over our eyes.
Someone you trust is one of us.
Reading the description of this outage was just like a day-to-day description of life at my job. I'm not a network engineer, I'm a software developer. And explaining this stuff as part of my job to non-programmers is next to impossible.
e very-second-of-outage production debugging really is. I would ass-u-me that if I ever worked on an air traffic control system or NASA flight software that the testing would be rigorous enough that this would not happen. But the reality seems to be that in most software jobs, good enough is good enough, and bugs really do happen in production systems.
/. was just up and available 24/7, but the reality is, there will always be problems. It's interesting to read about what they were and how they were fixed.
My job description involves lots of doing requirements, designing solutions and the implementing those solutions in software (with some testing thrown in if the PHB will allow for it in the schedule). All of this sounds like normal programmer fare. But then a production outage rolls in... Some client calls saying the system is down, or the data is corrupt, or the reponse times are unacceptable, or whatever, and the firefighting mode goes into full gear. All of the developers go into full debug mode: is the database OK, did the software get changed, is this a code bug or something environmental, etc. Everyone brainstorms, tries ideas and eventually the problem is solved. Sometimes the developer who wrote that module/class/program/function is on vacation, or [s]he was a contractor who quit/fired last week. Sometimes a sysadmin helpfully made a system change that fubared all of the software's configuration, sometimes an idiot developer hardcoded a value that should be dynamic. And the options go on and on.
What suprises me is how regularly this occurs in all of the various software jobs I have had. And how large a part of my job this kind of real-time, customers-are-waiting-and-we're-losing-money-for-
Thanks for the cool blow-by-blow analysis of the outage. Of course, I'd prefer if
Blocking pings makes it slightly harder for crackers to find your machine when they are scanning huge subnets at a time... They won't want to waste their time running a port scan on a machine that they're not sure is up.
I was recently running a network security scan at my place of work, pinging machines to see if they were up before scanning them and was unaware that icmp traffic was dropped by default, so when the results turned up about 3 vulnerabilities on a network saturated with Microsoft products, I was a little skeptical.
SuPz.orG
It's the hardware.
They don't have a hard drive in them. All they have is N megabytes of Flash RAM that has to store the OS, and the configuration.
If they stored those logs in the Flash RAM then:
It's funny what people believe they actually have a right to.
there is no thing
what else could you want?
The next time you hear those words bring a rechargable shaver, a toothbrush, toiletries, a sleeping bag, and a charger for your cell phone. It sounds like you had to learn the hard way that "about five minutes" is geekspeak for "let's take a sail on the S.S. Minnow." :)
Get off my virtual lawn, you damned virtual kids!
So when DOES The streaming WEB-based movie short of this disaster come out? Who will play cmdrtaco?
:)
The natural english interpretation of a layman would be that she didn't fix the problem. In other words, it is probably a slightly hyperbolic reference to the fact that the problem continued to persist even after the network tech had arrived. I wouldn't read into that that Rob was disparaging the actual technical skills of the person in question, (unless he has those skills himself he'd hardly be in a position to judge) but as a simple tongue in cheek reference to the distance between skill and expectation (ie: as we had hoped..)
It is easy to hope that your techs can walk on water. It is slightly less common to find people who take a shortcut across the duck pond on the way to work.
LibBT: BitTorrent for C - small - fast - clean (Now Versio
Wish the companies I deal with on a regular basis ever showed that level of skill when I need help. well... hmm... actually Speakeasy is generally pretty good about accepting that my problem is accurately diagnosed and figuring out what's wrong. And Viewsonic the other day was able to provide refresh-rate specs on a monitor I wanted to order within about 60 seconds of my placing the call (Though they dropped the ball by not having the specs I wanted available on their web page) What is this trend of good service? It's scaring me...
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?
we're talking an 800 number that you dial and within less than a minute you are talking to a technician...
/., confirm this? I'm curious how much is Cisco's good customer support and how much is the fact that OSDN probably has a $BIG_NUMBER support contract with Cisco.
Can anyone famliar with Cisco, besides people working for
Just curious.
My my, what a putz you are.
--
NetInfo connection failed for server 127.0.0.1/local
I think the SEC needs to be informed about this...a public company lying to it's "customers" (aka, the Slashdot readers)? Hmmm. Time to drop them an email, I think!
Just because someone screwed at your work doesn't make your mantra a universal rule.. Especially when dealing with something like a router or a switch.. These things are normally not meant to be user serviceable and will take a reboot just fine(no hot swappable drives there).. You could have hit a 1ppm problem and rebooting just brings everything back online until statistics kick in again. Little uptime is better than none.
Sure it won't fix anything per se, but getting things normalized enables you to start concentrating on the problems at a less hectic pace..
I can only say one word: WOW! Pretty kewl! I am sure glad we have Cisco hardware now! :) We used to be a IBM only shop, even for switches! But Cisco bought IBM's networking out, so we are switching to cisco. We still have a big IBM switch left, but they are swapping it out for a gigabit switch. We're going to be 100 MB to the desktop, with a gigabit backbone. Not sure what we are going to do with the T-1's (we have 4 T-1's...some ask why don't you go with a OC-3? 4 T-1's are probably cheaper and provide redundancy...nuff said).
Gorkman
---
/bin/fortune | slashdotsig.sh
They could take some pointers from Cisco, about what "tech support" is. Anyone here ever deal with Netscape/AOL over the phone? And I don't mean for client support, we're talking iPlanet and commercial product support.
also, isn't it against some RFC or something?
--
Free Mac Mini
WTF?! There were Dave, Kurt, Hemos, Yazz, and possibly others. Not one person could stay up? They all were too tired? You couldn't take shifts?
Yeah right.
I mean, how could you even sleep, realizing that
Free Manning, jail Obama.
You only have one DNS server?
Ah, that makes sense. I was assuming that these things had their own hard drives.
Considering what their routers cost, I'm surprised Cisco doesn't give you a dedicated journaling drive. Guess they don't fail often enough to justify anything that elaborate.
Dahlmann tightly grips the knife, which he may have no idea how to use, and steps out into the plain.
I'm not sure I understand that. Why does the router purge its logs when you reboot it?
That sounds lame as hell. (Granted, though, configuring a Pipeline 50 goes right over my little bow head, much less a Cisco. So yes, I'll stipulate that I'm talking out of my ass here.)
The act of rebooting should be just another even that gets logged, NOT a synonym for "oh, and by the way, you can delete the old log file now."
IMHO log deletion should be done on a calendar basis; everything more than x days old gets purged automatically. What's Cisco's rationale for auto-deleting logs during the boot process?
Dahlmann tightly grips the knife, which he may have no idea how to use, and steps out into the plain.
In all truth I was simply asking for the full story - it seemed like some pertinant facts were being swept under the carpet in the write-up, and this is the last place I ever expected that to happen. I don't want to know what names people called each other; just let us know if the original story was crap or not. If it was just Taco suffering lack of sleep, then say so. If there was a dispute and someone quit then say so.
I don't care what the tech's name is; I just don't like it when people "wash" stories to avoid anything that may reflect badly on them. Don't just tell the truth, tell the whole truth.
And as for the modding down of posts - I guess no-one's talking about that and it's not something that we're likely to hear about any time soon. Sure, it could be the Slashdot editors, but it could equally be a bunch of rampantly loyal moderators who don't want to upset the status quo, but who're rarely seen since Slashdot rarely gets criticised much on it's own site. Don't go running away with the conspiracy theory idea, OK kids?
OTOH, if you've calmed down a bit Jamie (or any other editors), would you care to give us a definitive answer as to whether the editors can/would moderate down like this? Try to remember that we all like coming here and expressing ourselves, and that's the only reason we're asking.
But check out http://slashdot.org/comments.pl?sid=01/06/27/12420 7&cid=86 - a thread I started to ask what was going on when I noticed everyone asking about this was suddenly hitting -1 despite it clearly being a popular question. Kurt Gray makes some worthwhile input, and Jamie McCarthy shoots his mouth off too, though the question still isn't really cleared up.
C'mon, tell us the full story!
BTW, feel free to mod me down, prove my point and compound my paranoia; I've got karma to spare : )
Cisco places only a very small PROM in most of their equipment for NV Storage. This is where the IOS and config are stored. The idea behind not storing logs there is so that you'll have plenty of space for your access lists and other custom config stuff.
EveryDNS. Use it. It works.
AC's need not reply
There's 4 little copper contacts on the top of where the battery clips in. Bend those outward, and your problem should be fixed. Before I was laid off from my last Sysadmin postion, we had these as well. When we asked out Nextel rep about it, they told us that's what they do to them.
EveryDNS. Use it. It works.
AC's need not reply
Yeah, as if you couldn't discover the same information about of the machine by talking to the HTTP server :P
-SK7
"a powerful and unexpected ally..."
I am glad to hear the actual details and not some PR created legally scrubbed crap (assuming there was no scrubbing in there...)
It's scrubbed. The original story mentioned one of their network tech's quit mid-crisis. Not that I blame them for brushing that under the carpet.
Temkin
You can't buy advertising like this. (you have to get it the old fashioned way...you EARN it). Let's hope the faster/better/cheaper downsizing craze doesn't hit the guy who runs the TAC organization. Or, better yet, let's hope it does; my company would love to have him do his thing for them!
The key is that their "redundant" Supervisor modules weren't. I didn't even know you could do that with Sups on a 6500 (share VLANs). My understanding is that only one is active at any given time. Is this right?
Derek
Don't Panic...
But, says Yazz, "Since the Cisco was rebooted there were no logs to look at."
You fell into the classic "Windows" trap.. this is what I tell the Jr. tech guys here when one of the servers goes wonky: "If it doesn't work, there is a reason; something is wrong. Rebooting will not fix the problem."
WTF!
Any halfway decent Network Ops (of which I'm in charge of one) have syslog servers for all their stuff. Often one uses som sort of smart agent checking thoses logs, like CiscoWorks or DFM. (Or handwritten in Perl, like for our firewall logs.)
BTW, 6509:s are pretty good stuff, we have a few here at The Big Hospital. And believe me, you don't wan't any network downtime here...can you say "dead people are cool"?
"We live in the lovely quiet and dark." - John Varley
Hard drives fail too often to justify anything that elaborate.
Karma: Bored. (Thinking about resurrecting the "Anyone else is an imposter" joke.)
Maybe you need to read the account again. They replaced the FreeBSD bridge/firewall with the crossover cable to see IF the firewall was causing the problem. Their firewall was NOT a crossover cable.
Kent
I have never seen a better proxy advertisement for any company than this slurry of posts regarding the overall superiority of Cisco tech support. If getting their routers did not require the purchasing power of selling my soul or my firstborn child, I'd buy one.
p.s. would've applied (R), sm, and tm as needed, but <sup> isn't allowable HTML. :P
"[T]he single essential element on which all discoveries will be dependent is human freedom." -- Barry Goldwater
"Since the Cisco was rebooted there were no logs to look at" or you guys have a seriously weird logging set up. Rebooting a Cisco box shouldn't nuke the log, it should still be there. Also, here's a hint "logging a.b.c.d" will get it to log to an external syslog server.
mas cerveza, por favor politically incorrect stu
But I think that his point is that it's certainly NOT the FIRST BLOODY THING YOU DO. Cuz depending on what's wonky, it might not come back up at all.
Vintage computer games and RPG books available. Email me if you're interested.
<blockquote>
I do agree that they should probably try SOMETHING before resorting to rebooting it, but it's the easiest way to tell if something is broke.</blockquote>
Actually, I'd say that looking at the logs and doing diagnostics is the easiest way to tell if something is broke, for a piece of hardware like Cisco. But oops, they wern't syslogging to a box (ideally with a dotmatrix printer; try cleaning THOSE logs, cracker-boy!) so they lost one of their best tools for finding out what went wrong.</P>
Vintage computer games and RPG books available. Email me if you're interested.
I know that its a bad reflection upon your business and all to hear of things like this POSTED to your site about how badly configured stuff was behind the scenes. At least at first glance.
:P Its nice to know some people are honest...
I am glad to hear the actual details and not some PR created legally scrubbed crap (assuming there was no scrubbing in there...) There always are bad moments when you get a group of cranky nerds up way past their bedtimes under a lot of stress, but hey thats what makes us human. So props to you guys for being open with this kind of stuff
Jeremy
Look, and your part of the reason no doubt, /. is like home to a bazillion computer geeks. A small minority that is hugely vocal.
Okay, so peope get stressed and felings get bruised. Someone quit under certain circumstances. It was not anyones business honestly to post that to the FRONT PAGE of slash dot. Its none of our business. They are being damn nice as it is being this open with it. Things get crazy when you have THAT MUCH stress on top of you. Its like the end of the world to the people trying to fix it.. trust me I have been there.
Anyway, just let it go folks..
Jeremy
Where did you get your hard cover notebook? I've never heard of such a thing and my search at local shops has proved fruitless...
Peace,
Amit
ICQ 77863057
[o]_O
I will now prove, using extremely shaky methods, that "Blow-by-Blow Account of the OSDN Outage" by Roblimo is, in fact, an epic myth.
... the switch resets itself... then email starts streaming into my inbox... then I can ping our sites all of a sudden... we're back online! Everything is back! Weird."
I. Call to Adventure
"By 7 a.m. it was obvious that this was not a typical, easily-fixed, reboot-the-database problem. The network operations people were paged, but did not respond."
II. Meeting the Mentor
CowboyNeal once said, "You can take everything I know about Cisco, put it in a thimble and throw it away."
Whoops, that's not it.
"So I called Cisco tech support."
There we go.
III. Obstacles
"Just to make things interesting we've added ports to the 6509 by cascading to a Foundry Fast Iron II and also a Cisco 3500. We've got piles of printouts and documetation of all sorts, drawings and spreadsheets, helping us keep track of every IP and machine in this cage, yet it doesn't seem to get any clearer unless you've either built it yourself (only one person who did still works here and wasn't available this weekend) or if you've had the joyful opportunity of spending a night trying to trace through it all under pressure of knowing that the minutes of downtime are piling up and the answer is not jumping out at you."
IV. Fulfilling The Quest
"He bounces the switch... copy startup-config running-config
V. Return of the Hero
"The next day, Monday, Kurt talked to Exodus network engineers and asked them why our uplink settings were so confusing to Cisco engineers."
"Tuesday was router reconfig day."
VI. Transformation of the Hero
"At least we've learned a lot from the experience -- like to call for help from specialists right away instead of trying to gut things out, and just how valuable good tech support can be."
"We certainly aren't going to make the same ones [ed: mistakes] again!"
Peace,
Amit
ICQ 77863057
[o]_O
Now, let me tell you about Minimed tech support. They are unbelievable. The most I have ever had to wait for assistance with my pump is 3 minutes, and this is several calls at random hours (3 am, 2 pm, 11:30 am, etc, all hours, no matter what). The tech support people know the minimed pump product better than anything.
But the real reason is that they have to. Sure, Cisco and Arrowpoint have big $$ contracts with their customers to keep routers and such in order. But if a router goes down for a few hours, it doesn't mean the death of the customer. If a Minimed insulin pump stops working for even a few hours, and the user doesn't realize it, they can go into diabetic keto-acidosis and potentially die. Now, this hasn't ever happened, to my knowlege, but the possibility is still present. Therefor, to avoid the unbelievable criminal lawsuits, Minimed has what I would expect to be the best tech support in the world.
I dont have a
Years ago a customer sent us an obsolete Cisco router (very old) for repair. They actually bought the router from someone else in Australia or something.
:) ).
:)
Basically the thing was really dead, and really obsolete so we didn't have any in stock, so I reported it to the Cisco TAC.
Days later, a replacement router came in (lower priority coz the customer had other routers). I asked the TAC how are we going to sort out the payment part - e.g. who pays who etc. Turns out the TAC doesn't care - as long as the problem is fixed!
So the fix was on the house! Of course the customer was informed that as this was an obsolete router, fixes like this might not happen in the future as there probably won't be any routers of that model in stock.
But the main thing was: problem fixed for free, boss happy, me happy, TAC happy, customer very happy (I think we didn't charge them anything either
So guess what routers the customer will replace their obsolete ones with?
That's why I have no reservations recommending Cisco routers to people who can afford it. Life's tough for those that can't afford it.
These guys know their stuff. Heck many of them know their competitor's stuff better than their competitor's tech. And there were cases where Cisco's stuff worked better with Brand X's stuff than Brand X's stuff with itself!
I'd prefer Cisco's way of achieving market share. I hope they keep trying to do a "good clean fight" and showing that you can win without playing dirty (unlike some other companies..). They're going through tough times. Economy slow down, stiff competition (Juniper etc).
Cheerio,
Link.
It's not like a good monitoring system is hard to find.
Milalwi
Service Level Agreement.
BlackNova Traders
The dual sup part works great. If things are configured correctly then the primary sup config is copied to the backup sup, and if there is a failover then you still have the good config.
However, it don't work that way with MSFCs. If you have dual MSFCs in a chassis then the MSFC on the secondary card does NOT have the config of the primary MSFC. It's not a primary/secondary setup, it's more like having 2 completely separate routers. So if you have only one 6509 chassis and you have dual MSFCs but you didn't configure the 2nd one then you're h0zed. In our normal config we'd never notice...if one MSFC fails then the primary MSFC in the other chassis will take over (HSRP/VRRP being the handy thing that it is) and you swap out the bad hardware. But if you're someone that says "hey, I paid for 4 MSFCs, and dammit I want full redundancy" then be prepared for this tres funky 4-way HSRP setup. It's tasty.
And guess what...while our reseller was happy to sell us that config (2 chassis each with dual Sup/MSFC), Cisco does NOT recommend it. They recommend in our situation, 1 Sup2A/MSFC and 1 Sup2A WITHOUT MSFC per chassis. Go fig...
"Where quality is like a dead stinking rat - you just can't miss it."
I find it highly amusing that between all this praise of Ciscos excellent support services, their training manuals are horribly written, have piss poor grammer, and even the occasional glaring technical error.
:)
You'd figure that Cisco is big enough that they could afford to hire a few decent writers
Need help treating your acne? Come here!
Well it usually does with windows, so you can't really blame them...
-- Computers are not intelligent. They just think they are.
Yeah, check out my post on the previous story. I guess the powers-that-be don't like you to point out their mistakes.
Think this one will get modded down too?
Monkey sense
"I didn't see anything to explain that in the report."
From the article:
One of the two router cards in the 6509 died. When the failover to the other card occured, the other card didn't have all the information it needed to do its job. Thus, the original problem was the dead primary card. The contributory problem was the screwy config. But the screwy config didn't matter until the primary card died.
#!/usr/bin/perl
use Conspiracy qw(Censorship Story);
my $outage = new Story;
$Story->change();
$Story->change();
my $controversy = $Story->replies();
$controversy->mod_down();
my $interestlevel = $controversy->modpoints();
if ($interestlevel > 1) {
$controversy->mod_down();
$controversy++;
}
I have a feeling it was just a simple re-classification of the article. I noticed this too, and thought it was quite odd. So I did some poking around before (at the time) just shrugging and bookmarking the story.
The article first showed up on the front page, but disappeared sometime while I was reading the comments. However, the topic "slashdot" icon was still on the top of the screen. Strange, but I don't know how the site works in that regard. I thought maybe they reclassified the site and thought it wasn't good enough for the front page. Turns out, that wasn't too far from the truth.
I searched around in some of the different sections but didn't find the article. I even manually searched for the article itself, but still didn't find it. Oh well, I thought, and I went on with my day.
Low and behold, an hour or two later, the article is back, but now it is contained in the "Features" section. I had checked there before, but didn't see it. Also, I think the "Features" label didn't appear in front of the story title initially (probably because it hadn't yet been assigned to that section).
Simply put, I think the disappearance and reappearance, was just a side effect of reassigning the story to a new section. Short story made long, I know, but I thought it was kind of interesting, too.
"I say consider this day seized!" -Hobbes
"I say consider this day seized!" -Hobbes
"Tomorrow we'll seize the day and throttle it!" -Calvin
Except, of course, when it does. This is Windows we're talking about...
Doug
I was rather mystified by the moderations myself. I tried to right the wrong with my own measily 5 points (which I'm giving up by posting this), but I'm sorry I couldn't make much of a difference there.
In any well managed network system, there is always one person who is the "Network Enforcer". That is, someone who's basic function is to be a dick about disaster planning, redundancy, regular backups, frequent system failure tests, network management, documentation, etc. I've played the Network Enforcer role before and I initially made a lot of people unhappy. But when all the major network problems finally disappeared and we went 6 months without any downtime, everyone appreciated the work I did. Especially my wife because I could leave at 5pm every day and never worry about getting called into the office.
Frylock: That's not a toy!
Master Shake: You say that about everything you own. You should own toys. They're fun.
No matter WHAT transpired, it is NOT any of our business what happened with their tech. Whether they're all sexiest bastards, or she's a flaming idiot, they have obligations as an employer not to discuss things of that nature. No one here has "a right to know," period.
Just because it's a popular website doesn't remove it from certain rules that all businesses have to follow; Employee/Employer relations are private, and they are to remain that way. The post about her was probably pulled down after it was posted because it was done in the heat of the moment; as well it should have been. We all have moments of anger, frustration, or just stupidity. But then we regroup, gather up our senses, fix what we can and move on. If she feels she was unfairly fired or treated, she can file a lawsuit. If they feel they didn't receive the services they should have, the can do the same (if she was a contractor..). Either way, it's not our concern.
So why don't y'all calm the fsck down, and back the fsck off, because none of our fscking business.
Oh, IANAL, so I don't wanna hear any crap about who can sue who, that's what I believe is true, I may be wrong. The only person I want to hear tell me I am wrong is a lawyer, if you don't fit that bill, I don't want to hear it.
I Haven't Lost My Mind -- It's Backed Up On Disk Somewhere
My rantings, only longer and with better spelling..
Not everything is a conspiracy folks.
:)
I'm sure that's what you'd want us to think.
Take a relational database for example; there is so much, that can go wrong with it. For starters, there are bugs in such complex products and fixing them (save for Postgresql) is beyond your control.
But it must not even be a bug in the database code. It can be something in your network component (we chased cases for month which turned out to be a DECnet issue, but where attributed to the database server), it could be the fact that the db vendor compiles his product on multiple platforms and it's virtually impossible to test every functionality of a new release on every supported platform. Yes, I know that in an ideal world this should be done, but it isn't.
Assume it would be possible to perform such tests. Save for propriatery (or semi propriatery) architectures like OpenVMS/AXP you can have so many different hardware- and network components, that it's just not possible to forsee all eventualities.
After ruling out such possibilities, we're not there yet: What are the query characteristics, how many concurrent users do when, what. What front ends do they use, how are they connected. The problem may even be caused by a component that has nothing to do with the database engine (Access front end, anyone ?)
Although the fundamental cause for the problem might never be detected a reboot of the data server might fix the problem and it will never occur again, since the same combination of factors occurs so rare that it's even impossible to reproduce the problem.
However, the [alt-ctrl-del] attitude of younger IT folks (specifically those that grew up in a PC environment) makes me barf and indicates just how clueless a lot of those folks are. You never reboot a productive IT component, unless there is no other choice or in the context of your normal maintencance cycle (memory leaks do occur in software)
ich bin der musikant
mit taschenrechner in der hand
kraftwerk
A third's cell phone was on the kitchen counter, unhearable from the bedroom, and the fourth one's cell phone battery had fallen out. It was a frustrating comedy of errors, and an unusual one.
Yeah! And a dog ate my book report. And I was abducted by aliens. And my tyre was flat. And I forgot to feed the dog. And And And - and these two techs are full of it. they were probably smoking weed and playing nintendo.
What is this zikzak troll? Yes, i did find the user zikzak, but i could not find about his great troll he made, and i think i was not reading /. when he made it. Tell me about it.
Yes, you can mod this as offtopic,(how much karma can i loose in 1 post?) But why did they post about the outage? to advertise cisco? Because it was a detailed geeky story (Not, the very technical details are misssing, tell me more about this network layout, why can't we ping the site)
Blame Cisco
Times have changed,
Our Slashdot's getting worse,
There's no more "stuff that matters,"
Just a hit on VA's purse!
Should we blame the government?
Or blame our ISP?
Or should we blame the h4x0rs at DirecTV?
No!
Blame Cisco! Blame Cisco!
With their blinking LEDs,
And inflated techsupport fees,
Blame Cisco! Blame Cisco!
We need to form a full assault!
It's Cisco's fault!
Don't blame me for old JonKatz,
He lost his damn connection,
Now he's shooting at little brats!
And poor Roblimo once had
pictures of Heidi Wall,
But now, when I see him,
He tells me to suck his balls!
Well,
Blame Cisco! Blame Cisco!
It seems like everything's gone down
Since Cisco came to town.
Blame Cisco! Blame Cisco!
They're not even a real company, anyway.
Slashdot could've been the place to get our fix of daily news,
Instead we just get jpegs of the results of anal screws!
Should we blame the Editors?
Should we blame the Trolls?
Or the moderators who let them take their toll?
Heck, no!
Blame Cisco! Blame Cisco!
With all their worthless stock options,
And that bitch Anne Tomlinson,
Blame Cisco! Shame on Cisco, for...
The crap that we flood,
The news that's a dud,
The MPAA,
Your Rights gone away:
We must blame Cisco! Shout and cuss -
Before somebody thinks of blaming us!
sulli
RTFJ.
Either Anne is real or she isn't. If she's real, this is an internal matter that we really don't need to interfere in. If, as the "Anne" poster suggested, she quit because Taco and Hemos are hard to work with, she was within her rights and should get at least some support from a community which often says "Quit! Now!" to Ask Slashdots about PHBs.
If she's not, this is all a big waste of everyone's time, and possibly the best troll we have ever seen on slashdot. (An account by that name has a brand new uid (462836) and zero comments.) Think of the trolls you've posted - how many led to 100s of posts on other threads, conspiracy theories galore, and posts by #1 and #2? Whoever did this (if not Anne) should get mad props from the troll fans, but should not take any more of our time.
My bet is that she's not real. But in either case we should drop it and get on to more important things.
sulli
RTFJ.
Also, out of curiosity, why haven't you guys hired a Cisco buff dude? At least someone over there has to be Cisco competent or certified or something. I mean you're Slashdot, that's gotta mean something. The mentality of "Server malfunctioning? Let's reboot and make it better" is what I expect from a Windows admin ;P
Magius_AR
There is something that seems to be missing these days, not only in the high technology industries, but almost any industry...
Customer Service.
Most companies dont give a rats ass about you after they have your money in their grubby hands. They dont want to spend the money required to maintain proper support for their products. They dont want to spend the money required to train and retain the people who can take care of their customers. And in the end, those affected the most are the users who need the support the most.
So some seriously mad props to Cisco for having the foresight to maintain and train a workforce that can help out their customers in a timely and efficient manner. It would be a much better world if more corporations got the clue that you already have.
Take a look: www.netzilient.com
Rob Eden
Sr. Software Engineer
NetZilient Corporation
"Let your heart soar as high as it will. Refuse to be average." - A. W. Tozer
I must say that my experience with cisco is slightly different. We bought a couple of switches, a 5000 and a 5501. We wanted cisco to put someone on site to help us move it in and configure it.
Cisco sends this piece of meat. He comes in and starts asking for this 5501 "router" which is sitting on the table right in front of him. You can't miss it, it's about 3-4 feet high, big blue box.
In the end, all he did was call the TAC and got someone to walk him thru stuff. I came in on a weekend thinking I was going to learn something. Bloody waste of my time.
-the_B0fh
Yeah but if it's that screwed up that it's not evening working as-is, where's the harm in rebooting it? If it doesn't come back to life after you try to boot it, then you've just nailed the problem right on the head. Otherwise, you're still running around the server cage like a chicken with your head cut off; not the most effective way to deal with a problem.
I do agree that they should probably try SOMETHING before resorting to rebooting it, but it's the easiest way to tell if something is broke..
All I know about Bush is I had a good job when Clinton was president.
Ahhh... thanks! I managed to miss that on the first read.
-S
--- What parts of "shall make no law", "shall not be infringed", and "shall not be violated" don't you understand?
OK, so the config was a mess. But it was like that BEFORE the outage, right? So what happened between "running OK" and "we're down" to cause it to fail? I didn't see anything to explain that in the report. Or maybe they don't know...
-S
--- What parts of "shall make no law", "shall not be infringed", and "shall not be violated" don't you understand?
Just don't name the router "Kenny".... he dies every week.
-S
--- What parts of "shall make no law", "shall not be infringed", and "shall not be violated" don't you understand?
I also need training in words spelling correctly before I hit submit. *ugh*
Everything failed at about 7AM Sat. Dave was at Exodus between 8:30 - 10:30AM Sat (didn't look at the log book when I got there). Kurt arrived shortly after that I belive (again I didn't look at the log book). I arrived there around 11:40AM. Sat.
And yes my battery was lose on my Nextel. Just takes a little pressure upwards to lossen the batter on the i1000plus I have. The batter doesn't fall out, just loss enought so it lose contact and turns the phone off.
I have now taped the battery in place!
Yazz Atlas
My site got over 500 hits from Google over the weekend from people searching for Slashdot. And who said search engine placement doesn't mean anything.
Heck, it's like that all the time. Or did you mean slower? I just assume it's the slashdot effect happening to slashdot.
-- .sig are belong to us!
All your
A feeling of having made the same mistake before: Deja Foobar
Let me first mention I'm a CCIE, so somewhat biased. I call TAC for one reason or another a couple of times per week - Generally, I can talk to an engineer on average in 15 minutes. Of course, the problem you have will make the wait time vary greatly - If it's a simple hardware RMA, it may be 1-2 min. IBM problems, 5-6 min. Common troubles with route protocols may take longer, depending on the size of your network and what priority you can get (Priority 1, big network, will get attention quickly)
Don't pick up the pho*(@)$*@&@!@ NO CARRIER
You forgot to document how it takes you 20 minutes to get to your cage from the front door of Boston 2.
:)
You forgot to document how you had to bang on the bullet-proof glass to wake the guard up at 3:00 in the morning.
Obviously, i've been to exodus a few times
Yes, my girlfriend is a BitchX
That is funny. Hehehehe.
I agree that it's important to have editorial discretion and, after all, you never remove posts. It just might be walking a dangerous line is all.
~
If you don't test, failover will fail.
Many Internet eons ago, I dealt with a system (codenamed Rosewood)
and it's offspring that were fault tolerant.
I learned early on to simulate failures to exercise the redundant components
to make sure they were functioning. Sometimes daily but at least weekly.
It's less stressful to catch that failure of the backup, and go back to the primary
while the primary can *still* function.
It also makes life easier on the weekends!
--
You are being MICROattacked, from various angles, in a SOFT manner.
Same story here, but I feel it dose reflect on me, and make fixing things a pain in the @$$. Getting the boss to pay for fixing the mess is a more of a pain though.
"Profanity is the crutch of an inarticulate motherfucker" - Unkown
Dream as if you'll live forever.
Live as if you'll die tomorrow.
~Anonymous~
Thanks again for the info and the honesty.
AAiP
Obliteracy: Words with explosions
After having been modded down next to the goatse links, somebody please explain to me how the hell we're supposed to discuss the decidedly strange disappearance (and subsequent reappearance) of this story on the site without getting modded as "offtopic"?
Just where, exactly, are we to discuss this little point? For example, why did this story disappear? Was it technical? Was it editorial?
For a group that is so damned keen on openness and truth, it strikes me as somewhat ironic that several dozen mod points have been used to effectively supress this part of the thread.
I want to know what happened. Others do to. If you can't give us a decent place on Slashdot to discuss this issue, then don't mod us down as offtopic!
Obliteracy: Words with explosions
This is why Cisco is on top. Their hardware is not necessarily the best, but it is good and their support is excellent. Try to find any other company competing in the same arena that has the same combination of hardware, software, and support. It does not exist.
-N
Actually, there is a very large class of bug for which that is not true - and a large subset of that is where the bug is somewhat repeatable. Sometimes the effort of grovelling around in crash debris is just not worth the effort. There have been many times when restarting, installing extra logging, then watching the crash happen has provided far more information in 5 minutes than grovelling for hours could have.
This is my World Wide Web of Whatever
You have your router/switch send messages to a syslog server as well as the routers log. That way, when it reboots you can still see the logs.
Joe
Security through obscurity is no security.
No matter how FUBAR'd your router/switch/firewall configuration is, it's still no serious obstacle to crackers, Robin.
The next Slashdot story will be ready soon, but subscribers can beat the rush and slashdot the links early!
if (comment like "%girl%" or comment like "%What happened to%" or comment like "%original story%") update posting set score = -1, reason = random("Troll","Offtopic")
Is there someone else outthere hosting a site where we can have a non-biased discussion ?
BROWSE AT -1 Checkout how many posts have gone straight in at -1 (and this one too will, I betcha...)
Two wrongs may not make a right, but three
I know this is redundant, but I know people are too lazy to cut and paste, so the link is helpful.
- Dan I.
...one of our most knowledgable people had quit recently...
... and they came to an agreement: our switch config was a mess..
/. /    |\/| |\/| |\/| / Run, Bill!
Hmmm...that most knowledgable people didn't leave for nothing, it seems....
 _
Lesson Two-Tell the whole story, including the politically incorrect drama
I might have missed it, but I saw nothing that mentioned the "She" that quit for whatever reason, but I wasn't reading for that reason. However, I did read the whole account and find it strange that that was glossed over.
Thing is, all the ppl who were crying about how you were changing the posts before are going to go into hissy conspiracy fits now.
---"What did I say that sounded like 'Tell me about your day?'"---
My first argument was going to be the BSoD. It occurs all the time, you just have to reboot. But then I realized something -- it's not a random error (or so we're led to believe...), it's caused by actual flaws in the code. The best solution would be to fix these, although you can't. :)
________________________________________________
________________________________________________
suwain_2
Beware...soon they will send you groceries when your fridge is empty! -ted
Had a similar problem with my old 6160 Nokia. Any shaking or quick motions would cause the contacts to lose connection.
:)
The solution was simple, and low-tech.
Hacked a plastic drinking straw into 3 lengths of about 7cm. Shimmed/crushed them between the battery and the telephone.
This kept the battery from moving around on the back of the phone, thus fixing my mysterious power-off problems.
And no unsightly tape!
If there's a castle floating upside down in the sky, then there's a castle floating upside down in the sky.
Man alive. I don't remember how much a phone is in the States, but I made sure to get a reliable Nokia. And I'm not even a network guy.
;)
Get a new phone, dude!
I would have actually suggested Motorola, because of their extreme quality initiatives. But I have more personal experience with Nokia.
Thanks.
Yes... that's why I thanked him. I would have suggested Motorola, and now I know it would have been a little absurd to do so, since Nextel == Motorola.
;)
I was aware I didn't make that clear... I put such crap in my slashdot posts.
Some cisco routers are really pieces of junk.
No cisco support has already told me that the 67x series was not designed for what were using it for. The solution is to get a dslam card for a 6100 series router, but alas were a struggling .com...
Not completely true...
I was doing an IP address switch over a couple weeks ago, and bonehead me forgets to do a "clear arp;clear xlate" on the firewall. None of my translations are working, I can't figure out why. Access lists look good, static commands all check out, why the hell can't I ping things properly? I reboot the firewall and everything starts working. Rebooting the firewall cleared the xlate and arp tables. This was not a "symptom" that would come back (unless my fsck-ing ISP changes our IP block on us without warning *AGAIN*), it was me being a bonehead after a long day and forgetting to clear a table. The reboot fixed this nicely.
Granted when I figured out what the problem was, I felt like a total idiot for having to take down the firewall, but at least the network started working again.
I'm not sure I understand that. Why does the router purge its logs when you reboot it?
Unless you have syslog logging setup on the Cisco, and it can talk to your Syslog server to offload the data, the logs are stored in RAM. Reboot and the logs go bye bye. Since the router was acting toasty and the vlans were disappearing, the router probably couldn't access whatever box was setup to receive the logs (assuming that syslog was turned on to begin with).
(drop quick instead of process incoming ICMP through all the firewall rules then process the outgoing icmp through all the FW rules for outbound packets.)
and, if it doesn't stop the ping flood it prevents u from flooding with icmp responses.
(i can't remember right now any occurence where just not doing anything do take more ressources/time than actually *doing* something.)
or maybe you're pissed off because you're a little bit short on smurf amplifiers lately?
i had a sig, once..
Well let's see, they don't know how this particular setup is configured and can't get hold of anyone who does. They aren't sure of what the settings are supposed to be so why not reboot? With a lot of problems it would have gotten them back up and then they would have time to track down somebody. There was a vald reason to do so in this case. A reboot should have restored the configurations and gotten them up unless it was a hardware problem and if that was the case they might actually get an idea of whicch hardware the problem was in. If they were networking people they would know of other ways to do it. Guess what? they're not. In this context it even made sense to reboot the firewall. They had no idea what the configuration was supposed to be, so reboot and let it start up with the services and configurations that it had stored on the hard drive. They could hardly have been in worse shape and it had a reasonable chance of working.
"If there is nothing you are willing to die for, then you are not really alive." Myself
Some of the stories may be a little outdated, but it may help to read how other people solved their problems. Even if it doesn't help, it might still be something to enjoy :)
karma capped
Where does this mysterious woman fit into the story above?
sig sig sputnik
If you had read the whole thing, you'd know that putting the crossover cable in place was to bypass the firewall *TEMPORARILY* to eliminate it as the cause of the outage.
With the FW bypassed: If you have data flowing, your FW needs work (reconfigure, reboot, or other). If no data is flowing, look for another cause.
And if you'd ever worked on-call in this sort of situation, I think your comments would be a little softer. Nobody wants their own network to be down. If the staff thought that sleep was more important to them than getting the network up, I trust that judgement. A downed network is not something you want to be working on with 2-3 hours sleep. You can cause more harm than good. I am man enough to admit that I speak from experience, and have learned my lessons quickly.
/rant
My sources are unreliable, but their information is fascinating. -- Ashleigh Brilliant
If you can explain to my how one is supposed to methodically test a network connection with multiple single-points-of-failure, I'd love to hear it. This test usually only lasts for about a minute or two, during which time /. was not functioning, anyway. The /. staff quickly and successfully eliminated the firewall as a source of their outage.
Bypassing the firwall *is*, in some cases, the only way to determine if it's the cause of your network outage. At the companies I've worked at, this test is also documented. But it's only implemented in extreme cases due to the inherent security risks you list above.
If you saw no data flow on your network, how would *you* go about determining the cause? Please be specific.
And, yes, I came from a tolerant company. We also had multiple teams of people to handle problems. The teams consisted of the normal daytime IS staff, and on-call was rotated among us all.
The company actually told us to get sleep if we felt we needed it, even during an on-call outage. The costs of further network downtime due to lack of repair by any one individual is far less than the downtime incurred by tracking down a non-working "fix" by someone who was too tired to know when to call it a day. The potential gains from keeping us all up and working do not outweigh the risks to network stability and reliability. Because we all rotated this duty, there was never a time that we couldn't get back online in a hurry. Besides, the addage "too many cooks spoil the broth" comes to mind. Somtimes time!=money. Do a quality job that takes a little longer in the short-term, and it will be rewarded in innumerable ways in the long-term.
My sources are unreliable, but their information is fascinating. -- Ashleigh Brilliant
You actually think that "ReBOOTING WILL NOT FIX THE PROBLEM". I work in the software/cable/digital video industry, and more often that not, when something is "wonky", you have to reboot the hardware.
And yes...the documentation actually says to reboot the hardware as a Troubleshooting technique. I've read countless docs that indicate that as a mid to last resort reboot hardware.
But you have the attitude of every Linux/*nix/whatever except Windows user I work with(yes I'm a software guy too). I know everything about everything, I'll tell you how to do your job because I use Linux/*nix/whatever except Windows. Let the guys alone and let them figure it out, that's what they're paid to do.
The reason I'm coming down on this, is because I actually lost a previous job because I helped out a new Hardware guy that was in trouble. I did his job for him, and subsequently got fired.
i understand that under pressure people get cranky and possible social problems are not our (the readers) business. still this thing left rather uneasy feeling inside because /. is the one usually laughing, bitching and flaming others of coverups.
i think there's not much else to be done to this than take a lesson: think before you submit, especially you editors! i don't want to read about accusations of incompetence in a popular public forum. it's odd to read this "Important Stuff:" section under this posting form, knowing that even editors of /. don't follow these rules.
Preserve old classics: copy your collection onto all hard drives.
what happened to the woman that quit?
-
sean
The world moves for love. It kneels before it in awe.
"I rebooted everything," he said. "I think's it's the Cisco."
If he rebooted everything, why does the Slashdot stats box say, "uptime: 106 days, 4:15, 0 users, "?
"Oh, we were tiiiiiiiiiiiiired.. so we went to sleep.. AND LEFT ALL OF THE WEBSITES DOWN." :P
:P~
What kind of geeks are you people? My god, has the Slashdot Staff gotten OLD or something? Wait, no.. I've seen 65yr old COBOL programmers stay up for 36hrs straight.
The Slashdot Staff didn't get old.. they just started getting paid and now they're all soft and weak. Bah! Pathetic!
I bet CmdrTaco can't even handle a 24hr Quake-III-a-thon these days.
I too bow to Cisco.... For God's sake - this is the *biggest* company in the World...I recently set up a complex network for a growing company. I bought Cisco cause I trust their stuff and...ahem, own their stock...When I had some problems I just called and spent HOURS on the 800 number. Some of the tech people think they have the answers and don't but, no prob, just try someone else. Turns out a WAN module I got did not work with a newer IOS - so Cisco rewrote a fixed version for me....Holy Cow!!!!
After looking intensively at products like Slashcode, it does not surprise me one bit that their routers are a snarl of obfuscated tables that nobody understands.
No amount of obscurity will provide security (and as we all know, there is always a new hack waiting to be found) However, I don't think it is a good idea to paint a bright pretty picture for a hacker to go by :)
If you give someone a set of keys that will open only 1 out of 1000 available automobiles, you are doing yourself a disservice if you tell him to look for the one that is a black, 2 door, GMC Sonoma extended cab, with fuzzy dice hanging from the mirror.
The reason people block ping is because you can tell a lot about someones equipment based on the response to an ICMP packet. In some cases you can get info on the OS/equipment models that the packet bounced off of on the other end. From this information it "might" be possible to determine which hacks/scripts/weaknesses you could try instead of just blindly trying everything in the book.
Dropped 79% today, according to cnbc, any relation to the /. outage?
-----
japh
japh
(i wish =) )
I've had to deal with Tech Support from many companies, and have been quite a bit less-than-satisfied from most. I haven't ever dealt directly with Cisco before, but I look forward to doing so. Having word-of-mouth compliments about Cisco's tech support makes me 100% more confident about purchasing their products, especially coming from (what I consider) a reliable source. Anyone who has dealt with poor tech support will surely agree. Kudos to Cisco for getting smart people and training them sufficiently to do their jobs!
VLANs were a stupid idea to begin with. I wish they had never been invented. Look. Get your own ASN and IP space. Get 2 cheap Cisco 72xx or Foundy BigIrons. Chuck the 6509. Get to BGP feeds from Exodus, one to each router (to different upsteam Exodus boxes). Set up VRRP (HSRI in Cisco) on the backend of the routers - so the default gateway for the back end stuff is x.x.x.1. (Front end are /30's to Exodus) Using the VRRP both routers will respond and be active. One fails no big deal. This is really the cleanest IP based way to do it. Hell I will do it for you for the right $$$. I can get it done in an 1/2 hour at the most (the cutover part!)
Just my (I have been doing this for 10 years) two cents...
When we installed it, it asked for an email gateway and the email address of our network administrator. We thought this was so it could email us about problems... Close, but even better...
One night, one of our NetApp's spontaniously rebooted itself. Came up just fine all by itself... One of those little hiccoughs that you normally wouln't even know about if you weren't doing monitoring...
We actually found out about the reboot when we came in at 8:00 that morning because there was an email in our inboxes (reconstruction, not quote):
From: xxx@netapp.com
Subject: Patch Notification
At 3:14 a.m., your file server "z003" crashed and rebooted. From the information that it sent to our autosupport service, we see this was due to bug #754783.
Please download patch http://..... and install it to prevent further problems.
Regards,
Network Appliance Automatic Support
In other words, it emailed its logs and core files to the vendor, who had someone look at them, figure out the problem and give us a solution before we even *knew* we had the problem. Wow!
Same sort of thing when we had a disk in the RAID array fail one night... We discovered it had failed because there was a box on my desk with a replacement disk. Yes, it sent the email and they fedex'ed out a replacement without me asking!
Now *that's* support!
You missed a very important factor!
ME!
Listen, you want the real goods on what happened?
They called me up and asked me to fix it after various insults.
I quit afterwards and now I am going to make my own business.
Employer Etiquitte
I am not Slashdot's bitch!
I remember when I started out in computer networking (and it didn't seem like it was that long ago), I was told this by one of the other technical members of our team, something that I haven't forgotten: redundancy in a system is necessary not only in the hardware and software in that system, but also in the resources that are used to keep that system running (that includes of, course human resources, as well as power HVAC, and so on).
Too often, the human part of the redundancy equation isn't totally factored in. When you don't put all of the human factors into the redundancy equation, you have a redundant system isn't really redundant.
Of course, it helps if you have a vendor that will work with you (and those of you who remember working with Novell servers in "the old days" know what I'm talking about, too).
These are the good old days you'll be telling your children about. Make them worthwhile.
We don't sleep at GENUiTY :)