Slashdot Mirror


Slashdot.org Self-Slashdotted

Slashdot.org was unreachable for about 75 minutes this evening. Here is the post-mortem from Sourceforge's chief network engineer Uriah Welcome. "What we had was indeed a DoS, however it was not externally originating. At 8:55 PM EST I received a call saying things were horked, at the same time I had also noticed things were not happy. After fighting with our external management servers to login I finally was able to get in and start looking at traffic. What I saw was a massive amount of traffic going across the core switches; by massive I mean 40 Gbit/sec. After further investigation, I was able to eliminate anything outside our network as the cause, as the incoming ports from Savvis showed very little traffic. So I started poking around on the internal switch ports. While I was doing that I kept having timeouts and problems with the core switches. After looking at the logs on each of the core switches they were complaining about being out of CPU, the error message was actually something to do with multicast. As a precautionary measure I rebooted each core just to make sure it wasn't anything silly. After the cores came back online they instantly went back to 100% fabric CPU usage and started shedding connections again. So slowly I started going through all the switch ports on the cores, trying to isolate where the traffic was originating. The problem was all the cabinet switches were showing 10 Gbit/sec of traffic, making it very hard to isolate. Through the process of elimination I was finally able to isolate the problem down to a pair of switches... After shutting the downlink ports to those switches off, the network recovered and everything came back. I fully believe the switches in that cabinet are still sitting there attempting to send 20Gbit/sec of traffic out trying to do something — I just don't know what yet. Luckily we don't have any machines deployed on [that row in that cabinet] yet so no machines are offline. The network came back up around 10:10 PM EST."

34 of 388 comments (clear)

  1. Thanks for the information by sleeponthemic · · Score: 5, Funny

    Now if you could just post the link to the form where I can claim my full refund (for time not wasted incurred) I'll go back to being a loyal "customer".

    --
    I record my sleeptalking
    1. Re:Thanks for the information by Anonymous Coward · · Score: 5, Funny

      Okay, here is the link: http://slashdot.org/subscribe.pl

      You probably owe about $10 for your time not wasted.

    2. Re:Thanks for the information by Arthur+Grumbine · · Score: 5, Funny

      I don't know about you, but I'm suing for punitive damages. Do you have any idea much pain and suffering the work I did in that time caused me?!

      --
      Now that I think about it, I'm pretty sure everything I just said is completely wrong.
    3. Re:Thanks for the information by Atario · · Score: 5, Funny

      Trust me, it's nothing compared to the pain and suffering your work caused us.

      -- The testing staff

      --
      "A great democracy must be progressive or it will soon cease to be a great democracy." --Theodore Roosevelt
    4. Re:Thanks for the information by spartacus_prime · · Score: 5, Informative

      I don't know about you, but I'm suing for compensatory damages. Do you have any idea much pain and suffering the work I did in that time caused me?!

      Fixed that for you. Sorry, law student.

      --
      If you can read this, it means that I bothered to log in.
  2. In Soviet Russia by MindlessAutomata · · Score: 5, Funny

    In Soviet Russia, Slashdot slashdots Slashdot!

    1. Re:In Soviet Russia by ocularDeathRay · · Score: 5, Funny

      the headline is confusing, was the problem caused by a recursive dupe or something?

      I didn't read the rest of the summary cause it is longer than my finger and that is how we used to roll on the dialup BBSs... never read anything longer than your finger held up to the screen. this message is only intended for people of all finger sizes.

      --
      Obama is a twitter sock puppet
    2. Re:In Soviet Russia by robophilosopher · · Score: 5, Informative

      I believe you mean: Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo. The caps matters. In other words, Buffalo from the city of Buffalo that are pushed around by (other) buffalo from the city of Buffalo in turn push around (still more) buffalo from the city of Buffalo. And you thought this was unrelated to the recursive dupe comment.

    3. Re:In Soviet Russia by Anonymous Coward · · Score: 5, Funny

      Yo dawg, I herd u like Slashdot so I slashdotted your Slashdot!

    4. Re:In Soviet Russia by Zarf · · Score: 5, Funny

      In Soviet Russia ...

      1. Meme Very Tired. No Longer Wired.
      2. 'Soviet Russia' ceased to exist last century.
      3. Profit!!!

      I for one welcome our previous-century-meme based overlords.

      --
      [signature]
    5. Re:In Soviet Russia by Zarf · · Score: 5, Funny

      Was it maybe a feedback loop of that very thing that caused the slashdotting?

      I think the switch was trying to get first post.

      --
      [signature]
  3. A.I. by gmuslera · · Score: 5, Funny

    probably the biggest proof that Slashdot has become sentient is that is willing to suicide self before seeing again another batch of Idle videos.

    1. Re:A.I. by BLT2112 · · Score: 5, Funny

      Like the poet from HHGG whose own intestines leaped out of his throat to strangle himself...

  4. On the plus side by Toe,+The · · Score: 5, Funny

    Any day you get to legitimately use "horked" in a public post can't be all bad. :P

  5. Would like final analysis by Midnight+Thunder · · Score: 5, Interesting

    When you do work out what the root cause was, I am sure we would all like to find out what it was, so please post an update when you can.

    --
    Jumpstart the tartan drive.
    1. Re:Would like final analysis by Anonymous Coward · · Score: 5, Funny

      The problem was the system was HORKED, didn't you get that?

    2. Re:Would like final analysis by yanyan · · Score: 5, Funny

      The switches were running Windows 7 Starter Edition. http://tech.slashdot.org/article.pl?sid=09/02/09/1348255

    3. Re:Would like final analysis by Precision · · Score: 5, Informative

      I'll be sure to when I get to the data center next week and am able to get my hands on the angry switch in question. I do love how it just sat there quietly for two weeks w/o doing anything and then decided randomly to just start blasting out 20 Gbit.. sigh.. hardware..

      --
      - U
  6. and still no work done by qw0ntum · · Score: 5, Insightful

    Even though /. was down, I still managed to not get any work done. Maybe it had something to do with the fact I kept rechecking to see if it were back up. Or maybe I should just stop blaming my laziness on external factors and just admit it is a personal problem: I would still find ways to not do work even without Slashdot! :P

    --
    'Every story, if continued long enough, ends in death.' --Ernest Hemingway
  7. Re:*Sniff* they grow up so fast! by adolf · · Score: 5, Insightful

    Naw. Stuff sometimes, yaknow, happens. People sometimes make mistakes, and hardware sometimes just breaks. It's not always ignorance -- especially, I'd guess, at the level of Slashdot's back end.

    I once implemented a VoIP phone system at a factory in an evening. (This, in itself, was an undertaking - close to 200 extensions, up and running, between Wednesday at close of business and Thursday when folks started showing up, including three hours on the phone with Sprint to get the PRI and T1 circuits reconfigured at 2:00AM.)

    We left, tired and groggy, with an IP phone placed in a common area for the facilities network admins to train any staff who needed training, at about 7:30AM. At 8:30, after I finally got home and managed to close my eyes, my phone rang. It was the network admin. He had a few minor issues which could've waited, but the real problem was that their network was totally fucked: Packets everywhere. No capacity to do anything. An amazing cascading failure of the sort that one hopes to never see.

    And it wasn't any hodge-podge network, either. HP Procurve switches configured in a redundant fabric mode with gigabit fiber links - hot stuff or the time, especially for a factory. The wiring was all new, and was all good. The network had been designed specifically to avoid the limitations of Ethernet, and was successful to that end (a non-trivial task in an existing building complex). But it was tripping all over itself.

    Turns out that someone had taken that fancy IP phone in the common area with its built-in unmanaged switch, and plugged both of its 10/100 Ethernet jacks into the wall. (Nobody knows who.)

    The ensuing packet storm broke everything. Unplugging one of them fixed the problem pretty much immediately.

    I wrote about this here once before, and everyone's immediate reply was this: "Well, duh. They should've turned the Spanning Tree Protocol on, and this wouldn't have happened. They're obviously idiots."

    But the truth is so much more simple: People make mistakes. It was a mistake to keep STP turned off in that environment, and it was a mistake to plug two fancy ports of a Procurve switch into two dumb ports on an IP phone. Had either of those mistakes not happened, things would've been fine.

    But mistakes happen anyway. We do our best, as IT professionals, to minimize these mistakes, or at least keep them away from production. But sometimes, despite having the best people and the best tools and all the knowledge it takes to make stuff work, shit just happens.

  8. A tour of Slashdot... by lymond01 · · Score: 5, Funny

    The year is 2025.

    Well, Ladies and Gentlemen, here you see what you may think is an archaic lot of old computers. You would be mistaken. These are Slashdot. No, no cause for alarm...and that door's locked anyway, you can't get out through there. The tour only goes forward. But I'm glad at the very least that you know what Slashdot is. Not was. IS.

    It's a safeguard against...something. Something that was unleashed for 75 minutes in 2009 that crippled what was rumored to be the most robust public-facing cluster known. All we have left from that fateful day is the single post from the Slashdot network admin. Someone archived it, lucky us, because he was never seen after that day. I have a copy here, hardcopy of course -- no sense in taking risks so close to...well....

    Here it is:

    I fully believe the switches in that cabinet are still sitting there attempting to send 20Gbit/sec of traffic out trying to do something. I just don't know what yet.

  9. Is it possible.... by GaryOlson · · Score: 5, Funny

    ...the problem down to a pair of switches...I fully believe the switches in that cabinet are still sitting there attempting to send 20Gbit/sec of traffic out trying to do something â" I just don't know what yet.

    Is it possible the duplicate article generator tried to spawn, became entangled in its own potential well of duplicity, and now is trapped like two Lisp programmers deep inside their parenthesis?

    --
    Every mans' island needs an ocean; choose your ocean carefully.
  10. Re:This isn't the first time... by MBGMorden · · Score: 5, Funny

    Indeed. Studies show that you're far more likely to get hacked if you keep a computer in your home. Indeed it's often even a case where an attacker is able to wrest control of your own computer from you and use it against you.

    At the very minimum, given the elevated hazard potential to kids (over 90% of kids will suffer a computer accident before the age of 18), you should always keep your computers and networking equipment securely locked in separate compartments.

    I'm not going to go so far as you and call for an outright ban, but I think it's obvious that we need common-sense computer control laws put into place. In particular, we need to stop the widespread smuggling of these devices from across the borders of places such as Taiwan, Japan, and California, into our outer-city suburbs.

    --
    "People who think they know everything are very annoying to those of us who do."-Mark Twain
  11. Re:And finally the question is answered: by eosp · · Score: 5, Funny

    Quis slashdotiet ipsos slashdotes?

  12. Re:*Sniff* they grow up so fast! by Nyall · · Score: 5, Interesting

    I'm not a network engineer but I think we did that senior year of college (2004). The engineering department provided us with our own work rooms we could lock. The rooms only had a couple of Ethernet jacks so we brought in our own switch which I remember could auto detect the uplink. It was plugged into the wall then someone by mistake plugged both ends of another CAT cable into some open ports. That mistake took down half the campus network for a couple of hours till some very mad IT guys found us.

    --
    http://en.wikipedia.org/wiki/Jury_nullification
  13. Slashdotted by Greyfox · · Score: 5, Funny
    --

    I'm trying to teach myself to set people on fire with my mind... Is it hot in here?

  14. Re:Wow, that sucks by jd · · Score: 5, Funny

    Act as a data source to Excel.

    --
    It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
  15. Re:*Sniff* they grow up so fast! by adolf · · Score: 5, Interesting

    The timeframe is pretty close - my story happened late in 2004. The network admins in my story were pretty livid as well. (Well, panicked, followed by angry and lividity once they'd found the fault. They blamed everyone, including us for selling them unmanaged switches in their telephones, and promised to find the responsibile party and throw them under the bus. It never happened. I hope that they eventually turned STP on.)

    It seems to be common in network administration to think (and I've mistakenly thought this way, too) that once some random person does something stupid and the entire fucking thing crashes that they'd just simply undo whatever it was and never do it again. Nevertheless, if lay people (or, no offense, students) were all that good at networking or computers, they'd probably never have produced the problem to begin with.

    These days, in my day job, I work with salespeople and law enforcement. They're not stupid -- in fact, most of the clients I work with do things daily that I could never accomplish -- but they occasionally do stupid things with computers and networks. I try hard to avoid blaming them for what they've done wrong, and to instead try to use it as an opportunity to better (and gently) show them how things actually work.

    I learned this, oddly enough, when pulling some Cat5 at a plastics factory. I moved a ceiling tile in an office that had a photo sensor fire alarm in it, and it went off. The entire plant was evacuated. The fire department showed up. Of course, there was no real fire -- the dust from the fiberglass insulation that I'd set the photo sensor on was enough to trigger it. And, thankfully, they were understanding. Because of my mistake, they learned a few weaknesses of their fire alarm system (some employees couldn't hear it and had to be found and dragged outside, which is a very real problem), and they considered it to be a good fire drill. They continue to hire us back for work today, and I learned not to do that again. :)

  16. Seen That Once by maz2331 · · Score: 5, Interesting

    A couple years ago, I had to troubleshoot a problem that was similar for a school district's network. Absolutely nothing could communicate.

    I checked switches, routers, and servers for a while until I hooked a sniffer up, and still got bafflling results.

    THEN I decided to go low-tech, and start disconnecting cables. That got me somewhere - certain backbone connections could be disconnected and traffic levels dropped to normal levels.

    So, I hooked them back up, and went to the other end of the link, and started disconnecting things port by port until I found the problem.

    It turned out to be an unauthorized little 4-port switch that had malfunctioned, and was spewing perfectly valid (as in, good CRC) packets to the LAN, but with random source MAC addresses.

    THAT took down every switch in the network, as it required them to update their internal tables on a per-packet basis. The thing was actually not sending much data, but it was poisoning the switchs' internal tables. Not at the IP layer, but at the MAC layer.

    When networking gear goes rogue, it can do really bad things to other connected equipment.

    It's really hard to find the problem because every indication from every other piece of equipment is confusing. You almost always have to go to the backbone and disconnect entire segmets to find it.

  17. Sometimes You Have To Be There by maz2331 · · Score: 5, Interesting

    It may be strange for those not in the networking field, but when things really go bad, the only place to be is physically in the data center.

    That means looking at the LEDs on switches for traffic indications. If you see a single port is spewing a LOT of activity during an outage, disconnect it. No, don't make it "down" but pull the cable out of the port.

    Then go downstream and repeat until the potential problem set is reduced to an understandable level.

    What really sucks about these kind of outages is that you can't remotely log in to various hosts or switches - you have to pull wires out of ports to break the "spew" that is taking things down.

    I have to remember to charge a 100-X surcharge the next time I troubleshoot one of these... (300X if after-hours)

    These sort of problems are REALLY hard to find, but trivial to fix.

    1. Re:Sometimes You Have To Be There by jamie · · Score: 5, Informative

      Our network engineer lives a couple of states away from the data center. The work he's talking about doing, he did from home.

    2. Re:Sometimes You Have To Be There by Bearhouse · · Score: 5, Funny

      It may be strange for those not in the networking field, but when things really go bad, the only place to be is physically in the data center.

      Heh. I've heard that in the old day you could find broken Token ring hardware by listening after a high pitched whining noise. Guess one really has to be there for stuff like that.

      Was there, and confirm true. Whining noise normally came from IBM SE who was trying to fix problem.

    3. Re:Sometimes You Have To Be There by sentientbeing · · Score: 5, Interesting

      Those times coincide with recent posts you made at slashdot (216.34.181.45) I think after each post slashcode quickly scans the originating IP to check for proxy trolling.

      --

      ------
      beware he who would deny you access to information, for in his mind he dreams himself your master
  18. Re:Wow, that sucks by Achromatic1978 · · Score: 5, Informative

    He (she?)

    For Slashdot staff, I think the generally accepted nominal is "It"...