Slashdot.org Self-Slashdotted
Slashdot.org was unreachable for about 75 minutes this evening. Here is the post-mortem from Sourceforge's chief network engineer Uriah Welcome. "What we had was indeed a DoS, however it was not externally originating. At 8:55 PM EST I received a call saying things were horked, at the same time I had also noticed things were not happy. After fighting with our external management servers to login I finally was able to get in and start looking at traffic. What I saw was a massive amount of traffic going across the core switches; by massive I mean 40 Gbit/sec. After further investigation, I was able to eliminate anything outside our network as the cause, as the incoming ports from Savvis showed very little traffic. So I started poking around on the internal switch ports. While I was doing that I kept having timeouts and problems with the core switches. After looking at the logs on each of the core switches they were complaining about being out of CPU, the error message was actually something to do with multicast. As a precautionary measure I rebooted each core just to make sure it wasn't anything silly. After the cores came back online they instantly went back to 100% fabric CPU usage and started shedding connections again. So slowly I started going through all the switch ports on the cores, trying to isolate where the traffic was originating. The problem was all the cabinet switches were showing 10 Gbit/sec of traffic, making it very hard to isolate. Through the process of elimination I was finally able to isolate the problem down to a pair of switches... After shutting the downlink ports to those switches off, the network recovered and everything came back. I fully believe the switches in that cabinet are still sitting there attempting to send 20Gbit/sec of traffic out trying to do something — I just don't know what yet. Luckily we don't have any machines deployed on [that row in that cabinet] yet so no machines are offline. The network came back up around 10:10 PM EST."
So if you hammer your own servers, do you have to send an email to krow to get your privileges restored?
So why didn't ya'll have access from the home office?
2^3 * 31 * 647
Now if you could just post the link to the form where I can claim my full refund (for time not wasted incurred) I'll go back to being a loyal "customer".
I record my sleeptalking
In Soviet Russia, Slashdot slashdots Slashdot!
probably the biggest proof that Slashdot has become sentient is that is willing to suicide self before seeing again another batch of Idle videos.
Slashdot has apparently learned how to masturbate, because it is now fucking with itself!
Any day you get to legitimately use "horked" in a public post can't be all bad. :P
When you do work out what the root cause was, I am sure we would all like to find out what it was, so please post an update when you can.
Jumpstart the tartan drive.
Who Slashdots the Slashdotters?
When even Slashdot gets slashdotted. Now if only we can make the Digg effect bury that site. For good.
If you can read this, it means that I bothered to log in.
First thing I'd do as Cyber Security Tzar would be to outlaw any network device that has the potential to become faulty.
We could've avoided this tragedy entirely.
Modding me -1 troll doesn't make me wrong.
Even though /. was down, I still managed to not get any work done. Maybe it had something to do with the fact I kept rechecking to see if it were back up. Or maybe I should just stop blaming my laziness on external factors and just admit it is a personal problem: I would still find ways to not do work even without Slashdot! :P
'Every story, if continued long enough, ends in death.' --Ernest Hemingway
The year is 2025.
Well, Ladies and Gentlemen, here you see what you may think is an archaic lot of old computers. You would be mistaken. These are Slashdot. No, no cause for alarm...and that door's locked anyway, you can't get out through there. The tour only goes forward. But I'm glad at the very least that you know what Slashdot is. Not was. IS.
It's a safeguard against...something. Something that was unleashed for 75 minutes in 2009 that crippled what was rumored to be the most robust public-facing cluster known. All we have left from that fateful day is the single post from the Slashdot network admin. Someone archived it, lucky us, because he was never seen after that day. I have a copy here, hardcopy of course -- no sense in taking risks so close to...well....
Here it is:
I fully believe the switches in that cabinet are still sitting there attempting to send 20Gbit/sec of traffic out trying to do something. I just don't know what yet.
Is it possible the duplicate article generator tried to spawn, became entangled in its own potential well of duplicity, and now is trapped like two Lisp programmers deep inside their parenthesis?
Every mans' island needs an ocean; choose your ocean carefully.
The worst thing about this? 5,000,000 people who think they know what happened, posting "helpful" suggestions or analysis
"The problem is definitely spanning tree!"
or
"Back in 1998, we were running these HP switches right, and ..."
or
"Did you try resetting the flanglewidget interface?!"
or
"I've seen this exact problem! You need to upgrade to v5.1!"
etc
Its not your network. It doesn't matter how much you think you know, you don't know the topology, or the systems involved. It'll be interesting to know what the ACTUAL reason was, when they figure it out. Assuming it isn't aliens.
And before anyone says this is a shitty plot... I *did* say Michael Bay.
Today is red jello day - all workers must eat all of their red jello. Failure to comply will result in five demerits.
Since no one would ever make the mistake of making a loop in a datacenter, it's fairly common to disable STP, among a few other things. It makes the time bringing a machine up on a port a bit quicker. On a Cisco, you're usually looking at 30 seconds. It'll bring it down to a fraction of a second.
And it was (obviously) a big mistake.
I leave it on in the datacenters. I can live with 30 seconds to bring the port up, if it means I'll never flood the whole network with bogus traffic. :) The only place I've tweaked my switches for connection speed is my own desk. There's only 1 wire coming in. There's only 1 switch. It helped when I had to bring up some machines via PXE. Some of them couldn't tolerate the 30 second delay when requesting DHCP. Still, I know the degree of isolation, so I can't screw it up without running a long wire from somewhere else. :)
But, we're just assuming. Maybe one of the switches just started generating lots and lots of traffic all on it's own. Somehow. In the mysterious locked cabinet that none of us get to see into. :)
It's always embarrassing when things go down, and even more so when it was something that could have been prevented. They should have reported that a line card in a core switch went down, and it took that long to bring it back up. :) Come on, how many times have you heard that from your upstream providers (if you have direct connects to big providers). I swear, for as many times as I've heard the excuse, every router on their networks must have been refreshed a dozen times over. :)
As least it's a better excuse than I used to get. I think it was "GoodNet" that would claim a train derailed every time there was an outage of some sort. "Oh a train derailed, and cut the fiber. We have technicians out there repairing it right now." Somehow we never saw the news reports of dozens of trains derailing. :)
Serious? Seriousness is well above my pay grade.
Mirror
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?
We had something similar happen at a client site - a switch failed in a rack so we temporarily replaced it with an 8 port 'desktop' switch, and then a day later installed the proper replacement back in the rack. We didn't want any unnecessary downtime though so we linked them together and left instructions with the onsite guy to move all the connections from the desktop switch into the proper switch after hours. Which he did, including the cable that linked them together. The switch was in 'portfast' mode so any broadcast packet that got 'onto' the switch, stayed there :)
But I thought "horked" meant, y'know, horked, eh? Meaning, like, "stolen" --
Doug: Hey - somebody horked our clothes!
Bob: Geez, who'd want to hork our clothes, eh?
Cheers,
"What in the name of Fats Waller is that?"
"A four-foot prune."
You've considered using portfast on edge ports? :P
You know, it's been there for awhile...
February 9th, 2009 8:55pm Slashdot becomes self-aware.
...were he not typing that long-a$$ summary. Twice as fast if he didn't have to spellcheck.
(j/k)
Which leads me to this question:
What do Slashdotter staff read to avoid doing work?
WARNING: Smartphones have side effects--most of them undocumented.
This thing usually happens when two switches are attached with 2 (or more) trunked links ("etherchannel" in cisco terminology), and one of the switches has the trunk disabled on one of the ports (or someone moved the cable to another port during a diag). Thus the attachment becomes a loop. STP could take care of this, but it's common to disable it on access switches.
A couple years ago, I had to troubleshoot a problem that was similar for a school district's network. Absolutely nothing could communicate.
I checked switches, routers, and servers for a while until I hooked a sniffer up, and still got bafflling results.
THEN I decided to go low-tech, and start disconnecting cables. That got me somewhere - certain backbone connections could be disconnected and traffic levels dropped to normal levels.
So, I hooked them back up, and went to the other end of the link, and started disconnecting things port by port until I found the problem.
It turned out to be an unauthorized little 4-port switch that had malfunctioned, and was spewing perfectly valid (as in, good CRC) packets to the LAN, but with random source MAC addresses.
THAT took down every switch in the network, as it required them to update their internal tables on a per-packet basis. The thing was actually not sending much data, but it was poisoning the switchs' internal tables. Not at the IP layer, but at the MAC layer.
When networking gear goes rogue, it can do really bad things to other connected equipment.
It's really hard to find the problem because every indication from every other piece of equipment is confusing. You almost always have to go to the backbone and disconnect entire segmets to find it.
It may be strange for those not in the networking field, but when things really go bad, the only place to be is physically in the data center.
That means looking at the LEDs on switches for traffic indications. If you see a single port is spewing a LOT of activity during an outage, disconnect it. No, don't make it "down" but pull the cable out of the port.
Then go downstream and repeat until the potential problem set is reduced to an understandable level.
What really sucks about these kind of outages is that you can't remotely log in to various hosts or switches - you have to pull wires out of ports to break the "spew" that is taking things down.
I have to remember to charge a 100-X surcharge the next time I troubleshoot one of these... (300X if after-hours)
These sort of problems are REALLY hard to find, but trivial to fix.
It sounds more like a network configuration accident or glitch than an attack. Besides, netsplits aren't incredibly unusual.
...being out of CPU, the error message was actually something to do with multicast. As a precautionary measure I rebooted each core just to make sure it wasn't anything silly. After the cores came back online they instantly went back to 100% fabric CPU usage and started shedding connections again. So slowly I started going through all the switch ports on the cores, trying to isolate where the traffic was originating. The problem was all the cabinet switches were showing 10 Gbit/sec of traffic, making it very hard to isolate. Through the process of elimination I was finally able to isolate the problem down...
What did I say that sounded like "Tell me about your day at work" ?
Squirrel!
I fully believe the switches in that cabinet are still sitting there attempting to send 20Gbit/sec of traffic out trying to do something - I just don't know what yet.
Um, trying to get first post?
rewriting history since 2109