Slashdot.org Self-Slashdotted
Slashdot.org was unreachable for about 75 minutes this evening. Here is the post-mortem from Sourceforge's chief network engineer Uriah Welcome. "What we had was indeed a DoS, however it was not externally originating. At 8:55 PM EST I received a call saying things were horked, at the same time I had also noticed things were not happy. After fighting with our external management servers to login I finally was able to get in and start looking at traffic. What I saw was a massive amount of traffic going across the core switches; by massive I mean 40 Gbit/sec. After further investigation, I was able to eliminate anything outside our network as the cause, as the incoming ports from Savvis showed very little traffic. So I started poking around on the internal switch ports. While I was doing that I kept having timeouts and problems with the core switches. After looking at the logs on each of the core switches they were complaining about being out of CPU, the error message was actually something to do with multicast. As a precautionary measure I rebooted each core just to make sure it wasn't anything silly. After the cores came back online they instantly went back to 100% fabric CPU usage and started shedding connections again. So slowly I started going through all the switch ports on the cores, trying to isolate where the traffic was originating. The problem was all the cabinet switches were showing 10 Gbit/sec of traffic, making it very hard to isolate. Through the process of elimination I was finally able to isolate the problem down to a pair of switches... After shutting the downlink ports to those switches off, the network recovered and everything came back. I fully believe the switches in that cabinet are still sitting there attempting to send 20Gbit/sec of traffic out trying to do something — I just don't know what yet. Luckily we don't have any machines deployed on [that row in that cabinet] yet so no machines are offline. The network came back up around 10:10 PM EST."
So why didn't ya'll have access from the home office?
2^3 * 31 * 647
When you do work out what the root cause was, I am sure we would all like to find out what it was, so please post an update when you can.
Jumpstart the tartan drive.
My guess is there is a loop somewhere and the traffic is just multicast traffic going in circles! Is there some kind of redundancy that depends on Spanning Tree?
I'm surprised STP was off by default. I remember in 1999 or so I had some trouble that resulted in my having to turn STP off on Cisco switches (they shipped with it on (these were 3524s and a 5505). I can't actually remember why. I think it had something to do with a Novell server?
In any case, I remember saying to the Cisco phone support guy, who had been baffled for 4 hours or so before he told me to turn it off (and things started to work) "Who the heck would plug in two ports from one device into the same network?"
Since then, I have seen exactly that situation many times in small office environments. Also, the classic plugging in while also being on the wireless side of the network.
The brains of a chicken, coupled with the claws of two eagles, may well hatch the eggs of our destruction.
I'm not a network engineer but I think we did that senior year of college (2004). The engineering department provided us with our own work rooms we could lock. The rooms only had a couple of Ethernet jacks so we brought in our own switch which I remember could auto detect the uplink. It was plugged into the wall then someone by mistake plugged both ends of another CAT cable into some open ports. That mistake took down half the campus network for a couple of hours till some very mad IT guys found us.
http://en.wikipedia.org/wiki/Jury_nullification
...Because if it's aliens, then it won't be interesting?
We had something similar happen at a client site - a switch failed in a rack so we temporarily replaced it with an 8 port 'desktop' switch, and then a day later installed the proper replacement back in the rack. We didn't want any unnecessary downtime though so we linked them together and left instructions with the onsite guy to move all the connections from the desktop switch into the proper switch after hours. Which he did, including the cable that linked them together. The switch was in 'portfast' mode so any broadcast packet that got 'onto' the switch, stayed there :)
The timeframe is pretty close - my story happened late in 2004. The network admins in my story were pretty livid as well. (Well, panicked, followed by angry and lividity once they'd found the fault. They blamed everyone, including us for selling them unmanaged switches in their telephones, and promised to find the responsibile party and throw them under the bus. It never happened. I hope that they eventually turned STP on.)
It seems to be common in network administration to think (and I've mistakenly thought this way, too) that once some random person does something stupid and the entire fucking thing crashes that they'd just simply undo whatever it was and never do it again. Nevertheless, if lay people (or, no offense, students) were all that good at networking or computers, they'd probably never have produced the problem to begin with.
These days, in my day job, I work with salespeople and law enforcement. They're not stupid -- in fact, most of the clients I work with do things daily that I could never accomplish -- but they occasionally do stupid things with computers and networks. I try hard to avoid blaming them for what they've done wrong, and to instead try to use it as an opportunity to better (and gently) show them how things actually work.
I learned this, oddly enough, when pulling some Cat5 at a plastics factory. I moved a ceiling tile in an office that had a photo sensor fire alarm in it, and it went off. The entire plant was evacuated. The fire department showed up. Of course, there was no real fire -- the dust from the fiberglass insulation that I'd set the photo sensor on was enough to trigger it. And, thankfully, they were understanding. Because of my mistake, they learned a few weaknesses of their fire alarm system (some employees couldn't hear it and had to be found and dragged outside, which is a very real problem), and they considered it to be a good fire drill. They continue to hire us back for work today, and I learned not to do that again. :)
Kid-proof tablet..
It's likely multicast-related, as that's where TFA states the problem was seen. There are only so many multicast issues you can have. True, we don't know the topology. True, we don't know the switch configuration. True, it's just as possible this is some sort of revenge by the Church of Scientology for all the Slashdot articles on them.
However, some things seem more plausible than others. Since this was a spontaneous problem, hardware seems more suspect than software. If it is software (unlikely but possible), the only multicast protocol most switches use are the spanning-tree protocols.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Nah, I used to run one of the bigger, well know publically facing clusters. It was ranked #300 by Alexa when I left over 2 years ago. What's happened since is their own fault. :)
Actually, this wouldn't have downed that network. Every GigE circuit was individual to a city, or set of racks (depending on the site). There were no cross connects between them. Almost everything was designed so if we lost a city for any reason, it didn't hurt the site. We had connectivity outages, and even a couple brownouts that upset the power systems, but the sites were always accessible.
Slashdot should not, under any circumstances, be hosted in one location. In my opinion, they should be at the largest continental and intercontinental peerings that they can be at.
1 Wilshire, Los Angeles, CA - providing the west coast of the US, and the most substantial fiber links on the Pacific.
111 8th Ave, New York, NY - providing the east coast of the US, and virtually all of the links to Europe.
36 NE 2nd St, Miami, FL - providing the southeast US, redundancy for the Southeast US, and some fiber to Europe and S. America
Redundant options.
426 S LaSalle St, Chigaco, IL - providing good service to the East and West coast of the US
55 S Market St, San Jose, CA - providing good service to the West coast of the US, and some trans-Pacific connectivity
Some people really like Atlanta, Dallas, Houston, Las Vegas, Salt Lake City, and Vienna/Ashburn/Reston. I don't really suggest it, if you can have a presence in the better locations.
There are some very nice global options too. I'm not sure how well the European networks have cleaned up. Several years ago, due to peering arrangements over there, most European traffic ended up going to New York and back to Europe, even though we were on one of the top Tier 1 providers. We ditched the site, and sent all of Europe to New York. Our users sent complements on our "new data center in Europe", since it was so fast. :) People like to complain, but rarely send complements. That was interesting. There are some great locations in Australia and Asia also, but ... well ... it's all in how much you want to spend.
I know people in the Silicon Valley always scream when I suggest them as secondary, but if you've had a good look at all the major cities, you'd get over yourselves. Just because you live there, and there are expensive neighbors, it doesn't make you the center of the world.
Slashcode would need some revamping to make work in this environment. There are lots of options there too.
But, I'm not on the Slashdot IT team, so I don't get to make these decisions (or even give opinions).
Serious? Seriousness is well above my pay grade.
A couple years ago, I had to troubleshoot a problem that was similar for a school district's network. Absolutely nothing could communicate.
I checked switches, routers, and servers for a while until I hooked a sniffer up, and still got bafflling results.
THEN I decided to go low-tech, and start disconnecting cables. That got me somewhere - certain backbone connections could be disconnected and traffic levels dropped to normal levels.
So, I hooked them back up, and went to the other end of the link, and started disconnecting things port by port until I found the problem.
It turned out to be an unauthorized little 4-port switch that had malfunctioned, and was spewing perfectly valid (as in, good CRC) packets to the LAN, but with random source MAC addresses.
THAT took down every switch in the network, as it required them to update their internal tables on a per-packet basis. The thing was actually not sending much data, but it was poisoning the switchs' internal tables. Not at the IP layer, but at the MAC layer.
When networking gear goes rogue, it can do really bad things to other connected equipment.
It's really hard to find the problem because every indication from every other piece of equipment is confusing. You almost always have to go to the backbone and disconnect entire segmets to find it.
It may be strange for those not in the networking field, but when things really go bad, the only place to be is physically in the data center.
That means looking at the LEDs on switches for traffic indications. If you see a single port is spewing a LOT of activity during an outage, disconnect it. No, don't make it "down" but pull the cable out of the port.
Then go downstream and repeat until the potential problem set is reduced to an understandable level.
What really sucks about these kind of outages is that you can't remotely log in to various hosts or switches - you have to pull wires out of ports to break the "spew" that is taking things down.
I have to remember to charge a 100-X surcharge the next time I troubleshoot one of these... (300X if after-hours)
These sort of problems are REALLY hard to find, but trivial to fix.
"Nevertheless, if lay people (or, no offense, students) were all that good at networking or computers, they'd probably never have produced the problem to begin with."
I've seen IT professionals do exactly the same thing many a time. I don't think students are particularly special here, anyone who has never encountered the problem before is prone to it I'd say but most people in IT encounter it eventually one way or another!
The HP Procurve switches had something called "Mesh mode" which allowed you to have and to utilise multiple uplinks. So if you had 2 x 1 Gb uplinks, then you could use both of them. If you had STP protocol turned on you would have one online and one offline. It's for this reason that Cisco now does PVST or Per VLAN Spanning Tree. This allows you to utilise both uplinks, and just use a different uplink for a different VLAN.
Curiosity was framed; ignorance killed the cat. -- Author unknown
In my 10 years inside a network operation center of a 10K active hosts campus, I have seen this happening by two causes:
First, some smartass uni professor plugs two network outlets onto a switch of his own in order to 'double the bandwidth'.
Second, some semi-smartass professor wants to ghost at the same time all the computers in his lab, and uses a wrong multicast address (or even broadcast). This way his lab in Greece is ghosted, as well as some random PCs in Texas, US.
Needless to say, in order for those things to happen, some security measures on behalf of the net admins have been forgotten. But who's perfect?
I agree, this is a great example. As someone who has worked in manufacturing before, I can say without a doubt most "fire drills" aren't much of a drill since they're planned in advance and staff are notified prior.
The issue is that during production, staff can't just walk away from their machines without causing tremendous costs. To avoid those costs, management sees fit to notify staff prior to shutdown gracefully which kind of defeats the purpose of a drill.
The effect is that most manufacturers do not know the true ability of their staff to exit under a true emergency.
You can't legislate goodness. Let each to his own destiny, by will of his freely made choices.
Yes, unfortunately though at many places, they're not.
I think the real question is, why the fuck is this even possible? There shouldn't be a single piece of networking hardware available today that's vulnerable to this by default, it's not as if the problem hasn't been known about since about as long as the relevant networking hardware has been around.