Slashdot.org Self-Slashdotted
Slashdot.org was unreachable for about 75 minutes this evening. Here is the post-mortem from Sourceforge's chief network engineer Uriah Welcome. "What we had was indeed a DoS, however it was not externally originating. At 8:55 PM EST I received a call saying things were horked, at the same time I had also noticed things were not happy. After fighting with our external management servers to login I finally was able to get in and start looking at traffic. What I saw was a massive amount of traffic going across the core switches; by massive I mean 40 Gbit/sec. After further investigation, I was able to eliminate anything outside our network as the cause, as the incoming ports from Savvis showed very little traffic. So I started poking around on the internal switch ports. While I was doing that I kept having timeouts and problems with the core switches. After looking at the logs on each of the core switches they were complaining about being out of CPU, the error message was actually something to do with multicast. As a precautionary measure I rebooted each core just to make sure it wasn't anything silly. After the cores came back online they instantly went back to 100% fabric CPU usage and started shedding connections again. So slowly I started going through all the switch ports on the cores, trying to isolate where the traffic was originating. The problem was all the cabinet switches were showing 10 Gbit/sec of traffic, making it very hard to isolate. Through the process of elimination I was finally able to isolate the problem down to a pair of switches... After shutting the downlink ports to those switches off, the network recovered and everything came back. I fully believe the switches in that cabinet are still sitting there attempting to send 20Gbit/sec of traffic out trying to do something — I just don't know what yet. Luckily we don't have any machines deployed on [that row in that cabinet] yet so no machines are offline. The network came back up around 10:10 PM EST."
It's ok they were all public domain.
and why no out of band management networks?
did the little dunking bird alarm not work this time?
There are A few failures on the network design part that you need to acknowledge:
People are idiots. If there are free points (and even if there aren't in some cases) people are going to plug in stuff.
Unused ports should be disabled (shutdown). This requires someone with half a brain to connect devices.
This is standard operating procedure.
Spanning tree was not enabled which caused your loop.
This is standard operating procedure.
Learn from your mistakes don't blame them on user stupidity as you can only control your own actions.
One-half thrashing hardware rule with network monitor+RMON: hardware and/or software power a/o connect I/O-service -one-half-out/off- ... problem mitigated/change to better-status, then out/off has problem-part/all. One-half-out2on condition stable, then One-half-out2on.... Also, vice-versa if problem does not change. If problem is partially improved, then proceed logically....
When box/brick/#U... on-off/in-out changes the status, then leave it off and get a replacement/repair....
Total Time to stable connection service depends on size, but less than a couple hours... old-school/days.
Unaccountable leaders are masters, and unrepresented people are slaves. How do US and EU fare?
Ya know, if I had just quoted this:
After fighting with our external management servers to login I finally was able to get in and start looking at traffic.
you would have immediately been labeled as troll. As it is, you've been labeled insightful because neither you nor the mods read the summary. Excellent. What IS your secret?
The point is, they hadn't already given him direct access to those connections before yesterday, and he had to spend a large chunk of those 75 minutes getting the authorization to access the equipment so he COULD fix it.
2^3 * 31 * 647