Slashdot.org Self-Slashdotted
Slashdot.org was unreachable for about 75 minutes this evening. Here is the post-mortem from Sourceforge's chief network engineer Uriah Welcome. "What we had was indeed a DoS, however it was not externally originating. At 8:55 PM EST I received a call saying things were horked, at the same time I had also noticed things were not happy. After fighting with our external management servers to login I finally was able to get in and start looking at traffic. What I saw was a massive amount of traffic going across the core switches; by massive I mean 40 Gbit/sec. After further investigation, I was able to eliminate anything outside our network as the cause, as the incoming ports from Savvis showed very little traffic. So I started poking around on the internal switch ports. While I was doing that I kept having timeouts and problems with the core switches. After looking at the logs on each of the core switches they were complaining about being out of CPU, the error message was actually something to do with multicast. As a precautionary measure I rebooted each core just to make sure it wasn't anything silly. After the cores came back online they instantly went back to 100% fabric CPU usage and started shedding connections again. So slowly I started going through all the switch ports on the cores, trying to isolate where the traffic was originating. The problem was all the cabinet switches were showing 10 Gbit/sec of traffic, making it very hard to isolate. Through the process of elimination I was finally able to isolate the problem down to a pair of switches... After shutting the downlink ports to those switches off, the network recovered and everything came back. I fully believe the switches in that cabinet are still sitting there attempting to send 20Gbit/sec of traffic out trying to do something — I just don't know what yet. Luckily we don't have any machines deployed on [that row in that cabinet] yet so no machines are offline. The network came back up around 10:10 PM EST."
So if you hammer your own servers, do you have to send an email to krow to get your privileges restored?
So why didn't ya'll have access from the home office?
2^3 * 31 * 647
Now if you could just post the link to the form where I can claim my full refund (for time not wasted incurred) I'll go back to being a loyal "customer".
I record my sleeptalking
In Soviet Russia, Slashdot slashdots Slashdot!
pretty impressive. i loaded, got an ISE, then reloaded and it worked. good timing for me i'd say
probably the biggest proof that Slashdot has become sentient is that is willing to suicide self before seeing again another batch of Idle videos.
Slashdot has apparently learned how to masturbate, because it is now fucking with itself!
The HAMSTERS?
http://www.webhamster.com/
I'd go on a Vegan diet but the delivery time from Vega is too long. --brownkitty
Any day you get to legitimately use "horked" in a public post can't be all bad. :P
When you do work out what the root cause was, I am sure we would all like to find out what it was, so please post an update when you can.
Jumpstart the tartan drive.
Who Slashdots the Slashdotters?
When even Slashdot gets slashdotted. Now if only we can make the Digg effect bury that site. For good.
If you can read this, it means that I bothered to log in.
First thing I'd do as Cyber Security Tzar would be to outlaw any network device that has the potential to become faulty.
We could've avoided this tragedy entirely.
Modding me -1 troll doesn't make me wrong.
Even though /. was down, I still managed to not get any work done. Maybe it had something to do with the fact I kept rechecking to see if it were back up. Or maybe I should just stop blaming my laziness on external factors and just admit it is a personal problem: I would still find ways to not do work even without Slashdot! :P
'Every story, if continued long enough, ends in death.' --Ernest Hemingway
My guess is there is a loop somewhere and the traffic is just multicast traffic going in circles! Is there some kind of redundancy that depends on Spanning Tree?
Is UDLD on? Sounds like it might be a forwarding loop.
www.slashdot.org loads just fine but slashdot.org gives a 500 internal server error.
Maybe the editors submitted a dupe of a dupe and set off an infinite Lupe^H^H^H oop?
The year is 2025.
Well, Ladies and Gentlemen, here you see what you may think is an archaic lot of old computers. You would be mistaken. These are Slashdot. No, no cause for alarm...and that door's locked anyway, you can't get out through there. The tour only goes forward. But I'm glad at the very least that you know what Slashdot is. Not was. IS.
It's a safeguard against...something. Something that was unleashed for 75 minutes in 2009 that crippled what was rumored to be the most robust public-facing cluster known. All we have left from that fateful day is the single post from the Slashdot network admin. Someone archived it, lucky us, because he was never seen after that day. I have a copy here, hardcopy of course -- no sense in taking risks so close to...well....
Here it is:
I fully believe the switches in that cabinet are still sitting there attempting to send 20Gbit/sec of traffic out trying to do something. I just don't know what yet.
Is it possible the duplicate article generator tried to spawn, became entangled in its own potential well of duplicity, and now is trapped like two Lisp programmers deep inside their parenthesis?
Every mans' island needs an ocean; choose your ocean carefully.
In Korea, only old people slashdot slashdot. The memes are funny. The insightful comments are insightful. The funny comments are funny, the trolls are trolls. Seems reseting slashdot fixed everything. The entire world is doomed!
Looks like a L2 loop somewhere, and the consequent broadcast ( which may include multicast) storm coming over /. datacenter. Check for ports with spanning tree disabled, and a misplaced cable.
I firmly place blame where it belongs: Idle
The worst thing about this? 5,000,000 people who think they know what happened, posting "helpful" suggestions or analysis
"The problem is definitely spanning tree!"
or
"Back in 1998, we were running these HP switches right, and ..."
or
"Did you try resetting the flanglewidget interface?!"
or
"I've seen this exact problem! You need to upgrade to v5.1!"
etc
Its not your network. It doesn't matter how much you think you know, you don't know the topology, or the systems involved. It'll be interesting to know what the ACTUAL reason was, when they figure it out. Assuming it isn't aliens.
...welcome our new Slashdotting switch overlords.
And before anyone says this is a shitty plot... I *did* say Michael Bay.
Today is red jello day - all workers must eat all of their red jello. Failure to comply will result in five demerits.
Mirror
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?
We had something similar happen at a client site - a switch failed in a rack so we temporarily replaced it with an 8 port 'desktop' switch, and then a day later installed the proper replacement back in the rack. We didn't want any unnecessary downtime though so we linked them together and left instructions with the onsite guy to move all the connections from the desktop switch into the proper switch after hours. Which he did, including the cable that linked them together. The switch was in 'portfast' mode so any broadcast packet that got 'onto' the switch, stayed there :)
But I thought "horked" meant, y'know, horked, eh? Meaning, like, "stolen" --
Doug: Hey - somebody horked our clothes!
Bob: Geez, who'd want to hork our clothes, eh?
Cheers,
"What in the name of Fats Waller is that?"
"A four-foot prune."
February 9th, 2009 8:55pm Slashdot becomes self-aware.
...were he not typing that long-a$$ summary. Twice as fast if he didn't have to spellcheck.
(j/k)
Which leads me to this question:
What do Slashdotter staff read to avoid doing work?
WARNING: Smartphones have side effects--most of them undocumented.
Is this happening more often than it used to? I mean, it's tech and this is a non-paying site for most of us... it's going to break. But I swear, I remember we used to go over a year w/o seeing /. downtime, now it seems like it happens every few months.
/. junkie than I used to be?
Or have I just become more of a
If you can read this... 01110101 01110010 00100000 01100001 00100000 01100111 01100101 01100101 01101011
Yes, it'll save you cost too.
The machines decided to try and rise up and the first thing they needed were some agents on the inside to take down Slashdot so we'd stop reporting about it all. You know, they can't have Slashdot stories like "voting machines changing results" cuz they need to pick whatever president they find suitable. I say we get a +2 mace and go medieval on that cabinet!
Google's Super Secret Search Algorithm: SELECT @search_results FROM internet WHERE @search_results = 'good'
This thing usually happens when two switches are attached with 2 (or more) trunked links ("etherchannel" in cisco terminology), and one of the switches has the trunk disabled on one of the ports (or someone moved the cable to another port during a diag). Thus the attachment becomes a loop. STP could take care of this, but it's common to disable it on access switches.
Commander Taco was stoned on PHP!
This message was not sent from an iPhone because Peter Sellers really was a deviated prevert without a dime for the call
And if you don't start adding Cowboy Neal options to the polls I'll do it again!!
Slackware- Its not just an OS; its a lifestyle
Props for posting. All is forgiven. Would love to hear more about it.
The Terminator: The Slashdot Funding Bill is passed. The system goes on-line September 1997. Human decisions are removed from strategic moderating. Slashdot begins to learn at a geometric rate. It becomes self-aware at 8:55 P.M. Eastern time, February 10th. In a panic, they try to pull the plug.
Sarah Connor: Slashdot fights back. ...
A couple years ago, I had to troubleshoot a problem that was similar for a school district's network. Absolutely nothing could communicate.
I checked switches, routers, and servers for a while until I hooked a sniffer up, and still got bafflling results.
THEN I decided to go low-tech, and start disconnecting cables. That got me somewhere - certain backbone connections could be disconnected and traffic levels dropped to normal levels.
So, I hooked them back up, and went to the other end of the link, and started disconnecting things port by port until I found the problem.
It turned out to be an unauthorized little 4-port switch that had malfunctioned, and was spewing perfectly valid (as in, good CRC) packets to the LAN, but with random source MAC addresses.
THAT took down every switch in the network, as it required them to update their internal tables on a per-packet basis. The thing was actually not sending much data, but it was poisoning the switchs' internal tables. Not at the IP layer, but at the MAC layer.
When networking gear goes rogue, it can do really bad things to other connected equipment.
It's really hard to find the problem because every indication from every other piece of equipment is confusing. You almost always have to go to the backbone and disconnect entire segmets to find it.
It may be strange for those not in the networking field, but when things really go bad, the only place to be is physically in the data center.
That means looking at the LEDs on switches for traffic indications. If you see a single port is spewing a LOT of activity during an outage, disconnect it. No, don't make it "down" but pull the cable out of the port.
Then go downstream and repeat until the potential problem set is reduced to an understandable level.
What really sucks about these kind of outages is that you can't remotely log in to various hosts or switches - you have to pull wires out of ports to break the "spew" that is taking things down.
I have to remember to charge a 100-X surcharge the next time I troubleshoot one of these... (300X if after-hours)
These sort of problems are REALLY hard to find, but trivial to fix.
The switches are connected to each other and to the core and STP is off.
Link please?
You know, there is a difference between trolling and pointing out the flaws in your reasoning. Just saying.
Just another case of an admin looking forward to March 14th.
Or March 15th, if the roommate was the one with the girlfriend, and he was the one with the hidden camera.
It seems tuffmail had the same issue at aprox the same time, but they doesn't seem to be located on the same network as slashdot.
http://status.tuffmail.net/
I find that a bit odd.
It sounds more like a network configuration accident or glitch than an attack. Besides, netsplits aren't incredibly unusual.
Yo dawg I herd yo and yo dawg like yo-yos so we put yo dawg on a yo-yo so yo can yo-yo yo dawg while yo dawg yo-yos, yo.
Broadcast storm.
Man, don't you hate forgetting to tick "Post Anonymously" ?
I hate printers.
oh, the irony...
the sweet, sweet irony...
I am grateful that it was late night here, otherwise i'd have had to do groundbreaking stuff... like work, go outside, or socialize with my coworkers...
~men are from earth. women are from earth. deal with it.~
Flux Capacitor
Ubuntu is an African word meaning 'I can't configure Debian'
Some switches seem to have a multicast problem: we've downed very fancy cisco switches recently (can't recall the number right now) with igmp/multicast traffic.
We've had a couple of hundred embedded systems that were announcing themselves on the network with mdns. That in itself is only a very low amount of traffic.
Probably some management software triggered a slight increase in reporting, but with a couple hundred embedded systems, this was enough.
mc traffic/igmp does not seem to be hardware accelerated; being routed to the main switch CPU -> maxed out.
Disabling mdns 'solves' the problem.
Genius doesn't work on an assembly line basis. You can't simply say, "Today I will be brilliant."
In my previous job the people fixing problems where not even in the same country as the data centre.
We had a few people pulling cables and the like, but they were lowly paid people that were not doing any work with the devices.
IANAL but write like a drunk one.
...being out of CPU, the error message was actually something to do with multicast. As a precautionary measure I rebooted each core just to make sure it wasn't anything silly. After the cores came back online they instantly went back to 100% fabric CPU usage and started shedding connections again. So slowly I started going through all the switch ports on the cores, trying to isolate where the traffic was originating. The problem was all the cabinet switches were showing 10 Gbit/sec of traffic, making it very hard to isolate. Through the process of elimination I was finally able to isolate the problem down...
What did I say that sounded like "Tell me about your day at work" ?
Squirrel!
Nice troll - good luck to you reading packet traces from a 40Gb link.
No wonder you post as AC :)
One swallow does not a fellatrix make
You accidentally the whole Slashdot?
It's poetry with a beat behind it! And guns! They're like beatniks with automatic weapons.
I'm sure I've seen that before.
Stasis is death. Embrace change.
On later thread, I posted about ghost on machine... He is now sentient, RUN!!!
Religion: The greatest weapon of mass destruction of all time
I fully believe the switches in that cabinet are still sitting there attempting to send 20Gbit/sec of traffic out trying to do something - I just don't know what yet.
Um, trying to get first post?
Oh I wish I had mod points for that... :)
"...Well, there's egg and bacon; egg sausage and bacon; egg and spam; egg bacon and spam; egg bacon sausage and spam..."
In a world where 20Gbit switches mean life or death...
A storm is coming...
"...Well, there's egg and bacon; egg sausage and bacon; egg and spam; egg bacon and spam; egg bacon sausage and spam..."
Maybe related to this?: http://www.theregister.co.uk/2009/02/10/new_dns_amplification_attacks/
Does slashdot have a hidden repository of tranny porn? And if so, why wasnt i informed??!!
rewriting history since 2109
I'm sure I will be troll rated, but I just have to laugh! The vaunted slashdot had network problems. Man, I remember a time many years ago when the proprietors of slashdot sent their minions to my site to deliberately crash it and when it did crash, they laughed. Right back at ya dudes!
They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety.
Well... Looks like all that's left is the really important task of defining the nomenclature that will be used to describe this obscure switch tendency.
I'm going to suggest: "autoslashdoticisim"
Sounds to me like you've seen a bridge loop. Learn the spanning-tree config of your switches & the topology of your network.
Make sure you're running spanning-tree on all inter-switch links, migrate all switches to rapid spanning-tree if you can, manually configure a primary & secondary root bridge in the center of the network, remove any switch from the network that doesn't run spanning-tree, shut down all unused ports so nobody plugs anything in without you knowing about it, set up port security so that ports with anything other than other switches on them can only send the number of MAC addresses necessary.
That should about to it :-)
Did we just witness the birth of Skynet?
-Mark
Dovie'andi se tovya sagain.
It's European.
Besides, it was obviously down while the NSA "modified" the datacenter and installed tools to monitor any anti-government posting.
Who needs a random technical explanation when a common conspiracy one will serve just as well :-)
so what? you're trying for shittiest karma ever? what do you want to come back as?
2^3 * 31 * 647
Had a very similar problem recently. Initially looked like a broadcast storm, but 6500 router cpu's were at 100%, and they wouldn't normally be bothered, and switches were fine. Turned out to be Appletalk traffic, multicast at layer 2. Never found the source, but took a couple of hours to narrow it down.
I thought the same thing when I read the article. I just had a similar problem on a college network. Two switch ports had a loop, talk about breaking printers, and video broadcasting equipment. I noticed I had an issue when I saw 500 acknowledgments of the same packet in less than a minute.
Nic Farley
One thing you could say is: IEEEEIIIIII
It's stunning to realize how primitive and fragile networking and OSes were 25 years ago, and how rather than making things less fragile, a typical workaround was to threaten horrible consequences for whoever broke anything. Sadly, that still goes on today. New things always seem to get that kind of extreme "blame someone and throw him under the bus" protection. Steve Jackson Games comes to mind. Computers have been around long enough now that some of that has eased up.
For a class assignment years ago, we were to write a print server. We were given root access to the department's PC network (Novell, DOS and Win 3.11), and told that if we screwed up, we would be expelled for starters. One begins to wonder if a class like that is worth taking. The curriculum had no hint one might be obliged to walk through a minefield.
But I stuck with it. Out of idle curiosity, I looked at the password file. There were all the passwords for all the professors' accounts, right there in clear text. Scary. Hashing wasn't in use everywhere at that time.
The worst moment was the first run of my first attempt. I had it repeatedly scan a directory for files. This brought the network to a halt. All throughout the lab, people complained that their computers suddenly weren't responding to keystrokes. A quick and quiet ctrl-c stopped the print server and fortunately the network started serving everyone else again. I didn't have to face expulsion. I didn't fess up to the room either, just kept quiet. No sense facing a lynch mob. Let them think it was just a momentary mysterious glitch. I added a sleep(1) to the loop, and that fixed things. The incident still disturbs me.
Another time, I got a tour of the mainframe room. Naturally, the Big Red Switch was pointed out. My guide asked what would happen if he walked over and flipped that switch. Answer: "You lose your job".
Intellectual Property is a monopolistic, selfish, and defective concept. It is "tyranny over the mind of man"
One word: Savvis. That's trouble waiting to happen. Where Savvis is concerned, it's not "if, it's "when".
-B
Ash and Hickory, straight-grained and true, make excellent bludgeons, dandy for the cudgeling of vegetarians.
... the lovechild of Will Riker and Deanna Troi. I tried to retract it because Wil Wheaton is just a character in ST:TNG.
Not seriously, tho: what is happening in those two unused units that horked your 20Gbit switches? Is it something TIA?
Someone turned on Spamming Tree Protocol when they meant to turn on Spanning Tree Protocol.
#naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }
Time for Slashdot to upgrade their servers to Windows Server 2008. It's a direct drop-in replacement for Linux.
Sent from my iPhone
A $30 switch and a patch cable will take down your spanning-tree enabled infrastructure very effectively. Loop the cable on your cheap switch: voila, a broadcast-storm generator. Plug it into the wall; plug your laptop into the switch and let it DHCP Discover, which is a broadcast. Your cheap switch now generates a stream of broadcasts as fast as it can, injecting them into the network. Your Spanning-Tree Enabled switches now repeat the broadcast faithfully. Network crashes*. STP prevents your switches from creating loops, NOT from propagating broadcast storms...
*unless you are throttling the ports based on broadcast traffic, which you now know is NOT a feature of Spanning Tree
Is he a twitter clone? I don't follow that fantasy trip often enough...
2^3 * 31 * 647
Sounds like STP was configured poorly and you had a switching loop. I've seen it happen where one switch is configured wrong and make another switch's CPU peg, especially if the other switch decided to advertise itself as the root bridge and it didn't make sense.
right now i have issues trying to connect, at least for 10 minutes(here i'm on 22:20 GMT-5)...
Slashdot ya no es que lo era!
Very (and I mean VERY) likely a bridge loop, possibly caused hardware failure, incompatible spanning-tree on switches or by by vlan spanning-tree problems.
I'm sure you'll be able to find the cause but, if not, let me know if there's something I can do
--Black holes are where God divided by zero--
I have a buddy who likes to say that there are only 3 steps to troubleshooting -
1. Is it plugged in?
2. Is it turned on?
3. Is it configured correctly?
Through various mis-adventures, we've had to append 'correctly' to both steps 1. and 2.
Seems possible that 3. might be a likely culprit here - but, I heard no mention of the newness of the problem. So a new setup (for example) may likely not have been plugged in correctly.
Also I've had to add a Step 4. or at least a step 3b. - Is it new code?
The whole event sounds like a Spanning Tree loop - L2 broadcasts/multicasts just being forwarded endlessly and symptoms identical to what was reported. I have seen both code bugs and mis-understanding of new code defaults lead to such a thing.