Electricity Outage Puts Routing to a Tough Test
infofarmer writes "Today at about 11:30 MSD (GMT+4) a major electricity outage in Moscow, Russia brought new meanings to words like "uninterruptible", "redundant" and "uptime" for network administrators, who haven't experienced such harsh and unexpected power failures since the USSR got its Internet connection. Half of the city is totally out of electricity - including subway and the most important traffic exchange point, half of the top russian sites went down, including www.mail.ru, www.rambler.ru, www.lenta.ru, some of them haven't been brought up yet. IP packets going from ADSL users in Moscow to some local sites got rerouted to somewhere in London and then back to Scandinavia, where they met their "No route to host" deadend. Other routers found themselves in a loopback, which made many packets get dropped with TTL expired. The point is that most of popular servers have got two or three mainline Internet connections, but lack of BGP/RIP2/whatever configuration resulted in packets losing their way to hosts."
I checked right away, it's still up.
For the last three or four weeks my gmail account has been POUNDED by 100-200 cyrillic spam messages every day. The filters catch them, but I have to clean out my spam folder pretty often.
I've gotten none in the last couple hours.
I think you need to check your priorities. How do you think geeks all over the world just found out about the power failure?
I live in Russia, about 1000 km from Moscow. We were hit by network outage, nothing worked (even Slashdot :( ) for about 30 minutes. Number of routes announced by both of our peers was about 700 instead of normal 150000.
:)
But then routes began to appear again! I was amazed, Internet routed itself around damaged segments, packets were routed through Japan (!), Finland and Holland instead of Moscow. The most funny part was when I traced the route to a computer in the next building - it went through Saint-Petersburg
I was able to access Slashdot, and most of Russian sites (http://newsru.com/ , http://ntv.ru/ , http://nbc.ru/ not directly affected by outage.
Yes, but they're not public :(
Right now poor admins are trying to find stable routes for Russian traffic, which overloaded some international channels.
http://mosnews.com/news/2005/05/25/chubaiscriminal case.shtml
From the article:
Russian prosecutors on Wednesday opened a criminal case against the management of power monopoly Unified Energy System (UES) after a major power outage in Moscow, agencies reported Wednesday.
The case was opened to investigate possible negligence, the Interfax agency quoted the Prosecutor General's Office as saying.
Well, the wiki on my site was continually being probled with vandalism attempts by various machines around the world for the past couple of days, and it stopped dead right around the time of the power failure.
So... no prizes for guessing where the control machines for the botnet were.
I doubt it was the lack of RIP2 configuration that caused this. You don't use RIP in the core, you use BGP as the exterior protocol and most likely OSPF or ISIS as the interior protocol.
UPS: at least in one place in MSK-IX they did have proper UPS backups, you can tell from routing tables that some BGP connections have an uptime of 4 weeks plus. They did bounce (or it had a power failure) one of their core routers as all those peering connections only have an uptime of 8.5 hours. I'd rather not provide a link to this as the last thing they need is their core routers slashdotted with BGP table summary requests.
Connectivity: it appears MSK-IX is peered with at least 12 other sites that are also peered with another major IX. For example they are connected to three other sites that are also connected to AMS-IX and four other sites that are also peered with LINX, among a few others with only 1 connection to another Internet Exchange. Many of these were thru Informtelecom XXI, so if they also had power problems everything was running on 50% normal capacity. There should have been enough connections to keep things running (i.e. no single point of failure), but that is assuming everything is working/powered, and assuming these guys in the middle could/would handle all the traffic (unlikely).
BTW, packets don't lose thir way, routers lose their routes to destinations. When all the crap started the routes began to "flap", i.e. go up and down as routers were reset, power came back on, routers went back down under the heavy load, manually trying to route around the problem, etc. When your peer sees your routes flapping, they usually put a holddown on them for a period of time, meaning they won't readvertise your route updates to other routers on the internet (said flaps propogate all over the world, putting undue stress on other routers). So even once you get everything working again, the internet waits for a little bit to accept your routes. Well, some do and some don't or some wait longer. That's why you see routers still forwarding packets to London, apparently London thinks it can still get to Moscow so it's still advertising routes. You don't get the count to infinity problem with BGP, but loops are still possible, especially during major outages and route flapping. And routers get "routing loops," not "found themselves in a loopback."
I provided as much details as I could, it's lacking in a few places because I can't follow russian websites.