Lessons Learned From Skype’s Outage
aabelro writes "On December 22th, 1600 GMT, the Skype services started to become unavailable, in the beginning for a small part of the users, then for more and more, until the network was down for about 24 hours. A week later, Lars Rabbe, CIO at Skype, explained what happened in a post-mortem analysis of the outage."
For us it's nearly our only way to speak to our loved ones at home. I'm just glad it's back up...
Not sure why you didn't link to the actual article on Skype http://blogs.skype.com/en/2010/12/cio_update.html Instead of the blogspam site.
"Would you, could you, with a goat?" Dr Seuss
Seriously?
If you are a node-based company worth several billion, charge for services, and don't even run enough of your own supernodes and monitor them in such a way that they cannot handle an outage effectively, you need serious help.
When the foot seeks the place of the head, the line is crossed. Know your place. Keep your place. Be a shoe.
we've got people bitching at work about how it doesn't work from time to time, and why I've blocked its ability to do voice/video at the firewall. If you want VOIP, use something that uses standard SIP or some other documented, configurable traffic.
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Hmm. Seems to me their biggest problem is that they allowed clients with a known bug to become supernodes; if 50% of the network had upgraded, they should only have been creating supernodes from the upgraded clients.
And in hindsight (I don't know that they should be blamed for not considering this before), the number of supernodes should probably be ~100-150% more than needed to service expected load. That way, if a third of them die, they _still_ have more than needed to handle the expected load. (And thus, hopefully, more than needed to handle the excessive load without causing them to shut down).
"At its core, Skype relies on a third generation P2P network that has lots of peer nodes and a number of supernodes, one for several hundreds of nodes. Since Skype does not have a centralized directory to support finding routes between two or more nodes that want to communicate, the virtual network uses supernodes as directories. When a client enters Skype, it registers itself with a supernode, giving its IP address so it can be found by other clients who might want to establish a communication."
Skype is a peer-to-peer network? Like torrent? So the supernode is like a tracker website, to connect peers to one another? No supernode==no tracker==no calls going through. Hmmmm. Maybe they should try DHT.
"I disapprove of what you say, but I will defend to the death your right to say it." - historian Evelyn Beatrice Hall
Lots of users were using an old outdated buggy version of Skype, lots of client crashes at once bringing down big chunks of the P2P network, remaining network couldn't handle the load and went down too, took a while for Skype to put it's own supernodes up to help get the network self-sustaining again.
They're considering an auto-update feature now since such a feature could have kept this from happening. Personally I think old versions should be blocked from making or receiving calls too, so users would be encouraged to update (works for Team Fortress 2). Of course auto updates would make updating super easy anyway so impact from that would be minimal.
...unless you need something in the newer version (feature, security update etc.). Of course us geeks like to have the latest to fiddle with, but for the average Joe end-user, if it ain't broke, don't fix it. There is always the risk that the newer software will contain new bugs. At one point the buggy version of the Skype software was the latest version and was what users were being pushed to upgrade to. If the crash had happened then, I wonder if they'd find a new way to scapegoat users.
By the way new versions breaking existing functionality isn't theoretical, or rare. I'm currently installing software on my new laptop. I've had to downgrade both Zonealarm and Virtualbox. The former broke remote desktop. The later broke file sharing. No idea why, but in each case uninstalling and installing an older version I knew worked fixed the issue for me.
These posts express my own personal views, not those of my employer
How about they release some supernode only software that people can setup on a server and possibly the ability to setup Skype to use a preferred supernode. So a businesses can setup a supernode of their own and point their users too it. But also that supernode is part of the collective of supernodes and routes Skype connections for everyone else too. This would hopefully give Skype more supernodes out there that are 24/7 and not desktop computers routing the traffic.
-----BEGIN PGP SIGNATURE-----
12345
-----END PGP SIGNATURE-----
Google video chat, perhaps? Or maybe acknowledge that its fairly impossible to provide both 100% uptime and free video chat at the same time, without the resources of a major player behind you to promote goodwill?
Seriously, they were down for some percentage of the people for 1% of one year, during which time many competitive products were available. This is not an earth-shattering catastrophe.
You're special forces then? That's great! I just love your olympics!
"We expected a Limewire topology to be as reliable as a Phone companyi topology and oddly enough that bit us in the ass."
The QA of this release is way down. On top of that, skype auto-updated people from 4.0 to 5.0. Within a few days, the buggy 5.0 had enough penetration (50%) to bring them down.
The windows client has widely been reported to:
consume 2x as much CPU (33% to 60% on mine after upgrade)
leak RAM (starts out ok but after some use over 1.5gig needed)
the GUI is slow, so the fade effects on some computers (mine) causes video tearing. It is no longer possible to run full-screen. (320x240 is all I get before tearing sets in)
The fonts in the video area don't render correctly.
It should be noted that I have a AMD X2 1.6 and Radeon 1200 card in this computer. Its not shabby. But the 5.0 client brought it to its knees.
It plays SCII just fine (albeit on the lowest setting).
It comes at a bad time when they are trying for more corporate agreements, but can't run on my 3-year-old hardware.
I uninstalled 5.0 and installed 4.0 and its back to normal.
Slashdot's rate-of-post filter: Preventing you from posting too many great ideas at once.
You can bitch they didn't QA the release. You can bitch that you don't like a P2P topology. But it is nice to see a public post-mortem.
http://blindscribblings.com - Tasty pop-culture in conceptual fashion.
that's right, because everyone who wants to use VOIP should review the source code and familiarize themselves with the relevant RFC specs
classic "if you aren't a computer scientist you shouldn't use the internet" ignorant geek snobbery. how's that standard of behavior working for you?
intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
I hate when apps run auto update daemons. This precisely the reason why I don't use any Google desktop software on my computers.
Proper thing to do in this case is simply disallow users to log in with a message they need to upgrade their client if they want to continue to use the app. Simple thing to do, rather than each app running a daemon. Soon enough there will be hundred update daemons on each user's computer, eating resources, connecting online all the time and bogging down the user experience. Thanks but no thanks. I refuse to use any of those.
As the island of our knowledge grows, so does the shore of our ignorance.
I think we're talking about better up-time than that for Skype. If we believe the outage numbers presented on their Wikipedia page http://en.wikipedia.org/wiki/Skype, they've had a total of 72 hours down time since the initial release in 2003--and assuming a 100% outage in all cases (which was not the case here)--their up-time minutes work out to something like:
99.9988%
Seven years and 72 hours of total down-tine... It might not be five nines, but does seem a pretty respectable up-time percentage.
Well said. Skype is primarily a piece of technology aimed at the individual consumer. It is made completely clear at the outset that it doesn't claim to be a landline replacement, so anyone who lost business as a result of the outage doesn't get much sympathy from me.
:-}
The dowmtime period for me was about a day and a half, which amounts to 0.41% of the year. No biggie, I have SIP and mobile alternatives. Or both if I run a SIP client over my wireless internet dongle or phone tether.
I get very tired of those who insist on telling everybody to stop using Skype and to use this or that product instead. Skype has a commanding and undeniable position in peoples' headspace because it offers a fucking good product. For me, the combination of IM client with voice calling capability is a killer. My non-geek friends will never be pursuaded to run a separate IM and SIP client. I can (and do) leave video calling alone, since nobody needs to see me after (or during) an evening on the single-malts...
One important lesson to be learned is this: many users do not update their software if they don’t have to. Skype had a newer version in place, without the triggering bug, but most users had the buggy one.
Yeah. Right. Because all recent Skype updates (staring with version 3(?)) were known to contain mostly only one of this: more ads or more UI bloat. And occasional breakages.
So why they expect that users would be updating it regularly?
All hope abandon ye who enter here.
Ah, but its a brave new world where the client/server relationship is becoming fuzzier all the time. The part I think you are missing is that if you read the actual post it is obvious that everything that was crashing was applications on clients computers. It appears that some clients are promoted to server status to handle routing requests.
As for bad design/software I would instead say they had features without consideration of consequences. Here are where their problems are from what I can see.
1. Non-Patched Nodes Become Servers - Seriously if you are relying on your customers computer to be part of your infrastructure you would think that you would want to use only fully patched versions of your application for this.
2. Failed Message Queue - In an attempt to deliver messages to offline/crashed users they are queuing the messages and delivering them later. These types of systems ALWAYS exacerbate load problems unless implemented extremely carefully. Interestingly they hit both of the major issues with them at one go. First they had a bug that caused messages to crash systems (which they blindly kept trying to deliver). Then they had more of these messages out there causing more traffic on an already crippled network.
3. Shutdown On Overflow - Since they are running on clients networks if the load becomes too great they shutdown clients running as servers when they are running too hot. This one is just made to cause cascading failures. While I an unsure how their lookup domains are set up it would probably be much better to spawn new servers to deal with increased load instead of shutting down working servers.
But that is just what I think,
That doesn't mean, of course, that China isn't becoming a superpower. They may be, or may not, I don't know the future. Military, they already are...
I think you've overrated China's status as a military power. Sure they have the capability of attacking and perhaps overrunning neighboring countries like, say, Taiwan and Vietnam, with whom they waged a brief but bloody war in the late 1970s. But the Chinese lack the ability to deploy their forces across continents and the two largest oceans, an ability which the Russians, as the main heirs of the former Soviet war machine, still have. In fact, after the end of World War 2, the US remains the only country to have waged multiple large scale wars overseas: the Vietnam War and the two Iraq Wars. (A possible exception might be the UK, which won the Falkland Islands war against Argentina, but was merely part of the supporting cast in the Iraq wars.)
While undoubtedly enough of a deterrent to avert a US invasion, China's nuclear might is just on a par with the other permanent members of the Security Council. So, no, barring the political disintegration of the US, China is still a long way to go from becoming a Cold War class superpower.
Why do I block skype? Because the only way to have it work properly through most firewalls is to allow ALL outgoing ports.
Skype lists three other firewall configurations that work, including two that only require egress on a single port that's almost always open anyway.
Its a massive, massive security issue you could drive an oil tanker through.
Oh, come on. Sure, egress filtering is a polite thing to do, but it's inbound connections that put you at risk. And chances are, if you do fall victim to some nefarious piece of malware that's making unwanted outbound connections, simple packet filtering will be useless anyway because it will fall back to TCP 80, or TCP 443, or even UDP 53, to tunnel out. Just like Skype does.
You advertise yourself as an "admin of some 12 years" experience, but you're exactly the type of admin I dislike. You take a personal stance against something, and then back up your bias with a mixture of pseudo-facts, deliberate omission, and high-handed horseshit.