Lessons Learned From Skype’s Outage
aabelro writes "On December 22th, 1600 GMT, the Skype services started to become unavailable, in the beginning for a small part of the users, then for more and more, until the network was down for about 24 hours. A week later, Lars Rabbe, CIO at Skype, explained what happened in a post-mortem analysis of the outage."
For us it's nearly our only way to speak to our loved ones at home. I'm just glad it's back up...
Not sure why you didn't link to the actual article on Skype http://blogs.skype.com/en/2010/12/cio_update.html Instead of the blogspam site.
"Would you, could you, with a goat?" Dr Seuss
Seriously?
Lessons Learned From Skype’s Outage
It's all crystal clear now. Do not use Skype!
a major company shouldn't picky-pack on users and actually own their infrastructure that wouldn't go down like that?
If you are a node-based company worth several billion, charge for services, and don't even run enough of your own supernodes and monitor them in such a way that they cannot handle an outage effectively, you need serious help.
When the foot seeks the place of the head, the line is crossed. Know your place. Keep your place. Be a shoe.
we've got people bitching at work about how it doesn't work from time to time, and why I've blocked its ability to do voice/video at the firewall. If you want VOIP, use something that uses standard SIP or some other documented, configurable traffic.
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Sorry if this is off topic or an ignorant question, but how does Skype define supernodes? Does the company just randomly choose users who are online a lot and declare them supernodes without the owner's knowledge, or is there some other process?
cheers
Hmm. Seems to me their biggest problem is that they allowed clients with a known bug to become supernodes; if 50% of the network had upgraded, they should only have been creating supernodes from the upgraded clients.
And in hindsight (I don't know that they should be blamed for not considering this before), the number of supernodes should probably be ~100-150% more than needed to service expected load. That way, if a third of them die, they _still_ have more than needed to handle the expected load. (And thus, hopefully, more than needed to handle the excessive load without causing them to shut down).
"At its core, Skype relies on a third generation P2P network that has lots of peer nodes and a number of supernodes, one for several hundreds of nodes. Since Skype does not have a centralized directory to support finding routes between two or more nodes that want to communicate, the virtual network uses supernodes as directories. When a client enters Skype, it registers itself with a supernode, giving its IP address so it can be found by other clients who might want to establish a communication."
Skype is a peer-to-peer network? Like torrent? So the supernode is like a tracker website, to connect peers to one another? No supernode==no tracker==no calls going through. Hmmmm. Maybe they should try DHT.
"I disapprove of what you say, but I will defend to the death your right to say it." - historian Evelyn Beatrice Hall
Lots of users were using an old outdated buggy version of Skype, lots of client crashes at once bringing down big chunks of the P2P network, remaining network couldn't handle the load and went down too, took a while for Skype to put it's own supernodes up to help get the network self-sustaining again.
They're considering an auto-update feature now since such a feature could have kept this from happening. Personally I think old versions should be blocked from making or receiving calls too, so users would be encouraged to update (works for Team Fortress 2). Of course auto updates would make updating super easy anyway so impact from that would be minimal.
please mod this up and the /. article down.
...unless you need something in the newer version (feature, security update etc.). Of course us geeks like to have the latest to fiddle with, but for the average Joe end-user, if it ain't broke, don't fix it. There is always the risk that the newer software will contain new bugs. At one point the buggy version of the Skype software was the latest version and was what users were being pushed to upgrade to. If the crash had happened then, I wonder if they'd find a new way to scapegoat users.
By the way new versions breaking existing functionality isn't theoretical, or rare. I'm currently installing software on my new laptop. I've had to downgrade both Zonealarm and Virtualbox. The former broke remote desktop. The later broke file sharing. No idea why, but in each case uninstalling and installing an older version I knew worked fixed the issue for me.
These posts express my own personal views, not those of my employer
How about they release some supernode only software that people can setup on a server and possibly the ability to setup Skype to use a preferred supernode. So a businesses can setup a supernode of their own and point their users too it. But also that supernode is part of the collective of supernodes and routes Skype connections for everyone else too. This would hopefully give Skype more supernodes out there that are 24/7 and not desktop computers routing the traffic.
-----BEGIN PGP SIGNATURE-----
12345
-----END PGP SIGNATURE-----
"A bug in Skype for Windows version 5.0.0.152 made the application crash when receiving late messages..... previous versions for Windows and all the other versions for non-Windows machines were not affected by the bug."
The new versions are often *more* buggy than the last version. Just recently Microsoft auto-updated my work computer from IE7 to 8, and the browser worked perfectly but something in the update killed my network connection. I had to waste an hour going back to the previous version (as did most people in the office). And then there was that Antivirus Software update from three week ago that killed people's Windows PCs by making them unbootable.
Programmers really should be more careful with their updates, to make sure the new X.y release is better rather than worse. But since they aren't careful I turned off auto-updates. They are too dangerous.
"I disapprove of what you say, but I will defend to the death your right to say it." - historian Evelyn Beatrice Hall
So this outage was triggered by out-dated clients and proprietary support servers. That's like saying IE6+IIS users can bring down 50% the Web. And to think people depend on services like skype to keep in touch with loved ones never realizing there are simpler almost better alternatives that do the exact same thing.
If problems with the client can lead to problems with the server then the server system lacks robustness. For applications like this the servers should be practically immune to any client state much ups.
Seems to me skype needs to work on their server side state machines.
For as much bull$hit is spread about this, 99% of skype users were UNAFFECTED.
The only crisis is the fabricated one by websites and media.
"We expected a Limewire topology to be as reliable as a Phone companyi topology and oddly enough that bit us in the ass."
China doesn't want them competing with state run telecoms, so they are being "asked" not to expand in China.
http://finance.yahoo.com/news/China-to-go-after-Internet-apf-78040210.html
The lesson they learned is that the users like to use buggy versions of their software? Sure blame your users... Maybe the lesson to learn is not release buggy software!
Coderz 4 Life
The QA of this release is way down. On top of that, skype auto-updated people from 4.0 to 5.0. Within a few days, the buggy 5.0 had enough penetration (50%) to bring them down.
The windows client has widely been reported to:
consume 2x as much CPU (33% to 60% on mine after upgrade)
leak RAM (starts out ok but after some use over 1.5gig needed)
the GUI is slow, so the fade effects on some computers (mine) causes video tearing. It is no longer possible to run full-screen. (320x240 is all I get before tearing sets in)
The fonts in the video area don't render correctly.
It should be noted that I have a AMD X2 1.6 and Radeon 1200 card in this computer. Its not shabby. But the 5.0 client brought it to its knees.
It plays SCII just fine (albeit on the lowest setting).
It comes at a bad time when they are trying for more corporate agreements, but can't run on my 3-year-old hardware.
I uninstalled 5.0 and installed 4.0 and its back to normal.
Slashdot's rate-of-post filter: Preventing you from posting too many great ideas at once.
The problem is that it is broke, you just often don't realize it. Older doesn't mean more secure or more stable inherently. New versions fix bugs discovered in old versions. If everyone did update immediately, then everyone would have had the bug fix and this outage wouldn't have happened.
http://blindscribblings.com - Tasty pop-culture in conceptual fashion.
You can bitch they didn't QA the release. You can bitch that you don't like a P2P topology. But it is nice to see a public post-mortem.
http://blindscribblings.com - Tasty pop-culture in conceptual fashion.
..unless you need something in the newer version (feature, security update etc.).
And also especially when the update is a 20 megabytes file. In fact, we need to reinstall the whole software every time.
Why such a lame updating system ?
Back when I was doing one of the first VOIP solutions (this one mostly for LAN use) we dreamed up something like Skype, that would work in similar fashion. The big advantage is that it could be done by any reasonably large group of users and no phone company at all need be involved -- no charge to anyone, no control over anyone by some big monolithic corp. It could still be done, and I wonder why no one in the open source area has managed? Critical mass issue; selling the first phone is a bear -- who you gonna call? Once going, a completely free open source solution would keep going just fine I'd think. I'd suppose the main problems would be with security, outside actors diddling supernodes to break it, as some companies would have a large interest in not having it as a competitor? Not sure how you'd handle those issues.
Why guess when you can know? Measure!
I hate when apps run auto update daemons. This precisely the reason why I don't use any Google desktop software on my computers.
Proper thing to do in this case is simply disallow users to log in with a message they need to upgrade their client if they want to continue to use the app. Simple thing to do, rather than each app running a daemon. Soon enough there will be hundred update daemons on each user's computer, eating resources, connecting online all the time and bogging down the user experience. Thanks but no thanks. I refuse to use any of those.
As the island of our knowledge grows, so does the shore of our ignorance.
Shouldn't that be December 22nd?
On est ce qu'on veut (A man is what he wills himself to be). -- Sartre
And that's exactly why this happened. People were satisfied with the initial release of v5 and saw no need to update (meaningless bug fixes, no useful features, who cares). Then they broke everything...
a client (or even many) crashing shouldn't cause the server to, too. That's just bad design/software.
Skype seems clueless. They're thinking of using "processes for providing ‘automatic’ updates to our users so that we can help keep everyone on the latest Skype software. We believe these measures will reduce the possibility of this type of failure occurring again." Contrariwise - this would only make the matter worse. What if the _current_ version were the one with the problem, and an automated update system had forced everyone onto it? Then, instead of 50% of the clients contributing to the problem, they'd have 100%.
"National Security is the chief cause of national insecurity." - Celine's First Law
You're suffering from sample bias. Newer software is also 'broke' and you also don't know that. I think the point would be, if it is 'broke' but not impacting you in a way that you'd know it, do you care? In some cases yes, in other cases no.
(Satire)
Sorry, no. In Today's Post 911 World, rational decision making can never be the same again. We have to Respond to an Event like this. Remember the Day That Skype Was Down forever!
In other censorship news, all discussions of Averages and Means have been blocked, because 7 years of past performance will never matter again.
(/Satire)
My first Journal Entry ever, in 8 years! http://slashdot.org/journal/365947/aphelion-scifi-fantasy-horror-poetry-webzine
About 20 years ago now... sent out code with a bug in the fault recovery code, then a problem in one node cascaded throughout the network. http://www.phworld.org/history/attcrash.htm
...with Skype who^H^H^Hsharing the keys with every major gubmint out there (maybe that's a revenue stream too?)
Encrypted my ass.
It is equally possible that newer software introduces bugs as much as fixes them. But the assumption that older is always more secure and stable is flawed.
In reality, the best solution is to review changelogs and make informed decisions when upgrading. But avoiding all upgrades isn't the solution.
http://blindscribblings.com - Tasty pop-culture in conceptual fashion.
"We believe that increased load in supernode traffic led to some of these parameters exceeding normal limits, and as a result, more supernodes started to shut down"
Maybe I'm missing something, but why are supernodes coded to shut down during increased load instead of simply throttling requests? It seems like the idea of 'too many requests, shut down' is what caused the cascade. Can someone enlighten me as to why this is the preferred overload handling mechanism?
Well said. Skype is primarily a piece of technology aimed at the individual consumer. It is made completely clear at the outset that it doesn't claim to be a landline replacement, so anyone who lost business as a result of the outage doesn't get much sympathy from me.
:-}
The dowmtime period for me was about a day and a half, which amounts to 0.41% of the year. No biggie, I have SIP and mobile alternatives. Or both if I run a SIP client over my wireless internet dongle or phone tether.
I get very tired of those who insist on telling everybody to stop using Skype and to use this or that product instead. Skype has a commanding and undeniable position in peoples' headspace because it offers a fucking good product. For me, the combination of IM client with voice calling capability is a killer. My non-geek friends will never be pursuaded to run a separate IM and SIP client. I can (and do) leave video calling alone, since nobody needs to see me after (or during) an evening on the single-malts...
Mod parent up.
See my journal for slashdot ID's by year. Mine created in 2005. http://slashdot.org/journal/289875/slashdot-ids-by-year
One important lesson to be learned is this: many users do not update their software if they don’t have to. Skype had a newer version in place, without the triggering bug, but most users had the buggy one.
Yeah. Right. Because all recent Skype updates (staring with version 3(?)) were known to contain mostly only one of this: more ads or more UI bloat. And occasional breakages.
So why they expect that users would be updating it regularly?
All hope abandon ye who enter here.
> And that's exactly why this happened.
It happened because their system is vulnerable to cascading failure. They've managed to combine the disadvantages of a centralized system with those of a decentralized one.
Warning: this article may contain humor, sarcasm, parody, and perhaps even irony. Read at your own risk.
Well, that'll teach them for relying so much on misconfigured Windows clients. Any properly configured Windows client won't work as a supernode, so their network depends on misconfigured, most probably malware infested machines. I'm amazed that it works at all.
Excuse me, but please get off my Pennisetum Clandestinum, eh!
And yes, breaking software with an upgrade can happen to anyone and even happens to the Mac. My 12" Powerbook DVD player software died last year after a recommended graphics update. Lots of panicked complaining on the Apple's bulletin board, but dead silence from Apple. A fix from them took more than a month, during which VLC was the only workaround.
Reading that analysis in the TFA, it seems that (a) a systemwide crash was inevitable and (b) just about anyone who knew how Skype functions at the network level really, really should have anticipated that exactly this would happen. So the real problem here is the kind of systemic human failure to act in the face of an emergency that we've seen time and time again, from the Deepwater Horizon to 9/11 to the Columbia and the Challenger.
-- "The only thing that is ever new in the world is the history you do not know." -- Harry Truman
The going answer is "why waste time and effort making updates smaller?"
December twentytooth?!? Mildly amusing...almost as much so as the guy up there who used the term "Loads and Loads". If you have Loads of something, how do you make more loads? Add another plural of course! Reminds me of the "Infinity and beyond" ripple of 1998. I do believe that the Slashdot editors are all off on vacation...
"My immediate reaction is "WTF? What kind of moron doesn't make things 64-bit safe to begin with?" Linus
There is an option between "auto-update" and "update when you want"; depricated versions. If a version has a known major bug in it that could compromise the system require updates only those versions. That way only the bad version will be replaced and we won't be updating everyone at every release. The main advantage is that the system is kept safe without unnecessary updates.
NAT is evil. Skype needs to build an overly complex networking protocol because too many people are behind NAT gateways. Skype *could* probably get away with their basic available hardware if only they got to design for a NAT free world.
One could also say they were trying to cheap out and not invest as much hosting required to assure reliability of their chosen networking architecture.
Of course, on the flip side, Skype as a service would be nearly useless in a NAT-free world. No need for a coordinating entity other than DNS if all peers can directly ring up the address of their recipient. Even in multi-user desktop/migration scenarios one could have a DNS record that points to the 'active' user and deregisters on logout. Some may argue that skype would still be cleaner, and of course they still have the bridge to phone feature.
XML is like violence. If it doesn't solve the problem, use more.
Ohh noesss 20MB it will take me seconds on my internet connection. All that wasted time. Ohh noooeess
They are starting to roll out enterprise service. Skype for SIP now available in Beta.
Skype For SIP is the perfect way to integrate Skype with your existing PBX, allowing the communications from your PBX to be complemented by Skype functionality – head over to the Business blog to find out more about the Beta programme.
Somehow I don't think PBX interoperability is aimed at the consumer market. (though SIP support might help some consumers)
SSC
So a failure of only 30% of supernodes brought the system down. They should have had a lot more redundancy in their network than they did. The outage was NOT due to some fluke. It was due to an inherently inadequate network.
Here is what really happened.
A non-telephone company had a cascading problem with its ad-hoc peer-to-peer networking that provides telephony and video services at costs way below any telephone (or cable) company. The company is profitable enough to make its own way in this world.
This story was broadcast pretty-much worldwide by all media.
The non-telephone company was embarrased and released a statement to the media about how this happened as a means by which it might encourage everyone to download new, free software the will fix the problem and to cover for the public relations problem.
Skype is not a telephone company, but they allow you to provide telephony and video conferencing by using their software for free. And, for calls made to regular telephones, it's between 2.3 and 1.2 per minute anywhere in the world, offering a considerable savings over telephone companies and cable companies. When John Thomas Draper (AKA Captain Crunch) tried that with AT&T, he was convicted for wire fraud.
Five years ago, the only people who knew what Skype was were computer nerds. Today, as a result of the incredible savings people are receiving by making long-distance and international calls through Skype, almost everyone does. Five years ago, the only people who would have known of this outage were Slashdot users and a few other geeks. It would not have made news.
And that, dear reader, is the reason why this is important.
I don't plan to buy any stock in any phone company that doesn't do what Skype does.
Gods don't kill people, people with gods kill people.
Even macs have software bugs? Wow, next thing you know youll be telling me it can even get a virus. Maybe if I paid Lord Jobs more he could get me bug free mac software.
by now they should be big enough to be able to afford to run proper supernodes on the cloud proper and not rely on ordinary people's clients to do the "cloud" job for them.
Donald 'Duck' Dunn: We had a band powerful enough to turn goat piss into gasoline.
Not to make light of your hardships abroad, but aren't there other means of communication? I mean, there's email, Facebook, and Google Voice, and goog old instant messaging.
For me the problem usually is balancing the bug patched version with extra marketing shite (I don't want facebook or anything like that, just give me streamlined contacts) against an older buggy version which doesn't offend my eyes as it sits in the corner.
It's the same reason why I stayed with ICQ99b instead of the later versions. For all you script kiddies out there, I'm at 127.0.0.1.
And FWIW according to their blog, their commercial service didn't experience any outages whatsoever.
You're special forces then? That's great! I just love your olympics!
That doesn't mean, of course, that China isn't becoming a superpower. They may be, or may not, I don't know the future. Military, they already are...
I think you've overrated China's status as a military power. Sure they have the capability of attacking and perhaps overrunning neighboring countries like, say, Taiwan and Vietnam, with whom they waged a brief but bloody war in the late 1970s. But the Chinese lack the ability to deploy their forces across continents and the two largest oceans, an ability which the Russians, as the main heirs of the former Soviet war machine, still have. In fact, after the end of World War 2, the US remains the only country to have waged multiple large scale wars overseas: the Vietnam War and the two Iraq Wars. (A possible exception might be the UK, which won the Falkland Islands war against Argentina, but was merely part of the supporting cast in the Iraq wars.)
While undoubtedly enough of a deterrent to avert a US invasion, China's nuclear might is just on a par with the other permanent members of the Security Council. So, no, barring the political disintegration of the US, China is still a long way to go from becoming a Cold War class superpower.
> December 22th
Lessons Learnt - Lesson One:
Ordinal Numbering in Popular English Pronunciation
Why do I block skype? Because the only way to have it work properly through most firewalls is to allow ALL outgoing ports.
Skype lists three other firewall configurations that work, including two that only require egress on a single port that's almost always open anyway.
Its a massive, massive security issue you could drive an oil tanker through.
Oh, come on. Sure, egress filtering is a polite thing to do, but it's inbound connections that put you at risk. And chances are, if you do fall victim to some nefarious piece of malware that's making unwanted outbound connections, simple packet filtering will be useless anyway because it will fall back to TCP 80, or TCP 443, or even UDP 53, to tunnel out. Just like Skype does.
You advertise yourself as an "admin of some 12 years" experience, but you're exactly the type of admin I dislike. You take a personal stance against something, and then back up your bias with a mixture of pseudo-facts, deliberate omission, and high-handed horseshit.
People were satisfied with the initial release of v5 and saw no need to update (meaningless bug fixes, no useful features, who cares).
Maybe there are people who think that way, but to me it sounds totally backwards. I usually prefer the updates that are just bugfixes. Being forced to upgrade to a new version with new features because of security bugs in the old version is an annoyance. It would be so much more convenient to install a new version that was identical to what I already had with just that one bug fixed. But instead I may have to upgrade to a new version that adds some features that I never needed, removed some features that I used on a daily basis, fixed one critical security bug, and added a few new bugs (that were not security problems, just making my computer crash once a day or something like that).
What I think would really be much better would be to first of all let users know if an update was bugfixes or adding features, second give users the opportunity to see more details about what is changed in the update, and finally give users a reliable way to downgrade if they should experience problems with the update. I think one of the main reasons for people not installing updates is a history of some companies using automated updates to push their own agenda instead of giving users improved software.
Do you care about the security of your wireless mouse?
The article leads to something that could bring down the network again, if Skype hasn't learnt from their failure:
Ask me about repetitive DNA
Well, is it better to have programs doing their randomness on port 80 (or 443)?
And, assuming your sales or other staff are halfway presentable, isn't much better for sales to be able to see your customers and vice versa (if they want to)?
Yeah, bandwidth costs, but how does that compare to the cost of the warm body?
I'm not a lawyer, but I play one on the Internet. Blog
Do most DSL providers allow full-scale businesses (in commercial zones instead of SOHO) to buy a consumer Internet connections?
I'm not a lawyer, but I play one on the Internet. Blog
Insisting on being a douchebag over a few kilobytes of bandwidth (non-video calls over Skype are NOT that heavy on traffic) just makes you look like an asswipe.
He(?) did say he was instructed to block it - do you get that?
Whether Skype can be a productivity bonus, ditto every other thing people are "demanding" obviously varies from job to job.
But the job will determine that - not you. In the end you're just the asswipe without the job. In real life you can't even argue that until you're blue in the face - you'd get forcibly removed before then. In real life you'd be best employing a compelling argument.
But what would I know... oh, and your views *are* fascinating.
the lusers as you call them are whom the internet is for
In much the same way as water is there for *you* to swim in, piss in, and drink, and please do post me some of what you're smoking - I'll get it tested, I suspect it's been dipped in cat tranq.
I didn't say that egress filtering has no merit, and yes, there are situations where it's called for.
Either you're confused - or meta-semantic trolling. I wrote, unambiguously, and quoted you - that the assertion "it's inbound connections that put you at risk" is "frankly, insane". (check it - the words are still there). Anything less that the complete truth *is* a lie. Eg. to claim that inbound is the main cause of malware would be speculative at best, and lacking in good faith (omitting important information).
[snip] It doesn't mean that anyone who doesn't follow your policy is "frankly, insane".
Nor do I claim that they are. It's not "my" policy. It's good practice (see ITIL) to recognize that security is not simply a matter of blocking exits. Again - nowhere did I say (or imply) that failure to follow my policy was "insane" - that's you either putting a spin on an opinion that contradicts your own, or failing to comprehend. If the english language is a problem I'm happy to "try" and accommodate your native preference.
Nor did I say that all networks should follow my policy - reread my post - I celebrate that others don't.
Nor is allowing outbound connections "a massive, massive security issue you could drive an oil tanker through".
Again - I did not say, or imply that.. Though I agree with blocking any *unnecessary* connection - regardless of the direction. Nor is that the language I would use if I had said it.
SOHO routers by Linksys, D-Link, SMC, Netgear, etc. allow unrestricted outbound connections by default,
Apropos of what? Are you confusing the lowest common denominator with best practice? If 15 million people believe a stupid thing does it make it *not* stupid? Does a product protocol reflect best practice or simply acknowledge the limitations of the largest market? That's rhetorical - if I believed that those products would be locked down.
and a hell of a lot of people are using them without it causing "massive security issues".
Now you are really grasping and flailing.
That's not to say these people don't have any massive security issues. They're just not caused by their egress filtering policy.
Contrasting a professional environment to an amateur environment - even with red herrings like "SOHO", is just silly. "massive security issues" is your hyperbole - you can keep it. Likewise the stupidity of an argument where you both deny, and concede the same slimey allegation. I never said it. Your baffle 'em with machinegun bullshit might impress the lads down the bus stop - but it fails in real life where you claims are tested. Any fool can claim they don't have security problems - to extend that to it "being a result of a policy (no outbound filtering)" is just demonstrating the utility of ridiculum absolutum. When you're done plucking spurious claims out of your arse try considering that not all security concerns come from the outside - particularly when it comes to loss of critical information in a business environment. Test your security theories - just like in the real world. It's called a tiger team - not teenage tautology.
Your opinions *are* not necessarily valid, though they are common.
What you say is true for power-users, but an average user has neither the requisite understanding, nor desire, nor availability to do the manual labor necessary.
My wife doesn't know the first thing about auto-updates beyond asking me "hey I'm getting this pop-up in the bottom right-corner of my screen, do you know why I'm getting it?" And I just don't have the time to do it on her laptop regularly. I don't auto-update everything on her laptop and periodically I'll update her software (about every three or fours months, like I did for 4 hours yesterday), but for some things it's a necessity in order to get the bona fide security patches she actually needs in a timely manner.
Weylin
67.5% Slashdot Pure I guess I need to work on that....