Slashdot Mirror


Lessons Learned From Skype’s Outage

aabelro writes "On December 22th, 1600 GMT, the Skype services started to become unavailable, in the beginning for a small part of the users, then for more and more, until the network was down for about 24 hours. A week later, Lars Rabbe, CIO at Skype, explained what happened in a post-mortem analysis of the outage."

38 of 278 comments (clear)

  1. Deployed Soldiers. by puterg33k · · Score: 5, Insightful

    For us it's nearly our only way to speak to our loved ones at home. I'm just glad it's back up...

    1. Re:Deployed Soldiers. by Ihmhi · · Score: 2

      "Doubles" refers to the last two digits in your post number (22 in this case).

      Every post on 4chan is numbered, with each forum having its own individual counter. So while something small like /int/ (International) might have tens of thousands of posts, something more popular like /v/ (Video Games) or /b/ (Random, the sewage drain of the Internet) have millions.

      There are often posts such as "doubles/triples/quads names my dog", or games wherein events are determined by post numbers like a roll of the dice. During the leadup to Christmas, there were more than a few threads that would gift games to people who managed to reach a certain number or pattern of numbers.

      Aside from this, there's the quirky odd coincidences that result, such as a post saying "I am God" ending in 666.

      Lastly, certain numbers on certain boards have a special significance, and bits here and there of Internet culture were born just because a particular idea, image, etc. managed to get that post number. Aside from obvious stuff like post #2,000,000, there's things such as post 11223344, or post 44444444, etc.

      But yes, as the brother post says, it's essentially cultural bleedover from 4chan.

  2. Blogspam by ralf1 · · Score: 5, Informative

    Not sure why you didn't link to the actual article on Skype http://blogs.skype.com/en/2010/12/cio_update.html Instead of the blogspam site.

    --
    "Would you, could you, with a goat?" Dr Seuss
    1. Re:Blogspam by commodore64_love · · Score: 2, Informative

      Not sure why you didn't link to the actual article on Skype http://blogs.skype.com/en/2010/12/cio_update.html [skype.com] Instead of the blogspam site.

      Here's why: "Your organization's Internet use policy restricts access to this web page.
      "Reason:
      "Internet Telephony is filtered." - So I'm glad slashdot linked to the blog so I'd be able to read what was going on. My workplace is so backwards they still use old-fashioned telephone lines rather than internet phones. Oh and hot water radiators with that classic "thunk thunk thunk" sound when they turn on. Feels like I'm living in the 1930s. ;-)

      --
      "I disapprove of what you say, but I will defend to the death your right to say it." - historian Evelyn Beatrice Hall
    2. Re:Blogspam by Jurily · · Score: 4, Insightful

      But how else will aabelro promote his own site on Slashdot?! It's just good business sense.

      And people wonder why we don't RTFA.

    3. Re:Blogspam by Monkeedude1212 · · Score: 5, Funny

      We didn't want to Slashdot Skype and cause any more issues.

    4. Re:Blogspam by John+Hasler · · Score: 4, Insightful

      My workplace is so backwards they still use old-fashioned telephone lines rather than internet phones.

      And consequently you had reliable service while all the "modern, forward thinking" Skype users were down.

      --
      Warning: this article may contain humor, sarcasm, parody, and perhaps even irony. Read at your own risk.
    5. Re:Blogspam by iluvcapra · · Score: 2

      Logic: We need enhanced 911 service and reliable telephony during power outages, therefore block connections to skype.com on port 80.

      --
      Don't blame me, I voted for Baltar.
  3. December 22th? by colinRTM · · Score: 5, Funny

    Seriously?

  4. you are kidding me by alphatel · · Score: 5, Interesting

    If you are a node-based company worth several billion, charge for services, and don't even run enough of your own supernodes and monitor them in such a way that they cannot handle an outage effectively, you need serious help.

    --
    When the foot seeks the place of the head, the line is crossed. Know your place. Keep your place. Be a shoe.
    1. Re:you are kidding me by TubeSteak · · Score: 5, Insightful

      If you are a node-based company worth several billion, charge for services, and don't even run enough of your own supernodes and monitor them in such a way that they cannot handle an outage effectively, you need serious help.

      No one expects 40% of a globally distributed network to crash at once. No one.
      FTFA:

      The initial crashes happened just before our usual daily peak-hour (1000 PST/1800 GMT), and very shortly after the initial crash, which resulted in traffic to the supernodes that was about 100 times what would normally be expected at that time of day.

      Not even a multi-billion dollar company would have a disaster plan that provisions 100x capacity as a hot/cold spare.
      Though I bet their new plan includes automatic spawning of nodes on EC2 or some other distributed CDN.

      --
      [Fuck Beta]
      o0t!
    2. Re:you are kidding me by marcosdumay · · Score: 3, Interesting

      "China isn't deluded about itself like America"

      I'll belive that when I hear a chinese (one that isn't out of country for decades) saying that China will rule the world for any reason but because they are a superior race or culture. China is quite deluded, even more so than the US. Half the world (ocident) is helping them getting even more deluded, and the other half (orient) is too afraid to help them cut any kind of delusion.

      That doesn't mean, of course, that China isn't becoming a superpower. They may be, or may not, I don't know the future. Military, they already are...

    3. Re:you are kidding me by TubeSteak · · Score: 2

      No one expects 40% of a globally distributed network to crash at once. No one.

      Oops. I made a mistake.
      It's 40% of 50%. So actually ~20% of global users crashed.
      The problem was that those ~20% of global users represented 25%~30% of active supernodes.

      Either way, losing 20%~30% or 40% of a globally distributed network is still the kind of stuff that only the RAND corporation and the Pentagon make plans for.

      If Skype hadn't included circuit breakers (so that the client would go easy on your bandwidth and CPU), their network might have stayed up.

      --
      [Fuck Beta]
      o0t!
  5. lesson (hopefully) learned... by smash · · Score: 4, Insightful

    ... relying on dodgy peer to peer VOIP telephony for business purposes is retarded.

    we've got people bitching at work about how it doesn't work from time to time, and why I've blocked its ability to do voice/video at the firewall. If you want VOIP, use something that uses standard SIP or some other documented, configurable traffic.

    --
    I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
    1. Re:lesson (hopefully) learned... by commodore64_love · · Score: 5, Interesting

      Ahh so YOU'RE the one blocking my skype. ;-)
      I don't understand why Net Admins (such as yourself) block useful tools like Skype. Or streaming radio. I don't see any harm in letting those things into the office space, and it provides a more pleasant working environment (to distract from the boredom of sitting at a desk all day).

      --
      "I disapprove of what you say, but I will defend to the death your right to say it." - historian Evelyn Beatrice Hall
    2. Re:lesson (hopefully) learned... by smash · · Score: 5, Informative

      Why do I block skype? Because the only way to have it work properly through most firewalls is to allow ALL outgoing ports. Which means you allow any random program to do any random shit through your firewall to the outside network. Its a massive, massive security issue you could drive an oil tanker through.

      Also, many companies pay for bandwidth. I don't want all of my bandwidth chewed up on video calls instead of mission critical apps.

      Its not just because we're nazis, its because skype protocol is completely fucked when it comes to the ability of your admin to control resources. Want voip/video? Use something else.

      --
      I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
    3. Re:lesson (hopefully) learned... by smash · · Score: 5, Insightful

      Just let me clarify: corporate networks are different to your home network. your home network? fine, use skype. in the office, where you've got several hundred PCs that may/may not have malicious software and/or users at the helm - allowing all outgoing connections is just begging for trouble.

      Egress filtering is a good thing.

      Making your day at work "less boring" by enabling you to do non-work related shit with company resources is not what my job is about. It is about ensuring the continued operation of the company's network - and skype is a liability.

      --
      I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
    4. Re:lesson (hopefully) learned... by BobMcD · · Score: 2

      Making your day at work "less boring" by enabling you to do non-work related shit with company resources is not what my job is about. It is about ensuring the continued operation of the company's network - and skype is a liability.

      Careful there, BOFH. Here I'll help:

      Making your day at work "less boring" by enabling you to do non-work related shit with company resources is none of my business. Get it requested through the proper channels and you can have it. I don't make the business decisions here, I just do what the company needs done to be successful.

    5. Re:lesson (hopefully) learned... by BobMcD · · Score: 2

      Good luck with that. Welcome to 2010's economy.

      Meanwhile, CYA and collect your paycheck. Let those with the MBA's make the calls and take the heat, and NEVER bicker with the end user. You're not paid enough to deal with their crap.

    6. Re:lesson (hopefully) learned... by smash · · Score: 4, Informative
      1. Because skype wasn't written that way. You want standard voice/video, use a SIP program. Skype was written deliberately by the developers to allow it to talk to anywhere and everywhere through your network so it can route other people's calls, and connect to random other nodes for your own call routing. That free lunch you're eating? Paid for by other's use of your bandwidth.
      2. Multiply 500 users by 48kbit. thats 24 megabit in streaming audio. That you can get off that fucking $10 FM radio on your desk. Now i'm not sure how expensive bandwidth is where you are, but a 24 business grade meg METERED (say, 300 gigs) internet connection here is about 5-10 grand a month. The business is not going to wear the cost of 5-10k per month for our users to listen to shitty quality streaming MP3. Thats before you take into account the increase latency to mission critical apps, or remote end points on crappy satellite connections paying anywhere up to $7 per MEG of data
      --
      I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
  6. Obvious problem.... by dstar · · Score: 4, Interesting

    Hmm. Seems to me their biggest problem is that they allowed clients with a known bug to become supernodes; if 50% of the network had upgraded, they should only have been creating supernodes from the upgraded clients.

    And in hindsight (I don't know that they should be blamed for not considering this before), the number of supernodes should probably be ~100-150% more than needed to service expected load. That way, if a third of them die, they _still_ have more than needed to handle the expected load. (And thus, hopefully, more than needed to handle the excessive load without causing them to shut down).

  7. I don't understand this. by commodore64_love · · Score: 4, Interesting

    "At its core, Skype relies on a third generation P2P network that has lots of peer nodes and a number of supernodes, one for several hundreds of nodes. Since Skype does not have a centralized directory to support finding routes between two or more nodes that want to communicate, the virtual network uses supernodes as directories. When a client enters Skype, it registers itself with a supernode, giving its IP address so it can be found by other clients who might want to establish a communication."

    Skype is a peer-to-peer network? Like torrent? So the supernode is like a tracker website, to connect peers to one another? No supernode==no tracker==no calls going through. Hmmmm. Maybe they should try DHT.

    --
    "I disapprove of what you say, but I will defend to the death your right to say it." - historian Evelyn Beatrice Hall
  8. TL;DR version: by The+MAZZTer · · Score: 4, Interesting

    Lots of users were using an old outdated buggy version of Skype, lots of client crashes at once bringing down big chunks of the P2P network, remaining network couldn't handle the load and went down too, took a while for Skype to put it's own supernodes up to help get the network self-sustaining again.

    They're considering an auto-update feature now since such a feature could have kept this from happening. Personally I think old versions should be blocked from making or receiving calls too, so users would be encouraged to update (works for Team Fortress 2). Of course auto updates would make updating super easy anyway so impact from that would be minimal.

  9. Never makes sense to upgrade working software... by syousef · · Score: 5, Interesting

    ...unless you need something in the newer version (feature, security update etc.). Of course us geeks like to have the latest to fiddle with, but for the average Joe end-user, if it ain't broke, don't fix it. There is always the risk that the newer software will contain new bugs. At one point the buggy version of the Skype software was the latest version and was what users were being pushed to upgrade to. If the crash had happened then, I wonder if they'd find a new way to scapegoat users.

    By the way new versions breaking existing functionality isn't theoretical, or rare. I'm currently installing software on my new laptop. I've had to downgrade both Zonealarm and Virtualbox. The former broke remote desktop. The later broke file sharing. No idea why, but in each case uninstalling and installing an older version I knew worked fixed the issue for me.

    --
    These posts express my own personal views, not those of my employer
  10. Supernode Software by varmittang · · Score: 4, Interesting

    How about they release some supernode only software that people can setup on a server and possibly the ability to setup Skype to use a preferred supernode. So a businesses can setup a supernode of their own and point their users too it. But also that supernode is part of the collective of supernodes and routes Skype connections for everyone else too. This would hopefully give Skype more supernodes out there that are 24/7 and not desktop computers routing the traffic.

    --
    -----BEGIN PGP SIGNATURE-----
    12345
    -----END PGP SIGNATURE-----
  11. Re:Lessons Learned From Skype’s Outage by rjstanford · · Score: 2

    Google video chat, perhaps? Or maybe acknowledge that its fairly impossible to provide both 100% uptime and free video chat at the same time, without the resources of a major player behind you to promote goodwill?

    Seriously, they were down for some percentage of the people for 1% of one year, during which time many competitive products were available. This is not an earth-shattering catastrophe.

    --
    You're special forces then? That's great! I just love your olympics!
  12. Article Summary [sarcastic] by Ukab+the+Great · · Score: 4, Funny

    "We expected a Limewire topology to be as reliable as a Phone companyi topology and oddly enough that bit us in the ass."

  13. Skype Win 5.0 client sucks by scorp1us · · Score: 4, Interesting

    The QA of this release is way down. On top of that, skype auto-updated people from 4.0 to 5.0. Within a few days, the buggy 5.0 had enough penetration (50%) to bring them down.

    The windows client has widely been reported to:
    consume 2x as much CPU (33% to 60% on mine after upgrade)
    leak RAM (starts out ok but after some use over 1.5gig needed)
    the GUI is slow, so the fade effects on some computers (mine) causes video tearing. It is no longer possible to run full-screen. (320x240 is all I get before tearing sets in)
    The fonts in the video area don't render correctly.
    It should be noted that I have a AMD X2 1.6 and Radeon 1200 card in this computer. Its not shabby. But the 5.0 client brought it to its knees.

    It plays SCII just fine (albeit on the lowest setting).

    It comes at a bad time when they are trying for more corporate agreements, but can't run on my 3-year-old hardware.

    I uninstalled 5.0 and installed 4.0 and its back to normal.

    --
    Slashdot's rate-of-post filter: Preventing you from posting too many great ideas at once.
    1. Re:Skype Win 5.0 client sucks by smash · · Score: 2

      Maybe you're a supernode? :)

      --
      I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
  14. Public Post-Mortem by Enderandrew · · Score: 4, Insightful

    You can bitch they didn't QA the release. You can bitch that you don't like a P2P topology. But it is nice to see a public post-mortem.

    --
    http://blindscribblings.com - Tasty pop-culture in conceptual fashion.
  15. Re:How are supernodes defined? by circletimessquare · · Score: 2

    that's right, because everyone who wants to use VOIP should review the source code and familiarize themselves with the relevant RFC specs

    classic "if you aren't a computer scientist you shouldn't use the internet" ignorant geek snobbery. how's that standard of behavior working for you?

    --
    intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
  16. Forced auto updates are not the solution. by mario_grgic · · Score: 4, Interesting

    I hate when apps run auto update daemons. This precisely the reason why I don't use any Google desktop software on my computers.

    Proper thing to do in this case is simply disallow users to log in with a message they need to upgrade their client if they want to continue to use the app. Simple thing to do, rather than each app running a daemon. Soon enough there will be hundred update daemons on each user's computer, eating resources, connecting online all the time and bogging down the user experience. Thanks but no thanks. I refuse to use any of those.

    --
    As the island of our knowledge grows, so does the shore of our ignorance.
  17. Re:Lessons Learned From Skype’s Outage by tenex · · Score: 2

    I think we're talking about better up-time than that for Skype. If we believe the outage numbers presented on their Wikipedia page http://en.wikipedia.org/wiki/Skype, they've had a total of 72 hours down time since the initial release in 2003--and assuming a 100% outage in all cases (which was not the case here)--their up-time minutes work out to something like:

              99.9988%

    Seven years and 72 hours of total down-tine... It might not be five nines, but does seem a pretty respectable up-time percentage.

  18. Re:Lessons Learned From Skype's Outage by BrokenHalo · · Score: 2

    Well said. Skype is primarily a piece of technology aimed at the individual consumer. It is made completely clear at the outset that it doesn't claim to be a landline replacement, so anyone who lost business as a result of the outage doesn't get much sympathy from me.

    The dowmtime period for me was about a day and a half, which amounts to 0.41% of the year. No biggie, I have SIP and mobile alternatives. Or both if I run a SIP client over my wireless internet dongle or phone tether.

    I get very tired of those who insist on telling everybody to stop using Skype and to use this or that product instead. Skype has a commanding and undeniable position in peoples' headspace because it offers a fucking good product. For me, the combination of IM client with voice calling capability is a killer. My non-geek friends will never be pursuaded to run a separate IM and SIP client. I can (and do) leave video calling alone, since nobody needs to see me after (or during) an evening on the single-malts... :-}

  19. Autoupdates by ThePhilips · · Score: 2

    One important lesson to be learned is this: many users do not update their software if they don’t have to. Skype had a newer version in place, without the triggering bug, but most users had the buggy one.

    Yeah. Right. Because all recent Skype updates (staring with version 3(?)) were known to contain mostly only one of this: more ads or more UI bloat. And occasional breakages.

    So why they expect that users would be updating it regularly?

    --
    All hope abandon ye who enter here.
  20. Re:Back up... by pagedout · · Score: 2

    Ah, but its a brave new world where the client/server relationship is becoming fuzzier all the time. The part I think you are missing is that if you read the actual post it is obvious that everything that was crashing was applications on clients computers. It appears that some clients are promoted to server status to handle routing requests.

    As for bad design/software I would instead say they had features without consideration of consequences. Here are where their problems are from what I can see.

    1. Non-Patched Nodes Become Servers - Seriously if you are relying on your customers computer to be part of your infrastructure you would think that you would want to use only fully patched versions of your application for this.

    2. Failed Message Queue - In an attempt to deliver messages to offline/crashed users they are queuing the messages and delivering them later. These types of systems ALWAYS exacerbate load problems unless implemented extremely carefully. Interestingly they hit both of the major issues with them at one go. First they had a bug that caused messages to crash systems (which they blindly kept trying to deliver). Then they had more of these messages out there causing more traffic on an already crippled network.

    3. Shutdown On Overflow - Since they are running on clients networks if the load becomes too great they shutdown clients running as servers when they are running too hot. This one is just made to cause cascading failures. While I an unsure how their lookup domains are set up it would probably be much better to spawn new servers to deal with increased load instead of shutting down working servers.

    But that is just what I think,

  21. China cannot project its military by Troll-Under-D'Bridge · · Score: 2

    That doesn't mean, of course, that China isn't becoming a superpower. They may be, or may not, I don't know the future. Military, they already are...

    I think you've overrated China's status as a military power. Sure they have the capability of attacking and perhaps overrunning neighboring countries like, say, Taiwan and Vietnam, with whom they waged a brief but bloody war in the late 1970s. But the Chinese lack the ability to deploy their forces across continents and the two largest oceans, an ability which the Russians, as the main heirs of the former Soviet war machine, still have. In fact, after the end of World War 2, the US remains the only country to have waged multiple large scale wars overseas: the Vietnam War and the two Iraq Wars. (A possible exception might be the UK, which won the Falkland Islands war against Argentina, but was merely part of the supporting cast in the Iraq wars.)

    While undoubtedly enough of a deterrent to avert a US invasion, China's nuclear might is just on a par with the other permanent members of the Security Council. So, no, barring the political disintegration of the US, China is still a long way to go from becoming a Cold War class superpower.

  22. Not true. by nuckfuts · · Score: 2

    Why do I block skype? Because the only way to have it work properly through most firewalls is to allow ALL outgoing ports.

    Skype lists three other firewall configurations that work, including two that only require egress on a single port that's almost always open anyway.

    Its a massive, massive security issue you could drive an oil tanker through.

    Oh, come on. Sure, egress filtering is a polite thing to do, but it's inbound connections that put you at risk. And chances are, if you do fall victim to some nefarious piece of malware that's making unwanted outbound connections, simple packet filtering will be useless anyway because it will fall back to TCP 80, or TCP 443, or even UDP 53, to tunnel out. Just like Skype does.

    You advertise yourself as an "admin of some 12 years" experience, but you're exactly the type of admin I dislike. You take a personal stance against something, and then back up your bias with a mixture of pseudo-facts, deliberate omission, and high-handed horseshit.