Slashdot Mirror


Wikipedia Explains Today's Global Outage

gnujoshua writes "The Wikimedia Tech Blog has a post explaining why many users were unable to reach Wikimedia sites due to DNS resolution failure. The article states, 'Due to an overheating problem in our European data center many of our servers turned off to protect themselves. As this impacted all Wikipedia and other projects access from European users, we were forced to move all user traffic to our Florida cluster, for which we have a standard quick failover procedure in place, that changes our DNS entries. However, shortly after we did this failover switch, it turned out that this failover mechanism was now broken, causing the DNS resolution of Wikimedia sites to stop working globally. This problem was quickly resolved, but unfortunately it may take up to an hour before access is restored for everyone, due to caching effects."

153 comments

  1. DNA? Really? by Anonymous Coward · · Score: 0

    How 'bout proofreading titles

    1. Re:DNA? Really? by Anonymous Coward · · Score: 0

      They're too busy downmodding nigger jokes.

  2. Wow! by Anonymous Coward · · Score: 1, Funny

    DNA resolution failure

    1. Re:Wow! by suso · · Score: 2

      Maybe the reverse DNA wasn't set right.

    2. Re:Wow! by Midnight+Thunder · · Score: 3, Funny

      DNA resolution failure

      Clearly they thought captchas were too easy to defeat.

      --
      Jumpstart the tartan drive.
    3. Re:Wow! by monkeySauce · · Score: 1

      No, the RNA is just fine. The real problem is a DNA CNAME pointing to an A molecule that isn't resolving.

    4. Re:Wow! by Anonymous Coward · · Score: 0

      Nice try, but I think you stretched that one a little too far. And besides, pointing a CNAME to an A record is ok, its pointing a CNAME to a CNAME that is forbidden.

    5. Re:Wow! by DA-MAN · · Score: 1
      Nah, its standard practice.

      $ host download.microsoft.com
      download.microsoft.com is an alias for download.microsoft.com.nsatc.net.
      download.microsoft.com.nsatc.net is an alias for mscom-dlc.vo.llnwd.net.
      mscom-dlc.vo.llnwd.net has address 208.111.161.113
      mscom-dlc.vo.llnwd.net has address 208.111.161.89

      --
      Can I get an eye poke?
      Dog House Forum
    6. Re:Wow! by omnichad · · Score: 1

      No, it's pointing an MX record to a CNAME that's "forbidden." It's done all the time, but it's technically not allowed.

  3. This was terrible! by Anonymous Coward · · Score: 1, Funny

    Because of this outage, I actually had to work this morning.

    1. Re:This was terrible! by DeadDecoy · · Score: 1

      Then why are you on slashdot?

  4. DNA caused outage? by xata_boy · · Score: 1

    Human or otherwise?

  5. DNA DNS? by Anonymous Coward · · Score: 2, Funny

    I could see why the failover didn't work... They should try resolving names instead of nucleic acids. :\

  6. Oh teh noes! by Anonymous Coward · · Score: 0

    Its the T-virus, run!

  7. errr by yerpo · · Score: 0

    The blog link explaining the whole thing of course doesn't work for us Europeans, either.

  8. Test, and Test Again by Jazz-Masta · · Score: 3, Insightful

    However, shortly after we did this failover switch, it turned out that this failover mechanism was now broken, causing the DNS resolution of Wikimedia sites to stop working globally.

    Good thing Wikimedia pays their System Administrators well enough to test their backup systems.

    1. Re:Test, and Test Again by X0563511 · · Score: 3, Insightful

      I know people who work in the Florida DC. They do, and they are smart people. Don't assume incompetence.

      --
      For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
    2. Re:Test, and Test Again by cryfreedomlove · · Score: 2, Insightful

      You say test and test again. I say that this is true only when the cost of an outage outweighs the cost of testing. What does this one hour, once per year really cost wikipedia?

    3. Re:Test, and Test Again by commodore64_love · · Score: 2, Insightful

      Free media publicity.

      --
      "I disapprove of what you say, but I will defend to the death your right to say it." - historian Evelyn Beatrice Hall
    4. Re:Test, and Test Again by Dahamma · · Score: 2, Insightful

      True, and the cost was probably fairly minor, as they are not advertising based... so only the cost of any people so pissed off with the downtime that they refuse to donate :)

    5. Re:Test, and Test Again by Jazz-Masta · · Score: 2, Informative

      I actually wasn't assuming incompetence, the hallmark of many SysAdmins is being understaffed, overworked and underpaid, and thus do not have the resources to properly test all backup and redundant systems.

      As consultants and contractors in the area of System Administration, you get let go if anything like this was ever to happen. This is why they charge a little bit more.

      Whatever happened, it failed. A good lesson for next time. Not knowing exactly the cause, but it is safe to safe there were too many eggs in one basket. Multiple geopgrahically diverse load-balanced DNS servers? Why was there an overheating problem in the first place? Only one air conditioner?

      Wikipedia has had a few failures, not all their fault. In 2006 Cogent pulled a block of IP addresses that were leased to Wikipedia.

    6. Re:Test, and Test Again by geniice · · Score: 4, Interesting

      Going by past statsitics the cost of downtime to wikipedia tends to be negative since donations rise. Not that this is something wikimedia aims to do.

    7. Re:Test, and Test Again by Anonymous Coward · · Score: 1, Insightful

      I know people who work in the Florida DC. They do, and they are smart people. Don't assume incompetence.

      I'm going to assume incompetence. The only question is whose incompetence: the admins, or the folks higher up the food chain who didn't give them the resources they needed. But I have no doubt somebody was incompetent somewhere, how else do you explain the failure? Can you answer that instead of telling people what to think?

    8. Re:Test, and Test Again by geniice · · Score: 2, Informative

      Wikipedia has a fairly limited budget and has historically accepted the odd few hours of downtime now and again as the natural result of this. The number of such incidents have reduced over the years though.

    9. Re:Test, and Test Again by Anonymous Coward · · Score: 0

      I'm going to assume incompetence. The only question is whose incompetence: the admins, or the folks higher up the food chain who didn't give them the resources they needed. But I have no doubt somebody was incompetent somewhere, how else do you explain the failure? Can you answer that instead of telling people what to think?

      Wow. For someone who probably uses the service and doesn't pay for it, you're sure griping a lot. They don't serve ads (well, except to solict funds to keep their servers up and running), they rely on private donations (which sets them apart from even public television, which gets some of it's funding from the government).

      When you pay for an SLA with Wikipedia (signed by someone with the authority to make such an agreement) then you have the right to throw rude accusations around.

    10. Re:Test, and Test Again by Critical+Facilities · · Score: 1

      That may be true. Although, this whole thing has me wondering if they're re-thinking their "Green Technology" push. According to this article, they've recently partnered with a European firm that specializes in Green Data Center Technology, most specifically, using "air side economization" cooling techniques (cool the data center with outside air as opposed to mechanical cooling). Now, while I think this is a viable and worthwhile technology and strategy (and before anyone flames me into oblivion for being a naysayer when it comes to becoming more energy efficient), I do have to raise the question as to whether they had fully commissioned this facility. Had they determined at what point(s) supplemental cooling was brought online? Had they tested the mechanical cooling systems to ensure they would come online successfully and in enough time? Did anyone ask any questions relating to these issues?

      Just makes me wonder. I hope they share the information so that we can all learn from it.

    11. Re:Test, and Test Again by Chris+Mattern · · Score: 1

      I'm going to assume incompetence.

      Damn right we're going to assume incompetence! Why aren't we getting the uptime we're paying for?

      Oh, wait...

    12. Re:Test, and Test Again by petermgreen · · Score: 1

      I doubt this was a case of wikimedia deliberately going green. I suspect it's far more likely that they happened to be in the right place and happened to make an offer that wikimedia liked.

      --
      note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
    13. Re:Test, and Test Again by FuckingNickName · · Score: 2, Insightful

      Wow. For someone who probably uses the service and doesn't pay for it, you're sure griping a lot.

      0. For someone who is going off on a rant based on a reasoned assumption, you sure aren't setting off on the right foot by starting the unjustified assumption that the poster uses Wikipedia;

      1. You don't have to pay for or be a net consumer of something in order to criticise it - all you have to do is provide a reasonable explanation for the criticism. The alternative, that only the paying consumer should have a voice, is irrational and harmful;

      2. All this said, maybe the poster has donated time and/or money to Wikipedia - you do realise it's produced by thousands of (sometimes even well-meaning) volunteers, right?

      They don't serve ads (well, except to solict funds to keep their servers up and running),

      So they don't, except when they do. At least regular adverts give you the opportunity to learn about some product. Huge banners telling you that the Child in Africa will die from not knowing all the Pokemon characters if you don't donate are quite pathetic.

      When you pay for an SLA with Wikipedia (signed by someone with the authority to make such an agreement) then you have the right to throw rude accusations around.

      I know America has such a macho culture that it's considered life-destroying to receive public criticism, but it's actually useful to be told that you're incompetent when you're incompetent. It's the first step to finding out where you've demonstrated incompetence, which is the precursor to fixing (i) your approach; (ii) the problem. Brushing the truth under the carpet by sounding the "I/Wikipedia admins have the right not to be offended!" klaxon solves nothing.

    14. Re:Test, and Test Again by Anonymous Coward · · Score: 0

      Minus the cost of people donating so there's less downtime

    15. Re:Test, and Test Again by Anonymous Coward · · Score: 0

      Don't forget that unqualified "CTO" they hired this year.

    16. Re:Test, and Test Again by Dencrypt · · Score: 1

      Pays them what? $3.3M isn't really much in a year paying of 30 people.

    17. Re:Test, and Test Again by Anonymous Coward · · Score: 0

      Just to nit pick, smart != competent though the two usually go together. It many cases of failure, it is a matter of the sysadmin not growing a pair and knuckling under to so know-nothing MBA. "Told ya so" won't cut it, even though the MBA forced the order down the throats of the IT dept, it will still be IT that gets the blame.

    18. Re:Test, and Test Again by David+Gerard · · Score: 1

      Wikimedia is a charity. $8M to run a top 5 website is approximately NOTHING. My suggested slogan for the last fundraiser wasn't used: "Give us money. Or the homework GETS IT."

      --
      http://rocknerd.co.uk
    19. Re:Test, and Test Again by David+Gerard · · Score: 1

      There's nothing quite as uplifting as the complaints of someone getting something for free. Not even for ads.

      --
      http://rocknerd.co.uk
    20. Re:Test, and Test Again by David+Gerard · · Score: 1

      It is actually true that downtime used to be by far our most profitable product ;-)

      --
      http://rocknerd.co.uk
    21. Re:Test, and Test Again by similar_name · · Score: 1

      For someone who is going off on a rant based on a reasoned assumption, you sure aren't setting off on the right foot by starting the unjustified assumption that the poster uses Wikipedia;

      How is jumping to incompetence a reasoned assumption but assuming someone has used wikipedia unjustified. It's at least reasonable.

      You don't have to pay for or be a net consumer of something in order to criticise it - all you have to do is provide a reasonable explanation for the criticism.

      So what was the useful criticism in assuming incompetence over lack of funding?

      All this said, maybe the poster has donated time and/or money to Wikipedia - you do realise it's produced by thousands of (sometimes even well-meaning) volunteers, right?

      If so then according to his assumption of incompetence is he partly to blame?

      So they don't, except when they do.

      That's rather pedantic. I think (most) everybody knows the difference between a non-profit fundraising banner and an ad.

      I know America has such a macho culture that it's considered life-destroying to receive public criticism, but it's actually useful to be told that you're incompetent when you're incompetent

      Very true. I work with people you should be told so more often. However, it is equally true that thinking human error is the result of incompetence and that getting rid of people who make mistakes will lead to a more perfect system. I good system deals with human error, it doesn't assume it won't exist.

      I think it is fair to say that someone may have made a mistake and that Wikipedia may not be funded well enough to deal with human error.

  9. Rumor was.. by cybrthng · · Score: 1

    Some government pencil pusher mixed up wikileaks with wikipedia... after all the "strange tweets" from @wikileaks it sounded feasible ;)

    1. Re:Rumor was.. by Anonymous Coward · · Score: 0

      Wikileaks is part of wikimedia, so it went down too (along with wikinews, wikispecies, etc.).

    2. Re:Rumor was.. by Anonymous Coward · · Score: 2, Informative

      Wikileaks is part of wikimedia, so it went down too (along with wikinews, wikispecies, etc.).

      Wikileaks is certainly NOT part of Wikimedia. You can see such at http://wikimediafoundation.org/wiki/Our_projects

    3. Re:Rumor was.. by David+Gerard · · Score: 1

      This is on the press coverage bingo card.

      --
      http://rocknerd.co.uk
  10. Do we accept this... by Al's+Hat · · Score: 4, Funny

    ...as proof of global warming?

    1. Re:Do we accept this... by alta · · Score: 1

      +1 So True...

      --
      Do not meddle in the affairs of sysadmins, for they are subtle, and quick to anger.
    2. Re:Do we accept this... by Pete+Venkman · · Score: 1

      It came from Wikipedia so it must be true--right?

    3. Re:Do we accept this... by smooth+wombat · · Score: 1

      Not necessarily, but this might be.

      --
      We will bankrupt ourselves in the vain search for absolute security. -- Dwight D. Eisenhower
    4. Re:Do we accept this... by Al's+Hat · · Score: 1

      Possibly lame attempt at humor on my part. I'm more worried about ocean acidification as a more immediate threat.

    5. Re:Do we accept this... by afabbro · · Score: 0, Troll

      Do not meddle in the affairs of sysadmins, for they are subtle, and quick to anger.

      Wow - is that the worst sig on Slashdot or what?

      • "Do not meddle in the affairs of sysadmins, for they are subtle..." The sysadmins or their affairs? And why would subtle affairs (or subtle sysadmins) be either significant or threatening? Oh my God, he's...he's...SUBTLE! RUN!
      • "..and quick to anger." Angry affairs? Angry sysadmins?
      • And to top it off, your URL points to a slimey affiliate site that promotes "Free Advertising System" and "Copy the Super Affiliates".
      --
      Advice: on VPS providers
    6. Re:Do we accept this... by Anonymous Coward · · Score: 0

      Troll, or woefully ignorant of Tolkien? I can't tell.

    7. Re:Do we accept this... by jeffasselin · · Score: 0, Offtopic

      Either you should read more or you have some serious linguistic credentials.

      It's a reference to a quote in The Lord of the Rings by JRR Tolkien, something said to Frodo Baggins by Gildor Inglorion in The Fellowship of the Ring (tome 1) in chapter 3 "Three is Company":

      "Do not meddle in the affairs of wizards, for they are subtle and quick to anger."

      The quote is of course in reference to the wizards of Middle-Earth. The user sig which you tried to disparage is an attempt to make an analogy between sysadmins to wizards.

      My problem here is that you weren't attacking the analogy, but the syntax of the sentence. A sentence crafted by JRR Tolkien, one of the most well-known scholars of the English language, and one who was named "Author of the Century" for the last century. I think he knew English syntax better than you do. Unless of course you're a well-known, published writer who has studied the English language extensively and you have the diplomas to prove it on the wall in your office.

      --
      If he explores all forms and substances Straight homeward to their symbol-essences; He shall not die.
    8. Re:Do we accept this... by Anonymous Coward · · Score: 0

      ^Try not failing English 101. Sysadmins are the subject of the entire sentence. Thus, sysadmins are subtle, and quick to anger. HTH

    9. Re:Do we accept this... by fuzza · · Score: 1

      The version I used to have was: "Do not meddle in the affairs of sysadmins, for they are quick to anger and have not need for subtlety."

      --
      Can't find examples of evolution? No matter, neither could Dawkins
    10. Re:Do we accept this... by Jupix · · Score: 1

      While we're on the subject of temperature, can anyone enlighten me as to the point of having a data center in Florida? Wouldn't you want hardware like this in a climate that's naturally as cool as possible?

    11. Re:Do we accept this... by alta · · Score: 1

      I like that version :)

      --
      Do not meddle in the affairs of sysadmins, for they are subtle, and quick to anger.
    12. Re:Do we accept this... by alta · · Score: 1

      All I can say is, thanks for learning what's at the site, you just earned me a quarter. Have a nice day Portland.

      --
      Do not meddle in the affairs of sysadmins, for they are subtle, and quick to anger.
  11. Hate by olddotter · · Score: 1

    Hate it when my DNA doesn't resolve.

    Sorry I know its just a type-o, but its funny to me.

    1. Re:Hate by Locke2005 · · Score: 2, Funny

      I don't have a problem with DNA not resolving.
      I have a problem with getting it out of the sheets.

      --
      I've abandoned my search for truth; now I'm just looking for some useful delusions.
  12. welcome to the cloud by Anonymous Coward · · Score: 0

    With both stormy and sunny days.

  13. Good choice on the article you're linking to... by Anonymous Coward · · Score: 0

    I don't know which is more awesome - that this article came up just as I was wondering what happened to Wikipedia, or that the post links to an article which I CAN'T READ BECAUSE WIKIPEDIA IS DOWN.

  14. Didn't know DNA could cause an outage by jmdevince · · Score: 0, Redundant

    When did we start using DNA to resolve domain names? I mean we can fit a butt-load of information in a DNA strand but I think the overhead would be too high for DNS resolutions. (Should be DNS)

    1. Re:Didn't know DNA could cause an outage by wizardforce · · Score: 1

      Just to give an idea of just how vast DNA's information storage is, the average human cell contains about as much information as most DVDs can store. So hypothetically, if you could reliably transport DNA like we do electrons on the internet, the bandwidth would be enormous (1 gram DNA can store ~10^21 bits) although lag might be a problem unless you can route these DNA packets at relativistic velocities.

      --
      Sigs are too short to say anything truly profound so read the above post instead.
    2. Re:Didn't know DNA could cause an outage by Andy+Dodd · · Score: 1

      I'm assuming that material containing large amounts of DNA gummed up a cooling fan, causing the overheating. :)

      --
      retrorocket.o not found, launch anyway?
    3. Re:Didn't know DNA could cause an outage by Anonymous Coward · · Score: 0

      Hey, they really love their servers!

  15. DNA resolution failure??? by Anonymous Coward · · Score: 0

    >>due to DNA resolution failure. ...Also known as mutation...or X-men

  16. wait what my dna is being resolved? by Anonymous Coward · · Score: 0

    cant we simply hack that by modifying gens.conf?

  17. rndc flush by ls671 · · Score: 2, Funny

    I noticed wikipedia wasn't resolving this morning.

    Flushing my "DNA" cache fixed it ;-))

    rndc flush

    --
    Everything I write is lies, read between the lines.
    1. Re:rndc flush by ls671 · · Score: 1

      I will add that this is a good thing this article was posted. It caused me to stop investigating the possibilities of somebody hacking into my "DNA". ;-))

      --
      Everything I write is lies, read between the lines.
    2. Re:rndc flush by Aladrin · · Score: 1

      That is disgusting. :D

      --
      "If you make people think they're thinking, they'll love you; But if you really make them think, they'll hate you." - DM
    3. Re:rndc flush by Sigma+7 · · Score: 1

      Flushing my "DNA" cache fixed it ;-))

      Not for everyone, since some ISPs cache DNS lookup results.

    4. Re:rndc flush by ls671 · · Score: 1

      > Not for everyone, since some ISPs cache DNS lookup results.

      It should have been obvious that you needed admin access to your own "DNA" in order for this fix to work... ;-))

      Also your ISP must not intercept your "DNA" queries (redirecting deoxyribonucleic acid #53 to their own DNA)

      --
      Everything I write is lies, read between the lines.
    5. Re:rndc flush by icebraining · · Score: 1

      Why, are you forced to use your ISP's DNS servers? here.

    6. Re:rndc flush by compro01 · · Score: 1

      I prefer level3's DNS servers (4.2.2.1-4.2.2.4). I've heard rumours of them planning to block public access to them, but never heard anything more about it. Works great for me.

      --
      upon the advice of my lawyer, i have no sig at this time
    7. Re:rndc flush by ls671 · · Score: 1

      Here is the list of DNS to query when you run your own DNS, as I stated in my OP. You obviously need to run your own DNS in order to be able to flush the DNS cache as I mentioned in my OP ;-)

      This list of root DNS is guaranteed to remain free for public access. These DNS only return pointers to other DNS and are the foundation of how name resolving works on the internet so you are guaranteed to get the correct data as far as it is possible to get it.

      In short, no third party is required to run your own DNS. Some will say this is slower because you have to first populate your cache doing multiple queries but I have never noticed any slowness so I do not care about that. Once your cache is populated it is much faster anyway because you do not have to go to the network at all to resolve a name.

      It is easy and free to setup your own DNS on most OSes and it could be safer because you get the information as accurate as it can get. My DNS process uses about 13 Meg of ram + the configured cache size which is very light.

      List of root DNSes, most of those IPs use "anycast addressing and routing to provide resilience and load balancing across a wide geographic area", so you always query a root DNS close to you anyway:
      http://www.internic.net/zones/named.root

      Wikipedia documentation:
      http://en.wikipedia.org/wiki/DNS_root_zone

      I realize that you probably already knew this, I am just posting to clarify for others ;-))

      --
      Everything I write is lies, read between the lines.
  18. DNA resolution failure by TheCreeep · · Score: 1

    Guess it resolved to a chimp?

  19. DNA resolution? by gad_zuki! · · Score: 2, Funny

    Whoa, why is the DNS resolving dATP.dGTP.dCTP.dATP?!?

  20. Hour Delay by Reason58 · · Score: 5, Funny

    This problem was quickly resolved, but unfortunately it may take up to an hour before access is restored for everyone, due to caching effects.

    If you don't want to wait an hour for it to update, you can open a command prompt and type "ipconfig /flushdna".

    Please be warned that this may also revert you to some sort of single-celled organism.

    1. Re:Hour Delay by Dancindan84 · · Score: 4, Funny

      I /flushdna all the time. Hasn't had any noticeable effect except clogging my toilet.

      --
      "Always forgive your enemies; nothing annoys them so much." - Oscar Wilde
    2. Re:Hour Delay by shoehornjob · · Score: 1

      LMFAO wish I still had mod points +1 funny.

      --
      "We are just a war away from Amerikastan. When god vs god the undoing of man." Dave Mustaine
    3. Re:Hour Delay by mrdogi · · Score: 1

      OK, I'm somewhat worried now. I was going to make a snarky comment on how I can't seem to find the ipconfig command on my Mac, but it *actually* has one! Mac is following Windows?!?

      At least I'm still safe with not having on on my Solaris boxen...

    4. Re:Hour Delay by Shemmie · · Score: 1

      Not often I laugh out loud at /. - thank you.

    5. Re:Hour Delay by BertieBaggio · · Score: 2, Funny

      I /flushdna all the time. Hasn't had any noticeable effect except clogging my toilet.

      ...?!

      I'd recommend you see a doctor about that.

      --
      If all you have is a grenade, pretty soon every problem looks like a foxhole -- MightyYar
    6. Re:Hour Delay by Hurricane78 · · Score: 1

      I AM a single-celled organism, you insensitive clod!

      And: I’m also your single-celled overlord! So bow to me!
      No! Not to wipe me away with your... sponge...! Please no! Aaaaahhhh!
      *wipe*

      --
      Any sufficiently advanced intelligence is indistinguishable from stupidity.
  21. Overheating by Anonymous Coward · · Score: 0

    Obviously this was caused by Global Warming.

  22. DNA? by MrTripps · · Score: 0, Offtopic

    With DNA resolution problems, apple.com resolves to 64.38.232.180 (oranges.com).

    --
    "I'm not a quack, I'm a mad scientist! There's a difference." - Dr. Cockroach
    1. Re:DNA? by omnichad · · Score: 1

      Nah...it points to orange.co.uk - which is a UK mobile phone company that offers the iphone.

  23. FTFA by Rik+Sweeney · · Score: 4, Funny

    We apologize for the inconvenience this has caused.

    [Citation needed]

    1. Re:FTFA by macbuzz01 · · Score: 1

      We apologize [citation needed] for the inconvenience [citation needed] this has caused [citation needed].

    2. Re:FTFA by kenj0418 · · Score: 1

      I think they should have at least put up a page that had, in large, friendly letters "Don't Panic".

  24. Actually an RNA problem by Anonymous Coward · · Score: 0

    RNA actually translates the name into an IP address.

    You could read up on this at http://en.wikipedia.org/wiki/Translation_(genetics) if wikipedia's ribosomes weren't down right now.

  25. Oops by girlintraining · · Score: 3, Insightful

    You see guys, this is why you regularily test your backup plans and failovers. This is equivalent to building maintenance making sure the fire extinguishers aren't expired... it's basic to IT. Unfortunately, Wikipedia just reminded us that what's basic isn't always what's remembered. Someone just lost their job.

    --
    #fuckbeta #iamslashdot #dicemustdie
    1. Re:Oops by cryfreedomlove · · Score: 2, Insightful

      I doubt anyone lost their job over this. What is the real cost of a 1 hour global outage for wikipedia if it only occurs once per year?

    2. Re:Oops by Anonymous Coward · · Score: 1, Insightful

      Wikipedia does not profit from traffic so they actually save money for every hour the site is down. Looks like someone just got promoted!

    3. Re:Oops by ArundelCastle · · Score: 1

      Wikipedia just reminded us that what's basic isn't always what's remembered.

      TFA quote did say it was a standard procedure. Seems like an accurate description leading to the common SNAFU, or "Administrivia" if you prefer. It's the weird shit you're always checking on.
      Building maintenance is an interesting comparison to use. Every year I see plenty of elevator licenses and fire extinguisher tags in many, many buildings that are expired.

    4. Re:Oops by Anonymous Coward · · Score: 0

      For every hour? Really? With that logic they should just keep it down 24/7 then.

    5. Re:Oops by geniice · · Score: 1

      Since wikimedia's server admins have long since been divided into two departments known as wing and prayer they can probably avoid any job loses by blaming each other.

    6. Re:Oops by Arthur+Grumbine · · Score: 2, Insightful

      For every hour? Really? With that logic they should just keep it down 24/7 then.

      Only when combined with the premise that profit is a goal for them. Which it's not.

      --
      Now that I think about it, I'm pretty sure everything I just said is completely wrong.
    7. Re:Oops by VTEX · · Score: 2, Insightful

      Someone just lost their job.

      I highly doubt someone lost their job over this - and they shouldn't. There are no perfect systems out there, period. Given Wikipedia is a not for profit corporation, they very likely have limited resources and the IT staff does the best with what they have. Even with a virtual unlimited amount of resources things can still go wrong in a "Perfect Storm".

      If anything, the System Administrators should be commended for their quick actions to get the site back up and running as soon as they did.

    8. Re:Oops by Yvanhoe · · Score: 1

      Someone does an awesome job at having a failover procedure for such an incredible non-profit project. And for resuming access within one hour. For heaven's sake, they don't even make money keeping the biggest encyclopedia of all History online, give them a break !

      Come on wikipedia, fix this, but rest assured that we all love you !

      --
      The Wise adapts himself to the world. The Fool adapts the world to himself. Therefore, all progress depends on the Fool.
    9. Re:Oops by Anonymous Coward · · Score: 0

      This is equivalent to building maintenance making sure the fire extinguishers aren't expired...

      It's as if they were making sure that the fire extinguishers paints weren't faded or peeling, rather than checking the mechanism and pressure.
      Can't have rotten looking fire extinguishers now, can we? Nobody would want to pick them up if there was an actual fire, they might get all icky and covered in paint shards.

    10. Re:Oops by Anonymous Coward · · Score: 0

      Expired buildings?

    11. Re:Oops by u38cg · · Score: 1

      Since donations spike after an outage, they profit from downtime :p

      --
      [FUCK BETA]
    12. Re:Oops by tlhIngan · · Score: 1

      I doubt anyone lost their job over this. What is the real cost of a 1 hour global outage for wikipedia if it only occurs once per year?

      Having to deal with the students who couldn't crib their report off Wikipedia an hour before it was due?

      (Yes, I'm joking. But I suppose we should continue this thread with other fun things we couldn't do with Wikipedia... like make bets about something on Wikipedia - only having edited the article in your favor minutes before).

    13. Re:Oops by Anonymous Coward · · Score: 0

      Survival and dominance is though. Not for profit organisations still need money to run.

    14. Re:Oops by BitZtream · · Score: 1

      Yea, the problem is people tend to 'regularly test' during the work day in my experience which results in the exact same event happening anyway.

      It generally only happens once, either accident or during testing, and gets fixed. Unless you're going to do ALL your testing during off hours, which is really hard to define for a global operation, then any test that fails is just the same as a failure during non-test conditions.

      Testing for no reason other than testing is not always the brightest of ideas, contrary to what you've been told.

      --
      Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
    15. Re:Oops by FunPika · · Score: 1

      Yes, and to boot the site being down for 1 hour will give people incentive to donate thinking it'll stop it from happening again. :D

      --
      After years of not using a signature, I am going to make one to say the following: Fuck Beta
    16. Re:Oops by David+Gerard · · Score: 1

      This is literally true! But we swear we don't take the site down deliberately ;-)

      --
      http://rocknerd.co.uk
  26. Only to Be Expected by PingPongBoy · · Score: 1

    Nothing to see here. Overheating was normal behavior after I updated the Pr0n article.

    --
    Know your pads. One time pad: good for cryptography. Two timing pad: where to take your mistress.
    1. Re:Only to Be Expected by ls671 · · Score: 1

      Are you Polish ? I thought the movement was abolished in 1989...

      Pron:

      http://en.wikipedia.org/wiki/Patriotyczny_Ruch_Odrodzenia_Narodowego

      --
      Everything I write is lies, read between the lines.
  27. I just assumed DNS was reverted... by Anonymous Coward · · Score: 1, Funny

    ...to the old setting by some Admin who edited it the last time, and who would be damned if he let anyone else get in the last word.

  28. genetic level? by Anonymous Coward · · Score: 0

    DNA resolution failure? sounds serious.

  29. Serves them right by Anonymous Coward · · Score: 0

    Wikipedia admins need to get out of their basement anyway.

  30. makes sense by nomadic · · Score: 1

    Judging by traveling through Europe in the summer, they've never discovered "air conditioning."

  31. Uptime is dumb by Anonymous Coward · · Score: 0

    Guess what Wikipedia, you are a free service. You could be up 10 hours a day or 24, existing is sufficient. Damn demanding internayz morons.

  32. Edited? by DarkKnightRadick · · Score: 2, Insightful

    Well, looks like all the DNA jokes are now -1 off topic

    Well played /., well played.

    --
    "There is a way that seems right to a man, but its end is the way of death." Proverbs 16:25 (NKJV)
  33. I saw some issues with wiktionary... by Qubit · · Score: 1

    But when I got to the wiktionary.org main page I didn't see any kind of note or warning.

    Couldn't they have at least put up some kind of warning box, hopefully with a list of IP addresses underneath so that one could directly access the services when in dire need?

    .
    .
    .
    .
    .

    (I'm not really sure what constitutes "dire need" of wikimedia services, but I'm sure someone can come up with a list of relevant circumstances)

    --

    coding is life /* the rest is */
    1. Re:I saw some issues with wiktionary... by PPH · · Score: 1

      I'm not really sure what constitutes "dire need" of wikimedia services, but I'm sure someone can come up with a list of relevant circumstances

      You could look up 'Dire Need' on Wiki..... oh, never mind.

      --
      Have gnu, will travel.
  34. Run both systems live at half capacity by Colin+Smith · · Score: 2, Interesting

    active/passive systems are a pain in the arse. The whole concept of testing failover in an active/passive situation is wrong. Anything which relies on human beings doing this and that and that and that is a bad solution.

    Just run active/active and load balancer over both sites. If one fails it's tests, you just pull it.

     

    --
    Deleted
    1. Re:Run both systems live at half capacity by rmm4pi8 · · Score: 1

      For systems that can be stateless, this is always the best approach. master-master replication with conflict resolution isn't always that easy, however, especially when you think about something like the way wikipedia edits can potentially interact. So developing a conflict resolution scheme can be extraordinarily expensive, and MySQL isn't the most stable in multi-master anyway. Thus while you're right in principle, the expense can be prohibitive.

      --
      U.S. War Crimes blog. Email for free Mandriva support.
    2. Re:Run both systems live at half capacity by GNUALMAFUERTE · · Score: 2, Informative

      Yes, I agree. But the main issue with that paradigm is that many times the expense of one of your locations (and the quality of that location) is substantially lower than the other.

      Example: I run servers on the US, Brasil and Argentina. The US server has better, cheaper bandwidth than the other two. Also, since this are VoIP servers, sometimes the services I send the calls to are in the US anyway, so even if the call goes originally to Argentina's POP, I'm still forwarding it to some IP in the US anyway.

      So, in that case, I want the Arg/Brasil locations for other traffic (that's why there are there), and for local connectivity, but balancing our main traffic there makes no sense from any point of view. So, I only failover to those servers when I have an issue in our main location.

      Sometimes, you have many resources you can use in emergencies, but you don't want to use them when the main location is clearly cheaper and better.

      --
      WTF am I doing replying to an AC at 5 A.M on a Friday night?
    3. Re:Run both systems live at half capacity by xaxa · · Score: 1

      Ping [Amsterdam wikimedia cluster]: 30ms
      Ping [Florida wikimedia cluster]: 130ms

      That's from London. It's obviously better if I normally access the Amsterdam site.

    4. Re:Run both systems live at half capacity by Colin+Smith · · Score: 1

      powerdns geo backend.

      Which they're already using.... Which means it looks like the problem may be more related to automation of the testing of the sites and the subsequent automatic (vs manual) pulling of a site from the dns when it fails.
       

      --
      Deleted
  35. Administrative problem by PPH · · Score: 1

    They couldn't get to the Wiki page about failover testing.

    --
    Have gnu, will travel.
  36. From hot to hotter by LoRdTAW · · Score: 1

    From the Summary:
    "Due to an overheating problem in our European data center many of our servers turned off to protect themselves"
    "we were forced to move all user traffic to our Florida cluster"

    I think Wikipedia needs to build some data centers further north.

    1. Re:From hot to hotter by Anonymous Coward · · Score: 0

      Maybe they should have spent some money on their data center instead of hiring Danese Cooper as CTO.

  37. Deleted? by Grishnakh · · Score: 4, Funny

    I thought maybe they had simply deleted Wikipedia because some admin decided nothing on there was "notable".

  38. I disagree. by Colin+Smith · · Score: 1

    You build your systems to be fault tolerant. They automatically continue with half the components missing. Automatically disable those which fail the continually running tests.

    Build your backup tests into daily procedures. i.e. don't copy/scp files to other locations/servers/sites, restore them to the other location. Autorestore DB backups to the staging/test/dev/reporting systems daily.

    Computers are there to do stuff automatically. Getting human beings to do them is prone to failure.
     

    --
    Deleted
    1. Re:I disagree. by RAMMS+EIN · · Score: 1

      You make some very good points in your post.

      At the end of it all comes the realization that planning for crisis is complicated, and getting it right is hard. It's also something that every organization I have ever worked with has underestimated considerably. From what little information I have about this incident with Wikimedia (I noticed nothing, myself), they did considerably better than average.

      But you are right: the right approach is not to prepare for contingency, but to make recovery part of the normal flow. If failure and recovery are part of the everyday routine, you will know what things are broken before disaster strikes, and when it does, you will be prepared. Nothing will make your organization infallible, but at least you will have procedures, people who know how to execute them, and experience with doing so.

      --
      Please correct me if I got my facts wrong.
  39. Funny by Anonymous Coward · · Score: 0

    I didn't understand some terms in the summary, so I was about to wiki them... *sigh*

  40. backup failure doesn't mean a failure to test by rritterson · · Score: 4, Insightful

    I see lots of comments stating that this would not have happened had admins run regular tests on the failover mechanisms. That seems a poor assumption- if the system happens to fail and then an outage occurs before the next scheduled test, one may not be aware of it.

    We had this problem recently where we were testing our backup generator. Normally, we cut power to the local on-campus substation, which kicks in the generator and activates a failover mechanism, rerouting power. Well, the generator came on no problem but the failover mechanism was broken, so every server in the datacenter spontaneously lost power. Had we known the failover was broken, we would have not done the regular test. However, the last test on the failover (done directly without cutting power), a mere month prior, had shown the failover mechanism was fine.

    Point being, unless you are going to literally continuously test everything, there is still some probability of an unexpected double failure.

    --
    -Ryan
    AUWYHSTOT (Acronyms are Useless When You Have to Spell Them Out Too)
    1. Re:backup failure doesn't mean a failure to test by BitZtream · · Score: 1

      As you pointed out, testing can (and in my experience with data center failures is usuaully) be the cause of a failure.

      The only time I've ever had an 'outage' in a data center, it was during a test cycle. While thats great that it was during a test cycle, it STILL resulted in an outage. Had the tests not been performed, no service disruption would have happened.

      Testing software in a test lab ... you test continuously.

      Testing a production environment ... you do it only when you have a real reason to suspect a possible problem, and only then if you can perform the test in such a way that a failure during the test will be less harmful than at some random time.

      For a global operation, there often isn't a 'less harmful time'.

      --
      Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
  41. Denmark is still without Wikipedia by Anonymous Coward · · Score: 1, Interesting

    20:47 UTC+1, we are still without Wikipedia probably due to poor DNS propagation

  42. great replies by vxice · · Score: 1

    if only their blog had mod points. all the comments are of the form "still down where ever I am"

    --
    every anarchist is a baffled dictator. Benito_Mussolini
  43. Wikipedia goes off the air... by Anonymous Coward · · Score: 0

    ...and nobody really gave a damn.

    So why are they considered relevant again?

  44. Read that wrong by leamanc · · Score: 1

    Darn, I thought Wikipedia was going to explain today's global outrage.

    --
    :q!
  45. School assignments by bjb_admin · · Score: 1

    How many kids will go to school tomorrow and say they couldn't complete an assignment because Wikipedia is down?

  46. Distributed Wikipedia by RAMMS+EIN · · Score: 2, Interesting

    Speaking of Wikipedia, an idea that has long been in my mind, but that I have never sat down and worked out is distributed hosting of Wikipedia. The idea is that volunteers each contribute some resources (network capacity, storage space, RAM, and CPU cycles) to host and serve part of the content.

    This way, we should be able to reduce the load on the (donation supported) Wikimedia servers, as well as increase the redundancy in the system.

    Is anybody already working on this or are there perhaps even already implementations of this idea?

    --
    Please correct me if I got my facts wrong.
    1. Re:Distributed Wikipedia by u38cg · · Score: 2, Interesting

      Attempts have been made at the general case, but it is a hard problem: how do you ensure fair resource sharing and reliability?

      --
      [FUCK BETA]
    2. Re:Distributed Wikipedia by BitZtream · · Score: 4, Insightful

      Its hard enough keeping a bunch of nodes that you control online and functioning properly (hence the failure) ... trying to run anything reliable when you give any control you had to other random people on the Internet is doomed to fail.

      The only reason distributed computing projects like SETI@HOME and distributed.net work is because the server gives clients data to process but it doesn't need a quick response, nor does it have to trust that the data returned is actually valid ... its going to have another host check it at some point anyway to be sure. Those clients are used to weight the data so the master server only processes the most likely packets that may match and need authoritative checking.

      Doing that for a web server would ... well, a complete and total waste of resources as its likely to be worse in every single way, including reliability.

      --
      Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
    3. Re:Distributed Wikipedia by Anonymous Coward · · Score: 0

      So, freenet without the anonymity? ;)

    4. Re:Distributed Wikipedia by David+Gerard · · Score: 1

      It's really hard keeping the databases distributed. Basically, all WMF wikis are served from three large database clusters in Florida. The parallelisation is having those large DB servers feed lots and lots of Apaches (which run PHP and render the pages into HTML) and worldwide Squids for reverse proxying.

      Wikileaks was mooting plans for a distributed MediaWiki backend - they have serious need for such a thing - but they haven't managed it either.

      There are perennial experimental projects to put something like git as the back end instead of MySQL, but none of these are in any usable condition as yet.

      --
      http://rocknerd.co.uk
  47. China? by zorro-z · · Score: 1

    And here, I thought that the Great Firewall of China had been blocking access to politically-charged Websites again.

    --
    -Z
  48. My bad! Sorry... by citab · · Score: 1

    I was looking up curry recipes ... the really hot ones.

    Man, I never thought my tapeworm would cause a global outage.

  49. Global "Outrage"? by sxedog · · Score: 2, Funny

    I started reading the article and wondered why there was such global outrage about dns resolution on Wikipedia, then I went back and looked at the title again...

    --
    If it ain't broke, DON'T fix it.
  50. Ha by mikazo · · Score: 1

    I happened to be looking up "DNS Failover" on Wikipedia at the time of the DNS failure

    --
    I was only 28,931 registrations away from having a 6-digit UID
  51. More donations = More uptime by saibot834 · · Score: 2, Informative

    Wikimedia is terribly understaffed. They have about 35 employees, for one of the 5th largest sites on the Internet (and that includes legal/finance/MediaWiki devs/etc. staff). Basically the site is run by a dozen guys. Compare that to any other Top 10 site, this is just crazy.

    Given their limited resources (both human and financial), it is amazing that Wikipedia is down so rarely. If you want the site to be more reliable, there is something you can do: Donate to the Wikimedia Foundation

    1. Re:More donations = More uptime by troll8901 · · Score: 1

      Well said both GP and PP. Someone kindly mod them up.

      I donated a meager US$20, and I have not a single word to criticize Wikimedia. I'm just grateful that it exists to serve me, and has been serving me for FREE for many years.

  52. Overheating?!? by Anonymous Coward · · Score: 0

    Overheating?!? Must be Global Warming - ahem - [man made] Climate Change!

  53. Jokes aside, how do you feel when you lost Wiki? by porky_pig_jr · · Score: 2, Interesting

    I was rather pissed. And the only thing I was going to do is to look up a few math terms. Ended up using PlanetMath and few other sites, but when Wki came back, I check them as well as guess what: they had the most comprehensive and informative articles. That's the first outage I remember since I started using Wiki.

  54. Re:Jokes aside, how do you feel when you lost Wiki by owlnation · · Score: 1, Troll

    It was though there was a great calming came over the Force. Like the dawning of a new age, one based on freedom and facts. One where people were free to write articles without fear of deletion and condemnation. Or edit articles without fear of biased reversion, or banishment.

    Suddenly people saw the wood from the trees, and realized there was an whole Internet out there with truth and beauty in it, where jack-booted book-burners were not only not in control, but not welcome either.

    And then they brought wikipedia back up again; the black flags flew, and the click of heels was once again heard around the net.

    Jokes aside. Indeed, I wish I were joking. There is a lot of pure evil at the heart of Wikipedia.

  55. content-addressed content by jonaskoelker · · Score: 1

    trying to run anything reliable when you give any control you had to other random people on the Internet is doomed to fail.

    I've heard a talk from someone who suggested moving to content-addressing: instead of giving you a URL, I give you a sha1 hash of the page you want (and maybe an URL to tell you where to start looking). Then, you don't care from where you get your data, as long as it matches. You can grab the page from the originating host, or from a local cache, or from a bunch of different peers, or from... well, you name it. As long as you get the bits that match the hash, you're happy.

    I think the idea is (1) good; (2) pie in the sky; and (3) applicable here.

    1. Re:content-addressed content by David+Gerard · · Score: 1

      That's why a git backend is such a tempting idea. Leading to many abortive attempts to write such a thing.

      --
      http://rocknerd.co.uk
  56. must be nice to have a job where you can f-up by Anonymous Coward · · Score: 0

    that bad and instead of having to apologize, you just have your buddies brag about how smart and competent you are and how well payed.

    us little people down here in meatspace... when we f-up that bad, we get this thing called 'fired'.

    1. Re:must be nice to have a job where you can f-up by X0563511 · · Score: 1

      Generally there are lots of other little things you do wrong on the road to that firing.

      Too bad for you. Do your job right and you tend to not get axed at the next convenient excuse.

      --
      For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
  57. Re:Jokes aside, how do you feel when you lost Wiki by David+Gerard · · Score: 1

    You realise of course that pretty much all of Wikipedia is multiply mirrored ... answers.com, Google cache, Bing ...

    --
    http://rocknerd.co.uk