Wikipedia Explains Today's Global Outage

← Back to Stories (view on slashdot.org)

Wikipedia Explains Today's Global Outage

Posted by timothy on Wednesday March 24, 2010 @06:25AM from the citation-you-requested dept.

gnujoshua writes "The Wikimedia Tech Blog has a post explaining why many users were unable to reach Wikimedia sites due to DNS resolution failure. The article states, 'Due to an overheating problem in our European data center many of our servers turned off to protect themselves. As this impacted all Wikipedia and other projects access from European users, we were forced to move all user traffic to our Florida cluster, for which we have a standard quick failover procedure in place, that changes our DNS entries. However, shortly after we did this failover switch, it turned out that this failover mechanism was now broken, causing the DNS resolution of Wikimedia sites to stop working globally. This problem was quickly resolved, but unfortunately it may take up to an hour before access is restored for everyone, due to caching effects."

36 of 153 comments (clear)

Min score:

Reason:

Sort:

DNA DNS? by Anonymous Coward · 2010-03-24 06:28 · Score: 2, Funny

I could see why the failover didn't work... They should try resolving names instead of nucleic acids. :\
Test, and Test Again by Jazz-Masta · 2010-03-24 06:29 · Score: 3, Insightful

However, shortly after we did this failover switch, it turned out that this failover mechanism was now broken, causing the DNS resolution of Wikimedia sites to stop working globally.
Good thing Wikimedia pays their System Administrators well enough to test their backup systems.
1. Re:Test, and Test Again by X0563511 · 2010-03-24 06:47 · Score: 3, Insightful
  
  I know people who work in the Florida DC. They do, and they are smart people. Don't assume incompetence.
  
  --
  For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
2. Re:Test, and Test Again by cryfreedomlove · 2010-03-24 06:55 · Score: 2, Insightful
  
  You say test and test again. I say that this is true only when the cost of an outage outweighs the cost of testing. What does this one hour, once per year really cost wikipedia?
3. Re:Test, and Test Again by commodore64_love · 2010-03-24 06:59 · Score: 2, Insightful
  
  Free media publicity.
  
  --
  "I disapprove of what you say, but I will defend to the death your right to say it." - historian Evelyn Beatrice Hall
4. Re:Test, and Test Again by Dahamma · 2010-03-24 07:01 · Score: 2, Insightful
  
  True, and the cost was probably fairly minor, as they are not advertising based... so only the cost of any people so pissed off with the downtime that they refuse to donate :)
5. Re:Test, and Test Again by Jazz-Masta · 2010-03-24 07:03 · Score: 2, Informative
  
  I actually wasn't assuming incompetence, the hallmark of many SysAdmins is being understaffed, overworked and underpaid, and thus do not have the resources to properly test all backup and redundant systems.
  As consultants and contractors in the area of System Administration, you get let go if anything like this was ever to happen. This is why they charge a little bit more.
  Whatever happened, it failed. A good lesson for next time. Not knowing exactly the cause, but it is safe to safe there were too many eggs in one basket. Multiple geopgrahically diverse load-balanced DNS servers? Why was there an overheating problem in the first place? Only one air conditioner?
  Wikipedia has had a few failures, not all their fault. In 2006 Cogent pulled a block of IP addresses that were leased to Wikipedia.
6. Re:Test, and Test Again by geniice · 2010-03-24 07:23 · Score: 4, Interesting
  
  Going by past statsitics the cost of downtime to wikipedia tends to be negative since donations rise. Not that this is something wikimedia aims to do.
7. Re:Test, and Test Again by geniice · 2010-03-24 07:38 · Score: 2, Informative
  
  Wikipedia has a fairly limited budget and has historically accepted the odd few hours of downtime now and again as the natural result of this. The number of such incidents have reduced over the years though.
8. Re:Test, and Test Again by FuckingNickName · 2010-03-24 09:51 · Score: 2, Insightful
  
  Wow. For someone who probably uses the service and doesn't pay for it, you're sure griping a lot.
  0. For someone who is going off on a rant based on a reasoned assumption, you sure aren't setting off on the right foot by starting the unjustified assumption that the poster uses Wikipedia;
  1. You don't have to pay for or be a net consumer of something in order to criticise it - all you have to do is provide a reasonable explanation for the criticism. The alternative, that only the paying consumer should have a voice, is irrational and harmful;
  2. All this said, maybe the poster has donated time and/or money to Wikipedia - you do realise it's produced by thousands of (sometimes even well-meaning) volunteers, right?
  
  They don't serve ads (well, except to solict funds to keep their servers up and running),
  So they don't, except when they do. At least regular adverts give you the opportunity to learn about some product. Huge banners telling you that the Child in Africa will die from not knowing all the Pokemon characters if you don't donate are quite pathetic.
  
  When you pay for an SLA with Wikipedia (signed by someone with the authority to make such an agreement) then you have the right to throw rude accusations around.
  I know America has such a macho culture that it's considered life-destroying to receive public criticism, but it's actually useful to be told that you're incompetent when you're incompetent. It's the first step to finding out where you've demonstrated incompetence, which is the precursor to fixing (i) your approach; (ii) the problem. Brushing the truth under the carpet by sounding the "I/Wikipedia admins have the right not to be offended!" klaxon solves nothing.
Do we accept this... by Al's+Hat · 2010-03-24 06:30 · Score: 4, Funny

...as proof of global warming?
Re:Wow! by suso · 2010-03-24 06:32 · Score: 2

Maybe the reverse DNA wasn't set right.
rndc flush by ls671 · 2010-03-24 06:33 · Score: 2, Funny

I noticed wikipedia wasn't resolving this morning.
Flushing my "DNA" cache fixed it ;-))
rndc flush

--
Everything I write is lies, read between the lines.
Re:Hate by Locke2005 · 2010-03-24 06:33 · Score: 2, Funny

I don't have a problem with DNA not resolving.
I have a problem with getting it out of the sheets.

--
I've abandoned my search for truth; now I'm just looking for some useful delusions.
DNA resolution? by gad_zuki! · 2010-03-24 06:34 · Score: 2, Funny

Whoa, why is the DNS resolving dATP.dGTP.dCTP.dATP?!?
Hour Delay by Reason58 · 2010-03-24 06:34 · Score: 5, Funny

This problem was quickly resolved, but unfortunately it may take up to an hour before access is restored for everyone, due to caching effects.
If you don't want to wait an hour for it to update, you can open a command prompt and type "ipconfig /flushdna".

Please be warned that this may also revert you to some sort of single-celled organism.
1. Re:Hour Delay by Dancindan84 · 2010-03-24 06:39 · Score: 4, Funny
  
  I /flushdna all the time. Hasn't had any noticeable effect except clogging my toilet.
  
  --
  "Always forgive your enemies; nothing annoys them so much." - Oscar Wilde
2. Re:Hour Delay by BertieBaggio · 2010-03-24 11:17 · Score: 2, Funny
  
  I /flushdna all the time. Hasn't had any noticeable effect except clogging my toilet.
  
  ...?!
  I'd recommend you see a doctor about that.
  
  --
  If all you have is a grenade, pretty soon every problem looks like a foxhole -- MightyYar
FTFA by Rik+Sweeney · 2010-03-24 06:36 · Score: 4, Funny

We apologize for the inconvenience this has caused.
[Citation needed]

--
Summation 2
Re:Wow! by Midnight+Thunder · 2010-03-24 06:36 · Score: 3, Funny

DNA resolution failure
Clearly they thought captchas were too easy to defeat.

--
Jumpstart the tartan drive.
Oops by girlintraining · 2010-03-24 06:41 · Score: 3, Insightful

You see guys, this is why you regularily test your backup plans and failovers. This is equivalent to building maintenance making sure the fire extinguishers aren't expired... it's basic to IT. Unfortunately, Wikipedia just reminded us that what's basic isn't always what's remembered. Someone just lost their job.

--
#fuckbeta #iamslashdot #dicemustdie
1. Re:Oops by cryfreedomlove · 2010-03-24 06:50 · Score: 2, Insightful
  
  I doubt anyone lost their job over this. What is the real cost of a 1 hour global outage for wikipedia if it only occurs once per year?
2. Re:Oops by Arthur+Grumbine · 2010-03-24 07:39 · Score: 2, Insightful
  
  For every hour? Really? With that logic they should just keep it down 24/7 then.
  Only when combined with the premise that profit is a goal for them. Which it's not.
  
  --
  Now that I think about it, I'm pretty sure everything I just said is completely wrong.
3. Re:Oops by VTEX · 2010-03-24 07:50 · Score: 2, Insightful
  
  Someone just lost their job.
  I highly doubt someone lost their job over this - and they shouldn't. There are no perfect systems out there, period. Given Wikipedia is a not for profit corporation, they very likely have limited resources and the IT staff does the best with what they have. Even with a virtual unlimited amount of resources things can still go wrong in a "Perfect Storm".
  
  If anything, the System Administrators should be commended for their quick actions to get the site back up and running as soon as they did.
Re:Rumor was.. by Anonymous Coward · 2010-03-24 06:59 · Score: 2, Informative

Wikileaks is part of wikimedia, so it went down too (along with wikinews, wikispecies, etc.).
Wikileaks is certainly NOT part of Wikimedia. You can see such at http://wikimediafoundation.org/wiki/Our_projects
Edited? by DarkKnightRadick · 2010-03-24 07:05 · Score: 2, Insightful

Well, looks like all the DNA jokes are now -1 off topic
Well played /., well played.

--
"There is a way that seems right to a man, but its end is the way of death." Proverbs 16:25 (NKJV)
Run both systems live at half capacity by Colin+Smith · 2010-03-24 07:18 · Score: 2, Interesting

active/passive systems are a pain in the arse. The whole concept of testing failover in an active/passive situation is wrong. Anything which relies on human beings doing this and that and that and that is a bad solution.
Just run active/active and load balancer over both sites. If one fails it's tests, you just pull it.

--
Deleted
1. Re:Run both systems live at half capacity by GNUALMAFUERTE · 2010-03-24 08:34 · Score: 2, Informative
  
  Yes, I agree. But the main issue with that paradigm is that many times the expense of one of your locations (and the quality of that location) is substantially lower than the other.
  Example: I run servers on the US, Brasil and Argentina. The US server has better, cheaper bandwidth than the other two. Also, since this are VoIP servers, sometimes the services I send the calls to are in the US anyway, so even if the call goes originally to Argentina's POP, I'm still forwarding it to some IP in the US anyway.
  So, in that case, I want the Arg/Brasil locations for other traffic (that's why there are there), and for local connectivity, but balancing our main traffic there makes no sense from any point of view. So, I only failover to those servers when I have an issue in our main location.
  Sometimes, you have many resources you can use in emergencies, but you don't want to use them when the main location is clearly cheaper and better.
  
  --
  WTF am I doing replying to an AC at 5 A.M on a Friday night?
Deleted? by Grishnakh · 2010-03-24 07:24 · Score: 4, Funny

I thought maybe they had simply deleted Wikipedia because some admin decided nothing on there was "notable".
backup failure doesn't mean a failure to test by rritterson · 2010-03-24 07:49 · Score: 4, Insightful

I see lots of comments stating that this would not have happened had admins run regular tests on the failover mechanisms. That seems a poor assumption- if the system happens to fail and then an outage occurs before the next scheduled test, one may not be aware of it.
We had this problem recently where we were testing our backup generator. Normally, we cut power to the local on-campus substation, which kicks in the generator and activates a failover mechanism, rerouting power. Well, the generator came on no problem but the failover mechanism was broken, so every server in the datacenter spontaneously lost power. Had we known the failover was broken, we would have not done the regular test. However, the last test on the failover (done directly without cutting power), a mere month prior, had shown the failover mechanism was fine.
Point being, unless you are going to literally continuously test everything, there is still some probability of an unexpected double failure.

--
-Ryan
AUWYHSTOT (Acronyms are Useless When You Have to Spell Them Out Too)
Distributed Wikipedia by RAMMS+EIN · 2010-03-24 07:55 · Score: 2, Interesting

Speaking of Wikipedia, an idea that has long been in my mind, but that I have never sat down and worked out is distributed hosting of Wikipedia. The idea is that volunteers each contribute some resources (network capacity, storage space, RAM, and CPU cycles) to host and serve part of the content.
This way, we should be able to reduce the load on the (donation supported) Wikimedia servers, as well as increase the redundancy in the system.
Is anybody already working on this or are there perhaps even already implementations of this idea?

--
Please correct me if I got my facts wrong.
1. Re:Distributed Wikipedia by u38cg · 2010-03-24 08:36 · Score: 2, Interesting
  
  Attempts have been made at the general case, but it is a hard problem: how do you ensure fair resource sharing and reliability?
  
  --
  [FUCK BETA]
2. Re:Distributed Wikipedia by BitZtream · 2010-03-24 09:38 · Score: 4, Insightful
  
  Its hard enough keeping a bunch of nodes that you control online and functioning properly (hence the failure) ... trying to run anything reliable when you give any control you had to other random people on the Internet is doomed to fail.
  The only reason distributed computing projects like SETI@HOME and distributed.net work is because the server gives clients data to process but it doesn't need a quick response, nor does it have to trust that the data returned is actually valid ... its going to have another host check it at some point anyway to be sure. Those clients are used to weight the data so the master server only processes the most likely packets that may match and need authoritative checking.
  Doing that for a web server would ... well, a complete and total waste of resources as its likely to be worse in every single way, including reliability.
  
  --
  Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
Global "Outrage"? by sxedog · 2010-03-24 08:25 · Score: 2, Funny

I started reading the article and wondered why there was such global outrage about dns resolution on Wikipedia, then I went back and looked at the title again...

--
If it ain't broke, DON'T fix it.
More donations = More uptime by saibot834 · 2010-03-24 11:01 · Score: 2, Informative

Wikimedia is terribly understaffed. They have about 35 employees, for one of the 5th largest sites on the Internet (and that includes legal/finance/MediaWiki devs/etc. staff). Basically the site is run by a dozen guys. Compare that to any other Top 10 site, this is just crazy.
Given their limited resources (both human and financial), it is amazing that Wikipedia is down so rarely. If you want the site to be more reliable, there is something you can do: Donate to the Wikimedia Foundation
Jokes aside, how do you feel when you lost Wiki? by porky_pig_jr · 2010-03-24 12:16 · Score: 2, Interesting

I was rather pissed. And the only thing I was going to do is to look up a few math terms. Ended up using PlanetMath and few other sites, but when Wki came back, I check them as well as guess what: they had the most comprehensive and informative articles. That's the first outage I remember since I started using Wiki.