Wikipedia Explains Today's Global Outage

← Back to Stories (view on slashdot.org)

Wikipedia Explains Today's Global Outage

Posted by timothy on Wednesday March 24, 2010 @06:25AM from the citation-you-requested dept.

gnujoshua writes "The Wikimedia Tech Blog has a post explaining why many users were unable to reach Wikimedia sites due to DNS resolution failure. The article states, 'Due to an overheating problem in our European data center many of our servers turned off to protect themselves. As this impacted all Wikipedia and other projects access from European users, we were forced to move all user traffic to our Florida cluster, for which we have a standard quick failover procedure in place, that changes our DNS entries. However, shortly after we did this failover switch, it turned out that this failover mechanism was now broken, causing the DNS resolution of Wikimedia sites to stop working globally. This problem was quickly resolved, but unfortunately it may take up to an hour before access is restored for everyone, due to caching effects."

12 of 153 comments (clear)

Min score:

Reason:

Sort:

Test, and Test Again by Jazz-Masta · 2010-03-24 06:29 · Score: 3, Insightful

However, shortly after we did this failover switch, it turned out that this failover mechanism was now broken, causing the DNS resolution of Wikimedia sites to stop working globally.
Good thing Wikimedia pays their System Administrators well enough to test their backup systems.
1. Re:Test, and Test Again by X0563511 · 2010-03-24 06:47 · Score: 3, Insightful
  
  I know people who work in the Florida DC. They do, and they are smart people. Don't assume incompetence.
  
  --
  For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
2. Re:Test, and Test Again by geniice · 2010-03-24 07:23 · Score: 4, Interesting
  
  Going by past statsitics the cost of downtime to wikipedia tends to be negative since donations rise. Not that this is something wikimedia aims to do.
Do we accept this... by Al's+Hat · 2010-03-24 06:30 · Score: 4, Funny

...as proof of global warming?
Hour Delay by Reason58 · 2010-03-24 06:34 · Score: 5, Funny

This problem was quickly resolved, but unfortunately it may take up to an hour before access is restored for everyone, due to caching effects.
If you don't want to wait an hour for it to update, you can open a command prompt and type "ipconfig /flushdna".

Please be warned that this may also revert you to some sort of single-celled organism.
1. Re:Hour Delay by Dancindan84 · 2010-03-24 06:39 · Score: 4, Funny
  
  I /flushdna all the time. Hasn't had any noticeable effect except clogging my toilet.
  
  --
  "Always forgive your enemies; nothing annoys them so much." - Oscar Wilde
FTFA by Rik+Sweeney · 2010-03-24 06:36 · Score: 4, Funny

We apologize for the inconvenience this has caused.
[Citation needed]

--
Summation 2
Re:Wow! by Midnight+Thunder · 2010-03-24 06:36 · Score: 3, Funny

DNA resolution failure
Clearly they thought captchas were too easy to defeat.

--
Jumpstart the tartan drive.
Oops by girlintraining · 2010-03-24 06:41 · Score: 3, Insightful

You see guys, this is why you regularily test your backup plans and failovers. This is equivalent to building maintenance making sure the fire extinguishers aren't expired... it's basic to IT. Unfortunately, Wikipedia just reminded us that what's basic isn't always what's remembered. Someone just lost their job.

--
#fuckbeta #iamslashdot #dicemustdie
Deleted? by Grishnakh · 2010-03-24 07:24 · Score: 4, Funny

I thought maybe they had simply deleted Wikipedia because some admin decided nothing on there was "notable".
backup failure doesn't mean a failure to test by rritterson · 2010-03-24 07:49 · Score: 4, Insightful

I see lots of comments stating that this would not have happened had admins run regular tests on the failover mechanisms. That seems a poor assumption- if the system happens to fail and then an outage occurs before the next scheduled test, one may not be aware of it.
We had this problem recently where we were testing our backup generator. Normally, we cut power to the local on-campus substation, which kicks in the generator and activates a failover mechanism, rerouting power. Well, the generator came on no problem but the failover mechanism was broken, so every server in the datacenter spontaneously lost power. Had we known the failover was broken, we would have not done the regular test. However, the last test on the failover (done directly without cutting power), a mere month prior, had shown the failover mechanism was fine.
Point being, unless you are going to literally continuously test everything, there is still some probability of an unexpected double failure.

--
-Ryan
AUWYHSTOT (Acronyms are Useless When You Have to Spell Them Out Too)
Re:Distributed Wikipedia by BitZtream · 2010-03-24 09:38 · Score: 4, Insightful

Its hard enough keeping a bunch of nodes that you control online and functioning properly (hence the failure) ... trying to run anything reliable when you give any control you had to other random people on the Internet is doomed to fail.
The only reason distributed computing projects like SETI@HOME and distributed.net work is because the server gives clients data to process but it doesn't need a quick response, nor does it have to trust that the data returned is actually valid ... its going to have another host check it at some point anyway to be sure. Those clients are used to weight the data so the master server only processes the most likely packets that may match and need authoritative checking.
Doing that for a web server would ... well, a complete and total waste of resources as its likely to be worse in every single way, including reliability.

--
Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager