Wikipedia Explains Today's Global Outage
gnujoshua writes "The Wikimedia Tech Blog has a post explaining why many users were unable to reach Wikimedia sites due to DNS resolution failure. The article states, 'Due to an overheating problem in our European data center many of our servers turned off to protect themselves. As this impacted all Wikipedia and other projects access from European users, we were forced to move all user traffic to our Florida cluster, for which we have a standard quick failover procedure in place, that changes our DNS entries. However, shortly after we did this failover switch, it turned out that this failover mechanism was now broken, causing the DNS resolution of Wikimedia sites to stop working globally. This problem was quickly resolved, but unfortunately it may take up to an hour before access is restored for everyone, due to caching effects."
I could see why the failover didn't work... They should try resolving names instead of nucleic acids. :\
However, shortly after we did this failover switch, it turned out that this failover mechanism was now broken, causing the DNS resolution of Wikimedia sites to stop working globally.
Good thing Wikimedia pays their System Administrators well enough to test their backup systems.
...as proof of global warming?
Maybe the reverse DNA wasn't set right.
I noticed wikipedia wasn't resolving this morning.
Flushing my "DNA" cache fixed it ;-))
rndc flush
Everything I write is lies, read between the lines.
I don't have a problem with DNA not resolving.
I have a problem with getting it out of the sheets.
I've abandoned my search for truth; now I'm just looking for some useful delusions.
Whoa, why is the DNS resolving dATP.dGTP.dCTP.dATP?!?
This problem was quickly resolved, but unfortunately it may take up to an hour before access is restored for everyone, due to caching effects.
If you don't want to wait an hour for it to update, you can open a command prompt and type "ipconfig /flushdna".
Please be warned that this may also revert you to some sort of single-celled organism.
We apologize for the inconvenience this has caused.
[Citation needed]
Summation 2
DNA resolution failure
Clearly they thought captchas were too easy to defeat.
Jumpstart the tartan drive.
You see guys, this is why you regularily test your backup plans and failovers. This is equivalent to building maintenance making sure the fire extinguishers aren't expired... it's basic to IT. Unfortunately, Wikipedia just reminded us that what's basic isn't always what's remembered. Someone just lost their job.
#fuckbeta #iamslashdot #dicemustdie
Wikileaks is part of wikimedia, so it went down too (along with wikinews, wikispecies, etc.).
Wikileaks is certainly NOT part of Wikimedia. You can see such at http://wikimediafoundation.org/wiki/Our_projects
Well, looks like all the DNA jokes are now -1 off topic
Well played /., well played.
"There is a way that seems right to a man, but its end is the way of death." Proverbs 16:25 (NKJV)
active/passive systems are a pain in the arse. The whole concept of testing failover in an active/passive situation is wrong. Anything which relies on human beings doing this and that and that and that is a bad solution.
Just run active/active and load balancer over both sites. If one fails it's tests, you just pull it.
Deleted
I thought maybe they had simply deleted Wikipedia because some admin decided nothing on there was "notable".
I see lots of comments stating that this would not have happened had admins run regular tests on the failover mechanisms. That seems a poor assumption- if the system happens to fail and then an outage occurs before the next scheduled test, one may not be aware of it.
We had this problem recently where we were testing our backup generator. Normally, we cut power to the local on-campus substation, which kicks in the generator and activates a failover mechanism, rerouting power. Well, the generator came on no problem but the failover mechanism was broken, so every server in the datacenter spontaneously lost power. Had we known the failover was broken, we would have not done the regular test. However, the last test on the failover (done directly without cutting power), a mere month prior, had shown the failover mechanism was fine.
Point being, unless you are going to literally continuously test everything, there is still some probability of an unexpected double failure.
-Ryan
AUWYHSTOT (Acronyms are Useless When You Have to Spell Them Out Too)
Speaking of Wikipedia, an idea that has long been in my mind, but that I have never sat down and worked out is distributed hosting of Wikipedia. The idea is that volunteers each contribute some resources (network capacity, storage space, RAM, and CPU cycles) to host and serve part of the content.
This way, we should be able to reduce the load on the (donation supported) Wikimedia servers, as well as increase the redundancy in the system.
Is anybody already working on this or are there perhaps even already implementations of this idea?
Please correct me if I got my facts wrong.
I started reading the article and wondered why there was such global outrage about dns resolution on Wikipedia, then I went back and looked at the title again...
If it ain't broke, DON'T fix it.
Wikimedia is terribly understaffed. They have about 35 employees, for one of the 5th largest sites on the Internet (and that includes legal/finance/MediaWiki devs/etc. staff). Basically the site is run by a dozen guys. Compare that to any other Top 10 site, this is just crazy.
Given their limited resources (both human and financial), it is amazing that Wikipedia is down so rarely. If you want the site to be more reliable, there is something you can do: Donate to the Wikimedia Foundation
I was rather pissed. And the only thing I was going to do is to look up a few math terms. Ended up using PlanetMath and few other sites, but when Wki came back, I check them as well as guess what: they had the most comprehensive and informative articles. That's the first outage I remember since I started using Wiki.