Wikipedia Explains Today's Global Outage
gnujoshua writes "The Wikimedia Tech Blog has a post explaining why many users were unable to reach Wikimedia sites due to DNS resolution failure. The article states, 'Due to an overheating problem in our European data center many of our servers turned off to protect themselves. As this impacted all Wikipedia and other projects access from European users, we were forced to move all user traffic to our Florida cluster, for which we have a standard quick failover procedure in place, that changes our DNS entries. However, shortly after we did this failover switch, it turned out that this failover mechanism was now broken, causing the DNS resolution of Wikimedia sites to stop working globally. This problem was quickly resolved, but unfortunately it may take up to an hour before access is restored for everyone, due to caching effects."
Wikileaks is part of wikimedia, so it went down too (along with wikinews, wikispecies, etc.).
Wikileaks is certainly NOT part of Wikimedia. You can see such at http://wikimediafoundation.org/wiki/Our_projects
I actually wasn't assuming incompetence, the hallmark of many SysAdmins is being understaffed, overworked and underpaid, and thus do not have the resources to properly test all backup and redundant systems.
As consultants and contractors in the area of System Administration, you get let go if anything like this was ever to happen. This is why they charge a little bit more.
Whatever happened, it failed. A good lesson for next time. Not knowing exactly the cause, but it is safe to safe there were too many eggs in one basket. Multiple geopgrahically diverse load-balanced DNS servers? Why was there an overheating problem in the first place? Only one air conditioner?
Wikipedia has had a few failures, not all their fault. In 2006 Cogent pulled a block of IP addresses that were leased to Wikipedia.
Wikipedia has a fairly limited budget and has historically accepted the odd few hours of downtime now and again as the natural result of this. The number of such incidents have reduced over the years though.
Yes, I agree. But the main issue with that paradigm is that many times the expense of one of your locations (and the quality of that location) is substantially lower than the other.
Example: I run servers on the US, Brasil and Argentina. The US server has better, cheaper bandwidth than the other two. Also, since this are VoIP servers, sometimes the services I send the calls to are in the US anyway, so even if the call goes originally to Argentina's POP, I'm still forwarding it to some IP in the US anyway.
So, in that case, I want the Arg/Brasil locations for other traffic (that's why there are there), and for local connectivity, but balancing our main traffic there makes no sense from any point of view. So, I only failover to those servers when I have an issue in our main location.
Sometimes, you have many resources you can use in emergencies, but you don't want to use them when the main location is clearly cheaper and better.
WTF am I doing replying to an AC at 5 A.M on a Friday night?
Wikimedia is terribly understaffed. They have about 35 employees, for one of the 5th largest sites on the Internet (and that includes legal/finance/MediaWiki devs/etc. staff). Basically the site is run by a dozen guys. Compare that to any other Top 10 site, this is just crazy.
Given their limited resources (both human and financial), it is amazing that Wikipedia is down so rarely. If you want the site to be more reliable, there is something you can do: Donate to the Wikimedia Foundation