Power Outage Takes Wikimedia Down
Baricom writes "Just a few weeks after a major power outage took out well-known blogging service LiveJournal for several hours, almost all of Wikimedia Foundation's services are offline due to a tripped circuit breaker at a different colo. Among other services, Wikimedia runs the well-known Wikipedia open encyclopedia. Coincidentally, the foundation is in the middle of a fundraising drive to pay for new servers. They have established an off-site backup of the fundraising page here until power returns."
They'll turn the lights off.
Coincidentally, the foundation is in the middle of a fundraising drive to pay for new servers.
"You see, guys? This is what could happen if we ever ran out of money. Now cough up some dough!"
The coolest voice ever.
What happened?
At about 14:15 PST some circuit breakers were tripped in the colocation facility where our servers are housed. Although the facility has a well-stocked generator, this took out power to places inside the facility, including the switch that connects us to the network and all our servers.
What's wrong?
After some minutes, the switch and most of our machines had rebooted. Some of our servers required additional work to get up, and a few may still be sitting there dead but can be worked around.
The sticky point is the database servers, where all the important stuff is. Although we use MySQL's transactional InnoDB tables, they can still sometimes be left in an unrecoverable state. Attempting to bring up the master database and one of the slaves immediately after the downtime showed corruption in parts of the database. We're currently running full backups of the raw data on two other database slave servers prior to attempting recovery on them (recovery alters the data).
If these machines also can't be recovered, we may have to restore from backup and replay log files which could take a while.
After returning from the power outage, the servers have just been slash-fried.
Although we use MySQL's transactional InnoDB tables, they can still sometimes be left in an unrecoverable state
Ya know, I just don't understand why so many projects with such high visibility and requirements for reliability use a toy database like MySQL.
Someone PLEASE tell me why. Because right now the only thing I can think is that people just don't know how to pronounce "Postgres".
I found this useful information about power outages:o utage
http://www.wikipedia.org/search?/power_
I don't know if PostgreSQL has similar problems, but I very much doubt that Oracle or DB2 do. I know that improved failover support has been a target of the PSQL developers for a little while now, so while it may not be on par with Oracle and DB2 it's probably closer than MySQL. At least for now.
I wish this had prompted management to consider alternatives to MySQL, at least for our mission-critical database servers, but unfortunately it hasn't. They don't even see that we could sell an enterprise-level RDBMS as a significant feature - we're a webhosting company - and charge through the nose for it. Oh well. They don't listen to peons like me, they just make me fix MySQL replication every two weeks.
So far one of our database servers has completed a successful recovery (we're working through them all). On a gigabit link it takes something between 90 minutes and 4 hours to rsync from one to another. As soon as we have two database servers working, we'll be restoring service in read only mode. Likely to be that 90 minutes to 4 hours from now as worst case.
I'll post followups to this post later, as we're closer to being fully recovered.
I find it an interesting coincidence the power outage happened so soon after that the Xenu article was featured.
Gee, you just had to mention the X-word! Now this thread won't load for most Scientologists because the keyword filters they were forced to install by their Church will see "Xenu" and block the site. After all the mere sight of the word could cause "pneumonia and death" if you haven't paid the Church of Scientology for the proper preparation.
Wikipedia's Xenu article has an interesting history if you look, as I did the other night when it was featured. Scientologists vandalize it regularly. You're supposed to pay them a half million (or some absurd sum of money) to find out about Xenu. After you find out, you're too embarrassed to admit to anybody that you paid a half million to learn that your problems are caused by bad science fiction, when you could have bought a house in Silicon Valley instead. So they obviously don't want a Wikipedia article giving away their half-million-dollar "trade secret" for free.
One trick I saw was to use HTML entities to spell out insults at the top of the article- like "only an idiot would believe this" or something. In the editor window, the entities weren't rendered and each letter appeared as a hex code.
A more effective attack took a different approach. The vandal in this case changed "Scientologists" to "Muslims", "Scientology" to "Islam", and inserted a boring-sounding sentence at the end of the first paragraph claiming that "Xenu" is another name that Muslims use for "Allah". It completely discouraged you from reading further. If you didn't know better you wouldn't find out how "Allah" distributed the thetans around volcanoes on various planets and blew them up with hydrogen bombs, and how their blown-up spirits cause problems in your personal life today.
This is OT, but what the hell, why not whack a beehive? Additional information on Xenu:
Operation Clambake (Hubbard maintained that humans are descended from clams)
The Xenu leaflet (all about Xenu- this information can save you lots of $$$$$)
The road to Xenu (authored by a woman who got suckered)
The Google cache of Wikipedia's Xenu article is also a must read.
I'm wondering if I'll get a lot of freaks, downmoderations, and hostile AC replies after I post this. After all, that's the kind of thing that Hubbard called "fair game". If it sinks below default visibility I'll repost it again with my karma bonus, so you theta-clear-wannabes out there can save your points for someone else.
Easily. See what those saying that MySQL can't do what MySQL does are promoting.:)
LiveJournal found that it had some disk systems which lied about having committed writes. The have a preliminary tool which copies what it's writing to disk to a networked system and then compares the after power off and recovery state to what the disk system said it could do. Are going to make it available to the community as time allows.
I expect we're going to find the same at Wikipedia. Here's a pretty typical error log, this one from the server which was master database server:
050222 5:11:12 InnoDB: Database was not shut down normally.
InnoDB: Starting recovery from log files...
InnoDB: Starting log scan based on checkpoint at
InnoDB: log sequence number 303 1283776146
InnoDB: Doing recovery: scanned up to log sequence number 303 1289018880
InnoDB: Doing recovery: scanned up to log sequence number 303 1294261760
InnoDB: Doing recovery: scanned up to log sequence number 303 1299504640
InnoDB: Doing recovery: scanned up to log sequence number 303 1304747520
InnoDB: Doing recovery: scanned up to log sequence number 303 1309990400
InnoDB: Doing recovery: scanned up to log sequence number 303 1315233280
InnoDB: Doing recovery: scanned up to log sequence number 303 1320476160
InnoDB: Doing recovery: scanned up to log sequence number 303 1325719040
InnoDB: Doing recovery: scanned up to log sequence number 303 1330961920
InnoDB: Doing recovery: scanned up to log sequence number 303 1336204800
InnoDB: Doing recovery: scanned up to log sequence number 303 1341447680
InnoDB: Doing recovery: scanned up to log sequence number 303 1346690560
InnoDB: Doing recovery: scanned up to log sequence number 303 1347688389
InnoDB: 1 transaction(s) which must be rolled back or cleaned up
InnoDB: in total 14 row operations to undo
InnoDB: Trx id counter is 1 935480064
050222 5:11:13 InnoDB: Starting an apply batch of log records to the database...
InnoDB: Progress in percents: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 InnoDB: Database page corruption on disk or a failed
InnoDB: file read of page 8617985.
InnoDB: You may have to recover from a backup.
050222 5:12:20 InnoDB: Page dump in ascii and hex (16384 bytes):
Observe that the database engine went back to its last checkpoint, noticed the partial transaction and undid it and was rolling ahead in the write-ahead log when it encountered a database page which failed its checksum test. That failed checksum test is why I think it's a problem with the disk system lying about what was written. You can get that when a database page spans two drives in a stripe set and one has committed the update while the other hasn't.
In more typical situations MySQL simply applies the updates and all is well. I've had a server set up to exceed RAM with swap turned off and get killed every ten minutes for hours and recover every time.
Just to be complete: