Power Outage Takes Wikimedia Down

This is why you don't turn Google down by Anonymous Coward · 2005-02-21 14:29 · Score: 5, Funny

They'll turn the lights off.

Coincidence... ;) by Faust7 · 2005-02-21 14:29 · Score: 5, Funny

Coincidentally, the foundation is in the middle of a fundraising drive to pay for new servers.

"You see, guys? This is what could happen if we ever ran out of money. Now cough up some dough!"

--
The coolest voice ever.

Re:Coincidence... ;) by xsupergr0verx · 2005-02-21 14:32 · Score: 5, Funny

So... slashdot the offsite backup?

--

Click here for a free picture of an iPod!
Re:Coincidence... ;) by daveo0331 · 2005-02-21 14:40 · Score: 5, Interesting

On the other hand, subjecting the donation page to the Slashdot effect seems like a great way to reach the fundraising goal in no time. Assuming of course the page itself stays up.

Seriously though, if you like wikipedia, consider donating, even if it's just 5 bucks. I think it's even tax deductible if you itemize.

--
Remember the days when Republicans were the party of fiscal responsibility?
Re:Coincidence... ;) by Raul654 · 2005-02-21 14:41 · Score: 5, Informative

I was just in freenode joking with Jimbo about this. He said he thought was wondering how long it would be before slashdot ran a story about it (2 hours) and asked people to please stop with the consideracy theories. Meanwhile, the devs are working fairly furiously to get it back up (Kate hasn't slept in 27 hours. Jimbo just declared Feb 22 to be Kate-day) (--A wikipedia admin.)

--

To make laws that man cannot, and will not obey, serves to bring all law into contempt.
--E.C. Stanton
Re:Coincidence... ;) by fredrikj · 2005-02-21 22:21 · Score: 4, Interesting

On the other hand, subjecting the donation page to the Slashdot effect seems like a great way to reach the fundraising goal in no time. Assuming of course the page itself stays up.

You do know that Wikipedia receives something like 100 times the traffic Slashdot does, right?

What Happened. by Anonymous Coward · 2005-02-21 14:29 · Score: 5, Informative

What happened?
At about 14:15 PST some circuit breakers were tripped in the colocation facility where our servers are housed. Although the facility has a well-stocked generator, this took out power to places inside the facility, including the switch that connects us to the network and all our servers.

What's wrong?
After some minutes, the switch and most of our machines had rebooted. Some of our servers required additional work to get up, and a few may still be sitting there dead but can be worked around.

The sticky point is the database servers, where all the important stuff is. Although we use MySQL's transactional InnoDB tables, they can still sometimes be left in an unrecoverable state. Attempting to bring up the master database and one of the slaves immediately after the downtime showed corruption in parts of the database. We're currently running full backups of the raw data on two other database slave servers prior to attempting recovery on them (recovery alters the data).

If these machines also can't be recovered, we may have to restore from backup and replay log files which could take a while.

Re:What Happened. by Anonymous Coward · 2005-02-21 15:46 · Score: 5, Funny

Real datacenters don't have PeeCees.

Oh, maybe one, out at the guard's desk.

News Update by Anonymous Coward · 2005-02-21 14:31 · Score: 5, Funny

After returning from the power outage, the servers have just been slash-fried.

Another indictment of MySql by Anonymous Coward · 2005-02-21 14:34 · Score: 5, Insightful

Although we use MySQL's transactional InnoDB tables, they can still sometimes be left in an unrecoverable state

Ya know, I just don't understand why so many projects with such high visibility and requirements for reliability use a toy database like MySQL.

Someone PLEASE tell me why. Because right now the only thing I can think is that people just don't know how to pronounce "Postgres".

Re:Another indictment of MySql by ergo98 · 2005-02-21 14:51 · Score: 5, Interesting

No database can guarantee data integrity in the case of a power failure.

Barring a couple of extreme exceptions, of course a modern database system should protect integrity in the case of a power failure, or any other sudden system failure (kernel panic, GPF, whatever). In the case of the much maligned SQL Server, you can hit the power button all you want mid-transaction and you're going to get a blister on your finger before the database is corrupted.
Re:Another indictment of MySql by Anonymous Coward · 2005-02-21 14:55 · Score: 5, Insightful

No database can guarantee data integrity in the case of a power failure

This is false. SQL Server 2000 (yeah, I know, instant mod-down) has a transaction log and so does Oracle and I'm sure every other half-decent database. ALL committed transactions are preserved and the data is in a consistent state.

MySQL does not have this and the developers don't seem to care much about it. This is the problem with open-source in general, if someone is just doing it for fun they aren't going to spend any time on the stuff they don't care about personally.
Re:Another indictment of MySql by sploo22 · 2005-02-21 14:58 · Score: 5, Informative

No database can guarantee data integrity in the case of a power failure.

Think again. Techniques to do this have been around for years -- it's called stable storage. You just keep redundant copies of data that's changing, and use a neat and simple procedure to ensure that either they both get updated by a transaction, or the original data can be recovered. Certainly the most recent data might be lost, but there's no reason for the database to be corrupted or even in an inconsistent state.

--
Karma: Segmentation fault (tried to dereference a null post)
Re:Another indictment of MySql by imroy · 2005-02-21 15:15 · Score: 5, Informative

I just love stupid trolls that can't even use Google.

Tsearch2 - full text extension for PostgreSQL
DevX: Implementing Full Text Indexing with PostgreSQL - about Tsearch2.

Tsearch2 is included in the postgresql-contrib package of at least Debian and Novell/SuSE. Is that "out of the box" enough for a clueless MySQL user?
Re:Another indictment of MySql by Tough+Love · 2005-02-21 17:35 · Score: 4, Insightful

Since at least one of our MySQL database servers has so far restarted successfully with all InnoDB data intact, perhaps you'd care to reconsider your assessment that MySQL is incapable of doing what it just did?

But one didn't. That's a much more informative data point.

--
When all you have is a hammer, every problem starts to look like a thumb.

mysql bad at disaster recovery? by bdigit · 2005-02-21 14:34 · Score: 4, Interesting

This is not a troll or a flame at all but between this and the livejournal servers, it sure sounds like hell if your mysql servers ever go down unexpected.

Is mysql the only dbase like this or does postgres get corrupted as well during unplanned downtime? If I recall from using MSSQL servers , we never had a problem like this. We would simply reboot the servers and not worry about tables being left in unrecoverable states. Please correct me if I am wrong though.

Is there any way around this or will this always be a problem with mysql?

Re:mysql bad at disaster recovery? by YU+Nicks+NE+Way · 2005-02-21 14:46 · Score: 4, Insightful

There's a simple way around this: stick to PostgreSQL, MSSQL, Oracle, DB/2, or some other real database. MySQL doesn't make the grade, precisely because things like this can happen.
Re:mysql bad at disaster recovery? by ctr2sprt · 2005-02-21 14:50 · Score: 5, Interesting

We have a similar problem at work. There we don't endure database corruption, we just get broken replication. It appears to be working, but it actually isn't. So we have to take the master offline (actually just acquire a write lock on the DB, it can still answer SELECTs), tar up its (massive) database, scp it to the slaves, start the master, stop the slaves, untar the database, restart the slaves, and restart replication. The entire process can take several hours and it's easy to make mistakes. We put stickers on our MySQL servers saying "DO NOT REBOOT WITHOUT CONTACTING OPS MANAGEMENT," though unfortunately faulty DIMMs are illiterate.
I don't know if PostgreSQL has similar problems, but I very much doubt that Oracle or DB2 do. I know that improved failover support has been a target of the PSQL developers for a little while now, so while it may not be on par with Oracle and DB2 it's probably closer than MySQL. At least for now.
I wish this had prompted management to consider alternatives to MySQL, at least for our mission-critical database servers, but unfortunately it hasn't. They don't even see that we could sell an enterprise-level RDBMS as a significant feature - we're a webhosting company - and charge through the nose for it. Oh well. They don't listen to peons like me, they just make me fix MySQL replication every two weeks.
Re:mysql bad at disaster recovery? by ctr2sprt · 2005-02-21 15:48 · Score: 4, Interesting

Mysql can handle reboots well.
No. It can't. We have two concrete examples in this very page - one provided by Wikimedia, one provided by me - which directly contradict your statement. Maybe under some circumstances MySQL can handle reboots, but it's been proven already that it can't always do so. Perhaps your MySQL experience is not with high-load applications (at least not the level of load Wikimedia and my employer see).
BTW, write a damn script. Mysql was written for unix, unix thrives on scripts. If you can't handle writing a script, why the hell are you a DB admin?
Because the process doesn't lend itself well to scripting. For example, MySQL automatically releases locks when you close your connection to the DB. Presumably this is to avoid deadlocks and for other good reasons, but it's not trivial to write a script to do that. Also, since this is an important system, we don't like the idea of trusting computers to handle its repair: we want someone knowledgeable monitoring every step in case something doesn't work exactly right. I can of course sit there and watch the script do its thing, but that defeats the purpose of scripting the process in the first place.
Regardless, the difficulty of the task is not the main issue. The main issue is that we are dealing with north of 1GB of data here, and on busy servers on a busy network that means restarting replication takes an hour or longer. So not only is performance reduced by 33% when we take the slaves offline one at a time, performance is reduced further by the traffic of tar/scp in the background. Not to mention the fact that, because we have a lock on the master's DB, so you can't even consider the DB cluster fully functional.
Re:mysql bad at disaster recovery? by Dachannien · 2005-02-21 16:15 · Score: 4, Funny

So we have to take the master offline (actually just acquire a write lock on the DB, it can still answer SELECTs), tar up its (massive) database, scp it to the slaves, start the master, stop the slaves, untar the database, restart the slaves, and restart replication.

You forgot the part where you have to take the chicken across first, because the fox won't eat the grain if you leave them alone.
Re:mysql bad at disaster recovery? by TheNarrator · 2005-02-21 18:31 · Score: 4, Informative

PostgreSQL is far superior to MySql in it's disaster recovery ability, namely WAL (Write Ahead Logging). I've been using PostgreSQL since version 7.0 came out and I've never had it fail to come back up on me after any power outage or reset.
http://www.postgresql.org/docs/8.0/interactive/wal .html
Re:mysql bad at disaster recovery? by Jamesday · 2005-02-22 01:41 · Score: 5, Informative
>>Can anyone quess why its the case?

Easily. See what those saying that MySQL can't do what MySQL does are promoting.:)

LiveJournal found that it had some disk systems which lied about having committed writes. The have a preliminary tool which copies what it's writing to disk to a networked system and then compares the after power off and recovery state to what the disk system said it could do. Are going to make it available to the community as time allows.

I expect we're going to find the same at Wikipedia. Here's a pretty typical error log, this one from the server which was master database server:

050222 5:11:12 InnoDB: Database was not shut down normally.
InnoDB: Starting recovery from log files...
InnoDB: Starting log scan based on checkpoint at
InnoDB: log sequence number 303 1283776146
InnoDB: Doing recovery: scanned up to log sequence number 303 1289018880
InnoDB: Doing recovery: scanned up to log sequence number 303 1294261760
InnoDB: Doing recovery: scanned up to log sequence number 303 1299504640
InnoDB: Doing recovery: scanned up to log sequence number 303 1304747520
InnoDB: Doing recovery: scanned up to log sequence number 303 1309990400
InnoDB: Doing recovery: scanned up to log sequence number 303 1315233280
InnoDB: Doing recovery: scanned up to log sequence number 303 1320476160
InnoDB: Doing recovery: scanned up to log sequence number 303 1325719040
InnoDB: Doing recovery: scanned up to log sequence number 303 1330961920
InnoDB: Doing recovery: scanned up to log sequence number 303 1336204800
InnoDB: Doing recovery: scanned up to log sequence number 303 1341447680
InnoDB: Doing recovery: scanned up to log sequence number 303 1346690560
InnoDB: Doing recovery: scanned up to log sequence number 303 1347688389
InnoDB: 1 transaction(s) which must be rolled back or cleaned up
InnoDB: in total 14 row operations to undo
InnoDB: Trx id counter is 1 935480064
050222 5:11:13 InnoDB: Starting an apply batch of log records to the database...
InnoDB: Progress in percents: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 InnoDB: Database page corruption on disk or a failed
InnoDB: file read of page 8617985.
InnoDB: You may have to recover from a backup.
050222 5:12:20 InnoDB: Page dump in ascii and hex (16384 bytes):

Observe that the database engine went back to its last checkpoint, noticed the partial transaction and undid it and was rolling ahead in the write-ahead log when it encountered a database page which failed its checksum test. That failed checksum test is why I think it's a problem with the disk system lying about what was written. You can get that when a database page spans two drives in a stripe set and one has committed the update while the other hasn't.

In more typical situations MySQL simply applies the updates and all is well. I've had a server set up to exceed RAM with swap turned off and get killed every ten minutes for hours and recover every time.

Just to be complete:
- The database servers have dual redundant power supplies. TWO breakers at the colo tripped, taking out both.
- The systems are a mix of SCSI and SATA, so no point in arguing about one being lousy. SATA and Linux win if you want a winner: it was a SATA box using Linux RAID 0 whoch completed full recovery. It wasn't one of the normal servers - it was used for backup and offline report generation.
- Two different disk controller makers, one each for SCSI and SATA.
- Battery backed up write cache on most of the main server disk controllers but the one without the battery backup for the write cache had the same problem (which shouldn't surprise anyone - that one should be expected not to recover well).
- After LJ's experience we were after UPS systems in the racks but hadn't yet checked whether the local fire code allows them. Some don't, for el

More information here... by Anonymous Coward · 2005-02-21 14:40 · Score: 5, Funny

I found this useful information about power outages:
http://www.wikipedia.org/search?/power_o utage

Re:More information here... by daveo0331 · 2005-02-21 15:00 · Score: 4, Funny

Unfortunately that site you linked to appears to be slashdotted, or something. Here's a mirror:

http://www.answers.com/topic/power-outage-1

--
Remember the days when Republicans were the party of fiscal responsibility?

Join me, my friends! by mctk · 2005-02-21 14:44 · Score: 4, Funny

A power outage has taken down wikipedia! as a community we must carry the torch!

--
Paul Grosfield - the quicker picker upper.

Aaaaand... by Faust7 · 2005-02-21 14:52 · Score: 4, Funny

Meanwhile, the devs are working fairly furiously to get it back up

Don't worry, we'll take care of your backup servers in the meantime. ;)

--
The coolest voice ever.

ETA for read only service is now 2-4 hours. by Jamesday · 2005-02-21 14:55 · Score: 5, Informative

So far one of our database servers has completed a successful recovery (we're working through them all). On a gigabit link it takes something between 90 minutes and 4 hours to rsync from one to another. As soon as we have two database servers working, we'll be restoring service in read only mode. Likely to be that 90 minutes to 4 hours from now as worst case.

I'll post followups to this post later, as we're closer to being fully recovered.

Re:ETA for read only service is now 2-4 hours. by Jamesday · 2005-02-21 15:46 · Score: 4, Informative

May be longer so I withdraw that time estimate.

lame quotes rule by mrpuffypants · 2005-02-21 15:00 · Score: 4, Funny

it's as though 300,000 people cried out and were suddently silenced ...

and then somebody diffed the change and made them speak again

Re:Ironic by Jamesday · 2005-02-21 15:33 · Score: 4, Informative

Yes. I wrote that cached page and it's now a bit out of date. IF, and it's not certain, local fire regulations permit the use of UPS systems in the racks we're going to be installing them. Decided on that after LiveJournal's unfortunate experience. But don't yet have them.

Re:They should ask for more... by brion · 2005-02-21 16:01 · Score: 4, Interesting

Our database masters do have dual power supplies. The circuit breakers were tripped on both sides.

--

Chu vi parolas Vikipedion?

Re:Xenu Strikes Again! by MillionthMonkey · 2005-02-21 17:51 · Score: 5, Informative

I find it an interesting coincidence the power outage happened so soon after that the Xenu article was featured.

Gee, you just had to mention the X-word! Now this thread won't load for most Scientologists because the keyword filters they were forced to install by their Church will see "Xenu" and block the site. After all the mere sight of the word could cause "pneumonia and death" if you haven't paid the Church of Scientology for the proper preparation.

Wikipedia's Xenu article has an interesting history if you look, as I did the other night when it was featured. Scientologists vandalize it regularly. You're supposed to pay them a half million (or some absurd sum of money) to find out about Xenu. After you find out, you're too embarrassed to admit to anybody that you paid a half million to learn that your problems are caused by bad science fiction, when you could have bought a house in Silicon Valley instead. So they obviously don't want a Wikipedia article giving away their half-million-dollar "trade secret" for free.

One trick I saw was to use HTML entities to spell out insults at the top of the article- like "only an idiot would believe this" or something. In the editor window, the entities weren't rendered and each letter appeared as a hex code.

A more effective attack took a different approach. The vandal in this case changed "Scientologists" to "Muslims", "Scientology" to "Islam", and inserted a boring-sounding sentence at the end of the first paragraph claiming that "Xenu" is another name that Muslims use for "Allah". It completely discouraged you from reading further. If you didn't know better you wouldn't find out how "Allah" distributed the thetans around volcanoes on various planets and blew them up with hydrogen bombs, and how their blown-up spirits cause problems in your personal life today.

This is OT, but what the hell, why not whack a beehive? Additional information on Xenu:
Operation Clambake (Hubbard maintained that humans are descended from clams)
The Xenu leaflet (all about Xenu- this information can save you lots of $$$$$)
The road to Xenu (authored by a woman who got suckered)
The Google cache of Wikipedia's Xenu article is also a must read.

I'm wondering if I'll get a lot of freaks, downmoderations, and hostile AC replies after I post this. After all, that's the kind of thing that Hubbard called "fair game". If it sinks below default visibility I'll repost it again with my karma bonus, so you theta-clear-wannabes out there can save your points for someone else.

:::eyes UPS under table::: by shoemakc · 2005-02-21 18:02 · Score: 4, Funny

:::eyes my UPS::::

::::ponders for a momment::::

:::eyes the serial cable that gracefully shuts down said computer in the event of a power failure::::

:::ponders some more::::

:::eyes the spare UPS sitting in the corner that used to be connected to a database server::::

Hmm, I think i'm almost onto something here, but i just can't seem to nail it down...

-Chris

--
--an unbreakable toy is useful for breaking other toys--

Latest news by saforrest · 2005-02-22 01:27 · Score: 4, Informative

Posted on the mailing list wikipedia-l 32 minutes ago:

From: Brion Vibber
Reply-To: wikipedia-l@wikimedia.org
To: Wikipedia-l, Wikimedia Foundation Mailing List, Wikimedia developers
Date: Tue, 22 Feb 2005 04:47:56 -0800
Subject: Re: [Wikipedia-l] Wiki Problems?

Brion Vibber wrote:
> There was some sort of power failure at the colocation facility. We're
> in the process of rebooting and recovering machines.

The power failure was due to circuit breakers being tripped within the colocation facility; some of our servers have redundant power supplies but *both* circuits failed, causing all our machines and the network switch to unceremoniously shut down.

Whether a problem in MySQL, with our server configurations, or with the hardware (or some combination thereof), most of our database servers managed to glitch the data on disk when they went down. (Yes, we use InnoDB tables. This ain't good enough, apparently.)

The good news: one server maintained a good copy, which we've been copying to the others to get things back on track. We're now serving all wikis read-only.

The bad news: that copy was a bit over a day behind synchronization (it was stopped to run maintenance jobs), so in addition to slogging around 170gb of data to each DB server we have to apply the last day's update logs before we can restore read/write service.

I don't know when exactly we'll have everything editable again, but it should be within 12 hours.

Slashdot Mirror

Power Outage Takes Wikimedia Down

34 of 577 comments (clear)