Power Outage Takes Wikimedia Down

This is why you don't turn Google down by Anonymous Coward · 2005-02-21 14:29 · Score: 5, Funny

They'll turn the lights off.

Re:This is why you don't turn Google down by Captain+Nitpick · 2005-02-21 15:41 · Score: 2, Informative

Nobody's turned Google down. There's been no actual proposals to turn down yet.

--
But then again, I could be wrong.
Re:This is why you don't turn Google down by multisync · 2005-02-21 16:42 · Score: 2, Insightful

They should talk about his work and his contribution to American culture. They shouldn't be making fun of him. He deserves better.

If Hunter S. Thompson were still alive, he'd be making fun of himself for killing himself.

--
I don't care why you're posting AC
Re:This is why you don't turn Google down by mdecarle · 2005-02-21 21:14 · Score: 2, Insightful

Must you really know what the money is being spend on?

If you donate money, you are asking them to continue to offer their great service to you and other people. How they achieve that goal, is up to them, no?

You don't ask the Red Cross what they use your money for, do you? The organisation usually tells you afterwards.

Coincidence... ;) by Faust7 · 2005-02-21 14:29 · Score: 5, Funny

Coincidentally, the foundation is in the middle of a fundraising drive to pay for new servers.

"You see, guys? This is what could happen if we ever ran out of money. Now cough up some dough!"

--
The coolest voice ever.

Re:Coincidence... ;) by xsupergr0verx · 2005-02-21 14:32 · Score: 5, Funny

So... slashdot the offsite backup?

--

Click here for a free picture of an iPod!
Re:Coincidence... ;) by daveo0331 · 2005-02-21 14:40 · Score: 5, Interesting

On the other hand, subjecting the donation page to the Slashdot effect seems like a great way to reach the fundraising goal in no time. Assuming of course the page itself stays up.

Seriously though, if you like wikipedia, consider donating, even if it's just 5 bucks. I think it's even tax deductible if you itemize.

--
Remember the days when Republicans were the party of fiscal responsibility?
Re:Coincidence... ;) by Raul654 · 2005-02-21 14:41 · Score: 5, Informative

I was just in freenode joking with Jimbo about this. He said he thought was wondering how long it would be before slashdot ran a story about it (2 hours) and asked people to please stop with the consideracy theories. Meanwhile, the devs are working fairly furiously to get it back up (Kate hasn't slept in 27 hours. Jimbo just declared Feb 22 to be Kate-day) (--A wikipedia admin.)

--

To make laws that man cannot, and will not obey, serves to bring all law into contempt.
--E.C. Stanton
Re:Coincidence... ;) by Raul654 · 2005-02-21 14:57 · Score: 3, Insightful

No no, but with the google deal looming, the tin-foil-hatters are paying close attention to wikipedia, and every little thing gets overly-scrutinized.

--

To make laws that man cannot, and will not obey, serves to bring all law into contempt.
--E.C. Stanton
Re:Coincidence... ;) by Random+Chaos · 2005-02-21 18:15 · Score: 2, Informative

Well...Slashdot has fully hit:

Temporary fundraising site: "This account has exceeded it's bandwidth quota and has been temporarily disabled."
Re:Coincidence... ;) by fredrikj · 2005-02-21 22:21 · Score: 4, Interesting

On the other hand, subjecting the donation page to the Slashdot effect seems like a great way to reach the fundraising goal in no time. Assuming of course the page itself stays up.

You do know that Wikipedia receives something like 100 times the traffic Slashdot does, right?
Re:Coincidence... ;) by David+Gerard · 2005-02-21 22:50 · Score: 2, Interesting

I can now see why Kate NEVER EVER emerges from her heavily-armed bunker in Oxfordshire.

--
http://rocknerd.co.uk
Re:Coincidence... ;) by DA_Chef · 2005-02-22 02:54 · Score: 3, Interesting

Something like that, yes: Alexa's statistics

What Happened. by Anonymous Coward · 2005-02-21 14:29 · Score: 5, Informative

What happened?
At about 14:15 PST some circuit breakers were tripped in the colocation facility where our servers are housed. Although the facility has a well-stocked generator, this took out power to places inside the facility, including the switch that connects us to the network and all our servers.

What's wrong?
After some minutes, the switch and most of our machines had rebooted. Some of our servers required additional work to get up, and a few may still be sitting there dead but can be worked around.

The sticky point is the database servers, where all the important stuff is. Although we use MySQL's transactional InnoDB tables, they can still sometimes be left in an unrecoverable state. Attempting to bring up the master database and one of the slaves immediately after the downtime showed corruption in parts of the database. We're currently running full backups of the raw data on two other database slave servers prior to attempting recovery on them (recovery alters the data).

If these machines also can't be recovered, we may have to restore from backup and replay log files which could take a while.

Re:What Happened. by wakejagr · 2005-02-21 15:18 · Score: 2, Insightful

Kudos to Wikimedia for actually explaining what happened and not just putting a "This page is down, please try again later" messege up. Many people/companies/groups/etc would be too proud or too afraid of bad publicity to actually explain the problem.

--
Don't save Windows XP! http://www.petitiononline.com/jjw1xp/petition.html
Re:What Happened. by Anonymous Coward · 2005-02-21 15:33 · Score: 2, Insightful

You do know that in real datacenters you don't have a UPS on each PC, but a UPS for the ROOM and between this UPS and your servers you are going to need brakers, so if you put to many things on a circuit it may cause problem, as simple as that.
Re:What Happened. by Anonymous Coward · 2005-02-21 15:46 · Score: 5, Funny

Real datacenters don't have PeeCees.

Oh, maybe one, out at the guard's desk.
Re:What Happened. by Leo+McGarry · 2005-02-21 18:16 · Score: 3, Interesting

What constitutes a "real" datacenter.

One that complies with building and safety codes, for starters. In every jurisdiction with which I'm familiar -- admittedly not even close to all of them-- it's actually against the law to have a battery unit inside a data center cage. It's a violation of the safety code. When fire and rescue personnel go into a commercial building, they have to be sure that the power is really off. If there's a battery lying around somewhere, shorting to ground through a desk or door frame for instance, it can cause big problems.

Ask around. I bet you'll find that your data center explicitly forbids customer-installed battery units.
Re:What Happened. by mr_zorg · 2005-02-21 18:48 · Score: 2, Informative

You laugh at this, but it's 100% true.

News Update by Anonymous Coward · 2005-02-21 14:31 · Score: 5, Funny

After returning from the power outage, the servers have just been slash-fried.

They should ask for more... by PornMaster · 2005-02-21 14:33 · Score: 2, Informative

If they bought actual servers with dual power supplies and got power from multiple PDUs at their data center, they would be much better off. If this is really because of a tripped breaker, then it's pretty inexcusable, since dual power supplies fed from separate circuits would have prevented it... unlike the LJ outage which was from the power being cut to all circuits.

But if they're going to cobble together some whitebox crap servers, and not change the architecture, they'll be right back to an outage next time it happens.

--
500GB of disk, 5TB of transfer, $5.95/mo

Re:They should ask for more... by Raul654 · 2005-02-21 14:39 · Score: 2, Insightful

Right, because we all know money grows on trees...

--

To make laws that man cannot, and will not obey, serves to bring all law into contempt.
--E.C. Stanton
Re:They should ask for more... by man_ls · 2005-02-21 15:05 · Score: 3, Insightful

IIRC, that's the Fire Code. The breaker needs to be able to unconditionally kill all power inside the facility. Thus -- it kills the power post-UPS.
Re:They should ask for more... by PornMaster · 2005-02-21 15:08 · Score: 3, Insightful

Sometimes it costs more to do things wrong, in the long term, than to do them right.

--
500GB of disk, 5TB of transfer, $5.95/mo
Re:They should ask for more... by mboverload · 2005-02-21 15:11 · Score: 2, Insightful

Hey man, they have their traffic doubling every 4 months, they NEVER planned for this sucess this early. Building infrastructure is hard when you never plan for it.
Re:They should ask for more... by Jamesday · 2005-02-21 15:24 · Score: 3, Informative

The database servers have dual redundant supplies and the colo tells us that TWO circuit breakers tripped. Fun. Not. Do try to avoid having the same happen to you - losing both circuits isn't fun.
Re:They should ask for more... by brion · 2005-02-21 16:01 · Score: 4, Interesting

Our database masters do have dual power supplies. The circuit breakers were tripped on both sides.

--
Chu vi parolas Vikipedion?
Re:They should ask for more... by Cramer · 2005-02-21 23:32 · Score: 2, Informative

A word on breakers... first, they aren't fuses. They are magnetically thrown -- pull too much current through it and an electromechanical break is closed releasing the breaker contacts which are pushed/pulled apart by springs. As it's magnetically thrown, tripping one breaker can (and does) trip surrounding breakers. I've seen it happen a number of times -- with brand new breakers, even.

Another indictment of MySql by Anonymous Coward · 2005-02-21 14:34 · Score: 5, Insightful

Although we use MySQL's transactional InnoDB tables, they can still sometimes be left in an unrecoverable state

Ya know, I just don't understand why so many projects with such high visibility and requirements for reliability use a toy database like MySQL.

Someone PLEASE tell me why. Because right now the only thing I can think is that people just don't know how to pronounce "Postgres".

Re:Another indictment of MySql by ergo98 · 2005-02-21 14:51 · Score: 5, Interesting

No database can guarantee data integrity in the case of a power failure.

Barring a couple of extreme exceptions, of course a modern database system should protect integrity in the case of a power failure, or any other sudden system failure (kernel panic, GPF, whatever). In the case of the much maligned SQL Server, you can hit the power button all you want mid-transaction and you're going to get a blister on your finger before the database is corrupted.
Re:Another indictment of MySql by Anonymous Coward · 2005-02-21 14:55 · Score: 5, Insightful

No database can guarantee data integrity in the case of a power failure

This is false. SQL Server 2000 (yeah, I know, instant mod-down) has a transaction log and so does Oracle and I'm sure every other half-decent database. ALL committed transactions are preserved and the data is in a consistent state.

MySQL does not have this and the developers don't seem to care much about it. This is the problem with open-source in general, if someone is just doing it for fun they aren't going to spend any time on the stuff they don't care about personally.
Re:Another indictment of MySql by sploo22 · 2005-02-21 14:58 · Score: 5, Informative

No database can guarantee data integrity in the case of a power failure.

Think again. Techniques to do this have been around for years -- it's called stable storage. You just keep redundant copies of data that's changing, and use a neat and simple procedure to ensure that either they both get updated by a transaction, or the original data can be recovered. Certainly the most recent data might be lost, but there's no reason for the database to be corrupted or even in an inconsistent state.

--
Karma: Segmentation fault (tried to dereference a null post)
Re:Another indictment of MySql by imroy · 2005-02-21 15:15 · Score: 5, Informative

I just love stupid trolls that can't even use Google.

Tsearch2 - full text extension for PostgreSQL
DevX: Implementing Full Text Indexing with PostgreSQL - about Tsearch2.

Tsearch2 is included in the postgresql-contrib package of at least Debian and Novell/SuSE. Is that "out of the box" enough for a clueless MySQL user?
Re:Another indictment of MySql by Jamesday · 2005-02-21 15:20 · Score: 3, Informative

Since at least one of our MySQL database servers has so far restarted successfully with all InnoDB data intact, perhaps you'd care to reconsider your assessment that MySQL is incapable of doing what it just did?

For the rest, we'll see as we get to them and, for any that fail, then look to see whether it was the disk controller or the disk drive lying about having the data written to battery backed up RAM or the disk surface.

Wikipedia hasn't suffered a day of downtime yet for this reason and looks to be down for no more than a few hours this time. A previous incident lasting more than a day was a human or three screwing up and having two copies of the server software writing to the same database files without any locking to prevent conflicting updates. The result of that shouldn't surprise anyone.
Re:Another indictment of MySql by Anonymous Coward · 2005-02-21 16:56 · Score: 2, Insightful

I realise since you seem to be involved with wikipedia, you'll be modded up no matter what. However, what you just said makes no logical sense. The grandparent basically said that mysql's transaction support sucks and consequently it can't guarantee db integrity over a power failure. You said that because *one* server came back up with no problems that he should reassess mysql. You could have *all* your servers come back with no problems and it still wouldn't change the grandparents assessment. You would just be getting lucky.
Re:Another indictment of MySql by Tough+Love · 2005-02-21 17:35 · Score: 4, Insightful

Since at least one of our MySQL database servers has so far restarted successfully with all InnoDB data intact, perhaps you'd care to reconsider your assessment that MySQL is incapable of doing what it just did?

But one didn't. That's a much more informative data point.

--
When all you have is a hammer, every problem starts to look like a thumb.
Re:Another indictment of MySql by Jamesday · 2005-02-22 06:51 · Score: 2, Insightful

Depends on the cause. If the database server software was being lied to by the OS, controller or drives I'm not sure just how much I'm inclined to blame the database server sofware.

I am inclined to ask the database server vendor to see if they can find ways to protect against it and I've briefly discussed that already.

mysql bad at disaster recovery? by bdigit · 2005-02-21 14:34 · Score: 4, Interesting

This is not a troll or a flame at all but between this and the livejournal servers, it sure sounds like hell if your mysql servers ever go down unexpected.

Is mysql the only dbase like this or does postgres get corrupted as well during unplanned downtime? If I recall from using MSSQL servers , we never had a problem like this. We would simply reboot the servers and not worry about tables being left in unrecoverable states. Please correct me if I am wrong though.

Is there any way around this or will this always be a problem with mysql?

Re:mysql bad at disaster recovery? by YU+Nicks+NE+Way · 2005-02-21 14:46 · Score: 4, Insightful

There's a simple way around this: stick to PostgreSQL, MSSQL, Oracle, DB/2, or some other real database. MySQL doesn't make the grade, precisely because things like this can happen.
Re:mysql bad at disaster recovery? by ctr2sprt · 2005-02-21 14:50 · Score: 5, Interesting

We have a similar problem at work. There we don't endure database corruption, we just get broken replication. It appears to be working, but it actually isn't. So we have to take the master offline (actually just acquire a write lock on the DB, it can still answer SELECTs), tar up its (massive) database, scp it to the slaves, start the master, stop the slaves, untar the database, restart the slaves, and restart replication. The entire process can take several hours and it's easy to make mistakes. We put stickers on our MySQL servers saying "DO NOT REBOOT WITHOUT CONTACTING OPS MANAGEMENT," though unfortunately faulty DIMMs are illiterate.
I don't know if PostgreSQL has similar problems, but I very much doubt that Oracle or DB2 do. I know that improved failover support has been a target of the PSQL developers for a little while now, so while it may not be on par with Oracle and DB2 it's probably closer than MySQL. At least for now.
I wish this had prompted management to consider alternatives to MySQL, at least for our mission-critical database servers, but unfortunately it hasn't. They don't even see that we could sell an enterprise-level RDBMS as a significant feature - we're a webhosting company - and charge through the nose for it. Oh well. They don't listen to peons like me, they just make me fix MySQL replication every two weeks.
Re:mysql bad at disaster recovery? by ctr2sprt · 2005-02-21 15:48 · Score: 4, Interesting

Mysql can handle reboots well.
No. It can't. We have two concrete examples in this very page - one provided by Wikimedia, one provided by me - which directly contradict your statement. Maybe under some circumstances MySQL can handle reboots, but it's been proven already that it can't always do so. Perhaps your MySQL experience is not with high-load applications (at least not the level of load Wikimedia and my employer see).
BTW, write a damn script. Mysql was written for unix, unix thrives on scripts. If you can't handle writing a script, why the hell are you a DB admin?
Because the process doesn't lend itself well to scripting. For example, MySQL automatically releases locks when you close your connection to the DB. Presumably this is to avoid deadlocks and for other good reasons, but it's not trivial to write a script to do that. Also, since this is an important system, we don't like the idea of trusting computers to handle its repair: we want someone knowledgeable monitoring every step in case something doesn't work exactly right. I can of course sit there and watch the script do its thing, but that defeats the purpose of scripting the process in the first place.
Regardless, the difficulty of the task is not the main issue. The main issue is that we are dealing with north of 1GB of data here, and on busy servers on a busy network that means restarting replication takes an hour or longer. So not only is performance reduced by 33% when we take the slaves offline one at a time, performance is reduced further by the traffic of tar/scp in the background. Not to mention the fact that, because we have a lock on the master's DB, so you can't even consider the DB cluster fully functional.
Re:mysql bad at disaster recovery? by fimbulvetr · 2005-02-21 16:09 · Score: 2, Insightful

I'd rather just agree to disagree on this one, at this point it's all just what we have observed. It heavily depends on the situation, how the db is setup, etc.

As far as the script, yes, it does have locks, and rightly so. It's not terribly tough to write a lock aware script. In my opinion, the replication setup is extremely easy to script. I'd much rather script it than sit in front of the console. Once I see it work, I know it will work every time, and I won't worry about something like me or a peer mistyping the server-id at 4:00am. Even at 20GB, it can't be terribly long at 100Mb/s.

You only need the lock on the master while you're tar'ing the snapshot for distribution to the other servers. Once it's tar'd, unlock master, gzip, redistribute, tar zxvf, setup slave and it will catch up.
Re:mysql bad at disaster recovery? by Dachannien · 2005-02-21 16:15 · Score: 4, Funny

So we have to take the master offline (actually just acquire a write lock on the DB, it can still answer SELECTs), tar up its (massive) database, scp it to the slaves, start the master, stop the slaves, untar the database, restart the slaves, and restart replication.

You forgot the part where you have to take the chicken across first, because the fox won't eat the grain if you leave them alone.
Re:mysql bad at disaster recovery? by Tough+Love · 2005-02-21 17:22 · Score: 2, Informative

The main issue is that we are dealing with north of 1GB of data here

Nice post. I'd just like to add that Wikipedia deals with north of 170 GB, not counting images.

--
When all you have is a hammer, every problem starts to look like a thumb.
Re:mysql bad at disaster recovery? by darkpixel2k · 2005-02-21 17:27 · Score: 2

We're a bunch of GEEKS. Why hasn't anyone invented some sort of device that could keep power going to a computer during a power outage. Maybe some sort of...battery backup? We could even give it a cool name like "Uninteruptable Power Supply"--or even a cool acronym like UPS.

We could even attatch circuitry to this 'UPS' that could send a signal to the computer when the power goes out or the battery runs low. If we work hard, we could even get this circuitry to give us an estimated time before the battery is totally dead. We could call this new technology 'smart signaling'.

Wow--this would totally enable some cool features on the computer side. You could send an email, or even a page to the responsible server admin. You could even...
...wait for it...
SHUTDOWN THE F*CKING DATABASE IN A SAFE MANNER!

Ok--I understand the deal with LiveJournal and the EPO device--but why should a tripped circuit breaker cause a hard shutdown of a server? Seriously. Battery Backup people!

Even my home computer does that. Power goes out, five minutes later the box begins shutting down...

--
There's no place like ::1 (I've completed my transition to IPv6)
Re:mysql bad at disaster recovery? by sarahemm · 2005-02-21 17:39 · Score: 3, Informative

A lot of datacentres don't allow UPSes within customer enclosures, as even if the EPO is triggered they keep supplying power which can be dangerous for fire/rescue crews. I'm aware this wasn't an EPO situation AFAIK, but the rules still apply.
Re:mysql bad at disaster recovery? by sterwill · 2005-02-21 17:48 · Score: 3, Informative

I know that PostgreSQL uses write-ahead-logging so it can avoid exactly these kinds of problems. It doesn't matter how much I/O PostgreSQL is doing; all writes go to the log. If the machine crashes, it replays the log file up to its most recent write. Worst case: data that was in the process of being appended to the log when the machine crashed didn't get flushed to disk, and that last transaction is lost. No tables are corrupt. No 6+ hour delay getting back online.
You would know this to if you had read the PostgreSQL documentation.
Re:mysql bad at disaster recovery? by TheNarrator · 2005-02-21 18:31 · Score: 4, Informative

PostgreSQL is far superior to MySql in it's disaster recovery ability, namely WAL (Write Ahead Logging). I've been using PostgreSQL since version 7.0 came out and I've never had it fail to come back up on me after any power outage or reset.
http://www.postgresql.org/docs/8.0/interactive/wal .html
Re:mysql bad at disaster recovery? by Jugalator · 2005-02-21 19:22 · Score: 2, Insightful

They were lucky with that server?

I mean, if a few servers' databases survived, that may speak more of random luck of not being in a status so when the power outage occured nothing bad happened. If all of the databases survived, that speaks of MySQL being resistant to this sort of thing.

--
Beware: In C++, your friends can see your privates!
Re:mysql bad at disaster recovery? by Jamesday · 2005-02-22 01:41 · Score: 5, Informative
>>Can anyone quess why its the case?

Easily. See what those saying that MySQL can't do what MySQL does are promoting.:)

LiveJournal found that it had some disk systems which lied about having committed writes. The have a preliminary tool which copies what it's writing to disk to a networked system and then compares the after power off and recovery state to what the disk system said it could do. Are going to make it available to the community as time allows.

I expect we're going to find the same at Wikipedia. Here's a pretty typical error log, this one from the server which was master database server:

050222 5:11:12 InnoDB: Database was not shut down normally.
InnoDB: Starting recovery from log files...
InnoDB: Starting log scan based on checkpoint at
InnoDB: log sequence number 303 1283776146
InnoDB: Doing recovery: scanned up to log sequence number 303 1289018880
InnoDB: Doing recovery: scanned up to log sequence number 303 1294261760
InnoDB: Doing recovery: scanned up to log sequence number 303 1299504640
InnoDB: Doing recovery: scanned up to log sequence number 303 1304747520
InnoDB: Doing recovery: scanned up to log sequence number 303 1309990400
InnoDB: Doing recovery: scanned up to log sequence number 303 1315233280
InnoDB: Doing recovery: scanned up to log sequence number 303 1320476160
InnoDB: Doing recovery: scanned up to log sequence number 303 1325719040
InnoDB: Doing recovery: scanned up to log sequence number 303 1330961920
InnoDB: Doing recovery: scanned up to log sequence number 303 1336204800
InnoDB: Doing recovery: scanned up to log sequence number 303 1341447680
InnoDB: Doing recovery: scanned up to log sequence number 303 1346690560
InnoDB: Doing recovery: scanned up to log sequence number 303 1347688389
InnoDB: 1 transaction(s) which must be rolled back or cleaned up
InnoDB: in total 14 row operations to undo
InnoDB: Trx id counter is 1 935480064
050222 5:11:13 InnoDB: Starting an apply batch of log records to the database...
InnoDB: Progress in percents: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 InnoDB: Database page corruption on disk or a failed
InnoDB: file read of page 8617985.
InnoDB: You may have to recover from a backup.
050222 5:12:20 InnoDB: Page dump in ascii and hex (16384 bytes):

Observe that the database engine went back to its last checkpoint, noticed the partial transaction and undid it and was rolling ahead in the write-ahead log when it encountered a database page which failed its checksum test. That failed checksum test is why I think it's a problem with the disk system lying about what was written. You can get that when a database page spans two drives in a stripe set and one has committed the update while the other hasn't.

In more typical situations MySQL simply applies the updates and all is well. I've had a server set up to exceed RAM with swap turned off and get killed every ten minutes for hours and recover every time.

Just to be complete:
- The database servers have dual redundant power supplies. TWO breakers at the colo tripped, taking out both.
- The systems are a mix of SCSI and SATA, so no point in arguing about one being lousy. SATA and Linux win if you want a winner: it was a SATA box using Linux RAID 0 whoch completed full recovery. It wasn't one of the normal servers - it was used for backup and offline report generation.
- Two different disk controller makers, one each for SCSI and SATA.
- Battery backed up write cache on most of the main server disk controllers but the one without the battery backup for the write cache had the same problem (which shouldn't surprise anyone - that one should be expected not to recover well).
- After LJ's experience we were after UPS systems in the racks but hadn't yet checked whether the local fire code allows them. Some don't, for el
Re:mysql bad at disaster recovery? by Cramer · 2005-02-22 03:00 · Score: 3, Informative

You mis-read the comment. (and again, know little about wiki's setup) The point is, there aren't any Microsoft Windows boxes in the cluster. And I don't expect the Wikimedia Foundation to approve the cash outlay for buying windows and mssql licenses for the number of system currently serving as database engines. Plus there's the complexity of those boxes not just being db servers, and the fact that none of the admins are anywhere near the actual hardware -- remote management of windows boxes is not something I would recommend (and not something easily done via ssh.)

MySQL is not inferior in all possible characteristics... MSSQL is a windows only product. I cannot run it on Solaris, OS X, AIX, Tru64, linux, etc. It, thus, loses on that characteristic. Wiki is not a windows shop, so stop wasting your time suggesting a windows product. The cluster is running linux. Bring linux software to the table and we'll talk. Wiki uses mysql because it's free and it's fast. My suggestion would be Oracle, but it's most certainly not cheap or free.
Re:mysql bad at disaster recovery? by jdavidb · 2005-02-22 03:36 · Score: 2, Informative

Unfortunately for webhosting the demand for MySQL is higher than for the other available DBMS's, since most available open source software and gratis software that requires a database is going to have been developed originally with MySQL. I would much prefer to be using PostgreSQL for the applications I run with a hosting provider, but the apps I use don't function with it, and the hosting provider (NearlyFreeSpeech.net) doesn't offer anything else, anyway.

I figure once the advantages of the other DBMS systems become more apparent (and enough disaster stories happen to highlight the advantages) the apps will begin to offer and improve support for PostgreSQL and others, and then there will be a demand for them and some hosting providers will begin to offer them. I do understand that PostgreSQL consumes a lot more resources than MySQL, though, so it will not be cheap.

You get what you pay for. How much is reliability worth?

--
Secession is the right of all sentient beings.
Re:mysql bad at disaster recovery? by defile · 2005-02-22 03:39 · Score: 2, Interesting

No. It can't. We have two concrete examples in this very page - one provided by Wikimedia, one provided by me - which directly contradict your statement. Maybe under some circumstances MySQL can handle reboots, but it's been proven already that it can't always do so. Perhaps your MySQL experience is not with high-load applications (at least not the level of load Wikimedia and my employer see).

I don't mean to diminish what you guys do, or question your abilities. I simply want to offer my perspective because I've been in similar situations.

I ran an extremely high load site off of MySQL for about 4 years. It started out modestly and went up to around 2000-3000 queries/sec hitting the RDBMS, about 30% of them data updates.

There were genuine cases where MySQL annoyed the hell out of me.

For example, the use of pthreads is a huge pain in the ass under Linux because all of the thread stacks share address space. On a 32-bit platform and a large InnoDB buffer pool, it's easy to run out of pointer bits. Once we switched to a replicated setup this wasn't such a big deal anymore. Moving to a 64-bit platform would've made this a non-issue too, but we didn't have the luxury of doing this at the time.

Regarding data corruption? Every time I've blamed MySQL 4 + InnoDB for ruining my data, I've been too soon to do so. MySQL is often the messenger of a real underlying problem.

What underlying problem? Well, OS, disk, or RAM for starters. And these aren't always easy to find.

I went to enormous lengths to verify that all of those things were working properly so I could blame MySQL. Kernel updates, memtest86, I even ran (VA-)CTCS on it for a week and the machine showed no problems. But every time we'd bring MySQL up on it we'd encounter data corruption within 20 days or so.

The site continued to run off of the remaining servers and we just hoped MySQL wouldn't "corrupt" those as well while we tried to figure out what the problem was.

Anyway, one day a few weeks later someone was playing around with the retired machine and found that 1 time out of 20, on boot it wouldn't find one of the hard disks. Oh, and this disk just happened to be used in the MySQL data partition. We replaced the disk container and the machine hasn't had a problem since and runs in production.

We should've just thrown the fucker away.

I've wasted a lot of time because I had more confidence in machines than I should've. Most computers have terrible reliability, even the ones that are marketed as being reliable. Most people just don't notice. MySQL notices. :(

Power outages suck. by goofyheadedpunk · 2005-02-21 14:39 · Score: 2, Interesting

Power outages suck, and a great way to protect from them is to distribute your project over a large area of electrical service.

I know the wikimedia folks are fundraising for more servers, but I wonder if this will provide more incentive to accept Google's offer?

--

What if the entire Universe were a chrooted environment with everything symlinked from the host?

More information here... by Anonymous Coward · 2005-02-21 14:40 · Score: 5, Funny

I found this useful information about power outages:
http://www.wikipedia.org/search?/power_o utage

Re:More information here... by daveo0331 · 2005-02-21 15:00 · Score: 4, Funny

Unfortunately that site you linked to appears to be slashdotted, or something. Here's a mirror:

http://www.answers.com/topic/power-outage-1

--
Remember the days when Republicans were the party of fiscal responsibility?

Join me, my friends! by mctk · 2005-02-21 14:44 · Score: 4, Funny

A power outage has taken down wikipedia! as a community we must carry the torch!

--
Paul Grosfield - the quicker picker upper.

Oh, great... by ral315 · 2005-02-21 14:48 · Score: 3, Funny

Even when the servers go back on, they'll be slashdotted.

There's a lesson to be learned here by Raul654 · 2005-02-21 14:52 · Score: 3, Funny

As that economic genius, Eric Cartman taught us:

1) Get something other people love
2) Don't let them use it
3) Profit!

It doesn't hurt if you are running a fund drive at the same time, either.

--

To make laws that man cannot, and will not obey, serves to bring all law into contempt.
--E.C. Stanton

Aaaaand... by Faust7 · 2005-02-21 14:52 · Score: 4, Funny

Meanwhile, the devs are working fairly furiously to get it back up

Don't worry, we'll take care of your backup servers in the meantime. ;)

--
The coolest voice ever.

ETA for read only service is now 2-4 hours. by Jamesday · 2005-02-21 14:55 · Score: 5, Informative

So far one of our database servers has completed a successful recovery (we're working through them all). On a gigabit link it takes something between 90 minutes and 4 hours to rsync from one to another. As soon as we have two database servers working, we'll be restoring service in read only mode. Likely to be that 90 minutes to 4 hours from now as worst case.

I'll post followups to this post later, as we're closer to being fully recovered.

Re:ETA for read only service is now 2-4 hours. by Jamesday · 2005-02-21 15:46 · Score: 4, Informative

May be longer so I withdraw that time estimate.

Distributed Wikipedia? by femto · 2005-02-21 15:00 · Score: 2, Interesting

Isn't raising money for servers a short term solution? Surely the real solution is to invest time and effort into finding a way to distribute wikipedia across the 'net?

Google seems to have succeeded in building a distributed platform. What about something similar to seti@home, which takes a chunk of each user's disk space and bandwidth and uses them to implement a virtual computer on which wikimedia projects may be run?

Surely someone is already working on something like this (pointers anyone??)

Re:Distributed Wikipedia? by midom · 2005-02-21 15:31 · Score: 3, Insightful
Well, distributing a wiki is a task a bit more complex than distributing search index (async!) or seti@home (async). You don't care in async data arrays wether the packet you sent to some node is hour or day old. You care about that in wiki, because every user will be pressing 'edit' button, and data should be consistent everywhere. We are working on distribution.
- Distributed caches - now majority of hits are served by caches, and some of them are offsite. It was a pilot project for a while and now we're trying to design and build scalable infrastructure for that. But still, lots of edits are served uncached.
- Distributed file systems - are there any? NFS is single-server system, MS has something, PVFS has no redundancy, GoogleFS is closed and not released, Coda, AFS, all of those just don't work. Right now we're trying to develop MogileFS (the perl-based app-level file storage by LiveJournal) store and sure there are other ideas.
- Distributed database - there are no proper large database multimaster opensource solutions. MySQL with replication and transactional data store is used. In this event it would be great to have second datacenter nearby with additional DB replicas and gigabit interconnection, but that costs money. And app-level bidirectional replication is in plans for both MySQL and PostgreSQL. And SAN deployment is too costly.
And yes, MediaWiki code has PostgreSQL support, but migrating from one database to another without proper tests, benchmarks and insurance isn't very mature.
Re:Distributed Wikipedia? by InfiniteWisdom · 2005-02-21 15:47 · Score: 2, Insightful

170GB isn't that big and people routinely run far more critical stuff without any kind of exotic seti@home-like distribution. What's really inexcusable is the fact that a power failure caused database corruption that turned a 2 minute power outage into major downtime.

lame quotes rule by mrpuffypants · 2005-02-21 15:00 · Score: 4, Funny

it's as though 300,000 people cried out and were suddently silenced ...

and then somebody diffed the change and made them speak again

URI to the Rescue by Doc+Ruby · 2005-02-21 15:03 · Score: 3, Interesting

This outage, as well as our beloved slashdotting, is yet another argument for URIs, rather than just URLs. URLs are like IP#s; they're absolute pointers to specific object locations, in terms of the storage/retrieval interface of a single instance. URIs are virtual, like domain names. They are distributed in DNS, a Netwide database, updated for current lookup values for actual retrieval. URLs need the same kind of layer. Of course, some other characteristics of these objects must be reflected in the URI model that are not appropriate to IP#/domain names, like multiple identical copies, or perhaps versions.

Just cacheing copies, either actively with a redirection URL, or passively in caching backbone webservers, isn't cutting it. Caching values is always better suited to solving performance problems, creating its own concurrency and identy problems. Not to mention the publication limits of "opt-in" caches, like Coral or Google, which are an afterthought (and usually unknown) to the published object itself. Google has a huge, high-performance URL lookup system. It's taken quite a bit of value from the Internet, and all the content creators it rides on to derive all its value. It give back quite a bit, with its simple, fast, effective interface. Google is perfectly positioned to make its name truly synonymous with an Internet revolution (not just a pinnacle of search evolution) by implementing URIs. If Google let objects get looked up by a URI code as simple as say, [A-Za-z0-9]+, it could get halfway to its namesake in objects with just 28 "digits"; just 7 digits would cover each object instance in its database right now, dozens of times over. If Google opened up such a URI protocol to anyone on the Web running such a "DIS" server, just like DNS, they could offload much of the work, avoid accusations of trying to "own the Internet", and improve their own service immeasurably, not least by making broken links in their database a quaint old curiosity. Will they rock our world, or will another big player, like Archive.org do it, before Microsoft, desperate to distinguish MSN Search, ruins it for everyone with some kind of proprietary hack that favors MS objects?

--

--
make install -not war

Re:URI to the Rescue by Amit+J.+Patel · 2005-02-21 15:10 · Score: 2, Interesting

URLs contain a domain name. Domain names already provide a level of indirection. Why can't we use that level of indirection for Wikipedia's problem? I don't see what URIs buy us -- if we're already not using the indirection we have, how does a second level give us?
Re:URI to the Rescue by Doc+Ruby · 2005-02-21 15:19 · Score: 3, Informative

Because domain names equate to a single IP# (even if that number changes) - a single instance of the object. A URI is just a unique ID across the whole Net, for an object class, which can have single instances. A good URI scheme will take different states of that class into account, like different versions of the object. Domain names, as implemented in DNS, can't give us the one (URI) to many (instances) we obviously need to support scalability and distributed objects.

--
--
make install -not war
Re:URI to the Rescue by J'raxis · 2005-02-21 16:23 · Score: 2, Insightful

URIs are a superset of URLs and URNs. I think what you're talking about is a URN, isn't it? These are the URIs that specifically name something uniquely (for example, urn:isbn:1902593790 or urn:oid:1.3.6.1.4.1.20115) but don't necessary help you locate it at a specific place.

--
Liberty in your lifetime

Re:Where are you guys hosting from? by Rakishi · 2005-02-21 15:19 · Score: 2, Informative

Because the power didn't actually go out?

Answers.com by stevemm81 · 2005-02-21 15:30 · Score: 3, Informative

You can look things up on answers.com.. They mirror wikimedia, as well as other dictionaries/encyclopedias.

Re:Integrity? by Jamesday · 2005-02-21 15:30 · Score: 3, Interesting

Yes. It's in our plans regardless of what happens with Google.

Re:Ironic by Jamesday · 2005-02-21 15:33 · Score: 4, Informative

Yes. I wrote that cached page and it's now a bit out of date. IF, and it's not certain, local fire regulations permit the use of UPS systems in the racks we're going to be installing them. Decided on that after LiveJournal's unfortunate experience. But don't yet have them.

Absolute power corrupts. by kiwidefunkt · 2005-02-21 15:33 · Score: 3, Funny

As soon as I saw "Power corrupts. Power failure corrupts absolutely" I thought, the damn commies finally did it! But no, not hacked by commies...just by a renegade circuit breaker.

--
www.kiwilyrics.com - a wiki for lyrics

Re:Stupid question... by ScrewMaster · 2005-02-21 15:34 · Score: 2, Insightful

Something still doesn't add up. Even if a backup generator autostarts successfully, there's a significant delay between mains failure, switchover, and the generator picking up the load. That's usually a few seconds or more, too long for a computer to run off the residual charge in its power supply filter caps. There would still have been an inverter-charger somewhere to keep the equipment running until the generator was fired up. Sounds like somebody screwed up, either by tripping the wrong breaker, or by designing the facility improperly to begin with.

--
The higher the technology, the sharper that two-edged sword.

why, why, why? by CAIMLAS · 2005-02-21 15:42 · Score: 2, Insightful

Why were they not using battery backup on their database servers (IE, their critical servers)? That way the servers would have the necessary 10 minutes (or whatever) so that they can shut down the DBs and power off the systems.

This is a negligible cost for something as integral as an active sync with the work that people have performed - for free.

Why is this not seen as important? "The wiki users will just recreate the material"? That's somewhat presumptuous.

Now, livejournal I can understand not doing this (as there are many clients which allow people to sync with their online journals and the material is fairly culturally worthless), but wikipedia? It's one of the better things on the Internet.

--
~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers

Re:Stupid question... by Carnildo · 2005-02-21 15:51 · Score: 2, Informative

Fire code. When someone hits the Big Red Button, all electrical power in the server room must be out. Therefore, UPSs can't be located in server racks (or if they are, you need to go to the effort of wiring them into the BRB).

--
"They redundantly repeated themselves over and over again incessantly without end ad infinitum" -- ibid.

Oh...ok... by buffy · 2005-02-21 16:46 · Score: 2, Funny

So, _this_ is where I should be posting my outage reports! And here I've been sending them only to people who would care.

"Slashdot...outage reports for nerds! Stuff that doesn't matter to me!"

Lol!

-buf

MySQL not ACID by Tough+Love · 2005-02-21 17:06 · Score: 2, Insightful

From the wikipedia page:

At about 14:15 PST some circuit breakers were tripped in the colocation facility where our servers are housed. Although the facility has a well-stocked generator, this took out power to places inside the facility, including the switch that connects us to the network and all our servers. (Yes, even the machines with dual power supplies -- both circuits got shut off.)

After some minutes, the switch and most of our machines had rebooted. Some of our servers required additional work to get up, and a few may still be sitting there dead but can be worked around.

The sticky point is the database servers, where all the important stuff is. Although we use MySQL's transactional InnoDB tables, they can still sometimes be left in an unrecoverable state.

(Bolding mine.) This proves that MySQL is not ACID, there is no way that a power outage is supposed to cause corruption in a database. This is not a troll, this is a simple conclusion. I really think that Wikipedia should switch to PostgreSQL, which is considerably more mature in terms of ACID compliance.

--
When all you have is a hammer, every problem starts to look like a thumb.

Re:Backup power supply? by brion · 2005-02-21 17:11 · Score: 3, Informative

The colocation facility has diesel generators to protect against the outside power going out. Thanks to the miracle of circuit breakers, power circuits inside the facility shut off (including both circuits feeding our dual-power supply machines).

--

Chu vi parolas Vikipedion?

Re:Easy, brain-dead sql db recovery (if possible) by Tough+Love · 2005-02-21 17:16 · Score: 3, Interesting

A completely designed, 100% empty database.

A COMPLETE log of all the SQL statements that were applied to it IN the order they were used. This is obtained by the application logging the SQL statements to the SQL log file AFTER the SQL statement is succesfully executed.

When a data base failure occurs, stop everything, 'replay' the backed up SQL logfile (thats on a separate backup system) on a copy of the empty DB there. TADA! you are back in business back to the point of failure!

Read the Wikipedia page. That's exactly what they've done, but because the MySQL database got corrupted, instead of just falling back a few minutes, they may have to go right back to a full backup and replay the log since then, which takes a lot more time than replaying a few transactions.

The solution is to switch to a database that actually implements ACID (the second letter stands for "Consistency" and the last letter stands for "Durability" which is what failed here).

--
When all you have is a hammer, every problem starts to look like a thumb.

Taken down by CO$, coincidence or not! by friendscallmelenny · 2005-02-21 17:27 · Score: 3, Funny

Yesterday they had frontpage Scientology entry with Xenu stuff. I told my friend, "That site will be in trouble soon."
He thinks I'm a god now!

perhaps I just inadvertently reached clear

Re:170 gigs? by brion · 2005-02-21 17:42 · Score: 2, Interesting

The vast majority of this space is taken up by revision histories (and those are compressed!) Periodic database dumps are available for download. Image and multimedia uploads have been taking up a bigger share lately, but those are on a separate server which recovered just fine.

A German company has published an end-user-friendly CD-ROM of material from the German-language Wikipedia, but afaik no one's published an English-language edition yet.

--

Chu vi parolas Vikipedion?

Re:What's the Name of Wikimedia's Colo? by timstarling · 2005-02-21 17:51 · Score: 2, Informative

http://www.neutelligent.com/

Re:Xenu Strikes Again! by MillionthMonkey · 2005-02-21 17:51 · Score: 5, Informative

I find it an interesting coincidence the power outage happened so soon after that the Xenu article was featured.

Gee, you just had to mention the X-word! Now this thread won't load for most Scientologists because the keyword filters they were forced to install by their Church will see "Xenu" and block the site. After all the mere sight of the word could cause "pneumonia and death" if you haven't paid the Church of Scientology for the proper preparation.

Wikipedia's Xenu article has an interesting history if you look, as I did the other night when it was featured. Scientologists vandalize it regularly. You're supposed to pay them a half million (or some absurd sum of money) to find out about Xenu. After you find out, you're too embarrassed to admit to anybody that you paid a half million to learn that your problems are caused by bad science fiction, when you could have bought a house in Silicon Valley instead. So they obviously don't want a Wikipedia article giving away their half-million-dollar "trade secret" for free.

One trick I saw was to use HTML entities to spell out insults at the top of the article- like "only an idiot would believe this" or something. In the editor window, the entities weren't rendered and each letter appeared as a hex code.

A more effective attack took a different approach. The vandal in this case changed "Scientologists" to "Muslims", "Scientology" to "Islam", and inserted a boring-sounding sentence at the end of the first paragraph claiming that "Xenu" is another name that Muslims use for "Allah". It completely discouraged you from reading further. If you didn't know better you wouldn't find out how "Allah" distributed the thetans around volcanoes on various planets and blew them up with hydrogen bombs, and how their blown-up spirits cause problems in your personal life today.

This is OT, but what the hell, why not whack a beehive? Additional information on Xenu:
Operation Clambake (Hubbard maintained that humans are descended from clams)
The Xenu leaflet (all about Xenu- this information can save you lots of $$$$$)
The road to Xenu (authored by a woman who got suckered)
The Google cache of Wikipedia's Xenu article is also a must read.

I'm wondering if I'll get a lot of freaks, downmoderations, and hostile AC replies after I post this. After all, that's the kind of thing that Hubbard called "fair game". If it sinks below default visibility I'll repost it again with my karma bonus, so you theta-clear-wannabes out there can save your points for someone else.

Notice thier Database worries by iwadasn · 2005-02-21 17:52 · Score: 2, Funny

Apparently one of their MySQL databases got corrupted as well. Figures. You'd think with all that volume they'd be wise enough to use a DB that can withstand a hard powercycle without losing data.

Just remember, friends don't let friends use MySQL for important data.

:::eyes UPS under table::: by shoemakc · 2005-02-21 18:02 · Score: 4, Funny

:::eyes my UPS::::

::::ponders for a momment::::

:::eyes the serial cable that gracefully shuts down said computer in the event of a power failure::::

:::ponders some more::::

:::eyes the spare UPS sitting in the corner that used to be connected to a database server::::

Hmm, I think i'm almost onto something here, but i just can't seem to nail it down...

-Chris

--
--an unbreakable toy is useful for breaking other toys--

Fortunately, Wikicities is still online . . . by greenreaper · 2005-02-21 18:38 · Score: 2, Informative

And going strong!

Proper fundraising link by Jugalator · 2005-02-21 19:19 · Score: 3, Informative

The link in the article is broken, here's the proper one:
http://wikimedia.org/fundraising/

--
Beware: In C++, your friends can see your privates!

Re:Xenu Strikes Again! by Silentnite · 2005-02-21 19:21 · Score: 3, Interesting

If only I were a mod. Informative, and just plain funny if you ask me. I've read about that entire thing going back and forth and its kinda odd. On the one hand I think that Wikipedia should be limited to who can change it. But on the other its really neat and diverse to let everybody at it.

Oh well.. Slightly OT

At the risk of pointing out the obvious... by Headcase88 · 2005-02-21 20:49 · Score: 2, Funny

So now they'll have to put up a page to say "The temp page that says that our site is down is down. We are working aorund the clock to get the temp page back up.".

--
"When the atomic bomb goes off there's devastation...but when the atomic bong goes off there's celebraaaaation!"

Linux Kernel bug?!? by peterwilm · 2005-02-21 21:34 · Score: 2, Interesting

I recall a discussion about fsync not being properly implemented both in Linux Kernel 2.4 as well as 2.6. I think it was patched in 2.6.9 or so, but not in 2.4.

Unfortunately, I cannot find the thread any more. Does anybody remember?

So, this might be rather a linux kernel bug, not a mysql bug.

Secondly, why does everybody say that mysql does not support ACID-transactions? MySQL does advertise them. Are you talking about pre-4.0 MyIsam tables? Or do you suggest that 4.0/4.1 InnoDB-tables aren't ACID-compliant either?

It's not SATA by Jamesday · 2005-02-22 00:11 · Score: 2, Informative

The best copy we have is on a lowly pair of 250GB SATA drives using Linux RAID 0 and since thats the best it's the one we used.

Every main database server had corrupt database pages. That is, 3 systems with battery backed up write caching controlles and SCSI drives and 2 SATA systems with write caching SATA controllers, one battery backed up the other not, two different SATA disk drive makers.

Involved:

Two completely different caching controller brands
Two different SATA drive makers
Seagate only on the SCSI drive maker side

Obvious speculation involves the controllers not telling the drives not to write buffer or the drives not listening. No point in getting into SCSI or SATA or this disc controller or that controller fights when there's this much variation involved.

Latest news by saforrest · 2005-02-22 01:27 · Score: 4, Informative

Posted on the mailing list wikipedia-l 32 minutes ago:

From: Brion Vibber
Reply-To: wikipedia-l@wikimedia.org
To: Wikipedia-l, Wikimedia Foundation Mailing List, Wikimedia developers
Date: Tue, 22 Feb 2005 04:47:56 -0800
Subject: Re: [Wikipedia-l] Wiki Problems?

Brion Vibber wrote:
> There was some sort of power failure at the colocation facility. We're
> in the process of rebooting and recovering machines.

The power failure was due to circuit breakers being tripped within the colocation facility; some of our servers have redundant power supplies but *both* circuits failed, causing all our machines and the network switch to unceremoniously shut down.

Whether a problem in MySQL, with our server configurations, or with the hardware (or some combination thereof), most of our database servers managed to glitch the data on disk when they went down. (Yes, we use InnoDB tables. This ain't good enough, apparently.)

The good news: one server maintained a good copy, which we've been copying to the others to get things back on track. We're now serving all wikis read-only.

The bad news: that copy was a bit over a day behind synchronization (it was stopped to run maintenance jobs), so in addition to slogging around 170gb of data to each DB server we have to apply the last day's update logs before we can restore read/write service.

I don't know when exactly we'll have everything editable again, but it should be within 12 hours.

Imagine by Anonymous Coward · 2005-02-22 01:32 · Score: 2, Funny

Imagine what would happen if there would be a link on wikipedias main site to slashdot and from slashdot back to wikipedia...Boom?

ACID by Jamesday · 2005-02-22 07:24 · Score: 2, Informative

Except it's now been a few years since MySQL incorporated InnoDB, so maybe it's time to move on and rejoice that it's now one of the free database servers with ACID support? This one happens to come with standard replication and fulltext search. Also with a range of other engines to choose. PostgreSQL, last I knew, doesn't have built in replication, fulltext search and alternative storage engines but has it's own particular strengths. In the end, every end user gets to benefit from the competition between excellent tools. Good for us all to be happy about that.

InnoDB does use WAL by Heikki_Tuuri · 2005-02-22 11:13 · Score: 2, Informative

Hi!

InnoDB has used WAL since I wrote it in mid-1990s. To PostgreSQL, WAL came later, around 2000.

Regards,
Heikki
Innobase Oy

Slashdot Mirror

Power Outage Takes Wikimedia Down

98 of 577 comments (clear)