Power Outage Takes Wikimedia Down

← Back to Stories (view on slashdot.org)

Power Outage Takes Wikimedia Down

Posted by ryuzaki0 on Monday February 21, 2005 @02:28PM from the problem-with-non-absolute-power dept.

Baricom writes "Just a few weeks after a major power outage took out well-known blogging service LiveJournal for several hours, almost all of Wikimedia Foundation's services are offline due to a tripped circuit breaker at a different colo. Among other services, Wikimedia runs the well-known Wikipedia open encyclopedia. Coincidentally, the foundation is in the middle of a fundraising drive to pay for new servers. They have established an off-site backup of the fundraising page here until power returns."

21 of 577 comments (clear)

Min score:

Reason:

Sort:

mysql bad at disaster recovery? by bdigit · 2005-02-21 14:34 · Score: 4, Interesting

This is not a troll or a flame at all but between this and the livejournal servers, it sure sounds like hell if your mysql servers ever go down unexpected.

Is mysql the only dbase like this or does postgres get corrupted as well during unplanned downtime? If I recall from using MSSQL servers , we never had a problem like this. We would simply reboot the servers and not worry about tables being left in unrecoverable states. Please correct me if I am wrong though.

Is there any way around this or will this always be a problem with mysql?
1. Re:mysql bad at disaster recovery? by ctr2sprt · 2005-02-21 14:50 · Score: 5, Interesting
  
  We have a similar problem at work. There we don't endure database corruption, we just get broken replication. It appears to be working, but it actually isn't. So we have to take the master offline (actually just acquire a write lock on the DB, it can still answer SELECTs), tar up its (massive) database, scp it to the slaves, start the master, stop the slaves, untar the database, restart the slaves, and restart replication. The entire process can take several hours and it's easy to make mistakes. We put stickers on our MySQL servers saying "DO NOT REBOOT WITHOUT CONTACTING OPS MANAGEMENT," though unfortunately faulty DIMMs are illiterate.
  I don't know if PostgreSQL has similar problems, but I very much doubt that Oracle or DB2 do. I know that improved failover support has been a target of the PSQL developers for a little while now, so while it may not be on par with Oracle and DB2 it's probably closer than MySQL. At least for now.
  I wish this had prompted management to consider alternatives to MySQL, at least for our mission-critical database servers, but unfortunately it hasn't. They don't even see that we could sell an enterprise-level RDBMS as a significant feature - we're a webhosting company - and charge through the nose for it. Oh well. They don't listen to peons like me, they just make me fix MySQL replication every two weeks.
2. Re:mysql bad at disaster recovery? by ctr2sprt · 2005-02-21 15:48 · Score: 4, Interesting
  
  Mysql can handle reboots well.
  No. It can't. We have two concrete examples in this very page - one provided by Wikimedia, one provided by me - which directly contradict your statement. Maybe under some circumstances MySQL can handle reboots, but it's been proven already that it can't always do so. Perhaps your MySQL experience is not with high-load applications (at least not the level of load Wikimedia and my employer see).
  BTW, write a damn script. Mysql was written for unix, unix thrives on scripts. If you can't handle writing a script, why the hell are you a DB admin?
  Because the process doesn't lend itself well to scripting. For example, MySQL automatically releases locks when you close your connection to the DB. Presumably this is to avoid deadlocks and for other good reasons, but it's not trivial to write a script to do that. Also, since this is an important system, we don't like the idea of trusting computers to handle its repair: we want someone knowledgeable monitoring every step in case something doesn't work exactly right. I can of course sit there and watch the script do its thing, but that defeats the purpose of scripting the process in the first place.
  Regardless, the difficulty of the task is not the main issue. The main issue is that we are dealing with north of 1GB of data here, and on busy servers on a busy network that means restarting replication takes an hour or longer. So not only is performance reduced by 33% when we take the slaves offline one at a time, performance is reduced further by the traffic of tar/scp in the background. Not to mention the fact that, because we have a lock on the master's DB, so you can't even consider the DB cluster fully functional.
3. Re:mysql bad at disaster recovery? by defile · 2005-02-22 03:39 · Score: 2, Interesting
  
  No. It can't. We have two concrete examples in this very page - one provided by Wikimedia, one provided by me - which directly contradict your statement. Maybe under some circumstances MySQL can handle reboots, but it's been proven already that it can't always do so. Perhaps your MySQL experience is not with high-load applications (at least not the level of load Wikimedia and my employer see).
  
  I don't mean to diminish what you guys do, or question your abilities. I simply want to offer my perspective because I've been in similar situations.
  
  I ran an extremely high load site off of MySQL for about 4 years. It started out modestly and went up to around 2000-3000 queries/sec hitting the RDBMS, about 30% of them data updates.
  
  There were genuine cases where MySQL annoyed the hell out of me.
  
  For example, the use of pthreads is a huge pain in the ass under Linux because all of the thread stacks share address space. On a 32-bit platform and a large InnoDB buffer pool, it's easy to run out of pointer bits. Once we switched to a replicated setup this wasn't such a big deal anymore. Moving to a 64-bit platform would've made this a non-issue too, but we didn't have the luxury of doing this at the time.
  
  Regarding data corruption? Every time I've blamed MySQL 4 + InnoDB for ruining my data, I've been too soon to do so. MySQL is often the messenger of a real underlying problem.
  
  What underlying problem? Well, OS, disk, or RAM for starters. And these aren't always easy to find.
  
  I went to enormous lengths to verify that all of those things were working properly so I could blame MySQL. Kernel updates, memtest86, I even ran (VA-)CTCS on it for a week and the machine showed no problems. But every time we'd bring MySQL up on it we'd encounter data corruption within 20 days or so.
  
  The site continued to run off of the remaining servers and we just hoped MySQL wouldn't "corrupt" those as well while we tried to figure out what the problem was.
  
  Anyway, one day a few weeks later someone was playing around with the retired machine and found that 1 time out of 20, on boot it wouldn't find one of the hard disks. Oh, and this disk just happened to be used in the MySQL data partition. We replaced the disk container and the machine hasn't had a problem since and runs in production.
  
  We should've just thrown the fucker away.
  
  I've wasted a lot of time because I had more confidence in machines than I should've. Most computers have terrible reliability, even the ones that are marketed as being reliable. Most people just don't notice. MySQL notices. :(
Power outages suck. by goofyheadedpunk · 2005-02-21 14:39 · Score: 2, Interesting

Power outages suck, and a great way to protect from them is to distribute your project over a large area of electrical service.

I know the wikimedia folks are fundraising for more servers, but I wonder if this will provide more incentive to accept Google's offer?

--

What if the entire Universe were a chrooted environment with everything symlinked from the host?
Re:Coincidence... ;) by daveo0331 · 2005-02-21 14:40 · Score: 5, Interesting

On the other hand, subjecting the donation page to the Slashdot effect seems like a great way to reach the fundraising goal in no time. Assuming of course the page itself stays up.

Seriously though, if you like wikipedia, consider donating, even if it's just 5 bucks. I think it's even tax deductible if you itemize.

--
Remember the days when Republicans were the party of fiscal responsibility?
Re:Another indictment of MySql by ergo98 · 2005-02-21 14:51 · Score: 5, Interesting

No database can guarantee data integrity in the case of a power failure.

Barring a couple of extreme exceptions, of course a modern database system should protect integrity in the case of a power failure, or any other sudden system failure (kernel panic, GPF, whatever). In the case of the much maligned SQL Server, you can hit the power button all you want mid-transaction and you're going to get a blister on your finger before the database is corrupted.
Distributed Wikipedia? by femto · 2005-02-21 15:00 · Score: 2, Interesting

Isn't raising money for servers a short term solution? Surely the real solution is to invest time and effort into finding a way to distribute wikipedia across the 'net?
Google seems to have succeeded in building a distributed platform. What about something similar to seti@home, which takes a chunk of each user's disk space and bandwidth and uses them to implement a virtual computer on which wikimedia projects may be run?
Surely someone is already working on something like this (pointers anyone??)
URI to the Rescue by Doc+Ruby · 2005-02-21 15:03 · Score: 3, Interesting

This outage, as well as our beloved slashdotting, is yet another argument for URIs, rather than just URLs. URLs are like IP#s; they're absolute pointers to specific object locations, in terms of the storage/retrieval interface of a single instance. URIs are virtual, like domain names. They are distributed in DNS, a Netwide database, updated for current lookup values for actual retrieval. URLs need the same kind of layer. Of course, some other characteristics of these objects must be reflected in the URI model that are not appropriate to IP#/domain names, like multiple identical copies, or perhaps versions.

Just cacheing copies, either actively with a redirection URL, or passively in caching backbone webservers, isn't cutting it. Caching values is always better suited to solving performance problems, creating its own concurrency and identy problems. Not to mention the publication limits of "opt-in" caches, like Coral or Google, which are an afterthought (and usually unknown) to the published object itself. Google has a huge, high-performance URL lookup system. It's taken quite a bit of value from the Internet, and all the content creators it rides on to derive all its value. It give back quite a bit, with its simple, fast, effective interface. Google is perfectly positioned to make its name truly synonymous with an Internet revolution (not just a pinnacle of search evolution) by implementing URIs. If Google let objects get looked up by a URI code as simple as say, [A-Za-z0-9]+, it could get halfway to its namesake in objects with just 28 "digits"; just 7 digits would cover each object instance in its database right now, dozens of times over. If Google opened up such a URI protocol to anyone on the Web running such a "DIS" server, just like DNS, they could offload much of the work, avoid accusations of trying to "own the Internet", and improve their own service immeasurably, not least by making broken links in their database a quaint old curiosity. Will they rock our world, or will another big player, like Archive.org do it, before Microsoft, desperate to distinguish MSN Search, ruins it for everyone with some kind of proprietary hack that favors MS objects?

--
--
make install -not war
1. Re:URI to the Rescue by Amit+J.+Patel · 2005-02-21 15:10 · Score: 2, Interesting
  
  URLs contain a domain name. Domain names already provide a level of indirection. Why can't we use that level of indirection for Wikipedia's problem? I don't see what URIs buy us -- if we're already not using the indirection we have, how does a second level give us?
Re:Integrity? by Jamesday · 2005-02-21 15:30 · Score: 3, Interesting

Yes. It's in our plans regardless of what happens with Google.
Re:They should ask for more... by brion · 2005-02-21 16:01 · Score: 4, Interesting

Our database masters do have dual power supplies. The circuit breakers were tripped on both sides.

--
Chu vi parolas Vikipedion?
Re:Easy, brain-dead sql db recovery (if possible) by Tough+Love · 2005-02-21 17:16 · Score: 3, Interesting

A completely designed, 100% empty database.

A COMPLETE log of all the SQL statements that were applied to it IN the order they were used. This is obtained by the application logging the SQL statements to the SQL log file AFTER the SQL statement is succesfully executed.

When a data base failure occurs, stop everything, 'replay' the backed up SQL logfile (thats on a separate backup system) on a copy of the empty DB there. TADA! you are back in business back to the point of failure!

Read the Wikipedia page. That's exactly what they've done, but because the MySQL database got corrupted, instead of just falling back a few minutes, they may have to go right back to a full backup and replay the log since then, which takes a lot more time than replaying a few transactions.

The solution is to switch to a database that actually implements ACID (the second letter stands for "Consistency" and the last letter stands for "Durability" which is what failed here).

--
When all you have is a hammer, every problem starts to look like a thumb.
Re:170 gigs? by brion · 2005-02-21 17:42 · Score: 2, Interesting

The vast majority of this space is taken up by revision histories (and those are compressed!) Periodic database dumps are available for download. Image and multimedia uploads have been taking up a bigger share lately, but those are on a separate server which recovered just fine.

A German company has published an end-user-friendly CD-ROM of material from the German-language Wikipedia, but afaik no one's published an English-language edition yet.

--
Chu vi parolas Vikipedion?
Re:What Happened. by Leo+McGarry · 2005-02-21 18:16 · Score: 3, Interesting

What constitutes a "real" datacenter.

One that complies with building and safety codes, for starters. In every jurisdiction with which I'm familiar -- admittedly not even close to all of them-- it's actually against the law to have a battery unit inside a data center cage. It's a violation of the safety code. When fire and rescue personnel go into a commercial building, they have to be sure that the power is really off. If there's a battery lying around somewhere, shorting to ground through a desk or door frame for instance, it can cause big problems.

Ask around. I bet you'll find that your data center explicitly forbids customer-installed battery units.
Re:Coincidence... ;) by Anonymous Coward · 2005-02-21 18:54 · Score: 1, Interesting

That's not "SYSTEMS", it's "System 5 R3.0"

In other words, Wikipedia runs on SCO technology. By the Power of Moroni, SCOPOWER!!!
Re:Xenu Strikes Again! by Silentnite · 2005-02-21 19:21 · Score: 3, Interesting

If only I were a mod. Informative, and just plain funny if you ask me. I've read about that entire thing going back and forth and its kinda odd. On the one hand I think that Wikipedia should be limited to who can change it. But on the other its really neat and diverse to let everybody at it.

Oh well.. Slightly OT
Linux Kernel bug?!? by peterwilm · 2005-02-21 21:34 · Score: 2, Interesting

I recall a discussion about fsync not being properly implemented both in Linux Kernel 2.4 as well as 2.6. I think it was patched in 2.6.9 or so, but not in 2.4.

Unfortunately, I cannot find the thread any more. Does anybody remember?

So, this might be rather a linux kernel bug, not a mysql bug.

Secondly, why does everybody say that mysql does not support ACID-transactions? MySQL does advertise them. Are you talking about pre-4.0 MyIsam tables? Or do you suggest that 4.0/4.1 InnoDB-tables aren't ACID-compliant either?
Re:Coincidence... ;) by fredrikj · 2005-02-21 22:21 · Score: 4, Interesting

On the other hand, subjecting the donation page to the Slashdot effect seems like a great way to reach the fundraising goal in no time. Assuming of course the page itself stays up.

You do know that Wikipedia receives something like 100 times the traffic Slashdot does, right?
Re:Coincidence... ;) by David+Gerard · 2005-02-21 22:50 · Score: 2, Interesting

I can now see why Kate NEVER EVER emerges from her heavily-armed bunker in Oxfordshire.

--
http://rocknerd.co.uk
Re:Coincidence... ;) by DA_Chef · 2005-02-22 02:54 · Score: 3, Interesting

Something like that, yes: Alexa's statistics