Power Outage Takes Wikimedia Down
Baricom writes "Just a few weeks after a major power outage took out well-known blogging service LiveJournal for several hours, almost all of Wikimedia Foundation's services are offline due to a tripped circuit breaker at a different colo. Among other services, Wikimedia runs the well-known Wikipedia open encyclopedia. Coincidentally, the foundation is in the middle of a fundraising drive to pay for new servers. They have established an off-site backup of the fundraising page here until power returns."
They'll turn the lights off.
Coincidentally, the foundation is in the middle of a fundraising drive to pay for new servers.
"You see, guys? This is what could happen if we ever ran out of money. Now cough up some dough!"
The coolest voice ever.
What happened?
At about 14:15 PST some circuit breakers were tripped in the colocation facility where our servers are housed. Although the facility has a well-stocked generator, this took out power to places inside the facility, including the switch that connects us to the network and all our servers.
What's wrong?
After some minutes, the switch and most of our machines had rebooted. Some of our servers required additional work to get up, and a few may still be sitting there dead but can be worked around.
The sticky point is the database servers, where all the important stuff is. Although we use MySQL's transactional InnoDB tables, they can still sometimes be left in an unrecoverable state. Attempting to bring up the master database and one of the slaves immediately after the downtime showed corruption in parts of the database. We're currently running full backups of the raw data on two other database slave servers prior to attempting recovery on them (recovery alters the data).
If these machines also can't be recovered, we may have to restore from backup and replay log files which could take a while.
After returning from the power outage, the servers have just been slash-fried.
If they bought actual servers with dual power supplies and got power from multiple PDUs at their data center, they would be much better off. If this is really because of a tripped breaker, then it's pretty inexcusable, since dual power supplies fed from separate circuits would have prevented it... unlike the LJ outage which was from the power being cut to all circuits.
But if they're going to cobble together some whitebox crap servers, and not change the architecture, they'll be right back to an outage next time it happens.
500GB of disk, 5TB of transfer, $5.95/mo
Although we use MySQL's transactional InnoDB tables, they can still sometimes be left in an unrecoverable state
Ya know, I just don't understand why so many projects with such high visibility and requirements for reliability use a toy database like MySQL.
Someone PLEASE tell me why. Because right now the only thing I can think is that people just don't know how to pronounce "Postgres".
This is not a troll or a flame at all but between this and the livejournal servers, it sure sounds like hell if your mysql servers ever go down unexpected.
Is mysql the only dbase like this or does postgres get corrupted as well during unplanned downtime? If I recall from using MSSQL servers , we never had a problem like this. We would simply reboot the servers and not worry about tables being left in unrecoverable states. Please correct me if I am wrong though.
Is there any way around this or will this always be a problem with mysql?
Power outages suck, and a great way to protect from them is to distribute your project over a large area of electrical service.
I know the wikimedia folks are fundraising for more servers, but I wonder if this will provide more incentive to accept Google's offer?
What if the entire Universe were a chrooted environment with everything symlinked from the host?
I found this useful information about power outages:o utage
http://www.wikipedia.org/search?/power_
A power outage has taken down wikipedia! as a community we must carry the torch!
Paul Grosfield - the quicker picker upper.
Even when the servers go back on, they'll be slashdotted.
As that economic genius, Eric Cartman taught us:
1) Get something other people love
2) Don't let them use it
3) Profit!
It doesn't hurt if you are running a fund drive at the same time, either.
To make laws that man cannot, and will not obey, serves to bring all law into contempt.
--E.C. Stanton
Meanwhile, the devs are working fairly furiously to get it back up
;)
Don't worry, we'll take care of your backup servers in the meantime.
The coolest voice ever.
So far one of our database servers has completed a successful recovery (we're working through them all). On a gigabit link it takes something between 90 minutes and 4 hours to rsync from one to another. As soon as we have two database servers working, we'll be restoring service in read only mode. Likely to be that 90 minutes to 4 hours from now as worst case.
I'll post followups to this post later, as we're closer to being fully recovered.
Google seems to have succeeded in building a distributed platform. What about something similar to seti@home, which takes a chunk of each user's disk space and bandwidth and uses them to implement a virtual computer on which wikimedia projects may be run?
Surely someone is already working on something like this (pointers anyone??)
it's as though 300,000 people cried out and were suddently silenced ...
and then somebody diffed the change and made them speak again
This outage, as well as our beloved slashdotting, is yet another argument for URIs, rather than just URLs. URLs are like IP#s; they're absolute pointers to specific object locations, in terms of the storage/retrieval interface of a single instance. URIs are virtual, like domain names. They are distributed in DNS, a Netwide database, updated for current lookup values for actual retrieval. URLs need the same kind of layer. Of course, some other characteristics of these objects must be reflected in the URI model that are not appropriate to IP#/domain names, like multiple identical copies, or perhaps versions.
Just cacheing copies, either actively with a redirection URL, or passively in caching backbone webservers, isn't cutting it. Caching values is always better suited to solving performance problems, creating its own concurrency and identy problems. Not to mention the publication limits of "opt-in" caches, like Coral or Google, which are an afterthought (and usually unknown) to the published object itself. Google has a huge, high-performance URL lookup system. It's taken quite a bit of value from the Internet, and all the content creators it rides on to derive all its value. It give back quite a bit, with its simple, fast, effective interface. Google is perfectly positioned to make its name truly synonymous with an Internet revolution (not just a pinnacle of search evolution) by implementing URIs. If Google let objects get looked up by a URI code as simple as say, [A-Za-z0-9]+, it could get halfway to its namesake in objects with just 28 "digits"; just 7 digits would cover each object instance in its database right now, dozens of times over. If Google opened up such a URI protocol to anyone on the Web running such a "DIS" server, just like DNS, they could offload much of the work, avoid accusations of trying to "own the Internet", and improve their own service immeasurably, not least by making broken links in their database a quaint old curiosity. Will they rock our world, or will another big player, like Archive.org do it, before Microsoft, desperate to distinguish MSN Search, ruins it for everyone with some kind of proprietary hack that favors MS objects?
--
make install -not war
Because the power didn't actually go out?
You can look things up on answers.com.. They mirror wikimedia, as well as other dictionaries/encyclopedias.
Yes. It's in our plans regardless of what happens with Google.
Yes. I wrote that cached page and it's now a bit out of date. IF, and it's not certain, local fire regulations permit the use of UPS systems in the racks we're going to be installing them. Decided on that after LiveJournal's unfortunate experience. But don't yet have them.
As soon as I saw "Power corrupts. Power failure corrupts absolutely" I thought, the damn commies finally did it! But no, not hacked by commies...just by a renegade circuit breaker.
www.kiwilyrics.com - a wiki for lyrics
Something still doesn't add up. Even if a backup generator autostarts successfully, there's a significant delay between mains failure, switchover, and the generator picking up the load. That's usually a few seconds or more, too long for a computer to run off the residual charge in its power supply filter caps. There would still have been an inverter-charger somewhere to keep the equipment running until the generator was fired up. Sounds like somebody screwed up, either by tripping the wrong breaker, or by designing the facility improperly to begin with.
The higher the technology, the sharper that two-edged sword.
Why were they not using battery backup on their database servers (IE, their critical servers)? That way the servers would have the necessary 10 minutes (or whatever) so that they can shut down the DBs and power off the systems.
This is a negligible cost for something as integral as an active sync with the work that people have performed - for free.
Why is this not seen as important? "The wiki users will just recreate the material"? That's somewhat presumptuous.
Now, livejournal I can understand not doing this (as there are many clients which allow people to sync with their online journals and the material is fairly culturally worthless), but wikipedia? It's one of the better things on the Internet.
~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
Fire code. When someone hits the Big Red Button, all electrical power in the server room must be out. Therefore, UPSs can't be located in server racks (or if they are, you need to go to the effort of wiring them into the BRB).
"They redundantly repeated themselves over and over again incessantly without end ad infinitum" -- ibid.
So, _this_ is where I should be posting my outage reports! And here I've been sending them only to people who would care.
"Slashdot...outage reports for nerds! Stuff that doesn't matter to me!"
Lol!
-buf
From the wikipedia page:
At about 14:15 PST some circuit breakers were tripped in the colocation facility where our servers are housed. Although the facility has a well-stocked generator, this took out power to places inside the facility, including the switch that connects us to the network and all our servers. (Yes, even the machines with dual power supplies -- both circuits got shut off.)
After some minutes, the switch and most of our machines had rebooted. Some of our servers required additional work to get up, and a few may still be sitting there dead but can be worked around.
The sticky point is the database servers, where all the important stuff is. Although we use MySQL's transactional InnoDB tables, they can still sometimes be left in an unrecoverable state.
(Bolding mine.) This proves that MySQL is not ACID, there is no way that a power outage is supposed to cause corruption in a database. This is not a troll, this is a simple conclusion. I really think that Wikipedia should switch to PostgreSQL, which is considerably more mature in terms of ACID compliance.
When all you have is a hammer, every problem starts to look like a thumb.
The colocation facility has diesel generators to protect against the outside power going out. Thanks to the miracle of circuit breakers, power circuits inside the facility shut off (including both circuits feeding our dual-power supply machines).
Chu vi parolas Vikipedion?
A completely designed, 100% empty database.
A COMPLETE log of all the SQL statements that were applied to it IN the order they were used. This is obtained by the application logging the SQL statements to the SQL log file AFTER the SQL statement is succesfully executed.
When a data base failure occurs, stop everything, 'replay' the backed up SQL logfile (thats on a separate backup system) on a copy of the empty DB there. TADA! you are back in business back to the point of failure!
Read the Wikipedia page. That's exactly what they've done, but because the MySQL database got corrupted, instead of just falling back a few minutes, they may have to go right back to a full backup and replay the log since then, which takes a lot more time than replaying a few transactions.
The solution is to switch to a database that actually implements ACID (the second letter stands for "Consistency" and the last letter stands for "Durability" which is what failed here).
When all you have is a hammer, every problem starts to look like a thumb.
He thinks I'm a god now!
perhaps I just inadvertently reached clear
The vast majority of this space is taken up by revision histories (and those are compressed!) Periodic database dumps are available for download. Image and multimedia uploads have been taking up a bigger share lately, but those are on a separate server which recovered just fine.
A German company has published an end-user-friendly CD-ROM of material from the German-language Wikipedia, but afaik no one's published an English-language edition yet.
Chu vi parolas Vikipedion?
http://www.neutelligent.com/
I find it an interesting coincidence the power outage happened so soon after that the Xenu article was featured.
Gee, you just had to mention the X-word! Now this thread won't load for most Scientologists because the keyword filters they were forced to install by their Church will see "Xenu" and block the site. After all the mere sight of the word could cause "pneumonia and death" if you haven't paid the Church of Scientology for the proper preparation.
Wikipedia's Xenu article has an interesting history if you look, as I did the other night when it was featured. Scientologists vandalize it regularly. You're supposed to pay them a half million (or some absurd sum of money) to find out about Xenu. After you find out, you're too embarrassed to admit to anybody that you paid a half million to learn that your problems are caused by bad science fiction, when you could have bought a house in Silicon Valley instead. So they obviously don't want a Wikipedia article giving away their half-million-dollar "trade secret" for free.
One trick I saw was to use HTML entities to spell out insults at the top of the article- like "only an idiot would believe this" or something. In the editor window, the entities weren't rendered and each letter appeared as a hex code.
A more effective attack took a different approach. The vandal in this case changed "Scientologists" to "Muslims", "Scientology" to "Islam", and inserted a boring-sounding sentence at the end of the first paragraph claiming that "Xenu" is another name that Muslims use for "Allah". It completely discouraged you from reading further. If you didn't know better you wouldn't find out how "Allah" distributed the thetans around volcanoes on various planets and blew them up with hydrogen bombs, and how their blown-up spirits cause problems in your personal life today.
This is OT, but what the hell, why not whack a beehive? Additional information on Xenu:
Operation Clambake (Hubbard maintained that humans are descended from clams)
The Xenu leaflet (all about Xenu- this information can save you lots of $$$$$)
The road to Xenu (authored by a woman who got suckered)
The Google cache of Wikipedia's Xenu article is also a must read.
I'm wondering if I'll get a lot of freaks, downmoderations, and hostile AC replies after I post this. After all, that's the kind of thing that Hubbard called "fair game". If it sinks below default visibility I'll repost it again with my karma bonus, so you theta-clear-wannabes out there can save your points for someone else.
Apparently one of their MySQL databases got corrupted as well. Figures. You'd think with all that volume they'd be wise enough to use a DB that can withstand a hard powercycle without losing data.
Just remember, friends don't let friends use MySQL for important data.
:::eyes my UPS::::
::::ponders for a momment::::
:::eyes the serial cable that gracefully shuts down said computer in the event of a power failure::::
:::ponders some more::::
:::eyes the spare UPS sitting in the corner that used to be connected to a database server::::
Hmm, I think i'm almost onto something here, but i just can't seem to nail it down...
-Chris
--an unbreakable toy is useful for breaking other toys--
And going strong!
The link in the article is broken, here's the proper one:
http://wikimedia.org/fundraising/
Beware: In C++, your friends can see your privates!
If only I were a mod. Informative, and just plain funny if you ask me. I've read about that entire thing going back and forth and its kinda odd. On the one hand I think that Wikipedia should be limited to who can change it. But on the other its really neat and diverse to let everybody at it.
Oh well.. Slightly OT
So now they'll have to put up a page to say "The temp page that says that our site is down is down. We are working aorund the clock to get the temp page back up.".
"When the atomic bomb goes off there's devastation...but when the atomic bong goes off there's celebraaaaation!"
I recall a discussion about fsync not being properly implemented both in Linux Kernel 2.4 as well as 2.6. I think it was patched in 2.6.9 or so, but not in 2.4.
Unfortunately, I cannot find the thread any more. Does anybody remember?
So, this might be rather a linux kernel bug, not a mysql bug.
Secondly, why does everybody say that mysql does not support ACID-transactions? MySQL does advertise them. Are you talking about pre-4.0 MyIsam tables? Or do you suggest that 4.0/4.1 InnoDB-tables aren't ACID-compliant either?
Every main database server had corrupt database pages. That is, 3 systems with battery backed up write caching controlles and SCSI drives and 2 SATA systems with write caching SATA controllers, one battery backed up the other not, two different SATA disk drive makers.
Involved:
Obvious speculation involves the controllers not telling the drives not to write buffer or the drives not listening. No point in getting into SCSI or SATA or this disc controller or that controller fights when there's this much variation involved.
Posted on the mailing list wikipedia-l 32 minutes ago:
From: Brion Vibber
Reply-To: wikipedia-l@wikimedia.org
To: Wikipedia-l, Wikimedia Foundation Mailing List, Wikimedia developers
Date: Tue, 22 Feb 2005 04:47:56 -0800
Subject: Re: [Wikipedia-l] Wiki Problems?
Brion Vibber wrote:
> There was some sort of power failure at the colocation facility. We're
> in the process of rebooting and recovering machines.
The power failure was due to circuit breakers being tripped within the colocation facility; some of our servers have redundant power supplies but *both* circuits failed, causing all our machines and the network switch to unceremoniously shut down.
Whether a problem in MySQL, with our server configurations, or with the hardware (or some combination thereof), most of our database servers managed to glitch the data on disk when they went down. (Yes, we use InnoDB tables. This ain't good enough, apparently.)
The good news: one server maintained a good copy, which we've been copying to the others to get things back on track. We're now serving all wikis read-only.
The bad news: that copy was a bit over a day behind synchronization (it was stopped to run maintenance jobs), so in addition to slogging around 170gb of data to each DB server we have to apply the last day's update logs before we can restore read/write service.
I don't know when exactly we'll have everything editable again, but it should be within 12 hours.
Imagine what would happen if there would be a link on wikipedias main site to slashdot and from slashdot back to wikipedia...Boom?
Except it's now been a few years since MySQL incorporated InnoDB, so maybe it's time to move on and rejoice that it's now one of the free database servers with ACID support? This one happens to come with standard replication and fulltext search. Also with a range of other engines to choose. PostgreSQL, last I knew, doesn't have built in replication, fulltext search and alternative storage engines but has it's own particular strengths. In the end, every end user gets to benefit from the competition between excellent tools. Good for us all to be happy about that.
Hi!
InnoDB has used WAL since I wrote it in mid-1990s. To PostgreSQL, WAL came later, around 2000.
Regards,Heikki
Innobase Oy