Power Outage Takes Wikimedia Down
Baricom writes "Just a few weeks after a major power outage took out well-known blogging service LiveJournal for several hours, almost all of Wikimedia Foundation's services are offline due to a tripped circuit breaker at a different colo. Among other services, Wikimedia runs the well-known Wikipedia open encyclopedia. Coincidentally, the foundation is in the middle of a fundraising drive to pay for new servers. They have established an off-site backup of the fundraising page here until power returns."
Although we use MySQL's transactional InnoDB tables, they can still sometimes be left in an unrecoverable state
Ya know, I just don't understand why so many projects with such high visibility and requirements for reliability use a toy database like MySQL.
Someone PLEASE tell me why. Because right now the only thing I can think is that people just don't know how to pronounce "Postgres".
Right, because we all know money grows on trees...
To make laws that man cannot, and will not obey, serves to bring all law into contempt.
--E.C. Stanton
There's a simple way around this: stick to PostgreSQL, MSSQL, Oracle, DB/2, or some other real database. MySQL doesn't make the grade, precisely because things like this can happen.
No no, but with the google deal looming, the tin-foil-hatters are paying close attention to wikipedia, and every little thing gets overly-scrutinized.
To make laws that man cannot, and will not obey, serves to bring all law into contempt.
--E.C. Stanton
IIRC, that's the Fire Code. The breaker needs to be able to unconditionally kill all power inside the facility. Thus -- it kills the power post-UPS.
Sometimes it costs more to do things wrong, in the long term, than to do them right.
500GB of disk, 5TB of transfer, $5.95/mo
Hey man, they have their traffic doubling every 4 months, they NEVER planned for this sucess this early. Building infrastructure is hard when you never plan for it.
Kudos to Wikimedia for actually explaining what happened and not just putting a "This page is down, please try again later" messege up. Many people/companies/groups/etc would be too proud or too afraid of bad publicity to actually explain the problem.
Don't save Windows XP! http://www.petitiononline.com/jjw1xp/petition.html
- Distributed caches - now majority of hits are served by caches, and some of them are offsite. It was a pilot project for a while and now we're trying to design and build scalable infrastructure for that. But still, lots of edits are served uncached.
- Distributed file systems - are there any? NFS is single-server system, MS has something, PVFS has no redundancy, GoogleFS is closed and not released, Coda, AFS, all of those just don't work. Right now we're trying to develop MogileFS (the perl-based app-level file storage by LiveJournal) store and sure there are other ideas.
- Distributed database - there are no proper large database multimaster opensource solutions. MySQL with replication and transactional data store is used. In this event it would be great to have second datacenter nearby with additional DB replicas and gigabit interconnection, but that costs money. And app-level bidirectional replication is in plans for both MySQL and PostgreSQL. And SAN deployment is too costly.
And yes, MediaWiki code has PostgreSQL support, but migrating from one database to another without proper tests, benchmarks and insurance isn't very mature.You do know that in real datacenters you don't have a UPS on each PC, but a UPS for the ROOM and between this UPS and your servers you are going to need brakers, so if you put to many things on a circuit it may cause problem, as simple as that.
Something still doesn't add up. Even if a backup generator autostarts successfully, there's a significant delay between mains failure, switchover, and the generator picking up the load. That's usually a few seconds or more, too long for a computer to run off the residual charge in its power supply filter caps. There would still have been an inverter-charger somewhere to keep the equipment running until the generator was fired up. Sounds like somebody screwed up, either by tripping the wrong breaker, or by designing the facility improperly to begin with.
The higher the technology, the sharper that two-edged sword.
Why were they not using battery backup on their database servers (IE, their critical servers)? That way the servers would have the necessary 10 minutes (or whatever) so that they can shut down the DBs and power off the systems.
This is a negligible cost for something as integral as an active sync with the work that people have performed - for free.
Why is this not seen as important? "The wiki users will just recreate the material"? That's somewhat presumptuous.
Now, livejournal I can understand not doing this (as there are many clients which allow people to sync with their online journals and the material is fairly culturally worthless), but wikipedia? It's one of the better things on the Internet.
~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
170GB isn't that big and people routinely run far more critical stuff without any kind of exotic seti@home-like distribution. What's really inexcusable is the fact that a power failure caused database corruption that turned a 2 minute power outage into major downtime.
I'd rather just agree to disagree on this one, at this point it's all just what we have observed. It heavily depends on the situation, how the db is setup, etc.
As far as the script, yes, it does have locks, and rightly so. It's not terribly tough to write a lock aware script. In my opinion, the replication setup is extremely easy to script. I'd much rather script it than sit in front of the console. Once I see it work, I know it will work every time, and I won't worry about something like me or a peer mistyping the server-id at 4:00am. Even at 20GB, it can't be terribly long at 100Mb/s.
You only need the lock on the master while you're tar'ing the snapshot for distribution to the other servers. Once it's tar'd, unlock master, gzip, redistribute, tar zxvf, setup slave and it will catch up.
The sticky point is the database servers, where all the important stuff is. Although we use MySQL's transactional InnoDB tables, they can still sometimes be left in an unrecoverable state.
I don't get it, then why the fuck bother with InnoDB. Transactions/ACIDity imply a performance penalty over just cache and async write of a direct image. One pays this penalty for the benefits (usually critical for many applications) of data integrity and robustness. How would you like your bank to run on MySQL?
This is the dumbest thing I've ever heard. I used to tell MySQL weenies that their DBMS sucked because it had no transaction support, then recently these annoying inbred fuckwits tell me that MySQL is just as good as Oracle because it has InnoDB support (we'll let the fact that the schema is kept in the shitball format slide)...Well apparently these morons don't have a fucking clue what transaction processing really means. Usually COMMIT and ROLLBACK are suppossed to actually mean something... and even working 90% of the time doesn't cut it.
I would never donate to this goddamn Wikipedia project as long as I know that the funds are going to end up being sapped to support their crippled shitball database.
URIs are a superset of URLs and URNs. I think what you're talking about is a URN, isn't it? These are the URIs that specifically name something uniquely (for example, urn:isbn:1902593790 or urn:oid:1.3.6.1.4.1.20115) but don't necessary help you locate it at a specific place.
Liberty in your lifetime
If Hunter S. Thompson were still alive, he'd be making fun of himself for killing himself.
I don't care why you're posting AC
From the wikipedia page:
At about 14:15 PST some circuit breakers were tripped in the colocation facility where our servers are housed. Although the facility has a well-stocked generator, this took out power to places inside the facility, including the switch that connects us to the network and all our servers. (Yes, even the machines with dual power supplies -- both circuits got shut off.)
After some minutes, the switch and most of our machines had rebooted. Some of our servers required additional work to get up, and a few may still be sitting there dead but can be worked around.
The sticky point is the database servers, where all the important stuff is. Although we use MySQL's transactional InnoDB tables, they can still sometimes be left in an unrecoverable state.
(Bolding mine.) This proves that MySQL is not ACID, there is no way that a power outage is supposed to cause corruption in a database. This is not a troll, this is a simple conclusion. I really think that Wikipedia should switch to PostgreSQL, which is considerably more mature in terms of ACID compliance.
When all you have is a hammer, every problem starts to look like a thumb.
They were lucky with that server?
I mean, if a few servers' databases survived, that may speak more of random luck of not being in a status so when the power outage occured nothing bad happened. If all of the databases survived, that speaks of MySQL being resistant to this sort of thing.
Beware: In C++, your friends can see your privates!
Must you really know what the money is being spend on?
If you donate money, you are asking them to continue to offer their great service to you and other people. How they achieve that goal, is up to them, no?
You don't ask the Red Cross what they use your money for, do you? The organisation usually tells you afterwards.
Sometimes the history of an article says just as much (if not more) than the article itself.