Power Outage Takes Wikimedia Down
Baricom writes "Just a few weeks after a major power outage took out well-known blogging service LiveJournal for several hours, almost all of Wikimedia Foundation's services are offline due to a tripped circuit breaker at a different colo. Among other services, Wikimedia runs the well-known Wikipedia open encyclopedia. Coincidentally, the foundation is in the middle of a fundraising drive to pay for new servers. They have established an off-site backup of the fundraising page here until power returns."
They'll turn the lights off.
Coincidentally, the foundation is in the middle of a fundraising drive to pay for new servers.
"You see, guys? This is what could happen if we ever ran out of money. Now cough up some dough!"
The coolest voice ever.
What happened?
At about 14:15 PST some circuit breakers were tripped in the colocation facility where our servers are housed. Although the facility has a well-stocked generator, this took out power to places inside the facility, including the switch that connects us to the network and all our servers.
What's wrong?
After some minutes, the switch and most of our machines had rebooted. Some of our servers required additional work to get up, and a few may still be sitting there dead but can be worked around.
The sticky point is the database servers, where all the important stuff is. Although we use MySQL's transactional InnoDB tables, they can still sometimes be left in an unrecoverable state. Attempting to bring up the master database and one of the slaves immediately after the downtime showed corruption in parts of the database. We're currently running full backups of the raw data on two other database slave servers prior to attempting recovery on them (recovery alters the data).
If these machines also can't be recovered, we may have to restore from backup and replay log files which could take a while.
So like PBS, they bring the service down to remind you they need the cash to provide you with the service you wanted to see but they just brought down.
Nothing for you to see here. Please move along. lol, or does slashdot just take a long time to update? great, i click submit and "The operation timed out while trying to connect to slashdot.org"
After returning from the power outage, the servers have just been slash-fried.
If they bought actual servers with dual power supplies and got power from multiple PDUs at their data center, they would be much better off. If this is really because of a tripped breaker, then it's pretty inexcusable, since dual power supplies fed from separate circuits would have prevented it... unlike the LJ outage which was from the power being cut to all circuits.
But if they're going to cobble together some whitebox crap servers, and not change the architecture, they'll be right back to an outage next time it happens.
500GB of disk, 5TB of transfer, $5.95/mo
Perfect. Now I can't look up stuff Q-wiki-ly. Har har.
I was affected by LJ going down too, so I know how this is. Pain in my ass.
Although we use MySQL's transactional InnoDB tables, they can still sometimes be left in an unrecoverable state
Ya know, I just don't understand why so many projects with such high visibility and requirements for reliability use a toy database like MySQL.
Someone PLEASE tell me why. Because right now the only thing I can think is that people just don't know how to pronounce "Postgres".
This is not a troll or a flame at all but between this and the livejournal servers, it sure sounds like hell if your mysql servers ever go down unexpected.
Is mysql the only dbase like this or does postgres get corrupted as well during unplanned downtime? If I recall from using MSSQL servers , we never had a problem like this. We would simply reboot the servers and not worry about tables being left in unrecoverable states. Please correct me if I am wrong though.
Is there any way around this or will this always be a problem with mysql?
Wikimedia/pedia don't run after Homer
Homer: Me Homer, I'm running from PBS.
Power outages suck, and a great way to protect from them is to distribute your project over a large area of electrical service.
I know the wikimedia folks are fundraising for more servers, but I wonder if this will provide more incentive to accept Google's offer?
What if the entire Universe were a chrooted environment with everything symlinked from the host?
I found this useful information about power outages:o utage
http://www.wikipedia.org/search?/power_
A power outage has taken down wikipedia! as a community we must carry the torch!
Paul Grosfield - the quicker picker upper.
Sorry, dude. You missed the LJ outage. This was the one where you should've posted some misleading-but-legitimate-looking "information."
Even when the servers go back on, they'll be slashdotted.
Google's like them nazi's:
If you don't join their party, they'll come get you!
As that economic genius, Eric Cartman taught us:
1) Get something other people love
2) Don't let them use it
3) Profit!
It doesn't hurt if you are running a fund drive at the same time, either.
To make laws that man cannot, and will not obey, serves to bring all law into contempt.
--E.C. Stanton
Meanwhile, the devs are working fairly furiously to get it back up
;)
Don't worry, we'll take care of your backup servers in the meantime.
The coolest voice ever.
Why didn't their servers have a UPS? If the power was down for only a few minutes, it wouldn't have been such a big deal.
And here is the data Paris Hilton photos and phone numbers
If you don't want your random nonsense moded down or deleted, post it on Everything2.com.
So far one of our database servers has completed a successful recovery (we're working through them all). On a gigabit link it takes something between 90 minutes and 4 hours to rsync from one to another. As soon as we have two database servers working, we'll be restoring service in read only mode. Likely to be that 90 minutes to 4 hours from now as worst case.
I'll post followups to this post later, as we're closer to being fully recovered.
I remember once my mail service provider went offline too a year or so back due to power failure but fortunately they had diesel generators for backup power. Dosen't Wikimedia has the same facility?
The only thing worse than somebody faking a heart attack on slashdot is someone else BELIEVING HIM!!
Google seems to have succeeded in building a distributed platform. What about something similar to seti@home, which takes a chunk of each user's disk space and bandwidth and uses them to implement a virtual computer on which wikimedia projects may be run?
Surely someone is already working on something like this (pointers anyone??)
it's as though 300,000 people cried out and were suddently silenced ...
and then somebody diffed the change and made them speak again
Jamesday is wikipedia's chief sysadmin, so his comment is probably one of the most informative one here
To make laws that man cannot, and will not obey, serves to bring all law into contempt.
--E.C. Stanton
This outage, as well as our beloved slashdotting, is yet another argument for URIs, rather than just URLs. URLs are like IP#s; they're absolute pointers to specific object locations, in terms of the storage/retrieval interface of a single instance. URIs are virtual, like domain names. They are distributed in DNS, a Netwide database, updated for current lookup values for actual retrieval. URLs need the same kind of layer. Of course, some other characteristics of these objects must be reflected in the URI model that are not appropriate to IP#/domain names, like multiple identical copies, or perhaps versions.
Just cacheing copies, either actively with a redirection URL, or passively in caching backbone webservers, isn't cutting it. Caching values is always better suited to solving performance problems, creating its own concurrency and identy problems. Not to mention the publication limits of "opt-in" caches, like Coral or Google, which are an afterthought (and usually unknown) to the published object itself. Google has a huge, high-performance URL lookup system. It's taken quite a bit of value from the Internet, and all the content creators it rides on to derive all its value. It give back quite a bit, with its simple, fast, effective interface. Google is perfectly positioned to make its name truly synonymous with an Internet revolution (not just a pinnacle of search evolution) by implementing URIs. If Google let objects get looked up by a URI code as simple as say, [A-Za-z0-9]+, it could get halfway to its namesake in objects with just 28 "digits"; just 7 digits would cover each object instance in its database right now, dozens of times over. If Google opened up such a URI protocol to anyone on the Web running such a "DIS" server, just like DNS, they could offload much of the work, avoid accusations of trying to "own the Internet", and improve their own service immeasurably, not least by making broken links in their database a quaint old curiosity. Will they rock our world, or will another big player, like Archive.org do it, before Microsoft, desperate to distinguish MSN Search, ruins it for everyone with some kind of proprietary hack that favors MS objects?
--
make install -not war
Time to move your operations to Winnipeg, Canada. The power never stops flowing (in the 12 years I've lived here, I only remember two power failures in my residential neighbourhood). I really don't understand why there aren't network server operations set up in reliable power centres such as these.
1) Register a wiki domain name like wikisearch.org.
2) Host a "backup" fundraising page there that sends money to us instead.
3) Have someone mess with the Wikimedia circuit breaker.
4) Send the power outage news to slashdot with our link.
5) Profit!!
What wikipedia needs is some UPS technology between the wall and those critical servers they are spending hours restoring.
I'm not surprised.
The facility they are coloed in is considered "rickety" by many.
From what I hear, they are expanding into a decidedly "non-rickety" location.
Hopefully, this is the last outage we'll see due to these circumstances..
...with all the porn sites that they host on the side.
Is there a way to ensure integrity of the data with such a setup?
All of these "values" are artistically incorporated in one person: Wikipedia.
There's a person named Wikipedia now? Weird parents...
That aint much. My older harddisk is 200. I'm planning to get a 400 gig one.
I wonder if wikimedia will ship the whole wikipedia on a few bzipped DVD isos to people who want a not-so-up-to-date encyclopaedia. I was researching a period of 1200AD, not much chance that data will change in the next few months.
And I DO wonder why doesnt another database company take up a mirror of wikipedia, just to show the reliability, speed, scalability etc of their database.... great marketing tool especially if you own all the ad bars. Sybase? Ingres? MSSQL? sleepycat even?
Why do I have a feeling someone kicked a Pentium1 server running freebsd with a 200GB harddisk somewhere out there...
"Give orange me give eat orange me eat orange give me eat orange give me you." -Nim Chimpsky
You can look things up on answers.com.. They mirror wikimedia, as well as other dictionaries/encyclopedias.
It was Xenu! Great God of the Scientoligists who caused the power outage. He/she/it was angry you didn't pay all your hard earned cash to learn the inner secrets to find out about he/she/it. Read all about it on Wikipedia. Oh, wait you can't!
I find it an interesting coincidence the power outage happened so soon after that the Xenu article was featured. I may be paranoid, but the Scientologists have taken paranoia to a new dimension. They are not above dirty tricks. Karl "Turd Blossom" Rove could learn a thing or two.
What's the name of Wikimedia's colo?
Ron
As soon as I saw "Power corrupts. Power failure corrupts absolutely" I thought, the damn commies finally did it! But no, not hacked by commies...just by a renegade circuit breaker.
www.kiwilyrics.com - a wiki for lyrics
Why were they not using battery backup on their database servers (IE, their critical servers)? That way the servers would have the necessary 10 minutes (or whatever) so that they can shut down the DBs and power off the systems.
This is a negligible cost for something as integral as an active sync with the work that people have performed - for free.
Why is this not seen as important? "The wiki users will just recreate the material"? That's somewhat presumptuous.
Now, livejournal I can understand not doing this (as there are many clients which allow people to sync with their online journals and the material is fairly culturally worthless), but wikipedia? It's one of the better things on the Internet.
~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
Where am I do research? The internet? I have visted that place in years!
ba dum pish.
oh wait.
Ad revenue ripoff site here
SAILING MISHAP
So how long should it take before they resolve all of their issues? History reports are pretty hard without Wikipedia...
I had just begun working at Hurricane Electric when they had their big power failure. (It was the first day I was answering the phone on the help desk. Not a pleasant experience!) In that case the power loss was due to mistake by a technician servicing the backup power supply. Then there was the Internap failure, which seems to have been caused by a similar human error. Now a third provider has had some weird circuit breaker issues. That makes three major outages in less than a year. Either there's some evil conspiracy, or a lot of different companies are using the same bad procedures.
>Although we use MySQL's transactional InnoDB tables, they can still sometimes be left in an unrecoverable state. Attempting to bring up the master database and one of the slaves immediately after the downtime showed corruption in parts of the database.
Well this is just great PR for mySQL.
(To DBA: have you guys ever heard of the replicating feature?)
It's a good thing that I got hooked on everything2.com before I even heard of wikipedia. Otherwise I wouldn't be able to get my info fix.
Here are the ingredients to this solution:
A completely designed, 100% empty database.
A COMPLETE log of all the SQL statements that were applied to it IN the order they were used. This is obtained by the application logging the SQL statements to the SQL log file AFTER the SQL statement is succesfully executed.
When a data base failure occurs, stop everything, 'replay' the backed up SQL logfile (thats on a separate backup system) on a copy of the empty DB there. TADA! you are back in business back to the point of failure!
The downsides....
Redesigning the database will screw everything up unless the SQL statements used during the redesign are logged as well.
All sql requests must be funneld through 1 and only 1 db connection. Otherwise the sql statements in the logfile stand a chance of being recorded 'out of sequence'. Here is a brief example:
With one db connection, user 1 edits record x two separate times in succession then user 2 comes along behind user 1 and modifies the record with no problems. Without record locking or with indiscriminant multithreading, record x will be corrupted if user 2 edits record x between user 1's two consecutive edits. See the downside?
The SQL logfile gets corrupted due to storage media failure. The only way around this would be to copy the log file to a backup mirror system on a periodic basis and verify it is a good backup copy using a strong cryptographic hash such as SHA-512 or for the utterly anal and paranoid, a byte-for-byte comparison.
The EXTREME volume of data may/will make this approach unfeasable due to time constraints -- too much data to restore via 'replaying'. 'Checkpointing' from a known good database state will cut down the size of the SQL log file but introduces the possibility of database corruption by simply using the wrong checkpoint database when replaying the sql statements.
Speaking of 'tar' in the parent post, I 'cowrote' a simple, high-performance freeware Windows file archiver that combines file aggregation with data compression. If you want to try it out, it is here.
What is most pathetic is that those actual, real stars all gave their numbers to that moron. Money talks, I guess.
I don't know who I angered, but I'm getting modded down more than usual, including at least three different moderators who voted the parent post a Troll. A troll? Anyone reading it can see I was making a joke based on what the submitter said. Not funny? OK, that's a valid criticism for everyone has their own view of humor, and I respect that. But a troll? Wow, What did I do to them?
So, _this_ is where I should be posting my outage reports! And here I've been sending them only to people who would care.
"Slashdot...outage reports for nerds! Stuff that doesn't matter to me!"
Lol!
-buf
The sooner that asinine word exits the lexicon the better.
From the wikipedia page:
At about 14:15 PST some circuit breakers were tripped in the colocation facility where our servers are housed. Although the facility has a well-stocked generator, this took out power to places inside the facility, including the switch that connects us to the network and all our servers. (Yes, even the machines with dual power supplies -- both circuits got shut off.)
After some minutes, the switch and most of our machines had rebooted. Some of our servers required additional work to get up, and a few may still be sitting there dead but can be worked around.
The sticky point is the database servers, where all the important stuff is. Although we use MySQL's transactional InnoDB tables, they can still sometimes be left in an unrecoverable state.
(Bolding mine.) This proves that MySQL is not ACID, there is no way that a power outage is supposed to cause corruption in a database. This is not a troll, this is a simple conclusion. I really think that Wikipedia should switch to PostgreSQL, which is considerably more mature in terms of ACID compliance.
When all you have is a hammer, every problem starts to look like a thumb.
He thinks I'm a god now!
perhaps I just inadvertently reached clear
I hope I don't get in trouble, but I submitted a story that said "My site went off line during the reboot to upgrade the kernel, then, went down for a few seconds while overwriting files to implement the new Refrozen Upload"
Big deal, who cares. Do I really need to know when some group's servers crash?
It probably is a troll. But, maybe someone will think the joke was funny enough to remember the symptoms and recognize them if they happen for real at some point in the future.
Apparently one of their MySQL databases got corrupted as well. Figures. You'd think with all that volume they'd be wise enough to use a DB that can withstand a hard powercycle without losing data.
Just remember, friends don't let friends use MySQL for important data.
But as for hardware that can be used to serve two instances of a website, Cisco makes a product called Distributed Director.
From the product description: I am only mildly familiar with Distributed Director, but it gives different IP answers to DNS queries based on some formulas, one of which can be which ever server farm is considered closer to the client.
In the case of this or a planned outage with DD you can take a site out of the active config (i.e. the down site). DD is for geographically disperse server farms.
Cisco also makes a product called Local Director (both of these may have been replaced with "Intelligent Director" in some part, IDK anymore). LD allows you to balance across web servers for example (in the same server farm).
Also as for a big caching system, most of the time I think the people that are serving something want to be the ones to serve it, directly. Reasons for not using your suggestion could include security, advertising revenue based on traffic stats, etc.
Silly Rabbit: tricks are for kids.
Could be. But we took it offline and exchanged a few more messages that make me think it was more likely to have been real.
LJ was down, WP is down, Server Beach had an outage two weeks ago, and I at least have had the misfortune to have my ISP down for a week. Is it me, or does it seem like colocation center outages are becoming rampant lately?
Terrorists can attack freedom, but only Congress can destroy it.
OK, the "Eminem" voice has an Aussie accent, just like the douchebags that supposedly made the call. I call bullshit on this one.
:::eyes my UPS::::
::::ponders for a momment::::
:::eyes the serial cable that gracefully shuts down said computer in the event of a power failure::::
:::ponders some more::::
:::eyes the spare UPS sitting in the corner that used to be connected to a database server::::
Hmm, I think i'm almost onto something here, but i just can't seem to nail it down...
-Chris
--an unbreakable toy is useful for breaking other toys--
And going strong!
The link in the article is broken, here's the proper one:
http://wikimedia.org/fundraising/
Beware: In C++, your friends can see your privates!
that someone might as well trip the power cord between the UPS and the computer. Your UPS doesn't look so good now, does it?
There are valid reasons for real databases having proper disaster recovery mechanisms built-in.
Man, I'd hit it. You hear that, kturner? I'd hit it! Judging from the back of her neck, that is.
--grendel drago
Laws do not persuade just because they threaten. --Seneca
There are also valid reasons for having more than a single point of failure in a system.
--grendel drago
Laws do not persuade just because they threaten. --Seneca
You'd think that, since MySQL has been around for a number of years, and because other databases have it, that high reliability would have been contributed or at the very least funded by somebody.
Maybe the performance penalty it incurs is prohibitive---one can run the site reliably, or one can run it fast, but not both. Ugh, what a choice.
--grendel drago
Laws do not persuade just because they threaten. --Seneca
Loco colo severs servers?
If Google let objects get looked up by a URI code as simple as say, [A-Za-z0-9]+ ... just 7 digits would cover each object instance in its database right now, dozens of times over. If Google opened up such a URI protocol to anyone on the Web running such a "DIS" server, just like DNS, they could offload much of the work...
Yeah. I'm going to go register n8y9vtw before anyone else does. 'Cause everyone knows that the entirety of the namespace you mentioned is useful. Uh huh.
--grendel drago
Laws do not persuade just because they threaten. --Seneca
Clearly, someone set up you the bomb.
--grendel drago
Laws do not persuade just because they threaten. --Seneca
Running Wikimedia is insanely costly. Did you donate to them? (I thought not)
"Oppression and harassment is a small price to pay to live in the land of the free." -- Montgomery Burns.
Nope. I don't use them.
Perhaps they should just give up since it costs so much.
So now they'll have to put up a page to say "The temp page that says that our site is down is down. We are working aorund the clock to get the temp page back up.".
"When the atomic bomb goes off there's devastation...but when the atomic bong goes off there's celebraaaaation!"
Many of us do use them, many times each day, and find it an incredibly useful resource. It's just too bad it's so slow, and even worse when they're completely down like now, so we donate what we can.
"Oppression and harassment is a small price to pay to live in the land of the free." -- Montgomery Burns.
I recall a discussion about fsync not being properly implemented both in Linux Kernel 2.4 as well as 2.6. I think it was patched in 2.6.9 or so, but not in 2.4.
Unfortunately, I cannot find the thread any more. Does anybody remember?
So, this might be rather a linux kernel bug, not a mysql bug.
Secondly, why does everybody say that mysql does not support ACID-transactions? MySQL does advertise them. Are you talking about pre-4.0 MyIsam tables? Or do you suggest that 4.0/4.1 InnoDB-tables aren't ACID-compliant either?
need servers in difference co-lo's.
Google suffered the same problem and that was the last time they had all their eggs in one basket (err servers in on co-lo.
Yeah, because you should always give up whenever you meet some resistance, because you're destined to fail anyway.
</sarcasm>
It seems to be working in read-only mode now. Now let's see how the single slave DB server reacts to the upcoming slashdot effect.
15 minutes later
1 hour and 15 minutes later
Anyway, I'm a big fan of the Wikimedia foundation and recently donated a small sum to their cause. I will donate again later if they still run and they need help.
If Slashdot has decided that it's a priority to report when popular servers experience downtime, shouldn't it let us know the status of our favorite pr0n servers, too?
How much longer do I have to wait before I can once again get access to my daily pr0n? I need to know!
The concept of wikipedia is a good one, but I experienced first hand a flaw in their system of open moderation. I wrote an article that used images from my own website that I created, and someone registered my entry as possible copyright violation of my own work, even though I clearly stated on the entry that the images were mine! So my entry has lingered in CV limbo awaiting judgement. Laughable at best... ivanjs
There was a dispute last night that didn't go Jimbo's way, so maybe he pulled the plug? Makes for nice conspiracy anyway
They should have used a UPS...we use it all the time!!
Every main database server had corrupt database pages. That is, 3 systems with battery backed up write caching controlles and SCSI drives and 2 SATA systems with write caching SATA controllers, one battery backed up the other not, two different SATA disk drive makers.
Involved:
Obvious speculation involves the controllers not telling the drives not to write buffer or the drives not listening. No point in getting into SCSI or SATA or this disc controller or that controller fights when there's this much variation involved.
I've set up www.WikiMirror.com a while back.
;)
Let's see how it manages to survive the Slashdot effect
Note: Some pages might be out of date
Yes, so does the InnoDB engine in MySQL. Doesn't help so much when the drive system has lied about which page writes have been committed to disk and you have a RAID system where a page is spanning a couple of drives and one wrote its update to the page while the other didn't, which is what I speculate happened. That leaves you with a database page which fails its checksumming. Can recover from individual bad pages but not worth doing it when there's a complete copy available. Since we don't need to recover them, we aren't. Copying from one with no apparent damage instead.
Did, of course, make a copy of the databases before trying to restart some ofthem, so we'd have that recovery option if it turns out that we need it.
Slightly Operating Thetan?
I don't think so. Sounds to me like an ARC-break!
There was a dispute last night that didn't go Jimbo's way, so maybe he pulled the plug? Makes for nice conspiracy anyway...
As long as the servers are not "outsourced", there is little transparency in WP. Google, please
to keep them from tripping over the powercord again?
no worries. Living in Ontario Canada there aren't any doctors. Government cutbacks and low Doctor pay.
Sometimes the history of an article says just as much (if not more) than the article itself.
Must you really know what the money is being spend on?
If you donate money, you are asking them to continue to offer their great service to you and other people. How they achieve that goal, is up to them, no?
Except, well, every major charitable organization of decent size issues itemized reports as to where the money is budgeted and how last years money was actually spent versus budget. So yes, I will ask the organization how they plan to spend my money before giving it. Well, except for webcomics... After all, we already know they need the money for server costs, pens, crack...ers. Can't do without those saltines.
This sig has absolutely no significance and serves only to take up screen space and waste the time of the reader.
Posted on the mailing list wikipedia-l 32 minutes ago:
From: Brion Vibber
Reply-To: wikipedia-l@wikimedia.org
To: Wikipedia-l, Wikimedia Foundation Mailing List, Wikimedia developers
Date: Tue, 22 Feb 2005 04:47:56 -0800
Subject: Re: [Wikipedia-l] Wiki Problems?
Brion Vibber wrote:
> There was some sort of power failure at the colocation facility. We're
> in the process of rebooting and recovering machines.
The power failure was due to circuit breakers being tripped within the colocation facility; some of our servers have redundant power supplies but *both* circuits failed, causing all our machines and the network switch to unceremoniously shut down.
Whether a problem in MySQL, with our server configurations, or with the hardware (or some combination thereof), most of our database servers managed to glitch the data on disk when they went down. (Yes, we use InnoDB tables. This ain't good enough, apparently.)
The good news: one server maintained a good copy, which we've been copying to the others to get things back on track. We're now serving all wikis read-only.
The bad news: that copy was a bit over a day behind synchronization (it was stopped to run maintenance jobs), so in addition to slogging around 170gb of data to each DB server we have to apply the last day's update logs before we can restore read/write service.
I don't know when exactly we'll have everything editable again, but it should be within 12 hours.
Imagine what would happen if there would be a link on wikipedias main site to slashdot and from slashdot back to wikipedia...Boom?
::::kicks out the cord between your computer and the UPS, which is analogous to what happened to Wikipedia::::
They are using mysql's transactional tables. It still corrupts the whole damn table alot of the time when its unexpectedly terminated. Real databases do not do this, they will not corrupt the table.
Today in fact. I built a wiki to work as a knowledgebase it works superbly on a small scale, there's a possibility of using it on a larger scale. My main concern is the databse, I'd like to port it to DB2 (we're an IBM shop) as I don't want to be the one that gets fired when what happened to wikimedia happens to me. Might even get them to hire someone to do the job, if they're feeling generous...
Meanwhile, the devs are working fairly furiously to get it back up (Kate hasn't slept in 27 hours
That doesn't sound right.
"Fairly" means moderately, while "furiously" is pretty extreme.
You could say something's "fairly big" which would mean it's pretty big, or you could say it's "gigantic" which mean it has extreme size. But you'd never say it's "fairly gigantic", since the meanings of the two words conflict.
I know the wikimedia folks are fundraising for more servers, but I wonder if this will provide more incentive to accept Google's offer?
What you mean to ask is if Google's team of secret agents sneaking into the colocation facility and tripping the circuit breakers will result in wikipedia deciding to accept Google's offer and thus further their plans for world domination.
In case anyone else finds the highlighting as distracting as I do:
Google cache of Wikipedia's Xenu article without highlighting
I am suprised their servers are so fragile. A UPS/Surge protector aside from generators can do wonders... I thought they had some kick ass backup power... Although wikipedia seems to be back, rather slow but better than none.
"All your base" used to be fun.
Wikipedia updates one of it's records regarding Slashdot.org (a.k.a. /.)
/., you will rue this day..."
The new record reads "You will rue this day
I mod down so you can mod up. Your welcome.
Yes, that's true - I've used DD (and its predecessors) since 1999. It's got exactly the limitations addressed by URIs. Because URIs are a higher-level layer, while DD is a lower-level layer, addressing the problems of distributed objects keyed as URLs. DD makes you clone entire webservers for a single distributed object. And doesn't let you distinguish between versions, or other state differences. I expect that Cisco will make terrific products supporting URIs when we get software that uses more than URLs.
--
make install -not war
Heh, from the Day 2 Fund Drive report:
"What can I say? I owe half of what I know to the Wikipedia! Keep up the good work." by Jonathan Grose
So a full quarter of what Jonathan knows is misinformed, inaccurate, or 1337 speek!
(Nah, really I love Wikipedia. But Jonathan was asking for it.)
Ok... but that's the point. YOU use them, have YOU donated? No? STFU about slowness then!
Except it's now been a few years since MySQL incorporated InnoDB, so maybe it's time to move on and rejoice that it's now one of the free database servers with ACID support? This one happens to come with standard replication and fulltext search. Also with a range of other engines to choose. PostgreSQL, last I knew, doesn't have built in replication, fulltext search and alternative storage engines but has it's own particular strengths. In the end, every end user gets to benefit from the competition between excellent tools. Good for us all to be happy about that.
Are you tired of getting spam, scams, and malware via email?
Yes! Not only via email, also via Slashdot comments!
STOP YOUR Spam/scams/malware NOW!
NOW? Oh dear God, where do I sign up!!!
FREE download!
This is too good to be true!!!!
Learn how to stop your unwanted email now for FREE!
All for FREE???
New email filtering programs for Windows makes spam and malware sent by email 'almost impossible'.
Where do I send the cheque???
Now, seriously, I find it utterly hilarious that the solution to spam that READS like spam is spamvertized on Slashdot by its author who doesn't do anything else than spamming Slashdot threads with his spam links! Hilarious.
Moderators: WAKE UP!
Wikipedia is now read-write on a limited number of servers. Enough for most things but we still have some features disabled as the rest of the database servers catch up. Any data loss was limited, so far as we can tell at present, to the last few seconds at most.
Hi!
InnoDB has used WAL since I wrote it in mid-1990s. To PostgreSQL, WAL came later, around 2000.
Regards,Heikki
Innobase Oy
And you should be worried because those shooting pains in your wrist sound like a carpal tunnel syndrome. Try to use your keyboard mostly with your right wrist while the pain is stronger and do some exercises every fifteen minutes or so (push-ups are great for stretching wrists). The shortness of breath has nothing to do with carpal tunnel and is a normal symptom of being overworked. But keep in mind that while few nice shots of espresso will help you stabilise your breath and feel less tired, only proper exercises can help your overworked wrists.
You should try to use your right wrist for the hardest jobs, do a lot of exercises, and if the symptoms don't disappear in few weeks or months at most you should probably see a doctor. I wish you good luck and I hope your wrist problems will not stop you from posting on Slashdot.
Sincerely,
Pan Tarhei Hosé, PhD.
"Homo sum et cogito ergo odi profanum vulgus et libido."
I find it curious that both Livejournal and Wikipedia were using fancy battery backed RAID controllers and this still happened. I have no personal experience with such controllers but
I assume they must work like this: the most recent N writes are stored in battery backed RAM. After power is restored, the controller rewrites ALL of the data in battery backed RAM in case some of it didn't make it out to disk.
Now there are three scenarios where this could fail, that I can think of. In one, the total cache on all connected disk drives exceeds the total size of the battery backed RAM. Unless there is a message you can send to the drive forbidding it to use a portion of its cache for writes (ok to use all for reads), you are SOL in this case. The second scenario is that the controller clears some of the data in the battery backed RAM after it has been told by the drive that it has been written (via serial number) or after some amount of time (or number of write operations) has passed and this assumption is wrong. The third scenario is
a variation of the second in which the controller assumes the drive does not cache writes at all and immediately invalidates the RAM copy as soon as a write xfer operation completes but that is rather pointless. I am ignoring problems due to equipment failures or week long power outages here.
Write back caching on the drive is loaded with problems and in lieu of such a controller, It would seem that about the only way it can safely be enabled (without supplemental storage on a device without write back cache) is if there is the ability for the operating system to send a query to the drive to ask if a specific serial number of write operation has completed or at least to return the highest serial number such that all serial numbers lower than that have completed. Then the operating system can allow, for example, 15 processes to simultaneously schedule write operations (or one process multiple writes) which the drive can complete in optimal order but none
of those processes can continue once they have called flush() until the data has actually been written.
I assume wikipedia has something like 100 reads for every write. I wonder if the performance from allowing write caching is really necessary
during normal operations (as opposed to database or index rebuilds or replication operations) and if it could be turned off most of the time. This is not a fix for the underlying problems (unless it stays off all the time) but would be a way to improve the odds considerably.
It seems to me that data center proceedures should be designed (and fire codes need to accomodate this) to allow operations to occur in a reasonable sequence. First the alarm goes off. Then a signal is transmitted to all servers to begin shutdown (can be delivered by ethernet). Lights must remain powered until people have a chance to evacuage. Then there is a reasonable delay to allow evacuation of personell and to allow someone to reach the hold off switches. There would be separate hold off switches for halon and power down. Then halon wold be released. Power would not be shut off until servers had time to power down or fire crews needed to enter the area for purposes other than verifying that the fire was out. Even if the fire is being maintained by emergency power, the halon should kill it so a facility with adequate halon should never need to actually do an emergency power off unless perhaps the fire affected an area outside the area protected by halon.
At a minimum, I would expect the data center to have two complete UPS systems. These would supply power to racks in a cartesian arrangement where each rack was supplied power from two different UPSes, one from a row bus and one from a column bus. With the exception of the emergency power off required by fire codes, it would be very difficult for a rack to lose power on both busses unless there was a short from one b
You should go make a page on wikipedia that has information about what regional porn servers are up or down.
;-)
Perhaps you could get Netcraft and Vivid Video to co-sponsor a page with up to date information on it -- just like what Netcraft does now. I suggest that you call it
Vivid Craft
(GoDaddy.com says that vividcraft.com is not taken yet!...)
coding is life
This was a very good post. Unfortunately, it will be modded down because it dared to say the truth, while the laughable reply by incompetent Wikimedia DBA saying "Since at least one of our MySQL database servers has so far restarted successfully with all InnoDB data intact, perhaps you'd care to reconsider your assessment that MySQL is incapable of doing what it just did?" was obviously modded up... Yes, Jamesday, if *one* database in a cluster (the only one which was not live, I might add) is not corrupted, it means that a database is reliable... Let's just ignore that *every single one* of the live database servers was not reliable, let's concentrate on the one that was off-line and was not processing any updates while the power went down. And the rest of the servers must have had hardware problems, all of them at the same time. What a joke.
There are 8 people who have decided to call themselves that and are doing something. There's no broad community action on it and it's not in any way any sort of official editorial team with any official role.
Editing articles to a fixed state seems very unlikely to happen, since it's pretty thoroughly contrary to the method by which the project works and the complete and comprehensive objectives of the project. The general result of people trying to do it is them being barred from the project for uncooperative editing.
Paper and CD are risky targets because they lose the CDA and OCILLA protections which keep wikipedia.org, the Wikimedia Foundation and other contributors very safe from legal action based on content.
Frankly, could be a heart attack. I am a doctor. You need to go and see someone IMMEDIATELY, even if the pains subside. I hope that it isnt so, but sometimes chest pain which goes away (Unstable angina) is a precursor of a heart attack. IF you are reading this, ask someone to take you to the doctor and get an ECG and your serum enzymes (CPK-MB) done, NOW.
I have indeed donated, out of my poor student budget.
"Oppression and harassment is a small price to pay to live in the land of the free." -- Montgomery Burns.
The standard in the "big iron" database world is that no matter what the hardware lies to you about, you can still come up in a consistent state, assuming that there is some time t in the past at which all data up to that point is successfully written to disk. Algorithms for figuring out what hasn't been written yet, even in the face of inconsistent write caches, are probably 30 years old by now.
Losing recent changes is certainly acceptable, but the DB simply giving up and saying "restore from backup" isn't.
10 PRINT CHR$(205.5+RND(1)); : GOTO 10
Big iron (and expected corporate features) is still an area where MySQL is rapidly evolving. I doubt it'll take two years. Likely less with stimulous from high profile incidents.
Restore from backup for MySQL really means "restore from your backup and replay your binary log until you get back to the point of failure". Or ask MySQL for assistance - they will look at such cases. Neither is as good as I'd like of course - either involves more extended unavailability of data when the site needs to be up, if with incompete data, within minutes or a small number of hours.
On the followup side, additional power lines are being run to our racks and discussion with one RAID controller vendor indicates that a maximum of 20 minutes of battery backup can be expected. That's not long enough for a colo situation, so more followup with their engineers is needed, to see if they can produce something more realistic.