Why Power Failures Can Always Lead To Data Loss
bigsmoke writes "So, all your servers run on RAID. You back up religiously. You're even sure that your backups are recoverable. But do you also need a UPS? According to Halfgaar (on Slashdot before to promote better Linux backup practices), yes, usually you do. He argues that despite technological advancements such as file system journaling, power failures can still cause data loss in most setups."
UPS is more than just saving your data.
Ok, people who don't just read the executive summary knew this all along, but perhaps it's necessary that someone spells it out for the rest: Journaling and RAID do not prevent data loss in case of a power outage (and many more circumstances). If you know why, just skip the article. If you're wondering how you can lose data if you write everything to two disks and your filesystem guarantees its own consistency, then perhaps this is the wake up call that you need.
APC is the only UPS maker on the market that has at least spent some small effort so that their UPSs can be properly integrated with a Linux machine. I made the mistake of purchasing an Ultra UPS as it was cheaper than the APC.
"Thanks for all the money you paid to us. We've used it to buy off ISO among other things" -Microsoft
Computer power supplies should be built with enough spare capacitance to run things long enough for the computer to save critical data
Here's a question for you: Calculate the size of the capacitor needed that can hold enough power to run a 200W load for 5 minutes and maintain a voltage level within a specific usable range.
Hint: its BIG. batteries are more space efficient, but the chemicals and outgassing make them inappropriate for location INSIDE the computer box.
This morning we had a planned shutdown of 100 servers for eletricity works, all were on the same 40 kVA UPS. All went fine, we shutdown all servers to be safe, and kept some stuff online for montoring and the like, then main power was shut off. The UPS gladly took the load, with an estimated battery life of 75 minutes, more than what was needed for the electrical work. Once this was done, the electrician put the main power back on, and... the UPS shutdown !
Since all servers were stopped already we didn't lose anything, but we had to put the UPS in bypass mode for a while, then back on, and now we hope for the best waiting for the UPS to be repaired, crossing most of our fingers because of the holidays...
In summary : testing that the UPS can handle the power coming back is as important as testing for it to be able to handle the power shutting down.
Votez ecolo : Chiez dans l'urne !
The typical small UPS system has some amount of surge protection built-in. But it's typically only good for at most a couple thousand joules. But then, if you get a spike that is big enough to blow a varister, you also get to buy a new ups.
A better solution is to put a "whole house" surge protector on the circuit-breaker panel. It protects everything, with a much higher number of joules. Five or six pounds of varisters can absorb a lot more shock than one ounce of varisters. They cost about $100, and can be found at most big hardware stores or electrical supply houses. That doesn't eliminate the need for a ups. It does protect the ups, along with the other equipment, from most voltage spikes.
Last year, lightning hit the power pole 20 feet from my house. We know where it hit because the pole caught fire. My next-door neighbors on both sides lost every single piece of electrical equipment -- not just computers, TV's, and stereos, but also fridge, microwave, water heater, and range. All of it was damaged beyond repair. We barely noticed the hit, except for the bright flash of light, and had no damage at all.
Are you sure your disks are in write-through mode? Have you checked? Brad Fitzpatrick (of LiveJournal, memcache, OpenID, etc. fame) discovered that many disks lie about being in write-through mode, and wrote a utility to check it.
Why 5 minutes? It usually takes less than a second to run a sync on the disks depending on how active they are. A couple seconds of runtime should be enough to do an "emergency shutdown" and avoid data corruption.
####@johncash:~$ time sync
real 0m0.004s
user 0m0.004s
sys 0m0.000s
That will sync the disks, but it won't stop the database from accepting incoming data. It won't stop cron jobs which might be just about to trigger. It won't deal with tasks that are in the middle of a big operation which involves a lot of writing to disk.