Why Power Failures Can Always Lead To Data Loss
bigsmoke writes "So, all your servers run on RAID. You back up religiously. You're even sure that your backups are recoverable. But do you also need a UPS? According to Halfgaar (on Slashdot before to promote better Linux backup practices), yes, usually you do. He argues that despite technological advancements such as file system journaling, power failures can still cause data loss in most setups."
This is old hat in embedded systems.
Yes, but embedded systems usually have lower power requirements, or at the very least, a smaller range of power requirements. You can't add 3 PCIe cards, a few extra drives, and a few more GB of RAM to most embedded systems.
I worked on the design of an embedded system a few years ago that had a holdup spec - I think it was supposed to survive for 50 ms with no power. So a 50 ms power interruption would result in continued operation, while an outage longer than that was allowed to reset the board. However, the power draw on the board was around 200 Watts; being able to supply that much power for that long in a fairly compact form factor was a huge hurdle. It also caused airflow problems, because the giant capacitors would prevent air from getting to other components on the board, like the CPU. In the next version of the spec, I believe the holdup requirement was eliminated - apparently we weren't the only ones having trouble meeting that requirement.
Rule #1.
NEVER plug a laser printer into a UPS. The power that the fuser draws is WAY too much.
Look at some of the cheap office units, they show little pictures on them, notice the printer icon is on the surge side, NOT battery/surge side.
If the power goes out, you should NOT be trying to print.
http://articles.techrepublic.com.com/5100-10878_11-6085460.html See #6
http://arstechnica.com/guides/other/ups.ars/3
http://www.jetcafe.org/npc/doc/ups-faq.html#0405 see 04.05
Would you put a space heater on a UPS? Shredder? Vacuum? Table Saw? If you put a laser printer on it, you may as well.
Do not meddle in the affairs of sysadmins, for they are subtle, and quick to anger.
I 100% agree with the idea of testing under controlled conditions. The oops you guys discovered is a good thing to be caught early on. I can imagine the look on your support team's faces when the UPS suddenly turned itself off while the remaining servers still trying to perform a safe shutdown. I'm sure the secondary UPS was left running as a precaution until the test is successful.
I have seen a screw up where somebody cut into a live power cord thinking it was a tie wrap caused a major short in the PDU. The guy thought he was safe until he discovered whoever installed the servers didn't double check the power connections and loads so it created a cascade failure in several racks and lost several tons of data. Recovery took awhile.
Least to say it was not a good day.
...by design. TFA doesn't delve into too much detail, but a sudden power loss on such software RAID systems is a condition that ZFS accounts for. Its Copy-on-write (COW) and write-length stiping strategy prevents things such as the RAID5 write hole condition, a condition that has the biggest chance of occurring when a power loss event happens.
Last night we had a power outage. I shut down the desktop and was able to continue working for almost 2 hours on the laptop because with the Desktop down the UPS was only carrying the DSL router and the WiFi box.
At work. Power is a whole enterprise within the company I work for.
Dual gas powered Generators at each location, Rooms full of Batteries for the Telecoms gear (most is straight DC) and Inverters for the Servers. (DC PSUs are available for some of the servers we use but at so high a premium that the inverters are cheaper.)
We can handle a dozen Power cuts in a day with no service interruption or data loss ("Tested" 2 weeks ago) and we can stay up without external power for more than a week. After that we have to start trucking in additional diesel.
Yep. That's right. With sufficient fuel we can be online indefinably. Which we will have to do if we get hit by a major hurricane.
Which means the phone network is a lot more reliable than the Power grid where I live.
As for Data loss. I have over the years done a lot of recovery work. "Morfy" of "Murfy's Law" fame isn't a guy or a girl. He is a deamon from the darkest pits of hell sent to torment the souls of IT workers everywhere.
Imagine a server, where UPS #2 is down for repairs, UPS #1 fails during a power cut, When everything comes back up we find 2 failed hard drives in the RAID 5 on the email server.
despite previous testing and confirmation that the backups work the most recent tapes failed to read.
Eventually we sent the failed drives off to a Data recovery company in Florida because
#1. The customer can afford it.
#2. Simply "skipping" a few days of Email is not an option for a bank (hence the ability to afford data recovery).
So yeah. A UPS is essential. Just like RAID, Clustering and Backups but in the end it can all fail.
Best advise? Memorize all your important data. That way if you loose your mind, you are not responsible for the lost Data (or anything else).
--= Isn't it surprising how badly I spell ?
UPS units are relatively cheap, it's well worthwhile to invest in one, not just to protect from data loss:
* Hardware loss: I've seen a lot of hardware blown up from power interruptions. Do you trust your power company that much to provide clean power to you? Sure surge protectors help a bit, but a decent UPS costs maybe twice as much as a good surge protector.
* Time lost restoring your session after blackouts / brownouts: OK, maybe you're used to restarting your computer every morning anyway. But I like to leave things open and return to my desktop just the way I left it arranged.
* Stats: Using NUT and Munin, you get to monitor and log your power, so you can see things like exactly when your electricity went out and for how long, what load your PC is drawing after that last upgrade, etc. e.g.: http://hairball.bumba.net/cgi-bin/nut/upsstats.cgi?host=apc@localhost
* Graceful shutdown: you have a chance to tell your buddies that your power just went out, and you'll be coming back once it's restored.
Frankly, I'm a little surprised a backup battery isn't built into PC power supplies already, so they'd work a bit more like laptops. Same with networking gear.
This reminds me of my favorite power loss story. The facility was doing a generator test, where we were supposed to switch over from city power to the generator. Unfortunately it didn't happen smoothly and the UPS kicked in. Sadly it turned out that so many servers had been added since the original design, the UPS was really only good for fifteen minutes or so. The final problem was that our operator didn't notice the issue quickly enough and so the next thing everyone in IT knew is that our main data center just lost power.
We spent most of the day getting our servers back up from various states of disrepair (confirming the article, power loss is superbad). It turns out that our main medical software ran on a Tandem. Though the drives and such lost power, the CPU had a backup of D-batteries and survived the power loss just fine. Needless to say, we stopped making fun of their seemingly primitive emergency backup power.
That's why any datacenter worth putting your servers in pipes its power through a flywheel or some other electricity "cleaner". A 1-ton lead ball spinning at 10,000 RPMs isn't going to speed up that much on a spike like that.
rooooar
Less filling but tastes great!
Ok back on subject
A UPS isn't even a panacea... I had a server lose 3 out of 4 HDs in a 4 hour period. (The 3rd drive went at 4:57 PM Thursday Dec 11th 1997. Not that I would remember...) When I looked at the service history on it it had been losing drives for 8 months at an accelerating rate.
Turns out that the 3000va rack mount wonder UPS from that big, well known vendor was the problem. The switching unit in it was sending spikes into the equipment.
They wouldn't warranty it so I ended up putting a Triplite ISObar surge suppressor between it and the server in our test environment and it was in service for years after that.
Never trust any piece of equipment...
You now have a 1% efficient power supply.
A slightly more practical option (with better isolation than a standard electromagnetic transformer, but unfortunately also some inductive effects) would be to couple two motors with an insulative shaft.