Why Power Failures Can Always Lead To Data Loss
bigsmoke writes "So, all your servers run on RAID. You back up religiously. You're even sure that your backups are recoverable. But do you also need a UPS? According to Halfgaar (on Slashdot before to promote better Linux backup practices), yes, usually you do. He argues that despite technological advancements such as file system journaling, power failures can still cause data loss in most setups."
Power losses can cause data loss? Gee, you mean that my system that relies on electricity for everything it does can be adversely effected by power outages even if I take precautions? That's some good admin work there, Lou -- if only there was some sort of law that covered the tendency of things that can go wrong to go wrong...
Next week: Fires can make things warm, floods can make things wet.
Every year during my review, I just pray the words "slashdot.org" aren't mentioned.
From TFA:
(DRAM needs to be refreshed constantly otherwise it will loose it's data)
Fly, little data! Be free!
Definitely maybe?
UPS is more than just saving your data.
I remember a discussion on the PostgreSQL hacker's list about recoverability and transaction logs.
You can't make a system that will not lose data, you can only make a system that knows the last save point of 100% integrity.
There are too many variables and too much randomness on a cold hard power failure. You absolutely need a UPS that gives you time to shut down cleanly.
APC is the only UPS maker on the market that has at least spent some small effort so that their UPSs can be properly integrated with a Linux machine. I made the mistake of purchasing an Ultra UPS as it was cheaper than the APC.
"Thanks for all the money you paid to us. We've used it to buy off ISO among other things" -Microsoft
is a weak spot in the design of most computers.
Computer power supplies should be built with enough spare capacitance to run things long enough for the computer to save critical data, and operating systems and critical apps should be able to handle an emergency shutdown and save critical data in very short order.
This is old hat in embedded systems.
"Prefiero morir de pie que vivir siempre arrodillado!"
The funny part is someone had to have thought they were safe without a UPS for this to become news.
Wanna fight ? Bend over, stick your head up your ass, and fight for air.
In my company, everything is behind UPSs. Our SAN is even behind 2 separate UPSs. We thought everything was configured properly, but you'd be surprised what comes to roost when you test everything.
We recently had a test night where all we did was test the UPS system and shutdown procedures, and there was a couple gotchas. Interestingly, by default the APC powerchute app we were using defaulted to shutting down the UPS completely after the [first] server went down - not good. This was buried fairly deeply in the configuration.
Equally important to any protection measure, be it RAID, Power Protection, whatever - is testing!
you can recover your RAM minutes after loosing power.. no kidding! http://citp.princeton.edu/memory/
have you been defaced today?
I really can't understand people who don't have a UPS. Don't you care about your data? At all? The UPS is not very expensive (My BackUPS 900 is very nice and only $100), and will last a long time (you just replace the batteries now and then). Once you are on UPS, you can stop worrying about any power issues, journalling file systems, crash recovery, and all that. The computer will never fail due to power. If you run Linux, it will also never fail due to the OS. If you are a normal user, that means your computer will never fail, period. Seriously, there is no excuse for not having a UPS. Go and get one right now!
Ok, now everyone has something to give to your kid for the sysadmin-in-traning class.
For the rest of us... back to work, nothing here you didn't learn your first year.
For the poster... Shame shame... Turn in your card.
Do not meddle in the affairs of sysadmins, for they are subtle, and quick to anger.
If there's clouds in your server room, your server's probably been slashdotted and is on fire!
mcgrew's razor: Never attribute to stupidity that which can be explained by greedy self-interest
"3.2. (Ecrypted) file systems"
Please tell me more about these ecrypted file systems. Do they also do gurnalling?
Intron: the portion of DNA which expresses nothing useful.
Rule #1.
NEVER plug a laser printer into a UPS. The power that the fuser draws is WAY too much.
Look at some of the cheap office units, they show little pictures on them, notice the printer icon is on the surge side, NOT battery/surge side.
If the power goes out, you should NOT be trying to print.
http://articles.techrepublic.com.com/5100-10878_11-6085460.html See #6
http://arstechnica.com/guides/other/ups.ars/3
http://www.jetcafe.org/npc/doc/ups-faq.html#0405 see 04.05
Would you put a space heater on a UPS? Shredder? Vacuum? Table Saw? If you put a laser printer on it, you may as well.
Do not meddle in the affairs of sysadmins, for they are subtle, and quick to anger.
...by design. TFA doesn't delve into too much detail, but a sudden power loss on such software RAID systems is a condition that ZFS accounts for. Its Copy-on-write (COW) and write-length stiping strategy prevents things such as the RAID5 write hole condition, a condition that has the biggest chance of occurring when a power loss event happens.
The scary thing is that yet one more person can't feakin' tell the difference between "loose" and "lose." It's becoming an epidemic.
Don't disappoint your bird dog. Go to the range.
Yes, quite. It can't handle the substantial inrush current needed by the laser printer.
The "click" you hear in the UPS when the laser printer warms up is the UPS noting the drops on the power mains, which gives you some idea just how much current that printer needs.
I have a Samsung ML2150, and have noticed the same thing. Lights flicker, etc. whenever I submit a print job and the printer transitions from standby to active. The various UPSes in my office sense that, and respond with clicks and beeps.
Take the laser printer off the UPS. If you really need printer capability during a power failure, switch to an ink jet.
b.g.
Last night we had a power outage. I shut down the desktop and was able to continue working for almost 2 hours on the laptop because with the Desktop down the UPS was only carrying the DSL router and the WiFi box.
At work. Power is a whole enterprise within the company I work for.
Dual gas powered Generators at each location, Rooms full of Batteries for the Telecoms gear (most is straight DC) and Inverters for the Servers. (DC PSUs are available for some of the servers we use but at so high a premium that the inverters are cheaper.)
We can handle a dozen Power cuts in a day with no service interruption or data loss ("Tested" 2 weeks ago) and we can stay up without external power for more than a week. After that we have to start trucking in additional diesel.
Yep. That's right. With sufficient fuel we can be online indefinably. Which we will have to do if we get hit by a major hurricane.
Which means the phone network is a lot more reliable than the Power grid where I live.
As for Data loss. I have over the years done a lot of recovery work. "Morfy" of "Murfy's Law" fame isn't a guy or a girl. He is a deamon from the darkest pits of hell sent to torment the souls of IT workers everywhere.
Imagine a server, where UPS #2 is down for repairs, UPS #1 fails during a power cut, When everything comes back up we find 2 failed hard drives in the RAID 5 on the email server.
despite previous testing and confirmation that the backups work the most recent tapes failed to read.
Eventually we sent the failed drives off to a Data recovery company in Florida because
#1. The customer can afford it.
#2. Simply "skipping" a few days of Email is not an option for a bank (hence the ability to afford data recovery).
So yeah. A UPS is essential. Just like RAID, Clustering and Backups but in the end it can all fail.
Best advise? Memorize all your important data. That way if you loose your mind, you are not responsible for the lost Data (or anything else).
--= Isn't it surprising how badly I spell ?
UPS units are relatively cheap, it's well worthwhile to invest in one, not just to protect from data loss:
* Hardware loss: I've seen a lot of hardware blown up from power interruptions. Do you trust your power company that much to provide clean power to you? Sure surge protectors help a bit, but a decent UPS costs maybe twice as much as a good surge protector.
* Time lost restoring your session after blackouts / brownouts: OK, maybe you're used to restarting your computer every morning anyway. But I like to leave things open and return to my desktop just the way I left it arranged.
* Stats: Using NUT and Munin, you get to monitor and log your power, so you can see things like exactly when your electricity went out and for how long, what load your PC is drawing after that last upgrade, etc. e.g.: http://hairball.bumba.net/cgi-bin/nut/upsstats.cgi?host=apc@localhost
* Graceful shutdown: you have a chance to tell your buddies that your power just went out, and you'll be coming back once it's restored.
Frankly, I'm a little surprised a backup battery isn't built into PC power supplies already, so they'd work a bit more like laptops. Same with networking gear.
This reminds me of my favorite power loss story. The facility was doing a generator test, where we were supposed to switch over from city power to the generator. Unfortunately it didn't happen smoothly and the UPS kicked in. Sadly it turned out that so many servers had been added since the original design, the UPS was really only good for fifteen minutes or so. The final problem was that our operator didn't notice the issue quickly enough and so the next thing everyone in IT knew is that our main data center just lost power.
We spent most of the day getting our servers back up from various states of disrepair (confirming the article, power loss is superbad). It turns out that our main medical software ran on a Tandem. Though the drives and such lost power, the CPU had a backup of D-batteries and survived the power loss just fine. Needless to say, we stopped making fun of their seemingly primitive emergency backup power.
...If you're a Mac fanboy running a network of Apple computers. If anything goes wrong, it's an artistic expression and anyone who criticizes the problem is a closed-minded square who "doesn't get it." Then you sit back in self satisfaction listening to alternative pop, thinking about how hip and different and enlightened you are.
Happy thoughts power supply: Dead stable.
Linux networks can run on happy thoughts as well as long as you run on electricity during the setup and installation stages and then switch to happy thoughts once everything's running properly...you just have to make sure you never, ever run emacs, vi, or Gpaint.
"When information is power, privacy is freedom" - Jah-Wren Ryel
This morning we had a planned shutdown of 100 servers for eletricity works, all were on the same 40 kVA UPS. All went fine, we shutdown all servers to be safe, and kept some stuff online for montoring and the like, then main power was shut off. The UPS gladly took the load, with an estimated battery life of 75 minutes, more than what was needed for the electrical work. Once this was done, the electrician put the main power back on, and... the UPS shutdown !
Since all servers were stopped already we didn't lose anything, but we had to put the UPS in bypass mode for a while, then back on, and now we hope for the best waiting for the UPS to be repaired, crossing most of our fingers because of the holidays...
In summary : testing that the UPS can handle the power coming back is as important as testing for it to be able to handle the power shutting down.
Votez ecolo : Chiez dans l'urne !
Last night we had a power outage. I shut down the desktop and was able to continue working for almost 2 hours on the laptop because with the Desktop down the UPS was only carrying the DSL router and the WiFi box.
good uptime for a laptop. got a second battery? (I know I do)
Inverters for the Servers. (DC PSUs are available for some of the servers we use but at so high a premium that the inverters are cheaper.)
that's because it just has to invert it before it can step it up or down. If you supply DC you are actually introducing another necessary step. It gets hard to cram 2x the electronics into the PS. Inverters are definitely the way to go.
We can handle a dozen Power cuts in a day with no service interruption or data loss ("Tested" 2 weeks ago) and we can stay up without external power for more than a week. After that we have to start trucking in additional diesel.
Yep. That's right. With sufficient fuel we can be online indefinably. Which we will have to do if we get hit by a major hurricane.
Might want to rethink how easy it is to get a truck in during a hurricane. ;) Unless it's more of a boat, think Katrina.
Imagine a server, where UPS #2 is down for repairs, UPS #1 fails during a power cut, When everything comes back up we find 2 failed hard drives in the RAID 5 on the email server. despite previous testing and confirmation that the backups work the most recent tapes failed to read.
um, ouch?
Best advise? Memorize all your important data. That way if you loose your mind, you are not responsible for the lost Data (or anything else).
Was going to say, all of the above is moot if an EF5 rolls through town. Better add "offsite backup" to your list if it's not already there. With the EF5 that ran through here last month, some people got their backups turned into "offsite" backups. (maintenance guy was here last week, said they are still looking for their dump truck )
I work for the Department of Redundancy Department.
Any professional server or data center setup that does not include a UPS for a graceful shutdown... is almost by definition NOT professional.
The typical small UPS system has some amount of surge protection built-in. But it's typically only good for at most a couple thousand joules. But then, if you get a spike that is big enough to blow a varister, you also get to buy a new ups.
A better solution is to put a "whole house" surge protector on the circuit-breaker panel. It protects everything, with a much higher number of joules. Five or six pounds of varisters can absorb a lot more shock than one ounce of varisters. They cost about $100, and can be found at most big hardware stores or electrical supply houses. That doesn't eliminate the need for a ups. It does protect the ups, along with the other equipment, from most voltage spikes.
Last year, lightning hit the power pole 20 feet from my house. We know where it hit because the pole caught fire. My next-door neighbors on both sides lost every single piece of electrical equipment -- not just computers, TV's, and stereos, but also fridge, microwave, water heater, and range. All of it was damaged beyond repair. We barely noticed the hit, except for the bright flash of light, and had no damage at all.
Real text editors will recover gracefully from such situations. :-)
(I'm think along the lines of @UEDIT on OS2200 which saves its entire virtual memory state to disk periodically and can recover it with ease at the next startup, or the old EDT editor on VMS which saved the commands one entered and could replay them when a recovery was specified).
I'm surprised more text editors don't have a similar feature. I think vim does, tho...?
Mainframe/UNIX Bit Twiddler and long time Windows/Linux Hobbyist.
The Theorem Theorem: If If, Then Then.
If you're not at the machine, or don't know how to shutdown without a CRT, the disk can get messed up when the UPS runs out of power. Unless you only have a desktop machine with no network applications writing to disk (no BitTorrent); then you might be OK if you just walk away from your keyboard and let the system become quiescent before it loses power.
1) You build a RAID5 array
2) You backup
3) You test your backups
4) You plug your server DIRECTLY INTO THE WALL?!?!
Ummm DUH! Of course you need a UPS - what kind of yutz does 1-3 and then powers the server off of unconditioned wall power?
---- "Logoff! That cookie shit makes me nervous!" - A. Soprano
Less filling but tastes great!
Ok back on subject
A UPS isn't even a panacea... I had a server lose 3 out of 4 HDs in a 4 hour period. (The 3rd drive went at 4:57 PM Thursday Dec 11th 1997. Not that I would remember...) When I looked at the service history on it it had been losing drives for 8 months at an accelerating rate.
Turns out that the 3000va rack mount wonder UPS from that big, well known vendor was the problem. The switching unit in it was sending spikes into the equipment.
They wouldn't warranty it so I ended up putting a Triplite ISObar surge suppressor between it and the server in our test environment and it was in service for years after that.
Never trust any piece of equipment...
I was heavily involved in the planning for moving our I.T. infrastructure to a different place.
It went from what was essentially a closet in a basement with a single AC unit and individual UPS's on each server.
So I decided redundancy was key. We had redundant AC, but the best part was power.
All servers (70 of them at last reckoning) are attached to an APC Symmetra that nominally gives 40 minutes of battery power. The Symmetra in turn is backed up by a 125kW natural-gas fired generator that spools up within 10 seconds.
It was decided we could suffer a brief AC outage so that was simply attached to the generator. There were two 2 ton AC units in place.
Even had the foresight to extend a tendril out to the MDF in the building so that our telecom and ISP could plug their UPS into the generator circuit.
And what was the fly in the ointment? Our DNS services were provided by an outside entity. So one day we had a power failure that hit a very large swath of the city and included us and the entity that provided DNS services.
So while everything in our shop was running, nobody from the outside could see our public services, and nobody inside could get out.
We actually got hold of the DNS zone and had our own after that.
All you need to do is have the grid power feed some high wattage light bulbs. And near the light bulbs is some solar cells. The output from the solar cells is used to charge batteries which feed an inverter that actually powers the computer. Of course there is some power loss in the conversion process, and you need to have some (ok, a lot), of the input power to the system commited towards running a cooling unit to keep things at a reasonable temperature. But the resulting device provides clean power with no possibility of any surges getting thru to the protected equipment.
Of course, if you go to this level of trouble for your power source, then I'd also suggest opto-isolating all signal lines to and from the server. And enclose the server in a well grounded faraday cage. And it wouldn't be a bad idea to have a dedicated comm link to a duplicate server located else where. Preferably on a different tectonic plate.
The hard drives and DMA controller however, will run a bit longer; so if data is being written to disk, the DMA controller will keep reading data from memory, but it has no idea that this data is corrupted.
Pretty sure that's wrong. It used to be (20 years ago) that hard drives losing power in this way had a chance of the heads crashing against the platters (the fabled "hard drive crash"). To solve this, modern drives are very sensitive to the power input. As soon as power fails the drives extract power from the spinning platters to move the heads over to the parked position. Regardless of what the DMA controller thinks it should be doing, the hard drive is busy parking the heads.
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
I don't think this has been true since... maybe 8-10 years now? Definitely since MR drives came on the market (ages ago).
Modern drives have:
It does NOT go writing crap all over whatever's between your data and the parked position, unless the drive is a defective design. The emergency park is a fairly brutal affair, and you'll typically see the datasheet list a maximum number that's notably lower than the max power cycles.
It's also essential these days because:
Normally that holds true. I've seen some drives (1.0" and 1.8" miniature ones) which suffered from head-on-platter but that was due to misdesign in the power supply feeding it (e.g voltage rails going slightly negative, draining the cap early).
But anyway, the worst you'll get with the power going out is a partially written sector, which will then be marked bad, probably permanently. Or maybe a bunch of sectors. Or maybe bad in a different order to what the OS sent due due to caching.
If you had a drive and/or RAID fail due to power outage, you should get a refund. You might lose a tiny amount of data, not the whole lot.