Why Power Failures Can Always Lead To Data Loss
bigsmoke writes "So, all your servers run on RAID. You back up religiously. You're even sure that your backups are recoverable. But do you also need a UPS? According to Halfgaar (on Slashdot before to promote better Linux backup practices), yes, usually you do. He argues that despite technological advancements such as file system journaling, power failures can still cause data loss in most setups."
Power losses can cause data loss? Gee, you mean that my system that relies on electricity for everything it does can be adversely effected by power outages even if I take precautions? That's some good admin work there, Lou -- if only there was some sort of law that covered the tendency of things that can go wrong to go wrong...
Next week: Fires can make things warm, floods can make things wet.
Every year during my review, I just pray the words "slashdot.org" aren't mentioned.
What if your data's on the cloud?
First!
From TFA:
(DRAM needs to be refreshed constantly otherwise it will loose it's data)
Fly, little data! Be free!
Definitely maybe?
UPS is more than just saving your data.
I remember a discussion on the PostgreSQL hacker's list about recoverability and transaction logs.
You can't make a system that will not lose data, you can only make a system that knows the last save point of 100% integrity.
There are too many variables and too much randomness on a cold hard power failure. You absolutely need a UPS that gives you time to shut down cleanly.
As my nieces would say, Durrrr! Yes, of course - you need a UPS. Next question please.
When It Counts.
into a huge cache on the drive don't get written permanently if the power quits? Why didn't somebody tell me about this before?
I always thought Gremlins caused data loss.
Since when did power have anything to do with it?
APC is the only UPS maker on the market that has at least spent some small effort so that their UPSs can be properly integrated with a Linux machine. I made the mistake of purchasing an Ultra UPS as it was cheaper than the APC.
"Thanks for all the money you paid to us. We've used it to buy off ISO among other things" -Microsoft
is a weak spot in the design of most computers.
Computer power supplies should be built with enough spare capacitance to run things long enough for the computer to save critical data, and operating systems and critical apps should be able to handle an emergency shutdown and save critical data in very short order.
This is old hat in embedded systems.
"Prefiero morir de pie que vivir siempre arrodillado!"
The funny part is someone had to have thought they were safe without a UPS for this to become news.
Wanna fight ? Bend over, stick your head up your ass, and fight for air.
In my company, everything is behind UPSs. Our SAN is even behind 2 separate UPSs. We thought everything was configured properly, but you'd be surprised what comes to roost when you test everything.
We recently had a test night where all we did was test the UPS system and shutdown procedures, and there was a couple gotchas. Interestingly, by default the APC powerchute app we were using defaulted to shutting down the UPS completely after the [first] server went down - not good. This was buried fairly deeply in the configuration.
Equally important to any protection measure, be it RAID, Power Protection, whatever - is testing!
you can recover your RAM minutes after loosing power.. no kidding! http://citp.princeton.edu/memory/
have you been defaced today?
I know that PHB's will try to cut costs, and that unnecessary hardware is the first to be cut, but is there ANYONE who believes that a UPS is not needed? Are there really people out there that think, "We don't need the UPS right now. We can wait until we have more money."
It boggles my mind that there is even a need for such an article
Great civilizations have lived and died on false theories. Don't mess up mine with a few facts.
Well, duh. Thank you Captain Obvious.
Here's question for you all. I have a cheap Conext (made by APC) IPS. Yes, it's an interruptible power supply. It used to work fine, but once I added a Samsung b/w laser printer, whenever the printer's heating element first comes on, the UPS drops out immediately and the computer restarts. Even put a new battery in it; no help. The printer, btw, is NOT plugged into the UPS. The line voltage appears to get yanked down just momentarily and the computer ignores it, when off the UPS. The UPS, with nothing plugged in to it, always clicks off then back on once during the printer's warm-up cycle. Is the UPS just too small (900 AVR)?
"I might have made a tactical error in not going to a physician for 20 years." -- Warren Zevon
I really can't understand people who don't have a UPS. Don't you care about your data? At all? The UPS is not very expensive (My BackUPS 900 is very nice and only $100), and will last a long time (you just replace the batteries now and then). Once you are on UPS, you can stop worrying about any power issues, journalling file systems, crash recovery, and all that. The computer will never fail due to power. If you run Linux, it will also never fail due to the OS. If you are a normal user, that means your computer will never fail, period. Seriously, there is no excuse for not having a UPS. Go and get one right now!
Ok, now everyone has something to give to your kid for the sysadmin-in-traning class.
For the rest of us... back to work, nothing here you didn't learn your first year.
For the poster... Shame shame... Turn in your card.
Do not meddle in the affairs of sysadmins, for they are subtle, and quick to anger.
If you back up religiously, assuming you have the backups on some sort of removable media, why would recovering from them be impossible when data loss via electrical outage occur?
Dur-durdur!
It is pitch black. You are likely to be eaten by a grue.
"3.2. (Ecrypted) file systems"
Please tell me more about these ecrypted file systems. Do they also do gurnalling?
Intron: the portion of DNA which expresses nothing useful.
So a UPS is needed, really. Working on a long block of code haven't hit save in a while and no autosave is on... Bam, power is out and you just lost 100 lines of code you spent hours on. Go get a UPS.
...by design. TFA doesn't delve into too much detail, but a sudden power loss on such software RAID systems is a condition that ZFS accounts for. Its Copy-on-write (COW) and write-length stiping strategy prevents things such as the RAID5 write hole condition, a condition that has the biggest chance of occurring when a power loss event happens.
The scary thing is that yet one more person can't feakin' tell the difference between "loose" and "lose." It's becoming an epidemic.
Don't disappoint your bird dog. Go to the range.
Last night we had a power outage. I shut down the desktop and was able to continue working for almost 2 hours on the laptop because with the Desktop down the UPS was only carrying the DSL router and the WiFi box.
At work. Power is a whole enterprise within the company I work for.
Dual gas powered Generators at each location, Rooms full of Batteries for the Telecoms gear (most is straight DC) and Inverters for the Servers. (DC PSUs are available for some of the servers we use but at so high a premium that the inverters are cheaper.)
We can handle a dozen Power cuts in a day with no service interruption or data loss ("Tested" 2 weeks ago) and we can stay up without external power for more than a week. After that we have to start trucking in additional diesel.
Yep. That's right. With sufficient fuel we can be online indefinably. Which we will have to do if we get hit by a major hurricane.
Which means the phone network is a lot more reliable than the Power grid where I live.
As for Data loss. I have over the years done a lot of recovery work. "Morfy" of "Murfy's Law" fame isn't a guy or a girl. He is a deamon from the darkest pits of hell sent to torment the souls of IT workers everywhere.
Imagine a server, where UPS #2 is down for repairs, UPS #1 fails during a power cut, When everything comes back up we find 2 failed hard drives in the RAID 5 on the email server.
despite previous testing and confirmation that the backups work the most recent tapes failed to read.
Eventually we sent the failed drives off to a Data recovery company in Florida because
#1. The customer can afford it.
#2. Simply "skipping" a few days of Email is not an option for a bank (hence the ability to afford data recovery).
So yeah. A UPS is essential. Just like RAID, Clustering and Backups but in the end it can all fail.
Best advise? Memorize all your important data. That way if you loose your mind, you are not responsible for the lost Data (or anything else).
--= Isn't it surprising how badly I spell ?
UPS units are relatively cheap, it's well worthwhile to invest in one, not just to protect from data loss:
* Hardware loss: I've seen a lot of hardware blown up from power interruptions. Do you trust your power company that much to provide clean power to you? Sure surge protectors help a bit, but a decent UPS costs maybe twice as much as a good surge protector.
* Time lost restoring your session after blackouts / brownouts: OK, maybe you're used to restarting your computer every morning anyway. But I like to leave things open and return to my desktop just the way I left it arranged.
* Stats: Using NUT and Munin, you get to monitor and log your power, so you can see things like exactly when your electricity went out and for how long, what load your PC is drawing after that last upgrade, etc. e.g.: http://hairball.bumba.net/cgi-bin/nut/upsstats.cgi?host=apc@localhost
* Graceful shutdown: you have a chance to tell your buddies that your power just went out, and you'll be coming back once it's restored.
Frankly, I'm a little surprised a backup battery isn't built into PC power supplies already, so they'd work a bit more like laptops. Same with networking gear.
And I bet they has a longer uptime than yours....
No sig today...
This reminds me of my favorite power loss story. The facility was doing a generator test, where we were supposed to switch over from city power to the generator. Unfortunately it didn't happen smoothly and the UPS kicked in. Sadly it turned out that so many servers had been added since the original design, the UPS was really only good for fifteen minutes or so. The final problem was that our operator didn't notice the issue quickly enough and so the next thing everyone in IT knew is that our main data center just lost power.
We spent most of the day getting our servers back up from various states of disrepair (confirming the article, power loss is superbad). It turns out that our main medical software ran on a Tandem. Though the drives and such lost power, the CPU had a backup of D-batteries and survived the power loss just fine. Needless to say, we stopped making fun of their seemingly primitive emergency backup power.
...If you're a Mac fanboy running a network of Apple computers. If anything goes wrong, it's an artistic expression and anyone who criticizes the problem is a closed-minded square who "doesn't get it." Then you sit back in self satisfaction listening to alternative pop, thinking about how hip and different and enlightened you are.
Happy thoughts power supply: Dead stable.
Linux networks can run on happy thoughts as well as long as you run on electricity during the setup and installation stages and then switch to happy thoughts once everything's running properly...you just have to make sure you never, ever run emacs, vi, or Gpaint.
"When information is power, privacy is freedom" - Jah-Wren Ryel
Post this under most obvious thing ever!
I guess the author wasn't worried about any events or transactions that were in the process of being committed. Nor has he managed any production databases.
Next thing you know there will be an article about not being able to surf the web when the Internet connection is down.
He fails to mention battery packs for RAID cards. They maintain power to the disk cache memory on the card in the event of a power failure, which allows the card to finish writing the cache to disk once main power is up again. That's one of the arguments for a hardware RAID solution.
Always? Maybe if you are using Linux. Not if you are using an OS that runs ZFS filesystems.
--AC
Linux software RAID, and any RAID basically, needs to know if the disks of the array are still properly matched to eachother when the array is initialized. When power fails, or when you press reset, they will be in a "dirty" state, and the system may need to recreate the array. That is, if it can. I've never tried it, but I can imagine that a RAID0 can be completely destroyed by a power failure. But, don't take my word for that...
One way is if the partition table and drivers on one slice gets trashed, and the first few meg of the data partition (directory mostly) get trashed on the other slice, by the same event that also happens to hang the computer.
You have no idea how unpleasant it is to reconstruct a partition table from scratch and reinstall firewire driver partitions using DD. It didn't BOOT, but I was able to bribe it to mount and copy the data off.
I work for the Department of Redundancy Department.
I never thought of the problem with the degaussing coil in the monitor. But then again, who still uses a CRT anymore? (Well I do at work since my company is too cheap to buy me a new monitor). Point is you can leave the monitor powered by the UPS if it is an LCD type.
At my company's NOC the UPS failed... so everything failed, except the fancy new generator they had just installed.
Big problem though: when the UPS totally crapped itself, power from the generator couldn't get through the UPS to any of the devices plugged into it. Whoops.
Blessed be he who reads this post, Cursed be he who tells my boss.
http://developers.slashdot.org/article.pl?sid=08/07/20/1624253&from=rss
This morning we had a planned shutdown of 100 servers for eletricity works, all were on the same 40 kVA UPS. All went fine, we shutdown all servers to be safe, and kept some stuff online for montoring and the like, then main power was shut off. The UPS gladly took the load, with an estimated battery life of 75 minutes, more than what was needed for the electrical work. Once this was done, the electrician put the main power back on, and... the UPS shutdown !
Since all servers were stopped already we didn't lose anything, but we had to put the UPS in bypass mode for a while, then back on, and now we hope for the best waiting for the UPS to be repaired, crossing most of our fingers because of the holidays...
In summary : testing that the UPS can handle the power coming back is as important as testing for it to be able to handle the power shutting down.
Votez ecolo : Chiez dans l'urne !
I worked for a respectable insurance company. The other day a "well-known" H/W maker came to our place to upgrade the hardware for a mainframe, in our computer room.
They unscrewed the mainframe's panels and put them aside, on the large thingy right beside it.
That thingy aside happened to be the UPS, which started to heat up, having its vents blocked by the panels. At some point, it gave up, sending a massive "shutdown now" command to all connected computers, including most of the web infrastructure...
It's been more that 2 days now, and we are still struggling to bring all the pieces together...
Last night we had a power outage. I shut down the desktop and was able to continue working for almost 2 hours on the laptop because with the Desktop down the UPS was only carrying the DSL router and the WiFi box.
good uptime for a laptop. got a second battery? (I know I do)
Inverters for the Servers. (DC PSUs are available for some of the servers we use but at so high a premium that the inverters are cheaper.)
that's because it just has to invert it before it can step it up or down. If you supply DC you are actually introducing another necessary step. It gets hard to cram 2x the electronics into the PS. Inverters are definitely the way to go.
We can handle a dozen Power cuts in a day with no service interruption or data loss ("Tested" 2 weeks ago) and we can stay up without external power for more than a week. After that we have to start trucking in additional diesel.
Yep. That's right. With sufficient fuel we can be online indefinably. Which we will have to do if we get hit by a major hurricane.
Might want to rethink how easy it is to get a truck in during a hurricane. ;) Unless it's more of a boat, think Katrina.
Imagine a server, where UPS #2 is down for repairs, UPS #1 fails during a power cut, When everything comes back up we find 2 failed hard drives in the RAID 5 on the email server. despite previous testing and confirmation that the backups work the most recent tapes failed to read.
um, ouch?
Best advise? Memorize all your important data. That way if you loose your mind, you are not responsible for the lost Data (or anything else).
Was going to say, all of the above is moot if an EF5 rolls through town. Better add "offsite backup" to your list if it's not already there. With the EF5 that ran through here last month, some people got their backups turned into "offsite" backups. (maintenance guy was here last week, said they are still looking for their dump truck )
I work for the Department of Redundancy Department.
TFA has no mention of battery backup, which is insane, since not only can it improve reliability, it can allow the drive to return write success before the data actually makes it to disk, leading to significant write performance gains in many circumstances.
Any professional server or data center setup that does not include a UPS for a graceful shutdown... is almost by definition NOT professional.
The typical small UPS system has some amount of surge protection built-in. But it's typically only good for at most a couple thousand joules. But then, if you get a spike that is big enough to blow a varister, you also get to buy a new ups.
A better solution is to put a "whole house" surge protector on the circuit-breaker panel. It protects everything, with a much higher number of joules. Five or six pounds of varisters can absorb a lot more shock than one ounce of varisters. They cost about $100, and can be found at most big hardware stores or electrical supply houses. That doesn't eliminate the need for a ups. It does protect the ups, along with the other equipment, from most voltage spikes.
Last year, lightning hit the power pole 20 feet from my house. We know where it hit because the pole caught fire. My next-door neighbors on both sides lost every single piece of electrical equipment -- not just computers, TV's, and stereos, but also fridge, microwave, water heater, and range. All of it was damaged beyond repair. We barely noticed the hit, except for the bright flash of light, and had no damage at all.
The major reasob for doing this is that I live in a rural area where bad weather can make the power glitchy. One of the neatest things about using UPSs is that I can unplug stuff from the wall (eg if I need to move cables to a different power socket) and keep the computers alive.
Engineering is the art of compromise.
If you're not at the machine, or don't know how to shutdown without a CRT, the disk can get messed up when the UPS runs out of power. Unless you only have a desktop machine with no network applications writing to disk (no BitTorrent); then you might be OK if you just walk away from your keyboard and let the system become quiescent before it loses power.
ZFS has end-to-end checksums for every block, for data and for meta-data, this problem will never arise for it.
http://opensolaris.org/os/community/zfs/faq/
People, this is such a non-issue. Are there really places out there that run prod-level systems without battery backup? I bet those sysadmins got their degree from a cereal box.
Hi, I Boris. Hear fix bear, yes?
Everyone is worried about data loss, but sometimes I wonder what sort of data gains happen without our knowledge or consent?
“Common sense is not so common.” — Voltaire
For those who want to know what TFA says without actually reading it, it boils down to:
1. What you think you have saved and what has actually been written to stable storage may not be the same. In particular, things may still sit in DRAM, waiting to be written to disk.
2. What gets written to stable storage after the power failure may not be what was intended to be written. You could end up with corrupt data.
3. That's the hardware side of the story; software introduces many more hazards by lengthening the path between your actions and stable storage.
Please correct me if I got my facts wrong.
And remember to recondition those RAID controller cache batteries! Nine out of ten servers I ran in to at my last job had a daily-recurring syslog entry that the RAID battery was shot because no one had ever bothered to recondition it because of the (relatively minor) performance hit to turn the cache off.
If I remember correctly, LiveJournal had a MAJOR data corruption issue where they had to reformat and restore off of tape to repair because their cache batteries had gone tits up and their wonky drivers lacked the verbiage to remind them.
I'm confused by a number of his recommendations. LVM storage is default for most distros now. There are a few things that he suggests...like keep forcing disk syncs, that slow down the processes and still allow for files to change as you read them.
2005 called, they wanted to let you know snapshots are stable.
According the Gentoo Wiki, you are even more susceptible to data loss in the event of a power failure when using an encrypted file system. I have to admit that I can't think of the reason why this would be so, because as I explained, after a power failure, everything that is written to disk is garbage anyway, whether it passes through some encryption pipeline or not. But, it's something you want to keep in mind.
Can anyone please exlpain why encrypted file systems should be more susceptible to data loss? (if it is true, of course. If not, please confirm that it isn't)
thomasdamgaard.dk.
What kind of idiot would run a server without a UPS?!?! Perhaps some pimply faced 15 year old... but you wouldn't catch me dead without my 80 KVA UPS!!!!
1) You build a RAID5 array
2) You backup
3) You test your backups
4) You plug your server DIRECTLY INTO THE WALL?!?!
Ummm DUH! Of course you need a UPS - what kind of yutz does 1-3 and then powers the server off of unconditioned wall power?
---- "Logoff! That cookie shit makes me nervous!" - A. Soprano
...in financial transactions. Database transactions are interlocked in such a way that if $1000 is transferred from an account in bank A to an account in bank B, then no matter what happens, come hell or high water, when the dust settles the $1000 has either been moved to bank B or remains in bank A. There cannot be $0 in both or $1000 in both.
If file systems aren't designed to work this way, it's not because of any intrinsic limitation on what is or is not possible, it's because system designers have made a conscious effort to favor speed over reliability.
Even in supposedly mission-critical servers.
"How to Do Nothing," kids activities, back in print!
I have thought about this matter, and I think it is important to factor in how much data is an acceptable loss.
My /home is backed up every night, and backups are kept for 12 months. That means there are three ways I can lose data:
1. I lose it before it's been backed up. This applies to, at most, the last 24 hours of work I do.
2. I lose data after it's been backed up, and I don't notice for at least 12 months.
3. The data disappears from both the live system and the backup, before I can recover it.
2 is fairly unlikely. 3 is one that worries me, but I'm working on that by arranging my backup to be duplicated off-site. 1 is the one that bothers me the most. I imagine my harddisk failing on the day I just finished a large project, thus replacing the joy of having finished it by the pain of just having lost the final pieces.
What TFA talks about (losing data before it has been written to disk) doesn't worry me so much. I doubt I'd lose more than a few minutes of work that way. And I'd know that something had failed, so I would expect the data loss.
It gets more worrisome when you provide services to remote users. Imagine that you run an e-commerce website, and a customer has just placed an order and received a confirmation, and then your machine goes *poof* before the order has been comitted to stable storage. The customer would not be pleased, especially if they had already paid for the order.
I think, in cases like the e-commerce example above, you would want to make sure that changes had been recorded before telling anyone that they had. And now comes a question: is there any way I, as a programmer, can verify that something has been written to stable storage? Can I tell the system (library/operating system/database/whatever) "write this down and don't return to me before you've actually written it"? And preferably without writing _everything_ to stable storage.
Please correct me if I got my facts wrong.
Less filling but tastes great!
Ok back on subject
A UPS isn't even a panacea... I had a server lose 3 out of 4 HDs in a 4 hour period. (The 3rd drive went at 4:57 PM Thursday Dec 11th 1997. Not that I would remember...) When I looked at the service history on it it had been losing drives for 8 months at an accelerating rate.
Turns out that the 3000va rack mount wonder UPS from that big, well known vendor was the problem. The switching unit in it was sending spikes into the equipment.
They wouldn't warranty it so I ended up putting a Triplite ISObar surge suppressor between it and the server in our test environment and it was in service for years after that.
Never trust any piece of equipment...
How often is this?
is it equivalent to "possibly always" or "always sometimes"?
I was heavily involved in the planning for moving our I.T. infrastructure to a different place.
It went from what was essentially a closet in a basement with a single AC unit and individual UPS's on each server.
So I decided redundancy was key. We had redundant AC, but the best part was power.
All servers (70 of them at last reckoning) are attached to an APC Symmetra that nominally gives 40 minutes of battery power. The Symmetra in turn is backed up by a 125kW natural-gas fired generator that spools up within 10 seconds.
It was decided we could suffer a brief AC outage so that was simply attached to the generator. There were two 2 ton AC units in place.
Even had the foresight to extend a tendril out to the MDF in the building so that our telecom and ISP could plug their UPS into the generator circuit.
And what was the fly in the ointment? Our DNS services were provided by an outside entity. So one day we had a power failure that hit a very large swath of the city and included us and the entity that provided DNS services.
So while everything in our shop was running, nobody from the outside could see our public services, and nobody inside could get out.
We actually got hold of the DNS zone and had our own after that.
All you need to do is have the grid power feed some high wattage light bulbs. And near the light bulbs is some solar cells. The output from the solar cells is used to charge batteries which feed an inverter that actually powers the computer. Of course there is some power loss in the conversion process, and you need to have some (ok, a lot), of the input power to the system commited towards running a cooling unit to keep things at a reasonable temperature. But the resulting device provides clean power with no possibility of any surges getting thru to the protected equipment.
Of course, if you go to this level of trouble for your power source, then I'd also suggest opto-isolating all signal lines to and from the server. And enclose the server in a well grounded faraday cage. And it wouldn't be a bad idea to have a dedicated comm link to a duplicate server located else where. Preferably on a different tectonic plate.
Anchorman
The grade-school English used by Slashdot's "editors" is so sad.
Advice: on VPS providers
Good you could continue working (posting to Slashdot), but I think there's one small problem: the UPS is causing excessive newlines to be inserted into your outgoing stream.
battery backed write cache (usually only costs a couple/few hundred extra on a decent server)
not only is time travel possible, it's irrelevant.
This is one reason why servers should have power supplies that have enough reserve power to write out journaled data before it gets corrupted, along with alarm mechanisms when the voltage begins to drop and when it is about to drop so low that data is compromised.
They should also have power-conserving circuits so "non-essential" components get cut off completely so the reserve power isn't used up too fast.
With good capacitors, holding 2-4 seconds of power should be trivial, holding 10-20 should be very doable at some cost.
Desktops can have this or not depending on cost and market demand.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
The hard drives and DMA controller however, will run a bit longer; so if data is being written to disk, the DMA controller will keep reading data from memory, but it has no idea that this data is corrupted.
Pretty sure that's wrong. It used to be (20 years ago) that hard drives losing power in this way had a chance of the heads crashing against the platters (the fabled "hard drive crash"). To solve this, modern drives are very sensitive to the power input. As soon as power fails the drives extract power from the spinning platters to move the heads over to the parked position. Regardless of what the DMA controller thinks it should be doing, the hard drive is busy parking the heads.
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
Indefinitely
Murphy
Advice
Lose
And generally "Demon" when you're not talking about Linux.
And I won't even get started on your capitalization or grammar.
This is why our entire IT infrastructure is based on laptops. Built in protection against power failure.
You have to love someone who posts what almost reads as authoritative, and then puts crap in it like:
> "When power fails, or when you press reset, they will be in a >
> "dirty" state, and the system may need to recreate the array.
> That is, if it can. I've never tried it, but I can imagine"
MD devices generally recover fine from a power loss. And, I've tried it. A lot. I've had quite a few machines (say hundreds) which have had various events in the past, which caused them to lose power. Here's an example now. Someone accidentally pulled the power cord out on this machine today. It wasn't intentional, they thought they were pulling the one above it.
cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md1 : active raid5 hdd2[1] hda2[2]
351421696 blocks level 5, 4k chunk, algorithm 2 [3/2] [_UU]
md0 : active raid1 hdd1[1] hdc1[0] hda1[2]
40064 blocks [3/3] [UUU]
unused devices:
One of the drives didn't come up. Not surprising, this machine has been up for a long time, under heavy load. I'm pretty sure we're beyond MTBF on most of the components. It will be replaced soon anyways. We swapped drives, and ran:
raidhotadd /dev/md1 /dev/hdc2
Now it's rebuilding. There's no noticable performance impact. No data was lost. The only "downtime" was for the tech to realize they pulled the wrong cord, and put it back in.
cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md1 : active raid5 hdc2[3] hdd2[1] hda2[2]
351421696 blocks level 5, 4k chunk, algorithm 2 [3/2] [_UU]
[>....................] recovery = 0.0% (54992/175710848) finish=159.6min speed=18330K/sec
md0 : active raid1 hdd1[1] hdc1[0] hda1[2]
40064 blocks [3/3] [UUU]
unused devices:
Serious? Seriousness is well above my pay grade.
"Power loss can cause data corruption? Duh... "
What a novel idea. Yes - you need a UPS and you need to test its operation.
Didn't Novell sort this out years ago with something called 'transaction tracking' or was some kind of parallel universe?
"NetWare includes a transaction-monitoring feature called the Transaction Tracking System (TTS(TM))"
--
You seem to be using Microsoft Internet Explorer. Although this site is made up of valid HTML 4.01 and CSS2, when using Internet Explorer, the layout will be messed up, because Microsoft deliberately sabotages the CSS standard (among other things). I have no intention of including all sorts of work arounds for this. Do yourself a favor, get Firefox. It is more secure and much more convenient. Any other normal browser will do as well...
davecb5620@gmail.com
If a hard drive loses power while it is writing to the platter there is a very high chance that you will lose the HD. Not only is the HD likely doing a full-track write, and thus touching sectors that aren't even part of the computer-directed write operation, but HD manufacturers skimp on the parts (basically just a few big capacitors) to detect the loss of power, finish the write out, and prevent the HD's heads from writing a swath of garbage across half the disk as they whip over to their parked position.
The result? data corruption and even physical damage.
Probably five out of the last six drives I've lost were due to uncontrolled power failures. One was even from a UPS, which worked just fine but the computers failed to get the shutdown signal and were still writing when the UPS finally ran out of battery time.
The really funny thing about this is that your typical RAID system is doing parallel writes. You can wind up losing several disks at once and if that happens you probably won't be recovering anything off that RAID. Oops!
This has nothing to do with computer's DMA still running and passing corrupted data.
-Matt
Journaling filesystems are not meant to prevent data loss, they only (help) prevent the filesystem from becoming trashed if the disk loses power in the middle of a write. No amount of software can change that.
From the article:
"The surge protection on UPSes also often includes protection for ethernet and/or telephone networks. I really advice against using those. When there is a surge, the MOVs temporarily short the line containing the surge with the safety earth, but it will also connect the data networks to it. This safety earth, however, does not have infinitely low impedance, and therefore it's possible that some of the excess current will travel up the network, as opposed to down the safety earth. The exact details of this are more complex than this, but as always, the internet is your tool should you want to find out more."
While this may be true, what happens if a power surge instead comes in through a telephone line, or an Ethernet cable (possibly by way of a cable modem or something). If, for example, a lightning strike hits the cable or phone lines or something similar, that power surge can come right through and fry everything plugged into your fancy expensive surge protector or UPS.
What you really want to do if you're protecting your equipment from power surges is to create a barrier between everything you have plugged in and the outside world. A high quality surge suppressor is that barrier.
Do people actually consider running anything the least bit important without a UPS?
Is it common for anyone to run production equipment in the US without power protection?
Synchronous I/O with IBM HACMP/XD clusters. (Now PowerHA).
No data loss.
Nuff said.
but here it goes anyway.
You mean there are sysadmins out there who would spend hundreds to thousands on RAID and not at least by a cheap UPS for their server?
There are a billion parts of the system in which data could be lost:
Applications write to disks, even the apps could do some caching, or could partially write files - corrupting them if they leave files half-written. This could be corrected if apps always did "safe-writing" for proper recovery, but some do, some don't.
A kernel or OS will usually implement some buffering of caching of data. This could be bypassed, but at the severe expense of performance.
Filesystems control the writing of data blocks, and filesystem metadata to the disks. Often times, if one piece of this gets written, but another doesn't, corruption could occur. Things like journaling exist, must most of the time (for example) ext3 jounraling prevents the filesystem metadata from getting corrupted, not the file data.
RAID controllers often have battery backed caches. Often, these batteries are dead, and you may not even know. As someone who has worked extensivly in this area, I can assure you there is no way of knowing the health of the battery without completely draining it, recharging it, and looking at it charge/discharge capactity. Trust me, your RAID controller does not do this. If it did, you'd be completely vulnerable to data loss while this test was in-progress and the battery was depleated. Does your RAID controller have two batteries? I didn't think so.
Your raid controller sucks. You might feel all warm and fuzzy that you have a RAID-5 array from a name-brand vendor, but until you've pushed that card, and tested it in all the potential edge conditions, you don't know how many blaring issues it really has. Really, I'm serious.
So here's the issue: Power outage is only one of several (billion) things that could go wrong rendering the system inoperable, and causing data loss. Correcting the issue means understanding the problems at all of these layers and fixing them. The chain is only as strong as it's weakest link.
Are there implementations that go the whole nine years and do all this? Yes:
Databases (at the application layer) tend to be anal about how the write to the disk/filesystem, journaling, safe-writing, etc. These are sometimes even mirrored at this level (mysql cluster) to prevent problems.
Filesystems (like ext3) can journal actual data (not just metadata) at the severe expense of performance. This is so bad, you probably want to handle things at the application layer above.
High-end systems can use (typically external) active-active RAID controllers with mirrored caches. Pricey. You put it all together - and you're talking a system which is well integrated, purpose-built, and very, very, very well tested at the extremes of any edge conditions. This is what separates very expensive high-end solutions from cheap things thrown together with mix-and-matched commodity hardware. Not to say "commodity" hardware isn't okay - but you have to really know what its doing - and how the pieces interact.
So to the topic - having a UPS is like pissing in a dark blue suit - it makes you feel all warm and comefy, but no one really see that you're really just covered in piss.
It will product against one issue. If your serious about protection, you need more.
So like everything in life, everything comes at a cost. No, your $200 UPS is not the magic bullet to protect your data that is sooo critical to you.
So now how important is that data really to you? How much you got to spend? ;-)
Just brown outs alone permanently degrade circuits, let alone most power losses.
This is one reason you will find even old laptops survive longer than a desktop counterpart that is not running on regulated power. (Except for the few laptops with GPU/CPUS that can cook eggs and people don't clean the vents out.)
Seriously, if this concept is new to anyone, run to buy a UPS with uber-fast switching or looped continous power...
As for good old power losses, nothing is coded to be completely impervious, although some out there do a beter than expected job, especially when it comes to data loss situations. The tricks do help, like a 'good' RAID and journaling and an OS that expects people to be stupid enough to pull the plug at anytime. Here is an area where Windows and OS X tend to be a bit better as the OS and software integration is designed around users that think unplugging the unit or flipping a power switch is normal. And between the two, I give a nod to Vista because of NTFS and its journaling, until Apple gets around to ZFS.
That applies to every moment in time, whether there is a power outage or not
So that is what those SPS things were for, all of this time I just thought they were shocking homeless people.
You like being homeless now? How about now you filthy bitch!? How does getting a job and working sound about now!? How about some health insurance? How about some responsibility! Uh oh, order another CX300, this shocker is out of homeless motivational juice.
I dunno though, data loss, or motivational torture. With all of the shades of grey, who can say what is right and wrong anymore.
"but they need a very fast backup generator to sustain anything more than 30 seconds of outage."
Crap, that is an eternity for a backup generator. Cheap automatic ones can be online in 5 seconds or less. Expensive ones with air start can be online in half a cycle, or 1/120th of a second. With the fast generators, you really don't need a UPS, though prudence dictates otherwise.
wouldn't that be more appropriate?
A more common, and efficient way of isolating from the power grid is via a flywheel. The grid runs a motor that's connected to a moderate sized flywheel and then the flywheel is connected to a DC (or AC with a converter) generator that charges the UPS batteries. This provides excellent isolation from the grid and not much loss of power efficiency. If there is a spike/lightening strike the motor/generator set can ride it out without any problems. If there is a short (less than 2 seconds) drop out the flywheel will keep everything going.
Motor/generator sets are off the shelf technology that have been proven for many years in data centers. And besides they look really cool.
See http://www.pscpower.com/pages/industrial%20motor%20generator%20rt.htm "The Series IMG-RT Ride Thru Motor Generator integrates state-of-the-art controls, a single-shaft motor generator and a mechanical flywheel into a power conditioning system that can deliver up to 5 seconds of ride-thru during an interruption of power. The typical induction-synchronous MG set delivers this ride-thru with a maximum frequency drop at full load of 1%, or 0.6Hz on a 60Hz system. For sensitive applications where no frequency variation is acceptable, a synchronous-synchronous MG set is available. Series IMG-RT ride thru motor generator systems are sized and customized to meet a wide range of customer driven application criteria. MG sets are available in size ranges up to 2500 kVA for low voltage applications up to 600V. For medium voltage applications, please consult the factory."
See also http://www.pscpower.com/pages/series%20xc.htm for upto 10,000 kVA (parallel modules and "ring bus" configuration), claims to have 20 year service life.
Caution--some serious high voltage/current here. Do not attempt at home.
Where some = those not suitable for coping with a power loss scenario, quel surprise.
There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
Why do you think that a real database doesn't count a transaction as committed until the disk reports the relevant parts of the transaction log *written to disk* rather than sitting in a cache on the way there?
That's the ideal way of dealing with this. But it requires the hardware to never lie, the OS to never lie, the database to be designed to cope with this (and allow for multiple outstanding requests otherwise performance will suffer.)
Of course, hardware often does lie. (SCSI gear is better than ATA gear due to command queuing; in theory SATA with NCQ can be as good as SCSI).
Note: Running sync() won't deal with the situation if your hard drive lies and claims the sector is written to disk when it's really sitting in the HDD cache.
Need a UPS too! I am amazed that someone will spend anywhere from $250 to several thousand on a new system, but they will not spend $70 on a decent UPS to protect it. I live in a town of about 25,000 in Southeast Iowa. We seem to get a lot of short (less than 1 second) power drops. Usually several per day, sometimes dozens. The power company gets many complaints, but is not interested in fixing the problem. In such a situation, I consider a UPS essential And not that certain cheap brand that a popular chain of department stores sells (or used to)...I went through 3 oof them in less than a year. I now have a more reliable name brand UPS that has lasted 2 1/2 years so far, with no problems..
I also consider frequent backups of my data to be essential.
C'mon. Show of hands... Who here runs a server with redundant power supplies, RAID configured storage, and confirmed backups without plugging the whole kit and kaboodle into a UPS?
Bueller? Bueller?
You know an article if obvious when you see Airport Security already doing it.
All their computers and hell, even their explosive dust sniffers run on UPSs.
While the author of this article do have some points, half of it is misconceptions or just plain nonsense.
I started laughing at the 2nd paragraph: did he says unrefreshed DRAM garbage being written to disk??? Regardless of the fact that DRAM keeps its content seconds, sometimes even minutes after power goes off - even when removed from the system, can he explain how, if there's no more power refresh the ram, his DMA controller will be copying data? How the disk controller will send the data trough the wire? How the data will be written when there's no power to spin the platters and move the heads?
Sure there used to be systems with power fail interrupt. That was the SGI's using an old version of XFS _without_ journaling. The PSU was loaded with big capacitors and upon triggering that interrupt the system would flush cache to disk before the power was out.
There's also misconception about databases - at least MySQL. I work with it in cluster environments. In my testing, I was routinely (and automatically trough network boot bars) shutting off current of the active node, causing a hard shutdown, and letting resources fail over the passive one. Did that hundred - maybe thousand - times on a well loaded replicated slave cluster without a single glitch. No forced InnoDB recoveries. No replication problems.
His "disk cache" issue is a nonsense too - at least the way he present it. The proper way to demonstrate it would rather be doing the sync, then upon sync returning shutting off the computer, because there lies the problem. My MySQL cluster above was able to recover because on every fsync (and there were hundreds _per second_!) it knew the data was hard on disk (in the battery-backed RAID controller cache actually). The problem lies with consumer-lever hardware. IDE/SATA has their write cache enabled by default. In some case it can't even be turned off.
So instead of suggesting to buy UPSes to "patch the problem", data reliability should start with decent hardware components: ECC Ram, SCSI/Sas drives, etc. Sometimes tha's also a tradeoff between speed and reliability as you ofter get the choice. And BTW in about 5 years I've seen whole datacenters loose power at 4 occasions, and not the cheapest/smallest ones (three times UPS failures, and once generator failed to kick in. That was in two different datacenters in US and Canada). I've also had an UPS failure from an expensive APC SMART-UPS. You can't only rely on UPSes.
Take a lesson from commercial aircraft design. Power inputs are characterized extensively and the devices on the power bus have built-in capacitance to hold-up the processor/devices long enough to commit all data before the device loses power. In addition, critical devices have redundancy so if one device dies, the other device can complete the critical operaton.
I suppose if this is good enough for our asses, it is good enough for some non-living data!
Forget about outages for a minute, think about lightning, surges, sags, etc. I lost some network gear to lightning before I had money to put UPSes everywhere, and I've seen modems with chipsets that have huge holes ripped in them from a strike. Currently, my TV stack (tivo, cable box, cable amp, cable modem, router, etc) are all plugged into a 1500 VA UPS, all of our computers each have UPSes which guarantee at least 10 minutes runtime (depends on number of disks, etc), and the tivo in the other room is on a 550VA UPS. The 1500VA UPS will carry our network gear, tivo, etc for about 2 hours and 20 minutes. Most of our outages tend to be either brief cuts lasting 2 seconds or less, but otherwise tend to be a couple of hours. Around 4:30a the other night, we had an outage affecting some 3000 houses. Before the power went out completely, it reset 3 or 4 times as the grid tried to reroute around the fault and failed. This would have been murder on a disk if the system was set to restore to previous state. The other dozen or so blips every year would also be bad on disks. Anyway, the power finally goes out, my tivo is still recording my shows, and the only light in the house is the LCDs, which I use to shut down the systems and immediately power the screens down to increase runtime. I eventually had to shut off the smaller UPS with the extra tivo, but everything else was shut down cleanly and didn't have any bobbles in power. After sitting around in the dark for a while, unable to sleep, I plugged in a lamp into the 1500VA on my main desktop UPS (which was offline) and read a book for a couple hours until the power came back on. It's pretty damn weird when you think you're maybe the only person for a couple miles with a light on, let alone a tivo or a computer. I second what others have said about laptops, it's pretty nice to be able to keep working for a while with the power out. UPS's are a must for many reasons.
Fundamentally, any time you lose power, you WILL lose some data that is cached. Period. The important part is not whether or not you lose data, but whether you KNOW what got written and what didn't. That's whole point of transactions.
Filesystem journals are (mostly) concerned with metadata integrity/consistency (ZFS is somewhat exceptional in this area). Unless you do full-data journalling (and the performance penalty means you don't want to), it is up to the application to implement transactional behaviour. This is a key feature of databases.
Part 2, ram can fail (refresh) before hard drive stops doing DMA. Frankly, I don't believe it. Data gets DMA'd to buffers on the drive. That's the cheap (powerwise) part. Moving the heads around and writing the data is the expensive bit. The drive is not going to tell you it has written something unless it has done so and modern drives won't try to write unless they have enough power to do so.
Part 3, you can lose cached data. Clue. The same problem happens if the drive dies or starts erroring out. If you don't do synchronous writes, you (the app) are responsible for checking to see that the writes actually got committed.
Part 3.2. This has NOTHING to do with bit flipping. The Gentoo Wiki is talking about enabling write-cacheing on the drives. This is incredibly dangerous (fatal) without a UPS. Basically, enabling write-cacheing on the drive allows the drive to say "yes this is committed to stable store" when it isn't, and if the cache is not protected, all bets are off. True, it's much more practical to do hard-luck recovery of unencrypted data in this case, but fundamentally, this is a no-no. A lot of RAID controllers offer writeback cacheing and most are smart enough to disallow this unless the cache is battery-backed.
Part 3.3 RAID. RAID 0 isn't worse than a single disk. RAID 1 potentially requires a full resync, but is recoverable. Similar rules for other RAID levels. This is all known/handled.
Part 3.4. VERY WRONG. All I can say is read up on ACID. Postgresql is fully ACID (http://en.wikipedia.org/wiki/ACID) compliant, as are most/any databases worthy of the name these days.
I won't argue that power failures are healthy or desirable. Hardware stress and failure are the most obvious issues here, but most of the "issues" brought up in the article are simply incorrect. Of course, there are an unbelievably large number of "applications"/programs out there that don't implement the necessary journalling/transactions to correctly deal with power outages/crashes etc. But that's another story. Badly-written applications are as old as the computer.
"The reason I remember the exact minute it failed was I had my bag in hand and was walking toward the door when the server alarm went off"
You actually had a server alarm - cool !!!
davecb5620@gmail.com
I lost faith that this guy knows what he's talking about when I read this:
> The PostgreSQL mailinglist doesn't have "don't kill -9 the postmaster!" as a standard signature to list messages for nothing.
Indeed that is standard advice, but it has NOTHING WHATSOEVER to do with power failure recoverability. The reason you're not supposed to do it is that hard-killing the postmaster doesn't get rid of its subprocesses or shared memory segment, which could make a subsequent attempt to restart the postmaster hazardous. But those things won't survive a system crash due to power loss (or any other reason).
I'm not really qualified to evaluate all the other statements in the article, but the fact that the one statement I do know about is hogwash doesn't make me feel good about the others.
http://www.zerosurge.com/
( no dies-in-2-years MOVs: series-reactor, instead.
Been around for many years, & the only thing that can stop mega-spikes:
when nearby-lightning has killed all the normal UPSs in the region,
if YOUR UPSs are protected by these, they'll probably be fine.
no, I'm not affiliated, but prefer infrastructure to continue-working, for OUR benefit.
They have a 5A panel-mount unit, for sticking on appliances, too :)
You CAN make a system that won't lose data. The FIX protocol (Financial Information Exchange) is an example of this. The data is sent, received, written, and then CONFIRMED RECEIVED is sent back before it is deleted/processed at the source. There are all kinds of redundancy built into FIX. This of course wastes a lot of cycles but in the financial world this is what is done because data corruption could equal actual money lost. Imagine if your bank balance was being transferred and was lost!
I wish they would do this randomly at Microsoft and Intel... that and have the computers randomly fail and give bad calculation results. Maybe then we would have fault-tolerant computing and truly robust backup schemes.
The space shuttle computers could then be just normal computers.
They ARE out to get you simply because They are in it for themselves and they don't care about you.
My company had a powerfailure one year ago after our local electricity company finished their work in the street. The electricity went out-on-out-on, this in 9 cycles. Not only there were big problems with the devices not behind an ups, but also big problems with the servers and other material behind ups's.
The current somehow managed to reach the serial port of 2 APC's, killing half of the server infrastructure in that rack.
Needless to say, I switched immediately to the higher series with online-power. I hope to prevent such kind of damage.
Every UPS has only a limited processing power; expressed in Joules. Once the power problem goes over that amount, the UPS will fail its function and eventually also fail its safety precautions. The higher series got filtering and protection for such seperated.
--- I am known for the ones who want to find me on the net. Is that a privacy risk or a privilege? One might wonder..