How Power Failures Corrupt Flash SSD Data
An anonymous reader writes "Flash SSDs are non-volatile, right? So how could power failures screw with your data? Several ways, according to a ZDNet post that summarizes a paper (PDF) presented at last month's FAST 13 conference. Researchers from Ohio State and HP Labs researchers tested 15 SSDs using an automated power fault injection testbed and found that 13 lost data. 'Bit corruption hit 3 devices; 3 had shorn writes; 8 had serializability errors; one device lost 1/3 of its data; and 1 SSD bricked. The low-end hard drive had some unserializable writes, while the high-end drive had no power fault failures. The 2 SSDs that had no failures? Both were MLC 2012 model years with a mid-range ($1.17/GB) price.'"
Seriously... slap in some basic power circuitry and some caps - enough that the drive can finish the cycle it is on and do whatever it needs to do to power off safely.
For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
The paper doesn't disclose the brands.
Belief is the currency of delusion.
... Power failure corrupts absolutely.
These devices have an elaborate internal database for the management of block remapping. For this to survive power failures it needs to use transactional updates. Getting this right is hard - it takes years for file systems and databases to become robust. I'd guess that many devices don't even attempt to do it and the ones that do probably have obscure failure modes. A UPS is essential.
I had some original Vertex drives from OCZ that kept absolutely corrupting when my laptop got accidentally unplugged and I powered on the machine. I had to RMA them over and over and over again. I finally figured out that my battery was getting old and, although everything was functional even on battery power and it would boot, the initial large draw of power on boot must have created a voltage drop (i.e. brownout) which the SSDs weren't designed to compensate for. Within an hour of boot (even back on plugged power) they would choke, freeze the OS, and be rendered unusable from then on out.
Several SSD manufacturers are probably not engineering well for fluctuating power. Rather than fixing the problem with better engineering, OCZ simply changed their warranty policy to void the warranty if the customer is not providing proper power which, correct me if I'm wrong, I don't think rotating disk hard drive manufacturers have had that in their warranty clauses.
We encountered extensive and progresssive file corruption on SSDs in an industrial device. It used the FAT file system, and after every loss of power, it ran its equivalent of chkdsk/f at the next boot. If power was lost again while this command was running, then it was guaranteed that the file system would become corrupt (despite the fact that we were writing nothing to the SSD; it held only files which were opened for reading). The window of opportunity was described as "very short", and the possibility of corruption was "very small" according to the vendor. In our experience in the field, and in our internal testing, the window of opportunity exceeded 20 seconds, and the possibility of corruption was "utter certainty".
The vendor fixed the problem in a very easy way. They changed the file system from FAT to a commercial journaling FS. In our subsequent tests, we never found any file corruption, even on iterated power loss at random intervals after power on.
Those who can make you believe absurdities can make you commit atrocities. - Voltaire
Useless paper/test.
Who logs in to gdm? Not I, said the duck.
What some of folks don't realize is its the seesaw nature of many power events that's primarily behind both data corruption and SSD failure. It's a rare rack system that has its own power conditioning and UPS these days (HP NonStop comes to mind) and without it you're subject to whatever the event provides in the way of under/over voltage, spikes, drops, etc. Many times these happen in timeframes too fast for power switching equipment to react and in some cases its that stuff that gets fried first.
Organization? You must be joking..
Most enterprise SSDs do have small supercapacitors or capacitor arrays onboard for exactly this reason. Some of the higher-end consumer drives do too. But most consumer drives don't.
The answer? Get a UPS.
A UPS is no panacea: I experience grid failure very rarely.
However, relatively speaking I experience many more kernel lockups that require an ACPI-initiated poweroff by holding down the power button until the machine abruptly powers off. What do you do when a reboot/poweroff command causes your Linux/BSD machine to hang? I/O handle leaks in the Samba SMB client (ie. *not* the smbd daemon) and the Samba Winbind code are notorious for this. The only times I have ever had to "yank power" from a production Linux database machine were due to SMB share mount zombies or Winbind that the kernel couldn't kill even during an issued reboot command.
I have several OCZ Vertex 4 SSDs, and this concerns me—especially due to the fact that the paper/presentation does not disclose the test results. I guess I will just have to hope that my device models aren't affected and/or that waiting a minute or two during a hung poweroff/reboot means the kernel has stopped attempting to write to the devices and everything has flushed.
PS. If you compare the vague results in the summary with the paper you will find that only two of the fifteen drives passed the tests, yet four of the devices were cited to have power protection capacitors.
This is old news; see fx Wikipedia's coverage. Only buy SSDs with a battery or capacitor, or whatever is the in DRAM cache of the SSD will be lost on power failure.
This is why I don't use prototype tech that is really not ready to be used in the real world. And if you do, expect loads of bugs and bricking.
But either way, thanks for funding the development of something I am excited to try out in 2-4+ years when it will be a mature usable technology.
Troll is not a replacement for I disagree.
You got this too? I just ordered a Crucial M4 on sale a few weeks ago. the day after I installed and cloned it, I had the same situation where it wouldn't start. I called Crucial, expecting to need an RMA. Luckily I got an informed gentleman on the phone who told me to leave it at the failed POST screen for 20 minutes, reboot, and give it another 20 minutes, and reboot again. It worked. Supposedly it's not so much a 'bug' as an 'obscure feature'. ...I'm keeping my spinning rust drive around just in case.
Power loss protection (super capacitors) was stated on four of the drives (the four least expensive to boot). Only three performed flawlessly in the unserialized writes test. Those aren't great odds. In fact only two drives passed all tests with no errors, and it wasn't necessarily the SLC "enterprise" drives, though those two also passed the serialized writes test.
In case you aren't aware, unserialized writes invalidate *every* assumption, including write ahead, journaling, even your fancy BTRFS/ZFS. His example is a database where the transaction log write was sync'd before the data page write, then after a power failure the data page is persisted but the log write is gone.
You can recover from many of the other errors or at least detect them but unserialized writes can silently corrupt data or even ruin the entire filesystem.
Obviously the metadata/dead failures are the exception... Those render the whole SSD useless.
Natural != (nontoxic || beneficial)