Slashdot Mirror


How Power Failures Corrupt Flash SSD Data

An anonymous reader writes "Flash SSDs are non-volatile, right? So how could power failures screw with your data? Several ways, according to a ZDNet post that summarizes a paper (PDF) presented at last month's FAST 13 conference. Researchers from Ohio State and HP Labs researchers tested 15 SSDs using an automated power fault injection testbed and found that 13 lost data. 'Bit corruption hit 3 devices; 3 had shorn writes; 8 had serializability errors; one device lost 1/3 of its data; and 1 SSD bricked. The low-end hard drive had some unserializable writes, while the high-end drive had no power fault failures. The 2 SSDs that had no failures? Both were MLC 2012 model years with a mid-range ($1.17/GB) price.'"

43 of 204 comments (clear)

  1. build in some power storage by X0563511 · · Score: 5, Insightful

    Seriously... slap in some basic power circuitry and some caps - enough that the drive can finish the cycle it is on and do whatever it needs to do to power off safely.

    --
    For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
    1. Re:build in some power storage by Anonymous Coward · · Score: 2, Insightful

      I'll quote the great CliffyB: Vote with your dollars!

      What? It's valid thinking, not at all 9:th grade.

    2. Re:build in some power storage by v1 · · Score: 5, Insightful

      space is at an extreme premium in those drives. There's a reason they feel so heavy/dense. Given the quilting layout of the chips, adding a single cap would prevent several memory chips from fitting. So you may as well then fill that remaining space with more caps. But you will reduce capacity, and that's what sells SSDs.

      There's already a substantial amount of circuitry in them, far from "basic". It's essentially a CPU. I'd be interested to see some numbers as to average power drain during idle, read, and write.

      The ones that did the best during the power blips probably did have caps and a bit more in their power system to handle it though. It certainly does surprise me that the mid-range, not the high-end, were the best performers in this test.

      --
      I work for the Department of Redundancy Department.
    3. Re:build in some power storage by Guspaz · · Score: 2

      Most enterprise SSDs do have small supercapacitors or capacitor arrays onboard for exactly this reason. Some of the higher-end consumer drives do too. But most consumer drives don't.

      The answer? Get a UPS.

    4. Re:build in some power storage by Mad+Merlin · · Score: 2

      space is at an extreme premium in those drives. There's a reason they feel so heavy/dense.

      I don't know what SSDs you've been using, but I've never picked up an SSD (OCZ Vertex 2/3, Intel X25-M/320/330/335/510/520) that didn't feel light and sound nearly hollow.

    5. Re:build in some power storage by hawguy · · Score: 2

      I bet no one ever thought of that!!

      Based on the paper, I guess they didn't

      Some SSDs already have capacitors that do just this, so yes, they did think of it. Did you really think that SSD manufacturers aren't aware of this issue?

      But when a few dollars can sway a purchase decision, and it's hard to convince consumers through a few sentences on the side of an SSD box that power protection circuitry is important to have, it's hard to justify putting it in. And since most SSD's are probably sold as OEM equipment where a few pennies can make the difference between getting the sale or not, then it's even harder to justify.

      It's not something I'd be willing to pay extra for - my computer hasn't lost power in years (thanks to a UPS that automatically shuts down my computer), but my computer writes to disk so rarely that there's probably a 100 to 1 chance that it will be in the middle of a write if I just walk up and pull the plug. If I do lose data, there's always backups to fall back on.

    6. Re:build in some power storage by Mashiki · · Score: 3, Informative

      I don't know what SSDs you've been using, but I've never picked up an SSD (OCZ Vertex 2/3, Intel X25-M/320/330/335/510/520) that didn't feel light and sound nearly hollow.

      Consumer drives are usually lightweight, they don't need the extra cooling. Enterprise drives depending on who they're made by and what they're for can have heatspreaders or heatsinks within, or attached to each chip adding to the weight.

      --
      Om, nomnomnom...
    7. Re:build in some power storage by TechyImmigrant · · Score: 2

      It wold be great if they told you about the feature so you could make an informed purchasing decision.

      --
      I should use this sig to advertise my book ISBN-13 : 978-1501515132.
    8. Re:build in some power storage by yurtinus · · Score: 2

      Exactly, this is buying consumer computer equipment. Put a label on the side with a bullet point touting your unexpected power fault protection and I can pretty much guarantee it will have no impact on your product sales. You know what will? The extra $2 price that puts you below the other guy on the "lowest price first" product sorting.

      --
      +1 Disagree
    9. Re:build in some power storage by hawguy · · Score: 2

      But when a few dollars can sway a purchase decision, and it's hard to convince consumers through a few sentences on the side of an SSD box that power protection circuitry is important to have, it's hard to justify putting it in

      This isn't buying a car. $3 or even $20 isn't going to be detrimental to the purchase oppritunity when the consumer can TELL it is of quality above the competitors. Blaming the consumer in this case sounds like you are on the other side

      How can the consumer TELL if its quality is above the competitors? The presence of capacitors doesn't mean that it's a better drive than a drive without capacitors. It just means that you have more protection from one rare set of circumstances -- potentially with less reliability overall, since big electrolytic capacitors are known to fail, especially cheap ones.

      I suspect that most SSD's are bought as OEM drives buried inside laptops and desktops where the end user may not ever know what brand and/or model the drive is, so how will a higher cost for a feature that may offer no real benefit for mother users help sell more drives?

      Don't believe me? Here's proof: Manufacturers aren't promoting it as a feature in big letters on the side of the box. If they thought they could add $5 of circuitry and sell the drive for $10 more, they would.

      If you're reading Slashdot, then you're not a typical consumer, and maybe you really are enough of an SSD expert to compare features to know what makes one SSD better than another, but for the other 99% of consumers, they will either buy an SSD with their next computer, or they'll buy the one at Best Buy that has the lowest price and the highest transfer rate since that's a number he can understand. How would you even quantify "Power protection capacitors" to know if it's worth $5, $50 or $100 to you? If it's really important to you, you can always buy an enterprise class SLC drive that includes the capacitors

      Blaming the consumer in this case sounds like you are on the other side

      Is this one of those George Bush "If you're not with us, you're against us" false dichotomys? Believe it or not, it's possible for people to have different opinions without being enemies.

    10. Re:build in some power storage by TechyImmigrant · · Score: 3, Funny

      >yet nearly all computers sold today are portables

      What I really want is a potable computer, so I can drink it if I get thirsty.

      --
      I should use this sig to advertise my book ISBN-13 : 978-1501515132.
    11. Re:build in some power storage by TechyImmigrant · · Score: 2

      >Flash memory is accessed in blocks and only blocks. Even if you need to write to a single bit, the entire block that that bit resides in needs to be re-written. This means before you can write, the entire block has to be read and stored temporary ram. If power is interrupted during a write operation then there is a very good chance the entire block will be lost because the contents of the flash controller's ram will be lost.

      You are wrong.

      Flash it written word by word. The size of the word depends on the chip.
      Flash is *erased* a block at a time.

      That is what makes flash more efficient than EEPROM, the block erase plane.

      --
      I should use this sig to advertise my book ISBN-13 : 978-1501515132.
    12. Re:build in some power storage by edmudama · · Score: 2

      Most of the enterprise grade SSDs on the market that are outfitted with power-loss protection circuitry fit these capacitors within the 2.5" form factor.

      --
      More data, damnit!
    13. Re:build in some power storage by froggymana · · Score: 2

      Probably the five finger discount.

      --
      "To prevent this day from getting any worse, I'll just read ERROR as GOOD THING" 1GJU8xLuDKDxEs4KLf8fAGyptoDsqvEsBT
    14. Re:build in some power storage by thegarbz · · Score: 2

      but I've never picked up an SSD (OCZ Vertex 2/3, Intel X25-M/320/330/335/510/520) that didn't feel light and sound nearly hollow.

      Rip it open and have a look. There's not much weight at all to a piece of fibreglass and some plastic resin encasing some silicon. Circuit boards and components are really quite light when they don't require cooling or even large bits of metal for simple thermal mass.

      You'll find that even though it's light and looks hollow it'll be packed quite full. Now combine that with the problems associated with creating some form of energy storage. Storage can come in some electrical form, i.e. battery which would be great but then you need either a maintenance regime, combined with some form of monitoring and perhaps even some charging circuit.

      Your other options is capacitance, but in order to get enough useful capacitance you need large densities which invariably comes by rolling together two thin aluminium plates. All useful (for this application) capacitors will therefore be cylindrical and will immediately consume massive amounts of space (by massive I'm talking cubic mm). Again have a look and see how little actually fits in such a small form factor.

    15. Re:build in some power storage by Dunbal · · Score: 2

      The problem with voting has always been that the idiots get to vote too. So while you might "vote with your dollars" to select the most reliable drive, they will vote for the one with the cute name, or the shiny case, or the "free gift", or the special price, etc.

      --
      Seven puppies were harmed during the making of this post.
  2. Before you ask. by eddy · · Score: 5, Informative

    The paper doesn't disclose the brands.

    --
    Belief is the currency of delusion.
  3. Power corrupts... by preflex · · Score: 5, Funny

    ... Power failure corrupts absolutely.

  4. Unsurprising by Anonymous Coward · · Score: 3, Insightful

    These devices have an elaborate internal database for the management of block remapping. For this to survive power failures it needs to use transactional updates. Getting this right is hard - it takes years for file systems and databases to become robust. I'd guess that many devices don't even attempt to do it and the ones that do probably have obscure failure modes. A UPS is essential.

  5. Finally somebody said it! by Dishwasha · · Score: 5, Informative

    I had some original Vertex drives from OCZ that kept absolutely corrupting when my laptop got accidentally unplugged and I powered on the machine. I had to RMA them over and over and over again. I finally figured out that my battery was getting old and, although everything was functional even on battery power and it would boot, the initial large draw of power on boot must have created a voltage drop (i.e. brownout) which the SSDs weren't designed to compensate for. Within an hour of boot (even back on plugged power) they would choke, freeze the OS, and be rendered unusable from then on out.

    Several SSD manufacturers are probably not engineering well for fluctuating power. Rather than fixing the problem with better engineering, OCZ simply changed their warranty policy to void the warranty if the customer is not providing proper power which, correct me if I'm wrong, I don't think rotating disk hard drive manufacturers have had that in their warranty clauses.

  6. We encountered something like this by AliasMarlowe · · Score: 5, Interesting

    We encountered extensive and progresssive file corruption on SSDs in an industrial device. It used the FAT file system, and after every loss of power, it ran its equivalent of chkdsk/f at the next boot. If power was lost again while this command was running, then it was guaranteed that the file system would become corrupt (despite the fact that we were writing nothing to the SSD; it held only files which were opened for reading). The window of opportunity was described as "very short", and the possibility of corruption was "very small" according to the vendor. In our experience in the field, and in our internal testing, the window of opportunity exceeded 20 seconds, and the possibility of corruption was "utter certainty".

    The vendor fixed the problem in a very easy way. They changed the file system from FAT to a commercial journaling FS. In our subsequent tests, we never found any file corruption, even on iterated power loss at random intervals after power on.

    --
    Those who can make you believe absurdities can make you commit atrocities. - Voltaire
    1. Re:We encountered something like this by TheRealMindChild · · Score: 4, Insightful

      First, running an SSD on an "industrial device"

      Second, using FAT

      Third, "commercial journaling FS". What does that even mean?

      If you are industrial, where is your UPS?

      --

      "When life gives you lemons, don't make lemonade. Make life take the lemons back!" -- Cave Johnson
    2. Re:We encountered something like this by certsoft · · Score: 5, Informative

      We use USB flash drives for a data logger. Most of the time the data is being buffered in the ARM based Linux board's RAM to save power. Once we get a complete file's worth (4MB at the present) we power up, validate, write the file, and power down. Supercaps have been a lifesaver. There's even enough capacity to do the write cycle if the flash was powered down when a power fail is detected. That allows to not lose whatever what was already in the RAM buffer.

    3. Re:We encountered something like this by yurtinus · · Score: 5, Insightful

      Likely as part of an embedded system - monitoring or control software. Systems where you just flip the power switch on when you need them and off when you're done, so an UPS wouldn't apply.

      I'm not saying their implementation was right, just saying that you can't imply from his post that it was wrong :P

      --
      +1 Disagree
    4. Re:We encountered something like this by thejynxed · · Score: 3, Informative

      If it was a drive being used to read schematics for CNC for instance, there isn't a manufacturer out there that currently offers a machine-tied UPS for the CNC machine. If the CNC machine loses power, then so does the drive, and vice versa, since it's all on the same circuit (usually you'll find the power stuff hidden in a cabinet along a nearby wall, and that stuff takes power directly from the mains).

      --
      @Mindless Drivel: 100% of Twitter posts ever Tweeted.
    5. Re:We encountered something like this by Darinbob · · Score: 3, Interesting

      I hate a lot of USB drives and CompactFlash. They're all designed as dumb commodity devices for the undiscriminating user, and trying to get any solid spec sheets out of the manufacturers is impossible if you're not also a giant corporation. Instead their data sheets are just marketing literature (you rarely get anything more technical than "8x speed"). Almost all are designed to work with Windows with no concern to work with embedded systems or production automation, etc. So you end up buying a wide variety to test with and see which ones are barely adequate to work with your system.

    6. Re:We encountered something like this by certsoft · · Score: 2

      Fortunately the client has facilities to test various drives over a wide temperature range (down to -40, not sure how hot they test) while running. And yes, a lot of them are crap.

    7. Re:We encountered something like this by thejynxed · · Score: 5, Interesting

      Not just a lot of them, most of them, to the point that my former contract rolled their own due to flaky controllers, etc put out by the SSD manufacturers. Yes, they found it cheaper and more efficient to make their own SSD drives, and to incinerate the ones that failed in a blast furnace than rely on the crap the manufacturers are currently foisting on the market.

      --
      @Mindless Drivel: 100% of Twitter posts ever Tweeted.
    8. Re:We encountered something like this by hot+soldering+iron · · Score: 5, Interesting

      You might check into adding supercaps into the power supply, across the DC output lines.
      For a less DY method, you could try this: http://www.beam-tech.com/093001/prd_pgs/internal_ups.htm#
      It's an internally mounted, UPS. There are also some PC power supplies that have the UPS built-in, but expect to pay a premium for those.
      If your application allows it, you might want to just mount your SSD into a laptop. It already has internal battery power, and there isn't any exotic hardware you have to pay through the nose for.

      --
      When you want something built, come see me. If you want correct grammar and spelling, get a F*ing liberal arts student.
    9. Re:We encountered something like this by adolf · · Score: 2

      Do laptops ever monitor health of the battery if external power is never removed? I'm aware that laptops can tell when the battery is eventually trashed in nornal use (Dells, in particular, seem to be pretty bitchy about it with continuously-blinking lights, and report their findings to the OS if it bothers to ask).

      But being plugged in forever is not "normal use" for a laptop.

      I like your idea (and no, I'm not the AC you're replying to), but I have this vision of a small laptop that has been running with external power for years and years. And for all of those years and years, it's been reporting (via ACPI or whatever hooks) that the battery is in fine, working order.

      Suddenly the power dips for a moment, and the machine crashes with neither warning nor expectation because the li-ion/li-po cells are simply very old and nothing bothered to check (let alone report) if they still work beforehand.

    10. Re:We encountered something like this by ultrasawblade · · Score: 2

      You calling ext3/ext4 shitty? I can put the journal on a separate device for performance enhancement, can NTFS do that?

      In all serious though NTFS is well engineered.

    11. Re:We encountered something like this by adolf · · Score: 2

      Right, sure: All of this battery information can certainly be gleaned under any operating system, given appropriate software.

      But the question is (restated): If the machine never runs on battery, does the machine know the health status of that battery? Does it really have any idea what those figures really are? Can it possibly know, without ever having run on (or otherwise discharged) the battery what the operational status of that battery really is?

      The implication is that if it cannot, then it's really not inherently more reliable than a much simpler machine with no battery at all.

      Please remember that the context here is that of a reliable machine that generally has external power and exists in a fixed location, but which may (as any other thing also may) lose that external power at some point.

      That a laptop in normal use that spends some of its time running plugged in, some of it just charging, some of it sitting in a bag doing nothing, and some of it running only from battery -- and report statistics based on that normal treatment -- only indicates that a laptop battery works predictably in normal use.

      This isn't normal use, though. And I, myself, have never tried this particular abnormal use of getting a new laptop, plugging it in, and leaving it that way for a Long, Long Time.

      Hence, the question.

    12. Re: We encountered something like this by Anonymous Coward · · Score: 2, Informative

      He is talking about the file system specification (its on disk structure) not about the specific code implementation in windows.

    13. Re:We encountered something like this by Anonymous Coward · · Score: 2, Insightful

      Obvious troll is obviously doing just that, i think his use of the term "silly faggots" when referring to linux users is the clue that tipped me off to this fact.

  7. not naming names = data "pulled out of my ass" by citizenr · · Score: 2, Insightful

    Useless paper/test.

    --
    Who logs in to gdm? Not I, said the duck.
    1. Re:not naming names = data "pulled out of my ass" by citizenr · · Score: 2

      If they do that, they won't get any more free SSDs to test, and that'll impact their ability to write papers criticizing SSDs. What would you prefer?

      I would prefer research to be done by someone who is not manufacturers bitch.
      You dont need a ton of money to test commodity hardware, the trick is to SELL stuff after the test, not take home and pretend it wasnt a bribe.

      --
      Who logs in to gdm? Not I, said the duck.
  8. up/down/up/brown/fried by h8sg8s · · Score: 2, Insightful

    What some of folks don't realize is its the seesaw nature of many power events that's primarily behind both data corruption and SSD failure. It's a rare rack system that has its own power conditioning and UPS these days (HP NonStop comes to mind) and without it you're subject to whatever the event provides in the way of under/over voltage, spikes, drops, etc. Many times these happen in timeframes too fast for power switching equipment to react and in some cases its that stuff that gets fried first.

    --
    Organization? You must be joking..
  9. UPS does nothing for the common fault case. by stoploss · · Score: 3, Informative

    Most enterprise SSDs do have small supercapacitors or capacitor arrays onboard for exactly this reason. Some of the higher-end consumer drives do too. But most consumer drives don't.

    The answer? Get a UPS.

    A UPS is no panacea: I experience grid failure very rarely.

    However, relatively speaking I experience many more kernel lockups that require an ACPI-initiated poweroff by holding down the power button until the machine abruptly powers off. What do you do when a reboot/poweroff command causes your Linux/BSD machine to hang? I/O handle leaks in the Samba SMB client (ie. *not* the smbd daemon) and the Samba Winbind code are notorious for this. The only times I have ever had to "yank power" from a production Linux database machine were due to SMB share mount zombies or Winbind that the kernel couldn't kill even during an issued reboot command.

    I have several OCZ Vertex 4 SSDs, and this concerns me—especially due to the fact that the paper/presentation does not disclose the test results. I guess I will just have to hope that my device models aren't affected and/or that waiting a minute or two during a hung poweroff/reboot means the kernel has stopped attempting to write to the devices and everything has flushed.

    PS. If you compare the vague results in the summary with the paper you will find that only two of the fifteen drives passed the tests, yet four of the devices were cited to have power protection capacitors.

    1. Re:UPS does nothing for the common fault case. by stoploss · · Score: 2

      I don't understand how if they claim that it takes up to 20 sec for the final write to finalize that a computer that simply shutsdown in 10 sec won't have the same problem.

      Drives support a blocking "sync" command that is only supposed to return when the drive has flushed all pending writes and has reached quiescence. If there is nothing pending to flush then the command will return immediately. If not, it may take the cited 20 seconds to return. Normal reboot/poweroff procedure in the OS waits for this condition, and this has been around forever (the HDD equivalent is to flush write cache and park the heads). That's why a 10 second shutdown can be safe even with the putative 20 second worst-case writeout window—if there are pending writes then there is nothing to do.

      Yanking the power prevents this sanity check from happening.

  10. Buy a SSD with a battery or capacitor by thue · · Score: 2

    This is old news; see fx Wikipedia's coverage. Only buy SSDs with a battery or capacitor, or whatever is the in DRAM cache of the SSD will be lost on power failure.

  11. My Personal Policy by wisnoskij · · Score: 2, Insightful

    This is why I don't use prototype tech that is really not ready to be used in the real world. And if you do, expect loads of bugs and bricking.

    But either way, thanks for funding the development of something I am excited to try out in 2-4+ years when it will be a mature usable technology.

    --
    Troll is not a replacement for I disagree.
  12. Re:Interesting failure mode for Crucial SSDs by Voyager529 · · Score: 2

    You got this too? I just ordered a Crucial M4 on sale a few weeks ago. the day after I installed and cloned it, I had the same situation where it wouldn't start. I called Crucial, expecting to need an RMA. Luckily I got an informed gentleman on the phone who told me to leave it at the failed POST screen for 20 minutes, reboot, and give it another 20 minutes, and reboot again. It worked. Supposedly it's not so much a 'bug' as an 'obscure feature'. ...I'm keeping my spinning rust drive around just in case.

  13. Since I did RTFA by rabtech · · Score: 2

    Power loss protection (super capacitors) was stated on four of the drives (the four least expensive to boot). Only three performed flawlessly in the unserialized writes test. Those aren't great odds. In fact only two drives passed all tests with no errors, and it wasn't necessarily the SLC "enterprise" drives, though those two also passed the serialized writes test.

    In case you aren't aware, unserialized writes invalidate *every* assumption, including write ahead, journaling, even your fancy BTRFS/ZFS. His example is a database where the transaction log write was sync'd before the data page write, then after a power failure the data page is persisted but the log write is gone.

    You can recover from many of the other errors or at least detect them but unserialized writes can silently corrupt data or even ruin the entire filesystem.

    Obviously the metadata/dead failures are the exception... Those render the whole SSD useless.

    --
    Natural != (nontoxic || beneficial)