Slashdot Mirror


Why I'm Usually Unnerved When Modern SSDs Die on Us (utoronto.ca)

Chris Siebenmann, a Unix Systems Administrator at University of Toronto, writes about the inability to figure out the bottleneck when an SSD dies: What unnerves me about these sorts of abrupt SSD failures is how inscrutable they are and how I can't construct a story in my head of what went wrong. With spinning HDs, drives might die abruptly but you could at least construct narratives about what could have happened to do that; perhaps the spindle motor drive seized or the drive had some other gross mechanical failure that brought everything to a crashing halt (perhaps literally). SSDs are both solid state and opaque, so I'm left with no story for what went wrong, especially when a drive is young and isn't supposed to have come anywhere near wearing out its flash cells (as this SSD was).

(When a HD died early, you could also imagine undetected manufacturing flaws that finally gave way. With SSDs, at least in theory that shouldn't happen, so early death feels especially alarming. Probably there are potential undetected manufacturing flaws in the flash cells and so on, though.) When I have no story, my thoughts turn to unnerving possibilities, like that the drive was lying to us about how healthy it was in SMART data and that it was actually running through spare flash capacity and then just ran out, or that it had a firmware flaw that we triggered that bricked it in some way.

7 of 358 comments (clear)

  1. This is why you have RAID and backups by froggyjojodaddy · · Score: 3, Informative

    *shrug* ?

    I mean, manufacturing defects, environment, and just old plain bad luck? SSDs have come a long way, but if I have anything of importance, I'm RAID'ing it and backing up. I feel anyone with an understanding of technology knows the importance of this.

  2. It's not that scary... by FrankSchwab · · Score: 4, Informative

    Infant failures are common in electronics ( https://www.weibull.com/hotwir... ) From a simple standpoint, imagine a poorly soldered junction on the PCB - soldered well enough to pass QC and work initially, but after a couple of heating cycles the solder joint fractures. The same kinds of problems occur inside chips - wire bonds between the package and die may be defective but initially conductive, and fracture due to thermal cycling.
    Similar problems can occur on the die. The gate oxide for a particular transistor might be too thin due to process issues. If it's way too thin, it'll fail immediately and the die will get sorted out at test. If it's just a bit thicker, it might pass all production tests but fail after an hour or two of operation, or 100 power cycles. If it's just a bit thicker (where it should be), it might last for 20 years and a million power cycles.
    Everyone in the semiconductor industry would love to figure out how to eliminate these early failures. No one has found a way to do it.

    --
    And the worms ate into his brain.
  3. Re:Learn about the subject by Anonymous Coward · · Score: 2, Informative

    Electronics wear out slowly. In fact most will long exceed their usefulness before they die.
    Mor often electronics will die early due to manufacturing defects. It's why if your device lasts the first month it will probably keep working until you upgrade it. SSD's are a different beast though. thus they have excess capacity to handle wear leveling. Still a young drive that dies is usually, again, a sign of a manufacturing defect.

  4. Re:With spinning disks, you do not know either by AmiMoJo · · Score: 4, Informative

    Often SSD failures can be predicted or at least diagnosed by looking at SMART data. That's what it's for, after all. Some manufacturers provide better data than others.

    Like HDDs, sometimes the electronics die too. Usually a power supply issue. Can be tricky to diagnose. SSDs are slightly worse as with HDDs you can often replace the controller PCB and get them working again, where as SSDs are a single PCB with the controller and memory.

    --
    const int one = 65536; (Silvermoon, Texture.cs)
    SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
  5. Why does it matter? by CaptainDork · · Score: 4, Informative

    I'm a retired IT guy and there's no kind of something that didn't fucking break. I'm not a goddam engineer. My job was to locate the problem at a black-box level and get the shit running again. Contemplating the "why" of a hardware failure is wheel-spinning instead of pulling the stuff out of the ditch.

    For new purchases under warranty, I exchanged them and sent the dead one back to the vendor. Let them hook it up and do diagnostics over a cup of coffee.

    I had work to do.

    --
    It little behooves the best of us to comment on the rest of us.
  6. Re:Controller failure by bobbied · · Score: 5, Informative

    Wow, that PCB substation trick became very hit/miss a long time ago.

    Now days, there is a whole bunch of operational parameters which need to be set properly to get data on/off a drive. I understand that Some of these "configuration" items are now stored in non-volatile memory on that PCB and set during the manufacturing process. Similar serial numbers may help, but it's still very hit or miss.

    --
    "File to fit, pound to insert, paint to match" - Aircraft Maintenance 101
  7. Re:With spinning disks, you do not know either by Comboman · · Score: 3, Informative

    Mod parent up. The most common cause of a sudden, unexplained failure for both HDs and SSDs is a failure of the controller rather than the media.

    --
    Support Right To Repair Legislation.