Why I'm Usually Unnerved When Modern SSDs Die on Us (utoronto.ca)

← Back to Stories (view on slashdot.org)

Why I'm Usually Unnerved When Modern SSDs Die on Us (utoronto.ca)

Posted by msmash on Tuesday December 11, 2018 @04:16AM from the SSDDeathDisturbing dept.

Chris Siebenmann, a Unix Systems Administrator at University of Toronto, writes about the inability to figure out the bottleneck when an SSD dies: What unnerves me about these sorts of abrupt SSD failures is how inscrutable they are and how I can't construct a story in my head of what went wrong. With spinning HDs, drives might die abruptly but you could at least construct narratives about what could have happened to do that; perhaps the spindle motor drive seized or the drive had some other gross mechanical failure that brought everything to a crashing halt (perhaps literally). SSDs are both solid state and opaque, so I'm left with no story for what went wrong, especially when a drive is young and isn't supposed to have come anywhere near wearing out its flash cells (as this SSD was).

(When a HD died early, you could also imagine undetected manufacturing flaws that finally gave way. With SSDs, at least in theory that shouldn't happen, so early death feels especially alarming. Probably there are potential undetected manufacturing flaws in the flash cells and so on, though.) When I have no story, my thoughts turn to unnerving possibilities, like that the drive was lying to us about how healthy it was in SMART data and that it was actually running through spare flash capacity and then just ran out, or that it had a firmware flaw that we triggered that bricked it in some way.

17 of 358 comments (clear)

Min score:

Reason:

Sort:

With spinning disks, you do not know either by gweihir · 2018-12-11 04:21 · Score: 5, Insightful

Seriously, you do not. You may know the end-result sometimes (head-crash), but the root-cause is usually not clear.
So get over it. It is a new black-box replacing an older black-box.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
1. Re:With spinning disks, you do not know either by 110010001000 · 2018-12-11 04:25 · Score: 5, Insightful
  
  What is unnerving is that a guy from the Department of Computer Science thinks that SSDs are theoretically immune to manufacturing failures.
2. Re:With spinning disks, you do not know either by froggyjojodaddy · 2018-12-11 04:33 · Score: 5, Insightful
  
  From the article:
  
  "Further, when I have no narrative for what causes SSD failures, it feels like every SSD is an unpredictable time bomb. Are they healthy or are they going to die tomorrow? "
  
  Emphasis mine. I feel like this guy has opportunities to improve his coping mechanism. For someone in Computer Sciences, it seems like he's way too worried about this. I'm not trying to be mean, but it's like if I got into a car accident and then questioned the entire safety design of all vehicles rather than just taking a few steps back and understanding it's a freak event, but not a totally unexpected one. If you've been driving for 30 years, statistically, you're likely to get into at least one accident, even if it's not your fault
3. Re:With spinning disks, you do not know either by AmiMoJo · 2018-12-11 04:44 · Score: 4, Informative
  
  Often SSD failures can be predicted or at least diagnosed by looking at SMART data. That's what it's for, after all. Some manufacturers provide better data than others.
  Like HDDs, sometimes the electronics die too. Usually a power supply issue. Can be tricky to diagnose. SSDs are slightly worse as with HDDs you can often replace the controller PCB and get them working again, where as SSDs are a single PCB with the controller and memory.
  
  --
  const int one = 65536; (Silvermoon, Texture.cs)
  SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
4. Re:With spinning disks, you do not know either by ctilsie242 · 2018-12-11 04:57 · Score: 5, Funny
  
  Could be worse. At a previous job, I've had someone demand "7200 RPM SSDs", and no amount of explaining could change the person's mind.
5. Re:With spinning disks, you do not know either by Stonent1 · 2018-12-11 05:05 · Score: 4, Interesting
  
  Ok, I'm in IT and it unnerves me. I've had numerous computers have an SSD totally die and lose all data with no smart warnings in the last few years. (Not me personally, I mean people at our organization)
6. Re:With spinning disks, you do not know either by jellomizer · 2018-12-11 05:36 · Score: 4, Funny
  
  That is why I always stick to real to real 9 track paper tape. If you can't see the bits you just can't trust it.
  
  --
  If something is so important that you feel the need to post it on the internet... It probably isn't that important.
7. Re:With spinning disks, you do not know either by Anonymous Coward · 2018-12-11 05:39 · Score: 5, Insightful
  
  All for the SAME reason- the wrong type of cell failed, and the crappy software doesn't know how to recover. The software systems of the SSD and the OS driver side are written by idiots.
  A low level tool that knows your particular SSD driver chipset could trivially access the vast majority of flash cells on your SSD drive. But what good is that FACT if the tools are not readily available.
  And SMART warning do NOT apply to SSD drives. SMART is for electro-mechanical systems with statistical models of gradual failure. SMART is FAKED for SSD.
  A catastrophic SSD failure is when the 'wrong' memory cell dies, and the software locks up. Since all memory cells are equally likely to die at some point, this is a terrible fault of many of these drives.
8. Re:With spinning disks, you do not know either by Sponge+Bath · 2018-12-11 05:45 · Score: 4, Funny
  
  Tell this person you could only find 7199 RPM SSDs, but if they spin in an office chair while using the system it will make up the difference.
9. Re:With spinning disks, you do not know either by gweihir · 2018-12-11 07:28 · Score: 4, Interesting
  
  Well, I originally bought OCZ. Today _all_ of 5 OCZ drives I got are stone-dead. After that I moved to Samsung, mostly "Pro". They are all still working fine and some are older now than the first OCZ when it died. So yes, it makes a difference. Incidentally, Samsung had excellent reliability in their spinning drives as well. It seems they just care more about quality and reputation.
  That said, I find it sad that you cannot get "high reliability" SSDs where you basically can forget about the risk of them dying. I am talking reliability levels like a typical CPU here. It seems the market for that is just not there.
  
  --
  Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
10. Re:With spinning disks, you do not know either by viperidaenz · 2018-12-11 07:37 · Score: 5, Insightful
  
  SMART should be able to provide the number of remapped sectors. There should be manufacturer specific counters for the amount of over provisioning that is left for remapping too. That should tell you precisely when you should plan to replace an SSD due to age.
  How hard would it be to notify something that the drive can't handle any more dead cells, so should not be written to any more? Or that it is down to x% of spare nand?
11. Re:With spinning disks, you do not know either by dgatwood · 2018-12-11 13:01 · Score: 5, Insightful
  
  I think you may be missing his point. I've had SSD's die on me as well with absolutely no warning. What's unnerving about it is you have no idea why it failed. Good engineers like failure analysis; it helps determine if you're buying a crappy product, running your product out of spec, or any number of other metrics which can inform future purchases.
  Statistically, without even knowing what the particular product was, I can tell you what caused it: RoHS.
  The change from lead-based solder to lead-free solder is one of the major causes of premature electronics failures — probably more common than all other causes put together. Between tin whiskers, cold solder joints, and stress fractures caused by thermal expansion of component packages, the RoHS lead-free solder rule is a clear example of environmentalism gone amok. Instead of improving our environment by reducing the amount of lead going out into the world, it has, IMO, made our environment worse by dramatically increasing the amount of hardware discarded as junk long before it otherwise would have been.
  
  --
  Check out my sci-fi/humor trilogy at PatriotsBooks.
Controller failure by macraig · 2018-12-11 04:30 · Score: 5, Insightful

I've had two SSDs die utterly. It wasn't because there was a failure of any part of the actual storage pathways: it was irreparable failure of the embedded controller circuits. The Flash itself was still fine and safely storing all my data, but there was no means to access it. At least with a platter drive if the PCB fails, you can unscrew and detach it and replace it with a matching PCB from another drive; no way to do that with an SSD. Early on when manufacturers were spending all their time hyping the comparative robustness of the Flash medium, they conveniently forgot to mention how fragile and not-so-robust the embedded third-party controller circuits could be.
1. Re:Controller failure by bobbied · 2018-12-11 05:02 · Score: 5, Informative
  
  Wow, that PCB substation trick became very hit/miss a long time ago.
  Now days, there is a whole bunch of operational parameters which need to be set properly to get data on/off a drive. I understand that Some of these "configuration" items are now stored in non-volatile memory on that PCB and set during the manufacturing process. Similar serial numbers may help, but it's still very hit or miss.
  
  --
  "File to fit, pound to insert, paint to match" - Aircraft Maintenance 101
It's not that scary... by FrankSchwab · 2018-12-11 04:31 · Score: 4, Informative

Infant failures are common in electronics ( https://www.weibull.com/hotwir... ) From a simple standpoint, imagine a poorly soldered junction on the PCB - soldered well enough to pass QC and work initially, but after a couple of heating cycles the solder joint fractures. The same kinds of problems occur inside chips - wire bonds between the package and die may be defective but initially conductive, and fracture due to thermal cycling.
Similar problems can occur on the die. The gate oxide for a particular transistor might be too thin due to process issues. If it's way too thin, it'll fail immediately and the die will get sorted out at test. If it's just a bit thicker, it might pass all production tests but fail after an hour or two of operation, or 100 power cycles. If it's just a bit thicker (where it should be), it might last for 20 years and a million power cycles.
Everyone in the semiconductor industry would love to figure out how to eliminate these early failures. No one has found a way to do it.

--
And the worms ate into his brain.
Why does it matter? by CaptainDork · 2018-12-11 04:46 · Score: 4, Informative

I'm a retired IT guy and there's no kind of something that didn't fucking break. I'm not a goddam engineer. My job was to locate the problem at a black-box level and get the shit running again. Contemplating the "why" of a hardware failure is wheel-spinning instead of pulling the stuff out of the ditch.
For new purchases under warranty, I exchanged them and sent the dead one back to the vendor. Let them hook it up and do diagnostics over a cup of coffee.
I had work to do.

--
It little behooves the best of us to comment on the rest of us.
The spin is in! by theendlessnow · 2018-12-11 05:06 · Score: 4, Insightful

One thing I like about spinning disks is that a lot of times the failure is gradual. Bad sectors and such and you have the opportunity to grab data off the drive (noting, you really should have backups).

With SSD, whatever the issue, it's more like losing a controller board on the drive, everything dies and ceases to operate.

So... I'll go along and say SSD is "better" and more "reliable", but when it dies, it dies hard. Just the way it is. (not talking about performance degradation... speaking about failure)