Why I'm Usually Unnerved When Modern SSDs Die on Us (utoronto.ca)
Chris Siebenmann, a Unix Systems Administrator at University of Toronto, writes about the inability to figure out the bottleneck when an SSD dies: What unnerves me about these sorts of abrupt SSD failures is how inscrutable they are and how I can't construct a story in my head of what went wrong. With spinning HDs, drives might die abruptly but you could at least construct narratives about what could have happened to do that; perhaps the spindle motor drive seized or the drive had some other gross mechanical failure that brought everything to a crashing halt (perhaps literally). SSDs are both solid state and opaque, so I'm left with no story for what went wrong, especially when a drive is young and isn't supposed to have come anywhere near wearing out its flash cells (as this SSD was).
(When a HD died early, you could also imagine undetected manufacturing flaws that finally gave way. With SSDs, at least in theory that shouldn't happen, so early death feels especially alarming. Probably there are potential undetected manufacturing flaws in the flash cells and so on, though.) When I have no story, my thoughts turn to unnerving possibilities, like that the drive was lying to us about how healthy it was in SMART data and that it was actually running through spare flash capacity and then just ran out, or that it had a firmware flaw that we triggered that bricked it in some way.
(When a HD died early, you could also imagine undetected manufacturing flaws that finally gave way. With SSDs, at least in theory that shouldn't happen, so early death feels especially alarming. Probably there are potential undetected manufacturing flaws in the flash cells and so on, though.) When I have no story, my thoughts turn to unnerving possibilities, like that the drive was lying to us about how healthy it was in SMART data and that it was actually running through spare flash capacity and then just ran out, or that it had a firmware flaw that we triggered that bricked it in some way.
Seriously, you do not. You may know the end-result sometimes (head-crash), but the root-cause is usually not clear.
So get over it. It is a new black-box replacing an older black-box.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
*shrug* ?
I mean, manufacturing defects, environment, and just old plain bad luck? SSDs have come a long way, but if I have anything of importance, I'm RAID'ing it and backing up. I feel anyone with an understanding of technology knows the importance of this.
Waterboarding?
I've had two SSDs die utterly. It wasn't because there was a failure of any part of the actual storage pathways: it was irreparable failure of the embedded controller circuits. The Flash itself was still fine and safely storing all my data, but there was no means to access it. At least with a platter drive if the PCB fails, you can unscrew and detach it and replace it with a matching PCB from another drive; no way to do that with an SSD. Early on when manufacturers were spending all their time hyping the comparative robustness of the Flash medium, they conveniently forgot to mention how fragile and not-so-robust the embedded third-party controller circuits could be.
Infant failures are common in electronics ( https://www.weibull.com/hotwir... ) From a simple standpoint, imagine a poorly soldered junction on the PCB - soldered well enough to pass QC and work initially, but after a couple of heating cycles the solder joint fractures. The same kinds of problems occur inside chips - wire bonds between the package and die may be defective but initially conductive, and fracture due to thermal cycling.
Similar problems can occur on the die. The gate oxide for a particular transistor might be too thin due to process issues. If it's way too thin, it'll fail immediately and the die will get sorted out at test. If it's just a bit thicker, it might pass all production tests but fail after an hour or two of operation, or 100 power cycles. If it's just a bit thicker (where it should be), it might last for 20 years and a million power cycles.
Everyone in the semiconductor industry would love to figure out how to eliminate these early failures. No one has found a way to do it.
And the worms ate into his brain.
Electronics wear out slowly. In fact most will long exceed their usefulness before they die.
Mor often electronics will die early due to manufacturing defects. It's why if your device lasts the first month it will probably keep working until you upgrade it. SSD's are a different beast though. thus they have excess capacity to handle wear leveling. Still a young drive that dies is usually, again, a sign of a manufacturing defect.
It's bad firmware. Some of the drives can supposedly be resuscitated by the factory or people who have reversed the private ATA commands.
I mean, at a minimum unless it's a PHY failure (and there's no reason to suspect those) the firmware could at least report missing storage (I've actually seen a 0MB drive failure once or twice) but their usual failure mode is to halt and catch fire, as the author notes as their usual behavior.
With the recent reports about the inexcusable security problems on Samsung and Crucial drives this is starting to feel like the old BIOS problems with Taiwanese mobo companies outsourcing to the lowest bidder and shipping bug-laden BIOS with reckless abandon. It's OK, all the world's servers only depend on this technology.
To be fair, I have batch of 20GB Intel SLC SSD's that have never done this, but those are notable exceptions. At this point only low-end laptops like Chromebooks don't get at least a mirror drive here.
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
I can see what you mean, but I think I won't really understand it until it happens to me (and I hope it never happens to me). I'm on my third SSD and none has ever failed; my previous one was showing some age and was SATA so I upgraded to M.2 NVMe on Cyber Monday. Perhaps they haven't failed on me because I keep most of my data on a HDD RAID array and use the SSDs only for OS, program files, and very limited caching.
Incipiamus, fratres, servire Domino Deo, quia hucusque vix vel parum in nullo profecimus.
I'm a retired IT guy and there's no kind of something that didn't fucking break. I'm not a goddam engineer. My job was to locate the problem at a black-box level and get the shit running again. Contemplating the "why" of a hardware failure is wheel-spinning instead of pulling the stuff out of the ditch.
For new purchases under warranty, I exchanged them and sent the dead one back to the vendor. Let them hook it up and do diagnostics over a cup of coffee.
I had work to do.
It little behooves the best of us to comment on the rest of us.
Doesn't know how SSD's work.
No offense to CS majors, but this EE major tends to understand "How a computer works" at a lower level than most of you programmer types. While not universally true, in my experience a Computer Science major generally get's outside their comfort zone with hardware once you get past "Plug it in and turn it on." I don't blame them, there is a lot of stuff happening at lower levels than a CS major needs to know to do their job.
That some CS major is concerned about how SSD's fail because he doesn't understand their failure modes is fine. We tend to fear what we don't understand and let's face it, there is a LOT of stuff going on inside a computer that high level users simply don't need to know. Heck, even I don't need to know some of that stuff and I've designed computing systems in the past. Fear not, if it works, it works, if it doesn't you just replace it anyway.
"File to fit, pound to insert, paint to match" - Aircraft Maintenance 101
That happened to me three or four times already. They die without warning. No SMART indication, nothing. It really pisses us off. Someone needs to technically give us some kind of anticipation. Maybe SMART is not supposed to work well with SSD after all.
One thing I like about spinning disks is that a lot of times the failure is gradual. Bad sectors and such and you have the opportunity to grab data off the drive (noting, you really should have backups).
With SSD, whatever the issue, it's more like losing a controller board on the drive, everything dies and ceases to operate.
So... I'll go along and say SSD is "better" and more "reliable", but when it dies, it dies hard. Just the way it is. (not talking about performance degradation... speaking about failure)
I had a Sandisk USB stick recently go read only. I had been using it as a hypervisor boot drive and the boot was crashing. When I inspected it, it was read only and any attempts to format it, diskpart it, fdisk it failed with some kind of error. I looked it up and apparently this is the designed failure route for these USB drives. When the controller detects an inconsistency or uncorrectable error, the drive is locked from writing so you can get data off of it.
SSDs really are unpredictable timebombs, so act appropriately - take frequent backups and use RAID if the downtime from a sudden SSD failure with zero warning is unacceptable. Any IT department that hasn't been prepared for the nature of SSD failures since long before they were available off the shelf was doing it wrong anyway.
I'm most worried about what SSDs mean for the Average Joe, whose data is largely protected by the predictability and recoverability of most hard drive failures. SSDs throw all of that out the window and lure them in with the warm glow of performance like moths to a flame. Average Joes need a real wake-up call on the importance of backups with the switch to SSDs.
"When information is power, privacy is freedom" - Jah-Wren Ryel
Not using TRIM doesn't have a huge effect on SSD life. Just performance. Write amplification adds some wear, but not enough to be drastic. And it won't cause sudden failure either - just normal wear on the wear-levelling curve. Sudden failure is by definition going to be something that's not related to routine depletion of a fixed lifespan.
Who the hell cares? Replace it and restore your data.
The data on a failing drive might be a newer version than the most recent weekly backup. I see value in backing up the newer version elsewhere as the first part of replacing the drive. But SSD failure modes allegedly make this newer version inaccessible sooner than HDD failure modes.
Backup your data frequently. Stop worrying. Is that so hard?
I'm guessing the author never lived through the era when there were a lot more companies in existence for mechanical HDs than there are now. HD's can spontaneously die from a failed motor, electronics failure or catastrophic crash. Some small companies went completely under and were swallowed up by larger manufacturers due to massive defects. SSDs have gone through the same era as well with buggy firmware. Generally speaking thou if you stick to the big manufacturers like Samsung and Intel the chances of fatal issues goes down a lot. That said an SSD is not a guarantee of safe data. They're far more reliable but circuit failure or static electricity can kill SSDs. Besides, SSDs won't save you from an accidental erase all.
Disclaimer: I've known Chris since we were CS undergraduates together in the 1980s, and we currently work together in the CS Department in Toronto. It may seem a bit odd to some that a hard disk failure isn't unnerving but an SSD failure is. That's because one of a good sysadmin's skills is properly focused anxiety, used to motivate a mental model of how things can fail, and what to do about it. Data storage is a key part of this mental model, since data access loss, or even worse, data loss, is a major risk. That's why it's helpful to know how disks work, how they behave when they fail, and how likely it is for such things to happen. Chris has a few decades of experience in dealing with disks. SSDs take the place of disks, and they store stuff just like disks do, but they work differently, and they behave very differently when they fail. In particular, SSDs often don't seem to give any indication that things may be wrong: one moment all is well, the next moment, all is dead. So instincts honed over a few decades of experience with hard drives don't apply. Of course Chris (and we all) will develop new instincts as we get more experience with SSDs. But in the meanwhile, it's indeed unnerving. And no, this isn't some sort of profound insight. It's merely an observation. Many experienced sysadmins, I think, will "get" this. People newer to the field might not. That's OK.