Why I'm Usually Unnerved When Modern SSDs Die on Us (utoronto.ca)
Chris Siebenmann, a Unix Systems Administrator at University of Toronto, writes about the inability to figure out the bottleneck when an SSD dies: What unnerves me about these sorts of abrupt SSD failures is how inscrutable they are and how I can't construct a story in my head of what went wrong. With spinning HDs, drives might die abruptly but you could at least construct narratives about what could have happened to do that; perhaps the spindle motor drive seized or the drive had some other gross mechanical failure that brought everything to a crashing halt (perhaps literally). SSDs are both solid state and opaque, so I'm left with no story for what went wrong, especially when a drive is young and isn't supposed to have come anywhere near wearing out its flash cells (as this SSD was).
(When a HD died early, you could also imagine undetected manufacturing flaws that finally gave way. With SSDs, at least in theory that shouldn't happen, so early death feels especially alarming. Probably there are potential undetected manufacturing flaws in the flash cells and so on, though.) When I have no story, my thoughts turn to unnerving possibilities, like that the drive was lying to us about how healthy it was in SMART data and that it was actually running through spare flash capacity and then just ran out, or that it had a firmware flaw that we triggered that bricked it in some way.
(When a HD died early, you could also imagine undetected manufacturing flaws that finally gave way. With SSDs, at least in theory that shouldn't happen, so early death feels especially alarming. Probably there are potential undetected manufacturing flaws in the flash cells and so on, though.) When I have no story, my thoughts turn to unnerving possibilities, like that the drive was lying to us about how healthy it was in SMART data and that it was actually running through spare flash capacity and then just ran out, or that it had a firmware flaw that we triggered that bricked it in some way.
Seriously, you do not. You may know the end-result sometimes (head-crash), but the root-cause is usually not clear.
So get over it. It is a new black-box replacing an older black-box.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Happens
Hey Chris from Department of Computer Science has a problem. Let's hear about it, Chris.
Doesn't know how SSD's work.
If you want to know the details, learn about the subject at hand. The thing is, electronics wear out, there is a reason these wear out faster than other "solid state technology" like transistors and a lot of it has to do with scaling down the chip.
Obviously there could be other issues at hand too such as firmware failures which then you have to know why SSD's are so much more complex than a hard drive to begin with (hint: it has primarily to do with the above wearing).
Custom electronics and digital signage for your business: www.evcircuits.com
Mine died about 2 months after I got it. A Samsung 850 Pro. Must say that they did a quick turn around fixing it. A total of 3 days on their behave to repair and get it back to me but I was annoyed that they required me to send the disk back and repair it rather than send out a replacement. At the time I'm thinking I'm going to be dead in the water for a week or more. Still, two days to mail it to them and three more days is a long time to not be able to use a machine for work.
The cause? A power outage. I know that mfg'rs have been working on problems without power lose but as most don't mention whether or not their product is safe against a power outage is disconcerning at the least.
I replaced that UPS that week. :(
"Chris Siebenmann of Department of Computer Science writes about HIS inability to figure out the bottleneck when an SSD dies"
Just because he can't do it, doesn't mean it's not possible. There are ways of making the silicon talk.
*shrug* ?
I mean, manufacturing defects, environment, and just old plain bad luck? SSDs have come a long way, but if I have anything of importance, I'm RAID'ing it and backing up. I feel anyone with an understanding of technology knows the importance of this.
Did a bad firmware flash and just won't admit it for warranty purposes!
I've had two SSDs die utterly. It wasn't because there was a failure of any part of the actual storage pathways: it was irreparable failure of the embedded controller circuits. The Flash itself was still fine and safely storing all my data, but there was no means to access it. At least with a platter drive if the PCB fails, you can unscrew and detach it and replace it with a matching PCB from another drive; no way to do that with an SSD. Early on when manufacturers were spending all their time hyping the comparative robustness of the Flash medium, they conveniently forgot to mention how fragile and not-so-robust the embedded third-party controller circuits could be.
Infant failures are common in electronics ( https://www.weibull.com/hotwir... ) From a simple standpoint, imagine a poorly soldered junction on the PCB - soldered well enough to pass QC and work initially, but after a couple of heating cycles the solder joint fractures. The same kinds of problems occur inside chips - wire bonds between the package and die may be defective but initially conductive, and fracture due to thermal cycling.
Similar problems can occur on the die. The gate oxide for a particular transistor might be too thin due to process issues. If it's way too thin, it'll fail immediately and the die will get sorted out at test. If it's just a bit thicker, it might pass all production tests but fail after an hour or two of operation, or 100 power cycles. If it's just a bit thicker (where it should be), it might last for 20 years and a million power cycles.
Everyone in the semiconductor industry would love to figure out how to eliminate these early failures. No one has found a way to do it.
And the worms ate into his brain.
The spinning parts of an hdd are not the only parts that can go bad. Just as the NAND flash memory are not the only parts of an ssd that can go bad. There are other components: controllers for the computer interface and the NAND chips, and the power to everything. One bad electronic component can take down either. One dead capacitor can stop a whole motherboard from running.
In my experience with HDDs you'll usually get some warning that your drive has issues before it completely calls it quits. Whether it's bad sectors turning up or noises from the drive itself. If you pay attention to that (and you're a little lucky), you can manage to salvage most of the drive's contents before it dies completely.
With an SSD one minute it's working completely fine and the next it's completely gone. While most of the data itself is probably still perfectly intact on the flash memory, getting at it is completely impossible (afaik) without going to a professional recovery service.
With a spinning disk, you'll usually get an indication of a problem with a plethora of S.M.A.R.T errors.
It's been my experience that when an SSD dies... you just suddenly appear to have an empty drive cage. It's a really ugly binary failure.
I've taken to building my boxes with mirrored SSD's combined with taking and validating my backups.
Yes Francis, the world has gone crazy.
had a 2gb memory card once. A Day One fault of one 512 mb block dead. Windows could not recognise this fault nor fix it. Instead writing to the card had corruption (obviously) when the faulty block was engaged.
I simply wrote a utility that marked the 'sectors' covered by the block as 'used'- and from that day on Windows played happy with a 1.5 GB card. But before my FAT hack, the card could lock up the PC, as the Windows stack just doesn't know how to handle failed Flash memory units.
So a 'dead' SSD drive is almost certainly recoverable with direct software access, but that's going to be a big pain-in-the-ass. The 'level' system that deals with real time cell failures and remaps data is going to need to be understood. All GOOD modern SSD drives write 'fast' cos they have a 1/10th size RAM cache on the drive (few know this). The REAL write speed of the flash memory is 1/2 - 1/4 of what is quoted.
PS when my simple SDCard had that block fail, I tried all the so-called recovery utilities, and all failed. Yet the problem was trivial (for me) to identify and permanently fix.
99% of so-called coders are grossly incompetent. Windows and the interface of SSD drives reflects this fact. So the point of the article is that a SSD drive is most unlikely to have the type of complete failure that renders a HDD 100% useless- and this is TRUE. But even so, what are the chances of finding tools written by good coders that can access the majority of still good cells on the SSD drive, and remap the data back to the desired file formats?
More people code than ever before, but they are mostly the monkeys trying to rewrite the works of Shakespeare on their millions of typewriters. Which is why the GARBAGE computer languages are so popular these days.
As a consequence, the first unpredicted system error of an SSD drive has the real possibility of rendering all the data inaccessible, even tho, as I said, the vast majority of cells are fine and can be read. What could be and what is and not usually the same thing.
It's bad firmware. Some of the drives can supposedly be resuscitated by the factory or people who have reversed the private ATA commands.
I mean, at a minimum unless it's a PHY failure (and there's no reason to suspect those) the firmware could at least report missing storage (I've actually seen a 0MB drive failure once or twice) but their usual failure mode is to halt and catch fire, as the author notes as their usual behavior.
With the recent reports about the inexcusable security problems on Samsung and Crucial drives this is starting to feel like the old BIOS problems with Taiwanese mobo companies outsourcing to the lowest bidder and shipping bug-laden BIOS with reckless abandon. It's OK, all the world's servers only depend on this technology.
To be fair, I have batch of 20GB Intel SLC SSD's that have never done this, but those are notable exceptions. At this point only low-end laptops like Chromebooks don't get at least a mirror drive here.
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
As said before, shit happens. All electronics can fail and there will always be defects. Doesn't matter that it's not mechanical, it can fail. You are being too paranoid.
One of the benefits of SSDs is that they have a significantly longer MTBF than HDDs. They also can stand worse environments as well. However, when the ECC fails and the gates stop keeping in the electronics, there is no way to recover a SSD. When they fail, they fail hard.
This is what RAID and backups are for. It isn't if a drive fails; it is when. Don't ever count on drive recovery services. Especially with how relatively inexpensive backups are. Backups are not difficult. Veeam, Borg Backup, Arq, Time Machine, Windows Backup (wbadmin), and many more are available. At the minimum, CrashPlan.
Overall, SSDs have more advantages than disadvantages, especially newer ones. I wouldn't want to go back to spinning disks on the desktop or active use.
Just put your data in the cloud. Then, when itâ(TM)s gone or has been corrupted, you can ask the cloud provider what happened. No more troubleshooting unreliable hardware for this guy!
Stop buying shit SSD's. I've been using them for 7 years now and have not had a single failure. As for loss prevention, a good PC owner knows to always back up important files. You do this regularly, right? Oh you don't? Then it's your own fault.
Seven puppies were harmed during the making of this post.
>When a HD died early, you could also imagine undetected manufacturing flaws that finally gave way. With SSDs, at least in theory that shouldn't happen, so early death feels especially alarming.
Chris's dismay comes from believing that solid state physical devices should "in theory" perform like a deterministic idealized model.
Maybe because they're solid state? Even in solid state devices there is a great deal of movement and there always will be above 0K and especially with power cycling.
Flaws can reveal themselves long after initial testing. Quality varies between brands and even between production runs for the same brand.
Hard drives too can have all sorts of bad shit happen to them too with the controller board, processor, and ram, that have nothing to do with the spinning discs. And you're just as much in the dark without a narrative.
I'm a retired IT guy and there's no kind of something that didn't fucking break. I'm not a goddam engineer. My job was to locate the problem at a black-box level and get the shit running again. Contemplating the "why" of a hardware failure is wheel-spinning instead of pulling the stuff out of the ditch.
For new purchases under warranty, I exchanged them and sent the dead one back to the vendor. Let them hook it up and do diagnostics over a cup of coffee.
I had work to do.
It little behooves the best of us to comment on the rest of us.
In the earlier days, when 'Intel' drives were regarded as the best/most reliable - We purchased 12. We fitted them to laptops. The performance vs spinning disk was next level. Every single drive was dead within 4 months.
This was followed by OCZ madness.
I'm frankly not over it yet. The fact that power cut of is still enough to terminate some models is bad juju. In terms of performance, I am convinced. In terms of an all round good replacement to 'spinning rust', the jury is out for me.
When we make the majority of drives so they are reliable, including in cases of a power outage, and where the drives are still not in the failure rate levels of spindle disks, then we can talk.Unless speed is required, I still have a leaning towards using spindle disks over SSD :/
-- All drives get mitigated by actions like RAID, taking backups. But I've been around. Many times bad drives have been a workable state by pulling data off a drive, not just sudden blackness.As the author states, he is not alone is disliking the sudden - without warning, nature of SSD failures.
And far as I see this is being compounded by cheap'ness' being applied in flash with die shrinkage and production leaning ever more towards components that die faster, and have less life, mitigated by clever 'wear levelling' firmware.
Doesn't know how SSD's work.
No offense to CS majors, but this EE major tends to understand "How a computer works" at a lower level than most of you programmer types. While not universally true, in my experience a Computer Science major generally get's outside their comfort zone with hardware once you get past "Plug it in and turn it on." I don't blame them, there is a lot of stuff happening at lower levels than a CS major needs to know to do their job.
That some CS major is concerned about how SSD's fail because he doesn't understand their failure modes is fine. We tend to fear what we don't understand and let's face it, there is a LOT of stuff going on inside a computer that high level users simply don't need to know. Heck, even I don't need to know some of that stuff and I've designed computing systems in the past. Fear not, if it works, it works, if it doesn't you just replace it anyway.
"File to fit, pound to insert, paint to match" - Aircraft Maintenance 101
Yes, you can listen for mechanical issues, yes you can (sometimes) read bad block and other SMART data. But, ultimately, without millions in equipment and skills, you just do not know. It is a cheap data storage brick. Choose one appropriate for your capacity and I/O needs, have a good backup plan in place, and quit whining.
Silence is a state of mime.
..and the more complex a machine is, the more that can go wrong with it.
The controller PCB on a brand-new modern HDD can fail, rendering the entire device useless; any piece of silicon on a modern SSD can fail also, rendering the entire device useless. The only difference here is that with a HDD, if you happen to have another working drive of the exact same model and revision level, you could theoretically swap the controller PCB and be able to access the data on the platters again (I've done this). With an SSD it's all one PCB and short of actually diagnosing the failure and replacing failed component(s), the chances of accessing the contents of the flash memory is a snowballs' chance in hell.
There's no point in worry about it, though. Back up your important data and forget about it. If the system in question is mission-critical and up-time is essential, then use two SSDs in a mirror set, and don't worry about it. If someone is going to get their head lopped off if there's any chance of the system in question failing due to SSD failure, then mirror your mirror-set to another mirror-set (i.e. use 4 SSDs) and back the whole mess up to an off-site location regularly. Sitting around biting your nails down to the quick isn't going to help anything.
Maybe it helps the author to develop a narrative, but the long and short of it is, the author's non-volatile storage unit died, he needs to replace it to get the system back and he can send it back to where he bought it from because it died under warranty. Or, he might want to have it destroyed locally if it contains proprietary information.
If you're in IT, I'm sure you'll see everything eventually break (including things like cases which don't make any sense at all) so why sweat it?
Mimetics Inc. Twitter
I still don't trust SSDs. I'd rather make a virtual RAM Drive over using them. It simply isn't worth the hassle when they fail. They've proven to still be dodgy even today, never mind the shitfest that was early SSDs. Holy corruption hell Batman.
HDDs are good enough for most tasks. Demanding tasks can be done in RAM quite easily, and better, than SSDs.
What, you don't have at least 32GB of RAM? Probably because you spent it wastefully on a shitty SSD listening to hype. All your data is a guinea pig to them.
Sure hope you have backups.
HDDs are trivial to repair if the logic failed. Replace it with another drives board. Harder to find the board separately, but this is why you always buy drives in pairs. They are less easy, but still accessible, if you are capable of removing the platters and putting it in another. Requires more effort and knowledge to do properly, but doable.
Or, just have backups, restore backup to new drive. Cannibalize the dead drives components as spare parts, optionally sell them, ditch the rest.
SSDs are a whole load of FUCK YOUR FILES if they fail. True even for specialists. It also requires way more effort to fix for said specialists.
Disk platters are pretty simple to work with. SSD electronics and specs are all the fuck over the place.
Until these idiots standardize everything, I'm going nowhere near them. I class them as more dangerous than IntelME in terms of destruction (to files and the mind) they can cause. Hell, even Spectre-class bugs.
Again, just not worth the hassle.
Despite what others have said, this comes down to the brick wall nature of error correction codes. Every time you erase and rewrite a flash cell, you as wear to the transistors that make up the memory cell. Eventually (and probably immediately too) some of the bits won't read correctly. To compensate for this, the controller runs a mathematical function on your data, allowing it to recover from a certain percentage of bar bits. This is good, as that combined with wear leveling allows it to run a long time. However, one it hits that percentage, it's like hitting a wall and it can't recover.
...si hoc legere nimium eruditionis habes...
I still feel Flash is a flawed technology because it can wear out. With both computers and electronics in general, a "worn out" chip just doesn't happen. If a chip is dead, either it was exposed to excessive voltage or its supplied cooling apparatus failed.
I doubt that most home PC users have both the case space and the cash for a RAID. A user of a mainstream laptop sure doesn't.
Reports from the ISS are that 9 out of 24 SSD drives failed in an HP supercomputer they'd brought up there. Quite scary how fragile those things are from radiation.
Design for Use, not Construction!
That happened to me three or four times already. They die without warning. No SMART indication, nothing. It really pisses us off. Someone needs to technically give us some kind of anticipation. Maybe SMART is not supposed to work well with SSD after all.
One thing I like about spinning disks is that a lot of times the failure is gradual. Bad sectors and such and you have the opportunity to grab data off the drive (noting, you really should have backups).
With SSD, whatever the issue, it's more like losing a controller board on the drive, everything dies and ceases to operate.
So... I'll go along and say SSD is "better" and more "reliable", but when it dies, it dies hard. Just the way it is. (not talking about performance degradation... speaking about failure)
Do you get this anxious when a RAM module fails? There really is no difference between a RAM module failing and a SSD failing...
Just make sure that you have backups....
Improper handling of ungrounded components really can mess them up. They work but are defective. Take a look at some micrographs of ESD damage sometime.. ESD does not always kill a part it maims -- sometimes only slightly. Anti-static mats and wrist straps are no laughing matter, Okay. They are. But use them anyway.
"No fear. No envy. No meanness." Liam Clancy
Most of the time heat kills electronics. Either they get too hot and something fries, or they suffer thermal fatigue.
One of our competitors trademarked the term "hypothesis". From now on, we will call them "boneheaded ideas".
Even touched a real computer rather than just read about them in textbooks? CPUs don't have moving parts but they fail often. Usually its do to heat (whether too much heat or too much variation). What would boggle my mind is if they made a PC part that never broke...that would change my fundamental view on the universe.
I had a Sandisk USB stick recently go read only. I had been using it as a hypervisor boot drive and the boot was crashing. When I inspected it, it was read only and any attempts to format it, diskpart it, fdisk it failed with some kind of error. I looked it up and apparently this is the designed failure route for these USB drives. When the controller detects an inconsistency or uncorrectable error, the drive is locked from writing so you can get data off of it.
SSDs really are unpredictable timebombs, so act appropriately - take frequent backups and use RAID if the downtime from a sudden SSD failure with zero warning is unacceptable. Any IT department that hasn't been prepared for the nature of SSD failures since long before they were available off the shelf was doing it wrong anyway.
I'm most worried about what SSDs mean for the Average Joe, whose data is largely protected by the predictability and recoverability of most hard drive failures. SSDs throw all of that out the window and lure them in with the warm glow of performance like moths to a flame. Average Joes need a real wake-up call on the importance of backups with the switch to SSDs.
"When information is power, privacy is freedom" - Jah-Wren Ryel
the layer that holds the charge in an ssd is just a few microns thick.
We rely on that fact that we can trap electrons in a floating gate with quantum tunnelling.
we *know* it as a finite lifespan. the floating gate can withstand only so many programming cycles.
We build-in redundancy.
But obviously, every sector on the SSD can not be checked for a level of wear.
Defects on the die, latent defects not detected, materials imperfections, fab variations, WILL cause SSDs to die.
Always follow the 3-2-1 backup rule.
3 copies
2 different types of media [ think HD, tape, or DVD, Blueray, etc ]
1 copy off-site
the SSD will fail.
the HD will fail.
DVDs will become unreadable.
Tapes can and will be demagnetized
Add encryption, becuase lets face it, people will steal stuff.
Who the hell cares? Replace it and restore your data.
The data on a failing drive might be a newer version than the most recent weekly backup. I see value in backing up the newer version elsewhere as the first part of replacing the drive. But SSD failure modes allegedly make this newer version inaccessible sooner than HDD failure modes.
Could be worse. At a previous job, I've had someone demand "7200 RPM SSDs", and no amount of explaining could change the person's mind.
Nod your head, and say, "Yes! I'm on it!"
I was called on something like it ONCE. I was asked, "Why did you bullshit the guy?"
After explaining what happened and what he said, I asked, "So, next time I should refer someone like that to you?"
I got a disgusted look and the boss walked away. And I got a great performance review 7 months later, too!
Don't argue with ignorant people who refuse to listen. You get nowhere. Unfortunately, that's most people. - and I'm including myself. In this over-hyped marketing society, I've become cynical. I won't argue, I just think everyone is full of shit until proven otherwise.
So you can have peace of mind:
If it dies suddenly, without warning, it's 1) buggy firmware (I think this is by far the biggest culprit), or 2) bad components/soldering/cleaning on the PCB board, or 3) a really dumb controller that isn't doing wear leveling on every single thing (think the master index), so when a critical flash cell dies the entire thing is dead even though there's plenty of good flash left (this was common with crappy little 'SSDs' that were just Compact Flash), or 4) a badly designed controller that leaves the drive in bad state when power suddenly goes out and can't recover
If it sloooows down and starts getting more and more sluggish you've lost enough flash cells that the wear leveling is losing its capacity to cope. Take some stuff off the drive to give it some breathing room and prepare for its demise. I had this happen with one of the original Intel SSDs (the X-25M). It took ten years of continuous use, though - yes, just this year.
SSDs have a bunch of tiny wires. When you push electricity through wires they heat up, they're not perfect super-conductors. If you heat it up too much, it will of course burn, but they avoid that. Still, heating up a wire over and over will have some wear and tear. For big thick power-lines in houses, this doesn't have too much effect, but for tiny precision electronics, it builds up. And SSD's have a LOT of those wires with a little bit of manufacturing variance which makes some parts fail sooner.
They burn out the same way lightbulbs burn out. They don't have moving parts, right?
Of course you can still access it!
You just transplant the controller from a working drive.
I’ve seen him do that. Anyone who can solder chips, can do it.
(It's only problematic, if the controller itself has internal permanent storage that keeps some state, like of wear leveling. But with the entire thing being a storage device, I don't think anyone is stupid enough to do that.)
What we want to know, are the physical processes that make chips fail. I'm sure somebody with an actual clue, like from an actual manufacturer instead of a /. armchair "expert", could tell us quite a lot about that.
Because humans (at least the nowadays rare self-thinker) don't like operating in the dark. I want to know what I can do to 1. avoid harming the drive, and 2. detect failure early.
I'm pretty sure, you could detect it with a high-resolution heat camera microscope, pointing at the structures. If hardware fails, the best way is always to look for where it’s hot when it shouldn’t. But I want that built in.
Garbage upvoted to 5- typical slashdot.
With a HDD, one has an electro-mechanical mechanism with many points of total system failure. Motor goes, anything to do with the heads go, circuit board goes, and the entire thing is down.
And SSD drive is closer to a SINGLE chip in concept. How many PCs fail due to CPU failure? 0.0001 percent. Yet the CPU is by far ther most complex part.
An SSD drive is not one chip, but the chips for the flash, RAM (yes, most SSD drives have a 1/10th RAM cache) and interface are very reliable compared to an electro-mechanical device, and most of the faults that can happen do not break the entire system.
So SSD drive failures ARE 'mysterious' is a way most HDD failures can never be. Silent and 'puzzling'. But the answer lies with SOFTWARE.
When an SSD drive suffers certain memory fails (including standard flash cell fails), the software that drives the SSD drive may become terminally confused. In reality the vast majority of the SSD drive is still fine, and the flash still accessible. But the OS and driver software is atrociously written for robustness, and so essentially writes the SSD drive off.
The issue is the path to the reconstructed data on the SSD drive goes thru many software layers. The OS will give up at the slightest confusion. Specialist tools that know how that particular SSD drive works could triviallly recover most of the data from the SAME computer. This is NEVER the case with an HDD, which needs to be stripped down and connected to specialist hardware tools- and even then the HDD data may be unrecoverable.
But notice how many know nothing DRIBBLERS have their useless input voted up.
A SOFTWARE fail is not the same as a hardware fail. And while the EXCUSE for an SSD fail is initially a trivial hardware fail (like the wrong cell failing), the real reason an SSD drive becomes useless is 99% down to bad software. For it is always possible to make a system robust to ANY cell fail.
And NO- level wearing and other nonsense does NOT address this issue. Only cetain statistical cases are caught and mitigated by these mechanisms. There are MANY types of predictable cell fails that will brick many SSD drives.
The BEST solution would be to drop ALL software mechanisms on the SSD drive itself, and allow the flash to be FLATLY addressed by the OS. The OS would then take full responsibility for detecting and remapping cells as they fail. This way the SSD drive could see increasing loss of capacity across usage WITHOUT catastrophic failure.
Again, it is VANISHINGLY unlikely that a 'bricked' SSD drive does not allow access to most of its memory cells. But a BRICKED HHD can never be mitigated by software on the PC side- no PC software can 'repair' the broken motor, heads, or driving circuit board.
PS I actually have first hand experience in this field. 'Bricked' memory cards that would actually crash windows, yet could, with low level code, be directly read, bypassing the problem- a simple faulty block of flash. How I wish you could do this with a bricked HDD. .
Metal migration limits the lifetime of the interconnect in ICs. Absolutely a wear mechanism.
I'm going to disagree with the people saying that spinning disks don't give you a warning of imminent death. A bad spindle will start whirring, and steadily get louder, and my experience has been that most drives go that way. Hence, the old trick of sticking the drive in a freezer to get a few minutes more life out of it (because, you didn't keep your backups updated....again. :-(
This is a phenomena that should always be kept in mind when switching from mechanical to electronic systems. The electronic are usually MORE reliable, in the sense that they are less likely to go belly up, but WHEN they do, they won't give you any warning. I could arguably make my home-built airplane MORE reliable and feature rich by replacing the flight controls with a fly by wire system. But, one day a gate in one of the processors will fry itself, and the whole system will quit working at once. Woe unto me if I'm at altitude at that point. The mechanical system will require more maintenance, but it will slowly wear out over time, controls will get sloppy, and exhibit more play. That is the system telling me, "I'm getting kinda tired here. I'm getting old, y'all. Replace me. Screw it. I quit." It gives warnings to the operator that knows what to listen for.
So, the article does have a point. . . sort of.
Aah, change is good. -- Rafiki
Yeah, but it ain't easy. -- Simba
God, you think this person so thick he doesn't know his running shoes will eventually fall to pieces if he never changes them?
IT has tools that SHOW the %loss of cells, so any SSD driven to disaster is no mystery to anyone when it fails.
These are the fails when the drive does NOT have significant loss of capacity. And these fails happen cos there are cells, and timings for cetain cells, when a cell fail spells total system failure. And this does NOT have to be the case- but is down to crap software. Crap file and error recovery that has states that cannot be handled.
It would be TRIVIAL to improve the software and eliminate these fail modes, but would need software engineers that knew what they were doing. The 'race' to ever cheaper SSD storage is running ahead of excellence in DOABLE engineering.
Doesn't know how SSD's work.
No offense to CS majors, but this EE major tends to understand "How a computer works" at a lower level than most of you programmer types. While not universally true, in my experience a Computer Science major generally get's outside their comfort zone with hardware once you get past "Plug it in and turn it on." I don't blame them, there is a lot of stuff happening at lower levels than a CS major needs to know to do their job.
That some CS major is concerned about how SSD's fail because he doesn't understand their failure modes is fine. We tend to fear what we don't understand and let's face it, there is a LOT of stuff going on inside a computer that high level users simply don't need to know. Heck, even I don't need to know some of that stuff and I've designed computing systems in the past. Fear not, if it works, it works, if it doesn't you just replace it anyway.
This ^^^. I had a brilliant CS college roommate. But when he built his first computer himself, the motherboard was held to the case with one screw. He couldn’t figure out why it was crashing all the time. Everything in the machine was barely in their slots/socket. This is back in the Pentium days. Days of VLB and very early AGP. And sometimes IRQ switches.
" in my experience a Computer Science major generally get's outside"
Yup, I can believe you're an EE. While you go on and on congratulating yourself almost as hard as a doctor, you can't even tell the difference between GET IS and GETS.
New SSDs, failure could be a die bond failure, a sometimes defect that allows it to pass inspection then fail. Or a ball bond to PC failure that can be intermittent as the package, solder ball, and PC change dimensions due to different thermal expansion coefficients. The tiny contacts on the PC versus relatively huge contacts on the mechanical hard drive make these happen more often on SSDs.
On older SSDs there could be degradation of the ability to hold or modify the stored charge that represents bit. Not likely unless you are a heavy duty user. Or metal migration from the mask layer, or metal migration at the bonding level wire physical aluminum or gold wires are bonded to the actual chip. Less likely are bonding failure to the underlying substrate as the wire material used is chosen for high compatibility.
Now Chris, feel better knowing just a few ways you can envision the failures?
- Tjp
I am in wallow with my inner money grubbing capitalistic pig. ... Oink!
Backup your data frequently. Stop worrying. Is that so hard?
Just imagine the unicorn in the drive died.
It's about as accurate as what you imagine happened to the spinning disk.
Obviously.
Not that some vultures wouldn't jump on it anyway.
So it's no great surprise that he blathers bullshit about hard drives also, the lying faggot has zero integrity.
I had a 4 tb spinning drive fail, after only 2 years. It was 75% full. That is what is scary to me. The only narrative I came up with to explain it was that it was in my system, but powered on, 24x7. Now my backup drives are external and I power them on when I need them.
As drives get bigger, that is when I get nervous. I know, there's options to mitigate that, but I'm on a budget. I just migrated my OS to an SSD a couple of months ago, and still have spinning drives holding everything else.
My beliefs do not require that you agree with them.
Wow. I think that the OP has behind the complaint is that SSDs don't have the difficult-to-instrument hardware failures of HDDs and so why do we suffer the consequences of "unknowable" SSD death? They're HW, but they are solid-state. Sure you might get tin whiskers or solder failures, but why the heck do we put up with "it died"? Heck, these things should be able to know intimately that the resistor at R63 is the likely culprit for mistimed read signaling or whatever the failure point is.
It's as if you'd need a computer to attach to a special connector to be able to diagnose these or something.
Duh, I want my computer to print out a label with the manufacturer's address on it with the exact stinking location of the failure on it so I can send it to be repaired and I get it back in three days. Problem solved and ditch the whiny IT staff.
I'm guessing the author never lived through the era when there were a lot more companies in existence for mechanical HDs than there are now. HD's can spontaneously die from a failed motor, electronics failure or catastrophic crash. Some small companies went completely under and were swallowed up by larger manufacturers due to massive defects. SSDs have gone through the same era as well with buggy firmware. Generally speaking thou if you stick to the big manufacturers like Samsung and Intel the chances of fatal issues goes down a lot. That said an SSD is not a guarantee of safe data. They're far more reliable but circuit failure or static electricity can kill SSDs. Besides, SSDs won't save you from an accidental erase all.
Donald? Is that you?
I've fallen off your lawn, and I can't get up.
Just chiming in My Crucial M4 128GB (Micron) drive also died on me 2 months ago after very mild use since February 2013. It was my OS drive in a Windows 7-10 desktop which O mostly used for 3-5 multiplayer games through the years, or the odd media consumption. It was a machine that was on about 1/20 of the entire 5 years and 8 months.
There's another problem I've found with SSDs in addition to their failures occurring with no previous warning signs. That is that the process of obtaining warranty replacements can be terrible.
Perhaps because hard drives were expected to fail, manufacturers put procedures in place (such as "Advance" RMA) to ship a replacement very quickly. This is important when, for example, you have a single-drive failure in a RAID configuration that can only tolerate losing one drive.
My experience with obtaining two warranty replacements on Intel M.2 SSDs has been really poor. In each case the replacement drive took so long to arrive I had to purchase a replacement drive in the meantime.
You can get the best of both worlds by setting up a RAID of both an SSD and a platter drive! :-P
Most likely reason is a firmware bug cause enough corruption that it can't even low-level format. If it were a prototype that a developer could diagnose, it would be easy for them to patch it and get it going again. But without that specialized environment you SSD and the data on it are trash.
In many ways I think I would have preferred the raw NAND systems like SmartMedia (now obsolete), where the host had the real brains and the media was as primitive as possible. SmartMedia formatting was about conforming to a software standard on the host side and was managed by a driver. A real driver that a could be debugged with ordinary tools, not some obscure firmware embedded in a device.
“Common sense is not so common.” — Voltaire
We have experienced from mechanical, SSD, and NVMe drives that there are points of failure that we can detect, and there are points of failure we can't. Most cases where an unpredictable failure occurs is almost always at the power source, and is mostly indicative of voltage irregularity in our tests with bad drives from these 3 types. While we'd like to think that new hardware will hold up to a degree of it's certified life span; voltage as a whole to power said hardware will almost certainly add the anomalous layer for a margin of error from minimal to catastrophic.
The chips store data in a capacitor.
The capacitor is connected to (or is the) the gate of a mosfet so the state can be read.
To charge or discharge the capacitor, electrons must be forced over the insulation later that stops the capacitor discharging on its own.
Every time that happens the insulation breaks down a little. Once it's all gone, the cell can no longer store data.
It's a gradual process that happens every time a cell is written to or erased. SSD's wear out as they're used, it's how they work. You should treat them as a consumable.
Or something randomly broken. like a solder joint from thermal cycling or something.
You are slowly perforating the gate oxide when erasing nand flash, but most chips give an indication on the read a quality on write, so you know when the cell starts to get risky to use.
... I can say I never "abused" it. Never defragged it or otherwise thrashed it needlessly, so I'm a tad sad to lose the thing. And surprised by the suddenness of it, in the middle of playing Fallout.
Dagnabbit.
As with other posters, my SMART checks never disclosed any potential or actual errors in the SSD.
They oughta make some sort of warning inbuilt. Make 'em scream if they hurt, like HDs do. That grinding unhappy-drive sound the mechanical ones produce is my suggestion.
OTOH, it managed to survive long enough that I could replace it wif an EVO. It's all good, I 'spose. Hrm.
See SSD Failures in Datacenters: What? When? and Why?.
Failures include retention errors caused due to leakage current, which worsens with time when not acted upon. Second, they also suffer from phenomenon such as read disturb and program disturb errors, where read or program of a row or block of cells affects the threshold voltage of untouched cells in its vicinity. data retention, program disturb, read disturb, endurance, and power faults.
Flash controllers have proactive and reactive mechanisms in place, to prevent the flash error propagation to higher levels in the system stack. Consequently, not all of the above-mentioned failures propagate to upper layers. But, ones that do propagate can result in fail-stop failures.
I spent 3 years on a "deep dive" into EE basics, analog circuit design, then microcontrollers, and it really improved my software development a lot.
I don't think this is a natural blind spot in CS, I think it is just manufactured ignorance by dividing the fields in an unrealistic way. Which seems to have happened during the rush to train workers during the .com boom, so maybe it wasn't even thought out at all.
Does anyone really know why a spinning disk dies? Sure - maybe if the last operation was "dropped laptop down stairwell"
A narrative over what went wrong?! Whenever a HDD failed a light came on the RAID array - and I'd find a package from FedEx on my desk at 9AM with a replacement disk in it. As for personal computers - the drive stops working and you lose data.
What is there to think about?
I do agree about the "timebomb" thought. I know that SSD just give up the ghost. On a HDD many times "check disk" starts reporting a high number of failures and you can be prepared...except when the head falls off the arm. That's a rather rapid failure.
SSD have a write-lifetime that I can't predict. HDD goes until it doesn't work anymore. In both cases you break out the backup tapes.
If you begin to notice vibration from the SSDs then you know they are near the end of their life.
I'll see your senator, and I'll raise you two judges.
We set them up in either a RAID or EC configuration or other redundant configuration , so that the operations department can swap them out when they fail without downtime.
Unless we start to see an unusual high number of failures, we don't care.
L'Idiot
Well, I do think it's natural for CS majors to be a bit farther away from hardware. Let's face it, much of their work these days doesn't really care what operating system they run on much less the hardware it's actually running on. I don't blame them, really the state of programming has evolved away from hardware dependence, and that's a good thing..
Where I understand hardware details of what's happening behind the programing model seen by the CS guys and gals, and I believe that I have a different perspective when doing software development, I'm not sure they would benefit all that much. Programming Java is pretty hardware agnostic anyway, C/C++ a bit more specific (assuming you have the libs and compiler), but still largely portable unless you are handling actual hardware or kernel level stuff. My hardware knowledge really only serves to make me more aware of performance implications of my choices perhaps, but the CS folks do just fine with most higher level languages.
So I don't agree, CS folks really don't need to know all the same stuff I do to program. It used to be true, it used to be valuable to understand what the hardware had to go though, both to be able to optimize your code for performance and size and get it to do what you wanted. However, with the advent of the higher level languages, most CS folks don't interact with the hardware anyway, but abstract programming models like the JREs which for all the world look identical regardless of the hardware being used.
"File to fit, pound to insert, paint to match" - Aircraft Maintenance 101
Why does the OP need a "story" (he actually keeps repeating that word)...? Reminds me of millennials grocery shopping.
I've been told (in all seriousness) that when millennials go grocery shopping, they want to know the backstory of everything they purchase - and not just what you'd think of asking: whether that chicken lived on a free-range farm, whether pesticides were used on those tomatoes, GMOs etc but rather, they want to read some story about the actual farm, the animals that live there, an anecdote from someone who works there--that sort of bullshit.
WHY?
I blame the autocorrect software.
This is theoretical, an attempt to understand why this sysadmin is "unnerved". Well, that's a psychological reaction, and it needs a psychological explanation.
The key, I think, is that he is a Unix sysadmin. In that world everything is a story. I've commented on this in the past, the bizarre naming of Unix commands. The only way to explain commands like "ls" and "grep" and "sudo" is to invoke the culture of the story. I'm discounting the command line interface, keyboards, and the virtues of short commands in this. While that is true enough, it also fails to fully explain the quixotic commands themselves.
So every Unix command has a story about how that command got it's name. And people are very attuned to telling and learning stories. However the problem with that system is that you simply have to spend the time required to hear and learn all those stories. Guessing at commands is ineffective in Unix.
Thus this guy learned Unix by stories, and he needs a story to explain each SSD failure. It's right in line with the Unix sysadmin culture.
Now, there's a not-Unix specific aspect to this whole thing too. It's undesirable systems design, regardless of any other factor, to have sudden failures. Failures with a warning are a much better systems characteristic (assuming that failures are inevitable and not preventable). Traditional HDDs have a better track record of failing with signs that something was going wrong, preceding total failure.
If I wrote something like that, I would not attach my name to it. He's basically asking to be fired from his job. Hard drive failures are something everyone has dealt with for decades. What if a bus driver said that he/she had a hard time navigating in traffic, but other than that, they were great?
After looking into his homepage and blog, I can't help but get the impression this guy slacks off a lot, while thinking a lot of himself. The following parts struck me as interesting:
please do not send him unsolicited mail touting your good deals or your good cause; he will just become irate
I take that to mean he loves to pontificate, but doesn't want to hear anyone else.
His current amusement is to have as many home pages around the University of Toronto as possible; he will let finding them be your amusement.
If I told any employer my favorite pastime was to waste their time (and thus money), I could not imagine having a job much longer.
Then again, that home page was written 22 years ago. Maybe he's matured in the mean time, but I doubt it. Someone at the same job for 2 decades, who still doesn't understand basic stuff like hard drives, probably hasn't improved much. Those sort of people tend to do the absolute minimum possible, at all times.
Disclaimer: I've known Chris since we were CS undergraduates together in the 1980s, and we currently work together in the CS Department in Toronto. It may seem a bit odd to some that a hard disk failure isn't unnerving but an SSD failure is. That's because one of a good sysadmin's skills is properly focused anxiety, used to motivate a mental model of how things can fail, and what to do about it. Data storage is a key part of this mental model, since data access loss, or even worse, data loss, is a major risk. That's why it's helpful to know how disks work, how they behave when they fail, and how likely it is for such things to happen. Chris has a few decades of experience in dealing with disks. SSDs take the place of disks, and they store stuff just like disks do, but they work differently, and they behave very differently when they fail. In particular, SSDs often don't seem to give any indication that things may be wrong: one moment all is well, the next moment, all is dead. So instincts honed over a few decades of experience with hard drives don't apply. Of course Chris (and we all) will develop new instincts as we get more experience with SSDs. But in the meanwhile, it's indeed unnerving. And no, this isn't some sort of profound insight. It's merely an observation. Many experienced sysadmins, I think, will "get" this. People newer to the field might not. That's OK.
That's the crux of the article. I should care why? He's a technical guy, he knows about memory. He just refuses to apply his knowledge to get rid of his paranoia. This guy's nothing but a low-level conspiracy theorist.
As I wrote in one of my books: "They're all alike. Conspiracy theorists. They'd rather live in a terrifying fantasy world than the real one."
With this ‘admins’ angst in full display for a piece of hardware, I worry about the ability to handle a malicious attack. Going to cry for Mother(board)?
Sandforce controllers self-brick at the first sign of trouble to prevent competitors from reverse engineering their controllers. Or at least that is the reason stated for their crappy design. IIRC, Intel developed a customized version that has better failure modes.
This is the uncanny valley in which the world of REAL slowly sinks, sinking...sunk into the technological relative world of NOW.
There is no bridge between. You stand stranded on the shores of reason while the world in which you live sinks away, out of sight and out of mind.
Millennials know the futility of questioning the NOW, its irrelevant to wonder ' why?' Just BE now!
Enlightenment as to why, what went wrong - much less how to prevent bad things is not among possibles. Shit happens!
... are Cosmic.
I'm not sure what they call Computer Science these days, but my bachelors had a required digital design component. We started by wiring together transistors to build a gate. When you'd demonstrated that you could use 74HC00s, and you had to build an adder. When your adder worked, you were allowed to use an ALU chip. You had to set the thing up with supporting logic and DIP switches and invent a machine code to demonstrate instruction processing and register transfers.
In the compiler class we started out by writing a simulator for that hardware, then an assembler, then a compiler.
Make it reason enough to not purchase them.
I've had plenty of Intel 5xx series "die" on me. I have one now that is BSOD 5-6 times a day. Intel SSD Toolbox Full Diagnostic says its a OK! I've RMA'd so many of these, I already know what the techs are going to ask for during the RMA games. For SATA SSD's so far no issues with Crucial MX300 or MX500.
Because yes I am your God, man.
"No fear. No envy. No meanness." Liam Clancy
I was actually thinking that if they had more understanding of the hardware, they'd have a better idea what the layers actually are, and they'd end up with more portable code not less portable code as you seem to imply. Knowing about how hardware works helps to be more hardware agnostic, because if you're using intermediate layers with no idea of the hardware and OS coupling that it creates then you'll do it more often.
I was actually thinking that if they had more understanding of the hardware, they'd have a better idea what the layers actually are, and they'd end up with more portable code not less portable code as you seem to imply. Knowing about how hardware works helps to be more hardware agnostic, because if you're using intermediate layers with no idea of the hardware and OS coupling that it creates then you'll do it more often.
Yea, I see what you are saying, but remember they are stamping out CS degrees with little more than Java and Database Skills. The whole point of Java was to let you ignore all that hardware stuff though abstraction layers any way. Most of them don't need to know how to dig though all those layers to do what they need and with Object Oriented concepts, hardware is becoming trivia to them.
But I agree, a bit of understanding of hardware is a good thing, especially when you start talking recursion and how pointers/references are actually working. I've always been amused at the BSCS holders who didn't understand what the call stack was or how they where killing performance with all the objects going in and out of scope, or why the math was being in done using integers when they wanted floating point (or vice versa). I just don't know if they have the scope in an undergraduate CS curriculum to throw that stuff in. Many won't need it, use it or remember it anyway.
"File to fit, pound to insert, paint to match" - Aircraft Maintenance 101
I have been using my 2010 MacBook pretty hard as an over-the-air DVR and moving lots of files off of it to a backup RAID array, and it's 256GB SSD is working just fine. I do have a Time Machine backup of it, and there is little data on the drive with the vast majority of it backed up.
I worry more about backing up my backup data safely offsite, along with organizing it all. That is what I would want an off-line AI to do for me in a future OS. I don't want to put a lot of data in "the cloud" or move it on-line since my internet connection is my cell phone and is capped at 12 GB per month.
The issue is not that it's hard to know why an SSD died. After all, as others have said, the same is true with spinning rust. The real issue is how suddenly an SSD can die. It can be perfectly healthy one day, then completely read-only or even totally dead the next day. An HDD on the other hand usually (but not always) shows symptoms of dying. It starts making noise, or the number of I/O errors spikes. Maybe it stops working when it's moved on its side.
An SSD, on the other hand, intelligently reallocates bad sectors until its dying breath. Because NAND cells were historically so fragile, the FTL is very paranoid about ensuring a sector does not die on it. That's good and all, but it means that it's harder to tell that the device is in its death throes. An HDD is much dumber. It won't reallocate something as soon as a sector becoming unhealthy. It'll be totally fine until reads start failing, in which case it will try a number of times to read it so it can relocate it, and as this starts happening more and more, it becomes very obvious.
I'm not criticizing SSDs at all. Their failure mode is not worse, it's just different. People are not used to monitoring drive health, so they (possibly foolishly) tend to rely on physical symptoms appearing. That's not a very good idea for SSDs.