Ask Slashdot: How Do SSDs Die?

← Back to Stories (view on slashdot.org)

Ask Slashdot: How Do SSDs Die?

Posted by timothy on Tuesday October 16, 2012 @04:20AM from the whimpery-bang dept.

First time accepted submitter kfsone writes "I've experienced, first-hand, some of the ways in which spindle disks die, but either I've yet to see an SSD die or I'm not looking in the right places. Most of my admin-type friends have theories on how an SSD dies but admit none of them has actually seen commercial grade drives die or deteriorate. In particular, the failure process seems like it should be more clinical than spindle drives. If you have X many of the same SSD drive and none of them suffer manufacturing defects, if you repeat the same series of operations on them they should all die around the same time. If that's correct, then what happens to SSDs in RAID? Either all your drives will start to fail together or at some point, your drives will become out of sync in-terms of volume sizing. So, have you had to deliberately EOL corporate grade SSDs? Do they die with dignity or go out with a bang?"

23 of 510 comments (clear)

Min score:

Reason:

Sort:

CRC Errors by Anonymous Coward · 2012-10-16 04:22 · Score: 5, Informative

I had 2 out of 5 SSDs fail (OCZ) with CRC errors, I'm guessing faulty cells.
1. Re:CRC Errors by Anonymous Coward · 2012-10-16 04:39 · Score: 5, Informative
  
  OCZ has some pretty notorious QA issues with a few lines of their SSDs, especially if your firmware isn't brand spanking new at all times.
  I'd google your drive info to see if yours are on death row. They seem a little small (old) for that, since I only know of problems with their more recent, bigger drive.
2. Re:CRC Errors by lytles · 2012-10-16 05:40 · Score: 5, Funny
  
  power corrupts. absolute power corrupts absolutely
  
  --
  My blog
3. Re:CRC Errors by Mattcelt · 2012-10-16 05:44 · Score: 5, Funny
  
  And a lack of power enables corruption. QED
4. Re:CRC Errors by MrL0G1C · 2012-10-16 05:53 · Score: 5, Informative
  
  http://www.behardware.com/articles/862-7/components-returns-rates-6.html
  Personally, I'm glad my SSDs aren't OCZ.
  
  --
  Waterfox - a Firefox fork with legacy extension support, security updates and better privacy by default.
5. Re:CRC Errors by arth1 · 2012-10-16 07:16 · Score: 5, Insightful
  
  I am running (6) OCZ Vertex2 256GB drives under heavy use 24/7. Almost 2 years on have only had one fail and it still works, just started kicking random errors.
  Your failure rate of > 8% per year isn't very reassurring.
6. Re:CRC Errors by ZedNaught · 2012-10-16 08:13 · Score: 5, Informative
  
  Firmwares release notes, from January 13th, 2012: "Correct a condition where an incorrect response to a SMART counter will cause the m4 drive to become unresponsive after 5184 hours of Power-on time. The drive will recover after a power cycle, however, this failure will repeat once per hour after reaching this point. The condition will allow the end user to successfully update firmware, and poses no risk to user or system data stored on the drive."
How do SSD's die by AwesomeMcgee · 2012-10-16 04:26 · Score: 5, Funny

Screaming in agony, hissing bits and bleeding jumperless in the night
Re:Die! by Anonymous Coward · 2012-10-16 04:27 · Score: 5, Funny

Wow - you've been here a long long time then
They usually die gracefully... by dublin · 2012-10-16 04:31 · Score: 5, Informative

In general, if the SSD in question has a well-designed controller (Intel, SandForce), then write performance will begin to drop off as bad blocks start to accumulate on the drive. Eventually, wear levelling and write cycles have taken their toll, and the disk can no longer write at all. At this point, the controller does all it can: it effectively becomes a read-only disk. It should operate in this mode until else something catastrophic (tin migration, capacitor failure, etc.) keeps the entire drive from working.
BTW - I haven't seen this either, but that's the degradation profile that's been presented to me in several presentations by the folks making SSD drives and controllers. (Intel had a great one a few years back - don't have a link to it handy, though...)

--
"The future's good and the present is nothing to sneeze at." - Roblimo's last ./ post
Re:Umm by Anonymous Coward · 2012-10-16 04:31 · Score: 5, Insightful

yeah, sounds like submitter may be mildly deficient

Which is why he's asking.
Fuck people who ask questions when they don't know something, right?
Re:Umm by kelemvor4 · 2012-10-16 04:32 · Score: 5, Informative

It was my understanding that for traditional drives in a RAID you don't want to get all the same type of drive all made around the same time since they will fail around the same time too. Same would apply to SSDs.
Never heard of that. I've got about 450 servers each with a raid1 and raid10 array of physical disks. We always buy everything together, including all the disks. If one fails we get alerts from the monitoring software and get a technician to the site that night for a disk replacement. I think I've seen one incident in the past 14 years I've been in this department where more than one disk failed at a time.

My thought on buying them separately is that you run the risk of getting devices with different firmware levels or other manufacturer revisions which would be less than ideal when raided together. Not to mention you have a mess for warranty management. We replace systems (disks included) when the 4 year warranty expires.
I have seen SSD death by MRGB · 2012-10-16 04:38 · Score: 5, Informative

I have seen SSD death many times and it is a strange sight indeed. What is interesting about it when compared to normal drives is that when normal drives fail it is - mostly - and all or nothing ordeal. A bad spot on a drive is a bad spot on a drive. With SSDs you can have a bad spot one place, reboot, and you get a bad spot in another place. Windows loaded on an SSD will exhibit all kinds of bizarre behaviour. Sometimes it will hang, sometimes it will blue-screen, sometimes it will boot normally until it tries to read or write to that random bad spot. Rebooting is like rolling the dice to see what it will do next - that is, until it fails completely.
Re:Umm by statusbar · 2012-10-16 04:40 · Score: 5, Insightful

I've seen two instances where a drive failed. Each time there were no handy replacement drives. Within a week a second drive died the same way as the first! back to backup tapes! Better to have replacement drives in boxes waiting.

--
ipv6 is my vpn
Re:Bang! by ColdWetDog · 2012-10-16 04:48 · Score: 5, Funny

Does anyone else find this sort of thing upsetting? I grew up during that period of time when tech failed dramatically on TV and in movies. Sparks, flames, explosions - crew running around randomly spraying everything with fire extinguishers. Klaxons going off. Orders given and received. Damage control reports.
None of this 'oh snap, the hard drive died'.
Personally, I think the HD (and motherboard) manufacturers ought to climb back on the horse. Make failure modes exciting again. Give us a run for the money. It can't be hard - there still must be plenty of bad electrolytic capacitors out there.
How about a little love?

--
Faster! Faster! Faster would be better!
X-25M Death: Firmware bug too? by Anonymous Coward · 2012-10-16 04:50 · Score: 5, Interesting

I had an 80G Intel X-25M fail in an interesting manner. Windows machine, formatted NTFS, Cygwin environment. Drive had been in use for about a year, "wear indicator" still read 100% fine. Only thing wrong with it is that it had been mostly (70 out of 80G full) filled, but wear leveling should have mitigated that. It had barely a terabyte written to it over its short life.
Total time from system operational to BSOD was about ten minutes. I first noticed difficulties when I invoked a script that called a second script, and the second script was missing. "ls -l" on the missing script confirmed that the other script wasn't present. While scratching my head about $PATH settings and knowing damn well I hadn't changed anything, a few minutes later, I discovered I also couldn't find /bin/ls.exe. In a DOS prompt that was already open, I could DIR C:\cygwin\bin - the directory was present, ls.exe was present, but it wasn't anything that the OS was capable of executing. Sensing imminent data loss, and panic mounting, I did an XCOPY /S /E... etc to salvage what I could from the failing SSD.
Of the files I recovered by copying them from the then-mortally-wounded system, I was able to diff them against a valid backup. Most of the recovered files were OK, but several had 65536-byte blocks consisting of nothing but zeroes.
Around this point, the system (unsurprisingly, as executables and swap and heaven knows what else was being riddled with 64K blocks of zeroes) crashed. On reboot, Windows attempted (and predictably failed) to recover (assinine that Windows tries to write to iself on boot, but also assinine of me to not power the thing down and yank the drive, LOL.) The system did recognize it as an 80G drive and attempted to boot itself - Windows logo, recovery console, and all.
On an attempt to mount the drive from another boot disk, the drive still appeared as an 80G drive once, unfortunately, it couldn't remain mounted long enough for me to attempt further file recovery or forensics.
A second attempt - and all subsequent attempts - to mount the drive showed it as an 8MB (yes, eight megabytes) drive.
I'll bet most of the data's still there. (The early X-25Ms didn't use encryption). What's interesting is that the newer drives have a similar failure mode that's widely recognized as a firmware bug. If there were a way to talk to the drive over its embedded debugging port (like the Seagate Barracuda fix from a few years ago), I'll bet I could recover most of the data.
(I don't actually need the data, as I got it all back from backups, but it's an interesting data recovery project for a rainy day. I'll probably just desolder the chips and read the raw data off 'em. Won't work for encrypted drives, but it might work for this one.)
Re:They die without warning and without recourse by cellocgw · 2012-10-16 04:54 · Score: 5, Interesting

With traditional mechanical drives, you usually get a clicking noise accompanied by a time period where you can offload data from the drive before it fails completely.
OK, so I'm sure some enterprising /.-er can write a script that watches the SSD controller and issues some clicks to the sound card when cells are marked as failed.

--
https://app.box.com/WitthoftResume Code: https://github.com/cellocgw
Re:They die without warning and without recourse by dougmc · 2012-10-16 04:54 · Score: 5, Informative

With traditional mechanical drives, you usually get a clicking noise accompanied by a time period where you can offload data from the drive before it fails completely.
Usually? No.
This does happen sometimes, but it certainly doesn't happen "usually". There's enough different failure mechanisms for hard drives that there isn't any one "usual" method --
1- drive starts reporting read and/or write errors occasionally, but otherwise seems to keep working
2- drive just suddenly stops working completely all at once
3- drive starts making noise (and performance usually drops massively), but the drive still works.
4- drive seems to keep working, but smart data starts reporting all sorts of problems.
Personally, I've had #1 happen more often than anything else, usually with a healthy serving of #4 at about the same time or shortly before. #2 is the next most common failure mode, at least in my experience.
Re:They shrink by v1 · 2012-10-16 04:55 · Score: 5, Informative

The sectors you are talking about are often referred to as "remaps" (or "spares"), which is also used to describe the number of blocks that have been remapped. Strategies vary, but an off-the-cuff average would be around one available spare per 1000 allocatable blocks. Some firmware will only use a spare from the same track, other firmware will pull the next nearest available spare. (allowing an entire track to go south)
The more blocks they reserve for spares, the lower the total capacity count they can list, so they don't tend to be too generous. Besides, if your drive is burning through its spares at any substantial rate, doubling the number of spares on the drive won't actually end up buying you much time, and certainly won't save any data.
But with the hundreds of failing disks I've dealt with, when more than ~5 blocks have gone bad, the drive is heading out the door fast. Remaps only hide the problem at that point. If your drive has a single block fail when trying to write, it will be remapped silently and you won't ever see the problem unless you check the remap counter in smart. If it gets an unreadable block on a read operation, you will probably see an io error however. Some drives will immediately remap it, but most don't and will conduct the remap when you next try to write to that cell. (otherwise they'd have to return fictitious data, like all zeros)
So I don't particularly like automatic silent remaps. I'd rather know whean the drive first looks at me funny so I can make sure my backups are current and get a replacement on order, and swap it out before it can even think about getting worse. I prefer to replace a drive on MY terms, on MY schedule, not when it croaks and triggers any grade of crisis. There are legitimate excuses for downtime, but a slowly failing drive shouldn't be one of them.
All that said, on multiple occasions I've tried to cleanse a drive of IO errors by doing a full zero-it format. All decent OBCCs on drives should verify all writes, so in theory this should purge the drive of all IO errors, provided all available spares have not already been used. The last time I did this on a 1TB Hitachi that had ONE bad block on it, it still had one bad block (via read verify) when the format was done. The write operation did not trigger a remap, (and I presume it wasn't verified, as the format didn't fail) and I don't understand that. If it were out of remaps, the odds of it being ONE short of what it needed is essentially zero. So I wonder in reality just how many drive manufacturers aren't even bothering with remapping bad blocks. All I can attribute this to is crappy product / firmware design.

--
I work for the Department of Redundancy Department.
Re:Umm by Anonymous Coward · 2012-10-16 05:01 · Score: 5, Interesting

Google published a study they did of their own consumer grade drives, and found the same time. If the drive survives the first month of load, it will likely go on to work for years, but if it throws even just SMART errors in the first 30 days, it is likely to be dodgy
Bathtub Curve by Onymous+Coward · 2012-10-16 05:05 · Score: 5, Informative
The bathtub curve is widely used in reliability engineering. It describes a particular form of the hazard function which comprises three parts:
- The first part is a decreasing failure rate, known as early failures.
- The second part is a constant failure rate, known as random failures.
- The third part is an increasing failure rate, known as wear-out failures.
Re:Umm by Bob+the+Super+Hamste · 2012-10-16 05:37 · Score: 5, Informative

For those who are interested the white paper is titled "Failure Trends in a Large Disk Drive Population" and can be found here. It is a fairly short read (13 total pages) and quite interesting if you are into monitoring stuff.

--
Time to offend someone
Re:Umm by Anonymous Coward · 2012-10-16 06:26 · Score: 5, Insightful

I've seen two instances where a drive failed. Each time there were no handy replacement drives. Within a week a second drive died the same way as the first! back to backup tapes! Better to have replacement drives in boxes waiting.
This. Your spares closet is your best friend in the enterprise. Ensure you keep it stocked.
And locked. And don't label them "spares". Label them "cold swap fallback device" or something that management won't see as something "extra" that can be "repurposed" (i.e. stolen)