Ask Slashdot: How Do SSDs Die?
First time accepted submitter kfsone writes "I've experienced, first-hand, some of the ways in which spindle disks die, but either I've yet to see an SSD die or I'm not looking in the right places. Most of my admin-type friends have theories on how an SSD dies but admit none of them has actually seen commercial grade drives die or deteriorate. In particular, the failure process seems like it should be more clinical than spindle drives. If you have X many of the same SSD drive and none of them suffer manufacturing defects, if you repeat the same series of operations on them they should all die around the same time. If that's correct, then what happens to SSDs in RAID? Either all your drives will start to fail together or at some point, your drives will become out of sync in-terms of volume sizing. So, have you had to deliberately EOL corporate grade SSDs? Do they die with dignity or go out with a bang?"
I had 2 out of 5 SSDs fail (OCZ) with CRC errors, I'm guessing faulty cells.
It was my understanding that for traditional drives in a RAID you don't want to get all the same type of drive all made around the same time since they will fail around the same time too. Same would apply to SSDs.
The drives will shrink down to nothing. I believe that the drive controller considers a sector dead after 100,000 writes.
Screaming in agony, hissing bits and bleeding jumperless in the night
Wow - you've been here a long long time then
Didn't happen to me, but a number of people with the same Intel SSD reported that they booted up and the SSD claimed to be 8MB and required a secure wipe before it could be reused. Supposedly it's fixed in the new firmware, but I'm still crossing my fingers every time I reboot that machine.
From what I understand, SSD die because of "write-burnout" if they are FLASH based and from what I understand the majority of SSDs are flashed based now. So while I haven't actually had a drive fail on me, I assume that I would be able to still read data off a failing drive and restore it, making it an ideal failure path. I did a google search and found a good article on the issue: http://www.makeuseof.com/tag/data-recovered-failed-ssd/
SSDs use wear leveling algorithms to optimize each memory cell's lifespan; meaning that it keeps track of how many times each cell was written and it ensures that all cells are being utilized evenly. When the cells fail, they're being kept track of and the drive does not attempt to write to that cell any longer. When enough cells have failed the capacity of the drive will shrink noticeably. At that point it is probably wise to replace it. For a RAID configuration the wear level algorithm would presumably still work as the RAID algorithm pumps even amounts of data to each drive (whether it is mirrored or striped). When any of the drives are shrinking in size it is presumably time to replace the array.
In general, if the SSD in question has a well-designed controller (Intel, SandForce), then write performance will begin to drop off as bad blocks start to accumulate on the drive. Eventually, wear levelling and write cycles have taken their toll, and the disk can no longer write at all. At this point, the controller does all it can: it effectively becomes a read-only disk. It should operate in this mode until else something catastrophic (tin migration, capacitor failure, etc.) keeps the entire drive from working.
BTW - I haven't seen this either, but that's the degradation profile that's been presented to me in several presentations by the folks making SSD drives and controllers. (Intel had a great one a few years back - don't have a link to it handy, though...)
"The future's good and the present is nothing to sneeze at." - Roblimo's last
With traditional mechanical drives, you usually get a clicking noise accompanied by a time period where you can offload data from the drive before it fails completely. In my experience, though SSDs don't fail as often, when they do, it's sudden and catastrophic. Having said that, I've only seen one fail out of the ~10 we've deployed here (and it was in a laptop versus traditional desktop / workstation). So BACK IT UP. Just my $0.02.
So by reason of thinking, if you have a RAID of 15 drives for storage of images, these images never change, they are written and never over written, then the SSDs should theoretically never die because they are only reading these bits now?
All three of the commercial grade SSD failures I've cleaned up after (I do PostgreSQL data recovery) just died. No warning, no degrading in SMART attributes; works one minute, slag heap the next. Presumably some sort of controller level failure. My standard recommendation here is to consider then no more or less reliable than traditional disks and always put them in RAID-1 pairs. Two of the drives were Intel X25 models, the other was some terrible OCZ thing.
Out of more current drives, I was early to recommend Intel's 320 series as a cheap consumer solution reliable for database use. The majority of those I heard about failing died due to firmware bugs, typically destroying things during the rare (and therefore not well tested) unclean shutdown / recovery cases. The "Enterprise" drive built on the same platform after they tortured consumers with those bugs for a while is their 710 series, and I haven't seen one of those fail yet. That's not across a very large installation base nor for very long yet though.
My experience was system crash due to corruption of loaded executables, then at the hard reboot it fails the e2fsck because the "drive" is basically unwritable so the e2fsck can't complete.
It takes a long time to kill a modern SSD... this failure was from back when a CF plugged into a PATA-to-CF adapter was exotic even by /. standards
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
Like spinning drives, silicon drives always die when it will do the most damage.
Like right before you find out all your backups are bad.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
Not to be confused with STD's
Personally, I wouldn't really mind all that much if my STD's died, I can see an upside...
I have seen SSD death many times and it is a strange sight indeed. What is interesting about it when compared to normal drives is that when normal drives fail it is - mostly - and all or nothing ordeal. A bad spot on a drive is a bad spot on a drive. With SSDs you can have a bad spot one place, reboot, and you get a bad spot in another place. Windows loaded on an SSD will exhibit all kinds of bizarre behaviour. Sometimes it will hang, sometimes it will blue-screen, sometimes it will boot normally until it tries to read or write to that random bad spot. Rebooting is like rolling the dice to see what it will do next - that is, until it fails completely.
Had an aftermarket SSD for a macbook air fail twice in 2 years (threw it out and placed an original hdd after that). Both times the system decided not to boot and could not find the SSD.
In both cases I have suspected that the Indilinx controller gave way. This seems mirrored in quite a few cases with the experience of others who had drives with these chips in them.
In an ideal scenario the controller should be able to handle the eventual wearout of the disk by finding other memory cells to write to. Any cells that have been used up should still be readable as well since the floating gates basically have been filled up with electrons and will not allow further erasing.
I guess the main issue right now is the fact that SSDs cant notify the user once things get a bit too worn out. Eventually the controller wont be able to keep up with the useless cells and then might simply no longer respond. Things will only get worse when the cycles go down due to smaller manufacturing processes so that useless controllers in cheap SSDs are more likely to fail
I had a FusionIO IODrive fail a few weeks ago. It was running a data array on a windows 2008 r2 server. It manifested its-self by giving errors in the windows event log and causing long boot times (even though it was not a boot device). The device was still accessible, but slower than normal. I think the answer to your question will probably vary greatly both by manufacturer and also based on what part of the device failed. The SSD's I've used generally come with a fairly large amount of "backup" memory on them so that if a cell begins to fail, the card marks the cell bad and uses one from one of the backup chips. Much like how hard drives deal with bad sectors. As I understand it, the SSD is somehow able to detect the failure before data is lost and begin using the backup chips transparently and automatically vs having to do a scandisk or similar to do the same on a physical disk. That may very well vary by manufacturer as well.
as the disk controller reads them their last rites before they integrate with the great RAID array in the sky.
Not with a bang but a whimper.
We use SSDs in a few Windows machines at work. Running 24/7/365 production. We were replacing them every couple of years.
Eternity: will that be smoking, or non-smoking? I Corinthians 6:9-10
So by reason of thinking, if you have a RAID of 15 drives for storage of images, these images never change, they are written and never over written, then the SSDs should theoretically never die because they are only reading these bits now?
Reading flash is not 100% non-destructive, if you never do a re-write cells near each read cell (which is all of them, probably) will degrade over time. I believe the stored data will degrade over long periods of time in any case, but I'm not sure. But if you re-write data every year or so, they could probably last decades.
"None can love freedom heartily, but good men; the rest love not freedom, but license." --John Milton
I have a G.Skill Falcon 64GB SSD that is failing on me. Windows chkdsk started seeing "bad sectors" (whatever this means for SSD... I think its really slow parts of the SSD) and started seeing more and more and windows would not boot. A fresh install of windows would immediately crash in a day or two. I had done a "secure erase" and that seemed to the job, a chkdsk found no "bad sectors". But a weeks later chkdsk found 4 bad sectors. But its going on a month now and I have yet to have windows fail.
Total time from system operational to BSOD was about ten minutes. I first noticed difficulties when I invoked a script that called a second script, and the second script was missing. "ls -l" on the missing script confirmed that the other script wasn't present. While scratching my head about $PATH settings and knowing damn well I hadn't changed anything, a few minutes later, I discovered I also couldn't find /bin/ls.exe. In a DOS prompt that was already open, I could DIR C:\cygwin\bin - the directory was present, ls.exe was present, but it wasn't anything that the OS was capable of executing. Sensing imminent data loss, and panic mounting, I did an XCOPY /S /E... etc to salvage what I could from the failing SSD.
Of the files I recovered by copying them from the then-mortally-wounded system, I was able to diff them against a valid backup. Most of the recovered files were OK, but several had 65536-byte blocks consisting of nothing but zeroes.
Around this point, the system (unsurprisingly, as executables and swap and heaven knows what else was being riddled with 64K blocks of zeroes) crashed. On reboot, Windows attempted (and predictably failed) to recover (assinine that Windows tries to write to iself on boot, but also assinine of me to not power the thing down and yank the drive, LOL.) The system did recognize it as an 80G drive and attempted to boot itself - Windows logo, recovery console, and all.
On an attempt to mount the drive from another boot disk, the drive still appeared as an 80G drive once, unfortunately, it couldn't remain mounted long enough for me to attempt further file recovery or forensics.
A second attempt - and all subsequent attempts - to mount the drive showed it as an 8MB (yes, eight megabytes) drive.
I'll bet most of the data's still there. (The early X-25Ms didn't use encryption). What's interesting is that the newer drives have a similar failure mode that's widely recognized as a firmware bug. If there were a way to talk to the drive over its embedded debugging port (like the Seagate Barracuda fix from a few years ago), I'll bet I could recover most of the data.
(I don't actually need the data, as I got it all back from backups, but it's an interesting data recovery project for a rainy day. I'll probably just desolder the chips and read the raw data off 'em. Won't work for encrypted drives, but it might work for this one.)
SSD's have an advertised capacity N and an actual capacity M. Where M > N. In general the bigger M realtive to N the better the performance and lifetime of the drive. As it wears it will "silently" assign bad blocks and reduce M. Your write performance will degrade. If you have good analysis tools it will tell you when it starts getting a lot of blocks near end of life and when M is getting reduced.
Blocks near end of life are also more likely to get read errors. The drive firmware is supposed to juggle things around so all of the blocks near end of life about the same time. With a soft read error the block will be moved to a more reliable portion of the SSD. That means increased wear.
1. Watch write perforamance/spare block count
2. If you get any read errors do a block life audit
3. When you get into life limiting events things accelerate to bad due to the mitigation behaviors
Be carefull depending on the sensitivities of the firmware it will let you get closer to catastrophe before warning you. More likely to be closer in consumer grade.
In theory, yes. In flashROM devices the erase process is the aging action. Your write-once-never-erase-read-only flash should last until A) enough charge manages to leak out of gates that you get bit errors, or B) the part fails due to corrosion or other long-term aging issue, similar to any piece of electronics.
If you have raw access to the flashROM you could in theory write the same data into the same unerased bytes to recover from bit errors (if you had an uncorrupted copy), so only aging failures would occur. But of course you can't do this with an SSD as you have no direct access to the memory, and the controller A) wouldn't let you write into unerased space, and B) wouldn't write the data into the exact same place again anyway.
It doesn't hurt to be nice.
It's statistical, not fixed rate. Some cells wear faster than others due to process variations, and the failures don't show up to you until there are uncorrectable errors. If one chip gets 150 errors spread out across the chip, and another gets 150 in critical positions (near to each other), then the latter one will show failures while the first one keeps going.
So yeah, when one goes, you should replace them all. But they won't all go at once.
Also note most people who have seen SSD failures have probably seen them fail due to software bugs in their controllers, not inherent inability to store data due to wear.
http://lkml.org/lkml/2005/8/20/95
I always expected the cells to go first. I was careful to avoid unnecessary writes. In the end, though, it was a known bug that killed the drive. Well, I didn't know about it, of course, until it was too late. If I'd known, I'd have updated the drive firmware to one that didn't have a catastrophic bug.
I replaced it with a Samsung. The RMA'd replacement OCZ is still sitting in its packet on my desk.
If your comment title says 'Re: Foo', I'm not likely to read it.
After reading this horror story I arrived to the conclusion that SSDs are not for me. I wonder if it's still true.
Super Talent 32 GB SSD, failed after 137 days
OCZ Vertex 1 250 GB SSD, failed after 512 days
G.Skill 64 GB SSD, failed after 251 days
G.Skill 64 GB SSD, failed after 276 days
Crucial 64 GB SSD, failed after 350 days
OCZ Agility 60 GB SSD, failed after 72 days
Intel X25-M 80 GB SSD, failed after 15 days
Intel X25-M 80 GB SSD, failed after 206 days
http://www.codinghorror.com/blog/2011/05/the-hot-crazy-solid-state-drive-scale.html
I recently had a "old" (cir 2008) 64gb SSD drive die on me. It's death followed this pattern:
After popping a new disk in and doing a partition resize, my system was back up and running with no data loss. Of all the storage hardware failures I've experienced, this was probably the most pain-free as the failure caused the drive to simply degrade into a read-only device.
Karma: SELECT `karma` FROM `users` WHERE `userid`=138474;
why are you complaining about our long standing culture
lister king of smeg (2481612)
Not your first account I take it?
Apparently wizard is not a legitimate career path, so I chose programmer instead.
We've got a fair number of SSDs here. Failures have been really rare. The few that have:
#1 just went dead. Not recognized by the computer at all.
#2 Got stuck in a weird read-only mode. The OS was thinking it was writing to it, but the writes weren't really happening. You'd reboot and all your changes were undone. The OS was surprisingly okay with this, but would eventually start having problems where pieces of the filesystem metadata it cached didn't sync up with new reads. Reads were still okay, and we were able to make a full backup by mounting in read only mode.
#3 Just got progressively slower and slower on writes. but reads were fine.
Overall far lower SSD failure rates than spinning disk failure rates, but we don't have many elderly SSDs yet. We do have a ton of servers running ancient hard drives, so it'll be interesting to see over time.
In theory they should degrade to read-only just as others have pointed out in other posts, allowing you to copy data off them.
In reality, just like modern hard drives, they have unrecoverable firmware bugs, fuses that can blow with a power surge, controller chips that can burn up, etc.
And just like hard drives, when that happens in theory you should still be able to read the data off the flash chips but there are revisions to the controller, firmware, etc that make that more or less successful depending on the manufacturer. You also can't just pop the board off the drive like with an HDD, you need a really good surface mount resoldering capability.
So the answer is "it depends"... If the drive itself doesn't fail but reaches the end of its useful life or was put on the shelf less than 10 years ago (flash capacitors do slowly drain away) then the data should be readable or mostly readable.
If the drive itself fails, good luck. Maybe you can bypass the fuse, maybe you can re-flash the firmware, or maybe it's toast. Get ready to pay big bucks to find out.
P.S. OCZ is fine for build it yourself or cheap applications but be careful. They have been known to buy X-grade flash chips for some of their product lines - chips the manufacturers list as only good for kid toys or non-critical, low-volume applications. Don't know if they are still doing it but I avoid their stuff.
Intel's drives are the best and have the most-tested firmware but you pay for it. Crucial is Micron's consumer brand and tends to be pretty good given they make the actual flash - they are my go-to brand right now. Samsung isn't always the fastest but seems to be reliable.
Do your research and focus on firmware and reliability, not absolute maximum throughput/IOPs.
Natural != (nontoxic || beneficial)
Tin whisker growth is another way not directly related to the flash cells. Commercial electronics use lead-free solder and no real whisker mitigation techniques. Eventually a whisker shorts between two things that shouldn't be shorted, conducts sufficient current for a sufficient amount of time, and poof, your drive is dead.
That is why one uses RAID 6 with lower tier drives and hot spares.
Works great until 3 drives in the RAID fail.
Better make it a RAID-60 just to be safe. And maybe mirror that too.
We have replaced a few 8x 15K RPM RAID5 with OCZ PCI cards of 0.5TB and 1TB. They were serving databases with high update frequency (~500Hz average in the long run). At first they were fantastic, iowait was gone, great performance, just as expected, enough to get us hooked on them and order more :)
However in a few months the performance has deteriorated dramatically, to the point where they were much worse than the disks they were replacing. Writing /dev/zero to them a few times restored for a short time the performance but finally one card died dramatically, another lost 2 of the 8 slots ... Finally we moved back to the disks and as much RAM as fitted the servers and called it a failed, very costly, experiment. I'd argue that the SSDs are probably good only for PCs since one can fit a comparable amount of memory in a server for read performance and if you need high write rates you would anyway destroy them quickly...
Did you mount it with noatime and nodiratime ?
Religion is what happens when nature strikes and groupthink goes wrong.
A: Memory cells begin to die off faster than the SSD's controller can annotate them as bad and reallocate the memory which initially shows up as major slowdown, then as crc32 errors which increase in frequency and severity due to overwrites not completing correctly. The issue accelerates until the drive becomes unusable. This failure is usually due to heavy use, age and cheap, cheap memory.
B: Solder joint on a chip cracks takes out the chip and, since the entire array of chips are set up RAID0 style, the entire drive is dead one day mysteriously. This occurs due to an extreme difference in hot temp and cold temp the drive is exposed to not by itself but by other components; lead-free solder has multiple metals in it which expand and contract at different rates, as you heat up and cool down you cause extreme contraction and expansion. Like bending a fork too many times, microfractures form which eventually coalesce to become one big open in the circuit.
C: Shorting of the internal chip components causing the infamous "black glass" situation where the voltage and grounding planes of the chip short out, heat up, and you get to see black glass on the very top of the chip and sometimes a small distortion.
D: Firmware memory fails. Shows up as every single wierd issue you can imagine.
E: Defects in the drive such as poor connectors between the die and external connectors, or lack of shock resistance during shipping for certain solder joints, usually the drives fail quick and hard.
All of the above are basically possible, save for Point A, on a regular hard drive.
Fact: If a Harddrive goes, drivesavers can toss it under an electron microscope and recover the data. SSD's have no known recovery methodologies because the above failure modes usually physically destroys the data.
Point A makes RAID arrays using SSD's particularily interesting since if you purchase a box of drives with similar Serial numbers and start running them at the same load over time, you're bound to end up with the them failing near the same point in time. Thankfully, however, different cells on each drive are going to fail at different times. The majority of harddrive failures are mechanical in nature as wear occurs at different rates for different disks.
SSD's are GREAT for certain applications where shock resistance and speed are key; you can get 15 times the random read/write at 1/100th the latency out of a SSD than you can out of the priciest harddrive, for a fraction of the cost a server racked with drives can fully saturate it's network ports . For doing large-volume data projects or running a fully virtualized infrastructure that needs tons of I/O, there really is, IMO, no other option. Doing so, however, without backups upon backups is suicide for the same reason running a SAN indefinatly without a backup is suicide. Thankfully running VM's makes backing up and restoring a breeze.
Could run a cubed strip of raid 6 arrays in a RAID666
Paying taxes to buy civilization is like paying a hooker to buy love.
The ultra-cheap SSD's in my severs lasted only 3 months. The 4 OCZ Vertex 3 IOPS have so far lasted over a year with ~2TB processed per disk, 2 Intel SLC and 2 MLC's already over 2 years over which time they have processed ~10TB each (those were all enterprise grade or close to it). They are in a 60TB array doing caching so they regularly get read/write/deleted. I have some OCZ Talos (SAS) as well where one was DoA and another early-death but simply shipping them into RMA and I had another one in a couple of days. But the rest of them do well over 6 months and going.
Several other random ones still work fine in random desktop machines and workstations.
As far as spare room on those devices, depending on the manufacturing process you get between 5 and 20% unused space where 'bad' blocks come to live. I haven't had one with bad blocks so most of mine have gone out with a bang, usually they just stop responding and drop out, totally dead. I would definitely recommend RAID6 or mirrors as they do die just like normal hard drives (I just had 3 identical 3TB drives die in the last week)
Custom electronics and digital signage for your business: www.evcircuits.com
I have ordered approximately 500 Intel SSD's over the past 18 months (320 series and the 520 series primarily). To date, we have had exactly one fail to my knowledge. It was a 320 series 160 GB with known firmware issue. We have around 80 of that type and size, and the drive that failed did so on first image. We RMA'ed the drive and got a replacement.