Ask Slashdot: How Do SSDs Die?
First time accepted submitter kfsone writes "I've experienced, first-hand, some of the ways in which spindle disks die, but either I've yet to see an SSD die or I'm not looking in the right places. Most of my admin-type friends have theories on how an SSD dies but admit none of them has actually seen commercial grade drives die or deteriorate. In particular, the failure process seems like it should be more clinical than spindle drives. If you have X many of the same SSD drive and none of them suffer manufacturing defects, if you repeat the same series of operations on them they should all die around the same time. If that's correct, then what happens to SSDs in RAID? Either all your drives will start to fail together or at some point, your drives will become out of sync in-terms of volume sizing. So, have you had to deliberately EOL corporate grade SSDs? Do they die with dignity or go out with a bang?"
I had 2 out of 5 SSDs fail (OCZ) with CRC errors, I'm guessing faulty cells.
It was my understanding that for traditional drives in a RAID you don't want to get all the same type of drive all made around the same time since they will fail around the same time too. Same would apply to SSDs.
I miss the days when people actually had something useful to add rather than constant lame attempts at humor.
The drives will shrink down to nothing. I believe that the drive controller considers a sector dead after 100,000 writes.
Screaming in agony, hissing bits and bleeding jumperless in the night
Wow - you've been here a long long time then
is when you will see a degredation of performance and possible corruption.
Spectacularly and without warning.
Didn't happen to me, but a number of people with the same Intel SSD reported that they booted up and the SSD claimed to be 8MB and required a secure wipe before it could be reused. Supposedly it's fixed in the new firmware, but I'm still crossing my fingers every time I reboot that machine.
From what I understand, SSD die because of "write-burnout" if they are FLASH based and from what I understand the majority of SSDs are flashed based now. So while I haven't actually had a drive fail on me, I assume that I would be able to still read data off a failing drive and restore it, making it an ideal failure path. I did a google search and found a good article on the issue: http://www.makeuseof.com/tag/data-recovered-failed-ssd/
SSDs use wear leveling algorithms to optimize each memory cell's lifespan; meaning that it keeps track of how many times each cell was written and it ensures that all cells are being utilized evenly. When the cells fail, they're being kept track of and the drive does not attempt to write to that cell any longer. When enough cells have failed the capacity of the drive will shrink noticeably. At that point it is probably wise to replace it. For a RAID configuration the wear level algorithm would presumably still work as the RAID algorithm pumps even amounts of data to each drive (whether it is mirrored or striped). When any of the drives are shrinking in size it is presumably time to replace the array.
by performing same set of actions, in unreasonable time, then with 99.999%(the more drives, add 9's) probability it's a bug in the firmware/controller. afaik there shouldn't be such drives on market anymore..
otherwise the nands shouldn't die at the same time. shitty nands I suppose will die faster (a bad batch is shitty).
some drive controllers have counters about the nand use - but they shouldn't all blow up when it hits 0, at which point you're recommended to replace them.
I haven't had one die, though I do have a vertex 2 in daily thrashing use.
world was created 5 seconds before this post as it is.
ask someone who tries to selvage data from dead ssd drive.
who?
him:
http://www.youtube.com/watch?v=vLoYduckmuo
In general, if the SSD in question has a well-designed controller (Intel, SandForce), then write performance will begin to drop off as bad blocks start to accumulate on the drive. Eventually, wear levelling and write cycles have taken their toll, and the disk can no longer write at all. At this point, the controller does all it can: it effectively becomes a read-only disk. It should operate in this mode until else something catastrophic (tin migration, capacitor failure, etc.) keeps the entire drive from working.
BTW - I haven't seen this either, but that's the degradation profile that's been presented to me in several presentations by the folks making SSD drives and controllers. (Intel had a great one a few years back - don't have a link to it handy, though...)
"The future's good and the present is nothing to sneeze at." - Roblimo's last
I am new to commenting on /. and I think lame attempts at humor belong to 9GAG and Reddit.
Never mind.
With traditional mechanical drives, you usually get a clicking noise accompanied by a time period where you can offload data from the drive before it fails completely. In my experience, though SSDs don't fail as often, when they do, it's sudden and catastrophic. Having said that, I've only seen one fail out of the ~10 we've deployed here (and it was in a laptop versus traditional desktop / workstation). So BACK IT UP. Just my $0.02.
Pretty much all SSDs have more then 8 chips in a configuration similar to RAID0. If any single chip has a problem, the entire drive is useless. I've seen SSDs fail from the cheap 40GB patriots, all the way up to the high end fusion io drives. *Most* of them died after power cycles, I guess if they are going to fail, that will usually be the time it happens. At least with the mechanical disks you can spend some cash and have it recovered after it fails.
Suddenly. I've had 2 SSDs fail on me and they both died a sudden and unexpected death.
All three of the commercial grade SSD failures I've cleaned up after (I do PostgreSQL data recovery) just died. No warning, no degrading in SMART attributes; works one minute, slag heap the next. Presumably some sort of controller level failure. My standard recommendation here is to consider then no more or less reliable than traditional disks and always put them in RAID-1 pairs. Two of the drives were Intel X25 models, the other was some terrible OCZ thing.
Out of more current drives, I was early to recommend Intel's 320 series as a cheap consumer solution reliable for database use. The majority of those I heard about failing died due to firmware bugs, typically destroying things during the rare (and therefore not well tested) unclean shutdown / recovery cases. The "Enterprise" drive built on the same platform after they tortured consumers with those bugs for a while is their 710 series, and I haven't seen one of those fail yet. That's not across a very large installation base nor for very long yet though.
How do they get to Silicon Heaven?
My experience was system crash due to corruption of loaded executables, then at the hard reboot it fails the e2fsck because the "drive" is basically unwritable so the e2fsck can't complete.
It takes a long time to kill a modern SSD... this failure was from back when a CF plugged into a PATA-to-CF adapter was exotic even by /. standards
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
I've had SSDs die... Basically just got an increasing number of bad blocks due to worn out flash cells.
Like spinning drives, silicon drives always die when it will do the most damage.
Like right before you find out all your backups are bad.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
I have seen SSD death many times and it is a strange sight indeed. What is interesting about it when compared to normal drives is that when normal drives fail it is - mostly - and all or nothing ordeal. A bad spot on a drive is a bad spot on a drive. With SSDs you can have a bad spot one place, reboot, and you get a bad spot in another place. Windows loaded on an SSD will exhibit all kinds of bizarre behaviour. Sometimes it will hang, sometimes it will blue-screen, sometimes it will boot normally until it tries to read or write to that random bad spot. Rebooting is like rolling the dice to see what it will do next - that is, until it fails completely.
I had a couple vertex2 60's in raid 0 running windows 7.
1. At first my windows would reboot in the middle of the night.
2. It kept getting worse, eventually it got to the point where it would only boot for a few minutes, before crashing. Sometimes the drive wasn't recognized on post.
3. eventually OCZ replaced it. i had to tell them the red led was blinking on the drive indicating it was dead.
I assume if it was a few cells had gone bad it could recover, but to not show up on post, there must have been some bigger issue.
Only one of the identical drives crashed, the other has been running fine for months ( as a single drive now)
Who died peacefully, in his sleep. Unlike his train passengers who died painfully while screaming..
Only seen a single SSD fail. It was a Mini-PCIex unit in a Dell Mini 9. I suspect the actual failure may have been atypical as it seems it failed in just the right place to render the filesystem unwritable, although you could read from fairly hefty sections of it. It was immediate and irrepairable, although I suspect SSD manufacturers use better quality than that built-to-a-price (possibly counterfeit) POS.
Had an aftermarket SSD for a macbook air fail twice in 2 years (threw it out and placed an original hdd after that). Both times the system decided not to boot and could not find the SSD.
In both cases I have suspected that the Indilinx controller gave way. This seems mirrored in quite a few cases with the experience of others who had drives with these chips in them.
In an ideal scenario the controller should be able to handle the eventual wearout of the disk by finding other memory cells to write to. Any cells that have been used up should still be readable as well since the floating gates basically have been filled up with electrons and will not allow further erasing.
I guess the main issue right now is the fact that SSDs cant notify the user once things get a bit too worn out. Eventually the controller wont be able to keep up with the useless cells and then might simply no longer respond. Things will only get worse when the cycles go down due to smaller manufacturing processes so that useless controllers in cheap SSDs are more likely to fail
I had a FusionIO IODrive fail a few weeks ago. It was running a data array on a windows 2008 r2 server. It manifested its-self by giving errors in the windows event log and causing long boot times (even though it was not a boot device). The device was still accessible, but slower than normal. I think the answer to your question will probably vary greatly both by manufacturer and also based on what part of the device failed. The SSD's I've used generally come with a fairly large amount of "backup" memory on them so that if a cell begins to fail, the card marks the cell bad and uses one from one of the backup chips. Much like how hard drives deal with bad sectors. As I understand it, the SSD is somehow able to detect the failure before data is lost and begin using the backup chips transparently and automatically vs having to do a scandisk or similar to do the same on a physical disk. That may very well vary by manufacturer as well.
as the disk controller reads them their last rites before they integrate with the great RAID array in the sky.
You must be new ... oh, wait.
Wish I had mod points so bad, Izzard references are always worthy of moding up in my book. If people don't want to read the humorous posts that's what the mod system is for :D
Not with a bang but a whimper.
in a fire
Any comment mentioning moderation is automatically Offtopic.
We use SSDs in a few Windows machines at work. Running 24/7/365 production. We were replacing them every couple of years.
Eternity: will that be smoking, or non-smoking? I Corinthians 6:9-10
That is all
I have a G.Skill Falcon 64GB SSD that is failing on me. Windows chkdsk started seeing "bad sectors" (whatever this means for SSD... I think its really slow parts of the SSD) and started seeing more and more and windows would not boot. A fresh install of windows would immediately crash in a day or two. I had done a "secure erase" and that seemed to the job, a chkdsk found no "bad sectors". But a weeks later chkdsk found 4 bad sectors. But its going on a month now and I have yet to have windows fail.
Total time from system operational to BSOD was about ten minutes. I first noticed difficulties when I invoked a script that called a second script, and the second script was missing. "ls -l" on the missing script confirmed that the other script wasn't present. While scratching my head about $PATH settings and knowing damn well I hadn't changed anything, a few minutes later, I discovered I also couldn't find /bin/ls.exe. In a DOS prompt that was already open, I could DIR C:\cygwin\bin - the directory was present, ls.exe was present, but it wasn't anything that the OS was capable of executing. Sensing imminent data loss, and panic mounting, I did an XCOPY /S /E... etc to salvage what I could from the failing SSD.
Of the files I recovered by copying them from the then-mortally-wounded system, I was able to diff them against a valid backup. Most of the recovered files were OK, but several had 65536-byte blocks consisting of nothing but zeroes.
Around this point, the system (unsurprisingly, as executables and swap and heaven knows what else was being riddled with 64K blocks of zeroes) crashed. On reboot, Windows attempted (and predictably failed) to recover (assinine that Windows tries to write to iself on boot, but also assinine of me to not power the thing down and yank the drive, LOL.) The system did recognize it as an 80G drive and attempted to boot itself - Windows logo, recovery console, and all.
On an attempt to mount the drive from another boot disk, the drive still appeared as an 80G drive once, unfortunately, it couldn't remain mounted long enough for me to attempt further file recovery or forensics.
A second attempt - and all subsequent attempts - to mount the drive showed it as an 8MB (yes, eight megabytes) drive.
I'll bet most of the data's still there. (The early X-25Ms didn't use encryption). What's interesting is that the newer drives have a similar failure mode that's widely recognized as a firmware bug. If there were a way to talk to the drive over its embedded debugging port (like the Seagate Barracuda fix from a few years ago), I'll bet I could recover most of the data.
(I don't actually need the data, as I got it all back from backups, but it's an interesting data recovery project for a rainy day. I'll probably just desolder the chips and read the raw data off 'em. Won't work for encrypted drives, but it might work for this one.)
SSD's have an advertised capacity N and an actual capacity M. Where M > N. In general the bigger M realtive to N the better the performance and lifetime of the drive. As it wears it will "silently" assign bad blocks and reduce M. Your write performance will degrade. If you have good analysis tools it will tell you when it starts getting a lot of blocks near end of life and when M is getting reduced.
Blocks near end of life are also more likely to get read errors. The drive firmware is supposed to juggle things around so all of the blocks near end of life about the same time. With a soft read error the block will be moved to a more reliable portion of the SSD. That means increased wear.
1. Watch write perforamance/spare block count
2. If you get any read errors do a block life audit
3. When you get into life limiting events things accelerate to bad due to the mitigation behaviors
Be carefull depending on the sensitivities of the firmware it will let you get closer to catastrophe before warning you. More likely to be closer in consumer grade.
Hi,
We bought over 70 OCZ Vertex 4, and after 1 month, we had over 20 failures. About 5 of them were DOA, and the rest died in prod. They would crash windows and would not reboot.
So my experience with SSD is, BACKUP anything critical on a regular HDD.
OCZ makes the worst SSDs in the world, and it's not even a flash wear issue. For them, it's firmware. And, FFS, you have to update firmware on the goddamn things practically daily, and you can only do it by moving the drive to another machine, or with a hokey linux bootable CD, and while the planets are in a specific alignment and while holding the rabbit ears just a little to the left, except on Tuesdays when you have to hold them just a little to the right.
They just die inexplicably, and with no warning, and all of your data is just GONE.
It's statistical, not fixed rate. Some cells wear faster than others due to process variations, and the failures don't show up to you until there are uncorrectable errors. If one chip gets 150 errors spread out across the chip, and another gets 150 in critical positions (near to each other), then the latter one will show failures while the first one keeps going.
So yeah, when one goes, you should replace them all. But they won't all go at once.
Also note most people who have seen SSD failures have probably seen them fail due to software bugs in their controllers, not inherent inability to store data due to wear.
http://lkml.org/lkml/2005/8/20/95
I'm a SSD firmware engineer so know this all in depth. If the SSD suddenly fails then most likely cause is a firmware bug putting the drive into a bad state or a catastrophic NAND failure. It all depends on how well the firmware and NAND are tested.The trickiest part of the firmware to get right is the unsolicited power cycle. So make sure to shutdown the system properly. As for the NAND, it might be good to do a burn-in write of random data on full drive capacity between 3 to 30 times to scrub out the early NAND block failures. A good manufacturer would already do this.
http://computer.howstuffworks.com/solid-state-drive5.htm
Old SSDs never die. They just lose their bits.
Expected time to finish is 1 hour and 60 minutes.
I always expected the cells to go first. I was careful to avoid unnecessary writes. In the end, though, it was a known bug that killed the drive. Well, I didn't know about it, of course, until it was too late. If I'd known, I'd have updated the drive firmware to one that didn't have a catastrophic bug.
I replaced it with a Samsung. The RMA'd replacement OCZ is still sitting in its packet on my desk.
If your comment title says 'Re: Foo', I'm not likely to read it.
After reading this horror story I arrived to the conclusion that SSDs are not for me. I wonder if it's still true.
Super Talent 32 GB SSD, failed after 137 days
OCZ Vertex 1 250 GB SSD, failed after 512 days
G.Skill 64 GB SSD, failed after 251 days
G.Skill 64 GB SSD, failed after 276 days
Crucial 64 GB SSD, failed after 350 days
OCZ Agility 60 GB SSD, failed after 72 days
Intel X25-M 80 GB SSD, failed after 15 days
Intel X25-M 80 GB SSD, failed after 206 days
http://www.codinghorror.com/blog/2011/05/the-hot-crazy-solid-state-drive-scale.html
I recently had a "old" (cir 2008) 64gb SSD drive die on me. It's death followed this pattern:
After popping a new disk in and doing a partition resize, my system was back up and running with no data loss. Of all the storage hardware failures I've experienced, this was probably the most pain-free as the failure caused the drive to simply degrade into a read-only device.
Karma: SELECT `karma` FROM `users` WHERE `userid`=138474;
No offense intended but if your new why are you complaining about our long standing culture of cracking lame jokes, if you don't like it why did you join?
---Saying gnome 3 is better than windows 8 not so much a compliment as it is damning with light praise.
no duh
Well, if the drive was made by Intel, it fails because you turned off the computer or let it go to sleep.... of course newer ones do not do this, just like the old newer ones didn't do it either...
I work in a storage test lab. I've had several enterprise SSDs die randomly. They just drop off the device list in the middle of a stress test. I'd suspect it was due to a bad controller, but we didn't have time to dissect it and find out, so we just ordered a new one. It's happened from a myriad of manufacturers (who shall not be named). Granted, most of the drives that have died so far were engineering samples, but the final revisions we got later were the same hardware with updated firmware. While most SSDs are supposed to be resilient to most hardware failures, there's always a single point of failure somewhere.
SSD's are just flash memory, yea? So, all flash memory has an inherent limit of about 1 million read/writes. I would assume the SSD's would fail after they get close to approaching their read/write cycle limit.
an ask /. that actually belongs here!
Looking forward to reading the comments on the topic.
... is loss of firmware/configuration due to the firmware not refreshing these areas of storage. Flash data retention is measured in years if it isn't refreshed, and controllers that don't take this into account WILL fail prematurely, whether that means 2 years, 5 years, 10 years... internal microcontroller flash memory I use is rated to 40 years retention, but these are relatively huge flash cells.
Normally SSDs die when an X-Wing or something crashes into the conning tower and it rams into a Death Star.
(-1: Post disagrees with my already-settled worldview) is not a valid mod option.
why are you complaining about our long standing culture
lister king of smeg (2481612)
Not your first account I take it?
Apparently wizard is not a legitimate career path, so I chose programmer instead.
We've got a fair number of SSDs here. Failures have been really rare. The few that have:
#1 just went dead. Not recognized by the computer at all.
#2 Got stuck in a weird read-only mode. The OS was thinking it was writing to it, but the writes weren't really happening. You'd reboot and all your changes were undone. The OS was surprisingly okay with this, but would eventually start having problems where pieces of the filesystem metadata it cached didn't sync up with new reads. Reads were still okay, and we were able to make a full backup by mounting in read only mode.
#3 Just got progressively slower and slower on writes. but reads were fine.
Overall far lower SSD failure rates than spinning disk failure rates, but we don't have many elderly SSDs yet. We do have a ton of servers running ancient hard drives, so it'll be interesting to see over time.
Cold, alone, and in the dark.
In theory they should degrade to read-only just as others have pointed out in other posts, allowing you to copy data off them.
In reality, just like modern hard drives, they have unrecoverable firmware bugs, fuses that can blow with a power surge, controller chips that can burn up, etc.
And just like hard drives, when that happens in theory you should still be able to read the data off the flash chips but there are revisions to the controller, firmware, etc that make that more or less successful depending on the manufacturer. You also can't just pop the board off the drive like with an HDD, you need a really good surface mount resoldering capability.
So the answer is "it depends"... If the drive itself doesn't fail but reaches the end of its useful life or was put on the shelf less than 10 years ago (flash capacitors do slowly drain away) then the data should be readable or mostly readable.
If the drive itself fails, good luck. Maybe you can bypass the fuse, maybe you can re-flash the firmware, or maybe it's toast. Get ready to pay big bucks to find out.
P.S. OCZ is fine for build it yourself or cheap applications but be careful. They have been known to buy X-grade flash chips for some of their product lines - chips the manufacturers list as only good for kid toys or non-critical, low-volume applications. Don't know if they are still doing it but I avoid their stuff.
Intel's drives are the best and have the most-tested firmware but you pay for it. Crucial is Micron's consumer brand and tends to be pretty good given they make the actual flash - they are my go-to brand right now. Samsung isn't always the fastest but seems to be reliable.
Do your research and focus on firmware and reliability, not absolute maximum throughput/IOPs.
Natural != (nontoxic || beneficial)
Upgrading SSD firmware over time seems like a bum deal and I hope newer SSDs do not need this.
A Crucial C300 in MacBook Pro while idle on my desk simply stopped and the screen went white.
Turns out that in about 1 year, the Firmware revisions went from #0 to #6 or so and I was never informed you needed to do firmware upgrades.
Crucial gave me a workaround to reset firmware. Bum deal is that I had to remove the SSD and connect to a PC to do the reset and then the instructions for doing the sequential firmware updates was incomprehensible, so I didn't upgrade firmware.
Selling an SSD as a drop in replacement and not stating anything about firmware upgrades and not providing a way to easily do those upgrades with a one click application with the drive in place is BAD PR for a company. It is also bad policy to require firmware updates and not have a notification system in place.
In a flash!
I have an older 30GB OCZ drive, that has undetected read errors every 4-5 full reads. So far I have done only MD5 sums and one in 4 or 5 is different. This is the worst case of failure possible, as SSDs should, just like every other drive, have a very, very small probability of not detecting an error, far smaller than having an error.
I suspect a firmware bug in the error-correction implementation and a flaky cell that mostly reads right, but sometimes does not. This shows that at the moment, firmware bugs are a significant risk with SSDs. Fortunately the 4 other OCZ ones I have (60, 128, 240, 255G) do not seem to have this problem, but I can only be sure about the 60G one, as that one is in a 3-way RAID1 that is subject to a consistency check every week.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
We have a great description of the wear they will take and how to mitigate wear in the DragonFly BSD swapcache manpage. http://leaf.dragonflybsd.org/cgi/web-man?command=swapcache
To summarize, each nand cell is able to be programmed a finite number of times. Voltage required to program a cell increases as the cell becomes more worn, good drives ship with more storage than they report and will automatically relocate data to cells that are still good and retire cells that are worn out. Enterprise-class drives have a larger amount of reserve storage than consumer-grade drives do. Drives which use cheap NAND alongside an inadequate controller can result in data loss because the controller did not relocate your data before it was unable to retrieve it from a cell (OCZ).
I had a 60GB Mushkin Chronos in my MySQL server. It died in 13 days. I do not know for certain what actually killed (it's a low profile server, with only about 3.600/1.500 select/insert statements an hour), but I received a replacement disk within hours, swapped them out, and the replacement snuffed the dust in just under 18 days. I was unable to read or write to them. They just went out, like someone flicked a switch.
That's when I decided to stick with the regular spindle disks for my database server...
I bought two of those 60GB disks at the same time; used the other one as the OS disk for my workstation. It's still running good, almost two years later. I recently purchased a 120GB Mushkin Chronos Deluxe and replaced my 60GB disk (which currently only holds my development projects and few, select games).
-- Chaos, panic, pandemonium... My job here is done!
Tin whisker growth is another way not directly related to the flash cells. Commercial electronics use lead-free solder and no real whisker mitigation techniques. Eventually a whisker shorts between two things that shouldn't be shorted, conducts sufficient current for a sufficient amount of time, and poof, your drive is dead.
i've seen many ssd's die in enterprise SAN kit and will even go so far as to say as a percentage they die far more often then spinning rusty metal.
However, this is because of how they are used in SAN. Often they are used in a multi-tier way, where most frequently accessed data is pushed up to the SSD's to allow quick access, so they get hit the hardest.
I would be guessing your asking this question simply cause its easier to understand why a part-in-motion can slowly die over time where something thats in silicon shouldnt. You'd probably be surprised to know that alot of drives die a controller death, not a platter/motor death. Typically a platter/motor death usually means a badly made drive (mostly because makers of spinning rusty metal have gotten very good at the mechanics behind them) and while the mechanics in a drive will slowly wear over time, typically something silicon in the controller goes first. The exception to that is where drivers are physically in motion not of their own (i.e. laptops for example), often then the drive shaft in the drive itself starts to get wear unevenly and that usually gets worse over time (or at least, this is what i've been told).
Some SAN makers will even put hdd's through their paces first to make sure they actually perform ok - for eg, they'll measure vibration, etc (i.e. the mechanical components) to make sure a drive is up to spec before it goes into their kit - they cant really do the same as easily with silicon as theres not much to work off that can be measured - so often in enterprise grade SAN's, silicon dies before mechanical.
http://www.makeuseof.com/tag/solidstate-drives-work-makeuseof-explains/ is a fairly good explanation of why ssd's die (and relavent).
Having said all that, i honestly cant wait for the death of spinning rusty metal for the simple reason that ssds should (and havent yet) taken on forms which would be much more useful - why use a sata 2.5" format when you could have much better geometries for example? Then theres the interfaces we use where were really designed with hard drives in mind... but thats an entirely different issue.
As for "if its all the same, why doesnt it die at the same time?"... because at a fundamental level, it isnt the same. When we make anything no matter what it is it, the materials go thru some form of refining process where impurities are removed. Its impractical and near impossible to get anything 100% pure (not to mention entirely uneconomical - you might pay $1/tonne for 90% pure iron and $100/tonne for 91% pure iron as an example). The nature of where we get those materials often means the impurities vary greatly in composition in small spaces of time, hence why two hdd's sitting next to each other on the assembly line might mean one will die after 2 days and the other will die after 200 years. Theres also other components to this in that if you looked at iron (again an example, but true of most metals) under a microscope you'd find that it isnt a uniform substance, its quite grainy so to speak and those grains vary considerably. This too impacts the nature of the substance and its longevity. Using iron as an example, when its cooled its very hard to get a truly consistent cooling of it across the entire piece and the cooling determines how those grains form - i.e. grains in the center of the material will be quite different to grains at the edge and so forth. Thats just a small number of things that explain why consistency isnt quite as 100% as it might appear to be on the surface, there are quite alot of factors that come together to effect materials we use. Ultimately the way we choose the materials we produce things with is by tolerance, i.e. i expect 99% of my metal you sell me to fall into 90% pure and have x tensile strength or any number of variables you might consider important to your manufacturing process but even then its never 100%, you always except at some point that you'll get raw materials that'll fall outside those tolerances and as with everything on the planet its a trade off between price vs quality!
after unexpectedly getting a static charge (yes, in the server room) when touching a box It caused me to wonder if adequate grounding is always present, and what effect that might have on an SSD.
I have added a big ground wire to each of my servers. I attach it to one of the screws holding the power supply to the frame.
When you talk about volume sizes becoming out of sync, are you referring to cells becoming worn out and being de-provisioned? Many drives, especially enterprise SSDs, are overprovisioned. Meaning you might buy a 480 gb drive, but really it has 512gb of space. The overprovisioned space is used to replace dead cells. Some drives even allow you to customize the amount of overprovisioning. For a great number of use cases, it "should" take 10-20 years to reach the point where the writes have worn out the drive. The extreme edge cases would be a system that is constantly streaming recorded video from some surveillance system, and even then it should take a few years to wear out the drive.
In my own experiences, most deaths of electronics involve a moving part failure. A fan goes and then the component overheats. Hence why the failure of an HDD late in life is probably much different than how a SSD might fail, since HDDs have mechanical wear which would eventually lead to failure. Additionally, SSDs produce a tenth of the heat of an HDD, so heat related wear/failures is much less of a risk, I am guessing.
I am guessing that most SSD failures would be early in their life as it exercises different chips/cells and finds a flawed component. I wish there were reliable statistics on DOA rates for SSDs.
Other failures would be related to degradation of circuits over time. I forget the term, but the metal in circuit boards degrades over time as electricity passes through it until it causes a faulty connection. This is something that effects anything that is on a circuit board, so is no more of a risk than your super old can't replace RAID card failing or the circuit board in a HDD failing. Hopefully it would fail, and not start corrupting data silently. I don't know how long it takes for that to become a risk, or if there's things manufacturers have done to mitigate this risk.
I am also curious in the failure pattern for SSDs. Even in the world of HDDs and RAID, there are a lot of proprietary tricks that make predicting failure scenarios/behaviors complicated and unpredictable. Legacy file systems tend to be overly trustful of the error reporting of the drive. Additionally, silent data corruption is becoming a bigger risk with larger drives, and this is why newer file systems are taking a bigger role in data integrity, because there are scenarios that even RAID 5 and 6 don't handle/detect.
Disk array manufacturers are dealing with this in a couple of different ways (I work for one).
1) Using different methods to determine when a SSD will fail, and proactively sparing it out
2) Inline dedupe at the cache level to reduce writes before they even hit the disks, extending disk life (example: http://www.theregister.co.uk/2012/08/27/xtremio_projectx_unveiled/)
3) MLC drives, which are supposed to be "enterprise" grade. Theory is if you can find creative ways to reduce writes (such as the last line) this negates the expense of MLC drives. Large storage vendors who got into flash early typically used MLC, but expect SLC to become more accepted (cost being one big reason, improved reliability another).
Just remember, when flash drives die they really die. Due to the way files are stored you can't just ship the drive off somewhere and get files recovered. This isn't a bad thing, but something people need to keep in mind.
As far as laptops/desktops go, beware of things that increase writes. Full disk encryption is good, but if the file is encrypted after it is written you've doubled you writes without even thinking about it. That is just one example of things that can cause flash drives to fail a little earlier than you expected. I've seen MLC flash drives that are used for array caching (hot data blocks written to flash for better response, data constantly being promoted/demoted to these drives) hit their write limit in 9 months. Not die, hit their write limit.
Ever feel like you are driving the getaway car?
I can't decide if parent is the most useless post yet, or the lamest attempt at humor yet. I'll settle it by making this post so the original distinction is moot.
Will kill a SSD very quickly.
over 110f and you should be thinking of improving your cooling.
Alot of problems are also caused by bad cables and or connectors. I've run into alot of really crappy sata connectors and cables. And SSD speeds really shows their faults.
Also windows and many other oses have alot of default behaviour that is not really compatable with long SSD life. Look into tweaking your os for SSD use.
Eventually they'll get around to fixing these issues as ssd gets more common. But for now theres no reason to let windows chew up a ssd just because it was set badly by default.
If you order large enough batches of drives you'll eventually see one die. We had an intel 630 600gb die in a raid 60 on my dell R610. Symptoms include drive reporting wrong drive size and overall corruption in the firmware.
On my Kingston, half the filesystem turned to crap. I managed to copy off some of the more important bits which I foolishly hadn't backed up (scripts and stuff - it was the OS drive).
A bad block check revealed that about half the drive that was in use was dead - the blank areas were fine and dandy. I tried to image it the following day (to avoid a reinstall if I could) and at that point the drive ceased to be recognised by the BIOS.
I should probably add that it failed just two days before the warranty expired. However they had discontinued them (I wonder why...?) so I got a refund and went back to a spinning rust drive for the OS for a couple of years. I dp have another one now as they were on sale - hopefully it will last longer this time.
We had a users drive fail badly.
extensive data corruption across the entire drive.
most if not all of the files were still present but almost none of them could be opened.
We even sent to drivesavers and there was nothing left to recover.
I had a small OCZ SSD of some variety in my foo-server (which mounted the NAS for all the important changing data). One day I realized that / had gone ready-only days earlier. Console showed a write failure to the journal (ext3).
Rebooted it, and it worked for ~1 day. Reformatted (managed system, I have no idea if there was data corruption. Didn't seem to be any, but I didn't look for any) and it worked for around 1 week. At that point I gave up and replaced it. It had lasted for just over a year when it failed.
The two Intel SSDs I've bought have not failed yet, nor has another OCZ brand SSD (Vertex3, fwiw).
There was a time when SSDs made by a certain company whose name means one millionth of a meter would fail at a certain number of hours, even if they still seemed good. It was near 4K hours of power-on-hours. The place i worked that bought those drives had everybody run smartctl on a regular basis to make sure we were still safe. Multiple disks in different machines failed right near that 4K window made it awfully suspicious. Our IT guy called them, and we got a really nice discount on the next drives we bought, that didn't have that limit.
They die quickly when you buy them from OCZ... then when you RMA them, they say they'll replace it, never do, and hope you just forget about it or something.
http://goo.gl/H34U6
Make America grate again!
I would recommend watching your SMART sector reallocation totals as an indication of drive health. As the drive starts aging, it will start needing to reallocate the weakest cells first, so you should get some warning.
I only put the OS partitions on the SSD, /boot /root /usr /usr/local and /var. that way it boots & loads large software very quickly. my /home stays on a spinning magnetic disk. Yes, I back up regularly to trivial to replace the SSD.
I have a 90GB Kingston I bought used, that I use for my Linux partition. I've done the usual tweaks to minimize writing. It's been working fine for the single month I've had it; blazingly fast boot and load times, etc., but I'm running a full backup right now as I type, and I'm going to schedule weekly ones. As far as brand names go, I've heard very little that's good about OCZ, whereas Intel SSDs are usually praised through the roof. I'm going to get one for my Win 7 partition in the next few weeks. The 330 Series is pretty price competitive.
I'd love to see real data on SSD lifetimes. Here's mine:
2x OCZ Summit 64GB (circa 2009) - See note below for issues I had.
2x Intel X25M G2 160GB - Installed in March 2010 - Both have worked flawlessly and both show 99% of drive life available by SMART E8 entry. One is my main desktop and one my main laptop. Never had an issue with either. Both have estimated EOL of November 2020 and Dec 2021 by SSDLife.
1x Intel X25M G2 120GB - Installed in April 2010 - 99% drive life availabe by SMART. It is the boot drive for Server 2008 R2 and is only a file server. Not much to do there so I'm expecting a very very long life. Estimated EOL Nov 2027 by SSDLife.
1x Crucial C300 128GB - Installed Nov 2010(was boot drive for 2 months by now used for games only) - 86% of life remaining and EOL is Nov 2020.
I don't go too far out of my way to minimize writes. I always disable hibernation and pagefile in Windows for all of my machines. I never use hibernation and my RAM is always 16GB or more. I use the drive like I normally would without regard for the "limited lifespan". If I was going to do something like copy a blu-ray or reencode a video I used to do it using only local drives and then copy it to the server. Now I just do it over the network shares. Otherwise I use my drive just like I always would. Run BOINC on it, etc.
I've gotten a few friends into Intel SSDs, and none of us have had any kind of failure at all ... yet. Everyone's drives are listed as having EOL of 2020 or later. If these drives REALLY do last that long, I expect we'll be throwing them away before 2020 because a 128GB drive will be too small for the OS and a few common programs(Office, etc). I used to tell people to go big because they can take the drive from machine to machine over the next 10 years. It really just doesn't matter though, they're dropping in price so fast you should just buy what you will want for the computer you are using.
One friend bought an OCZ drive because it was really cheap at the time after rebates. He has had to RMA it 4 times in less than 12 months. He's the only person I know personally that hasn't bought Intel, and he is the only one to have any problems.
Personally, I swear by Intels. My experience has been phenomenal with them. I have yet to see an SSD failure personally, and it seems that lots of people have heard stories of Intel drives failing, but I haven't met anyone personally. My experience is that Intel SSDs, reliability-wise, are far superior to rotating rust. I am a little concerned now that Intel is getting away from using their in-house controller and going to Sandforce. After seeing what OCZ drives do and the fact that they use Sandforce I'm a little hesitant to expect a long lifespan from them.
I'm wondering if Intel switched to the cheaper Sandforce despite the lower reliability only because they want to be competitive for the price. Who REALLY buys an SSD expecting a 2020 EOL? Allegedly the newer Intels will have a SMART failure message when you have 1% of the drive left. Intel says that for most users that should be about 2 months of regular use since 1% is not really 1/100th of the drive life remaining. If this is true and I can expect to own the drive for 3-5 years and the drive will give me a SMART error when it is nearing EOL, what more could you ask for? That's nirvana for me!
I will say this. Putting SSDs in every computer I own makes them MUCH more responsive. I've always upgraded every time a new Intel CPU design came out. Right now my desktop is using an Intel i7-920! That's circa Nov 2008. I've NEVER had a computer more than 2 years. Thanks to SSDs the machine still works great 4 years later. I'm thinking of upgrading with the next Intel CPU generation only because the machine is getting old and as a geek I need to be able to justify my geekiness. It's hard to call myself a geek if everyone else is buying $500 Dell machines with more power than my machines. A friend bought a hybrid hard drive. The
Never had a problem on my corsair gt ssd 120gb. Not one bluescreen ever. Do you guys recommend turning the computer off, or leaving it on, to prolong ssd life?
That is why one uses RAID 6 with lower tier drives and hot spares.
Works great until 3 drives in the RAID fail.
Better make it a RAID-60 just to be safe. And maybe mirror that too.
We have replaced a few 8x 15K RPM RAID5 with OCZ PCI cards of 0.5TB and 1TB. They were serving databases with high update frequency (~500Hz average in the long run). At first they were fantastic, iowait was gone, great performance, just as expected, enough to get us hooked on them and order more :)
However in a few months the performance has deteriorated dramatically, to the point where they were much worse than the disks they were replacing. Writing /dev/zero to them a few times restored for a short time the performance but finally one card died dramatically, another lost 2 of the 8 slots ... Finally we moved back to the disks and as much RAM as fitted the servers and called it a failed, very costly, experiment. I'd argue that the SSDs are probably good only for PCs since one can fit a comparable amount of memory in a server for read performance and if you need high write rates you would anyway destroy them quickly...
SSD's follow the familiar engineering "bathtub curve" for failures, just like anything else. But...
In the case of SSDs, the failure of individual memory cells will follow this curve. Therefore, you will probably not notice a few dead cells at first, because SSD's are built with a few "extras." But after a number of writes, 100,000 or so, cell failure rates will rise quickly. That is why the SMART or other drive-lifetime native ware will exclude the failed or over-used cells from use.
Result is that an SSD drive is not likely to "fail" in the sense that a platter HD crashes. Instead, it will just slowly lose usable capacity. Once it's enough to notice, it is time to replace.
But that's really only if it's in a server. If it's in your laptop, you will have upgraded to a new laptop long before noticing the SSD capacity degradation.
Either, all the answers you find are your own, or all the answers you find are things you've already tried. And the enormous amount of advertisements by companies that will sell you your problem, or the solution for it, most often both at the same time.
I was promised a flying car. Where is my flying car?
When you order systems with a bunch of drives in them and a RAID controller, some "quality" manufacturers take it upon themselves to actually mix and match drives for you, so they don't come off the assembly line right after each other. This is probably why you don't see "mass failure" happening on those systems.
I was promised a flying car. Where is my flying car?
...sometimes. An SSD is at least as dangerous as a RAID0 array, make backups often.
"When information is power, privacy is freedom" - Jah-Wren Ryel
Just had my OCZ 1208GB SSD fail on me after 4 months. I was experiencing bluescreens, slow reads/writes, and over all just weird behavior. I'm sending it back now for RMA. At least they have a 3 year warrenty...
1) OCZ drives are GARBAGE along with most products put out by that company. Avoid at all costs.
2) Crucial m4's as wonderful as they are, had a firmware bug in the last version, whereby if the power was lost in some circumstances the drive would then fail to post. The fix strangely was just to apply power and no data cable for 2x 20 minutes periods. So just hook the drive up to power and wait 20 minutes. shut down and then do it again.
Its amazing, but i have personally fixed 3 drives doing that magic trick. I believe they have corrected this bug in their 000F firmware. Otherwise it seemed to occur mostly in laptops when they were shut down improperly. It was scary for sure! drive appeared to just disappear from the machine.
As a potential lottery winner, I totally support tax cuts for the wealthy
Do we still not have any type of cheap/fast/small memory storage medium that does *not* degrade over time? Physical hard disc drives can have mechanical failure/bad sectors, SSD cells die after a hundred thousand writes, even CDs and DVDs start to exhibit bad sectors after years of storage. It would be nice to have a drive that will last as long as I do, if not much longer.
Half the time a mechanical disk goes it isn't the spinning bit, it is either the power supply or controller board that just go dead.
Those pieces are identical in an SSD, and have no reason to be any more reliable.
An ask /. that's not insulting.
The now-fairly-ancient SSDs in my laptop, which are Samsungs from around 2009, seem to be dying slowly; every so often I'll get some ATA errors reported in the logs and the drive will either be remounted RO or simply become entirely inaccessible, resulting in everything you try to run that's not currently completely cached failing to work with an I/O error or a segfault. After a reboot, it usually comes back up working okay, but I'm sure at some point it won't.
So...really, pretty much exactly like a typical spinning disk failure, in this case. So far anyhow. I've seen the same 'periodic failures, followed by a day where it just won't work any more' pattern with spinning disks before.
A: Memory cells begin to die off faster than the SSD's controller can annotate them as bad and reallocate the memory which initially shows up as major slowdown, then as crc32 errors which increase in frequency and severity due to overwrites not completing correctly. The issue accelerates until the drive becomes unusable. This failure is usually due to heavy use, age and cheap, cheap memory.
B: Solder joint on a chip cracks takes out the chip and, since the entire array of chips are set up RAID0 style, the entire drive is dead one day mysteriously. This occurs due to an extreme difference in hot temp and cold temp the drive is exposed to not by itself but by other components; lead-free solder has multiple metals in it which expand and contract at different rates, as you heat up and cool down you cause extreme contraction and expansion. Like bending a fork too many times, microfractures form which eventually coalesce to become one big open in the circuit.
C: Shorting of the internal chip components causing the infamous "black glass" situation where the voltage and grounding planes of the chip short out, heat up, and you get to see black glass on the very top of the chip and sometimes a small distortion.
D: Firmware memory fails. Shows up as every single wierd issue you can imagine.
E: Defects in the drive such as poor connectors between the die and external connectors, or lack of shock resistance during shipping for certain solder joints, usually the drives fail quick and hard.
All of the above are basically possible, save for Point A, on a regular hard drive.
Fact: If a Harddrive goes, drivesavers can toss it under an electron microscope and recover the data. SSD's have no known recovery methodologies because the above failure modes usually physically destroys the data.
Point A makes RAID arrays using SSD's particularily interesting since if you purchase a box of drives with similar Serial numbers and start running them at the same load over time, you're bound to end up with the them failing near the same point in time. Thankfully, however, different cells on each drive are going to fail at different times. The majority of harddrive failures are mechanical in nature as wear occurs at different rates for different disks.
SSD's are GREAT for certain applications where shock resistance and speed are key; you can get 15 times the random read/write at 1/100th the latency out of a SSD than you can out of the priciest harddrive, for a fraction of the cost a server racked with drives can fully saturate it's network ports . For doing large-volume data projects or running a fully virtualized infrastructure that needs tons of I/O, there really is, IMO, no other option. Doing so, however, without backups upon backups is suicide for the same reason running a SAN indefinatly without a backup is suicide. Thankfully running VM's makes backing up and restoring a breeze.
About a year ago I had a new vendor ship about twenty computers to my company. They were supposed to contain Intel SSD's but instead contained Kingston drives. All of those drives failed within a year. As they fail I have been swapping them out with Intel drives and they just chug along nicely after that. I even took one of the RMA'ed Kingston drives and put it in my laptop to see how it did. It made it four months and failed just last week.
Here's what that failure looked like. It started off where the aystem would just freeze up for a couple of seconds about once a day or so. After a week that started happening more like once an hour, but the system would always come to life again, it would just pause for 10 seconds or so. Then after about another week it started hanging completely. First once a day, then several times a day requiring holding the power button down and restarting the system. Then I RMA'ed it.
The original drive was a kingston ssdnow 100, and lasted 11 months, the replacement was a ssdnow v200, it lasted 4 months. They just shipped me a replacement today, we'll see. It may go straight on ebay!
Could run a cubed strip of raid 6 arrays in a RAID666
Paying taxes to buy civilization is like paying a hooker to buy love.
Hey everybody, the question was about COMMERCIAL drives, NOT consumer drives. You want reliability, you want SLC. No enterprise in their right mind would rely on MLC in production critical environment. Stop giving the original poster advice about consumer crap! Please?
The system asks "What SSD"?
- real hackers don't have sigs -
http://xkcd.com/979/
-=This sig has nothing to do with my comment. Move along now=-
Catastrophic failure of the internal controller circuit is one possibility. It happened to me with a small G.Skill SSD. That wasn't my judgment of what happened, that was G.Skill's. The data might have still been there, but I had no way to access it. As far as the computer was concerned, the physical device still existed but the media and partition didn't.
That was one of two SSDs that I have bought, so from my perspective it's a 50 percent failure rate for the technology. Here's the irony: I have a Conner Peripherals 170 MEGAbyte IDE platter drive - from about 1992? - that still works. I have a small box full of old magnetic platter drives like that one that still work. In 25 years of using platter drives, I've had perhaps three physically fail. Am I going to be able to say the same thing about the SSDs I have now in 20 years, especially given their guaranteed obsolescence? Not a chance. YMMV, but not by much.
in the place where i work we use some old IDE flash moduls, one could call them early SSDs.
basically what happend when they died on us (nearly all of the 300+ died within a few month, approx. 2 Years of runtime):
first you might see some errors which are recoverable through firmware.
2nd you might see non-recoverable sector write errors, making the partition read only (we use Linux)
3rd you see read errors , though only a few of our drives reach that state.
regards
I've had three 128Gb Extrememory SSD's fail, with 6-18 months of use. One is still running as is another Samsung 256Gb.
The ultra-cheap SSD's in my severs lasted only 3 months. The 4 OCZ Vertex 3 IOPS have so far lasted over a year with ~2TB processed per disk, 2 Intel SLC and 2 MLC's already over 2 years over which time they have processed ~10TB each (those were all enterprise grade or close to it). They are in a 60TB array doing caching so they regularly get read/write/deleted. I have some OCZ Talos (SAS) as well where one was DoA and another early-death but simply shipping them into RMA and I had another one in a couple of days. But the rest of them do well over 6 months and going.
Several other random ones still work fine in random desktop machines and workstations.
As far as spare room on those devices, depending on the manufacturing process you get between 5 and 20% unused space where 'bad' blocks come to live. I haven't had one with bad blocks so most of mine have gone out with a bang, usually they just stop responding and drop out, totally dead. I would definitely recommend RAID6 or mirrors as they do die just like normal hard drives (I just had 3 identical 3TB drives die in the last week)
Custom electronics and digital signage for your business: www.evcircuits.com
I'm just here to watch all the old-timers post.
We have an array SSDs acting as cache for our multi-tiered file system; 15,000 RPM SAS -> 7,200 RPM SAS -> Tape running Solaris.
We had one of these SSDs continued to operate in a read only manner for a while. It was really tricky for us to actually detect that the problem was with the cache caused by the faulty SSD. It was even a proper enterprise grade SSD - but I guess when you are using an SSD array as a cache for a file system used by thousands of users, it's not that surprising when you have one of the aforementioned SSDs fail.
We have a smaller setup of SSDs on a ESXi host, once again the SSDs acting as cache for the File System. This has really helped our rapid development for our Standard Operating Environments and for other projects where setting up a physical box is more time consuming than virtualisation.
I have no theories for why they actually fail, just found that some of the consumer grade OCZ drives have been particularly notorious as a place to store single copies of things, although since the Vertex2 series, I think this has improved. I guess Vertex2 was really only 2nd/3rd gen SSD, so the market is still in the maturing phase - I will choose SSD for my next personal computer without doubt.
I have ordered approximately 500 Intel SSD's over the past 18 months (320 series and the 520 series primarily). To date, we have had exactly one fail to my knowledge. It was a 320 series 160 GB with known firmware issue. We have around 80 of that type and size, and the drive that failed did so on first image. We RMA'ed the drive and got a replacement.
One G1 intel X25-E 32GB unit out of 16 failed for me about 8 months after deployment. The unit would no longer respond to SATA commands. It was like the drive vanished. Intel replaced it no problem, but thats the only SLC SSD failure I've ever encountered.
Died two weeks ago. Strange noise came out of it. After opening the case, I noticed a diode (Which was in series right after the SATA power supply entered the device), had a burn hole in it. After some measuring, I replaced it with another diode. During powering up and some tries, the device eventually woke up again, registering itself as Apex drive at the operating system. However data was not readable anymore.
Thanks for the response! I was hoping it was all mdadm, as I love to use software raid.
May I ask, what was your use-case for using this hybrid approach? Did you do much benchmarking with the applications you were trying to benefit from faster reads? Did you also tune FS parameters like +noatime and tweak block sizes and such to minimize writes?
Would a system like this with full disk encryption be any better/worse off? The first enc pass (say with truecrypt where it writes to the whole drive) adds extra wear, but it should happen only once with subsequent writes changing small portions of the disk.
In Soviet Russia, SSD fscks YOU!
My first SSD, OWC 120 GB SATA2, died in my notebook. It started with hidden data error when write large files (4+ GB). Followed in a few days with CRC error. Finally, it couldn't boot. However, I still could copy most data from it by attacted it to other PC.
My second SSD, OCZ Octane 128 GB, occationally sent a lot of error for few months. Then, suddenly, it couldn't mount any more. However, after secure erase, it worked again without any error.
long time ac before setting up account
---Saying gnome 3 is better than windows 8 not so much a compliment as it is damning with light praise.
I had something worse. I got a crucial adrenaline SSD cache so I could use my 2TB drive as main storage and still enjoy the speeds. Well it was working beatifully, even after a power outage or two, but not three. Something went bad and it left my main drive unbootable. to make things worse it turned it into a dynamic disk so I had to get all kinds of rescue software to find one that could convert it back without destroying it so that I could use another app to recover files after it was a basic volume. days later I got data back, although not as pretty as I remember. All I know is that if you turn on your PC and see scandisk giving you shitloads of INODE errors, immediately shut down your pc if you want to keep your data, then try to restore it with some ntfs restore app before your MFT gets royally screwed. I formatted my 2 TB drive and all was well again. My crucial adrenaline stayed in my machine, but not installed out of fear. I wouldn't trust any SSD without a UPS power backup.
What is your expectation of how the drives will begin to fail - are you expecting bulk simultaneous failures or are you expecting to get plenty of degradation warning before you start to see failures?
-- A change is as good as a reboot.
I had a dell laptop whose ssd hard drive died from a couple of drops of water because the chips were exposed without enclosure and its location was directly under the keys.
In my limited experience (we have a VNX SAN running a few cages of EFDs, 16 servers with RAID 1 SSDs and a mix of laptops running SSDs), most die with no warning whatsoever.. they just cease to exist with nary a whimper.
the data just fades away. ;-)
assuming defect free the flash will slow down. Write cycles will take longer and then it eventually is unable to store data, one bit here one bit there. I dont know if the SSD's hide all of this from you and for how long. I assume that there is a way to determine how many bits have failed and there is perhaps a threshold where you will start to get warnings from the operating system. again assuming defect free, which is not an assumption I would make.
I've had many flash drives fail and that's pretty identical, assuming the SSD's chip didn't burn out or something instead. When flash memory fails, you get quirky delayed write failures but can typically still read the data for a short period of time. That happened with all 3 of my drives that failed (they were really bad brands). And by the way, everyone's hating on OCZ since their 1-3 drives were a catastrophe but their version 4 ones work great as of firmware 1.5, which they all ship with. I've used about 15 for builds over the last several months with no problems. Intel I had 2 slight problems with though. For example, their own latest of the late copy of the bootable firmware flasher doesn't recognize a 60GB 330 Maplecrest under any circumstances on any board with any SATA controller, even an Intel one.
You are welcome. mdraid is pretty cool indeed.
My original use case was that I had foolishly put my mailbox into Maildir format (one file per message) and that opening it got incredibly slow (minutes). With an SSD that was not an issue anymore. On the other hand, I did not want something this important on a single drive. Somebody suggested that --write-mostly may help. Tried it, opening latency gone down to the speed of a bare SSD. No filesystem tweaking whatsoever, I cannot tell you whether it makes any difference.
By how I have the root partitions on two machines in the same configuration. Everything is just much, much snappier.
I have not done it for encrypted volumes so far. It should work, but there is still some subtle problem with Linux dm-crypt and RAID or LVM that causes massive write slowdowns for some people. Others have no problem at all. It may be due to the differences in write-scheduling. On reads, no problems that I know of, so reading should get the low SSD latency and, depending on CPU power and cipher, most of the SSD read speeds.
You do not need to worry about the wear of a single disk overwrite with a modern SSD. Even the lowest end FLASH chips can take 2.5k overwrites.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
My two experiences with failed SSDs have been the same. Everything seems just fine for a year or two and then, bang, the system locks up. Reboot and the drive is completely inaccessible or it is visible but completely unreadable or writable.
More worrying though is that this has happened twice, with different manufacturers in the last thee years. In the previous 15 years, I have only had two personal drive failures and both were such that I could scrape data off the drives before throwing them out. I've got a box on the shelf that has nine drives in it. They all work just fine and have been replaced over the course of many years due to upgrades, new system/more capacity.
It's anecdotal, but for me SSDs have proven to be very unreliable and short lived.
Someone told me rather recently that SSDs fail with spectacular explosions, fireworks, and burning. A read of several comments on this page speaks to the farcical nature of this idea. Can anyone comment on this?
Sadly, a Libertarian cannot force his views on another, and freedom cannot spread as does the cancer known as religion.
I work with Enterprise class SLC SSD Drives 200 & 400 GB by HGST (Now a WD company) drives.
I have yet to see a drive fail. Note: this SLC drive is 80% over provisioned and has a 5 year warranty that assumes a full drive write 30 times a day.
FWIW: My lab sample size is 60 drives, and thousands shipped to customers. Street price on this drive is $3K
What is the price premium? (Off to normal hardware site ...)
Seek times and data rates?
Do I need that?
Shiny, new and high-tech? Yes, but I still don't need it.
I know that it's a very un-Slashdot thing, but I don't need that shiny-bright-new-unreliable tech.
Birds are not dinosaur descendants;birds are dinosaurs, for all useful meanings of "birds", "are" and "dinosaurs"
I personally know of (3) OCZ Agility 1 30GB drives fail, 2 on Linux, 1 on Windows.
OCZ Agility -> Intel SSD Image -> Distro Upgrade to Fix Corruption
The failure mode on one was file system corruption like an HDD but check disk would find problems, fix them but error out and another run would find different errors. I was able to image that one off to an Intel X25-V (Value) G1 40GB SSD then did a whole Ubuntu distro upgrade that basically overwrote pretty much all the important files on the system to downloaded good ones and that took care of the corruption and any problems from the previous hard drive. System is still running Asterisk PBX to this day without any errors surprisingly. I'm still a little amazed at how simple this recovery was and that there were no issues after distro upgrade that seemed to fix any corrupted files. I sent the failed OCZ drive back to my friends after fixing their PBX with with instructions to put a bullet through it instead of sending it for replacement, and I was being serious and literal and it is likely that they did just that.
Another failed with inaccessible and unbootable from Windows XP. The last one just kernel panicked disappeared from the BIOS completely. Both went back to OCZ for replacement and new ones showed up. I told the folks to not open the boxes and sell them on eBay and instead buy Intel X25-M or -V series drives to replace them.
OCZ Bashing
I still have a sealed OCZ Agility 1 30GB in my house and I posted it on eBay twice and nobody wants to buy it. I guess the word is out that OCZ SSD has shit for reliability. Newegg reviews are just full of failure reports. Even though Anandtech keeps reviewing these OCZ Vertex 2, 3, 4 series drives and praising them for performance I stay the hell away from OCZ as a vendor due to the massive amounts of complaints of failures people report on these.
As a side story, I also got burned by a performance grade OCZ 550W power supply with unstable 12V rail that wasn't even heavily loaded that would drop to 11V for no reason and destabilize my system causing weird behavior. Switched to Corsair TX750 after that and weirdness went away.
Intel SSDs - 3 Generations Going Strong
I still run an Intel X25-M G1 80GB in my laptop for a few years now without issues that used to be a desktop drive. I have an Intel X25-M G2 80GB at work and it's still working fine. I also have an Intel 320 (G3) 160GB as my new desktop drive andI applied the firmware upgrade to it that was available to fix that weird lock-up 8MB issue that was reported. I also have that Intel 320 40GB in my Ubuntu XbmcLive HTPC in my living room and another Intel X25-V G1 40GB in a friend's Ubuntu based Asterisk PBX system running just fine.
Love Intel for their SSD, never had an issue and I'm quite happy with them and the engineering that they did on the drives. Looking at the return numbers Intel has very low return rates for SSD, somewhere within the neighborhood of 1% and most of those were related to the two firmware bugs found, the one in the X25-M series early and the other the 320 series.
Intel 520 Series and SandForce SF-2281 Controller Firmware
There's a nice little story on Anandtech when Intel was choosing the new SandForce SF-2281controller for their Intel 520 SSD product line that they ran so many tests and did so much engineering on the drives that they came up with firmware updates that they gave to the vendor due to the issue that they discovered. Too bad that later on Intel found out that the controller can't do AES256 only AES128 encryption and it offering refunds for those that care about it.
http://www.anandtech.com/show/5508/intel-ssd-520-review-cherryville-brings-reliability-to-sandforce/
All of my Intel SSDs are about 2 to 3 generations behind and still use the old Intel controller that's limited to SATA-2 3Gbps speeds but
when a RAID is run with, say, X I/Os / second,
for several years,
and a drive dies,
AND the other drives in the RAID are near failure due to the same issue
( bearings in IBM's Deathstar drives, e.g,
or electro-migration in a chip in Fujitsu's nightmare enterprise drives, when they changed to a more eco-friendly chip-chemistry,
from what I've read )
suddenly the RAID is getting full NORMAL use
*PLUS* the RAID-rebuild...
A few months ago there was some article on Tom's Hardware
( first I'd bothered with that site in years )
discussing drive-reliability,
and the contributing datacentres found 2 things of interest to me:
1. the most-common time for a drive to die is *within 1h* of another drive dying.
2. Super Talent drives have a significantly higher failure rate.
I'm with the people who build RAIDs with 4 brands/models of drives, specifically to make entire-RAID-loss significantly less likely...
( & RAIDZ2 or RAID6 oughta be the law :)
As James Hamilton said in his Usenix paper ( excellent ), when you scale things up enough, it isn't IF a given failure will happen, it is WHEN it will happen: entire-rack failure will happen given enough opportunity...
I love his rule, though: if you aren't shutting your servers off by yanking the plug, you don't trust your HA system.
( :
Cheers!
There's a lot of conjecture and theorising in this thread so far. Not surprisingly some enterprising geeks have been busy testing SSDs to destruction, and they have some great stats. This thread with over 5000 posts has a ton of info about exactly what happens and some good hard numbers.
My expectation is that an SSD used by a desktop computer is most likely to indicate issues when it is being heavily used. In that sense, an imaging process is a good event.
For the SSD's that IBM sells, it gives them an "Endurance" rating for a certain number of "Total Bytes Written." According to http://www.redbooks.ibm.com/abstracts/tips0879.html The implication for this discussion thread is that if you build a RAID 1, 5, 6, or 10 array with new SSD's you should expect them to all reach end of life at about the same time.
The drives described in this paper from March, 2012 are rated for Endurance: "36 TB of total bytes written (TBW) at 90% full disk based on predefined usage pattern for 64 GB SSDs and 72 TB of TBW for higher capacity drives."
"Enterprise Value SSDs and Enterprise SSDs have similar read and write IOPS performance, but the key difference between them is their endurance (or life time) (that is, how long they can perform write operations because SSDs have a finite number of program/erase (P/E) cycles). Enterprise Value SSDs have a better cost/IOPS ratio but lower endurance compared to Enterprise SSDs. SSD write endurance is typically measured by the number of program/erase (P/E) cycles, that the drive incurs over its lifetime, listed as TBW in the device specification.
"The TBW value assigned to a solid-state device is the total bytes of written data (based on the number of P/E cycles) that a drive can be guaranteed to complete (% of remaining P/E cycles = % of remaining TBW). Reaching this limit does not cause the drive to immediately fail. It simply denotes the maximum number of writes that can be guaranteed. A solid-state device will not fail upon reaching the specified TBW. At some point based on manufacturing variance margin, after surpassing the TBW value, the drive will reach the end-of-life point, at which the drive will go into a read-only mode. Because of such behavior by Enterprise Value solid-state drives, careful planning must be done to use them only in read-intensive environments to ensure that the TBW of the drive will not be exceeded prior to the required life expectancy.
"The endurance of Enterprise Value drives is specified based on the following access pattern: 50% random data and 50% sequential data with block size mixes of 5% of the data as 4 KB block size, 5% of the data as 8 KB block size, 10% of the data as 16 KB block size, 35% of the data as 64 KB block size, and 35% of the data as 128 KB block size. The Enterprise Value drives described here are capable of 36 TB (64 GB SSD) or 72 TB (128 GB, 256 GB and 512 GB SSDs) of lifetime writes, with the workload stated above as the worse case. For the device to last in five years inside of the 72 TB of TBW, the drive write workload must be limited to no more than 40 GB of writes per day. For the device to last in three years, the drive write workload must be limited to no more than 65 GB of writes per day."
I don't have a problem with your decisions, but I don't know that I'd give that advice to everyone else?
IMO, the current stare of SSDs is such that you're still paying a big price premium for one over a traditional hard drive, and you're getting technology that clearly has certain limitations (primarily being a limited lifespan if it's forced to do many, many data rewrites).
You can Google search it to see what I'm talking about, but there are quite a few sysadmins out there who got excited by the prospects of moving their relational databases onto SSDs on their servers for a big speed boost, only to find they were consistently killing off the drives in a matter of as little as 2 to 6 months' time. They clearly couldn't hold up to that type of use/abuse.
On the other hand, 99% of the other tasks you might do with a computer aren't nearly as rewrite intensive. If, say, you're a computer gamer? You're going to like an SSD for the advantages it gives of faster load time for all those levels they have to read in. The casual user will mainly appreciate the quick boot time if he/she turns the computer off when it's not in use, so finds themselves booting up from scratch pretty regularly. Digital video editors and photographers and artists should appreciate the quicker time to load plug-ins and video content, not to mention large applications.
But to me, the temporary swap file is something you can still throw onto a physical hard drive, at least in a desktop PC. You can even recycle a smaller capacity drive this way that you'd otherwise not bother using anymore. It's pretty much win-win because it won't really slow down the overall system performance much at all if everything else is on the SSD. (Ideally, you have enough RAM so the swap file isn't being relied on real heavily anyway.)
All true what you say but apart from the remarkably faster load-times of applications the main benefit I get from the SSD is the lack of disk-trashing.
I'm not going to promote running a high-load RDBMS on an SSD as there seems to be a lot of evidence around that this indeed kills them rather quickly (*) but things like the swap and %TEMP% are well within the limitations of what you can throw at these beasts. (IMHO)
My Intel 320 120Gb was installed about a 8 months ago and currently stands at 3.48Tb Total Reads / 3.65 Tb Total writes and still shows 100%.
(SSD-Life claims 9 years, but I'm not putting too much value on that number)
(*: the high number of writes in SMART probably is due to my work with the databases. I'm not even sure if the value means actual bytes written or if it represents the total of blocks that were updated on the flash-chips which could be much higher! Anyway, I realise it eats at the drive's lifetime but there simply is no comparing to doing the same things on the HDD... To be entirely honest, if possible I try to run these things on a RAM-disk as it's even faster and I DO care about the SSD's limitations, but my laptop is limited to 8Gb and I need 'some' of that for other purposes too =)
If there is one thing to be learned on slashdot, it has to be sarcasm.
Here's my favorite part of the paper:
:/
"Failure rates are known to be highly correlated with drive models, manufacturers and vintages. [...] However, in this paper, we do not show a breakdown of drives per manufacturer, model, or vintage due to the proprietary nature of these data."
Thanks, Google.