How Does Flash Media Fail?
bhodge writes "Aside from the obvious 'it stops working' answer, how does flash media — such as USB, SD, and CF — fail? Unlike with traditional hard drive, where anyone who's worked with computers for a while knows what a drive failure looks like, I don't know anyone who has experienced such a failure with flash. I've haven't been able to find more than scant evidence of what such failures look like at the OS level. The one account I have found detailed using a small USB drive for /var/log storage; it failed very quickly, and then utterly (0 byte unformatted device), after five years of service in the role. This runs contrary to other anecdotal claims that you should still be able to read the media after you can no longer write to it. So my question is: what have you seen of the nature of flash media failure, if anything?"
It usually "fails" because it went through the washing machine in my pants too many times.
The next Cmdr Taco duplicate will be ready soon, but subscribers can beat the rush and see it early!
He'd taken it out of his camera, tried to put it back in, and nothin'. Slapped it into my Linux box. It "saw" that there was a device there, but wasn't real happy about it:
[ 5555.618324] sd 4:0:0:0: [sdb] Add. Sense: No additional sense information
[ 5558.777567] sd 4:0:0:0: [sdb] Sense Key : No Sense [current]
"It's dead, Jim."
I'm tempted to try the old hard-drive swaparoo: get the exact same SD card, unsolder the flash chips, and put the bad one's flash on the new one's circuitry. See if it's the circuitry that's bad, or the flash, itself. If anyone has any bright ideas on how to determine definitively which it is without me going through that exercise, I'm all ears.
Had two finally wear out. Both started giving "could not write to device" sort of errors. The system (Windows 2K or XP) would still recognize the drive, would show the files, etc. Indeed, I could still access (read) the files, so the data was there and copyable. But I'd get a file write error every time I read anything, because Windows was trying to update the flash drive's file directory with "last accessed" or some such, and that write would fail.
No biggie; copied the data to a replacement, threw the old ones away, after hitting them several times with a hammer to "clear" the memory :-)
Flash media fails when you write the data. In theory this means that you can always recover data as you can never write data to bad sectors. In practice the entire media device (CF, SD, etc.) fails at once.
><));>
Without knowing more about this specific situation, I'd say this failure sounds like it pre-dates wear leveling. Prior to wear leveling, the most used sectors were likely to fail the fastest. And what sector gets written to more than the file allocation table?
If the file allocation table was lost, that would explain why the device became completely inaccessible. The card might not be a total loss if the card contains firmware or circuitry to remove bad blocks from usage. In that case it might be possible to reformat it. (Of course, if it lacks wear leveling I wouldn't count on it.)
Wear leveling neatly solves this issue by shifting writes to different free blocks with every write. This assures that the maximum use of the card is obtained prior to failure. Should any given block fail the card will detect the checksum error, mark the block as bad, then attempt to rewrite to a different block. This is communicated back to the reader in a transparent way. As far as the reader knows, nothing happened.
As you can imagine, wear leveling makes it incredibly rare to see Flash failures these days. It can still happen, but the results are likely to be unpredictable. The card will need to chew through all free blocks before it starts returning errors. In that case you may be able to continue reading the media. Or it may fail like the USB drive you mentioned. It all depends on the importance of the block on which the erasure was attempted. Since you only know about a failure *after* the block erasure, you're at the mercy of the quality of the card's electronics and algorithms to protect against a dangerous erasure.
Javascript + Nintendo DSi = DSiCade
What about redundancy and self-healing? How do those work?
Chaos maximizes locally around me.
I had a 4GB FAT32 flash drive that I used as storage for a mail server attached to an OpenWRT router. It required renaming and deleting files all the time (every time it got an e-mail)--so I think it wore down pretty quickly.
One day, the storage for the flash drive stopped working (from one hour to the next, without being touched, the computer acted like I had just yanked the drive out)--it would be recognized but report a "no media in drive" error when you tried to access it, like an empty CD drive. In fact I think Windows would say "Insert CD" or "No disc in drive F"
You're thinking of 'Tin whiskers', and I'm not sure they're an issue with Silicon chips (because, well, they're SILICON), and the amount of time it takes for whiskers to grow between SMT components shouldn't differ between SSDs and HDDs. Plus it's a very slow process anyway, especially in the atmosphere.
A few weeks ago /. linked to a really wonderfully written article by Anand Lal Shimpi about SSD drives. In the article he includes some simple and clear explanations of how flash memory works, its lifespan, and how it handles writes and deletes to maximize the life of every block of storage.
http://www.anandtech.com/printarticle.aspx?i=3531
The only think missing from the article is a description of the behaviour of a failing drive.
Those work behind the scenes, if they are implemented. You wouldn't know they had been activated. If you lose a gate in the redundancy circuitry, that dies as well.
Some years ago i used a 64Mb CF to install a minimal Debian on a IBM PC110 with 8Mb of ram. As the install process wanted more memory i created a 12Mb swap partition.
Big mistake.
The install took a whole day. I happily ran some programs the next day and crash - kernel screams of i/o errors in the swap partition.
Formated the card MS-DOS - it found a few bad sectors. Then i ran Norton Disk Doctor and at every run it was founding more and more bad sectors. But each time i was re-formating the card using a camera, the bad sectors were shifting around. Unusable.
FYI: IBM PC110 is a 486 Palmtop with a CF slot to be used as hard-drive. The CF interface is IDE.
1% APY, No fees, Online Bank https://captl1.co/2uIErYq Don't let your $$$ sit in a no-interest acct.
On a modern filesystem, your writes should essentially be atomic and in theory it shouldn't be possible to leave the drive in an inconsistent state when the write fails.
Of course most camera memory cards end up being formatted with fat32 which can be a little less forgiving.
Dodgy card readers,...
That's what you get for buying a Chrysler product or any Detroit product. Try getting a Honda or Toyota card reader. Or if you're a yuppy, a BMW card reader. Although, no one holds a candle to the Japanese.
About 5-6 years ago, I decided that it would be a good idea to build a small application on a flash drive, that is, code and compile it directly to the drive. :)
After what must have been hitting compile a few hundred to a thousand times, the 128MB thumb drive starting giving me drive write errors and then stopped responding altogether within about a minute after errors starting appearing.
I think the moral of this story is backup your data, even when it's on a flash based drive, and don't code directly on a cheap thumb drive
I am currently taking a class on solid state devices, and we just talked about how MOSFETs would fail. Basically, a high voltage to the gate would create these electrons that have so much kinetic energy that they create pairs of opposing charges (electron-hole pairs) in what was supposed to be the insulator. These pairs of charges would create an internal electric field inside the insulator. This process reduces the barrier for tunneling to occur, so more electrons are able to tunnel through the insulator and do the same thing, creating a runaway effect.
For more information, look up "Time-Dependent Dielectric Breakdown" and refer to pages 293 and 294 of Streetman and Banerjee's "Solid State Electronic Devices" (6th ed).
...and quality and longevity take a back seat. So companies stopped offering SLC Flash RAM (+100.000 writes) and only offer MLC (5000 writes), and are now pushing even eight-level MLC, which will be even less reliable than standard 4-level MLC Flash RAM. But who cares, the consumer will be slightly fucked after a while, but that will be much later, after they enjoyed the happiness of getting slightly more GB for their buck.
The only manufacturer that I know of, that is an exception, if Kingston, which still offers SLC Flash products - namely their elite pro line of SD and CF cards, and the Data traveler USB drives. But that's it, everyone else has not completely transitioned to MLC.
"The agriculture ministry is not in charge of Gundam" - Japanese ministry official.
Is there something I'm missing?
Maybe the part where you assume everyone knows the above?
Or how about the part where the submitter is asking about typical failure modes, not all possible failure modes?
AccountKiller
If a cell fails, you can't read or write that cell.
If a gate fails in a page, you lose access to the page.
If a gate fails in the overall control logic, you lose access to the whole device.
Is there something I'm missing? Did you think there were oil changes or brake shoes? It's one silicon chip with metal on it.
Conceptually at least, there are several parts to worry about:
1 - the OS & storage driver
2 - the USB driver
3 - the flash controller
4 - the flash memory
At the flash memory cell level the usual failures are breakdown of the dielectric materials and trapping charges in the memory cell that prevent an erase from happening and yield 'stuck' cells. This is normal for /all/ flash chips and is why they all have an erase cycle rating. There are certainly more exceptional ways for the chips to fail (soldering, wire bond failure, static damage, etc).
The flash controller is supposed to be doing wear leveling, error detection and correction on the flash, to get around those problems with the flash chips, and also talking USB. These chips usually have a microcontroller in them somewhere, and there's probably bugs in that code, no doubt more in the parts that get exercised the least, like error paths :-)
The OS and drivers just have the garden variety bugs and features that we all know and love...
-- All that's left of me, is slight insanity, whats on the right, I don't know. -- Bob Mould
There's no redundancy or self healing in the hardware of a common USB flash stick. The illusion that there is comes from a flash controller chip that does a mapping between disk sectors and flash sectors and shuffles things in and out so you don't notice the failures until it can't compensate for them anymore.
-- All that's left of me, is slight insanity, whats on the right, I don't know. -- Bob Mould
For a prior employer, I had set up a process to qualify flash media for use in embedded products. There's a couple of different failure modes you are likely to see.
:-)
First off, when the actual flash media itself wears out, it takes longer and longer to erase individual sectors.
A flash device such as a USB stick or a CF card is slight more complicated because it has something known as an FTL (Flash Translation Layer). The FTL has the job of implementing the virtual media to flash sector translations, implementing wear leveling, and handling the awkward page erases. (Multiple sectors in a page, but you can only erase full pages.)
The FTL obviously must store some mapping information in the media in addition to your data.
If you start writing flash media, and time those writes, you see an initial rapid growth in the write timing that evetually levels off as the FTL tables swell to their constant operational size.
The over all flash write speed will level off to some average value that follows slow growth over a very very long tail as the media wears.
Early flash chips supported about 10,000 erases per page, and modern chips shipped by Samsung and others support a couple million erases per page. When you consider this is spread over say 4GB of media, you can understand that tail is very very long and flash media are probably comperable to hard drives in their MTBF these days.
Secondly, when flash actually does begin to fail, the media itself tends to exhibit a small number of different symptoms.
The flash may stat to show occasional data corruption when read. You might also have instances where data persists in the media only so long as power is applied. And then of course you have the fact that erases take longer and longer to achieve. Eventually erases or programming start timing out occasionaly.
With the FTL between you and the flash, you don't directly observe these effects. Presumably the FTL is smart enough to try and re-map your data elsewhere. In most cases there's ECC to attempt correction of moderately corrupted data. The real killers are when the data fails to persist after power cycling, when ECC fails to recover critical FTL data tables, or when there are no more spare sectors to re-map data too.
Those first two critical errors are likely to produce the lightbulb effect where your flash card or USB stick one day simply fails to come up when probed after device insertion. In more rare cases, the lack of spares may show up as some sort of reported write failure in your kernel logs assuming the flash device reports proper IDE/ATAPI/??? error data.
One final note -- please don't leave your USB stick inserted in the PC as you power it off! USB ports supply power and use a FET device to control that power. When you turn off the PC, the gates float and significant leakage current goes to the USB device. Some of the cheaper USB drives lack a key resistor that bleads this current away and protects the flash memory chips. This leads to data corruption. I have seen the FTL break in such sticks simply by doing POR on the PC.
Oh...almost forgot. When you put you flash stick through the washer and dryer, always use fabric softner or Bounce strips to reduce the static.
I too had a flash drive fail, but in the "worst" way... quietly.
Fortunately, the drive was mostly used for "sneaker net" use, and did not contain any irreplaceable data. This use exposed the issue quickly too (had it been a backup device, the backup would have been useless and I wouldn't know until I needed it.)
A typical failure was to zip up a software installation on a dev machine, then take it to a clean target machine, where the zip would fail to unpack, or the installer exe, once unpacked, would fail to run with various errors.
I finally got to the point where I simply copied several megabytes of plain text data to the memory key, then copied it back and diffed the files to see the corruption (large areas of nulls, as I recall.)
Never heard a peep from the OS.
It was a 1 1/2 year old Patriot XT 2GB, and, after a couple of emails and a PDF of my NewEgg receipt, a new drive showed up in the mail under the lifetime warranty.
I also had an expensive Lexar CF card for a digital SLR that failed. In that case pictures that I know I took simply weren't on the card... but could be "recovered" with the Lexar utility (along with EVERYTHING else on the card, so it was a PITA.) Since that was nearly $200 when it was new, I figured getting my lifetime warranty honored would be easy, since the cards were down to about $20. No dice. Just got the run-around and finally gave up. Lexar lost a customer.
This issue is a bit more complicated than you think.
Seen this failure mode a lot too. Static build up on your body, then when you go to insert the device the charge jumps between you, through the device to the grounded casing around the USB connector.
Can do anything from reboot your PC (if you're lucky) to destroying the stick or the USB controller on the PC (or HUB if you're luck).
As you said power is a major problem with USB. Cheap USB sticks need FULL power to work right. Often times we'll have a customer with a stick that works fine in one PC (at home or work for instance) and will either not be recognized or will give read/write errors in another. Most of the time this is solved by using an external powered USB hub as the mother board simply isn't supplying enough voltage or current to power the stick. I'm not really sure if in general the problem is the motherboard or the stick as I haven't bothered to pull out the multimeter and do any serious testing, but I'm inclined to think its the stick as it seems to happen mostly with cheap/noname sticks that were probably rejected by the likes of Sandisk and co.
As far as pulling them out while the card is powered, that is part of the specification for SD and USB, not sure about compact flash, but I would assume its there as well. USB and SD have the connector configured in such a way to ensure power is applied and removed in the proper order, which is why their connectors have some contacts that are longer than the others.
What you said is still true however, a cheap chip on either side may not handle that process well. I can say however that we have successfully ran 3.3v SD cards at 7 to 9 volts for short periods of time due to mis configured testing setups where we didn't check the voltage after switching modes. Of course, we've also lost more than a few SD cards for that very reason, even at 5 volts they won't last more than a few minutes. mini and micro cards in an adapter to full SD fair better generally as the mini and micro's work at around 1.8v (I think, memory is fuzzy about that atm, might be 2.7) and have internal voltage dividers to cut down the 3.3 v input from the system, the still fail eventually due to over voltage, they just seem to do better although I have only anecdotal evidence to support that.
Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
There may be other manners of failure. I have a recent 2Gb USB thumb drive that started going ever more slowly after a few days of use. I last measured a "dd if=/dev/random of=/media/device/test" of no more than 0.5kB/s. If somebody wants to have some fun analyzing it, I can put it in an envelope free of charge.
Non-Linux Penguins ?
This is a silent failure, much like hard drives marking blocks as bad. Capacity is reduced without any obvious signs. Not sure if OS tools can recognize it unless the controller reports bad cells as bad blocks. This will eventually result in "disk full" messages when there appears to be space on the drive. Reformatting won't recover the cells but it will likely result in your OS being aware of the flash's reduced capacity.
Very similar to above, but larger amounts of data. I want to say there's 64 cells to the page but don't take that as gospel.
Hello failed/unreadable/size 0 disk error. The data storage mechanism is intact but there's no way to access them. As people stated above, a lot of the time it is not the failure of a transistor so much as a trace or solder point failing. If you know your device has been abused physically, you can try the low-tech approach of gently squeezing or bending the stick while it's in the USB port (use an extension cable so you don't damage your mobo!!) to try and get the contacts to reconnect long enough to retrieve data. If that fails you can pop the case apart and use a magnifying glass to look for breaks in the solder or traces; if you're handy with a soldering iron you can try to bridge the connection. Again, temporary fix.
Actually most of them are several silicon chips; one controller plus a variable amount of memory chips. The increase in traces and board assembly is offset by the ability to reuse components and the overall design while memory chip prices fall. It also cuts down on the impact of failed chips, since you aren't losing controller+memory for one bad gate on the controller.
I've been on slashdot so long I'm starting to get out of touch with the cool stuff if it ain't on slashdot.
Yes and no. A page or cell failure will result in I/O errors if there are no more spares, and if it occurs during a read cycle, it -should- result in I/O errors for all subsequent reads from that cell or page until it gets rewritten to a new cell or page. If it doesn't work that way, then the device is fundamentally violating the contract between the device and the OS to report all nonrecoverable errors that result in data loss.
Also, while a multi-chip design reduces the probability of a device failing outright, it dramatically increases the probability of a failure. First, using a separate controller significantly increases the probability of failure because instead of having interconnect traces on a slab of silicon that (electromigration notwithstanding) almost never change or fail if they work from the factory, you have solder joints exposed on a circuit board. Solder joints are the most common cause of circuit failure in my experience.
Even ignoring the increased risk of having extra solder joints between the controller and flash parts, the odds of failure are still much worse for multi-chip devices. Remember your RAID MTBF theory. The MTBF of a collection of devices is equal to the MTBF of one device divided by the number of devices. If you have one part, the MTBF on that slab of silicon and associated solder joints might be a year. If you have five parts, the MTBF is now 73 days. That's an extreme example, but sadly, I've seen flash sticks with large numbers of failures in the first month, so that's not nearly as gross an exaggeration as you might think.... And whether one part fails or the whole thing fails, you still lose data.
Also, a controller failure is still likely to cause all flash parts to be inaccessible whether it is integrated into a flash chip or is driving eight discrete flash chips. It's not like you're going to use a separate flash controller per flash part. And I -think- that a device showing zero capacity is probably caused by the flash controller being unable to communicate with the flash parts. If so, then that is much more likely to be caused by a failed connection between the two than by a failed flash controller (unless there are problems with interconnects inside the flash controller chip package failing due to overzealous compliance with ROHS rules).
The original poster also failed to mention the most common failure mode, bar none: poor solder joints or other physical interconnects getting broken by physical force. This is very common among cheap flash drives. I wouldn't expect the same with SSDs, of course---you don't normally carry a SSD in your pocket---but at least in my experience, this one cause of failure is easily an order of magnitude more frequent than any other single cause, and is in all likelihood greater than all the others put together. And that's not even counting actual abuse (washing machines, run over by cars, and so on).
My Lexar JumpDrive Secure flash drive suddenly stopped working, and I talked to my mother, whose entire university class was using that same model of drive. Turns out that between us, we had experienced close to a 50% failure rate on those things within the first month or so, having seen somewhere around 14 or 15 failures. The failure was interesting. Mine failed suddenly, but worked if you tipped the connector at an angle... at least for a couple of seconds once or twice. This told me pretty conclusively that the failure was caused by poor hardware design. As best I can tell, when you carry the drive in your pocket, the cap puts pressure on the USB connector. Over time, this gradually causes solder joint or trace failure (I never cut one open to figure out which) at or near the USB connector.
Since then, I only buy flash devices with mechanisms where the USB connector retracts into a solid housing. Sure, you have an elevated risk of gunk from your pocket getting into the connector because it isn't covered, but at least you don't have the flexing problem. Gunk can be cleaned with a flat toothpick and alcohol. Failed solder joints requires disassembly and SMT soldering skills.... :-)
Check out my sci-fi/humor trilogy at PatriotsBooks.