How Does Flash Media Fail?
bhodge writes "Aside from the obvious 'it stops working' answer, how does flash media — such as USB, SD, and CF — fail? Unlike with traditional hard drive, where anyone who's worked with computers for a while knows what a drive failure looks like, I don't know anyone who has experienced such a failure with flash. I've haven't been able to find more than scant evidence of what such failures look like at the OS level. The one account I have found detailed using a small USB drive for /var/log storage; it failed very quickly, and then utterly (0 byte unformatted device), after five years of service in the role. This runs contrary to other anecdotal claims that you should still be able to read the media after you can no longer write to it. So my question is: what have you seen of the nature of flash media failure, if anything?"
It usually "fails" because it went through the washing machine in my pants too many times.
The next Cmdr Taco duplicate will be ready soon, but subscribers can beat the rush and see it early!
Comment removed based on user account deletion
He'd taken it out of his camera, tried to put it back in, and nothin'. Slapped it into my Linux box. It "saw" that there was a device there, but wasn't real happy about it:
[ 5555.618324] sd 4:0:0:0: [sdb] Add. Sense: No additional sense information
[ 5558.777567] sd 4:0:0:0: [sdb] Sense Key : No Sense [current]
"It's dead, Jim."
I'm tempted to try the old hard-drive swaparoo: get the exact same SD card, unsolder the flash chips, and put the bad one's flash on the new one's circuitry. See if it's the circuitry that's bad, or the flash, itself. If anyone has any bright ideas on how to determine definitively which it is without me going through that exercise, I'm all ears.
From what I gather, the most common cause of failure is the flash getting fried. Dodgy card readers, pulling the card out when a voltage is running through it, the chips are very sensative to spikes in current or voltage and burn out because of it.
If a cell fails, you can't read or write that cell.
If a gate fails in a page, you lose access to the page.
If a gate fails in the overall control logic, you lose access to the whole device.
Is there something I'm missing? Did you think there were oil changes or brake shoes? It's one silicon chip with metal on it.
Intron: the portion of DNA which expresses nothing useful.
Had two finally wear out. Both started giving "could not write to device" sort of errors. The system (Windows 2K or XP) would still recognize the drive, would show the files, etc. Indeed, I could still access (read) the files, so the data was there and copyable. But I'd get a file write error every time I read anything, because Windows was trying to update the flash drive's file directory with "last accessed" or some such, and that write would fail.
No biggie; copied the data to a replacement, threw the old ones away, after hitting them several times with a hammer to "clear" the memory :-)
Flash media fails when you write the data. In theory this means that you can always recover data as you can never write data to bad sectors. In practice the entire media device (CF, SD, etc.) fails at once.
><));>
Maybe I am totally on the wrong track here but don't the fact that they can't use Lead in some of the alloys contribute to the lifespan of some computer parts?
As I understand it aluminium alloys created without lead and then used in computers degenerate several magnitudes quicker than alloys with lead. The process is apparently that the aluminium start sprouting tiny tiny "hairs" and when one of these connects to another one of these coming from somewhere else in the machine then it's thank you and good night for that part.
Anyway the reason I mentioned this is because apparently with intensive use 5-7 years is how long parts in your computer takes to make a connection and after that it is LED OFF (see what I did there?) Of course unless you have a computer constructed before the mid nineties (I think that was the point); since they use lead in their alloys this isn't something that will affect them (though a range of other issues will).
The Long Now Foundation
Without knowing more about this specific situation, I'd say this failure sounds like it pre-dates wear leveling. Prior to wear leveling, the most used sectors were likely to fail the fastest. And what sector gets written to more than the file allocation table?
If the file allocation table was lost, that would explain why the device became completely inaccessible. The card might not be a total loss if the card contains firmware or circuitry to remove bad blocks from usage. In that case it might be possible to reformat it. (Of course, if it lacks wear leveling I wouldn't count on it.)
Wear leveling neatly solves this issue by shifting writes to different free blocks with every write. This assures that the maximum use of the card is obtained prior to failure. Should any given block fail the card will detect the checksum error, mark the block as bad, then attempt to rewrite to a different block. This is communicated back to the reader in a transparent way. As far as the reader knows, nothing happened.
As you can imagine, wear leveling makes it incredibly rare to see Flash failures these days. It can still happen, but the results are likely to be unpredictable. The card will need to chew through all free blocks before it starts returning errors. In that case you may be able to continue reading the media. Or it may fail like the USB drive you mentioned. It all depends on the importance of the block on which the erasure was attempted. Since you only know about a failure *after* the block erasure, you're at the mercy of the quality of the card's electronics and algorithms to protect against a dangerous erasure.
Javascript + Nintendo DSi = DSiCade
I've been booting linux servers off of flash for a few years. For some of them, the whole OS, even /var/log, is on the flash drive.
I've had one drive fail, and it basically got hot and stopped being recognized as being connected by the computer. It was older generation technology, though. Newer flash technology designed for computers doesn't fail, as far as I have experienced. I'm talking about the flash SATA drives from name-brand manufactures.
A slashdotter who didn't build his own computer is like a Jedi who didn't build his own lightsaber.
I had a 4GB FAT32 flash drive that I used as storage for a mail server attached to an OpenWRT router. It required renaming and deleting files all the time (every time it got an e-mail)--so I think it wore down pretty quickly.
One day, the storage for the flash drive stopped working (from one hour to the next, without being touched, the computer acted like I had just yanked the drive out)--it would be recognized but report a "no media in drive" error when you tried to access it, like an empty CD drive. In fact I think Windows would say "Insert CD" or "No disc in drive F"
A few weeks ago /. linked to a really wonderfully written article by Anand Lal Shimpi about SSD drives. In the article he includes some simple and clear explanations of how flash memory works, its lifespan, and how it handles writes and deletes to maximize the life of every block of storage.
http://www.anandtech.com/printarticle.aspx?i=3531
The only think missing from the article is a description of the behaviour of a failing drive.
If the flash drive fails, yes you can continue to read from it, but you also have to consider what is meant by reading.
You can always read the raw data from the device, that will never change. There is nothing that prevents the electrical signals from forming a proper read transaction on the IO pins of the flash IC chip.
However, when you consider the software that is on top of the raw data (a file system for example), this is where you will have the trouble.
With older CF cards, the concept of wear leveling was not implemented, I don't know about newer ones. This being the case, the directory structure for a file would more than likely reside in the same physical location on the flash. Opening, writing, closing a file with the same name would no doubt wear that space out as the directory entry gets hammered. Once that has "worn out", data is lost because the file system can no longer track it (even though the actual data may be viable).
Also consider the device that does support wear leveling. At some point it will run out of places to wear. Some large files will remain static and won't move (they are only read), some files will be moved all over the device by the device's ASIC as the data in the file is updated or changed. At some point, the flash will run out of cells. This could happen as some critical directory entry is being updated, and the whole file system could be corrupted because there are no more viable flash cells to use.
Your data might still be there is all its binary glory, but w/o a viable file system data structure to access it, well, you're toast. Unlike a harddrive that burped and lost a few bytes, a worn out flash drive has no recordable medium available to do any file system data structure repairs.
Kevin
I had flash failing on my 'gracefully'. The amount of available storage just becomes fewer and fewer after usage. It seems like the cells(if one can call it that) just dies after repetitive usage. Formatting does not help either.
Some years ago i used a 64Mb CF to install a minimal Debian on a IBM PC110 with 8Mb of ram. As the install process wanted more memory i created a 12Mb swap partition.
Big mistake.
The install took a whole day. I happily ran some programs the next day and crash - kernel screams of i/o errors in the swap partition.
Formated the card MS-DOS - it found a few bad sectors. Then i ran Norton Disk Doctor and at every run it was founding more and more bad sectors. But each time i was re-formating the card using a camera, the bad sectors were shifting around. Unusable.
FYI: IBM PC110 is a 486 Palmtop with a CF slot to be used as hard-drive. The CF interface is IDE.
1% APY, No fees, Online Bank https://captl1.co/2uIErYq Don't let your $$$ sit in a no-interest acct.
Your flash memory is fine, the controller is hosed.
This kind of (essentially unrecoverable) failure will continue to be an issue wherever the logic is integrated with the storage.
If it's any consolation, except for those who are always forgetting to "eject" or turn off their device before removing the media this kind of failure should be quite rare*.
Enjoy.
*Mfr's producing shoddy products not withstanding.
Platform advocacy is like choosing a favorite severely developmentally disabled child.
On a modern filesystem, your writes should essentially be atomic and in theory it shouldn't be possible to leave the drive in an inconsistent state when the write fails.
Of course most camera memory cards end up being formatted with fat32 which can be a little less forgiving.
About 5-6 years ago, I decided that it would be a good idea to build a small application on a flash drive, that is, code and compile it directly to the drive. :)
After what must have been hitting compile a few hundred to a thousand times, the 128MB thumb drive starting giving me drive write errors and then stopped responding altogether within about a minute after errors starting appearing.
I think the moral of this story is backup your data, even when it's on a flash based drive, and don't code directly on a cheap thumb drive
...and quality and longevity take a back seat. So companies stopped offering SLC Flash RAM (+100.000 writes) and only offer MLC (5000 writes), and are now pushing even eight-level MLC, which will be even less reliable than standard 4-level MLC Flash RAM. But who cares, the consumer will be slightly fucked after a while, but that will be much later, after they enjoyed the happiness of getting slightly more GB for their buck.
The only manufacturer that I know of, that is an exception, if Kingston, which still offers SLC Flash products - namely their elite pro line of SD and CF cards, and the Data traveler USB drives. But that's it, everyone else has not completely transitioned to MLC.
"The agriculture ministry is not in charge of Gundam" - Japanese ministry official.
ive been able to roll over a flashdrive with my car, wash, and bake a flashdrive in the process of doing laundry, and its never failed...however ive had one on my desk for a month that failed like a whale for no good (read:user abusing it as normal) reason. blaming gremlins, jeebus, and FSM until a solution abounds.
Good people go to bed earlier.
When I was in the digital imaging kiosk business, we had to repair about three flash drives a week. A customer would put it in one of our systems and pull it out while it was being read, or it was a cheap drive or whatever. Either way, the customer would blame our systems for killing their drives (rightly or wrongly). Of course, it would contain pictures of their dead grandfather or ex-girlfriend naked or whatever was completely priceless and irreplaceable.
The vast majority of the time, we would be able to run an application that would be able to recover whatever was on the drive. While I'm not certain of the original problem, the system acted as if the drive had no FAT (File Allocation Table... do I really need to say it?) on it or the FAT had become corrupted. This particular application would be able to go in and recover whatever was on the drive and most of the time repair the drive to its previous working state.
I say it ACTED like the FAT was corrupt, but I don't know or care if a flash drive has a FAT on it. Could have been a hardware thingie in there that hiccuped. The repair utility acted much like a scan-disk that would repair an MBR or FAT and/or act like an undelete utility would, restoring the files on the drive.
There is no "I disagree" mod for a reason. Flamebait, Troll, and Overrated are not substitutes.
For a prior employer, I had set up a process to qualify flash media for use in embedded products. There's a couple of different failure modes you are likely to see.
:-)
First off, when the actual flash media itself wears out, it takes longer and longer to erase individual sectors.
A flash device such as a USB stick or a CF card is slight more complicated because it has something known as an FTL (Flash Translation Layer). The FTL has the job of implementing the virtual media to flash sector translations, implementing wear leveling, and handling the awkward page erases. (Multiple sectors in a page, but you can only erase full pages.)
The FTL obviously must store some mapping information in the media in addition to your data.
If you start writing flash media, and time those writes, you see an initial rapid growth in the write timing that evetually levels off as the FTL tables swell to their constant operational size.
The over all flash write speed will level off to some average value that follows slow growth over a very very long tail as the media wears.
Early flash chips supported about 10,000 erases per page, and modern chips shipped by Samsung and others support a couple million erases per page. When you consider this is spread over say 4GB of media, you can understand that tail is very very long and flash media are probably comperable to hard drives in their MTBF these days.
Secondly, when flash actually does begin to fail, the media itself tends to exhibit a small number of different symptoms.
The flash may stat to show occasional data corruption when read. You might also have instances where data persists in the media only so long as power is applied. And then of course you have the fact that erases take longer and longer to achieve. Eventually erases or programming start timing out occasionaly.
With the FTL between you and the flash, you don't directly observe these effects. Presumably the FTL is smart enough to try and re-map your data elsewhere. In most cases there's ECC to attempt correction of moderately corrupted data. The real killers are when the data fails to persist after power cycling, when ECC fails to recover critical FTL data tables, or when there are no more spare sectors to re-map data too.
Those first two critical errors are likely to produce the lightbulb effect where your flash card or USB stick one day simply fails to come up when probed after device insertion. In more rare cases, the lack of spares may show up as some sort of reported write failure in your kernel logs assuming the flash device reports proper IDE/ATAPI/??? error data.
One final note -- please don't leave your USB stick inserted in the PC as you power it off! USB ports supply power and use a FET device to control that power. When you turn off the PC, the gates float and significant leakage current goes to the USB device. Some of the cheaper USB drives lack a key resistor that bleads this current away and protects the flash memory chips. This leads to data corruption. I have seen the FTL break in such sticks simply by doing POR on the PC.
Oh...almost forgot. When you put you flash stick through the washer and dryer, always use fabric softner or Bounce strips to reduce the static.
I have a Philips DVD drive with a usb port, and was using a 1GB flash drive to play back video files copied from my PC. The drive failed relatively quickly - I'd had it for about a year, but hadn't used it all that often. I started to notice the video files were corrupt on playback, but initially suspected the file itself, or possibly a problem with the DVD player's decoder. I diagnosed the problem by copying a file onto the drive, then repeatedly checksumming it. The first couple of times, the checksum value would be often be correct, then on subsequent checks it would change on me. I'd end up seeing several different checksum values, never seeing it return to a previous value. Whether this was due to a problem in the interface harware when reading, or memory cells failing to retain their state, I don't know.
Even though it was a year old and I had no receipt, the manufacturer (Kingmax, I think?) was happy to send a free replacement. The new drive has seen much more use, but is still working fine.
I too had a flash drive fail, but in the "worst" way... quietly.
Fortunately, the drive was mostly used for "sneaker net" use, and did not contain any irreplaceable data. This use exposed the issue quickly too (had it been a backup device, the backup would have been useless and I wouldn't know until I needed it.)
A typical failure was to zip up a software installation on a dev machine, then take it to a clean target machine, where the zip would fail to unpack, or the installer exe, once unpacked, would fail to run with various errors.
I finally got to the point where I simply copied several megabytes of plain text data to the memory key, then copied it back and diffed the files to see the corruption (large areas of nulls, as I recall.)
Never heard a peep from the OS.
It was a 1 1/2 year old Patriot XT 2GB, and, after a couple of emails and a PDF of my NewEgg receipt, a new drive showed up in the mail under the lifetime warranty.
I also had an expensive Lexar CF card for a digital SLR that failed. In that case pictures that I know I took simply weren't on the card... but could be "recovered" with the Lexar utility (along with EVERYTHING else on the card, so it was a PITA.) Since that was nearly $200 when it was new, I figured getting my lifetime warranty honored would be easy, since the cards were down to about $20. No dice. Just got the run-around and finally gave up. Lexar lost a customer.
This issue is a bit more complicated than you think.
I've been running my home desktop/server (Linux 2.6) on a Sandisk Cruzer 8GB usb stick (root, swap, tmp, everything except large media files) for a year and four months without any glitches. I've napkin-calculated that at current usage and wear levelling, I should be able to use it for over 50 years without a failure. Funnily enough, the portable USB drive that I use to back it up failed last December. I keep multiple backups, I didn't flinch.
Then again some flash devices fail miserably and silently. I've had a few 64MB and 128MB stick batches with stuck bits, and those were practically new. The operating systems they were used on didn't detect the errors, I did, by trying to open garbled files.
My wish list: A SATA gizmo that has 4-5 USB connectors with each their own bus that presents itself to the SATA bus as a single drive, and does RAID-5 automatically. That'd be sweet.
is packaging. There is stuff in the potting epoxy that holds enough electric charge to make the FET's gate start to conduct a little, playing havoc with everything. We've been having to redo parts with an extra layer of metal over the top of the IC to protect it from an intermittent contamination in our packaging material.
I believe I remember reading that Intel had problems with their water being mildly radioactive downstream of an old uranium mine, and running into the same problems (only much worse, since they're doing much finer geometry.)
So this is a case where the FET hasn't failed, precisely: it's just getting messed up by external interference.
Nostalgia's not what it used to be.
So, I will pass on what I have discussed with my brother-in-law who is an Electrical Engineer that writes software to test flash memory:
1. Flash memory is built with additional fail over storage (so a 1GB SD card actually has a certain % more memory than 1GB).
When a section of memory fails it is marked bad by the flash controller and some of the fail over memory comes into service (marked bad much like failures on standard hard drives... although I get the impression the flash controller may be the thing remembering it's bad... wasn't clear on this now I have something else to ask him)
2. Flash memory will fail... it can only be written to so many times before it will no longer be able to be written to... and the number of times is definitely not as high as a standard hard drive
So it's likely that you can extend the life of a flash device by writing to it less often.
And, not from my brother-in-law discussions, I personally had a flash drive fail (I was using it as the master copy of documents as I moved data between my work machine and home machine while working toward an online degree). When it failed there was no warning previously. It simply stopped working... wouldn't be read and wouldn't write. I suspect my batch file that performed the backups to it must have written to it too many times (it was a smaller 128MB drive so, considering the above discussion about fail-over memory a smaller drive SHOULD fail faster...)
Hope that helps
I wonder how well that hammer thing worked.
It's beautifully effective for drives because they are high speed high precision mechanical devices, but even if you broke up the circuit board the chips were soldered to a guy with a soldering iron and some know how might still be able to get it back together again. Looking at that cell to gate progression posted earlier it sounds like unless you are able to actually destroy a given gate you don't destroy access to a give chip. If you were able to access the internals of the chip that might not be a barrier.
Too many electrons are easy to find though. Maybe get some rubber gloves, one of those hand held stun guns and zap the board parts a few times after (or before) you're finished hammering. It could be fun and sparkley. This also provides opportunity for some memorable conversations with management. " ...It's these SSDs boss. They're just really hard to erase when they fail. I'm afraid the department is going to need it's own Vandegraff generator..."
The blue smoke wants to be free.
Everyone knows that the mainstream flash media is so left wing that's going to fail.