Google-Backed SSD Endurance Research Shows MLC Flash As Reliable As SLC (hothardware.com)
MojoKid writes: Even for mainstream users, it's easy to feel the differences between using a PC that has an OS installed on a solid state drive versus a mechanical hard drive. Also, with SSD pricing where it is right now, it's also easy to justify including one in a new configuration for the speed boost. And there's obvious benefit in the enterprise and data center for both performance and durability. As you might expect, Google has chewed through a healthy pile of SSDs in its data centers over the years and the company appears to have been one of the first to deploy SSDs in production at scale. New research results Google is sharing via a joint research project now encompasses SSD use over a six year span at one of Google's data centers. Looking over the results led to some expected and unexpected findings. One of the biggest discoveries is that SLC-based SSDs are not necessarily more reliable than MLC-based drives. This is surprising, as SLC SSDs carry a price premium with the promise of higher durability (specifically in write operations) as one of their selling points. It will come as no surprise that there are trade-offs of both SSDs and mechanical drives, but ultimately, the benefits SSDs offer often far outweigh the benefits of mechanical HDDs.
I was thinking the same thing. The grammar is appalling. They need to go to collage!
This site has "TechNerd 101" as a pre-requirement. If you're reading this site without first completing that course, please speak with your student adviser to discuss your options.
Yep, we need critical details, like if there is any way to tell that an SSD is about to become totally unreadable. There is some worrying stuff in TFA:
Other results point to the uselessness of the RBER value (raw bit error rate). It was found that there was absolutely no correlation between the number of these warnings and the number of uncorrectable errors that creep up.
Uncorrectable errors are not too bad, at least you can make a copy of the drive. It's when the drive dies completely and you have to reach for the most recent backup that we really want to predict.
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
http://0b4af6cdc2f0c5998459-c0...
Someone you trust is one of us.
If you did not understand the summary, then this site is not for you.
It's when the drive dies completely and you have to reach for the most recent backup that we really want to predict.
Flash wear doesn't cause the drive to completely die, only controller failure can do that. Usually it seems to be the controller firmware that fails, rather than the electronics.
Thanks. However, what I read from this study is that SLC is indeed far more reliable, especially over time, with the risk of a several years old MLC drive developing an uncorrectable error is an order of magnitude or more higher than for a similarly old SLC drive.
Only some less common errors are in the same ballpark, which should not be extrapolated to what the title claims.
Just as worthless as their last "study" on storage reliability, as they do not name manufacturers and models. Research published by Google sucks badly.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
These issues seemed to have gone away since the crappy sandforce controllers are no longer in use. It is a firmware issue and not wear in hte past. OCZ made defective drives which had no extra capacitors so they would fry.
Howevcer, I did throw out a Sandisk ultra 1 plus I bought in 2013 recently. It experienced corruption which I thought was my imagination and something I did. I put it in a girlfriends laptop and noticed Windows reported corruption after a reboot. This coulld be due to extra ram as cache that went bad.
I had great luck with Samsung pro SSDs
http://saveie6.com/
Okay, thanks to selectspec for posting a link to the report: http://0b4af6cdc2f0c5998459-c0...
The bad news is that Google were using their own custom controllers. Thus we can't draw any conclusions about different manufacturers or controllers or error correction techniques. All they look at is the error rates for different types of flash memory and how often their hardware could correct the errors.
For consumers this is likely meaningless. So much is dependent on the drive controller and selection of error detection/recovery scheme, it doesn't really help to look at the type of flash.
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
and most consumer level drives aren't even MLC... but TLC :-/. I don't even think there are consumer level SLC models anywhere. Even the samsung pro is MLC, with samsung EVO being TLC.
All I know is that even Intel can't even make a decent SSD. The first SSD I bought was their Intel SSD 530 Series 120GB and I've never been able to use the damn thing. I've tried it on two computers, a Mac mini 2010 and a DIY PC with a recent motherboard, and in both of them the drive just won't boot after a warm reset. Even after all these years, Intel hasn't published a firmware upgrade to fix the problem.
...then it would stand to reason that other storage vendors mostly know this, too.
So why aren't there more MLC based flash arrays, especially all-flash models? For storage capacities under 24 TB raw, it would be pretty price competitive to HDD but produce a storage device with insane I/O potential.
Recent Intel drives have a "feature" where once the a failed write counter reaches a certain limit they go into read only mode, and then brick after the next power cycle. Not exactly ideal behaviour.
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
Yup:
To quote the report:
"Revisiting Table 3, we see that this perception is correct
when it comes to SLC drives and their RBER, as they
are orders of magnitude lower than for MLC and eMLC
drives. However, Tables 2 and 5 show that SLC drives do
not perform better for those measures of reliability that
matter most in practice: SLC drives donâ(TM)t have lower repair
or replacement rates, and donâ(TM)t typically have lower
rates of non-transparent errors"
SLC have less errors, but they affect real-world usage (i.e. non-correctable) only as much as MLC errors do. So technically MLC have MORE errors but are just as useful as replacement rates at the same. Presumably more error correction.
I'm curious if the Intel drives doing this are consumer level or enterprise grade drives. The Intel enterprise grade drives did quite well at a previous job where they were used to handle an insane amount of random I/O hitting them on a constant basis.
Of course, the difference between the two are the capacitors, which hold enough electricity to finish the in-flight write transaction, so a hard power off is less likely to cause the controller to lose its ability to find pages (the SSD equivalent of the thumping noise a HDD makes when it can't find the track servos, so hits the center hub constantly.)
Er, TLC ("triple-level cell", 8 states)) is a form of MLC. MLC is (blindingly obvious from the acronym) "multi-level cell", not "two-level cell" (4 states).
They don't brick, they go into write only mode. Ask Intel, they'll tell ya.
APK quotes people (including myself) without context and should not be trusted. Just thought you should know.
Thanks. However, what I read from this study is that SLC is indeed far more reliable, especially over time, with the risk of a several years old MLC drive developing an uncorrectable error is an order of magnitude or more higher than for a similarly old SLC drive.
Only some less common errors are in the same ballpark, which should not be extrapolated to what the title claims.
This paper shows that RBER and the UE rates are correlated with P/E cycles, age, and SLC vs. MLC. But this was already known. Finding the opposite would be surprising. I would expected that as the rated endurance is approached, retention is affected. It would have been interesting to see the RBER and UE rates for the MLC drives past the rated endurance points and to see if the authors' claim that error rates are linear with PE cycles still holds past (or at least close to) the rated endurance.
Despite the better RBER and UE rates for SLC, the paper claims that SLC is no more reliable than MLC based on per-drive repair rates. But that seems sort of obvious. The parts of the SSD other than the flash cells should be expected to fail at about the same rate.
Also, in the short discussion that compares SSDs to HDDs, the authors say that annual failure rates (AFR) for HDDs (2-9%) are higher than for for SSDs (4-10% over 4 years or 1-3% AFR). However, as other reports (such as from Backblaze) show, some HDD drive models have consistently low AFR below 1%. More importantly, the use cases for SSDs and HDDs are becoming more and more distinct, so the comparison is becoming less relevant.
We should also keep in mind that large data center owners, such as Google, are extremely price sensitive. So, they will attempt to squeeze the most performance and reliability out of the cheapest drives. So, their engineering efforts will try to force the cheaper MLC to approach SLC. So, these results may not necessarily translate to single SSDs in laptops.
We should also keep in mind Google-specific
Enterprise drives. They are very much there for caches or RAID use where the loss of a drive won't be critical. It just seems like an odd decision, presumably due to the firmware being able to write internal metadata back to the drive and having to keep it all in RAM. Crap design IMHO.
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
That makes logical sense but that is not how the terms are used in industry. MLC actually means two levels and TLC is three. QLC will be the term to describe four level cells that I believe are not yet on the market.
The problem is not so much schemes used for error recovery or wear levelling. Those are all well understood. The problem is one of QA. Poorly written firmware, unable to handle edge cases (e.g. power failure), bugs in the code etc. You only need to see the list of firmware updates some of the drive manufacturers roll out to realise that it's not the hardware that is causing problems, it's the firmware in the controllers.
Calculate the cost of the replacement cycle too and suddenly SSDs look a lot cheaper. It's just that most people can't think beyond the end of their noses, so if the up-front cost looks expensive they stop right there.
I bought my last HDDs last year. Two 4TB 'archival' drives for backups. My existing pile of new 1TB or 2TB HDDs (I have around a dozen 3.5" and half a dozen 2.5" left) will be dribbled out as needed, but won't be buying any new HDDs from now on. In fact, I couldn't even foist off some of my extra 2TB 3.5" HDDs onto friends last weekend. They weren't interested! Bryce happily took one of my original 40G Intel SSDs (2.5") but had no interest whatsoever in a brand spanking new 2TB WDC (3.5").
Last year I scrapped the 3.5" form factor (I made two exceptions for the two backup drives). All new systems have only 2.5" hot swap slots now. And until recently I had a growing pile of 2.5" HDDs to maintain those systems.
But now my pile of 2.5" SSDs continues to grow while my pile of 3.5" and 2.5" HDDs has begun to shrink. The strange thing about my pile of 2.5" SSDs... I haven't had to throw any SSDs away since I started it! That pile began with the old 40G Intel 2.5" SSDs that really started the industry ramp. All of those originals that I had slowly replaced with higher-capacity SSDs are still in the pile and still in perfect working order! And I actually use them, they are perfect for small test systems.
So its a bit of a strange situation. HDDs would die or get read errors and I'd wipe and throw them away. I never recycled old HDDs very much (they become unreliable when you mix cold and hot cycling or shelve them). But SSDs are a completely different matter. You can mix cold and hot cycling just fine, they basically don't go bad unless they fail outright (at a much lower rate than any HDD) or unless they are worn out from write wear (which is quite hard to do if you don't do stupid things with them). If its just an unrecoverable block due to chance, a reformat fixes the problem and the continue soldiering on (which hasn't happened to me yet but that's what I'll do).
So my SSD pile continues to grow and the capacities continue to cycle up. The pile saw its first 1TB SSDs last year. At this juncture if I have to replace my archival HDDs I might use some of the spares from my HDD pile, but after that I'll happily spend the money to throw in a bunch of SSDs for the same storage because they'll last a whole lot longer.
-Matt
The problem is one of QA. Poorly written firmware, unable to handle edge cases (e.g. power failure)
Nice post, except I would hardly call power failure an "edge case".
When all you have is a hammer, every problem starts to look like a thumb.
Yes "abnormal" operation could be considered a better word.
That's not the conclusion of the study at all. This is not a study of endurance.
holding down the on/off button to improperly shut the computer off when it freezes
There's an improper way to shut down an unresponsive computer? What about system's without reset buttons?
And the odd part is that a "Three level cell" is actually 8 levels (3 bits), so QLC will be 16 levels.