Google-Backed SSD Endurance Research Shows MLC Flash As Reliable As SLC (hothardware.com)

← Back to Stories (view on slashdot.org)

Google-Backed SSD Endurance Research Shows MLC Flash As Reliable As SLC (hothardware.com)

Posted by timothy on Tuesday March 1, 2016 @01:40AM from the utah-vs-oklahoma dept.

MojoKid writes: Even for mainstream users, it's easy to feel the differences between using a PC that has an OS installed on a solid state drive versus a mechanical hard drive. Also, with SSD pricing where it is right now, it's also easy to justify including one in a new configuration for the speed boost. And there's obvious benefit in the enterprise and data center for both performance and durability. As you might expect, Google has chewed through a healthy pile of SSDs in its data centers over the years and the company appears to have been one of the first to deploy SSDs in production at scale. New research results Google is sharing via a joint research project now encompasses SSD use over a six year span at one of Google's data centers. Looking over the results led to some expected and unexpected findings. One of the biggest discoveries is that SLC-based SSDs are not necessarily more reliable than MLC-based drives. This is surprising, as SLC SSDs carry a price premium with the promise of higher durability (specifically in write operations) as one of their selling points. It will come as no surprise that there are trade-offs of both SSDs and mechanical drives, but ultimately, the benefits SSDs offer often far outweigh the benefits of mechanical HDDs.

44 of 62 comments (clear)

Min score:

Reason:

Sort:

Re:Gibberish? by 110010001000 · 2016-03-01 02:15 · Score: 1

I was thinking the same thing. The grammar is appalling. They need to go to collage!
Re:Gibberish? by watermark · 2016-03-01 02:16 · Score: 3, Insightful

This site has "TechNerd 101" as a pre-requirement. If you're reading this site without first completing that course, please speak with your student adviser to discuss your options.
Re:Where is the report? by AmiMoJo · 2016-03-01 02:18 · Score: 2

Yep, we need critical details, like if there is any way to tell that an SSD is about to become totally unreadable. There is some worrying stuff in TFA:

Other results point to the uselessness of the RBER value (raw bit error rate). It was found that there was absolutely no correlation between the number of these warnings and the number of uncorrectable errors that creep up.
Uncorrectable errors are not too bad, at least you can make a copy of the drive. It's when the drive dies completely and you have to reach for the most recent backup that we really want to predict.

--
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
Re:Where is the report? by selectspec · 2016-03-01 02:28 · Score: 4, Informative

http://0b4af6cdc2f0c5998459-c0...

--
Someone you trust is one of us.
Re:Gibberish? by Nunya666 · 2016-03-01 02:39 · Score: 1

If you did not understand the summary, then this site is not for you.
Re:Where is the report? by Anonymous Coward · 2016-03-01 02:42 · Score: 1

It's when the drive dies completely and you have to reach for the most recent backup that we really want to predict.
Flash wear doesn't cause the drive to completely die, only controller failure can do that. Usually it seems to be the controller firmware that fails, rather than the electronics.
Re:Where is the report? by arth1 · 2016-03-01 02:44 · Score: 5, Interesting

Thanks. However, what I read from this study is that SLC is indeed far more reliable, especially over time, with the risk of a several years old MLC drive developing an uncorrectable error is an order of magnitude or more higher than for a similarly old SLC drive.
Only some less common errors are in the same ballpark, which should not be extrapolated to what the title claims.
Worthless.... by gweihir · 2016-03-01 02:54 · Score: 2

Just as worthless as their last "study" on storage reliability, as they do not name manufacturers and models. Research published by Google sucks badly.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
1. Re:Worthless.... by thegarbz · 2016-03-01 09:17 · Score: 1
  
  as they do not name manufacturers and models
  When you're studying differences purely on a technological point of view trying to address the conception of SLC vs MLC, what has manufacturer got to do with it? People constantly post about the differences between SLC and MLC regardless of which manufacturer manufacturers the drive, so when attempting to study that all you're doing is adding additional distractions from your point.
  Yes it would be nice to know who's doing the best and the worst.
  No that was not at all the point of the study.
2. Re:Worthless.... by gweihir · 2016-03-01 09:32 · Score: 1
  
  Several points:
  1. Manufacturers and models are critical to repeatability and verifiability. As it is, they could have pulled those numbers from their backsides and nobody could tell.
  2. There are quite a few SSDs out there that have problems in the relevant time-of-purchase time-span. For example, OCZ had much higher failure rates in a number of models. Without knowing whether any of those (and how many) were in the sample, you do not get a realistic picture, as you are comparing devices on different maturity levels.
  3. It depends very much what additional measures to deal with the shortcomings of SLC and MLS, respectively, are implemented. Without make and model and firmware revision, that is impossible to say. They may have well measured the effects of certain firmware artifacts, not of SLC vs. MLC at all. In fact, their numbers suggest such an error.
  4. It is also important to know what type of memory and controller was in there. Some may cause systematic errors independent of SLC or MLC usage. Again, their numbers suggest such errors.
  There are quite a few more methodical shortcomings.
  This is not science. This is statistics done without understanding and with any repeatability and verifiability removed. Work on low amateur-level that should have been rejected by any competent reviewer.
  
  --
  Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:Where is the report? by Billly+Gates · 2016-03-01 02:57 · Score: 1

These issues seemed to have gone away since the crappy sandforce controllers are no longer in use. It is a firmware issue and not wear in hte past. OCZ made defective drives which had no extra capacitors so they would fry.
Howevcer, I did throw out a Sandisk ultra 1 plus I bought in 2013 recently. It experienced corruption which I thought was my imagination and something I did. I put it in a girlfriends laptop and noticed Windows reported corruption after a reboot. This coulld be due to extra ram as cache that went bad.
I had great luck with Samsung pro SSDs

--
http://saveie6.com/
Re:Where is the report? by AmiMoJo · 2016-03-01 03:06 · Score: 5, Informative

Okay, thanks to selectspec for posting a link to the report: http://0b4af6cdc2f0c5998459-c0...
The bad news is that Google were using their own custom controllers. Thus we can't draw any conclusions about different manufacturers or controllers or error correction techniques. All they look at is the error rates for different types of flash memory and how often their hardware could correct the errors.
For consumers this is likely meaningless. So much is dependent on the drive controller and selection of error detection/recovery scheme, it doesn't really help to look at the type of flash.

--
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
Re:Where is the report? by Anonymous Coward · 2016-03-01 03:13 · Score: 1

and most consumer level drives aren't even MLC... but TLC :-/. I don't even think there are consumer level SLC models anywhere. Even the samsung pro is MLC, with samsung EVO being TLC.
Intel sucks at SSDs by U2xhc2hkb3QgU3Vja3M · 2016-03-01 03:16 · Score: 2

All I know is that even Intel can't even make a decent SSD. The first SSD I bought was their Intel SSD 530 Series 120GB and I've never been able to use the damn thing. I've tried it on two computers, a Mac mini 2010 and a DIY PC with a recent motherboard, and in both of them the drive just won't boot after a warm reset. Even after all these years, Intel hasn't published a firmware upgrade to fix the problem.
1. Re:Intel sucks at SSDs by BLToday · 2016-03-01 03:26 · Score: 1
  
  Are you sure that's just not a defective drive? I've put the same SSD in a MacBook Pro 13 2011 and some random Toshiba laptop (Windows 8.1) for my sisters in law, both with the 240GB version of the drive. Seems to work perfectly fine and they've been running for a couple years without issues.
2. Re:Intel sucks at SSDs by U2xhc2hkb3QgU3Vja3M · 2016-03-01 03:34 · Score: 1
  
  That particular problem with the 530 Series has been known for years.
  Intel says it's a problem with Macs.
  Apple says it's a problem with the drive.
3. Re:Intel sucks at SSDs by Kobun · 2016-03-01 04:50 · Score: 2
  
  I have a couple dozen each of 520 and 530 models deployed right now. Those machines have no problems with their drives.
4. Re:Intel sucks at SSDs by U2xhc2hkb3QgU3Vja3M · 2016-03-01 06:46 · Score: 1
  
  Nope they haven't. This article talks about firmware DC12 but the most recent firmware is DC33 and people are still having the same problem with these drives.
5. Re:Intel sucks at SSDs by thegarbz · 2016-03-01 09:18 · Score: 1
  
  Tell you what, I'd do a no questions asked exchange for an OCZ Vortex drive for you. How does that sound?
  Yep I'll take one for the team.
6. Re:Intel sucks at SSDs by SB5407 · 2016-03-01 11:01 · Score: 1
  
  I have about 10 of them (120 GB and 180 GB Intel 530 Series SSDs) deployed in my environment in HP laptops and they've been great. They've been much more reliable than the failed half-height laptop HDDs they replaced.
7. Re:Intel sucks at SSDs by m.dillon · 2016-03-01 11:49 · Score: 1
  
  It's almost certainly Intel's fault. Some of their SSDs do not follow the SATA spec properly on reset which can cause the initial probe to fail with a timeout. If you probe a second time it will succeed. I actually had to add a second probe to DragonFlyBSD's AHCI driver to work around the problem. It doesn't seem to be related to startup time, even with a long delay I'll see first-probe failures on Intel SSDs in various boxes.
  Strangely enough the failures occur with Intel AHCI chipsets + Intel SSDs, but do not occur with AMD AHCI chipsets + Intel SSDs.
  Just one of those things. Intel firmware generally suck across all of their chips. They write specs to make their hardware designs easier and don't really give a shit how much it complicates the drivers people have to write for the hardware. That said, their SSD firmware, once you can probe the SSD, seems to work just fine.
  -Matt
8. Re:Intel sucks at SSDs by U2xhc2hkb3QgU3Vja3M · 2016-03-01 14:55 · Score: 1
  
  The same problem also happens with nVidia chipsets, such as the MCP89 in my Mac mini 2010.
If Google knows this... by swb · 2016-03-01 03:20 · Score: 2

...then it would stand to reason that other storage vendors mostly know this, too.
So why aren't there more MLC based flash arrays, especially all-flash models? For storage capacities under 24 TB raw, it would be pretty price competitive to HDD but produce a storage device with insane I/O potential.
1. Re:If Google knows this... by fnj · 2016-03-01 05:17 · Score: 1
  
  it would be pretty price competitive to HDD
  No it wouldn't/isn't. Not even close.
2. Re:If Google knows this... by Anonymous Coward · 2016-03-01 05:57 · Score: 2, Informative
  
  Your math is off... 400*24=$9600, not 96K.
3. Re:If Google knows this... by swb · 2016-03-02 00:36 · Score: 1
  
  First off, your math is way off -- 24 x $400 is $9600.
  Secondly, nobody would build a 24 TB array with 4x8TB in RAID 5. The risk of data loss during a disk rebuild is too high and it would provide so little I/O that it would be all but useless for anything but low-access archiving.
  A better comparison for disks would be 1 TB 15k SAS, and these retail for $225, so the math on disk cost alone is a lot more competitive.
  It becomes more competitive when you look at the performance -- 24 SSDs would give you close to 2 million IOPS and sequential throughput would probably max out SAS-12 controller or 10 gig Ethernet multi path links. 15k sas array of a similar composition would top out around 8k IOPS.
  Plus, the cost of the disks is just a subset of the cost a complete storage solution. An Equallogic of 24x 15k disks would probably run in the mid $30s. If you just factor in the price differential for MLC SSDs, in theory the price ought not rise by more than $5k, or less than 20% while delivering performance that spinning rust would take $100k to compete with.
4. Re:If Google knows this... by jon3k · 2016-03-02 02:56 · Score: 1
  
  So why aren't there more MLC based flash arrays
  What companies are you referring to? I just installed an EMC VNX2 with a tier of MLC flash, which uses FAST VP. Nimble's arrays also use MLC flash - not eMLC, MLC:
  
  Todayâ(TM)s SSDs degrade when burdened with continual patterns of random writes. When SSDs receive random writes, the write activity within the SSD is greater than the actual number of writes. This write amplification dramatically increases the number of write cycles that the SSD must support. Multi-level cell (MLC) flash is typically not suitable for traditional storage systems because it can only endure 5,000 to 10,000 write cycles. Instead, traditional systems must use single-level cell (SLC) SSDs and will soon begin using enterprise multi-level cell (eMLC) SSDs. SLC and eMLC technologies can endure up to 100,000 write cycles, but cost 4 to 6 times more than traditional MLC flash.
  
  Nimble Storage approaches the problem of write amplification differently. The CASL file system is optimized to aggregate a large number of random writes into sequential I/O stripes. It only writes to flash in multiples of full-erase block width sizes. As a result, write amplification is minimized, allowing the use of lower-cost MLC SSDs.
5. Re:If Google knows this... by swb · 2016-03-02 03:47 · Score: 1
  
  I know that Compellent uses MLC in their flash tiers, too, but they refer to it "read intensive" and in the certification class it was explained it's only used for cache reads.
6. Re:If Google knows this... by jon3k · 2016-03-03 02:53 · Score: 1
  
  Neither Pure or EMC market these for read only. The VNX2 we just installed uses it for FAST VP. Write's are cached on a handful of SLC based drives (2-4 disks usually), when possible, called "FAST Cache" to increase write performance. Then FAST VP moves the most used blocks to the MLC drives from the slower tiers (SAS, NL-SAS, SATA).
Re:Where is the report? by AmiMoJo · 2016-03-01 03:20 · Score: 1

Recent Intel drives have a "feature" where once the a failed write counter reaches a certain limit they go into read only mode, and then brick after the next power cycle. Not exactly ideal behaviour.

--
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
Re:Where is the report? by ledow · 2016-03-01 04:47 · Score: 1

Yup:
To quote the report:
"Revisiting Table 3, we see that this perception is correct
when it comes to SLC drives and their RBER, as they
are orders of magnitude lower than for MLC and eMLC
drives. However, Tables 2 and 5 show that SLC drives do
not perform better for those measures of reliability that
matter most in practice: SLC drives donâ(TM)t have lower repair
or replacement rates, and donâ(TM)t typically have lower
rates of non-transparent errors"
SLC have less errors, but they affect real-world usage (i.e. non-correctable) only as much as MLC errors do. So technically MLC have MORE errors but are just as useful as replacement rates at the same. Presumably more error correction.
Re:Where is the report? by castionsosa · 2016-03-01 04:53 · Score: 1

I'm curious if the Intel drives doing this are consumer level or enterprise grade drives. The Intel enterprise grade drives did quite well at a previous job where they were used to handle an insane amount of random I/O hitting them on a constant basis.
Of course, the difference between the two are the capacitors, which hold enough electricity to finish the in-flight write transaction, so a hard power off is less likely to cause the controller to lose its ability to find pages (the SSD equivalent of the thumping noise a HDD makes when it can't find the track servos, so hits the center hub constantly.)
Re:Where is the report? by fnj · 2016-03-01 05:15 · Score: 1

and most consumer level drives aren't even MLC... but TLC
Er, TLC ("triple-level cell", 8 states)) is a form of MLC. MLC is (blindingly obvious from the acronym) "multi-level cell", not "two-level cell" (4 states).
Re:Where is the report? by BronsCon · 2016-03-01 05:26 · Score: 3, Funny

They don't brick, they go into write only mode. Ask Intel, they'll tell ya.

--
APK quotes people (including myself) without context and should not be trusted. Just thought you should know.
Re:Where is the report? by larryjoe · 2016-03-01 05:43 · Score: 1

Thanks. However, what I read from this study is that SLC is indeed far more reliable, especially over time, with the risk of a several years old MLC drive developing an uncorrectable error is an order of magnitude or more higher than for a similarly old SLC drive.
Only some less common errors are in the same ballpark, which should not be extrapolated to what the title claims.
This paper shows that RBER and the UE rates are correlated with P/E cycles, age, and SLC vs. MLC. But this was already known. Finding the opposite would be surprising. I would expected that as the rated endurance is approached, retention is affected. It would have been interesting to see the RBER and UE rates for the MLC drives past the rated endurance points and to see if the authors' claim that error rates are linear with PE cycles still holds past (or at least close to) the rated endurance.
Despite the better RBER and UE rates for SLC, the paper claims that SLC is no more reliable than MLC based on per-drive repair rates. But that seems sort of obvious. The parts of the SSD other than the flash cells should be expected to fail at about the same rate.
Also, in the short discussion that compares SSDs to HDDs, the authors say that annual failure rates (AFR) for HDDs (2-9%) are higher than for for SSDs (4-10% over 4 years or 1-3% AFR). However, as other reports (such as from Backblaze) show, some HDD drive models have consistently low AFR below 1%. More importantly, the use cases for SSDs and HDDs are becoming more and more distinct, so the comparison is becoming less relevant.
We should also keep in mind that large data center owners, such as Google, are extremely price sensitive. So, they will attempt to squeeze the most performance and reliability out of the cheapest drives. So, their engineering efforts will try to force the cheaper MLC to approach SLC. So, these results may not necessarily translate to single SSDs in laptops.
We should also keep in mind Google-specific
Re:Where is the report? by AmiMoJo · 2016-03-01 06:48 · Score: 1

Enterprise drives. They are very much there for caches or RAID use where the loss of a drive won't be critical. It just seems like an odd decision, presumably due to the firmware being able to write internal metadata back to the drive and having to keep it all in RAM. Crap design IMHO.

--
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
Re:Where is the report? by Jeff- · 2016-03-01 06:50 · Score: 1

That makes logical sense but that is not how the terms are used in industry. MLC actually means two levels and TLC is three. QLC will be the term to describe four level cells that I believe are not yet on the market.
Re:Where is the report? by thegarbz · 2016-03-01 09:14 · Score: 1

The problem is not so much schemes used for error recovery or wear levelling. Those are all well understood. The problem is one of QA. Poorly written firmware, unable to handle edge cases (e.g. power failure), bugs in the code etc. You only need to see the list of firmware updates some of the drive manufacturers roll out to realise that it's not the hardware that is causing problems, it's the firmware in the controllers.
SSD pile growing, HDD pile shrinking. by m.dillon · 2016-03-01 11:37 · Score: 1

Calculate the cost of the replacement cycle too and suddenly SSDs look a lot cheaper. It's just that most people can't think beyond the end of their noses, so if the up-front cost looks expensive they stop right there.
I bought my last HDDs last year. Two 4TB 'archival' drives for backups. My existing pile of new 1TB or 2TB HDDs (I have around a dozen 3.5" and half a dozen 2.5" left) will be dribbled out as needed, but won't be buying any new HDDs from now on. In fact, I couldn't even foist off some of my extra 2TB 3.5" HDDs onto friends last weekend. They weren't interested! Bryce happily took one of my original 40G Intel SSDs (2.5") but had no interest whatsoever in a brand spanking new 2TB WDC (3.5").
Last year I scrapped the 3.5" form factor (I made two exceptions for the two backup drives). All new systems have only 2.5" hot swap slots now. And until recently I had a growing pile of 2.5" HDDs to maintain those systems.
But now my pile of 2.5" SSDs continues to grow while my pile of 3.5" and 2.5" HDDs has begun to shrink. The strange thing about my pile of 2.5" SSDs... I haven't had to throw any SSDs away since I started it! That pile began with the old 40G Intel 2.5" SSDs that really started the industry ramp. All of those originals that I had slowly replaced with higher-capacity SSDs are still in the pile and still in perfect working order! And I actually use them, they are perfect for small test systems.
So its a bit of a strange situation. HDDs would die or get read errors and I'd wipe and throw them away. I never recycled old HDDs very much (they become unreliable when you mix cold and hot cycling or shelve them). But SSDs are a completely different matter. You can mix cold and hot cycling just fine, they basically don't go bad unless they fail outright (at a much lower rate than any HDD) or unless they are worn out from write wear (which is quite hard to do if you don't do stupid things with them). If its just an unrecoverable block due to chance, a reformat fixes the problem and the continue soldiering on (which hasn't happened to me yet but that's what I'll do).
So my SSD pile continues to grow and the capacities continue to cycle up. The pile saw its first 1TB SSDs last year. At this juncture if I have to replace my archival HDDs I might use some of the spares from my HDD pile, but after that I'll happily spend the money to throw in a bunch of SSDs for the same storage because they'll last a whole lot longer.
-Matt
Re:Where is the report? by Tough+Love · 2016-03-01 12:41 · Score: 1

The problem is one of QA. Poorly written firmware, unable to handle edge cases (e.g. power failure)
Nice post, except I would hardly call power failure an "edge case".

--
When all you have is a hammer, every problem starts to look like a thumb.
Re:Where is the report? by thegarbz · 2016-03-01 22:58 · Score: 1

Yes "abnormal" operation could be considered a better word.
Re:So long, physics by jon3k · 2016-03-02 02:50 · Score: 1

That's not the conclusion of the study at all. This is not a study of endurance.
Re:Probably not applicable to portable devices by Bengie · 2016-03-02 06:26 · Score: 1

holding down the on/off button to improperly shut the computer off when it freezes
There's an improper way to shut down an unresponsive computer? What about system's without reset buttons?
Re:Where is the report? by stoatwblr · 2016-03-10 23:41 · Score: 1

And the odd part is that a "Three level cell" is actually 8 levels (3 bits), so QLC will be 16 levels.