Backblaze Releases Billion-Hour Hard Drive Reliability Report (extremetech.com)

← Back to Stories (view on slashdot.org)

Backblaze Releases Billion-Hour Hard Drive Reliability Report (extremetech.com)

Posted by BeauHD on Tuesday May 17, 2016 @10:00PM from the reliability dept.

jones_supa writes: The storage services provider Backblaze has released its reliability report for Q1/2016 covering cumulative failure rates of mechanical hard disk drives by specific model numbers and by manufacturer. The company noted that as of this quarter, its 60,000 drives have cumulatively spun for over one billion hours (100,000 years). Hitachi Global Storage Technologies (HGST) is the clear leader here, with an annual failure rate of just 1% for three years running. The second position is also taken by a Japanese company: Toshiba. Third place goes to Western Digital (WD), with the company's ratings having improved in the past year. Seagate comes out the worst, though it is suspected that much of that rating was warped by the company's crash-happy 3 TB drive (ST3000DM001). Backblaze notes that 4 TB drives continue to be the sweet spot for building out its storage pods, but that it might move to 6, 8, or 10 TB drives as the price on the hardware comes down.

30 of 130 comments (clear)

Min score:

Reason:

Sort:

Japanese? Not anymore. by johnsmithperson123 · 2016-05-17 22:09 · Score: 4, Informative

HGST is owned by WD now if I recall, so it's not Japanese anymore. (Sorry if somebody already mentioned this.)
1. Re:Japanese? Not anymore. by Solandri · 2016-05-18 01:18 · Score: 4, Informative
  
  IBM sold their storage division to Hitachi, who renamed it HGST. So it was never Japanese to begin with.
  
  Several countries objected to the HGST and WD merger since it would leave only two manufacturers of 3.5" HDDs (WD and Seagate). So to push the merger through, HGST agreed to sell its 3.5" assets to Toshiba (which until then only made 2.5" HDDs) so we would have three manufacturers of 3.5" HDDs
2. Re:Japanese? Not anymore. by fnj · 2016-05-18 02:21 · Score: 2
  
  [HGST is] still run as a separate company though, with engineers and design in Japan and manufacturing in Japan and I think Malasia.
  I've got some bad news for you. MOFCOM has approved a full integration of the two companies, so (1) the HGST brand will disappear, and (2) all that HGST tradition of reliability is headed sraight down the crapper.
  MOFCOM in 2012 at the time of the merger approval "restricted the two companies to a 'hold separate' ..., which prevented the companies from combining products and workforce."
  For, actually, three years the two companies were forced to act as separate entities. But since late last year, WD has been allowed to integrate "substantial portions" of its business with HGST. R&D and manufacturing will become fully integrated within another one to two years. Separate product lines will no longer be mandated after two years.
  It's going to be a repeat of what happened when Seagate swallowed Samsung. All the goodness of the Samsung products evaporated. The failed morass of Seagate's lackadaisical incompetence annihilated it.
Re:Why does this matter? by Anonymous Coward · 2016-05-17 22:10 · Score: 5, Informative

It will affect you, if you ignore the results and choose to buy a Seagate drive. Trust me, I've been there...
Re:Why does this matter? by Anonymous Coward · 2016-05-17 22:26 · Score: 5, Informative

Can anyone tell me how this affects anyone? A billion hours is a ridiculous amount of time that makes this irrelevant to any reasonable person. No one cares if a hard drive lasts a billion hours.
I suggest you look at the definition of the word "cumulatively".
Here is a hint: divide 1,000,000,000 by the 60,000 HDD of the report, this makes 16,667 hours which is approximately 2 years.
That webpage by Anonymous Coward · 2016-05-17 23:06 · Score: 2, Informative

Good god! opening that webpage is like walking trough treacle. I had to turn on Ghostery - 25 trackers!!
Re:Why does this matter? by RogueyWon · 2016-05-17 23:08 · Score: 3, Interesting

Yeah, I've been there. There's nothing quite like the sinking feeling you get when you first hear the "bird chirp" sound which is the first sign of impending catastrophic failure.
I had 3 of those drives fail in a 6 month period, all of them relatively new and only subjected to consumer-level usage. It got to the point where I was getting agitated every time there was birdsong outside my window. Seagate drives don't get anywhere near my home PC since then.
Not very useful by TheRaven64 · 2016-05-17 23:18 · Score: 2

As they note, one of their drives has an 8% annual failure rate because they have 45 and one happened to fail this quarter. A lot of the others are similar numbers, with the difference between 0 and 1 failures being 4-8%. The only ones where they have enough data to be useful are HGS, one WD, and two Seagate models. One Seagate is a lot less reliable than most HGST drives (and less reliable than the worst HGST model), the other is the most reliable disk in the set. The WD drive is the least reliable.

--
I am TheRaven on Soylent News
1. Re:Not very useful by thegarbz · 2016-05-18 02:17 · Score: 2
  
  You say not very useful and then point out that they provide all the stats needed to get all required information about the usefulness, and then also go on to say that there are several models with some really large data sets.
  This is the very opposite of "not very useful".
Re:ST3000DM001? In a DATA CENTER? by Anonymous Coward · 2016-05-17 23:19 · Score: 5, Insightful

Ever price out "enterprise" drives?
When you're buying 10 drives, you pay the premium because man-hours to deal with failures are expensive. When you're buying 10,000? Not so much because failures are built into the design at that scale.
At the scale Backblaze operates, it's cheaper to build redundant systems that can handle consumer drive failures and just buy twice as many drives.
Re:Why does this matter? by GrumpySteen · 2016-05-17 23:25 · Score: 5, Informative

It's how statistics work.
There are over 7 billion people on the planet divided among 100 or so ethnicities and about 200 countries. If you're trying to determine the demographics of the world, checking only 10 random people will not give you any meaningful data. Checking a million random people, on the other hand, will give you a fairly good idea of the demographics of the world.
Same with hard drives. Statistics on 5 hard drives won't tell you anything about the likelihood of a 6th drive failing. Statistics on 100,000 drives will.
Re: Why does this matter? by Traxton · 2016-05-17 23:34 · Score: 2

My 3x Seagate NAS 4TB drives sitting in my file server have power-on hours of about 19000 without any sign of problems. I will replace them at the 3 year running mark anyway, but mostly because I am running out of space. If you wanna talk about really horrible drives, look at first generation WD Green. I sold mine after 2 months because of extremely lacking performance and worrying mechanical noises. During the 20 years I have been using hard drives I have come to the conclusion that one should never buy the largest drives and not the first generation of new technology drives. Choose the safe middle-ground.
Re: Why does this matter? by fuzzyfuzzyfungus · 2016-05-17 23:43 · Score: 4, Insightful

Depends on your use case: the Backblaze people are operating a system specifically designed for cheapo drives that are expected to have a fairly high chance of falling over and dying(pragmatically speaking, that's part of why they are so nice and friendly about drive reliability data and sharing the designs for their 'pods': their real asset as a company is the software sauce that allows them to offer cheap, reliable, storage through software-level redundancy on top of a pile of low-end drives packed tight and connected with really cheap HBAs and SATA port multipliers: no fancy hardware RAID, no redundant-controller SAS, etc.)

If you are buying drives to use as the boot volume for computers that only get a single HDD, or even systems with small RAID arrays, you are going to be seriously inconvenienced by drive models that drop dead atypically fast, even if you save a few bucks upfront. Re-imaging a replacement drive or swapping out a failed RAID disk and rebuilding the volume take time and trouble.

If your purposes are very similar to theirs, then your sensitivity to failure is lower and getting a slightly better deal per GB might start to make sense; but you have to be pretty failure insensitive(or the price of reliability really steep) to be in the same boat.
Re: Why does this matter? by dwywit · 2016-05-17 23:47 · Score: 2

IIRC, the WD green drives had firmware issues - they were "green" because the FW would power them down prematurely in an effort to save energy, only to have them powered up again because the OS requested a read/write. Too many off/on cycles = premature failure.
Also, I must be lucky - I've had one Seagate failure in 12 years, and it was replaced under warranty. Small sample, admittedly - somewhere between 100 and 150 in domestic use.

--
They sentenced me to twenty years of boredom
Re:they only run wd reds (non pro) by fuzzyfuzzyfungus · 2016-05-17 23:51 · Score: 2

Probably price: Backblaze's thing is using some sort of software abstraction and redundancy layer to get away with providing storage on the cheapest drives that they can get their hands on.

Makes them a pretty good value among providers of offsite backup/cold-ish storage; but they have a very limited interest in paying for more reliability at the hardware level, since that would fairly quickly push them into the domain of traditional storage vendors who use more expensive hardware to provide fault tolerance for software that isn't designed to handle that itself.

They obviously have an interest in getting the best value for money, hence the gathering reliability data, and they'd presumably be willing to pay a nonzero premium if the reliability difference were large enough; but their whole approach is a 'paper over lousy hardware in software' strategy. It makes their storage designs a poor drop-in replacement for many applications(even if you are using a fairly clever filesystem like ZFS that has good tolerance for some drives dying, the sight of SATA port multipliers hanging off the cheapest HBAs they can find might make you a bit nervous); but it's pretty difficult to buy a storage system where a lower percentage of the total cost is non-disk hardware.
Re:Why does this matter? by marcansoft · 2016-05-18 01:00 · Score: 3, Insightful

No, it will affect you if you choose to ignore the results and buy a *3TB* Seagate drive.
When will people stop picking stupid manufacturer sides when it comes to drive reliability? It has nothing to do with manufacturers and everything to do with models. *Every* drive maker has put out shitty models that fail in dumb ways, from HGST (ex-IBM)'s DeathStars to Samsung's firmware fail (I still own a bunch of HD204UIs with an unfixed firmware bug that eats data if you dare use SMART self-tests) to Seagate's 3TB failures. Picking manufacturer sides just means you'll get hit whenever they make the next broken drive.
If you actually look at their per-drive stats, you'll see that Seagate's 4TB drive is, so far, *more* reliable than WD's current drives. I have a bunch of those and they're mostly running fine - though I had one drop off the controller last weekend (came back after reboot), first failure in years, I need to look into that. We'll see. Right now, 4TB Seagates seem to be the best bang per buck with decent reliability. Next year it might be another brand/drive.
Actual link to report by Solandri · 2016-05-18 01:06 · Score: 4, Informative

And not some news website which doesn't even have the courtesy to provide a link to the actual source report.

https://www.backblaze.com/blog/hard-drive-reliability-stats-q1-2016/

It includes historical models as well as statistical confidence intervals - very useful for determining which model drive is more reliable. I know everyone wants to use an easy rule like "Seagate bad" when buying, but it's not that simple. Each new model of drive includes new design changes to try to increase capacity, improve speed and reliability, and/or reduce cost. Sometimes these design changes work, sometimes they don't and the model is less reliable (e.g. Samsung 840 EVO). The statistics have the greatest orthogonality when broken down by model, not by manufacturer.
1. Re:Actual link to report by PhrostyMcByte · 2016-05-18 02:44 · Score: 2
  
  Slashdot should not link to sites that hide the actual article. So annoying.
Re:Terrible Data Table by Solandri · 2016-05-18 01:32 · Score: 2

The article linked in the summary is a (bad) tech website's take on the actual report. Look above, I've provided a link to the actual Backblaze report.

The different drives have been in operation for different lengths of time. So they have to normalize the failure rate to an annual number in order to compare. e.g. If a drive model has been in use for 3 years, you just give the number of failures in the last year. If a drive model has been in use for 3 months, you multiply its failure rate by 4 to get its projected annual failure rate. It's not perfect, but at least this way you're comparing based on the same number of operating hours.

The bar graph (or anything comparing based on manufacturer) is pretty useless. I'd suggest just ignoring it. You want to concentrate on looking at the different model drives. The bar graph is made by lumping the statistics of each manufacturer's drives they used for the year. So it's a non-normalized amalgam of (1) a different mix of drive models every year, (2) in different quantities. Those two variables pretty much wipe out any statistical meaning, which you can get directly from the other charts they provide anyway. It's something that would be useful to Backblaze internally (see how well they're doing at filtering out unreliable drive models year over year), but useless for anyone else.
Re:Why does this matter? by gumbi+west · 2016-05-18 01:57 · Score: 2

If you randomly select drives from a population than it absolutely does tell you something about the unsampled units. Obviously they don't run their drives for one hour an then retire them.
Re:Why does this matter? by thegarbz · 2016-05-18 02:14 · Score: 2

You must be one of those the chance of winning the lottery is 50:50, you either win or you don't people.
A long analysis of statistics of 100000 drives most definitely gives you information about the 100001th drive when it's in a population group compared to another population group.

Build a drive that self destructs after 2 hours, run a billion of them for 1 hour ... billion hours of time with no failures!!!!!!

Your absurd abuse of statistics would give very valuable insight into the assembly process and QA process of a manufacturer. This would produce very valuable information despite your attempt to show it's worthless, especially since infant mortality is a thing.
Interesting trends by thegarbz · 2016-05-18 02:19 · Score: 2

Aside from comments on specific models and specific manufacturers, has anyone else noticed the downward trend?
I wonder if this is due to more careful selection or (except in the case of Seagate which is quite obvious) the manufacturers are actually getting better, or age related issues in the way the stats are reported.
Re: Why does this matter? by Gr8Apes · 2016-05-18 02:23 · Score: 2

I had two of the terrible 1.5TB Seagates fail early. Didn't even do a warranty exchange on them, wasn't worth having to do another one 3-6 months down the road, and then another, and another and... So I bought WD, Toshiba, HSGT, pretty much anything but Seagate. I still won't buy Seagate. Trust once lost is hard to earn back. Their drives just haven't been better than the brands I do trust, so no reason to go back to them.

--
The cesspool just got a check and balance.
Better Source by Krazy+Kanuck · 2016-05-18 03:29 · Score: 2

I realize advertising is king here, but a link to the original and far more detailed report would have been nice. https://www.backblaze.com/blog...
Re:Why does this matter? by omnichad · 2016-05-18 03:34 · Score: 2

They were still using 3TB Seagates in their last report (Q4 2015). They discontinued all use of them as a result of their findings.
Re: Why does this matter? by MachineShedFred · 2016-05-18 04:15 · Score: 3, Informative

They buy Seagate because Seagate will allow them to do volume purchases.
It's a bit easier to go to your local Best Buy and get one or two drives of whatever manufacturer you want then to buy 10,000 drives in a single order. The article specifically says that WD and Toshiba haven't been able to get that done, where Hitachi and Seagate have.

--
Slashdot still doesnâ(TM)t support Unicode after it was added to the HTML standard in 1997.
Re:Is cheaper really better? by brianwski · 2016-05-18 06:09 · Score: 5, Informative

> Does it really pay off in the long-run to buy lower quality drives?

Disclaimer: Brian from Backblaze here. We use a fairly small, simple spreadsheet to answer that exact question. If Drive A is the same size as Drive B but fails 1% more often, then we might choose the drive that fails at a higher rate if is 2% cheaper, and if it is 10% cheaper it is a slam dunk. Make sense?

You ask about warranty. We enter the warranty information into the simple spreadsheet. If a warranty is 5 years long, then replacement drives are free during that time. If the failure rate is 1% per year, then that warranty is worth exactly 5% to us. If a drive with no warranty at all is 10% cheaper, then it is cheaper. If the drive with no warranty is 2% cheaper then we purchase the drive with the warranty.

In reality, the simple spreadsheet has a few more categories. For example, an 8 TByte Hard Drive takes half the datacenter space rental as two 4 TByte drives and the 8 TByte drive takes about half the electricity of the two 4 TByte drives. So if they were the same price we would obviously choose the 8 TByte drive. But they aren't the same price, so the additional cost of the 8 TByte drive has to be recovered over three years of reduced cabinet space rental costs and reduced electricity costs. We purchase drives once per month, so we get 20 bids from our cheapest suppliers, and right now SOME months Backblaze ends up purchasing the 8 TByte drives because they will pay for themselves within 3 years, and some months we go back to the 4 TByte drives because they are so ridiculously cheap it would take 7 years for the 8 TByte drives to pay for themselves.
Re:Is cheaper really better? by brianwski · 2016-05-18 06:13 · Score: 5, Informative

Brian from Backblaze here. This is exactly correct. We have redundancy across multiple computers in multiple locations in our datacenter, so losing one drive is usually a calm, non critical event that we take up to 24 hours to replace at our leisure during business hours.

If you are interested in details of our redundancy, here is a blog post about our "Vaults": https://www.backblaze.com/blog...

Summary of article: Backblaze uses Reed-Solomon coding across 20 computers in 20 locations in our datacenter. It is a 17 data drive plus 3 parity configuration, so we can lose any 3 entire pods in 3 separate racks in our datacenter and the data is still completely intact and available.
Re:ST3000DM001? In a DATA CENTER? by brianwski · 2016-05-18 06:40 · Score: 5, Informative

> What ... is this company doing using consumer hard drives in a ... data center? .... they will fall out of an array every time there's a URE

Brian from Backblaze here. You assume we use RAID (inside of one computer), which is incorrect. We wrote our own layer where any one piece of data is Reed Solomon encoding across 20 different computers in 20 different locations in our datacenter (which is using some of the excellent ideas from RAID and ditching some of the parts that don't work well in our particular application). Our encoding happens to be 17 data drives plus 3 parity. We can make our own decisions about what to do with timeouts. When doing reads, we ask all 20 computers for their piece, and THE FIRST 17 THAT RETURN are used to calculate the answer. Now if one of the computers does not respond at all we send a data center tech to replace it. But if it was just momentarily slow a few times a day we let it be (we don't eject it from the Reed Solomon Group).

> These drives are only meant to be powered on a few hours a day and consumer workload duty cycles

I think a really interesting study would be to power a few thousand drives up once per day for an hour and shut them down. Compare it to a control group of the same drives left on so their temperature did not fluctuate. See which ones last longer without failure. I honestly don't have the answer. (Really, I don't.) What I do know is that Backblaze has left 61,590 hard drives continuously spinning, most of these are often labeled as "consumer drives", and that the vast majority of drives last so long that we copy the data off onto massively more dense drives (like copying all the data off a 1 TByte drive into an 8 TByte drive) not because the 1 TByte fails, but because it ECONOMICALLY MAKES SENSE. An 8 TByte drive takes less electricity per TByte, takes 1/8th the rack space rental, etc. So Backblaze honestly wouldn't care if the "Enterprise Drives" lasted 10x as long in our environment-> we would STILL replace them at the same moment.
Re:Is cheaper really better? by Mr.CRC · 2016-05-18 14:59 · Score: 2

I'd rather have two of the $130 drives in RAID1, vs. the $264 drive. Then buy a cup of coffee.
I just experienced the first drive failure ever, since getting my first 20MB Winchester hard drive in the 80s. Fortunately it happened while testing before using as temp. storage to allow repartitioning another drive. Granted, my drives have only ever seen "desktop" workloads.