Backblaze Releases Billion-Hour Hard Drive Reliability Report (extremetech.com)

← Back to Stories (view on slashdot.org)

Backblaze Releases Billion-Hour Hard Drive Reliability Report (extremetech.com)

Posted by BeauHD on Tuesday May 17, 2016 @10:00PM from the reliability dept.

jones_supa writes: The storage services provider Backblaze has released its reliability report for Q1/2016 covering cumulative failure rates of mechanical hard disk drives by specific model numbers and by manufacturer. The company noted that as of this quarter, its 60,000 drives have cumulatively spun for over one billion hours (100,000 years). Hitachi Global Storage Technologies (HGST) is the clear leader here, with an annual failure rate of just 1% for three years running. The second position is also taken by a Japanese company: Toshiba. Third place goes to Western Digital (WD), with the company's ratings having improved in the past year. Seagate comes out the worst, though it is suspected that much of that rating was warped by the company's crash-happy 3 TB drive (ST3000DM001). Backblaze notes that 4 TB drives continue to be the sweet spot for building out its storage pods, but that it might move to 6, 8, or 10 TB drives as the price on the hardware comes down.

11 of 130 comments (clear)

Min score:

Reason:

Sort:

Japanese? Not anymore. by johnsmithperson123 · 2016-05-17 22:09 · Score: 4, Informative

HGST is owned by WD now if I recall, so it's not Japanese anymore. (Sorry if somebody already mentioned this.)
1. Re:Japanese? Not anymore. by Solandri · 2016-05-18 01:18 · Score: 4, Informative
  
  IBM sold their storage division to Hitachi, who renamed it HGST. So it was never Japanese to begin with.
  
  Several countries objected to the HGST and WD merger since it would leave only two manufacturers of 3.5" HDDs (WD and Seagate). So to push the merger through, HGST agreed to sell its 3.5" assets to Toshiba (which until then only made 2.5" HDDs) so we would have three manufacturers of 3.5" HDDs
Re:Why does this matter? by Anonymous Coward · 2016-05-17 22:10 · Score: 5, Informative

It will affect you, if you ignore the results and choose to buy a Seagate drive. Trust me, I've been there...
Re:Why does this matter? by Anonymous Coward · 2016-05-17 22:26 · Score: 5, Informative

Can anyone tell me how this affects anyone? A billion hours is a ridiculous amount of time that makes this irrelevant to any reasonable person. No one cares if a hard drive lasts a billion hours.
I suggest you look at the definition of the word "cumulatively".
Here is a hint: divide 1,000,000,000 by the 60,000 HDD of the report, this makes 16,667 hours which is approximately 2 years.
That webpage by Anonymous Coward · 2016-05-17 23:06 · Score: 2, Informative

Good god! opening that webpage is like walking trough treacle. I had to turn on Ghostery - 25 trackers!!
Re:Why does this matter? by GrumpySteen · 2016-05-17 23:25 · Score: 5, Informative

It's how statistics work.
There are over 7 billion people on the planet divided among 100 or so ethnicities and about 200 countries. If you're trying to determine the demographics of the world, checking only 10 random people will not give you any meaningful data. Checking a million random people, on the other hand, will give you a fairly good idea of the demographics of the world.
Same with hard drives. Statistics on 5 hard drives won't tell you anything about the likelihood of a 6th drive failing. Statistics on 100,000 drives will.
Actual link to report by Solandri · 2016-05-18 01:06 · Score: 4, Informative

And not some news website which doesn't even have the courtesy to provide a link to the actual source report.

https://www.backblaze.com/blog/hard-drive-reliability-stats-q1-2016/

It includes historical models as well as statistical confidence intervals - very useful for determining which model drive is more reliable. I know everyone wants to use an easy rule like "Seagate bad" when buying, but it's not that simple. Each new model of drive includes new design changes to try to increase capacity, improve speed and reliability, and/or reduce cost. Sometimes these design changes work, sometimes they don't and the model is less reliable (e.g. Samsung 840 EVO). The statistics have the greatest orthogonality when broken down by model, not by manufacturer.
Re: Why does this matter? by MachineShedFred · 2016-05-18 04:15 · Score: 3, Informative

They buy Seagate because Seagate will allow them to do volume purchases.
It's a bit easier to go to your local Best Buy and get one or two drives of whatever manufacturer you want then to buy 10,000 drives in a single order. The article specifically says that WD and Toshiba haven't been able to get that done, where Hitachi and Seagate have.

--
Slashdot still doesnâ(TM)t support Unicode after it was added to the HTML standard in 1997.
Re:Is cheaper really better? by brianwski · 2016-05-18 06:09 · Score: 5, Informative

> Does it really pay off in the long-run to buy lower quality drives?

Disclaimer: Brian from Backblaze here. We use a fairly small, simple spreadsheet to answer that exact question. If Drive A is the same size as Drive B but fails 1% more often, then we might choose the drive that fails at a higher rate if is 2% cheaper, and if it is 10% cheaper it is a slam dunk. Make sense?

You ask about warranty. We enter the warranty information into the simple spreadsheet. If a warranty is 5 years long, then replacement drives are free during that time. If the failure rate is 1% per year, then that warranty is worth exactly 5% to us. If a drive with no warranty at all is 10% cheaper, then it is cheaper. If the drive with no warranty is 2% cheaper then we purchase the drive with the warranty.

In reality, the simple spreadsheet has a few more categories. For example, an 8 TByte Hard Drive takes half the datacenter space rental as two 4 TByte drives and the 8 TByte drive takes about half the electricity of the two 4 TByte drives. So if they were the same price we would obviously choose the 8 TByte drive. But they aren't the same price, so the additional cost of the 8 TByte drive has to be recovered over three years of reduced cabinet space rental costs and reduced electricity costs. We purchase drives once per month, so we get 20 bids from our cheapest suppliers, and right now SOME months Backblaze ends up purchasing the 8 TByte drives because they will pay for themselves within 3 years, and some months we go back to the 4 TByte drives because they are so ridiculously cheap it would take 7 years for the 8 TByte drives to pay for themselves.
Re:Is cheaper really better? by brianwski · 2016-05-18 06:13 · Score: 5, Informative

Brian from Backblaze here. This is exactly correct. We have redundancy across multiple computers in multiple locations in our datacenter, so losing one drive is usually a calm, non critical event that we take up to 24 hours to replace at our leisure during business hours.

If you are interested in details of our redundancy, here is a blog post about our "Vaults": https://www.backblaze.com/blog...

Summary of article: Backblaze uses Reed-Solomon coding across 20 computers in 20 locations in our datacenter. It is a 17 data drive plus 3 parity configuration, so we can lose any 3 entire pods in 3 separate racks in our datacenter and the data is still completely intact and available.
Re:ST3000DM001? In a DATA CENTER? by brianwski · 2016-05-18 06:40 · Score: 5, Informative

> What ... is this company doing using consumer hard drives in a ... data center? .... they will fall out of an array every time there's a URE

Brian from Backblaze here. You assume we use RAID (inside of one computer), which is incorrect. We wrote our own layer where any one piece of data is Reed Solomon encoding across 20 different computers in 20 different locations in our datacenter (which is using some of the excellent ideas from RAID and ditching some of the parts that don't work well in our particular application). Our encoding happens to be 17 data drives plus 3 parity. We can make our own decisions about what to do with timeouts. When doing reads, we ask all 20 computers for their piece, and THE FIRST 17 THAT RETURN are used to calculate the answer. Now if one of the computers does not respond at all we send a data center tech to replace it. But if it was just momentarily slow a few times a day we let it be (we don't eject it from the Reed Solomon Group).

> These drives are only meant to be powered on a few hours a day and consumer workload duty cycles

I think a really interesting study would be to power a few thousand drives up once per day for an hour and shut them down. Compare it to a control group of the same drives left on so their temperature did not fluctuate. See which ones last longer without failure. I honestly don't have the answer. (Really, I don't.) What I do know is that Backblaze has left 61,590 hard drives continuously spinning, most of these are often labeled as "consumer drives", and that the vast majority of drives last so long that we copy the data off onto massively more dense drives (like copying all the data off a 1 TByte drive into an 8 TByte drive) not because the 1 TByte fails, but because it ECONOMICALLY MAKES SENSE. An 8 TByte drive takes less electricity per TByte, takes 1/8th the rack space rental, etc. So Backblaze honestly wouldn't care if the "Enterprise Drives" lasted 10x as long in our environment-> we would STILL replace them at the same moment.