Slashdot Mirror


Backblaze Dishes On Drive Reliability In their 50k+ Disk Data Center

Online backup provider Backblaze runs hard drives from several manufacturers in its data center (56,224, they say, by the end of 2015), and as you'd expect, the company keeps its eye on how well they work. Yesterday they published a stats-heavy look at the performance, and especially the reliability, of all those drives, which makes fun reading, even if you're only running a drive or ten at home. One upshot: they buy a lot of Seagate drives. Why? A relevant observation from our Operations team on the Seagate drives is that they generally signal their impending failure via their SMART stats. Since we monitor several SMART stats, we are often warned of trouble before a pending failure and can take appropriate action. Drive failures from the other manufacturers appear to be less predictable via SMART stats.

7 of 145 comments (clear)

  1. Not very useful. by Anonymous Coward · · Score: 0, Interesting

    The type of use case they subject their drives to is very unlike the type of use case you will likely see. I wouldn't try to read too much into their statistics. They really apply only to themselves, or someone in the same business.

    1. Re:Not very useful. by Anonymous Coward · · Score: 3, Interesting

      Backblaze doesn't always choose the most reliable drive, we look at the total cost of ownership including the amount of power the drive will consume and the drive's failure rate and let a spreadsheet kick out the correct drive for us to purchase this month. It is rarely the most reliable drive.

      Do you factor in the work cost? In an environment where services are bought by the hour, the cost of a single maintenance operation is more than the cost difference between the most expensive drive in a selected class and the drive of an average cost.

  2. Sorry WD fans by Solandri · · Score: 5, Interesting

    Can't help but feel for all the people who read Blackblaze's previous report and decided Seagate was junk and bought WD instead. I tried to warn them that the model of the drive mattered more than the manufacturer, because each manufacturer tries new technologies and new cost-cutting strategies with each different model. Sometimes it works and the model is reliable. Sometimes it doesn't and the model is unreliable. But everyone was eager to get on the bash Seagate, praise WD bandwagon and ignored me.

    Well, WD was least reliable this time around. The Seagate stats in the previous report were probably being skewed by just one or two bad models. It's skewed this time by one bad model, which due to the passage of time means it makes up a tiny portion of their Seagate sample, so doesn't spike Seagate's score like before. (You can pretty much ignore WD in the 4TB graph, as a sample size of just 46 drives means the confidence interval is a 0.3% - 8.8% failure rate.)

    At least Blackblaze addressed my criticism from before - they've broken down the stats to individual drive models. And you can see that like I said, there's huge variability in reliability between models within a manufacturer's lineup. Now they just need to add confidence interval to the graphs.

  3. This is a repeat of 6/23/15 topic . "When will" by FirstOne · · Score: 2, Interesting

    ""When will your hard drive fail"

    I pointed out that Blackblaze chassis configuration improperly stressed the fragile SATA/Power connectors by implementing a vertical disk drive mounting configuration,.
    Where the mass of drive(&vibration) is placed upon the fragile SATA data and power connectors.

    This type of vertical drive storage/raid cabinet is not conducive for long term/reliable drive lifespan., thus any number of other factors could kick in and cause a premature failure.

  4. Bad sectors? by nbritton · · Score: 5, Interesting

    What is Backblaze doing to check the drives for bad sectors? I manage a 10,000 disk openstack swift installation and I've noticed the auto sector remapping doesn't work correctly, there are a portion of drives (maybe 3%) that have a few bad sectors that need to be manually remapped using ddrescue. I ended up having to write a custom monthly cron job script that ran badblocks to first identify these drives, and then ddrescue to force a sector remap.

  5. Re:Doesn't make any mention of.. by brianwski · · Score: 5, Interesting

    Brian from Backblaze here.

    The individual drives in our datacenter run ext4 (the OS is Debian). We do an extremely simple Reed-Solomon encoding that is 17+3 (17 data drives and 3 parity) but the 20 drives are spread across 20 different computers in 20 different locations in our datacenter. This means we can lose any 3 drives and not lose data at all.

    We released the Reed-Solomon source code free (open source but even better) for anybody else to use also. You can read about it in this blog post: https://www.backblaze.com/blog...

  6. Re:RAID, let them fail by brianwski · · Score: 3, Interesting

    Brian from Backblaze here.

    Sometimes the "drive failure" is as simple as the little circuit board on the bottom of the hard drive has a component die. This won't be predicted by SMART stats at all. We have chatted very informally with the people at "Drive Savers" ( http://www.drivesaversdatareco... ) and they say one of the early steps in attempting to recover the data from a drive that won't work is to replace the circuit board with the board from an identical hard drive of same make and model.

    I have no affiliation with "Drive Savers" but from my interactions with them I trust them as quite a good and valuable service who know their craft. We even used them once in a panic once to get back the minimum number of drives for data integrity in a RAID array (a long time ago before our multi-machine vault architecture). It worked - we got all the data back from the drive!