For First Three Years, Consumer Hard Drives As Reliable As Enterprise Drives
nk497 writes "Consumer hard drives don't fail any more often than enterprise-grade hardware — despite the price difference. That's according to online storage firm Backblaze, which uses a mix of both types of drive. It studied its own hardware, finding consumer hard-drives had a failure rate of 4.2%, while enterprise-grade drives failed at a rate of 4.6%. CEO Gleb Budman noted: 'It turns out that the consumer drive failure rate does go up after three years, but all three of the first three years are pretty good,' he notes. 'We have no data on enterprise drives older than two years, so we don't know if they will also have an increase in failure rate. It could be that the vaunted reliability of enterprise drives kicks in after two years, but because we haven't seen any of that reliability in the first two years, I'm skeptical.'"
I thought this was already common knowledge. "Enterprise" drives are just a way to separate stupid people from their money. Sort of like "premium" gasoline.
SSDs are all the rage now!!!!!! Who uses harddrives anymore???
At my company all the hardware is managed by CSC. They retire severs in about 3 years...including the drives.
When Fascism comes to America, it will call itself Anti-Fascism, and tell you to give up your guns.
"Enterprise" drives may have longer warranty coverage, so you are essentially just buying an extended warranty that is built into the selling price. This is how water heaters are priced...a 5 year warranty water heater is often identical to a 10 year warranty unit, but the manufacturer has crunched the failure rate numbers and will just wind up replacing a percentage of 10 year models when they start to leak in 8 years.
"We make our world significant by the courage of our questions and by the depth of our answers." Carl Sagan
Google already published many detailed reports on various issues surrounding the HDD business, proving that the money saved by buying cheaper hard-drives, and using them in data 'defending' situations (replicating data on multiple drives) made far more sense then using so-called 'enterprise' class equipment in complex, expensive configurations. Once again, to the surprise of no alpha, the KISS (keep it simple, stupid) principle wins out in engineering.
The buzz wordy, mock intellectual, synthetically complex world of 'enterprise' solutions is designed to appeal to the mind of the 'beta', a class of technocrat for whom rote-learning is everything. IT people are mostly of this class, so the 'paraphernalia' and 'jargon' make such people feel 'special'. The fundamentals of Computer Science fly right over the heads of most people involved in computer decision making.
It shames people to not even understand why the capitalist society works best with mass manufactured items, and that limited run items will always have significant compromises. Make more of an item, and it gets cheaper AND more reliable through necessity of efficiency.
But only a few days back, in some forum, people were dribbling in ecstasy because some fake enterprise HDD (RED series from Seagate?) was being 'discounted' to only 40% above the cost of the cheapest quality 3TB HDD. Many people gave EXPENSE as the primary reason for buying the vastly inferior Xbox One over the PS4 (in other words they were 'big' individuals because they could afford the more expensive console).
But consumer hard drives are so much cheaper that it's not really cost effective anymore to buy Enterprise drives. You may need to replace them more often, but as SATA are hot swappable and everyone is using some variation of RAID these days, one could argue that buying Enterprise drives is an unnecessary expense. In a down economy, that might be significant.
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
Perhaps it's due to the smaller components or faster spindles creating more heat, but I rarely get a few years of service out of a single SATA drive before smartctl starts showing problems or a raid array tossing a drive. Seagate and OCZ have always been awesome about replacing the drive under warranty but still. Seems like those 400 meg IDE drives of yesteryear lasted decades before making any clicks-of-death.
Join the Slashcott! Feb 10 thru Feb 17!
All the newer shelves came preloaded with Coraid-approved drives. As I said, there's hundreds of drives involved here, a lot of SATA 1TB and 2TB and some SAS 600GB. I think out of the later drives, we've had two fail. Maybe three.
Asked about it, Coraid said, yes, the warranty is better on "Enterprise-class" or "RAID-class" drives, but also, the firmware is different. They claim that drives intended for the consumer / SOHO market spend a lot of time retrying marginal reads before declaring an unreadable sector and sparing it. They say that SAN-class drives limit the retry time, because the array controller handles it more efficiently, since it has the big-picture view.
The also say that the drives are optimized for close-quarters operation, all jammed together in an array, handling vibration and heat build-up slightly differently, and that they have minor differences to keep lubrication from migrating out of the spindle bearing under continuous operation. I don't know but I imagine loss of spindle bearing lube would add vibration and make any but the best reads more marginal.
I don't know for sure, but we've spent a great deal of US dollars on their products and our experience has borne out the fact that there's a definite difference in arrays.
As for corporate desktop and/or server use, well, I don't really know. Our servers that have one to four drives were mostly shipped with those drives, so we didn't choose them. I can't tell you if they are enterprise class drives, but I imagine they are, based on the replacement costs. And I know about what some of those costs are, or anyhow I know they were way more than I personally pay for drives for home desktop and server use. I know that because occasionally they fail, and I have to buy new ones.
No difference between enterprise and home HDD's that I know of.
As for what "hammering and heavy use " of a drive is?
The biggest killer of HDD's is something called the CSS test cycle.
CSS = Contact Start Stop where the drive is booted up, spun up, and then shut down repetitively.
Generally, a HDD sitting there spinning away is not what kill them off,
however turning them on-off-on-off a lot is the most abusive thing that you can do.
I still think WD makes the best quality out there, but that's just my opinion.
just my 0.02 worth...
www.effectiveelectrons.com "chips that work" Analog, RF, Mixed Signal
I wonder how many more slashdot stories will be based upon the same Backblaze story of the "first of its kind" (ignoring Google's older paper) story on hard drive longevity, that doesn't name names?
Given the cheap PSU's I've seen in a lot of boxes (and the rate of failure), I'd say in many cases that it's a contest between the drives and the PSU, especially when you get to areas with flakey power.
I think you missed his point. With the money you save, buy a spare drive.
6 drives with enterprise warranty: $1800, 12 hour replacement
7 drives with consumer warranty: $1300, instant replacement
That's an unfair scenario. There could be many quality differences between enterprise and consumer drives that simply don't come up in their environment. I know when I make consumer and enterprise-grade objects, of course the consumer-grade objects work -- I don't build carp -- but the enterprise-grade work better. For many values of better. Most often, that better includes things like a wider temperature range, dirtier air, and more frequent and rougher shipping. Even my packaging is wildly different as a result. Better foam, larger boxes. Also interestingly stupid things like additional electrical certifications. And then there are emergency situations like easier repair, in this case data-rescue would be a major feature, as would fire and flood resistance..
The blog post states: "You might object to these numbers because the usage of the drives is different. The enterprise drives are used heavily. The consumer drives are in continual use storing users’ updated files and they are up and running all the time, but the usage is lighter. "
That invalidates the conclusion they're drawing. You can't put two different types of drives under different workloads and then conclude they fail at the same rate. The fact that other studies have reached similar conclusions (Google published one a few years back) is irrelevant when it comes to evaluating whether or not *this* study has measured what it seeks to measure.
Consumer drives and enterprise drives may fail at equal rates, but using different workloads doesn't help us reach that conclusion.
Until we see some names, those studies are useless...
I've got better things to do tonight than die.
Plus, you don't have to worry about the replacement drive coming in from a different batch that has 100 fewer sectors on it. You can tell a new RAID admin because he will accept the system's default of "use every sector on the drive" instead of reserving .1% for drive manufacturer rounding shenanigans.
I don't know, maybe the situation is better today, but if I'm put in charge of a RAID array, I'm always going to shave off a few sectors out of distrust of drive manufacturers.
I read the internet for the articles.
You'll be glad to hear that for 250GB and larger drives HD manufacturers agreed on standardized sector counts.
Also, welcome to the 2000s.
I always assumed that Enterprise HDDs were the same as consumer ones, they just got binned differently after passing through tighter QC scrutiny or more arduous burn-in testing or something.
Either way, it's not that surprising... I mean, a drive either fails within the first several months, or runs fine until something inside's well past its MTBF. Right?
So within the first three years, we're basically talking about the near end of the curve. It's not until after 3+ years that normal HDDs start to really show their age, in my experience.
Friend: "The NIC is misconfigured..." Me: "No prob, I'll just telnet in and fix it." *Silence*
That's happened to me recently, with multi-TB drives. The last time I bought drives for an existing RAID array I checked the sector count and found that the the drive I was looking at buying had MORE sectors than the drives already in the array. That wasn't a problem, but it shows that "equal" sized drives aren't necessarily equal. Do you have a citation or some good search terms to find out about this supposed standardization? I'd love to read anything that might give some information about which manufacturers, if any, have actually standardized.
It's been said that some drives (enterprise drives?) have more spare sectors reserved. Given that that number of sectors per platter is a physical attribute they can't readily adjust, it seems that reserving more sectors would leave fewer visible sectors.
I manage a couple of petabytes worth of disks (consumer, not enterprise) for the HPC center at Vanderbilt University, and they get absolutely hammered by CMS-HI users 24/7/365. At scale, you will daily see problems that you would never even think of.
The firmware on consumer hard drives is often crap. Very few of them support TLER, we have ~400's drives (Seagates) that needed a firmware fix to prevent sudden death but the fix wouldn't work en bulk over the SAS controller so we had to yank/flash/replace/repeat, and drives will occasionally lock up hard and require a power-cycle.
Don't believe for a second that Linux doesn't need a defrag utility. We were mystified by a sudden influx of permanent drive *slot* failures. After *much* investigation, it turns out that our users were filling them 100% full, erasing 5%, refilling, erasing 5%, etc, until the average file (~100 MB) had thousands of extents. The vibration from the head frantically scanning the disk to read the file was enough to cause the SATA connector to destroy the connector on the backplane (Supermicro chassis, would *NOT* buy again, Chenbro is the way...) We wrote a simple defrag script that simply copied the worst files to a different location and then move them back.
RAID5 isn't nearly sufficient at this point because you will eventually have two or more simultaneous failures just due to the number of disks. We wrote our own filesystem to offer Reed-Solomon-6+3 redundancy.
I'd love to know if you guys have any similar "WTH" horror stories.
Only because the user isn't technically part of the PC.
Also the user isn't under warranty.
(that might explain because most of them are crap)
Stupid planned obsolescence. We should complain to the manufacturer.
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
Are consumer hard drives just as reliable for the first 3 years when installed in rack servers?
Are enterprise drives more reliable past 3 years when installed in poorly-maintained desktop computers?
What the standard sizes *should* be:
250GB - 488397168 sectors
320GB - 625142448 sectors
500GB - 976773168 sectors
750GB - 1465149168 sectors
1TB - 1953525168 sectors
1.5TB - 2930277168 sectors
2TB - 3907029168 sectors
3TB - 5860533168 sectors
Don't have the figure for 4TB, neither do I have anything about 2.5" drives.
If you have a drive that appears smaller than that, check Host Protected Area and your controller settings.
If you have a drive that appears larger than that... please post the model.
These 2 factors combine to kill modern drives, IMHO. The increased density makes the drive work harder to combat vibration.
HD bays and mounts ought to come with more carefully engineered dampening.
The major difference in "enterprise" NL-SAS drives versus consumer SATA drives is the native SAS command set making the drives more efficient and more compatible with RAID controllers and technologies like storage pools in Server 2012. Step up to true 10K and 15K SAS drives and you get significantly higher IOPS, slower than SSD but less expensive per GB.
"... but also, the firmware is different. They claim that drives intended for the consumer / SOHO market spend a lot of time retrying marginal reads before declaring an unreadable sector and sparing it. They say that SAN-class drives limit the retry time, because the array controller handles it more efficiently, since it has the big-picture view."
What you are describing is known as TLER or "Time Limited Error Recovery" (the Western Digital name for it, at least). See TLER
10^14 on consumer drives vs 10^15 on enterprise drives.
Worst. Signature. Ever.
There really is a difference. However... that difference is in the firmware, and that's where the manufacturers were scamming.
There is a thing called TLER - time to recover from errors. That is, if the drive is trying to write or read from a block, and it finds a problem, it goes to write to another block, or uses error recovery to read. For servers, they really want the time that the drive keeps trying to be under seven seconds. The consumer drives could be adjusted using software like hdparm.
Then around '09, and apparently starting with WD, they made a change to the firmware, and you could no longer change that variable. Servers scream and gag and give up, and tell you the drive's dying, when instead of spending under seven seconds, the drive keeps trying for ->over two minutes-. Everything you read says do *NOT* use those for RAID, either.
The server grade drives are very much *not* into spinning down, and they have that short TLER.
This is why, around here, we're ecstatic at the new WD Red drives, that are "targeted towards NAS"; the reality is that they've got TLER set to seven seconds. And, where the server grade drives are two to three times the price of consumer grade drives, or higher (some sources are a *lot* higher), the Reds are 1.33% of consumer grade drives.
Reliability: we have some of everything, and have not noticed a real difference in reliability. And our drives get used a *lot*.
mark
A safe rule of thumb that I've used over the years is to only allocate 99% of the drive size when putting together the array. This is a bit easier with Software RAID and mdadm under Linux.
You might even be safe at 99.5% or 99.9%, but the latter is probably a bit risky.
Wolde you bothe eate your cake, and have your cake?
Yes, but there's no difference between Cadillac parts and Chevy parts, except the price.
What this is saying is that the use portion of the bathtub curve is the same between two differently marketed products produced from the same manufacturing lots/lines. In other word, "Well duh!! What else did you expect?!??".
Do tell you thought there were two different manufacturing processes: one for consumer, one for business?!?
The difference between marketed types is what tested out with sampling as more reliable or not. But that's AFTER the use portion of the bath tub curve. Infant failures are all undesirable so all are removed by burn-in anyway.
IOW not even news.