Disk Drive Failures 15 Times What Vendors Say

Repeat? by Corith · 2007-03-02 09:16 · Score: 2, Insightful

Didn't we already see this evidence with Google's report?

--
user corith signing off...

Re:Repeat? by georgewilliamherbert · 2007-03-02 09:20 · Score: 3, Informative

We did both this study and the Google study in the first couple of days after FAST was over. Completely redundant....
Re:Repeat? by LiquidCoooled · 2007-03-02 09:23 · Score: 2, Interesting

Yes, and its mentioned in the report.
The best part about the entire thing is the very last quote:

"If they told me it was 100,000 hours, I'd still protect it the same way. If they told me if was 5 million hours I'd still protect it the same way. I have to assume every drive could fail."

Just common sense.

--
liqbase :: faster than paper
Re:Repeat? by ajs · 2007-03-02 09:34 · Score: 5, Informative

The best part about the entire thing is the very last quote:

"If they told me it was 100,000 hours, I'd still protect it the same way. If they told me if was 5 million hours I'd still protect it the same way. I have to assume every drive could fail."

Just common sense. It's "common sense," but not as useful as one might hope. What MTTF tells you is, within some expected margin of error, how much failure you should plan on in a statistically significant farm. So, for example, I know of an installation that has thousands of disks used for everything from root disks on relatively drop-in-replaceable compute servers to storage arrays. On the budgetary side, that installation wants to know how much replacement cost to expect per annum. On the admin side, that installation wants to be prepared with an appropriate number of redundant systems, and wants to be able to assert a failure probability for key systems. That is, if you have a raid array with 5 disks and one spare, then you want to know the probability that three disks will fail on it in the, let's say, 6 hour worst-case window before you can replace any of them. That probability is non-zero, and must be accounted for in your computation of anticipated downtime, along with every other unlikely, but possible event that you can account for.

When a vendor tells you to expect 1 0.2% failure rate, but it's really 2-4% that's a HUGE shift in the impact to your organization.

When you just have one or a handful of disks in your server at home, that's a very different situation from a datacenter full of systems with all kinds of disk needs.
Re:Repeat? by Detritus · 2007-03-02 12:08 · Score: 2, Funny

There's also a non-zero probability that all of the air molecules in a room will rush to the corner of the room, suffocating the occupants.

--
Mea navis aericumbens anguillis abundat
Re:Repeat? by Baddas · 2007-03-02 12:48 · Score: 2, Funny

Just think what a fantastic way to die that would be. You'd get all kinds of notoriety.
Re:Repeat? by ShakaUVM · 2007-03-02 13:30 · Score: 3, Informative

Except MTBF is just pulled out of their asses. Look at the development cycle of a hard drive. Look at the MTBF. I used to work for an engineering company, and have worked doing test suites to determine MTBF. Sure, there's numbers involved, but it's probably 60% wishful thinking and 40% science.

Believe me, they aren't determining an 11 year MTBF empirically.

it's relative. by User+956 · 2007-03-02 09:17 · Score: 4, Funny

The data sheets for the drives indicated MTTF between 1 and 1.5 million hours.

Yeah, but I bet they didn't say what planet those hours are on.

--
The theory of relativity doesn't work right in Arkansas.

Re:it's relative. by bigtangringo · 2007-03-02 09:22 · Score: 2, Funny

Or what percentage of the speed of light they were traveling.

--
Yes, I am a smart ass; it's better than the alternative.

In other news... by Mr.+Underbridge · 2007-03-02 09:22 · Score: 4, Informative

...Carnegie Mellon researchers can't tell a mean from a median. This is inherently a long-tailed distribution in which the mean will be much higher than the median. Imagine a simple situation in which failure rates are 50%/yr, but those that last beyond a year last a long time. Mean time to failure might be 1000 years. You simply can't compare the statistics the way they have without knowing a lot more about the distribution than I saw in the article. Perhaps I missed it while skimming.

Re:In other news... by Falkkin · 2007-03-02 09:57 · Score: 3, Informative

In other news, Carnegie Mellon researchers know more about statistics than you give them credit for; blame ComputerWorld for crappy coverage of what the paper says. If you read the paper or the abstract, the researchers actually claim the opposite of what you are suggesting, namely, that the "infant mortality effect" (bathtub curve) often claimed for hard drives isn't actually the case. See Figure 4 in the paper and Section 5 ("Statistical properties of disk failures"). The paper is online here:

http://www.usenix.org/events/fast07/tech/schroeder /schroeder_html/index.html

Personally I am SHOCKED by dingbatdr · 2007-03-02 09:23 · Score: 2, Insightful

Yes, I am SHOCKED that companies have implemented a systematic program of distorting the truth in order to increase profits.

I propose a new term for the heinous practice---"marketing".

--
The truth is an offense, but not a sin.------R. N. Marley

Re:Personally I am SHOCKED by Beardo+the+Bearded · 2007-03-02 09:43 · Score: 4, Informative

What, really?

The same companies that lie about the capacity on EVERY SINGLE DRIVE they make? You don't think that they're a bunch of lying fucking weasels? (We're both using sarcasm here.)

I don't care how you spin it. 1024 is the multiple. NOT 1000!

Failure doesn't get fixed because making a drive more reliable means it costs more. If it costs more, it's not going to get purchased.

--

---
ECHELON is a government program to find words like bomb, jihad, plutonium, assassinate, and anarchy.
Re:Personally I am SHOCKED by Lord+Ender · 2007-03-02 10:11 · Score: 3, Informative

Before computers were used in real engineering, we could get away with "k" sometimes meaning 1024 (like in memory addresses) and sometimes meaning 1000 (like in network speeds). Those days are past. Now that computers are part of real engineering work, even the slightest amount of ambiguity is not acceptable .

Differentiating between "k" (=1000) and "ki" (=1024) is a sign that the computer industry is finally maturing. It's called progress.

--
A slashdotter who didn't build his own computer is like a Jedi who didn't build his own lightsaber.
Re:Personally I am SHOCKED by CorSci81 · 2007-03-02 11:36 · Score: 2

I'd just like to point out that computers were used for "real" engineering long before they became ubiquitous in the workplace or home. Why do you think FORTRAN is one of the oldest computing languages in existence?
Re:Personally I am SHOCKED by Chonine · 2007-03-02 11:39 · Score: 3, Informative

Standard metric is indeed powers of 10, and a megabyte is indeed 10^6 bytes.
To clear up the confusion, the notation for binary, as in 2^20 bytes was developed. That would be a Mebibyte.
http://en.wikipedia.org/wiki/Mebibyte
Re:Personally I am SHOCKED by binarybum · 2007-03-02 11:46 · Score: 4, Funny

yeah, I used to think they were dirty bastards, but they just work on a different scale than the rest of us.
The trick is to purchase your HD in pennies.

"100,000 pennies! why that's 1024 dollars!!"

--
ôó
Re:Personally I am SHOCKED by Timothy+Brownawell · 2007-03-02 12:15 · Score: 2, Insightful

Before computers were used in real engineering,
Computers have *always* been used for "real engineering" as you call it. It's only recently that they've gotten cheap enough to use as toys.

we could get away with "k" sometimes meaning 1024 (like in memory addresses) and sometimes meaning 1000 (like in network speeds). Those days are past.
WTF? It's like any other part of language, things have different meanings in different contexts. What does "cat" mean?

Now that computers are part of real engineering work, even the slightest amount of ambiguity is not acceptable .

Ok, so do we rename cat-the-program or cat-the-heavy-machinery (and what about cat-the-animal)? Computers and heavy machinery are both used for "real engineering work", so we can't have any ambiguity in which we're talking about. That would be not acceptable .

Differentiating between "k" (=1000) and "ki" (=1024) is a sign that the computer industry is finally maturing. It's called progress.

No, it's a sign that too many people have sticks up their butts and can't accept that language can be context-dependent. The world is not binary, and failing to recognize this is likely one reason that software sucks so much.

Also, it's a sign that disks (as opposed to ram) are sized by cost, rather than efficient use of address lines. Ram is sold in power-of-2 sizes for technical reasons. Disks are different enough that those technical reasons aren't there, so marketing dictates that the prefixes used be chosen to give the largest numbers.
Re:Personally I am SHOCKED by Ryan+Mallon · 2007-03-02 12:56 · Score: 4, Funny

Why do you think FORTRAN is one of the oldest computing languages in existence?
Because it was invented before most other computer languages? Is this a trick question ;-)

I believe it... by madhatter256 · 2007-03-02 09:23 · Score: 2, Informative

Yeh. Don't rely on the HDD after it surpasses its' manufacturer warranty.

--
Previewing comments are for sissies!

Re:I believe it... by SighKoPath · 2007-03-02 09:28 · Score: 2, Insightful

Also, don't rely on the HDD before it surpasses its manufacturer warranty. All the warranty means is you get a replacement if it breaks - it doesn't provide any extra guarantees of the disk not failing.
Re:I believe it... by The+Clockwork+Troll · 2007-03-02 10:13 · Score: 2, Insightful

In my experiences with several major drive vendors, I have never gotten an "upgrade". What you get is a replacement drive, but generally it's the same drive (perhaps refurbished or firmware-revised) and the original warranty period is still in effect (with perhaps a 30 day extension to account for your downtime). I've RMA'd a lot of drives and never have I gotten one of different spec/size. I'm not even sure this would be desirable, e.g. in the case of replacing a drive in a RAID array with something of different specification (yes, even "better" specification). Symmetry and everything.

--

There are no karma whores, only moderation johns

This study is useless. by Lendrick · 2007-03-02 09:26 · Score: 2, Interesting

In the article, they mention that the study didn't track actual failures, just the how often customers *thought* there was a failure and replaced their drive. There are all sorts of reasons someone might think a drive has failed. They're not all correct. I can't begin to guess what percentage of those perceived failures were for real.

This study is not news. All it says is that people *think* their hard drives fail more often than the mean time to failure.

Re:This study is useless. by crabpeople · 2007-03-02 10:12 · Score: 3, Interesting

Thats fair, but if you pull a bad drive, ghost it (assuming its not THAT bad), plop the new drive in, and the system works flawlessly, what are you to assume?

I dont really care to know exactly what is wrong with the drive. If i replace it, and the problem goes away, I would consdier that a bad drive. Even if you could still read and write to it. I just did one this morning that showed no symptoms other than windows taking what I considered a long time, to boot. All the user complained about was sluggish performance, and there were no errors or drive noises to speak of. Problem fixed, user happy, drive bad.

As I already posted, a good rule of thumb is 3 years from the date of manufacture, is when most drives go bad.

--
I'll just use my special getting high powers one more time...

Interface matters why? by neiko · 2007-03-02 09:30 · Score: 3, Interesting

TFA seems surprised by SATA drives lasting as long as Fibre...why one earth would your data interface have any consequences on the drive internals? Or are we talking assuming Interface = Data Throughput?

Re:Interface matters why? by ender- · 2007-03-02 09:36 · Score: 3, Insightful

TFA seems surprised by SATA drives lasting as long as Fibre...why one earth would your data interface have any consequences on the drive internals? Or are we talking assuming Interface = Data Throughput?

That statement is based on the long-held assumption that hard drive manufacturers put better materials and engineering into enterprise-targeted drives [Fibre] than they put into consumer-level drives [SATA].

Guess not...

--
Nothing to see here
Re:Interface matters why? by mollymoo · 2007-03-02 09:42 · Score: 5, Informative

TFA seems surprised by SATA drives lasting as long as Fibre...why one earth would your data interface have any consequences on the drive internals?

Fibre Channel drives, like SCSI drives, are assumed to be "enterprise" drives and therefore better built than "consumer" SATA and PATA drives. It's nothing inherent to the interface, but a consequence of the environment in which that interface is expected to be used. At least, that's the idea.

--
Chernobyl 'not a wildlife haven' - BBC News
Re:Interface matters why? by Spazmania · 2007-03-02 09:59 · Score: 2, Informative

They certainly charge enough more. SATA drives run about $0.50 per gig. Comparable Fibre Channel drives run about $3 per gig. A sensible person would expect the Fibre Channel drive to be as much as 6 times as reliable, but per the article there is no difference.

--
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.

I have thought the MTTF is bullshit for a while by Danga · 2007-03-02 09:30 · Score: 5, Interesting

I have had 3 personal use hard drives go bad in the last 5 years, they were either Maxtor or Wester Digital. I am not hard on the drives other than leaving them on 24/7. The drives that failed were all just for data backup and I put them in big, well ventilated boxes. With this use I would think the drives would last for years (at least 5 years), but nope! The drives did not arrive broken either, they all functioned great for 1-2 years before dying. The quality of consumer hard drives nowadays is way, WAY low, and the manufacturers should do something about it.

I don't consider myself a fluke because I know quite a few other people who have had similar problems. What's the deal?

Also, does anyone else find this quote interesting?:

"and may have failed for any reason, such as a harsh environment at the customer site and intensive, random read/write operations that cause premature wear to the mechanical components in the drive."

It's a f$#*ing hard drive! Jesus H Tapdancing Christ how can they call that premature wear, do they calculate the MTTF by just letting the drive sit idle and never reading and writing to it? That actually wouldn't suprise me.

--
Hey, there is only one Return and it's not of the King, it's of the Jedi.

Re:I have thought the MTTF is bullshit for a while by paeanblack · 2007-03-02 12:45 · Score: 2, Insightful

The biggest reason I think it would make no difference is because if unconditioned power is supposed to be so bad for electronics then why is the only thing that I have a problem with turn out to be the hard drives? I would think bad power would take out RAM before a hard disk.

Ram has no significant inductive load.

I am shocked! by Anonymous Coward · 2007-03-02 09:31 · Score: 2, Insightful

I just can't believe that the same vendors that would misrepresent the capacity of their disk by redefining a Gigabyte as 1,000,000,000 bytes instead of 1,073,741,824 bytes would misrepresent their MTBF too! And by the way, nobody actually runs a statistically significant sample set their equipment for 10,000 hours to arrive at a MTBF of 10,000 hours, so isn't their methodology a little suspect in the first place?

And that's a really wide range by VampireByte · 2007-03-02 09:33 · Score: 2, Funny

I feel sorry for anyone buying drives on the low end of that range. A MTTF of 1 hour really sucks.

--

Run and catch, run and catch, the lamb is caught in the blackberry patch.

Even better ... by khasim · 2007-03-02 09:34 · Score: 3, Interesting

Give me 6 month failure rates.

Start with 100 drives. Continuous usage.

How many fail in the first 6 months? 12 months? 18 months? ... 60 months? That would be the info that I'd need. Where's the big failure spike? I'm going to be replacing them right before that.

Re:Even better ... by Falkkin · 2007-03-02 10:01 · Score: 5, Informative

This is handled in the paper. See this graph: http://www.usenix.org/events/fast07/tech/schroeder /schroeder_html/img14b.PNG

Unfortunately there is no big "spike"; the average replacement rate just grows and grows with time.
Re:Even better ... by vux984 · 2007-03-02 13:00 · Score: 2, Funny

Alrighty then... I'll just replace them before I install them ;)

Having read the paper and seen the talk... by reset_button · 2007-03-02 09:36 · Score: 2, Informative

Here are the main conclusions:

the MTTF is always much lower than the observed time to disk replacement
SATA is not necessarily less reliable than FC and SCSI disks
contrary to popular belief, hard drive replacement rates to not enter steady state after the first year of operation, and in fact steadily increase over time.
early onset of wear-out has a stronger impact on replacement than infant mortality.
they show that the common assumptions that the time between failure follows an exponential distribution, and that failures are independent, are not correct.

It was an interesting paper (won the best paper award) at this year's FAST (File and Storage Technologies) conference. Here is a link to the paper, and the summary from the conference.

Check SMART Info by Bill+Dimm · 2007-03-02 09:41 · Score: 3, Interesting

Slightly off-topic, but if you haven't checked the Self-Monitoring, Analysis and Reporting Technology (SMART) info provided by your drive to see if it is having errors, you probably should. You can download smartmontools, which works on Linux/Unix and Windows. Your Linux distro may have it included, but may not have the daemon running to automatically monitor the drive (smartd).

To view the SMART info for drive /dev/sda do:
smartctl -a /dev/sda
To do a full disk read check (can take hours) do:
smartctl -t long /dev/sda

Sadly, I just found read errors on a 375-hour-old drive (manufacturer's software claimed that repair succeeded). Fortunately, they were on the Windows partition :-)

Re:Check SMART Info by Chalex · 2007-03-02 13:32 · Score: 2, Informative

Slightly off-topic, but if you haven't checked the Google paper on Self-Monitoring, Analysis and Reporting Technology (SMART) info provided by your drive to see if it is having errors, you probably should. The paper is available here: http://hardware.slashdot.org/hardware/07/02/18/042 0247.shtml

The conclusions are roughly the following: a) if there are SMART errors, the disk will fail soon, b) if there are no SMART errors, the disk is still likely to fail. They saw no SMART errors on 36% of their failed disks.

RAID = Redundant Articles of Identical Discourse by MasterC · 2007-03-02 09:42 · Score: 2, Informative

New meaning for RAID: Redundant Articles of Identical Discourse.

Feb 21: http://hardware.slashdot.org/article.pl?sid=07/02/ 21/004233

Slashdot has a high rate of RAID, which is a bad thing. Which is a bad thing. It has been a whole 9 days. Slashdot needs a story moderation system so dupe articles can get modded out of existance. Ditto for slashdot editors who do the duping! :) (I have long since disabled tagging since 99% of the tags were completely worthless: "yes", "no", "maybe", "fud", etc. If tagging is actually useful now, please let me know!)

Can we get redundant posting on the story about google's paper?

--
:wq

Redundancy by pizza_milkshake · 2007-03-02 09:46 · Score: 3, Funny

I thought storage-related redundancy was supposed to be a good thing ;)

Re:Redundancy by georgewilliamherbert · 2007-03-02 09:52 · Score: 5, Funny

Redundant Array of Irritating Discussions?

No way by Tablizer · 2007-03-02 10:04 · Score: 2, Funny

High rate of failure? That's a bunch of

--
Table-ized A.I.

Seagate by mabu · 2007-03-02 10:04 · Score: 3, Insightful

After 12 years of running Internet servers, I won't put anything but Seagate SCSI drives in any mission critical servers. My experience indicates Seagate drives are superior. Who's the worst? Quantum. The only thing Quantum drives are good for is starting a fire IMO.

Re:Seagate by CelticWhisper · 2007-03-02 10:09 · Score: 2, Funny

Well, duh. Why do you think they used to call them Fireballs?

--
Help protect civil rights from abuse by the TSA - visit TSA News Blog.
http://www.tsanewsblog.com

just assume 3 years by crabpeople · 2007-03-02 10:05 · Score: 4, Informative

A good rule of thumb is 3 years. Most hard drives fail in 3 years. I dont know why, but im currently seeing alot of bad 2004 branded drives and consider that right on schedule. Last year the 02-03 drives were the ones failing left and right. I just pulled one this morning thats stamped march 04. Just started acting up a few days ago. Like clockwork.

--
I'll just use my special getting high powers one more time...

Faster, cheaper, more reliable by dangitman · 2007-03-02 10:09 · Score: 2, Informative

Pick any two.

I've noticed this personally. Now, anecdotal evidence doesn't count for a lot, and it may be a case that we are pushing our drives more. But back in the day of 40MB hard drives that cost a fortune, they used to last forever. The only drive I ever had fail on me in the old days were the Syquest removable HD cartridges, for obvious reasons. But even they didn't fail that often, considering the extra wear-and-tear of having a removable platter with separate heads in the drive.

But these days, with our high-capacity ATA drives, I see hard drives failing every month. Sure, the drives are cheap and huge, but they don't seem to make them like they used to. I guess it's just a consequence of pushing the storage and speed to such high levels, and cheap mass-production. Although the drives are cheap, if somebody doesn't back up their data, the costs are incalculable if the data is valuable.

--
... and then they built the supercollider.

Off-Topic: SI Units by ewhac · 2007-03-02 10:21 · Score: 5, Informative

I just can't believe that the same vendors that would misrepresent the capacity of their disk by redefining a Gigabyte as 1,000,000,000 bytes instead of 1,073,741,824 bytes would misrepresent their MTBF too!

Not that this is actually relevant or anything, but there's been a long-standing schism between the computing community and the scientific community concerning the meaning of the SI prefixes Kilo, Mega, and Giga. Until computers showed up, Kilo, Mega, and Giga referred exclusively to multipliers of exactly 1,000, 1,000,000, and 1,000,000,000, respectively. Then, when computers showed up and people had to start speaking of large storage sizes, the computing guys overloaded the prefixes to mean powers of two which were "close enough." Thus, when one speaks of computer storage, Kilo, Mega, and Giga refer to 2**10, 2**20, and 2**30 bytes, respectively. Kilo, Mega, and Giga, when used in this way, are properly slang, but they've gained traction in the mainstream, causing confusion among members of differing disciplines.

As such, there has been a decree to give the powers of two their own SI prefix names. The following have been established:

2**10: Kibi (abbreviated Ki)
2**20: Mebi (Mi)
2**30: Gibi (Gi)

These new prefixes are gaining traction in some circles. If you have a recent release of Linux handy, type /sbin/ifconfig and look at the RX and TX byte counts. It uses the new prefixes.

Schwab

--
Editor, A1-AAA AmeriCaptions

Re:Not So Fuzzy math by Annoying · 2007-03-02 10:22 · Score: 4, Informative

0.88% != 0.88
0.0088 * 15 = 0.132 (13%)
13% you say? The excerpt says 2%-4%. RTA and you'll see though they report up to 13% on some systems.

Re:Odd numbers for memory failure? by Akaihiryuu · 2007-03-02 10:24 · Score: 2, Interesting

I had a 4mb 72-pin parity SIMM go bad one time...this was about 12 years ago in a 486 I used to have. It just didn't work one day (it worked for the first two months). Turn the computer on, get past BIOS start, bam...parity error before bootloader could even start. Reboot, try again, parity error. Turn off parity checking, it actually started to boot and then crashed. The RAM was obviously very defective...when I took that 1 stick out the computer booted normally even with parity on, if I tried to boot with just that stick it would never even POST. That's the only time I have ever seen memory fail...but then it came from a really shady local dealer who regularly scammed people...this same guy had a rack of "shareware" DOS games with neatly printed labels (all labels he printed) for like $5/disk, all of the disks completely blank (not even formatted). I had happened to get one of those when I got the RAM, and my friend did too (from another part of the rack, we didn't give much thought to that at the time, was just an "oh, this looks like it might be neat" thing). Neither disk was even formatted. The CDROM drives he sold me and my friend died within a month also (about a month after the RAM). Amazingly the store was still in business when I went back with the stick of RAM...he looked at it with a magnifying glass, claimed it was "scratched" and therefore abused. I burned rubber out of his parking lot, tossing a lot of gravel against the windows, then I found a reputable place to get RAM (though this was back in the days when 4MB cost $200). 2 days later I drove by, the place was boarded up and closed. Both CDROM drives died within 2 days of each other a month later. Nothing that came out of that place worked.

Actually, one useful feature of Vista... by Tim+Browse · 2007-03-02 10:38 · Score: 4, Interesting

...is that it detects SMART disk errors in normal use (i.e. you don't have to be watching the BIOS screens when your PC boots).

When I was trying the Vista RC, it told me that my drive was close to failing. I, of course, didn't believe it at first, but I ran the Seagate test floppy and it agreed. So I sent it back to Seagate for a free replacement.

About the only feature that impressed me in Vista, sadly. (And I'm not sure it should have impressed me, tbh. I'm assuming XP never did this as I've never seen/heard of such a feature.)

Re:Actually, one useful feature of Vista... by Matt+Perry · 2007-03-02 14:03 · Score: 3, Informative

When I was trying the Vista RC, it told me that my drive was close to failing. ... About the only feature that impressed me in Vista, sadly.
Be sad no more. SmartMonTools will run in UNIX or Windows and notify you if it detects SMART errors. For the Windows installer look for the phrase "Install the Windows package" on the smartmontools home page..

--
Slashdot: Failed Car Analogies. Amateur Lawyering. Anecdote Battles.

This is only news... by rickb928 · 2007-03-02 11:25 · Score: 2, Informative

...to those of you who haven't managed 24x7x365 servers very much. And little news to those of you who have a computer at all.

I expect most desktop drives to last 5 years max. MAX. No manufacturer has an edge. It's just the way it is. MTBF is fiction.

For an always-on server, I expect failures about every 3-4 years. For my clients who cared enough to pay for the very best, I replaced the drives in the 3rd year without waiting. No failures costa a bit more.

My experience is that Seagate and Fujitsu are my best server drives. IBM was also on the list, but I'm watching Hitachi. No decision.

The losers: Quantum (thankfully gone), Samsung (until recently), Maxtor. Not my opinion, my experience.

Now, in fairness, these are some of my historical losers:

Seagate: Early IDE drives and the 'stiction' problem. Remember banging drives to get them started?

Quantum 'Bigfoot' drives: popular in Compaq machines, the 5.25" .7" thin piece of junk. died often. Even Compaq admitted these were bad.

Seagate SCSI drives: Many different types had a bad habit of going off-line for no apparent reason. Your Novell server would log the 'device deactivated to a non-media defect' error. Just restarting the bus controller would sometimes wake them up. Sometimes repowering the drives. Would happen every few months. Usually when I was elsewhere...

And then there was Miniscribe.

But MTBF numbers are universally fiction. Imagine trying to sell the idea of a wave bearing lasting 16 years to an engineer with real-world experience. I figure MTBF numbers come out of the marketing department.

-rick

--
deleting the extra space after periods so i can stay relevant, yeah.

Re:Masters of estimates by Fulcrum+of+Evil · 2007-03-02 11:32 · Score: 3, Insightful

Well, the hard-drive makers are correct on the size thing - a Gigabyte is 1000 Megabytes, and the OS and software makers are wrong.

Yeah, they coined the term and have been using it for 40 years, but they're wrong.

Gigabytes are actually displayed as Gigabytes, or that the listing is changed to correctly display Gibibytes as the value? (or Kibibytes, Mebibytes, whatever)

Listen, just because someone comes up with a standard doesn't obligate everyone to use it, especially when they already have a perfectly workable system already. Claiming that NIST can impose an unwanted standard on the world is like saying that it isn't a word until the OED lists it.

--
"We returned the General to El Salvador, or maybe Guatemala, it's difficult to tell from 10,000 feet"

Ideal conditions vs. Real world by CorporalKlinger · 2007-03-02 14:08 · Score: 2, Informative

I think one of the key problems here isn't necessarily the statistical methods used, it is that the CMU team was comparing real-life drive performance to the "ideal" performance levels predicted by the drive manufacturers. Allow me to provide two examples of this "apples to oranges" comparison problem.

I have had two computers with power supply units that were "acting up." They ended up killing my hard drives on multiple occasions - Seagates, WD's, Maxtors, etc. It didn't matter what type of drive you put in these systems, the drive would die after anywhere from a week to two years. I later discovered that the power supplies were the problems, replaced them with brand new ones, and replaced the drives one last time. That was quite some time ago (years), and those drives, although small, still work, and have been transferred into newer computer systems since that time. The PSU was killing the drives; they weren't inherently bad or had a manufacturing defect. A friend of mine who lives in an apartment building constructed circa 1930 experienced similar problems with his drives. After just a few months, it seemed like his drives would spontaneously fail. When I tested his grounding plug, I found that it was carrying a voltage of about 30V (a hot ground - how wonderful). Since he moved out of that building and replaced his computer's PSU, no drive failures.

The same type of thing is true in automobile mileage testing. Car manufacturers must subject their cars to tests based on rules and procedures dictated by state and federal government agencies. These tests are almost never real world - driving on hilly terrain, through winds, with the headlights and window wipers on, plus the AC for defrost. They're based on a certain protocol developed in a laboratory to level the playing field and ensure that the ratings, for the most part, are similar. It simply means when you buy a new car, you can expect that under ideal conditions and at the beginning of the vehicle's life, it should BE ABLE to get the gas mileage listed on the window (based on an average sampling of the performance of many vehicles).

My point is that there really isn't a decent way to go about ensuring that an estimated statistic is valid for individual situations. By modifying the environmental conditions, the "rules of the game" change. A data-center with exceptional environmental control and voltage regulation systems, and top-quality server components (PSU's, voltage regulators, etc.) should expect to experience fewer drive failures per year than the drives found in an old chicken-shack data center set up in some hillbilly's back yard out in the middle of nowhere where quality is the last thing on the IT team's mind. It's impractical to expect that EVERY data center will be ideal - and since it's very very difficult to have better than the "ideal" testing conditions used in the MTTF tests - the real-life performance can only move towards more frequent and early failures. Using the car example above, since almost nobody is going to be using their vehicle in conditions BETTER than the ideal dictated by the protocols set forth by the government, and almost EVERYONE will be using their vehicles under worse conditions, the population average and median have nowhere to go but down. That doesn't mean the number is wrong, it just means that it's what the vehicle is capable of - but almost never demonstrates in terms of its performance - since ideal conditions in the real world are SO rare.

Slashdot Mirror

Disk Drive Failures 15 Times What Vendors Say

54 of 284 comments (clear)