Everything You Know About Disks Is Wrong

MTBF by seanadams.com · 2007-02-20 13:36 · Score: 5, Interesting

MT[TB]F has become a completely BS metric because it is so poorly understood. It only works if your failure rate is linear with respect to time. Even if you test for a stupendously huge period of time, it is still misleading because of the bathtub curve effect. You might get an MTBF of say, two years, when the reality is that the distribution has a big spike at one month, and the rest of the failures forming a wide bell curve centered at say, five years.

Suppose a tire manufacturer drove their tires around the block, and then observed that not one of the four tires had gone bald. Could they then claim an enormous MTBF? Of course not, but that is no less absurd than the testing being reported by hard drive manufacturers.

Re:MTBF by Wilson_6500 · 2007-02-20 13:45 · Score: 5, Informative

Um, but doesn't the summary of the paper say that there is no infant mortality effect, and that failure rates increase with time, and thus the bathtub curve doesn't actually apply?
Re:MTBF by Hubec · 2007-02-20 13:50 · Score: 1

I'm not saying you're wrong, but how does your statement about the one month infant mortality spike relate to the article's finding that no such spike is observable in the wild?
Re:MTBF by Anonymous Coward · 2007-02-20 13:58 · Score: 0

If on average one out of four drives goes bad immediately when you plug it, it will decrease the MTBF by 25%, not make it ridiculously small or enormous.

I'm amazed by the number of people who set out to show that MTBF is easy to misunderstand, but end up proving it by example.
Re:MTBF by gvc · 2007-02-20 14:04 · Score: 3, Interesting

MT[TB]F has become a completely BS metric because it is so poorly understood. It only works if your failure rate is linear with respect to time. Even if you test for a stupendously huge period of time, it is still misleading because of the bathtub curve effect. You might get an MTBF of say, two years, when the reality is that the distribution has a big spike at one month, and the rest of the failures forming a wide bell curve centered at say, five years.
The simplest model for survival analysis is that the failure rate is constant. That yields an exponential distribution, which I would not characterize as a bell curve. The Weibull distribution more aptly models things (like people and disks) that eventually wear out; i.e. the failure rate increases with time (but not linearly).
With the right model, it is possible to extrapolate life expectancy from a short trial. It is just that the manufacturers have no incentive to tell the truth, so they don't. Vendors never tell the truth unless some standardized measurement is imposed on them.
Re:MTBF by kidgenius · 2007-02-20 15:16 · Score: 2, Informative

Well, I guess you don't really understand reliability then. You also don't understand MTBF/MTTF (hint: they aren't the same) What they have said is a big "no duh" to anyone in the field. MTTF will work regardless of whether or not your failure rate is linear with time. Also, there are other distribution of failure beyond just exponential, such as the Weibull. Exponential is a subset of the Weibull. Using this distribution you can accurately calculate an MTTF. Now, the MTBF will not match the MTTF initially, but given enough time, it will eventually match the MTTF. All of this information is very useful to anyone that actually knows what to do with those numbers.
Re:MTBF by Anonymous Coward · 2007-02-20 15:21 · Score: 0

The problem with the hard drive MTBF is the way the manufacturers measure and come up with those numbers.

I'm pretty lazy, but you can search around or go here http://forums.storagereview.net/index.php to find out how they exaggerate the numbers to ridicules proportions.
Re:MTBF by kidgenius · 2007-02-20 15:24 · Score: 2, Insightful

I'm also going to add to my statement and mention that the authors of the article do not understand MTTF. They have calculated MTBF, not MTTF. They are not the same. In fact, they have assumed that the drives fail in a random way by doing a simple hours/failures. They need to really to look at failures and suspensions and perform a weibull analysis to see how close their stuff is to the manufacturers stated values.
Re:MTBF by kidgenius · 2007-02-20 15:35 · Score: 3, Informative

No, they don't. Hard drive manufacturers state an MTTF, which is very different from MTBF. The two can be similar, but they are not interchangeable. The author of this paper has calculated MTBF, and tried to compare it to MTTF, which is WRONG. They really should've consulted a reliability engineer. Any competent one worth their salt would see the difference. One of them varies with time, the other is static and unchanging based on age.
Re:MTBF by 6th+time+lucky · 2007-02-20 15:56 · Score: 3, Insightful

MT[TB]F has become a completely BS metric because it is so poorly understood.
Dont forget the M in MTBF. Its mean (stastically speaking...). That means (!) that some might fail now, some later, but on average they last a while. Manipulate that information and you might get 1,000,000 hrs MTBF, but you have to account for and not forget about the worst case senario (thats what a failure is) which might be the next drive is going to fail *now*, which is why RAID5 isnt as good as it might seem looking at the average statistics.

Backup, backup, backup has always been my motto (and thats just personal data). Interesting that Google thinks this is the way to go also (i.e. 3 copies of all data)
Re:MTBF by plover · 2007-02-20 16:31 · Score: 1

Metric or not, MTBF has a direct effect on our systems. Our group has about 45,000 machines deployed nationwide. An MTBF of 30,000 hours means that we're repairing a dozen machines a day (30,000 hours includes everything, not just hard drives.) It's interesting in that we have clients who expect certain bits of data will be magically "guaranteed" to arrive intact. I like explaining to them that the statistics insure we'll have at least a dozen machine failures per day, and that they better understand that occasionally a two-phase commit will never get finished because the client died. (Besides, I hate lazy designers who think other systems will always properly clean up their messes.)

--
John
Re:MTBF by bepo · 2007-02-20 17:50 · Score: 1

It would be interesting to see how they came up with these off the wall MTBF numbers. I wonder how many drives actually make it there. There are two stats I would like to see in a large in the field study. First, stability of the power supply on drive failure rate. Second, I would like to see if the failure rate is related to the technician doing the install.

I've seen techs slide computers out from under desks while they were running. Of course the rubber feet grab on the floor and causes a bumping effect for the unparked heads.
Re:MTBF by angio · 2007-02-20 17:56 · Score: 3, Informative

Your statement doesn't make a lot of sense. a) Hard drives are a non-repairable system, for all intents and purposes. Therefore, there *is* no repair. MTTF is the only useful metric. b) MTBF = MTTF + the time to repair. Assuming that's zero, then for any useful failure engineering, hard drive MTBF = hard drive MTTF. That's about all you've got if you're expressing the statistic as a single number. The reason that MTBF is a function of time is to cope with the assumption that the system is less reliable after a repair, which doesn't apply in this case.

Now, you can have all sorts of distributions that you draw that mean from, but a mean is a mean.
Re:MTBF by vtcodger · 2007-02-20 20:01 · Score: 5, Insightful

***Um, but doesn't the summary of the paper say that there is no infant mortality effect,***
It does. But it also says -- repeatedly -- that the data is disk replacement data, NOT disk failure data. i.e. it's data on the number of problems that the user tech thought might be fixed by replacing the disk, not by the number of disks that actually failed. One might wonder if, for example, the response to a system failing while it was being set up or in early lifetime might not be to put the whole damn thing into a box and ship it back to the vendor rather than dink around trying to figure out what is wrong. That won't be recorded as a disk failure.
The study is fine -- really it is. But, table 3 ought to give pause. It's quite clear that different data sets show quite different diagnostic patterns. We've got one set of data that says that power supplies, for example, are hardly ever replaced and a second set that says that they are the most frequently replaced item. There MAY be good reasons for this. But it could also be an indication that the technicians are incompetent, that the record keeping is erratic, or (and I'd seriously consider this one) that only certain kinds of failures are being recorded.
Finally, I think someone really ought to mention that there is no way that a disk manufacturer is actually going to measure MTBFs of 100000 hours prior to printing up the data sheets. The problem is that there are only around 750 hours in a month. And you need a reasonable number of failures (many quality guys would say at least 4) in order to get a reasonably valid MTBF. In order to actually measure a six digit MTBF, the manufacturer would have to run maybe 500 units for a month. My guess is that isn't going to happen. If they have the production line producing 500 units, they are going to ship them. Manufacturer MTBF data are surely based on data from a handful of engineering and preproduction units plus a bunch of wild guesses.
My guess, and it is just a guess, is that manufacturer MTBFs for disks are probably pretty much the MTBF goal in the drive specifications established before the design actually started.
Incidentally, based on some experience with other sorts of high tech gadetry, if the engineering/preproduction units do fail during test, a failure analysis will be done, and steps will be taken to fix the problem. Problem's fixed. OK, we shouldn't count those failures since they won't happen any more. That's called "censoring failure data". Begin to get an idea why disk MTBFs might be pretty much pure fiction?

--
You can't see ANYTHING from a car, You've got to get out of the goddamned contraption and walk...Edward Abbey
Re:MTBF by Eivind · 2007-02-20 21:23 · Score: 1

It does. The summary also claim that Google found no infant-mortality effect. Which is patently wrong, Google *definitely* showed a reasonably strong infant-mortality effect, particularily strong in high-load uses.
Re:MTBF by cowbutt · 2007-02-20 23:57 · Score: 1

a) Hard drives are a non-repairable system, for all intents and purposes. Therefore, there *is* no repair. MTTF is the only useful metric. b) MTBF = MTTF + the time to repair. Assuming that's zero, then for any useful failure engineering, hard drive MTBF = hard drive MTTF.
You're ignoring soft errors and read errors. These can usually be corrected by rewriting the block(s) in question (doing so either replaces the corrupt soft error with a good data, or forces the drive to remap the block). This should result in MTBF being significantly less than MTTF, since the drive would only get replaced on write errors (indicating some sort of catastrophic mechanical or media problem, or the exhaustion of the reserve of spare blocks) or obvious physical failure (failing to spin up, grinding, etc).
Re:MTBF by elrous0 · 2007-02-21 03:39 · Score: 1

Well, if you put an infant in a bathtub and fail to attend it, the chance of mortality is very high.
-Eric

--
SJW: Someone who has run out of real oppression, and has to fake it.
Re:MTBF by kabocox · 2007-02-21 05:05 · Score: 1

Finally, I think someone really ought to mention that there is no way that a disk manufacturer is actually going to measure MTBFs of 100000 hours prior to printing up the data sheets. The problem is that there are only around 750 hours in a month. And you need a reasonable number of failures (many quality guys would say at least 4) in order to get a reasonably valid MTBF. In order to actually measure a six digit MTBF, the manufacturer would have to run maybe 500 units for a month. My guess is that isn't going to happen. If they have the production line producing 500 units, they are going to ship them. Manufacturer MTBF data are surely based on data from a handful of engineering and preproduction units plus a bunch of wild guesses.

Why is this allowed? Let's use our old fashioned car anaology. Consumer Reports, the government, and insurance companies, environmental groups and some other car buying groups buy production cars and run them for 3 months, 1 year, 5 years to get various stats to let folks know the things like what the real mpg or real cost of maintenance for a car is. Why don't we do the same things for our computers either by entire system or parts? (I'd say the old reason was that someone would just buy a new computer/part after 2-3 years rather than bother with properly rating parts.) Companies, government, and insurance should all want industries properly regulated with reliable stats. This is like just taking the automakers word that all their cars have 50+ mpg, last 1 million miles, and last 20 years with no one else verifying it. Um, come on its nature for the drive manufacters to slant things to their benefit. It's up to all their consumers and other folks to test, verify, and hold them accountable for their claims.
Re:MTBF by kidgenius · 2007-02-21 06:00 · Score: 1

Repair doesn't matter for MTTF and MTBF. Repairable, replaceable, whatever. It makes no difference. Now, you can have all sorts of distributions that you draw that mean from, but a mean is a mean. Yes, but the MTBF and MTTF are two different types of mean. MTBF is the mean time between failures, i.e. how many hours can we expect between failures in our system, whereas MTTF is the mean time to failure, i.e. how many hours, on average, does each drive see before it fails. MTTF is the mean of the distribution. MTTF = MTBF when the distribution is exponential. In this case, it is not an exponential failure distribution. It follows the Weibull distribution VERY closely as they have indicated. So MTBF MTTF.
Re:MTBF by kidgenius · 2007-02-21 06:10 · Score: 2, Insightful

And for my final trick, let me give you an example.
Let's say you have five units with an MTTF of 5000 hours, and we put a new one into service every 500 hours.
It'll look something like this:
0-5000
500-5500
1000-6000
1500-6500
2000-7000
Now, each drive failed after five thousand hours. This is the mean time to failure. In other words, each drive had, on average, 5000 hours on it when it failed.
Next, let's calculate MTBF. There were 5 failures, with a total of 7000 hours of operation. This would result in a cumulative MTBF of 7000/5 = 1400 for the system. If you really look at it even closer you can see that you had an MTBF of infinity for the first 5000 hours, then an MTBF of only 500 hours for the last 2000 hours. Noticed how MTBF has changed over time but MTTF has remained the same? Notice the huge difference between MTBF and MTTF now? Noticed how I didn't take repair into account at all?
So repeat after me....MTBF is NOT the same as MTTF. The paper is incorrect in this regard.

moving parts by DogDude · 2007-02-20 13:41 · Score: 5, Funny

Every single mechanism with moving parts will fail. It's just a matter of when. In a few years, when everybody is using solid state drives, people will look back and shake their heads, wondering why we were using spinning magnetic platters to hold all of our critical data for such a long time.

--
I don't respond to AC's.

Re:moving parts by Nimloth · 2007-02-20 13:57 · Score: 2, Interesting

I thought flash memory had a lower read/write cycle expectancy before crapping out?
Re:moving parts by theReal-Hp_Sauce · 2007-02-20 14:05 · Score: 5, Funny

Forget Solid State Drives, soon we'll have Isolinear Chips. It wont matter if they fail or not because as long as the story line supports it Geordie can re-route the power through some other subsystem, Data can move the chips around really quickly, Picard can "make it so", and after it's all over with Wesley can wear a horrible sweater and deliver a really cheese line.

-C
Re:moving parts by NMerriam · 2007-02-20 14:19 · Score: 4, Informative

I thought flash memory had a lower read/write cycle expectancy before crapping out?

They do have a limited read/write lifetime for each sector, BUT the controllers automatically distribute data over the least-used sectors (since there's no performance penalty to non-linear storage), and you wind up getting the maximum possible lifetime from well-built solid-state drives (assuming no other failures).

So in practice, the lifetime of modern solid state will be better than spinning disks as long as you aren't reading and writing every sector of the disk on a daily basis.

--
Recursive: Adj. See Recursive.
Re:moving parts by brarrr · 2007-02-20 14:24 · Score: 1

Says you!

I'm going to live forever!

--
to email me: take my /. handle and append .net preceded by charter.
Re:moving parts by CastrTroy · 2007-02-20 14:25 · Score: 1

Unfortunately, we don't have solid state storage that doesn't fail either. I've had more RAM chips die than hard drives. And I know that you aren't suggesting that flash memory doesn't fail. Although I've never had flash memory fail, I've only ever used it for digicams and mp3 players, and not for the kind of usage pattern you would get from a hard drive.

--

Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
Re:moving parts by wik · 2007-02-20 14:45 · Score: 4, Informative

Not true. Transistors at really small dimensions (e.g., 32nm and 22nm processes) will experience soft breakdown during (what used to be) normal operational lifetimes. This will be a big problem in microprocessors because of gate oxide breakdown, NBTI, electromigration, and other processes. Even "solid-state" parts have to tolerate current, electric fields, and high thermal conditions and gradually break down, just like mechanical parts. Don't go believing that your storage will be much safer, either.

--
/ \
\ / ASCII ribbon campaign for peace
x
/ \
Re:moving parts by scoot80 · 2007-02-20 14:56 · Score: 2, Informative

Flash memory will have about 100,000 write cycles before you will burn it out. As parent mentioned, a controller would write that data to several different locations, at different times, thus increasing the lifetime. What this would mean though is that your flash disk will be considerably bigger then what it can actually hold.
Re:moving parts by tedgyz · 2007-02-20 15:01 · Score: 1

So is there a MTBF for solid state drives? I'm serious.

--
"No matter where you go, there you are." -- Buckaroo Banzai
Re:moving parts by Anonymous Coward · 2007-02-20 15:16 · Score: 0

I'm gonna learn how to fly!
Re:moving parts by mightyQuin · 2007-02-20 15:31 · Score: 1

cisco 6900 series routers - flash RAM has been very reliable from my experience...close to a HD-type of demand.

--
Now, if you'll excuse me, I've got some idea balls to remove from a manatee tank.
Re:moving parts by blackest_k · 2007-02-20 15:37 · Score: 3, Interesting

Still doesn't mean it will last, got a 1 gig usb flash drive here dead in less than 8 weeks and very few read and writes. It will not identify itself. It might have 99,900 write cycles left but its still trashed.
Lets face it there is no reliable storage media, the only way to be safe is multiple copies.

--
Blarney Quality Restaurant, Plants
Re:moving parts by um...+Lucas · 2007-02-20 16:01 · Score: 1

But honestly, if something was going crap out on me, and it had critical data on it that needed to be reassembled, I'd MUCH prefer to pull a few hard drives out of the server and send them to a place that's been that has access to the 40+ years of compiled experience recovering data from magnetic platters than (gulp!) pull out the flash drive and say "god, i hope someone knows how to odo something with this!"
Re:moving parts by Detritus · 2007-02-20 16:21 · Score: 1

Given some failure data, you can calculate an MTBF for almost anything. The military has been compiling reliability data for various electronic components for many decades.

--
Mea navis aericumbens anguillis abundat
Re:moving parts by AbRASiON · 2007-02-20 16:51 · Score: 1

People have been praising "solid state disks" since I was a bloody kid in high school!

I've been dying for them for so long it's just mind boggingly insane - I don't think they will ever happen on the desktop, not beating normal hard disks.
You've got your write limitation cycle issue, you've got the speed sucking isssue (seriously flash does not beat a good hard disk! - terrible) you've got space issues (once the flash IS affordable to manufacture a 400gb consumer hard disk, we'll all be using 3tb magnetic standard drives)

I'd love to ditch hard disks, I've ranted against them many a time - slow bloody things but it's not going to happen any time soon.
Re:moving parts by DDLKermit007 · 2007-02-20 20:12 · Score: 1

Hooray for anecdotal evidence. Your experience is the minority with flash based media. Not to mention you yourself stated that it likely has a whole shitload of write cycles left. A problem like this would be averted by internal solid state memory (no frequent removal). Whats gone wrong worst case with yours is the controller is shot (can happen on any kind of drive). I'd put good money though that if you popped the case to the flash drive open you'd find one of the pins that leads to the USB connection has wiggled free of the solder or snapped from the user causing stress on the pins. It's a fairly common thing with devices like flash drives & USB wireless adapters with me.
Re:moving parts by am+2k · 2007-02-20 21:32 · Score: 2, Insightful

The point you didn't get was that even solid state disks can fail without warning, so you need a backup anyways.

You only need a single counterexample to disprove a theory.
Re:moving parts by Anonymous Coward · 2007-02-20 21:34 · Score: 0

But most problems like electromigration (or worse, simple thermal destruction) are related to high temperatures / high voltages.

I don't think something that isn't a high performance CPU would suffer from those problems, especially in an embedded context where care is taken to stop leakage. If you don't pump an awful lot of energy through those transistors, I doubt you'd see much electromigration.
Re:moving parts by jez9999 · 2007-02-20 21:49 · Score: 1

Sounds like you're buying crappy RAM. I always buy Crucial non-ECC, and have never had any RAM failure. Buy Crucial's ECC, and the stuff is reliable as anything.

--
== Jez ==
Do you miss Firefox? Try Pale Moon.
Re:moving parts by asuffield · 2007-02-21 01:00 · Score: 1

got a 1 gig usb flash drive here dead in less than 8 weeks and very few read and writes

Which just goes to show what anybody can tell you: there is huge variation in both quality and price of USB flash drives. You can get 1Gb of flash storage for the price of a cup of coffee. And it'll last about as long.

You can also buy drives that last a lot longer. Those cost more.
Re:moving parts by koyangi · 2007-02-21 01:11 · Score: 2

Given some failure data, you can calculate an MTBF for almost anything. The military has been compiling reliability data for various electronic components for many decades. Yes they have.

I work in the defense industry and in general hardware works just long enough to be installed in the vehicle. All you need to know is when the first system test will be done in front of the customer and you can easily predict when failure of every critical componet will occur.

Some people insist upon using math and MIL-HDBK-217, but I say give me a program schedule and I can tell you exactly when you will hit a 50% failure rate!
Re:moving parts by RalphTheWonderLlama · 2007-02-21 03:57 · Score: 1

Yeah, I was hoping someone would say this. They will probably fail less but they will be very hard to recover. Plus there is all the current infrastructure and experience built up at data recovery firms currently that will gradually become obsolete.

--
simple, fast homepage with your links: http://www.ngumbi.com/
Re:moving parts by kabocox · 2007-02-21 04:49 · Score: 1

Every single mechanism with moving parts will fail. It's just a matter of when. In a few years, when everybody is using solid state drives, people will look back and shake their heads, wondering why we were using spinning magnetic platters to hold all of our critical data for such a long time.

Well, electrons move so we are in trouble since all our computers are based on moving electricity!
Eventually, entropy will get you no matter what you use.
Re:moving parts by Boglin · 2007-02-21 05:32 · Score: 3, Funny

I think you need to reread the article. It clearly states that consumer solutions are just as good as Enterprise one.
Re:moving parts by Maximum+Prophet · 2007-02-21 07:43 · Score: 2, Informative

If you look at the numbers for the failure of the system RAM and assume that most machines have much, much more disk space than RAM, SSD's don't make sense. They are faster, but you won't get better MTTB's. On the HPC1 and COM1 groups of machines, the memory was replaced almost as often as the hard drives. If you had to replace all that HD space with RAM, your failure rate would go though the roof.

--
All ideas^H^H^H^H^Hprocesses in this post are Patent Pending. (as well as the process of patenting all postings)
Re:moving parts by blackest_k · 2007-02-22 00:05 · Score: 1

Thanks for the suggestion if I can get in the case I will see if it is purely a mechanical problem like you suggest.
would be nice if it was repairable.

--
Blarney Quality Restaurant, Plants
Re:moving parts by Alioth · 2007-02-22 03:43 · Score: 1

Look at the paper a bit deeply. A more important observation, I think, is the comparative table. It showed that memory had a similar failure rate to hard disks! So much for no-moving-parts being more reliable.

--
Oolite: Elite-like game. For Mac, Linux and Windows
Re:moving parts by dfries · 2007-02-24 14:10 · Score: 1

check out table 2 in that report, "Node outages that were attributed to hardware problems broken down by the responsible hardware component. This includes all outages, not only those that required replacement of a hardware component."
HPC1,
CPU %44
Memory $29
hard drive %16
At those kinds of failure rates I think I'll stick to the spinning magnetic disk drive.

The Abstract by Anonymous Coward · 2007-02-20 13:41 · Score: 0

Component failure in large-scale IT installations is becoming an ever larger problem as the number of components in a single cluster approaches a million.

In this paper, we present and analyze field-gathered disk replacement data from a number of large production systems, including high-performance computing sites and internet services sites. About 100,000 disks are covered by this data, some for an entire lifetime of five years. The data include drives with SCSI and FC, as well as SATA interfaces. The mean time to failure (MTTF) of those drives, as specified in their datasheets, ranges from 1,000,000 to 1,500,000 hours, suggesting a nominal annual failure rate of at most 0.88%.

We find that in the field, annual disk replacement rates typically exceed 1%, with 2-4% common and up to 13% observed on some systems. This suggests that field replacement is a fairly different process than one might predict based on datasheet MTTF.

We also find evidence, based on records of disk replacements in the field, that failure rate is not constant with age, and that, rather than a significant infant mortality effect, we see a significant early onset of wear-out degradation. That is, replacement rates in our data grew constantly with age, an effect often assumed not to set in until after a nominal lifetime of 5 years.

Interestingly, we observe little difference in replacement rates between SCSI, FC and SATA drives, potentially an indication that disk-independent factors, such as operating conditions, affect replacement rates more than component specific factors. On the other hand, we see only one instance of a customer rejecting an entire population of disks as a bad batch, in this case because of media error rates, and this instance involved SATA disks.

Time between replacement, a proxy for time between failure, is not well modeled by an exponential distribution and exhibits significant levels of correlation, including autocorrelation and long-range dependence.

i'll tell you by User+956 · 2007-02-20 13:43 · Score: 2, Interesting

Bianca Schroeder, of CMU's Parallel Data Lab, submitted Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?

It means I should be storing my important, important data on a service like S3.

--
The theory of relativity doesn't work right in Arkansas.

Re:i'll tell you by Taimat · 2007-02-20 13:51 · Score: 1

I disagree.... I would much rather have control over the drives where my data resides, rather then upload it to a place where I have no idea what they are doing to protect my data. In my opinion, nothing beats a good RAID with current backups. Just because I drive isn't dead, doesn't mean it can't be replaced. After the drive has been in commision a while, replace it, and move it to a server that isn't as critical.

--
The above comments are not guaranteed to make sense to anyone other than the author...
Re:i'll tell you by udderly · 2007-02-20 13:57 · Score: 1

I always tell my customers that if you don't have an off-site backup, you're really not backed up. Of course, we have an off-site backup service, so take that with a grain of salt. But, like they say with real estate, "location location, location." Except in this case, "redundancy, redundancy, redundancy."
Re:i'll tell you by karnal · 2007-02-20 14:03 · Score: 2

"redundancy, redundancy, redundancy."

So that Department of Redundancy Department really does something after all!

--
Karnal
Re:i'll tell you by Anonymous Coward · 2007-02-20 14:56 · Score: 0

Against that, sometimes you're just not supposed to have the data. Consider the real world case of a site that had no less than five sets of backups, three of them at distinct offsite storages. Come time to recover the data, the two at the local data centre were found to be bad. So they called back the first step. Exposed to magnetic fields; unreadable. They then called back the second set. Water damaged. Third set. Courier had a car crash, tapes were damaged beyond repair ...
Re:i'll tell you by DarkVader · 2007-02-20 15:12 · Score: 2, Funny

And could there be anything funnier that could happen to that comment than it being moderated "Redundant"?
Re:i'll tell you by Anonymous Coward · 2007-02-20 15:29 · Score: 1, Funny

In a perfectly humorous world, everyone would mod it as Underrated, (except the original Redundant mod,) so that it makes it to +5 Redundant.

Oh, I wish I didn't waste my mod points on the Valentine survey.
Re:i'll tell you by Brickwall · 2007-02-20 16:20 · Score: 1

It's awfully nice to see you computer guys get the message that's been burned into the soul of every telecom engineer for the last 80 years. Welcome aboard!

--
What was once true, is no longer so

"Everything You Know About Disks Is Wrong" by cookieinc · 2007-02-20 13:49 · Score: 3, Funny

Everything You Know About Disks Is Wrong

Finally, a paper which disspells the common myth that disks are made of boiled candy.

Re:"Everything You Know About Disks Is Wrong" by egr · 2007-02-20 14:44 · Score: 4, Funny

I've read the article, then the tittle, damn!
Re:"Everything You Know About Disks Is Wrong" by mrbcs · 2007-02-20 14:57 · Score: 1

umm So Maxtor drives really aren't that good? /smartass

--
I'm not anti-social, I'm anti-idiot.
Re:"Everything You Know About Disks Is Wrong" by cookieinc · 2007-02-20 15:04 · Score: 1

I was going to read the article, but the overly assertive title intimidated me. Instead I wept quietly in my cubicle of impending anxiety.

Dr. Schroeder is pretty hot, too! by yanyan · 2007-02-20 13:50 · Score: 1, Offtopic

http://www.cs.cmu.edu/~bianca/

I would love to give her my very large hard drive. For "performance evaluation and measurement", you understand. ;-P

Re:Dr. Schroeder is pretty hot, too! by Anonymous Coward · 2007-02-20 13:53 · Score: 5, Funny

Except she requires a MTBF of more than 3 seconds. Sorry dude.
Re:Dr. Schroeder is pretty hot, too! by gardyloo · 2007-02-20 13:56 · Score: 2, Funny

Except she requires a MTBF of more than 3 seconds. Sorry dude.

You call that failure?!? I'd call it success.
Re:Dr. Schroeder is pretty hot, too! by Anonymous Coward · 2007-02-20 14:03 · Score: 2, Insightful

A quick look into her lectures/talks in the past:
June 2006 Microsoft Research, Mountain View, CA. Host: Chandu Thekkath. "Understanding failure at scale".
Its okay man.. She will understand..
Re:Dr. Schroeder is pretty hot, too! by inviolet · 2007-02-20 14:38 · Score: 1, Offtopic

Except she requires a MTBF of more than 3 seconds. Sorry dude.
You call that failure?!? I'd call it success.

MTBF, in this case, means Mean Time Between Farkings. So yeah, three seconds is an astoundingly short refractive period. :)

--
FATMOUSE + YOU = FATMOUSE
Re:Dr. Schroeder is pretty hot, too! by Anonymous Coward · 2007-02-20 15:04 · Score: 0

You know, if I were a hot woman research scientist, I'd be pretty pissed off at how everybody always commented on my looks. On the other hand, if I were a hot woman research scientist, I'd probably be unemployed, what with spending all day at home in front of the mirror.
Re:Dr. Schroeder is pretty hot, too! by mgabrys_sf · 2007-02-20 15:09 · Score: 1

I'm surpised that didn't make it into the summary. Let's rewrite the headline please:

"Hot babe researcher has this to say about hard drives - oh momma!"

I think we'd see a bit more comments in the thread don't you think? MARKETING - you need to think about MARKETING!
Re:Dr. Schroeder is pretty hot, too! by IdolizingStewie · 2007-02-20 15:36 · Score: 1

Except she requires a MTBF of more than 3 seconds. Sorry dude.

You call that failure?!? I'd call it success. If you only last 3 seconds I guarantee it's a failure for her.
Re:Dr. Schroeder is pretty hot, too! by SirSlud · 2007-02-20 16:21 · Score: 1

You're forgetting that its a performance measurethat is also important for motivating repeat business. In this case, selling the 'drive' once is not exacly a sustainable success.

--
"Old man yells at systemd"
Re:Dr. Schroeder is pretty hot, too! by mabhatter654 · 2007-02-20 16:44 · Score: 1

not to mention impossible.
Everybody keeps beating on her for contradicting the "scientific numbers" used for advertising, but that's not what her study was about! Like the Google study, this is about Actual length of drives in real installs. Sure, you could get better performance, longer life... but that's not what it's about. It's about putting meaningful numbers so that network engineers can plan their installs of disks according to when the drives will fail. We all know drives fail...it's not about pointing fingers, but rather giving engineers numbers to plan PREVENTIVE maintenance rather than REACTIVE maintenance. In the ideal world, you'd replace the drive in rotation a few days before it failed.. because you know it WILL fail. Uptime is related more to how well you maintenance the computers, not how much the manufacturer says it will last. As installs get bigger and the numbers better, the people with little systems can benifit from these reports as many longstanding ideas are challanged.
Re:Dr. Schroeder is pretty hot, too! by exultavit · 2007-02-20 18:23 · Score: 1

Best. Comment. Evar.
Re:Dr. Schroeder is pretty hot, too! by MasterGwaha · 2007-02-21 02:12 · Score: 1

I would love to give her my 3.5 in Floppy. There, fixed it for you.

Amazing! by Dr.+Eggman · 2007-02-20 13:52 · Score: 2, Insightful

You mean to tell me these people have found hard drives that don't fail beyond repair by the end of the first year? I've never encountered a HD that has done this, much to the despare of my wallet. Now, I am serious, what is wrong with the harddrives I choose that kills them so quickly? Is Western Digital no longer a good manufacturer? Should I maybe not run a virus check nightly and a disk defrag weekly? Is 6.5GB of virtual memory too much to ask? Of course not, the manufacturers are just making crappier hds. This article has told me one thing: it's time to get a RAID setup. I've been looking at RAID 5, but two things still trouble me, the price and the performance hit. Does anyone have any information on just how much a performance hit I might experience if I have to access the HD a lot?

--
Demented But Determined.

Re:Amazing! by CastrTroy · 2007-02-20 14:00 · Score: 1

If anything, RAID should make your hard disk access a lot faster. That is, unless you go for software RAID, which will put a hit on your processor. However, I think if you're going to make the investment to go with RAID 5, then buying a proper hardware controller won't add a significant amount to the cost of your set up.

--

Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
Re:Amazing! by Rakishi · 2007-02-20 14:08 · Score: 1

I'd be tempted to say that the problem may be partially on your end either due to having improper conditions (heat, etc.) or bad power/power supplies. Likewise if you get hard drives with a 1 or 3 year warranty then don't expect too much from them (I mean if they're dead in a year then you're not out much as the warranty should cover them... well unless you buy some dirt cheap refurbished 90-day warranty pos).

Personally I backup all my data to a server running raid 1 (hard drives are relatively cheap and raid 1 is simpler to deal with in case of failure imho) daily and plan to back up important stuff onto DVDs. If you really need the space then raid 5 is better and I'd assume that with a good controller the performance hit isn't large at all.
Re:Amazing! by Anonymous Coward · 2007-02-20 14:15 · Score: 0

The RAID level depends on what you plan on using it for. I'd never recommend RAID 5 for a database server because of the performance hit you take during writes; for something like this you're better with a RAID 0+1 setup. RAID 5 is better suited for file servers or backup volumes. I'm also seeing RAID 6 become more popular for this purpose because of the double parity, but setups like this can be expensive.
Re:Amazing! by Simon+Garlick · 2007-02-20 14:32 · Score: 1

I, on the other hand, have personally experienced one HD failure -- a Western Digital drive, as it happens -- in my LIFE.

--

-----
PGP Key ID 0xCB8FF658
Re:Amazing! by obarthelemy · 2007-02-20 14:35 · Score: 1

I have ONE failed HD currently at my place, out of about 15 of various ages (oldest ones is 3gigs), and that failed HD is my lone WD drive. So I have very strong statistical evidence that WD is crap :-p. I personnally buy Seagate since none has yet failed on me... and, joking aside, that WD was my first in a long time, I bought it because it was a tad cheaper than seagate, and it failed really quickly, inside 6 months IIRC.

--
The Cloud - because you don't care if your apps and data are up in the air.
Re:Amazing! by obarthelemy · 2007-02-20 14:38 · Score: 1

I second that, power suplies especially have a very strong impact on a PC's, and especially its HD's life. Power supplies are then single most important component of a PC, reliability wise: crappy ones tend to fail and/or fry your components, especially if your mains if not too good.

--
The Cloud - because you don't care if your apps and data are up in the air.
Re:Amazing! by Anonymous Coward · 2007-02-20 14:43 · Score: 0

I MOSTLY agree with your point. But... make sure that your raid controller does not use proprietary methodology. Or if it does, make sure to get backup controller cards. Cause if the controller card goes, then so does all the data on the array.
Re:Amazing! by obarthelemy · 2007-02-20 14:46 · Score: 0

I'm still not sold on raid at all, especially on desktops and small servers:
- it does NOT eliminate the need for backups
- the performance gains are not noticeable except in the most extreme cases, and even then, RAID is not cost-efficient compared to RAM for caching, except special cases (RAM already maxed out, extremelly heavy sequential data access...)
- RAID hardware / firmware / software is nowhere near as reliable as as plain old PATA/SATA/SCSI, which is something of a scandal
- Raid is rarely cost efficient, compared to putting your dollars into more servers, more RAM

I think RAID is a way to cover one's ass, by displacing resposability for crashes on to the raid vendor, but IRL I keep hearing horror stories.

--
The Cloud - because you don't care if your apps and data are up in the air.
Re:Amazing! by Blakey+Rat · 2007-02-20 14:58 · Score: 1

... was Western Digital EVER a good manufacturer?

Seriously. The only dead drives I've ever seen are either IBM Deathstars (known by that name so completely that I don't know what the actual brand name is... 'disk star' perhaps?) and Western Digital drives. I generally buy Seagate or Hitachi drives, and I've never had a failure. Usually I run out of space and have to upgrade before the drives die. IBM drives other than the Deathstars seem to do ok as well.

--
Comment of the year
Re:Amazing! by Planesdragon · 2007-02-20 15:13 · Score: 1

Now, I am serious, what is wrong with the harddrives I choose that kills them so quickly?

First guess? Your system has a dirty power supply. (Unless you have a high-quality PSU and have a line-noise-filtering UPS, this is entirely possible.)

This article has told me one thing: it's time to get a RAID setup. I've been looking at RAID 5, but two things still trouble me, the price and the performance hit. Does anyone have any information on just how much a performance hit I might experience if I have to access the HD a lot?

RAID 5 is not a great deal more than RAID 0 with a fancy backup. You need to get yourself a good RAID controller (in hardware), and go from there. You should be able to do classic RAID 1 (two drives only) without any perceptible performance hit with a good hardware controller.

If you're still using IDE, switching to SATA would almost certainly eat up any performance hit you would otherwise experience.

And if that doesn't work, do what Windows Vista does: get yourself a large flash drive, and use that for short-term storage.
Re:Amazing! by jafiwam · 2007-02-20 15:18 · Score: 1

"Desk Store" and "Serve Store" I believe.

I lost two raid 5 setups to those because they failed faster than we could replace them. (1 spare and several days for shipping) Out of the 7 we had in two servers, 5 of them failed so the nick name is not undeserved in my opinion.
Re:Amazing! by flyingfsck · 2007-02-20 15:36 · Score: 1

Hmm - I had a HDD fail on me today, while making a backup image. It didn't like all the reading activity it seems. Sigh...

--
Excuse me, but please get off my Pennisetum Clandestinum, eh!
Re:Amazing! by Anonymous Coward · 2007-02-20 15:39 · Score: 0

I've had good luck all along (well, since about 1996) with Western Digital. My current hard drive is about... 3 years old. Granted, I pretty much run my computer 24/7 (only turning off the monitor) so that may help extend hard drive life (I remember a rule of thumb that the computer should be left on if you plan on using it in the next 24 hours... but that was a long time ago and don't know if that really holds true.) I really doubt that the defrag and virus check are doing that much harm. Your problems are likely hardware related, and there's a good chance it's not the hard-drives, but inappropriate energy.

This energy can take many forms, and a few basic purchasing and set up decisions will help you out.
vibrations and shock - mount the computer in an out of the way location that you won't keep knocking it about. Not on the floor where you keep kicking it, and not on the desk that you use as a workspace. I personally have a shelving unit set on top of an old desk that I made with my dad as a kid (Okay, more I watched and went to grab tools as he made it.) Also, ensure that you aren't getting too much case vibration from your fans. I figure louder noise = more vibrational energy, so choosing larger diameter fans is a good idea. A decent case and a good mounting job will help out a bit as well.

Heat - operating at temperatures above spec will wreck havoc with your hard-drive. Make sure your case has adequate ventilation. A halfway decent case will again help you here, along with adequate fans installed. Proper cable management will also do wonders for helping the flow-through in the case. Avoid mounting unnecessary components such as extraneous old hard drives that pump out lots of heat. Storage space is decently inexpensive, so just get a decently large hard drive and copy all the old stuff over if you're the kind of guy that has four smallish hard drives installed. Try to locate your computer out of direct sunlight, and away from heaters.

Electricity - Stray static charges, along with power spikes and valleys can really increase failure rates. The first thing is to avoid static damage. Always make sure to ground yourself to the chassis while working with components, while a wrist strap is good for this, it can get annoying. I prefer to touch the chassis with my off-hand, a wrist, elbow, or some other part of exposed skin. This may not be ideal, but it works for me. Appropriately ground all components to the case when installing. Research and find whether to connect with metallic screws or nylon standoffs. Paper insulators on metallic screws can be extremely damaging if used incorrectly, as a capacitance can build up between the screw and the item being mounted. If in doubt, it's probably better to have a decent ground to the case. Next, attempt to plug in to decent mains wiring. Don't overload a circuit with too many devices. If in doubt, a line conditioner (evening the line conditioning provided by a low end UPS) can save a lot of headache. Inside the case, make sure that you are not overloading the power supply. Ensure that it is rated decently high, and of decent quality. Again, running too many internal devices (AKA multiple disk drives) will take a lot of the power in spikes, and remember that modern video cards and processors are extremely power hungry. Chances are that a power supply that came with a $25 case isn't going to cut it. Research what power demands your entire system will need, add some headroom and don't cheap out on the power supply. Again, unplug those old legacy hard drives if you don't need em anymore as they are just going to be sucking down juice. Copy the data to a new hard drive, throw them in a static bag and put them somewhere safe. Maybe consider getting an external IDE enclosure if you want to try to hook them up to get data off them every now and then.

Oh, and if you are defragging weekly, there's a chance that there is a software problem that's haunting you, but it's not in the act of defragging. Do your best
Re:Amazing! by BagOBones · 2007-02-20 15:41 · Score: 2, Insightful

Those Deathstars as I like to call them where really really bad. If you build your servers with a strong support contract from your vendor you can get really fast drive replacement times. We run completely on Dell servers with GOLD level support. I had a drive fail in my primarily file server, I had a replacement drive on my reception desk in 4 hours from putting my phone down to report the problem. The controller supported background rebuilding so the users didn't even feel the loss.

I you build your own servers, you need to have more spares on hand than 1

--
EA David Gardner -"... but the consumers have proven that actually what they want is fun."
Re:Amazing! by PygmySurfer · 2007-02-20 15:46 · Score: 1

IMHO, WD is STILL a good manufactuer. I've never had a problem with them. I sold a 640 MB drive to someone a few years ago, I believe it STILL works. I personally buy WD and Seagate now (I've always stayed away from Maxtor - I hope Seagate acquiring them improves the reliability of Maxtor drives, and doesn't affect Seagate drives at all). My MacBook Pro has a Hitachi drive (perpendicular, woot!). Anecdotal evidence doesn't say much, I've read good and bad things about pretty much any hard drive manufacturer (though Seagate doesn't seem to get much bad press at all).

The IBM Deskstar drives with the abnormally high failure rates were the 75GXP series (and I believe the 60GXP series to a lesser extent). I toasted a good deal of them, fortunately IBM was pretty good about replacing them. I had mine paired (Striped) wtih a Promise controller of some sort, and I think many other people did as well, and that pairing may have had an impact on the longevity of the drive. I'll personally never buy a Promise controller again (nor would I use the crappy onboard RAID controllers so common nowadays).

It's kind of funny you mention Hitachi - They bought IBM's hard disk division.
Re:Amazing! by PygmySurfer · 2007-02-20 15:48 · Score: 1

Deskstar (PATA,SATA), Ultrastar (SCSI,Fibre Channel, SAS), Travelstar (Laptop PATA/SATA).
Re:Amazing! by tylernt · 2007-02-20 15:49 · Score: 1

If anything, RAID should make your hard disk access a lot faster. That is, unless you go for software RAID, which will put a hit on your processor.
I've had pretty good luck with Linux software RAID improving performance without soaking the CPU. However, I have a RAID-0 (striped, no redundancy) software array on my Windows 2000 box that uses insane amounts of kernel CPU time. While disk access *is* pretty fast (10,000rpm SCSI), any kind of heavy disk activity (reading or writing) pretty much clobbers UI responsiveness. The CPU meter in Task Manager is mostly red, so I don't know what the heck Win2k's kernel is doing, but it ain't doing it well. Hopefully this has improved in XP/2003.

--
DRM 'manages access' in the same way that a prison 'manages freedom'
Re:Amazing! by drsmithy · 2007-02-20 15:57 · Score: 1

This article has told me one thing: it's time to get a RAID setup. I've been looking at RAID 5, but two things still trouble me, the price and the performance hit. Does anyone have any information on just how much a performance hit I might experience if I have to access the HD a lot?
Ignore RAID5, go straight to RAID6 or RAID10.
With RAID10 your performance will improve. With RAID6 it will probably improve, unless your usage pattern is heavy random writes.
Re:Amazing! by drsmithy · 2007-02-20 16:00 · Score: 1

With RAID6 it will probably improve, unless your usage pattern is heavy random writes.
And even then, if your benchmark is a single drive, RAID6 will be faster.
Re:Amazing! by Rakishi · 2007-02-20 16:04 · Score: 1

The point of raid is that if you lose a hard drive you don't have to have half a day of downtime as you restore backups and then possibly deal with the data loss due to non real time backups. This is MORE important on a small server where you likely lack the staff and backups systems for a fast restore of some sort. I've had two hard drives die in an old server but thanks to Raid 5 I simply plugged in a spare with no real downtime or headaches.

Raid 1 on a desktop is sure better than spending half the day reinstalling an os, restoring data from backups and all the related headaches. It's also rather hard to fuck up a raid 1 array as each hard drive can at worst be run on its own.
Re:Amazing! by Reziac · 2007-02-20 16:04 · Score: 1

When you get a bunch of the same component failing, it's usually not that component (in your case HDs), but rather something they're hooked to that's bad, or something being done wrong, such as:

-- Bad power supply in your machine (may affect RAM and motherboard, too)
-- Lack of surge protection between your PC and the wall
-- Bad RAID controller (suspect this immediately if the symptom is garbage written to the disk)
-- FAT32 partitions larger than 32GB (can cause data loss that mimics a sick HD; it's due to a bug in FAT32)

--
~REZ~ #43301. Who'd fake being me anyway?
Re:Amazing! by Anonymous Coward · 2007-02-20 16:07 · Score: 1, Informative

"RAID should make your hard disk access a lot faster. That is, unless you go for software RAID"

This is wrong! SOFTWARE raid is faster. Why? Consider:
- The CPUs one buys are usually the latest and greatest.
- A 1.6GHz Athlon XP can process raid5 data at >3GB/s. This is significantly greater than your bus speed.
- If you're waiting on a disk read, chances are, your CPU isn't doing much anyways. (That said, you need to do very little to process a disk read. It's the disk writes that require checksuming).
- A raid controller adds an extra step into the disk->cpu latency
- A raid card microprocessor is spec'ed at whatever rate is needed to max a bus, or, often, significantly less. This means that any processing needed will incur a higher latency than if the data were processed by the CPU.

Roughly, for the hardware solution, all advantages are:
- Data can be considered flushed once it reaches the raid card, not the disk due to battery backuped ram (only matters for ACID databases, for systems not on UPS, without redundant power supplies)
- Batch systems may see reduced CPU use. This highly depends on the device driver being well written.
- Bus usage will be divided by 3 for small (sub ((n-1)/2)*block size, where n is the number of disks in the raid) writes, due to not having to do a read and write to update the parity.

You'll note that all of these advantages are on writes! Also, the last advantage is less important than it may seem. Very few small random IO write bound loads exist. (eg. databases will try to rearrange data to make large linear writes, requiring a bus usage of n/(n-1) in the software case)

To reiterate, usually the issue with data access isn't bandwidth, but latency. A hardware solution will not decrees this, except under specialised loads.
Re:Amazing! by Rakishi · 2007-02-20 16:08 · Score: 1

I've heard that Windows software raid sucks, period. Linux isn't bad and I use it myself due to being lazy and cheap (it's probably more reliably than any of those cheap raid controllers anyway).
Re:Amazing! by Dr.+Eggman · 2007-02-20 16:48 · Score: 1

Thanks for the thoughts, but I do have a surge protector (as part of an APC 1250 NS back-up battery), my hard drives are consistently formated NTFS, my power supply is quite good (as far as my APC monitoring system tells me), and I don't have a RAID setup (yet...) Is there maybe something else that maybe the problem? I don't think cooling should be a problem, either. Maybe it's just the HD line I've chosen. Anyone have any comments on the WD1600JB or WD400JB series hard drives?

--
Demented But Determined.
Re:Amazing! by Anonymous Coward · 2007-02-20 16:57 · Score: 0

Well, every single WD drive I, anonymous coward, have owned has failed on me. Every damn one of them.

I learned there toward the end: my last WD system had an 80 which backed up to another WD 80. Insurance! Sure enough, one of them failed and was replaced under warranty. Then the other one failed and was replaced under warranty. They did honor the warranty which is the only good thing I can say about WD.

Well, that and previous WD failures with data loss taught me to backup backup backup. Every system I build now has dual drives.

The dual 80 failures gave me an excuse to upgrade that system to dual 300 Maxtors. Getting ready to replace them with dual 500s right now.
Re:Amazing! by Reziac · 2007-02-20 17:12 · Score: 1

APC's doodad monitors the PC's *own* power supply too?? tho now that I think of it, there's no reason it can't read voltages from the BIOS (if the BIOS allows it), much as utils like Everest do.

What symptoms do you get that indicate a HD failure?

Have you had any other components fail in that system? What is the base system, anyway? OEM or clone? Power supplies in OEMs tend to be minimal and often of marginal quality (in my experience, the single most-likely-to-fail component in OEM boxen).

Is there any question about the BIOS's support for large HDs? some BIOSs will go appear to see a HD (can ID it for size/type) that exceeds their capacity, but don't *actually* fully support it, leading to errors once the drive gets on toward full.

As to W.D., they've been my HD of choice for over a decade, because overall they have the best reliability and longevity. And when they do fail, I've yet to see one just up and die from one minute to the next (as is the case with some other brands). At worst they give you fairly obvious notice, usually in the form of strange noises.

I haven't heard of any batch failures in the model lines you mention. Last bad batches I know of were WD's first run of 8.4GB (drives made over a 3 week span were recalled because of it), and again with the first run of 40GB (tho I've come to believe *those* were 100% due to the FAT32 bug, and not the HDs at all).

I still find it highly doubtful that the HDs are the actual point of failure; as others mentioned, and in my experience, it's far more likely that they're failing (or appearing to fail) secondary to some other problem.

Back in the Olden Days, the IDE controller card was more often at fault than the HDs, but the boot error message was often just "HD failure" regardless. Rather deceptive!

--
~REZ~ #43301. Who'd fake being me anyway?
Re:Amazing! by rs79 · 2007-02-20 17:32 · Score: 1

If anything, RAID should make your hard disk access a lot faster."

Uh sorta. Depends on the raid type. Striped will be faster, mirrored will be about as fast, raid 5 is gonna be the slowest, even in hardware. I have numbers on my other computer of comparing the same disk operations with the same drives in various configurations.

I agree you don't need expensive hardware. 80 and 160 mb/s stuff is pretty cheap now and 9 or 10 of those is fun enough for
any pc.

--
Need Mercedes parts ?
Re:Amazing! by taylortbb · 2007-02-20 17:46 · Score: 1

Those HDs should be fine, I've run all Western Digital (except one machine that included Hitachis) for the past 4 years without issues. At home I've never had a drive fail, but from where I work I conclude again Western Digital is the best. We have about 300 HDs, mostly in client machines but with 4 servers. They're about 45%/45%/10% Maxtor/WD/Seagate. I've only seen one WD fail, but 60+ Maxtors in the past four years. I've experienced 100% failure rate with Segate, but I blame that on the machines. The 1Ghz IBM NetVistas were serious lemons, 80%+ have had at least one warranty claim other than the HD.
Re:Amazing! by Kadin2048 · 2007-02-20 17:58 · Score: 2, Interesting

Somewhere around I have an Apple 20MB hard drive that is getting on 15 years old. Sure, it hasn't seen a lot of usage recently, but I still fire it up every once in a while. (It makes the greatest turbine-like startup sound; seriously, it's like a 747.) Connects to the floppy disk controller. Has its own power supply.

I'm sure there are people around with even older, still-working-fine gear. A while back, I saw some DEC disk packs for the early removable-platter hard drives selling on eBay, as pulls-from-working equipment. I'm not sure what exactly was going through the minds of the designers when they were building stuff, a decade or two ago, but they just seemed to not be planning for obsolescence in the same way that the people churning out today's disposable gear are. (Although the sample is clearly biased: looking at the 20-year-old gear from 1986 that's still around today might make you think that everything then was bulletproof, but in reality all the crappy stuff is already 30 feet down in some landfill somewhere.)

I suspect in 20 years, people will look back at 2006 gear as the height of reliability, just because it'll only be the really exceptionally well-built pieces of gear that will still be around. The Deathstars and other crap drives that failed will long be forgotten.

--
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
Re:Amazing! by Technician · 2007-02-20 18:36 · Score: 1

I, on the other hand, have personally experienced one HD failure -- a Western Digital drive, as it happens -- in my LIFE.

I take it you are quite a bit younger than I am. My first HD failure was a 5 Meg drive with removable platters. The replacement heads were ceramic pucks about the size of US quarters. I have had a few other drive failures since then from a Fuji 30 Meg to an IBM 30 Gig.

Just for grins, what size was the WD drive that failed?

--
The truth shall set you free!
Re:Amazing! by gbjbaanb · 2007-02-20 21:08 · Score: 1

If you have any kind of fast CPU, it'll be faster than those crappy cheap raid controllers (or, the raid built into motherboard bios chips). I'm not sure about reliability (apparently linux software raid5 isn't the best) but with software raid you can get at the data even if you change motherboard. If you used bios or raid card, you'd have to stick with that - as changing would require different drivers.
Re:Amazing! by petermgreen · 2007-02-20 21:30 · Score: 2, Informative

If anything, RAID should make your hard disk access a lot faster. That is, unless you go for software RAID, which will put a hit on your processor.
afaict Linux software raid is actually pretty good nowadays at least as long as you stick to the basic raid levels

beware of the very common fake hardware (e.g. really software but with some bios and driver magic to make the array bootable and generally behave like hardware raid from the users point of view) controllers. Theese often have far worse performance than linux software raid and many of them only support windows.

--
note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
Re:Amazing! by Slashcrap · 2007-02-20 22:27 · Score: 1

Is 6.5GB of virtual memory too much to ask?

If you're really asking for 6.5GB of virtual memory you are almost certainly retarded.

Now, I am serious, what is wrong with the harddrives I choose that kills them so quickly?

See above.

I've been looking at RAID 5, but two things still trouble me, the price and the performance hit.

If you have a 6.5GB swap file your choice of RAID is probably the least of your performance worries.
Re:Amazing! by drsmithy · 2007-02-20 22:37 · Score: 2, Informative

That is, unless you go for software RAID, which will put a hit on your processor.
This myth needs to die. No remotely modern processor takes a meaningful performance hit from the processing overhead of RAID.
However, I think if you're going to make the investment to go with RAID 5, then buying a proper hardware controller won't add a significant amount to the cost of your set up.
Decent RAID5-capable controllers are hundreds of dollars. Software RAID is free and - in most cases - faster, more flexible and more reliable.
Re:Amazing! by drsmithy · 2007-02-20 22:41 · Score: 2, Informative

Uh sorta. Depends on the raid type. Striped will be faster, mirrored will be about as fast, raid 5 is gonna be the slowest, even in hardware.
Compared to a single disk, RAID5 is still going to be faster (except perhaps for the odd corner-case here and there).
Also, in many cases, software RAID5 is faster that hardware RAID5.
Re:Amazing! by 10Ghz · 2007-02-20 23:30 · Score: 2, Interesting

"If anything, RAID should make your hard disk access a lot faster. That is, unless you go for software RAID, which will put a hit on your processor."

Since we are talking about IO-bound operations, does that matter? I mean, CPU is hardly ever the bottleneck these days, the hard-drive quite often is. So even if soft-RAID puts more load on the CPU, does it cause any slowdown? Espesially if it makes IO faster?

--
Lesbian Nazi Hookers Abducted by UFOs and Forced Into Weight Loss Programs - -all next week on Town Talk.
Re:Amazing! by Simon+Garlick · 2007-02-20 23:40 · Score: 1

I take it you are quite a bit younger than I am.

I'm 34.

Just for grins, what size was the WD drive that failed?

3.2GB. This was back in 1997.

--

-----
PGP Key ID 0xCB8FF658
Re:Amazing! by Dr.+Eggman · 2007-02-21 00:15 · Score: 1

Normally, I first notice a problem at boot up, it takes longer than normal and I hear a series clicking noises which usually prompts me to do an unscheduled backup. During the backup, I notice a few corrupted files. Not long after this starts, usually a week or two, a few corrupted files have become a lot more and eventually it refuses to boot the OS. Sometimes I can repair it from the boot CD, but it doesn't usually last long. I've stopped bringing them into the shop, I became tired of paying $15 to hear my data was unrecoverable, so I can't tell you for certain if the last couple had really died completly. From the volume of responses I've had here, it sounds like WD is still a good manufacturer, so either I've got bad luck or some other component is failing, though I can't recall anything else I've had problems with.

--
Demented But Determined.
Re:Amazing! by Avatar8 · 2007-02-21 04:12 · Score: 1

As others have stated, it definitely sounds like you may have other factors causing your HD failure.

If you do lots of read/writes, in theory you should see better performance from RAID 5; more spindles = more reads (as long as the data is distributed across multiple disks).
I'd suggest something like http://www.intel.com/design/servers/storage/ss4000 -E/index.htm or http://www.enhance-tech.com/products/desktop_array /desktoparray_Index.htm to solve several of your issues depending on how you implement it.
If you set it up as it's own storage/backup server, then you're looking at network speed as your bottleneck. If you install it as a cage within your system (not sure if these models can), you're at least providing RAID 1, 5 or 10 redundancy, but you'll still have other factors (P/S, environment) impacting the drives.
Personally, if I had the money I'd have an iSCSI array like http://www.equallogic.com/products/view.aspx?id=46
Anyone want to float me $20k. :-)
Re:Amazing! by Technician · 2007-02-21 05:25 · Score: 1

I'm 34.

I've got almost a 20 year headstart.

Just for grins, what size was the WD drive that failed?

3.2GB. This was back in 1997.

My 30 Gig failed in 2003. The 5 Meg drive failed in 1978. The 5 Meg drive cost more than your computer system unless you have a real spendy system.

http://www.pdp8.net/rk05/rk05.shtml?med

--
The truth shall set you free!
Re:Amazing! by Reziac · 2007-02-21 05:51 · Score: 1

Unusual clicks at bootup are indeed a sign of drive about to go tits-up, BUT... it's usually an *old-age* thing, and I suspect has to do with the drive's power circuits being chronically stressed. Three times now, I've taken such a "failing" older drive and put it on another machine, and it stopped having the problem. (In one case, the HD's power port had been knocked loose, making the cause kinda obvious.)

Personally, I'd swap out the system's power supply and see if that helps. At the very least, check all the power leads with a voltmeter, and specifically watch for spikes or sags at power-up. From all you've said, the PSU is almost certainly the root cause. -- When you buy a new PSU, ignore reviews, brands, and hype, and pick one that feels relatively heavy and has lots of thick leads. That's the easiest way to get a good one.

I've had a few bad HDs (buy enough HDs and you'll invariably get some duds), but overall my experience with W.D. has been excellent. I've found that if a W.D. drive is going to die young, it does so within the first couple months. Otherwise, it can be counted on for a bit over 5 years of quality service; if it makes it to age 6, it will probably keep running indefinitely (hence the vast majority of old still-working HDs that I see are W.D.) And when they're sick, they give you plenty of warning. I've never seen one "just die".

Seagates seem to do about the same, but for the same class of HD are noticeably slower, and run relatively hot and noisy, so I don't buy 'em. They are somewhat more likely to die after age 5 than are W.D.

Maxtors are fast and usually quiet, but have a lot of sudden-death at 2 to 3 years, and they give absolutely no warning -- they just quit. But there again, if they make it past that threshold, they're likely to keep on running indefinitely. However, I see very few old still-working Maxtors.

These electrons have been brought to you by a 6YO W.D. HD and a 13YO PSU :)

--
~REZ~ #43301. Who'd fake being me anyway?
Re:Amazing! by baggins2001 · 2007-02-21 06:26 · Score: 1

I've got a stack of WD hard drives. About 2 years ago we went through the failed hard drives and found that of 10, 8 were WD. The other 2 were Fujitsu's.
Just last week I had a drive failure and thought this would be my second Seagate to crash. Pulled it out, low and behold it was a 300GB WD. I don't know how it got in there.
Basically quality appears to swing around. We went through a rash of failures on Fujitsu's about 4 years ago. It got so bad that we actually went around and replaced the hard drives in computers that were running Fujitsu's.

--
He who said 1,000,000 monkeys on 1,000,000 typewriters would eventually type the great novel, never saw an AOL chat room
Re:Amazing! by raddan · 2007-02-21 06:33 · Score: 1

Think about environmental conditions. How hot is it where you store your computer? Do you dump water on it yearly?
Re:Amazing! by glsunder · 2007-02-21 06:45 · Score: 1

While raid helps lessen the chance for data loss, it's mainly there to eliminate downtime when a drive fails. A few hours of downtime due to a lost drive can easy eat up the extra $300 or so that a 2nd drive costs.

Even on small servers with non-hot-swappable sata drives, you can use removable bays to make replacing a dead sata drive easy, and almost as quick as a plain old reboot.
Re:Amazing! by nmos · 2007-02-21 06:55 · Score: 1

If your drives are consistantly failing in less than a year then there is something else at work beyond just your choice of HD brand. Maybe you have a bad PSU or very poor cooling around the drives etc.
Re:Amazing! by LunaticTippy · 2007-02-21 06:58 · Score: 1

Good advice. If you have access to a scope, see if you have voltage ripple. If you're stressing your power supply you can get pretty big ripples. It's fun to take a marginal power supply and watch what happens as you connect more and more drain to it.

--
Man, you really need that seminar!
Re:Amazing! by crabpeople · 2007-02-21 07:04 · Score: 1

"my power supply is quite good (as far as my APC monitoring system tells me"

I am not sure how a UPS would know if you had a bad power supply, but the best way to check is to crack that sucker open and look for blown oozing caps. Power supplies can work seemingly flawlessly and with no symptoms and still be bad. Although you do have a symptom, dying drives. A good power supply usually costs more than a hundred bucks, and should be made by a company that you can recognize (their homepage isnt all in chinese would be a good start). There are SO MANY bad power supplies out there. I wont even buy power supplies anymore that cost less than 50 bucks. Usually if, it comes in a cardboard box (as in non OEM) its an ok unit. Those $29 dollar 400 watt psu's just scream "rape my system".

--
I'll just use my special getting high powers one more time...
Re:Amazing! by Reziac · 2007-02-21 08:27 · Score: 1

Sounds like a useful approach, especially with those OEM machines and their uniformly barely-adequate PSUs. -- Personally, I think that's why I see such a high death rate in OEM mainboards and PSUs, whereas in clones, they hardly ever die.

--
~REZ~ #43301. Who'd fake being me anyway?
Re:Amazing! by Dr.+Eggman · 2007-02-21 08:49 · Score: 1

I am not sure how a UPS would know if you had a bad power supply... My mistake, I should have worded it better. What I ment was as far as I can tell from my APC unit, there is nothing wrong with my psu. By that I mean the apc battery backup works, as any other ups, as a hub for multiple electronics and the only noticable power interrupts that have occured to my system, occure to all other devices powered directly through the ups, in which case the battery backup kicks in and, if necessary, the ups manages a system shut down before the battery runs out. Perhaps, however, I don't have the right idea of a bad power supply. If the symptoms of a bad psu can be solely that hds are dieing young but nothing else is effected, then I would agree with you. (That's a question: can that be the sole symptom? I don't know.) I used to have a bad psu, I would start playing a graphics intensive game and the gpu would start drawing a lot of power, until the psu dipped out and the whole system reset. Eventually, that began happening outside of just games and I had the power supply replaced (about 4 years ago.) That's part of the reason I don't think it's a psu problem, the replacement was a good hefty 550watt and by no measure cheap. I can't recall the manufacturer or the price atm, but I do remember it's still under warrentee. A warrentee that would be broken if I simply opened the psu up.

--
Demented But Determined.
Re:Amazing! by LunaticTippy · 2007-02-21 08:50 · Score: 1

I've had good luck with OEM power supplies as long as you don't add any extra power use. If you add a video card, 2nd HD, and run a lot of hungry USB devices that 250W dude won't last long.

Odd timing, my HP power supply died yesterday after 3 years of always-on. It had a video card, 2nd HD and a lot of USB devices. I fully intended to upgrade it, but my other computers took priority and I forgot. 3 years is pretty good considering it was used brutally.

At work, our corporate-dictated Dells soldier on with surprising durability. We've got 30 machines up to 7 years old and as long as I replace failing fans and drives have very little trouble. Some of them are tough to get around inside, but the newest ones we've got are much better at that.

--
Man, you really need that seminar!
Re:Amazing! by DRAGONWEEZEL · 2007-02-22 11:43 · Score: 1

I had not 1 but 2 seagates fail on me w/ in their first year.

--
How much is your data worth? Back it up now.
Re:Amazing! by DRAGONWEEZEL · 2007-02-22 11:51 · Score: 1

I had the same install of xp on a mb raid 0 install for 2 years.

The only problem I ever had with the HDD controller was when I left my pc plugged in during a large windstorm that flipped power on / off / out for a weekend. I was away from home and my PC was set to Boot on AC restore.

After reinst. It's been silky smooth again.

--
How much is your data worth? Back it up now.

Question by defiant1 · 2007-02-20 13:52 · Score: 1

Didn't I read about this on Slashdot a few days ago or did some drives fail and the story was lost?? Must have been a drive failure cause it's unlike Slashdot to have dups :)

Exactly! by plasmacutter · 2007-02-20 13:54 · Score: 1

And here i knew disks stored data on a set of rotating platters, i guess its really stored on the alien spacecraft hiding behind hale bopp!

--
VLC FOR MAC IS DYING! IF YOU DEVELOP, PLEASE SAVE IT!!

infant mortality by Anonymous Coward · 2007-02-20 13:57 · Score: 5, Insightful

I suspect that the 'infant mortality' syndrome really has to do with the drives being abused before they are installed in the machines (getting dropped during shipping for example)

the large shops like these studies are looking at get the drives in bulk directly from the manufacturer, the rest of us who have to go through several middle-men before we get our drives have more of a chance that something happened to them before we received them.

David Lang

Re:infant mortality by mabhatter654 · 2007-02-20 16:55 · Score: 2, Insightful

I think the myth of infant mortality is that if the drive works in the first week/month it will work perfectly until the warranty/magic dust wears off and you don't have to worry about reliability until then. What they saw in the real world was that some drives had consistantly reduced performance and lifespan right from the start. You can't operate on the assumption that I replaced 5 drives so I'm good for 3 years and not keep spares or backups ready... the Google report takes this another step because they were interested in what the drives were reporting for health.. is the drive's internal software giving good reliability numbers... In Google's case the drives weren't and work needs to be done.
the whole idea is away from the main myth that drive never fail unless they're junk... to the idea that drive DO and WILL fail because they are mechanical parts. Engineers aren't interested in blaming the manufacurer for imperfect parts, but in doing THEIR job of keeping the data and the network going.

Re:MTBF? RTFA. by Vellmont · 2007-02-20 13:58 · Score: 4, Informative

You might get an MTBF of say, two years, when the reality is that the distribution has a big spike at one month, and the rest of the failures forming a wide bell curve centered at say, five years.

Well, the article actually says that drives don't have a spike of failures at the beginning. It also says failure rates increase with time. So you're right that MTBF shouldn't be taken for a single drive, since the failure rate at 5 years is going to be much higher than at one.

The other thing that the article claims is that the stated MTBF is simply just wrong. It mentioned a stated MTBF of 1,000,000 hours, and an observed MTBF of 300,000 hours. That's pretty bad. It's also quite interesting that the "enterprise" level drives aren't any better than the consumer level drives.

--
AccountKiller

Desktop vs Server usage. by DigiShaman · 2007-02-20 13:58 · Score: 3, Insightful

Key observations from Dr. Schroeder's research:
High-end "enterprise" drives versus "consumer" drives?

Interestingly, we observe little difference in replacement rates between SCSI, FC and SATA drives, potentially an indication that disk-independent factors, such as operating conditions, affect replacement rates more than component specific factors."

Maybe consumer stuff gets kicked around more. Who knows?

Or maybe powering up the drives off and on is more stressful to the components; say in a desktop environment. With servers racked up, the drives are always spinning with near constant thermal conditions.

--
Life is not for the lazy.

Re:Desktop vs Server usage. by Anonymous Coward · 2007-02-20 14:02 · Score: 1, Interesting

Also, residential power is less clean than datacenter power. Bad power can take out the drive electronics.
Re:Desktop vs Server usage. by complete+loony · 2007-02-20 14:10 · Score: 1

So... Server grade HD's have a longer average life simply because more of them are installed in servers?

--
09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
Re:Desktop vs Server usage. by anagama · 2007-02-20 14:15 · Score: 1

That's reasonable. At my office in a building built in 1912, I had a computer with a power supply burn out in less than one year (decent $100 Antec case/supply). It was in a room where the lights flickered whenever the fax or printer powered up. Anyway, after I replaced the power supply, I put it on a UPS that protects against brownouts. I would imagine that bad electricity could easily be the culprit for a lot of failures.

--
What changed under Obama? Nothing Good
Re:Desktop vs Server usage. by pla · 2007-02-20 14:17 · Score: 1

Or maybe powering up the drives off and on is more stressful to the components

You just posed the one question to which I'd actually have liked to know the answer... Turn it on and off as needed (minimize runtime), or leave it on all the time if you'll use it at least a few times per day (minimize power cycling).

I know that counts as something of a religious issue among geeks, but I'd still have liked a good solid answer on it... It even has implications for whether or not we should let our non-laptops spin drives down when idle.

Oh well, better luck next study (or I can find my own collection of 100k drives to test, I suppose).
Re:Desktop vs Server usage. by Lumpy · 2007-02-20 14:38 · Score: 4, Interesting

Or she forgot to put in the part that Enterprise drives are replaced on a schedule BEFORE they fail. At Comcast I used to have 30 some servers with 25-50 drives each scattered about the state. every hard drive was replaced every 3 years to avoid failures. These servers (Tv ad insertion servers) made us between $4500-13,000 a minute they were in operation in spurts of 15 minutes down 3-5 minutes inserting ad's. Downtime was not acceptable so we replaced them on a regular basis.

Most enterprise level operations that relies on their data replace drives before they fail. In fac tthe replacement rate was increased to every 2 years not for failure prevention but for capacity increases.

--
Do not look at laser with remaining good eye.
Re:Desktop vs Server usage. by Cthefuture · 2007-02-20 14:44 · Score: 1

Do people actually shut their desktops off?

The concept is bizarre to me. I haven't shut my desktop off on a daily basis in probably 15 years (or about as long as I've been running Linux as my desktop).

This has nothing to do with the OS though. I don't power cycle any of my important electronics more than needed because I do believe it stresses them. My (PC) computers have always run 24/7 unless there is an electrical storm passing over or I don't have power.

The last time I power cycled on a daily basis was back when I had "console" type computers (C64, Amiga, etc.). Even then they often ran 24/7 serving BBS duty.

--
The ratio of people to cake is too big
Re:Desktop vs Server usage. by markov_chain · 2007-02-20 15:02 · Score: 2, Informative

I never had a hard drive fail. I buy one more new one a year, and drop the smallest one. I run 4 at a time in a beige box PC. They are a mix of all sorts of manufacturers (usually from a CompUSA sale for less than $0.30/GB).

- I never turn off the PC.
- The case has no cover.

--
Tsunami -- You can't bring a good wave down!
Re:Desktop vs Server usage. by StillAnonymous · 2007-02-20 15:25 · Score: 1

I too, have never had a hard drive fail on me. I always leave my system running and I use a UPS. When I hear the drive start to make that tell-tale sound of a bearing approaching failure, I buy a new one and replace it before it dies.
Re:Desktop vs Server usage. by Reziac · 2007-02-20 15:53 · Score: 2, Interesting

Well, I can connect my own anecdots ;) Once they're fully set up, my everyday machines are never powered down again (except to upgrade the hardware), nor do the HDs spin down. They are also on good quality power supply units, AND are protected by a good UPS, AND have good cooling. Those 3 points can make all the difference in the world to their longevity, regardless of use patterns.

Right now my everyday HDs number thus:

6.4GB W.D. -- new in 1998, has always run 24/7. No SMART but probably has upward of 70,000 hours uptime. (Its identical twin failed about a year ago, but it had always clanked louder while doing thermal recalibration. This one is still quiet.)

8.4GB W.D. -- new in 1998, used about 12hrs/day thru 2002, offline 2002-2006, running 24/7 for the past year. No SMART but probably has about 25,000 hours uptime.

45GB W.D. -- SMART data: 42093 hours uptime, 181 power cycles (mainly as hard resets).

40GB W.D. -- SMART data: 3919 hours uptime, 197 power cycles. (Dated 2002; found in trash in 2006)

60GB W.D. -- SMART data: 28056 hours uptime, 100 power cycles (mainly as hard resets)

Running 24/7 pretty much eliminates thermal stress and the "what do you mean you're not powering up today?!!" that happens sometimes with older HDs.

Other points of conventional wisdom about running fulltime:
1) "It causes more bearing wear." I wonder if that's so -- might the lubricant stay better distributed when it never chills down and never gets a chance to settle and congeal??
2) "It's more likely to stiction if it does sit til it's cold." In my experience it's the opposite -- the HD with only intermittent use is far more likely to stiction, and sometimes can be cured permanently by letting 'em run for a few days solid.

One of the points in TFA was that over 40% of RMA'd HDs proved to have nothing wrong with them. This is in line with my own observations (in fact, closer to 100% in SOHO/home-user environments) -- many supposed HD failures are actually user or software errors, not the hardware at all.

I don't know that this is at all helpful :) But my recommendation to my clients is that if they don't want to run 24/7, they should not power the machine on and off more than once a day.

--
~REZ~ #43301. Who'd fake being me anyway?
Re:Desktop vs Server usage. by ColaMan · 2007-02-20 15:55 · Score: 1

Some stats from my MythTV box. Note the poweron hours and the number of powercycles.

This is in a box that is pretty much on 24/7, unventilated except for the power supply fan.

My original drive, 5 years old now:

smartctl version 5.33 [i686-pc-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: WDC WD800JB-00ETA0 Serial Number: WD-WMAHL3065403 Firmware Version: 77.07W77 User Capacity: 80,026,361,856 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 6 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Wed Feb 21 13:45:40 2007 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0007 089 089 021 Pre-fail Always - 2050 4 Start_Stop_Count 0x0032 100 100 040 Old_age Always - 333 5 Reallocated_Sector_Ct 0x0033 181 181 140 Pre-fail Always - 295 7 Seek_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0 9 Power_On_Hours 0x0032 060 060 000 Old_age Always - 29205 10 Spin_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 330 194 Temperature_Celsius 0x0022 098 253 000 Old_age Always - 45 196 Reallocated_Event_Count 0x0032 137 137 000 Old_age Always - 63 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0012 200 200 000 Old_age Always - 0 199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0009 200 085 051 Pre-fail Offline - 0

My "New" drive, 3 years old now:

smartctl version 5.33 [i686-pc-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: WDC WD2000JB-00GVA0 Serial Number: WD-WMAL71580378 Firmware Version: 08.02D08 User Capacity: 200,049,647,616 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 6 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Wed Feb 21 13:46:06 2007 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0007 123 122 021 Pre-fail Always - 6391 4 Start_Stop_Count 0x0032 100 100 040 Old_age Always - 143 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0 9 Power_On_Hours 0x0032 072 072 000 Old_age Always - 20616 10 Spin_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 142 194 Temperature_Celsius 0x0022 105 087 000 Old_age Always - 45 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pe

--

You are in a twisty maze of processor lines, all alike.
There is a lot of hype here.
Re:Desktop vs Server usage. by PygmySurfer · 2007-02-20 15:58 · Score: 1

I can't give you anything more than anecdotal evidence, but I've been involved with a few data centre power downs, and we'd typically see quite a few drive failures when powering the servers back up. if we'd not powered them down, they likely would've kept on churning away.

There's other factors to consider for home use though - whats the cost of keeping the machine powered on all year, versus shutting it down when not in use - if it's more than the cost of a new drive, you're probably better off shutting the box down/suspending it.
Re:Desktop vs Server usage. by MadMorf · 2007-02-20 16:01 · Score: 5, Informative

Most enterprise level operations that relies on their data replace drives before they fail.

You worked at an unusual place!

I'm a Tech Support Engineer for a large storage system manufacturer and I can tell you that NONE of our customers replace disks before they fail unless our OS detects a "predictive failure" for the disk. Our customers are some of the biggest names in business from all over the planet.

--
Goofy, Geeky Gifts and More!
Re:Desktop vs Server usage. by whoever57 · 2007-02-20 16:07 · Score: 1

You had better prepare to replace that 5 year old drive soon -- the non-zero reallocated sector count is a good predictor of failure according to Google.

--
The real "Libtards" are the Libertarians!
Re:Desktop vs Server usage. by mabhatter654 · 2007-02-20 17:04 · Score: 1

This is about optimising your data center spending on disk. What they're saying is that accross 100,000 drives of a bunch of types installed in data centers, whether it was an expensive FC drive or a desktop SATA didn't seem to affect the over all live as much as other things... spending big money on expensive FC + RAID solutions doesn't gain you anything over the same solution with cheaper drives. It's all about the management of the failures. You can't believe people willing to take a lot of money and then ignore the reliability.
That makes the most interesting solution Serial attached Scsi.. the ability to mix fast or big drives as your application demands. And optimise for cost... make up any preceved "reliability" loss with engineering and planning and consistancy. That's what IT lacks.. we (IT) have to get away from "magic bullets" and to real engineering/accountability methods.
Re:Desktop vs Server usage. by ColaMan · 2007-02-20 17:38 · Score: 1

It's been like that for a few years now - it hasn't increased any. I suspect that it took a couple of knocks while spinning early in it's life - the kids used to kick it a lot.

--

You are in a twisty maze of processor lines, all alike.
There is a lot of hype here.
Re:Desktop vs Server usage. by Anonymous Coward · 2007-02-20 17:45 · Score: 0

That's why we use RAID, we expect disks to blow and when they do, it's whoopdy doo!
Re:Desktop vs Server usage. by yoprst · 2007-02-20 18:05 · Score: 3, Interesting

It's broadcasting, dude! No downtime is allowed. Here in Soviet Russia we (broadcasters) do exactly the same, except that we prefer 2-year period.
Re:Desktop vs Server usage. by Anonymous Coward · 2007-02-20 18:08 · Score: 0

Our customers are some of the biggest names in business from all over the planet.
Wow, you work for Scientology?!
Re:Desktop vs Server usage. by Technician · 2007-02-20 18:49 · Score: 0, Offtopic

In fact the replacement rate was increased to every 2 years not for failure prevention but for capacity increases.

I thought cable TV was getting way too many commercials.. How about increasing the programming instead. Disclaimer.. No longer a pay TV consumer.

--
The truth shall set you free!
Re:Desktop vs Server usage. by bartwol · 2007-02-20 19:34 · Score: 1

Scheduled replacement sounds dubious and expensive. Regardless of scheduled replacement, random failures will occur. Assuming uptime is critical, contingency measures have to be in place and automatically engaged if downtime is to be avoided. If uptime isn't critical, then you handle your outages as they occur (according to less stringent service level constraints).

The increased cost of disk drives (~2x due to early retirement?) and maintenance labor (drive service every 3 years instead of ~6?) seems to only be offset by a shift of some percentage of disk repairs from unscheduled to scheduled work. That might make sense if you have to dispatch off-site people to effect repairs in very remote locations, but in a data center or large office environment, the labor is usually readily available without significant additional travel/scheduling costs.

I'd be interested in hearing a full justification of routine disk replacement.
Re:Desktop vs Server usage. by rossifer · 2007-02-20 20:20 · Score: 1

Do people actually shut their desktops off?
Well, actually, we put our computers on standby on a daily basis. Completely powering down is unusual. Not sure if you're separating these two cases.

Saves a decent amount on power to use standby. We have three laptops and the home theater desktop. The two workstation laptops and the HTPC we standby whenever they're not in active use. The older laptop is a subversion/trac dev server and is always going though it's a fairly low draw with the screen closed and the CPU throttled back (15-20W).

When we first set up the HTPC, it was just running all the time. We paid $18 more for power that month. Pretty noticeable since our typical electric bill is $35/month. Turns out that the HTPC draws between 115W and 140W while idling and only 5W in S3 standby (running folding@home, the HTPC consumes 230W and would have nearly doubled our monthly power bill). Since that little hiccup, we've become religious about putting machines on standby and powering the rest of our entertainment center down with an external switch (power strips).

Regards,
Ross

P.S. Living in a house with gas heating and no A/C helps a lot with the power bill. What with both of us commuting on motorcycles (me $4/week on gas, her $2.50/week), we're pretty miserly, energy-wise. We're working on the food distribution costs and product packaging/garbage slowly.
Re:Desktop vs Server usage. by the_womble · 2007-02-20 21:25 · Score: 2, Informative

There are some good reasons to shut down:

1) Electricity consumption
2) Power cuts (unless you have a UPS and software for a clean shutdown installed, what happens if there is a power cut while you are away?).
3) Power fluctuations (my power supply blew dramatically after one a few months ago) and lightning.
4) Heat (in a hot climate)
Re:Desktop vs Server usage. by petermgreen · 2007-02-20 21:44 · Score: 1

Remember though that the cost of a hard drive failure can be considerablly more than the cost of the replacement drive.

first you have the cost of the actual replacement, easy enough if you can do it yourself but if you have to pay someone to do it for you....

then you have the value of the time spent getting your system back into a usable state, if you have a raid setup this is small but if you are relying on data only backups or you have problems with your full backup it could take MUCH longer.

then you have the value to you of any data that is lost forever.

--
note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register

Re:moving parts - Don't always wear out by NFN_NLN · 2007-02-20 14:01 · Score: 1

Machines built at the molecular level can't wear out. There's either enough energy to break the bond at the molecular level or there's not. Just run it within spec. and it'll never break.

Cyrus IMAP by More+Trouble · 2007-02-20 14:05 · Score: 2, Interesting

From StorageMojo's article: Further, these results validate the Google File System's central redundancy concept: forget RAID, just replicate the data three times. If I'm an IT architect, the idea that I can spend less money and get higher reliability from simple cluster storage file replication should be very attractive.

For best-of-breed open source IMAP, that means Cyrus IMAP replication.
:w

Re:Cyrus IMAP by Bronster · 2007-02-20 15:46 · Score: 1

*cough*

Eventually - but we're not turning our automated replication checks off just yet...

Amazing the little corner cases that can go in and corrupt your data for you.

seems like she could make her own job... by Anonymous Coward · 2007-02-20 14:06 · Score: 0, Offtopic

as head of an independent testing lab. That would probably be a heckuva lot more interesting, and lucrative, than some random gig with Google, IBM, or MS Research.

Every single solid state drive will fail too... by EmbeddedJanitor · 2007-02-20 14:07 · Score: 2, Informative

It is just a matter of time. Depending on the technology (eg. flash) it might be a short to medium time or a long time.

If something has an MTBF of 1 million hours (that's 114 years or so), then you'll be a long time dead before it fails.

At this stage, the only reasonable non-volatile solid state alternative is NAND flash which costs approx 2 cents per MByte ($20/Gbyte) and dropping. NAND flash has far slower transfer speeds than HDD, but is far smaller, uses less power and is mechanically robust. NAND flash typically has a lifetime of 100k erasure cycles and needs special file systems to get robustness and long life.

--
Engineering is the art of compromise.

Re:Every single solid state drive will fail too... by Anonymous Coward · 2007-02-20 14:37 · Score: 0

You fail at understanding what MTBF is.

(hint: MTBF of 1 million hours doesn not mean the average drive will last 1 million hours)
Re:Every single solid state drive will fail too... by Detritus · 2007-02-20 16:09 · Score: 2, Informative

MTBF tells you the failure rate over the item's service lifetime, which for hard disks, is commonly five years.

--
Mea navis aericumbens anguillis abundat

No "infant mortality" effect? by pla · 2007-02-20 14:10 · Score: 0

Everything else in there, I think most of us us already knew... Except the "infant mortality" one really surprised me.

I have to wonder, though, did she include DOAs in that, or did she only include drives that worked at least for a few minutes/hours/days? I have to strongly suspect the later - I can't argue with the statistics from 100k drives, but my personal experience with a few dozen drives has shown that they have a strong bias toward either never working, or working for at least a year.

Love the RAID5 stat, though... Perhaps this study will finally convince people to only use RAID for performance or huge-JBOD reasons, never for (the illusion of) reliability.

Re:No "infant mortality" effect? by TheLink · 2007-02-20 15:51 · Score: 1

Well "HPC1 compute nodes" seem to have infant mortality. Strangely the others don't.

As for RAID, I'm biased towards software RAID because most hardware RAID cards seem to have either poor performance or poor reliability. Also I'm biased towards RAID1 and RAID10.

Sure I have to bet on Linus and gang not screwing up software RAID, but for software if the kernel passes decent "infant mortality" tests it's likely to keep working till at least the next kernel patch ;).
--
- Too many replies beneath your current threshold
Re:No "infant mortality" effect? by ScrewMaster · 2007-02-20 16:18 · Score: 1

Perhaps this study will finally convince people to only use RAID for performance or huge-JBOD reasons, never for (the illusion of) reliability.

Properly done, mirroring can certainly increase the reliability of your storage solution. The servers in my basement are all mirrored and while I've experienced a couple of drive failures along the way I've never lost a byte. I make sure the drives have sufficient airflow to run at room temperature and they run on filtered power, but nothing lasts forever. So mirroring helps. Matter of fact, when I do have a crash, the system continues running on the remaining functional drive in the array: if it weren't for the fact that the server emails my cell phone I'd never know about it.

However, the myth that I would like to see dispelled is the one that goes "Oh cool! If I have a RAID 0+1 solution I don't need to worry about backups!"

Gagh.

--
The higher the technology, the sharper that two-edged sword.
Re:No "infant mortality" effect? by drsmithy · 2007-02-20 22:53 · Score: 1

Love the RAID5 stat, though... Perhaps this study will finally convince people to only use RAID for performance or huge-JBOD reasons, never for (the illusion of) reliability.
That is _not_ the conclusion anyone should take form this article.
The conclusion to be drawn is that single-parity RAID schemes cannot handle multiple drive failures in quick succession - something anyone involved in data storage should know already, and take into account.
Re:No "infant mortality" effect? by cowbutt · 2007-02-21 00:24 · Score: 1

As for RAID, I'm biased towards software RAID because most hardware RAID cards seem to have either poor performance or poor reliability. Also I'm biased towards RAID1 and RAID10.
Sure I have to bet on Linus and gang not screwing up software RAID, but for software if the kernel passes decent "infant mortality" tests it's likely to keep working till at least the next kernel patch ;).
Unfortunately, Linux's software RAID drops the block device out of the array if it encounters a read error. So you hot-add that drive back into the array and start rebuilding, then the other block device encounters a read error too and gets dropped. Nice failure condition!
As I understand it, at least one RAID implementation in at least one of the BSDs takes a different approach; upon encountering a read error, it tries to get that block from the rest of the array and immediately attempts to copy it to the block device that generated the read error. Only if this fails is that device dropped from the array. This strikes me as a much safer approach. I've also seen syslogs indicating that BSD makes use of advisory status reports from drives (is this a SCSI-only feature?) to notice when the drive had problems reading a block, and immediately refresh it for similar reasons.
Re:No "infant mortality" effect? by asuffield · 2007-02-21 01:29 · Score: 2, Informative

Love the RAID5 stat, though... Perhaps this study will finally convince people to only use RAID for performance or huge-JBOD reasons, never for (the illusion of) reliability.

It's true that you should never buy anything for the illusion of reliability, but the article does not claim RAID is not a good way to get reliability.

First, let's look at the common mistake when people think about RAID: "If the probability of a drive failure is X, then the probability of two drives in a RAID volume failing is X*X, which is much smaller". That's nonsense, as the article demonstrates - the probability is only X*X if the events are independent, which they are clearly not.

But the idea was nonsense even before that. The statement is taking the wrong attitude to the problem - it is considering the probability of data loss at *one point in time*. That's not actually what you care about - if your server dies on Tuesday, it is no comfort to you that it did not die on Monday. Here is a more sensible way to look at what is going on (ignoring backups for the moment):

Every drive is going to fail, typically within the first ten years of its life. So if you have a non-RAID system, the probability of data loss is 100% - certain. Really. Without RAID, sooner or later, you are going to lose that volume. What RAID gives you is a moderate chance of getting through the inevitable drive failures without losing the volume, and that's a chance that you never had at all without RAID. Different configurations can modify how large that chance is, but the essential feature of RAID is that you get the chance.

So what do backups get you? It's basically the same thing, except that you've got to rebuild the server. So if you just have backups and no RAID, it is a certainty that sooner or later your server is going to have significant amounts of downtime while it's being rebuilt from the backup. If downtime bothers you, you need RAID, period. Exactly what kind of RAID depends on what chance you want to take (standard risk management calculation), but there's just no contest between "certain failure" and "chance of avoiding failure" - even a 10% chance of surviving a disk failure is infinitely better than no chance (and the actual figure should be much better than that).

Lastly, what happens if you have RAID and no backups? It should be apparent that you get the same scenario as RAID with backups, only with a higher chance of failure. So there's no fundamental reason not to do that - line up the figures along with RAID+backup solutions in your risk management analysis, and pick the cheapest option for the level of risk you (or your insurance company) are willing to accept.

The impact of this study is a nice improvement in the accuracy of that analysis. Neither more nor less. If you're running large servers, this would be a good time to pull out those numbers and take another look at them (if you don't have those numbers on file, this study is not for you).
Re:No "infant mortality" effect? by TheLink · 2007-02-21 02:15 · Score: 1

Thanks for that information. That's something to keep in mind.

What do the hardware raid stuff do in that situation? So far I think some of them are no better and looking at the history of bugs for say Megaraid, makes me wonder whether RAID vendors really know what they are doing, and whether paying the extra money for their stuff improves availability or actually makes things _worse_. So far it seems 3ware could be the least crap ;). Anyway the main reason I prefer Linux software RAID is it's easier to detect a drive failure and alert someone. Whereas there are many hardware RAID products don't support Linux well enough for the O/S to be alerted of problems - sure the controller starts beeping like crazy, but often nobody is physically around to hear it[1].

I've been using FreeBSD for quite some years already, but in the office it's Linux. I quite like the FreeBSD approach to some things - but AFAIK stuff like vmware don't work on FreeBSD (without excessive tinkering anyway).

[1] I've wondered if it would be a good idea to wire up LEDs of switches, routers, servers and other stuff to speakers, so that they make some activity noise. So that a admin could just walk into the server room and know that "something's different...", or even put in some microphones and then stream the sound to the admin as a background. If you hear a buzz from the webserver and switch maybe there's a slashdotting in progress ;). And if the airconditioning fails, you'd probably hear it before the temperatures go up.
--
- Too many replies beneath your current threshold

This paper and the Google paper are complementary by Thagg · 2007-02-20 14:10 · Score: 4, Informative

What's interesting about both of these papers is that previously-believed myths are shown to be, in fact, myths.

The Google paper shows that relatively high temperatures and high usage rates don't affect disk life.
The current paper shows that interface (SCSI, FC vs ATA) had no effect either. The Google paper shows
a significant infant mortality that the CMU paper didn't, and the Google paper shows some years of flat
reliability where the current paper shows decreasing reliability from year one.

The both show that the failure rate is far higher than the manufacturers specify, which shouldn't come
as a surprise to anybody with a few hundred disks.

I'm particularly pleased to see a stake driven through the heart of "SCSI disks are more reliable."
Manufacturers have been pushing that principle for years, saying that "oh, we bin-out the SCSI disks
after testing" or some other horseshit, but it's not true and it's never been true. The disks are
sometimes faster, but they're not "better".

Thad

--
I love Mondays. On a Monday, anything is possible.

Human MTBF by EmbeddedJanitor · 2007-02-20 14:12 · Score: 4, Funny

MTBF of a human until gross catastophic failure (ie. death) is approx 50 years which is approx 440,000 hours.

Of course if we count relatively minor failures (like forgetting to take out the trash or pick up dirty underwear), then MTBF is approx 27 minutes!

--
Engineering is the art of compromise.

We need a better file system... by complete+loony · 2007-02-20 14:14 · Score: 1

Further, these results validate the Google File System's central redundancy concept: forget RAID, just replicate the data three times. If I'm an IT architect, the idea that I can spend less money and get higher reliability from simple cluster storage file replication should be very attractive. Someone needs to hurry up and write a good cross platform clustering file system solution. Something that encourages a company to buy bigger, better value HD's for their desktops so they can be used as redundant storage.

--
09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.

Re:We need a better file system... by ogiller · 2007-02-20 15:42 · Score: 1

There is already a distributed file system providing location independence, scalability, security. It is called AFS.
http://www.openafs.org/

What does the rest of Slashdot think? Is OpenAFS anything like GoogleFS?
Re:We need a better file system... by Hawke666 · 2007-02-20 19:01 · Score: 1

Unfortunately, it's not even close. A couple things that spring to mind:

AFS servers don't automatically ensure that every piece of data is stored on some minimal number of them. Enabling/configuring replication of a given volume is a manual process.

AFS treats a single server as the "master" read-write copy; replicated copies are read-only.

Redhat's Global File System ( http://www.in.redhat.com/software/rha/gfs/index.ph p3 ) or Lustre ( http://www.lustre.org/ ) or maybe ddraid ( http://sources.redhat.com/cluster/ddraid/ ) sound much closer to what googlefs is than AFS does.

IMO AFS is more of a file sharing protocol (like NFS or SMB) than googlefs.

Infant Mortality and stuff by jmorris42 · 2007-02-20 14:16 · Score: 0, Troll

> Um, but doesn't the summary of the paper say that there is no infant mortality effect, and that
> failure rates increase with time, and thus the bathtub curve doesn't actually apply?

That may be the new 'theory' but we all know about theory vs reality. Here in reality if you put a couple of dozen new drives into service you have one or two spare hard drives to replace the ones that WILL fail in the first week. Especially with consumer grade drives typical in workstation deployment. If you only have one dud out of twenty it was a good rollout.

And as for some of the other assertions in this paper (well the summary, haven't read this one yet, still wanting to reread the google paper again, need to hours in a day.... bah!).......

> Costly FC and SCSI drives are more reliable than cheap SATA drives.

Sorta. Again, real world vs theory. Try banging the hell out of an off the shelf consumer drive 24/7/365 and see how long it holds up. Yea, thought so. Hope you didn't have anything important on that paperweight.

> RAID 5 is safe because the odds of two drives failing in the same RAID set are so low.

This one should bother ya if you are overly relying on the 'infallibility' of RAID5. Remember kids, drives fail from two major groups of causes, internal and external. If a power event kills one drive in the array the odds are pretty low of only one being dead, you just might not KNOW about #2 yet. And filesystem corruption will be faithfully mirrored onto the array. Obey the 1st Commandment: "Thou Shalt Make Backups."

--
Democrat delenda est

Re:Infant Mortality and stuff by Wilson_6500 · 2007-02-20 14:24 · Score: 5, Insightful

That may be the new 'theory' but we all know about theory vs reality.

Uh, but wasn't this data accumulated via testing actual drives? That's... kinda how science works--by replacing anecdotal evidence with scientifically-gathered data. That's basically condemning science in favor of anecdotes--and the medical fields can tell you how well _that_ works.
Re:Infant Mortality and stuff by pinkstuff · 2007-02-20 14:30 · Score: 1

But Raid 1 or 5 is what I was going to use for my (home) backup box. Do I now need a backup of that as well, ouch, my wallet is starting to hurt :).
Re:Infant Mortality and stuff by diersing · 2007-02-20 15:02 · Score: 1

But then he couldn't finish the day feeling .
Re:Infant Mortality and stuff by DarkVader · 2007-02-20 15:02 · Score: 2, Interesting

1 in 20 drive failures? What are you using, Western Digital drives? I don't see anything close to that failure rate, more like 1 in 300.

I don't deploy "enterprise" drives, they're overpriced, and the few I did install years ago proved to be less reliable than "consumer" drives. My real world experience is that the "consumer" drives are generally reliable, I just plan on a 2-3 year replacement schedule.

I can't disagree with RAID being fallible depending on what takes out the drive, though.
Re:Infant Mortality and stuff by TheLink · 2007-02-20 15:33 · Score: 4, Insightful

quote: "Sorta. Again, real world vs theory. Try banging the hell out of an off the shelf consumer drive 24/7/365 and see how long it holds up"

Uh the paper is based on _real_world_ stats (which part of "empirical evidence" + "she looked at 100,000 drives" don't you understand?).

Your assumptions = theory. Paper = real world.

And that's why the paper was voted "Best Paper", because it seems lots of people had similar assumptions and this paper is very useful to at least get some people to revisit those assumptions.

It might still be proven wrong by a bigger/better study, or it could turn out that it was flawed in some way. But I'll give them the benefit of doubt - more than I'll trust the MTTF/MTBF figures from drive manufacturers.
--
- Too many replies beneath your current threshold
Re:Infant Mortality and stuff by Anonymous Coward · 2007-02-20 16:30 · Score: 5, Insightful

Use two drives that are not in a raid setup. Use one as the data holder and rsync or tar.gz the data to the other one at your comfort level (hourly/daily/weekly/monthly or whatever time frame you would like). Much cheaper then raid, easier to get going, no gotchas involved with different HD controllers or different drives and most importantly, the second drive is not "live" and not in normal operation which constitutes a backup (remember, raid is not and never was a backup solution, it is only for uptime and maybe speed).

Raid controllers comes in two flavors. Ones that are very well supported and you will always find a similar or compatible one if that controller fails, the down side of this type is it is very expensive. The other type is the cheap ones, you know, the ones for under $100 which may not exist in 2 years when your fails leaving your raid array useless and the on board SATA raid chip sets that change at least yearly as well. Good luck with those. They do work but I'd bet you will have more problems with the raid setup itself then with actual drives the data is on.

I know, KISS is not in typical /. speak but it definitely applies here. 300GB HDs are about $80 without rebates, using one to hold a copy of the other using rsync or robocopy is about the cheapest backup you can get and since it is not a live file system, all the other things that happens to data that is not the fault of the actual HD (virus, mouse slip, kids messing around, accidents, overwriting) will be recoverable.
Re:Infant Mortality and stuff by duffbeer703 · 2007-02-20 18:09 · Score: 4, Informative

That may be the new 'theory' but we all know about theory vs reality. Here in reality if you put a couple of dozen new drives into service you have one or two spare hard drives to replace the ones that WILL fail in the first week. Especially with consumer grade drives typical in workstation deployment. If you only have one dud out of twenty it was a good rollout.

This study looks pretty realistic to me, in fact its better data than the Google paper's because they are looking at different usage scenarios. The study also jives with vendor's warranty periods -- right around the 3 year mark (end of warranty) failures start going up.
I take issue with your "real world vs. theory" argument version workstation disks and server disks as well, only because I have my own numbers. Based on numbers that my company gathers for its 50,000 workstations, the disk failure rate is around 1.9% annually. (Still alot of disks) There are exceptions -- those numbers are driven upward by one deployment of workstations from a vendor that had a 22% failure rate. (the PCs were replaced by the vendor) Server disks are in the same ballpark - slightly less that 2%.

Vendors provide more evidence of that fact. Many servers are being shipped with SATA disks, often the same as what you'll find in workstations. If SATA was less reliable, that would increase the vendor's support costs and they wouldn't ship them.

You're totally right about RAID-5... it can be a dangerous thing for an inept admin. Bad disks often come in batches, and bad controllers can ruin your day. A redundant array of bad data isn't very helpful ;)

--
Conformity is the jailer of freedom and enemy of growth. -JFK
Re:Infant Mortality and stuff by empaler · 2007-02-20 22:09 · Score: 2, Informative

I actually only have good experiences with WD and was about to order a new batch of SATA disks (now-ish).
Re:Infant Mortality and stuff by Anonymous Coward · 2007-02-20 22:41 · Score: 0

Want anecdotal evidence...
I've been banging "the hell out of an off the shelf consumer drive 24/7/365" since 2000... (the comp as an actual uptime of some years, the hdd has actually been heavily running for more than six years)
It's a WD 30GB or some crap like this and it still works well !!!
ad0: 28629MB <WDC WD300BB-00AUA1 18.20D18>

You'd be better of listening to someone who tested 100,000 drives...
Re:Infant Mortality and stuff by rikkards · 2007-02-21 01:18 · Score: 1

One thing I have noticed over the years is that, like banks, there will be good stories about a manufacturer and bad stories. My belief is that they probably are all the same.

Of course this is leaving the IBM Deskstar drives out of the picture. They were just evil.
Re:Infant Mortality and stuff by shakah · 2007-02-21 03:01 · Score: 1

For some more numbers from the real world:
ftp://ftp.research.microsoft.com/pub/tr/TR-2004- 107.pdf
"Our expectations were not very high. Many experienced administrators told us to expect many disk failures and problems with our inexpensive, "white-box" PC configurations. Typical advice was "SATA drives will fail all the time", "SATA is not SCSI and can't keep up with the I./O demands", etc. We had previously experienced excellent reliability from our Compaq Cluster and SAN [Barclay04]. We practically never visited the data center to perform maintenance work. We were advised to be prepared to be in the data center frequently to service the bricks, disks, or other components. The advice we received was so severe that we made a substantial investment, over 5% of the total system cost, in remote management capabilities provided by the Advocent KVM/IP and ServerTech IP PDUs fearing that we would be living at the data center.

Our experience has been the exact opposite. The storage bricks and the SATA disk drives have been every bit as reliable as the Compaq Cluster and SAN containing SCSI disks. In three years, approximately thirty-two SCSI drives failed in the Compaq SAN and web farm. Due to triple-disk mirroring, we never experienced any data loss. To date, a total of nine SATA drives have failed and been replaced. Due to dual-disk mirroring, we have not experienced any data loss and have not had to put our "just-in-time-backup" process into action."
Re:Infant Mortality and stuff by Zombywuf · 2007-02-21 03:21 · Score: 1

I'm willing to make allowances for the author as his typing indicates he was half asleep (long day in the data centre perhaps). The following should be noted about he article though: The Google paper only used data from drives that had been burnt in, an action presumably designed to remove morbid infants, and still found a slight infant mortality effect. Yet the article claimed Google found no infant mortality effect.

--
If you can read this you've gone too far.
Re:Infant Mortality and stuff by MrNiceguy_KS · 2007-02-21 03:37 · Score: 1

"Hard drives are like hard liquor. Everyone has a brand that they had a bad experience with and will avoid for the rest of their life."
Read this on a forum somewhere, so I can't take credit for this, (hence the quotes).

--
Redundancy is good And also good.
Re:Infant Mortality and stuff by element-o.p. · 2007-02-21 08:47 · Score: 1

Try banging the hell out of an off the shelf consumer drive 24/7/365 and see how long it holds up"

Umm....I do.

I run web, e-mail, Asterisk, MySQL, LDAP and DNS servers using consumer grade hardware and whatever hard drives I can scrounge. I blow PSU's all the time (what I get for buying whatever was last on sale at Crap^H^H^HompUSA) and I think I just cooked the mobo on the web and e-mail server (a K6-2/500, to give you some idea how old it is), but the hard drives have been phenomenally reliable. BUT, just to be safe, I rsync all the data from all of the machines to a new, large capacity hard drive on another machine. It's been a really good way to make sure I can return the servers to running status on those occasions when something does go horribly wrong, but I find that most of the restores I do are because someone borked something when updating a web page.

--
MCSE? No, sir...I don't do Windows. Yes, I am an idealist. What's your point?
Re:Infant Mortality and stuff by psiphre · 2007-02-22 06:59 · Score: 1

I'd honestly like to know either where you get your drives or what you're smoking that you can say a 300gb drive before rebates is typically $80... even pricewatch doesn't quote them that low.
Re:Infant Mortality and stuff by Anonymous Coward · 2007-02-22 13:16 · Score: 0

NewEgg has several 320GB models for $85-90 with free shipping.
Re:Infant Mortality and stuff by mink · 2007-02-27 07:26 · Score: 1

I was with you till that last line. Then you just went into some strange alternate reality.

The Deskstars (or deathstars as some people call them) that were prone to failure were a single model number (out of 6-7) I have purchased and used plenty of them from around that time and all I did was make sure to get the model that was not flawed and they still (6 years later roughly) are running. They see 24/7 operation with few power cycles.

Much like you first comment, not all IBM drives were evil data eaters, and people constantly harping on a single models failure and applying that to all drives from them for years after the issue was resolved, IMO is not fair (weather it's seagate or IBM, or anyone else).

I purchased 2 Seagate 320GB SATA drives and had them in service for 6 months before one of them started re-allocating sectors and ran out of extras to do that(in less time then it took to notice the issue). I returned it and the replacement has survived the stress test. It could still fail in 6 months, but I dont paint Seagate as having nothing but unreliable drives for all eternity.

--
Well I've wrestled with reality for thirty five years doctor, and I'm happy to say I finally won out over it.

100,000 Disk Drives? by MindStalker · 2007-02-20 14:19 · Score: 1

No its 131072 Does noone care about base-2 anymore?? /To the sarcasm disabled.. Its a joke..

Re:100,000 Disk Drives? by Anonymous Coward · 2007-02-20 14:34 · Score: 0

Or maybe it's 32.
Re:100,000 Disk Drives? by dreddnott · 2007-02-20 14:39 · Score: 1

100K disk drives would actually be 102,400.

--
I may make you feel, but I can't make you think.
Re:100,000 Disk Drives? by Anonymous Coward · 2007-02-20 16:13 · Score: 0

100K disk drives would actually be -279.67 degrees Fahrenheit disk drives.
Re:100,000 Disk Drives? by MindStalker · 2007-02-21 01:41 · Score: 1

Umm.. no.. You can't just stick a zero on the back of 1024
1024
2048
4096
8192
16384
32768
65536
131072
Re:100,000 Disk Drives? by Anonymous Coward · 2007-02-21 07:47 · Score: 0

131072 is 128K but thanks for playing.
Re:100,000 Disk Drives? by MindStalker · 2007-02-21 08:29 · Score: 1

Well duh, but its not 102400.
Re:100,000 Disk Drives? by dreddnott · 2007-02-21 14:53 · Score: 1

Then what would 100K be? I assumed you switched the measurement context to computer memory, which is base 2. One hundred kilobytes (100*1024) is 102,400 bytes. For comparison, 1000K is 1,024,000, but 1MB (1024*1024) is 1,048,576.

Your joke had promise but I think it's beyond repair now...poor thing.

--
I may make you feel, but I can't make you think.

Yay by Daishiman · 2007-02-20 14:20 · Score: 1

Software RAID FTW!!

In all seriousness, in truly critical storage you save your stuff under a RAID1. RAID5 is simply too unreliable for the task(not to mention that those controllers aren't exactly cheap).

So save yourself trouble, money, and grief, and just user logical volume management to replicate drives.

Re:Yay by Quill_28 · 2007-02-20 23:58 · Score: 1

If it was truly critical why would expense be a limiting factor.

Was not use two RAID 5 in a RAID configuration?
Re:Yay by swilver · 2007-02-21 00:39 · Score: 1

Software Raid FTW!!

not to mention that those controllers aren't exactly cheap
What controllers? Ever heard of software RAID5? I've been running a 6 drive array for years now, and it has survived plenty of drive failures.

That's wrong by ArbitraryConstant · 2007-02-20 14:22 · Score: 2, Informative

It didn't conclude RAID 5 doesn't help, it concludes RAID 5 doesn't help as much as people think, because people think the probability of another failure before the rebuild is complete is negligible and they're wrong.

It helps, and distributing the data more helps more. Someone concerned about multi-drive failures can, for example, use a 3-way RAID 1 array, or a RAID 6 array (which can tolerate the loss of any 2 drives).

--
I rarely criticize things I don't care about.

Re:That's wrong by jcgf · 2007-02-20 14:35 · Score: 2, Insightful

In my humble opinion it also helps to use different branded drives in your raid array, that way the chance of them failing at the same time for the same reason is less and you should have longer to do your rebuild.
Re:That's wrong by twiddlingbits · 2007-02-20 15:23 · Score: 1

The probability of another failure of a drive is the same as it was for the drive that failed (assuming same type and same mfg). However, when that failure point is reached is at a random point within the distribution so while the probability of another failure at any point in time is not zero it is pretty small. MTBF can be influenced by environment and usage patterns. Rebuilding a RAID array isn't a very lengthy process, perhaps a day or two at most if it was a huge array. Plus you SHOULD have backups and snapshots if it was critical data. If you are really paranoid you could also mirror the full set of RAID drives (RAID 51 or RAID 15), that design while costly can handle any THREE drives failing. I'm rusty on my stats but I believe you would multiply the probabilities of 1 drive failing three times (X*X*X) which gets pretty small.
Re:That's wrong by Bronster · 2007-02-20 15:52 · Score: 1

Even RAID6 isn't always enough:

http://blog.fastmail.fm/?p=521

Which is why (as suggested in another thread) we use Cyrus replication now for people's email. That has its own collection of "issues" - I've probably written at least half the bugfixes that have been merged into the past couple of releases to get it stable - but it's a whole lot better than relying on a single disk unit.

We had another failure where the RAID controller went psycho and decided to lie about the status of drives, it failed 5 of them in quick succession on a single RAID6 unit, and the worst thing is we actually replaced some before we gave up and switched RAID controllers, so we lost the volume. No knowing how corrupted it would have been anyway.
Re:That's wrong by windsurfer619 · 2007-02-20 16:15 · Score: 1

Psh! People! I don't know what the fuss is all about "Raid 5". You should be using Raid 2! Superior performance, and amazing redundancy! Of course, the cost IS a little prohibitive...
Re:That's wrong by ArbitraryConstant · 2007-02-20 16:38 · Score: 1

"Even RAID6 isn't always enough"

Yes, but if a rogue comet the size of Pluto hit the earth we'd all be killed and nothing would be enough.

It's about risk mitigation -- how much you need, how much you can afford. Anyone that sees enough of the industry will eventually witness remarkably improbable data loss, but RAID or other similar strategies aren't about making you invulnerable (since a file that gets overwritten with an old version or something is just as gone on a RAID array), they're about being able to operationally tolerate the failure of individual drives. You still need proper backups, and that's completely beyond the scope of avoiding mechanical failure.

--
I rarely criticize things I don't care about.
Re:That's wrong by mabhatter654 · 2007-02-20 17:22 · Score: 1

think about it for a minute.. a typical RAID 5 is set up on one system in one shot from 3-4 identical drives with the same lot numbers. Sounds good on paper but... if one fails then due to the fact it's siblings are almost identical, they will have the same useage/wear on them as well potential failure defects! So they have just as much chance of failing as the one that died...and the same variables. Now you put extra stress to rebuild the missing drive and accellerate another failure that should be "about" to occur. I like the other poster's idea about mixing drives up, but that can lead to unnessary compatibility problems not necessary either. You'd think drive monitoring would let you pick out the potential failure sooner, but as Google found out the built in reporting isn't reliable enough yet to make decisons from... you may repace the wrong drive and still cause the failure you're trying to prevent!
Re:That's wrong by petermgreen · 2007-02-20 22:13 · Score: 2, Informative

However, when that failure point is reached is at a random point within the distribution so while the probability of another failure at any point in time is not zero it is pretty small.

There are three real dangers with raid

The first is that arrays are typically built out of identical drives, usually drives from the same batch and then all the drives are run for the same time periods. This means that if there is a design or manufacturing fault that causes a failure peak at a certain number of operational hours there is a good chance that more than one drive in your array will fail at about the same time.

The second is that the drives in an array are typically in one machine, running off one power supply (or one pair of redundant power supplies) and connected to one controller. This means that faults with other hardware in the machine can destroy multiple hard drives at once.

The third is failure of the controller. In many cases the controller stores information on how the data is set up within its own non-volatile memory (some better controllers do store it on the disks themselves) while this doesn't destroy the actual data it can easilly put it beyond the ability of non-experts to reassemble the array in a way that gets the data back (and if they make a mistake they can easilly destory the data they were trying to recover). There is also the problem that getting a suitable replacement controller may be difficult.

--
note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
Re:That's wrong by complete+loony · 2007-02-21 01:29 · Score: 1

Rebuilding an array after a drive failure involves reading a complete copy of everything from the good drives, and writing a complete copy to the new drive. Disk capacities have been increasing MUCH faster that the bus speed used to talk to them, so RAID arrays are taking longer and longer to rebuild. The longer the rebuild, the more likely a second failure will occur before the array is back up.
If you take a data replication strategy like google's, instead of keeping your redundant copies next to each other, you scatter the pieces evenly and almost randomly across all your storage. When you have 3 copies of the data and lose any 2 disks, chances are you'll get back to a minimum of 2 copies within minutes of the failure, and back to 3 copies within a couple of hours (this would have a lower priority than normal disk usage and would take longer so as not to affect more critical processes).

--
09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
Re:That's wrong by Phishcast · 2007-02-21 02:29 · Score: 1

Disk capacities have been increasing MUCH faster that the bus speed used to talk to them, so RAID arrays are taking longer and longer to rebuild.
Actually, the speed of the disks is still the bottleneck, not the bus. Disk capacities have grown by leaps and bounds, but the performance of any individual disk hasn't increased at anywhere near the same rate.
I'd also point out that with most standard RAID scenarios, when a disk fails the rest of the disks in that RAID set generally become immensely more busy than their normal workload. For example, when I have a single drive fail with RAID 5 and my ass is on the line until the rebuild completes, the remaining disks will be drastically more stressed than during normal operation which increases the likelihood of failure and data loss. I can throttle my rebuild to lessen that stress, but that will increase the time where I'm unprotected against a second drive failure. It's a catch-22, you may as well ust cross your fingers and hope for the best.
If you really care about the availability of your data, an additional layer of redundancy is worth your time and money. RAID6 (or any double-parity implementation) and three-way RAID 1 are good examples.
Note: I am jaded. We lost data using RAID 10 (on a Sourceforge mirror site, BTW) due to a double-drive failure.
Re:That's wrong by petermgreen · 2007-02-21 02:55 · Score: 1

I'd also point out that with most standard RAID scenarios, when a disk fails the rest of the disks in that RAID set generally become immensely more busy than their normal workload
further note that with parity based raid you get a double whammy of extra load. not only does the rebuild cause extra load but any read ops that would normally hit the failed drive require a read of ALL drives to satisfy.

and the rebuild operation itself needs to read ALL data off ALL drives.

--
note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
Re:That's wrong by twiddlingbits · 2007-02-21 03:29 · Score: 1

There is no perfect solution. Disks fail, Murphy's Law says they will fail at the worst possible time (i.e during a RAID rebuild). There are some preventative strategies: 1) Mix disk mfgs in RAID arrays (not easy unless you build your own), have redundant controllers. 2) Make backups every day. 3) Don't wait until failures begin to replace disks, start replacing them at around 1/2 the stated MTBF (which is often overstated!). This is going to be tough in a production environment where the disks are online 24x7, but there are ways to do it. 4) Shop around for the most reliable drives and the best service plans. 5) Use lesser known but more reliable RAID technologies instead of just RAID 5. RAID 5 is the best balance of reliability and performance. Other solutions exist that can tremendously increase reliability w/o a big performance hit. 6) De-stress the drives with proper temp control and proper access mechanisms. Tuning the SQL can have performance benefits as well as help disks last longer.

So SSD's are not only faster, but more reliable? by gelfling · 2007-02-20 14:32 · Score: 3, Interesting

I wonder if anyone looked at what actually failed in the drives? An arm, a platter, an actuator, a board, an MPU?

Would an analysis tell us that SSDs are not only faster but more reliable and if so by how much?

forget RAID? by juventasone · 2007-02-20 14:38 · Score: 2, Informative

Translation: one array drive failure means a much higher likelihood of another drive failure ... Further, these results validate the Google File System's central redundancy concept: forget RAID, just replicate the data three times.

The fact that another drive in an array is more likely to fail if one has already failed makes a lot of sense, but the conclusion to forget RAIDs doesn't. Arrays are normally composed of the same drive model, even the same manufacturing batch, and are in the same operating environment. If something is "wrong" with any of these three variables, and it causes a drive to fail, it's common sense the other drives have a good chance at following. I've seen real-world examples of this.

In my real-world situations, the RAID still did it's job, the drive was replaced, and nothing was lost, despite subsequent failure of other drives in the array. Sure you can get similar reliability at a lower price by replicating data, but I think that's always been understood as the case. Furthermore, as someone else in the forum mentioned, enterprise-class RAIDs are often used primarily for performance reasons. A modern hardware RAID controller (with a dedicated processor and ram) can create storage performance unattainable outside of a RAID.

Re:forget RAID? by Ragin'Cajun · 2007-02-20 17:50 · Score: 2, Interesting

I used to work at a company that made network-attached storage appliances. Amazingly enough, one source of drive failures was the hot spare spinning up! The current draw during the spinup would cause a voltage dip on the power plane, which could lead to a read or write error on one of the neighbouring drives. Unfortunately, the most common cause of the hot spare spinning up was...another drive failing. So suddenly a second drive fails because of a read or write error.

The thing is, sometimes getting a read error doesn't actually mean the media is bad. There could have been some power fluctuation during the write, so the checksum doesn't match the data and the drive's controller returns a failure during the read. But if you rewrite that sector, it will be fixed (e.g. during an unconditional format).

--
--It's all fun and games, 'till someone loses an eye. Then it's one-eyed fun!--
Re:forget RAID? by swilver · 2007-02-21 00:51 · Score: 1

Excuse me, but marking a drive invalid because of a read failure (that's only temporary because of a drive spinning up) seems absolutely non-sensical to me. How about retrying the read? Surely it should be able to be read correctly again in senseable timeout, like 10 seconds, when the alternative is complete array failure... The second drive never failed, only your software marked it as failed because it doesn't bother to retry the read a few times.

Luckily Linux Software RAID is smarter than that. It won't mark a drive as failed because a sector cannot be read once.. it will retry until a time out is reached. If no drives have failed yet, and the read error is permanent, it will even try to reconstruct the sector and write it back in the hopes that the drive will reallocate it (most drives only reallocate sectors on writes).
Re:forget RAID? by juventasone · 2007-02-21 14:57 · Score: 1

Interesting, but it's hard to see it as anything but a defective design. Or at the very least the product should specifically not support hot spares, or drives over a certain power requirement. It is believable though, since large arrays initiate by staggering their spin-up, although they don't start any read/writes until it's all up.

Schroeder's disk... by Anonymous Coward · 2007-02-20 14:42 · Score: 2, Funny

is neither working nor broken... Unless you look at it of course ;)

How much does handling matter? by RebornData · 2007-02-20 14:43 · Score: 5, Interesting

What's interesting to me is that neither of these papers mentions the issue of pre-installation handling. The good folks over at Storage Review seem to be of the opinion that the shocks and bumps that happen to a drive between the factory and the final installation are the most significant factor in drive reliability (much more than brand, for example).

The google paper talks a bit about certain drive "vintages" being problemmatic, but I wonder if they buy drives in large lots, and perhaps some lots might have been handled roughly during shipping. If they could trace back each hard drive to the original order, perhaps they could look to see if there's a correlation between failure and shipping lot.

-R

Re:How much does handling matter? by Reziac · 2007-02-20 16:18 · Score: 1

I got to thinking about that very question when I noticed a distinct dichotomy in HD lifespan:

Long reliable life, usually 5+ years: HDs purchased new; *and* used HDs salvaged from random sources.

Short iffy life, typically 3 to 6 months: HDs *purchased from vendors of USED drives*.

Now, why is this? Speaking from what I've observed at computer swapmeets, it's because the used-HD vendors schlepp their merchandise around any which way, and it gets banged around like that every weekend until some sucker buys it. Whereas random salvage hasn't been smacked this way and that nearly as much -- probably just during initial shipping, and during its last trip to the dustbin.

BTW, where can I find the google paper? I seem to have missed the link (going blind from peering at too many charts :)

--
~REZ~ #43301. Who'd fake being me anyway?
Re:How much does handling matter? by Anonymous Coward · 2007-02-20 17:22 · Score: 0

People keep writing this, but I don't see how it can be. Have you read a drive's physical specs? They're supposedly rated to 25+ gees...
Re:How much does handling matter? by ForestGrump · 2007-02-20 18:41 · Score: 3, Informative

the google paper was posted a day or 2 ago. let me find it.
here you go
http://hardware.slashdot.org/article.pl?sid=07/02/ 18/0420247

--
Is it true that more people vote for the winner of American Idol, than vote for the president? -Ali G.
Re:How much does handling matter? by Reziac · 2007-02-20 18:59 · Score: 1

Thank you! I found the PDF after I'd posted that, but totally missed the discussion here. A number of interesting posts yearn to be read. :)

--
~REZ~ #43301. Who'd fake being me anyway?
Re:How much does handling matter? by Anonymous Coward · 2007-02-20 19:09 · Score: 0

I'd be much more interested to know if power supplies have effect.

There are some reported cases of file system failures due to faulty power supply, so "bad" PSU could have big impact.

I think reports like this will increase demand and use of ZFS.
Re:How much does handling matter? by jez9999 · 2007-02-20 21:23 · Score: 1

The good folks over at Storage Review [storagereview.com] seem to be of the opinion [storagereview.com] that the shocks and bumps that happen to a drive between the factory and the final installation are the most significant factor in drive reliability (much more than brand, for example).

So, perhaps there's an opportunity for a brand to get an increased reputation for itself in reliability by researching ways of protecting drives against knocks and bumps?

--
== Jez ==
Do you miss Firefox? Try Pale Moon.
Re:How much does handling matter? by cowbutt · 2007-02-21 00:34 · Score: 1

perhaps there's an opportunity for a brand to get an increased reputation for itself in reliability by researching ways of protecting drives against knocks and bumps?
Like Seagate's SeaShell packaging, you mean? Perhaps this is why Seagate is the only manufacturer who still warranties all their drives for at least five years?

Lemon or not by Bullfish · 2007-02-20 14:44 · Score: 1

I doubt MTBF fits into anyone's thoughts when buying a drive, unless they are buying bulk or such for a business and have to justify the choice. I am only talking about home use here.

Personally I have only ever had one drive go on me (a quantum scirroco) in 10 years. For myself, and most home users, that's a great track record. On the other hand, I have had friends and relatives who's drives just up and quit. New ones, old one, many brands. As long as you buy a major brand, they seem to be more or less equal in practice.

That said, with drives going at 10K rpm, the heat, etc, there are going to be lemons. I suspect that will always be a long as we use mechanical drives. I am not suprprised warranty periods dropped about the time drives began to exceed 7200 rpm. Always remember to back up data that's important and keep those receipts.

Re:This paper and the Google paper are complementa by Anonymous Coward · 2007-02-20 14:51 · Score: 1, Interesting

I'm particularly pleased to see a stake driven through the heart of "SCSI disks are more reliable."

I have been saying that for at least 10 years. Back then I worked at a large government contractor and we set up what was then a very large 2 TB array of SCSI drives (about 100 drives). Those damn things were "industrial grade" certified by a large well known server vendor yet we were losing 2 or 3 drives per day for several months. Totally rediculous because I extrapolated the failure rates of IDE drives from another government setup and found it was actually much better than the SCSI drives and they weren't even rated for heavy duty usage.

Of course prior to this article the group-think Slashweenies would moderate me into oblivion (probably will anyway, but meh).

all this is moot by billcopc · 2007-02-20 14:56 · Score: 3, Insightful

Hard drives die often because the manufacturers build them cheaply, the same as every other component in a PC. Why would they ever make a bulletproof hard drive ? They'd go out of business!

Sure, some of them end up being replaced under warranty, but a lot of them don't, and so Maxtor/IBM/Hitachi make another buck off your sorry ass. There isn't a sane server admin that doesn't keep a set of spares in his desk drawer, because it's not a question of "if" it dies but WHEN. Hell, most decently-geared techies have a whole box of hard drives, pre-mounted in hotswap bays ready to rock. And if it weren't for the fact that I was just laid off a month ago, I'd be buying a couple spare SATA drives myself, I just have a funny feeling something's going to go tits up in my media server. I haven't had any warnings or hiccups, but I just know the Seagate devil's planning his move, waiting for 2 drives to start straying so he can kill my Raid-5 nice and fast. Hard drives are little more than Murphy's Law in a box.

--
-Billco, Fnarg.com

Re:all this is moot by Diordna · 2007-02-20 16:07 · Score: 2

No, they wouldn't. People buy new drives because the price of storage keeps going down and the size of the average file keeps going up.
Re:all this is moot by Anonymous Coward · 2007-02-20 16:19 · Score: 0

Why would they ever make a bulletproof hard drive ? They'd go out of business!

Of all the conspiracy theories I see on slashdot, this has to be the absolute dumbest.

Hard drive capacity is growing by approximately a factor of two every year. Demand for capacity keeps pace, because that's just how things are. HD manufacturers make almost all of their money from hard drive upgrades, not replacements for failed units. If somebody waved a magic wand and suddenly all hard drives were immune from any kind of failure, I doubt their revenues would take any noticeable hit.

RAM manufacturers are in pretty much the same boat and yet RAM lasts basically forever. If an HD manufacturer could make their products equally bulletproof without making them hugely expensive they'd jump on it in a second; it would be a huge advantage over their competitors.
Re:all this is moot by RAMMS+EIN · 2007-02-20 19:26 · Score: 1

Unfortunately, flash cards don't help much. I recently put my root filesystem and the most frequently accessed parts on flash to save power. I also figured the lack of moving parts would reduce seek times and, hopefully, make the system more reliable.

Well, the last bit certainly worked out. I can absolutely rely on the system going catatonic after a few days, due to I/O errors. I have no idea if the problem is in the flash card (controller), the reader, the USB controller, the PCI bus, or the software. All I know is that if I reboot the system, everything is fine again...for a few days.

If anybody knows how to fix this, or even what the problem is, please post.

--
Please correct me if I got my facts wrong.
Re:all this is moot by poot_rootbeer · 2007-02-21 03:40 · Score: 1

Why would they ever make a bulletproof hard drive ? They'd go out of business!

That might be true if demand for greater performance and capacity weren't perpetual.

Sure, it'd be nice if a hard drive were guaranteed to work flawlessly for 20 years -- but what use would I have today for a 1987-vintage 60MB drive and its astonishing 1.5MB/s transfer rate? They give out bigger, faster devices for free at trade shows now.
Re:all this is moot by WhoBeDaPlaya · 2007-02-22 03:57 · Score: 1

Maybe you should just use them in a JBOD like fashion?
Re:all this is moot by billcopc · 2007-03-07 07:22 · Score: 1

While the 60mb drive might be ridiculously small by today's standards, the fact that most drives die within a few years is scary for your data. If my old 40gb Maxtors hadn't blown up a gazillion times, I still probably would be using them today, in favor of the 750gb monsters I have now, but at least I'd still have the data.

Make a drive that can die while still being able to pull data off (without paying some smartass a ton of money) and I'll be just as happy. The device is cheap, what's stored on it is priceless.

--
-Billco, Fnarg.com

Exponential with time by tedgyz · 2007-02-20 14:58 · Score: 2, Informative

All the hard drives I installed in my family's computers have failed in the last 5 years - including mine. :-(

Waaaah! They cry, when I tell them there is no hope for the family photos, barring a media reclamation service == $$$

I tell everyone: "Assume your hard drive will fail at any moment, starting now! What is on your hard drive that you would be upset if you never saw it again?"

--
"No matter where you go, there you are." -- Buckaroo Banzai

No mention of the co-author? by Petro123 · 2007-02-20 15:00 · Score: 1

This paper was co-authored by Garth Gibson!

Re:This paper and the Google paper are complementa by StikyPad · 2007-02-20 15:15 · Score: 1

Almost makes some of these posts look like these in retrospect.

--
https://www.eff.org/https-everywhere

Re:moving parts - Don't always wear out by AndersOSU · 2007-02-20 15:19 · Score: 1

Yeah I wouldn't worry about quantum effects on machines built at the quantum level either.

Nothing I knew about hard drives was mentioned by AllParadox · 2007-02-20 15:21 · Score: 3, Insightful

As mechanical devices, hard drives are appallingly reliable.

The electronics on the hard drive rank as major players in heat generation in the boxen.

Heat kills transistorized components.

"Hard Drive Data Recovery" companies often have nothing more sophisticated than a hard drive buying program, and very competent techs soldering and unsoldering drive electronics. They buy a few each of most available hard drives, as the drives appear on the market. When a customer sends them a hard drive for "recovery", the techs find a matching drive in inventory, disconnect the electronics, and replace the electronics in the drive. The percentage of drive failures due to mechanical failure is very low.

When I bought a desktop computer for an unsophisticated family member, I also purchased and installed a drive cooler - a special fan that blows directly on the drive electronics.

I was very concerned about MTBF. I just assumed that the manufacturer's information was totally irrelevant to my situation - a hard drive in a corner of the tower, covered with dust, and no air circulation.

I occasionally pick up used equipment from family and friends. Usually, it is broken. Often, it is the hard drive. What is amazing is not that they failed, but that they lasted so long with a 1.5 inch coating of insulating dust.

I suspect this would also explain the rising failure rate with time. Nobody seems to clean the darned things. They just sit and run 24/7/365, until they fail.

--
All is paradox. Retired lawyer, so this is just one more layman's opinion.

Re:Nothing I knew about hard drives was mentioned by flyingfsck · 2007-02-20 15:44 · Score: 1

Sorry to burst your bubble, but the Google paper claims that temperature has no effect on the failure rate.

--
Excuse me, but please get off my Pennisetum Clandestinum, eh!
Re:Nothing I knew about hard drives was mentioned by AllParadox · 2007-02-20 15:53 · Score: 2, Funny

"temperature has no effect on the failure rate"

Said by people who do not know how to light off a cutting torch.

Trust me, I *can* make 'em fail.

Real quick, too.

--
All is paradox. Retired lawyer, so this is just one more layman's opinion.
Re:Nothing I knew about hard drives was mentioned by HPNpilot · 2007-02-20 16:31 · Score: 1

Over the years I have seen a good number of drive failures. In a lot of cases people have asked (and are willing to pay for my time and expenses) if I could recover data. I am successful well over half the time, and that is without any special equipment.

Many failures are indeed in the electronics board. Ebay is a great resource to find used but functional drives. Really, really try hard to get a drive with a close serial number. It sometimes makes the difference between success and failure. You do not need to do any soldering to replace the board; I have never seen a board soldered to the drive mechanism on any 5.25, 3.5 or 2.5 inch drive. Be aware some drives store parameters in eeprom which match bad blocks of the specific drive mechanism, also some store other calibration data. You will not always get all the data, and not in nice neat order. You will need a disk data recovery program to look at all the FATs and backup copies and piece things together.

I have gone even further, opening up the drive assembly. Be prepared to lose everything when you do this, but I can tell you it is a myth that the drive sill self destruct immediately if not in a clean room. I have run 10k SCSI drives for hours after replacing head and preamp assemblies and recovered data. Then I take the magnets out because the drive should never be used again (I have a wicked collection of astoundingly strong magnets).

As you guessed from the last paragraph I have seen failures of the read amps which are mounted frequently on the flex circuit going to the heads. I have replaced these chips but the patience of a saint is required not to mention a steady hand, magnifying light, and Pace SMT workstation.

I have only seen a few real head crashes. The absolute worst was a Barracuda SCSI of mine (figures) which wiped half the metalization off the platters. Then I found my DAT backups were useless. That one I decided to send to a pro shop where they determined the internal read preamp for a critical track died, probably due to heat, and somehow this caused full deflection head oscillation until crashed occurred. At least that's what they told me. The drive was in a system where the air conditioning failed and the room got to 160 degrees F while I was away.

Heat kills, indeed, and it shows up in greatly reduced MTBF of solid state components or in my case outright failure. I have not seen many wear out failures of the mechanicals, perhaps that is because I cater mostly to home and small business users who tend to turn their systems off at night. I also recommend replacement of drives where the bearings get noisy as the noise is caused by vibration and vibration can cause head crashes, and enough of those can cause a catastrophic crash with the head flipping over and scraping (which causes dust which causes the other heads to crash...).

I'm getting too old for these repairs. I can't wait until the solid state drives become common.

Peter
Re:Nothing I knew about hard drives was mentioned by Anonymous Coward · 2007-02-20 16:43 · Score: 0

The percentage of drive failures due to mechanical failure is very low.

I'm dubious. I know it's only anecdotal, but I've seen many hard drives fail and it's always been mechanical. I know it's mechanical because they make really horrible noises as they die, either the grinding noise of shot bearings or the click and ping of persistent head crashes. I've never personally seen a drive just die, they always go through horrible death throes first.
Re:Nothing I knew about hard drives was mentioned by Technician · 2007-02-20 19:01 · Score: 1

"Hard Drive Data Recovery" companies often have nothing more sophisticated than a hard drive buying program, and very competent techs soldering and unsoldering drive electronics.

Have you taken apart a drive lately? The electronics on most drives unplug.

--
The truth shall set you free!
Re:Nothing I knew about hard drives was mentioned by vidarh · 2007-02-20 20:41 · Score: 1

have gone even further, opening up the drive assembly. Be prepared to lose everything when you do this, but I can tell you it is a myth that the drive sill self destruct immediately if not in a clean room.
While I haven't done this with any modern hardware, my first harddisk (a 20MB one with an XT interface) was second hand, and had a problem with the motor that caused problems getting it to spin up. I frequently opened the drive assembly, and spun the drive up with my finger (obviously I avoided touching the platter itself...). Usually it would get enough momentum after a few tries to let it spin up fully. It kept working for another 6 months or so until I finally replaced it. No problems with read/write errors at all.
Re:Nothing I knew about hard drives was mentioned by drsmithy · 2007-02-20 23:08 · Score: 0, Redundant

The electronics on the hard drive rank as major players in heat generation in the boxen.
I think you'll find the spinning disks play a bigger part in the heat generation.
Re:Nothing I knew about hard drives was mentioned by RalphTheWonderLlama · 2007-02-21 03:43 · Score: 1

Then maybe you should partner up with a solid data recovery company. I have one I could recommend :)

We do have specialized equipment (in addition to our experienced techs), even some hardware from Russia, and a large inventory of thousands of parts drives with all attributes stored and searchable in our database. Even with that we occasionally have to order a matching drive. There are just so many variations out there to deal with and occasionally finding those can be very difficult if we don't have one in our inventory. We have some vendors that help us in the search as well.

Please contact me if you are interested or want to learn more.

--
simple, fast homepage with your links: http://www.ngumbi.com/
Re:Nothing I knew about hard drives was mentioned by nmos · 2007-02-21 07:37 · Score: 1

When I bought a desktop computer for an unsophisticated family member, I also purchased and installed a drive cooler - a special fan that blows directly on the drive electronics.

There was a period where I tried using these kinds of coolers for the machines I built. Unfortunately I found that one of the few things less reliable than a hard drive is a cheap HD cooler. Not only do the fans tend to fail but most are designed in such a way that if it does fail the "cooler" ends up acting as insolation and/or blocking proper air flow and actually making things worse.
Re:Nothing I knew about hard drives was mentioned by nmos · 2007-02-21 07:40 · Score: 1

Sorry to burst your bubble, but the Google paper claims that temperature has no effect on the failure rate.

You might want to read it yourself because that's not what it said.

Everything You Know About Dupes Is Wrong by Jeff+DeMaagd · 2007-02-20 15:28 · Score: 0, Offtopic

I suppose dupes are good!

HD wear mostly a non-issue by callmetheraven · 2007-02-20 15:37 · Score: 1

Several boxes in my office closet contain a pretty good history of desktop PC hard drive technology from about 1988-2005. Much like archaeological sediments, on the bottom you will find the oldest, 10Mb and 20Mb drives, capacities increasing as you move up through the layers, and at the top the most recent addition, a 30Gb retired from a Dell retired last Xmas. All of these HD's were retired in good working order, and as far as I know they all still work. Every one of them succumbed to the REAL nemesis of hard drives, that is they were swallowed up by new drives with 10x their capacity.

Sure, I've seen a couple drives fail, but they've been few and very far between. I've seen a lot more drives run long beyond their usefulness whilst packed solid in dust-bunnies, running scorchingly hot, on questionable power, some even sticky with spilled Mountain Dew.

Just be sure to get good backups, and enjoy the cheap storage.

--
You can have my SIG when you pry it from my cold, dead hands.

Hard drives on consumer PCs by jmorris42 · 2007-02-20 15:39 · Score: 1

> I tell everyone: "Assume your hard drive will fail at any moment, starting now! What is on your
> hard drive that you would be upset if you never saw it again?"

True enough, I use a similar warning. Mine is, "Don't leave anything on your hard drive you care about. If you manage to make it a year without reloading Windows the drive can crap out with no warning. Burn anything you can't download again to a CD/DVD."

Personally I don't have to worry about Windows and I have a RAID5 at home.... but I still burn anythiing I care about. Important stuff like photos get backed up to a DVD-RAM until I fill it then I burn two DVD-R copies on different brands of quality media.

The problem is hard drives have become freaking huge. Where can you backup a modern large drive? We are back where we were when backing up a 60MB drive meant a crate of floppies, only now we need a spindle of DVD-Rs and we actually need more time. We need those holographics DVDs!

I have taken to recommending RAID1. It is cheap and almost any non-laptop can do it these days. With drives as unreliable as they have become it makes sense for anything other than a gaming rig.

--
Democrat delenda est

Re:Hard drives on consumer PCs by Anonymous Coward · 2007-02-21 07:31 · Score: 0

"Where can you backup a modern large drive?"

A USB hard drive...
Backup up regularly, but be sure to keep it disconnected when not in use. I usually find that the temptation to nuke backups for more MP3 space gets too strong if it's not locked up somewhere.
Re:Hard drives on consumer PCs by tedgyz · 2007-02-21 14:57 · Score: 1

I use RAID1 for my linux server.
I use Norton Ghost (formerly DriveImage) for my C: drive backups.
Everything else (email files, media, etc.) is replicated over 2 or 3 PCs.

IMHO, 3 hard drive copies is equal to or better than spindles of backups. Why? A backup isn't a backup until it has been successfully restored. It is painfully difficult to verify backups. You also run into media decay.

--
"No matter where you go, there you are." -- Buckaroo Banzai

Exception to the Rule by camperdave · 2007-02-20 15:56 · Score: 1

I've had more RAM chips die than hard drives.

Really! My experience is just the opposite. I've had three drives fail within the past year or so, but I've never had a RAM chip fail. I would guess that hard drives fail more often than RAM chips, and that your experience is the exception to the rule. (Perhaps some better grounding would help.)

I've only seen flash fail once, and that was a failure of my USB key to turn up after disappearing into the crack between the sofa cushions. Other than that, my flash experience is the same as yours: flawless in low volume usage.

--
When our name is on the back of your car, we're behind you all the way!

So... by maglor_83 · 2007-02-20 16:01 · Score: 1

Nothing is wrong? Phew!

Re:moving parts - Don't always wear out by Detritus · 2007-02-20 16:26 · Score: 1

Until it encounters an energetic cosmic ray or an alpha particle.

--
Mea navis aericumbens anguillis abundat

and Google contradicts. by bill_mcgonigle · 2007-02-20 16:33 · Score: 4, Interesting

Well, the article actually says that drives don't have a spike of failures at the beginning.

Hmm, the Google paper says they do, from 3-6 months (Figure 2).

Which leaves us with confirmation that 50% of all studies are wrong.

--
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)

Re:and Google contradicts. by mauthbaux · 2007-02-20 19:59 · Score: 1

Which leaves us with confirmation that 50% of all studies are wrong.

You should do a study on that.

--
"Operating systems suck: you're better off using only the BIOS" --trainsaw.com
Re:and Google contradicts. by stonecypher · 2007-02-22 07:31 · Score: 1

Which leaves us with confirmation that 50% of all studies are wrong.

Nah. This is just one of the dangers of referring to studies in descriptive and literate terms, rather than statistical terms. I'll make up data to make the concept easier, but when you see it, it's straightforward. Let's say we have a brand of drives which has a graph representing failure rate from the remaining drive population. There's a 1% over norm at two weeks, which probably represents drives damaged during shipping. There's a 5% over norm at three months, which probably represents drives manufactured with defective or badly attached components, particularly the drive arm or platter spindle assemblies. Then there's a slow growth 15% over normal at the two year line - we'll assume the drives are manufactured for five years - which probably represents disks under near-constant wear and tear, such as those in servers or cluster farms.

The issue here is that if you ask five people, you're going to get five different answers about whether the graph has spikes. One person will say all three are significant spikes. One person will see two, and discard the 1% as an aberration, or come to the same opinion I did about the reason and decide to disclude it on expected damage (there are people that do things like that :( .) One person will only see the big spike. One person will see the 1% as a pre-echo of a 5%, or possibly both as pre-echoes of the 15%. One person will ignore or fail to notice the 15% because it's a slow growth, which makes it look like a bulge instead of an EKG, and they don't understand the significance.

The thing is, only two of those five viewpoints are mistakes. Three of them are quite valid, and it's just a matter of opinion about the importance of the characteristics of the data. If three studies were done, and if all three came up with the same data, you could still get three entirely different descriptions of the situation.

The old phrase "with training you can make statistics say anything you want" may be a bit ascerbic, but there's a grain of truth to it; we all do it, and it's not a destructive thing unless it's intentionally distortional. Indeed, the purpose of citing the statistics in the first place is to get them to say something.

That said, there is a lot to be said for reading the description to get a sense of what the author believes, but reading only the data when trying to form your own beliefs.

Just because there are incompatible interpretations of data doesn't mean that the data is incorrect.

--
StoneCypher is Full of BS
Re:and Google contradicts. by bill_mcgonigle · 2007-02-23 03:21 · Score: 1

Thanks, good post.

--
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)

Re:This paper and the Google paper are complementa by bill_mcgonigle · 2007-02-20 16:38 · Score: 1

"oh, we bin-out the SCSI disks
after testing"

As I understand it, the kernel of truth in those claims is that the testing/sampling rate on the SCSI assembly line is higher. I don't know how much higher or if it's statistically significant, but I've heard from quality engineers who work in some of these plants that they do (or did a couple years back). I expect they still do even if it's only to have something marginally defensible to back up their salesmen's pitches.

--
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)

disk spin-up is most responsible for failure ? by cats-paw · 2007-02-20 16:52 · Score: 3, Interesting

I keep hearing this persistent rumor that it's disk spin-up which is the most significant contribution to disk failure. The moral of the story is that systems which are left on 24/7 are less likely to see HD failures than systems turned on/off everyday.

Now if that's really true, wouldn't it be quite simple for the manufacturers to simply spin-up the disk more slowly by putting in very simple and reliable motor control circuitry ?

Does anyone have any real evidence, i.e. not anecdotal, that this is really true.

--
Absolute statements are never true

Re:disk spin-up is most responsible for failure ? by Alien+Being · 2007-02-20 18:58 · Score: 1

Turning a machine off, power supply and all, and letting it cool down is probably severe compared to just spinning down an idle disk, but the latter might happen several times an hour so the cumulative effect might be worse.

A slower spinup might help, but I suspect the biggest problems with frequently starting and stopping the drive are related to:

Heat cycles causing materials (metals, coatings and lubricants) to distort, breakdown, stick, etc.

Power problems caused by relatively high current requirements at startup.

Head crashes during takeoffs and landings. Now that I think about it, maybe they spin the disk up quickly for a smoother takeoff (at least on disks which land the heads on the disk surface).
Re:disk spin-up is most responsible for failure ? by Doug+Coulter · 2007-02-21 03:54 · Score: 1

Back in the day, I worked for DEC as a tech, and did a lot of disk work. Back then spin ups and downs were a major problem, as there was no such thing as a winchester sealed drive to keep dust out, and the heads (more so in this day) fly very low to the disk. When the disk is stopped or slowed, the heads rub. Even if they are parked outside the data area, there's a chance to kick up some stuff from the disk surface that will then get between the head and the disk on the data area and cause failures.
Having said that, and having maintained a network of about 10 machines for the last decade or so (we write software here), we find that plain old time is about as good an indicator of failure as there is, along with the odd bad batch. In other words, computers infrequently powered up still had about the same disk lifetimes on a calendar given a certain brand and lot. We start getting nervous at a few years out. By then, most times the mobo has been upgraded etc anyway, so it's no big deal to put in a new drive while at it. As computers are how we eat, we feel the hardware costs of ensuring reliable ones is trivial compared to missing a few big paychecks.
We got bitten by the IBM deskstars, and it's the only time we'd ever lost any actual data, about a day's work. We were using machines to cross backup one another and in a spell of hot weather these things dropped like flies, faster than we thought we had to recover them, oops. That was surely a mess. We also have a ton of Seagate 2gb drives from when that was big, and none have failed, ever! They are rarely used now of course.
For some reason, we have some brand loyalty now, and always get drives a little down from the peak available, the ones that have been around longer and have a little more margin in the technology than the biggest/fastest ones around. This seems to be working out. Cross backups machine-machine on our network works pretty well if you do it enough! Of course it is much easier to backup and restore a linux box...as the disk organization is far cleaner -- no need or point in backing up most of the drive, just the user-changed parts. For windows boxen we image a drive and keep it in the machine, but unplugged. This saves a re-install of the system and all those apps, but you have to remember to do it often enough.

Obvious Solution by Deliveranc3 · 2007-02-20 17:28 · Score: 1

Why haven't we moved to a system where you NEVER delete?

Security would become a concern (But not too much of one, you should already shred your drive, and if you could overwrite all of one type of bits with the other one)...

This would require a new type of file system (One with a pretty strange [or flash based] file table). But you could have data that would last thousands or millions of years, and considering how many dots I can fit on a peice of paper and how much I suck at making dots a pretty damn large storage size.

Re:Obvious Solution by vidarh · 2007-02-20 18:46 · Score: 1

This would require a new type of file system (One with a pretty strange [or flash based] file table).
No. It would be a straightforward variation on a log structured filesystem
A naive implementation of a log structured filesystem is as a virtual never ending sequential log overlaid on a disk that is treated as a circular buffer - when you "wrap around" deleted space may be reclaimed by doing a new write of the undeleted data from the start of the disk to the end of the log (though there are many alternative ways of handling this that may be more efective).
If you have enough space you don't need to wrap around, and thus don't need to delete anything either. Unfortunately, most of us do regularly run out of space.
Re:Obvious Solution by drsmithy · 2007-02-20 23:12 · Score: 1

Why haven't we moved to a system where you NEVER delete?
Because disk space is a finite resource.
What's that got to do with disk reliability, anyway ?
Re:Obvious Solution by Deliveranc3 · 2007-02-23 06:07 · Score: 1

With the proposed solution there would be no way to write over half the bits, no deleting or reclaiming used space in any way. Think writing on paper, or a simple chemical reaction or something similar.

Is comparing MTBF correct???? by rayzat · 2007-02-20 17:31 · Score: 1

Is comparing MTBF correct when saying different drive types are of the same reliability? By looking at an existing system you really aren't looking at idependant variables. Let's say I have two servers one that hits the drives alot and another that barely touches the drives so using what I'll call "the old rules of thumb" I would put the FC drives on the intensive server and SATA drives on the less intensive server. So after 1 year I get my first drive failure on each. One could conclude that SATA is as reliable as FC, but is it really? I setup my environment to more heavily hit the FC drives, when would the SATA drive have failed if I had placed SATAs where the FCs were? The only way to really compare drives would be to hit a large number of each different drive type with the same workload. If you look at the way most places do tiered storage they'll put highly accessed data on an FC tier and then migrate less used data to a SATA tier, this might be one reason why the failure rates of the two drives look the same.

OSS Software RAID, too. by Kadin2048 · 2007-02-20 17:43 · Score: 4, Insightful

On the other hand, you could get a cheap drive controller, and do software RAID, using OSS tools; the setup might be more complex than hardware RAID, but there shouldn't be any issues with recovering your data later due to the format it's written in.

I agree though, that for most people, some sort of "userland RAID" where the disks are just mounted as regular volumes to the filesystem, and then you just write the data twice, is probably the best bet. There's no format problems, and you'll always be able to pull a drive out, stick it in another machine, and get at your data.

--
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."

Shipping definitely risky by ben_rh · 2007-02-20 17:52 · Score: 1

You're absolutely right. I do the IT work for an office of about 30 people, which recently has involved building a couple of new servers and setting up a solid backup plan. In turn this has involved buying around a dozen new disks over the course of about 6 months.

Of these drives, two (one each from two different retailers) were damaged upon unpacking them. One had a section of the plastic surrounding the jumper block punched in at an angle, and the other had a small but obvious bit of rasping on the metal of one side. (Once the damage was found, it was obvious that there was corresponding damage to the packaging, but in the process of paying etc at the store I hadn't looked close enough. Lesson: with fragile goods such as disks, insist on unpacking them in the store before paying.)

I returned both of course; the first one the retailer wouldn't accept back until I kicked up a stink and demanded to speak to the manager; the second took it back with no arguments (and even an apology).

I guess the thing is though, for every drive that has visible damage, how many have been mishandled and will show a correspondingly shorter lifetime without having any obvious damage? I can imagine many ways these drives could be dropped or shocked without leaving marks.

Solid state? by SuperKendall · 2007-02-20 18:10 · Score: 1

Every single mechanism with moving parts will fail. It's just a matter of when. In a few years, when everybody is using solid state drives, people will look back and shake their heads, wondering why we were using spinning magnetic platters to hold all of our critical data for such a long time.

You mean the "solid state" of the CF card I am recovering just this moment because the partition table magically became hosed the moment I removed the card from teh camera? Or the "solid state" of the CF card that died so hard Lexar themselves could get nothing from it?

Just because something does not have moving parts does not mean it'll last forever, and if you keep a solid state device operation long enough you'll figure that out.

--
"There is more worth loving than we have strength to love." - Brian Jay Stanley

Replicate with spare by SuperKendall · 2007-02-20 18:15 · Score: 1

This paper prooves something I've always suspected - that the best backup solution for a small organization (or a home) is RAID 1 (mirror) along with swapping out one of the drives regularily. You then have three copies (as Google mandates) and also a little bit of a buffer for recovery in case you delete or modify something you should not have.

All those people who have RAID5 at home are just asking for trouble, especially if there is ever a fire...

--
"There is more worth loving than we have strength to love." - Brian Jay Stanley

Re:This paper and the Google paper are complementa by whitis · 2007-02-20 18:44 · Score: 1

Google didn't see overtemp failures only because google kept their drives cool. Possibly too cold. Their graph
cuts off at only 10 degrees hotter than a typical PC. But if you extrapolate the data on the right hand side of
the graph, you see that drives fail at higher temperatures just as expected. Also, they appear to have looked
at average temperatures over the life of the drive, not the temperatures near the time of failure. And they
totally ignored temperature fluctuations.

In fact, the conclusion one should reasonably draw from their data (if it can be trusted and I called that into
question saturday) is that drives are designed to operate at 40 degrees C (which, happens to be the operating
temperature of the hard drive on this machine right now in a typical mid tower case) and that any deviation higher
or lower will result in increased failure rates:

But it is also possible that the cooling systems, and not the temperatures themselves, are possible for the
drive failures seen in googles systems. They had some hard drives (the ones particularly responsible
for the low temp failures) that were operating at around room temperature. With light fan cooling, a drive
operates at around 20 celsius degrees above ambient. So how do you get an operating temperature around
room temperature? You cool the server room to freezing, you put A/C evaporator coils inside the server boxes or racks,
you water cool them, or you sandblast the drives with hurricane force winds (slight exaggeration). All of those
approaches raise the possibility of creating environmental hazards other than temperature.

But it is quite possible that it is just the temperature and that drive manufacturers have done the sensible thing and optimized their designs for the typical operating temperature of a drive. I also point out that there are
a number of failure modes associated with over-temp, under-temp, and temperature variation.

In a typical PC, the most likely cause of an overtemp failure is a fan failure.

http://hardware.slashdot.org/comments.pl?sid=22297 8&cid=18063644

Using google, ironically, I found at least one example dating back to 2003 of people discussing the effects of
too low an operating temperature (i.e. room temperature) through excessive cooling adversely affecting hard drives (not even getting into industrial or outdoor temperature ranges). And I wasn't even looking for that: http://www.silentpcreview.com/forums/viewtopic.php ?t=7677

Conclusion: For best results use a closed loop temperature control system with redundant variable speed fans to keep
the drive itself (not the ambient air) at a constant temperature of 40 degrees C. Or operate your machine with
moderate cooling in an environment comfortable for humans and use software to power down the drive and raise alarms if
it gets much above 50 degrees C. Whether you should shut down if the drive gets below 25 degrees C (after time to
come up to operating temp) is debatable. If you have had a major heating system failure or a broken window in winter, the drives own heat might be giving it some protection but the drive is also more vulnerable when operating than when shut down.

Re:Duplication and DRM by Technician · 2007-02-20 18:52 · Score: 1

forget RAID, just replicate the data three times.

Sounds incompatible with most DRM that ties a key to hardware.

--
The truth shall set you free!

Re:Bullet Proof drive by Technician · 2007-02-20 18:58 · Score: 1

They'd go out of business!

Are you kidding? With Moore's law, repeat consumers would build extreme brand loyalty. Let's face it. Even though it works great, there is very little market for my 20 Meg CDC 5-1/4 inch drive on an ST-503 interface.

It's yours for free if you want to pick it up.

--
The truth shall set you free!

I'm not surprised by Anonymous Coward · 2007-02-20 19:18 · Score: 0

Having worked for a disk drive manufacturer, this doesn't surprise me one bit.

The specified MTBF is theoretical - after all you print the data sheets before you start selling drives, and by the time you have some experience making the drive you can't back off. Your customers would crucify you. Also, the competitive pressure is too high to be realistic.

The theoretical failure rate is basically the sum of the component failure rates, under the assumption that there are no surprises. That means you'll see the actual MTBF approaching the theoretical MTBF by the time the product has matured, aka is getting obsolete. If it ever gets that far given the short product cycles.

A realistic failure rate that's about four times the theoretical value sounds about right.

Posting as AC for a reason...

Re:moving parts - Don't always wear out by Omestes · 2007-02-20 20:19 · Score: 1

Which is why all computers in the future will be placed in VERY large isolated lead boxes 100 meters below the earths surface. As for quantum effects, we won't be able to open these boxes to see if they are functioning, thus they are in a state of super-position, therefore cannot actually fail.

--
A patriot must always be ready to defend his country against his government. -edward abbey

Actually, mostly it DOESN'T contradict by Moraelin · 2007-02-20 20:19 · Score: 5, Insightful

The two don't really contradict each other that much. Google's spike is relatively small and it's really a spike in the first 1-3 months. By the 6th month it's basically settled. In this paper half the time they graph in whole year increments, so that kind of a spike would be averaged into the first year. So, no, they don't contradict each other as such. And in at least one of the graphs by month in this paper (HPC1), there is something that looks like a spike in the first month.

More importantly, they don't contradict each other in respect to the rest of the curve. With or without that spike, the curve just doesn't look like the bathtub fairy tale that drive makers try to bullshit us with. You're led into a false sense of security that, basically, if a drive didn't fail within the first couple of months, then it'll be at a (nearly) constant and very small probability to fail for the whole next 5 years, and only then it starts rising again. Basically that if you upgrade your drives every 4 years, whatever didn't fail within 2-3 months, heck, it's very unlikely to fail. And the curve just doesn't look that way. The probability to fail rises continuously, and (again whether that spike actually exists or not) after as little as 1 year you're above the starting height of the "bathtub" already.

In retrospect, I don't even know when and why the "bathtub" myth even started. The bathtub distribution was originally for stuff like electronic components, without moving parts. For something with mechanical wear and tear like a hard drive, who the heck came up with the idea that the same curve must apply? Shouldn't it have been common sense all along that it linearly gets more wear and tear?

Both papers also tell us that the manufacturers' MTBF numbers are, basically, pure bullshit. They're some impressive number put there for the benefit of the marketting department, not because someone at Seagate/Maxtor/whatever actually believes that number.

In retrospect, again, we should have had an alarm signal when the manufacturers lowered there warranty from 3 to 1 year. If indeed there was (1) the MTBF they claim, and more importantly (2) the bathtub curve they claim, the reduction wouldn't have even made too much of a difference. I mean, most drives would have failed withing a couple of months, followed by barely a trickle of deffective drives for the next 5 years straight. Why bother doing the bad-for-marketting thing of lowering the warranty in that scenario? Or did they already know that they lie?

And finally, a very important point is that (again, bullshit marketting claims be damned) there is no difference in reliability between cheap SATA and expensive SCSI and FC. There is this assumption permeating the whole society that if something is expensive, it _must_ automatically be better and more durable than the cheap stuff. That if you buy a big plasma TV, it's automatically better and last longer than an el-cheapo CRT. (Yeah, right. Plasma is actually known for its decay over time.) A whole edifice of consumerism, conspicuous consumption, and SFV (Stupid Fashion Victim) syndrome is based on that bullshit excuse to spend more than you need to spend. "Yeah, but it'll be better and last longer!" Yeah, right.

I've actually met people who wouldn't even _consider_ putting a ATA drive in any kind of server. "What, you're going to put your enterprise data on ATA drives???" (Said with a perplexed look, as if I had proposed flushing it to /dev/nul or something.) Well, now we know they're not actually any worse. If you don't actually need the extra bandwidth or lower latency or a 15,000 RPM drive, then you can just as well drop a SATA drive in that machine. Even for 10,000 RPM, 4.5ms, there are the WD Raptor drives with SATA interface, and they're cheaper than a SCSI or FC drive. For a lot of stuff you don't even need those, a 7200 RPM will do perfectly fine.

--
A polar bear is a cartesian bear after a coordinate transform.

Re:Actually, mostly it DOESN'T contradict by bill_mcgonigle · 2007-02-20 23:59 · Score: 1

The two don't really contradict each other that much. Google's spike is relatively small and it's really a spike in the first 1-3 months. By the 6th month it's basically settled. In this paper half the time they graph in whole year increments, so that kind of a spike would be averaged into the first year.

Yes, that's a good point. I just think it's clear from Google's study that there is an infant mortality effect, and by doing that averaging and then claiming there's no infant mortality effect is to lose the trees for the forest.

The bathtub distribution was originally for stuff like electronic components, without moving parts. For something with mechanical wear and tear like a hard drive, who the heck came up with the idea that the same curve must apply?

Somebody who never owned a car, I guess.

Why bother doing the bad-for-marketting thing of lowering the warranty in that scenario? Or did they already know that they lie?

Yes, I think you're right. Note that Seagate has a 5-year warranty on all of their Seagate drives. I think that despite their internal data they realized that it was bad for marketing. My first hand experience with Seagate drives tells me they follow a very similar curve to what is claimed in these papers, but many people don't bother doing a warranty exchange on an $80 drive - by time you get to 3 years your drive isn't worth replacing, so it's a smart strategy, and just about everybody wins.

I've actually met people who wouldn't even _consider_ putting a ATA drive in any kind of server. "What, you're going to put your enterprise data on ATA drives???" (Said with a perplexed look, as if I had proposed flushing it to /dev/nul or something.)

Note that the vast majority of people doing storage consulting are also resellers. I happen to be a consultant with a non-resell principle - and I've been recommending SATA drives, always with RAID-1, for a few years now, and with SATA-II/NCQ on a good controller (3ware, etc) I see very little difference for the vast majority of cases. And in nine of ten of the cases where there is a difference, more RAM and caching is as effective as SCSI, cheaper, and has additional benefits. SCSI still does perform better with highly concurrent high-throughput, high-seek applications, but most people don't have that and, besides, that might just be more heads and more mature controllers, not bus, it's hard to tell.

Unfortunately, I'd be making alot more money if I were just reselling SCSI drives for a 40% mark up instead. It'd be more profitable to be dishonest.

--
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
Re:Actually, mostly it DOESN'T contradict by darCness · 2007-02-21 02:31 · Score: 2, Informative

"There is this assumption permeating the whole society that if something is expensive, it _must_ automatically be better"

This is known as the Veblen Effect based on work by Thorstein Veblen.
Re:Actually, mostly it DOESN'T contradict by virtual_mps · 2007-02-21 07:41 · Score: 1

In retrospect, I don't even know when and why the "bathtub" myth even started. The bathtub distribution was originally for stuff like electronic components, without moving parts. For something with mechanical wear and tear like a hard drive, who the heck came up with the idea that the same curve must apply? Shouldn't it have been common sense all along that it linearly gets more wear and tear?

It makes intuitive sense to me that something that is shock-sensitive (in both senses) will have more failures just after being installed than it will for a while sitting in a rack.
Re:Actually, mostly it DOESN'T contradict by Grishnakh · 2007-02-21 15:36 · Score: 1

For a lot of stuff you don't even need those, a 7200 RPM will do perfectly fine.

Personally, I'd be happy if I could get a couple of huge, but slow (like 3600 rpm) drives for storing media files. I don't need low latency for movies and music files, just good bandwidth and large storage capacity. A big bonus provided by the low speed would be low heat, a problem with my current disks, which require a noisy fan.
Re:Actually, mostly it DOESN'T contradict by seebs · 2007-02-21 18:47 · Score: 1

The reason I always use SCSI in servers is performance bottlenecks under heavy load, not "reliability". I haven't got enough experience with heavily loaded SATA to know whether it performs as well.

My guess is that this is much more a matter of controllers than of drives; PATA controllers tend to be the cheapest crap the manufacturer could get to respond to the BIOS. SCSI controllers have generally been more reliable for me.

--
My blog: http://www.seebs.net/log/ --- My iPhone/iPad app: http://www.seebs.net/seebsfrac/

Re:So SSD's are not only faster, but more reliable by imsabbel · 2007-02-20 20:24 · Score: 1

What do you consider "reliable" ?
There are usb sticks around that survive driven over by a semi.
Flash regularily survives a round in the washing maschine /dryer.

The "ruggedness" of SSD is much bigger than mechanical, although the "soft errors" are yet really to be explored (i.e. aging of flash cells with the lover structure sizes, ect, cell vs controller failure rates, ect).

Otoh, with SSDs, it should be very cheap to create on-disk-redundancy (maybe 2 redundant controllers with fail-over, chipkill for the flash banks, ect)

--
HI O WISE PRINCE. WHT TOOK U SO DAM LONG?

Re:Duplication and DRM by ppanon · 2007-02-20 20:55 · Score: 1

Sounds like DRM that ties a key to hardware is incompatible with reliable storage mechanisms required for enterprise-class systems.

There. Fixed that for you. Enterprises don't put up with this crap if an alternative exists, and they often have the purchasing power to ensure that alternative exists. Consumers shouldn't have to either.

It's good for VMware that they still have a significant edge in management and failover tools as compared to Linux-based solutions using Xen and KVM/QEMU. When the latter two start getting sufficiently sophisticated, CIOs worth their salt will look at moving to Xen or KVM not because of licencing costs, but because of the hassle (and additional failure modes) of dealing with the new Licence Manager in ESX 3.0/VI 2.0

--
Laissez lire, et laissez danser; ces deux amusements ne feront jamais de mal au monde. - Voltaire

3 Phase Commit - Re:MTBF by dave1g · 2007-02-20 21:04 · Score: 1

They should use 3 phase commit :-)

http://en.wikipedia.org/wiki/Three-phase_commit

Of course even 3 phase commit can fail lol. Though I'm a bit rusty from my Operating Systems class...

Mis-information by mal0rd · 2007-02-20 22:45 · Score: 1

Some clarifications are due. For flash disks with built in IDE controllers, they don't really distribute the writes across the disk. This is because they can't since IDE gives no indication when a sector is deleted. So if you write to 100 sectors of a 100 sector disk, then all further writes will either have to:

overwrite the logically overwritten sector
write to a different sector, but move that different sector to the logically overwitten sector

So as you see, it's not possible. The only way they can save your disk is if you don't use the whole thing. Normally they don't let you access the whole space, but that's just like spinning drives, which has reserved sectors for failures.

And the flash disks that allow direct access, without IDE controllers, don't do any load balancing. But normally one will use a load-balancing filesystem designed for flash, like JFFS.

neither the CMU paper nor the Google paper is good by Anonymous Coward · 2007-02-21 00:02 · Score: 1, Interesting

Unfortunately, this paper is severely flawed. Similar to the Google paper, it is written by academics with little understanding of the subject matter, but a strong desire to publish lengthy papers.

To write a meaningful paper, there is a lot of data about the drives and the systems they are used in that needs to be collected. These are initial conditions and operating conditions that any real system scientist will tell you cannot be ignored (to say the least). One cannot look at drives in the abstract, but must look at many details of how they are used, including the storage systems they are part of.

Google, to their credit, did collect the SMART measurements. That is a good start, but not sufficient data to support the conclusions of the Google paper.

For example, the orientation of each drive needs to be taken into account. What percentage of the drives analyzed were mounted horizontally vs. vertically? How were the drives themselves mounted? Specific mounting techniques result in a greater incidence of particular failure patterns. How were the drives cooled? Particular cooling techniques similarly result in specific failure patterns. What sort of data usage patterns were in use? What levels of RAID were used across the various drives?

I see no measurements of vibration in this paper. Drive orientation and drive vibration (including system-based vibration) are two factors that are very important in determining drive reliability. Drives have a certain resistance to vibration (and shock) that varies based on the directionality of the vibration.

We also see no meaningful treatment of the conditions for the HPC1, COM1, and COM2 systems. In HPC1 and COM1 we see massive failure levels for memory, likely indicating severe heat problems in those systems. In the COM2 system, we see a very high incidence of motherboard failure, again mostly likely indicating heat problems (or possibly bad caps). Specific heat conditions are operating conditions for drives that must be taken into account. Maybe the early onset of wear-out degradation is at least in part due to heat?

I have merely touched on several important elements of study that were neglected in both papers. To gain a real understanding of drive failure in the "real world", real and comprehensive data is needed first. Otherwise we are dealing with merely variations on the "GIGO / Garbage In Garbage Out" theme.

Also, I see a number of irrational conclusions being put forth by readers -- no value in RAID just replicate your data 3 times? This sounds a bit like how to get home from Oz. It works in the movies. But it doesn't work as well in real life.

RAID1 is a very solid solution for many businesses (and their correspondent data usage models), especially if there is a hot spare on the system as well. Many studies have shown the business value of the simple, transparent, low cost redundancy that RAID1 delivers. Even simple probability theory will tell you that RAID1 has clear potential for reliability improvements (that are well measured and proven in the real world).

I see a lot of analysis of RAID5 which people in the real world know is not a good choice for data that matters. There is no sane recovery procedure for RAID5. The drive access patterns tend to result in a lot of vibration as well.

Overall, I am disappointed that with all the investment that large organizations make in purchasing and deploying storage, they seem to have no one in their organization that (1) understands the mechanics and physics of even a single disk drive, (2) understands the concept of initial conditions, (3) understands the concept of operating environment/conditions (4) has the willingness to make actual measurements vs. barf up a bunch of hearsay, and (5) truly wants to understand the reliability of storage systems vs. take pot shots at the drive industry.

Each of these papers, CMU's and Google's is incomplete. There is not enough data to support the conclusions. There is not even enough data to support almost any conclusion beyond the basic observation, "drives fail, some days more than others."

Yes, Everything! by DeeVeeAnt · 2007-02-21 00:27 · Score: 2, Funny

It turns out they are actually triangular

--
Home fucking is killing prostitution.

Re:This paper and the Google paper are complementa by cowbutt · 2007-02-21 00:27 · Score: 1

As I understand it, the kernel of truth in those claims is that the testing/sampling rate on the SCSI assembly line is higher.

This paper from Seagate claims that SCSI drives are individually tested, whilst (S)ATA discs are only batch tested. As a result, I started running badblocks in write-test mode on my new ATA discs before putting them into service so as to attempt to reclaim that relative advantage. I also suspect that SCSI drives have a larger pool of reserved blocks for remapping failed blocks, which would go some way to explaining their funny sizes.

Re:This paper and the Google paper are complementa by swilver · 2007-02-21 00:33 · Score: 1

Google's idea of "high temperature" however is somewhere around 40 degrees C, and you'll be lucky to get such a low temperature in most desktop PC's, especially models that only bother to cool the CPU with a fan that only kicks in when CPU temperature becomes too high.

In most desktop systems drives can easily push beyond 45 C, and then you WILL see a drastic reduction of hard disk life (most drives are rated for 55 degrees C max). Often enough I see setups where people have packed 2 or 3 hard disks on top of each other in 3,5" bays, often sandwiched above or below a floppy drive. Air flow is minimal. Drives in such a configuration easily go beyond 60 degrees C.

drives by ralph1 · 2007-02-21 01:22 · Score: 0

Most of these are useless to me as my point of view is consumer and you have to name names.

Re:This paper and the Google paper are complementa by asuffield · 2007-02-21 01:41 · Score: 1

The Google paper shows that relatively high temperatures and high usage rates don't affect disk life.

Hrnghk!

The Google paper shows that relatively high temperatures do significantly affect disk life, and pins the safety point at about 45 degrees. Which is about where the manufacturers said it should be in the operating specs for the disks, if you bothered to read them.

They absolutely do not show that high temperatures don't affect disk life. Quite the opposite. Their graphs clearly show increased failure rates as the temperature rises above 45 degrees.

What they do show is that abnormally low (below 40 degree) temperatures don't improve it. That disproves the sanity of attaching watercooling rigs to your hard drive, but apart from that it's not very significant (did anybody seriously think that was a good idea?).

Normal operating temperature for a correctly installed disk in a 1-drive PC is typically 35-40 degrees. 10k RPM disks are hotter. Densely packed stacks of disks in servers can reach 50-60 degrees if not cooled *very* carefully, and that's *bad*, as the Google study shows. No myth was disproved - rather, one of the few figures which the manufacturer can and does test properly was demonstrated to be correct (it's really easy for them to find the safe operating temperature, so this is no surprise).

Re:moving parts - Don't always wear out by putaro · 2007-02-21 01:44 · Score: 1

"When I hear of Schrödinger's cat, I reach for my gun" - Stephen Hawking

Re:all this is [not] moot by Envy+Life · 2007-02-21 02:11 · Score: 1

Hard drives die often because the manufacturers build them cheaply, the same as every other component in a PC. Why would they ever make a bulletproof hard drive ? They'd go out of business! There is a middle ground, however. You can buy a 250G SATA for $50. Honestly, who here wouldn't pay 2, 3 or even 4x that price for the same drive if it lasted twice as long? It costs a lot more than $50 in expenses and frustration when a drive fails unless you have good backups and/or redundancy... but who really makes the effort to do that unless they are being paid for it?

Re:So SSD's are not only faster, but more reliable by gelfling · 2007-02-21 02:17 · Score: 1

No I don't mean rugged I mean reliable. USB drives so far have limited performance and a limit on the rewrite cycles. Real SSD's are basically, complicated RAM assemblies with their own power backup. I wonder how, if you remove all of the mechanical components, the reliability of the drive stacks up, over time? Do SSD's have the same kind of error vs age characteristics? Compared to RAID-5 what's the real difference in availability vs recoverability for example? See we really don't care about RAID-5 except that it purports to offer us better availability and recoverability. But if that's not really the case vs BODs or mirroring then it's possible or it at least bears some investigation whether chucking mechanical drives altogether is a better approach. We're spending gobs of money anyway. So why not spend it differently?

Re:moving parts - Don't always wear out by leonardluen · 2007-02-21 02:18 · Score: 2, Funny

but what happens when we run out of cats to power them?

Raid by scharkalvin · 2007-02-21 02:26 · Score: 1

While having two or more drives in a system would reduce the MTBF on the system as a whole, a raid array still makes sense as cheap insurance. You still should back things up because data loss doesn't only happen due to disk failure, there is the human error element too! As soon as one drive in a raid1 array fails, you should replace BOTH drives at once, first rebuilding the array after replacing the bad drive, and again after replacing the other original drive. Of course at the time you have a failure the exact same make/model hard disks you had in the array are probably no longer available. Just buy something as large or larger and create partitions as large or larger than on the original. When you are done replacing both drives, everything will be identical again.

Having a hot spare drive in a raid1 array isn't the best idea as the spare is aging at the same rate as the two in use. If the hot spare was kept totally powered down until needed this might make sense (but then it's not really a 'HOT' spare is it?).

The real advantage of raid isn't protection from data loss (backups are the only way to do that), but rather a good way to recover from the loss of a drive with minimal down time while the data is being restored (since rebuilding the array can sometimes be done while the array is in use, though I'd rather bring the system down to level 1 while rebuilding).

Re:neither the CMU paper nor the Google paper is g by RalphTheWonderLlama · 2007-02-21 03:18 · Score: 1

There is no sane recovery procedure for RAID5.

Please send us your broken RAID5s :)
We're quite good at recovering them and other difficult ones. (unabashed self promotion)
ESS Data Recovery

--
simple, fast homepage with your links: http://www.ngumbi.com/

IMHO, Some the other Grads are hotter by Anonymous Coward · 2007-02-21 04:03 · Score: 0

Best dating site ever ?
http://women.cs.cmu.edu/Who/Profiles/Grad/index.ph p

All schools should have something like that.

Schrödinger? by D4rk+Fx · 2007-02-21 04:22 · Score: 1

Am I the only person who read the person's name as Bianca Schrödinger?

Schrödinger's Hard drives are (dead/not dead).

Forget RAID by SoVeryTired · 2007-02-21 04:45 · Score: 1

"Forget RAID, just replicate the data three times".

Just be sure not to do it on the same disk...

--
Slashdot: news for Apple. Stuff that Apple.

Re:Capacity correction... by Technician · 2007-02-21 05:29 · Score: 1

I just looked up specs to jog my memory. The RK05 drive is not 5 Meg. It was only 2.4 Meg.

--
The truth shall set you free!

Stiction by awtbfb · 2007-02-21 05:51 · Score: 1

That would probably be leftover from the good old days of stiction. There is nothing quite like pulling a drive out of a computer and smacking/snapping/why the hell are you doing to that! in front of a novice user.

Re:So SSD's are not only faster, but more reliable by strangluv2 · 2007-02-21 05:55 · Score: 1

I want to know end user experience with data recovery services. I ignored that click click sound on my WD Caviar, and now I'm crying. What is the succes rate of these data recovery services, and the price range? Are there happy endings to disaster?

The big problem by LunaticTippy · 2007-02-21 06:23 · Score: 1

This is a good idea, and if/when computer tech stops advancing so fast it'll be possible. Right now, if you want 5 year data you're looking at 10GB drives using completely different technology than what is currently available. If I want to buy a 500GB SATA drive there simply isn't data going back very far. Once the data on 500GB SATA drives is collected, I'll be buying a 20TB QWERTY drive using holographic biostorage.

I suppose there might be a small amount of value in knowing how good each manufacturer's drives were 5 years ago, but I'd be surprised if they are even making drives in the same country now.

--
Man, you really need that seminar!

Re:The big problem by Reziac · 2007-02-21 08:13 · Score: 1

"...but I'd be surprised if they are even making drives in the same country now."

They're not. Frex, about the time Quantum got out of the HD business, W.D. moved from Singapore to ... um, Malaysia? I'd have to look, but anyway W.D.'s drives suddenly looked suspiciously like they were being made by Quantum's old HD factory (case design, etc.) This gave me the shuddering heebie-jeebies, because the tail end of Quantum's HDs hadn't been very good, and I was afraid W.D. drives would go likewise.

But I expect (hope?) overall quality control has more to do with ongoing drive reliability than does which plant made what.

--
~REZ~ #43301. Who'd fake being me anyway?

Re:This paper and the Google paper are complementa by darkwhite · 2007-02-21 06:37 · Score: 1

Are you a datacenter engineer? Do you have extensive experience with component cooling in datacenters?

Your whole analysis of the Google paper relies on premises like "they cool their drives too much, so moisture must be killing them" and "they really don't know how to analyze their data, they should have done forensic analysis on their drives". Particularly ridiculous was your assertion that their data should be analyzed by people less concerned with statistical analysis. Do you realize that these people are datacenter engineers? That as part of one of the biggest custom datacenter operators on the planet, they probably are mostly concerned with the engineering and cost-effectiveness aspects of their analysis? That this data was collected using highly automated methods based on SMART readings in huge environments, and performing forensic analysis on failed parts is usually a ridiculous proposition on several grounds? That this paper's material is nothing more than an extract of their internal reliability analysis, whose sole purpose is to maximize reliability and which probably analyzes factors like cooling regimes and humidity to death?

The above mostly applies to the post you're linked. Each of your statements here has merit, and most agree with the Google paper's conclusions, but the difference is they're analyzing their massive operational data, while you seem to be drawing shaky conclusions from rationalizations.

--

[an error occurred while processing this directive]

Re:MTBF? RTFA. by stonecypher · 2007-02-21 07:05 · Score: 1

So you're right that MTBF shouldn't be taken for a single drive, since the failure rate at 5 years is going to be much higher than at one.

That's like saying that the inertia of an object shouldn't be taken at standstill, since the inertia at near-light-speed is going to be much higher than at rest. It's reductionist and absurd. There is no reason to discard legitimate and informative statistical measurements based on someone else's inability to apply them correctly.

An MTBF implies exactly the behavior that paper presents. If you saw MTBF 1 million hours and thought that meant for all your million drives, the flaw isn't in the measurement, it's in your comprehension of tenth grade mathematics.

Anyone with a functional understanding of basic statistics knows that the significant MTBF risk increases per non-redundant failure source either exponentially (if failures don't cascade) or combinatorially (if they do.) That means the MTBF goes *down* as you add drives. Two non-redundant drives with an MTBF of X have a mean time of single failure for the group of sqrt(x). Three non redundant drives, and it's cuberoot(x).

Really, there's nothing sadder than someone sitting by the sidelines, claiming that good solid measurements should be discarded because they're too stupid to not be misled by them. You remind me of the jerks who sued because they don't know the difference between a megabyte and a mebibyte.

--
StoneCypher is Full of BS

I hope you don't have that problem by gd2shoe · 2007-02-21 07:38 · Score: 1

Every critical piece of data should have a backup plan, making this expensive recovery obsolete. But you knew that already, didn't you.

I hear your pain though. I'm not looking forward to the first time I have a customer bring me a dead solid state drive with vital data. Telling him that there is nothing that anyone can do will be painful.

(this assumes that new recovery tech is not developed - but this will certainly be much more expensive)

--
I won't join Slashcott. OTOH, If Beta goes live, I just won't be back until it's fixed. Sorry Dice.

Re:I hope you don't have that problem by um...+Lucas · 2007-02-21 07:58 · Score: 1

Yes, I'm fine... the workstations and servers are thoroughly accounted for here, between RAID 5's, tape backups, backups to hard drives taken off site, and a contract with Iron Mountain for offsite backup.

Remote users login to a dedicated Terminal Server, which is part of the office backup strategy.

But with all that, I'd hate to have my bosses wife purchase him a fancy laptop to use at home with a solid-state drive, and have him bring it to the office some months later complaining that he can't access the data. It's not on me to enforce rules on him, it's on me to make sure the system works despite him... So, I'm keeping my fingers crossed that he never gets a thing like that!

Too True,,, by Plekto · 2007-02-21 08:05 · Score: 1

I recently put together a system for a client and one of the three drives was DOA from the factory. Western Digital, 250gig - the exact ones you use in raid arrays for servers/drive arrays.

Booted up - but the heads never moved. Probably something broken inside with the armature or stepper motor.

So, yes, it happens all the time. The business I work for repalces drivs in their data center every day. Now, they have something like a thousand drives in there, but that's an astonishing rate when you think about it. A year is about all you get out of a drive today before you are on borrowed time. Two years is common for home users, IME.

As for data protection, four things are key:
#1: Make a CD or DVD with all of your installers. AV, firewall, acrobat, divx, and all the rest - so you can get the machine ready to install you main aps from a clean boot in an hour or so. Also include all of your data recovery software and utilities, plus sound and video drivers.

#2: Weekly backup of email and documents and such(use the tool of your choice) - this should be 20-30MB at most per week. A fe miutes at most out of your schedule. Save it to a USB drive that you leave in one of the rear slots. This can be a 128MB "free" drive that you see coming with a spindle of CDs or whatever. In my case, it's a 512MB card, now, since my email and such is included, and it's grown quite large.

Now, obviously, if you have a 4 gig flash-drive, this solves #1 and #2. If it's an 8 gig model, you can install Windows(to boot/recover with) and still have enough room to partition it for data backup. But even 128MB is better than nothing, since the data is good for more than a decade.

#3:Go out and buy a good surge protector. By this, I mean an IsoBar strip.(more like a metal brick - heh). It works. Most everything else doesn't. If you can nail the issues related to power as a cause for failure, or mitigate them to the level of "freak accident", you're that much better off. A UPS is of course, better, but the number of people running without either is astounding.

#4: Raid 1 is a godsend for the average user. MTTF for both drives at once is amazingly high - on the order of 1/100K+ per day versus something closer to 1/250 or higher for a single drive. Given that a drive to run as a mirror is $60-$80 these days, it's infinately cheaper than data recovery costs if you have a drive crash on you.

$60 now or ~$2000 to have Drivesavers recover it(no joke - it's that expensive.)

Re:MTBF? RTFA. by Vellmont · 2007-02-21 08:10 · Score: 1

Thank you for you input. I will consider it carefully.

Love Vellmont.

--
AccountKiller

Re:This paper and the Google paper are complementa by bill_mcgonigle · 2007-02-21 08:48 · Score: 1

As a result, I started running badblocks in write-test mode on my new ATA discs before putting them into service so as to attempt to reclaim that relative advantage

Good thinking, I've been doing the same lately after a couple painful failures. It really hurts under USB. :(

You might need to do that with the SCSI drives anyway. From the paper you linked:

The build and test times for ES drives are considerably
longer than PS drives. Increased test time can make a drive
more reliable. During this time, drives also undergo detailed
characterization, such as learning precisely how irregular
individual tracks are, which allows them to better keep the
heads on track during normal operation. More time spent
analyzing the media for flaws results in lower probabilities
these flaws causing unrecoverable read errors in the field.

That doesn't say to me that they test the entire SCSI drive, just that they test them for more time than ATA. Which isn't good enough for me. I admit, I only skimmed the paper.

--
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)

Re:This paper and the Google paper are complementa by Anonymous Coward · 2007-02-21 10:23 · Score: 0

um, you're not logged in, so you can go fuck yourself.

Re:MTBF? RTFA. by Pseudonym · 2007-02-21 10:59 · Score: 1

It's also quite interesting that the "enterprise" level drives aren't any better than the consumer level drives.

As you know, you can buy different speeds of CPU or RAM. The reason why they come in fast and slow is not because anyone sets out to make a slow CPU. They make one speed of CPU, and then test them. Most of them are duds and get melted down to make new wafers. Some test okay, and they get packaged and shipped. Some test bad at high clock speeds and okay at low clock speeds, and these get packaged as lower-speed CPUs.

Now I don't know much about hard drive manufacturing, but I would guess that there's a similar thing going on here. An "enterprise"-level drive is one that tested better at the factory, but it's designed to the same specs as a consumer-level one.

--
sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});

Re:Capacity correction... by Simon+Garlick · 2007-02-21 16:49 · Score: 1

Even so... holy crap, that's still impressive. I remember thinking, in 1983, that a megabyte was an inconceivably large amount of information. I couldn't imagine anyone ever needing to store a megabyte, and that was three years after you had a 2.4M drive. Whoa.

--

-----
PGP Key ID 0xCB8FF658

Re:This paper and the Google paper are complementa by cowbutt · 2007-02-21 21:53 · Score: 1

As a result, I started running badblocks in write-test mode on my new ATA discs before putting them into service so as to attempt to reclaim that relative advantage

Good thinking, I've been doing the same lately after a couple painful failures. It really hurts under USB. :(

Yes, raw PATA speeds are ~50-60MB/s, the best I've managed via USB 2.0 is ~25-30MB/s, with the same disc rehoused in a caddy using a Prolific PATA-to-USB bridge chipset, plugged into an Intel USB controller (NEC was slower, IIRC).

Also, another advantage of running badblocks in write-test mode before using the drive is that hopefully any marginal or failed blocks will be remapped before they contain useful data. I've never seen that happen, and to be honest, I'd now be inclined to reject a drive that did so as D<ead|amaged>OA.

Re:This paper and the Google paper are complementa by bill_mcgonigle · 2007-02-22 02:01 · Score: 1

Also, another advantage of running badblocks in write-test mode before using the drive is that hopefully any marginal or failed blocks will be remapped before they contain useful data. I've never seen that happen, and to be honest, I'd now be inclined to reject a drive that did so as DOA.

Good policy. I've been inclined to think that way myself in the past 6 months or so, and the Google study adds credence to my anecdotal experience.

--
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)

Lousy policy. by jotaeleemeese · 2007-02-22 05:56 · Score: 1

It is a waste of time and money.

And it creates unnecessary risks (why the fluffy bunny should you be touching that hardware when it is not failing?).

You will need to replace disks, no question about it, but given the redundancy and hot swappability in modern devices of enterprise quality, preemtive action just increases the risks of something else going wrong (pulling a cable, doing something stoopid).

Or tell me, how do you explain shuting down that machine by mistake when you were changing a diks that was in perfect working order? If I was your user, it would make absolutely no sense to me.

--
IANAL but write like a drunk one.

And then your disk crashes. by jotaeleemeese · 2007-02-23 00:50 · Score: 1

I think you don't have a full grasp of prioritites....

--
IANAL but write like a drunk one.

Slashdot Mirror

Everything You Know About Disks Is Wrong

330 comments