Ask Slashdot: How Do SSDs Die?

CRC Errors by Anonymous Coward · 2012-10-16 04:22 · Score: 5, Informative

I had 2 out of 5 SSDs fail (OCZ) with CRC errors, I'm guessing faulty cells.

Re:CRC Errors by Quakeulf · 2012-10-16 04:29 · Score: 3, Interesting

How big in terms of gigabytes were they? I have two disks from OCZ myself and they have been pretty fine so far. The biggest is 64 gb, the smallest is 32 gb. I was thinking of upgrading to a 256 gb SSD at some point but not knowing what might kill it is something I honestly have not thought of, and would like some input on. My theory is heat and a faulty power supply would play major roles in this, but not so sure about physical impact although to some extent it would break it.
Re:CRC Errors by Anonymous Coward · 2012-10-16 04:39 · Score: 5, Informative

OCZ has some pretty notorious QA issues with a few lines of their SSDs, especially if your firmware isn't brand spanking new at all times.
I'd google your drive info to see if yours are on death row. They seem a little small (old) for that, since I only know of problems with their more recent, bigger drive.
Re:CRC Errors by AmiMoJo · 2012-10-16 04:48 · Score: 2

You could be more specific. Errors on reading or writing?
I have had a couple of SSDs die. The first was an Intel and ran out of spare capacity after about 18 months, resulting in write failures and occasional CRC errors on read. The other was an Adata and just died completely one day, made the BIOS hang for about five minutes before deciding nothing was connected.

--
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
Re:CRC Errors by Anonymous Coward · 2012-10-16 04:48 · Score: 2

I've also had two OCZ drives die on me. The first to die was a 60 GB OCZ Agility which actually didn't see that much usage compared to the 32 GB Agility that I had and it died a little after a year. The Vertex, which was older than the Agility died about 6 months later. Both were RMA'd and now I have new ones, lets see how long they last. The reason I've heard why the die is that the controllers crap out, not the cells reach their write limit, which makes sense to me because when both of mine died I couldn't access it at all and the computer had problems detecting them.
Re:CRC Errors by Synerg1y · 2012-10-16 04:53 · Score: 4, Interesting

OCZ makes several different product lines of SSDs, each line has it's own quirks, so generalizing OCZ's QA issues isn't accurate. I've always had good luck with the vertex 3s both for myself & people I install them for. I've a SSD die once and it looked identifcal to a spinning disk failure from a chkdsk point of view, can't remember what kind it was, it was either an OCZ, or a Corsair, but I can name a ton of both of those brands that are still going 2-3y+.
Re:CRC Errors by Synerg1y · 2012-10-16 04:55 · Score: 1

I actually mean Agility 3, can't say either way on the vertex's besides they sometimes get less eggs on newegg from reviews.
Re:CRC Errors by Anonymous Coward · 2012-10-16 04:58 · Score: 0

I'm guessing you're a Windows user. ERROR_CRC is what NTFS returns when it decides the disk is bad. I've also seen ERROR_FILE_CORRUPT in similar circumstances.
Re:CRC Errors by ourlovecanlastforeve · 2012-10-16 04:58 · Score: 1

Likewise my experience with OCZ products is that either they work or they don't. One out of 3 OCZ products has to be returned for exchange. So I spend a little more to not have to do the returns.
Re:CRC Errors by Anonymous Coward · 2012-10-16 05:00 · Score: 0

He explicitly said "with a few lines of their SSDs", not them all.
Re:CRC Errors by Anonymous Coward · 2012-10-16 05:10 · Score: 1

Original Poster here:
CRC errors signify both written and reading errors. A CRC check is basically a hash that allows the drive to verify if the data was written properly.
http://en.wikipedia.org/wiki/Cyclic_redundancy_check
It resulted on both drives in significant performance degradation, well beyond the normal performance of a 5200 RPM SATA drive, which is saying a lot in a SSD.
Re:CRC Errors by Dishwasha · 2012-10-16 05:11 · Score: 4, Informative

I've had over 10 replacements on the original OCZ Vertex 160GB drives and an unnecessary motherboard replacement on my laptop that I eventually figured out was due to the laptop battery reaching the end of its life and not providing enough voltage. Unfortunately OCZ's engineers did not design the drives to handle loss of voltage and the drives absolutely corrupt. Eventually OCZ sneakily modified their warranty to include not providing warranty when the drives don't receive enough power rather than getting their engineers to just fix the problem. I'm actually running on a Vertex 3 and as of yet have not had that problem, but I am crossing my fingers.
Re:CRC Errors by bobcat7677 · 2012-10-16 05:12 · Score: 1

I am running (6) OCZ Vertex2 256GB drives under heavy use 24/7. Almost 2 years on have only had one fail and it still works, just started kicking random errors.
Re:CRC Errors by ArhcAngel · 2012-10-16 05:17 · Score: 2

I suspected you weren't talking about the Vertex line. I got a 60GB Vertex III last year and shortly after installation it started randomly disappearing during intense gaming. I thought it was due to heat so I bolted a fan/heat sink to it but that didn't seem to help. I struggled with the issue for over six months until OCZ finally released a firmware update (it had released several prior) that fixed it. I never lost any data but it was a little unnerving every time it would vanish.

--
"A person is smart. People are dumb, panicky dangerous animals and you know it." - K
Re:CRC Errors by DarkTempes · 2012-10-16 05:23 · Score: 1

I've had a 64GB crucial m4 die.
Well, sort of die. It dismounts or gives read/write errors after a randomly short period of time depending on the system it's attached to.
I still need to RMA it :(
Re:CRC Errors by Anonymous Coward · 2012-10-16 05:31 · Score: 0

However if reading the same block again yields a correct result, it obviously was a read error.
Re:CRC Errors by Anonymous Coward · 2012-10-16 05:37 · Score: 0

Original Poster here:
One was a 128GB Petrol, the other a 120GB Vertex 3.
I was told it was a firmware issue, but I specifically updated the firmware to the latest release before the OS install, so I'm not sold on that.
Re:CRC Errors by ZedNaught · 2012-10-16 05:40 · Score: 1

have you updated to the latest firmware? There was an elapsed-time counter that was declared as a smallint and it was overflowing. On reboot the drive would work again for one hour before failing on the hour counter update.
Re:CRC Errors by lytles · 2012-10-16 05:40 · Score: 5, Funny

power corrupts. absolute power corrupts absolutely

--
My blog
Re:CRC Errors by Mattcelt · 2012-10-16 05:44 · Score: 5, Funny

And a lack of power enables corruption. QED
Re:CRC Errors by franc0ph0bic · 2012-10-16 05:50 · Score: 1

It is mostly not the controllers. Depending of course which Agility/Vertex you are referring to. In non-sync NAND drives most of the time it is the NAND having bad blocks - those drives are cheap and not meant to be used in high-usage scenarios, so if you do, you probably won't have very good results. Generally the drives that use first generation Sandforce-based controllers, which did not have very efficient garbage collection, have problems too - if you don't enable TRIM, your drive will destroy its own NAND pretty quickly.
Re:CRC Errors by MrL0G1C · 2012-10-16 05:53 · Score: 5, Informative

http://www.behardware.com/articles/862-7/components-returns-rates-6.html
Personally, I'm glad my SSDs aren't OCZ.

--
Waterfox - a Firefox fork with legacy extension support, security updates and better privacy by default.
Re:CRC Errors by Leggman · 2012-10-16 05:58 · Score: 0

wish I had a mod point...

--
You don't eat crackers in the bed of your future or you get all...scratchy! - The Tick
Re:CRC Errors by tguyton · 2012-10-16 06:08 · Score: 1

Interesting, I've been having the same problem with a Crucial M4 drive for a while now. To be honest I haven't even investigated since I stopped using it as my boot drive before the problem started happening, but I'll have to check and see if there is a firmware issue with it. I bought it when the drive was first released almost a year and a half ago, so I'm sure an update won't hurt. On a side note - the Crucial was not replaced for any failures on its part, I definitely liked it a lot (especially the speed!). I just needed more space, and my fiance gave me a lovely Intel 510 for Christmas :)
Re:CRC Errors by Bengie · 2012-10-16 06:23 · Score: 1

The firmware is supposed to detect when the reserved wear leveling space has about run out and the cells are about at their limit. At this point the entire drive should suddenly enter a "real only" state with only the most recently written data subject to corruption.

This is my understanding, but I guess it's up to the firmware to handle this and I'm not sure is this is industry standard.
Re:CRC Errors by markhahn · 2012-10-16 06:27 · Score: 3, Insightful

this is not very useful, as it mainly points out that the initial generations of commodity SSDs were immature. not to mention that return rates contain other phenomena than wear or even failure.
Re:CRC Errors by FilmedInNoir · 2012-10-16 06:29 · Score: 1

I've had a similar experience with a Samsung 120GB PM830.
I didn't even know it was a SSD, until someone told me, the failure mimicked what I would expect from a regular old spinney platter type disk.
The occassional black screen of death with Windows shut down due to blah blah blah and Can't boot, non-system boot disk! (worked on cold reboot but not warm).
Those kind of errors.

--
Sig. Sig. Sputnik
Re:CRC Errors by MyFirstNameIsPaul · 2012-10-16 06:36 · Score: 1

There is some real truth in that.

--
I once took an excursion to Reddit, and later HN. Unlimited up/down voting sucks when dealing with a hive-mind.
Re:CRC Errors by arth1 · 2012-10-16 07:16 · Score: 5, Insightful

I am running (6) OCZ Vertex2 256GB drives under heavy use 24/7. Almost 2 years on have only had one fail and it still works, just started kicking random errors.
Your failure rate of > 8% per year isn't very reassurring.
Re:CRC Errors by Christian+Smith · 2012-10-16 07:18 · Score: 1

Eventually OCZ sneakily modified their warranty to include not providing warranty when the drives don't receive enough power rather than getting their engineers to just fix the problem.
What is sneaky about that? Seems a perfectly reasonable clause. Such a drive is probably perfectly fine after (re)formatting.
Remember, FLASH blocks need power to erase and write correctly, and controllers rely on there being sufficient power to write anything.
It was your hardware that sucked in this case. If a laptop is having stability problems of any kind, remove the battery and run off the PSU as a matter of course to test it. I almost junked a perfectly good laptop before I learned that lesson.
Re:CRC Errors by Anonymous Coward · 2012-10-16 07:19 · Score: 0

I'm pretty sure the problem here is not that those SSDs are bad or unreliable, but that you failed to identify that you had a bad battery BEFORE replacing 10 drives and a motherboard. There must have been more telltale signs than simply SSD corruption.
Re:CRC Errors by DarkTempes · 2012-10-16 07:19 · Score: 1

Yeah, first thing I did was update to the latest firmware when I discovered the problem (6+ months ago, I'm super lazy).
Though maybe that was before the fix and I should try again. Thanks for the heads up.
Re:CRC Errors by viperidaenz · 2012-10-16 07:29 · Score: 1

spinny platter type failures tend to be along the lines of the Click of Death

I'm pretty sure SSD's won't start clicking. You'll get read errors though.
Re:CRC Errors by Anonymous Coward · 2012-10-16 07:36 · Score: 0

Yeah. If you have the power to kill all zombies that ever come near you, you will never become one.
Re:CRC Errors by Dishwasha · 2012-10-16 07:39 · Score: 4, Insightful

I would counter-argue that any flash drive manufacturer is asking for massive RMAs when the device is clearly targeted for the laptop market (otherwise they would manufacture it in a 3.5" format) where the operating environment is guaranteed to be running on a battery for long periods of time. Any research in to battery operation would expose you to the vast differences in operating voltage as batteries discharge as well as the age of the battery. It is just bad engineering to not take this in to account.
Reformatting the drive was not an option because the drive wouldn't even detect in the BIOS unless the special factory jumper was set which is a non-operational mode for the drive. This problem was reproduced over 10 times with over 10 different drives of the same model Vertex. Slightly bad power caused the entire drive to be rendered unusable. Amazingly, none of the other hardware in the laptop had any problem with the power (i.e. screen, cpu, memory, other spindle-based hard drive, gpu, etc.). As I said, bad engineering.
Re:CRC Errors by tibit · 2012-10-16 07:43 · Score: 1

The major problem, I think, is that hard drives of any sort have their costs cut to the bone. Thus spinning platter hard drives store firmware on the platter and not in flash, only simplified loader firmware is stored in a boot (eep)rom somewhere. This demonstrates that same goes for SSDs, apparently. The BIOS doesn't detect them because with dead data storage flash, they can't even boot up properly the controller. There is no reason whatsoever for an SSD to slow down at all when faced with uncorrectable read errors from the data flash, or to appear unresponsive because of that. It simply needs to fail all the data reads and that's it.

--
A successful API design takes a mixture of software design and pedagogy.
Re:CRC Errors by ZedNaught · 2012-10-16 08:13 · Score: 5, Informative

Firmwares release notes, from January 13th, 2012: "Correct a condition where an incorrect response to a SMART counter will cause the m4 drive to become unresponsive after 5184 hours of Power-on time. The drive will recover after a power cycle, however, this failure will repeat once per hour after reaching this point. The condition will allow the end user to successfully update firmware, and poses no risk to user or system data stored on the drive."
Re:CRC Errors by hairyfeet · 2012-10-16 08:21 · Score: 1

The controller. I've had 2 of my gamer customers lose SSDs to controller failure. Easy to spot as you flip the switch there is nothing, the drive doesn't even show up in BIOS. This is why I warn folks that if they are going SSD they need to have a backup strategy they religiously stick to, because if that controller fails they are SOL. With HDDs I've been able to get a lot of data off a badly dying drive but with SSDs once the controller goes so too goes the drive.

--
ACs don't waste your time replying, your posts are never seen by me.
Re:CRC Errors by g0tai · 2012-10-16 08:24 · Score: 1

I had 3 out of 3 OCZ Petrol SSD's fail on me (looks like a controller burn-out issue that's present in pretty much all of the Petrol series SSD's from my personal experience). However, on the flipside of that I've never had one of their Agility Series or Vertex 4 series SSD drives die on me. (we have 4 Agilities (another 3 at home for me) and around 8 Vertex 4's. All those ones have worked fine.

It sucks to have an SSD fail, believe me each time I had an OCZ petrol fail, it makes me sad. I've learnt my lesson though - avoid that series like the plague. Same vendor and I only get Agility or Vertex now (and touch-wood, I've been fine so far for half a year on those (and we use the vertex4's on our DB servers which get *hammered* on reads and writes)

This is my personal experience with SSD's. They're a godsend for DB stuff (and anything else that does lots of small writes/reads). OCZ are good and we still get our SSD's from them,.... just not the petrol series.
Re:CRC Errors by filthpickle · 2012-10-16 08:34 · Score: 1

Platter drives fail like he describes all the time. Bad sectors usually, although I am sure it could be other things. If you are really poor, bored, or stranded with no replacement parts, you can save it with clever partitioning.....but it just isn't worth messing with that shit anymore. Just get another one.
Re:CRC Errors by filthpickle · 2012-10-16 08:47 · Score: 2

otherwise they would manufacture it in a 3.5" format
The standard form factor for SSD's is 2.5" no matter how you intend to use them. I am not really commenting on what you say aside from that. I was honestly curious when I read that because I have never seen a 3.5" SSD (I haven't looked very hard). There are a few from OCZ on newegg but that's all a brief scan could find.
Re:CRC Errors by Anonymous Coward · 2012-10-16 09:17 · Score: 0

I had 3 SSDs die to a faulty power supply (took out a DVD drive at the same time, HDDs were unharmed). So that is a threat, but having a PSU go bad is probably rare enough not to worry about that too much. It does suck when it happens though, but we all have backups for that... right? ;)
Haven't seen any of the 6 I've had (4, plus two of the three I got back that were worth RMA'ing) die to overuse or age yet, though the longest running one was about 1.5 years as my primary drive (plus a year as a secondary) before it got murdered, so it's not exactly a great longevity example.
Re:CRC Errors by clarkn0va · 2012-10-16 09:19 · Score: 2

Exactly right. I've used, sold and supported dozens of SSDs. Most were Vertex or Agility (1, 2, 3 and 4), and I've yet to see a single one fail. By contrast, I sold exactly three OCZ Petrols and had 4 failures! The last two were RMA replaced by Agility 3 and Octane, repectively, so obviously OCZ has seen a problem there.
Similarly, I sold a batch of a dozen or so Kingston budget drives and saw nearly half of them fail around the 1 year mark. I've used a couple Corsair drives and had issues with them not coming out of sleep. I tried a handful of early Intel MLC drives and while they worked ok, performance was lacklustre for the high price.
These days I stick with the tried and true Vertex and Agility series for good value and flawless performance.
As for the OP's question of how they fail: you name it. Some develop bad blocks resulting in lost data or failure to boot. Others disappear suddenly from the BIOS altogether and refuse to be recognized by any OS. I had one that spontaneously (or so the customer claimed. It was a Kingston, after all) set itself an ATA password and was thus useless.

--
I am literally 3000 tokens away from the chaotic crossbow --Stephen
Re:CRC Errors by Anonymous Coward · 2012-10-16 09:19 · Score: 0

This about what I saw on one product I worked on. It was about 8-15% and 1-10 meg (not gig). However, I seriously doubt the tech has changed that much other than feature size, speed and quantity. We were looking at 100% failure in 4-6 years (due to power issues). The ones that were not power related were in the 8-10% range.
The re-write on most cells is anywhere from 2.5k-20k depending on tech and feature size. The 1 million they quote you depends on how much 'extra' in cells they packed in there. What many of the drives do is just swap out 'bad cells' for 'good ones' and you never notice. That is until 1 of 2 things happens. You run out of extra in a critical cell (and many of these exist in all drives), or you have a power fall off in the middle of a write which causes a corruption but the device is 'ok' and a full format resurrects it.
Usually the result is the drive 'goes away'. If your very very very lucky. The data is corrupt in a few places and you can recover most of it. But usually its bam gone...
Re:CRC Errors by aztracker1 · 2012-10-16 09:28 · Score: 1

My one of my first two SSDs died in the first year... I've now got a total of 8 running in various systems, all running strong, most for a few years... Mostly Corsair Force and Force 3's... my first two were both Intel's 1st gen the 80GB died, and the 160GB is still running, passed on to a friend... running a few 256GB drives, and two 480GB in my VM server... backups are crucial, since I didn't get any warning before the death...

--
Michael J. Ryan - tracker1.info
Re:CRC Errors by franc0ph0bic · 2012-10-16 10:02 · Score: 1

The firmware on most SSDs is stored in a dedicated SRAM chip on the board, but most drives are set up to stop functioning once S.M.A.R.T. shows the drive as failed or there are too many bad blocks to hold all of the data. This is why a 256GB drive holding 20GB of data will basically never go bad, but if it is constantly holding 250GB it will go bad very fast.
Re:CRC Errors by Electricity+Likes+Me · 2012-10-16 10:10 · Score: 1

otherwise they would manufacture it in a 3.5" format
The standard form factor for SSD's is 2.5" no matter how you intend to use them. I am not really commenting on what you say aside from that. I was honestly curious when I read that because I have never seen a 3.5" SSD (I haven't looked very hard). There are a few from OCZ on newegg but that's all a brief scan could find.
That's because a 3.5" form factor drive would still cost a fortune, since capacity is limited by the cost of the chips, rather then the physical number of them you can fit inside a certain volume.
Re:CRC Errors by Anonymous Coward · 2012-10-16 10:11 · Score: 1

I eventually figured out was due to the laptop battery reaching the end of its life and not providing enough voltage.
A dying PSU could do the same. I suspected mine after I lost a WD's enterprise drive and Seagate's consumer drive in sequence. After switch I haven't lost a single drive.
The 2004 solar storms caused 2V voltage spikes troughout my systems, which generally exceeds the voltage tolerances of the current drives. The voltage log was scary to read afterwards.
Re:CRC Errors by AaronW · 2012-10-16 10:13 · Score: 1

I had a 2 week old Agility 3 suddenly fail to the point where it was not visible on the SATA bus.

--
This post is encrypted twice with ROT-13. Documenting or attempting to crack this encryption is illegal.
Re:CRC Errors by antdude · 2012-10-16 10:15 · Score: 0

FYI, it's = it is. ;)

--
Ant(Dude) @ Quality Foraged Links (AQFL.net) & The Ant Farm (antfarm.ma.cx / antfarm.home.dhs.org).
Re:CRC Errors by KreAture · 2012-10-16 11:15 · Score: 1

Some drives use capacitors in the design to provide a buffer of power. To avoid the rest of the system eating this power they protect it. (Diodes)
This little power-store is used to ensure that the last/current operation is completed before the drive shuts down if it detects loss of input power.
It doesn't save all your data but atleast ensures no half-cooked corruption occurs on the write.
The drives ofcource cost extra, but give added safety.
Re:CRC Errors by toastar · 2012-10-16 11:17 · Score: 1

Well with motor death(click of death) you can usually send it off to a clean room to recover pretty cheaply.
Re:CRC Errors by hairyfeet · 2012-10-16 11:19 · Score: 1

My gaming customers have the 128Gb OCZ and Intel and both are on their second thanks to controller failures. Don't ask me why but from what I've seen as well as what those on the forums have said it really comes down to either chips failing (which will usually give you some warning) or controller failure which does NOT give you any warning. But I doubt it has anything to do with heat as both gamers have those massive ATX cases with multiple front and rear 120MM fans, those things sound like an F15 taking off but they keep even their OCed CPUs cool.

--
ACs don't waste your time replying, your posts are never seen by me.
Re:CRC Errors by Anonymous Coward · 2012-10-16 11:47 · Score: 0

The sample size is too small to draw any useful conclusions from.
Re:CRC Errors by socceroos · 2012-10-16 12:01 · Score: 1

Yes, truth exposes corruption.
Re:CRC Errors by scubamage · 2012-10-16 12:28 · Score: 1

Why would they have to manufacture in a 3.5" format? Enterprise disks are very often 2.5" now because of the density you can get. You only use larger drives in cases where you need a larger amount of storage (>1Tb/disk). 2.5" drives have been in use for servers for at least 5 years now (I know HP started using them in the DL series during generation 5, and they're up to G8 now - can't speak for other vendors).
Re:CRC Errors by Anonymous Coward · 2012-10-16 13:11 · Score: 0

FYI, just to be pedantic.
it's = it is
OR
it's = it has
Re:CRC Errors by Anonymous Coward · 2012-10-16 13:27 · Score: 1

Seriously! These days it's really simple to slap on a SMPS. If their product has tight power requirements, why not put on some electronics to guarantee it gets that.
Re:CRC Errors by jhantin · 2012-10-16 13:45 · Score: 1

You'd think they'd learn from others' gaffes. Western Digital had a problem with Velociraptor firmware where an unsigned 32-bit count of milliseconds since power on, combined with poorly written read timeout ("TLER") logic, caused all read requests submitted within TLER-timeout of counter rollover to fail. Imagine the confusion that ensues when a RAID controller (or Linux md) sees all its drives vanish for a few seconds then come back ...

--
...when you're writing a game...tweak the difficulty of "Easy" to something [your mother] can cope with. -- onion2k
Re:CRC Errors by Anonymous Coward · 2012-10-16 13:50 · Score: 0

We had one of our OCZ Enterprise level 1 TB SSD drives go out on us. We had 2 that were doing caching for our JBOD chassis.
Re:CRC Errors by Anonymous Coward · 2012-10-16 13:50 · Score: 1

I would counter-argue that any flash drive manufacturer is asking for massive RMAs when the device is clearly targeted for the laptop market (otherwise they would manufacture it in a 3.5" format) where the operating environment is guaranteed to be running on a battery for long periods of time. Any research in to battery operation would expose you to the vast differences in operating voltage as batteries discharge as well as the age of the battery. It is just bad engineering to not take this in to account.
Sorry dude, you're being an armchair quarterback here. Any real research into how laptops are designed would expose you to the fact that laptops never power items like disks directly from battery power. Instead, the battery powers step-up and step-down voltage converters, which provide regulated DC power to everything else. It's just like how a tower has a PSU which converts 120V/240V AC to a variety of regulated DC supply rails (5V, 12V, 3.3V), except here the input power to the PSU is DC at a lower voltage.
This is done because (a) nothing in the machine but the converters can use battery cell voltage directly (it's never one of the standard supply voltages used in digital electronics) and (b) even if (a) weren't true, a brand new 100% healthy battery has a large voltage swing while being drained from full to empty. Most digital ICs in laptops require regulation of their power rails to within 5% or 10% of the nominal value, or they won't function correctly. Direct supply from the battery could never meet that requirement.
If your laptop is responding to an old failing battery by sending very poorly regulated power to a disk, that's a badly designed laptop. The entire laptop should shut itself down if its internal PSU detects that it isn't maintaining proper regulation on one of its outputs. Sure, it'd be nice if OCZ also implemented a protection circuit which shuts the SSD down cleanly if it detects bad input power (*), but the fact that it's getting bad input power at all is a legitimate problem with the design of the whole laptop.
* - This is more of an absolute requirement for enterprise grade SSDs, but it's a frequently cut corner on consumer oriented products. Lots of people aren't willing to pay extra for reliability and data integrity features.
Re:CRC Errors by jakimfett · 2012-10-16 16:23 · Score: 1

This is brilliant, and I will be searching for a way to insert this into everyday conversation.

--
Bits of code, random ramblings: jakimfett.com
Re:CRC Errors by Anonymous Coward · 2012-10-16 16:55 · Score: 0

Err, what? A 3.5" SSD is just a 2.5" SSD with extra air. Smart people will buy the air separately.
Re:CRC Errors by bemymonkey · 2012-10-16 17:11 · Score: 1

The initial generations of *cheap* SSDs were immature - as soon as Intel released the Postville, that was pretty much a thing of the past. These days, there's no excuse for buying OCZ/Verbatim/*CheapoBrand* when you can get a Samsung 830, Crucial M4 or Intel 320/520 for just a little more.
Re:CRC Errors by Anonymous Coward · 2012-10-16 18:20 · Score: 0

I've had 3 OCZ SSD's. The first was terrible. Second two (purchased two months ago) are great so far.
Initial was first gen OCZ Vertex, 60GB. I used it 4 months. 3 times in that period it lost all of my data (drive/NTFS volume suddenly unreadable, but usable again after reformat). It was running latest firmware the whole time (1.5 at the time). Finally, the 4th time windows would no longer install, failing to extract files each time. The forums weren't particularly helpful, but they did try. I had an nforce SATA chipset at the time, which from what I was reading took some liberties with the SATA specification (which didn't cause issues with mechanical disks, but was the fault of many SSD issues for a variety of people).
I was apprehensive about trying again, but the second two though have been solid. 2xOCZ Vertex 4 256GB in raid0 (intel matrix raid on Gigabyte GA-Z68XP-UD3P).
I didn't have a chance to try the original vertex again on a different SATA chipset. I had an office space printer moment with it before the new computer was built.
Re:CRC Errors by MikeWeller · 2012-10-16 19:06 · Score: 1

I can confirm this bug. A few months back my company macbook started freezing up multiple times a day. I could see from the activity monitor that disk access stopped completely. The SMART status was OK, and all sorts of other disk tools could not find a problem. The drive is a Crucial M4 256 GB. After a few weeks of these random failures I realized they seemed to happen pretty consistently every hour, and eventually I found out about this firmware bug. After doing some calculations, the "5184 hours of Power-on time" matched up very nicely with the date I started using the drive and the almost 24/7 power-on time (I just leave it powered on at my desk). So I installed the firmware upgrade and the problem went away. It's a pretty crazy bug - there must be many thousands of these drives out there just waiting to hit this magic power-on time and start failing. How many of those users are going to know about the firmware update?
Re:CRC Errors by tibit · 2012-10-16 19:29 · Score: 1

Whoever thought up this "functionality" should be made to pay for all the lost data. I kid you not. There's stupid, then there's this "behavior", amounting IMHO to willful destruction of user data.

--
A successful API design takes a mixture of software design and pedagogy.
Re:CRC Errors by Xest · 2012-10-16 20:55 · Score: 1

I think the point is that if they produced 3.5" versions for desktops then they'd be able to use less fragile components, and have more room for handling heat dissipation etc. making the product that bit more durable. The problem is to miniturise you have to use a lot of cutting edge tech, and the potential issues that arise from that cutting edge tech in terms of degradation are often not known. This is the reason many consumer product defects arise.
I suspect the reason they don't is that it's cheaper to have one production line for all of it and just not give a shit about durability, but rest assured, if customers get fed up enough of that and make enough noise, return enough products, or file enough claims, it will change.
Re:CRC Errors by Anonymous Coward · 2012-10-16 21:02 · Score: 0

My OCZ was not detected one day. No warning. That's how it died.
Re:CRC Errors by Anonymous Coward · 2012-10-16 21:15 · Score: 0

I actually ran into this exact type of problem with a friends SSD, firmware update fixed it, but it took forever to figure out.
Re:CRC Errors by Alioth · 2012-10-16 22:04 · Score: 1

Why would they manufacture it in 3.5in format?
The standard form factor for *server* discs is now 2.5in. - essentially .3.5in format is on its way out. 2.5in does not imply "for laptop use".

--
Oolite: Elite-like game. For Mac, Linux and Windows
Re:CRC Errors by bfandreas · 2012-10-16 23:17 · Score: 1

It has a firmware issue and there is a patch. Basically after 5000 hours of uptime it will not return from a SMART check.
I got bitten by that nasty blighter. But I do wonder how you catch a thing like that. This must have been a bitch to debug.

--
20 minutes into the future
Re:CRC Errors by Anonymous Coward · 2012-10-17 03:30 · Score: 0

The first and last time I ever purchased an OCZ product was when I was doing a PC build for my dad.
I bought a 2x2GB RAM bundle, DDR3-833 (so not terribly high-spec or low-tolerance, even 3 years ago). Windows wouldn't install, and memtest+ showed 1100 errors. So I exchanged it for another of the same item. Windows installer wouldn't even boot, and this time memtest+ showed 13000 errors.
I returned that memory, took a refund, and bought Corsair instead. I now avoid OCZ like the plague. Microcenter doesn't even sell their stuff anymore. Apparently I wasn't the only one with problems.
I don't trust their SSD's since they can't be bothered to QC their RAM.
Re:CRC Errors by Synerg1y · 2012-10-17 03:50 · Score: 1

Corsair & OCZ also happen to sell a lot more SSD's than Intel & Crucial. Pretty sure there's a rule somewhere in sales & marketing that ties higher sales figures to higher acceptable return/failure rate.
--Stat class taught me to question #s & figures presented by random websites.
Re:CRC Errors by guruevi · 2012-10-17 04:48 · Score: 1

I think after they analyzed a bunch of RMA's they might have probed for similarities stored in the EEPROM. Then you just set the counter to the value with a production model to test.

--
Custom electronics and digital signage for your business: www.evcircuits.com
Re:CRC Errors by guruevi · 2012-10-17 04:55 · Score: 1

Bigger chips create more heat, that's why we want ever smaller chip processes because they create less heat and thus more stuff can be crammed for the same power requirements. Also, the faster you go the shorter you want the wires to be as eventually you'll become limited by reflections and capacitance of longer wiring.

--
Custom electronics and digital signage for your business: www.evcircuits.com
Re:CRC Errors by Xest · 2012-10-17 06:40 · Score: 1

Sure, I'm not talking about bigger chips though, but more room for heat management - i.e. heatsinks to help dissipate that heat more effectively.
Re:CRC Errors by Anonymous Coward · 2012-10-17 15:09 · Score: 0

What kills me is you have to RMA the fucking drive (I am at 3rd time here with Vertex) to get it off its `PANIC MODE` which is an over-protective crap put up by Sandforce.
Re:CRC Errors by RockDoctor · 2012-10-17 15:23 · Score: 1

I always thought that click-of-death sounds cam from a latch on the voice coil (head positioning actuator) failing to release it for some (electronic) driver issue.
But it's 4 or 5 years since I ripped a hard drive apart ; are they back to driving heads with motors instead of using a voice coil, strong magnets, and feedback from the platter's tracks to position the heads?

--
Birds are not dinosaur descendants;birds are dinosaurs, for all useful meanings of "birds", "are" and "dinosaurs"
Re:CRC Errors by RockDoctor · 2012-10-17 15:35 · Score: 1

The standard form factor for *server* discs is now 2.5in. - essentially .3.5in format is on its way out. 2.5in does not imply "for laptop use".

I didn't know that. That makes a small (but possibly significant) difference.

Oolite: Elite-like game. For Mac, Linux and Windows [aegidian.org]
I just got back from a run to Orerve. Stroppy bastards there.

--
Birds are not dinosaur descendants;birds are dinosaurs, for all useful meanings of "birds", "are" and "dinosaurs"
Re:CRC Errors by RockDoctor · 2012-10-17 15:39 · Score: 1

The 2004 solar storms caused 2V voltage spikes troughout my systems, which generally exceeds the voltage tolerances of the current drives. The voltage log was scary to read afterwards.
It's a shame you posted as AC ; it might have been interesting to see those voltage logs. (Obviously, as this is an externally-caused event, the worst outcome your site could get is "we are now definitely aware of this problem ; potential customers are encouraged to ask about our mitigation equipment and procedures.")

--
Birds are not dinosaur descendants;birds are dinosaurs, for all useful meanings of "birds", "are" and "dinosaurs"
Re:CRC Errors by RockDoctor · 2012-10-17 15:54 · Score: 1

Whoever thought up this "functionality" should be made to pay for all the lost data. I kid you not. There's stupid, then there's this "behavio[u]r", amounting IMHO to will[-l]ful destruction of user data.
Given a modest amount of understanding of electronics and computers, I would have predicted this from first principles.
I would expect someone who is choosing a hard drive for a system to have taken this into account.
If you designed your systems (including "home-brew"), then it's your fault ; if you're talking about systems that someone else put together for you, then it's their fault. UNLESS they described the system accurately to you, and you had described your needs accurately to them. You did write a contract, didn't you?
Describing yourself as a "computer engineer" doesn't necessarily confer upon you knowledge of either computers, or engineering. (IANA Computer Engineer.)

--
Birds are not dinosaur descendants;birds are dinosaurs, for all useful meanings of "birds", "are" and "dinosaurs"
Re:CRC Errors by tibit · 2012-10-17 16:17 · Score: 1

Don't be obtuse. The scaling of lifetime in proportion to ratio of allocated to available memory -- duh, I'm not talking about that! What I'm talking about is that the drive firmware makes the drive as a whole give up the ghost as soon as the write endurance is used up. There's no first principles behind that other than someone's stupidity (even if it's institutional stupidity). Unfortunately, going full retard like that (throwing out all of the data because a tiny bit of it might be bad and some functionality is lost) also seems to find its way into space missions, occasionally. I invite you to guess which one I'm talking about :)

--
A successful API design takes a mixture of software design and pedagogy.
Re:CRC Errors by RockDoctor · 2012-10-17 17:01 · Score: 1
I am a Nu-Scribe (driver), in Nu-Egypt. I use an electronic Nu-Pen (no easy analogy) to write on Nu-Papyrus (SSD). You, oh Pharaoh, instruct me to write upon the Nu-Papyrus, but I realise that the Nu-Papryrus is worn too thinly to write upon safely. I attempt to relocate to a less-used block of Nu-Papyrus, but none are available.
First question : Do I encounter the problem more often in a 10-sheet stack of Nu-Papyrus, or in a 100-sheet stack? "Doh!"
Second question : As a Nu-Scribe, do I
- 1- write, and damage the Nu-Papyrus, including other data on the Nu-Papyrus?
- 2- Humbly beg my Pharaoh to be given new sheets of Nu-Papyrus?
- 3- Pull a sheet of hidden Nu-Papyrus out of my loincloth and write upon that, without indicating to my Pharaoh that there is a problem looming in the future?
(Other strategies may be available.)
Probably, the best all-round solution would be to go read-only, "Oh Pharaoh! forgive me but I need more Nu-Papyrus!!"
And if I recall correctly, that is pretty much what happened when the Galileo probe had a tape-memory problem. Which was managed by a firmware update from the ground. "Write less, oh Nu-Scribe, upon the Nu-Papyrus, for thou wilt get none more!"
--
Birds are not dinosaur descendants;birds are dinosaurs, for all useful meanings of "birds", "are" and "dinosaurs"
Re:CRC Errors by tibit · 2012-10-17 19:22 · Score: 1

Nope, not Galileo :) I meant Huygens/Cassini -- you have to be really, really stupid to throw away one-of-a-kind data on frame sync errors. That's what would have killed Huygens data return -- otherwise, even with cycle slips, the data could be reconstructed on the ground without much trouble. The SSDs are doing the same thing: sure, when wear limits are passed and there's plenty of read errors, you can't relocate failing blocks to save them, but don't pretend that you know better than the owner of the data. Effectively killing the whole drive just because write endurance is up is even worse than the frame dropping on Cassini's probe support avionics. It's like turning off the entire receiver simply because there were some errors.

--
A successful API design takes a mixture of software design and pedagogy.
Re:CRC Errors by tibit · 2012-10-17 19:23 · Score: 1

Probably, the best all-round solution would be to go read-only, "Oh Pharaoh! forgive me but I need more Nu-Papyrus!!"
Whooosh. No kidding! Really?! No shit Sherlock.

--
A successful API design takes a mixture of software design and pedagogy.
Re:CRC Errors by Christian+Smith · 2012-10-17 22:44 · Score: 1

I think the point is that if they produced 3.5" versions for desktops then they'd be able to use less fragile components, and have more room for handling heat dissipation etc. making the product that bit more durable.
SSDs don't run particularly hot. They consume on the order of half a watt most of the time. Not much heat to dissipate, and better done using metal to metal contact anyway.

I suspect the reason they don't is that it's cheaper to have one production line for all of it and just not give a shit about durability, but rest assured, if customers get fed up enough of that and make enough noise, return enough products, or file enough claims, it will change.
I'd be surprised if component quality was a major factor. Most SSD problems are firmware based issues. In the case of dodgy power supplies, well, all bets are off as far as I'm concered. While undervoltage should not physically damage components, I wouldn't be surprised if a device logically bricks itself instead, though you'd hope there would be a way to reset that with a secure erase.
Re:CRC Errors by Anonymous Coward · 2012-10-18 19:49 · Score: 0

OCZ fanboy detected
Re:CRC Errors by bfandreas · 2012-10-25 10:12 · Score: 1

M4 needs a firmware update. It's got a bug which makes it not return from SMART calls after 5000 hours of total real time up time. Flash that thing and you have a good SSD hd again.

--
20 minutes into the future
Re:CRC Errors by Anonymous Coward · 2012-10-25 12:56 · Score: 0

The Celtics' 17 NBA Championships are the most for any NBA franchise with their Western Conference rival, the Los Angeles Lakers, following behind with 16 NBA Titles cheap nike nfl jerseys wholesale Most of their products are cheap and really popular with people3- Many people who want to understand how tocheap nfl jerseys earn steelers shirts NFL JERSEYS money fast on the Internet have heard about LEVERAGE Steelers Jerseys furnish cool attitude custom nfl jerseys cheap to the wearerGoogle offers free training nike elite nfl jerseys in using and getting the most from AdWordsSummer T-shirt Fashion Trend 2012 for MenThis time the summer fashion t-shirt trend of cheap ncaa jerseys 2012 is perfect for creating a fashion statementRelated ArticlesThor Flux Motocross GearWhen THOR sets out to develop a new gear for the season, the process is much more involved than new graphics and colors And you are able to follow up with them to offer more services and products that may interest themImportant FeaturesWhen choosing bicycling jerseys, you should be looking for a few important features accepts the client's Website to enrich the capacity and eminence through various search engines index and search results Nike claims oftentimes the black houston texans jersey makers of the extremely superior good Snowboarding cleats on the planetIndigo - Civic HF - Ride with the wind in the Civic HF - literally For this reason jerseys abide to help you the individuals who want to own personal a very affordable options with conveying his or hers help support on their well liked jerseys4home coaches and teams and also gamersThe biggest stars of the year are pretty clear The Pittsburgh Steelers are known for football jerseys cheap their great achievements in the last two seasons
Should you along with your co-workers pick the large MLB Chris McAlister Black jerseyas an alternative to choosing them for singles, nfl football jerseys cheap then you could almost all contain the chance to end up really exciting in addition to significant just like you viewpoint an individual's well liked little league customized nfl jerseys cheap have fun with This is ohio state nfl t shirts exactly football jerseys cheap where you practice arguing with your self in a cheap jerseys successful way How about the playground for adults, sure Vegas is across America, but Atlantic City serves as a mini Sin City for the east coast! There are just so many aspects of this state that are greatFinding budget friendly prices on discount jerseys is at the top cheap nike nfl jerseys of everyone?s shopping listSamantha Giancola, "Sweetheart" was born

Umm by The+MAZZTer · 2012-10-16 04:23 · Score: 4, Insightful

It was my understanding that for traditional drives in a RAID you don't want to get all the same type of drive all made around the same time since they will fail around the same time too. Same would apply to SSDs.

Re:Umm by Anonymous Coward · 2012-10-16 04:31 · Score: 5, Insightful

yeah, sounds like submitter may be mildly deficient

Which is why he's asking.
Fuck people who ask questions when they don't know something, right?
Re:Umm by kelemvor4 · 2012-10-16 04:32 · Score: 5, Informative

It was my understanding that for traditional drives in a RAID you don't want to get all the same type of drive all made around the same time since they will fail around the same time too. Same would apply to SSDs.
Never heard of that. I've got about 450 servers each with a raid1 and raid10 array of physical disks. We always buy everything together, including all the disks. If one fails we get alerts from the monitoring software and get a technician to the site that night for a disk replacement. I think I've seen one incident in the past 14 years I've been in this department where more than one disk failed at a time.

My thought on buying them separately is that you run the risk of getting devices with different firmware levels or other manufacturer revisions which would be less than ideal when raided together. Not to mention you have a mess for warranty management. We replace systems (disks included) when the 4 year warranty expires.
Re:Umm by StoneyMahoney · 2012-10-16 04:35 · Score: 4, Informative

The rationale behind splitting hard drives in a RAID between a number of manufacturers batches, even for identical drives, it to try and avoid a problem with an entire batch that's slipped past QA from taking out an entire array of drives simultaneously.
I'm paranoid, but am I paranoid enough....?
Re:Umm by hawguy · 2012-10-16 04:38 · Score: 1

It was my understanding that for traditional drives in a RAID you don't want to get all the same type of drive all made around the same time since they will fail around the same time too. Same would apply to SSDs.
I've heard the same, but judging from the serial numbers on disks in our major-vendor storage array, they seem to use same-lot disks, here's a few serial numbers from one disk shelf (partially obscured):
xx-xxxxxxxx4406
xx-xxxxxxxx4409
xx-xxxxxxxx4419
xx-xxxxxxxx4435
xx-xxxxxxxx4448
xx-xxxxxxxx4460
xx-xxxxxxxx4468
They look close enough to be from the same manufacturing lot. Unless the disk manufacturer randomizes disks before applying serial numbers when selling to this storage vendor. We do lose disks occasionally, but they seem to be spread out across different shelves (purchased at different times).
Re:Umm by statusbar · 2012-10-16 04:40 · Score: 5, Insightful

I've seen two instances where a drive failed. Each time there were no handy replacement drives. Within a week a second drive died the same way as the first! back to backup tapes! Better to have replacement drives in boxes waiting.

--
ipv6 is my vpn
Re:Umm by ByOhTek · 2012-10-16 04:42 · Score: 4, Insightful

In general, if you get such an issue, it will happen early on in the life of the drives (one coworker had what he called the 30-day thrash rule - he would plan ahead and get a huge number of drives - the cheapest available meeting requirements, including avoiding manufacturers we had issues with previously, take a handleful, and thrash 'em for 30 days. If nothing bad happend, he'd either keep up 30 day thrashes on sets of hard drives, pulling out the duds, or just return the whole lot.

--
Self proclaimed typo king, and inventor of the bear destroying coffee table (patent not pending).
Re:Umm by MightyMartian · 2012-10-16 04:42 · Score: 4, Interesting

Too true. Years ago we bought batches of Seagate Atlas drives, and all of them pretty much started dying within weeks of each other. They were still under warranty, so we got a bunch more of the same drives, and lo and behold within nine months they were crapping out again. It absolutely amazed me how closely together the drives crapped out.

--
The world's burning. Moped Jesus spotted on I50. Details at 11.
Re:Umm by Spazmania · 2012-10-16 04:44 · Score: 2

I lost a server once where the drive batch had a 60% failure rate after 6 months. Unless you're intentionally building the raid for performance (vice reliability), you definitely want to pull drives from as many different manufacturers and batches as you can.

--
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
Re:Umm by DarwinSurvivor · 2012-10-16 04:45 · Score: 1

I'm paranoid, but am I paranoid enough....?
Depends if you have backups.
Re:Umm by Anonymous Coward · 2012-10-16 04:47 · Score: 0

It was my understanding that for traditional drives in a RAID you don't want to get all the same type of drive all made around the same time since they will fail around the same time too. Same would apply to SSDs.
I imagine the portion of the wafer from which your chips are cut is more important than precisely when they were fabbed.
Re:Umm by Anonymous Coward · 2012-10-16 04:48 · Score: 1

I've seen it a few times myself over the past decade. The greatest example of this problem I saw was with a Sun StorEdge A3500 array which had sixty disks. There were three brands of disks used in that array: Seagate, IBM, and one other (can't recall at the moment). It was an almost even three way split. The array was used by a two node Sun Cluster running Oracle and SAP.
Out of no where the IBM drives started dying, almost one right after another. We would pull a drive, replace it, and as almost as soon as the LUN it was part of finished rebuilding another IBM drive failed, usually in a different LUN. Over the course of 48 hours we had close to 20 IBM drives failed.
It was determined that there was a bug in the firmware that caused those drives to die after a set amount of time. Any IBM drives that didn't die in that 48 hour window were replaced to avoid any more failures. Any other similar IBM drives we had in our datacenter had their firmware updated just as soon as a fix was released.
Fortunately due to good planning of how disks were spaced out between LUNs we didn't suffer an outage, but the cluster connected to that array suffered horrible performance that weekend. If the entire array used those drives we would have been screwed.
(note: this happened a long time ago, if my memory serves me correctly it was the IBM drives failing, but I could be wrong about that)
Re:Umm by Anonymous Coward · 2012-10-16 04:50 · Score: 0

I tried that theory with tapes once. I had my purchasing person track down DLT's from four different vendors. When they arrived they had clearly all come off the same manufacturing line.
I suppose they could have used identical shells but different media, but I kind of doubt it.
Oh well, at least they were probably from different batches.
Re:Umm by Anonymous Coward · 2012-10-16 04:51 · Score: 3, Interesting

Warranty replacement drivers are refurbished, meaning they've already failed once. I've never had a refurb drive last a full year without failing. It's gotten bad enough that I don't bother sending them back for warranty replacement anymore.
Re:Umm by nine-times · 2012-10-16 04:51 · Score: 1

The idea is that, though you can't control all the variables of manufacturing, if you have a bunch of disks made with the same machinery at the same time, many of those variables will be the same and so they have an increased chance of failing at the same time-- especially since the increased activity of rebuilding a failed array can sometimes trigger additional failures.
So if you want to be safe, buying from different batches is preferable, even if you buy the same brand of functionally identical drives. I don't know if anyone does it anymore, but some manufacturers used to do this for you when you bought a new server with a big RAID.
Re:Umm by na1led · 2012-10-16 04:52 · Score: 1

Defective Hard Drives usually fail within the first few months. When we purchase new Disk Storage units, it's almost a guarantee 1 or 2 drives will fail in the first 6 months.

--
-- By all means let's be open-minded, but not so open-minded that our brains drop out.
Re:Umm by interval1066 · 2012-10-16 04:57 · Score: 1

Seems like if you buy and commission two identical drives they would fail about the same time but that's not my exerience either. I guess the 2nd law exresses itself in these scenarios with differing exparation times.

--
Python: 'And then suddenly you have a language which says "we're all stuck with whatever the whiniest coder wants".'
Re:Umm by Anonymous Coward · 2012-10-16 05:00 · Score: 1

http://www.redhat.com/archives/rhl-list/2003-December/msg01731.html
Art Kagel who was at Bloomberg:
"Pooh, we lost 50 drives over two weeks when a batch of 200 IBM drives began to fail. IBM discovered that the single lot of
drives would have their spindle bearings freeze after so many hours of operation. Fortunately due in part to RAID10 and in part to a herculean
effort by DG techs and our own people over 2 weeks no data was lost. HOWEVER, one RAID5 filesystem was a total loss after a second drive failed
during recover. Fortunately everything was on tape."
Re:Umm by CaptSlaq · 2012-10-16 05:00 · Score: 4, Informative

I've seen two instances where a drive failed. Each time there were no handy replacement drives. Within a week a second drive died the same way as the first! back to backup tapes! Better to have replacement drives in boxes waiting.
This. Your spares closet is your best friend in the enterprise. Ensure you keep it stocked.
Re:Umm by Anonymous Coward · 2012-10-16 05:01 · Score: 5, Interesting

Google published a study they did of their own consumer grade drives, and found the same time. If the drive survives the first month of load, it will likely go on to work for years, but if it throws even just SMART errors in the first 30 days, it is likely to be dodgy
Re:Umm by Anonymous Coward · 2012-10-16 05:02 · Score: 0

Pretty much that. I worked through the WD quality problems in the late 90s/early 2000s, the DeathStar... er, DeskStar platter issues of the mid 2000s, and the Seagate 10,000/15,000 rpm firmware mixup from just a few years back. Diversifying your load-out, even if difficult, can save your ass when entire lines start croaking all at once.
Re:Umm by MightyMartian · 2012-10-16 05:02 · Score: 1

Needless to say neither did we. And I learned my lesson. Heterogeneous drive arrangements in RAID arrays.

--
The world's burning. Moped Jesus spotted on I50. Details at 11.
Re:Umm by RabidReindeer · 2012-10-16 05:07 · Score: 1

I've seen two instances where a drive failed. Each time there were no handy replacement drives. Within a week a second drive died the same way as the first! back to backup tapes! Better to have replacement drives in boxes waiting.
Same with me. Except that the drives would invariably fail when I was on vacation. One drive would blow. OK, no problem - it's RAID. Second drive blows 2 days later. Uh-oh.
Come back to work and that size drive is no longer manufactured and no other people in a campus of approximately 1200 people had a spare.
Re:Umm by Anonymous Coward · 2012-10-16 05:10 · Score: 0

It only has to happen once.
We had to restore 20TB of data because too many drives failed before DELL could get the replacements out. Yes the replacements where in the same day but it was still too late.
This cost a lot of money so management finally listened and declared that drives may not be purchased in large batches anymore.
Best of luck to you.
Re:Umm by infodragon · 2012-10-16 05:10 · Score: 2

Never paranoid enough when dealing with data! I had a RAID 5 (5 disks) of Seagate 80GB SATA disks; 4 failed within an 8 hour window, the 5th failed within 24 hours of the first; this was 3 months after purchase. It was a HUGE PITA. First drive failed and I started an immediate DB dump to an NFS mount. 20GB and 2 hours later the second disk failed and RAID was dead. I ran the other three disks just to see what would happen...
I will NEVER, EVER run two storage medium (Spinning platter, SSD, ...) from the same lot in the same RAID ever again. I was saved by 20 minutes, in the above situation, from 24 hours of hell.

--
If at first you don't succeed, skydiving is not for you.
Re:Umm by systemeng · 2012-10-16 05:11 · Score: 1

That failure pattern is called the bathtub curve or Weibull distribution.
Re:Umm by flappinbooger · 2012-10-16 05:12 · Score: 1

It was my understanding that for traditional drives in a RAID you don't want to get all the same type of drive all made around the same time since they will fail around the same time too. Same would apply to SSDs.
Excellent idea, for "serious mission critical" applications it would be good to source drives for a RAID array from different channels. Maybe same model/mfg but different vendors...
Otherwise you know they're coming from the same case of drives which came off the same assembly line in a row and had the same guy sneeze on all of them.
It's a nice theory anyway, but do people actually bother to do that? How often do simultaneous failures REALLY happen nowdays?
Maybe for SSDs which are more of a literal "black box" scenario things would be more predictable than with mechanical HDs.

--
Flappinbooger isn't my real name
Re:Umm by gweihir · 2012-10-16 05:14 · Score: 1

While your understanding is correct, there are issues in practice. The first is that many of the people responsible for RAIDs are not that smart. This means most do not get the idea, and many have trouble dealing with slight variations in HDD capacity, causing problems. Then there is the problem that mixing drives gets difficult beyond 3 drives. And when you use SSDs, you should probably mix in a few HDDs that are only written to in normal operation.
That said, I do run several 3-way RAID1 (Linux dmraid) with 3 different disks. Some have one SSD in there and the HDDs as "write mostly", which gives almost SSD speed for the whole array (writes being buffered, reads served just from the SSD). I have drives drop out of the RAID now and then (these are 2.5" notebook drives, seems they crash their firmware about once a year when run 24/7), so the 3-way has definitely paid off, and I suspect the mixing as well.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:Umm by Anonymous Coward · 2012-10-16 05:18 · Score: 0

I have experienced a double drive failure on a RAID 5 NetApp box. We had a single drive fail first and the system correctly began a rebuild to the hot spare. About an hour into the rebuild a second drive died. We had to restore from backups. In the aftermath we discovered that all the drives in the system were from the exact same batch build by the manufacturer. Now, whenever I build RAID setups, I do my best to stagger the batch numbers of the drives.
Re:Umm by Sparticus789 · 2012-10-16 05:18 · Score: 2

Also useful to set up your RAID with hot-swap drives. In a 16-drive array, I like to set up with RAID 6 and one hot-swap drive. That way, I can actually loose 2 drives, then one more drive once the hot swap has been populated.

--
sudo make me a sandwich
Re:Umm by roc97007 · 2012-10-16 05:23 · Score: 1

I think your experience is closer to normal. Mine has been less so, but was probably an anomaly. A few years back, I built up a medium sized data center with all SCSI fast-wide disks, using them there 7200 RPM disks, which were a new thing at the time. (5400 rpm being the standard at the time.)
Less than two months later, we had our first bearing failure. That this one failed early was actually a godsend, because it gave us time to get prepared for the avalanche of failed drives that started three weeks later.
When more than half of the drives had failed, we pulled all (100%) of the drives out of service and installed older technology.
You guessed it, all the 7200 RPM drives came from the same vendor, who bought them from the same manufacturer, who had bought all their bearings from the same supplier, etc. The root cause, I remember hearing, was traced back to a batch of faulty seal material that the bearing factory had purchased from *their* vendor. As all this happened under warranty, the cost to us was production downtime and some long hours for my team. The vendor very nearly went out of business.
You're right about the firmware thing. In another environment, which featured huge raid arrays, we insisted on the same manufacturer, routinely checked manufacturer's date and flashed all incoming drives to the same firmware level.

--
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
Re:Umm by roc97007 · 2012-10-16 05:24 · Score: 1

You are not paranoid enough. See my other post. A batch of bad drive bearing sealant took out our entire data center.

--
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
Re:Umm by Atzanteol · 2012-10-16 05:26 · Score: 1

It's only anecdotal - but I had some experience with this. Home-built RAID5 and got shipped 5 drives all with near sequential serial nos. 3 of the 5 died within a couple years...

--
"Ignorance more frequently begets confidence than does knowledge"

- Charles Darwin
Re:Umm by X0563511 · 2012-10-16 05:27 · Score: 1

First drive failed and I started an immediate DB dump to an NFS mount.
That's where you fucked up. The -FIRST- thing you should have done was secured the storage - eg, replace the failed drive and rebuild. You do NOT want to stress a failed storage array.

--
For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
Re:Umm by Lumpy · 2012-10-16 05:27 · Score: 2

I do. Get the refurb back and sell it on ebay for 50% of the going price. you at least get some money back.

--
Do not look at laser with remaining good eye.
Re:Umm by Anonymous Coward · 2012-10-16 05:35 · Score: 0

I don't think this rule of thumb would scale very well. If you have something like a large (multiple PB) Lustre, GPFS, etc. filesystem that spans multiple racks and thousands of drives you are going to be pretty limited in terms of sourcing a diverse pool of drives. Also as someone already mentioned in another post, mixing drive firmware levels in commercial disk arrays can sometimes lead to disastrous results.
Re:Umm by Medievalist · 2012-10-16 05:36 · Score: 1

I've seen every drive in a purchase batch fail within a single week. Expensive server-grade drives, too.
So, we mix our drives in RAID arrays. This is more important with larger drives and RAID5 (which I no longer use personally but others here still do) since in the time it takes to rebuild a RAID5, if you lose another disk or two you're sunk.

My thought on buying them separately is that you run the risk of getting devices with different firmware levels or other manufacturer revisions which would be less than ideal when raided together.
That's not exactly hard to check or manage, honestly.

We replace systems (disks included) when the 4 year warranty expires.
That explains a lot; you'll almost never see these kinds of problems. Most drives either fail in the first couple of weeks or they last past the warranty.
Re:Umm by Anonymous Coward · 2012-10-16 05:37 · Score: 0

Key word: "Seagate". I feel badly bashing them because ALL drive manufacturers have unacceptable failure rates IMHO. But in my experience Seagate is much much worse. Typical company that puts money into advertising and new new new rather than improving quality and reliability. Too bad everyone else follows that leader. WD have been really good lately, in my experience.
Re:Umm by Bob+the+Super+Hamste · 2012-10-16 05:37 · Score: 5, Informative

For those who are interested the white paper is titled "Failure Trends in a Large Disk Drive Population" and can be found here. It is a fairly short read (13 total pages) and quite interesting if you are into monitoring stuff.

--
Time to offend someone
Re:Umm by Anonymous Coward · 2012-10-16 05:39 · Score: 1

You mean stress like a rebuild? I'll take my chances grabbing critical. If a simple read kills the remaining drives, a rebuild isn't going to end well either.
Re:Umm by mlts · 2012-10-16 05:39 · Score: 1

It is a Scylla/Charybdis scenario.
If you buy drives from the same batch, you might end up with one issue nailing your whole array.
If you buy from different batches or makers, you have to contend with drive firmwares (some arrays may not support some revisions), drive brands, the fact that drive maker's 2TB drive isn't the same exact size as another maker's 2TB drive, and so on.
On the high end, this isn't usually a problem because drives specced for VNX or DS SANs tend to be enterprise grade with some semblance of a warranty, and someone will have their goose cooked if that tier of storage flops.
I'd rather just go with the same batch, than deal with the firmware issues, especially on lower end controllers.
Re:Umm by Anonymous Coward · 2012-10-16 05:40 · Score: 1

Fuck people who ask questions when they don't know something, right?
The worst is when you Google a common question, and it's just an echo chamber of lost souls posting your exact question on vBulletin/phpBB style forums, with nothing but tumbleweeds to show for it.
That. Is my definition of "alone".
Re:Umm by mlts · 2012-10-16 05:43 · Score: 2

That is why one uses RAID 6 with lower tier drives and hot spares.
Lower tier drives (SATA) need RAID 6 and hot spares, because it takes a long time (days) to rebuild a failed drive, which leaves a large window for another drive failure to happen.
Upper tier drives (FC/SSD) are far faster, so the window of vulnerability is a lot less, so RAID 5 is more useful. Even then, it doesn't hurt to have a hot spare, so no tech is needed in case of a drive failure. You jusr change out the failed drive at one's relative leisure.
Re:Umm by alen · 2012-10-16 05:44 · Score: 1

in our HP servers we use HP drives and i have never seen 2 drives fail in a RAID 5 at the same time. one time i've seen 2 drives fail in the same RAID5 within 2 days. that was once in 10 years.
With HP's warranty and custom firmware we get alerts of predictive failure and HP will replace drives that are predicted to fail.
Re:Umm by alen · 2012-10-16 05:46 · Score: 1

some sales guys from IBM once told me that for LTO the way they make tapes is they have a huge roll of LTO tape and they cut it up. supposedly the inner part is the best and IBM said that's what they use. its not like every manufacturer makes their own tape, it all comes from the same few factories
Re:Umm by Anonymous Coward · 2012-10-16 05:48 · Score: 0

You mean "hot spare", e.g. already hooked up and known by the RAID controller. Hot swap is when you can physically remove and insert a drive while the array is still functioning.
Re:Umm by GNious · 2012-10-16 05:48 · Score: 1

Depends on spec - I'd try to avoid same batch-number on all drives in a RAID, but as will accept them all being from a single supplier.
For me this came up due to bad batches from IBM, but it apparently was "standard" in some businesses before that.
My customers, who need very high uptimes, and tend to, uhm, be insufficient on the whole redundancy and backup thing, seeming are happy to by 4 drives from the same batch, so perhaps it is just me that is paranoid.
(No, I've actually not laughed on the few occassions their setups come crashing down)
Re:Umm by StripedCow · 2012-10-16 05:52 · Score: 0

Fuck people who ask questions when they don't know something, right?
In which case, let me ask: why are you so stupid? :)

--
If Pandora's box is destined to be opened, *I* want to be the one to open it.
Re:Umm by Anonymous Coward · 2012-10-16 05:53 · Score: 0

> Never heard of that.
It's been well studied and proven many times. I think you're taking his "around the same time" comment too strictly. Usually when you have one drive quit in an array, the fact that the rest of the drives are failing and are under the additional stress of rebuilding means that the other drives will die quickly. I only manage eight DB servers, but even I have had three arrays fail in the past two years because a second drive failed before I could get another installed and finish rebuilding.
Another thing you're forgetting is that many(most?) of the 15k drives, especially the Seagate ST3300656SS models, are complete crap and have a high failure rate. I've been averaging about one failure a week for ST3300656SS drives for the forty-eight I have.
Re:Umm by Anonymous Coward · 2012-10-16 05:54 · Score: 1, Funny

Scotty is your best friend in the Enterprise.
FTFY
Re:Umm by infodragon · 2012-10-16 06:04 · Score: 3, Interesting

[Sarcasm]Nothing like 20/20 hindsight... If I had done anything like trying to rebuild the array it would have fallen apart... Oh wait... If I had followed what you suggested I would have been SCREWED.[/Sarcasm]
I made a decision based on what on the information on hand.. The rebuild would have take more than a few hours, 80GB disk was SLOW, i.e. first gen SATA. By executing the DB dump I was hitting less than 1/2 the disk capacity on read than 100% disk capacity on a write. It would be significantly faster to retrieve the data than to rebuild. That time window was critical, 2 hours of read vs 4+ hours of write. I also knew I had all the data on hand and all the scripts tested monthly for rebuilding the entire DB on a different server. The decision was easy! Grab the DB data now, redeploy on another system and address the issue on the spot. The system ended up being down 3 hours rather than 24+.
Secondly The failure was abrupt with no SMART messages, I couldn't trust the others to not have the same non-reporting issues. I made a choice on the spot on how to proceed knowing full well I may have signed my own 24h torture warrant. Fortunately I didn't have the worst case happen and I learned a critical lesson.
A bit more information...
+- 30 minutes on each one
First disk failed...
2 hours later second disk failed...
2 hours later third disk failed.
2 hours later 4th disk failed
16 hours later 5th disk failed.

--
If at first you don't succeed, skydiving is not for you.
Re:Umm by David_Hart · 2012-10-16 06:07 · Score: 2

Wait... If you are running RAID-5 without a hot spare or two, you are just doing it wrong....
Re:Umm by X0563511 · 2012-10-16 06:07 · Score: 1

Yuck. That is just plain bad luck :(

--
For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
Re:Umm by barc0001 · 2012-10-16 06:10 · Score: 1

I've been involved with a situation that's about as close to a perfect storm as you can imagine for that and it was still not a problem. In 2001 we had a rack of a dozen IBM servers, each running 3 drives in RAID 5 with a shelf at the bottom with another 10 drives in RAID 6 for the database. Unfortunately these were the infamous IBM drives made in Hungary so they ALL started failing. Under our service agreement we called IBM and had them down at the data center almost daily for the entire month replacing drives within hours of each failing, and then starting to pre-emptively replace the remainder. Didn't lose a single byte. The closest we came was a couple machines had second drives fail about 36-48 hours after the first drives were replaced so ~48-60 hours apart.
Re:Umm by NeverVotedBush · 2012-10-16 06:15 · Score: 3, Insightful

When a drive fails and a RAID goes into reconstruction (if you are set up that way), that's when you are significantly more likely to have another drive fail due to all the extra activity across the RAID.

We see it all the time on a big array. One must hustle to repair/rebuild the RAID... ;-)
Re:Umm by daha · 2012-10-16 06:18 · Score: 1

I recently witnessed two different RAID-6 setups belonging to two different shops where they each had a single drive start to fail, so the hot spare kicked in and started rebuilding. While it was rebuilding a second drive started to fail. No problem! It's RAID-6! While both drives were being rebuilt, a third drive failed.
Drives of a feather fail together. Thank you, I'll be here all week. Try the veal.
Re:Umm by jeffmeden · 2012-10-16 06:19 · Score: 1

I'm paranoid, but am I paranoid enough....?
Depends if you have backups.
*if* ?!!?!?
Re:Umm by infodragon · 2012-10-16 06:19 · Score: 1

That and the fab process is so precise that a fault is replicated so precisely that after 90 days of 24/7 operation they all failed within 24 hours, 4 failing in 8 hours. So it was engineered bad luck!
Anyway I glad those days of system admin are behind me, I'm with my passion now which is HPC C++ development. Those experiences stuck with me and give me much more respect for the admin of the HW I now use. It's funny and sad to watch their expressions when I talk to them intelligently and with respect. It's like they've never had that happen before.

--
If at first you don't succeed, skydiving is not for you.
Re:Umm by Anonymous Coward · 2012-10-16 06:26 · Score: 5, Insightful

I've seen two instances where a drive failed. Each time there were no handy replacement drives. Within a week a second drive died the same way as the first! back to backup tapes! Better to have replacement drives in boxes waiting.
This. Your spares closet is your best friend in the enterprise. Ensure you keep it stocked.
And locked. And don't label them "spares". Label them "cold swap fallback device" or something that management won't see as something "extra" that can be "repurposed" (i.e. stolen)
Re:Umm by umghhh · 2012-10-16 06:29 · Score: 1

this is correct. At my engineering course we had a course on reliability of complex systems - this is pretty fascinating stuff by the way - using the same batch increases the risk in case of systematic failure i.e. something is built in incorrectly and has to fail in certain conditions which if they occur may cause all drives of the same batch to fail the same way thus landing you directly a massive problem instead of smaller one. This can be mitigated too - such systemic failures have also probabilities etc but the simples is to mix the devices from different batches if possible. The question about paranoia is difficult to answer - if your data is so precious you will have also backups as a secondary line of defense (first being your raid) but to have high availability you may want to remove/decrease the need to use backups too so mixing batches even if increases work load may be a good solution not a result of a paranoia. In electrical engineering and I am pretty sure any other type but IT (which somehow avoid being engineering and is science or craft somehow???) reliability calculations are made routinely and used to decrease not only cost of failure but also production. I still remember what they taught us on first hour of the course - how to produce torpedoes cheaply while ensuring that they still reach the target - you have to multiply systems but there is no need for any of them to be durable the way say turbine in an airplane engine must be - they must just work for the amount of time they are used an not much longer - to calculate this quite some knowledge empirical as well es theoretical is needed, and how you go about things that you build once only and you cannot use probes of a batch to determine the probability. Fascinating.
Re:Umm by Anonymous Coward · 2012-10-16 06:29 · Score: 0

No because then you would have had a box of spare identical hot-swappable drives stored onsite.
Re:Umm by Anonymous Coward · 2012-10-16 06:33 · Score: 1

I had great expectations when I read the paper when it was originally published. Google has millions of drives so they will have statistically meaningful results. Yet the conclusions of the paper were too inconclusive to really be useful. Maybe the authors were too careful, like how real scientists often are. Or perhaps it really didn't uncover any too interesting data. What exactly did you find useful in it?
Re:Umm by Binestar · 2012-10-16 06:34 · Score: 2

Those drives were hit with some sort of power issue. Even same batch of drives it's way too close together for manufacturing flaw. Congrats on getting the data off quickly though.

--
Do you Gentoo!?
Re:Umm by fa2k · 2012-10-16 06:48 · Score: 1

I don't want to discount this as I'm just an amateur, but could you have had some kind of lightning strike or other power anomaly at that time?
Re:Umm by Bob+the+Super+Hamste · 2012-10-16 06:53 · Score: 3, Informative

Mostly the methadology as well as it disproving some of the standard thought (heat or activity kills drives). While they were looking for some leading indicator for all drive failures (were some error reported before a given drive crapped out) which is what they didn't find as a large portion of the drives just crapped out without warning any drives that did start to report warnings were very likely to crap out shortly (I think their threshold was 60 days) which does help to prevent down time. Interestingly I had to look into disk monitoring at my job and ran across that paper, implemented some automated S.M.A.R.T. monitoring and one of the disks in a box had tossed some errors. People complained because my code was alarming this issue so they thought my code was bad. A couple days later the drive gave up the ghost and I was vindicated.

--
Time to offend someone
Re:Umm by Peter+Bortas · 2012-10-16 07:04 · Score: 1

For mirrors, that was something that was practiced a bit in the 90s; one brand of disks for each half of the mirror. It wasn't common though, and I haven't seen anyone do it this millennium. When you buy big storage systems you buy them preloaded with whatever disks the vendor recommends.
(And for the last year you where happy if the vendor could deliver enough disks to fill your order. Damn people building HDD factories in areas that gets flooded.)
Re:Umm by bobaferret · 2012-10-16 07:06 · Score: 1

That is CORRECT! I had 6 256GB m4's die on me just last week, all at the same time. Not fun. Had them in a raid 10 configuration, which obviously didn't help. The other twin server used m3's (forget what they are actually called). Haven't had a single problem with them. Turned out to be a firmware issue. Upgrad of it fixed the problem. As to how they died? SATA_REQUEST_TIMEOUT. After a point the wear leveling kicks in and can take too long esp for high transaction loads. We have a huge amount of churn on our data. So this begs the question are they worth it? with these drives in a raid 10 configuration, I have a sustained read rate of 2.7 gigaBYTES per second. This makes the users (gvt public access) very happy. Used to be our array speeds limited our peak usage. Now it's our net connection.
Re:Umm by Peter+Bortas · 2012-10-16 07:08 · Score: 1

Raidz2+ and a few hot spares is the shit. Do you build volumes with 16 drives though? I rarely go over 8 per volume and then stripe them.
Re:Umm by kasperd · 2012-10-16 07:12 · Score: 4, Interesting

That is why one uses RAID 6 with lower tier drives and hot spares.
Best argument for RAID 6 is bad sectors discovered during reconstruction. Assume one of your disks have a bad sector somewhere. Unless you periodically read through all your disks, you may not notice this for a long time. Now assume a different disk dies. Reconstruction starts writing on your hot spare. But during reconstruction an unreadable sector is found on a different drive. On RAID 5, that means data loss.

I have on one occasion been assigned the task on recovering from pretty much that situation. And some of the data did not exist anywhere else. In the end my only option was to retry reading the bad media over and over until on one pass I got lucky.

With RAID 6 you are much better off. If one disk is completely lost and you start reconstructing to the hot spare, you can tolerate lots of bad sectors. As long as you are not so unlucky to find bad sectors in the exact same location on two different drives, reconstruction will succeed. An intelligent RAID 6 system will even correct bad sectors in the process. When a bad sector is detected during this reconstruction, the data for both the bad sector as well as this location on the hot spare are reconstructed simultaneously and both can be written to the respective disk.

At the end of the reconstruction you not only have reconstructed the lost disk, you have also reconstructed all the bad sectors found on any of the drives. Should one of the disks run out of space for remapping bad sectors in the process, then that disk is next in line to be replaced.

--

Do you care about the security of your wireless mouse?
Re:Umm by nullchar · 2012-10-16 07:13 · Score: 1

Was this really 'dmraid' or 'mdraid ' ? I'm curious how you got reads to only come from the SSD, and also how do you force a write buffer when programs use fsync() calls? (Though I could live with slow writes with fast reads.)
It sounds awesome to use the SSD for reads, and 1 (or more) HDDs as a mirror in RAID1.
Re:Umm by Anonymous Coward · 2012-10-16 07:13 · Score: 0

quite interesting if you are into monitoring stuff.
In Soviet Russia... stuff monitors you!
Re:Umm by Anonymous Coward · 2012-10-16 07:16 · Score: 0

Same with me. Except that the drives would invariably fail when I was on vacation. One drive would blow. OK, no problem - it's RAID. Second drive blows 2 days later. Uh-oh.
Come back to work and that size drive is no longer manufactured and no other people in a campus of approximately 1200 people had a spare.
Ouch. So what did you end up doing? Restore from backup to an entirely new raid array?
Re:Umm by rijrunner · 2012-10-16 07:22 · Score: 1

Would that be the 2GB drives that shipped on Sparc systems in Q2 1994?
Not that that left me with bad memories, or anything..
Re:Umm by darguskelen · 2012-10-16 07:25 · Score: 1

Also: how many :)
Re:Umm by camperdave · 2012-10-16 07:26 · Score: 1

I'm paranoid, but am I paranoid enough....?
That would make a great signature.

--
When our name is on the back of your car, we're behind you all the way!
Re:Umm by Anonymous Coward · 2012-10-16 07:30 · Score: 1

Warranty replacement drivers are refurbished, meaning they've already failed once.
So true. I used to work in a company with enough hardware that we could actually observe how the replacement part in one RMA case was the exact same faulty hardware we had created an RMA case for before. The reason this first came to my attention was that one of the data center technicians asked me: "Are you sure this part is faulty? I replaced it two days ago as well." I checked and confirmed that the replacement part was DOA.

As I was looking through my email for the serial number of the faulty part, I noticed that something didn't add up. That was the point where I realized the email I was looking at was a month old RMA case for the same faulty piece of hardware. Turned out we had just created the third RMA case for the same piece of hardware. Every time we received it as a replacement, it was still suffering from the same fault that we returned it for in the first place.
Re:Umm by viperidaenz · 2012-10-16 07:48 · Score: 1

So who is better? Western Digital or Toshiba?
Re:Umm by amorsen · 2012-10-16 07:48 · Score: 1

The problem is that you DO want all the same type of drive (at least for spinning rust), since for RAID other than RAID-0 write speed is limited by the slowest drive. Since different drives have different performance across the platters, you can easily lose a lot of performance. At least with RAID-1/RAID-10 you only write to two drives at once, so you can perhaps get away with it, but with RAID-5 you get entirely scuppered. Then again, people who care about performance avoid RAID-5 like the plague.

--
Finally! A year of moderation! Ready for 2019?
Re:Umm by Bryansix · 2012-10-16 07:49 · Score: 1

"Storage Spaces". Google it.
Re:Umm by sirsnork · 2012-10-16 07:49 · Score: 2

This is exactly why most RAID cards to patrol reads during low activity.
Of course, that assumes you use a real RAID card rather than software RAID. I'm not aware of any software raid implementation that does patrol reads

--

Normal people worry me!
Re:Umm by kasperd · 2012-10-16 08:05 · Score: 2

It was my understanding that for traditional drives in a RAID you don't want to get all the same type of drive all made around the same time since they will fail around the same time too.
It's a claim I have seen often, but I have never seen evidence to support it. A much more likely scenario, which can easily be misdiagnosed as simultaneous drive failures is the following. One disk gets a bad sector, which at first goes unnoticed. A second disk dies. During reconstruction the first bad sector is noticed and causes reconstruction to fail. To protect against that you can use RAID 6 or RAID 1 across three or more drives.

Is there a reason to worry about simultaneous failure of multiple SSDs? I don't know for sure. The best protection against that might be RAID 1 across SSD and harddisk. I don't know if there exist a RAID system, which can do this intelligently, but it certainly is feasible.

I wouldn't do this directly on the harddisk because of the bad seek performance of the harddisk. Instead I would take advantage of the much larger capacity you get on harddisks at the same price. Add a logstructured layer that causes writes on the harddisk to be sequential. If the physical capacity of the harddisk is multiple times the logical capacity (which is set to match the size of the SSD), then most of the physical sectors are unused, and there will only rarely be a need to skip used sectors during write (or they can be migrated ahead of time).

In such a setup the mirror on the harddisk would have terrible read performance, so all reads would go to SSD. If you do need to recover after the SSD has failed, the recovery can happen with a decent performance by reading data in the order it is physically stored on the disk rather than in the logical order. The result of this is sequential reads on the harddisk and random access writes on the new SSD.

Again you can use three media to protect against bad sectors discovered during recovery. For performance reasons you'd probably want the third drive to be an SSD as well such that you have two SSDs and just one harddisk.

--

Do you care about the security of your wireless mouse?
Re:Umm by sabinelr · 2012-10-16 08:05 · Score: 1

If you look carefully at replacement RAID drives from Dell (HP too probably) you'll find that the same stock number might be made by two or three different manufacturers (are there more than two hard drive manufacturers?).
Re:Umm by gweihir · 2012-10-16 08:14 · Score: 1

Ah, sorry, the standard mdraid via mdadm.
The way to direct all reads to the SSD is by marking the HDDs "write-mostly" (-W in mdadm, shows up as (W) in cat /proc/mdstat ). Originally that was for block devices accessed over the net and writes do get buffered and are hence less critical, but works as well with SSD/HDD mixes. At this time you cannot set this for an array element that is up, so what you do is either set -W on creation or kick the HDD, and re-add it with -W.
And, yes, it is pretty awesome, unless you fsync() or the kernel does go into flush-mode you basically get SSD speeds and RAID1 with just a cheap HDD partition as second element. For fsync(), the HDD is still the bottleneck though and the raid driver just writes normally.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:Umm by kasperd · 2012-10-16 08:14 · Score: 1

This is exactly why most RAID cards to patrol reads during low activity.
Of course that helps. But I still expect RAID 6 without patrol reads to provide better reliability than RAID 5 with patrol reads. I'm not sure which of the two is best for performance.

I'm not aware of any software raid implementation that does patrol reads
At some point in the past I setup a software RAID 5 on Linux and just created a weekly cron job to read through all the physical media. Not exactly a full solution, but does ensure that bad sectors will get noticed. I don't recall if software RAID 6 already existed at the time, but it certainly wasn't mature yet, so I didn't consider that option at the time.

--

Do you care about the security of your wireless mouse?
Re:Umm by trygstad · 2012-10-16 08:23 · Score: 1

Had the same thing happen with Seagates in one of my labs; made the OEM PC vendor replace each one as it failed with a Western Digital drive.
Re:Umm by Anonymous Coward · 2012-10-16 08:25 · Score: 0

I have a RAID 0 on two partitions of one and the same disk (i. e., the very same lot) and—lo and behold—it works like a charm.
Re:Umm by Sparticus789 · 2012-10-16 08:27 · Score: 1

I skipped over a sentence there....
I use two drives for metadata controller. With 14 drives left, I use 6 drives set up in RAID 6 for the two volumes I create. The RAID units that I use (no free advertisement here folks) have a global hot-swap option, so I set two drives for that. Technically that means I could loose 4 drives, depending on what drives are actually lost.

--
sudo make me a sandwich
Re:Umm by lewiscr · 2012-10-16 08:42 · Score: 1

RedHat/Fedora has a weekly cron, /etc/cron.weekly/99-raid-check. The current version won't notify you about problems until you edit /etc/sysconfig/raid-check. I only have RAID1, so I don't know if RAID6 automatically repairs. The other RAID levels don't have enough information for an auto-repair, so it requires a human to do it.
ZFS has a scrub command, but it's not scheduled automatically on FreeBSD. This is a combined test & repair command, since ZFS has the checksum information to know which disk is wrong.
Regardless of the hard/soft RAID with a scheduled check or patrol, you should also be running smartd.
Re:Umm by broggyr · 2012-10-16 08:49 · Score: 1

I was going to with for a mod point, but then I realized you posted AC.

--
Irony? Yea, it's like goldy and bronzy, only it's made of iron!
Re:Umm by Anonymous Coward · 2012-10-16 08:51 · Score: 0

Backups of the backups of the backups of the data.
Re:Umm by aztracker1 · 2012-10-16 09:29 · Score: 1

I've twice had Raid1 incidents where both drives failed... once within 24hrs of the first drive, and the second within 2 days... both times before I could get a replacement... Fortunately, I'm pretty paranoid about backing up, so didn't lose more than 2 days work in either case.

--
Michael J. Ryan - tracker1.info
Re:Umm by aztracker1 · 2012-10-16 09:31 · Score: 1

My home uber-nas has 12 drives (3tb each), 10 in raid-z2 (similar to raid 6) and 2 hot spares... running like a rock, and with software raid, much less concerned about non-drive hardware failure.

--
Michael J. Ryan - tracker1.info
Re:Umm by Peter+Bortas · 2012-10-16 09:42 · Score: 1

That sounds about right. Still waiting for someone to tell me why I find RAID5s of 16+ drives in the wild now and again.
Re:Umm by Anonymous Coward · 2012-10-16 09:48 · Score: 0

Heh. Tell that to my former IT department!!!
Re:Umm by Electricity+Likes+Me · 2012-10-16 10:19 · Score: 2

Western Digital's warranty is still 3 years, although their drives straight up lie about reallocated sector counts in SMART (whereas Seagate does not). This makes failure planning hard, since you can't see if a drive is throwing bad sectors until you run out of replacements and get an uncorrectable error (i.e. data construction).
Most of my WDs are in a RAIDZ3 though, so it's not so much of a problem.
Re:Umm by dave562 · 2012-10-16 10:43 · Score: 1

The array did not support hot spares?
I see RAID drives fail all the time. The hot spare kicks in and an alert goes to the help desk. The failed drive is replaced and life goes back to normal. In the good SANs, it will even go across disk array enclosures to get hot spares if necessary. We tend to provision 1 spare for every 24 disk enclosure. That has never presented a problem.
Re:Umm by Anonymous Coward · 2012-10-16 10:45 · Score: 0

I have more problems with hardware RAID because of mixing different vendor/models, I always buy the batch from the same vendor as the server which usually makes sure that all are loaded with the same firmware (having mixtures of anything cause performance issues which generally create the conditions for future failures)
Re:Umm by Anonymous Coward · 2012-10-16 11:00 · Score: 0

DON"T USE SATA!
Seriously I don't get why people use these (and for DB servers no less)
SATA is 1. "HALF-Duplex" 2. Not for Enterprise use in most cases.
SAS is the replacement for SCSI which was the enterprise version of IDE/ATA. and were/are manufactured with way better quality (pick up SATA and an SAS equivalent and first of all notice the weight difference (more platters in SAS, and more 'metal' components that may be plastic or metal-coated plastic in SATA) They also have higher access speeds (dedicated channels with x4, x6, x8 and some x12 connectors in once cable for back-plane and external connections), higher RPMs (good luck getting anything faster than 7200 with SATA when some SAS go up to 20K), not to mention 'FULL-Duplex' access. 6gbps SATA is read or write at 6gbps (you won't actually get that though) SAS is simultaneous Read/Write at 12gbps total on 6gbps drives (6gbps read/6gbps write)
Using SATA on DB server is just plain stupid, That's the reason you re-build was SLOW being half-duplex and all. (you do know that with SATA you shouldn't even have it still online during rebuild, only SAS can do that with minimal slow-down)
Anyone who doesn't know this shouldn't be messing around with enterprise grade equipment in the first place. (just as bad as those that say you should run networking off a Linksys router because they can do it at home, just "Stupid")
Re:Umm by physlord · 2012-10-16 11:01 · Score: 1

... since they will fail around the same time too. Same would apply to SSDs.
... and condoms
Re:Umm by Anonymous Coward · 2012-10-16 11:12 · Score: 0

Never seen this in my experience running enterprise servers/drives from 9GB SCSI-160 to 600GB SAS-6 and all in-between. Can't say that about IDE/SATA since i have never used them in my servers, always bought SCSI, now SAS and they last for years. (still have countless boxes of 9GB SCSI-160 10K RPM sitting around that still work but useless to do anything with.)
I did have some other depts. spec servers with SATA drives to be cheap (they seem to have replace 1 or 2 every 6-8 months) and some supplied by vendors that use appliances that "have to run on their hardware" and they seem to drop dead much more frequently.
Most of SCSI/SAS drives i've usually only lost maybe 1 drive during the whole life of a server and most were from predictive failures and were still working when replaced (some 12+ years old) and several on an array due to building management not believing heat damages computers (and that server/array was way over-worked and used long past it's retirement date)
Now Power-Supplies, Fans die more than the drives. (occasionally a RAID controller battery will overheat and start cooking the circuit board till it blisters and quits, again not enough A/C in some places)
Re:Umm by Anonymous Coward · 2012-10-16 11:33 · Score: 0

Failures tend to follow a bathtub curve. Buying everything at the same time is a little silly unless your supplier has done some burn-in tests.
Re:Umm by drumlight · 2012-10-16 12:11 · Score: 1

I was quiet impressed with the fact both headlight bulbs in my Hyundai died in the same week after ~13 years of use. If the lifetime were anymore consistent I might have been stuck in the dark without replacements; as it was I had the second replacement close at hand when I needed it. I admit this is not very relevant to the topic at hand but horrah for chaos and randomness.
Re:Umm by kfsone · 2012-10-16 12:37 · Score: 1

I think that that is usually to avoid the same batch to avoid physical manufacturing defects. When you're dealing with corporate/industrial grade drives, you want the drives to be as similar to each other as possible.
SSDs and Spindle drives have finite life spans for very different reasons; and the reasons behind SSD lifespan seem like they would make matching a con rather than a pro.

--
-- A change is as good as a reboot.
Re:Umm by kelemvor4 · 2012-10-16 13:37 · Score: 1

I've seen two instances where a drive failed. Each time there were no handy replacement drives. Within a week a second drive died the same way as the first! back to backup tapes! Better to have replacement drives in boxes waiting.
This. Your spares closet is your best friend in the enterprise. Ensure you keep it stocked.
Keep a hot spare in the system and maintain same day onsite service contracts.
Re:Umm by kelemvor4 · 2012-10-16 13:41 · Score: 1

When a drive fails and a RAID goes into reconstruction (if you are set up that way), that's when you are significantly more likely to have another drive fail due to all the extra activity across the RAID.
Don't use RAID 5. Battle Against Any Raid 5
Re:Umm by infodragon · 2012-10-16 13:42 · Score: 1

I couldn't agree more... But what's best often meets real world. It was a skunkworks project with no budget. It was amazing we got things working the way we did and the results got the attention it needed and then the resources were allocated.
The server was the old 737(?) pin first gen amd64 bit system. 64 bit Gentoo linux with software raid running the 5 SATA 80GB Seagate HDs. 2Gb of ram for a DB of 150GB of which 80% of the data was accessed on a daily basis... It was CRAZY project put together with the lowest of budget that achieved results good enough to actually get resources allocated rather than "it's good! keep it up!"
Gotta love the reactions on /.

--
If at first you don't succeed, skydiving is not for you.
Re:Umm by sapphire+wyvern · 2012-10-16 14:19 · Score: 1

I understand the advantages of RAID 6 in this situation, but I don't understand why the impact of one unrecoverable sector seems to be so excessively large. My understanding is that the situation you describe for RAID 5 (one previously unknown bad sector) can cause the array re-build to fail catastrophically - as you said your only option was to retry the bad media until you got lucky.
Why can't the RAID controller just mark that one sector as unrecoverable, and go on to the next set of stripes? I mean, each set of stripes has its own parity block, independent of the others, so the rest of the volume actually doesn't depend on that one bad stripe. There's a pretty good chance that the data stored in that one sector wasn't really all that critical, anyway. If you've got a giant photo collection, it probably doesn't matter that much if one .jpg can't be viewed any more. You certainly wouldn't throw away the rest of the files because one file is no longer readable.
Re:Umm by kasperd · 2012-10-16 19:26 · Score: 1

Why can't the RAID controller just mark that one sector as unrecoverable, and go on to the next set of stripes?
There may be controllers that can do that. But doing so requires a much more complicated format. To the best of my knowledge, you cannot instruct a harddisk to turn a specific sector unreadable. So you either write something to that sector, and have silent data corruption, or you need an extra bit of storage somewhere to track which sectors are lost.

If you add a bit per sector to track which are lost, you cannot put that extra bit in the same sector. It won't fit. So you need bitmap sectors somewhere. Losing a bitmap sector would be bad. You'd not know which of the affected sectors contains good data, so you'd have to assume they were all bad. It would not be as bad as losing an entire disk, but you'd lose 2MB if the sector size was 512 bytes or 128MB if the sector size was 4096.

But you still haven't solved the problem. You could mark that one sector as invalid on the new disk. Effectively that one bad sector meant you lost two sectors (unless you were lucky and one of them was parity). But even worse, it put the remaining sectors at that position on other disks at risk. Since they no longer have any redundancy. So you need to actually do a new parity over the readable sectors, plus you need to keep the old parity to be able to recover if the disk with the unreadable sector is eventually able to read it. There is no way to do this in place, so the RAID layer need to set aside extra space somewhere to track this sort of complicated recovery.

If you have a RAID controller, which can take care of all of this, then it is probably using some proprietary layout of the data. That means if you lose the controller, then all the data is lost, unless you can find an identical controller. I haven't heard of any open standard for this sort of thing.

The format used for RAID 5 on Linux doesn't have any of this fancy stuff.

--

Do you care about the security of your wireless mouse?
Re:Umm by jamesh · 2012-10-16 19:43 · Score: 1

Unless you periodically read through all your disks, you may not notice this for a long time
Point me to an enterprise RAID controller that doesn't do this, as well as closely monitor SMART counters??
Even the Linux mdadm software will do a scheduled scan once a month in a default install (under Debian at least... can't speak for any other distribution).
Re:Umm by kasperd · 2012-10-16 19:57 · Score: 1

Even the Linux mdadm software will do a scheduled scan once a month in a default install
It didn't do that last time I set it up. But that was some years ago. A monthly read doesn't completely protect against this sort of problem. Two failures within a month is much more likely than two failures within the time it takes to rebuild a lost disk.

--

Do you care about the security of your wireless mouse?
Re:Umm by jamesh · 2012-10-16 20:30 · Score: 1

Even the Linux mdadm software will do a scheduled scan once a month in a default install
It didn't do that last time I set it up. But that was some years ago. A monthly read doesn't completely protect against this sort of problem. Two failures within a month is much more likely than two failures within the time it takes to rebuild a lost disk.
Maybe it is just a Debian thing?
The once-a-month scan doesn't solve all the problems, but an unused part of the disk can have an error on it and not get discovered for years (think LVM-on-RAID10 with unallocated extents), and there may not be just a single disk involved. You wouldn't know about the errors until a disk was replaced and it started a resync and then notices that another disk is failed (1 error = failed), and then another. Then you have a problem. I don't think it's that hard to solve - I believe you can bring up a raid in 'best effort' mode and just read the stuff off it that you want, but better not to have arrived in that situation in the first place.
Re:Umm by Anonymous Coward · 2012-10-16 20:33 · Score: 0

Regular RAID Scrubbing should stop this kind of nasty surprise from happening, linux's software raid doesn't do this by default, but activating it is a trivial shell script and cron job.
Re:Umm by kasperd · 2012-10-16 20:58 · Score: 1

Maybe it is just a Debian thing?
According to some other posts in the thread Red Hat and Fedora also has it. Fedora didn't have it, last time I set up a RAID on Fedora. But maybe more recent Fedora versions do have it.

better not to have arrived in that situation in the first place.
Agreed. RAID6 or periodic background reads can reduce the risk. But neither of them is a suitable replacement for an off-site backup. Combining the three can reduce the risk of data loss even further.

--

Do you care about the security of your wireless mouse?
Re:Umm by Neil+Boekend · 2012-10-16 21:01 · Score: 1

The average temperature graph provided me with new information and I expect it would be actionable for datacenter builders (regulate temp of hdd's between 37 and 45 C).

--
Well, I might have a way, but it only works on a semi spherical planet in a vacuum.
Re:Umm by sapphire+wyvern · 2012-10-16 21:08 · Score: 1

Well, ok.
But how about a software RAID "recovery mode", where it tries to recover the array... if it comes across any unrecoverable surprise bad sectors, it replaces the contents of those sectors with 0, and dumps a report to console saying exactly which sector(s) were unrecoverable. Basically, like a fsck report. For bonus points, if possible, the report should list which files are damaged. "I did my best, here's the mostly-rebuilt volume, here's a list of stuff that's borked. Hope it wasn't anything critical." This sort of situation is abnormal by definition; I don't think it would be helpful or necessary to try and persistently track/manage unrecovered sectors on an ongoing basis. If this were to ever happen to me, I know that I would need to manually analyse the importance of the lost data. If it's just NataliePortmanHotGrits.avi, then not a problem. If it's the source code repository for Stuxnet (or some other piece of precious proprietary data) then clearly you have a bigger issue. And hopefully backups :D
I guess what I'm saying is, when the RAID rebuild fails, it's better to recover as much as possible (with some warnings about the issues that were detected) rather than just throw your hands in the air and say "screw you, this array is entirely garbage now because there was one bad sector on one disk." I'd rather have most of volume's data (with some unrecoverable losses) than nothing at all.
Re:Umm by kasperd · 2012-10-16 22:19 · Score: 1

For bonus points, if possible, the report should list which files are damaged.
The interaction needed between the file system layer and the RAID layer needed for that goes well beyond what the actual block layer interface supports. What I would do in such a situation is to rebuild two copies of the block device, one in which I fill the lost area with all 0 bits and another in which I fill the lost area with all 1 bits. Then I use the file system to read both copies and compare the results. That requires quite some manual work, but it is rarely needed, and if data is at risk, I like to take that approach.

But on a slightly longer term, I will prefer to use file systems with builtin RAID functionality. They can respond much more intelligent to such problems.

I don't think it would be helpful or necessary to try and persistently track/manage unrecovered sectors on an ongoing basis.
I think it is, if you want to avoid silent corruption and still be able to read all the good data.

If it's just NataliePortmanHotGrits.avi, then not a problem.
You gotta be kidding.

I guess what I'm saying is, when the RAID rebuild fails, it's better to recover as much as possible
Of course. The tricky part is to automatically recover good data and still avoiding silent corruption. I don't like the RAID to complete a rebuild and leave the RAID in a state, where I can never find out which file was corrupted.

New file systems are being designed with checksums of each block and builtin RAID functionality. You can still run those on top of a hardware or software RAID, if you like to.

--

Do you care about the security of your wireless mouse?
Re:Umm by drsmithy · 2012-10-16 22:33 · Score: 1

Of course, that assumes you use a real RAID card rather than software RAID. I'm not aware of any software raid implementation that does patrol reads
Linux software RAID and ZFS can certainly do an array scrubs, which is basically the same thing (albeit a bit more intrusive if your system is 24/7).
Re:Umm by marcansoft · 2012-10-17 00:04 · Score: 1

Linux software RAID doesn't do it by default, but you can sure tell it to:
15 8 * * 4 root echo check >> /sys/block/md1/md/sync_action
Re:Umm by infodragon · 2012-10-17 01:28 · Score: 1

They were on a pretty good UPS system connected to a GFI breaker. The room was climate controlled so unless something very weird happened I don't think electrical or environmental were an issue.

--
If at first you don't succeed, skydiving is not for you.
Re:Umm by DarwinSurvivor · 2012-10-17 12:59 · Score: 1

There is no data, It's backups all the way down!
Re:Umm by MikeBabcock · 2012-10-18 03:35 · Score: 1

Hot spares. I always try to insist on hot spares, but of course, very few bother to pay the extra.
A hot spare drive in the array means that as soon as a failure happens, it can take over, reducing potential downtime substantially.

--
- Michael T. Babcock (Yes, I blog)
Re:Umm by Lennie · 2012-10-18 23:20 · Score: 1

Isn't that why a good RAID system has a hot spare drive installed (the spare hot spare did not get all the writes the others did).

--
New things are always on the horizon
Re:Umm by Anonymous Coward · 2012-10-23 18:01 · Score: 0

For Linux SW Raid:
echo check > /sys/block/md0/md/sync_action
and
echo repair > /sys/block/md0/md/sync_action

Re:Die! by Anonymous Coward · 2012-10-16 04:25 · Score: 0, Offtopic

I miss the days when people actually had something useful to add rather than constant lame attempts at humor.

They shrink by Anonymous Coward · 2012-10-16 04:26 · Score: 2, Informative

The drives will shrink down to nothing. I believe that the drive controller considers a sector dead after 100,000 writes.

Re:They shrink by tgd · 2012-10-16 04:32 · Score: 4, Informative

The drives will shrink down to nothing. I believe that the drive controller considers a sector dead after 100,000 writes.
Filesystems, generally speaking, aren't resilient to the underlying disk geometry changing after they've been laid down. There's reserved space to replace bad cells as they start to die, but the disk won't shrink. Eventually, though, you get parts of the disk dying in an unrecoverable way and the drive is toast.
Re:They shrink by klui · 2012-10-16 04:43 · Score: 2

Newer disks' cells aren't rated for more than approximately 5000 writes due to process shrink. You're basically hoping the manufacturer's write leveling firmware is enough to compensate.
Re:They shrink by Auroch · 2012-10-16 04:45 · Score: 1

The drives will shrink down to nothing. I believe that the drive controller considers a sector dead after 100,000 writes.
Filesystems, generally speaking, aren't resilient to the underlying disk geometry changing after they've been laid down. There's reserved space to replace bad cells as they start to die, but the disk won't shrink. Eventually, though, you get parts of the disk dying in an unrecoverable way and the drive is toast.
Yup, I had a 2nd gen kingston die. Ever had a flash drive go bad? Unless you buy one with a decent controller (sandforce, intel) then you'll have the same experience when your ssd dies.

--
Quartz Extreme and Core Image. Are there any other real reasons to spend all that money on generic hardware?
Re:They shrink by Spazmania · 2012-10-16 04:47 · Score: 1

I can't speak for the accuracy of this, but what I read is that when the SSD runs out of reserved space as a result of reallocation, it switches itself to read-only.

--
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
Re:They shrink by v1 · 2012-10-16 04:55 · Score: 5, Informative

The sectors you are talking about are often referred to as "remaps" (or "spares"), which is also used to describe the number of blocks that have been remapped. Strategies vary, but an off-the-cuff average would be around one available spare per 1000 allocatable blocks. Some firmware will only use a spare from the same track, other firmware will pull the next nearest available spare. (allowing an entire track to go south)
The more blocks they reserve for spares, the lower the total capacity count they can list, so they don't tend to be too generous. Besides, if your drive is burning through its spares at any substantial rate, doubling the number of spares on the drive won't actually end up buying you much time, and certainly won't save any data.
But with the hundreds of failing disks I've dealt with, when more than ~5 blocks have gone bad, the drive is heading out the door fast. Remaps only hide the problem at that point. If your drive has a single block fail when trying to write, it will be remapped silently and you won't ever see the problem unless you check the remap counter in smart. If it gets an unreadable block on a read operation, you will probably see an io error however. Some drives will immediately remap it, but most don't and will conduct the remap when you next try to write to that cell. (otherwise they'd have to return fictitious data, like all zeros)
So I don't particularly like automatic silent remaps. I'd rather know whean the drive first looks at me funny so I can make sure my backups are current and get a replacement on order, and swap it out before it can even think about getting worse. I prefer to replace a drive on MY terms, on MY schedule, not when it croaks and triggers any grade of crisis. There are legitimate excuses for downtime, but a slowly failing drive shouldn't be one of them.
All that said, on multiple occasions I've tried to cleanse a drive of IO errors by doing a full zero-it format. All decent OBCCs on drives should verify all writes, so in theory this should purge the drive of all IO errors, provided all available spares have not already been used. The last time I did this on a 1TB Hitachi that had ONE bad block on it, it still had one bad block (via read verify) when the format was done. The write operation did not trigger a remap, (and I presume it wasn't verified, as the format didn't fail) and I don't understand that. If it were out of remaps, the odds of it being ONE short of what it needed is essentially zero. So I wonder in reality just how many drive manufacturers aren't even bothering with remapping bad blocks. All I can attribute this to is crappy product / firmware design.

--
I work for the Department of Redundancy Department.
Re:They shrink by harrkev · 2012-10-16 05:10 · Score: 1

I have approximately 0% experience with SSD drives. You said ~5 blocks is a death rattle. Is this the sort of thing that S.M.A.R.T. can detect, or do you have to use special SSD monitoring software? In short: how do you detect this condition?

--
"-1 Troll" is the apparently the same as "-1 I disagree with you."
Re:They shrink by Anonymous Coward · 2012-10-16 05:38 · Score: 0

Write, read, Invert, write, read, Invert, write -- That's what I use in my disk utility to ensure every bit can be both read and written in both 1 and 0 positions. If you try that, I'm sure the drive will swap in a spare for the bad spot unless you've disabled the disk's spare-swap-in as I do (via my custom boot-loader), so I can try other more advanced techniques to recover the sector before I give up and get the spare. Yes, it does work. A verified write does not a good read make.
Re:They shrink by Bob+the+Super+Hamste · 2012-10-16 05:49 · Score: 2, Informative

From my understanding this is exactly the type of thing that S.M.A.R.T is going to detect along with a number of other issues. If you are interested I suggest checking out the paper from Google entitled "Failure Trends in a Large Disk Drive Population" as they made extensive use of S.M.A.R.T and tracked an extremely large number of drives for a number of years for the analysis.

--
Time to offend someone
Re:They shrink by harrkev · 2012-10-16 06:00 · Score: 1

I have seen this before -- good word, IF you are using old-fashioned mechanical drives. S.M.A.R.T has a couple of things about it -- I have heard that sometimes it tends to fib -- manufacturers do not want people to think that their drives are bad. The other big thing is that S.M.A.R.T. was invented BEFORE SSD drives became commercially viable.

--
"-1 Troll" is the apparently the same as "-1 I disagree with you."
Re:They shrink by Anonymous Coward · 2012-10-16 06:07 · Score: 1

Hi Sir,
I'm not sure you understand how hard drives work.
Brand new drives ship to the field with hundreds of errors re-mapped (Primary map).
Re-mapping spots on the drive is a GREAT thing, and is how your data stays protecting. Having greater than 5 re-allocates (Secondary Map or Glist) is meaningless in terms of Quality. It simply means that your drive is working as designed.
When your drive's quality is at risk, based on analysis of these logs and OEM proprietary logs, then the drive will fail for SMART.
Re:They shrink by v1 · 2012-10-16 06:09 · Score: 3, Informative

SMART is implemented in different ways by different manufacturers. The idea is that the host can ask the peripheral "what value does slot xx contain?" This can refer to an instantaneous condition, such as the temperature of the hard drive, a static value such as how many spares are currently available, a semidynamic value such as is this hard drive failing, and a dynamic value such as how many remap operations have occurred. There's a short list of "basic/standard" values, and then there's the "extended/optional" metrics that not all devices need to support. Each smart slot will also specify the min and max values. If any smart slot has a value outside its allowed range, overall smart status will report as failing. Once a drive toggles over to failing, there's no going back, unless you figure out a way to reset the counters.
One of the standard set is the "is the hard drive failing" metric. It allows the host to get a simple yes/no answer to summarize whether any of the metrics have gone beyond their tolerated values. For example, one drive I worked with recently was allowed to overtemp twice. If it had experienced a third overtemp during its lifetime, the drive would then permanently fail the overall test. This allows the host to "check smart status" without really having to think much about what it's doing. This is the basic test that most modern OS's check to see if a hard drive needs to be replaced. You usually need to run a special tool to check individual values being returned by smart. These tools need to have a list of what each slot means, and often will report fairly meaningless information near the end of the list, where they don't know what this 23 means in slot 85 etc.
Other known values may slowly increment over the lifetime of the drive, such as "head re-calibrations", "remaps", SMS head parks, max g forces experienced, etc. You'd have to compare their current values with their claimed limits to see how close each of these metrics is to causing overall smart to toggle to failed. Without knowing what the metric is, or what it's expected limit is, the numbers aren't useful.

--
I work for the Department of Redundancy Department.
Re:They shrink by synapse7 · 2012-10-16 07:04 · Score: 1

Load gnome disk utility or crystal disk info and view the SMART info for yourself.
Re:They shrink by Macman408 · 2012-10-16 07:13 · Score: 1

On the contrary; it depends on how valuable your data is. Google's paper on hard drive failure rates notes that a drive with a single remapping is 15 times more likely to fail in the next 60 days than a disk with no remapped blocks.
Granted, the annualized failure rate for the first 3 months after the remapping is still only ~19% for such a drive - but do you want to take a chance that your drive is about to die?
Re:They shrink by Bob+the+Super+Hamste · 2012-10-16 07:24 · Score: 1

I thought there were some specific scans S.M.A.R.T could do on SSDs (maybe I am not remembering things exactly) to get relevant data. It wouldn't surprise me if manufactures do fib but then they really are just hurting themselves as how many people actively make use of S.M.A.R.T for disk monitoring. Everything else being equal I would much rather have a drive show signs of failure before it is to late than have it die without warning.

--
Time to offend someone
Re:They shrink by lewiscr · 2012-10-16 08:55 · Score: 1

I have few enough drives that I can track the remaps manually. If a drive has a single remap event, I leave it be. If there is a second event, I replace the drive. I have one drive that remapped 451 sectors on it's first day, and has been working fine for 2 years. I do the same for uncorrectable read errors, except that I rebuild the RAID after the first event.
Re:They shrink by LinuxIsGarbage · 2012-10-16 11:23 · Score: 1

I have seen this before -- good word, IF you are using old-fashioned mechanical drives. S.M.A.R.T has a couple of things about it -- I have heard that sometimes it tends to fib -- manufacturers do not want people to think that their drives are bad. The other big thing is that S.M.A.R.T. was invented BEFORE SSD drives became commercially viable.
I had a bad mechanical disk that started stalling and getting read errors and OS failed to boot. SMART percentages reported the drive fully healthy, only looking at the raw data values could I see that there were remapped sectors, etc. This is the data needs to be scrutinized for any changes.
For the interest of anyone, I booted from a ubuntu CD and used this method to image the disk:
https://help.ubuntu.com/community/DataRecovery#Imaging_a_damaged_device.2C_filesystem_or_drive
It does one pass to get all the error free sectors first, then it will go back and retry on the sectors with errors. Then I opened the image in Test Disk to get my data.
Re:They shrink by Anonymous Coward · 2012-10-16 11:58 · Score: 0

If you were using an older kernel, say CentOS 4.x or earlier, there was a bug in the kernel on how it accessed one of the blocks on a boundary between LBA remapping schemes. Older drives had firmware that would silently address the right block, but Hitachi at some point switched firmware to start reporting errors on that corner case. In running "badblocks" this would translate to that block *ALWAYS* being reported as bad. In my experience, running into one bad block after several re-write passes is unusual, except in this case.
More details on this are in my 2008 blog post: http://www.tummy.com/journals/entries/jafo_20081008_181615
Re:They shrink by Yaztromo · 2012-10-17 18:46 · Score: 1

Filesystems, generally speaking, aren't resilient to the underlying disk geometry changing after they've been laid down.
No, but many filesystems do have a way of flagging bad blocks as used, so that they can't be accessed. Depending on how the drive fails, as more blocks get marked as bad, the available free space can (conceptually) dwindle down to nothing without any changes in geometry. The simplest way is to claim the sectors as used by a "bad blocks file" so they can't be used for future writes -- which doesn't require any geometry changes.
That being said, it's been a very, very, very long time sine I've seen a disk fail in this manner. With over a billion sectors on a 500GB disk, chances are very good the entire drive is simply going to fail to function completely before every block is added to a filesystems bad blocks list.
Yaz

How do SSD's die by AwesomeMcgee · 2012-10-16 04:26 · Score: 5, Funny

Screaming in agony, hissing bits and bleeding jumperless in the night

Re:How do SSD's die by Anonymous Coward · 2012-10-16 04:57 · Score: 0

Placed with all its treasures on its boat, which is then set on fire, and sent out to sea to... oh, wait, no, I'm thinking of vikings. My mistake.
Re:How do SSD's die by Lumpy · 2012-10-16 05:32 · Score: 1

So the SSD in my laptop will sound like a Borderlands Claptrap dying.... Sweet.

--
Do not look at laser with remaining good eye.

Re:Die! by Anonymous Coward · 2012-10-16 04:27 · Score: 5, Funny

Wow - you've been here a long long time then

When you're nearing maximum write limit by Anonymous Coward · 2012-10-16 04:28 · Score: 0

is when you will see a degredation of performance and possible corruption.

Re:When you're nearing maximum write limit by theNetImp · 2012-10-16 04:33 · Score: 3, Interesting

So by reason of thinking, if you have a RAID of 15 drives for storage of images, these images never change, they are written and never over written, then the SSDs should theoretically never die because they are only reading these bits now?
Re:When you're nearing maximum write limit by Baloroth · 2012-10-16 04:45 · Score: 2, Interesting

So by reason of thinking, if you have a RAID of 15 drives for storage of images, these images never change, they are written and never over written, then the SSDs should theoretically never die because they are only reading these bits now?
Reading flash is not 100% non-destructive, if you never do a re-write cells near each read cell (which is all of them, probably) will degrade over time. I believe the stored data will degrade over long periods of time in any case, but I'm not sure. But if you re-write data every year or so, they could probably last decades.

--
"None can love freedom heartily, but good men; the rest love not freedom, but license." --John Milton
Re:When you're nearing maximum write limit by SydShamino · 2012-10-16 04:51 · Score: 3, Informative

In theory, yes. In flashROM devices the erase process is the aging action. Your write-once-never-erase-read-only flash should last until A) enough charge manages to leak out of gates that you get bit errors, or B) the part fails due to corrosion or other long-term aging issue, similar to any piece of electronics.
If you have raw access to the flashROM you could in theory write the same data into the same unerased bytes to recover from bit errors (if you had an uncorrupted copy), so only aging failures would occur. But of course you can't do this with an SSD as you have no direct access to the memory, and the controller A) wouldn't let you write into unerased space, and B) wouldn't write the data into the exact same place again anyway.

--
It doesn't hurt to be nice.
Re:When you're nearing maximum write limit by Anonymous Coward · 2012-10-19 00:00 · Score: 0

Reading flash is not 100% non-destructive, if you never do a re-write cells near each read cell (which is all of them, probably) will degrade over time. I believe the stored data will degrade over long periods of time in any case, but I'm not sure. But if you re-write data every year or so, they could probably last decades.
The phenomenon you described is called Read Disturb. Reads of one area of the flash will very slightly disturb (degrade) the charges in adjacent areas. Over time, the cumulative disturbances will cause data loss in the "disturbed" areas of flash, even if those areas have not been accessed (read). Flash controllers avoid this by rewriting (or copying) the disturbed area before the damage gets out of hand.

They die... by Anonymous Coward · 2012-10-16 04:28 · Score: 1

Spectacularly and without warning.

Firmware bugs by Anonymous Coward · 2012-10-16 04:29 · Score: 2

Didn't happen to me, but a number of people with the same Intel SSD reported that they booted up and the SSD claimed to be 8MB and required a secure wipe before it could be reused. Supposedly it's fixed in the new firmware, but I'm still crossing my fingers every time I reboot that machine.

Re:Firmware bugs by Anonymous Coward · 2012-10-16 04:35 · Score: 0

I can 100% confirm this. Happened to my Intel 320.
Re:Firmware bugs by greg1104 · 2012-10-16 04:38 · Score: 1

That's the Intel 320 series drives. They didn't release a version of those drives claimed suitable for commercial work until the "8MB bug" was sorted out, as the much more expensive 710 series.
Re:Firmware bugs by rrohbeck · 2012-10-16 05:30 · Score: 1

That one happened first on the 64GB X-25E so it's been around for a while.

--
thegodmovie.com - watch it
Re:Firmware bugs by arglebargle_xiv · 2012-10-16 11:01 · Score: 1

Didn't happen to me, but a number of people with the same Intel SSD reported that they booted up and the SSD claimed to be 8MB and required a secure wipe before it could be reused. Supposedly it's fixed in the new firmware, but I'm still crossing my fingers every time I reboot that machine.
They never fixed it, they just stopped responding to complaints about it. I bought one of these ticking time-bombs specifically because of all the noise that Intel makes about the high reliability of their drives, and now I'm sitting on a potential complete data loss. The worst part is that while Intel will eventually replace them when they fail, they'll give you another one of these lemons for the replacement, so there's no escape.

Flash SSD has Write Limitations so... by Anonymous Coward · 2012-10-16 04:29 · Score: 2, Informative

From what I understand, SSD die because of "write-burnout" if they are FLASH based and from what I understand the majority of SSDs are flashed based now. So while I haven't actually had a drive fail on me, I assume that I would be able to still read data off a failing drive and restore it, making it an ideal failure path. I did a google search and found a good article on the issue: http://www.makeuseof.com/tag/data-recovered-failed-ssd/

Re:Flash SSD has Write Limitations so... by Auroch · 2012-10-16 04:46 · Score: 2

From what I understand, SSD die because of "write-burnout" if they are FLASH based and from what I understand the majority of SSDs are flashed based now. So while I haven't actually had a drive fail on me, I assume that I would be able to still read data off a failing drive and restore it, making it an ideal failure path. I did a google search and found a good article on the issue: http://www.makeuseof.com/tag/data-recovered-failed-ssd/
Which is why you can do the same from a failed usb flash drive?

It's a nice theory, but it's highly dependent on the controller.

--
Quartz Extreme and Core Image. Are there any other real reasons to spend all that money on generic hardware?
Re:Flash SSD has Write Limitations so... by SydShamino · 2012-10-16 04:58 · Score: 4, Interesting

For flash memory it is the erase cycles, not the write cycles, that drive life.
http://en.wikipedia.org/wiki/Flash_memory
The quantum tunneling effect described for the erase process can weaken the insulation around the isolated gate, eventually preventing that gate from holding its charge. That's the typical end-of-life scenario for a bit of flash memory.
You generally don't say that writes are end-of-life because you could, in theory, write the same pattern to the same byte over and over again (without erasing it) and not cause reduction in part life. Or, since bits erase high and write low, you could write the same byte location eight times, deasserting one new bit each time, then erase the whole thing once, and the total would still only be "one" cycle.

--
It doesn't hurt to be nice.
Re:Flash SSD has Write Limitations so... by klui · 2012-10-16 06:07 · Score: 1

Another problem with SSDs is cells need to be periodically rewritten so that counts against the P/E cycle to which you refer.
Re:Flash SSD has Write Limitations so... by petermgreen · 2012-10-16 06:36 · Score: 1

Since the erase blocks are larger than the write blocks and larger than the logial blocks used by the OS SSDs act by remapping logicial blocks every time they are written. Hopefully in a matter that levels wear as much as possible. They have some spare space to allow for failed blocks and garbage* blocks
The problem is these remapping systems are damn complex and therefore prone to bugs. Especially in corner cases like power failure (what happens if the power dies while the drive is moving blocks arround so it can "garbage collect" an erase block). Afaict this is the main cause of failures and data corruption on SSDs.
* That is a physical block that is no longer being used to back any logical block but has not yet been erased.

--
note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
Re:Flash SSD has Write Limitations so... by SydShamino · 2012-10-16 06:54 · Score: 1

They only need to be rewritten if they have lost their stored charge and thus flipped high. As I noted, you could rewrite the data to the same location without needing to erase it first. Thus you don't need to erase the location first and trigger a cycle. That said the flash vendors I've spoken to are always cagey about writing an unerased location, usually hemming and hawing about it being technically fine before saying the policy is always to erase.

--
It doesn't hurt to be nice.
Re:Flash SSD has Write Limitations so... by Bengie · 2012-10-16 09:04 · Score: 1

Yay, race conditions(sudden power loss is an interrupt). I guess SSDs need to remap blocks atomically or update state in a way that it cane pick up where it left off.
Re:Flash SSD has Write Limitations so... by Anonymous Coward · 2012-10-17 05:43 · Score: 0

There's no fundamental difference - it's the count of write-erase cycles that matter - the total cycle of both.

wear leveling by Anonymous Coward · 2012-10-16 04:29 · Score: 2, Informative

SSDs use wear leveling algorithms to optimize each memory cell's lifespan; meaning that it keeps track of how many times each cell was written and it ensures that all cells are being utilized evenly. When the cells fail, they're being kept track of and the drive does not attempt to write to that cell any longer. When enough cells have failed the capacity of the drive will shrink noticeably. At that point it is probably wise to replace it. For a RAID configuration the wear level algorithm would presumably still work as the RAID algorithm pumps even amounts of data to each drive (whether it is mirrored or striped). When any of the drives are shrinking in size it is presumably time to replace the array.

if they die at the same time repeatably.. by gl4ss · 2012-10-16 04:30 · Score: 1

by performing same set of actions, in unreasonable time, then with 99.999%(the more drives, add 9's) probability it's a bug in the firmware/controller. afaik there shouldn't be such drives on market anymore..

otherwise the nands shouldn't die at the same time. shitty nands I suppose will die faster (a bad batch is shitty).

some drive controllers have counters about the nand use - but they shouldn't all blow up when it hits 0, at which point you're recommended to replace them.

I haven't had one die, though I do have a vertex 2 in daily thrashing use.

--
world was created 5 seconds before this post as it is.

ask someone who tries to selvage data by Anonymous Coward · 2012-10-16 04:30 · Score: 0

ask someone who tries to selvage data from dead ssd drive.
who?
him:
http://www.youtube.com/watch?v=vLoYduckmuo

They usually die gracefully... by dublin · 2012-10-16 04:31 · Score: 5, Informative

In general, if the SSD in question has a well-designed controller (Intel, SandForce), then write performance will begin to drop off as bad blocks start to accumulate on the drive. Eventually, wear levelling and write cycles have taken their toll, and the disk can no longer write at all. At this point, the controller does all it can: it effectively becomes a read-only disk. It should operate in this mode until else something catastrophic (tin migration, capacitor failure, etc.) keeps the entire drive from working.

BTW - I haven't seen this either, but that's the degradation profile that's been presented to me in several presentations by the folks making SSD drives and controllers. (Intel had a great one a few years back - don't have a link to it handy, though...)

--
"The future's good and the present is nothing to sneeze at." - Roblimo's last ./ post

Re:They usually die gracefully... by Anonymous Coward · 2012-10-16 04:44 · Score: 0

thank you.
Re:They usually die gracefully... by AmiMoJo · 2012-10-16 04:51 · Score: 4, Interesting

I had an Intel SSD run out of spare capacity and it was not fun. Windows kept forgetting parts of my profile and resetting things to default or reverting back to backup copies. The drive didn't report a SMART failure either, even with Intel's own SSD monitoring tool. I had to run a full SMART "surface scan" before it figured it out.
That sums up the problem. The controller doesn't start reporting failures early enough and the OS just tries to deal with it as best as possible, leaving the user to figure out what is happening.

--
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
Re:They usually die gracefully... by fuzzyfuzzyfungus · 2012-10-16 04:58 · Score: 1

In general, if the SSD in question has a well-designed controller (Intel, SandForce), then write performance will begin to drop off as bad blocks start to accumulate on the drive. Eventually, wear levelling and write cycles have taken their toll, and the disk can no longer write at all. At this point, the controller does all it can: it effectively becomes a read-only disk. It should operate in this mode until else something catastrophic (tin migration, capacitor failure, etc.) keeps the entire drive from working.
BTW - I haven't seen this either, but that's the degradation profile that's been presented to me in several presentations by the folks making SSD drives and controllers. (Intel had a great one a few years back - don't have a link to it handy, though...)
That's the theoretical death trajectory. For whatever reasons(presumably relative newness and high cost sensitivity among non-enterprise customers), the 'just dropping off the bus, dead as a stone' and a variety of more peculiar firmware errors seem to be surprisingly common.
I understand that flash, by its nature, degrades with use; but SSDs seem to have a bad habit of failing more and faster than you'd expect from solid state electronics, especially solid state electronics that don't operate under alarming thermal conditions or include a bunch of shoddy electrolytic capacitors or similar. It's very odd.
Re:They usually die gracefully... by Anonymous Coward · 2012-10-16 06:10 · Score: 0

One of the real issues with SSD's is that performance starts to drop off the more you use them. This is because SSD's use a translation table to translate logical LBA's to physical LBA's. It does this to save P/E cycles (Program/Erase), which can be very limited in today's SSD's.
Once your drive gets full, these operations take longer.
One thing is sure, in my mind, HDD's are much more reliable than SSD's. As the SSD market matures, I am sure this will change, just as it has with HDD's when they were first made.
Re:They usually die gracefully... by jamesmeece · 2012-10-16 06:13 · Score: 1

the above is me, sorry I wasn't logged in
Re:They usually die gracefully... by justthinkit · 2012-10-16 06:34 · Score: 2

I am curious if Event Viewer data is helpful as the SSD starts to fail.

--
I come here for the love
Re:They usually die gracefully... by Anonymous Coward · 2012-10-16 07:10 · Score: 0

I had an early model OCZ drive that died. One day, Windows randomly died, it'd bluescreen while booting. Non-specific, right? Well, I was able to reinstall Windows and it happened again in about a week, at which point the drive was essentially read-only.
That said, it wasn't an unexpected failure. I knew I was about to hit 10,000 average erases per block during that month (simple linear regression about 10 months prior using a third-party tool). I didn't get any useful errors from Windows or the BIOS or anything of that nature, so it was confusing until I remembered that my SSD was due to die. Sadly, Windows requires about 1,000,000 erased blocks per boot, which proved to be the major contributor to wear and where problems first arose.
Re:They usually die gracefully... by macraig · 2012-10-16 08:53 · Score: 1

You are optimistically assuming that the controller logic and everything else but the NAND media itself doesn't fail. When that happens it's anything but a graceful failure.
Re:They usually die gracefully... by Bengie · 2012-10-16 09:12 · Score: 1

Samsung 830s do very well as they fill up. Triple core CPU in the controller helps with quick garbage collection and remapping. The issue is mostly moot. Something like a 20%-30% drop for your more average SSDs. Still leaps and bounds better than rust buckets, and it only really affects writes.
Re:They usually die gracefully... by Anonymous Coward · 2012-10-16 10:14 · Score: 0

I understand that flash, by its nature, degrades with use; but SSDs seem to have a bad habit of failing more and faster than you'd expect from solid state electronics, especially solid state electronics that don't operate under alarming thermal conditions or include a bunch of shoddy electrolytic capacitors or similar. It's very odd.
Most SSD manufacturers are new to the long term storage game since they are mainly dealt with RAM previously. Pretty much all SSD failures that have been seen/reported have been because of controler bugs, not actual wear.
Traditional spinning disks also degrade with use but they got their wear leveling working back in the 90's.
Re:They usually die gracefully... by Anonymous Coward · 2012-10-16 11:46 · Score: 0

Well-designed controller? Sandforce? Good one.

Re:Die! by Quakeulf · 2012-10-16 04:31 · Score: 1, Insightful

I am new to commenting on /. and I think lame attempts at humor belong to 9GAG and Reddit.

It's because of the noise they make! by SternisheFan · 2012-10-16 04:31 · Score: 1

They die because the people living near Kennedy Airport complain and demonstrate about the noise... What?... Not SST's..., SS"D" 's...... Oh, well, that's different then.

Never mind.

Re:It's because of the noise they make! by Steelwings · 2012-10-16 04:35 · Score: 0

Not to be confused with STD's
Re:It's because of the noise they make! by SternisheFan · 2012-10-16 04:38 · Score: 2

Not to be confused with STD's
Personally, I wouldn't really mind all that much if my STD's died, I can see an upside...
Re:It's because of the noise they make! by neokushan · 2012-10-16 04:42 · Score: 1

Nobody on Slashdot will ever have to worry about those.

--
+1 IDisagreeSoHeMustBeATrollOrAnAstroturferOrAShill
Re:It's because of the noise they make! by SternisheFan · 2012-10-16 04:46 · Score: 1

Nobody on Slashdot will ever have to worry about those.
Oooooh... (lol).. that's one of those "it's funny 'cause it's true" jokes. :-)
Re:It's because of the noise they make! by Minwee · 2012-10-16 05:44 · Score: 1

There's a down side. They don't go alone.
Re:It's because of the noise they make! by Anonymous Coward · 2012-10-22 10:20 · Score: 0

Because none of us are dumb enough to catch them.

They die without warning and without recourse by PeeAitchPee · 2012-10-16 04:31 · Score: 3, Informative

With traditional mechanical drives, you usually get a clicking noise accompanied by a time period where you can offload data from the drive before it fails completely. In my experience, though SSDs don't fail as often, when they do, it's sudden and catastrophic. Having said that, I've only seen one fail out of the ~10 we've deployed here (and it was in a laptop versus traditional desktop / workstation). So BACK IT UP. Just my $0.02.

Re:They die without warning and without recourse by PRMan · 2012-10-16 04:49 · Score: 4, Informative

I have had two SSD crashes. One was on a very cheap Zelman 32GB drive which never really worked (OK, about twice). The other was on a Kingston 64GB that I have in my server. When it gets really hot in the room (over 100, so probably over 120 for the drive itself in the case), it will crash. But when it cools down, it works perfectly well.

--
Peter predicted that you would "deliberately forget" creation 2000 years ago...
Re:They die without warning and without recourse by cellocgw · 2012-10-16 04:54 · Score: 5, Interesting

With traditional mechanical drives, you usually get a clicking noise accompanied by a time period where you can offload data from the drive before it fails completely.
OK, so I'm sure some enterprising /.-er can write a script that watches the SSD controller and issues some clicks to the sound card when cells are marked as failed.

--
https://app.box.com/WitthoftResume Code: https://github.com/cellocgw
Re:They die without warning and without recourse by dougmc · 2012-10-16 04:54 · Score: 5, Informative

With traditional mechanical drives, you usually get a clicking noise accompanied by a time period where you can offload data from the drive before it fails completely.
Usually? No.
This does happen sometimes, but it certainly doesn't happen "usually". There's enough different failure mechanisms for hard drives that there isn't any one "usual" method --
1- drive starts reporting read and/or write errors occasionally, but otherwise seems to keep working
2- drive just suddenly stops working completely all at once
3- drive starts making noise (and performance usually drops massively), but the drive still works.
4- drive seems to keep working, but smart data starts reporting all sorts of problems.
Personally, I've had #1 happen more often than anything else, usually with a healthy serving of #4 at about the same time or shortly before. #2 is the next most common failure mode, at least in my experience.
Re:They die without warning and without recourse by phayes · 2012-10-16 05:06 · Score: 1

I'd also heard that SSD's normally fail by going read-only... Not so, or not for me anyway. The 256 Gb Crucial C300 drive I had in my Dell e6500 just refused to be recognized one morning. I figured it was just a problem with that drive or some wierd conflict with the drive caddy I placed the spindle drive in (replacing the CD) because for some reason other drives worked in the caddy but not the original HD & the C300.
I bought a 256 Gb Crucial M4 to replace the failed drive which one morning started giving read errors on some files (unfortunately on the files used to store some critical data in a VM. I can see the files but cannot read them. The e6500 still runs & I may wipe it & see if the SSD lets me reinit but I no longer trust it. Right now it's serving me as a mere terminal.
I now have a rMBP which transparently backs up to my NAS over Time Machine...

--
Democracy is a sheep and two wolves deciding what to have for lunch. Freedom is a well armed sheep contesting the issue
Re:They die without warning and without recourse by Anonymous Coward · 2012-10-16 05:21 · Score: 0

Clickers and tickers are by far the most common. Most users are merely incapable of discerning "normal" HDD noises from those that are the harbingers of data loss. These users, by and large, will attribute most #3 scenarios to #2... but it's obvious upon inspection that the drive is tapping out "Jingle Bells" to anyone who listens. This is especially true for very modern drives with their increasingly smaller and quieter components.
In my experience, the most common cause for scenario #2 is a power surge (obviously scarred controller board), excessive heat (discolored casing), or some other outside influence. I even worked on one that had been shot.
Re:They die without warning and without recourse by Anonymous Coward · 2012-10-16 05:25 · Score: 0

Actually the "traditional" failure's only happen about half the time. If you lose electronics or have a head fail without crashing, you won't GET noise.
Re:They die without warning and without recourse by Anonymous Coward · 2012-10-16 06:02 · Score: 0

Hah, this reminds me of the story about adding fake engine noise to hybrid cars for safety.
Re:They die without warning and without recourse by Anonymous Coward · 2012-10-16 07:14 · Score: 0

This was a joke, you idiots. (Score:5, Interesting), really?
Re:They die without warning and without recourse by dougmc · 2012-10-16 07:42 · Score: 1

Clickers and tickers are by far the most common. Most users are merely incapable of discerning "normal" HDD noises from those that are the harbingers of data loss. These users, by and large, will attribute most #3 scenarios to #2... but it's obvious upon inspection that the drive is tapping out "Jingle Bells" to anyone who listens.
Often drives that have done a #2 will make strange noises *after* the fact, but with no unusual noises made before they failed.
Or at least no unusual noise that one would actually notice under normal conditions -- i.e. in a computer, with fans running nearby, possibly other hard drives, etc.

excessive heat (discolored casing), or some other outside influence. I even worked on one that had been shot.
Hard drives don't get enough power to heat them up enough to discolor the casing. Though certainly, you can have one component burn up and release all it's magic smoke, so you'll have a scorched spot on the PCB.
In 30 or so years, I've had one hard drive attempt to cook itself. It was a ST-251 that had something wrong with it where the entire drive would get (literally) too hot to touch after a while, even when installed somewhere with (normally) adequate air flow. No idea why. That said, the drive still worked, so it didn't actually succeed in cooking itself.
(In that same period, I've had a few dozen drives die on me in other ways, but that was the only one that was trying to cook itself.)
As for "other outside influence", I think we can agree that most drives fail due to causes other than being shot with a gun.
Re:They die without warning and without recourse by Anonymous Coward · 2012-10-16 08:00 · Score: 0

I think the previous poster said "usually" to mean that, *of the failure he's noticed*, usually it was a clicking noise that brought his attention to it.
Sort of like "of the house fires I've noticed, they were usually preceded by a smoky, burning smell."
Re:They die without warning and without recourse by Anonymous Coward · 2012-10-16 08:17 · Score: 0

I never said the drive itself caused the overheating problem. The customer in question ran rack-mounted controller stations for multi-megawatt silicon furnaces. Everything got far too hot, and it was not uncommon to have a cooked controller land on my desk. Among the many indications of overheating was a distinct blue sheen to the HDD casing -- the drives also tended to whine more than normal due to excessive bearing wear.
That said, I have witnessed heat-related HDD death, not caused by an outside influence, present as a very noisy spindle spins down, then locks and refuses to spin back up. The sound of these drives is unmistakable, and with a little practice, you know which ones not to turn off before you snag a copy of the data.
As for noises generated before and indicative of an impending failure, I can only disagree with you. In roughly 20 years of hardware support, I have found that abnormal noise of some sort almost always precedes a failure -- notable exceptions of course being those caused by firmware, surges, or some other outside influence. Whether ticking, clicking, restarting, spinning "weird", whining, or just sounding "off".... if you listen to one, you can usually tell.
I got out of direct hardware support about five years ago, so these statements do not reflect experience beyond personal builds with any of the more modern, quieter drives... nor anything to do with SSDs. YMMV.
Re:They die without warning and without recourse by Anonymous Coward · 2012-10-16 09:25 · Score: 0

Holy crap! Over 100 degrees in the room where a server is located? Optimal performance conditions say around 70 degrees. Sounds like you need some ventilation and circulation. No wonder you have had two SSDs crash!

SSDs do fail by Anonymous Coward · 2012-10-16 04:32 · Score: 1, Interesting

Pretty much all SSDs have more then 8 chips in a configuration similar to RAID0. If any single chip has a problem, the entire drive is useless. I've seen SSDs fail from the cheap 40GB patriots, all the way up to the high end fusion io drives. *Most* of them died after power cycles, I guess if they are going to fail, that will usually be the time it happens. At least with the mechanical disks you can spend some cash and have it recovered after it fails.

How do SSD's die? by Anonymous Coward · 2012-10-16 04:32 · Score: 0

Suddenly. I've had 2 SSDs fail on me and they both died a sudden and unexpected death.

Re:How do SSD's die? by SternisheFan · 2012-10-16 04:40 · Score: 1

Suddenly. I've had 2 SSDs fail on me and they both died a sudden and unexpected death.
How long did they last, if you don't mind me asking. Or is it "too soon"...

Bang! by greg1104 · 2012-10-16 04:33 · Score: 4, Informative

All three of the commercial grade SSD failures I've cleaned up after (I do PostgreSQL data recovery) just died. No warning, no degrading in SMART attributes; works one minute, slag heap the next. Presumably some sort of controller level failure. My standard recommendation here is to consider then no more or less reliable than traditional disks and always put them in RAID-1 pairs. Two of the drives were Intel X25 models, the other was some terrible OCZ thing.

Out of more current drives, I was early to recommend Intel's 320 series as a cheap consumer solution reliable for database use. The majority of those I heard about failing died due to firmware bugs, typically destroying things during the rare (and therefore not well tested) unclean shutdown / recovery cases. The "Enterprise" drive built on the same platform after they tortured consumers with those bugs for a while is their 710 series, and I haven't seen one of those fail yet. That's not across a very large installation base nor for very long yet though.

Re:Bang! by ColdWetDog · 2012-10-16 04:48 · Score: 5, Funny

Does anyone else find this sort of thing upsetting? I grew up during that period of time when tech failed dramatically on TV and in movies. Sparks, flames, explosions - crew running around randomly spraying everything with fire extinguishers. Klaxons going off. Orders given and received. Damage control reports.
None of this 'oh snap, the hard drive died'.
Personally, I think the HD (and motherboard) manufacturers ought to climb back on the horse. Make failure modes exciting again. Give us a run for the money. It can't be hard - there still must be plenty of bad electrolytic capacitors out there.
How about a little love?

--
Faster! Faster! Faster would be better!
Re:Bang! by greg1104 · 2012-10-16 04:56 · Score: 1

There are more bad electrolytic capacitors out there than ever before. Problem is they don't blow in an exciting way anymore. The stupid things just bow out at the top, with the case completely able to contain the explosion. So lame.
Re:Bang! by Mashiki · 2012-10-16 05:05 · Score: 1

Oddly, I've seen Intel drives fail just as catastrophically as much as OCZ drives. It really doesn't matter who makes the drive, if a SSD dies, it dies. But if you're running anything important in less than paired RAID 1, and don't expect a failure of any type of storage you should be beating your head against the wall until your brainbox is a gooey pile of mush. That way, whoever they hire next will hopefully do the job properly, and take your failure as a warning.
Firmware bugs are probably the biggest offenders of drive destruction across the board, on mechanical and silicon drives. Take a look at the seagate 7200.1 series drives, and that firmware fiasco. But at home, I'm still using a OCZ vertex(first generation) on the 1.6fw, it's now been over 5 years(5yrs 4mo actually) and still going strong.

--
Om, nomnomnom...
Re:Bang! by greg1104 · 2012-10-16 05:22 · Score: 1

OCZ is worthless to the commercial market not due to their failure rates, but to their product line thrashing always in search of the next big thing. I waited around some time to get them to ship their Vertex 2 Pro, the only drive they had with a battery cache (supercapacitor or similar) to make them more rugged against power failures. Can't reliably do database work at high speed without one of those. Once available, they were never easy to get in quantity. And then they announced the Vertex 3 Pro, followed by retailers flushing inventory on the 2 Pro. And I don't think the 3 Pro even ever shipped.
Companies putting valuable data on SSD will not stand for the non-stop enthusiast upgrade train like that. They want a drive that's stable and with predictable availability for a few years (focusing on ironing out obscure bugs) after it ships. OCZ is not that sort of company; Intel is.
Re:Bang! by ColdWetDog · 2012-10-16 05:47 · Score: 1

Problem is they don't blow in an exciting way anymore.
That should be amenable to simple engineering modifications. I'm thinking a small dollop of white phosphorus and thermite on the top of the cap. I suppose we can ask Micheal Bay. He should know of something that would work.

--
Faster! Faster! Faster would be better!
Re:Bang! by rubycodez · 2012-10-16 05:57 · Score: 1

I would like to request the computer play an .mp3 file repeating a paradoxical question you asked over and over while you ignite those, 'cause every old school sci-fi fan knows that's how you blow up evil computers
Re:Bang! by MrL0G1C · 2012-10-16 06:09 · Score: 1

You want exciting failure, reach round the back of the PC to the power supply, switch the 110/220 selection switch whilst the computer is powered on.
Yeah, I couldn't find the power switch and made that mistake ONCE. (Loud bang + magic smoke, whilst practically hugging the PC)

--
Waterfox - a Firefox fork with legacy extension support, security updates and better privacy by default.
Re:Bang! by Tolkien · 2012-10-16 06:35 · Score: 1

I've heard that raiding them together drastically reduces their lifespan because the trim command gets disabled as a result.

--
how is babby formed?
Re:Bang! by Mashiki · 2012-10-16 06:45 · Score: 1

I wouldn't say they're worthless. But their business line reminds me of Iomega in terms of how they plan out their deployment strategy.

--
Om, nomnomnom...
Re:Bang! by greg1104 · 2012-10-16 07:02 · Score: 1

That depends on implementation. Software RAID on Linux for example can certainly still issue TRIM through a pair of devices.
The TRIM command was a cheap hack back when SSD firmware wasn't very smart. Nowadays good units will do wear-leveling to use up all of the flash. You don't have to tell the drive what data isn't relevant anymore for them to figure that out.
It's still possible to construct synthetic benchmarks where TRIM makes a difference in performance. I don't feel that any of those simulations are relevant to real-world use patterns though.
Re:Bang! by null+etc. · 2012-10-16 07:42 · Score: 1

The miniaturization of electronics has resulted in the miniaturization of spectacular failures.
Re:Bang! by jkflying · 2012-10-16 07:57 · Score: 1

My friend had a big lytic blow just last week... fuzz EVERYWHERE!!!

--
Help I am stuck in a signature factory!
Re:Bang! by RatherBeAnonymous · 2012-10-16 08:46 · Score: 1

The best I ever saw was a hard drive that caught fire spontaneously. I had a user complain about a burning smell in his room and he said that his video editing station would not start up. It was a home-built monstrosity that a predecessor had put together, and for some reason he had installed the hard drives upside-down. In retrospect, "installed" is too strong a word. They were just sitting loose in the drive bays. I took the side off the full tower case and reseated all the cables, and since I could not see anything wrong, I powered it up. Moments later, I saw a 1 inch flame spring up from one of the drives. As I recall it was a diode, judging from the symbol on the logic board where it left a scorch mark.
Re:Bang! by danomac · 2012-10-16 10:17 · Score: 1

You mean like:

sda on fire
Re:Bang! by KaimaraZatar · 2012-10-16 10:44 · Score: 1

the HD (and motherboard) manufacturers ought to climb back on the horse. Make failure modes exciting again.
Edit ini files to rem nosmoke.exe
Re:Bang! by Anonymous Coward · 2012-10-16 11:24 · Score: 0

Well SSD is orders of magnitude faster than Mechanical Spindle disks so seems right that Failures would manifest much faster.
This is what keeps me from spending money on them, yes they are fast but when they fail there is no warning. I prefer to just keep a few non-ssd drives and minimize the damage which most of the time there is fair enough warning that can order a replacement and start offloading data (suddenly and steadily random corrupted files start appearing and if it doesn't stop after the OS re-formats the partition then the drive is going down and i replace it.)
Re:Bang! by Anonymous Coward · 2012-10-16 13:06 · Score: 0

God I remember working in a shop doing some graphic design back around 1990 when smoke started coming out of the computer and then the whole thing locked up and died. I ran for a fire extinguisher right after I unplugged the computer from the wall. Seems the hard drive had caught fire and all the client's data was..... TOAST!
Re:Bang! by tibit · 2012-10-16 16:04 · Score: 1

The biggest deal is that the progress is the enemy of quality here. Well engineered software (following a tried and true process like UL 1998) takes time to be specified, designed, implemented and validated. All SSD drive vendors would be well advised to have a hardware abstraction layer that tracks changes in hardware, but core function of data management should be engineered and treated like a long lifetime safety critical software component.

--
A successful API design takes a mixture of software design and pedagogy.
Re:Bang! by WuphonsReach · 2012-10-17 06:05 · Score: 1

If you work in a quiet office, they can sound like a firecracker going off and are good at startling everyone. I had a bunch of NVIDIA GeForce 8400 fanless cards installed in various servers, almost all of them have defective capacitors.

--
Wolde you bothe eate your cake, and have your cake?
Re:Bang! by Anonymous Coward · 2012-10-22 13:20 · Score: 0

Try dumping 10KV into one designed for 25V. You'll get a pop alright!
Re:Bang! by Anonymous Coward · 2012-10-27 09:21 · Score: 0

None of this 'oh snap, the hard drive died'.
Until one or more of those capacitors ignite your box... You'll be wishing that 'Oh snap!' was the worst of your day. True story. My box is longer with us :(
Re:Bang! by Anonymous Coward · 2012-10-30 11:40 · Score: 0

I taught word processing classes on early PCs and I always told the nervous, first-time users that what they saw on TV and movies was drama ... They weren't going to hurt the PC by anything they typed. Then my future wife turned hers on and a puff of smoke came out of the back. ;) I still tease her as she has crashed numerous PCs. I won't let her touch me until after I do a save. ;)

More importantly by Anonymous Coward · 2012-10-16 04:33 · Score: 1

How do they get to Silicon Heaven?

Re:More importantly by rubycodez · 2012-10-16 05:59 · Score: 1

their souls are transferred at death via T-mobile 4G to the resurrection ship
Re:More importantly by shentino · 2012-10-16 06:02 · Score: 1

Is there a hell though?
Re:More importantly by Kittenman · 2012-10-16 07:25 · Score: 1

There's a Robot hell, if that's a clue.

--
"The greatest lesson in life is to know that even fools are right sometimes" - Winston Churchill

Data corruption, then fails e2fsck upon boot by vlm · 2012-10-16 04:36 · Score: 3, Informative

My experience was system crash due to corruption of loaded executables, then at the hard reboot it fails the e2fsck because the "drive" is basically unwritable so the e2fsck can't complete.

It takes a long time to kill a modern SSD... this failure was from back when a CF plugged into a PATA-to-CF adapter was exotic even by /. standards

--
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger

Bad blocks by Anonymous Coward · 2012-10-16 04:36 · Score: 1

I've had SSDs die... Basically just got an increasing number of bad blocks due to worn out flash cells.

Dunno about how, but I do know WHEN by davidwr · 2012-10-16 04:36 · Score: 2

Like spinning drives, silicon drives always die when it will do the most damage.

Like right before you find out all your backups are bad.

--
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.

Re:Dunno about how, but I do know WHEN by Anonymous Coward · 2012-10-16 04:46 · Score: 0

Strictly speaking, if you don't check your backups and they're bad, your drives will always fail right before you check your backups.

I have seen SSD death by MRGB · 2012-10-16 04:38 · Score: 5, Informative

I have seen SSD death many times and it is a strange sight indeed. What is interesting about it when compared to normal drives is that when normal drives fail it is - mostly - and all or nothing ordeal. A bad spot on a drive is a bad spot on a drive. With SSDs you can have a bad spot one place, reboot, and you get a bad spot in another place. Windows loaded on an SSD will exhibit all kinds of bizarre behaviour. Sometimes it will hang, sometimes it will blue-screen, sometimes it will boot normally until it tries to read or write to that random bad spot. Rebooting is like rolling the dice to see what it will do next - that is, until it fails completely.

Re:I have seen SSD death by King_TJ · 2012-10-16 05:47 · Score: 1

This mimics my limited experience with SSD failure too.
One instance was with a 128GB budget priced SSD (I think a PNY brand or something similar?). I set it up as a Windows 7 boot drive in a new PC tower I assembled for a client, and after several weeks of use, he complained Windows was booting to an unusable state with a black desktop background, missing icons, and a mostly empty set of applications under the Programs menu.
I suspected a virus initially, but the system registry was so clearly damaged, I decided to just format and re-install from scratch. Everything appeared fine so I gave it back to him. A week later, same issue. When I tried to reformat/reinstall that time around, the system blue-screened in the middle of using it with an error having to do with delayed disk write issues. I swapped the drive with a new one, and he's been running fine for about a year now with that one.
SMART in the PC's BIOS is utterly useless on SSDs from what I've seen, because all it does is examines error counts the drives track internally and flags a problem when certain values count up too high. I don't think the SSDs even use these counters. Perhaps they place false values in them that remain static, just to please software trying to probe them?
Re:I have seen SSD death by jamesh · 2012-10-16 20:16 · Score: 1

I have seen SSD death many times and it is a strange sight indeed. What is interesting about it when compared to normal drives is that when normal drives fail it is - mostly - and all or nothing ordeal. A bad spot on a drive is a bad spot on a drive. With SSDs you can have a bad spot one place, reboot, and you get a bad spot in another place. Windows loaded on an SSD will exhibit all kinds of bizarre behaviour. Sometimes it will hang, sometimes it will blue-screen, sometimes it will boot normally until it tries to read or write to that random bad spot. Rebooting is like rolling the dice to see what it will do next - that is, until it fails completely.
In that failure mode it sounds like RAID may be no help at all...

raid 0 drive crased by Anonymous Coward · 2012-10-16 04:38 · Score: 0

I had a couple vertex2 60's in raid 0 running windows 7.
1. At first my windows would reboot in the middle of the night.
2. It kept getting worse, eventually it got to the point where it would only boot for a few minutes, before crashing. Sometimes the drive wasn't recognized on post.
3. eventually OCZ replaced it. i had to tell them the red led was blinking on the drive indicating it was dead.

I assume if it was a few cells had gone bad it could recover, but to not show up on post, there must have been some bigger issue.
Only one of the identical drives crashed, the other has been running fine for months ( as a single drive now)

Hopefully, like my grandfather by Anonymous Coward · 2012-10-16 04:39 · Score: 0

Who died peacefully, in his sleep. Unlike his train passengers who died painfully while screaming..

1 failed SSD experienced... by StoneyMahoney · 2012-10-16 04:40 · Score: 1

Only seen a single SSD fail. It was a Mini-PCIex unit in a Dell Mini 9. I suspect the actual failure may have been atypical as it seems it failed in just the right place to render the filesystem unwritable, although you could read from fairly hefty sections of it. It was immediate and irrepairable, although I suspect SSD manufacturers use better quality than that built-to-a-price (possibly counterfeit) POS.

Had one die twice by bstrobl · 2012-10-16 04:40 · Score: 2

Had an aftermarket SSD for a macbook air fail twice in 2 years (threw it out and placed an original hdd after that). Both times the system decided not to boot and could not find the SSD.

In both cases I have suspected that the Indilinx controller gave way. This seems mirrored in quite a few cases with the experience of others who had drives with these chips in them.

In an ideal scenario the controller should be able to handle the eventual wearout of the disk by finding other memory cells to write to. Any cells that have been used up should still be readable as well since the floating gates basically have been filled up with electrons and will not allow further erasing.

I guess the main issue right now is the fact that SSDs cant notify the user once things get a bit too worn out. Eventually the controller wont be able to keep up with the useless cells and then might simply no longer respond. Things will only get worse when the cycles go down due to smaller manufacturing processes so that useless controllers in cheap SSDs are more likely to fail

Re:Had one die twice by koshatul · 2012-10-21 17:58 · Score: 1

I've had two OCZ Octane 2's die, both the same way, clean shutdowns, no hibernation, booted one morning and "CHKDSK needs to run", takes about 15-20 minutes, finding lots of orphan files, reboots, "CHKDSK needs to run", same deal again, repeat about 4-5 more times and if windows did boot, it was pretty much corrupted, system files missing, applications failing to load.
And because I bought them from a "non-OCZ approved reseller" they're apparently not covered on warranty.
Smashed them with a hammer, bought a Samsung 830 series, hasn't skipped a beat yet.

I had one fail by kelemvor4 · 2012-10-16 04:41 · Score: 2

I had a FusionIO IODrive fail a few weeks ago. It was running a data array on a windows 2008 r2 server. It manifested its-self by giving errors in the windows event log and causing long boot times (even though it was not a boot device). The device was still accessible, but slower than normal. I think the answer to your question will probably vary greatly both by manufacturer and also based on what part of the device failed. The SSD's I've used generally come with a fairly large amount of "backup" memory on them so that if a cell begins to fail, the card marks the cell bad and uses one from one of the backup chips. Much like how hard drives deal with bad sectors. As I understand it, the SSD is somehow able to detect the failure before data is lost and begin using the backup chips transparently and automatically vs having to do a scandisk or similar to do the same on a physical disk. That may very well vary by manufacturer as well.

Re:I had one fail by greg1104 · 2012-10-16 04:50 · Score: 1

The FusionIO devices are provisioned with a fair amount of redundancy at the storage cell level. But if a part in the main controller goes boom, so does the whole device. I've seen that once so far, wasn't fun since the most critical parts of the data were stored there--trying to get the most out of the device's expense. Some of these units are just expensive enough that I've seen a depressing number of people buy just one (rather than a mirrored pair) after buying the sales pitch on the cell redundancy. If you're going to do that, make sure you have some sort of real-time replication over to a cheaper server going on, too.
Re:I had one fail by kelemvor4 · 2012-10-16 05:28 · Score: 1

The FusionIO devices are provisioned with a fair amount of redundancy at the storage cell level. But if a part in the main controller goes boom, so does the whole device. I've seen that once so far, wasn't fun since the most critical parts of the data were stored there--trying to get the most out of the device's expense. Some of these units are just expensive enough that I've seen a depressing number of people buy just one (rather than a mirrored pair) after buying the sales pitch on the cell redundancy. If you're going to do that, make sure you have some sort of real-time replication over to a cheaper server going on, too.
We bought two per server, at about $20k a drive it was a hard sell to management! When I lost the card, the PCIe backplane also failed. No way to tell if the backplane caused the card to fail or vice versa I guess.
Re:I had one fail by Anonymous Coward · 2012-10-16 13:43 · Score: 0

Many years ago I bought a multiple disk NAS RAID device, only to have the controller chip die. I guess it's like the accidental "# rm -fr /", a right of passage for the neophyte.

Peacefully, with their loved ones at their bedside by Revotron · 2012-10-16 04:41 · Score: 2

as the disk controller reads them their last rites before they integrate with the great RAID array in the sky.

Re:Die! by Anonymous Coward · 2012-10-16 04:42 · Score: 0

You must be new ... oh, wait.

Re:Die! by Vrekais · 2012-10-16 04:42 · Score: 1

Wish I had mod points so bad, Izzard references are always worthy of moding up in my book. If people don't want to read the humorous posts that's what the mod system is for :D

Oblig: T. S. Eliot by stevegee58 · 2012-10-16 04:43 · Score: 4, Funny

Not with a bang but a whimper.

Re:Oblig: T. S. Eliot by Anonymous Coward · 2012-10-16 06:00 · Score: 0

Megah Hurtz- he dead.
Re:Oblig: T. S. Eliot by Anonymous Coward · 2012-10-16 06:32 · Score: 0

I can show you fear in a handful of silicon.

die by SuperRenaissanceMan · 2012-10-16 04:45 · Score: 1

in a fire

--
Any comment mentioning moderation is automatically Offtopic.

Yes they do fail by AnalogDiehard · 2012-10-16 04:45 · Score: 2

We use SSDs in a few Windows machines at work. Running 24/7/365 production. We were replacing them every couple of years.

--
Eternity: will that be smoking, or non-smoking? I Corinthians 6:9-10

Robot Odyssey by Anonymous Coward · 2012-10-16 04:46 · Score: 0

That is all

My SSD is bad! by dittbub · 2012-10-16 04:46 · Score: 2

I have a G.Skill Falcon 64GB SSD that is failing on me. Windows chkdsk started seeing "bad sectors" (whatever this means for SSD... I think its really slow parts of the SSD) and started seeing more and more and windows would not boot. A fresh install of windows would immediately crash in a day or two. I had done a "secure erase" and that seemed to the job, a chkdsk found no "bad sectors". But a weeks later chkdsk found 4 bad sectors. But its going on a month now and I have yet to have windows fail.

Re:My SSD is bad! by dittbub · 2012-10-16 04:47 · Score: 1

I also updated the drive firmware since, that may have helped stability!

X-25M Death: Firmware bug too? by Anonymous Coward · 2012-10-16 04:50 · Score: 5, Interesting

I had an 80G Intel X-25M fail in an interesting manner. Windows machine, formatted NTFS, Cygwin environment. Drive had been in use for about a year, "wear indicator" still read 100% fine. Only thing wrong with it is that it had been mostly (70 out of 80G full) filled, but wear leveling should have mitigated that. It had barely a terabyte written to it over its short life.

Total time from system operational to BSOD was about ten minutes. I first noticed difficulties when I invoked a script that called a second script, and the second script was missing. "ls -l" on the missing script confirmed that the other script wasn't present. While scratching my head about $PATH settings and knowing damn well I hadn't changed anything, a few minutes later, I discovered I also couldn't find /bin/ls.exe. In a DOS prompt that was already open, I could DIR C:\cygwin\bin - the directory was present, ls.exe was present, but it wasn't anything that the OS was capable of executing. Sensing imminent data loss, and panic mounting, I did an XCOPY /S /E... etc to salvage what I could from the failing SSD.

Of the files I recovered by copying them from the then-mortally-wounded system, I was able to diff them against a valid backup. Most of the recovered files were OK, but several had 65536-byte blocks consisting of nothing but zeroes.

Around this point, the system (unsurprisingly, as executables and swap and heaven knows what else was being riddled with 64K blocks of zeroes) crashed. On reboot, Windows attempted (and predictably failed) to recover (assinine that Windows tries to write to iself on boot, but also assinine of me to not power the thing down and yank the drive, LOL.) The system did recognize it as an 80G drive and attempted to boot itself - Windows logo, recovery console, and all.

On an attempt to mount the drive from another boot disk, the drive still appeared as an 80G drive once, unfortunately, it couldn't remain mounted long enough for me to attempt further file recovery or forensics.

A second attempt - and all subsequent attempts - to mount the drive showed it as an 8MB (yes, eight megabytes) drive.

I'll bet most of the data's still there. (The early X-25Ms didn't use encryption). What's interesting is that the newer drives have a similar failure mode that's widely recognized as a firmware bug. If there were a way to talk to the drive over its embedded debugging port (like the Seagate Barracuda fix from a few years ago), I'll bet I could recover most of the data.

(I don't actually need the data, as I got it all back from backups, but it's an interesting data recovery project for a rainy day. I'll probably just desolder the chips and read the raw data off 'em. Won't work for encrypted drives, but it might work for this one.)

Re:X-25M Death: Firmware bug too? by Anonymous Coward · 2012-10-16 05:39 · Score: 0

The 8MB mode in not itself a firmware bug. It is a recovery mode caused by a drive issue. The underlining issue varies. The most common cause I have seen is caused by forced power downs.
Re:X-25M Death: Firmware bug too? by RatherBeAnonymous · 2012-10-16 08:29 · Score: 1

A second attempt - and all subsequent attempts - to mount the drive showed it as an 8MB (yes, eight megabytes) drive.
This sounds like the 8MB bug in the Intel 320 series. http://www.techspot.com/news/44694-intel-confirms-8mb-bug-in-320-series-ssds-fix-available.html
There is a known issue with the 320 series where if the drive looses power suddenly it may corrupt itself, loose all data, and report that it has only 8MB capacity. According to Intel, the only way to fix the drive is to do a secure erase, wiping all data. It happened to me once last spring when windows locked up and I had to force a reboot by holding down the power button. Incidentally, Intel's 4PC10362 that was supposed to fix the problem did not. My drive was running the updated firmware version when it corrupted. This is not considered a manufacturing defect either, so the drive is not replaceable under warranty.
My advice for everyone who asks is to avoid SSDs. If Intel's SSDs, which are widely regarded as the most reliable in the field, can't be relied on in the real world, then the technology is not ready.
Re:X-25M Death: Firmware bug too? by amorsen · 2012-10-16 08:54 · Score: 1

I have seen the exact same scenario. The lower 20GB or so of a 80GB X25-M were stuck at zero and unwritable. My home partition was above 20GB...
Well I did lose a Windows partition, but since I hadn't actually booted that one in at least a month, it probably wasn't all that important.

--
Finally! A year of moderation! Ready for 2019?
Re:X-25M Death: Firmware bug too? by Anonymous Coward · 2012-10-17 05:14 · Score: 0

The 8MB mode in not itself a firmware bug. It is a recovery mode caused by a drive issue. The underlining issue varies. The most common cause I have seen is caused by forced power downs.

That's be consistent with my X-25M story - when the OS finally crashed, it went down so hard I had to do a hard reset. Recovery console's attempt to recover failed just as hard. Drive was then physically powered off for removal, too.
(Although I waited a minute or so for the hung system to be quiescent, without having physical drive platters to listen to, for all I knew the half-dead OS could have been trying desperately to write to the failing disk when the reset button got hit, or when the plug got pulled. Or the controller might have had a bunch of stuff that it was trying to write to itself, only to be thwarted over and over again, as block after block became unwritable...)
On remount, I was using a removable SATA dock for convenience, also arguably a mistake. (Plugging it physically into an unused SATA port on the motherboard and taking steps to ensure that the motherboard booted the restored OS from the correct drive, would have eliminated the possibility of oddball powerup/powerdown states that are inherent in SATA docks. I've forgotten whether the dock was powered up when the drive was inserted, or if I inserted the drive into the dock, and then connected the dock's cable to the eSATA port, and if there are issues involving forced power-downs unrelated to the failing drive, the order of physical operations performed might have been important, possibly compounded by the edge case of doing so with a failing drive.)
Re:X-25M Death: Firmware bug too? by The+Finn · 2012-10-20 18:02 · Score: 1

sounds familiar. I had an 80GB early intel SSD (INTEL SSDSA2MH080G1GC) in a macbook pro, which gave me the OSX blue screen equivalent while I was working on it one day without any warning. it wouldn't even boot. Disk utilities under OSX (run from another system, obviously) were unable to even fsck the filesystems, so I replaced the disk. (I had the foresight to make backups, so I didn't lose anything.)
I moved the failed SSD to another machine to take a look at the SMART parameters, which still showed a 96% lifetime left, although it did show read errors after a captive self-test. an email to intel tech support indicated that a secure delete might bring it back to life, and indeed it did. after the secure delete, the drive was reformatted, and I now use it as storage for a couple VMs, and so far so good, although I'm careful not to have anything critical on it.
so no, not graceful. sudden and catastrophic. I'm wondering what a subsequent failure will look like, now that I have smartmontools keeping an eye on it.

--
NetBSD: the cathedral vs the bizzare.

SSD wear cliff by RichMan · 2012-10-16 04:50 · Score: 4, Informative

SSD's have an advertised capacity N and an actual capacity M. Where M > N. In general the bigger M realtive to N the better the performance and lifetime of the drive. As it wears it will "silently" assign bad blocks and reduce M. Your write performance will degrade. If you have good analysis tools it will tell you when it starts getting a lot of blocks near end of life and when M is getting reduced.

Blocks near end of life are also more likely to get read errors. The drive firmware is supposed to juggle things around so all of the blocks near end of life about the same time. With a soft read error the block will be moved to a more reliable portion of the SSD. That means increased wear.

1. Watch write perforamance/spare block count
2. If you get any read errors do a block life audit
3. When you get into life limiting events things accelerate to bad due to the mitigation behaviors

Be carefull depending on the sensitivities of the firmware it will let you get closer to catastrophe before warning you. More likely to be closer in consumer grade.

Re:SSD wear cliff by dargaud · 2012-10-16 08:14 · Score: 1

If you have good analysis tools [...]
OK, so which ones are those wear analysis tools ? Smartmontools ? Or something specific to SSDs ? Or something specific to that special brand of SSD (boot level binary utility from the maker) ?

--
Non-Linux Penguins ?
Re:SSD wear cliff by RichMan · 2012-10-16 08:27 · Score: 1

Can't help beyond
http://serverfault.com/questions/385446/how-do-you-monitor-ssd-wear-in-windows-when-the-drives-are-presented-as-generic
Note that real enterprise RAID stuff provides really good information about wear.
See the IBM tools linked above (and likely the HP ones mentioned).

OCZ vertex 4 fails out of the box by Anonymous Coward · 2012-10-16 04:51 · Score: 0

Hi,

We bought over 70 OCZ Vertex 4, and after 1 month, we had over 20 failures. About 5 of them were DOA, and the rest died in prod. They would crash windows and would not reboot.

So my experience with SSD is, BACKUP anything critical on a regular HDD.

Wanna see them die? Just get an OCZ by Anonymous Coward · 2012-10-16 04:55 · Score: 0

OCZ makes the worst SSDs in the world, and it's not even a flash wear issue. For them, it's firmware. And, FFS, you have to update firmware on the goddamn things practically daily, and you can only do it by moving the drive to another machine, or with a hokey linux bootable CD, and while the planets are in a specific alignment and while holding the rabbit ears just a little to the left, except on Tuesdays when you have to hold them just a little to the right.

They just die inexplicably, and with no warning, and all of your data is just GONE.

No, they don't all age the same. by YesIAmAScript · 2012-10-16 04:55 · Score: 3, Informative

It's statistical, not fixed rate. Some cells wear faster than others due to process variations, and the failures don't show up to you until there are uncorrectable errors. If one chip gets 150 errors spread out across the chip, and another gets 150 in critical positions (near to each other), then the latter one will show failures while the first one keeps going.

So yeah, when one goes, you should replace them all. But they won't all go at once.

Also note most people who have seen SSD failures have probably seen them fail due to software bugs in their controllers, not inherent inability to store data due to wear.

--
http://lkml.org/lkml/2005/8/20/95

Usually the firmware or the NAND by Anonymous Coward · 2012-10-16 04:56 · Score: 1

I'm a SSD firmware engineer so know this all in depth. If the SSD suddenly fails then most likely cause is a firmware bug putting the drive into a bad state or a catastrophic NAND failure. It all depends on how well the firmware and NAND are tested.The trickiest part of the firmware to get right is the unsolicited power cycle. So make sure to shutdown the system properly. As for the NAND, it might be good to do a burn-in write of random data on full drive capacity between 3 to 30 times to scrub out the early NAND block failures. A good manufacturer would already do this.

Re:Usually the firmware or the NAND by gweihir · 2012-10-16 05:38 · Score: 1

Question: How well is the "unsolicited power cycle" (nice name!) dealt with by the current generations? Is it something that can be done right, but is tricky, or is it something that always has a residual risk, even if the firmware is error-free, i.e. a fundamental issue?
I try to avoid it of course, but in particular my Laptop I run until power failure (Linux, ext3, absolutely no corruption so far from it, but HDD at this time). Is this still a significant risk to overall disk integrity? I do realize that the last few seconds of writes may be gone, but that is acceptable.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:Usually the firmware or the NAND by Anonymous Coward · 2012-10-16 06:49 · Score: 0

The unsolicited power cycle is something that can be done right. There are just so many corner cases, weird NAND behavior, etc to work around. It is a massive test, break and fix effort until you feel confident. As for laptops, it generally is not an issue because the SSD will be properly shutdown when the power is about to fail. It is more an issue for desktops where the power may be lost for no reason.
Re:Usually the firmware or the NAND by gweihir · 2012-10-16 08:02 · Score: 1

Thanks for the info, that was my intuition also, but confirmation is nice. So as SSD firmware does mature and designers become more and more experiences, it will eventually be not much of an issue anymore.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:Usually the firmware or the NAND by tibit · 2012-10-16 16:17 · Score: 1

There is nothing tricky about handling "unsolicited power cycles" except that people with no clue about software engineering are dabbling with systems that are de facto critical for business continuity, whether it's a mom-and-pop shop, or a big operation with fast databases. The fundamental issue is with the qualifications of people who specify, design and implement the firmware, or more importantly, their management, or that management's management. I don't care if the problem is all the way up to the board of directors, as it may well be. Alas, that's all there's to it. It's a people problem. Demonstrably, people who know their shit can make data stores that survive random restarts just fine. See sqlite, for example.

--
A successful API design takes a mixture of software design and pedagogy.
Re:Usually the firmware or the NAND by Anonymous Coward · 2012-10-16 16:29 · Score: 0

You've just described the #1 reason why OCZ and Sandforce have turned me against SSDs for a really, really LONG time.
Eleven years ago, one of my friends was attending a conference in New York, and his laptop got buried under a million tons of burning rubble when the collapsing tower sheared the hotel in half (it was in the room safe, he was about a mile away). About two years later, a box arrived in the mail -- the room safe was excavated, and its contents (including his slightly melted laptop) were inside. For the hell of it, he asked his company (who sent him to New York for the conference) whether they'd pay to do data recovery on the drive. They said yes, and about 2 weeks later, the contents of his laptop's hard drive were mostly recovered intact.
Fast forward to last year. My OCZ Vertex2 bluescreened. And did it again. And again. I didn't really have a good backup, so I did what anyone would have done... I booted Linux, and tried using ddrescue on it. The goddamn drive went into Panic mode. I was informed by two data recovery firms that it was hopeless, because Sandforce's firmware apparently encrypted everything whether you wanted it to or not, and they wouldn't share their master key (that would have enabled them to recover the data) with anyone.
Let's rewind and look at that again. My friend's data from a spinning drive survived getting buried under a million tons of burning rubble in the worst urban disaster in modern American history, but my solid-state drive's data was destroyed in an instant, without warning, because of Sandforce's fucking "business policy". Sandforce and OCZ both deserve to burn in hell forever over that one.

giyfs by Anonymous Coward · 2012-10-16 04:56 · Score: 0

http://computer.howstuffworks.com/solid-state-drive5.htm

Old SSDs never die... by LeDopore · 2012-10-16 04:58 · Score: 1

Old SSDs never die. They just lose their bits.

--
Expected time to finish is 1 hour and 60 minutes.

Firmware bugs killed my OCZ Vertex 2 by ThreeDayMonk · 2012-10-16 04:58 · Score: 2

I always expected the cells to go first. I was careful to avoid unnecessary writes. In the end, though, it was a known bug that killed the drive. Well, I didn't know about it, of course, until it was too late. If I'd known, I'd have updated the drive firmware to one that didn't have a catastrophic bug.

I replaced it with a Samsung. The RMA'd replacement OCZ is still sitting in its packet on my desk.

--
If your comment title says 'Re: Foo', I'm not likely to read it.

Still relevant? by Oscaro · 2012-10-16 04:58 · Score: 2

After reading this horror story I arrived to the conclusion that SSDs are not for me. I wonder if it's still true.

Super Talent 32 GB SSD, failed after 137 days
OCZ Vertex 1 250 GB SSD, failed after 512 days
G.Skill 64 GB SSD, failed after 251 days
G.Skill 64 GB SSD, failed after 276 days
Crucial 64 GB SSD, failed after 350 days
OCZ Agility 60 GB SSD, failed after 72 days
Intel X25-M 80 GB SSD, failed after 15 days
Intel X25-M 80 GB SSD, failed after 206 days

http://www.codinghorror.com/blog/2011/05/the-hot-crazy-solid-state-drive-scale.html

Re:Still relevant? by Anonymous Coward · 2012-10-16 05:21 · Score: 0

I have a Corsair drive that has been running for two years now. It is, not, however based on a 1st generation controller.
SSDs are not magically different from other technology. The first wave is expensive and quirky (or flaky, if you prefer). After that, things level out nicely.
Re:Still relevant? by gweihir · 2012-10-16 05:34 · Score: 1

Don't take too much stock in this story. First, about 5%/year failure is normal even for a mature product. Then, these are all earlier SSDs with likely inferior firmware, wear-leveling, etc. It takes a while to get to the 5%/year.
As this story has a mix of drives and all failed short of their expected lifetime, there is also a significant possibility they were mistreated. SSDs die early when run hot, just like HDDs. SSDs are even more dependent on clean power. Excessive writing can kill earlier models relatively fast.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:Still relevant? by Anonymous Coward · 2012-10-16 05:39 · Score: 0

People complain when things go wrong, and are often silent when things are OK.
I own 4 drives from two manufacturers on that list, Intel X25-M 80GB x2 and OCZ Agility 60GB x4. I've ran the Intels in raid 0 for close to 3 years in a workstation that sees heavy use, and I just built another newer workstation using the Agility line - no problems at all.
With the performance you gain vs the price you pay, especially now they're they're falling so fast, it should be an easy choice to pick a few up - just remember to keep back ups like with any platter drive.
Re:Still relevant? by King_TJ · 2012-10-16 05:56 · Score: 1

I've used enough SSD's with success to say you shouldn't necessarily be afraid of them. However, there are certain things you really need to do when using one to prolong its life.
Most importantly, don't save temporary swap files on one. Every desktop PC I've set up with an SSD boot drive, I custom configured to save its swap files on a second, standard hard drive. That does wonders for cutting down on unnecessary writes/rewrites to the SSD.
Second, avoid using them for relational databases. It seems like databases are the number one killer of SSDs, probably because many of them do thousands and thousands of writes/updates to the database file(s) in a short period of time.
SSDs are really FAR better for reading in data than writing it back out. They should allow essentially an infinite number of reads (until the whole controller in the thing finally fails due to bad capacitors or what-not). So they're great for holding the core of the OS that has to get read in/loaded up every single time the machine boots up. But anything you're going to be changing often is best stored on a traditional drive you can assign as your "data" disk.
Laptop/portable users may not have this luxury, obviously .... but my strategy in those cases is not to skimp and buy a cheaper SSD for them. Spend the extra money for one of the most reliable models, and do regular backups so a failure won't cost you much data loss. Most people upgrade laptops every 2-3 years anyway, so hopefully the SSD can make it that long.
Re:Still relevant? by njahnke · 2012-10-16 06:07 · Score: 1

to jeff i say: i'll see your anecdote and raise you my own - i've never had an ssd fail.
Re:Still relevant? by Anonymous Coward · 2012-10-16 07:02 · Score: 0

As interesting as they are, they tell nothing about the drives, which didn't fail. This means those numbers tells nothing about how big the risk is.
I had a WD VelociRaptor failing after a little more than half a year. That sounds bad, but then again I have 3, which just keeps on working without any problems despite being far older than that. Imagine if they end up outliving all the other drives, then my experience will say they are likely more reliable than other drives even though I one, which failed.
Besides all drives will fail eventually. Making a proper plan for how to deal with it (raid-1/backup) is a lot safer than having a plan to avoid it.
Re:Still relevant? by Anonymous Coward · 2012-10-16 07:23 · Score: 0

Statistically, for one person to have that many failures the failure rate across different companies and different generations would need to be around 30%, which we know to be false. So the only conclusion to draw from this is that this person is doing something to break his SSDs. (Or he bought 200 of them.)
IOW, if someone tells you that hammers are fragile because they broke the last eight they bought (different manufactures, different price ranges), then the proper conclusion is that they're abusing their hammers, not that hammers are faulty in general.
Re:Still relevant? by Anonymous Coward · 2012-10-16 13:49 · Score: 0

Buy SLC drives, not MLC. A 20GB Intel SLC SSD goes for $100. It's not gonna store your porn collection, but one or more of these are more than enough to handle most work.
Intel markets them as "caches" to speed up traditional disks, but they're identical to the industrial grade parts. Embedded manufacturers buy the Intel SLC chips, slap on another sticker, and sell them for 3x more. If you're paying extra for anything, it's extra testing to ensure greater temperature tolerances than Intel is willing to warrant for.
Re:Still relevant? by deroby · 2012-10-21 08:35 · Score: 1

I think about this differently.
I don't do anything special to prolong the (expected) life of my SSD (Intel 320 120Gb) except for disabling the last-acces-time thingy. Why should I buy a tool that speeds up 'my computer experience' and then cripple it on purpose ? I reboot at most once every 3 days so fast booting is nice and all but hardly the a selling point.Then again, when straining the machine while running tests, compiling stuff, having 20 applications open at once, browsing, etc etc... not having to wait for the spinning platters to store whatever temporary data that needs to be dumped makes one hell of a difference! (And funnily, yes that includes my RDBMS's tempdb =)
(space restrictions 'force' me to put my db-data-files on the HDDs)
OK, so maybe my drive will crap out after 4 year instead of after 12 years but at least I got as much out of it as possible in the meantime. Who knows what I'll replace the drive with in 4 years anyway ? Could be we have 1Tb SSD's by then on SATA-4 outperforming current RAM. Or whatever.
That said, I DO back up the system on a regular basis. If I'm unlucky and the thing fails cold turkey I'll lose say a week's worth of work. Not fun, but IMHO it does not out-weigh the advantages I got from it over the years.
I agree that you should use some basic caution regarding expensive electronic equipment : heat, shocks, etc... but not fully using for its intended purpose shouldn't be one of them.

--
If there is one thing to be learned on slashdot, it has to be sarcasm.
Re:Still relevant? by loxosceles · 2012-10-29 15:17 · Score: 1

20GB is not enough.
A fairly clean Windows 7 Pro VM I have (basic Word/Excel/Outlook/PP office, firefox, chrome, acrobat reader) takes up over 20G, mainly because of winsxs.
I don't do much with that VM. IOW, it's not pristine, but it's a lot cleaner than most win7 systems.
Windirstat reports 22.9GB in C:\windows with 13GB of that in C:\windows\winsxs
Some of that might be ntfs's equivalent of symlinks, but properties on C: reports 24.1GB used, 4.5GB free (I need to resize the VM disk eventually). I don't believe the C: space usage is a lie, even if C:\windows is slightly overreported by windirstat.

First hand experience here by SeanTobin · 2012-10-16 04:58 · Score: 4, Informative

I recently had a "old" (cir 2008) 64gb SSD drive die on me. It's death followed this pattern:

Inexplicable system slowdowns. In hindsight, this should have been a warning alarm.
System crash, followed by a failure to boot due to unclean ntfs volume which couldn't be fixed by chkdisk
Failed to mount r/w under Ubuntu. Debug logs showed that the volume was unclean and all writes failed with a timeout
Successful r/o mount showed that the filesystem was largely intact
Successful dd imaged the drive and allowed a restore to a new drive.

After popping a new disk in and doing a partition resize, my system was back up and running with no data loss. Of all the storage hardware failures I've experienced, this was probably the most pain-free as the failure caused the drive to simply degrade into a read-only device.

--
Karma: SELECT `karma` FROM `users` WHERE `userid`=138474;

Re:First hand experience here by halfnerd · 2012-10-16 05:26 · Score: 1

Are you sure there was no data corruption, or are you just assuming since it booted up again?
Re:First hand experience here by gweihir · 2012-10-16 05:28 · Score: 1

That is the sensible behavior. Your drive probably ran out of spare area. The best thing it can do is to refuse any more writes in that situation.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:First hand experience here by Anonymous Coward · 2012-10-16 06:06 · Score: 0

What brand / model was the drive? Curious what controller it used. That is a good failure.
Re:First hand experience here by Anonymous Coward · 2012-10-16 22:08 · Score: 0

I had my first Intel X25-M Gen1 80GB die on me in much the same manner just recently (slow downs, blue screens and then windows 7 OS not booting anymore). SMART stats show read errors beyond treshold, strangely the drive was in a stripe (RAID0) and the other drive still looks perfectly fine in SMART. Had this drive for several years in day to day office use.
Re:First hand experience here by SeanTobin · 2012-10-17 04:32 · Score: 1

After the data restore, chkdsk had some issues to take care of (as you would expect). I don't have any method to do a full data integrity check, but the system has run without issue for a few months now.

--
Karma: SELECT `karma` FROM `users` WHERE `userid`=138474;
Re:First hand experience here by SeanTobin · 2012-10-17 10:24 · Score: 1

Cavalry Pelican CASD00064MIS 2.5" 64GB http://www.newegg.com/Product/Product.aspx?Item=N82E16820411004

--
Karma: SELECT `karma` FROM `users` WHERE `userid`=138474;

Re:Die! by lister+king+of+smeg · 2012-10-16 04:59 · Score: 1, Insightful

No offense intended but if your new why are you complaining about our long standing culture of cracking lame jokes, if you don't like it why did you join?

--
---Saying gnome 3 is better than windows 8 not so much a compliment as it is damning with light praise.

Hammers by Anonymous Coward · 2012-10-16 05:00 · Score: 0

no duh

4MB Bug.... by jythie · 2012-10-16 05:00 · Score: 1

Well, if the drive was made by Intel, it fails because you turned off the computer or let it go to sleep.... of course newer ones do not do this, just like the old newer ones didn't do it either...

Bad Controllers by nakedhitman · 2012-10-16 05:01 · Score: 1

I work in a storage test lab. I've had several enterprise SSDs die randomly. They just drop off the device list in the middle of a stress test. I'd suspect it was due to a bad controller, but we didn't have time to dissect it and find out, so we just ordered a new one. It's happened from a myriad of manufacturers (who shall not be named). Granted, most of the drives that have died so far were engineering samples, but the final revisions we got later were the same hardware with updated firmware. While most SSDs are supposed to be resilient to most hardware failures, there's always a single point of failure somewhere.

Re:Bad Controllers by gweihir · 2012-10-16 05:27 · Score: 1

So this was not consistent over several instances of the same drive? Interesting.
However, HDDs can also have issues. I have a fileserver running 24/7 and I have observed that the 2.5" notebook drives I use crash about every two years, whit no detectable damage after an unplug and replug. This happens with WD, Seagate and Samsung drives. There are 3 drives in the server, but 2 others I replaces because of size also had the problem. Swapping the mainboard and PSU did not make a difference, and it is always only one (random) drive affected. My guess would be that firmware is just debugged up until the point that customer complaints diminish, i.e. commercially driven not quality driven.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:Bad Controllers by Lumpy · 2012-10-16 05:36 · Score: 1

" It's happened from a myriad of manufacturers (who shall not be named)."
I really wish people had the honesty to name products they find are defective.. You are causing a HUGE dis-service to everyone and yourself by protecting a bad supplier.
Or is Seagate paying you hush money? How do I get in on this? I'd love some hush money!

--
Do not look at laser with remaining good eye.
Re:Bad Controllers by m.dillon · 2012-10-16 06:41 · Score: 1

I've found SMP-related bugs in our disk drivers simply related to the fact that SSDs are so blasted fast. Other than that (and once fixed), and discounting the crucial POH bug which a firmware update fixed, I have not seen any of our SSDs drop-off the bus.
-Matt
Re:Bad Controllers by Miamicanes · 2012-10-16 16:42 · Score: 1

The problem with SSD failures is that, unlike spinning drives, RAID won't save you. Several generations of OCZ's Sandforce-based drives had (and still have) weird firmware bugs that caused spontaneous data-suicide. Making matters worse, if you had multiple drives in RAID configuration, and something triggered the bug in one of them, there's a pretty good chance the same bug ended up getting triggered in ALL of them.
RAID was developed to protect data from progressive drive failure. It simply can't deal with drives that will kill themselves at the drop of a hat, for any reason, or no apparent reason at all.
The sad thing is, all this time people have been neurotic about SSDs getting write errors, while totally overlooking their REAL problem -- buggy firmware and insidious mandatory encryption that makes data recovery nearly impossible when your drive dies without warning. One minute, your computer is working fine. The next minute, your data is gone forever because the controller decided to commit suicide and take your data with it.
Re:Bad Controllers by Bengie · 2012-10-17 05:43 · Score: 1

Sounds like your driver went all "Mr Bean". "I'm in a race" - Rat Race

My two cents. by Anonymous Coward · 2012-10-16 05:03 · Score: 0

SSD's are just flash memory, yea? So, all flash memory has an inherent limit of about 1 million read/writes. I would assume the SSD's would fail after they get close to approaching their read/write cycle limit.

Finally! by antdah · 2012-10-16 05:03 · Score: 1

an ask /. that actually belongs here! Looking forward to reading the comments on the topic.

The failure path I worry about... by Anonymous Coward · 2012-10-16 05:04 · Score: 0

... is loss of firmware/configuration due to the firmware not refreshing these areas of storage. Flash data retention is measured in years if it isn't refreshed, and controllers that don't take this into account WILL fail prematurely, whether that means 2 years, 5 years, 10 years... internal microcontroller flash memory I use is rated to 40 years retention, but these are relatively huge flash cells.

Re:The failure path I worry about... by Tapewolf · 2012-10-16 05:35 · Score: 1

Absolutely. This is something that's been bugging me for quite some time. I've had EPROMs which have (fortunately) held data for about 21 years (at which point I cloned them, just in case). Others aren't so lucky - machines like the Sony APR series are notorious for the control program dissolving and either rendering the machine useless, or crashing and shredding the master tape. And that's with 1980s-era die sizes - with the process shrink, things with flash firmware are just going to turn to crap in considerably less time.

Bathtub Curve by Onymous+Coward · 2012-10-16 05:05 · Score: 5, Informative

The bathtub curve is widely used in reliability engineering. It describes a particular form of the hazard function which comprises three parts:
The first part is a decreasing failure rate, known as early failures.
The second part is a constant failure rate, known as random failures.
The third part is an increasing failure rate, known as wear-out failures.

Or RAMs, maybe. by Impy+the+Impiuos+Imp · 2012-10-16 05:06 · Score: 1

Normally SSDs die when an X-Wing or something crashes into the conning tower and it rams into a Death Star.

--
(-1: Post disagrees with my already-settled worldview) is not a valid mod option.

Re:Or RAMs, maybe. by Anonymous Coward · 2012-10-16 05:14 · Score: 0

You beat me to it >:(

Re:Die! by HaZardman27 · 2012-10-16 05:13 · Score: 2

why are you complaining about our long standing culture

lister king of smeg (2481612)

Not your first account I take it?

--
Apparently wizard is not a legitimate career path, so I chose programmer instead.

Some anecdotes by toastyman · 2012-10-16 05:15 · Score: 2

We've got a fair number of SSDs here. Failures have been really rare. The few that have:

#1 just went dead. Not recognized by the computer at all.
#2 Got stuck in a weird read-only mode. The OS was thinking it was writing to it, but the writes weren't really happening. You'd reboot and all your changes were undone. The OS was surprisingly okay with this, but would eventually start having problems where pieces of the filesystem metadata it cached didn't sync up with new reads. Reads were still okay, and we were able to make a full backup by mounting in read only mode.
#3 Just got progressively slower and slower on writes. but reads were fine.

Overall far lower SSD failure rates than spinning disk failure rates, but we don't have many elderly SSDs yet. We do have a ton of servers running ancient hard drives, so it'll be interesting to see over time.

Re:Some anecdotes by Megahard · 2012-10-16 05:34 · Score: 1

I have a Crucial CT256M4SSD2 that gradually slowed to a crawl. Updated the firmware and all is well again.

--
I eat only the real part of complex carbohydrates.

Easy by Anonymous Coward · 2012-10-16 05:16 · Score: 0

Cold, alone, and in the dark.

Theory or Practice? by rabtech · 2012-10-16 05:21 · Score: 4, Interesting

In theory they should degrade to read-only just as others have pointed out in other posts, allowing you to copy data off them.

In reality, just like modern hard drives, they have unrecoverable firmware bugs, fuses that can blow with a power surge, controller chips that can burn up, etc.

And just like hard drives, when that happens in theory you should still be able to read the data off the flash chips but there are revisions to the controller, firmware, etc that make that more or less successful depending on the manufacturer. You also can't just pop the board off the drive like with an HDD, you need a really good surface mount resoldering capability.

So the answer is "it depends"... If the drive itself doesn't fail but reaches the end of its useful life or was put on the shelf less than 10 years ago (flash capacitors do slowly drain away) then the data should be readable or mostly readable.

If the drive itself fails, good luck. Maybe you can bypass the fuse, maybe you can re-flash the firmware, or maybe it's toast. Get ready to pay big bucks to find out.

P.S. OCZ is fine for build it yourself or cheap applications but be careful. They have been known to buy X-grade flash chips for some of their product lines - chips the manufacturers list as only good for kid toys or non-critical, low-volume applications. Don't know if they are still doing it but I avoid their stuff.
Intel's drives are the best and have the most-tested firmware but you pay for it. Crucial is Micron's consumer brand and tends to be pretty good given they make the actual flash - they are my go-to brand right now. Samsung isn't always the fastest but seems to be reliable.

Do your research and focus on firmware and reliability, not absolute maximum throughput/IOPs.

--
Natural != (nontoxic || beneficial)

Re:Theory or Practice? by yenic · 2012-10-19 12:27 · Score: 1

Samsung is the best of all that you listed. It has the speed AND reliability. A better reliability record than either Crucial or Intel. The 830 series (and now 840) are faster or as fast on average. I have old Intel G2 drives but Samsung would be my universally recommended and goto today.

--
http://www.accountkiller.com/en/delete-slashdot-account Stop visiting Slashdot.

SSD Firmware Fault Shutdown by BoRegardless · 2012-10-16 05:21 · Score: 1

Upgrading SSD firmware over time seems like a bum deal and I hope newer SSDs do not need this.

A Crucial C300 in MacBook Pro while idle on my desk simply stopped and the screen went white.

Turns out that in about 1 year, the Firmware revisions went from #0 to #6 or so and I was never informed you needed to do firmware upgrades.

Crucial gave me a workaround to reset firmware. Bum deal is that I had to remove the SSD and connect to a PC to do the reset and then the instructions for doing the sequential firmware updates was incomprehensible, so I didn't upgrade firmware.

Selling an SSD as a drop in replacement and not stating anything about firmware upgrades and not providing a way to easily do those upgrades with a one click application with the drive in place is BAD PR for a company. It is also bad policy to require firmware updates and not have a notification system in place.

Re:SSD Firmware Fault Shutdown by m.dillon · 2012-10-16 06:30 · Score: 1

I had several crucial's with the power-on-hours firmware bug. Basically onc they have accumulated enough power on hours they brick. You can power cycle them and they will work for ~1 hour before bricking again.
Upgrading the firmware fixes the problem. I am amazed that SSD vendors don't just provide USB disk key images for firmware upgrading. It took a while for me to construct a dr-dos based boot stick (since Microsoft is so rabid about letting people format bootable removable devices these days).
But once I had a working image the upgrade process was painless. The machine booted into dr-dos from the usb stick, automatically detected the SSD, and asked if I wanted to upgrade the firmware. 60 seconds later it was done.
Generally speaking I think it's a good idea to keep the firmware up-to-date for Intel and Crucial and possibly other SSDs. Bricking issues aren't the only things they fix. Numerous firmware fixes seem to relate to flash cell failure handling or other bugs related to data integrity. On the otherhand, it's usually NOT a good idea to upgrade the firmware on an OCZ as you are just as likely to introduce new bugs as you are to fix existing ones.
-Matt
Re:SSD Firmware Fault Shutdown by Anonymous Coward · 2012-10-16 11:57 · Score: 0

Hint: if turning it off and on again "unbricks" it, it's not bricked.

How Do SSDs Die? by Anonymous Coward · 2012-10-16 05:21 · Score: 0

In a flash!

Undetected read errors by gweihir · 2012-10-16 05:22 · Score: 1

I have an older 30GB OCZ drive, that has undetected read errors every 4-5 full reads. So far I have done only MD5 sums and one in 4 or 5 is different. This is the worst case of failure possible, as SSDs should, just like every other drive, have a very, very small probability of not detecting an error, far smaller than having an error.

I suspect a firmware bug in the error-correction implementation and a flaky cell that mostly reads right, but sometimes does not. This shows that at the moment, firmware bugs are a significant risk with SSDs. Fortunately the 4 other OCZ ones I have (60, 128, 240, 255G) do not seem to have this problem, but I can only be sure about the 60G one, as that one is in a 3-way RAID1 that is subject to a consistency check every week.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.

Why they die... is pretty simple. by Anonymous Coward · 2012-10-16 05:23 · Score: 0

We have a great description of the wear they will take and how to mitigate wear in the DragonFly BSD swapcache manpage. http://leaf.dragonflybsd.org/cgi/web-man?command=swapcache

To summarize, each nand cell is able to be programmed a finite number of times. Voltage required to program a cell increases as the cell becomes more worn, good drives ship with more storage than they report and will automatically relocate data to cells that are still good and retire cells that are worn out. Enterprise-class drives have a larger amount of reserve storage than consumer-grade drives do. Drives which use cheap NAND alongside an inadequate controller can result in data loss because the controller did not relocate your data before it was unable to retrieve it from a cell (OCZ).

Too much activity killed mine, I think by BagOCrap · 2012-10-16 05:27 · Score: 1

I had a 60GB Mushkin Chronos in my MySQL server. It died in 13 days. I do not know for certain what actually killed (it's a low profile server, with only about 3.600/1.500 select/insert statements an hour), but I received a replacement disk within hours, swapped them out, and the replacement snuffed the dust in just under 18 days. I was unable to read or write to them. They just went out, like someone flicked a switch.
That's when I decided to stick with the regular spindle disks for my database server...

I bought two of those 60GB disks at the same time; used the other one as the OS disk for my workstation. It's still running good, almost two years later. I recently purchased a 120GB Mushkin Chronos Deluxe and replaced my 60GB disk (which currently only holds my development projects and few, select games).

--
-- Chaos, panic, pandemonium... My job here is done!

Re:Too much activity killed mine, I think by lcfactor · 2012-10-16 05:31 · Score: 1

Had the same exact experience actually. We had 3/4 8GB SSD's fail that were installed in two pfsense routers within three weeks of turning them on. We've since learned that too much activity can in fact kill SSDs. Luckily these were being burned in prior to production deploy - but pfsense specifically writes a large amount of consistent activity to disk for traffic logging and building it's analysis reports in near real time. This apparently wears out the cells quickly.
Re:Too much activity killed mine, I think by bytesex · 2012-10-16 07:18 · Score: 2

Did you mount it with noatime and nodiratime ?

--
Religion is what happens when nature strikes and groupthink goes wrong.
Re:Too much activity killed mine, I think by Anonymous Coward · 2012-10-16 12:52 · Score: 0

Most 8GB "SSD"s have *no* wear leveling whatsoever (this includes CF cards). Killing those with repeated writes to the same sector takes less than an hour.

Tin whiskers by AikonMGB · 2012-10-16 05:27 · Score: 2

Tin whisker growth is another way not directly related to the flash cells. Commercial electronics use lead-free solder and no real whisker mitigation techniques. Eventually a whisker shorts between two things that shouldn't be shorted, conducts sufficient current for a sufficient amount of time, and poof, your drive is dead.

Re:Tin whiskers by VortexCortex · 2012-10-16 07:34 · Score: 1

Eventually a whisker shorts between two things that shouldn't be shorted, conducts sufficient current for a sufficient amount of time, and poof, your drive is dead.
The same can be said for Sex Drives. Eventually you learn not to mind the women whiskers -- It's like they choose not to see them! It's not like you can shave a lady in her sleep!
Re:Tin whiskers by Anonymous Coward · 2012-10-16 10:27 · Score: 0

Commercial electronics use lead-free solder and no real whisker mitigation techniques. Eventually a whisker shorts between two things that shouldn't be shorted, conducts sufficient current for a sufficient amount of time, and poof, your drive is dead.
You can always kick it at that point, or hit it with a hammer.
Re:Tin whiskers by Anonymous Coward · 2012-10-17 02:42 · Score: 0

I've been in the business for 15 years and thousands of machines (desktops and servers) now and I have yet to see tin whiskers (not doubting their existence) let alone any failure that could be attributed to them.
For that matter, I have yet to even meet anyone that has ever seen them in the wild.

In Enterprise SAN kit.... by pjr.cc · 2012-10-16 05:28 · Score: 1

i've seen many ssd's die in enterprise SAN kit and will even go so far as to say as a percentage they die far more often then spinning rusty metal.

However, this is because of how they are used in SAN. Often they are used in a multi-tier way, where most frequently accessed data is pushed up to the SSD's to allow quick access, so they get hit the hardest.

I would be guessing your asking this question simply cause its easier to understand why a part-in-motion can slowly die over time where something thats in silicon shouldnt. You'd probably be surprised to know that alot of drives die a controller death, not a platter/motor death. Typically a platter/motor death usually means a badly made drive (mostly because makers of spinning rusty metal have gotten very good at the mechanics behind them) and while the mechanics in a drive will slowly wear over time, typically something silicon in the controller goes first. The exception to that is where drivers are physically in motion not of their own (i.e. laptops for example), often then the drive shaft in the drive itself starts to get wear unevenly and that usually gets worse over time (or at least, this is what i've been told).

Some SAN makers will even put hdd's through their paces first to make sure they actually perform ok - for eg, they'll measure vibration, etc (i.e. the mechanical components) to make sure a drive is up to spec before it goes into their kit - they cant really do the same as easily with silicon as theres not much to work off that can be measured - so often in enterprise grade SAN's, silicon dies before mechanical.

http://www.makeuseof.com/tag/solidstate-drives-work-makeuseof-explains/ is a fairly good explanation of why ssd's die (and relavent).

Having said all that, i honestly cant wait for the death of spinning rusty metal for the simple reason that ssds should (and havent yet) taken on forms which would be much more useful - why use a sata 2.5" format when you could have much better geometries for example? Then theres the interfaces we use where were really designed with hard drives in mind... but thats an entirely different issue.

As for "if its all the same, why doesnt it die at the same time?"... because at a fundamental level, it isnt the same. When we make anything no matter what it is it, the materials go thru some form of refining process where impurities are removed. Its impractical and near impossible to get anything 100% pure (not to mention entirely uneconomical - you might pay $1/tonne for 90% pure iron and $100/tonne for 91% pure iron as an example). The nature of where we get those materials often means the impurities vary greatly in composition in small spaces of time, hence why two hdd's sitting next to each other on the assembly line might mean one will die after 2 days and the other will die after 200 years. Theres also other components to this in that if you looked at iron (again an example, but true of most metals) under a microscope you'd find that it isnt a uniform substance, its quite grainy so to speak and those grains vary considerably. This too impacts the nature of the substance and its longevity. Using iron as an example, when its cooled its very hard to get a truly consistent cooling of it across the entire piece and the cooling determines how those grains form - i.e. grains in the center of the material will be quite different to grains at the edge and so forth. Thats just a small number of things that explain why consistency isnt quite as 100% as it might appear to be on the surface, there are quite alot of factors that come together to effect materials we use. Ultimately the way we choose the materials we produce things with is by tolerance, i.e. i expect 99% of my metal you sell me to fall into 90% pure and have x tensile strength or any number of variables you might consider important to your manufacturing process but even then its never 100%, you always except at some point that you'll get raw materials that'll fall outside those tolerances and as with everything on the planet its a trade off between price vs quality!

Adequate Grounding ? by Anonymous Coward · 2012-10-16 05:29 · Score: 0

after unexpectedly getting a static charge (yes, in the server room) when touching a box It caused me to wonder if adequate grounding is always present, and what effect that might have on an SSD.

I have added a big ground wire to each of my servers. I attach it to one of the screws holding the power supply to the frame.

Overprovisioning and non-moving considerations by AaronLS · 2012-10-16 05:37 · Score: 1

When you talk about volume sizes becoming out of sync, are you referring to cells becoming worn out and being de-provisioned? Many drives, especially enterprise SSDs, are overprovisioned. Meaning you might buy a 480 gb drive, but really it has 512gb of space. The overprovisioned space is used to replace dead cells. Some drives even allow you to customize the amount of overprovisioning. For a great number of use cases, it "should" take 10-20 years to reach the point where the writes have worn out the drive. The extreme edge cases would be a system that is constantly streaming recorded video from some surveillance system, and even then it should take a few years to wear out the drive.

In my own experiences, most deaths of electronics involve a moving part failure. A fan goes and then the component overheats. Hence why the failure of an HDD late in life is probably much different than how a SSD might fail, since HDDs have mechanical wear which would eventually lead to failure. Additionally, SSDs produce a tenth of the heat of an HDD, so heat related wear/failures is much less of a risk, I am guessing.

I am guessing that most SSD failures would be early in their life as it exercises different chips/cells and finds a flawed component. I wish there were reliable statistics on DOA rates for SSDs.

Other failures would be related to degradation of circuits over time. I forget the term, but the metal in circuit boards degrades over time as electricity passes through it until it causes a faulty connection. This is something that effects anything that is on a circuit board, so is no more of a risk than your super old can't replace RAID card failing or the circuit board in a HDD failing. Hopefully it would fail, and not start corrupting data silently. I don't know how long it takes for that to become a risk, or if there's things manufacturers have done to mitigate this risk.

I am also curious in the failure pattern for SSDs. Even in the world of HDDs and RAID, there are a lot of proprietary tricks that make predicting failure scenarios/behaviors complicated and unpredictable. Legacy file systems tend to be overly trustful of the error reporting of the drive. Additionally, silent data corruption is becoming a bigger risk with larger drives, and this is why newer file systems are taking a bigger role in data integrity, because there are scenarios that even RAID 5 and 6 don't handle/detect.

Some things disk array vendors are doing by RichMeatyTaste · 2012-10-16 05:38 · Score: 1

Disk array manufacturers are dealing with this in a couple of different ways (I work for one).
1) Using different methods to determine when a SSD will fail, and proactively sparing it out
2) Inline dedupe at the cache level to reduce writes before they even hit the disks, extending disk life (example: http://www.theregister.co.uk/2012/08/27/xtremio_projectx_unveiled/)
3) MLC drives, which are supposed to be "enterprise" grade. Theory is if you can find creative ways to reduce writes (such as the last line) this negates the expense of MLC drives. Large storage vendors who got into flash early typically used MLC, but expect SLC to become more accepted (cost being one big reason, improved reliability another).
Just remember, when flash drives die they really die. Due to the way files are stored you can't just ship the drive off somewhere and get files recovered. This isn't a bad thing, but something people need to keep in mind.
As far as laptops/desktops go, beware of things that increase writes. Full disk encryption is good, but if the file is encrypted after it is written you've doubled you writes without even thinking about it. That is just one example of things that can cause flash drives to fail a little earlier than you expected. I've seen MLC flash drives that are used for array caching (hot data blocks written to flash for better response, data constantly being promoted/demoted to these drives) hit their write limit in 9 months. Not die, hit their write limit.

--

Ever feel like you are driving the getaway car?

Re:Some things disk array vendors are doing by Bengie · 2012-10-17 02:03 · Score: 1

ZFS supports rate limiting the cache drives, which is indicated by an interval time in units of ms and a transfer amounts indicated in an amount of bytes.

8,388,608 every 100ms would net you 80MB/s max. If you have more than one vdev for the cache, the total is split evenly among the vdevs.

based on readings, not practice.

Re:Die! by Anonymous Coward · 2012-10-16 05:40 · Score: 0

I can't decide if parent is the most useless post yet, or the lamest attempt at humor yet. I'll settle it by making this post so the original distinction is moot.

HEAT! by Anonymous Coward · 2012-10-16 05:42 · Score: 0

Will kill a SSD very quickly.
over 110f and you should be thinking of improving your cooling.

Alot of problems are also caused by bad cables and or connectors. I've run into alot of really crappy sata connectors and cables. And SSD speeds really shows their faults.

Also windows and many other oses have alot of default behaviour that is not really compatable with long SSD life. Look into tweaking your os for SSD use.
Eventually they'll get around to fixing these issues as ssd gets more common. But for now theres no reason to let windows chew up a ssd just because it was set badly by default.

Intel 630 600gb died by Anonymous Coward · 2012-10-16 05:43 · Score: 0

If you order large enough batches of drives you'll eventually see one die. We had an intel 630 600gb die in a raid 60 on my dell R610. Symptoms include drive reporting wrong drive size and overall corruption in the firmware.

Re:Intel 630 600gb died by EmagGeek · 2012-10-16 06:13 · Score: 1

How many did you have to buy to get one to fail?

Pretty quickly and messily by Tapewolf · 2012-10-16 05:43 · Score: 1

On my Kingston, half the filesystem turned to crap. I managed to copy off some of the more important bits which I foolishly hadn't backed up (scripts and stuff - it was the OS drive).

A bad block check revealed that about half the drive that was in use was dead - the blank areas were fine and dandy. I tried to image it the following day (to avoid a reinstall if I could) and at that point the drive ceased to be recognised by the BIOS.

I should probably add that it failed just two days before the warranty expired. However they had discontinued them (I wonder why...?) so I got a refund and went back to a spinning rust drive for the OS for a couple of years. I dp have another one now as they were on sale - hopefully it will last longer this time.

I've had one Die catastrophically in a dell laptop by Anonymous Coward · 2012-10-16 05:46 · Score: 0

We had a users drive fail badly.
extensive data corruption across the entire drive.
most if not all of the files were still present but almost none of them could be opened.
We even sent to drivesavers and there was nothing left to recover.

Mine went Read-Only by Hawke · 2012-10-16 05:47 · Score: 1

I had a small OCZ SSD of some variety in my foo-server (which mounted the NAS for all the important changing data). One day I realized that / had gone ready-only days earlier. Console showed a write failure to the journal (ext3).

Rebooted it, and it worked for ~1 day. Reformatted (managed system, I have no idea if there was data corruption. Didn't seem to be any, but I didn't look for any) and it worked for around 1 week. At that point I gave up and replaced it. It had lasted for just over a year when it failed.

The two Intel SSDs I've bought have not failed yet, nor has another OCZ brand SSD (Vertex3, fwiw).

Planned EOL in the hardware by Anonymous Coward · 2012-10-16 05:50 · Score: 0

There was a time when SSDs made by a certain company whose name means one millionth of a meter would fail at a certain number of hours, even if they still seemed good. It was near 4K hours of power-on-hours. The place i worked that bought those drives had everybody run smartctl on a regular basis to make sure we were still safe. Multiple disks in different machines failed right near that 4K window made it awfully suspicious. Our IT guy called them, and we got a really nice discount on the next drives we bought, that didn't have that limit.

OCZ by AVryhof · 2012-10-16 05:55 · Score: 1

They die quickly when you buy them from OCZ... then when you RMA them, they say they'll replace it, never do, and hope you just forget about it or something.

http://goo.gl/H34U6

--
Make America grate again!

Watch your SMART reallocation totals by ckthorp · 2012-10-16 05:55 · Score: 1

I would recommend watching your SMART sector reallocation totals as an indication of drive health. As the drive starts aging, it will start needing to reallocate the weakest cells first, so you should get some warning.

I don't care - OS files only by rubycodez · 2012-10-16 06:03 · Score: 1

I only put the OS partitions on the SSD, /boot /root /usr /usr/local and /var. that way it boots & loads large software very quickly. my /home stays on a spinning magnetic disk. Yes, I back up regularly to trivial to replace the SSD.

Kingston 90GB V+ 200 by TheGoodNamesWereGone · 2012-10-16 06:08 · Score: 1

I have a 90GB Kingston I bought used, that I use for my Linux partition. I've done the usual tweaks to minimize writing. It's been working fine for the single month I've had it; blazingly fast boot and load times, etc., but I'm running a full backup right now as I type, and I'm going to schedule weekly ones. As far as brand names go, I've heard very little that's good about OCZ, whereas Intel SSDs are usually praised through the roof. I'm going to get one for my Win 7 partition in the next few weeks. The 330 Series is pretty price competitive.

A great topic! by cyberjock1980 · 2012-10-16 06:18 · Score: 1

I'd love to see real data on SSD lifetimes. Here's mine:

2x OCZ Summit 64GB (circa 2009) - See note below for issues I had.
2x Intel X25M G2 160GB - Installed in March 2010 - Both have worked flawlessly and both show 99% of drive life available by SMART E8 entry. One is my main desktop and one my main laptop. Never had an issue with either. Both have estimated EOL of November 2020 and Dec 2021 by SSDLife.
1x Intel X25M G2 120GB - Installed in April 2010 - 99% drive life availabe by SMART. It is the boot drive for Server 2008 R2 and is only a file server. Not much to do there so I'm expecting a very very long life. Estimated EOL Nov 2027 by SSDLife.
1x Crucial C300 128GB - Installed Nov 2010(was boot drive for 2 months by now used for games only) - 86% of life remaining and EOL is Nov 2020.

I don't go too far out of my way to minimize writes. I always disable hibernation and pagefile in Windows for all of my machines. I never use hibernation and my RAM is always 16GB or more. I use the drive like I normally would without regard for the "limited lifespan". If I was going to do something like copy a blu-ray or reencode a video I used to do it using only local drives and then copy it to the server. Now I just do it over the network shares. Otherwise I use my drive just like I always would. Run BOINC on it, etc.

I've gotten a few friends into Intel SSDs, and none of us have had any kind of failure at all ... yet. Everyone's drives are listed as having EOL of 2020 or later. If these drives REALLY do last that long, I expect we'll be throwing them away before 2020 because a 128GB drive will be too small for the OS and a few common programs(Office, etc). I used to tell people to go big because they can take the drive from machine to machine over the next 10 years. It really just doesn't matter though, they're dropping in price so fast you should just buy what you will want for the computer you are using.

One friend bought an OCZ drive because it was really cheap at the time after rebates. He has had to RMA it 4 times in less than 12 months. He's the only person I know personally that hasn't bought Intel, and he is the only one to have any problems.

Personally, I swear by Intels. My experience has been phenomenal with them. I have yet to see an SSD failure personally, and it seems that lots of people have heard stories of Intel drives failing, but I haven't met anyone personally. My experience is that Intel SSDs, reliability-wise, are far superior to rotating rust. I am a little concerned now that Intel is getting away from using their in-house controller and going to Sandforce. After seeing what OCZ drives do and the fact that they use Sandforce I'm a little hesitant to expect a long lifespan from them.

I'm wondering if Intel switched to the cheaper Sandforce despite the lower reliability only because they want to be competitive for the price. Who REALLY buys an SSD expecting a 2020 EOL? Allegedly the newer Intels will have a SMART failure message when you have 1% of the drive left. Intel says that for most users that should be about 2 months of regular use since 1% is not really 1/100th of the drive life remaining. If this is true and I can expect to own the drive for 3-5 years and the drive will give me a SMART error when it is nearing EOL, what more could you ask for? That's nirvana for me!

I will say this. Putting SSDs in every computer I own makes them MUCH more responsive. I've always upgraded every time a new Intel CPU design came out. Right now my desktop is using an Intel i7-920! That's circa Nov 2008. I've NEVER had a computer more than 2 years. Thanks to SSDs the machine still works great 4 years later. I'm thinking of upgrading with the next Intel CPU generation only because the machine is getting old and as a geek I need to be able to justify my geekiness. It's hard to call myself a geek if everyone else is buying $500 Dell machines with more power than my machines. A friend bought a hybrid hard drive. The

Re:A great topic! by WuphonsReach · 2012-10-17 06:13 · Score: 1

I will say this. Putting SSDs in every computer I own makes them MUCH more responsive.

Which is exactly why I put a SSD into my 2007 Thinkpad T61p laptop about 20 months ago. It turned an aging dual-core laptop that was slow and sluggish into something that was a pleasure to use again. Now my major bottlenecks are the CPU speed and the video card.

Assuming nothing goes wrong with the machine (just put Win7 and 8GB of RAM in), I suspect I won't buy a new laptop until 2014.

We're in the process of putting $75 SSDs, 2GB more RAM and Win7 into all of the desktop machines at the office to get another few years out of dual-core machines that we bought back in 2006-2009. Our hope is that we can stretch them out into the 2014-2016 timeframe.

--
Wolde you bothe eate your cake, and have your cake?

ssd lifespan by Anonymous Coward · 2012-10-16 06:18 · Score: 0

Never had a problem on my corsair gt ssd 120gb. Not one bluescreen ever. Do you guys recommend turning the computer off, or leaving it on, to prolong ssd life?

Re:Umm... by daha · 2012-10-16 06:22 · Score: 3, Funny

That is why one uses RAID 6 with lower tier drives and hot spares.

Works great until 3 drives in the RAID fail.

Better make it a RAID-60 just to be safe. And maybe mirror that too.

Several PCI cards failed by costing · 2012-10-16 06:25 · Score: 2

We have replaced a few 8x 15K RPM RAID5 with OCZ PCI cards of 0.5TB and 1TB. They were serving databases with high update frequency (~500Hz average in the long run). At first they were fantastic, iowait was gone, great performance, just as expected, enough to get us hooked on them and order more :)

However in a few months the performance has deteriorated dramatically, to the point where they were much worse than the disks they were replacing. Writing /dev/zero to them a few times restored for a short time the performance but finally one card died dramatically, another lost 2 of the 8 slots ... Finally we moved back to the disks and as much RAM as fitted the servers and called it a failed, very costly, experiment. I'd argue that the SSDs are probably good only for PCs since one can fit a comparable amount of memory in a server for read performance and if you need high write rates you would anyway destroy them quickly...

Bathtub by Sir+Holo · 2012-10-16 06:25 · Score: 1

SSD's follow the familiar engineering "bathtub curve" for failures, just like anything else. But...

In the case of SSDs, the failure of individual memory cells will follow this curve. Therefore, you will probably not notice a few dead cells at first, because SSD's are built with a few "extras." But after a number of writes, 100,000 or so, cell failure rates will rise quickly. That is why the SMART or other drive-lifetime native ware will exclude the failed or over-used cells from use.

Result is that an SSD drive is not likely to "fail" in the sense that a platter HD crashes. Instead, it will just slowly lose usable capacity. Once it's enough to notice, it is time to replace.

But that's really only if it's in a server. If it's in your laptop, you will have upgraded to a new laptop long before noticing the SSD capacity degradation.

It can be worse by dutchwhizzman · 2012-10-16 06:27 · Score: 1

Either, all the answers you find are your own, or all the answers you find are things you've already tried. And the enormous amount of advertisements by companies that will sell you your problem, or the solution for it, most often both at the same time.

--
I was promised a flying car. Where is my flying car?

Re:It can be worse by arth1 · 2012-10-16 07:30 · Score: 1

Either, all the answers you find are your own, or all the answers you find are things you've already tried. And the enormous amount of advertisements by companies that will sell you your problem, or the solution for it, most often both at the same time.
I'd take spammers over rebloggers[*]. At least the spammer's intentions aren't hidden.
[*]: Almost all of them in India or Florida. Can we just bomb both, please?

When you order systems by dutchwhizzman · 2012-10-16 06:29 · Score: 1

When you order systems with a bunch of drives in them and a RAID controller, some "quality" manufacturers take it upon themselves to actually mix and match drives for you, so they don't come off the assembly line right after each other. This is probably why you don't see "mass failure" happening on those systems.

--
I was promised a flying car. Where is my flying car?

Suddenly and without warning... by GameboyRMH · 2012-10-16 06:36 · Score: 1

...sometimes. An SSD is at least as dangerous as a RAID0 array, make backups often.

--
"When information is power, privacy is freedom" - Jah-Wren Ryel

Re:Suddenly and without warning... by sexconker · 2012-10-16 07:10 · Score: 1

...sometimes. An SSD is at least as dangerous as a RAID0 array, make backups often.
I run 2 SSDs in RAID 0. Come at me. I've got nightly, full-disk back ups.

SSD by Anonymous Coward · 2012-10-16 06:38 · Score: 0

Just had my OCZ 1208GB SSD fail on me after 4 months. I was experiencing bluescreens, slow reads/writes, and over all just weird behavior. I'm sending it back now for RMA. At least they have a 3 year warrenty...

ocz is garbage by citylivin · 2012-10-16 06:48 · Score: 1

1) OCZ drives are GARBAGE along with most products put out by that company. Avoid at all costs.

2) Crucial m4's as wonderful as they are, had a firmware bug in the last version, whereby if the power was lost in some circumstances the drive would then fail to post. The fix strangely was just to apply power and no data cable for 2x 20 minutes periods. So just hook the drive up to power and wait 20 minutes. shut down and then do it again.

Its amazing, but i have personally fixed 3 drives doing that magic trick. I believe they have corrected this bug in their 000F firmware. Otherwise it seemed to occur mostly in laptops when they were shut down improperly. It was scary for sure! drive appeared to just disappear from the machine.

--
As a potential lottery winner, I totally support tax cuts for the wealthy

Re:ocz is garbage by AdamWill · 2012-10-16 07:13 · Score: 1

On 2), I believe I read that Crucial released a firmware fix for that.
Re:ocz is garbage by VanessaE · 2012-10-16 12:54 · Score: 1

We have two OCZ Vertex 2 60GB drives here, and aside from them disappearing from BIOS once due to a firmware bug (which we updated to fix), they've behaved flawlessly for about 2 years now.

Is *anything* forever?? by Anonymous Coward · 2012-10-16 06:57 · Score: 0

Do we still not have any type of cheap/fast/small memory storage medium that does *not* degrade over time? Physical hard disc drives can have mechanical failure/bad sectors, SSD cells die after a hundred thousand writes, even CDs and DVDs start to exhibit bad sectors after years of storage. It would be nice to have a drive that will last as long as I do, if not much longer.

Controller Boards by Anonymous Coward · 2012-10-16 07:01 · Score: 1

Half the time a mechanical disk goes it isn't the spinning bit, it is either the power supply or controller board that just go dead.

Those pieces are identical in an SSD, and have no reason to be any more reliable.

Finally by Anonymous Coward · 2012-10-16 07:02 · Score: 0

An ask /. that's not insulting.

my experience by AdamWill · 2012-10-16 07:12 · Score: 1

The now-fairly-ancient SSDs in my laptop, which are Samsungs from around 2009, seem to be dying slowly; every so often I'll get some ATA errors reported in the logs and the drive will either be remounted RO or simply become entirely inaccessible, resulting in everything you try to run that's not currently completely cached failing to work with an I/O error or a segfault. After a reboot, it usually comes back up working okay, but I'm sure at some point it won't.

So...really, pretty much exactly like a typical spinning disk failure, in this case. So far anyhow. I've seen the same 'periodic failures, followed by a day where it just won't work any more' pattern with spinning disks before.

Here are the failure modes. by Anonymous Coward · 2012-10-16 07:22 · Score: 2, Informative

A: Memory cells begin to die off faster than the SSD's controller can annotate them as bad and reallocate the memory which initially shows up as major slowdown, then as crc32 errors which increase in frequency and severity due to overwrites not completing correctly. The issue accelerates until the drive becomes unusable. This failure is usually due to heavy use, age and cheap, cheap memory.

B: Solder joint on a chip cracks takes out the chip and, since the entire array of chips are set up RAID0 style, the entire drive is dead one day mysteriously. This occurs due to an extreme difference in hot temp and cold temp the drive is exposed to not by itself but by other components; lead-free solder has multiple metals in it which expand and contract at different rates, as you heat up and cool down you cause extreme contraction and expansion. Like bending a fork too many times, microfractures form which eventually coalesce to become one big open in the circuit.

C: Shorting of the internal chip components causing the infamous "black glass" situation where the voltage and grounding planes of the chip short out, heat up, and you get to see black glass on the very top of the chip and sometimes a small distortion.

D: Firmware memory fails. Shows up as every single wierd issue you can imagine.

E: Defects in the drive such as poor connectors between the die and external connectors, or lack of shock resistance during shipping for certain solder joints, usually the drives fail quick and hard.

All of the above are basically possible, save for Point A, on a regular hard drive.

Fact: If a Harddrive goes, drivesavers can toss it under an electron microscope and recover the data. SSD's have no known recovery methodologies because the above failure modes usually physically destroys the data.

Point A makes RAID arrays using SSD's particularily interesting since if you purchase a box of drives with similar Serial numbers and start running them at the same load over time, you're bound to end up with the them failing near the same point in time. Thankfully, however, different cells on each drive are going to fail at different times. The majority of harddrive failures are mechanical in nature as wear occurs at different rates for different disks.

SSD's are GREAT for certain applications where shock resistance and speed are key; you can get 15 times the random read/write at 1/100th the latency out of a SSD than you can out of the priciest harddrive, for a fraction of the cost a server racked with drives can fully saturate it's network ports . For doing large-volume data projects or running a fully virtualized infrastructure that needs tons of I/O, there really is, IMO, no other option. Doing so, however, without backups upon backups is suicide for the same reason running a SAN indefinatly without a backup is suicide. Thankfully running VM's makes backing up and restoring a breeze.

Kingston drives are junk by elliott666 · 2012-10-16 07:23 · Score: 1

About a year ago I had a new vendor ship about twenty computers to my company. They were supposed to contain Intel SSD's but instead contained Kingston drives. All of those drives failed within a year. As they fail I have been swapping them out with Intel drives and they just chug along nicely after that. I even took one of the RMA'ed Kingston drives and put it in my laptop to see how it did. It made it four months and failed just last week.

Here's what that failure looked like. It started off where the aystem would just freeze up for a couple of seconds about once a day or so. After a week that started happening more like once an hour, but the system would always come to life again, it would just pause for 10 seconds or so. Then after about another week it started hanging completely. First once a day, then several times a day requiring holding the power button down and restarting the system. Then I RMA'ed it.

The original drive was a kingston ssdnow 100, and lasted 11 months, the replacement was a ssdnow v200, it lasted 4 months. They just shipped me a replacement today, we'll see. It may go straight on ebay!

Re:Umm... by ArsonSmith · 2012-10-16 07:30 · Score: 2

Could run a cubed strip of raid 6 arrays in a RAID666

--
Paying taxes to buy civilization is like paying a hooker to buy love.

Commercial NOT Consumer by bertomatic · 2012-10-16 07:41 · Score: 1

Hey everybody, the question was about COMMERCIAL drives, NOT consumer drives. You want reliability, you want SLC. No enterprise in their right mind would rely on MLC in production critical environment. Stop giving the original poster advice about consumer crap! Please?

My luck.... by taskiss · 2012-10-16 07:47 · Score: 1

The system asks "What SSD"?

--
- real hackers don't have sigs -

Obligatory XKCD by Calydor · 2012-10-16 08:23 · Score: 1

http://xkcd.com/979/

--
-=This sig has nothing to do with my comment. Move along now=-

Catastrophic controller failure by macraig · 2012-10-16 08:48 · Score: 1

Catastrophic failure of the internal controller circuit is one possibility. It happened to me with a small G.Skill SSD. That wasn't my judgment of what happened, that was G.Skill's. The data might have still been there, but I had no way to access it. As far as the computer was concerned, the physical device still existed but the media and partition didn't.

That was one of two SSDs that I have bought, so from my perspective it's a 50 percent failure rate for the technology. Here's the irony: I have a Conner Peripherals 170 MEGAbyte IDE platter drive - from about 1992? - that still works. I have a small box full of old magnetic platter drives like that one that still work. In 25 years of using platter drives, I've had perhaps three physically fail. Am I going to be able to say the same thing about the SSDs I have now in 20 years, especially given their guaranteed obsolescence? Not a chance. YMMV, but not by much.

embedded SSDS by Anonymous Coward · 2012-10-16 08:50 · Score: 1

in the place where i work we use some old IDE flash moduls, one could call them early SSDs.

basically what happend when they died on us (nearly all of the 300+ died within a few month, approx. 2 Years of runtime):

first you might see some errors which are recoverable through firmware.
2nd you might see non-recoverable sector write errors, making the partition read only (we use Linux)
3rd you see read errors , though only a few of our drives reach that state.

regards

3 out of 5 by Anonymous Coward · 2012-10-16 08:50 · Score: 0

I've had three 128Gb Extrememory SSD's fail, with 6-18 months of use. One is still running as is another Samsung 256Gb.

Experience with all ranges by guruevi · 2012-10-16 09:11 · Score: 3, Interesting

The ultra-cheap SSD's in my severs lasted only 3 months. The 4 OCZ Vertex 3 IOPS have so far lasted over a year with ~2TB processed per disk, 2 Intel SLC and 2 MLC's already over 2 years over which time they have processed ~10TB each (those were all enterprise grade or close to it). They are in a 60TB array doing caching so they regularly get read/write/deleted. I have some OCZ Talos (SAS) as well where one was DoA and another early-death but simply shipping them into RMA and I had another one in a couple of days. But the rest of them do well over 6 months and going.

Several other random ones still work fine in random desktop machines and workstations.

As far as spare room on those devices, depending on the manufacturing process you get between 5 and 20% unused space where 'bad' blocks come to live. I haven't had one with bad blocks so most of mine have gone out with a bang, usually they just stop responding and drop out, totally dead. I would definitely recommend RAID6 or mirrors as they do die just like normal hard drives (I just had 3 identical 3TB drives die in the last week)

--
Custom electronics and digital signage for your business: www.evcircuits.com

Re:Experience with all ranges by Bengie · 2012-10-17 03:23 · Score: 1

"depending on the manufacturing process you get between 5 and 20% unused space where 'bad' blocks come to live"

It's not just for "bad blocks" but for general wear leveling. The SSD doesn't just set aside 20% and not ever use it except corner cases, it uses it all the time, but it doesn't report that space when showing the drive size.

An extreme example of this is assume your SSD is 100% full and you want to change a block of data. Because there is no more free space, there is no more blocks to swap out with for wear leveling. If you have X% reserved, then you always have some blocks to swap out with.

Also, your theoretical best amount of data that can bet written to a SDD is a product of the amount of total storage and the amount of writes that can be handled. The practical amount of data that can be written is more a function of the interaction of how full the drive is and how much data is being written.

It's a balancing act of the trade off between usable storage and how fast the drive wears out.
Re:Experience with all ranges by guruevi · 2012-10-17 04:45 · Score: 1

I know how they work, I meant by 'unused' that that portion isn't reported to the OS. You can buy 3 physically similar drives from the same vendor and chipset (maybe different types of memory chips) and they will report 100GB (Enterprise, usually SLC or eMLC), 120GB (Mid-level) and 128GB (Cheap) with the same number of chips and identical controllers (although the enterprise may have a gold capacitor soldered on the board)

--
Custom electronics and digital signage for your business: www.evcircuits.com

Interesting times by azbot · 2012-10-16 09:13 · Score: 1

I'm just here to watch all the old-timers post.

It's the future by Anonymous Coward · 2012-10-16 09:14 · Score: 1

We have an array SSDs acting as cache for our multi-tiered file system; 15,000 RPM SAS -> 7,200 RPM SAS -> Tape running Solaris.

We had one of these SSDs continued to operate in a read only manner for a while. It was really tricky for us to actually detect that the problem was with the cache caused by the faulty SSD. It was even a proper enterprise grade SSD - but I guess when you are using an SSD array as a cache for a file system used by thousands of users, it's not that surprising when you have one of the aforementioned SSDs fail.

We have a smaller setup of SSDs on a ESXi host, once again the SSDs acting as cache for the File System. This has really helped our rapid development for our Standard Operating Environments and for other projects where setting up a physical box is more time consuming than virtualisation.

I have no theories for why they actually fail, just found that some of the consumer grade OCZ drives have been particularly notorious as a place to store single copies of things, although since the Vertex2 series, I think this has improved. I guess Vertex2 was really only 2nd/3rd gen SSD, so the market is still in the maturing phase - I will choose SSD for my next personal computer without doubt.

Intel SSD in the Enterprise: very low failure rate by bbasgen · 2012-10-16 09:22 · Score: 4, Informative

I have ordered approximately 500 Intel SSD's over the past 18 months (320 series and the 520 series primarily). To date, we have had exactly one fail to my knowledge. It was a 320 series 160 GB with known firmware issue. We have around 80 of that type and size, and the drive that failed did so on first image. We RMA'ed the drive and got a replacement.

I've had one intel X25-E fail by Anonymous Coward · 2012-10-16 10:02 · Score: 0

One G1 intel X25-E 32GB unit out of 16 failed for me about 8 months after deployment. The unit would no longer respond to SATA commands. It was like the drive vanished. Intel replaced it no problem, but thats the only SLC SSD failure I've ever encountered.

OCZ SSD APEX 250GB by Anonymous Coward · 2012-10-16 10:23 · Score: 1

Died two weeks ago. Strange noise came out of it. After opening the case, I noticed a diode (Which was in series right after the SATA power supply entered the device), had a burn hole in it. After some measuring, I replaced it with another diode. During powering up and some tries, the device eventually woke up again, registering itself as Apex drive at the operating system. However data was not readable anymore.

Re: SSD / HDD hybrid raid by nullchar · 2012-10-16 11:06 · Score: 1

Thanks for the response! I was hoping it was all mdadm, as I love to use software raid.

May I ask, what was your use-case for using this hybrid approach? Did you do much benchmarking with the applications you were trying to benefit from faster reads? Did you also tune FS parameters like +noatime and tweak block sizes and such to minimize writes?

Would a system like this with full disk encryption be any better/worse off? The first enc pass (say with truecrypt where it writes to the whole drive) adds extra wear, but it should happen only once with subsequent writes changing small portions of the disk.

Obligatory ISR by funkboy · 2012-10-16 11:16 · Score: 1, Offtopic

In Soviet Russia, SSD fscks YOU!

2 died by meam · 2012-10-16 11:26 · Score: 1

My first SSD, OWC 120 GB SATA2, died in my notebook. It started with hidden data error when write large files (4+ GB). Followed in a few days with CRC error. Finally, it couldn't boot. However, I still could copy most data from it by attacted it to other PC.

My second SSD, OCZ Octane 128 GB, occationally sent a lot of error for few months. Then, suddenly, it couldn't mount any more. However, after secure erase, it worked again without any error.

Re:Die! by lister+king+of+smeg · 2012-10-16 11:42 · Score: 0

long time ac before setting up account

--
---Saying gnome 3 is better than windows 8 not so much a compliment as it is damning with light praise.

Crucial Adrenaline, didn't die exactly by Anonymous Coward · 2012-10-16 12:56 · Score: 0

I had something worse. I got a crucial adrenaline SSD cache so I could use my 2TB drive as main storage and still enjoy the speeds. Well it was working beatifully, even after a power outage or two, but not three. Something went bad and it left my main drive unbootable. to make things worse it turned it into a dynamic disk so I had to get all kinds of rescue software to find one that could convert it back without destroying it so that I could use another app to recover files after it was a basic volume. days later I got data back, although not as pretty as I remember. All I know is that if you turn on your PC and see scandisk giving you shitloads of INODE errors, immediately shut down your pc if you want to keep your data, then try to restore it with some ntfs restore app before your MFT gets royally screwed. I formatted my 2 TB drive and all was well again. My crucial adrenaline stayed in my machine, but not installed out of fear. I wouldn't trust any SSD without a UPS power backup.

Re:Intel SSD in the Enterprise: very low failure r by kfsone · 2012-10-16 13:04 · Score: 1

What is your expectation of how the drives will begin to fail - are you expecting bulk simultaneous failures or are you expecting to get plenty of degradation warning before you start to see failures?

--
-- A change is as good as a reboot.

Laptops and liquids. by ehiris · 2012-10-16 13:43 · Score: 1

I had a dell laptop whose ssd hard drive died from a couple of drops of water because the chips were exposed without enclosure and its location was directly under the keys.

Suddenly and without warning by felixrising · 2012-10-16 13:47 · Score: 1

In my limited experience (we have a VNX SAN running a few cages of EFDs, 16 servers with RAID 1 SSDs and a mix of laptops running SSDs), most die with no warning whatsoever.. they just cease to exist with nary a whimper.

SSDs don't die... by Leemeng · 2012-10-16 13:55 · Score: 1

the data just fades away. ;-)

slowly by Anonymous Coward · 2012-10-16 15:21 · Score: 0

assuming defect free the flash will slow down. Write cycles will take longer and then it eventually is unable to store data, one bit here one bit there. I dont know if the SSD's hide all of this from you and for how long. I assume that there is a way to determine how many bits have failed and there is perhaps a threshold where you will start to get warnings from the operating system. again assuming defect free, which is not an assumption I would make.

correction and more info by slashmydots · 2012-10-16 17:21 · Score: 1

I've had many flash drives fail and that's pretty identical, assuming the SSD's chip didn't burn out or something instead. When flash memory fails, you get quirky delayed write failures but can typically still read the data for a short period of time. That happened with all 3 of my drives that failed (they were really bad brands). And by the way, everyone's hating on OCZ since their 1-3 drives were a catastrophe but their version 4 ones work great as of firmware 1.5, which they all ship with. I've used about 15 for builds over the last several months with no problems. Intel I had 2 slight problems with though. For example, their own latest of the late copy of the bootable firmware flasher doesn't recognize a 60GB 330 Maplecrest under any circumstances on any board with any SATA controller, even an Intel one.

Re: SSD / HDD hybrid raid by gweihir · 2012-10-16 18:07 · Score: 1

You are welcome. mdraid is pretty cool indeed.

My original use case was that I had foolishly put my mailbox into Maildir format (one file per message) and that opening it got incredibly slow (minutes). With an SSD that was not an issue anymore. On the other hand, I did not want something this important on a single drive. Somebody suggested that --write-mostly may help. Tried it, opening latency gone down to the speed of a bare SSD. No filesystem tweaking whatsoever, I cannot tell you whether it makes any difference.

By how I have the root partitions on two machines in the same configuration. Everything is just much, much snappier.

I have not done it for encrypted volumes so far. It should work, but there is still some subtle problem with Linux dm-crypt and RAID or LVM that causes massive write slowdowns for some people. Others have no problem at all. It may be due to the differences in write-scheduling. On reads, no problems that I know of, so reading should get the low SSD latency and, depending on CPU power and cipher, most of the SSD read speeds.

You do not need to worry about the wear of a single disk overwrite with a modern SSD. Even the lowest end FLASH chips can take 2.5k overwrites.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.

My Two Experiences by Anonymous Coward · 2012-10-17 02:34 · Score: 0

My two experiences with failed SSDs have been the same. Everything seems just fine for a year or two and then, bang, the system locks up. Reboot and the drive is completely inaccessible or it is visible but completely unreadable or writable.

More worrying though is that this has happened twice, with different manufacturers in the last thee years. In the previous 15 years, I have only had two personal drive failures and both were such that I could scrape data off the drives before throwing them out. I've got a box on the shelf that has nine drives in it. They all work just fine and have been replaced over the course of many years due to upgrades, new system/more capacity.

It's anecdotal, but for me SSDs have proven to be very unreliable and short lived.

With Fire! by Gallomimia · 2012-10-17 06:13 · Score: 1

Someone told me rather recently that SSDs fail with spectacular explosions, fireworks, and burning. A read of several comments on this page speaks to the farcical nature of this idea. Can anyone comment on this?

--
Sadly, a Libertarian cannot force his views on another, and freedom cannot spread as does the cancer known as religion.

Enterprise Class SSD SLC Observation by Anonymous Coward · 2012-10-17 06:35 · Score: 0

I work with Enterprise class SLC SSD Drives 200 & 400 GB by HGST (Now a WD company) drives.
I have yet to see a drive fail. Note: this SLC drive is 80% over provisioned and has a 5 year warranty that assumes a full drive write 30 times a day.
FWIW: My lab sample size is 60 drives, and thousands shipped to customers. Street price on this drive is $3K

It's got boring now ... by RockDoctor · 2012-10-17 16:11 · Score: 1

SSDs have reliability issues. We know that.

What is the price premium? (Off to normal hardware site ...)

8-10 times price of "spinning rust" ;

Seek times and data rates?

Not given, and

Do I need that?

Pretty dubious
Shiny, new and high-tech? Yes, but I still don't need it.

I know that it's a very un-Slashdot thing, but I don't need that shiny-bright-new-unreliable tech.

--
Birds are not dinosaur descendants;birds are dinosaurs, for all useful meanings of "birds", "are" and "dinosaurs"

OCZ Questionable Reliability, Intel Fanboyism by JakFrost · 2012-10-18 06:37 · Score: 1

I personally know of (3) OCZ Agility 1 30GB drives fail, 2 on Linux, 1 on Windows.

OCZ Agility -> Intel SSD Image -> Distro Upgrade to Fix Corruption

The failure mode on one was file system corruption like an HDD but check disk would find problems, fix them but error out and another run would find different errors. I was able to image that one off to an Intel X25-V (Value) G1 40GB SSD then did a whole Ubuntu distro upgrade that basically overwrote pretty much all the important files on the system to downloaded good ones and that took care of the corruption and any problems from the previous hard drive. System is still running Asterisk PBX to this day without any errors surprisingly. I'm still a little amazed at how simple this recovery was and that there were no issues after distro upgrade that seemed to fix any corrupted files. I sent the failed OCZ drive back to my friends after fixing their PBX with with instructions to put a bullet through it instead of sending it for replacement, and I was being serious and literal and it is likely that they did just that.

Another failed with inaccessible and unbootable from Windows XP. The last one just kernel panicked disappeared from the BIOS completely. Both went back to OCZ for replacement and new ones showed up. I told the folks to not open the boxes and sell them on eBay and instead buy Intel X25-M or -V series drives to replace them.

OCZ Bashing

I still have a sealed OCZ Agility 1 30GB in my house and I posted it on eBay twice and nobody wants to buy it. I guess the word is out that OCZ SSD has shit for reliability. Newegg reviews are just full of failure reports. Even though Anandtech keeps reviewing these OCZ Vertex 2, 3, 4 series drives and praising them for performance I stay the hell away from OCZ as a vendor due to the massive amounts of complaints of failures people report on these.

As a side story, I also got burned by a performance grade OCZ 550W power supply with unstable 12V rail that wasn't even heavily loaded that would drop to 11V for no reason and destabilize my system causing weird behavior. Switched to Corsair TX750 after that and weirdness went away.

Intel SSDs - 3 Generations Going Strong

I still run an Intel X25-M G1 80GB in my laptop for a few years now without issues that used to be a desktop drive. I have an Intel X25-M G2 80GB at work and it's still working fine. I also have an Intel 320 (G3) 160GB as my new desktop drive andI applied the firmware upgrade to it that was available to fix that weird lock-up 8MB issue that was reported. I also have that Intel 320 40GB in my Ubuntu XbmcLive HTPC in my living room and another Intel X25-V G1 40GB in a friend's Ubuntu based Asterisk PBX system running just fine.

Love Intel for their SSD, never had an issue and I'm quite happy with them and the engineering that they did on the drives. Looking at the return numbers Intel has very low return rates for SSD, somewhere within the neighborhood of 1% and most of those were related to the two firmware bugs found, the one in the X25-M series early and the other the 320 series.

Intel 520 Series and SandForce SF-2281 Controller Firmware

There's a nice little story on Anandtech when Intel was choosing the new SandForce SF-2281controller for their Intel 520 SSD product line that they ran so many tests and did so much engineering on the drives that they came up with firmware updates that they gave to the vendor due to the issue that they discovered. Too bad that later on Intel found out that the controller can't do AES256 only AES128 encryption and it offering refunds for those that care about it.

http://www.anandtech.com/show/5508/intel-ssd-520-review-cherryville-brings-reliability-to-sandforce/

All of my Intel SSDs are about 2 to 3 generations behind and still use the old Intel controller that's limited to SATA-2 3Gbps speeds but

usage changes in rebuild, that change may take-out by Anonymous Coward · 2012-10-18 15:27 · Score: 0

when a RAID is run with, say, X I/Os / second,
for several years,
and a drive dies,
AND the other drives in the RAID are near failure due to the same issue
( bearings in IBM's Deathstar drives, e.g,
or electro-migration in a chip in Fujitsu's nightmare enterprise drives, when they changed to a more eco-friendly chip-chemistry,
from what I've read )

suddenly the RAID is getting full NORMAL use
*PLUS* the RAID-rebuild...

A few months ago there was some article on Tom's Hardware
( first I'd bothered with that site in years )
discussing drive-reliability,
and the contributing datacentres found 2 things of interest to me:
1. the most-common time for a drive to die is *within 1h* of another drive dying.

2. Super Talent drives have a significantly higher failure rate.

I'm with the people who build RAIDs with 4 brands/models of drives, specifically to make entire-RAID-loss significantly less likely...

( & RAIDZ2 or RAID6 oughta be the law :)

As James Hamilton said in his Usenix paper ( excellent ), when you scale things up enough, it isn't IF a given failure will happen, it is WHEN it will happen: entire-rack failure will happen given enough opportunity...

I love his rule, though: if you aren't shutting your servers off by yanking the plug, you don't trust your HA system.

( :

Cheers!

A proper answer by CoolBru · 2012-10-18 22:49 · Score: 1

There's a lot of conjecture and theorising in this thread so far. Not surprisingly some enterprising geeks have been busy testing SSDs to destruction, and they have some great stats. This thread with over 5000 posts has a ton of info about exactly what happens and some good hard numbers.

Re:Intel SSD in the Enterprise: very low failure r by bbasgen · 2012-10-19 07:52 · Score: 1

My expectation is that an SSD used by a desktop computer is most likely to indicate issues when it is being heavily used. In that sense, an imaging process is a good event.

IBM Answer: SSD life based on Total Bytes Written by dakra137 · 2012-10-21 05:47 · Score: 1

For the SSD's that IBM sells, it gives them an "Endurance" rating for a certain number of "Total Bytes Written." According to http://www.redbooks.ibm.com/abstracts/tips0879.html The implication for this discussion thread is that if you build a RAID 1, 5, 6, or 10 array with new SSD's you should expect them to all reach end of life at about the same time.

The drives described in this paper from March, 2012 are rated for Endurance: "36 TB of total bytes written (TBW) at 90% full disk based on predefined usage pattern for 64 GB SSDs and 72 TB of TBW for higher capacity drives."

"Enterprise Value SSDs and Enterprise SSDs have similar read and write IOPS performance, but the key difference between them is their endurance (or life time) (that is, how long they can perform write operations because SSDs have a finite number of program/erase (P/E) cycles). Enterprise Value SSDs have a better cost/IOPS ratio but lower endurance compared to Enterprise SSDs. SSD write endurance is typically measured by the number of program/erase (P/E) cycles, that the drive incurs over its lifetime, listed as TBW in the device specification.

"The TBW value assigned to a solid-state device is the total bytes of written data (based on the number of P/E cycles) that a drive can be guaranteed to complete (% of remaining P/E cycles = % of remaining TBW). Reaching this limit does not cause the drive to immediately fail. It simply denotes the maximum number of writes that can be guaranteed. A solid-state device will not fail upon reaching the specified TBW. At some point based on manufacturing variance margin, after surpassing the TBW value, the drive will reach the end-of-life point, at which the drive will go into a read-only mode. Because of such behavior by Enterprise Value solid-state drives, careful planning must be done to use them only in read-intensive environments to ensure that the TBW of the drive will not be exceeded prior to the required life expectancy.

"The endurance of Enterprise Value drives is specified based on the following access pattern: 50% random data and 50% sequential data with block size mixes of 5% of the data as 4 KB block size, 5% of the data as 8 KB block size, 10% of the data as 16 KB block size, 35% of the data as 64 KB block size, and 35% of the data as 128 KB block size. The Enterprise Value drives described here are capable of 36 TB (64 GB SSD) or 72 TB (128 GB, 256 GB and 512 GB SSDs) of lifetime writes, with the workload stated above as the worse case. For the device to last in five years inside of the 72 TB of TBW, the drive write workload must be limited to no more than 40 GB of writes per day. For the device to last in three years, the drive write workload must be limited to no more than 65 GB of writes per day."

re: thinking about it differently by King_TJ · 2012-10-21 12:34 · Score: 1

I don't have a problem with your decisions, but I don't know that I'd give that advice to everyone else?

IMO, the current stare of SSDs is such that you're still paying a big price premium for one over a traditional hard drive, and you're getting technology that clearly has certain limitations (primarily being a limited lifespan if it's forced to do many, many data rewrites).

You can Google search it to see what I'm talking about, but there are quite a few sysadmins out there who got excited by the prospects of moving their relational databases onto SSDs on their servers for a big speed boost, only to find they were consistently killing off the drives in a matter of as little as 2 to 6 months' time. They clearly couldn't hold up to that type of use/abuse.

On the other hand, 99% of the other tasks you might do with a computer aren't nearly as rewrite intensive. If, say, you're a computer gamer? You're going to like an SSD for the advantages it gives of faster load time for all those levels they have to read in. The casual user will mainly appreciate the quick boot time if he/she turns the computer off when it's not in use, so finds themselves booting up from scratch pretty regularly. Digital video editors and photographers and artists should appreciate the quicker time to load plug-ins and video content, not to mention large applications.

But to me, the temporary swap file is something you can still throw onto a physical hard drive, at least in a desktop PC. You can even recycle a smaller capacity drive this way that you'd otherwise not bother using anymore. It's pretty much win-win because it won't really slow down the overall system performance much at all if everything else is on the SSD. (Ideally, you have enough RAM so the swap file isn't being relied on real heavily anyway.)

Re: thinking about it differently by deroby · 2012-10-21 22:12 · Score: 1

All true what you say but apart from the remarkably faster load-times of applications the main benefit I get from the SSD is the lack of disk-trashing.
I'm not going to promote running a high-load RDBMS on an SSD as there seems to be a lot of evidence around that this indeed kills them rather quickly (*) but things like the swap and %TEMP% are well within the limitations of what you can throw at these beasts. (IMHO)

My Intel 320 120Gb was installed about a 8 months ago and currently stands at 3.48Tb Total Reads / 3.65 Tb Total writes and still shows 100%.
(SSD-Life claims 9 years, but I'm not putting too much value on that number)

(*: the high number of writes in SMART probably is due to my work with the databases. I'm not even sure if the value means actual bytes written or if it represents the total of blocks that were updated on the flash-chips which could be much higher! Anyway, I realise it eats at the drive's lifetime but there simply is no comparing to doing the same things on the HDD... To be entirely honest, if possible I try to run these things on a RAM-disk as it's even faster and I DO care about the SSD's limitations, but my laptop is limited to 8Gb and I need 'some' of that for other purposes too =)

--
If there is one thing to be learned on slashdot, it has to be sarcasm.

Manufacturer/Model by Erpo · 2012-10-22 16:23 · Score: 1

Here's my favorite part of the paper:

"Failure rates are known to be highly correlated with drive models, manufacturers and vintages. [...] However, in this paper, we do not show a breakdown of drives per manufacturer, model, or vintage due to the proprietary nature of these data."

Thanks, Google. :/

Slashdot Mirror

Ask Slashdot: How Do SSDs Die?

510 comments