Slashdot Mirror


Why I'm Usually Unnerved When Modern SSDs Die on Us (utoronto.ca)

Chris Siebenmann, a Unix Systems Administrator at University of Toronto, writes about the inability to figure out the bottleneck when an SSD dies: What unnerves me about these sorts of abrupt SSD failures is how inscrutable they are and how I can't construct a story in my head of what went wrong. With spinning HDs, drives might die abruptly but you could at least construct narratives about what could have happened to do that; perhaps the spindle motor drive seized or the drive had some other gross mechanical failure that brought everything to a crashing halt (perhaps literally). SSDs are both solid state and opaque, so I'm left with no story for what went wrong, especially when a drive is young and isn't supposed to have come anywhere near wearing out its flash cells (as this SSD was).

(When a HD died early, you could also imagine undetected manufacturing flaws that finally gave way. With SSDs, at least in theory that shouldn't happen, so early death feels especially alarming. Probably there are potential undetected manufacturing flaws in the flash cells and so on, though.) When I have no story, my thoughts turn to unnerving possibilities, like that the drive was lying to us about how healthy it was in SMART data and that it was actually running through spare flash capacity and then just ran out, or that it had a firmware flaw that we triggered that bricked it in some way.

243 of 358 comments (clear)

  1. With spinning disks, you do not know either by gweihir · · Score: 5, Insightful

    Seriously, you do not. You may know the end-result sometimes (head-crash), but the root-cause is usually not clear.

    So get over it. It is a new black-box replacing an older black-box.

    --
    Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    1. Re:With spinning disks, you do not know either by 110010001000 · · Score: 5, Insightful

      What is unnerving is that a guy from the Department of Computer Science thinks that SSDs are theoretically immune to manufacturing failures.

    2. Re:With spinning disks, you do not know either by froggyjojodaddy · · Score: 5, Insightful

      From the article:

      "Further, when I have no narrative for what causes SSD failures, it feels like every SSD is an unpredictable time bomb. Are they healthy or are they going to die tomorrow? "

      Emphasis mine. I feel like this guy has opportunities to improve his coping mechanism. For someone in Computer Sciences, it seems like he's way too worried about this. I'm not trying to be mean, but it's like if I got into a car accident and then questioned the entire safety design of all vehicles rather than just taking a few steps back and understanding it's a freak event, but not a totally unexpected one. If you've been driving for 30 years, statistically, you're likely to get into at least one accident, even if it's not your fault

    3. Re: With spinning disks, you do not know either by chaboud · · Score: 1

      Oh, to have mod points....

      There is no other correct take but this. Solid-state does not mean "immune to wear", and anyone in a CS program should be aware of it. Anyone *teaching* a CS program should be embarrassed about this.

    4. Re:With spinning disks, you do not know either by Anonymous Coward · · Score: 1

      Yes, this is why RAID exists and why it is still equally valid with SSDs. It is also why we have good backup systems otherwise we don't sleep well at night.

      It also greatly depends on the class of SSD. I've never had an enterprise SSD die on me after going into production. They are either DOA or chug along nicely. You get what you pay for as they are quite a bit more expensive. It is also why you run a few disk benchmarks before going into production to verify build quality.

      Of course in terms of SANs and Enterprise SSDs they often give me an expected lifetime with the drive. Most of them happily working over 270% of their expected life but that is a risk people take. In my case there aren't a whole lot of writes happening so I'm not worried about it, again, it is in a RAID so if one should give out then I have time to fix the problem. If both fail then I have good backups so while I may take a brief outage it won't be the end of the world.

    5. Re:With spinning disks, you do not know either by AmiMoJo · · Score: 4, Informative

      Often SSD failures can be predicted or at least diagnosed by looking at SMART data. That's what it's for, after all. Some manufacturers provide better data than others.

      Like HDDs, sometimes the electronics die too. Usually a power supply issue. Can be tricky to diagnose. SSDs are slightly worse as with HDDs you can often replace the controller PCB and get them working again, where as SSDs are a single PCB with the controller and memory.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    6. Re:With spinning disks, you do not know either by alvinrod · · Score: 2

      Or to learn what causes SSD's to fail. Just because something appears unpredictable doesn't mean that it is so. If he doesn't have the time to devote to investigating this issue and acquire any requisite knowledge that will help him to uncover the truth, then he probably shouldn't be squandering any of that precious time whining or worrying about things that are out of his control.

    7. Re: With spinning disks, you do not know either by 110010001000 · · Score: 1

      What "specialist area"? I am a specialist in all things. Any moron knows any manufactured thing isn't theoretically immune to manufacturing failures. Not sure why are you talking about "root cause". I never mentioned that.

    8. Re:With spinning disks, you do not know either by ctilsie242 · · Score: 5, Funny

      Could be worse. At a previous job, I've had someone demand "7200 RPM SSDs", and no amount of explaining could change the person's mind.

    9. Re:With spinning disks, you do not know either by Stonent1 · · Score: 4, Interesting

      Ok, I'm in IT and it unnerves me. I've had numerous computers have an SSD totally die and lose all data with no smart warnings in the last few years. (Not me personally, I mean people at our organization)

    10. Re:With spinning disks, you do not know either by 110010001000 · · Score: 1

      That sounds about right for HP in 2018.

    11. Re: With spinning disks, you do not know either by Anonymous Coward · · Score: 1

      To get a 7200 RPM SSD, just put it in a centrifuge. Tell your coworker the centrifuge will separate good data from bad.

    12. Re: With spinning disks, you do not know either by omnichad · · Score: 3, Interesting

      Older SSDs didn't even have a wear-leveling SMART attribute or total host writes attribute. Some of the cheaper ones probably still don't. So there is no way to see how close you're getting to the estimated upper limit. There is a pretty clear progression on the newer drives. With hard drives, mechanical failure is actually less predictable than SSD wear-out (defects aside).

    13. Re:With spinning disks, you do not know either by Comboman · · Score: 3, Informative

      Mod parent up. The most common cause of a sudden, unexplained failure for both HDs and SSDs is a failure of the controller rather than the media.

      --
      Support Right To Repair Legislation.
    14. Re:With spinning disks, you do not know either by jellomizer · · Score: 2

      I find a lot of fear around new technology to be the same as the fear of flying.
      Where numbers all point to a better more robust product, there is just more anxiety for when something goes wrong, mostly because when it does, there is little to do to fix it.

      The old spinning drive if failed, you can sometimes put it in the freezer power it up and get the data off, or if you are more technical you can open it up, and move the data disks to an other drive.

      But for the most part, Standard best practices of keeping backups and/or having the correct RAID on your drives is the best option to keep the data safe. Solid State or mechanical, they can always fail. The solid state could fail from a power surge, or just excessive heat, or just a fault in the build.

      --
      If something is so important that you feel the need to post it on the internet... It probably isn't that important.
    15. Re:With spinning disks, you do not know either by jellomizer · · Score: 4, Funny

      That is why I always stick to real to real 9 track paper tape. If you can't see the bits you just can't trust it.

      --
      If something is so important that you feel the need to post it on the internet... It probably isn't that important.
    16. Re:With spinning disks, you do not know either by Anonymous Coward · · Score: 5, Insightful

      All for the SAME reason- the wrong type of cell failed, and the crappy software doesn't know how to recover. The software systems of the SSD and the OS driver side are written by idiots.

      A low level tool that knows your particular SSD driver chipset could trivially access the vast majority of flash cells on your SSD drive. But what good is that FACT if the tools are not readily available.

      And SMART warning do NOT apply to SSD drives. SMART is for electro-mechanical systems with statistical models of gradual failure. SMART is FAKED for SSD.

      A catastrophic SSD failure is when the 'wrong' memory cell dies, and the software locks up. Since all memory cells are equally likely to die at some point, this is a terrible fault of many of these drives.

    17. Re:With spinning disks, you do not know either by I-am-a-Banana · · Score: 3, Interesting

      Seriously, you do not. You may know the end-result sometimes (head-crash), but the root-cause is usually not clear.

      So get over it. It is a new black-box replacing an older black-box.

      Well I need to partially disagree with you there. With a traditional drive when it fails and you take it apart carefully you can try and determine what happened. If it was a head crash you may be able to see what caused the head crash. In my case a Quantum or Maxtor drive that had 3 extra screws shipped in it loose where the inside control circuitry was. You could tell if it was a frozen motor, or if you are lucky find that the external board had a fried electrical component on it. For friends I desoldered the fried component and put a new one on and the drive worked perfectly. Obviously we copied the data off of it onto something new then we put the drive into storage for safe keeping. With the older drives there is the small chance of repair. Yes there are companies out there that will disassemble the drive, remove the platter, and put them into another working drive to recover data. Obviously with a head crash you may not be able to recover all but, in absolute necessity you could. Or you could just be a nerd that wants to do an investigation to find out why. With SSDs however there is no chance of fixing it, and no chance of knowing exactly why. However I don't know why he would say that SSDs shouldn't have manufacturing defects. They do. They are just not mechanical, but I would hope that because they are not mechanical they would hopefully be less likely to be defective.

    18. Re:With spinning disks, you do not know either by Luckyo · · Score: 1, Interesting

      You do actually. Many if not most disk failures have clearly predictable markers. This has been true for quite a long time at this point, to the level where my last two HDD failures in home machine were diagnosable with no tools beyond SMART reader. Better yet, they weren't "instant" failures, but signs of impeding failure of the drive started appearing months in advance with clear cut warnings on SMART readout. This resulted in sufficient time to buy a new drive and migrate all the data with no problems.

      With SSDs, failure has a problem with being utterly opaque and sudden. This is likely more of a function of lack of expertise due to lack of time through, as it took us decades to get hard drive monitoring systems to where they are now.

    19. Re:With spinning disks, you do not know either by Sponge+Bath · · Score: 4, Funny

      Tell this person you could only find 7199 RPM SSDs, but if they spin in an office chair while using the system it will make up the difference.

    20. Re: With spinning disks, you do not know either by Type44Q · · Score: 1
      The word you were looking for is "trim."

      Cha-ching.

    21. Re: With spinning disks, you do not know either by omnichad · · Score: 1

      No, it's not. TRIM has nothing to do with wear leveling - and especially monitoring it over time, except that it might happen on a more efficient schedule.

    22. Re:With spinning disks, you do not know either by Anonymous Coward · · Score: 1

      Indeed. Although media failure will happen eventually. At a certain time there will be no more spare blocks to replace worn out blocks. A good controller will still offer the drive as a read only disk, so you can copy almost everything.

      SSD control software is incredibly complex, it is a binary of easily 5MB. Of course there are bugs in it.

    23. Re:With spinning disks, you do not know either by R3d+M3rcury · · Score: 2

      Exactly. I've had bad DRAM before which caused the occasional inexplicable crash. I don't see any reason why SSDs would somehow be immune from this.

      That said, most SMART codes are for mechanical hard drives. I wouldn't be surprised to discover that there isn't really a good way to test reliability for SSDs, so the SMART codes always come back as "A-OK!"

    24. Re:With spinning disks, you do not know either by Chewbacon · · Score: 1

      Yep. I've had a number of spinning drives just drop dead on me. Some advice: Western Digital makes returns pretty easy for their drives and, when it comes to all drives, backup regularly/often!

      --
      Chewbacon
      The Bible is like Wikipedia: written by a bunch of people and verifiable by questionable sources.
    25. Re:With spinning disks, you do not know either by greenwow · · Score: 2

      I disagree that SMART data helps with diagnosing failures. I save the output of "smartctl -a /dev/?" every night for every drive on every server. I haven't seen anything that predicted the huge number of SSD failures that you have with heavy use. We started using them three years ago when we started buying servers with 2.5" drive bays. I think we've replaced the ~75 drives about 120 times. Yes, more than once. If someone could come up with a predicting failures then they will become rich.

    26. Re:With spinning disks, you do not know either by ShanghaiBill · · Score: 1

      All for the SAME reason- the wrong type of cell failed, and the crappy software doesn't know how to recover.

      So it is basically bad software? Are there SSD brands with less crappy software than others?

      Is there data on reliability, like there is for HDDs?

      To be fair, I believe this is becoming less of a problem. I saw SSDs fail often in the early days of flash, but not recently.

    27. Re:With spinning disks, you do not know either by Headw1nd · · Score: 1

      The author mentions manufacturing errors as a possible source, but I think his question is an error in what, and if it's an error on silicon, why would it only show up after months of operation? Some people have more curiosity about the things they use, and want more of an explanation than "oh sometimes they just fail."

    28. Re:With spinning disks, you do not know either by 110010001000 · · Score: 2

      Thats nice, but that isn't relevant to what I wrote. I commented that it is unnerving that he thinks that SSDs are theoretically immune to manufacturing failures. There are a lot of reasons why a SSD can fail. Soldered joints can fail. There are various bonds that can also fail.

    29. Re:With spinning disks, you do not know either by AmiMoJo · · Score: 1

      With hard drives a sure sign of imminent failure is the sector retry or reallocation count increasing.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    30. Re:With spinning disks, you do not know either by DigiShaman · · Score: 1

      It's usually because of the controller or RAM / Cache errors in processing that corrupts the firmware or dynamic LBA flash block allocation table (database). This renders the reset of the NAND flash partially or totally inaccessible. Quality "prosumer" drives are supposed to have extra hardware (capacitance) to prevent half-writes upon a dirty shutdown (abrupt loss of power). But regardless, any corruption on write-back can render the drive "bricked".

      I'm not sure of SSD cache uses ECC or not, but they should if they don't already. I know that they will throttle with the temps get to high, which should prevent corruption at the expense of performance. So that at least is a good thing.

      If I recall, Intel SSDs in the past (not sure now) are programmed to fail or crippled with read-only after so many writes. It's like an odometer where you reach a certain level of distance and then because it's programmed to do so, fail. As though somehow that's being pro-active? Whatever. I avoid Intel drives for that bullshittery.

      --
      Life is not for the lazy.
    31. Re:With spinning disks, you do not know either by Junta · · Score: 1

      This is why I shake my head when I see someone going to a lot of trouble to track SMART data to 'know' when a disk is going to fail. It just makes it all the more disappointing when a drive fails and all the early warning effort did nothing.

      It is a much more robust approach to be able to not *care* if you don't see the failure coming or not than to try to be able to plan for an outage. SMART has no idea that a component on the controller board is going to burn out suddenly. Yes it can track things with known duty cycles, but with drives nowadays you have probably retired the drive long before that threshold will be reached, and the failure modes likely to smack you in production are ones that SMART will not catch.

      --
      XML is like violence. If it doesn't solve the problem, use more.
    32. Re:With spinning disks, you do not know either by Junta · · Score: 2

      Old stereos from
      the 1970's are still in service

      Well, old stereos from the 1970s that are still working are still in service. No one talks about the old stereos that died in the 70s because that's boring.
      SSDs are going to be in the same boat. Like all other electronics, some have a ticking time bomb and will probably fail within the first 5 years or so. Those that have the perfect voltage regulation and capacitors and such will last until their NAND wears out and they could also seem long lived (except the capacity is going to be so pathetic that no one is going to want to hold on to those, while a 1970s stereo is still perfectly capable of putting out good sound).

      --
      XML is like violence. If it doesn't solve the problem, use more.
    33. Re: With spinning disks, you do not know either by Bengie · · Score: 1

      Wear leveling is much less effective without TRIM. TRIM reduces write amplification by letting the wear leveling algorithm know when some data is no longer referenced.

    34. Re:With spinning disks, you do not know either by SuperKendall · · Score: 2

      This is why I try to buy more expensive and higher performance SSD drives (like the Samsung EVO line) - but I have to admit I have absolutely zero idea if the chipset on the more expensive drives is really any better at all. It just seems likely the design would be better in some ways or a bit more fault tolerant.

      Even that strategy I know can fail though, a few years back one of the most expensive Sandisk Pro SD cards just died out of the blue. It happened while I was at a photography convention where Sandisk was actually present, including a tech that had a full suite of SD analysis tools with him - and even he could get absolutely nothing from the SD card...

      I still back up regularly, really the only thing you can do in a world where and SSD drives may just fail whenever .

      --
      "There is more worth loving than we have strength to love." - Brian Jay Stanley
    35. Re:With spinning disks, you do not know either by Immerman · · Score: 2

      I believe Intel SSDs are programmed to "self brick" when they fail, or at least they used to be. I remember thinking that was a spectacularly stupid way to fail, and the read-only mode would be much preferable. Yes, your computer will likely crash hard in short order either way, but at least with read-only mode you could get (most of) your most recent data off it

      --
      --- Most topics have many sides worth arguing, allow me to take one opposite you.
    36. Re: With spinning disks, you do not know either by datavirtue · · Score: 1

      Yeah. Quit your bitching and break out a microscope if you really want to know. No one every really got to the root cause of mechanical drive failures either. Most of the "died" because the screws that hold them together lost enough torque to allow the body to warp enough to "fail,".... Click, click, click.

      Never forget watching that video where a guy used a cheap torque driver to "repair" dead drives.

      --
      I object to power without constructive purpose. --Spock
    37. Re: With spinning disks, you do not know either by omnichad · · Score: 1

      Yeah, it reduces it. Not by a huge amount. But the point was that there aren't SMART attributes on older SSDs to track wear-leveling. TRIM has nothing to do with that. If you read up-thread you'll see how non-sequitur the GP post is.

    38. Re: With spinning disks, you do not know either by guruevi · · Score: 1

      Perhaps you should invest in the data center SSD or even SLC if you have that many problems. I had the same problems with various brands but the Intel DC solved most of the problems.

      --
      Custom electronics and digital signage for your business: www.evcircuits.com
    39. Re: With spinning disks, you do not know either by datavirtue · · Score: 2

      Yep. Pure and Nimble already did. They got rich.

      --
      I object to power without constructive purpose. --Spock
    40. Re: With spinning disks, you do not know either by schure · · Score: 1

      Guy grew up playing Minecraft. If he had only played Lego instead...

    41. Re:With spinning disks, you do not know either by Immerman · · Score: 1

      In addition to the survivor bias mentioned by Junta, there's also transistor size to consider. The smaller a transistor is made, the more sensitive it is to any manufacturing imperfections, and the faster electromigration and other forms of normal wear and tear take their toll. Squeeze a billion of them on to a postage stamp, and even the most reliable one won't compare to the reliability of a well made canister transistor from the 70s.

      --
      --- Most topics have many sides worth arguing, allow me to take one opposite you.
    42. Re:With spinning disks, you do not know either by Aighearach · · Score: 1

      I'm not surprised.

      His problem is philosophical, not technical. Why would a CS guy be good at philosophy? That would only be likely if he was also interested in philosophy, which is unpopular in CS echo chambers.

      Does the act of constructing a narrative tell you what happened? No. Should possession of a narrative be a basis for risk assessment? No.

      When some uncommon but expected event happens, if you felt like you succeeded at constructing a narrative or not does tell you anything about the frequency of the risk, and you shouldn't think you have that sort of information. Instead of admitting to feeling "unnerved," he should see this mistake and be embarrassed by it. Not because he's bad at philosophy and felt unnerved, but because he can't comprehend storage failure rates that are well-studied and have hard data available, and blathered about his bad philosophy instead of looking up the numbers and known causes.

      Ultimately he should stop putting value on this "unnerved" feeling. It isn't a real thing; it is a feeling you get when you stubbornly insist on pretending you already understand things that you've received information that tells you don't understand. It is a type of cognitive dissonance. Dismissing the feeling, instead of assessing it as valuable, is the way to make it go away. Just accept the new information, and understand that feeling unnerved is maladaptive unless you're wandering in a dangerous forest trying not to get eaten by a Cave Bear.

    43. Re:With spinning disks, you do not know either by Aighearach · · Score: 1

      You should get one of those "1000 Electronic Projects" kids sets and learn youse some hardware.

      Sometimes the magic smoke comes out. Sometimes you don't even see the smoke come out. And yet, being encased in plastic so that you can't see the metal doesn't stop the ICs from letting out the magic.

      Fry a few transistors and you'll understand, there is nothing to be unnerved about. The plastic covering that hides the IC is not even the magic!

    44. Re:With spinning disks, you do not know either by gweihir · · Score: 4, Interesting

      Well, I originally bought OCZ. Today _all_ of 5 OCZ drives I got are stone-dead. After that I moved to Samsung, mostly "Pro". They are all still working fine and some are older now than the first OCZ when it died. So yes, it makes a difference. Incidentally, Samsung had excellent reliability in their spinning drives as well. It seems they just care more about quality and reputation.

      That said, I find it sad that you cannot get "high reliability" SSDs where you basically can forget about the risk of them dying. I am talking reliability levels like a typical CPU here. It seems the market for that is just not there.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    45. Re:With spinning disks, you do not know either by Aighearach · · Score: 1

      Nobody pays extra for drives that have built-in data forensics, so nobody wrote the feature.

      It isn't about crappy software, it is about software that only completes the assigned tasks in the most efficient way possible. That means they actually tear out most of the capabilities of the controllers in the process of making ASICs.

      SMART is "fake" in the sense you mean it for spinning drives, too. Duh. There isn't a magic elf from SMARTland inside the drive. It is simply that less of the data is useful.

      It isn't software in the sense that you talk about it, where you have a general purpose computer sitting there idle most of the time and you could easily just have it do some extra work. It is a tightly coupled collection of circuits that only do very narrow, specific things. Increasing the capabilities lowers performance, because that is how tight the timings already are.

      My advice, buy a bag of AVR microcontrollers and write some firmware. Then buy a cheap FPGA and try that. When you can do both, you'll be ready to understand what goes into the "software" on a HDD.

    46. Re:With spinning disks, you do not know either by viperidaenz · · Score: 5, Insightful

      SMART should be able to provide the number of remapped sectors. There should be manufacturer specific counters for the amount of over provisioning that is left for remapping too. That should tell you precisely when you should plan to replace an SSD due to age.
      How hard would it be to notify something that the drive can't handle any more dead cells, so should not be written to any more? Or that it is down to x% of spare nand?

    47. Re:With spinning disks, you do not know either by Aighearach · · Score: 2

      Nope. You're not paying for different control ICs, where you actually get something from paying more it would be higher speed or higher yield rates on the memory chips.

      Higher yield rates will translate into lower runtime failure rates.

      You're not going to learn much from the wrong side of the controller, because customers at all levels refuse to pay extra for built-in forensics. And you'd have to choose between extra silicon that normally isn't even used, or extra power use. It won't be free.

      You have to get at the pins of the memory chips and interface them to forensic tool. Usually it is probably simplest to unsolder them and put them on a breakout board. You could typically get most of the data back that way. If partial data is really that meaningful to you.

      Most people don't care; partial data is worthless to them. They either had a backup, or didn't. Probably only cops, criminals, and spies want people's data that bad.

    48. Re: With spinning disks, you do not know either by Aighearach · · Score: 1

      Anyone *teaching* a CS program should be embarrassed about this.

      He should spend a day standing in front of the EE department wearing a wizard hat and a sign, "Computers are Not Magic. I repent!"

    49. Re:With spinning disks, you do not know either by Kjella · · Score: 1

      If you can't see the bits you just can't trust it.

      My dad used to feel the same way about vacuum tubes and magnetic core memory. As long as you could use a scope and inspect the single bits you could always get to the bottom of it. Yes, it was a looong time ago.

      --
      Live today, because you never know what tomorrow brings
    50. Re:With spinning disks, you do not know either by Aighearach · · Score: 1

      Even when a spinning drive made crunching noises, it is usually because the controller IC was hosed!

      It isn't like a three phase low voltage BLDC motor operating at low load is likely to die; dead drives all come out with working motors. The drive may not spin when connected as a drive. But when I buy a box of salvaged HDD motors (by the pound) there are likely to be none that are actually bad. That's true even in a 25lb box, which is a few hundred motors, many of which came from dead drives.

      And the head driver is basically just a voice coil; how often does the voice coil in a speaker go bad? Basically never. All the other hardware around it is likely to fail first. Same here. But if the wrong transistor dies in the controller, then the feedback loops won't keep it from crashing into the end of the throw, or oscillating in a way that makes a crunching sound.

    51. Re: With spinning disks, you do not know either by Zero__Kelvin · · Score: 1

      I would have told them to go online and pick whichever one they wanted.

      --
      Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
    52. Re:With spinning disks, you do not know either by SuperKendall · · Score: 1

      Yeah I agree about the recovery aspect is not as important, as you say people either have backups or do not... I am just hoping I get some extra longevity in components (the lower runtime failure rates you mentioned).

      The extra performance is nice as well but I am even more interesting in something I can be pretty sure will last 3-4 years at least with moderate to heavy use.

      Probably only cops, criminals, and spies want people's data that bad.

      Well, I have run into a number of people over the years that lost pictures or important documents (like a whole book they had written), to them thousands of dollars would have been OK if it could actually get the data back (and these were people that did not have a lot of money).

      I think at least these days people do understand a little better how important a backup is after getting burned, reliable backups I feel like are still not an easy thing for most non-technical people to achieve (outside of mobile devices).

      --
      "There is more worth loving than we have strength to love." - Brian Jay Stanley
    53. Re:With spinning disks, you do not know either by prisoner-of-enigma · · Score: 2

      I think you may be missing his point. I've had SSD's die on me as well with absolutely no warning. What's unnerving about it is you have no idea why it failed. Good engineers like failure analysis; it helps determine if you're buying a crappy product, running your product out of spec, or any number of other metrics which can inform future purchases.

      Mechanical drives usually give you some indicator of why they failed in the form of horrible noises. SSD's don't give you much of anything. If neither SMART nor spare block allocation figures are out of spec you have nothing to go on. I've chalked these up to the controller on the drive itself failing but that's just a guess. I have no way to perform any additional diagnostics that might tell me more. As a result, I've simply avoided buying drives of that brand anymore. Crude, yes, but what other metrics can I use? I'm not talking about a single drive. It's happened to multiple drives of a similar make/model, all of which failed suddenly and gave no data afterwards I could use forensically.

      --
      In the end they will lay their freedom at our feet and say to us, Make us your slaves, but feed us. - Fyodor Dostoyevsky
    54. Re:With spinning disks, you do not know either by prisoner-of-enigma · · Score: 1

      Should possession of a narrative be a basis for risk assessment? No.

      Yes, it should, although only if the narrative doesn't involve a single case of failure. If you have multiple failures of a single brand or model, you should use that to inform future purchasing decisions. Knowing the cause of the failure could further inform. Random failures are to be expected but if multiple failures caused by the same defect occurring in a given product line or due to a specific environment (workload, temperature, etc.) then you have some useful data to make future product selection with.

      The OP is lamenting the paucity of any kind of failure data. Basically he's left with the decision to forego purchasing that model -- or even that entire brand -- hoping it will improve reliability. Hope is not a strategy for engineers. We prefer data.

      --
      In the end they will lay their freedom at our feet and say to us, Make us your slaves, but feed us. - Fyodor Dostoyevsky
    55. Re:With spinning disks, you do not know either by geekmux · · Score: 1

      Seriously, you do not. You may know the end-result sometimes (head-crash), but the root-cause is usually not clear.

      So get over it. It is a new black-box replacing an older black-box.

      It's a pain in the ass when any hardware fails, especially prematurely . Never truly knowing why something fails is very frustrating for anyone who actually gives a shit enough to not want to repeat history. You know, like buying the same "reputable" brand/solution/model again.

    56. Re: With spinning disks, you do not know either by prisoner-of-enigma · · Score: 2

      Doesn't help if the controller fails. SLC flash has better write longevity but none of that matters if the controller bombs.

      Further, a sudden, catastrophic failure is (by process of elimination) almost certainly a controller failure. No matter if you're using SLC/MLC/TLC/etc. flash, cells don't die en masse. They usually die a little at a time. The controller expects this and remaps bad blocks to the spare area. Keeping track of spare area usage is one of the best ways to predict impending failure. If the controller fails then all that is for nothing even though (theoretically) all your data is still perfectly preserved on the flash itself.

      --
      In the end they will lay their freedom at our feet and say to us, Make us your slaves, but feed us. - Fyodor Dostoyevsky
    57. Re:With spinning disks, you do not know either by Marxist+Hacker+42 · · Score: 1

      One way to improve his coping mechanism, would be to start publishing everything

      Including manufacturer names and his own mean time to failure numbers.

      Bet that will increase quality control real quick. Or at least tell us who not to buy from because they're cheap chinese crap.

      --
      SJW: a person who perceives an injustice, and while correcting it, commits a greater injustice.
    58. Re:With spinning disks, you do not know either by thegarbz · · Score: 1

      Oh hey nice anecdote. Let me share some with you too. I have had RAM die unexpectedly without warning. I've had motherboards die unexpectedly without warning. I've had video cards die unexpectedly without warning. I've had CPUs go tits up in infant mortality. My last monitor just one day let smoke out for no reason what so ever. I have a PSU on order right now as the motherboard is throwing warnings about the voltage rails.

      And ... wait for it ... you know it's coming .... I've had HDDs die without warning and sure as hell no SMART warnings losing all data in the process.

      Electronics die. Often at end of life, statistically quite randomly, and even scarier sometimes shortly after being put in service. SSDs aren't unique, amazing or unnerving. SMART is not there to give you early warning of random failures, it's there to give you an attempt to predict wearout / end of life related failures. No parts are immune, and they sure as hell aren't unnerving.

    59. Re:With spinning disks, you do not know either by thegarbz · · Score: 2

      So much wrong in so little post, where to start:

      The software systems of the SSD and the OS driver side are written by idiots.

      Hardly. The software systems of SSD are written by people who know SSDs well. That you bought an OCZ drive is just unlucky. Firmware related failures were only common in the early days of SSDs.

      A low level tool that knows your particular SSD driver chipset could trivially access the vast majority of flash cells on your SSD drive.

      And would know none of what to do with it because wear leveling is not something you can predict and decode later. You can only store it. If the component which stores this knowledge is dead then nothing can save you.

      And SMART warning do NOT apply to SSD drives. SMART is for electro-mechanical systems with statistical models of gradual failure. SMART is FAKED for SSD.

      SMART is a system for drive reporting metrics. Nothing is "faked" for SSDs and SMART sure as hell isn't for mechanical related issues only. There are several SMART values specifically created to report SSD related wearout mechanism including 171 - flash program fail, 172 - erase fail, 173 -wear level count, 192 - unsafe shutdown, 194 - internal temperature, 226 - media wear, 233 - wearout indicator, 241, 242 - read and written.

      A catastrophic SSD failure is when the 'wrong' memory cell dies, and the software locks up.

      You're good at writing words without any meaning what so ever.

    60. Re:With spinning disks, you do not know either by thegarbz · · Score: 1

      Not exactly comparable.

      The issue with SSDs is that there really is only one wearout related failure mode, and that is reading and writing / life left. The problem with SSDs is that randomised failures dominate which is perfectly expected given the wear of a typical drive should see it run into the 10 year mark which is well into the end of expected device for consumer electronics. The exception to that is overheating, and that along with wearout can give you an indicator of SMART, but SMART does not typically show sudden and random failures.

      The difference from the classical HDD is that for the lack of mechanics, there is actually quite a lot of scope for random electrical failure on the components. They run hotter, harder, and a manufactured with cutting edge technology rather than tried and tested technology or technology with obviously accessible failure modes. This makes them far more likely than a typical HDD to just suddenly up and die.

      It also means that a well made drive should also outlive a HDD which is my own personal experience. I've not had anything other than first generation SSDs die on me, and for all the good SMART does in predicting failures, it doesn't do shit in preventing them.

    61. Re:With spinning disks, you do not know either by thegarbz · · Score: 1

      SSDs have a few wearout related metrics. HDDs have many. Both devices can suffer from randomised failures but these cannot be predicted by SMART.

    62. Re: With spinning disks, you do not know either by nigelo · · Score: 1

      +1 for the 70s stereo and a $20 optical-RCA converter - puts any soundbar I've tried to shame.

      --
      *Still* negative function...
    63. Re:With spinning disks, you do not know either by Miamicanes · · Score: 1

      The bigger problem is that "99.99999%" of SSDs encrypt EVERYTHING at the block level, using an encryption key known only to the drive itself. So EVEN IF you can easily rip the bits from the failed drive's flash using a JTAG reader, you'll be reading what's effectively random noise.

      The reason for encrypting the data itself is legit (it makes the bits look pseudorandom & improves wear), but IMHO, the fact that there's literally NO WAY to replace the drive's own encryption key with one known to the drive's owner is absolute, complete BULLSHIT.

      As far as I know, it's a descendant of CPRM DRM. It's technically been a mandatory part of the ATA spec since the early 2000s... the difference is, on non-SSDs, it's disabled by default (the only devices I'm aware of that might actually use it are things like TiVO DVRs and videogame consoles). With a SSD, it's always on & can't be disabled or made to use a key known to YOU. As a result, data-recovery companies can still do recovery on a drive suffering from FILESYSTEM corruption, but they're now completely helpless if the drive ITSELF fails (even if the failure doesn't directly involve the flash memory). And unlike the old days, if the logic board is fried, you can't even solder the chips onto a sacrificial board, because the encryption key is tied to the original logic board.

      Put another way, thanks to mandatory block-level black-box encryption, something that has always been a bad situation (drive failure) has NOW become insurmountably worse, even though the technical challenge of physically reading bits from failed media is arguably easier now than it has ever been in history.

    64. Re:With spinning disks, you do not know either by WhoBeDaPlaya · · Score: 2

      You must have missed how Samsung royally screwed up with the 840 and 840 EVO firmware. Or on the mechanical side of things, lookup how they messed up the SpinPoint F4's firmware and tried to hide it ;)
      Not biased against Samsung or anything, as I still have several SpinPoint F3s in service, as well as a bunch of 840 Pros and 850 EVOs.

    65. Re:With spinning disks, you do not know either by dgatwood · · Score: 2

      It's usually because of the controller or RAM / Cache errors in processing that corrupts the firmware or dynamic LBA flash block allocation table (database). This renders the reset of the NAND flash partially or totally inaccessible. Quality "prosumer" drives are supposed to have extra hardware (capacitance) to prevent half-writes upon a dirty shutdown (abrupt loss of power). But regardless, any corruption on write-back can render the drive "bricked".

      And by this, you mean that some really bad SSD manufacturers still haven't learned the concept of log-structured storage. The problem of handling a partial write was solved a couple of decades ago. You roll back the partial transaction to the last checkpoint, then say, "whoops, that write never happened".

      Basically, in addition to a flat mapping table (as a cache), you store a copy of the mapping table (a checkpoint) with modifications in a log format. Each time you power on the drive, it ignores the cached flat mapping table (if it even bothers to persist it to disk), and reads the last checkpoint table, then replays the transaction log after that. When it reaches the last completed transaction in the log, it now has a valid mapping table that it is up-to-date to the maximum extent possible. A write operation is considered committed as soon as the transaction is added to the log, and existing used space is not reclaimed until that log write has occurred, ensuring that every write is effectively an atomic operation. Periodically, you write out a new flat table as a checkpoint, and after ensuring that it has been fully written, you then mark the oldest checkpoint and associated log pages as free for reuse.

      We were talking about this back when I was in grad school, around the turn of the century, precisely to prevent those sorts of failures. So IMO, if any SSD manufacturer still isn't doing a transactional/log-based mapping table between blocks and flash pages at this point, their hardware isn't good enough to use for storing system logs for a flush toilet, much less critical data. I mean, this is really *basic* stuff, and has been the norm for at least a decade.

      --

      Check out my sci-fi/humor trilogy at PatriotsBooks.

    66. Re:With spinning disks, you do not know either by dgatwood · · Score: 5, Insightful

      I think you may be missing his point. I've had SSD's die on me as well with absolutely no warning. What's unnerving about it is you have no idea why it failed. Good engineers like failure analysis; it helps determine if you're buying a crappy product, running your product out of spec, or any number of other metrics which can inform future purchases.

      Statistically, without even knowing what the particular product was, I can tell you what caused it: RoHS.

      The change from lead-based solder to lead-free solder is one of the major causes of premature electronics failures — probably more common than all other causes put together. Between tin whiskers, cold solder joints, and stress fractures caused by thermal expansion of component packages, the RoHS lead-free solder rule is a clear example of environmentalism gone amok. Instead of improving our environment by reducing the amount of lead going out into the world, it has, IMO, made our environment worse by dramatically increasing the amount of hardware discarded as junk long before it otherwise would have been.

      --

      Check out my sci-fi/humor trilogy at PatriotsBooks.

    67. Re:With spinning disks, you do not know either by Aighearach · · Score: 1

      That's just it, the idiots that want their data back that bad can only pay a few thousand. What a waste of time. And you know they'll cry about paying.

      I'd want more than they'd pay just to spend the time trying, plus a lot more if I recovered something.

      What about people who lost their pictures in a fire? They move on, they learn what is important.

    68. Re:With spinning disks, you do not know either by the_B0fh · · Score: 1

      you know this has been discussed on slashdot quite a few times, right?

    69. Re:With spinning disks, you do not know either by Waccoon · · Score: 1

      Remember to spin the right way. Righty tighty, lefty loosy!

    70. Re:With spinning disks, you do not know either by Ramze · · Score: 1

      I always go with Samsung EVO or PRO. Things may be different now, but when I was first in the market for SSDs, Samsung was the only one that designed every part of the device - not a cobbled together mess of components and software from various vendors made into a franken-device that might work ok most of the time. Now, I just buy Samsungs out of habit & the fact I've never had one fail on me. Samsung DID have a huge blunder with one or two specific lines of SSDs, but that was widespread with those specific models, not random deaths on random models.

      I've never had to use the Samsung software included other than a firmware update once, but it has lots of tools for diagnostics and recovery. I can't vouch for how well they work since I haven't had to use them.

      No drive will last forever, but considering I generally put my apps and OS on the C: Samsung and all my media and Windows profile on separate drives, my write/overwrite rate on the SSD is consistent with allowing it to last until sometime after our Sun turns into a red giant.

    71. Re:With spinning disks, you do not know either by gweihir · · Score: 1

      Thanks, that is way beyond the data I have!

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    72. Re:With spinning disks, you do not know either by gweihir · · Score: 1

      I have not missed that. The SSDs still work though, I have at least one 840. My claim what just that they seem to be significantly better than the competition, not that they are perfect.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    73. Re:With spinning disks, you do not know either by jellomizer · · Score: 1

      My experience is with far more drive failures with mechanical drives vs SSD. The problem isn't nessarly with data retention of the disk, but the mechanical aspect that fails or worse crashes and scratches off the data.

      --
      If something is so important that you feel the need to post it on the internet... It probably isn't that important.
    74. Re:With spinning disks, you do not know either by sbjornda · · Score: 1
      That's only true in the northern hemisphere. You hemispherist.

      --
      .nosig

    75. Re:With spinning disks, you do not know either by Agripa · · Score: 1

      That said, I find it sad that you cannot get "high reliability" SSDs where you basically can forget about the risk of them dying. I am talking reliability levels like a typical CPU here. It seems the market for that is just not there.

      They can be found in the enterprise market but they use lower density Flash so cost a lot more per gigabyte.

    76. Re:With spinning disks, you do not know either by Agripa · · Score: 1

      SMART should be able to provide the number of remapped sectors. There should be manufacturer specific counters for the amount of over provisioning that is left for remapping too. That should tell you precisely when you should plan to replace an SSD due to age.
      How hard would it be to notify something that the drive can't handle any more dead cells, so should not be written to any more? Or that it is down to x% of spare nand?

      Crucial does exactly this with their SSDs but it does not save you from spontaneous mysterious death.

    77. Re:With spinning disks, you do not know either by Agripa · · Score: 1

      I heard that outside of "wearing out", the biggest cause of failure by far is the controller, which fails at roughly the same rate as spinning rust. Overall, SSDs fail at roughly the same rate as mech drives, if you ignore the mechanical part of mech drives. The topic rant sounds like someone who would rather have a drive that is over 2x more likely to fail because he better understands that additional 100% failure modes.

      It is not the controller or ICs which fail. By themselves they are reliable.

      The problem is the way NAND Flash memory behaves when programming and perhaps erase operations are interrupted by for example loss of power. If a log type of file system is used, then you would expect that any interruption could only at most corrupt the data being written but interruption can cause the state machine controlling the write or erase to damage *other* locations. If those other locations include the the Flash translation layer data structures, then for practical purposes the drive is destroyed.

      Multi-level Flash storage has an additional failure mode where interrupting a write to a partially programmed page destroys the existing data stored on that page.

      The solution is to have backup power sufficient to complete any possible write or erase operation. I laughed when SandForce advertised their controllers as not requiring any power backup for safe operation.

    78. Re:With spinning disks, you do not know either by Agripa · · Score: 1

      I believe Intel SSDs are programmed to "self brick" when they fail, or at least they used to be. I remember thinking that was a spectacularly stupid way to fail, and the read-only mode would be much preferable. Yes, your computer will likely crash hard in short order either way, but at least with read-only mode you could get (most of) your most recent data off it

      Intel did or still does this when the SSD endurance is exhausted which struck me as particularly skeezy. Why not force read only mode so that the data may be recovered?

    79. Re: With spinning disks, you do not know either by prisoner-of-enigma · · Score: 1

      Given the typical pathetic performance of most sound bars, this isn't exactly a big hurdle to crow about. With 5.1 setups being so ridiculously cheap, sound bars are only for people too lazy to run speaker cable.

      --
      In the end they will lay their freedom at our feet and say to us, Make us your slaves, but feed us. - Fyodor Dostoyevsky
    80. Re:With spinning disks, you do not know either by viperidaenz · · Score: 1

      Nothing saves you from spontaneous mysterious hardware failure. Hence the need for backups.
      Saying SMART is only applicable to electro-mechanical systems is just wrong. Just because an SSD is "solid state", doesn't mean there are statistical models for failure. The very nature of the storage mechanism means it is guaranteed to fail, it's only a matter of when.

    81. Re:With spinning disks, you do not know either by Agripa · · Score: 1

      I think the big difference is that with a hard drive, failure is usually mechanical and preceded by signs reported in the SMART data like error rate and reallocation count. SSDs failure tend to be data structure corruption which has no antecedent to watch for.

      Here is the data available from my Crucial SSDs. It includes remaining operating life. The unexpected power loss events are because it is installed into a Windows 10 test system which keeps crashing with a blue screen and incomplete diagnostic data. Yay Microsoft!

      1 Raw Read Error Rate 0 Errors/Page
      5 Reallocated NAND Block Count 0 NAND Blocks
      9 Power On Hours Count 1176 Hours
      12 Power Cycle Count 69 Power Cycles
      171 Program Fail Count 0 NAND Page Program Failures
      172 Erase Fail Count 0 NAND Block Erase Failures
      173 Block Wear-Leveling Count 8 Erases
      174 Unexpected Power Loss Count 22 Unexpected Power Loss events
      180 Unused Reserved Block Count 100 Blocks
      183 SATA Interface Downshift 0 Downshifts
      184 Error Correction Count 0 Correction Events
      187 Reported Uncorrectable Errors 0 ECC Correction Failures
      194 Enclosure Temperature 35 Current Temperature (C)
      68 Highest Lifetime Temperature (C)
      196 Reallocation Event Count 0 Events
      197 Current Pending ECC Count 0 ECC Counts
      198 SMART Off-line Scan Uncorrectable Errors 0 Errors
      199 Ultra-DMA CRC Error Count 0 Errors
      202 Percentage Lifetime Remaining 100 % Lifetime Remaining
      206 Write Error Rate 0 Program Fails/MB
      210 RAIN Successful Recovery Page Count 0 TUs successfully recovered by RAIN
      246 Cumulative Host Sectors Written 1323897591 512 Byte Sectors
      247 Host Program Page Count 41371799 NAND Page
      248 FTL Program Page Count 21086208 NAND Page

  2. Shit by maxbuzz · · Score: 1

    Happens

  3. Department of Computer Science by 110010001000 · · Score: 1

    Hey Chris from Department of Computer Science has a problem. Let's hear about it, Chris.

    1. Re:Department of Computer Science by 93+Escort+Wagon · · Score: 1

      Since they’ve now edited the summary (hooray for editing), I’ll note for the edification of future readers: The original quotes in the summary were attributed to “Chris from Department of Computer Science”.

      --
      #DeleteChrome
    2. Re:Department of Computer Science by mermeid007 · · Score: 1

      Hi Chris, I lost my keyboard. Is it behind your desk? Let me look. Yes! It is! Thank you! Yeah, sometimes they fall if someone bumps into them with their elbow or something. Next caller.

  4. Re:I can relate by 110010001000 · · Score: 1

    Uh, if a disk dies in 2 months you need to get a replacement, not a repair.

  5. This is why you have RAID and backups by froggyjojodaddy · · Score: 3, Informative

    *shrug* ?

    I mean, manufacturing defects, environment, and just old plain bad luck? SSDs have come a long way, but if I have anything of importance, I'm RAID'ing it and backing up. I feel anyone with an understanding of technology knows the importance of this.

  6. Re:Heading should be by 110010001000 · · Score: 3, Funny

    Waterboarding?

  7. Controller failure by macraig · · Score: 5, Insightful

    I've had two SSDs die utterly. It wasn't because there was a failure of any part of the actual storage pathways: it was irreparable failure of the embedded controller circuits. The Flash itself was still fine and safely storing all my data, but there was no means to access it. At least with a platter drive if the PCB fails, you can unscrew and detach it and replace it with a matching PCB from another drive; no way to do that with an SSD. Early on when manufacturers were spending all their time hyping the comparative robustness of the Flash medium, they conveniently forgot to mention how fragile and not-so-robust the embedded third-party controller circuits could be.

    1. Re:Controller failure by bobbied · · Score: 5, Informative

      Wow, that PCB substation trick became very hit/miss a long time ago.

      Now days, there is a whole bunch of operational parameters which need to be set properly to get data on/off a drive. I understand that Some of these "configuration" items are now stored in non-volatile memory on that PCB and set during the manufacturing process. Similar serial numbers may help, but it's still very hit or miss.

      --
      "File to fit, pound to insert, paint to match" - Aircraft Maintenance 101
  8. It's not that scary... by FrankSchwab · · Score: 4, Informative

    Infant failures are common in electronics ( https://www.weibull.com/hotwir... ) From a simple standpoint, imagine a poorly soldered junction on the PCB - soldered well enough to pass QC and work initially, but after a couple of heating cycles the solder joint fractures. The same kinds of problems occur inside chips - wire bonds between the package and die may be defective but initially conductive, and fracture due to thermal cycling.
    Similar problems can occur on the die. The gate oxide for a particular transistor might be too thin due to process issues. If it's way too thin, it'll fail immediately and the die will get sorted out at test. If it's just a bit thicker, it might pass all production tests but fail after an hour or two of operation, or 100 power cycles. If it's just a bit thicker (where it should be), it might last for 20 years and a million power cycles.
    Everyone in the semiconductor industry would love to figure out how to eliminate these early failures. No one has found a way to do it.

    --
    And the worms ate into his brain.
    1. Re:It's not that scary... by bobbied · · Score: 1

      Which is why "burn in" operation, where you run the item though some thermal cycles is often done. We are trying to find the stuff that's going to initially fail.

      I usually do 24 hour burn in of all hardware I build, 12 hours on, then 2 hour cycles on off. Or, (sarc on) just load windows and run all the updates. (sarc off) It's almost the same thing anyway.. :)

      --
      "File to fit, pound to insert, paint to match" - Aircraft Maintenance 101
    2. Re:It's not that scary... by Andtalath · · Score: 1

      The more you use an SSD, the faster it goes bad.
      So it's not an ideal thing to do.

    3. Re:It's not that scary... by MrLogic17 · · Score: 1

      Has your burn in ever found something that worked fine at first power on, and was dead after 24hrs?

      The idea seems good, but I'm skeptical. I'd think that that anything leaving a factory after their testing, wouldn't benefit from anything more than a smoke test.

    4. Re:It's not that scary... by bobbied · · Score: 1

      Has your burn in ever found something that worked fine at first power on, and was dead after 24hrs?

      The idea seems good, but I'm skeptical. I'd think that that anything leaving a factory after their testing, wouldn't benefit from anything more than a smoke test.

      I've found some things, but rarely any of the major components actually suffered from infant mortality on my watch. However, I've done this professionally a bit too, where we needed to verify MilSpec operation. In these tests, you verify both the operating and storage temperature ranges to certify a product. We had environmental chambers that could heat, cool and shake systems both running and not. Even under those grueling conditions the failure rates wasn't that high, though it was higher than you'd expect for less extreme temperature and vibration ranges.

      I personally consider it good practice to burn in stuff for a number of reasons. Infant mortality is but one. I also know that electrolytic capacitors like to drift up in value as they are powered on and after sitting idle may degrade over long periods. So the burn in is actually conditioning them over the few hours they are powered on, extending their lives a bit. It's not so much a thing anymore, but for large value filter capacitors or those under higher voltages (such as in vacuum tube power supplies) it can show significant differences in operations. These days though, the time from manufacture to my integration is pretty low so derogation of electrolytic capacitors may not be a huge issue anymore.

      These days, I don't know if burn in matters all that much, but I do it. It makes me feel better if nothing else.

      --
      "File to fit, pound to insert, paint to match" - Aircraft Maintenance 101
    5. Re:It's not that scary... by bobbied · · Score: 1

      The more you use an SSD, the faster it goes bad. So it's not an ideal thing to do.

      There's power on and read/write cycles. Usually it's write that "uses up" a SSD, not power on time or read cycles.

      However, given the number of write cycles is huge per cell, unless you are putting an SSD into a high data rate service situation, using it up is hardly a problem as the rest of the system will go defunct before the SSD runs out of write cycles. Also 12 hours is hardly enough time to appreciably dent an SSD's number of cycles, when their expected life span is a decade or more.

      BUT... If you are worried about it, you don't have to write to the drive all that time. I'm really only "power on" burn in guy. I'm not "hit the hardware with a performance bench mark" burn in guy. For the most part, I just want to thermal cycle stuff, so I may do a performance run or two, but only to drive heat and cold cycles. I don't think it's a problem...

      --
      "File to fit, pound to insert, paint to match" - Aircraft Maintenance 101
  9. Why is it so hard to understand? by Anonymous Coward · · Score: 1

    The spinning parts of an hdd are not the only parts that can go bad. Just as the NAND flash memory are not the only parts of an ssd that can go bad. There are other components: controllers for the computer interface and the NAND chips, and the power to everything. One bad electronic component can take down either. One dead capacitor can stop a whole motherboard from running.

    1. Re:Why is it so hard to understand? by Immerman · · Score: 1

      True. However, in 30-odd years of computing I've had several hard rives fail for mechanical reasons - almost always spreading surface failure, and also a couple head crashes. And only one drive that suffered a sudden catastrophic failure that might have been a controller failure.

      Anecdotal evidence to be sure, but in my experience mechanical failures are far more likely than controller failures on HDDs. From what I can tell, SSDs are the opposite, probably due in large part to the much more complex (and less mature) software they run.

      --
      --- Most topics have many sides worth arguing, allow me to take one opposite you.
  10. Sudden stop vs small warnings by atrex · · Score: 1

    In my experience with HDDs you'll usually get some warning that your drive has issues before it completely calls it quits. Whether it's bad sectors turning up or noises from the drive itself. If you pay attention to that (and you're a little lucky), you can manage to salvage most of the drive's contents before it dies completely.

    With an SSD one minute it's working completely fine and the next it's completely gone. While most of the data itself is probably still perfectly intact on the flash memory, getting at it is completely impossible (afaik) without going to a professional recovery service.

    1. Re:Sudden stop vs small warnings by MightyYar · · Score: 1

      I agree, but this has no practical benefit to me. When the HDD starts to throw errors, I pull it out of the RAID and stick in a new one. If the SSD completely up and dies, I pull it out of the RAID and stick in a new one. If more drives die or start to throw errors than there is redundancy, I restore from backup. If I can't restore from backup, well, then maybe then I'd appreciate the slowly-dying hard drive :)

      --
      W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
    2. Re:Sudden stop vs small warnings by fahrbot-bot · · Score: 1

      In my experience with HDDs you'll usually get some warning that your drive has issues before it completely calls it quits. Whether it's bad sectors turning up or noises from the drive itself. If you pay attention to that (and you're a little lucky), you can manage to salvage most of the drive's contents before it dies completely.

      In 2009, I had a 10 year-old 5 GB (yes, 5) enterprise SCSI disk (at home, not work) that failed to spin up after being off for over a year. (before that it had been running almost continuously) I tapped it (pretty hard) on the side with a screwdriver handle while it was "clicking" when I powered it up after removing the PC case. I slooowly spun up and worked fine. It had some bearing noise, but that went away after the drive warmed up. I pulled the data off and ran the drive for a couple of days w/o incident. Fun times...

      --
      It must have been something you assimilated. . . .
    3. Re:Sudden stop vs small warnings by Gilgaron · · Score: 1

      With a HDD I can envision how they can pull the platter and do forensics on it, do you know how they take a peak in an SSD's memory at a professional service? It didn't occur to me until just now that I had no idea how they'd do it.

    4. Re:Sudden stop vs small warnings by Tablizer · · Score: 1

      do you know how they take a peak in an SSD's memory at a professional service? It didn't occur to me until just now that I had no idea how they'd do it.

      They call their buddy near the Red Square to restore the data from copies.

    5. Re:Sudden stop vs small warnings by saider · · Score: 1

      Connect to the controller board on the address and data lines for the flash chips, and manipulate them to access the chips. Then you would need to have a program that understands how this controller manages things and can reconstruct the sectors that it presents to the outside world.

      --


      Remember, You are unique...just like everyone else.
    6. Re:Sudden stop vs small warnings by Mark+of+the+North · · Score: 1

      I learned a similar trick from one of my tech's while working as the lead technology guy at a school authority.

      Our board chair, literally the highest ranking member of the organization, brought in his personal laptop and explained that it no longer booted. We plugged it in and hit the power button, it wouldn't boot of the hard disk. I started to explain that there was nothing we could do when my tech interrupted me. He removed the hard drive from the laptop, said "Watch this!" and without any hesitation, smartly whacked it against the desk. While my blood began to boil, he quickly placed the hard drive back in the laptop and power it up. It booted. I was very nearly floored.

      At the time, I looked up the mechanism for why this worked, but have since forgotten. In any case, worth a try when you've tried everything else.

    7. Re:Sudden stop vs small warnings by pnutjam · · Score: 1

      I'll bet the ssd rebuilds faster too.

    8. Re:Sudden stop vs small warnings by MightyYar · · Score: 1

      Just a little! LOL

      --
      W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
  11. It's the binary nature of it.. literally by Mysticalfruit · · Score: 1

    With a spinning disk, you'll usually get an indication of a problem with a plethora of S.M.A.R.T errors.

    It's been my experience that when an SSD dies... you just suddenly appear to have an empty drive cage. It's a really ugly binary failure.

    I've taken to building my boxes with mirrored SSD's combined with taking and validating my backups.

    --
    Yes Francis, the world has gone crazy.
    1. Re:It's the binary nature of it.. literally by azcoyote · · Score: 2

      I can see what you mean, but I think I won't really understand it until it happens to me (and I hope it never happens to me). I'm on my third SSD and none has ever failed; my previous one was showing some age and was SATA so I upgraded to M.2 NVMe on Cyber Monday. Perhaps they haven't failed on me because I keep most of my data on a HDD RAID array and use the SSDs only for OS, program files, and very limited caching.

      --
      Incipiamus, fratres, servire Domino Deo, quia hucusque vix vel parum in nullo profecimus.
    2. Re:It's the binary nature of it.. literally by Bengie · · Score: 1

      Spinning disk has about as many sudden deaths are SSDs, the only different is spinners have an additional set of failures that give warning. In other words. If you had a harddrive that never died from mechanical issues, its failure rate would be very similar to an SSD.

    3. Re:It's the binary nature of it.. literally by Wolfrider · · Score: 1

      > I've taken to building my boxes with mirrored SSD's

      --This may not actually help, especially if both SSDs are the same brand and model - because they will be experiencing the EXACT SAME load and wear patterns. They will likely both fail at the same time.

      --Try putting in the mirror drive about a week after the initial drive, that should give you some leeway.

      --
      .
      == WolfriderV6 == I'm willing to admit that *I just might* be wrong... Are you??
  12. Re:Learn about the subject by Anonymous Coward · · Score: 2, Informative

    Electronics wear out slowly. In fact most will long exceed their usefulness before they die.
    Mor often electronics will die early due to manufacturing defects. It's why if your device lasts the first month it will probably keep working until you upgrade it. SSD's are a different beast though. thus they have excess capacity to handle wear leveling. Still a young drive that dies is usually, again, a sign of a manufacturing defect.

  13. Low Bidders by bill_mcgonigle · · Score: 2

    It's bad firmware. Some of the drives can supposedly be resuscitated by the factory or people who have reversed the private ATA commands.

    I mean, at a minimum unless it's a PHY failure (and there's no reason to suspect those) the firmware could at least report missing storage (I've actually seen a 0MB drive failure once or twice) but their usual failure mode is to halt and catch fire, as the author notes as their usual behavior.

    With the recent reports about the inexcusable security problems on Samsung and Crucial drives this is starting to feel like the old BIOS problems with Taiwanese mobo companies outsourcing to the lowest bidder and shipping bug-laden BIOS with reckless abandon. It's OK, all the world's servers only depend on this technology.

    To be fair, I have batch of 20GB Intel SLC SSD's that have never done this, but those are notable exceptions. At this point only low-end laptops like Chromebooks don't get at least a mirror drive here.

    --
    My God, it's Full of Source!
    OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
  14. Why does it matter? by CaptainDork · · Score: 4, Informative

    I'm a retired IT guy and there's no kind of something that didn't fucking break. I'm not a goddam engineer. My job was to locate the problem at a black-box level and get the shit running again. Contemplating the "why" of a hardware failure is wheel-spinning instead of pulling the stuff out of the ditch.

    For new purchases under warranty, I exchanged them and sent the dead one back to the vendor. Let them hook it up and do diagnostics over a cup of coffee.

    I had work to do.

    --
    It little behooves the best of us to comment on the rest of us.
    1. Re:Why does it matter? by dcw3 · · Score: 1

      Then you also know that if you've been seeing an unusual trend in some items breaking, it's probably cost effective for you to look for a root cause, and fix the problem, or find a suitable substitute to break the cycle. This is why we keep metrics on outages. It's not so much your job as the "IT guy", but whoever is managing the program/IT should be interested because it's costing them money.

      --
      Just another day in Paradise
    2. Re:Why does it matter? by DigiShaman · · Score: 1

      More to the point, understanding the "why" is more important than the "how". Why did it fail?? Specifically, was this something that could have been prevented at the IT side of things? If yes, time to change procedures. If not, then off to the vendor it goes, and research alternatives that are less error-prone.

      --
      Life is not for the lazy.
    3. Re:Why does it matter? by CaptainDork · · Score: 1

      I agree.

      About the only time I've gone there was funky voltage or a network wiring problem. Those are typically the last things I would suspect, and it drove me crazy.

      --
      It little behooves the best of us to comment on the rest of us.
    4. Re:Why does it matter? by CaptainDork · · Score: 1

      I didn't keep paperwork. I'm not a goddam analyst. If someone wanted to do that, fine. Just don't bother to tell me.

      Another non-car analogy (and off topic, I suppose).

      At Mobil Oil, our fractional T1 that connected Beaumont, Dallas, and Reston, Va. went down. I had people on it and we were balls to the walls trying to identify a broken box or maybe a problem with the telco.

      Management called me into the large conference room and there were a lot of pissed off suits in there.

      "Why is connectivity down?"

      "Dunno."

      "When will it be up?"

      "Dunno."

      "What are you doing to fix the problem?"

      "Nothing."

      "NOTHING?"

      "I'm in here talking to you guys."

      "Well, then when will it be back up?"

      "Sometime after this meeting is over."

      --
      It little behooves the best of us to comment on the rest of us.
    5. Re:Why does it matter? by prisoner-of-enigma · · Score: 1

      I didn't keep paperwork. I'm not a goddam analyst. If someone wanted to do that, fine. Just don't bother to tell me.

      Good God, do you realize what a walking, talking epitome you are of the worst aspects of someone in IT? Condescend much? I've been in this industry for almost three decades. The absolute worst IT people in existence are the ones who treat humans exactly as you are treating them. It doesn't matter one whit how much of a technical genius you might be if you can't understand you're not just working on machines for the sake of the machines. You're working on tools that people depend upon to do their jobs. Your dismissive attitude towards the human factors of this job is inexcusable. It creates a hostile environment between users and IT that doesn't need to exist. It's a damn good thing you don't work in my IT shop. I'd have fired you for something like this no matter how "good" you were with the gear.

      I guarantee that even after you fixed whatever was wrong with the WAN, the people you interacted with said "what a fucking asshole, I hope we never have to deal with him again" instead of "man, that guy did a fantastic job and I'm going to tell his boss how happy we are with his work!" But I'm guessing this is probably something you could care less about anyway. Good luck with your career. You'll need it.

      --
      In the end they will lay their freedom at our feet and say to us, Make us your slaves, but feed us. - Fyodor Dostoyevsky
    6. Re:Why does it matter? by thegarbz · · Score: 1

      I'm not a goddam engineer.

      I am an engineer, one that specialises in reliability analysis. Maybe if the author of TFA was too he'd understand how utterly stupid his comments are.

      Random failures happen, they happen on SSDs, and they happen on HDDs (and montherboards, monitors, vga cards, cpus, ram, psus, etc. ) If the writer is in any way "unnerved" then he should be looking at his own backup strategy and take a chill pill.

    7. Re:Why does it matter? by CaptainDork · · Score: 1

      Nah. I was the IT guy. When I hired on, I outsourced analytics back up to management. They picked some poor soul to do a spreadsheet and make slide decks.

      Then management sent the guy to me.

      I said, "No time. Not now, not ever. Meet with management with that stuff. i got work to do."

      I don't know if it turned out well or not.

      --
      It little behooves the best of us to comment on the rest of us.
    8. Re:Why does it matter? by CaptainDork · · Score: 1

      I agree. I embraced RAID of all levels throughout my career. Mostly, the hardness depended on risk/cost assessment by managers who were clueless. I always asked for the best.

      Despite that, I've had servers go sideways and there wasn't a goddam thing that was going to stop it.

      Failed backup was my worst nightmare. I've pulled all-nighters making sure I had a good backups.

      --
      It little behooves the best of us to comment on the rest of us.
  15. Re:Department of Computer Science --- are you sure by bobbied · · Score: 2

    Doesn't know how SSD's work.

    No offense to CS majors, but this EE major tends to understand "How a computer works" at a lower level than most of you programmer types. While not universally true, in my experience a Computer Science major generally get's outside their comfort zone with hardware once you get past "Plug it in and turn it on." I don't blame them, there is a lot of stuff happening at lower levels than a CS major needs to know to do their job.

    That some CS major is concerned about how SSD's fail because he doesn't understand their failure modes is fine. We tend to fear what we don't understand and let's face it, there is a LOT of stuff going on inside a computer that high level users simply don't need to know. Heck, even I don't need to know some of that stuff and I've designed computing systems in the past. Fear not, if it works, it works, if it doesn't you just replace it anyway.

    --
    "File to fit, pound to insert, paint to match" - Aircraft Maintenance 101
  16. Both are black-ish boxes by wbr1 · · Score: 1

    Yes, you can listen for mechanical issues, yes you can (sometimes) read bad block and other SMART data. But, ultimately, without millions in equipment and skills, you just do not know. It is a cheap data storage brick. Choose one appropriate for your capacity and I/O needs, have a good backup plan in place, and quit whining.

    --
    Silence is a state of mime.
  17. Shit happens.. by Rick+Schumann · · Score: 1

    ..and the more complex a machine is, the more that can go wrong with it.
    The controller PCB on a brand-new modern HDD can fail, rendering the entire device useless; any piece of silicon on a modern SSD can fail also, rendering the entire device useless. The only difference here is that with a HDD, if you happen to have another working drive of the exact same model and revision level, you could theoretically swap the controller PCB and be able to access the data on the platters again (I've done this). With an SSD it's all one PCB and short of actually diagnosing the failure and replacing failed component(s), the chances of accessing the contents of the flash memory is a snowballs' chance in hell.

    There's no point in worry about it, though. Back up your important data and forget about it. If the system in question is mission-critical and up-time is essential, then use two SSDs in a mirror set, and don't worry about it. If someone is going to get their head lopped off if there's any chance of the system in question failing due to SSD failure, then mirror your mirror-set to another mirror-set (i.e. use 4 SSDs) and back the whole mess up to an off-site location regularly. Sitting around biting your nails down to the quick isn't going to help anything.

  18. Mod Parent Up by mykepredko · · Score: 1

    Maybe it helps the author to develop a narrative, but the long and short of it is, the author's non-volatile storage unit died, he needs to replace it to get the system back and he can send it back to where he bought it from because it died under warranty. Or, he might want to have it destroyed locally if it contains proprietary information.

    If you're in IT, I'm sure you'll see everything eventually break (including things like cases which don't make any sense at all) so why sweat it?

    1. Re:Mod Parent Up by CaptainDork · · Score: 1

      Victim blame much?

      --
      It little behooves the best of us to comment on the rest of us.
    2. Re:Mod Parent Up by CaptainDork · · Score: 1

      I'll be glad to. Thanks for the opportunity.

      ... because you tossed away a lot of hardware, and later you realized that a lot of it could have been prevented if only you had some minimal amount of curiosity.

      I wasn't a goddam hardware guy. I was a productivity guy. I'll give you an analogy but it isn't a car one, OK?

      My boss asked me one time if I could hack (ca. 1990). I said, not very well. He was surprised because he thought I could do everything.

      He asked me why I wasn't any good at it and I told him, "Look: I got just so many hours in the day. I live and breathe computer shit and I spend all my time studying and experimenting with the crap that supports your business. You're a law firm. You need to shuffle documents and you have no use for a propeller head using your equipment, on your dime, learning stuff that's not relative to the income stream."

      --
      It little behooves the best of us to comment on the rest of us.
    3. Re:Mod Parent Up by Immerman · · Score: 1

      How is it victim blaming - other than being able to tell if the victim actually *is* to blame? Unlike HDDs, SSDs wear out with use, and nobody on the planet sells an "unlimited use SSD".

      --
      --- Most topics have many sides worth arguing, allow me to take one opposite you.
    4. Re:Mod Parent Up by CaptainDork · · Score: 1

      Whose fault is that?

      --
      It little behooves the best of us to comment on the rest of us.
    5. Re:Mod Parent Up by prisoner-of-enigma · · Score: 1

      "Look: I got just so many hours in the day. I live and breathe computer shit and I spend all my time studying and experimenting with the crap that supports your business. You're a law firm. You need to shuffle documents and you have no use for a propeller head using your equipment, on your dime, learning stuff that's not relative to the income stream."

      You must be all kinds of fun at parties, but I digress.

      For a small firm I agree it's usually pointless to do a failure analysis. However, if you're dealing with a larger company, failure analysis is crucial. Otherwise you could be replacing a failed unit with one that's just as failure prone because you don't know why it failed. Warranties are great but they don't replace lost data and/or downtime due to device failure. RAID and backups aren't magically 100% effective. I think what the OP is lamenting is there isn't even the slightest possibility of doing any analysis even if you have the time to do it.

      --
      In the end they will lay their freedom at our feet and say to us, Make us your slaves, but feed us. - Fyodor Dostoyevsky
    6. Re:Mod Parent Up by prisoner-of-enigma · · Score: 1

      Nobody sells one because such a device cannot be built. Entropy always wins in the end.

      --
      In the end they will lay their freedom at our feet and say to us, Make us your slaves, but feed us. - Fyodor Dostoyevsky
    7. Re:Mod Parent Up by Immerman · · Score: 1

      As they said - it can't be built, the technology just doesn't work that way. Flash memory cells wear out with usage. And the write-cycle limitation is generally displayed prominently on the packaging and marketing literature, as there is a large amount of variation depending on the exact technology, scale, and storage strategy used.

      --
      --- Most topics have many sides worth arguing, allow me to take one opposite you.
    8. Re:Mod Parent Up by CaptainDork · · Score: 1

      We discovered the same thing about flash drives long ago, remember?

      --
      It little behooves the best of us to comment on the rest of us.
    9. Re:Mod Parent Up by CaptainDork · · Score: 1

      Absolutely correct.

      We can deal with entropy in small blocks by replacing parts. SSD doesn't have the granularity.

      Good point.

      --
      It little behooves the best of us to comment on the rest of us.
    10. Re:Mod Parent Up by CaptainDork · · Score: 1

      I agree about small site vs big site mentality and architecture.

      As we approach enterprise level, like a Mobil Oil, we have to specialize. There's just no other way.

      I didn't give a flying fuck why something failed.

      I had lots of people (on 1350 desktops in one of the places) who really didn't want to know the why the shit failed. They wanted to be up yesterday and I was always a user advocate.

      You manage your site however you see fit. I won't question your methods at your house, OK?

      --
      It little behooves the best of us to comment on the rest of us.
    11. Re:Mod Parent Up by Immerman · · Score: 1

      Right. And nothing has fundamentally changed, except that flash drive technology got fast and reliable enough that now we stick them inside our computers and call them SSDs.

      So where does the victim blaming come in?

      --
      --- Most topics have many sides worth arguing, allow me to take one opposite you.
    12. Re:Mod Parent Up by CaptainDork · · Score: 1

      It comes from AC, above:

      There's a lot of value in something that will work 5 years past warranty, as opposed to working 1 year past warranty. Suppose that a bunch of SSDs failed 1 year past warranty and you know that it was due to the huge amount of small writes of your application, then armed with that knowledge you could potentially change just 1 parameter in the application and save your company millions of dollars in SSD replacements.

      I don't know about you, but I don't buy fragile hardware and program it using DaintyCode.

      --
      It little behooves the best of us to comment on the rest of us.
    13. Re:Mod Parent Up by Immerman · · Score: 1

      Oh, I agree. But in that case you would be the victim of your own foolishness - it's not the SSD manufacturer's fault that you didn't consider the impact of your program on the well-stated limitations of their hardware. And nobody was blaming the actual victim - the company - for the programmer's stupidity.

      As a non-coding example - if I build a shoddy set of stairs that collapse on me so I break my leg, telling me it's my own damned fault isn't really victim blaming, is it? It *is* absolutely and wholly my own fault. Not in any way like the quintessential blaming of a rape victim for the actions of their rapist.

      --
      --- Most topics have many sides worth arguing, allow me to take one opposite you.
    14. Re:Mod Parent Up by CaptainDork · · Score: 1

      ... if I build a shoddy set of stairs that collapse on me so I break my leg ...

      I understand your point, but mine is that I don't build SSDs.

      If the stairs I built don't work, that's on me. If the stairs you built for me don't work, I'm the victim and you wouldn't blame me, right?

      I think we're on the same page.

      --
      It little behooves the best of us to comment on the rest of us.
    15. Re:Mod Parent Up by Immerman · · Score: 1

      That depends. If I built you a standard set of stairs, and they collapsed when you tried to send a herd of elephants up them, I absolutely *would* blame you. You are the one using them in ways they were never designed for.

      Similarly, if you rapidly kill an SSD by abusing it with an incredibly write-intensive workload, when it's limitations are clearly labeled, I would also blame you.

      --
      --- Most topics have many sides worth arguing, allow me to take one opposite you.
    16. Re:Mod Parent Up by CaptainDork · · Score: 1

      OK, enough. You jumped off the cliff. I'm staying here.

      I will say, I admire your commitment to your sig. :)

      Thanks.

      --
      It little behooves the best of us to comment on the rest of us.
    17. Re:Mod Parent Up by Immerman · · Score: 1

      I don't see it. X is designed for Y. If you use it for Z, when labeling and/or common sense clearly indicate it's not suited for such a use... that's your problem.

      Hehe. I decided on my sig to warn people where I was coming from. I usually sincerely mean what I'm saying, but have been known to change sides mid-argument if I start getting poorly-reasoned "support".

      --
      --- Most topics have many sides worth arguing, allow me to take one opposite you.
  19. Re:Heading should be by bobbied · · Score: 1

    Waterboarding?

    Well... Funny, but water mixed with electronics tends to produce situations where little communication takes place....

    --
    "File to fit, pound to insert, paint to match" - Aircraft Maintenance 101
  20. Forward error correction by Strider- · · Score: 1

    Despite what others have said, this comes down to the brick wall nature of error correction codes. Every time you erase and rewrite a flash cell, you as wear to the transistors that make up the memory cell. Eventually (and probably immediately too) some of the bits won't read correctly. To compensate for this, the controller runs a mathematical function on your data, allowing it to recover from a certain percentage of bar bits. This is good, as that combined with wear leveling allows it to run a long time. However, one it hits that percentage, it's like hitting a wall and it can't recover.

    --
    ...si hoc legere nimium eruditionis habes...
    1. Re:Forward error correction by prisoner-of-enigma · · Score: 1

      True, but that doesn't explain sudden, catastrophic SSD failure. Modern controllers remap bad blocks to the "spare area" on all SSD's. Keeping track of said blocks can offer a modicum of failure prediction. Indeed, many high-quality drives -- and nearly all enterprise drive arrays -- do exactly that.

      None of it matters if the onboard controller itself dies, and such failures cannot be predicted nor can they be analyzed. That's what the OP was lamenting. The only possible remedy is to avoid that brand/model in the future, assuming that's even an option (and with Dell/HP/Lenovo it frequently isn't since the OEM's will only warranty and replace OEM equipment).

      --
      In the end they will lay their freedom at our feet and say to us, Make us your slaves, but feed us. - Fyodor Dostoyevsky
  21. Re:I can relate by tepples · · Score: 1

    Then read it as "Samsung would not ship the replacement until it received the returned unit." This still implies a week's downtime.

  22. Re:I still don't trust SSDs even now. by MightyYar · · Score: 1

    Uh, for the massive performance boost you get from an SSD, they are totally worth setting up a backup job. Image the disk, set periodic backups to a server or even iDrive/Crashplan/Dropbox/etc and carry on with life. Hell, even leave the spinning disk in place and backup to that. For $60 you can extend the life of an old PC by several years simply by swapping in an SSD.

    You should have backups anyway.

    --
    W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
  23. Good luck putting RAID in a laptop by tepples · · Score: 1

    I doubt that most home PC users have both the case space and the cash for a RAID. A user of a mainstream laptop sure doesn't.

    1. Re:Good luck putting RAID in a laptop by MightyYar · · Score: 1

      Yes, I'm in that position with my small notebook. In my case, I imaged the drive when I first got it. I have Windows Backup set to backup to an NAS and I have iDrive installed for offsite backup. Most people don't need to go so crazy - they can get away with running Dropbox, OneDrive, Google Drive, etc. as their primary "Documents" folder and then letting Geek Squad put in a new drive and reinstall Windows. But even "most people" need to have backups of some kind. If they can't image a disk, they certainly won't be savvy enough to rescue data from a dying spinning hard drive.

      --
      W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
    2. Re:Good luck putting RAID in a laptop by omnichad · · Score: 1

      A lot of mainstream consumer laptops come with an M.2 slot for configurations with SSD but still have the SATA port for models with an HDD. You can fill both slots and make a RAID - the disks will just be different shapes. Software RAID, sure, but it can definitely be done affordably.

    3. Re:Good luck putting RAID in a laptop by tepples · · Score: 1

      If the SSD is replaceable then you should simply just use your backup

      How many days old is your backup?

      restore to a new drive

      How many days of shipping away is the new drive?

    4. Re:Good luck putting RAID in a laptop by tepples · · Score: 1

      Even little shitty PC cases have a place to stick a drive.

      "a drive" singular != "drives" plural, one requirement of RAID. Or are you recommending RAID between an internal drive and an external drive?

      My Lenovo laptop has mirrored hard disks.

      How big is it in inches (diagonal visible image size)? Drive bays that are practical in a 17" might not be practical in a 11.6" or 13".

  24. it's worse in space.. by unfortunateson · · Score: 1

    Reports from the ISS are that 9 out of 24 SSD drives failed in an HP supercomputer they'd brought up there. Quite scary how fragile those things are from radiation.

    --
    Design for Use, not Construction!
    1. Re:it's worse in space.. by hcs_$reboot · · Score: 1

      9 out of 24 SSD drives failed

      Were they all the same brand / type?

      --
      Slashdot, fix the reply notifications... You won't get away with it...
  25. Also here by jf_moreira · · Score: 2

    That happened to me three or four times already. They die without warning. No SMART indication, nothing. It really pisses us off. Someone needs to technically give us some kind of anticipation. Maybe SMART is not supposed to work well with SSD after all.

    1. Re:Also here by jimbo · · Score: 1

      HDD can also die suddenly, it's just that they also, in addition, have a class of failures that can be detected early.

  26. The spin is in! by theendlessnow · · Score: 4, Insightful

    One thing I like about spinning disks is that a lot of times the failure is gradual. Bad sectors and such and you have the opportunity to grab data off the drive (noting, you really should have backups).

    With SSD, whatever the issue, it's more like losing a controller board on the drive, everything dies and ceases to operate.

    So... I'll go along and say SSD is "better" and more "reliable", but when it dies, it dies hard. Just the way it is. (not talking about performance degradation... speaking about failure)

    1. Re:The spin is in! by scamper_22 · · Score: 1

      Same. I've never had a spinning HD just die. They always 'act' funny for a while.

      Then again, I've kind of stopped worrying about harddrives dying. Ever since I started working, it's been RAID 1 with two harddrives.
      One starts going bad, I swap it out.

      Then I have a NAS with RAID as well. Samething there.

      I've been running for a while without worries.

    2. Re:The spin is in! by thegarbz · · Score: 1

      Backups handle random failures just as well as wearout failures. I'm happy that SSDs have surpassed HDDs by removing the wearout related failures (to a large extent anyway).

    3. Re:The spin is in! by edis · · Score: 1

      My experience is this: of about 10 HDD drives that failed around recently, them not being part of RAID, I was able to salvage every single one. Restoring the system into running state was as simple, as get drive and make failsafe dd under some Linux booted. For RAID drives you don't care more, that pulling out unit for replacement.

      I like the reading speeds of SSD, but combined with the factors explained above, I could only give them preference where dying is OK, something like disk in kiosk, that would not accumulate specific setup or data.

      --
      Servant of karma
    4. Re:The spin is in! by justthinkit · · Score: 1

      Bias ply tires were replaced radial tires.

      Thing is that bias ply tires failed (slipped, when going around a corner too fast) in a progressive way.

      Radial tires provided more grip than bias ply tires, right up until they failed completely at providing traction.

      --
      I come here for the love
  27. Re:Learn about the subject by freeze128 · · Score: 1

    Correction: PROPERLY DESIGNED electronics wear out slowly. Improperly designed electronics may not even last past the warranty period. Since there is a huge demand for SSDs in increasing capacity, I can't help but think that manufacturers are pushing the bounds of reliability in favor of capacity. The manufacturers may just be relying on the SSD's built-in correction capability to correct for the decrease in reliability, but that will only get you so far.

  28. Re:Blame the OS by tepples · · Score: 1

    had a 2gb memory card once. A Day One fault of one 512 mb block dead. Windows could not recognise this fault nor fix it. Instead writing to the card had corruption (obviously) when the faulty block was engaged.

    Then Microsoft messed up by not offering a "try writing to all unallocated clusters" mode in the surface scan in chkdsk.

  29. Damage from static electricity is a good bet by bdwoolman · · Score: 1

    Improper handling of ungrounded components really can mess them up. They work but are defective. Take a look at some micrographs of ESD damage sometime.. ESD does not always kill a part it maims -- sometimes only slightly. Anti-static mats and wrist straps are no laughing matter, Okay. They are. But use them anyway.

    --
    "No fear. No envy. No meanness." Liam Clancy
    1. Re:Damage from static electricity is a good bet by dcw3 · · Score: 1

      "Static Zap makes Crap" - One of my favorite sayings from Computer Tech training in the USAF back in the 70s.

      --
      Just another day in Paradise
  30. Heat by Thelasko · · Score: 1

    Most of the time heat kills electronics. Either they get too hot and something fries, or they suffer thermal fatigue.

    --
    One of our competitors trademarked the term "hypothesis". From now on, we will call them "boneheaded ideas".
    1. Re:Heat by dcw3 · · Score: 1

      Heat, static, condensation, unstable power, radiation, magnetic fields, vibration...pick your poison. It all depends on the environment you're working in and how well the equipment was designed.

      --
      Just another day in Paradise
    2. Re:Heat by prisoner-of-enigma · · Score: 1

      All the more reason to have some way of doing failure analysis on the failed component. If, for example, it died after prolong high temps, you know you've got a cooling issue and can perhaps do something about it. If you can't do an analysis, you have no way of knowing if there is something you can do to avoid -- or at least reduce the possibility of -- future failures of the same type.

      --
      In the end they will lay their freedom at our feet and say to us, Make us your slaves, but feed us. - Fyodor Dostoyevsky
    3. Re:Heat by dcw3 · · Score: 1

      I agree 100%. But only to a cost effective level. For example, if you have a thousand hard drives, it's probably worth your effort to track outage reasons. If you have five, not so much.

      --
      Just another day in Paradise
  31. Failure done right - Sandisk USB by Stonent1 · · Score: 2

    I had a Sandisk USB stick recently go read only. I had been using it as a hypervisor boot drive and the boot was crashing. When I inspected it, it was read only and any attempts to format it, diskpart it, fdisk it failed with some kind of error. I looked it up and apparently this is the designed failure route for these USB drives. When the controller detects an inconsistency or uncorrectable error, the drive is locked from writing so you can get data off of it.

  32. He's right. by GameboyRMH · · Score: 2

    SSDs really are unpredictable timebombs, so act appropriately - take frequent backups and use RAID if the downtime from a sudden SSD failure with zero warning is unacceptable. Any IT department that hasn't been prepared for the nature of SSD failures since long before they were available off the shelf was doing it wrong anyway.

    I'm most worried about what SSDs mean for the Average Joe, whose data is largely protected by the predictability and recoverability of most hard drive failures. SSDs throw all of that out the window and lure them in with the warm glow of performance like moths to a flame. Average Joes need a real wake-up call on the importance of backups with the switch to SSDs.

    --
    "When information is power, privacy is freedom" - Jah-Wren Ryel
    1. Re:He's right. by thegarbz · · Score: 1

      SSDs really are unpredictable timebombs

      So are HDDs. Just because you have wearout related failure modes that make their life even shorter doesn't mean controller failures don't happen.

      There's nothing magic about SSDs. Random failures happen. Have a backup / business continuity strategy.

    2. Re:He's right. by GameboyRMH · · Score: 1

      It's possible but very unusual for HDDs to fail irrecoverably and without warning, but that's the normal failure mode for SSDs, that's the difference.

      --
      "When information is power, privacy is freedom" - Jah-Wren Ryel
    3. Re:He's right. by thegarbz · · Score: 1

      That's horseshit. There are many and quite common failure modes for HDD without warning including some of the mechanical wearout style ones which have incredibly soft and interprative SMART statistics.

      I take it you've never had sudden head failure of a HDD, control board failure? Random failures happen and you should consider yourself lucky if your HDD is dying due to one of the very limited mechanical cases that are detectable by SMART.

    4. Re:He's right. by GameboyRMH · · Score: 1

      I'd had head crashes in the '90s and no clear control board failures so far. Since the new millennium, I haven't had any totally unexpected hard drive failures, and no unrecoverable ones. With the custom SMART reporting/alerting script (to work around the soft and interpretive SMART statistics and focus on the ones that matter) on my home server, I've been able to see them all coming far in advance. The firmware-level SMART alerting system on the servers at the office seems to catch them well in advance too.

      --
      "When information is power, privacy is freedom" - Jah-Wren Ryel
    5. Re:He's right. by thegarbz · · Score: 1

      Well I've got a 50/50 success rate on both HDDs and SSDs, though admittedly I've only had 2 SSD failures to date rather than a larger number of HDD failures. That includes live monitoring of SMART parameters and excludes replacing risky drives at end of life (the last 2 drives I retired had no sign of failure but had over 7 years of head flying hours on them so it was time to go).

      Control board failures are common enough that it was a well known process to swap out control boards of identical drives back in the day in the hope that mechnically the drive is okay. This doesn't work these days due to boards being more "custom" which is to say that parameters are configured in the boards unique to drives in factories.

  33. Re:I can relate by Anonymous Coward · · Score: 1

    You can almost always pay for advanced replacement. You get your money back when they receive the drive. Then it is usually just a day you're done provided you don't have a spare around. If things are really that critical though then you've already failed and should have setup RAID for your SSD and had backups.

  34. Re:Learn about the subject by omnichad · · Score: 2

    Not using TRIM doesn't have a huge effect on SSD life. Just performance. Write amplification adds some wear, but not enough to be drastic. And it won't cause sudden failure either - just normal wear on the wear-levelling curve. Sudden failure is by definition going to be something that's not related to routine depletion of a fixed lifespan.

  35. Restore which version? by tepples · · Score: 2

    Who the hell cares? Replace it and restore your data.

    The data on a failing drive might be a newer version than the most recent weekly backup. I see value in backing up the newer version elsewhere as the first part of replacing the drive. But SSD failure modes allegedly make this newer version inaccessible sooner than HDD failure modes.

    1. Re:Restore which version? by brantondaveperson · · Score: 1

      Who uses weekly backups? Back up automatically to as many cloud providers as you can afford, and use something like Time Machine (there must be equivalents of this for other OSs, right... surely...) too. No problem.

    2. Re:Restore which version? by tepples · · Score: 1

      Back up automatically to as many cloud providers as you can afford

      Which isn't many if you have a lot of GB of data to back up, such as video or lossless audio, and your home ISP doesn't provide a lot of GB/mo. (Satellite ISPs tend to limit data, and cellular ISPs tend to limit hotspot data.) Or if you don't want yet another utility dipping into your checking account via your debit card every month.

      I looked for Time Machine equivalents on GNU/Linux, and Cronopete at least appears to have been worked on in the past year.

  36. Re:I can relate by 110010001000 · · Score: 1

    You should demand cross shipping for that. Any professional would.

  37. The Failure Modes by Sarusa · · Score: 1

    So you can have peace of mind:

    If it dies suddenly, without warning, it's 1) buggy firmware (I think this is by far the biggest culprit), or 2) bad components/soldering/cleaning on the PCB board, or 3) a really dumb controller that isn't doing wear leveling on every single thing (think the master index), so when a critical flash cell dies the entire thing is dead even though there's plenty of good flash left (this was common with crappy little 'SSDs' that were just Compact Flash), or 4) a badly designed controller that leaves the drive in bad state when power suddenly goes out and can't recover

    If it sloooows down and starts getting more and more sluggish you've lost enough flash cells that the wear leveling is losing its capacity to cope. Take some stuff off the drive to give it some breathing room and prepare for its demise. I had this happen with one of the original Intel SSDs (the X-25M). It took ten years of continuous use, though - yes, just this year.

    1. Re:The Failure Modes by krray · · Score: 1

      I've had multiple OWC branded SSD's die on me. I usually like OWC branded items, but the SSD failure has me pulling any / all such branded ones out of service.

      It was my understanding that a failing SSD (can't write anymore properly) should flip itself over to READ ONLY mode. At least this would give you a chance to pull the existing data off the drive.

      The OWC failures were catastrophic (sans I had working backups :). When these SSD's failed they were just GONE. Nothing. The system wouldn't see them even connected.

      At least with a hard drive they typically gave you some warning. Getting louder, clicking, ... I can really only think of one drive that just utterly died as I've seen SSD's do now.

      Moral: backup Backup BACKUP

    2. Re:The Failure Modes by Sarusa · · Score: 1

      Yes, I have full backups of everything nightly, so even though I have never had a SSD fail on me catastrophically (cross fingers), it's covered.

      Hard drives do just fall over dead too, but you're right, often there's warning signs.

      The OWC thing doesn't even sound like the 'drive' part (the flash) is failing, it sounds like the controller that talks SATA to the PC is failing, or the power circuit died so the thing doesn't have any power. Otherwise the system would at least see it. And it sounds systemic. So besides backup, backup, BACKUP, we have the moral 'OWC is trash'.

  38. Tiny wires, heat bad by HeckRuler · · Score: 1

    SSDs have a bunch of tiny wires. When you push electricity through wires they heat up, they're not perfect super-conductors. If you heat it up too much, it will of course burn, but they avoid that. Still, heating up a wire over and over will have some wear and tear. For big thick power-lines in houses, this doesn't have too much effect, but for tiny precision electronics, it builds up. And SSD's have a LOT of those wires with a little bit of manufacturing variance which makes some parts fail sooner.

    They burn out the same way lightbulbs burn out. They don't have moving parts, right?

    1. Re:Tiny wires, heat bad by FrankSchwab · · Score: 1

      Wires? Burn out the same way lightbulbs burn out?

      Your understanding of electronics is remarkably wrong.

      --
      And the worms ate into his brain.
    2. Re:Tiny wires, heat bad by HeckRuler · · Score: 1

      Ok, what's a simple word for the traces going into and out of transistors?

      Light bulbs are solid state, riiiiiiight?

  39. Re:I still don't trust SSDs even now. by tepples · · Score: 1

    What, you don't have at least 32GB of RAM?

    I see your point about prefetching most of your environment to disk cache. That's why Microsoft added the "SuperFetch" feature to Windows over a decade ago and Canonical added "ureadahead" to Ubuntu. But there are three problems:

    First, many tablet computers and compact laptops lack slots for 32 GB of RAM.
    Second, even on those machines that can take 32 GB, loading 32 GB when booting or when waking from hibernation takes a while before the prefetch stops being a source of read latency.
    Third, when a file is written and flushed, the application that you are using still needs to wait for the data to be written to spinning rust in case the power fails or the kernel panics. That adds several milliseconds of latency.

  40. Re:Blame the OS by omnichad · · Score: 1

    Don't blame the OS. Blame "no backups." Failure should be expected and accounted for with a backup plan.

  41. Re:Hmm by omnichad · · Score: 1

    Should have skipped Intel and OCZ and just waited for the Samsung EVO line. I've installed dozens over the last few years and not a single failure yet.

  42. Not so by bagofbeans · · Score: 1

    Metal migration limits the lifetime of the interconnect in ICs. Absolutely a wear mechanism.

  43. Mechanical vs Electronics by Shotgun · · Score: 1

    I'm going to disagree with the people saying that spinning disks don't give you a warning of imminent death. A bad spindle will start whirring, and steadily get louder, and my experience has been that most drives go that way. Hence, the old trick of sticking the drive in a freezer to get a few minutes more life out of it (because, you didn't keep your backups updated....again. :-(

    This is a phenomena that should always be kept in mind when switching from mechanical to electronic systems. The electronic are usually MORE reliable, in the sense that they are less likely to go belly up, but WHEN they do, they won't give you any warning. I could arguably make my home-built airplane MORE reliable and feature rich by replacing the flight controls with a fly by wire system. But, one day a gate in one of the processors will fry itself, and the whole system will quit working at once. Woe unto me if I'm at altitude at that point. The mechanical system will require more maintenance, but it will slowly wear out over time, controls will get sloppy, and exhibit more play. That is the system telling me, "I'm getting kinda tired here. I'm getting old, y'all. Replace me. Screw it. I quit." It gives warnings to the operator that knows what to listen for.

    So, the article does have a point. . . sort of.

    --
    Aah, change is good. -- Rafiki
    Yeah, but it ain't easy. -- Simba
  44. Re:Department of Computer Science --- are you sure by BLToday · · Score: 1

    Doesn't know how SSD's work.

    No offense to CS majors, but this EE major tends to understand "How a computer works" at a lower level than most of you programmer types. While not universally true, in my experience a Computer Science major generally get's outside their comfort zone with hardware once you get past "Plug it in and turn it on." I don't blame them, there is a lot of stuff happening at lower levels than a CS major needs to know to do their job.

    That some CS major is concerned about how SSD's fail because he doesn't understand their failure modes is fine. We tend to fear what we don't understand and let's face it, there is a LOT of stuff going on inside a computer that high level users simply don't need to know. Heck, even I don't need to know some of that stuff and I've designed computing systems in the past. Fear not, if it works, it works, if it doesn't you just replace it anyway.

    This ^^^. I had a brilliant CS college roommate. But when he built his first computer himself, the motherboard was held to the case with one screw. He couldn’t figure out why it was crashing all the time. Everything in the machine was barely in their slots/socket. This is back in the Pentium days. Days of VLB and very early AGP. And sometimes IRQ switches.

  45. Re:Department of Computer Science --- are you sure by Anonymous Coward · · Score: 1

    " in my experience a Computer Science major generally get's outside"

    Yup, I can believe you're an EE. While you go on and on congratulating yourself almost as hard as a doctor, you can't even tell the difference between GET IS and GETS.

  46. For Chris's peace of mind. by Tjp($)pjT · · Score: 1

    New SSDs, failure could be a die bond failure, a sometimes defect that allows it to pass inspection then fail. Or a ball bond to PC failure that can be intermittent as the package, solder ball, and PC change dimensions due to different thermal expansion coefficients. The tiny contacts on the PC versus relatively huge contacts on the mechanical hard drive make these happen more often on SSDs.

    On older SSDs there could be degradation of the ability to hold or modify the stored charge that represents bit. Not likely unless you are a heavy duty user. Or metal migration from the mask layer, or metal migration at the bonding level wire physical aluminum or gold wires are bonded to the actual chip. Less likely are bonding failure to the underlying substrate as the wire material used is chosen for high compatibility.

    Now Chris, feel better knowing just a few ways you can envision the failures?

    --
    - Tjp

    I am in wallow with my inner money grubbing capitalistic pig. ... Oink!

  47. Backup your data frequently by Solandri · · Score: 2, Insightful

    Backup your data frequently. Stop worrying. Is that so hard?

    1. Re:Backup your data frequently by Andtalath · · Score: 1

      Ever heard of defragmentation?
      This will ruin this little scheme.

      What MIGHT work is a dedicated partition.

      Still, this is in fact, already done by all disks which don't give you exactly a ^2 exponent since they leave some cells to be able to move data around...

    2. Re:Backup your data frequently by Cid+Highwind · · Score: 1

      You won't know the drive is running out of good cells until it's too late. One night everything is fine, the next morning you turn on the machine and "Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block"

      Only a backup on a separate device can save you from SSD failures.

      --
      0 1 - just my two bits
  48. What a dumbass by Anonymous Coward · · Score: 1

    Just imagine the unicorn in the drive died.
    It's about as accurate as what you imagine happened to the spinning disk.

  49. Spinning disks used to more unnerving... by gosand · · Score: 1

    I had a 4 tb spinning drive fail, after only 2 years. It was 75% full. That is what is scary to me. The only narrative I came up with to explain it was that it was in my system, but powered on, 24x7. Now my backup drives are external and I power them on when I need them.

    As drives get bigger, that is when I get nervous. I know, there's options to mitigate that, but I'm on a budget. I just migrated my OS to an SSD a couple of months ago, and still have spinning drives holding everything else.

    --

    My beliefs do not require that you agree with them.

  50. Re:I can relate by Junta · · Score: 1

    Realistically speaking, he almost certainly got a replacement, but return policy he had required it to be returned.

    However for electronics of this class, the manufacturer in all likelihood *could* repair it. The neat thing is if they do repair such a disk, it could come back with the data intact. In practice, I don't think any manufacturer would offer such a service or even try.

    --
    XML is like violence. If it doesn't solve the problem, use more.
  51. HDs were scary too at some point by foxalopex · · Score: 2

    I'm guessing the author never lived through the era when there were a lot more companies in existence for mechanical HDs than there are now. HD's can spontaneously die from a failed motor, electronics failure or catastrophic crash. Some small companies went completely under and were swallowed up by larger manufacturers due to massive defects. SSDs have gone through the same era as well with buggy firmware. Generally speaking thou if you stick to the big manufacturers like Samsung and Intel the chances of fatal issues goes down a lot. That said an SSD is not a guarantee of safe data. They're far more reliable but circuit failure or static electricity can kill SSDs. Besides, SSDs won't save you from an accidental erase all.

    1. Re:HDs were scary too at some point by thegarbz · · Score: 1

      Some small companies went completely under and were swallowed up by larger manufacturers due to massive defects.

      Some large companies had their HDD division go completely under. Looking at you IBM, I owned two of your IBM "Death"star series HDDs and somehow went through the warranty process 7 times on them.

    2. Re:HDs were scary too at some point by toddestan · · Score: 1

      IBM's HDD division didn't go under. It was bought by Hitachi, and generally Hitachi's drives are very well regarded nowadays.

      As bad as the IBM Deathstars were, I never actually lost any data because of them because they always gave some sign of impending doom before they finally failed, allowing me to grab whatever I needed to get off of them. I also had one last over 10 years in a workstation that was almost never turned off. I'm not even sure how that happened.

    3. Re:HDs were scary too at some point by thegarbz · · Score: 1

      IBM's division definitely went under! They were effectively blacklisted and unable to sell hardware. When they were puchased by Hitachi they were bought at bargain basement prices, the same price that Maxtor went for in the Seagate acquisition when Seagate bought a struggling company that has gone through half a decade of financial difficulties. When Hitachi bought it the only value left for IBM was in the commercial contracts. It took many years for Hitachi to turn the brand around, and after they did they sold drives at a fraction of the volume that IBM did and were subsequently bought by WD for more than double the original IBM acquisition cost.

      As bad as the IBM Deathstars were, I never actually lost any data because of them because they always gave some sign of impending doom before they finally failed

      You clearly never ran one in a RAID configuration. They were notorious for not making it through a full rebuild cycle once the dreaded click of death started. I never lost data either, backups and prioritising what data needed to be taken from the degraded arrays are the only reason though.

      I also had one last over 10 years in a workstation that was almost never turned off.

      Not all models had issues. I still have a working one here. At least I assume it's working, I'm not sure I've got any hardware with a PATA interface anymore so I can't test it.

  52. Did you have your covfefe this morning? by fyngyrz · · Score: 1

    real to real

    Donald? Is that you?

    --
    I've fallen off your lawn, and I can't get up.
    1. Re:Did you have your covfefe this morning? by Aighearach · · Score: 1

      Nice catch, I read right past that and didn't catch it; my parser rewrote it using the algorithm that corrects the "and and" mistakes, and I came away thinking he said "stick to real 9 track paper tape."

  53. I also had a failure recently by cloud.pt · · Score: 1

    Just chiming in My Crucial M4 128GB (Micron) drive also died on me 2 months ago after very mild use since February 2013. It was my OS drive in a Windows 7-10 desktop which O mostly used for 3-5 multiplayer games through the years, or the odd media consumption. It was a machine that was on about 1/20 of the entire 5 years and 8 months.

  54. Post-Failure Support by nuckfuts · · Score: 1

    There's another problem I've found with SSDs in addition to their failures occurring with no previous warning signs. That is that the process of obtaining warranty replacements can be terrible.

    Perhaps because hard drives were expected to fail, manufacturers put procedures in place (such as "Advance" RMA) to ship a replacement very quickly. This is important when, for example, you have a single-drive failure in a RAID configuration that can only tolerate losing one drive.

    My experience with obtaining two warranty replacements on Intel M.2 SSDs has been really poor. In each case the replacement drive took so long to arrive I had to purchase a replacement drive in the meantime.

  55. Re: Total GARBAGE by datavirtue · · Score: 1

    Yeah, that's the answer...turning SSDs into WinModems.

    --
    I object to power without constructive purpose. --Spock
  56. Best of both worlds by MobyDisk · · Score: 1

    You can get the best of both worlds by setting up a RAID of both an SSD and a platter drive! :-P

    1. Re:Best of both worlds by prisoner-of-enigma · · Score: 1

      I'm not sure if you were being funny or not, but this is a horrible idea. Instead of the "best of both worlds" you're getting the worst of both. Read and write times will be gated by the speed of the mechanical drive, negating any SSD speed benefits. You'd be better off with two mechanical drives: same speed at much lower cost.

      --
      In the end they will lay their freedom at our feet and say to us, Make us your slaves, but feed us. - Fyodor Dostoyevsky
  57. Re:Heading should be by Immerman · · Score: 1

    Quite so. As it happens I was just fixing a dentist's office computer yesterday, and used the dental air blower to get the dust-bison out of the heat sinks since I didn't have any compressed air on hand. Let me tell you I was *really* careful not to touch the water jet button. Clearly whoever designed the "two small identical buttons side by side" interface never intended it to be used in a setting where a stray jet of water could be a major problem.

    --
    --- Most topics have many sides worth arguing, allow me to take one opposite you.
  58. When an SSD dies by OrangeTide · · Score: 1

    Most likely reason is a firmware bug cause enough corruption that it can't even low-level format. If it were a prototype that a developer could diagnose, it would be easy for them to patch it and get it going again. But without that specialized environment you SSD and the data on it are trash.

    In many ways I think I would have preferred the raw NAND systems like SmartMedia (now obsolete), where the host had the real brains and the media was as primitive as possible. SmartMedia formatting was about conforming to a software standard on the host side and was managed by a driver. A real driver that a could be debugged with ordinary tools, not some obscure firmware embedded in a device.

    --
    “Common sense is not so common.” — Voltaire
  59. The Storage Debacle.. by Xnet+Project · · Score: 1

    We have experienced from mechanical, SSD, and NVMe drives that there are points of failure that we can detect, and there are points of failure we can't. Most cases where an unpredictable failure occurs is almost always at the power source, and is mostly indicative of voltage irregularity in our tests with bad drives from these 3 types. While we'd like to think that new hardware will hold up to a degree of it's certified life span; voltage as a whole to power said hardware will almost certainly add the anomalous layer for a margin of error from minimal to catastrophic.

  60. It's pretty simply by viperidaenz · · Score: 1

    The chips store data in a capacitor.
    The capacitor is connected to (or is the) the gate of a mosfet so the state can be read.
    To charge or discharge the capacitor, electrons must be forced over the insulation later that stops the capacitor discharging on its own.
    Every time that happens the insulation breaks down a little. Once it's all gone, the cell can no longer store data.

    It's a gradual process that happens every time a cell is written to or erased. SSD's wear out as they're used, it's how they work. You should treat them as a consumable.

    Or something randomly broken. like a solder joint from thermal cycling or something.

  61. Re:Blame the OS by Immerman · · Score: 1

    Well,to be fair you *didn't* fix it - you just worked around it. Almost as good in many settings. I've "fixed" several hard drives in a similar manner - one section of the drive is clearly bad, and spreading when used? Fine, re-partition it so that that section, and a generous buffer zone, are never used. They typically work fine for years after that.

    Certainly not something I'd generally recommend given the nature of such HDD failures, but perhaps justifiable if you just want to buy some more time before an upgrade, or until a kid destroys the thing more permanently.

    --
    --- Most topics have many sides worth arguing, allow me to take one opposite you.
  62. Re:Heading should be by rogoshen1 · · Score: 1

    That's the long and short of it.

  63. literature by TheSync · · Score: 1

    See SSD Failures in Datacenters: What? When? and Why?.

    Failures include retention errors caused due to leakage current, which worsens with time when not acted upon. Second, they also suffer from phenomenon such as read disturb and program disturb errors, where read or program of a row or block of cells affects the threshold voltage of untouched cells in its vicinity. data retention, program disturb, read disturb, endurance, and power faults.

    Flash controllers have proactive and reactive mechanisms in place, to prevent the flash error propagation to higher levels in the system stack. Consequently, not all of the above-mentioned failures propagate to upper layers. But, ones that do propagate can result in fail-stop failures.

  64. Re:Department of Computer Science --- are you sure by Aighearach · · Score: 1

    I spent 3 years on a "deep dive" into EE basics, analog circuit design, then microcontrollers, and it really improved my software development a lot.

    I don't think this is a natural blind spot in CS, I think it is just manufactured ignorance by dividing the fields in an unrealistic way. Which seems to have happened during the rush to train workers during the .com boom, so maybe it wasn't even thought out at all.

  65. I feel the same way about light bulbs. by ripvlan · · Score: 1

    Does anyone really know why a spinning disk dies? Sure - maybe if the last operation was "dropped laptop down stairwell"

    A narrative over what went wrong?! Whenever a HDD failed a light came on the RAID array - and I'd find a package from FedEx on my desk at 9AM with a replacement disk in it. As for personal computers - the drive stops working and you lose data.

    What is there to think about?

    I do agree about the "timebomb" thought. I know that SSD just give up the ghost. On a HDD many times "check disk" starts reporting a high number of failures and you can be prepared...except when the head falls off the arm. That's a rather rapid failure.

    SSD have a write-lifetime that I can't predict. HDD goes until it doesn't work anymore. In both cases you break out the backup tapes.

  66. Re:Department of Computer Science --- are you sure by DickBreath · · Score: 1

    If you begin to notice vibration from the SSDs then you know they are near the end of their life.

    --

    I'll see your senator, and I'll raise you two judges.
  67. I don't care. by ruddk · · Score: 1

    We set them up in either a RAID or EC configuration or other redundant configuration , so that the operations department can swap them out when they fail without downtime.
    Unless we start to see an unusual high number of failures, we don't care.

  68. Re:Luckyo is caught lying about plastic though by Luckyo · · Score: 1

    Just out of interest, how much time in your day is spent stalking me on slashdot after your anti-science drivel got exposed in that one argument?

  69. Re:Department of Computer Science --- are you sure by bobbied · · Score: 1

    Well, I do think it's natural for CS majors to be a bit farther away from hardware. Let's face it, much of their work these days doesn't really care what operating system they run on much less the hardware it's actually running on. I don't blame them, really the state of programming has evolved away from hardware dependence, and that's a good thing..

    Where I understand hardware details of what's happening behind the programing model seen by the CS guys and gals, and I believe that I have a different perspective when doing software development, I'm not sure they would benefit all that much. Programming Java is pretty hardware agnostic anyway, C/C++ a bit more specific (assuming you have the libs and compiler), but still largely portable unless you are handling actual hardware or kernel level stuff. My hardware knowledge really only serves to make me more aware of performance implications of my choices perhaps, but the CS folks do just fine with most higher level languages.

    So I don't agree, CS folks really don't need to know all the same stuff I do to program. It used to be true, it used to be valuable to understand what the hardware had to go though, both to be able to optimize your code for performance and size and get it to do what you wanted. However, with the advent of the higher level languages, most CS folks don't interact with the hardware anyway, but abstract programming models like the JREs which for all the world look identical regardless of the hardware being used.

    --
    "File to fit, pound to insert, paint to match" - Aircraft Maintenance 101
  70. Re:Blame the OS by prisoner-of-enigma · · Score: 1

    Don't blame the OS. Blame "no backups." Failure should be expected and accounted for with a backup plan.

    While I am the first to agree to this at the enterprise server level, it's far more difficult for the consumer or typical desktop user, especially for laptop users. RAID isn't always an option for laptops (frequently it's impossible) so you're left with some sort of external (USB or Thunderbolt) backup device or cloud storage. The former is difficult for road warriors and is nearly impossible to schedule since it's manually attached. The latter depends on an always-on Internet connection to have current backups.

    My strategy was for laptop/desktop users to have their My Documents (and any other crucial directories) mirrored using OneDrive (included with Office365). It worked most of the time but nothing could be done if a drive failed while someone wasn't connected to the Internet. Any changes made since the last sync were irretrievably lost.

    --
    In the end they will lay their freedom at our feet and say to us, Make us your slaves, but feed us. - Fyodor Dostoyevsky
  71. Re: Department of Computer Science --- are you sur by FuzzyDaddy2 · · Score: 1

    I blame the autocorrect software.

  72. Re:Heading should be by bobbied · · Score: 1

    Yea, with water, I see a reduction in resistance too... :)

    --
    "File to fit, pound to insert, paint to match" - Aircraft Maintenance 101
  73. Why SSD failures are legitimately unnerving by jddimarco · · Score: 2

    Disclaimer: I've known Chris since we were CS undergraduates together in the 1980s, and we currently work together in the CS Department in Toronto. It may seem a bit odd to some that a hard disk failure isn't unnerving but an SSD failure is. That's because one of a good sysadmin's skills is properly focused anxiety, used to motivate a mental model of how things can fail, and what to do about it. Data storage is a key part of this mental model, since data access loss, or even worse, data loss, is a major risk. That's why it's helpful to know how disks work, how they behave when they fail, and how likely it is for such things to happen. Chris has a few decades of experience in dealing with disks. SSDs take the place of disks, and they store stuff just like disks do, but they work differently, and they behave very differently when they fail. In particular, SSDs often don't seem to give any indication that things may be wrong: one moment all is well, the next moment, all is dead. So instincts honed over a few decades of experience with hard drives don't apply. Of course Chris (and we all) will develop new instincts as we get more experience with SSDs. But in the meanwhile, it's indeed unnerving. And no, this isn't some sort of profound insight. It's merely an observation. Many experienced sysadmins, I think, will "get" this. People newer to the field might not. That's OK.

  74. Chris Siebenmann has anxiety issues. by Oligonicella · · Score: 1

    That's the crux of the article. I should care why? He's a technical guy, he knows about memory. He just refuses to apply his knowledge to get rid of his paranoia. This guy's nothing but a low-level conspiracy theorist.

    As I wrote in one of my books: "They're all alike. Conspiracy theorists. They'd rather live in a terrifying fantasy world than the real one."

    1. Re:Chris Siebenmann has anxiety issues. by jddimarco · · Score: 1

      This response is confused on so many levels. First, Chris doesn't "know about memory" (particularly flash memory and corresponding control systems that are built into modern ssd's) in the same way and to the same degree as he knows about disks, that's the point. Secondly, he isn't refusing to apply his knowledge, he's using all the knowledge he has, which is less than what he has for disks. Thirdly, he isn't being paranoid -- paranoia requires high and ongoing anxiety about extremely unlikely things (i.e delusional): Chris' anxiety here is neither high nor ongoing, nor is what he is anxious about (SSD failure) an extremely unlikely (delusional) thing. Fourth, there's no evidence in his posting that Chris believes any sort of conspiracy is going on here.

  75. Re:Think of SSD drives as RAM memory by prisoner-of-enigma · · Score: 1

    Do you get this anxious when a RAM module fails? There really is no difference between a RAM module failing and a SSD failing...

    Uhh...people don't usually store critical files in volatile RAM. Kind of a huge difference there. Further, RAM failures may crash the computer but it rarely destroys anything else in the process. A mass storage failure -- be it HDD or SSD -- virtually guarantees you'll lose whatever data you had on it. Your only recourse is RAID (which isn't an option on most laptops) or some sort of backup (which is difficult to enforce on mobile users).

    Yes, you can blame users all day long for not backing up their data. It doesn't help when you're still responsible for IT as a whole. The problem lands on your desk whether you want it or not.

    --
    In the end they will lay their freedom at our feet and say to us, Make us your slaves, but feed us. - Fyodor Dostoyevsky
  76. Re:Self bricking by Anonymous Coward · · Score: 1

    Sandforce controllers self-brick at the first sign of trouble to prevent competitors from reverse engineering their controllers. Or at least that is the reason stated for their crappy design. IIRC, Intel developed a customized version that has better failure modes.

  77. Blackbox bricked by Reason by ElitistWhiner · · Score: 1

    This is the uncanny valley in which the world of REAL slowly sinks, sinking...sunk into the technological relative world of NOW.

    There is no bridge between. You stand stranded on the shores of reason while the world in which you live sinks away, out of sight and out of mind.

    Millennials know the futility of questioning the NOW, its irrelevant to wonder ' why?' Just BE now!

    Enlightenment as to why, what went wrong - much less how to prevent bad things is not among possibles. Shit happens!

  78. Re:The solution is easy by brantondaveperson · · Score: 1

    Exactly this. I bought a shit SSD, it lasted three years. Not too bad, I suppose. When it died, which it did last week, I was back up and running in an afternoon - including the time taken to drive to the store and buy a new one.

    It's a really odd article in any case, why be so paranoid about the precise failure modes? Hardware is hardware, and it can break. Plan for it, and you won't have any problems.

  79. Re:Department of Computer Science --- are you sure by ceoyoyo · · Score: 1

    I'm not sure what they call Computer Science these days, but my bachelors had a required digital design component. We started by wiring together transistors to build a gate. When you'd demonstrated that you could use 74HC00s, and you had to build an adder. When your adder worked, you were allowed to use an ALU chip. You had to set the thing up with supporting logic and DIP switches and invent a machine code to demonstrate instruction processing and register transfers.

    In the compiler class we started out by writing a simulator for that hardware, then an assembler, then a compiler.

  80. Yes. I am. by bdwoolman · · Score: 1

    Because yes I am your God, man.

    --
    "No fear. No envy. No meanness." Liam Clancy
  81. Re:Wow by jddimarco · · Score: 1

    I'm not sure if you're a troll and are trying to evoke annoyance, or if you suffer from severe reading comprehension difficulties and are trying to evoke pity. In me, you evoke both.

  82. Re:Department of Computer Science --- are you sure by Aighearach · · Score: 1

    I was actually thinking that if they had more understanding of the hardware, they'd have a better idea what the layers actually are, and they'd end up with more portable code not less portable code as you seem to imply. Knowing about how hardware works helps to be more hardware agnostic, because if you're using intermediate layers with no idea of the hardware and OS coupling that it creates then you'll do it more often.

  83. Re:Department of Computer Science --- are you sure by bobbied · · Score: 1

    I was actually thinking that if they had more understanding of the hardware, they'd have a better idea what the layers actually are, and they'd end up with more portable code not less portable code as you seem to imply. Knowing about how hardware works helps to be more hardware agnostic, because if you're using intermediate layers with no idea of the hardware and OS coupling that it creates then you'll do it more often.

    Yea, I see what you are saying, but remember they are stamping out CS degrees with little more than Java and Database Skills. The whole point of Java was to let you ignore all that hardware stuff though abstraction layers any way. Most of them don't need to know how to dig though all those layers to do what they need and with Object Oriented concepts, hardware is becoming trivia to them.

    But I agree, a bit of understanding of hardware is a good thing, especially when you start talking recursion and how pointers/references are actually working. I've always been amused at the BSCS holders who didn't understand what the call stack was or how they where killing performance with all the objects going in and out of scope, or why the math was being in done using integers when they wanted floating point (or vice versa). I just don't know if they have the scope in an undergraduate CS curriculum to throw that stuff in. Many won't need it, use it or remember it anyway.

    --
    "File to fit, pound to insert, paint to match" - Aircraft Maintenance 101