Slashdot Mirror


Why I'm Usually Unnerved When Modern SSDs Die on Us (utoronto.ca)

Chris Siebenmann, a Unix Systems Administrator at University of Toronto, writes about the inability to figure out the bottleneck when an SSD dies: What unnerves me about these sorts of abrupt SSD failures is how inscrutable they are and how I can't construct a story in my head of what went wrong. With spinning HDs, drives might die abruptly but you could at least construct narratives about what could have happened to do that; perhaps the spindle motor drive seized or the drive had some other gross mechanical failure that brought everything to a crashing halt (perhaps literally). SSDs are both solid state and opaque, so I'm left with no story for what went wrong, especially when a drive is young and isn't supposed to have come anywhere near wearing out its flash cells (as this SSD was).

(When a HD died early, you could also imagine undetected manufacturing flaws that finally gave way. With SSDs, at least in theory that shouldn't happen, so early death feels especially alarming. Probably there are potential undetected manufacturing flaws in the flash cells and so on, though.) When I have no story, my thoughts turn to unnerving possibilities, like that the drive was lying to us about how healthy it was in SMART data and that it was actually running through spare flash capacity and then just ran out, or that it had a firmware flaw that we triggered that bricked it in some way.

358 comments

  1. With spinning disks, you do not know either by gweihir · · Score: 5, Insightful

    Seriously, you do not. You may know the end-result sometimes (head-crash), but the root-cause is usually not clear.

    So get over it. It is a new black-box replacing an older black-box.

    --
    Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    1. Re:With spinning disks, you do not know either by 110010001000 · · Score: 5, Insightful

      What is unnerving is that a guy from the Department of Computer Science thinks that SSDs are theoretically immune to manufacturing failures.

    2. Re:With spinning disks, you do not know either by froggyjojodaddy · · Score: 5, Insightful

      From the article:

      "Further, when I have no narrative for what causes SSD failures, it feels like every SSD is an unpredictable time bomb. Are they healthy or are they going to die tomorrow? "

      Emphasis mine. I feel like this guy has opportunities to improve his coping mechanism. For someone in Computer Sciences, it seems like he's way too worried about this. I'm not trying to be mean, but it's like if I got into a car accident and then questioned the entire safety design of all vehicles rather than just taking a few steps back and understanding it's a freak event, but not a totally unexpected one. If you've been driving for 30 years, statistically, you're likely to get into at least one accident, even if it's not your fault

    3. Re: With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      Do you ever stop talking nonsense? Why comment when you know nothing about the specialist area? If you had bothered to spend 30 seconds with a search engine before typing you would have already know that but as usual you decide to spout nonsense about something you donâ(TM)t understand

      The root cause is nearly always clear with modern HDD drives. The industry has invested huge amounts investigating HDD failures over the last 20 years and we fully understand all the major failure mechanisms and itâ(TM)s pretty straightforward, in most cases, to identify the origin point for the failure.

    4. Re: With spinning disks, you do not know either by chaboud · · Score: 1

      Oh, to have mod points....

      There is no other correct take but this. Solid-state does not mean "immune to wear", and anyone in a CS program should be aware of it. Anyone *teaching* a CS program should be embarrassed about this.

    5. Re:With spinning disks, you do not know either by Anonymous Coward · · Score: 1

      Yes, this is why RAID exists and why it is still equally valid with SSDs. It is also why we have good backup systems otherwise we don't sleep well at night.

      It also greatly depends on the class of SSD. I've never had an enterprise SSD die on me after going into production. They are either DOA or chug along nicely. You get what you pay for as they are quite a bit more expensive. It is also why you run a few disk benchmarks before going into production to verify build quality.

      Of course in terms of SANs and Enterprise SSDs they often give me an expected lifetime with the drive. Most of them happily working over 270% of their expected life but that is a risk people take. In my case there aren't a whole lot of writes happening so I'm not worried about it, again, it is in a RAID so if one should give out then I have time to fix the problem. If both fail then I have good backups so while I may take a brief outage it won't be the end of the world.

    6. Re:With spinning disks, you do not know either by AmiMoJo · · Score: 4, Informative

      Often SSD failures can be predicted or at least diagnosed by looking at SMART data. That's what it's for, after all. Some manufacturers provide better data than others.

      Like HDDs, sometimes the electronics die too. Usually a power supply issue. Can be tricky to diagnose. SSDs are slightly worse as with HDDs you can often replace the controller PCB and get them working again, where as SSDs are a single PCB with the controller and memory.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    7. Re: With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      Lol if you understand what you are talking about its best to avoid gweihirs posts. Like most of Slashdots psuedo techs he doesn't have a clue. Best way is to just view their posts as amusement, like when you used to ask the stupid kid at stupid kid questions at school just to see how bat shit crazy the answer is

    8. Re:With spinning disks, you do not know either by alvinrod · · Score: 2

      Or to learn what causes SSD's to fail. Just because something appears unpredictable doesn't mean that it is so. If he doesn't have the time to devote to investigating this issue and acquire any requisite knowledge that will help him to uncover the truth, then he probably shouldn't be squandering any of that precious time whining or worrying about things that are out of his control.

    9. Re:With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      To give him some slack: HP with their orbital super computer project at the ISS did not expect the SSDs to suffer the failures they did. All that radiation resistance may cover cells, but the controller boards are a different matter. People live and think in bubbles due to the complexity of the real world. Also the technology is still new as the first few papers on real-world reliability of large-scale flash storage were published only few years ago.

    10. Re:With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      Funniest (sic) failure I've ever seen from a spinning HD was in the early days of the 80
      and 100 Megabyte hard drives -- the paint on the internal magnets peeled keeping the
      head's arm from moving! Couldn't believe it when I opened it!!

      But the author has a point. Solid state should last into the decades. Old stereos from
      the 1970's are still in service (some of them and it's usually the switches / solder joints
      that fail, not the semiconductors (unless abused). Honestly, I think the mfgs don't care
      since they can sell 'em another if it stops working cause they realize: there's no competition,
      no requirement (market driven, statute, or otherwise) to increase the quality, and they
      know you're a captured market - it's (almost) a necessity to have those electronics.

      Also, products are not made in the U.S. of A.; there's a huge cultural difference in how
      the Chinese think about the value of human life and it definitely trickles down to how they
      do business. This is quite different than the Japanese who, after WWII copied things to
      "perfection" (which is how they inadvertently kill U.S. auto manufacturing) and became the
      nation they are today the "honest" way. Unlike their subordinates who have no problem
      stealing technology with little understanding of how it's supposed to work.

      CAP == 'partly'

    11. Re: With spinning disks, you do not know either by 110010001000 · · Score: 1

      What "specialist area"? I am a specialist in all things. Any moron knows any manufactured thing isn't theoretically immune to manufacturing failures. Not sure why are you talking about "root cause". I never mentioned that.

    12. Re:With spinning disks, you do not know either by ctilsie242 · · Score: 5, Funny

      Could be worse. At a previous job, I've had someone demand "7200 RPM SSDs", and no amount of explaining could change the person's mind.

    13. Re:With spinning disks, you do not know either by Stonent1 · · Score: 4, Interesting

      Ok, I'm in IT and it unnerves me. I've had numerous computers have an SSD totally die and lose all data with no smart warnings in the last few years. (Not me personally, I mean people at our organization)

    14. Re:With spinning disks, you do not know either by 110010001000 · · Score: 1

      That sounds about right for HP in 2018.

    15. Re: With spinning disks, you do not know either by Anonymous Coward · · Score: 1

      To get a 7200 RPM SSD, just put it in a centrifuge. Tell your coworker the centrifuge will separate good data from bad.

    16. Re: With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      I use an SSD for primary drive, spinning for backups. Somehow that makes me feel better.

    17. Re: With spinning disks, you do not know either by omnichad · · Score: 3, Interesting

      Older SSDs didn't even have a wear-leveling SMART attribute or total host writes attribute. Some of the cheaper ones probably still don't. So there is no way to see how close you're getting to the estimated upper limit. There is a pretty clear progression on the newer drives. With hard drives, mechanical failure is actually less predictable than SSD wear-out (defects aside).

    18. Re: With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      a $10 grinder from harbor freight would suffice

    19. Re:With spinning disks, you do not know either by Anonymous Coward · · Score: 0


      What is unnerving is that a guy from the Department of Computer Science thinks that SSDs are theoretically immune to manufacturing failures.

      He's just some random UNIX system admin, not a computer science professor. He's obviously not got a terribly good handle on hardware failure, which likely makes sense if he's a sys admin. I take the entry to be more of a "hey, I thought this wasn't supposed to happen based on my somewhat limited understanding of this" rather than "I'm an expert, and this SHOULDN'T happen!".

      The real question (maybe not that shocking though) is why did a not very enlightening article written by someone without a very good understanding of hardware wind up on Slashdot?

    20. Re:With spinning disks, you do not know either by Comboman · · Score: 3, Informative

      Mod parent up. The most common cause of a sudden, unexplained failure for both HDs and SSDs is a failure of the controller rather than the media.

      --
      Support Right To Repair Legislation.
    21. Re:With spinning disks, you do not know either by jellomizer · · Score: 2

      I find a lot of fear around new technology to be the same as the fear of flying.
      Where numbers all point to a better more robust product, there is just more anxiety for when something goes wrong, mostly because when it does, there is little to do to fix it.

      The old spinning drive if failed, you can sometimes put it in the freezer power it up and get the data off, or if you are more technical you can open it up, and move the data disks to an other drive.

      But for the most part, Standard best practices of keeping backups and/or having the correct RAID on your drives is the best option to keep the data safe. Solid State or mechanical, they can always fail. The solid state could fail from a power surge, or just excessive heat, or just a fault in the build.

      --
      If something is so important that you feel the need to post it on the internet... It probably isn't that important.
    22. Re:With spinning disks, you do not know either by jellomizer · · Score: 4, Funny

      That is why I always stick to real to real 9 track paper tape. If you can't see the bits you just can't trust it.

      --
      If something is so important that you feel the need to post it on the internet... It probably isn't that important.
    23. Re:With spinning disks, you do not know either by Anonymous Coward · · Score: 5, Insightful

      All for the SAME reason- the wrong type of cell failed, and the crappy software doesn't know how to recover. The software systems of the SSD and the OS driver side are written by idiots.

      A low level tool that knows your particular SSD driver chipset could trivially access the vast majority of flash cells on your SSD drive. But what good is that FACT if the tools are not readily available.

      And SMART warning do NOT apply to SSD drives. SMART is for electro-mechanical systems with statistical models of gradual failure. SMART is FAKED for SSD.

      A catastrophic SSD failure is when the 'wrong' memory cell dies, and the software locks up. Since all memory cells are equally likely to die at some point, this is a terrible fault of many of these drives.

    24. Re:With spinning disks, you do not know either by I-am-a-Banana · · Score: 3, Interesting

      Seriously, you do not. You may know the end-result sometimes (head-crash), but the root-cause is usually not clear.

      So get over it. It is a new black-box replacing an older black-box.

      Well I need to partially disagree with you there. With a traditional drive when it fails and you take it apart carefully you can try and determine what happened. If it was a head crash you may be able to see what caused the head crash. In my case a Quantum or Maxtor drive that had 3 extra screws shipped in it loose where the inside control circuitry was. You could tell if it was a frozen motor, or if you are lucky find that the external board had a fried electrical component on it. For friends I desoldered the fried component and put a new one on and the drive worked perfectly. Obviously we copied the data off of it onto something new then we put the drive into storage for safe keeping. With the older drives there is the small chance of repair. Yes there are companies out there that will disassemble the drive, remove the platter, and put them into another working drive to recover data. Obviously with a head crash you may not be able to recover all but, in absolute necessity you could. Or you could just be a nerd that wants to do an investigation to find out why. With SSDs however there is no chance of fixing it, and no chance of knowing exactly why. However I don't know why he would say that SSDs shouldn't have manufacturing defects. They do. They are just not mechanical, but I would hope that because they are not mechanical they would hopefully be less likely to be defective.

    25. Re: With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      *handwaves* COMIC RAYS. Like a stray bullet from the great void.

      My answer - just motherfuckin deal with it! Shit happens

    26. Re:With spinning disks, you do not know either by Luckyo · · Score: 1, Interesting

      You do actually. Many if not most disk failures have clearly predictable markers. This has been true for quite a long time at this point, to the level where my last two HDD failures in home machine were diagnosable with no tools beyond SMART reader. Better yet, they weren't "instant" failures, but signs of impeding failure of the drive started appearing months in advance with clear cut warnings on SMART readout. This resulted in sufficient time to buy a new drive and migrate all the data with no problems.

      With SSDs, failure has a problem with being utterly opaque and sudden. This is likely more of a function of lack of expertise due to lack of time through, as it took us decades to get hard drive monitoring systems to where they are now.

    27. Re:With spinning disks, you do not know either by Sponge+Bath · · Score: 4, Funny

      Tell this person you could only find 7199 RPM SSDs, but if they spin in an office chair while using the system it will make up the difference.

    28. Re: With spinning disks, you do not know either by Type44Q · · Score: 1
      The word you were looking for is "trim."

      Cha-ching.

    29. Re: With spinning disks, you do not know either by omnichad · · Score: 1

      No, it's not. TRIM has nothing to do with wear leveling - and especially monitoring it over time, except that it might happen on a more efficient schedule.

    30. Re:With spinning disks, you do not know either by Anonymous Coward · · Score: 1

      Indeed. Although media failure will happen eventually. At a certain time there will be no more spare blocks to replace worn out blocks. A good controller will still offer the drive as a read only disk, so you can copy almost everything.

      SSD control software is incredibly complex, it is a binary of easily 5MB. Of course there are bugs in it.

    31. Re: With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      There's more truth in this sentiment than some folks would like to admit. -PCP

    32. Re:With spinning disks, you do not know either by R3d+M3rcury · · Score: 2

      Exactly. I've had bad DRAM before which caused the occasional inexplicable crash. I don't see any reason why SSDs would somehow be immune from this.

      That said, most SMART codes are for mechanical hard drives. I wouldn't be surprised to discover that there isn't really a good way to test reliability for SSDs, so the SMART codes always come back as "A-OK!"

    33. Re:With spinning disks, you do not know either by Chewbacon · · Score: 1

      Yep. I've had a number of spinning drives just drop dead on me. Some advice: Western Digital makes returns pretty easy for their drives and, when it comes to all drives, backup regularly/often!

      --
      Chewbacon
      The Bible is like Wikipedia: written by a bunch of people and verifiable by questionable sources.
    34. Re:With spinning disks, you do not know either by greenwow · · Score: 2

      I disagree that SMART data helps with diagnosing failures. I save the output of "smartctl -a /dev/?" every night for every drive on every server. I haven't seen anything that predicted the huge number of SSD failures that you have with heavy use. We started using them three years ago when we started buying servers with 2.5" drive bays. I think we've replaced the ~75 drives about 120 times. Yes, more than once. If someone could come up with a predicting failures then they will become rich.

    35. Re:With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      That is why I always stick to real to real 9 track paper tape. If you can't see the bits you just can't trust it.

      https://www.youtube.com/watch?v=am9BqZ6eA5c

    36. Re:With spinning disks, you do not know either by ShanghaiBill · · Score: 1

      All for the SAME reason- the wrong type of cell failed, and the crappy software doesn't know how to recover.

      So it is basically bad software? Are there SSD brands with less crappy software than others?

      Is there data on reliability, like there is for HDDs?

      To be fair, I believe this is becoming less of a problem. I saw SSDs fail often in the early days of flash, but not recently.

    37. Re:With spinning disks, you do not know either by Headw1nd · · Score: 1

      The author mentions manufacturing errors as a possible source, but I think his question is an error in what, and if it's an error on silicon, why would it only show up after months of operation? Some people have more curiosity about the things they use, and want more of an explanation than "oh sometimes they just fail."

    38. Re:With spinning disks, you do not know either by 110010001000 · · Score: 2

      Thats nice, but that isn't relevant to what I wrote. I commented that it is unnerving that he thinks that SSDs are theoretically immune to manufacturing failures. There are a lot of reasons why a SSD can fail. Soldered joints can fail. There are various bonds that can also fail.

    39. Re:With spinning disks, you do not know either by Anonymous Coward · · Score: 0


      I've had numerous computers have an SSD totally die and lose all data with no smart warnings in the last few years.

      The same things happen with spinning disks. They also have controllers that can suddenly, and without warning fail.

      The problem is that people were sold on this idea that spinning disks die because of physical problems, and SSD will never die because it doesn't have moving components. It's just not true, and it's never going to be true.

    40. Re:With spinning disks, you do not know either by AmiMoJo · · Score: 1

      With hard drives a sure sign of imminent failure is the sector retry or reallocation count increasing.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    41. Re:With spinning disks, you do not know either by DigiShaman · · Score: 1

      It's usually because of the controller or RAM / Cache errors in processing that corrupts the firmware or dynamic LBA flash block allocation table (database). This renders the reset of the NAND flash partially or totally inaccessible. Quality "prosumer" drives are supposed to have extra hardware (capacitance) to prevent half-writes upon a dirty shutdown (abrupt loss of power). But regardless, any corruption on write-back can render the drive "bricked".

      I'm not sure of SSD cache uses ECC or not, but they should if they don't already. I know that they will throttle with the temps get to high, which should prevent corruption at the expense of performance. So that at least is a good thing.

      If I recall, Intel SSDs in the past (not sure now) are programmed to fail or crippled with read-only after so many writes. It's like an odometer where you reach a certain level of distance and then because it's programmed to do so, fail. As though somehow that's being pro-active? Whatever. I avoid Intel drives for that bullshittery.

      --
      Life is not for the lazy.
    42. Re:With spinning disks, you do not know either by Junta · · Score: 1

      This is why I shake my head when I see someone going to a lot of trouble to track SMART data to 'know' when a disk is going to fail. It just makes it all the more disappointing when a drive fails and all the early warning effort did nothing.

      It is a much more robust approach to be able to not *care* if you don't see the failure coming or not than to try to be able to plan for an outage. SMART has no idea that a component on the controller board is going to burn out suddenly. Yes it can track things with known duty cycles, but with drives nowadays you have probably retired the drive long before that threshold will be reached, and the failure modes likely to smack you in production are ones that SMART will not catch.

      --
      XML is like violence. If it doesn't solve the problem, use more.
    43. Re:With spinning disks, you do not know either by Junta · · Score: 2

      Old stereos from
      the 1970's are still in service

      Well, old stereos from the 1970s that are still working are still in service. No one talks about the old stereos that died in the 70s because that's boring.
      SSDs are going to be in the same boat. Like all other electronics, some have a ticking time bomb and will probably fail within the first 5 years or so. Those that have the perfect voltage regulation and capacitors and such will last until their NAND wears out and they could also seem long lived (except the capacity is going to be so pathetic that no one is going to want to hold on to those, while a 1970s stereo is still perfectly capable of putting out good sound).

      --
      XML is like violence. If it doesn't solve the problem, use more.
    44. Re:With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      I heard that outside of "wearing out", the biggest cause of failure by far is the controller, which fails at roughly the same rate as spinning rust. Overall, SSDs fail at roughly the same rate as mech drives, if you ignore the mechanical part of mech drives. The topic rant sounds like someone who would rather have a drive that is over 2x more likely to fail because he better understands that additional 100% failure modes.

    45. Re: With spinning disks, you do not know either by Bengie · · Score: 1

      Wear leveling is much less effective without TRIM. TRIM reduces write amplification by letting the wear leveling algorithm know when some data is no longer referenced.

    46. Re:With spinning disks, you do not know either by SuperKendall · · Score: 2

      This is why I try to buy more expensive and higher performance SSD drives (like the Samsung EVO line) - but I have to admit I have absolutely zero idea if the chipset on the more expensive drives is really any better at all. It just seems likely the design would be better in some ways or a bit more fault tolerant.

      Even that strategy I know can fail though, a few years back one of the most expensive Sandisk Pro SD cards just died out of the blue. It happened while I was at a photography convention where Sandisk was actually present, including a tech that had a full suite of SD analysis tools with him - and even he could get absolutely nothing from the SD card...

      I still back up regularly, really the only thing you can do in a world where and SSD drives may just fail whenever .

      --
      "There is more worth loving than we have strength to love." - Brian Jay Stanley
    47. Re:With spinning disks, you do not know either by Immerman · · Score: 2

      I believe Intel SSDs are programmed to "self brick" when they fail, or at least they used to be. I remember thinking that was a spectacularly stupid way to fail, and the read-only mode would be much preferable. Yes, your computer will likely crash hard in short order either way, but at least with read-only mode you could get (most of) your most recent data off it

      --
      --- Most topics have many sides worth arguing, allow me to take one opposite you.
    48. Re: With spinning disks, you do not know either by datavirtue · · Score: 1

      Yeah. Quit your bitching and break out a microscope if you really want to know. No one every really got to the root cause of mechanical drive failures either. Most of the "died" because the screws that hold them together lost enough torque to allow the body to warp enough to "fail,".... Click, click, click.

      Never forget watching that video where a guy used a cheap torque driver to "repair" dead drives.

      --
      I object to power without constructive purpose. --Spock
    49. Re: With spinning disks, you do not know either by omnichad · · Score: 1

      Yeah, it reduces it. Not by a huge amount. But the point was that there aren't SMART attributes on older SSDs to track wear-leveling. TRIM has nothing to do with that. If you read up-thread you'll see how non-sequitur the GP post is.

    50. Re: With spinning disks, you do not know either by guruevi · · Score: 1

      Perhaps you should invest in the data center SSD or even SLC if you have that many problems. I had the same problems with various brands but the Intel DC solved most of the problems.

      --
      Custom electronics and digital signage for your business: www.evcircuits.com
    51. Re: With spinning disks, you do not know either by datavirtue · · Score: 2

      Yep. Pure and Nimble already did. They got rich.

      --
      I object to power without constructive purpose. --Spock
    52. Re: With spinning disks, you do not know either by schure · · Score: 1

      Guy grew up playing Minecraft. If he had only played Lego instead...

    53. Re:With spinning disks, you do not know either by Immerman · · Score: 1

      In addition to the survivor bias mentioned by Junta, there's also transistor size to consider. The smaller a transistor is made, the more sensitive it is to any manufacturing imperfections, and the faster electromigration and other forms of normal wear and tear take their toll. Squeeze a billion of them on to a postage stamp, and even the most reliable one won't compare to the reliability of a well made canister transistor from the 70s.

      --
      --- Most topics have many sides worth arguing, allow me to take one opposite you.
    54. Re:With spinning disks, you do not know either by Aighearach · · Score: 1

      I'm not surprised.

      His problem is philosophical, not technical. Why would a CS guy be good at philosophy? That would only be likely if he was also interested in philosophy, which is unpopular in CS echo chambers.

      Does the act of constructing a narrative tell you what happened? No. Should possession of a narrative be a basis for risk assessment? No.

      When some uncommon but expected event happens, if you felt like you succeeded at constructing a narrative or not does tell you anything about the frequency of the risk, and you shouldn't think you have that sort of information. Instead of admitting to feeling "unnerved," he should see this mistake and be embarrassed by it. Not because he's bad at philosophy and felt unnerved, but because he can't comprehend storage failure rates that are well-studied and have hard data available, and blathered about his bad philosophy instead of looking up the numbers and known causes.

      Ultimately he should stop putting value on this "unnerved" feeling. It isn't a real thing; it is a feeling you get when you stubbornly insist on pretending you already understand things that you've received information that tells you don't understand. It is a type of cognitive dissonance. Dismissing the feeling, instead of assessing it as valuable, is the way to make it go away. Just accept the new information, and understand that feeling unnerved is maladaptive unless you're wandering in a dangerous forest trying not to get eaten by a Cave Bear.

    55. Re:With spinning disks, you do not know either by Aighearach · · Score: 1

      You should get one of those "1000 Electronic Projects" kids sets and learn youse some hardware.

      Sometimes the magic smoke comes out. Sometimes you don't even see the smoke come out. And yet, being encased in plastic so that you can't see the metal doesn't stop the ICs from letting out the magic.

      Fry a few transistors and you'll understand, there is nothing to be unnerved about. The plastic covering that hides the IC is not even the magic!

    56. Re:With spinning disks, you do not know either by gweihir · · Score: 4, Interesting

      Well, I originally bought OCZ. Today _all_ of 5 OCZ drives I got are stone-dead. After that I moved to Samsung, mostly "Pro". They are all still working fine and some are older now than the first OCZ when it died. So yes, it makes a difference. Incidentally, Samsung had excellent reliability in their spinning drives as well. It seems they just care more about quality and reputation.

      That said, I find it sad that you cannot get "high reliability" SSDs where you basically can forget about the risk of them dying. I am talking reliability levels like a typical CPU here. It seems the market for that is just not there.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    57. Re:With spinning disks, you do not know either by Aighearach · · Score: 1

      Nobody pays extra for drives that have built-in data forensics, so nobody wrote the feature.

      It isn't about crappy software, it is about software that only completes the assigned tasks in the most efficient way possible. That means they actually tear out most of the capabilities of the controllers in the process of making ASICs.

      SMART is "fake" in the sense you mean it for spinning drives, too. Duh. There isn't a magic elf from SMARTland inside the drive. It is simply that less of the data is useful.

      It isn't software in the sense that you talk about it, where you have a general purpose computer sitting there idle most of the time and you could easily just have it do some extra work. It is a tightly coupled collection of circuits that only do very narrow, specific things. Increasing the capabilities lowers performance, because that is how tight the timings already are.

      My advice, buy a bag of AVR microcontrollers and write some firmware. Then buy a cheap FPGA and try that. When you can do both, you'll be ready to understand what goes into the "software" on a HDD.

    58. Re:With spinning disks, you do not know either by viperidaenz · · Score: 5, Insightful

      SMART should be able to provide the number of remapped sectors. There should be manufacturer specific counters for the amount of over provisioning that is left for remapping too. That should tell you precisely when you should plan to replace an SSD due to age.
      How hard would it be to notify something that the drive can't handle any more dead cells, so should not be written to any more? Or that it is down to x% of spare nand?

    59. Re:With spinning disks, you do not know either by Aighearach · · Score: 2

      Nope. You're not paying for different control ICs, where you actually get something from paying more it would be higher speed or higher yield rates on the memory chips.

      Higher yield rates will translate into lower runtime failure rates.

      You're not going to learn much from the wrong side of the controller, because customers at all levels refuse to pay extra for built-in forensics. And you'd have to choose between extra silicon that normally isn't even used, or extra power use. It won't be free.

      You have to get at the pins of the memory chips and interface them to forensic tool. Usually it is probably simplest to unsolder them and put them on a breakout board. You could typically get most of the data back that way. If partial data is really that meaningful to you.

      Most people don't care; partial data is worthless to them. They either had a backup, or didn't. Probably only cops, criminals, and spies want people's data that bad.

    60. Re: With spinning disks, you do not know either by Aighearach · · Score: 1

      Anyone *teaching* a CS program should be embarrassed about this.

      He should spend a day standing in front of the EE department wearing a wizard hat and a sign, "Computers are Not Magic. I repent!"

    61. Re:With spinning disks, you do not know either by Kjella · · Score: 1

      If you can't see the bits you just can't trust it.

      My dad used to feel the same way about vacuum tubes and magnetic core memory. As long as you could use a scope and inspect the single bits you could always get to the bottom of it. Yes, it was a looong time ago.

      --
      Live today, because you never know what tomorrow brings
    62. Re:With spinning disks, you do not know either by Aighearach · · Score: 1

      Even when a spinning drive made crunching noises, it is usually because the controller IC was hosed!

      It isn't like a three phase low voltage BLDC motor operating at low load is likely to die; dead drives all come out with working motors. The drive may not spin when connected as a drive. But when I buy a box of salvaged HDD motors (by the pound) there are likely to be none that are actually bad. That's true even in a 25lb box, which is a few hundred motors, many of which came from dead drives.

      And the head driver is basically just a voice coil; how often does the voice coil in a speaker go bad? Basically never. All the other hardware around it is likely to fail first. Same here. But if the wrong transistor dies in the controller, then the feedback loops won't keep it from crashing into the end of the throw, or oscillating in a way that makes a crunching sound.

    63. Re: With spinning disks, you do not know either by Zero__Kelvin · · Score: 1

      I would have told them to go online and pick whichever one they wanted.

      --
      Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
    64. Re:With spinning disks, you do not know either by SuperKendall · · Score: 1

      Yeah I agree about the recovery aspect is not as important, as you say people either have backups or do not... I am just hoping I get some extra longevity in components (the lower runtime failure rates you mentioned).

      The extra performance is nice as well but I am even more interesting in something I can be pretty sure will last 3-4 years at least with moderate to heavy use.

      Probably only cops, criminals, and spies want people's data that bad.

      Well, I have run into a number of people over the years that lost pictures or important documents (like a whole book they had written), to them thousands of dollars would have been OK if it could actually get the data back (and these were people that did not have a lot of money).

      I think at least these days people do understand a little better how important a backup is after getting burned, reliable backups I feel like are still not an easy thing for most non-technical people to achieve (outside of mobile devices).

      --
      "There is more worth loving than we have strength to love." - Brian Jay Stanley
    65. Re:With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      There's also the way he anthropomorphizes machines:

      Are they healthy or are they going to die tomorrow?

      Machines are not alive. They do not die.
      Added to his inability to recognize the failures as just freak events it leads me to believe that his thinking is sloppy.

    66. Re:With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      He probably sees a lot of SSD's that are healthy today and die tomorrow - instead of indicating that they are near their end of life.

    67. Re:With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      I find a lot of fear around new technology to be the same as the fear of flying.

      Where numbers all point to a better more robust product, there is just more anxiety for when something goes wrong, mostly because when it does, there is little to do to fix it.

      You are overselling. There are real well known tradeoffs.

      Read error rates of SSDs are higher.

      Bitrot when powered off for sustained periods is higher.

      Use of SSD for some workloads including virtual memory, continuous data recording/backup applications is out of the question due to inferior write endurance.

      Spinning rust also benefits from decades of predictive analysis baked into commercial storage arrays. Majority of the time impending disk failure can be detected and orderly migration to a hot spare completed before drive failure exerts any pressure on redundancy margins of an array.

    68. Re:With spinning disks, you do not know either by prisoner-of-enigma · · Score: 2

      I think you may be missing his point. I've had SSD's die on me as well with absolutely no warning. What's unnerving about it is you have no idea why it failed. Good engineers like failure analysis; it helps determine if you're buying a crappy product, running your product out of spec, or any number of other metrics which can inform future purchases.

      Mechanical drives usually give you some indicator of why they failed in the form of horrible noises. SSD's don't give you much of anything. If neither SMART nor spare block allocation figures are out of spec you have nothing to go on. I've chalked these up to the controller on the drive itself failing but that's just a guess. I have no way to perform any additional diagnostics that might tell me more. As a result, I've simply avoided buying drives of that brand anymore. Crude, yes, but what other metrics can I use? I'm not talking about a single drive. It's happened to multiple drives of a similar make/model, all of which failed suddenly and gave no data afterwards I could use forensically.

      --
      In the end they will lay their freedom at our feet and say to us, Make us your slaves, but feed us. - Fyodor Dostoyevsky
    69. Re:With spinning disks, you do not know either by prisoner-of-enigma · · Score: 1

      Should possession of a narrative be a basis for risk assessment? No.

      Yes, it should, although only if the narrative doesn't involve a single case of failure. If you have multiple failures of a single brand or model, you should use that to inform future purchasing decisions. Knowing the cause of the failure could further inform. Random failures are to be expected but if multiple failures caused by the same defect occurring in a given product line or due to a specific environment (workload, temperature, etc.) then you have some useful data to make future product selection with.

      The OP is lamenting the paucity of any kind of failure data. Basically he's left with the decision to forego purchasing that model -- or even that entire brand -- hoping it will improve reliability. Hope is not a strategy for engineers. We prefer data.

      --
      In the end they will lay their freedom at our feet and say to us, Make us your slaves, but feed us. - Fyodor Dostoyevsky
    70. Re:With spinning disks, you do not know either by geekmux · · Score: 1

      Seriously, you do not. You may know the end-result sometimes (head-crash), but the root-cause is usually not clear.

      So get over it. It is a new black-box replacing an older black-box.

      It's a pain in the ass when any hardware fails, especially prematurely . Never truly knowing why something fails is very frustrating for anyone who actually gives a shit enough to not want to repeat history. You know, like buying the same "reputable" brand/solution/model again.

    71. Re: With spinning disks, you do not know either by prisoner-of-enigma · · Score: 2

      Doesn't help if the controller fails. SLC flash has better write longevity but none of that matters if the controller bombs.

      Further, a sudden, catastrophic failure is (by process of elimination) almost certainly a controller failure. No matter if you're using SLC/MLC/TLC/etc. flash, cells don't die en masse. They usually die a little at a time. The controller expects this and remaps bad blocks to the spare area. Keeping track of spare area usage is one of the best ways to predict impending failure. If the controller fails then all that is for nothing even though (theoretically) all your data is still perfectly preserved on the flash itself.

      --
      In the end they will lay their freedom at our feet and say to us, Make us your slaves, but feed us. - Fyodor Dostoyevsky
    72. Re:With spinning disks, you do not know either by Marxist+Hacker+42 · · Score: 1

      One way to improve his coping mechanism, would be to start publishing everything

      Including manufacturer names and his own mean time to failure numbers.

      Bet that will increase quality control real quick. Or at least tell us who not to buy from because they're cheap chinese crap.

      --
      SJW: a person who perceives an injustice, and while correcting it, commits a greater injustice.
    73. Re: With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      Same here, OCZ followed by Mushkin, and finally Crucial, all failed. Switched to Samsung, never looked back.

    74. Re:With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      Knowing my luck they would summon the CEO to complain about the extra 1.158 microsecond latency. Stupid nepotism for putting that know-it-all in my life.

    75. Re: With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      Because again that's a big big simplification.

      Firstly the SSD needs a table of which physical locations are mapped to which LBA, and which physical locations are bad and unusable.

      And where is that table stored? In flash.

      So where is the info of that tables locations? In flash.

      And what happens if that part of flash has an uncorrectable error? Well that have a back up in a second location.

      But if both die at the same time?

      Will you just lose everything.

      Sometimes a drive will start over and erase and test every block and reconstruct a blank state of with known good cells. Those is why drives can "come good" after leaving them powered on for a very long time.

      But most manufacturers prefer to just brick at that point.

    76. Re:With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      what a crock... in rare instances drive die suddenly, unexpectedly, but RAID controllers have this ability to report when a drive is predicted to fail. If you listen to a drive, you can usually tell when it's on the way out.

      So, bottom line is that you're a fucking tool.

      FOAD, HTH, HAND.

      hugs and kisses,

      Juan Epstein

    77. Re: With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      Computers are magic. Technology is magic. The root word "techne" was taken from Latin. The meaning of the word is "to change reality" and was used the same way we use the word "magic."

    78. Re:With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      Yep. Traditional HDDs often give you a bit of prior warning. It sounds different - thrashing, clicking, delayed spin up, delays as drive spins down/up for no apparent reason - or you notice a bit of other weirdness that implies that file-system corruption might have started. With SSDs you never get that: one day it's working, the next day it's dead.

      Having said that I still like SSDs, and so long as you keep good backups (which you should anyway) it's really not an issue.

    79. Re:With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      Don't forget, in the southern hemisphere the disks spin in the opposite direction as they do in the north!

    80. Re:With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      What is unnerving is that a guy from the Department of Computer Science thinks that SSDs are theoretically immune to manufacturing failures.

      It's unnerving that he does understand mechanical failure and not electronic failure.

    81. Re: With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      That's not surprising at all. Computer science is by its nature abstracted from the mechanical world.

      It would be odd for a engineer to be ignorant in this area, but not for a computer scientist.

    82. Re:With spinning disks, you do not know either by thegarbz · · Score: 1

      Oh hey nice anecdote. Let me share some with you too. I have had RAM die unexpectedly without warning. I've had motherboards die unexpectedly without warning. I've had video cards die unexpectedly without warning. I've had CPUs go tits up in infant mortality. My last monitor just one day let smoke out for no reason what so ever. I have a PSU on order right now as the motherboard is throwing warnings about the voltage rails.

      And ... wait for it ... you know it's coming .... I've had HDDs die without warning and sure as hell no SMART warnings losing all data in the process.

      Electronics die. Often at end of life, statistically quite randomly, and even scarier sometimes shortly after being put in service. SSDs aren't unique, amazing or unnerving. SMART is not there to give you early warning of random failures, it's there to give you an attempt to predict wearout / end of life related failures. No parts are immune, and they sure as hell aren't unnerving.

    83. Re:With spinning disks, you do not know either by thegarbz · · Score: 2

      So much wrong in so little post, where to start:

      The software systems of the SSD and the OS driver side are written by idiots.

      Hardly. The software systems of SSD are written by people who know SSDs well. That you bought an OCZ drive is just unlucky. Firmware related failures were only common in the early days of SSDs.

      A low level tool that knows your particular SSD driver chipset could trivially access the vast majority of flash cells on your SSD drive.

      And would know none of what to do with it because wear leveling is not something you can predict and decode later. You can only store it. If the component which stores this knowledge is dead then nothing can save you.

      And SMART warning do NOT apply to SSD drives. SMART is for electro-mechanical systems with statistical models of gradual failure. SMART is FAKED for SSD.

      SMART is a system for drive reporting metrics. Nothing is "faked" for SSDs and SMART sure as hell isn't for mechanical related issues only. There are several SMART values specifically created to report SSD related wearout mechanism including 171 - flash program fail, 172 - erase fail, 173 -wear level count, 192 - unsafe shutdown, 194 - internal temperature, 226 - media wear, 233 - wearout indicator, 241, 242 - read and written.

      A catastrophic SSD failure is when the 'wrong' memory cell dies, and the software locks up.

      You're good at writing words without any meaning what so ever.

    84. Re: With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      Samsung EVO are the cheaper ones. Usually they use a cheaper nand type, and/or are slower than the pro version. Best they make right now is the 970 Pro MVMe SSDs.

    85. Re:With spinning disks, you do not know either by thegarbz · · Score: 1

      Not exactly comparable.

      The issue with SSDs is that there really is only one wearout related failure mode, and that is reading and writing / life left. The problem with SSDs is that randomised failures dominate which is perfectly expected given the wear of a typical drive should see it run into the 10 year mark which is well into the end of expected device for consumer electronics. The exception to that is overheating, and that along with wearout can give you an indicator of SMART, but SMART does not typically show sudden and random failures.

      The difference from the classical HDD is that for the lack of mechanics, there is actually quite a lot of scope for random electrical failure on the components. They run hotter, harder, and a manufactured with cutting edge technology rather than tried and tested technology or technology with obviously accessible failure modes. This makes them far more likely than a typical HDD to just suddenly up and die.

      It also means that a well made drive should also outlive a HDD which is my own personal experience. I've not had anything other than first generation SSDs die on me, and for all the good SMART does in predicting failures, it doesn't do shit in preventing them.

    86. Re:With spinning disks, you do not know either by thegarbz · · Score: 1

      SSDs have a few wearout related metrics. HDDs have many. Both devices can suffer from randomised failures but these cannot be predicted by SMART.

    87. Re: With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      That is not true at all. Smart implementation for SSDs rather depends on what the manufacturer implemented. In some cases, vendors have propagated nonsense data and in other situations to make use of the instrumentation requires special procedures beforehand. It really depends on the drive and vendor. It also requires documentation, in some cases, to understand what they have done. A good deal of my experience is with Samsung and Intel, but we had strong relations they gave our company insight and direct support. Point being, identifying drive failure causes as well as early signs of failure is not always impossible. However, if the interface fails completely then options are limited. One parting gift, if an ssd is failed, always attempt a secure erase. For several firmware bugs it is a magic pill.

    88. Re:With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      After purchasing close to 100 SSDs for classroom use, I have also found that Samsung SSDs work best. So far none have died. Google SSD Endurance Test and click on the images. Very often, the Samsung drives go the furthest by a long shot.

    89. Re: With spinning disks, you do not know either by nigelo · · Score: 1

      +1 for the 70s stereo and a $20 optical-RCA converter - puts any soundbar I've tried to shame.

      --
      *Still* negative function...
    90. Re:With spinning disks, you do not know either by Miamicanes · · Score: 1

      The bigger problem is that "99.99999%" of SSDs encrypt EVERYTHING at the block level, using an encryption key known only to the drive itself. So EVEN IF you can easily rip the bits from the failed drive's flash using a JTAG reader, you'll be reading what's effectively random noise.

      The reason for encrypting the data itself is legit (it makes the bits look pseudorandom & improves wear), but IMHO, the fact that there's literally NO WAY to replace the drive's own encryption key with one known to the drive's owner is absolute, complete BULLSHIT.

      As far as I know, it's a descendant of CPRM DRM. It's technically been a mandatory part of the ATA spec since the early 2000s... the difference is, on non-SSDs, it's disabled by default (the only devices I'm aware of that might actually use it are things like TiVO DVRs and videogame consoles). With a SSD, it's always on & can't be disabled or made to use a key known to YOU. As a result, data-recovery companies can still do recovery on a drive suffering from FILESYSTEM corruption, but they're now completely helpless if the drive ITSELF fails (even if the failure doesn't directly involve the flash memory). And unlike the old days, if the logic board is fried, you can't even solder the chips onto a sacrificial board, because the encryption key is tied to the original logic board.

      Put another way, thanks to mandatory block-level black-box encryption, something that has always been a bad situation (drive failure) has NOW become insurmountably worse, even though the technical challenge of physically reading bits from failed media is arguably easier now than it has ever been in history.

    91. Re:With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      That's par for the course. Most CS students know nothing about hardware actually works.

    92. Re:With spinning disks, you do not know either by WhoBeDaPlaya · · Score: 2

      You must have missed how Samsung royally screwed up with the 840 and 840 EVO firmware. Or on the mechanical side of things, lookup how they messed up the SpinPoint F4's firmware and tried to hide it ;)
      Not biased against Samsung or anything, as I still have several SpinPoint F3s in service, as well as a bunch of 840 Pros and 850 EVOs.

    93. Re:With spinning disks, you do not know either by dgatwood · · Score: 2

      It's usually because of the controller or RAM / Cache errors in processing that corrupts the firmware or dynamic LBA flash block allocation table (database). This renders the reset of the NAND flash partially or totally inaccessible. Quality "prosumer" drives are supposed to have extra hardware (capacitance) to prevent half-writes upon a dirty shutdown (abrupt loss of power). But regardless, any corruption on write-back can render the drive "bricked".

      And by this, you mean that some really bad SSD manufacturers still haven't learned the concept of log-structured storage. The problem of handling a partial write was solved a couple of decades ago. You roll back the partial transaction to the last checkpoint, then say, "whoops, that write never happened".

      Basically, in addition to a flat mapping table (as a cache), you store a copy of the mapping table (a checkpoint) with modifications in a log format. Each time you power on the drive, it ignores the cached flat mapping table (if it even bothers to persist it to disk), and reads the last checkpoint table, then replays the transaction log after that. When it reaches the last completed transaction in the log, it now has a valid mapping table that it is up-to-date to the maximum extent possible. A write operation is considered committed as soon as the transaction is added to the log, and existing used space is not reclaimed until that log write has occurred, ensuring that every write is effectively an atomic operation. Periodically, you write out a new flat table as a checkpoint, and after ensuring that it has been fully written, you then mark the oldest checkpoint and associated log pages as free for reuse.

      We were talking about this back when I was in grad school, around the turn of the century, precisely to prevent those sorts of failures. So IMO, if any SSD manufacturer still isn't doing a transactional/log-based mapping table between blocks and flash pages at this point, their hardware isn't good enough to use for storing system logs for a flush toilet, much less critical data. I mean, this is really *basic* stuff, and has been the norm for at least a decade.

      --

      Check out my sci-fi/humor trilogy at PatriotsBooks.

    94. Re:With spinning disks, you do not know either by dgatwood · · Score: 5, Insightful

      I think you may be missing his point. I've had SSD's die on me as well with absolutely no warning. What's unnerving about it is you have no idea why it failed. Good engineers like failure analysis; it helps determine if you're buying a crappy product, running your product out of spec, or any number of other metrics which can inform future purchases.

      Statistically, without even knowing what the particular product was, I can tell you what caused it: RoHS.

      The change from lead-based solder to lead-free solder is one of the major causes of premature electronics failures — probably more common than all other causes put together. Between tin whiskers, cold solder joints, and stress fractures caused by thermal expansion of component packages, the RoHS lead-free solder rule is a clear example of environmentalism gone amok. Instead of improving our environment by reducing the amount of lead going out into the world, it has, IMO, made our environment worse by dramatically increasing the amount of hardware discarded as junk long before it otherwise would have been.

      --

      Check out my sci-fi/humor trilogy at PatriotsBooks.

    95. Re:With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      OCZ? God-damn, they were always such hot garbage. They were the "high end" equipment for people who knew jack-shit about computers. Ever see a "Sherwood" or "Soundesign" stereo system back in the 80s? Yeah, that was OCZ's department in the new millennium.

    96. Re:With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      You get what you pay for. If the manufacturer only warranties the drive to last 30 days, expect it to die 30 days and 1 minute after power is first applied without regard to the actual time spent operating. Divide this time by two every time you (or the magical mystery hardware destroyer aka power saving shit) power cycles the drive.

      Now you know.

      The manufacturer does not WANT to replace the drive during the warranty period. They want it to fail IMMEDIATELY AFTER the warranty expires (as close as possible, in fact) in order to maximize their profits.

      So, you should expect the drive to last for EXACTLY as long as it is warrantied and NO LONGER. As soon as you buy the drive, the warranty starts running, even if you never use the drive at all.

      It is quite common for shmoes to buy the cheapest shit they can find and then wonder why it only lasts like the cheapest shit they could have bought.

    97. Re: With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      But with spinning discs you normally get warning. With an SSD there is almost never any warning of any sort. They just die.

    98. Re:With spinning disks, you do not know either by Aighearach · · Score: 1

      That's just it, the idiots that want their data back that bad can only pay a few thousand. What a waste of time. And you know they'll cry about paying.

      I'd want more than they'd pay just to spend the time trying, plus a lot more if I recovered something.

      What about people who lost their pictures in a fire? They move on, they learn what is important.

    99. Re: With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      Some donâ(TM)t but a write/erase cut short by a power outage can cause physical issues with some flash technologies. Unstable bits that flip later, over charged bits that wear out faster, things like this. Now ECC can help but not always, especially when the whole block goes. This is where a raid on top of the flash can give real reliablity. But most SSDs have their own internal bookkeeping to hide wear leveling and bad blocks. And if the manufacturer doesnâ(TM)t have some internal redundancy for this data, and one of those blocks die at the wrong time, boom, you have a brick.

    100. Re: With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      Probably only cops (trailer shows Jackie Chan beating up a thug), criminals (cut to Vin Diesel giving a 'Who,me?' expression), spies(Daniel Craig still photo--quickly replaced by Q, the man behind the tech)...but you get the picture.

    101. Re:With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      You're just being an unreasonable fool. Grow up ...

    102. Re: With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      He was bragging that the hardware he's got at home could fire 7200 rounds/I>per minute. Talk about *unnerving* poolside chat

    103. Re:With spinning disks, you do not know either by the_B0fh · · Score: 1

      you know this has been discussed on slashdot quite a few times, right?

    104. Re:With spinning disks, you do not know either by Waccoon · · Score: 1

      Remember to spin the right way. Righty tighty, lefty loosy!

    105. Re:With spinning disks, you do not know either by Ramze · · Score: 1

      I always go with Samsung EVO or PRO. Things may be different now, but when I was first in the market for SSDs, Samsung was the only one that designed every part of the device - not a cobbled together mess of components and software from various vendors made into a franken-device that might work ok most of the time. Now, I just buy Samsungs out of habit & the fact I've never had one fail on me. Samsung DID have a huge blunder with one or two specific lines of SSDs, but that was widespread with those specific models, not random deaths on random models.

      I've never had to use the Samsung software included other than a firmware update once, but it has lots of tools for diagnostics and recovery. I can't vouch for how well they work since I haven't had to use them.

      No drive will last forever, but considering I generally put my apps and OS on the C: Samsung and all my media and Windows profile on separate drives, my write/overwrite rate on the SSD is consistent with allowing it to last until sometime after our Sun turns into a red giant.

    106. Re:With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      What would anyone do with 7200 SSDs of brand RPM anyway?

    107. Re:With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      I don't think they teach manufacturing techniques in CS courses

    108. Re:With spinning disks, you do not know either by gweihir · · Score: 1

      Thanks, that is way beyond the data I have!

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    109. Re:With spinning disks, you do not know either by gweihir · · Score: 1

      I have not missed that. The SSDs still work though, I have at least one 840. My claim what just that they seem to be significantly better than the competition, not that they are perfect.

      --
      Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
    110. Re:With spinning disks, you do not know either by jellomizer · · Score: 1

      My experience is with far more drive failures with mechanical drives vs SSD. The problem isn't nessarly with data retention of the disk, but the mechanical aspect that fails or worse crashes and scratches off the data.

      --
      If something is so important that you feel the need to post it on the internet... It probably isn't that important.
    111. Re:With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      SMART codes for SSDs have specifically assigned meanings, if it's not in the smartctl database you can look them up in the the drive's data sheet.
      Still might not be able to trust them though.

    112. Re:With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      Why do you care? Design for failure and just replace it if/when it dies.

    113. Re:With spinning disks, you do not know either by sbjornda · · Score: 1
      That's only true in the northern hemisphere. You hemispherist.

      --
      .nosig

    114. Re:With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      I am talking reliability levels like a typical CPU here. It seems the market for that is just not there.

      In a large enough datacenter you seen CPUs dying roughly every month or so. RAM chips die at about the same rate.

      We're just used to the difference in manufacturing yields, Q/A and longevity between a small robot like a Hard drive and a computer expansion board like an SSD or RAM DIMMs. A lot of hard drive failures aren't the robot parts - drivers, spindle motors, arms - but controller failures because of poor PCB construction, chip death or thermal cycling on the solder joints.

      If we got rid of the complicated controller system on the SSD for the SATA interface we could drastically reduce the complexity and cost and improve reliability. And maybe go to higher bandwidth between RAM and these RAM-like devices. We could even give it some cool name, like NMVe.

    115. Re:With spinning disks, you do not know either by Agripa · · Score: 1

      That said, I find it sad that you cannot get "high reliability" SSDs where you basically can forget about the risk of them dying. I am talking reliability levels like a typical CPU here. It seems the market for that is just not there.

      They can be found in the enterprise market but they use lower density Flash so cost a lot more per gigabyte.

    116. Re:With spinning disks, you do not know either by Agripa · · Score: 1

      SMART should be able to provide the number of remapped sectors. There should be manufacturer specific counters for the amount of over provisioning that is left for remapping too. That should tell you precisely when you should plan to replace an SSD due to age.
      How hard would it be to notify something that the drive can't handle any more dead cells, so should not be written to any more? Or that it is down to x% of spare nand?

      Crucial does exactly this with their SSDs but it does not save you from spontaneous mysterious death.

    117. Re:With spinning disks, you do not know either by Agripa · · Score: 1

      I heard that outside of "wearing out", the biggest cause of failure by far is the controller, which fails at roughly the same rate as spinning rust. Overall, SSDs fail at roughly the same rate as mech drives, if you ignore the mechanical part of mech drives. The topic rant sounds like someone who would rather have a drive that is over 2x more likely to fail because he better understands that additional 100% failure modes.

      It is not the controller or ICs which fail. By themselves they are reliable.

      The problem is the way NAND Flash memory behaves when programming and perhaps erase operations are interrupted by for example loss of power. If a log type of file system is used, then you would expect that any interruption could only at most corrupt the data being written but interruption can cause the state machine controlling the write or erase to damage *other* locations. If those other locations include the the Flash translation layer data structures, then for practical purposes the drive is destroyed.

      Multi-level Flash storage has an additional failure mode where interrupting a write to a partially programmed page destroys the existing data stored on that page.

      The solution is to have backup power sufficient to complete any possible write or erase operation. I laughed when SandForce advertised their controllers as not requiring any power backup for safe operation.

    118. Re:With spinning disks, you do not know either by Agripa · · Score: 1

      I believe Intel SSDs are programmed to "self brick" when they fail, or at least they used to be. I remember thinking that was a spectacularly stupid way to fail, and the read-only mode would be much preferable. Yes, your computer will likely crash hard in short order either way, but at least with read-only mode you could get (most of) your most recent data off it

      Intel did or still does this when the SSD endurance is exhausted which struck me as particularly skeezy. Why not force read only mode so that the data may be recovered?

    119. Re: With spinning disks, you do not know either by prisoner-of-enigma · · Score: 1

      Given the typical pathetic performance of most sound bars, this isn't exactly a big hurdle to crow about. With 5.1 setups being so ridiculously cheap, sound bars are only for people too lazy to run speaker cable.

      --
      In the end they will lay their freedom at our feet and say to us, Make us your slaves, but feed us. - Fyodor Dostoyevsky
    120. Re:With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      I still have an OCZ boot drive running in the machine I'm using now, and it's (checks) from approximately 2012 (!) on a heavily used machine. So, reliability is pretty variable.

      Come to think of it, of all the SSDs I've bought, among OCZ, Kingston, and Samsung, none of them have died or otherwise had any issues so far, and some of them are years old. I assume because of the way flash memory works that they'll die eventually, but overall they've been as reliable as any of the hard drives I've bought.

      By contrast I have heard stories of firmware bugs and other catastrophic software/hardware failures of SSDs causing them to be utterly unrecoverable. Those horror stories show up here in /. pretty regularly, especially when it's a fundamental firmware bug, so I guess I've been lucky.

    121. Re:With spinning disks, you do not know either by viperidaenz · · Score: 1

      Nothing saves you from spontaneous mysterious hardware failure. Hence the need for backups.
      Saying SMART is only applicable to electro-mechanical systems is just wrong. Just because an SSD is "solid state", doesn't mean there are statistical models for failure. The very nature of the storage mechanism means it is guaranteed to fail, it's only a matter of when.

    122. Re:With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      Or... Backup and only use drives while under warranty. Then don't need to worry about it
      Captcha: reinvent

    123. Re:With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      I think you may be missing his point. I've had SSD's die on me as well with absolutely no warning. What's unnerving about it is you have no idea why it failed. Good engineers like failure analysis; it helps determine if you're buying a crappy product, running your product out of spec, or any number of other metrics which can inform future purchases.

      Statistically, without even knowing what the particular product was, I can tell you what caused it: RoHS.

      The change from lead-based solder to lead-free solder is one of the major causes of premature electronics failures — probably more common than all other causes put together. Between tin whiskers, cold solder joints, and stress fractures caused by thermal expansion of component packages, the RoHS lead-free solder rule is a clear example of environmentalism gone amok. Instead of improving our environment by reducing the amount of lead going out into the world, it has, IMO, made our environment worse by dramatically increasing the amount of hardware discarded as junk long before it otherwise would have been.

      Ironically, if tetra-ethyl lead had been removed from petrol a few decades earlier we probably wouldn't be having to listen to your retarded boomer bullshit now.

    124. Re:With spinning disks, you do not know either by Agripa · · Score: 1

      I think the big difference is that with a hard drive, failure is usually mechanical and preceded by signs reported in the SMART data like error rate and reallocation count. SSDs failure tend to be data structure corruption which has no antecedent to watch for.

      Here is the data available from my Crucial SSDs. It includes remaining operating life. The unexpected power loss events are because it is installed into a Windows 10 test system which keeps crashing with a blue screen and incomplete diagnostic data. Yay Microsoft!

      1 Raw Read Error Rate 0 Errors/Page
      5 Reallocated NAND Block Count 0 NAND Blocks
      9 Power On Hours Count 1176 Hours
      12 Power Cycle Count 69 Power Cycles
      171 Program Fail Count 0 NAND Page Program Failures
      172 Erase Fail Count 0 NAND Block Erase Failures
      173 Block Wear-Leveling Count 8 Erases
      174 Unexpected Power Loss Count 22 Unexpected Power Loss events
      180 Unused Reserved Block Count 100 Blocks
      183 SATA Interface Downshift 0 Downshifts
      184 Error Correction Count 0 Correction Events
      187 Reported Uncorrectable Errors 0 ECC Correction Failures
      194 Enclosure Temperature 35 Current Temperature (C)
      68 Highest Lifetime Temperature (C)
      196 Reallocation Event Count 0 Events
      197 Current Pending ECC Count 0 ECC Counts
      198 SMART Off-line Scan Uncorrectable Errors 0 Errors
      199 Ultra-DMA CRC Error Count 0 Errors
      202 Percentage Lifetime Remaining 100 % Lifetime Remaining
      206 Write Error Rate 0 Program Fails/MB
      210 RAIN Successful Recovery Page Count 0 TUs successfully recovered by RAIN
      246 Cumulative Host Sectors Written 1323897591 512 Byte Sectors
      247 Host Program Page Count 41371799 NAND Page
      248 FTL Program Page Count 21086208 NAND Page

    125. Re:With spinning disks, you do not know either by Anonymous Coward · · Score: 0

      Every SSD I've looked at with a good SMART tool (that doesn't just say x% likely to keep working) does just that. Thee SSD in the machine I'm posting with reports "Percentage lifetime used: 7%" and "Unused Reserve NAND Blk 1036", along the average block erase count.

      The rest of the stuff mostly stays at 0, but due to wear-levelling, etc... once bad sectors start getting re-allocated, you best order a new SSD, and/or set the disk to "read-only" if you can.

  2. Shit by maxbuzz · · Score: 1

    Happens

  3. Department of Computer Science by 110010001000 · · Score: 1

    Hey Chris from Department of Computer Science has a problem. Let's hear about it, Chris.

    1. Re:Department of Computer Science by 93+Escort+Wagon · · Score: 1

      Since they’ve now edited the summary (hooray for editing), I’ll note for the edification of future readers: The original quotes in the summary were attributed to “Chris from Department of Computer Science”.

      --
      #DeleteChrome
    2. Re:Department of Computer Science by mermeid007 · · Score: 1

      Hi Chris, I lost my keyboard. Is it behind your desk? Let me look. Yes! It is! Thank you! Yeah, sometimes they fall if someone bumps into them with their elbow or something. Next caller.

    3. Re:Department of Computer Science by Anonymous Coward · · Score: 0

      You mean Chris who posted 100+ videos in a year?

  4. Department of Computer Science --- are you sure? by Anonymous Coward · · Score: 0

    Doesn't know how SSD's work.

  5. Learn about the subject by guruevi · · Score: 0

    If you want to know the details, learn about the subject at hand. The thing is, electronics wear out, there is a reason these wear out faster than other "solid state technology" like transistors and a lot of it has to do with scaling down the chip.

    Obviously there could be other issues at hand too such as firmware failures which then you have to know why SSD's are so much more complex than a hard drive to begin with (hint: it has primarily to do with the above wearing).

    --
    Custom electronics and digital signage for your business: www.evcircuits.com
    1. Re:Learn about the subject by Anonymous Coward · · Score: 2, Informative

      Electronics wear out slowly. In fact most will long exceed their usefulness before they die.
      Mor often electronics will die early due to manufacturing defects. It's why if your device lasts the first month it will probably keep working until you upgrade it. SSD's are a different beast though. thus they have excess capacity to handle wear leveling. Still a young drive that dies is usually, again, a sign of a manufacturing defect.

    2. Re:Learn about the subject by Anonymous Coward · · Score: 0

      It can also be a sign of using the drive wrong such as not using TRIM or running high rate of transactions against SSD geared more for read. As you already know they have excess capacity built-in to handle wear leveling. One optimized for write will have more excess capacity than those geared for read. When talking consumer SSD you're dealing with consumer mtbf which is significantly lower. A lot of businesses try to get away with using them in their servers which of course leads to early failure.

    3. Re:Learn about the subject by freeze128 · · Score: 1

      Correction: PROPERLY DESIGNED electronics wear out slowly. Improperly designed electronics may not even last past the warranty period. Since there is a huge demand for SSDs in increasing capacity, I can't help but think that manufacturers are pushing the bounds of reliability in favor of capacity. The manufacturers may just be relying on the SSD's built-in correction capability to correct for the decrease in reliability, but that will only get you so far.

    4. Re:Learn about the subject by omnichad · · Score: 2

      Not using TRIM doesn't have a huge effect on SSD life. Just performance. Write amplification adds some wear, but not enough to be drastic. And it won't cause sudden failure either - just normal wear on the wear-levelling curve. Sudden failure is by definition going to be something that's not related to routine depletion of a fixed lifespan.

  6. I can relate by Anonymous Coward · · Score: 0

    Mine died about 2 months after I got it. A Samsung 850 Pro. Must say that they did a quick turn around fixing it. A total of 3 days on their behave to repair and get it back to me but I was annoyed that they required me to send the disk back and repair it rather than send out a replacement. At the time I'm thinking I'm going to be dead in the water for a week or more. Still, two days to mail it to them and three more days is a long time to not be able to use a machine for work.

    The cause? A power outage. I know that mfg'rs have been working on problems without power lose but as most don't mention whether or not their product is safe against a power outage is disconcerning at the least.

    I replaced that UPS that week. :(

    1. Re:I can relate by 110010001000 · · Score: 1

      Uh, if a disk dies in 2 months you need to get a replacement, not a repair.

    2. Re:I can relate by tepples · · Score: 1

      Then read it as "Samsung would not ship the replacement until it received the returned unit." This still implies a week's downtime.

    3. Re:I can relate by Anonymous Coward · · Score: 1

      You can almost always pay for advanced replacement. You get your money back when they receive the drive. Then it is usually just a day you're done provided you don't have a spare around. If things are really that critical though then you've already failed and should have setup RAID for your SSD and had backups.

    4. Re:I can relate by 110010001000 · · Score: 1

      You should demand cross shipping for that. Any professional would.

    5. Re:I can relate by Junta · · Score: 1

      Realistically speaking, he almost certainly got a replacement, but return policy he had required it to be returned.

      However for electronics of this class, the manufacturer in all likelihood *could* repair it. The neat thing is if they do repair such a disk, it could come back with the data intact. In practice, I don't think any manufacturer would offer such a service or even try.

      --
      XML is like violence. If it doesn't solve the problem, use more.
  7. Heading should be by Anonymous Coward · · Score: 0

    "Chris Siebenmann of Department of Computer Science writes about HIS inability to figure out the bottleneck when an SSD dies"

    Just because he can't do it, doesn't mean it's not possible. There are ways of making the silicon talk.

    1. Re:Heading should be by 110010001000 · · Score: 3, Funny

      Waterboarding?

    2. Re:Heading should be by Anonymous Coward · · Score: 0

      A quick example:

      http://www.sector-technologies.com/solutions/electrical-failure-analysis.html

    3. Re:Heading should be by bobbied · · Score: 1

      Waterboarding?

      Well... Funny, but water mixed with electronics tends to produce situations where little communication takes place....

      --
      "File to fit, pound to insert, paint to match" - Aircraft Maintenance 101
    4. Re:Heading should be by Immerman · · Score: 1

      Quite so. As it happens I was just fixing a dentist's office computer yesterday, and used the dental air blower to get the dust-bison out of the heat sinks since I didn't have any compressed air on hand. Let me tell you I was *really* careful not to touch the water jet button. Clearly whoever designed the "two small identical buttons side by side" interface never intended it to be used in a setting where a stray jet of water could be a major problem.

      --
      --- Most topics have many sides worth arguing, allow me to take one opposite you.
    5. Re:Heading should be by rogoshen1 · · Score: 1

      That's the long and short of it.

    6. Re:Heading should be by bobbied · · Score: 1

      Yea, with water, I see a reduction in resistance too... :)

      --
      "File to fit, pound to insert, paint to match" - Aircraft Maintenance 101
  8. This is why you have RAID and backups by froggyjojodaddy · · Score: 3, Informative

    *shrug* ?

    I mean, manufacturing defects, environment, and just old plain bad luck? SSDs have come a long way, but if I have anything of importance, I'm RAID'ing it and backing up. I feel anyone with an understanding of technology knows the importance of this.

    1. Re:This is why you have RAID and backups by Anonymous Coward · · Score: 0

      I RAIDed your mom's pooty last night. Then I backed up my thick wad of semen in her uterus.

  9. I bet he.... by Anonymous Coward · · Score: 0

    Did a bad firmware flash and just won't admit it for warranty purposes!

  10. Controller failure by macraig · · Score: 5, Insightful

    I've had two SSDs die utterly. It wasn't because there was a failure of any part of the actual storage pathways: it was irreparable failure of the embedded controller circuits. The Flash itself was still fine and safely storing all my data, but there was no means to access it. At least with a platter drive if the PCB fails, you can unscrew and detach it and replace it with a matching PCB from another drive; no way to do that with an SSD. Early on when manufacturers were spending all their time hyping the comparative robustness of the Flash medium, they conveniently forgot to mention how fragile and not-so-robust the embedded third-party controller circuits could be.

    1. Re:Controller failure by bobbied · · Score: 5, Informative

      Wow, that PCB substation trick became very hit/miss a long time ago.

      Now days, there is a whole bunch of operational parameters which need to be set properly to get data on/off a drive. I understand that Some of these "configuration" items are now stored in non-volatile memory on that PCB and set during the manufacturing process. Similar serial numbers may help, but it's still very hit or miss.

      --
      "File to fit, pound to insert, paint to match" - Aircraft Maintenance 101
  11. It's not that scary... by FrankSchwab · · Score: 4, Informative

    Infant failures are common in electronics ( https://www.weibull.com/hotwir... ) From a simple standpoint, imagine a poorly soldered junction on the PCB - soldered well enough to pass QC and work initially, but after a couple of heating cycles the solder joint fractures. The same kinds of problems occur inside chips - wire bonds between the package and die may be defective but initially conductive, and fracture due to thermal cycling.
    Similar problems can occur on the die. The gate oxide for a particular transistor might be too thin due to process issues. If it's way too thin, it'll fail immediately and the die will get sorted out at test. If it's just a bit thicker, it might pass all production tests but fail after an hour or two of operation, or 100 power cycles. If it's just a bit thicker (where it should be), it might last for 20 years and a million power cycles.
    Everyone in the semiconductor industry would love to figure out how to eliminate these early failures. No one has found a way to do it.

    --
    And the worms ate into his brain.
    1. Re:It's not that scary... by bobbied · · Score: 1

      Which is why "burn in" operation, where you run the item though some thermal cycles is often done. We are trying to find the stuff that's going to initially fail.

      I usually do 24 hour burn in of all hardware I build, 12 hours on, then 2 hour cycles on off. Or, (sarc on) just load windows and run all the updates. (sarc off) It's almost the same thing anyway.. :)

      --
      "File to fit, pound to insert, paint to match" - Aircraft Maintenance 101
    2. Re:It's not that scary... by Anonymous Coward · · Score: 0

      They know how to solve the issue. It's just expensive :)

      Extremely high reliability parts and equipment will have a proofing and conditioning phase during QC where EVERY part is individually tested and characterized. Power is cycled, temperature ranges are cycled, and the parts are run at or above spec for an extended period of time to age out any crib deaths.

      It's extremely effective, but also extremely pricey. The process is pure money. - Parts are kept longer, it's man-hour intensive, it requires a lot of floor space, it requires special equipment and specialized labor. The cost of the part is often dwarfed by the cost of the proofing.

      -The above is taken to absurd extremes when producing extremely high accuracy analog references for scientific test gear. It's a months long (or longer!) process that resembles arcane wizardry to the casual observer.

      The industry has decided that it's much cheaper to have an acceptable level of crib deaths, and just build systems as a whole fault tolerant.

    3. Re:It's not that scary... by Andtalath · · Score: 1

      The more you use an SSD, the faster it goes bad.
      So it's not an ideal thing to do.

    4. Re:It's not that scary... by MrLogic17 · · Score: 1

      Has your burn in ever found something that worked fine at first power on, and was dead after 24hrs?

      The idea seems good, but I'm skeptical. I'd think that that anything leaving a factory after their testing, wouldn't benefit from anything more than a smoke test.

    5. Re:It's not that scary... by bobbied · · Score: 1

      Has your burn in ever found something that worked fine at first power on, and was dead after 24hrs?

      The idea seems good, but I'm skeptical. I'd think that that anything leaving a factory after their testing, wouldn't benefit from anything more than a smoke test.

      I've found some things, but rarely any of the major components actually suffered from infant mortality on my watch. However, I've done this professionally a bit too, where we needed to verify MilSpec operation. In these tests, you verify both the operating and storage temperature ranges to certify a product. We had environmental chambers that could heat, cool and shake systems both running and not. Even under those grueling conditions the failure rates wasn't that high, though it was higher than you'd expect for less extreme temperature and vibration ranges.

      I personally consider it good practice to burn in stuff for a number of reasons. Infant mortality is but one. I also know that electrolytic capacitors like to drift up in value as they are powered on and after sitting idle may degrade over long periods. So the burn in is actually conditioning them over the few hours they are powered on, extending their lives a bit. It's not so much a thing anymore, but for large value filter capacitors or those under higher voltages (such as in vacuum tube power supplies) it can show significant differences in operations. These days though, the time from manufacture to my integration is pretty low so derogation of electrolytic capacitors may not be a huge issue anymore.

      These days, I don't know if burn in matters all that much, but I do it. It makes me feel better if nothing else.

      --
      "File to fit, pound to insert, paint to match" - Aircraft Maintenance 101
    6. Re:It's not that scary... by bobbied · · Score: 1

      The more you use an SSD, the faster it goes bad. So it's not an ideal thing to do.

      There's power on and read/write cycles. Usually it's write that "uses up" a SSD, not power on time or read cycles.

      However, given the number of write cycles is huge per cell, unless you are putting an SSD into a high data rate service situation, using it up is hardly a problem as the rest of the system will go defunct before the SSD runs out of write cycles. Also 12 hours is hardly enough time to appreciably dent an SSD's number of cycles, when their expected life span is a decade or more.

      BUT... If you are worried about it, you don't have to write to the drive all that time. I'm really only "power on" burn in guy. I'm not "hit the hardware with a performance bench mark" burn in guy. For the most part, I just want to thermal cycle stuff, so I may do a performance run or two, but only to drive heat and cold cycles. I don't think it's a problem...

      --
      "File to fit, pound to insert, paint to match" - Aircraft Maintenance 101
  12. Why is it so hard to understand? by Anonymous Coward · · Score: 1

    The spinning parts of an hdd are not the only parts that can go bad. Just as the NAND flash memory are not the only parts of an ssd that can go bad. There are other components: controllers for the computer interface and the NAND chips, and the power to everything. One bad electronic component can take down either. One dead capacitor can stop a whole motherboard from running.

    1. Re:Why is it so hard to understand? by Immerman · · Score: 1

      True. However, in 30-odd years of computing I've had several hard rives fail for mechanical reasons - almost always spreading surface failure, and also a couple head crashes. And only one drive that suffered a sudden catastrophic failure that might have been a controller failure.

      Anecdotal evidence to be sure, but in my experience mechanical failures are far more likely than controller failures on HDDs. From what I can tell, SSDs are the opposite, probably due in large part to the much more complex (and less mature) software they run.

      --
      --- Most topics have many sides worth arguing, allow me to take one opposite you.
  13. Sudden stop vs small warnings by atrex · · Score: 1

    In my experience with HDDs you'll usually get some warning that your drive has issues before it completely calls it quits. Whether it's bad sectors turning up or noises from the drive itself. If you pay attention to that (and you're a little lucky), you can manage to salvage most of the drive's contents before it dies completely.

    With an SSD one minute it's working completely fine and the next it's completely gone. While most of the data itself is probably still perfectly intact on the flash memory, getting at it is completely impossible (afaik) without going to a professional recovery service.

    1. Re:Sudden stop vs small warnings by MightyYar · · Score: 1

      I agree, but this has no practical benefit to me. When the HDD starts to throw errors, I pull it out of the RAID and stick in a new one. If the SSD completely up and dies, I pull it out of the RAID and stick in a new one. If more drives die or start to throw errors than there is redundancy, I restore from backup. If I can't restore from backup, well, then maybe then I'd appreciate the slowly-dying hard drive :)

      --
      W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
    2. Re:Sudden stop vs small warnings by fahrbot-bot · · Score: 1

      In my experience with HDDs you'll usually get some warning that your drive has issues before it completely calls it quits. Whether it's bad sectors turning up or noises from the drive itself. If you pay attention to that (and you're a little lucky), you can manage to salvage most of the drive's contents before it dies completely.

      In 2009, I had a 10 year-old 5 GB (yes, 5) enterprise SCSI disk (at home, not work) that failed to spin up after being off for over a year. (before that it had been running almost continuously) I tapped it (pretty hard) on the side with a screwdriver handle while it was "clicking" when I powered it up after removing the PC case. I slooowly spun up and worked fine. It had some bearing noise, but that went away after the drive warmed up. I pulled the data off and ran the drive for a couple of days w/o incident. Fun times...

      --
      It must have been something you assimilated. . . .
    3. Re:Sudden stop vs small warnings by Gilgaron · · Score: 1

      With a HDD I can envision how they can pull the platter and do forensics on it, do you know how they take a peak in an SSD's memory at a professional service? It didn't occur to me until just now that I had no idea how they'd do it.

    4. Re:Sudden stop vs small warnings by Tablizer · · Score: 1

      do you know how they take a peak in an SSD's memory at a professional service? It didn't occur to me until just now that I had no idea how they'd do it.

      They call their buddy near the Red Square to restore the data from copies.

    5. Re:Sudden stop vs small warnings by Anonymous Coward · · Score: 0

      Many SSD controllers have a JTAG interface. In some the flash array is memory mapped, you can dump it through the JTAG. It still needs to be reassembled logically before the data makes sense. To get individual pages of useful data is quite simple. The get a disk image is very complex, but depending on what you are trying to recover this may not matter.

      Some other controllers have a register based interface to the flash, this requires reverse engineering the firmware to find how to read it. It is best to look at bootloader code, as here the read and write commands are usually quite easy to find (and they may use a simple PIO mode instead of DMA).

      Some other controllers encrypt the array. This requires that the data passes through the hardware crypto + SATA/PCIe chain. Some professional data recovery companies reverse engineered this but I think it would be a big job.

      Incredibly, some encrypted SSD's keys can be read just via the debug interface. A few register read commands over the UART will give you the key in some cases. I don't understand why you design in encryption and then do this.

    6. Re:Sudden stop vs small warnings by saider · · Score: 1

      Connect to the controller board on the address and data lines for the flash chips, and manipulate them to access the chips. Then you would need to have a program that understands how this controller manages things and can reconstruct the sectors that it presents to the outside world.

      --


      Remember, You are unique...just like everyone else.
    7. Re:Sudden stop vs small warnings by Mark+of+the+North · · Score: 1

      I learned a similar trick from one of my tech's while working as the lead technology guy at a school authority.

      Our board chair, literally the highest ranking member of the organization, brought in his personal laptop and explained that it no longer booted. We plugged it in and hit the power button, it wouldn't boot of the hard disk. I started to explain that there was nothing we could do when my tech interrupted me. He removed the hard drive from the laptop, said "Watch this!" and without any hesitation, smartly whacked it against the desk. While my blood began to boil, he quickly placed the hard drive back in the laptop and power it up. It booted. I was very nearly floored.

      At the time, I looked up the mechanism for why this worked, but have since forgotten. In any case, worth a try when you've tried everything else.

    8. Re:Sudden stop vs small warnings by pnutjam · · Score: 1

      I'll bet the ssd rebuilds faster too.

    9. Re:Sudden stop vs small warnings by MightyYar · · Score: 1

      Just a little! LOL

      --
      W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
  14. It's the binary nature of it.. literally by Mysticalfruit · · Score: 1

    With a spinning disk, you'll usually get an indication of a problem with a plethora of S.M.A.R.T errors.

    It's been my experience that when an SSD dies... you just suddenly appear to have an empty drive cage. It's a really ugly binary failure.

    I've taken to building my boxes with mirrored SSD's combined with taking and validating my backups.

    --
    Yes Francis, the world has gone crazy.
    1. Re:It's the binary nature of it.. literally by azcoyote · · Score: 2

      I can see what you mean, but I think I won't really understand it until it happens to me (and I hope it never happens to me). I'm on my third SSD and none has ever failed; my previous one was showing some age and was SATA so I upgraded to M.2 NVMe on Cyber Monday. Perhaps they haven't failed on me because I keep most of my data on a HDD RAID array and use the SSDs only for OS, program files, and very limited caching.

      --
      Incipiamus, fratres, servire Domino Deo, quia hucusque vix vel parum in nullo profecimus.
    2. Re:It's the binary nature of it.. literally by Bengie · · Score: 1

      Spinning disk has about as many sudden deaths are SSDs, the only different is spinners have an additional set of failures that give warning. In other words. If you had a harddrive that never died from mechanical issues, its failure rate would be very similar to an SSD.

    3. Re:It's the binary nature of it.. literally by Wolfrider · · Score: 1

      > I've taken to building my boxes with mirrored SSD's

      --This may not actually help, especially if both SSDs are the same brand and model - because they will be experiencing the EXACT SAME load and wear patterns. They will likely both fail at the same time.

      --Try putting in the mirror drive about a week after the initial drive, that should give you some leeway.

      --
      .
      == WolfriderV6 == I'm willing to admit that *I just might* be wrong... Are you??
  15. Blame the OS by Anonymous Coward · · Score: 0

    had a 2gb memory card once. A Day One fault of one 512 mb block dead. Windows could not recognise this fault nor fix it. Instead writing to the card had corruption (obviously) when the faulty block was engaged.

    I simply wrote a utility that marked the 'sectors' covered by the block as 'used'- and from that day on Windows played happy with a 1.5 GB card. But before my FAT hack, the card could lock up the PC, as the Windows stack just doesn't know how to handle failed Flash memory units.

    So a 'dead' SSD drive is almost certainly recoverable with direct software access, but that's going to be a big pain-in-the-ass. The 'level' system that deals with real time cell failures and remaps data is going to need to be understood. All GOOD modern SSD drives write 'fast' cos they have a 1/10th size RAM cache on the drive (few know this). The REAL write speed of the flash memory is 1/2 - 1/4 of what is quoted.

    PS when my simple SDCard had that block fail, I tried all the so-called recovery utilities, and all failed. Yet the problem was trivial (for me) to identify and permanently fix.

    99% of so-called coders are grossly incompetent. Windows and the interface of SSD drives reflects this fact. So the point of the article is that a SSD drive is most unlikely to have the type of complete failure that renders a HDD 100% useless- and this is TRUE. But even so, what are the chances of finding tools written by good coders that can access the majority of still good cells on the SSD drive, and remap the data back to the desired file formats?

    More people code than ever before, but they are mostly the monkeys trying to rewrite the works of Shakespeare on their millions of typewriters. Which is why the GARBAGE computer languages are so popular these days.

    As a consequence, the first unpredicted system error of an SSD drive has the real possibility of rendering all the data inaccessible, even tho, as I said, the vast majority of cells are fine and can be read. What could be and what is and not usually the same thing.

    1. Re:Blame the OS by tepples · · Score: 1

      had a 2gb memory card once. A Day One fault of one 512 mb block dead. Windows could not recognise this fault nor fix it. Instead writing to the card had corruption (obviously) when the faulty block was engaged.

      Then Microsoft messed up by not offering a "try writing to all unallocated clusters" mode in the surface scan in chkdsk.

    2. Re:Blame the OS by omnichad · · Score: 1

      Don't blame the OS. Blame "no backups." Failure should be expected and accounted for with a backup plan.

    3. Re:Blame the OS by Immerman · · Score: 1

      Well,to be fair you *didn't* fix it - you just worked around it. Almost as good in many settings. I've "fixed" several hard drives in a similar manner - one section of the drive is clearly bad, and spreading when used? Fine, re-partition it so that that section, and a generous buffer zone, are never used. They typically work fine for years after that.

      Certainly not something I'd generally recommend given the nature of such HDD failures, but perhaps justifiable if you just want to buy some more time before an upgrade, or until a kid destroys the thing more permanently.

      --
      --- Most topics have many sides worth arguing, allow me to take one opposite you.
    4. Re:Blame the OS by prisoner-of-enigma · · Score: 1

      Don't blame the OS. Blame "no backups." Failure should be expected and accounted for with a backup plan.

      While I am the first to agree to this at the enterprise server level, it's far more difficult for the consumer or typical desktop user, especially for laptop users. RAID isn't always an option for laptops (frequently it's impossible) so you're left with some sort of external (USB or Thunderbolt) backup device or cloud storage. The former is difficult for road warriors and is nearly impossible to schedule since it's manually attached. The latter depends on an always-on Internet connection to have current backups.

      My strategy was for laptop/desktop users to have their My Documents (and any other crucial directories) mirrored using OneDrive (included with Office365). It worked most of the time but nothing could be done if a drive failed while someone wasn't connected to the Internet. Any changes made since the last sync were irretrievably lost.

      --
      In the end they will lay their freedom at our feet and say to us, Make us your slaves, but feed us. - Fyodor Dostoyevsky
  16. Low Bidders by bill_mcgonigle · · Score: 2

    It's bad firmware. Some of the drives can supposedly be resuscitated by the factory or people who have reversed the private ATA commands.

    I mean, at a minimum unless it's a PHY failure (and there's no reason to suspect those) the firmware could at least report missing storage (I've actually seen a 0MB drive failure once or twice) but their usual failure mode is to halt and catch fire, as the author notes as their usual behavior.

    With the recent reports about the inexcusable security problems on Samsung and Crucial drives this is starting to feel like the old BIOS problems with Taiwanese mobo companies outsourcing to the lowest bidder and shipping bug-laden BIOS with reckless abandon. It's OK, all the world's servers only depend on this technology.

    To be fair, I have batch of 20GB Intel SLC SSD's that have never done this, but those are notable exceptions. At this point only low-end laptops like Chromebooks don't get at least a mirror drive here.

    --
    My God, it's Full of Source!
    OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
  17. Shit happens by Anonymous Coward · · Score: 0

    As said before, shit happens. All electronics can fail and there will always be defects. Doesn't matter that it's not mechanical, it can fail. You are being too paranoid.

  18. It is the technology and trade-offs... by ctilsie242 · · Score: 0

    One of the benefits of SSDs is that they have a significantly longer MTBF than HDDs. They also can stand worse environments as well. However, when the ECC fails and the gates stop keeping in the electronics, there is no way to recover a SSD. When they fail, they fail hard.

    This is what RAID and backups are for. It isn't if a drive fails; it is when. Don't ever count on drive recovery services. Especially with how relatively inexpensive backups are. Backups are not difficult. Veeam, Borg Backup, Arq, Time Machine, Windows Backup (wbadmin), and many more are available. At the minimum, CrashPlan.

    Overall, SSDs have more advantages than disadvantages, especially newer ones. I wouldn't want to go back to spinning disks on the desktop or active use.

  19. Cloud by Anonymous Coward · · Score: 0

    Just put your data in the cloud. Then, when itâ(TM)s gone or has been corrupted, you can ask the cloud provider what happened. No more troubleshooting unreliable hardware for this guy!

  20. The solution is easy by Dunbal · · Score: 0

    Stop buying shit SSD's. I've been using them for 7 years now and have not had a single failure. As for loss prevention, a good PC owner knows to always back up important files. You do this regularly, right? Oh you don't? Then it's your own fault.

    --
    Seven puppies were harmed during the making of this post.
    1. Re:The solution is easy by brantondaveperson · · Score: 1

      Exactly this. I bought a shit SSD, it lasted three years. Not too bad, I suppose. When it died, which it did last week, I was back up and running in an afternoon - including the time taken to drive to the store and buy a new one.

      It's a really odd article in any case, why be so paranoid about the precise failure modes? Hardware is hardware, and it can break. Plan for it, and you won't have any problems.

  21. "In theory" by Anonymous Coward · · Score: 0

    >When a HD died early, you could also imagine undetected manufacturing flaws that finally gave way. With SSDs, at least in theory that shouldn't happen, so early death feels especially alarming.

    Chris's dismay comes from believing that solid state physical devices should "in theory" perform like a deterministic idealized model.
    Maybe because they're solid state? Even in solid state devices there is a great deal of movement and there always will be above 0K and especially with power cycling.
    Flaws can reveal themselves long after initial testing. Quality varies between brands and even between production runs for the same brand.

    Hard drives too can have all sorts of bad shit happen to them too with the controller board, processor, and ram, that have nothing to do with the spinning discs. And you're just as much in the dark without a narrative.

  22. Why does it matter? by CaptainDork · · Score: 4, Informative

    I'm a retired IT guy and there's no kind of something that didn't fucking break. I'm not a goddam engineer. My job was to locate the problem at a black-box level and get the shit running again. Contemplating the "why" of a hardware failure is wheel-spinning instead of pulling the stuff out of the ditch.

    For new purchases under warranty, I exchanged them and sent the dead one back to the vendor. Let them hook it up and do diagnostics over a cup of coffee.

    I had work to do.

    --
    It little behooves the best of us to comment on the rest of us.
    1. Re:Why does it matter? by Anonymous Coward · · Score: 0

      I'm a retired IT guy and there's no kind of something that didn't fucking break. I'm not a goddam engineer. My job was to locate the problem at a black-box level and get the shit running again. Contemplating the "why" of a hardware failure is wheel-spinning instead of pulling the stuff out of the ditch.

      For new purchases under warranty, I exchanged them and sent the dead one back to the vendor. Let them hook it up and do diagnostics over a cup of coffee.

      I had work to do.

      A-fucking-men!

    2. Re:Why does it matter? by dcw3 · · Score: 1

      Then you also know that if you've been seeing an unusual trend in some items breaking, it's probably cost effective for you to look for a root cause, and fix the problem, or find a suitable substitute to break the cycle. This is why we keep metrics on outages. It's not so much your job as the "IT guy", but whoever is managing the program/IT should be interested because it's costing them money.

      --
      Just another day in Paradise
    3. Re:Why does it matter? by DigiShaman · · Score: 1

      More to the point, understanding the "why" is more important than the "how". Why did it fail?? Specifically, was this something that could have been prevented at the IT side of things? If yes, time to change procedures. If not, then off to the vendor it goes, and research alternatives that are less error-prone.

      --
      Life is not for the lazy.
    4. Re:Why does it matter? by CaptainDork · · Score: 1

      I agree.

      About the only time I've gone there was funky voltage or a network wiring problem. Those are typically the last things I would suspect, and it drove me crazy.

      --
      It little behooves the best of us to comment on the rest of us.
    5. Re:Why does it matter? by CaptainDork · · Score: 1

      I didn't keep paperwork. I'm not a goddam analyst. If someone wanted to do that, fine. Just don't bother to tell me.

      Another non-car analogy (and off topic, I suppose).

      At Mobil Oil, our fractional T1 that connected Beaumont, Dallas, and Reston, Va. went down. I had people on it and we were balls to the walls trying to identify a broken box or maybe a problem with the telco.

      Management called me into the large conference room and there were a lot of pissed off suits in there.

      "Why is connectivity down?"

      "Dunno."

      "When will it be up?"

      "Dunno."

      "What are you doing to fix the problem?"

      "Nothing."

      "NOTHING?"

      "I'm in here talking to you guys."

      "Well, then when will it be back up?"

      "Sometime after this meeting is over."

      --
      It little behooves the best of us to comment on the rest of us.
    6. Re:Why does it matter? by prisoner-of-enigma · · Score: 1

      I didn't keep paperwork. I'm not a goddam analyst. If someone wanted to do that, fine. Just don't bother to tell me.

      Good God, do you realize what a walking, talking epitome you are of the worst aspects of someone in IT? Condescend much? I've been in this industry for almost three decades. The absolute worst IT people in existence are the ones who treat humans exactly as you are treating them. It doesn't matter one whit how much of a technical genius you might be if you can't understand you're not just working on machines for the sake of the machines. You're working on tools that people depend upon to do their jobs. Your dismissive attitude towards the human factors of this job is inexcusable. It creates a hostile environment between users and IT that doesn't need to exist. It's a damn good thing you don't work in my IT shop. I'd have fired you for something like this no matter how "good" you were with the gear.

      I guarantee that even after you fixed whatever was wrong with the WAN, the people you interacted with said "what a fucking asshole, I hope we never have to deal with him again" instead of "man, that guy did a fantastic job and I'm going to tell his boss how happy we are with his work!" But I'm guessing this is probably something you could care less about anyway. Good luck with your career. You'll need it.

      --
      In the end they will lay their freedom at our feet and say to us, Make us your slaves, but feed us. - Fyodor Dostoyevsky
    7. Re:Why does it matter? by thegarbz · · Score: 1

      I'm not a goddam engineer.

      I am an engineer, one that specialises in reliability analysis. Maybe if the author of TFA was too he'd understand how utterly stupid his comments are.

      Random failures happen, they happen on SSDs, and they happen on HDDs (and montherboards, monitors, vga cards, cpus, ram, psus, etc. ) If the writer is in any way "unnerved" then he should be looking at his own backup strategy and take a chill pill.

    8. Re:Why does it matter? by CaptainDork · · Score: 1

      Nah. I was the IT guy. When I hired on, I outsourced analytics back up to management. They picked some poor soul to do a spreadsheet and make slide decks.

      Then management sent the guy to me.

      I said, "No time. Not now, not ever. Meet with management with that stuff. i got work to do."

      I don't know if it turned out well or not.

      --
      It little behooves the best of us to comment on the rest of us.
    9. Re:Why does it matter? by CaptainDork · · Score: 1

      I agree. I embraced RAID of all levels throughout my career. Mostly, the hardness depended on risk/cost assessment by managers who were clueless. I always asked for the best.

      Despite that, I've had servers go sideways and there wasn't a goddam thing that was going to stop it.

      Failed backup was my worst nightmare. I've pulled all-nighters making sure I had a good backups.

      --
      It little behooves the best of us to comment on the rest of us.
  23. Hmm by Anonymous Coward · · Score: 0

    In the earlier days, when 'Intel' drives were regarded as the best/most reliable - We purchased 12. We fitted them to laptops. The performance vs spinning disk was next level. Every single drive was dead within 4 months.

    This was followed by OCZ madness.

    I'm frankly not over it yet. The fact that power cut of is still enough to terminate some models is bad juju. In terms of performance, I am convinced. In terms of an all round good replacement to 'spinning rust', the jury is out for me.

    When we make the majority of drives so they are reliable, including in cases of a power outage, and where the drives are still not in the failure rate levels of spindle disks, then we can talk.Unless speed is required, I still have a leaning towards using spindle disks over SSD :/

    -- All drives get mitigated by actions like RAID, taking backups. But I've been around. Many times bad drives have been a workable state by pulling data off a drive, not just sudden blackness.As the author states, he is not alone is disliking the sudden - without warning, nature of SSD failures.

    And far as I see this is being compounded by cheap'ness' being applied in flash with die shrinkage and production leaning ever more towards components that die faster, and have less life, mitigated by clever 'wear levelling' firmware.

    1. Re:Hmm by omnichad · · Score: 1

      Should have skipped Intel and OCZ and just waited for the Samsung EVO line. I've installed dozens over the last few years and not a single failure yet.

  24. Re:Department of Computer Science --- are you sure by bobbied · · Score: 2

    Doesn't know how SSD's work.

    No offense to CS majors, but this EE major tends to understand "How a computer works" at a lower level than most of you programmer types. While not universally true, in my experience a Computer Science major generally get's outside their comfort zone with hardware once you get past "Plug it in and turn it on." I don't blame them, there is a lot of stuff happening at lower levels than a CS major needs to know to do their job.

    That some CS major is concerned about how SSD's fail because he doesn't understand their failure modes is fine. We tend to fear what we don't understand and let's face it, there is a LOT of stuff going on inside a computer that high level users simply don't need to know. Heck, even I don't need to know some of that stuff and I've designed computing systems in the past. Fear not, if it works, it works, if it doesn't you just replace it anyway.

    --
    "File to fit, pound to insert, paint to match" - Aircraft Maintenance 101
  25. Both are black-ish boxes by wbr1 · · Score: 1

    Yes, you can listen for mechanical issues, yes you can (sometimes) read bad block and other SMART data. But, ultimately, without millions in equipment and skills, you just do not know. It is a cheap data storage brick. Choose one appropriate for your capacity and I/O needs, have a good backup plan in place, and quit whining.

    --
    Silence is a state of mime.
  26. Shit happens.. by Rick+Schumann · · Score: 1

    ..and the more complex a machine is, the more that can go wrong with it.
    The controller PCB on a brand-new modern HDD can fail, rendering the entire device useless; any piece of silicon on a modern SSD can fail also, rendering the entire device useless. The only difference here is that with a HDD, if you happen to have another working drive of the exact same model and revision level, you could theoretically swap the controller PCB and be able to access the data on the platters again (I've done this). With an SSD it's all one PCB and short of actually diagnosing the failure and replacing failed component(s), the chances of accessing the contents of the flash memory is a snowballs' chance in hell.

    There's no point in worry about it, though. Back up your important data and forget about it. If the system in question is mission-critical and up-time is essential, then use two SSDs in a mirror set, and don't worry about it. If someone is going to get their head lopped off if there's any chance of the system in question failing due to SSD failure, then mirror your mirror-set to another mirror-set (i.e. use 4 SSDs) and back the whole mess up to an off-site location regularly. Sitting around biting your nails down to the quick isn't going to help anything.

  27. Mod Parent Up by mykepredko · · Score: 1

    Maybe it helps the author to develop a narrative, but the long and short of it is, the author's non-volatile storage unit died, he needs to replace it to get the system back and he can send it back to where he bought it from because it died under warranty. Or, he might want to have it destroyed locally if it contains proprietary information.

    If you're in IT, I'm sure you'll see everything eventually break (including things like cases which don't make any sense at all) so why sweat it?

    1. Re:Mod Parent Up by Anonymous Coward · · Score: 0

      why sweat it?

      There's a lot of value in something that will work 5 years past warranty, as opposed to working 1 year past warranty. Suppose that a bunch of SSDs failed 1 year past warranty and you know that it was due to the huge amount of small writes of your application, then armed with that knowledge you could potentially change just 1 parameter in the application and save your company millions of dollars in SSD replacements.

    2. Re:Mod Parent Up by CaptainDork · · Score: 1

      Victim blame much?

      --
      It little behooves the best of us to comment on the rest of us.
    3. Re:Mod Parent Up by Anonymous Coward · · Score: 0

      You sound like someone who is bitter because you tossed away a lot of hardware, and later you realized that a lot of it could have been prevented if only you had some minimal amount of curiosity. Care to share your story?

    4. Re:Mod Parent Up by CaptainDork · · Score: 1

      I'll be glad to. Thanks for the opportunity.

      ... because you tossed away a lot of hardware, and later you realized that a lot of it could have been prevented if only you had some minimal amount of curiosity.

      I wasn't a goddam hardware guy. I was a productivity guy. I'll give you an analogy but it isn't a car one, OK?

      My boss asked me one time if I could hack (ca. 1990). I said, not very well. He was surprised because he thought I could do everything.

      He asked me why I wasn't any good at it and I told him, "Look: I got just so many hours in the day. I live and breathe computer shit and I spend all my time studying and experimenting with the crap that supports your business. You're a law firm. You need to shuffle documents and you have no use for a propeller head using your equipment, on your dime, learning stuff that's not relative to the income stream."

      --
      It little behooves the best of us to comment on the rest of us.
    5. Re:Mod Parent Up by Immerman · · Score: 1

      How is it victim blaming - other than being able to tell if the victim actually *is* to blame? Unlike HDDs, SSDs wear out with use, and nobody on the planet sells an "unlimited use SSD".

      --
      --- Most topics have many sides worth arguing, allow me to take one opposite you.
    6. Re:Mod Parent Up by CaptainDork · · Score: 1

      Whose fault is that?

      --
      It little behooves the best of us to comment on the rest of us.
    7. Re:Mod Parent Up by prisoner-of-enigma · · Score: 1

      "Look: I got just so many hours in the day. I live and breathe computer shit and I spend all my time studying and experimenting with the crap that supports your business. You're a law firm. You need to shuffle documents and you have no use for a propeller head using your equipment, on your dime, learning stuff that's not relative to the income stream."

      You must be all kinds of fun at parties, but I digress.

      For a small firm I agree it's usually pointless to do a failure analysis. However, if you're dealing with a larger company, failure analysis is crucial. Otherwise you could be replacing a failed unit with one that's just as failure prone because you don't know why it failed. Warranties are great but they don't replace lost data and/or downtime due to device failure. RAID and backups aren't magically 100% effective. I think what the OP is lamenting is there isn't even the slightest possibility of doing any analysis even if you have the time to do it.

      --
      In the end they will lay their freedom at our feet and say to us, Make us your slaves, but feed us. - Fyodor Dostoyevsky
    8. Re:Mod Parent Up by prisoner-of-enigma · · Score: 1

      Nobody sells one because such a device cannot be built. Entropy always wins in the end.

      --
      In the end they will lay their freedom at our feet and say to us, Make us your slaves, but feed us. - Fyodor Dostoyevsky
    9. Re: Mod Parent Up by Anonymous Coward · · Score: 0

      Because if you have thousands of drivers there may be something in your local environment that is making failure more likely.

      Just because you

    10. Re:Mod Parent Up by Anonymous Coward · · Score: 0

      Yes, you're right. Suggesting that people keep in mind the limitations of their hardware while developing an application is definitely "victim-blaming". You sound like a faggot.

    11. Re:Mod Parent Up by Immerman · · Score: 1

      As they said - it can't be built, the technology just doesn't work that way. Flash memory cells wear out with usage. And the write-cycle limitation is generally displayed prominently on the packaging and marketing literature, as there is a large amount of variation depending on the exact technology, scale, and storage strategy used.

      --
      --- Most topics have many sides worth arguing, allow me to take one opposite you.
    12. Re:Mod Parent Up by CaptainDork · · Score: 1

      We discovered the same thing about flash drives long ago, remember?

      --
      It little behooves the best of us to comment on the rest of us.
    13. Re:Mod Parent Up by CaptainDork · · Score: 1

      Absolutely correct.

      We can deal with entropy in small blocks by replacing parts. SSD doesn't have the granularity.

      Good point.

      --
      It little behooves the best of us to comment on the rest of us.
    14. Re:Mod Parent Up by CaptainDork · · Score: 1

      I agree about small site vs big site mentality and architecture.

      As we approach enterprise level, like a Mobil Oil, we have to specialize. There's just no other way.

      I didn't give a flying fuck why something failed.

      I had lots of people (on 1350 desktops in one of the places) who really didn't want to know the why the shit failed. They wanted to be up yesterday and I was always a user advocate.

      You manage your site however you see fit. I won't question your methods at your house, OK?

      --
      It little behooves the best of us to comment on the rest of us.
    15. Re:Mod Parent Up by Immerman · · Score: 1

      Right. And nothing has fundamentally changed, except that flash drive technology got fast and reliable enough that now we stick them inside our computers and call them SSDs.

      So where does the victim blaming come in?

      --
      --- Most topics have many sides worth arguing, allow me to take one opposite you.
    16. Re:Mod Parent Up by CaptainDork · · Score: 1

      It comes from AC, above:

      There's a lot of value in something that will work 5 years past warranty, as opposed to working 1 year past warranty. Suppose that a bunch of SSDs failed 1 year past warranty and you know that it was due to the huge amount of small writes of your application, then armed with that knowledge you could potentially change just 1 parameter in the application and save your company millions of dollars in SSD replacements.

      I don't know about you, but I don't buy fragile hardware and program it using DaintyCode.

      --
      It little behooves the best of us to comment on the rest of us.
    17. Re:Mod Parent Up by Immerman · · Score: 1

      Oh, I agree. But in that case you would be the victim of your own foolishness - it's not the SSD manufacturer's fault that you didn't consider the impact of your program on the well-stated limitations of their hardware. And nobody was blaming the actual victim - the company - for the programmer's stupidity.

      As a non-coding example - if I build a shoddy set of stairs that collapse on me so I break my leg, telling me it's my own damned fault isn't really victim blaming, is it? It *is* absolutely and wholly my own fault. Not in any way like the quintessential blaming of a rape victim for the actions of their rapist.

      --
      --- Most topics have many sides worth arguing, allow me to take one opposite you.
    18. Re:Mod Parent Up by CaptainDork · · Score: 1

      ... if I build a shoddy set of stairs that collapse on me so I break my leg ...

      I understand your point, but mine is that I don't build SSDs.

      If the stairs I built don't work, that's on me. If the stairs you built for me don't work, I'm the victim and you wouldn't blame me, right?

      I think we're on the same page.

      --
      It little behooves the best of us to comment on the rest of us.
    19. Re:Mod Parent Up by Immerman · · Score: 1

      That depends. If I built you a standard set of stairs, and they collapsed when you tried to send a herd of elephants up them, I absolutely *would* blame you. You are the one using them in ways they were never designed for.

      Similarly, if you rapidly kill an SSD by abusing it with an incredibly write-intensive workload, when it's limitations are clearly labeled, I would also blame you.

      --
      --- Most topics have many sides worth arguing, allow me to take one opposite you.
    20. Re:Mod Parent Up by CaptainDork · · Score: 1

      OK, enough. You jumped off the cliff. I'm staying here.

      I will say, I admire your commitment to your sig. :)

      Thanks.

      --
      It little behooves the best of us to comment on the rest of us.
    21. Re:Mod Parent Up by Immerman · · Score: 1

      I don't see it. X is designed for Y. If you use it for Z, when labeling and/or common sense clearly indicate it's not suited for such a use... that's your problem.

      Hehe. I decided on my sig to warn people where I was coming from. I usually sincerely mean what I'm saying, but have been known to change sides mid-argument if I start getting poorly-reasoned "support".

      --
      --- Most topics have many sides worth arguing, allow me to take one opposite you.
  28. I still don't trust SSDs even now. by Anonymous Coward · · Score: 0

    I still don't trust SSDs. I'd rather make a virtual RAM Drive over using them. It simply isn't worth the hassle when they fail. They've proven to still be dodgy even today, never mind the shitfest that was early SSDs. Holy corruption hell Batman.
    HDDs are good enough for most tasks. Demanding tasks can be done in RAM quite easily, and better, than SSDs.
    What, you don't have at least 32GB of RAM? Probably because you spent it wastefully on a shitty SSD listening to hype. All your data is a guinea pig to them.
    Sure hope you have backups.

    HDDs are trivial to repair if the logic failed. Replace it with another drives board. Harder to find the board separately, but this is why you always buy drives in pairs. They are less easy, but still accessible, if you are capable of removing the platters and putting it in another. Requires more effort and knowledge to do properly, but doable.
    Or, just have backups, restore backup to new drive. Cannibalize the dead drives components as spare parts, optionally sell them, ditch the rest.
    SSDs are a whole load of FUCK YOUR FILES if they fail. True even for specialists. It also requires way more effort to fix for said specialists.
    Disk platters are pretty simple to work with. SSD electronics and specs are all the fuck over the place.
    Until these idiots standardize everything, I'm going nowhere near them. I class them as more dangerous than IntelME in terms of destruction (to files and the mind) they can cause. Hell, even Spectre-class bugs.
    Again, just not worth the hassle.

    1. Re:I still don't trust SSDs even now. by MightyYar · · Score: 1

      Uh, for the massive performance boost you get from an SSD, they are totally worth setting up a backup job. Image the disk, set periodic backups to a server or even iDrive/Crashplan/Dropbox/etc and carry on with life. Hell, even leave the spinning disk in place and backup to that. For $60 you can extend the life of an old PC by several years simply by swapping in an SSD.

      You should have backups anyway.

      --
      W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
    2. Re:I still don't trust SSDs even now. by Anonymous Coward · · Score: 0

      Been using SSDs at home and in servers for 10 years and don't regret it for a second. We've had by more hard drives fail than flash drives. Yes we have backups, and raid, just like we did when all we used were HDDs.
      If you have to resort to repairing a hard drive, whatever you're doing is the wrong approach.

    3. Re:I still don't trust SSDs even now. by tepples · · Score: 1

      What, you don't have at least 32GB of RAM?

      I see your point about prefetching most of your environment to disk cache. That's why Microsoft added the "SuperFetch" feature to Windows over a decade ago and Canonical added "ureadahead" to Ubuntu. But there are three problems:

      First, many tablet computers and compact laptops lack slots for 32 GB of RAM.
      Second, even on those machines that can take 32 GB, loading 32 GB when booting or when waking from hibernation takes a while before the prefetch stops being a source of read latency.
      Third, when a file is written and flushed, the application that you are using still needs to wait for the data to be written to spinning rust in case the power fails or the kernel panics. That adds several milliseconds of latency.

    4. Re: I still don't trust SSDs even now. by Anonymous Coward · · Score: 0

      The Samsung evo line seems quite good. So far zero failures over a large number of disks for years

  29. Forward error correction by Strider- · · Score: 1

    Despite what others have said, this comes down to the brick wall nature of error correction codes. Every time you erase and rewrite a flash cell, you as wear to the transistors that make up the memory cell. Eventually (and probably immediately too) some of the bits won't read correctly. To compensate for this, the controller runs a mathematical function on your data, allowing it to recover from a certain percentage of bar bits. This is good, as that combined with wear leveling allows it to run a long time. However, one it hits that percentage, it's like hitting a wall and it can't recover.

    --
    ...si hoc legere nimium eruditionis habes...
    1. Re:Forward error correction by Anonymous Coward · · Score: 0

      True, but if that was the cause, it should show up in SMART as the level of ECC margin declined.

    2. Re:Forward error correction by Anonymous Coward · · Score: 0

      Despite what others have said, this comes down to the brick wall nature of error correction codes. Every time you erase and rewrite a flash cell, you as wear to the transistors that make up the memory cell. Eventually (and probably immediately too) some of the bits won't read correctly. To compensate for this, the controller runs a mathematical function on your data, allowing it to recover from a certain percentage of bar bits. This is good, as that combined with wear leveling allows it to run a long time. However, one it hits that percentage, it's like hitting a wall and it can't recover.

      Just like mechanical disks had a bad sector list, SSDs also have a bad block list.

      Those worn out blocks are placed on the bad blocks list and either not used or used with lower level (QLC to TLC to SLC etc).

      Heck, it wasn't even uncommon for SSDs to arrive with a big bad block list (due to manufacturing imperfections). I don't know if that is still the case.

    3. Re:Forward error correction by prisoner-of-enigma · · Score: 1

      True, but that doesn't explain sudden, catastrophic SSD failure. Modern controllers remap bad blocks to the "spare area" on all SSD's. Keeping track of said blocks can offer a modicum of failure prediction. Indeed, many high-quality drives -- and nearly all enterprise drive arrays -- do exactly that.

      None of it matters if the onboard controller itself dies, and such failures cannot be predicted nor can they be analyzed. That's what the OP was lamenting. The only possible remedy is to avoid that brand/model in the future, assuming that's even an option (and with Dell/HP/Lenovo it frequently isn't since the OEM's will only warranty and replace OEM equipment).

      --
      In the end they will lay their freedom at our feet and say to us, Make us your slaves, but feed us. - Fyodor Dostoyevsky
  30. Re: It's fairly simple by Anonymous Coward · · Score: 0

    I still feel Flash is a flawed technology because it can wear out. With both computers and electronics in general, a "worn out" chip just doesn't happen. If a chip is dead, either it was exposed to excessive voltage or its supplied cooling apparatus failed.

  31. Good luck putting RAID in a laptop by tepples · · Score: 1

    I doubt that most home PC users have both the case space and the cash for a RAID. A user of a mainstream laptop sure doesn't.

    1. Re:Good luck putting RAID in a laptop by Anonymous Coward · · Score: 0

      Most cases absolutely have space for a second M.2, they also have software RAID built right into the motherboard so it only costs as much as the second drive.

      As for laptops, they travel, as such they are subject to shock which can break solder points causing all kinds of failures, usually screens are the first to go. SSDs are often soldered right on to the motherboard on cheaper laptops which makes replacement a much more difficult task. If the SSD is replaceable then you should simply just use your backup, restore to a new drive and away you go.

    2. Re:Good luck putting RAID in a laptop by MightyYar · · Score: 1

      Yes, I'm in that position with my small notebook. In my case, I imaged the drive when I first got it. I have Windows Backup set to backup to an NAS and I have iDrive installed for offsite backup. Most people don't need to go so crazy - they can get away with running Dropbox, OneDrive, Google Drive, etc. as their primary "Documents" folder and then letting Geek Squad put in a new drive and reinstall Windows. But even "most people" need to have backups of some kind. If they can't image a disk, they certainly won't be savvy enough to rescue data from a dying spinning hard drive.

      --
      W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
    3. Re:Good luck putting RAID in a laptop by omnichad · · Score: 1

      A lot of mainstream consumer laptops come with an M.2 slot for configurations with SSD but still have the SATA port for models with an HDD. You can fill both slots and make a RAID - the disks will just be different shapes. Software RAID, sure, but it can definitely be done affordably.

    4. Re:Good luck putting RAID in a laptop by tepples · · Score: 1

      If the SSD is replaceable then you should simply just use your backup

      How many days old is your backup?

      restore to a new drive

      How many days of shipping away is the new drive?

    5. Re:Good luck putting RAID in a laptop by Anonymous Coward · · Score: 0

      I doubt that most home PC users have both the case space and the cash for a RAID.

      Even little shitty PC cases have a place to stick a drive.

      A user of a mainstream laptop sure doesn't.

      My Lenovo laptop has mirrored hard disks.

    6. Re:Good luck putting RAID in a laptop by tepples · · Score: 1

      Even little shitty PC cases have a place to stick a drive.

      "a drive" singular != "drives" plural, one requirement of RAID. Or are you recommending RAID between an internal drive and an external drive?

      My Lenovo laptop has mirrored hard disks.

      How big is it in inches (diagonal visible image size)? Drive bays that are practical in a 17" might not be practical in a 11.6" or 13".

    7. Re:Good luck putting RAID in a laptop by Anonymous Coward · · Score: 0

      "a drive" singular != "drives" plural, one requirement of RAID. Or are you recommending RAID between an internal drive and an external drive?

      Seriously stick a drive means add a drive to the existing computer. It doesn't mean there is only one slot for one drive in the system.

      How big is it in inches (diagonal visible image size)? Drive bays that are practical in a 17" might not be practical in a 11.6" or 13".

      14" T series and hot swappable. It's a normal sized laptop not a monster workstation. More modern systems support M. cards and you have more options to stick them in laptops. There will often be free slots for other expansion slots like ones reserved for cellular data cards that can be used to stick additional mini PCIe SSD.

  32. it's worse in space.. by unfortunateson · · Score: 1

    Reports from the ISS are that 9 out of 24 SSD drives failed in an HP supercomputer they'd brought up there. Quite scary how fragile those things are from radiation.

    --
    Design for Use, not Construction!
    1. Re:it's worse in space.. by hcs_$reboot · · Score: 1

      9 out of 24 SSD drives failed

      Were they all the same brand / type?

      --
      Slashdot, fix the reply notifications... You won't get away with it...
  33. Also here by jf_moreira · · Score: 2

    That happened to me three or four times already. They die without warning. No SMART indication, nothing. It really pisses us off. Someone needs to technically give us some kind of anticipation. Maybe SMART is not supposed to work well with SSD after all.

    1. Re:Also here by jimbo · · Score: 1

      HDD can also die suddenly, it's just that they also, in addition, have a class of failures that can be detected early.

  34. The spin is in! by theendlessnow · · Score: 4, Insightful

    One thing I like about spinning disks is that a lot of times the failure is gradual. Bad sectors and such and you have the opportunity to grab data off the drive (noting, you really should have backups).

    With SSD, whatever the issue, it's more like losing a controller board on the drive, everything dies and ceases to operate.

    So... I'll go along and say SSD is "better" and more "reliable", but when it dies, it dies hard. Just the way it is. (not talking about performance degradation... speaking about failure)

    1. Re:The spin is in! by scamper_22 · · Score: 1

      Same. I've never had a spinning HD just die. They always 'act' funny for a while.

      Then again, I've kind of stopped worrying about harddrives dying. Ever since I started working, it's been RAID 1 with two harddrives.
      One starts going bad, I swap it out.

      Then I have a NAS with RAID as well. Samething there.

      I've been running for a while without worries.

    2. Re:The spin is in! by thegarbz · · Score: 1

      Backups handle random failures just as well as wearout failures. I'm happy that SSDs have surpassed HDDs by removing the wearout related failures (to a large extent anyway).

    3. Re:The spin is in! by edis · · Score: 1

      My experience is this: of about 10 HDD drives that failed around recently, them not being part of RAID, I was able to salvage every single one. Restoring the system into running state was as simple, as get drive and make failsafe dd under some Linux booted. For RAID drives you don't care more, that pulling out unit for replacement.

      I like the reading speeds of SSD, but combined with the factors explained above, I could only give them preference where dying is OK, something like disk in kiosk, that would not accumulate specific setup or data.

      --
      Servant of karma
    4. Re:The spin is in! by justthinkit · · Score: 1

      Bias ply tires were replaced radial tires.

      Thing is that bias ply tires failed (slipped, when going around a corner too fast) in a progressive way.

      Radial tires provided more grip than bias ply tires, right up until they failed completely at providing traction.

      --
      I come here for the love
  35. Think of SSD drives as RAM memory by David_Hart · · Score: 0

    Do you get this anxious when a RAM module fails? There really is no difference between a RAM module failing and a SSD failing...

    Just make sure that you have backups....

    1. Re:Think of SSD drives as RAM memory by Anonymous Coward · · Score: 0

      Do you get this anxious when a RAM module fails? There really is no difference between a RAM module failing and a SSD failing...

      There are obvious significant differences between persistent and non-persistent storage and the repercussions of failure. If you are confused by this you probably shouldn't be posting here.

    2. Re:Think of SSD drives as RAM memory by prisoner-of-enigma · · Score: 1

      Do you get this anxious when a RAM module fails? There really is no difference between a RAM module failing and a SSD failing...

      Uhh...people don't usually store critical files in volatile RAM. Kind of a huge difference there. Further, RAM failures may crash the computer but it rarely destroys anything else in the process. A mass storage failure -- be it HDD or SSD -- virtually guarantees you'll lose whatever data you had on it. Your only recourse is RAID (which isn't an option on most laptops) or some sort of backup (which is difficult to enforce on mobile users).

      Yes, you can blame users all day long for not backing up their data. It doesn't help when you're still responsible for IT as a whole. The problem lands on your desk whether you want it or not.

      --
      In the end they will lay their freedom at our feet and say to us, Make us your slaves, but feed us. - Fyodor Dostoyevsky
  36. Damage from static electricity is a good bet by bdwoolman · · Score: 1

    Improper handling of ungrounded components really can mess them up. They work but are defective. Take a look at some micrographs of ESD damage sometime.. ESD does not always kill a part it maims -- sometimes only slightly. Anti-static mats and wrist straps are no laughing matter, Okay. They are. But use them anyway.

    --
    "No fear. No envy. No meanness." Liam Clancy
    1. Re:Damage from static electricity is a good bet by dcw3 · · Score: 1

      "Static Zap makes Crap" - One of my favorite sayings from Computer Tech training in the USAF back in the 70s.

      --
      Just another day in Paradise
    2. Re:Damage from static electricity is a good bet by Anonymous Coward · · Score: 0

      are you disagreeing with the teenagers on /r/buildapc? my god man.

  37. Heat by Thelasko · · Score: 1

    Most of the time heat kills electronics. Either they get too hot and something fries, or they suffer thermal fatigue.

    --
    One of our competitors trademarked the term "hypothesis". From now on, we will call them "boneheaded ideas".
    1. Re:Heat by dcw3 · · Score: 1

      Heat, static, condensation, unstable power, radiation, magnetic fields, vibration...pick your poison. It all depends on the environment you're working in and how well the equipment was designed.

      --
      Just another day in Paradise
    2. Re:Heat by Anonymous Coward · · Score: 0

      I've read that M2 SSDs get very hot, but common SATA ones don't get much warm

    3. Re:Heat by prisoner-of-enigma · · Score: 1

      All the more reason to have some way of doing failure analysis on the failed component. If, for example, it died after prolong high temps, you know you've got a cooling issue and can perhaps do something about it. If you can't do an analysis, you have no way of knowing if there is something you can do to avoid -- or at least reduce the possibility of -- future failures of the same type.

      --
      In the end they will lay their freedom at our feet and say to us, Make us your slaves, but feed us. - Fyodor Dostoyevsky
    4. Re:Heat by dcw3 · · Score: 1

      I agree 100%. But only to a cost effective level. For example, if you have a thousand hard drives, it's probably worth your effort to track outage reasons. If you have five, not so much.

      --
      Just another day in Paradise
  38. Has This Guy by Anonymous Coward · · Score: 0

    Even touched a real computer rather than just read about them in textbooks? CPUs don't have moving parts but they fail often. Usually its do to heat (whether too much heat or too much variation). What would boggle my mind is if they made a PC part that never broke...that would change my fundamental view on the universe.

  39. Failure done right - Sandisk USB by Stonent1 · · Score: 2

    I had a Sandisk USB stick recently go read only. I had been using it as a hypervisor boot drive and the boot was crashing. When I inspected it, it was read only and any attempts to format it, diskpart it, fdisk it failed with some kind of error. I looked it up and apparently this is the designed failure route for these USB drives. When the controller detects an inconsistency or uncorrectable error, the drive is locked from writing so you can get data off of it.

  40. He's right. by GameboyRMH · · Score: 2

    SSDs really are unpredictable timebombs, so act appropriately - take frequent backups and use RAID if the downtime from a sudden SSD failure with zero warning is unacceptable. Any IT department that hasn't been prepared for the nature of SSD failures since long before they were available off the shelf was doing it wrong anyway.

    I'm most worried about what SSDs mean for the Average Joe, whose data is largely protected by the predictability and recoverability of most hard drive failures. SSDs throw all of that out the window and lure them in with the warm glow of performance like moths to a flame. Average Joes need a real wake-up call on the importance of backups with the switch to SSDs.

    --
    "When information is power, privacy is freedom" - Jah-Wren Ryel
    1. Re:He's right. by thegarbz · · Score: 1

      SSDs really are unpredictable timebombs

      So are HDDs. Just because you have wearout related failure modes that make their life even shorter doesn't mean controller failures don't happen.

      There's nothing magic about SSDs. Random failures happen. Have a backup / business continuity strategy.

    2. Re:He's right. by GameboyRMH · · Score: 1

      It's possible but very unusual for HDDs to fail irrecoverably and without warning, but that's the normal failure mode for SSDs, that's the difference.

      --
      "When information is power, privacy is freedom" - Jah-Wren Ryel
    3. Re:He's right. by thegarbz · · Score: 1

      That's horseshit. There are many and quite common failure modes for HDD without warning including some of the mechanical wearout style ones which have incredibly soft and interprative SMART statistics.

      I take it you've never had sudden head failure of a HDD, control board failure? Random failures happen and you should consider yourself lucky if your HDD is dying due to one of the very limited mechanical cases that are detectable by SMART.

    4. Re:He's right. by GameboyRMH · · Score: 1

      I'd had head crashes in the '90s and no clear control board failures so far. Since the new millennium, I haven't had any totally unexpected hard drive failures, and no unrecoverable ones. With the custom SMART reporting/alerting script (to work around the soft and interpretive SMART statistics and focus on the ones that matter) on my home server, I've been able to see them all coming far in advance. The firmware-level SMART alerting system on the servers at the office seems to catch them well in advance too.

      --
      "When information is power, privacy is freedom" - Jah-Wren Ryel
    5. Re:He's right. by thegarbz · · Score: 1

      Well I've got a 50/50 success rate on both HDDs and SSDs, though admittedly I've only had 2 SSD failures to date rather than a larger number of HDD failures. That includes live monitoring of SMART parameters and excludes replacing risky drives at end of life (the last 2 drives I retired had no sign of failure but had over 7 years of head flying hours on them so it was time to go).

      Control board failures are common enough that it was a well known process to swap out control boards of identical drives back in the day in the hope that mechnically the drive is okay. This doesn't work these days due to boards being more "custom" which is to say that parameters are configured in the boards unique to drives in factories.

  41. Microns and Electrons by Anonymous Coward · · Score: 0

    the layer that holds the charge in an ssd is just a few microns thick.

    We rely on that fact that we can trap electrons in a floating gate with quantum tunnelling.

    we *know* it as a finite lifespan. the floating gate can withstand only so many programming cycles.

    We build-in redundancy.

    But obviously, every sector on the SSD can not be checked for a level of wear.

    Defects on the die, latent defects not detected, materials imperfections, fab variations, WILL cause SSDs to die.

    Always follow the 3-2-1 backup rule.

    3 copies
    2 different types of media [ think HD, tape, or DVD, Blueray, etc ]
    1 copy off-site

    the SSD will fail.
    the HD will fail.
    DVDs will become unreadable.
    Tapes can and will be demagnetized

    Add encryption, becuase lets face it, people will steal stuff.

    1. Re:Microns and Electrons by Anonymous Coward · · Score: 0

      I use SSDs and have had no failures yet (out of ten in the last about 7 years).
      In that time two rotating disks have died.

      So, I use each for what they are best at.

      SSD are powered on (data retention in powered off is NOT un-limited). But no moving parts (except electrons !)

      SSDs are backed up to rotating drives that are powered off most of the time except for back-ups. Rotating disks have good data retention when powered off, but are at risk when spinning. That is why I power them off when not in use for the back-ups.

      These are switched internal drives, so safe from an external enclosure getting slammed of knocked off the desk, but powered off by switching open the 5 and 12V power lines, grounds are connected.

  42. Restore which version? by tepples · · Score: 2

    Who the hell cares? Replace it and restore your data.

    The data on a failing drive might be a newer version than the most recent weekly backup. I see value in backing up the newer version elsewhere as the first part of replacing the drive. But SSD failure modes allegedly make this newer version inaccessible sooner than HDD failure modes.

    1. Re:Restore which version? by brantondaveperson · · Score: 1

      Who uses weekly backups? Back up automatically to as many cloud providers as you can afford, and use something like Time Machine (there must be equivalents of this for other OSs, right... surely...) too. No problem.

    2. Re:Restore which version? by tepples · · Score: 1

      Back up automatically to as many cloud providers as you can afford

      Which isn't many if you have a lot of GB of data to back up, such as video or lossless audio, and your home ISP doesn't provide a lot of GB/mo. (Satellite ISPs tend to limit data, and cellular ISPs tend to limit hotspot data.) Or if you don't want yet another utility dipping into your checking account via your debit card every month.

      I looked for Time Machine equivalents on GNU/Linux, and Cronopete at least appears to have been worked on in the past year.

  43. How to deal with those people..... by Anonymous Coward · · Score: 0

    Could be worse. At a previous job, I've had someone demand "7200 RPM SSDs", and no amount of explaining could change the person's mind.

    Nod your head, and say, "Yes! I'm on it!"

    I was called on something like it ONCE. I was asked, "Why did you bullshit the guy?"

    After explaining what happened and what he said, I asked, "So, next time I should refer someone like that to you?"

    I got a disgusted look and the boss walked away. And I got a great performance review 7 months later, too!

    Don't argue with ignorant people who refuse to listen. You get nowhere. Unfortunately, that's most people. - and I'm including myself. In this over-hyped marketing society, I've become cynical. I won't argue, I just think everyone is full of shit until proven otherwise.

  44. The Failure Modes by Sarusa · · Score: 1

    So you can have peace of mind:

    If it dies suddenly, without warning, it's 1) buggy firmware (I think this is by far the biggest culprit), or 2) bad components/soldering/cleaning on the PCB board, or 3) a really dumb controller that isn't doing wear leveling on every single thing (think the master index), so when a critical flash cell dies the entire thing is dead even though there's plenty of good flash left (this was common with crappy little 'SSDs' that were just Compact Flash), or 4) a badly designed controller that leaves the drive in bad state when power suddenly goes out and can't recover

    If it sloooows down and starts getting more and more sluggish you've lost enough flash cells that the wear leveling is losing its capacity to cope. Take some stuff off the drive to give it some breathing room and prepare for its demise. I had this happen with one of the original Intel SSDs (the X-25M). It took ten years of continuous use, though - yes, just this year.

    1. Re:The Failure Modes by krray · · Score: 1

      I've had multiple OWC branded SSD's die on me. I usually like OWC branded items, but the SSD failure has me pulling any / all such branded ones out of service.

      It was my understanding that a failing SSD (can't write anymore properly) should flip itself over to READ ONLY mode. At least this would give you a chance to pull the existing data off the drive.

      The OWC failures were catastrophic (sans I had working backups :). When these SSD's failed they were just GONE. Nothing. The system wouldn't see them even connected.

      At least with a hard drive they typically gave you some warning. Getting louder, clicking, ... I can really only think of one drive that just utterly died as I've seen SSD's do now.

      Moral: backup Backup BACKUP

    2. Re:The Failure Modes by Sarusa · · Score: 1

      Yes, I have full backups of everything nightly, so even though I have never had a SSD fail on me catastrophically (cross fingers), it's covered.

      Hard drives do just fall over dead too, but you're right, often there's warning signs.

      The OWC thing doesn't even sound like the 'drive' part (the flash) is failing, it sounds like the controller that talks SATA to the PC is failing, or the power circuit died so the thing doesn't have any power. Otherwise the system would at least see it. And it sounds systemic. So besides backup, backup, BACKUP, we have the moral 'OWC is trash'.

  45. Tiny wires, heat bad by HeckRuler · · Score: 1

    SSDs have a bunch of tiny wires. When you push electricity through wires they heat up, they're not perfect super-conductors. If you heat it up too much, it will of course burn, but they avoid that. Still, heating up a wire over and over will have some wear and tear. For big thick power-lines in houses, this doesn't have too much effect, but for tiny precision electronics, it builds up. And SSD's have a LOT of those wires with a little bit of manufacturing variance which makes some parts fail sooner.

    They burn out the same way lightbulbs burn out. They don't have moving parts, right?

    1. Re:Tiny wires, heat bad by FrankSchwab · · Score: 1

      Wires? Burn out the same way lightbulbs burn out?

      Your understanding of electronics is remarkably wrong.

      --
      And the worms ate into his brain.
    2. Re:Tiny wires, heat bad by HeckRuler · · Score: 1

      Ok, what's a simple word for the traces going into and out of transistors?

      Light bulbs are solid state, riiiiiiight?

  46. No way? Louis Rossman would strongly disagree. by Anonymous Coward · · Score: 0

    Of course you can still access it!

    You just transplant the controller from a working drive.
    I’ve seen him do that. Anyone who can solder chips, can do it.

    (It's only problematic, if the controller itself has internal permanent storage that keeps some state, like of wear leveling. But with the entire thing being a storage device, I don't think anyone is stupid enough to do that.)

    What we want to know, are the physical processes that make chips fail. I'm sure somebody with an actual clue, like from an actual manufacturer instead of a /. armchair "expert", could tell us quite a lot about that.
    Because humans (at least the nowadays rare self-thinker) don't like operating in the dark. I want to know what I can do to 1. avoid harming the drive, and 2. detect failure early.
    I'm pretty sure, you could detect it with a high-resolution heat camera microscope, pointing at the structures. If hardware fails, the best way is always to look for where it’s hot when it shouldn’t. But I want that built in.

  47. Total GARBAGE by Anonymous Coward · · Score: 0

    Garbage upvoted to 5- typical slashdot.

    With a HDD, one has an electro-mechanical mechanism with many points of total system failure. Motor goes, anything to do with the heads go, circuit board goes, and the entire thing is down.

    And SSD drive is closer to a SINGLE chip in concept. How many PCs fail due to CPU failure? 0.0001 percent. Yet the CPU is by far ther most complex part.

    An SSD drive is not one chip, but the chips for the flash, RAM (yes, most SSD drives have a 1/10th RAM cache) and interface are very reliable compared to an electro-mechanical device, and most of the faults that can happen do not break the entire system.

    So SSD drive failures ARE 'mysterious' is a way most HDD failures can never be. Silent and 'puzzling'. But the answer lies with SOFTWARE.

    When an SSD drive suffers certain memory fails (including standard flash cell fails), the software that drives the SSD drive may become terminally confused. In reality the vast majority of the SSD drive is still fine, and the flash still accessible. But the OS and driver software is atrociously written for robustness, and so essentially writes the SSD drive off.

    The issue is the path to the reconstructed data on the SSD drive goes thru many software layers. The OS will give up at the slightest confusion. Specialist tools that know how that particular SSD drive works could triviallly recover most of the data from the SAME computer. This is NEVER the case with an HDD, which needs to be stripped down and connected to specialist hardware tools- and even then the HDD data may be unrecoverable.

    But notice how many know nothing DRIBBLERS have their useless input voted up.

    A SOFTWARE fail is not the same as a hardware fail. And while the EXCUSE for an SSD fail is initially a trivial hardware fail (like the wrong cell failing), the real reason an SSD drive becomes useless is 99% down to bad software. For it is always possible to make a system robust to ANY cell fail.

    And NO- level wearing and other nonsense does NOT address this issue. Only cetain statistical cases are caught and mitigated by these mechanisms. There are MANY types of predictable cell fails that will brick many SSD drives.

    The BEST solution would be to drop ALL software mechanisms on the SSD drive itself, and allow the flash to be FLATLY addressed by the OS. The OS would then take full responsibility for detecting and remapping cells as they fail. This way the SSD drive could see increasing loss of capacity across usage WITHOUT catastrophic failure.

    Again, it is VANISHINGLY unlikely that a 'bricked' SSD drive does not allow access to most of its memory cells. But a BRICKED HHD can never be mitigated by software on the PC side- no PC software can 'repair' the broken motor, heads, or driving circuit board.

    PS I actually have first hand experience in this field. 'Bricked' memory cards that would actually crash windows, yet could, with low level code, be directly read, bypassing the problem- a simple faulty block of flash. How I wish you could do this with a bricked HDD. .

    1. Re: Total GARBAGE by datavirtue · · Score: 1

      Yeah, that's the answer...turning SSDs into WinModems.

      --
      I object to power without constructive purpose. --Spock
  48. Not so by bagofbeans · · Score: 1

    Metal migration limits the lifetime of the interconnect in ICs. Absolutely a wear mechanism.

  49. Mechanical vs Electronics by Shotgun · · Score: 1

    I'm going to disagree with the people saying that spinning disks don't give you a warning of imminent death. A bad spindle will start whirring, and steadily get louder, and my experience has been that most drives go that way. Hence, the old trick of sticking the drive in a freezer to get a few minutes more life out of it (because, you didn't keep your backups updated....again. :-(

    This is a phenomena that should always be kept in mind when switching from mechanical to electronic systems. The electronic are usually MORE reliable, in the sense that they are less likely to go belly up, but WHEN they do, they won't give you any warning. I could arguably make my home-built airplane MORE reliable and feature rich by replacing the flight controls with a fly by wire system. But, one day a gate in one of the processors will fry itself, and the whole system will quit working at once. Woe unto me if I'm at altitude at that point. The mechanical system will require more maintenance, but it will slowly wear out over time, controls will get sloppy, and exhibit more play. That is the system telling me, "I'm getting kinda tired here. I'm getting old, y'all. Replace me. Screw it. I quit." It gives warnings to the operator that knows what to listen for.

    So, the article does have a point. . . sort of.

    --
    Aah, change is good. -- Rafiki
    Yeah, but it ain't easy. -- Simba
  50. WRONG- not this fault by Anonymous Coward · · Score: 0

    God, you think this person so thick he doesn't know his running shoes will eventually fall to pieces if he never changes them?

    IT has tools that SHOW the %loss of cells, so any SSD driven to disaster is no mystery to anyone when it fails.

    These are the fails when the drive does NOT have significant loss of capacity. And these fails happen cos there are cells, and timings for cetain cells, when a cell fail spells total system failure. And this does NOT have to be the case- but is down to crap software. Crap file and error recovery that has states that cannot be handled.

    It would be TRIVIAL to improve the software and eliminate these fail modes, but would need software engineers that knew what they were doing. The 'race' to ever cheaper SSD storage is running ahead of excellence in DOABLE engineering.

  51. Re:Department of Computer Science --- are you sure by BLToday · · Score: 1

    Doesn't know how SSD's work.

    No offense to CS majors, but this EE major tends to understand "How a computer works" at a lower level than most of you programmer types. While not universally true, in my experience a Computer Science major generally get's outside their comfort zone with hardware once you get past "Plug it in and turn it on." I don't blame them, there is a lot of stuff happening at lower levels than a CS major needs to know to do their job.

    That some CS major is concerned about how SSD's fail because he doesn't understand their failure modes is fine. We tend to fear what we don't understand and let's face it, there is a LOT of stuff going on inside a computer that high level users simply don't need to know. Heck, even I don't need to know some of that stuff and I've designed computing systems in the past. Fear not, if it works, it works, if it doesn't you just replace it anyway.

    This ^^^. I had a brilliant CS college roommate. But when he built his first computer himself, the motherboard was held to the case with one screw. He couldn’t figure out why it was crashing all the time. Everything in the machine was barely in their slots/socket. This is back in the Pentium days. Days of VLB and very early AGP. And sometimes IRQ switches.

  52. Re:Department of Computer Science --- are you sure by Anonymous Coward · · Score: 1

    " in my experience a Computer Science major generally get's outside"

    Yup, I can believe you're an EE. While you go on and on congratulating yourself almost as hard as a doctor, you can't even tell the difference between GET IS and GETS.

  53. For Chris's peace of mind. by Tjp($)pjT · · Score: 1

    New SSDs, failure could be a die bond failure, a sometimes defect that allows it to pass inspection then fail. Or a ball bond to PC failure that can be intermittent as the package, solder ball, and PC change dimensions due to different thermal expansion coefficients. The tiny contacts on the PC versus relatively huge contacts on the mechanical hard drive make these happen more often on SSDs.

    On older SSDs there could be degradation of the ability to hold or modify the stored charge that represents bit. Not likely unless you are a heavy duty user. Or metal migration from the mask layer, or metal migration at the bonding level wire physical aluminum or gold wires are bonded to the actual chip. Less likely are bonding failure to the underlying substrate as the wire material used is chosen for high compatibility.

    Now Chris, feel better knowing just a few ways you can envision the failures?

    --
    - Tjp

    I am in wallow with my inner money grubbing capitalistic pig. ... Oink!

  54. Backup your data frequently by Solandri · · Score: 2, Insightful

    Backup your data frequently. Stop worrying. Is that so hard?

    1. Re:Backup your data frequently by Anonymous Coward · · Score: 0

      Well, no. But it puts some of the responsibility on the user, and we don't want that do we. When I get a new ssd in a computer I create a nice big file full of random characters. Never touch it. As long as it's there those cells are not being written to and are a poor man's cache. If the drive runs out of good cells I intend to delete that file which should free up those pristine cells for use. Have no idea what to do for controller software cell death. Buy a new drive I guess, and make regular backups. For some reason this reminds me of my newbie days when I discovered running out of inodes.

    2. Re:Backup your data frequently by Andtalath · · Score: 1

      Ever heard of defragmentation?
      This will ruin this little scheme.

      What MIGHT work is a dedicated partition.

      Still, this is in fact, already done by all disks which don't give you exactly a ^2 exponent since they leave some cells to be able to move data around...

    3. Re:Backup your data frequently by Anonymous Coward · · Score: 0

      Store all your data in RAM discs, which are the fastest anyhow, and back-up frequently. Maybe we can creae a two-level file system using RAM discs, where the journal is written continuously to tape.

    4. Re:Backup your data frequently by Anonymous Coward · · Score: 0

      Backup your data frequently. Stop worrying. Is that so hard?

      I'll believe backing up data has become easier for for the average idiot out there right after you convince me that organizing and deleting data has become any easier for the average idiot out there.

      If you think dealing with physical hoarders is bad enough, just wait until future generations have to deal with digital hoarders.

    5. Re:Backup your data frequently by Cid+Highwind · · Score: 1

      You won't know the drive is running out of good cells until it's too late. One night everything is fine, the next morning you turn on the machine and "Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block"

      Only a backup on a separate device can save you from SSD failures.

      --
      0 1 - just my two bits
    6. Re:Backup your data frequently by Anonymous Coward · · Score: 0

      Backup your data frequently.

      Backup is a noun. What you meant to say was "Back up your data frequently."

    7. Re:Backup your data frequently by Anonymous Coward · · Score: 0

      I've taken a new approach. Don't back up and wipe clean when wiped. Move on. Data hoarding is worse than figuring out most things again. Make new memories.

  55. What a dumbass by Anonymous Coward · · Score: 1

    Just imagine the unicorn in the drive died.
    It's about as accurate as what you imagine happened to the spinning disk.

  56. By "a working drive" I mean the *same model*. by Anonymous Coward · · Score: 0

    Obviously.

    Not that some vultures wouldn't jump on it anyway.

  57. Luckyo is caught lying about plastic though by Anonymous Coward · · Score: 0

    So it's no great surprise that he blathers bullshit about hard drives also, the lying faggot has zero integrity.

    1. Re:Luckyo is caught lying about plastic though by Luckyo · · Score: 1

      Just out of interest, how much time in your day is spent stalking me on slashdot after your anti-science drivel got exposed in that one argument?

  58. Spinning disks used to more unnerving... by gosand · · Score: 1

    I had a 4 tb spinning drive fail, after only 2 years. It was 75% full. That is what is scary to me. The only narrative I came up with to explain it was that it was in my system, but powered on, 24x7. Now my backup drives are external and I power them on when I need them.

    As drives get bigger, that is when I get nervous. I know, there's options to mitigate that, but I'm on a budget. I just migrated my OS to an SSD a couple of months ago, and still have spinning drives holding everything else.

    --

    My beliefs do not require that you agree with them.

  59. Are you all living in the SSD 1980s? by Anonymous Coward · · Score: 0

    Wow. I think that the OP has behind the complaint is that SSDs don't have the difficult-to-instrument hardware failures of HDDs and so why do we suffer the consequences of "unknowable" SSD death? They're HW, but they are solid-state. Sure you might get tin whiskers or solder failures, but why the heck do we put up with "it died"? Heck, these things should be able to know intimately that the resistor at R63 is the likely culprit for mistimed read signaling or whatever the failure point is.
    It's as if you'd need a computer to attach to a special connector to be able to diagnose these or something.
    Duh, I want my computer to print out a label with the manufacturer's address on it with the exact stinking location of the failure on it so I can send it to be repaired and I get it back in three days. Problem solved and ditch the whiny IT staff.

  60. HDs were scary too at some point by foxalopex · · Score: 2

    I'm guessing the author never lived through the era when there were a lot more companies in existence for mechanical HDs than there are now. HD's can spontaneously die from a failed motor, electronics failure or catastrophic crash. Some small companies went completely under and were swallowed up by larger manufacturers due to massive defects. SSDs have gone through the same era as well with buggy firmware. Generally speaking thou if you stick to the big manufacturers like Samsung and Intel the chances of fatal issues goes down a lot. That said an SSD is not a guarantee of safe data. They're far more reliable but circuit failure or static electricity can kill SSDs. Besides, SSDs won't save you from an accidental erase all.

    1. Re:HDs were scary too at some point by thegarbz · · Score: 1

      Some small companies went completely under and were swallowed up by larger manufacturers due to massive defects.

      Some large companies had their HDD division go completely under. Looking at you IBM, I owned two of your IBM "Death"star series HDDs and somehow went through the warranty process 7 times on them.

    2. Re:HDs were scary too at some point by toddestan · · Score: 1

      IBM's HDD division didn't go under. It was bought by Hitachi, and generally Hitachi's drives are very well regarded nowadays.

      As bad as the IBM Deathstars were, I never actually lost any data because of them because they always gave some sign of impending doom before they finally failed, allowing me to grab whatever I needed to get off of them. I also had one last over 10 years in a workstation that was almost never turned off. I'm not even sure how that happened.

    3. Re:HDs were scary too at some point by thegarbz · · Score: 1

      IBM's division definitely went under! They were effectively blacklisted and unable to sell hardware. When they were puchased by Hitachi they were bought at bargain basement prices, the same price that Maxtor went for in the Seagate acquisition when Seagate bought a struggling company that has gone through half a decade of financial difficulties. When Hitachi bought it the only value left for IBM was in the commercial contracts. It took many years for Hitachi to turn the brand around, and after they did they sold drives at a fraction of the volume that IBM did and were subsequently bought by WD for more than double the original IBM acquisition cost.

      As bad as the IBM Deathstars were, I never actually lost any data because of them because they always gave some sign of impending doom before they finally failed

      You clearly never ran one in a RAID configuration. They were notorious for not making it through a full rebuild cycle once the dreaded click of death started. I never lost data either, backups and prioritising what data needed to be taken from the degraded arrays are the only reason though.

      I also had one last over 10 years in a workstation that was almost never turned off.

      Not all models had issues. I still have a working one here. At least I assume it's working, I'm not sure I've got any hardware with a PATA interface anymore so I can't test it.

  61. Did you have your covfefe this morning? by fyngyrz · · Score: 1

    real to real

    Donald? Is that you?

    --
    I've fallen off your lawn, and I can't get up.
    1. Re:Did you have your covfefe this morning? by Aighearach · · Score: 1

      Nice catch, I read right past that and didn't catch it; my parser rewrote it using the algorithm that corrects the "and and" mistakes, and I came away thinking he said "stick to real 9 track paper tape."

    2. Re:Did you have your covfefe this morning? by Anonymous Coward · · Score: 0

      real to real

      Donald? Is that you?

      The 9 track tape was real. They just weren't in reels.

    3. Re: Did you have your covfefe this morning? by Anonymous Coward · · Score: 0

      So it's reels that are really the real deal. Reels rule!

  62. I also had a failure recently by cloud.pt · · Score: 1

    Just chiming in My Crucial M4 128GB (Micron) drive also died on me 2 months ago after very mild use since February 2013. It was my OS drive in a Windows 7-10 desktop which O mostly used for 3-5 multiplayer games through the years, or the odd media consumption. It was a machine that was on about 1/20 of the entire 5 years and 8 months.

  63. Post-Failure Support by nuckfuts · · Score: 1

    There's another problem I've found with SSDs in addition to their failures occurring with no previous warning signs. That is that the process of obtaining warranty replacements can be terrible.

    Perhaps because hard drives were expected to fail, manufacturers put procedures in place (such as "Advance" RMA) to ship a replacement very quickly. This is important when, for example, you have a single-drive failure in a RAID configuration that can only tolerate losing one drive.

    My experience with obtaining two warranty replacements on Intel M.2 SSDs has been really poor. In each case the replacement drive took so long to arrive I had to purchase a replacement drive in the meantime.

  64. Best of both worlds by MobyDisk · · Score: 1

    You can get the best of both worlds by setting up a RAID of both an SSD and a platter drive! :-P

    1. Re:Best of both worlds by prisoner-of-enigma · · Score: 1

      I'm not sure if you were being funny or not, but this is a horrible idea. Instead of the "best of both worlds" you're getting the worst of both. Read and write times will be gated by the speed of the mechanical drive, negating any SSD speed benefits. You'd be better off with two mechanical drives: same speed at much lower cost.

      --
      In the end they will lay their freedom at our feet and say to us, Make us your slaves, but feed us. - Fyodor Dostoyevsky
  65. When an SSD dies by OrangeTide · · Score: 1

    Most likely reason is a firmware bug cause enough corruption that it can't even low-level format. If it were a prototype that a developer could diagnose, it would be easy for them to patch it and get it going again. But without that specialized environment you SSD and the data on it are trash.

    In many ways I think I would have preferred the raw NAND systems like SmartMedia (now obsolete), where the host had the real brains and the media was as primitive as possible. SmartMedia formatting was about conforming to a software standard on the host side and was managed by a driver. A real driver that a could be debugged with ordinary tools, not some obscure firmware embedded in a device.

    --
    “Common sense is not so common.” — Voltaire
  66. The Storage Debacle.. by Xnet+Project · · Score: 1

    We have experienced from mechanical, SSD, and NVMe drives that there are points of failure that we can detect, and there are points of failure we can't. Most cases where an unpredictable failure occurs is almost always at the power source, and is mostly indicative of voltage irregularity in our tests with bad drives from these 3 types. While we'd like to think that new hardware will hold up to a degree of it's certified life span; voltage as a whole to power said hardware will almost certainly add the anomalous layer for a margin of error from minimal to catastrophic.

  67. It's pretty simply by viperidaenz · · Score: 1

    The chips store data in a capacitor.
    The capacitor is connected to (or is the) the gate of a mosfet so the state can be read.
    To charge or discharge the capacitor, electrons must be forced over the insulation later that stops the capacitor discharging on its own.
    Every time that happens the insulation breaks down a little. Once it's all gone, the cell can no longer store data.

    It's a gradual process that happens every time a cell is written to or erased. SSD's wear out as they're used, it's how they work. You should treat them as a consumable.

    Or something randomly broken. like a solder joint from thermal cycling or something.

  68. Re: It's fairly simple by Anonymous Coward · · Score: 0

    You are slowly perforating the gate oxide when erasing nand flash, but most chips give an indication on the read a quality on write, so you know when the cell starts to get risky to use.

  69. Having lost an SSD recently... by Anonymous Coward · · Score: 0

    ... I can say I never "abused" it. Never defragged it or otherwise thrashed it needlessly, so I'm a tad sad to lose the thing. And surprised by the suddenness of it, in the middle of playing Fallout.
        Dagnabbit.
      As with other posters, my SMART checks never disclosed any potential or actual errors in the SSD.
    They oughta make some sort of warning inbuilt. Make 'em scream if they hurt, like HDs do. That grinding unhappy-drive sound the mechanical ones produce is my suggestion.
    OTOH, it managed to survive long enough that I could replace it wif an EVO. It's all good, I 'spose. Hrm.

  70. literature by TheSync · · Score: 1

    See SSD Failures in Datacenters: What? When? and Why?.

    Failures include retention errors caused due to leakage current, which worsens with time when not acted upon. Second, they also suffer from phenomenon such as read disturb and program disturb errors, where read or program of a row or block of cells affects the threshold voltage of untouched cells in its vicinity. data retention, program disturb, read disturb, endurance, and power faults.

    Flash controllers have proactive and reactive mechanisms in place, to prevent the flash error propagation to higher levels in the system stack. Consequently, not all of the above-mentioned failures propagate to upper layers. But, ones that do propagate can result in fail-stop failures.

  71. Re:Department of Computer Science --- are you sure by Aighearach · · Score: 1

    I spent 3 years on a "deep dive" into EE basics, analog circuit design, then microcontrollers, and it really improved my software development a lot.

    I don't think this is a natural blind spot in CS, I think it is just manufactured ignorance by dividing the fields in an unrealistic way. Which seems to have happened during the rush to train workers during the .com boom, so maybe it wasn't even thought out at all.

  72. I feel the same way about light bulbs. by ripvlan · · Score: 1

    Does anyone really know why a spinning disk dies? Sure - maybe if the last operation was "dropped laptop down stairwell"

    A narrative over what went wrong?! Whenever a HDD failed a light came on the RAID array - and I'd find a package from FedEx on my desk at 9AM with a replacement disk in it. As for personal computers - the drive stops working and you lose data.

    What is there to think about?

    I do agree about the "timebomb" thought. I know that SSD just give up the ghost. On a HDD many times "check disk" starts reporting a high number of failures and you can be prepared...except when the head falls off the arm. That's a rather rapid failure.

    SSD have a write-lifetime that I can't predict. HDD goes until it doesn't work anymore. In both cases you break out the backup tapes.

  73. Re:Department of Computer Science --- are you sure by DickBreath · · Score: 1

    If you begin to notice vibration from the SSDs then you know they are near the end of their life.

    --

    I'll see your senator, and I'll raise you two judges.
  74. I don't care. by ruddk · · Score: 1

    We set them up in either a RAID or EC configuration or other redundant configuration , so that the operations department can swap them out when they fail without downtime.
    Unless we start to see an unusual high number of failures, we don't care.

  75. Re:Department of Computer Science --- are you sure by bobbied · · Score: 1

    Well, I do think it's natural for CS majors to be a bit farther away from hardware. Let's face it, much of their work these days doesn't really care what operating system they run on much less the hardware it's actually running on. I don't blame them, really the state of programming has evolved away from hardware dependence, and that's a good thing..

    Where I understand hardware details of what's happening behind the programing model seen by the CS guys and gals, and I believe that I have a different perspective when doing software development, I'm not sure they would benefit all that much. Programming Java is pretty hardware agnostic anyway, C/C++ a bit more specific (assuming you have the libs and compiler), but still largely portable unless you are handling actual hardware or kernel level stuff. My hardware knowledge really only serves to make me more aware of performance implications of my choices perhaps, but the CS folks do just fine with most higher level languages.

    So I don't agree, CS folks really don't need to know all the same stuff I do to program. It used to be true, it used to be valuable to understand what the hardware had to go though, both to be able to optimize your code for performance and size and get it to do what you wanted. However, with the advent of the higher level languages, most CS folks don't interact with the hardware anyway, but abstract programming models like the JREs which for all the world look identical regardless of the hardware being used.

    --
    "File to fit, pound to insert, paint to match" - Aircraft Maintenance 101
  76. "A story"? by Anonymous Coward · · Score: 0

    Why does the OP need a "story" (he actually keeps repeating that word)...? Reminds me of millennials grocery shopping.

    I've been told (in all seriousness) that when millennials go grocery shopping, they want to know the backstory of everything they purchase - and not just what you'd think of asking: whether that chicken lived on a free-range farm, whether pesticides were used on those tomatoes, GMOs etc but rather, they want to read some story about the actual farm, the animals that live there, an anecdote from someone who works there--that sort of bullshit.

    WHY?

  77. Re: Department of Computer Science --- are you sur by FuzzyDaddy2 · · Score: 1

    I blame the autocorrect software.

  78. Plausible Theory by Anonymous Coward · · Score: 0

    This is theoretical, an attempt to understand why this sysadmin is "unnerved". Well, that's a psychological reaction, and it needs a psychological explanation.

    The key, I think, is that he is a Unix sysadmin. In that world everything is a story. I've commented on this in the past, the bizarre naming of Unix commands. The only way to explain commands like "ls" and "grep" and "sudo" is to invoke the culture of the story. I'm discounting the command line interface, keyboards, and the virtues of short commands in this. While that is true enough, it also fails to fully explain the quixotic commands themselves.

    So every Unix command has a story about how that command got it's name. And people are very attuned to telling and learning stories. However the problem with that system is that you simply have to spend the time required to hear and learn all those stories. Guessing at commands is ineffective in Unix.

    Thus this guy learned Unix by stories, and he needs a story to explain each SSD failure. It's right in line with the Unix sysadmin culture.

    Now, there's a not-Unix specific aspect to this whole thing too. It's undesirable systems design, regardless of any other factor, to have sudden failures. Failures with a warning are a much better systems characteristic (assuming that failures are inevitable and not preventable). Traditional HDDs have a better track record of failing with signs that something was going wrong, preceding total failure.

  79. Wow by TimMD909 · · Score: 0

    If I wrote something like that, I would not attach my name to it. He's basically asking to be fired from his job. Hard drive failures are something everyone has dealt with for decades. What if a bus driver said that he/she had a hard time navigating in traffic, but other than that, they were great?

    After looking into his homepage and blog, I can't help but get the impression this guy slacks off a lot, while thinking a lot of himself. The following parts struck me as interesting:

    please do not send him unsolicited mail touting your good deals or your good cause; he will just become irate

    I take that to mean he loves to pontificate, but doesn't want to hear anyone else.

    His current amusement is to have as many home pages around the University of Toronto as possible; he will let finding them be your amusement.

    If I told any employer my favorite pastime was to waste their time (and thus money), I could not imagine having a job much longer.

    Then again, that home page was written 22 years ago. Maybe he's matured in the mean time, but I doubt it. Someone at the same job for 2 decades, who still doesn't understand basic stuff like hard drives, probably hasn't improved much. Those sort of people tend to do the absolute minimum possible, at all times.

    1. Re:Wow by jddimarco · · Score: 1

      I'm not sure if you're a troll and are trying to evoke annoyance, or if you suffer from severe reading comprehension difficulties and are trying to evoke pity. In me, you evoke both.

  80. Why SSD failures are legitimately unnerving by jddimarco · · Score: 2

    Disclaimer: I've known Chris since we were CS undergraduates together in the 1980s, and we currently work together in the CS Department in Toronto. It may seem a bit odd to some that a hard disk failure isn't unnerving but an SSD failure is. That's because one of a good sysadmin's skills is properly focused anxiety, used to motivate a mental model of how things can fail, and what to do about it. Data storage is a key part of this mental model, since data access loss, or even worse, data loss, is a major risk. That's why it's helpful to know how disks work, how they behave when they fail, and how likely it is for such things to happen. Chris has a few decades of experience in dealing with disks. SSDs take the place of disks, and they store stuff just like disks do, but they work differently, and they behave very differently when they fail. In particular, SSDs often don't seem to give any indication that things may be wrong: one moment all is well, the next moment, all is dead. So instincts honed over a few decades of experience with hard drives don't apply. Of course Chris (and we all) will develop new instincts as we get more experience with SSDs. But in the meanwhile, it's indeed unnerving. And no, this isn't some sort of profound insight. It's merely an observation. Many experienced sysadmins, I think, will "get" this. People newer to the field might not. That's OK.

    1. Re:Why SSD failures are legitimately unnerving by Anonymous Coward · · Score: 0

      It may seem a bit odd to some that a hard disk failure isn't unnerving but an SSD failure is. That's because one of a good sysadmin's skills is properly focused anxiety, used to motivate a mental model of how things can fail, and what to do about it. Data storage is a key part of this mental model, since data access loss, or even worse, data loss, is a major risk. That's why it's helpful to know how disks work, how they behave when they fail, and how likely it is for such things to happen.

      To understand failure mechanisms in modern devices, you have to have a basic understanding of device physics. A course in IC Fabrication or Solid State Physics usually provides this.

      Some of the failure mechanisms are chemical (things like diffusion and migration within the device), and some require an understanding of very basic quantum mechanics (look up hot electron injection, traps, tunnelling). This has become more and more true as things get smaller, because quantum phenomena become more and more important as scale goes down. With quantum failure mechanisms, the accumulation of enough random events can be enough to cause the failure.

      Errors in the information provided by the fab to the chip designers can also cause failures. There are always errors, because nothing human beings do is perfect (and because modeling non-linear systems is really hard - and transistors are intrinsically non-linear, at best we can approximate them as linear). But not all the errors matter, which is why we things work surprisingly well a lot of the time.

      As you move up to larger scales, you have additional failure mechanisms (such as bad solder bonds). Here you basically have to understand packaging, PCB layout, and a bunch of other things. It's a lot to know.

      Some devices will be more susceptible than others, due to random differences in fabrication due to a variety of mechanisms.

      Temperature, altitude, and humidity can play a role as well, especially for problems are larger scales. But even at smaller scales, the device "corners" (the range of possible production characteristics) can very enormously with temperature, so the "wrong" temperature can put you into a mode of operation that wasn't modelled well (or where things are just more susceptible to failure through one or more of the various possible mechanisms).

      You can easily construct many plausible failure mechanisms if you understand this stuff, but even then knowing which one(s) is (are) correct is really hard (and predicting the probability of failure is even more difficult). The problem ultimately comes down to the combination of lack of observability and rapidly changing technology. Typically you need both very expensive test equipment, and very knowledgeable and experienced people to figure out the real cause of a problem.

      Failures in software can also play a big role here. As the software engineers get more and more divorced from the hardware, it becomes harder and harder to write good drivers. Similarly, as the hardware gets ever more specialized, you can end up with a very small subset of hardware people writing the drivers and doing a bad job because most of them suck at software engineering. If they wanted to be software engineers, they wouldn't have gone the EE route - very few people can live in both worlds today (and those that can are worth their weight in gold).

      The net effect of software failures can be unexpected , unanticipated, and surprising failures, because either the wrong people were involved in writing the drivers, or because the hardware and software teams didn't communicate at the right times in the process (early enough that you might have been able to make the changes needed to actually fix a potential issue or provide needed visibility). Often you end up with this kind of problem because of "stovepipe" organizations where both groups are busy doing their own thing (under their own management, with their own budget and goals) and they don't work together. I see this kind of thing happening a lot.

  81. Chris Siebenmann has anxiety issues. by Oligonicella · · Score: 1

    That's the crux of the article. I should care why? He's a technical guy, he knows about memory. He just refuses to apply his knowledge to get rid of his paranoia. This guy's nothing but a low-level conspiracy theorist.

    As I wrote in one of my books: "They're all alike. Conspiracy theorists. They'd rather live in a terrifying fantasy world than the real one."

    1. Re:Chris Siebenmann has anxiety issues. by jddimarco · · Score: 1

      This response is confused on so many levels. First, Chris doesn't "know about memory" (particularly flash memory and corresponding control systems that are built into modern ssd's) in the same way and to the same degree as he knows about disks, that's the point. Secondly, he isn't refusing to apply his knowledge, he's using all the knowledge he has, which is less than what he has for disks. Thirdly, he isn't being paranoid -- paranoia requires high and ongoing anxiety about extremely unlikely things (i.e delusional): Chris' anxiety here is neither high nor ongoing, nor is what he is anxious about (SSD failure) an extremely unlikely (delusional) thing. Fourth, there's no evidence in his posting that Chris believes any sort of conspiracy is going on here.

  82. SSD - snowflake storage device? by Anonymous Coward · · Score: 0

    With this ‘admins’ angst in full display for a piece of hardware, I worry about the ability to handle a malicious attack. Going to cry for Mother(board)?

  83. Re:Self bricking by Anonymous Coward · · Score: 1

    Sandforce controllers self-brick at the first sign of trouble to prevent competitors from reverse engineering their controllers. Or at least that is the reason stated for their crappy design. IIRC, Intel developed a customized version that has better failure modes.

  84. Blackbox bricked by Reason by ElitistWhiner · · Score: 1

    This is the uncanny valley in which the world of REAL slowly sinks, sinking...sunk into the technological relative world of NOW.

    There is no bridge between. You stand stranded on the shores of reason while the world in which you live sinks away, out of sight and out of mind.

    Millennials know the futility of questioning the NOW, its irrelevant to wonder ' why?' Just BE now!

    Enlightenment as to why, what went wrong - much less how to prevent bad things is not among possibles. Shit happens!

  85. Some Rays.... by Anonymous Coward · · Score: 0

    ... are Cosmic.

  86. Re:Department of Computer Science --- are you sure by ceoyoyo · · Score: 1

    I'm not sure what they call Computer Science these days, but my bachelors had a required digital design component. We started by wiring together transistors to build a gate. When you'd demonstrated that you could use 74HC00s, and you had to build an adder. When your adder worked, you were allowed to use an ALU chip. You had to set the thing up with supporting logic and DIP switches and invent a machine code to demonstrate instruction processing and register transfers.

    In the compiler class we started out by writing a simulator for that hardware, then an assembler, then a compiler.

  87. And any device with soldered SSDs, like Apple... by Anonymous Coward · · Score: 0

    Make it reason enough to not purchase them.

  88. Intel SSD's by Anonymous Coward · · Score: 0

    I've had plenty of Intel 5xx series "die" on me. I have one now that is BSOD 5-6 times a day. Intel SSD Toolbox Full Diagnostic says its a OK! I've RMA'd so many of these, I already know what the techs are going to ask for during the RMA games. For SATA SSD's so far no issues with Crucial MX300 or MX500.

  89. Yes. I am. by bdwoolman · · Score: 1

    Because yes I am your God, man.

    --
    "No fear. No envy. No meanness." Liam Clancy
  90. Re:Department of Computer Science --- are you sure by Aighearach · · Score: 1

    I was actually thinking that if they had more understanding of the hardware, they'd have a better idea what the layers actually are, and they'd end up with more portable code not less portable code as you seem to imply. Knowing about how hardware works helps to be more hardware agnostic, because if you're using intermediate layers with no idea of the hardware and OS coupling that it creates then you'll do it more often.

  91. Re:Department of Computer Science --- are you sure by bobbied · · Score: 1

    I was actually thinking that if they had more understanding of the hardware, they'd have a better idea what the layers actually are, and they'd end up with more portable code not less portable code as you seem to imply. Knowing about how hardware works helps to be more hardware agnostic, because if you're using intermediate layers with no idea of the hardware and OS coupling that it creates then you'll do it more often.

    Yea, I see what you are saying, but remember they are stamping out CS degrees with little more than Java and Database Skills. The whole point of Java was to let you ignore all that hardware stuff though abstraction layers any way. Most of them don't need to know how to dig though all those layers to do what they need and with Object Oriented concepts, hardware is becoming trivia to them.

    But I agree, a bit of understanding of hardware is a good thing, especially when you start talking recursion and how pointers/references are actually working. I've always been amused at the BSCS holders who didn't understand what the call stack was or how they where killing performance with all the objects going in and out of scope, or why the math was being in done using integers when they wanted floating point (or vice versa). I just don't know if they have the scope in an undergraduate CS curriculum to throw that stuff in. Many won't need it, use it or remember it anyway.

    --
    "File to fit, pound to insert, paint to match" - Aircraft Maintenance 101
  92. Apple MacBook SSD works great by Anonymous Coward · · Score: 0

    I have been using my 2010 MacBook pretty hard as an over-the-air DVR and moving lots of files off of it to a backup RAID array, and it's 256GB SSD is working just fine. I do have a Time Machine backup of it, and there is little data on the drive with the vast majority of it backed up.

    I worry more about backing up my backup data safely offsite, along with organizing it all. That is what I would want an off-line AI to do for me in a future OS. I don't want to put a lot of data in "the cloud" or move it on-line since my internet connection is my cell phone and is capped at 12 GB per month.

  93. There are warning signs... by Anonymous Coward · · Score: 0

    The issue is not that it's hard to know why an SSD died. After all, as others have said, the same is true with spinning rust. The real issue is how suddenly an SSD can die. It can be perfectly healthy one day, then completely read-only or even totally dead the next day. An HDD on the other hand usually (but not always) shows symptoms of dying. It starts making noise, or the number of I/O errors spikes. Maybe it stops working when it's moved on its side.

    An SSD, on the other hand, intelligently reallocates bad sectors until its dying breath. Because NAND cells were historically so fragile, the FTL is very paranoid about ensuring a sector does not die on it. That's good and all, but it means that it's harder to tell that the device is in its death throes. An HDD is much dumber. It won't reallocate something as soon as a sector becoming unhealthy. It'll be totally fine until reads start failing, in which case it will try a number of times to read it so it can relocate it, and as this starts happening more and more, it becomes very obvious.

    I'm not criticizing SSDs at all. Their failure mode is not worse, it's just different. People are not used to monitoring drive health, so they (possibly foolishly) tend to rely on physical symptoms appearing. That's not a very good idea for SSDs.