Slashdot Mirror


Samsung Finds, Fixes Bug In Linux Trim Code

New submitter Mokki writes: After many complaints that Samsung SSDs corrupted data when used with Linux, Samsung found out that the bug was in the Linux kernel and submitted a patch to fix it. It turns out that kernels without the final fix can corrupt data if the system is using linux md raid with raid0 or raid10 and issues trim/discard commands (either fstrim or by the filesystem itself). The vendor of the drive did not matter and the previous blacklisting of Samsung drives for broken queued trim support can be most likely lifted after further tests. According to this post the bug has been around for a long time.

40 of 184 comments (clear)

  1. awkward! by Anonymous Coward · · Score: 4, Insightful

    Well, that's gotta be embarrassing for everyone bashing Samsung over this. I remember reading some rather strong opinions about who was at fault.

    1. Re:awkward! by Anonymous Coward · · Score: 2, Interesting

      I'd be interested to see if anyone has apologized. Doing so is exceedingly rare on internet forums.

    2. Re:awkward! by mwvdlee · · Score: 2, Insightful

      Even more so for the kernel developers that blacklisted the Samsung drives.
      These developers should probably be banned from kernel development or atleast banned from making decisions regarding functionality.
      Creating code with a bug is human, not doubting your own code and blaming somebody else is stupid.

      --
      Slashdot social media options: AIM, ICQ, Yahoo, Jabber and Mobile Text. Why no MySpace?
    3. Re:awkward! by Khyber · · Score: 2, Insightful

      If the kernel devs and Linus don't apologize, they're all a bunch of self-absorbed shitlords and should be smacked off the face of this planet.

      --
      Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
    4. Re:awkward! by Anonymous Coward · · Score: 5, Insightful

      The firmware bug of Samsung drives, a very severe one actually, was confirmed by Samsung. The RAID 0 issue is a totally different one, hardly affecting anyone.

      So yes, the severe issue was a bug on Samsung side, thile the very rare RAID 0 bug is Linux kernel one.

    5. Re:awkward! by GigaplexNZ · · Score: 5, Informative

      I've read the articles. There are two separate bugs here. One, Samsung drives advertise support for queued TRIM even though it's not properly supported, causing corruption. Two, the kernel had a TRIM bug that affected serial TRIM with mdadm RAID, which is the kernel bug Samsung found and fixed. The queued TRIM bug still exists in the Samsung firmware.

    6. Re:awkward! by GigaplexNZ · · Score: 3, Informative

      The queued TRIM blacklist on Samsung drives doesn't affect Windows because Windows doesn't support queued TRIM yet. This Linux kernel bug is a different issue, but many assumed it was the same, even though Algolia clearly stated in their blog post that they weren't using queued TRIM.

    7. Re:awkward! by sjames · · Score: 2

      The AC was sorta half right. It is not uncommon for hardware to break the standard so that it works with Windows. That sort of thing is becomm9ing less common but it's hardly unknown.

    8. Re:awkward! by GigaplexNZ · · Score: 2

      Because there are two separate bugs.

  2. Yhank You by Anonymous Coward · · Score: 2

    Thank You Samsung!
    While our company cad-workstations don't run Linux, all of them do run on Samsung SSD's.

  3. Bravo by Virtucon · · Score: 4, Interesting

    Nice to see vendors working together to improve Linux.

    --
    Harrison's Postulate - "For every action there is an equal and opposite criticism"
    1. Re:Bravo by gstoddart · · Score: 4, Insightful

      After many complaints that Samsung SSDs corrupted data when used with Linux

      There was definitely some self-interest there.

      Samsung can't have people saying their SSDs corrupt data when it's not them doing it.

      --
      Lost at C:>. Found at C.
    2. Re:Bravo by DarkOx · · Score: 5, Interesting

      Sure there was self interest. Still I think they deserve a lot of credit here. Rather than the typical "Its not my code" response from a developer who is sure the problem is elsewhere (rightly or wrongly) they actually found and fixed the problem. That is good behavior!

      --
      Repeal the 17th Amendment TODAY! Also Please Read http://www.gnu.org/philosophy/right-to-read.html
    3. Re:Bravo by Anonymous Coward · · Score: 4, Insightful

      Of course, this is only possible when the "other person's" code is Free Software. If this had been a problem in Windows/OSX that Microsoft/Apple was refusing to fix, there's little Samsung could have done about it.

    4. Re:Bravo by gstoddart · · Score: 2

      Sure it was good behavior.

      But it was borne entirely out of the Linux people saying "OMG, teh Samsung is teh sux0r".

      I do give them a lot of credit. More than the people who apparently insisted it was the fault of Samsung in the first place.

      --
      Lost at C:>. Found at C.
    5. Re: Bravo by bill_mcgonigle · · Score: 4, Interesting

      Yeah, the outcome is great. I just wonder why they waited more than a year to look into it. Maybe this will set a good example for the industry that with a little bit of effort you can take care of your customers and sell more product.

      If this were the 80's and a hard drive vendor had more than two reports of data loss under, say VMS, there would have been engineers on a plane to DEC by morning to get it solved by the coming weekend.

      Now we have thousands of users with reports and millions of units sold, and a wealthy vendor, and it's all crickets, leaving some kernel hackers to half-ass a blacklist. It's not like this is BeOS - there are millions of servers running in the target market. I don't mean to absolve the bad troubleshooting by kernel devs, but want to know what drove the apathy at Samsung (and other vendors behaving poorly). It's obviously not profit motive.

      --
      My God, it's Full of Source!
      OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
    6. Re: Bravo by bill_mcgonigle · · Score: 5, Informative

      I take some of that back. It seems the real credit for digging in goes to these guys. Samsung came in a month ago after they were provided a test suite and then gets credit for finding the kernel code path that caused the problem. An Oracle engineer provided a more-correct patch.

      --
      My God, it's Full of Source!
      OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
  4. Crying wolf by Sponge+Bath · · Score: 5, Informative

    When Apple updated OS X to allow TRIM on non-Apple supplied SSDs, forums were flooded with people claiming you should never use Samsung because they were fundamentally broken with regards to TRIM. Their "proof" was that corruption happened on Linux and they would not be swayed by the thought that maybe the problem was with Linux.

    1. Re:Crying wolf by beernutz · · Score: 4, Insightful

      The point however is that in a closed source system, Samsung could not have found and fixed the bug themselves.

      --
      (stolen from DaBum) I am dyslexia of borg - your ass will be laminated.
    2. Re:Crying wolf by Anonymous+Brave+Guy · · Score: 3, Insightful

      Is that really the point, though?

      Vendors of products affected by bugs in closed source software collaborate all the time. It's usually in their mutual interests, and it has been going on forever. Just look at the extraordinary lengths Microsoft used to go to in order to maintain compatibility of Windows with older applications.

      On the other hand, the existence of this issue in the first place, the fact that other vendors whose products may also have been affected did not act as Samsung did, and particularly the denial and active yet unjustified blacklisting of Samsung products by the people running the project with the real fault are indictments of that project, no matter how open it claims to be or how big and famous it is.

      This whole affair does not look good for Linux, and more importantly, it does not reflect well on the people currently running development of Linux.

      --
      If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
    3. Re:Crying wolf by kaiser423 · · Score: 5, Informative

      What makes you think that? Samsung is one signature away (PIA -- Proprietary Information Agreement) from viewing the vendor's source code and advising them. It's pretty damn routine and uncontroversial. I don't understand why people think that just because something is not open source that no one outside of the company ever, ever, under any circumstances can see a hunk of the code. Just sign a PIA and over the code in a secure manner, or give them remote VPN access to the test box. Pretty damn simple and routine.

    4. Re:Crying wolf by GigaplexNZ · · Score: 4, Informative

      That really depends on whether OS X uses serial or queued TRIM. The Samsung drives work fine with serial TRIM, but are still broken with queued TRIM. The bug that Algolia reported and Samsung fixed in the kernel was a serial TRIM issue in the Linux kernel with RAID, which is unrelated to the queued TRIM firmware issues.

  5. Just another case.... by darkain · · Score: 4, Insightful

    This is just another case of "Not My Problem" syndrome that too many techs get into. They think their code/tools/systems/whatever must be perfect, and other's are the ones fucking up. Samsung drives went on a blacklist for issuing the commands to them due to this bug? "WALP, LINUX IS PERFECT, MUST BE THE HARDWARE GUYS, even though their devices perform perfectly on other OSes" - and instead now we're left with a bug in Linux that corrupts data until the patch can make its way through the distro channels and pushed out to end users.

    1. Re:Just another case.... by DRJlaw · · Score: 2

      Devices working perfectly in other OSes is no indicator that the device is no at fault. Witness the vast amount of crap laptop hardware, whose disastrous ACPI implementations only worked because their Windows drivers were chock-full of workarounds.

      It certainly is an indicator. I think you mean to say "is not conclusive evidence."

      But then again, disastrous ACPI implementations are not conclusive evidence that a whole different type of device is at fault.

      Your reasoning falls into the very trap GP was pointing out.

    2. Re:Just another case.... by 0123456 · · Score: 4, Interesting

      Devices working perfectly in other OSes is no indicator that the device is no at fault. Witness the vast amount of crap laptop hardware, whose disastrous ACPI implementations only worked because their Windows drivers were chock-full of workarounds.

      Back when I was writing Windows drivers for plugin cards, there were certain motherboards that we'd detect and switch the motherboard bus to the slowest possible speed, because the chipset was a heap of junk that didn't work properly at higher speeds. Anyone who said 'but it works on Windows!' clearly had no idea that it only worked because we'd intentionally turned off most of the features.

    3. Re:Just another case.... by Anonymous+Brave+Guy · · Score: 3, Interesting

      A pro-Linux bias on Slashdot is not exactly a surprise, but an equally accurate headline on another forum might have read "Critical bug in Linux corrupts data on SSDs", and the subtitle "Linux maintainers deny serious fault, blame innocent parties for data loss" would probably have been fair too.

      --
      If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
    4. Re:Just another case.... by nojayuk · · Score: 4, Interesting

      We did workarounds on the ATA bus spec for known hardware bugs in older VIA chipsets. These were silicon bugs, not chipset firmware so they couldn't be fixed afterwards with patches and there were millions of these boards out there. Declaring our devices (CD-ROM and DVD-ROM drives) wouldn't work with these boards was not going to happen for sales reasons so our code included a lockup-recovery function that was invoked when the rare bug conditions were met and the IDE bus froze. The average user never noticed these lockups and we didn't tell them about them.

      Out-of-spec bugs like this were well-known in the industry and workarounds were easy to produce as long as you had access to a few million bucks worth of test equipment and a good team of professional engineers with decades of experience, not something that's common in the Linux world.

  6. not the case in my situation by nimbius · · Score: 3, Funny

    After many complaints that Samsung SSDs corrupted data when used with Linux

    Ive used Samsung SSD's for years now and until today I've never heard of a 14e07c2ea4f[NO CARRIER]

    --
    Good people go to bed earlier.
    1. Re:not the case in my situation by edtice1559 · · Score: 3, Informative

      If you have 64GB of RAM, you can cache the entire SSD. Then you won't have to issue TRIM commands!

    2. Re:not the case in my situation by Rinikusu · · Score: 4, Funny

      "But.. what does my cell phone carrier have to do with anything?"

      --
      If you were me, you'd be good lookin'. - six string samurai
  7. Vote with your wallet by jwkane · · Score: 4, Interesting

    Vote with your wallet, my next SSD will be a samsung.

  8. Re:Why did it only happened on Samsung's SSDs? by Anonymous Coward · · Score: 5, Insightful

    Confirmation bias. It was happening with other brands, but for one reason or another, people focused in on Samsung as the culprit, and once that happened, there was no getting out of it.

  9. Good work by Kuruk · · Score: 2

    Hats off to Samsung for finding and even fixing the problem.

  10. Re:Why did it only happened on Samsung's SSDs? by DRJlaw · · Score: 2

    Excellent question. My first guesses would be that either the Samsung SSDs were doing something a bit out-of-specs, or the Samsung SSDs have something that's missing from other SSDs.

    From TFS: "The vendor of the drive did not matter and the previous blacklisting of Samsung drives for broken queued trim support can be most likely lifted after further tests."

    If the vendor of the drive does not matter in testing, then there is no relevant difference in specification compliance or other "somethings." It's purely a matter of which anecdotes gain what traction within a small population of users using md raid with multiple SSDs in a raid 0 or 10 configuration, and which of those users circumstantially has the best contacts within the development community.

    My first guess is the users trying that configuration were purchasing the fastest available SSDs, which tend to be Samsung drives (large market share) or boutique manufacturers (small market share).

  11. Re:Why did it only happened on Samsung's SSDs? by swb · · Score: 2

    Perhaps competitive prices coupled with perceived quality (and good experience on other platforms) led to these drives being selected by more knowledgeable or performance oriented people.

    These drives then got pushed harder or in ways more likely to expose the bugs, leading to a perception that they were unreliable under Linux.

  12. Apology by JustAnotherOldGuy · · Score: 2

    On behalf of all internet users everywhere, whether in this specific space-time continuum or not, I would like to formally apologize to Samsung for all of the totally unwarranted bashing they took over over this issue. And I would also like to express my gratitude to them for finding a bug, fixing it, and posting a fix. Good job.

    --
    Just cruising through this digital world at 33 1/3 rpm...
  13. Re:fairly common to blacklist devices by Midnight+Thunder · · Score: 4, Insightful

    hardware firmware is commonly buggy. Device drivers often have to work around buggy hardware, so blacklisting devices for various functionality is not at all unusual.

    If the code seems to work with other devices and breaks with a new device, then the first instinct is going to be to assume the new device is doing something wrong.

    Another way of seeing things, is even if the bug is in the kernel, black listing still prevents damage to data on said vendor's hardware. When it comes to data corruption the first thing to do is limit damage, no matter who is it at fault. Afterwards, you can work together to try to isolate source of problems. Having unhappy users and customers is never good, unless you are the competition.

    --
    Jumpstart the tartan drive.
  14. Re:a bit too harsh by AthanasiusKircher · · Score: 2

    Bugs happen. If you've got code that seems to work and then you investigate and it doesn't work on one particular brand of drive, it would be a reasonable suspicion that there is something funny with those drives.

    It's hard to evaluate exactly what went on here. If you read the original report of the discovery (which I did last month and is still the first link in TFS), you see this explanation:

    Poking around in the source code of the kernel looking for the trim related code, we came to the trim blacklist. This blacklist configures a specific behavior for certain SSD drives and identifies the drives based on the regexp of the model name. Our working SSDs were explicitly allowed full operation of the TRIM but some of the SSDs of our affected manufacturer were limited. Our affected drives did not match any pattern so they were implicitly allowed full operation.

    In other words, they didn't know what was going on. Then they happened upon some code in the Linux kernel that explicitly blacklisted certain model segments from certain manufacturers. So, at some point someone made the assumption that this must be related to certain models from certain manufacturers, based on code in the Linux kernel.

    This could easily have led to confirmation bias in a situation where errors were not occurring frequently. (Note the further explanation that when they first informed Samsung, Samsung was unable to reproduce the issue until they started using a custom "much more intensive script" to increase the error rate of the problem.)

    So, I don't claim to know the full situation, but my guess is that Samsung wouldn't have been blamed for this at all if this blacklisting code hadn't already been seen in the Linux kernel.

    I'm not trying to place the blame on anyone in particular. But in this case there were various reasons they probably started thinking manufacturers were the problem other than just simple logic, and the "aha" moment apparently was based on looking at code in the Linux kernel already, not on actual prior observation that certain brands of drives were failing. (Otherwise, they would have probably suspected a hardware problem earlier... but instead the post describes a lot of time searching for software issues before they discovered the blacklist.)

  15. How was this recreated before the bug existed? by godamntheman · · Score: 4, Insightful

    Something doesn't add up ... The fix for this was an oversight in a relatively new "bio_split()" routine that merged in with the immutable bio vector patch set for Linux kernel 3.15. The Algolia blog referenced in the Samsung patch claims it was able to replicate the discard issue using kernels 3.2, 3.10, and 3.14, before the bug existed. What gives?

  16. Re:fairly common to blacklist devices by Anonymous Coward · · Score: 5, Informative

    Sorry, that's incorrect.

    There's a bug on MD raid0 and raid10. In Linux.

    There is a data destroyer bug in SAMSUNG NCQ TRIM firmware. Which is *blacklisted*, so that it uses the non-ncq trim.

    See? You're an idiot and everyone but you actually knew what they were complaining about. The samsung firmware is buggy crap that destroys data on NCQ TRIM, and the Linux kernel had a data destroyer bug in RAID0/RAID10 + TRIM that was fixed by a samsung engineer.

    The samsung firmware is still broken, the linux kernel has been fixed, and you're still an useless idiot.