Samsung Finds, Fixes Bug In Linux Trim Code
New submitter Mokki writes: After many complaints that Samsung SSDs corrupted data when used with Linux, Samsung found out that the bug was in the Linux kernel and submitted a patch to fix it. It turns out that kernels without the final fix can corrupt data if the system is using linux md raid with raid0 or raid10 and issues trim/discard commands (either fstrim or by the filesystem itself). The vendor of the drive did not matter and the previous blacklisting of Samsung drives for broken queued trim support can be most likely lifted after further tests. According to this post the bug has been around for a long time.
Well, that's gotta be embarrassing for everyone bashing Samsung over this. I remember reading some rather strong opinions about who was at fault.
Thank You Samsung!
While our company cad-workstations don't run Linux, all of them do run on Samsung SSD's.
Nice to see vendors working together to improve Linux.
Harrison's Postulate - "For every action there is an equal and opposite criticism"
When Apple updated OS X to allow TRIM on non-Apple supplied SSDs, forums were flooded with people claiming you should never use Samsung because they were fundamentally broken with regards to TRIM. Their "proof" was that corruption happened on Linux and they would not be swayed by the thought that maybe the problem was with Linux.
This is just another case of "Not My Problem" syndrome that too many techs get into. They think their code/tools/systems/whatever must be perfect, and other's are the ones fucking up. Samsung drives went on a blacklist for issuing the commands to them due to this bug? "WALP, LINUX IS PERFECT, MUST BE THE HARDWARE GUYS, even though their devices perform perfectly on other OSes" - and instead now we're left with a bug in Linux that corrupts data until the patch can make its way through the distro channels and pushed out to end users.
After many complaints that Samsung SSDs corrupted data when used with Linux
Ive used Samsung SSD's for years now and until today I've never heard of a 14e07c2ea4f[NO CARRIER]
Good people go to bed earlier.
Vote with your wallet, my next SSD will be a samsung.
Confirmation bias. It was happening with other brands, but for one reason or another, people focused in on Samsung as the culprit, and once that happened, there was no getting out of it.
Hats off to Samsung for finding and even fixing the problem.
From TFS: "The vendor of the drive did not matter and the previous blacklisting of Samsung drives for broken queued trim support can be most likely lifted after further tests."
If the vendor of the drive does not matter in testing, then there is no relevant difference in specification compliance or other "somethings." It's purely a matter of which anecdotes gain what traction within a small population of users using md raid with multiple SSDs in a raid 0 or 10 configuration, and which of those users circumstantially has the best contacts within the development community.
My first guess is the users trying that configuration were purchasing the fastest available SSDs, which tend to be Samsung drives (large market share) or boutique manufacturers (small market share).
Perhaps competitive prices coupled with perceived quality (and good experience on other platforms) led to these drives being selected by more knowledgeable or performance oriented people.
These drives then got pushed harder or in ways more likely to expose the bugs, leading to a perception that they were unreliable under Linux.
On behalf of all internet users everywhere, whether in this specific space-time continuum or not, I would like to formally apologize to Samsung for all of the totally unwarranted bashing they took over over this issue. And I would also like to express my gratitude to them for finding a bug, fixing it, and posting a fix. Good job.
Just cruising through this digital world at 33 1/3 rpm...
hardware firmware is commonly buggy. Device drivers often have to work around buggy hardware, so blacklisting devices for various functionality is not at all unusual.
If the code seems to work with other devices and breaks with a new device, then the first instinct is going to be to assume the new device is doing something wrong.
Another way of seeing things, is even if the bug is in the kernel, black listing still prevents damage to data on said vendor's hardware. When it comes to data corruption the first thing to do is limit damage, no matter who is it at fault. Afterwards, you can work together to try to isolate source of problems. Having unhappy users and customers is never good, unless you are the competition.
Jumpstart the tartan drive.
Bugs happen. If you've got code that seems to work and then you investigate and it doesn't work on one particular brand of drive, it would be a reasonable suspicion that there is something funny with those drives.
It's hard to evaluate exactly what went on here. If you read the original report of the discovery (which I did last month and is still the first link in TFS), you see this explanation:
Poking around in the source code of the kernel looking for the trim related code, we came to the trim blacklist. This blacklist configures a specific behavior for certain SSD drives and identifies the drives based on the regexp of the model name. Our working SSDs were explicitly allowed full operation of the TRIM but some of the SSDs of our affected manufacturer were limited. Our affected drives did not match any pattern so they were implicitly allowed full operation.
In other words, they didn't know what was going on. Then they happened upon some code in the Linux kernel that explicitly blacklisted certain model segments from certain manufacturers. So, at some point someone made the assumption that this must be related to certain models from certain manufacturers, based on code in the Linux kernel.
This could easily have led to confirmation bias in a situation where errors were not occurring frequently. (Note the further explanation that when they first informed Samsung, Samsung was unable to reproduce the issue until they started using a custom "much more intensive script" to increase the error rate of the problem.)
So, I don't claim to know the full situation, but my guess is that Samsung wouldn't have been blamed for this at all if this blacklisting code hadn't already been seen in the Linux kernel.
I'm not trying to place the blame on anyone in particular. But in this case there were various reasons they probably started thinking manufacturers were the problem other than just simple logic, and the "aha" moment apparently was based on looking at code in the Linux kernel already, not on actual prior observation that certain brands of drives were failing. (Otherwise, they would have probably suspected a hardware problem earlier... but instead the post describes a lot of time searching for software issues before they discovered the blacklist.)
Something doesn't add up ... The fix for this was an oversight in a relatively new "bio_split()" routine that merged in with the immutable bio vector patch set for Linux kernel 3.15. The Algolia blog referenced in the Samsung patch claims it was able to replicate the discard issue using kernels 3.2, 3.10, and 3.14, before the bug existed. What gives?
Sorry, that's incorrect.
There's a bug on MD raid0 and raid10. In Linux.
There is a data destroyer bug in SAMSUNG NCQ TRIM firmware. Which is *blacklisted*, so that it uses the non-ncq trim.
See? You're an idiot and everyone but you actually knew what they were complaining about. The samsung firmware is buggy crap that destroys data on NCQ TRIM, and the Linux kernel had a data destroyer bug in RAID0/RAID10 + TRIM that was fixed by a samsung engineer.
The samsung firmware is still broken, the linux kernel has been fixed, and you're still an useless idiot.