Samsung Finds, Fixes Bug In Linux Trim Code
New submitter Mokki writes: After many complaints that Samsung SSDs corrupted data when used with Linux, Samsung found out that the bug was in the Linux kernel and submitted a patch to fix it. It turns out that kernels without the final fix can corrupt data if the system is using linux md raid with raid0 or raid10 and issues trim/discard commands (either fstrim or by the filesystem itself). The vendor of the drive did not matter and the previous blacklisting of Samsung drives for broken queued trim support can be most likely lifted after further tests. According to this post the bug has been around for a long time.
Well, that's gotta be embarrassing for everyone bashing Samsung over this. I remember reading some rather strong opinions about who was at fault.
Thank You Samsung!
While our company cad-workstations don't run Linux, all of them do run on Samsung SSD's.
Nice to see vendors working together to improve Linux.
Harrison's Postulate - "For every action there is an equal and opposite criticism"
When Apple updated OS X to allow TRIM on non-Apple supplied SSDs, forums were flooded with people claiming you should never use Samsung because they were fundamentally broken with regards to TRIM. Their "proof" was that corruption happened on Linux and they would not be swayed by the thought that maybe the problem was with Linux.
This is just another case of "Not My Problem" syndrome that too many techs get into. They think their code/tools/systems/whatever must be perfect, and other's are the ones fucking up. Samsung drives went on a blacklist for issuing the commands to them due to this bug? "WALP, LINUX IS PERFECT, MUST BE THE HARDWARE GUYS, even though their devices perform perfectly on other OSes" - and instead now we're left with a bug in Linux that corrupts data until the patch can make its way through the distro channels and pushed out to end users.
After many complaints that Samsung SSDs corrupted data when used with Linux
Ive used Samsung SSD's for years now and until today I've never heard of a 14e07c2ea4f[NO CARRIER]
Good people go to bed earlier.
Vote with your wallet, my next SSD will be a samsung.
Why didn't other manufacturer brands had this issue?
That's odd... It's been working flawlessly for me on my Windows machines for ages.
:thumbsup:
The first reply:
Thanks for tracking this down. Instead of explicitly coding around the
issue in raid0/raid10/linear I would prefer to fix bio_split(). It seems
like a deficiency in the interface that it does not handle this
transparently.
Do you have a reproducible test case? If so it would be great if you
could try the following patch and let us know the results.
Hats off to Samsung for finding and even fixing the problem.
On behalf of all internet users everywhere, whether in this specific space-time continuum or not, I would like to formally apologize to Samsung for all of the totally unwarranted bashing they took over over this issue. And I would also like to express my gratitude to them for finding a bug, fixing it, and posting a fix. Good job.
Just cruising through this digital world at 33 1/3 rpm...
More like an assumption that the bug was in the driver because they hadn't noticed issues on other drives.
hardware firmware is commonly buggy. Device drivers often have to work around buggy hardware, so blacklisting devices for various functionality is not at all unusual.
If the code seems to work with other devices and breaks with a new device, then the first instinct is going to be to assume the new device is doing something wrong.
When I worked on a military base a while back, there was a young female in the group whose last name was Trim. I never made a comment on it until the last couple days I was going to be there, and only in response to her making a remark like "some guys snicker" when hearing her name. I told her it was one of the first thought in my mind months earlier, but couldn't say anything.
Could be worse though. In World War II, there was an Admiral Kuntz. He has a road and access gate named after him at Pearl Harbor. Imagine being his daughters.
If you think I voted for Trump because of this post, you're wrong. I voted for Dr. Jill Stein of the Green Party. Again.
Bugs happen. If you've got code that seems to work and then you investigate and it doesn't work on one particular brand of drive, it would be a reasonable suspicion that there is something funny with those drives.
Given the fact that multiple Samsung drive models were failing but multiple Intel drive models were *not* failing under the same test (from the linked article), the developers could be forgiven in suspecting there was something wonky going on with the Samsung drives.
There were issues with the 840 EVO losing significant speed after it had been in use for a while. There was eventually (after much complaining from customers) a "fix" released that helped but didn't actually completely resolve the issue.
But it doesn't have to. If a drive were to implement TRIM by doing absolutely nothing (which is completely within spec) then it wouldn't show the problem, but it doesn't mean the drive is better than another or the other drive has a fault.
It's quite possible that the way IBM implements TRIM is just a little different. Perhaps they defer it for a few ms or something. So the bug is occurring over and over but it doesn't show itself with corruption.
Yes, assuming that because you can reproduce it on Samsung drives it must be a Samsung bug is confirmation bias.
http://lkml.org/lkml/2005/8/20/95
The linked article pointed out that five models of Samsung SSD were affected, three models of Intel SSD were not. So there were at least some drives that didn't seem to be affected by the bug. (Presumably just due to luck/usage-pattern/etc.)
I'm running Linux on a RAID-0 SSD array.
I guess I should turn off fstrim until there's a backport of the fix to Fedora?
GCHQ Quantum Insert installed. If only our tongues were made of glass, how much more careful we would be when we speak
Something doesn't add up ... The fix for this was an oversight in a relatively new "bio_split()" routine that merged in with the immutable bio vector patch set for Linux kernel 3.15. The Algolia blog referenced in the Samsung patch claims it was able to replicate the discard issue using kernels 3.2, 3.10, and 3.14, before the bug existed. What gives?
You got it, dude. Your moderation proved this even more. Morons!
While an apology is due, this sort of problem is inevitable given the nature of the technology. TRIM on NAND is a crutch for a technology that is poorly suited to data storage. Transforming NAND into a usable storage device requires heroic efforts on the part of the vendor, and it is hard to blame them for the bugs. Likewise, it is hard to blame Linux developers for their heroic efforts to work around the extensive deficiencies of NAND flash. Trusting in cheap commodity devices that don't even claim to protect against power loss is ill-advised.
Using TRIM as a band-aid for the performance woes of over-filled NAND devices is just asking for trouble. It has long been known that filling up filesystems leads to terrible performance, and the same applies to NAND drives. It is irresponsible of the vendors to provision the drives with insufficient reserved space, but one can compensate by setting aside an empty partition covering 5% of the space. It is much safer to disable TRIM and under-provision the drive, and it achieves the same effect of limiting write-amplification, without having to worry about bugs trimming away live data.
The only place were TRIM really makes sense is in the context of virtualization. Recovering space in sparse virtual disk images has real benefit, and operating system vendors have a lot more incentive and ability to make it work properly.
Doesn't matter much - this is why many Samsungs were mistakenly blacklisted, thinking it was a problem with the drive.
Unless you're running RAID0 or similar, it's not going to bite you. Not at all sure why anyone runs RAID0, to be honest, and certainly not with SSD's, but there you go. RAID10 is affected, I believe, but with 8 drives I'm not sure what you'd get from RAID10 that RAID 5 wouldn't have been better for you anyway.