SSD Failure Temporarily Halts Linux 3.12 Kernel Work

← Back to Stories (view on slashdot.org)

SSD Failure Temporarily Halts Linux 3.12 Kernel Work

Posted by Soulskill on Wednesday September 11, 2013 @07:50AM from the must-be-nvidia's-fault dept.

jones_supa writes "The sudden death of a solid-state drive in Linus Torvalds' main workstation has led to the work on the 3.12 Linux kernel being temporarily suspended. Torvalds has not been able to recover anything from the drive. Subsystem maintainers who have outstanding pull requests may need to re-submit their requests in the coming days. If the SSD isn't recoverable he will finish out the Linux 3.12 merge window from a laptop."

15 of 552 comments (clear)

Min score:

Reason:

Sort:

Re:Really? by Anonymous Coward · 2013-09-11 08:01 · Score: 5, Informative

No backup?
http://lkml.indiana.edu/hypermail/linux/kernel/1309.1/01690.html
I long ago gave up on doing backups. I have actively moved to a model
where I use replacable machines instead. I've got the stuff I care
about generally on a couple of different machines, and then keys etc
backed up on a separate encrypted USB key.
So it's inconvenient. Mainly from a timing standpoint. But nothing more.
Linus
Re:Welcome to how SSDs fail. by RichMan · 2013-09-11 08:25 · Score: 3, Informative

A hard shutdown of high-speed SSD is death. It takes really really good firmware to recover without reinitializing the drive.
The basic SSD "format" is susceptable to damage on power fails in a way that hard drives are not. The mapping and setup stables of the SSD are critical and constantly in flux unlike a harddrive where the mapping is only updated when a failure occures.
SSD drives need internal power fail control so they can gracefully shudown and firmware that supports it.
Re:Really? by tlhIngan · 2013-09-11 08:27 · Score: 5, Informative

I found spinning rust to at least give some clues prior to a crash and burn. I would say, single ssd is not ready for anything critical, in my opinion. Worst case scenario, you can always get the platters transfered in a good drive and recover from there (pricey, bur cheap if data is valuable enough).
Sudden SSD failure is actually not really a failure that's detectable. Good SSDs have tons of metrics available through SMART including media wear indicators that tell you impending failure long before it happens.
But when an SSD suddenly dies, it's generally because the controller's FTL tables got corrupted. For high performance drives, it's remarkably easy to do as performance is #1, not data safety. There's nothing wrong with the disk or the electronics.
The FTL (flash translation layer) is what maps a sector the OS uses to the actual flash sector itself. If it gets corrupted, the controller has no way of accessing the right sectors anymore and things go tits up. It's even worse because a lot of metrics are tied to the FTL, including media wear, so losing that data means you can't simply erase and start over - you're completely hooped as the controller cannot access anything.
If you want to think of it another way, treat it like the super block on a filesystem, and the filesystem tables. Now imagine they get corrupt - the data is useless and recovery is difficult, even though the underlying media is perfectly fine. It's possible to hose it so badly that recovery is impossible.
For speed, FTL tables are cached - and modern SSDs can easily have 512MB-1GB of DDR memory just to hold the tables. Of course, you can't write-through changes since the tables themselves need to be wear-levelled on the flash media.
One of the iffiest times for this comes when an SSD is power cycled - pulling the power on an SSD can cause corruption because the tables may be in the middle of an update. But things like firmware bugs and other things can easily corrupt the table as well (think a stray pointer scribbling over the table RAM). A good SSD often has extra capacitance onboard to ensure that on sudden power failure, there is enough backup power to do an emergency commit to flash. This protects against power cycling, but firmware bugs can still destroy the data.
Of course, SSDs without such features mean the firmware has to be extra careful. And sometimes, such precautions can miss a point in time where you cannot pull the power at all.
It's sort of reminiscent of that Seagate failure that resulted in a log file reaching a certain size disabling the drive - the data and media were perfectly fine, it's just that the firmware crapped out.
Re:None of that mattered, because by Zero__Kelvin · 2013-09-11 08:28 · Score: 4, Informative

That is correct. In fact he wrote the code that is the industry standard and uses it every day. How else do you think he is going to continue completion of the project on his laptop.

--
Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
Re:Intel? by stkris · 2013-09-11 08:29 · Score: 3, Informative

More info here: http://goran.krampe.se/2013/01/02/ssd-nightmare/
"So power cycling can apparently trigger this - and the disk for some odd reason (self protection?) decides to decapitate itself and set accessible cylinders down to 16 instead of 16384."
Re:Really? by You're+All+Wrong · 2013-09-11 08:34 · Score: 5, Informative

Are you attempting to claim the prize for the person with the least understanding of the Distributed Source Code Control System in use?

There was absolutely no code on his system that wasn't on between dozens and thousands of other systems depending on its age.

Just read TFA: "I had pushed out _most_ of my pulls today". His "pulls" are code that is *elsewhere*. He's just a conduit (and gatekeeper) between a few dozen elsewheres and a server with a fat pipe. And by the construction of the system, it really shouldn't matter how those pulls ordered. (If there'll be a merge conflict one way round, there'll be a merge conflict in other permutations too.)

--
Your head of state is a corrupt weasel, I hope you're happy.
Re:Really? by Guspaz · 2013-09-11 08:51 · Score: 4, Informative

What makes you think you can't take FLASH devices and access them in a similar way to platters?
Because on most SSDs, the data is encrypted, and on all SSDs, the pages are in an effectively random order. If you've lost the controller, you've lost both the encryption keys and the table that enables a logical platter-style presentation of the pages. No amount of soldering is going to fix those problem.
Re:Really? by rssrss · 2013-09-11 08:55 · Score: 1, Informative

I thought NSA backed up all our drives.

--
In the land of the blind, the one-eyed man is king.
Controller failure by ArchieBunker · 2013-09-11 09:07 · Score: 1, Informative

So buy a new drive with the same rev boards and swap them out. Problem solved.

--
Only the State obtains its revenue by coercion. - Murray Rothbard
Re:Really? by michrech · 2013-09-11 09:11 · Score: 5, Informative

That's just as easy as popping off the back of the HD removing a couple a screws and pulling out the platter.
You do that outside of a cleanroom and your data is gone forever.
False -- I've done it on a number of occasions (to drives I didn't care about), and was able to run the drives for months without their covers. I'd still be using the drives if I had need for drives as small as they were (somewhere in the 80GB range)...
Would I use a drive in this state for something critical? No, but saying you immediately lose the data if you pull a drive cover is just flat wrong.

--
bork bork bork!
Re:RAID by Trogre · 2013-09-11 10:09 · Score: 5, Informative

You guys should really look at the --backup and --backup-dir options in rsync.
I use them in conjunction with --delete to always have a "current" copy of the data, along with any old files (ie that have been updated or deleted) in a separate backup folder, named after the current day of the month.
That way you get a directory structure as follows:
01
02
03
04 ...
31
Current
You can restore the up-to-date set from Current at any time, and if you want to retrieve a file you deleted or over-wrote five days ago, go look in folder 06.

--
"Nine times out of ten, starting a fire is not the best way to solve the problem." - my wife
Will never work with modern drives by dutchwhizzman · 2013-09-11 10:20 · Score: 5, Informative

Modern drives for the last five years at least, have calibration factors for platter/head packs on the EEPROM on the controller board. If you swap boards, the board most likely won't be able to read the data on the disk, since it's not calibrated to the head/platter kit.

--
I was promised a flying car. Where is my flying car?
Re:RAID by Solandri · 2013-09-11 11:39 · Score: 5, Informative

I stopped using RAID in any of my systems after I started using WHSv1. WHS2011 has the same feature -- live system backups. If a drive fails, I pop in a new one (of any type/size), boot a CD that came with WHS (essentially a WinPE environment with a recovery software baked in), select my backup (I save 7-10 days -- I forget what it's set to), and in about an hour my system is back to the state of the last backup.
There's the operative phrase. RAID is for systems where you can't have or don't want an hour of downtime while restoring from a backup. The R in RAID stands for redundant. As in you can have a failure and keep going.

Note that this is the converse of "RAID is not a backup!" Just like RAID is not a replacement for a backup, a backup is not a replacement for RAID either. They do different things (and if you're smart, you will also backup your RAID). From your own description, you wanted a backup. RAID was never the correct solution for your needs.
Re:RAID by fnj · 2013-09-11 17:04 · Score: 3, Informative

Why not do it right?
Re:Really? by gagol · 2013-09-11 20:10 · Score: 5, Informative

This is more like a MS employee workstation crash. The linux infrastructure is not hosted on Linux home machines, and replicated around the world. I was simply pointing my favorable opinion for slow spinning disks... not blaming Linus or whatever, shit happens.

--
Tomorrow is another day...