SSD Failure Temporarily Halts Linux 3.12 Kernel Work
jones_supa writes "The sudden death of a solid-state drive in Linus Torvalds' main workstation has led to the work on the 3.12 Linux kernel being temporarily suspended. Torvalds has not been able to recover anything from the drive. Subsystem maintainers who have outstanding pull requests may need to re-submit their requests in the coming days. If the SSD isn't recoverable he will finish out the Linux 3.12 merge window from a laptop."
No backup?
"If any question why we died, Tell them because our fathers lied."
That's all that Ballmer needs to stop Linux? Just find Torvald's SSD?
"No freeman shall ever be debarred the use of arms." -- Thomas Jefferson
Maybe Linus needs to create a backup program like he did when he wanted a better version control system and created git? Also, why is the only copy of the changes on his local workstation and not a server with redundancy? This seems rather amateurish.
today is spelling optional day.
Linux said "So I don't want to necessarily blame the harddisk, since it's just ten
days since I upgraded the rest of my machine, after it worked years in
the previous one. That just makes me go "hmm". As far as I know, all
the fans etc were working fine, but.."
There's his problem: "after it worked years in the previous [machine]."
His SSD died a natural death of old age.
IMarv
Trusting software vendors is no smarter than trus
Was he too busy treating people horribly to audit his DR procedures?
Fuck Ajit Pai
Why is this news... is this our version of People magazine, where instead of hearing about all the details of the Kardashians' lives, we hear about every email or event that happens to Linus?
...for over an hour when Torvalds had to make an emergency run to Albertson's for some toilet paper and hostility medication.
I've owned several hundred hard drives over the last 30 years. I've never had an active hard drive drive just blank out. I have had drives that had not been powered for a couple of years refuse to ever come back. But if I did not feel the need to even power the thing on for years, you can imagine how little I cared for what was on it.
In the last four years, I've owned around 20 SSDs. I've had five failures. Every single one was the drive just instantly lost everything. Amazingly, in four of the five cases, the drive still worked fine! It had simply lost all the data on it and believed itself to be a blank drive.
That said, the speed of SSDs makes them worth the risk to me. But I take backups far more seriously than I used to. I need them far more often.
He has backups all over the world. But like with any backup, you can't actually restore from it until you replace the failed disk.
That is correct. In fact he wrote the code that is the industry standard and uses it every day. How else do you think he is going to continue completion of the project on his laptop.
Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
Now there a new meaning for Kernel Panic!
I could write something witty for my sig, but instead wrote this...
You beat me to it. Anyone with a vague clue about unix would have thought of that. Obviously vague clues are a rare thing for the parent poster.
More info here: http://goran.krampe.se/2013/01/02/ssd-nightmare/
"So power cycling can apparently trigger this - and the disk for some odd reason (self protection?) decides to decapitate itself and set accessible cylinders down to 16 instead of 16384."
backintime. I am using it, works great, and the restore functions are quite easy to use as well.
I learned long ago after some close calls to back everything up. In my case for my desktop I store my data on a XFS partition stored on a RAID 5 hard drive array. I also am using Crashplan to back up all of my data, both to a removeable hard drive and to the cloud with over 3TB of data backed up. The nice thing about Crashplan is that it continually backs up, taking periodic snapshots so I can restore a previous version of a file if I wish. The main drawbacks of Crashplan are that it runs on Java and can be a memory pig. I pay $6/month for unlimited backup of up to 10 machines and have several computers backed up with them now. With the proper settings on my router I don't even notice all the backup traffic running in the background.
Since I have had sudden SSD failures in the past I also dump my root XFS filesystem weekly onto my RAID array (it takes under a minute to run xfsdump) and incremental backups nightly and those dumps get backed up on the cloud as well.
I have found the XFS tools to be quite good at recovery when things go really bad. When running software RAID 1 I had problems where drives would drop out of the array for apparently no reason and I have had several occasions where while rebuilding the other drive would pop out of the array. Switching to an Areca hardware raid controller with battery backed DRAM ended those problems (besides seeing a big performance improvement).
I have found the RAID controller to work well when drive failure occurs and it even recovered after human error (I accidentally disconnected one of the active drives while it was rebuilding and reconnected it).
I won't use btrfs yet. The last time I tried it about 6 months ago it was quite slow and I have a lot of concerns about the storage filling up due to COW that have not been adqeuately addressed as far as I could tell. I tried setting it up for a Cyrus IMAP server on an Intel SSD and it was unusably slow just untaring all the files so I ended up going back to XFS.
SSDs are still relatively new. I have had issues with some firmware versions and had one fail catastrophically after only 2 weeks of use. I have also had compact flash and SD devices suddenly fail. My experience is that usually mechanical hard drives give some warning (i.e. SMART) and they tend to last years. I have a server I just retired where the hard drive had 10 years on the clock according to SMART.
This post is encrypted twice with ROT-13. Documenting or attempting to crack this encryption is illegal.
I don't feel anything but shame for someone losing data in a hard drive crash who has or should have network backups available to them. If this happened to anyone but Linus the majority of the comments would be calling the coder a n00b. If it was Balmer there would be an absolute riot of anti-MS venom....
I guess the great Linus has fallen into shadow.
As someone who's taken over server administration from very talented developers a number of times, I've found that being a great developer doesn't mean that you're a great sysadmin. Developers may understand conceptually that RAID and backups are important (but sometimes think that RAID is a backup), but that doesn't mean that they actually set them up.
I'm not nearly as much of a believer in RAID for the home environment. If you (accidentally) delete something on one drive it's gone from both. Better to buy two drives and do a daily rsync. That way you have a window of opportunity to recover data. Personally, I use rsync without --delete until the 2d drive starts getting full, then I use the --delete flag to clean up.
Competition Good, Monopoly Bad.
I have a mirrored set of SSD's on all my important machines, and RAID 6 for bulk storage.
Unlike Linus, I can't afford to lose work.
"To those who are overly cautious, everything is impossible. "
According to a speech of his, that's how Linux got started. He accidentally wiped his MINIX partition.
As someone who's taken over server administration from very talented developers a number of times, I've found that being a great developer doesn't mean that you're a great sysadmin. Developers may understand conceptually that RAID and backups are important (but sometimes think that RAID is a backup), but that doesn't mean that they actually set them up.
And as a sysadmin, I'm tired of hearing that. RAID1,5,6,10,Z is a backup. It's not an archive. An archive is what you go to when you want the old version. A backup is generally one of two things:
1) Something that lets you keep chugging through a failure (raid5, a backup generator with automatic cut-over, etc)
2) A standby spare (tape, NAS/usb drive, secondary location with desks/computers/etc.
RAID (other than 0) is absolutely a backup. It's not the perfect backup but it is a backup. What it is NOT is an archive - last night's/week's/month's/quarter's data.
No, RAID is *not* a backup, RAID's only purpose is to improve reliability/uptime by letting you ride through hardware failures, but it does nothing to protect you from all of the rest of the things that can destroy your data, like file corruption, fat fingering a "rm -rf / home/someuser", a virus, a website hack attack, etc. That's what your backups are for, but you can call them archives if you like, but don't call RAID a "backup" because it's not. Depending on what the problem is and when you discover it, you may need to go back through several archives before you find the data you're looking for.
That is a ridiculous statement. Work is lost every time a drive fails unless it happens to fail immediately after a backup. Full backups take lots of time. If you understood git better you would realize that a lot less work is lost the git way than with old school backups. I'm sure that every time Linus does a successful merge he pushes it to a git repo elsewhere. All history is in the git logs. I am certain the work he lost is minimal, and is much less than if he was relying on nightly backups and the failure happened near the end of the work day. Just the effort of trying to determine what was done and what has been lost would be far more time consuming without git.
Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
This might be [electrolytic] capacitor or some other component-level magic-smoke release. There is also the dreaded, much-discussed "wear" from re-writing flash memory -- worse than you think because blocks of 64 KB [typically] have to be erased and re-written to change any byte therein.
Linus, of all people, ought to know his kernel has options to minimize the re-writes, many of them developed to optimize laptops (like delaying writes). Another thing is to mount partitions (/etc/fstab anyone?) with `noatime` as an option (maybe 'nodiratime` too). Un*x and other Linux-like systems by default will re-write the access time for any disk inode read. Turning it off reduces disk write load (and seeks on slow disks). I've had it off for over ten years an not noticed any malperformance, althrough there are rumored to be some, somewhere.
trying to desolder 100 pins spaced 0.01" apart then resoldering them, unless you have a 0.1 mill precision soldering robot it is impossible, you can't even buy wire thin enough to do it by hand.
SMT rework by hand isint rocket science, but takes more tools than the average garage has.
Desoldering you use a custom tip for that socket/package type (one tip per package & they're not cheap). It's essentially a metal ring that heats the solder on all the pins at once. In the center of the assembly is a vacuum probe. You heat all the pins, melting all the solder & hit the button on the handpiece to suction the chip up off the board. Then clean up the pads on the board. Careful with the heat because you dont want to lift pads off the board, if you do then you have to either fix them, or make a new pads. And then if you manage to trash a via (conductivity path to a different board layer), then you've got to drill out a new one and you have to use a esd safe conductive drill with a resistance cutoff. You put a clip from the drill in contact with the layer you're trying to get to, drill down and when the drill tip makes contact with that layer the drill turns off because the circuit is complete. But it still sucks and if you don't know how all the board layers are put together you may end up trashing a trace a couple layers into the board and wrecking the whole thing.
Soldering it down you do this. Align all the chip legs on the pads. Then you can either run a small bead of solder paste across all the pins or use a wave soldering tip (small cup, uses surface tension to hold the solder in place) and drag the tip over all the pins. Heat on the pin & pad draws the solder down into the joints. If you put too much solder you might have to vac it back up and redo it if you've made bridges etc. Alignment is key, and keeping the part in position is key. I used to try and avoid using glue underneath because that made it difficult to get it back off if you needed to down the road.
Doing hand rework on that kind of stuff the hardest thing for me was dealing with smt chip caps, little bastards will crack if you heat em to fast, so you have to get a temp regulated hot plate, heat em up slow, then pick and place em quick with tweezers/needlenose & solder em down quick.
01:36AM up 426 days, 2:46, 1 user, load average: 0.14, 0.11, 0.05
You have a software feature in a server OS that supports certain client OSes to do backups to the server. RAID may be a software feature, but even if it's "software raid", you often have BIOS bootable raids that even work with one of the drives missing. This essentially means that you can work OS agnostic on a lower level than "I have a backup system that works". For Linux, you can have a backup system too that will restore from a LiveCD/USB stick and stores on a remote server. The same amount of time roughly will be needed to backup and restore, differential, incremental, full backups, the works. The solution you are providing is really nothing comparable to RAID. It's fundamentally different because it works on a totally different layer, doesn't prevent downtime and it's not OS agnostic. RAID should prevent downtime, making working backups should prevent data loss. Maybe WHS is the shizniz, you rock for making actual backups, but other than that, your post is totally offtopic in this context and doesn't even begin to solve a problem that Linus was facing with his desktop.
I'm not modding you down, even though I have mod-points, but I'm telling you exactly why I think you shouldn't have posted this. I hope you learned something from it and in the future will implement both backups and RAID when unscheduled downtime is important. Maybe you would even implement a system that works for all relevant OSes in the environment you have to do it for, without relying on a single vendor that offers a closed source product. It's a risk that means you'll have to support their product and licencing and other requirements until the data isn't relevant anymore, even after you have migrated to a competing product.
I was promised a flying car. Where is my flying car?
Modern drives for the last five years at least, have calibration factors for platter/head packs on the EEPROM on the controller board. If you swap boards, the board most likely won't be able to read the data on the disk, since it's not calibrated to the head/platter kit.
I was promised a flying car. Where is my flying car?
Do you work for me, because if you do, you're fired.
Wow. I mean seriously. Wow. You are just hell bent on being a moron. You offered "solution" after "solution" that is no solution at all, and seem to have missed the first line of his post: "I had pushed out _most_ of my pulls today, so realistically I didn't lose a lot of work." I have to assume you don't work in the real world, because if you did you would realize how truly asinine you sound.
Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
"Only wimps use tape backup: real men just upload their important stuff on ftp, and let the rest of the world mirror it ;)" - Linus Torvalds[1]
Pfff... That's soooo last century!
Let me fix that for you, Mr. Torvalds
"Only wimps use tape backup: real men just upload their important stuff on git, and let the rest of the world clone it"
Now that sounds more typical for the current decade.
Oh, and for the MasterCard-Ads like finish:
"For everyone else, there's the NSA."
----
The funniest part is that he is the actual author of the git scm system which served him as backup this time.
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
rsync is nothing at all like time machine.
Look a little harder into the versioning aspects of time machine and let me know how to make rsync do that.
And when you've done that, please let me know how to get rsync to manage space and remove the oldest backup first when a full backup cannot complete.
Rack Mounted Server. I just gotta know, why is all this mission-critical operational stuff taking place on a workstation with workstation grade hardware and no backups or raids? Everyone's talking about oh raid at home isn't good, just use backup drives. Look: This is LINUX. If there's need for additional hardware and compile farms, people will probably donate. To have a single SSD failure cause so much calamity for any project, least of all *THE* open source project, is just embarrassing. Worse than swearing at your devs on a mailing list read by the whole world.
Sadly, a Libertarian cannot force his views on another, and freedom cannot spread as does the cancer known as religion.