SSD Failure Temporarily Halts Linux 3.12 Kernel Work

Really? by koan · 2013-09-11 07:52 · Score: 5, Insightful

No backup?

--
"If any question why we died, Tell them because our fathers lied."

Re:Really? by gagol · 2013-09-11 07:56 · Score: 4, Insightful

I found spinning rust to at least give some clues prior to a crash and burn. I would say, single ssd is not ready for anything critical, in my opinion. Worst case scenario, you can always get the platters transfered in a good drive and recover from there (pricey, bur cheap if data is valuable enough).

--
Tomorrow is another day...
Re:Really? by Anonymous+CowWord · 2013-09-11 07:59 · Score: 5, Funny

Haven't you heard?
"Only wimps use tape backup: real men just upload their important stuff on ftp, and let the rest of the world mirror it ;)" - Linus Torvalds[1]
1: https://groups.google.com/forum/#!msg/linux.dev.kernel/2OEgUvDbNbo/bTk-VE1zrnYJ

--

Disclaimer: My opinions are my own and do not, in any way, reflect the opinions of my employer or university.
Re:Really? by Anonymous Coward · 2013-09-11 07:59 · Score: 2, Funny

Ask Obama!
He's got a backup...
Re:Really? by Anonymous Coward · 2013-09-11 08:01 · Score: 5, Informative

No backup?
http://lkml.indiana.edu/hypermail/linux/kernel/1309.1/01690.html
I long ago gave up on doing backups. I have actively moved to a model
where I use replacable machines instead. I've got the stuff I care
about generally on a couple of different machines, and then keys etc
backed up on a separate encrypted USB key.
So it's inconvenient. Mainly from a timing standpoint. But nothing more.
Linus
Re:Really? by SJHillman · 2013-09-11 08:06 · Score: 5, Funny

Maybe Linus doesn't consider Linux to be critical...
Microsoft sure as hell doesn't seem to find Windows to be critical.
Re:Really? by Anonymous Coward · 2013-09-11 08:06 · Score: 5, Insightful

I used to think that too, until I had a mechanical hard drive experience controller failure without warning. Single drive is not ready for anything critical, regardless of the storage mechanism.
Re:Really? by stewsters · 2013-09-11 08:14 · Score: 4, Funny

Yeah, i wonder if anyone has ever told him about git. Too bad he didn't back it up. Now we will have to start a new Linux kernel.

Sarcasm Intended.
Re:Really? by pubwvj · 2013-09-11 08:23 · Score: 5, Funny

Ah, even Jesus saves. ;-)
Re:Really? by chuckinator · 2013-09-11 08:27 · Score: 4, Interesting

Seconded. I've had a RAID1 mirror on my primary workstation at home for roughly... 4 years. I had one of those "oh, drat, my drive is starting to click, and we all know what that means..." moments and barely had time to backup the /home partition to an external machine while I went hardware shopping. Since that event window closed, that configuration has saved my butt twice. One time, the mirrored pair started to go after kinetic shock from moving to a new residence, and it didn't even stress me out to wait for a new pair from my online vendor of choice. I don't know what happened the second time, but I'm guessing that some bad components on the mobo were dirtying the 5V and 3.3V power rails into the drive connector because the whole rig decided to go kaput shortly after in a way that forced an upgrade to the latest CPU socket du jour mobo. Thankfully, I was already budgeting for new guts for that rig due to performance demands.
Re:Really? by tlhIngan · 2013-09-11 08:27 · Score: 5, Informative

I found spinning rust to at least give some clues prior to a crash and burn. I would say, single ssd is not ready for anything critical, in my opinion. Worst case scenario, you can always get the platters transfered in a good drive and recover from there (pricey, bur cheap if data is valuable enough).
Sudden SSD failure is actually not really a failure that's detectable. Good SSDs have tons of metrics available through SMART including media wear indicators that tell you impending failure long before it happens.
But when an SSD suddenly dies, it's generally because the controller's FTL tables got corrupted. For high performance drives, it's remarkably easy to do as performance is #1, not data safety. There's nothing wrong with the disk or the electronics.
The FTL (flash translation layer) is what maps a sector the OS uses to the actual flash sector itself. If it gets corrupted, the controller has no way of accessing the right sectors anymore and things go tits up. It's even worse because a lot of metrics are tied to the FTL, including media wear, so losing that data means you can't simply erase and start over - you're completely hooped as the controller cannot access anything.
If you want to think of it another way, treat it like the super block on a filesystem, and the filesystem tables. Now imagine they get corrupt - the data is useless and recovery is difficult, even though the underlying media is perfectly fine. It's possible to hose it so badly that recovery is impossible.
For speed, FTL tables are cached - and modern SSDs can easily have 512MB-1GB of DDR memory just to hold the tables. Of course, you can't write-through changes since the tables themselves need to be wear-levelled on the flash media.
One of the iffiest times for this comes when an SSD is power cycled - pulling the power on an SSD can cause corruption because the tables may be in the middle of an update. But things like firmware bugs and other things can easily corrupt the table as well (think a stray pointer scribbling over the table RAM). A good SSD often has extra capacitance onboard to ensure that on sudden power failure, there is enough backup power to do an emergency commit to flash. This protects against power cycling, but firmware bugs can still destroy the data.
Of course, SSDs without such features mean the firmware has to be extra careful. And sometimes, such precautions can miss a point in time where you cannot pull the power at all.
It's sort of reminiscent of that Seagate failure that resulted in a log file reaching a certain size disabling the drive - the data and media were perfectly fine, it's just that the firmware crapped out.
Re:Really? by jimbolauski · 2013-09-11 08:34 · Score: 2, Funny

So you've never had a hard disk controller failure then?

" Worst case scenario, you can always get the platters transfered in a good drive and recover from there"
What makes you think you can't take FLASH devices and access them in a similar way to platters? Just like with platters, you won't be able to access data on any damaged portions but unlike with platters it is unlikely that the platters will trash the read/write heads of the new drive.
I don't know what your talking about it's very easy to desolder a couple hundred pins on a board, then install a new chip and resolder the new chip back in. That's just as easy as popping off the back of the HD removing a couple a screws and pulling out the platter.

--
Knowledge = Power
P= W/t
t=Money
Money = Work/Knowledge so the less you know the more you make
Re:Really? by You're+All+Wrong · 2013-09-11 08:34 · Score: 5, Informative

Are you attempting to claim the prize for the person with the least understanding of the Distributed Source Code Control System in use?

There was absolutely no code on his system that wasn't on between dozens and thousands of other systems depending on its age.

Just read TFA: "I had pushed out _most_ of my pulls today". His "pulls" are code that is *elsewhere*. He's just a conduit (and gatekeeper) between a few dozen elsewheres and a server with a fat pipe. And by the construction of the system, it really shouldn't matter how those pulls ordered. (If there'll be a merge conflict one way round, there'll be a merge conflict in other permutations too.)

--
Your head of state is a corrupt weasel, I hope you're happy.
Re:Really? by Talderas · 2013-09-11 08:47 · Score: 3, Funny

Apparently Linus.

--
"Lack of speed can be overcome. In the worst case by patience." --Znork
Re:Really? by Guspaz · 2013-09-11 08:51 · Score: 4, Informative

What makes you think you can't take FLASH devices and access them in a similar way to platters?
Because on most SSDs, the data is encrypted, and on all SSDs, the pages are in an effectively random order. If you've lost the controller, you've lost both the encryption keys and the table that enables a logical platter-style presentation of the pages. No amount of soldering is going to fix those problem.
Re: Really? by MightyYar · 2013-09-11 08:53 · Score: 2

On a drive somewhere. It can be on an NAS or something locally attached.

--
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
Re:Really? by Anonymous+Brave+Guy · 2013-09-11 08:57 · Score: 2, Insightful

Only wimps use tape backup. Real deities just upload their important stuff on FTP and let the rest of the universe mirror it.

--
If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
Re:Really? by Zero__Kelvin · 2013-09-11 09:00 · Score: 2

The new drive has a new controller. Where do you think the controller stores all the data it needs to decrypt? Hint: It is in the FLASH devices. I am not saying this will work 100% of the time, since the damaged part might be the component that stores the needed information, but again, that is no different than a platter scenario. There is a reason why data recovery services don't guarantee success with platter based media.

--
Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
Re:Really? by djdanlib · 2013-09-11 09:07 · Score: 2

It would be great if they would mention stability features on the box, or at least in the marketing material. But they don't. It always looks like this: MEMORY! It's quiet! SATA-II maximum bandwidth of 3.0 Gbps! Speed up your desktop! Look at the rebate! Millions of hours MTBF! Low power usage!
Re:Really? by michrech · 2013-09-11 09:11 · Score: 5, Informative

That's just as easy as popping off the back of the HD removing a couple a screws and pulling out the platter.
You do that outside of a cleanroom and your data is gone forever.
False -- I've done it on a number of occasions (to drives I didn't care about), and was able to run the drives for months without their covers. I'd still be using the drives if I had need for drives as small as they were (somewhere in the 80GB range)...
Would I use a drive in this state for something critical? No, but saying you immediately lose the data if you pull a drive cover is just flat wrong.

--
bork bork bork!
Re: Really? by Cyberax · 2013-09-11 09:31 · Score: 5, Funny

You've misspelled 'NSA'...
Re:Really? by Anonymous Coward · 2013-09-11 09:34 · Score: 2, Insightful

Microsoft also sure as hell wouldn't have a single hard drive failure interrupt their patch submission process (yes, it is internal but they have a tree of lab builds, team builds, and "winmain" with a well defined RI - reverse integration process for moving patches in) and their build process. Actually - I don't think anyone would allow a single drive failure to do this. It seems, well, stupid. What was Linus smoking?
Re:Really? by Luckyo · 2013-09-11 10:07 · Score: 2

Urban legend. Clean environment inside drives and in the lab that does data extraction from damaged drive is to maximize performance/chance of recovery.
Hard drive itself can run just fine in dirty environment for a while. It will wear down much faster as it's not designed for such operation, but it will very likely remain operational for weeks at the very least.
Re:Really? by Guspaz · 2013-09-11 10:52 · Score: 2

Perhaps you're unaware that most modern SSDs these days do controller-level AES encryption of all data? Intel's drives do (as do any others based on Sandforce controllers), Samsung's newer ones do, Crucial's newer ones do... and the keys are stored in the controller, not the NAND.
It's kind of odd for you to say I'm on drugs for saying things that are on the spec sheets of the drives themselves...
Re:Really? by Miamicanes · 2013-09-11 10:55 · Score: 2

> What makes you think you can't take FLASH devices and access them in a similar way to platters?
Sandforce controllers enforce mandatory AES encryption that can't be disabled, using a key that can't be recovered or set to a known value. So if your controller decides to quit allowing you to access your data, unsoldering the chips won't do you any good, because the values you read from them might as well be random noise.
Re:Really? by Guspaz · 2013-09-11 12:18 · Score: 2

In the case of most drives, the key they ship with is randomly generated at the factory, unless you enable ATA passwords in your BIOS, which will prompt a new key to be generated, secured by that password. This is the typical behaviour; encrypt everything by default using the built-in key, and most support various external interfaces for securing that. Some even support eDrive, which integrates with BitLocker. Anandtech has a nice article about that:
http://www.anandtech.com/show/6891/hardware-accelerated-bitlocker-encryption-microsoft-windows-8-edrive-investigated-with-crucial-m500
Re:Really? by bmo · 2013-09-11 12:26 · Score: 2

And not only that, but the only "cleanroom" you need is a box with a lid, gloves, and a filtration system you can sometimes pull from a dead vacuum cleaner (the kind with a hepa filter).
If you're handy, you can build one of these under 50 bux.
--
BMO
Re:Really? by citizenr · 2013-09-11 13:53 · Score: 2

Yes, the mythical read-only mode that NO ONE could trigger while doing wear leveling tests on SSDs - they ALWAYS DIE suddenly and without warning.

--
Who logs in to gdm? Not I, said the duck.
Re:Really? by gagol · 2013-09-11 20:10 · Score: 5, Informative

This is more like a MS employee workstation crash. The linux infrastructure is not hosted on Linux home machines, and replicated around the world. I was simply pointing my favorable opinion for slow spinning disks... not blaming Linus or whatever, shit happens.

--
Tomorrow is another day...

Eggs, Basket by Sneakernets · 2013-09-11 07:55 · Score: 5, Funny

That's all that Ballmer needs to stop Linux? Just find Torvald's SSD?

--
"No freeman shall ever be debarred the use of arms." -- Thomas Jefferson

Re:Eggs, Basket by CastrTroy · 2013-09-11 07:58 · Score: 2, Insightful

Makes me wonder what would happen to Linux development if Torvalds was to get hit by a bus, or be incapacitated in some way. Is kernel development that reliant on one person that a single laptop breaking brings everything to a halt?

--

Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
Re:Eggs, Basket by bobbied · 2013-09-11 08:20 · Score: 2

Who needs that? You can always take the last source release and start your own build.
This guy only controls the Linux Kernel by convention (and because it is convenient). Anytime he is unable or unwilling to keep the kernel development going, any number of others can step up and take over.
It will be interesting to watch when it happens though. I suspect that unless Torvalds appoints a successor and willingly hands over the keys the Linux Kernel will fracture into 3 or 4 major branches. Even if he does appoint someone or some organization to take over there is a risk of the kernel fracturing into multiple efforts.

--
"File to fit, pound to insert, paint to match" - Aircraft Maintenance 101
Re:Eggs, Basket by You're+All+Wrong · 2013-09-11 08:41 · Score: 2

His laptop breaking brought about 0.0001% of the actual work on linux to a halt, if that. Every linux developer continued developing as normal. Every code reviewer continued reviewing code as normal. Every subsystem maintainer kept maintaining their subsystem as normal. Every automatic test built robot kept automatically doing build tests as normal. People who desperately needed the patches that Linus was going to push put, if they really were that desperate, would have just pulled them from linux-next, or the relevant subsystem maintainer's tree, or, *most likely*, would already have them!

--
Your head of state is a corrupt weasel, I hope you're happy.

Next project - backups! by ruiner13 · 2013-09-11 07:55 · Score: 2

Maybe Linus needs to create a backup program like he did when he wanted a better version control system and created git? Also, why is the only copy of the changes on his local workstation and not a server with redundancy? This seems rather amateurish.

--

today is spelling optional day.

Re:Next project - backups! by geek · 2013-09-11 08:03 · Score: 5, Funny

Allow me to channel Linus Torvalds a minute:
"What do you mean there wasn't a backup disk? Fucking kill yourself with a pipe wrench. I hate you, your mother was a whore and your dad was the neighbors dog. People like you make me sick."
Re:Next project - backups! by PRMan · 2013-09-11 08:34 · Score: 4, Insightful

It's comments like these that make me wish Slashdot mods could go to 10 instead of 5. Nicely done.

--
Peter predicted that you would "deliberately forget" creation 2000 years ago...

Linus said something... by IMarvinTPA · 2013-09-11 07:57 · Score: 3, Interesting

Linux said "So I don't want to necessarily blame the harddisk, since it's just ten
days since I upgraded the rest of my machine, after it worked years in
the previous one. That just makes me go "hmm". As far as I know, all
the fans etc were working fine, but.."

There's his problem: "after it worked years in the previous [machine]."

His SSD died a natural death of old age.

IMarv

--
Trusting software vendors is no smarter than trus

Re:Linus said something... by kwalker · 2013-09-11 08:27 · Score: 2

That's not how drives die of old age. A sudden and permanent drive failure like what is described is almost always a controller failure. When mechanical drives die of old age, they generally develop bad sectors and read-errors accumulate on the platter, but you can still read from the un-damaged areas. When SSDs die, those worn-out sectors go read-only or begin throwing similar read/write errors, depending on the firmware.
After having a 40GB IBM Deathstar suddenly go down in flames, and dozens of "salvage my data!" calls from friends and family, I don't trust any single drive of any age or provenance. ALWAYS have backups.

--
... And so it comes to this.
Re:Linus said something... by citizenr · 2013-09-11 13:59 · Score: 2

His SSD died a natural death of old age.
IMarv
there is NOTHING natural about a drive that disappears without a notice with all of your data.

--
Who logs in to gdm? Not I, said the duck.
Re:Linus said something... by greg1104 · 2013-09-12 04:27 · Score: 2

That SSDs go read-only as they wear down is a myth. I've never seen a single credible report of it happening. If you read real wear-down tests, what actually happens is that the drives stop retaining data when powered off as they get very old. Not a single drive tested there failed gracefully at the end.

No RAID? No backup? by Nick · 2013-09-11 08:00 · Score: 2, Funny

Was he too busy treating people horribly to audit his DR procedures?

--
Fuck Ajit Pai

Re:No RAID? No backup? by samjam · 2013-09-11 08:30 · Score: 4, Funny

His SSD gave up out of shame for all the threats and abuse it had been forced to witness

--
blog.sam.liddicott.com

why this news? by Laxori666 · 2013-09-11 08:04 · Score: 4, Insightful

Why is this news... is this our version of People magazine, where instead of hearing about all the details of the Kardashians' lives, we hear about every email or event that happens to Linus?

Re:why this news? by Princeofcups · 2013-09-11 10:26 · Score: 2

Why is this news... is this our version of People magazine, where instead of hearing about all the details of the Kardashians' lives, we hear about every email or event that happens to Linus?
It shows that the best or at least most respected in the business can still be stupid when it comes to simple things like backups. Seriously, there is no reason in this day or age to lose more than a couple of transactions if you are careful. Someone kick Linus in the ass for being so sloppy and lazy.

--
The only thing worse than a Democrat is a Republican.

BREAKING: Development was also held up.... by musth · 2013-09-11 08:13 · Score: 2

...for over an hour when Torvalds had to make an emergency run to Albertson's for some toilet paper and hostility medication.

Welcome to how SSDs fail. by Mike_EE_U_of_I · 2013-09-11 08:13 · Score: 5, Interesting

I've owned several hundred hard drives over the last 30 years. I've never had an active hard drive drive just blank out. I have had drives that had not been powered for a couple of years refuse to ever come back. But if I did not feel the need to even power the thing on for years, you can imagine how little I cared for what was on it.

In the last four years, I've owned around 20 SSDs. I've had five failures. Every single one was the drive just instantly lost everything. Amazingly, in four of the five cases, the drive still worked fine! It had simply lost all the data on it and believed itself to be a blank drive.

That said, the speed of SSDs makes them worth the risk to me. But I take backups far more seriously than I used to. I need them far more often.

Re:Welcome to how SSDs fail. by RichMan · 2013-09-11 08:25 · Score: 3, Informative

A hard shutdown of high-speed SSD is death. It takes really really good firmware to recover without reinitializing the drive.
The basic SSD "format" is susceptable to damage on power fails in a way that hard drives are not. The mapping and setup stables of the SSD are critical and constantly in flux unlike a harddrive where the mapping is only updated when a failure occures.
SSD drives need internal power fail control so they can gracefully shudown and firmware that supports it.
Re:Welcome to how SSDs fail. by Anonymous Coward · 2013-09-11 08:45 · Score: 2, Interesting

My Crucial SSD has an issue where it craps out if power is unexpectedly removed. I discovered it has an undocumented "repair mode" where you connect it to power but not SATA for about twenty minutes and it repairs itself.
Still scares the crap out of me every time it happens, but I back up important stuff regularly; it's just there for the speed.
Re:Welcome to how SSDs fail. by Silvrmane · 2013-09-11 09:24 · Score: 2

This isn't likely to happen on a laptop. They sort of have a built-in battery backup. :)

--
planet texture maps and more
Re:Welcome to how SSDs fail. by Dracos · 2013-09-11 12:56 · Score: 2

This describes several of the reasons why I will not buy an SSD any time in the near future. Sketchy reliability, indeterminate longevity, inexplicable data loss. Mirroring a turd just means you have multiple turds. I have a few 10+ year old DeskStar drives that I still use and have never given me problems.

Re:Someone flame him... by sjames · 2013-09-11 08:23 · Score: 5, Insightful

He has backups all over the world. But like with any backup, you can't actually restore from it until you replace the failed disk.

Re:None of that mattered, because by Zero__Kelvin · 2013-09-11 08:28 · Score: 4, Informative

That is correct. In fact he wrote the code that is the industry standard and uses it every day. How else do you think he is going to continue completion of the project on his laptop.

--
Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun

Kernel Panic!!!! by Cmdrx · 2013-09-11 08:28 · Score: 4, Funny

Now there a new meaning for Kernel Panic!

--
I could write something witty for my sig, but instead wrote this...

Re:Pathetic by Viol8 · 2013-09-11 08:29 · Score: 2

You beat me to it. Anyone with a vague clue about unix would have thought of that. Obviously vague clues are a rare thing for the parent poster.

Re:Intel? by stkris · 2013-09-11 08:29 · Score: 3, Informative

More info here: http://goran.krampe.se/2013/01/02/ssd-nightmare/
"So power cycling can apparently trigger this - and the disk for some odd reason (self protection?) decides to decapitate itself and set accessible cylinders down to 16 instead of 16384."

Re:Pathetic by TheBig1 · 2013-09-11 08:35 · Score: 2

backintime. I am using it, works great, and the restore functions are quite easy to use as well.

Backups are your friend by AaronW · 2013-09-11 08:41 · Score: 2

I learned long ago after some close calls to back everything up. In my case for my desktop I store my data on a XFS partition stored on a RAID 5 hard drive array. I also am using Crashplan to back up all of my data, both to a removeable hard drive and to the cloud with over 3TB of data backed up. The nice thing about Crashplan is that it continually backs up, taking periodic snapshots so I can restore a previous version of a file if I wish. The main drawbacks of Crashplan are that it runs on Java and can be a memory pig. I pay $6/month for unlimited backup of up to 10 machines and have several computers backed up with them now. With the proper settings on my router I don't even notice all the backup traffic running in the background.

Since I have had sudden SSD failures in the past I also dump my root XFS filesystem weekly onto my RAID array (it takes under a minute to run xfsdump) and incremental backups nightly and those dumps get backed up on the cloud as well.

I have found the XFS tools to be quite good at recovery when things go really bad. When running software RAID 1 I had problems where drives would drop out of the array for apparently no reason and I have had several occasions where while rebuilding the other drive would pop out of the array. Switching to an Areca hardware raid controller with battery backed DRAM ended those problems (besides seeing a big performance improvement).

I have found the RAID controller to work well when drive failure occurs and it even recovered after human error (I accidentally disconnected one of the active drives while it was rebuilding and reconnected it).

I won't use btrfs yet. The last time I tried it about 6 months ago it was quite slow and I have a lot of concerns about the storage filling up due to COW that have not been adqeuately addressed as far as I could tell. I tried setting it up for a Cyrus IMAP server on an Intel SSD and it was unusably slow just untaring all the files so I ended up going back to XFS.

SSDs are still relatively new. I have had issues with some firmware versions and had one fail catastrophically after only 2 weeks of use. I have also had compact flash and SD devices suddenly fail. My experience is that usually mechanical hard drives give some warning (i.e. SMART) and they tend to last years. I have a server I just retired where the hard drive had 10 years on the clock according to SMART.

--
This post is encrypted twice with ROT-13. Documenting or attempting to crack this encryption is illegal.

Re:You trust Torvalds after this? by hawguy · 2013-09-11 08:44 · Score: 2

I don't feel anything but shame for someone losing data in a hard drive crash who has or should have network backups available to them. If this happened to anyone but Linus the majority of the comments would be calling the coder a n00b. If it was Balmer there would be an absolute riot of anti-MS venom....

I guess the great Linus has fallen into shadow.

As someone who's taken over server administration from very talented developers a number of times, I've found that being a great developer doesn't mean that you're a great sysadmin. Developers may understand conceptually that RAID and backups are important (but sometimes think that RAID is a backup), but that doesn't mean that they actually set them up.

RAID by Larry_Dillon · 2013-09-11 08:57 · Score: 5, Interesting

I'm not nearly as much of a believer in RAID for the home environment. If you (accidentally) delete something on one drive it's gone from both. Better to buy two drives and do a daily rsync. That way you have a window of opportunity to recover data. Personally, I use rsync without --delete until the 2d drive starts getting full, then I use the --delete flag to clean up.

--
Competition Good, Monopoly Bad.

Re:RAID by michrech · 2013-09-11 09:09 · Score: 2, Interesting

I know I'll probably see negative moderation as a result of what I'm about to post (being as I'm about to talk up WHS2011 in a Linux related thread), however...
I stopped using RAID in any of my systems after I started using WHSv1. WHS2011 has the same feature -- live system backups. If a drive fails, I pop in a new one (of any type/size), boot a CD that came with WHS (essentially a WinPE environment with a recovery software baked in), select my backup (I save 7-10 days -- I forget what it's set to), and in about an hour my system is back to the state of the last backup. WHS is set to perform the system backups between 00:00 - 02:00 every night. The very first system backup is a 'full' backup, the rest are 'diffs'. I've had to use this feature on two of my systems, so far, and both were because of crappy WD drives (OOOOHhhh, I hate that brand soooo much). It came in really handy when I switched both my primary desktop and my laptop from mechanical HDD's to SSD's. I forced a backup, swapped the drives, and then restored...
This way, either my WHS storage pool (based on StableBit's DrivePool product) or my workstation HDD's can fail, and I can easily recover. It's automagic, manageable via a single UI, and because of DrivePool, I can easily increase the storage space at any time (without interrupting other users of the storage pool). /me puts on his asbestos underpants

--
bork bork bork!
Re:RAID by Blackknight · 2013-09-11 09:11 · Score: 2

RAID 1 with a nightly rsync to an off-site server has worked for me for several years now. The remote server runs zfs so I also take weekly snapshots in case I need to restore something older than last night.
Re:RAID by jekewa · 2013-09-11 09:51 · Score: 2

Accidental deletion is a whole different beast. If you accidentally delete something created between rsync copies it's gone for good, too, and rsync can't save you.
Unless your tool does some incremental storage for you. For example, Eclipse saves each save in a local history, including deletions, so you can go back in time even if all you did is change the file (which would also have "not there" impact between rsync copies)..
if you need that kind of assurance, you'll need more than rsync or RAID.

--
End the FUD
Re:RAID by Trogre · 2013-09-11 10:09 · Score: 5, Informative

You guys should really look at the --backup and --backup-dir options in rsync.
I use them in conjunction with --delete to always have a "current" copy of the data, along with any old files (ie that have been updated or deleted) in a separate backup folder, named after the current day of the month.
That way you get a directory structure as follows:
01
02
03
04 ...
31
Current
You can restore the up-to-date set from Current at any time, and if you want to retrieve a file you deleted or over-wrote five days ago, go look in folder 06.

--
"Nine times out of ten, starting a fire is not the best way to solve the problem." - my wife
Re:RAID by Nevo · 2013-09-11 10:16 · Score: 2

WHS = Windows Home Server
Re:RAID by Miamicanes · 2013-09-11 10:50 · Score: 5, Interesting

The thing that really sucks about SSDs (at least, Sandforce-based drives) is the fact that 99% of their failures are due to firmware bugs that can be simultaneously triggered on an entire array at once (especially the sleep-related bugs). It's a mode of failure the creators of RAID 1, 5, and 10 never anticipated.
IMHO, the worst thing about SSDs (at least, those with Sandforce controllers) is the fact that they have mandatory full-drive encryption that can't be disabled, using a key you aren't allowed to set or recover, and gets blown away whenever you reflash the firmware. This means, among other things, if the drive's controller gets itself confused:
* You can't reflash data-recovery firmware onto the drive. The act flashing it would blow away the encryption key and render the data gone forever.
* If the drive decides you're trying "too hard" to systematically extract data from it while it's in a confused state, it'll go into "panic mode" by blowing away the encryption key. If this happens, your data is gone forever AND you have to send the drive back to OCZ or whomever you got it from in order to get it unlocked. For your protection, of course. And Hollywood's. Among other things, dd_rescue/ddrecover can trigger panic mode.
* You can't even do the equivalent of removing the platters from a conventional drive in a clean room and mount them to another drive for reading, because the data on the flash chips is all encrypted, and the key is unrecoverable.
This is BULLSHIT, and it's why I refuse to buy any more SSDs. I, as an end user, should be able to download a utility from somewhere, reflash the drive to firmware that includes an offline recovery mode that simply dumps the flash chip content from start to finish, and either disable the encryption or set it to a key *I* control, so the 99.99999% of the data on the drive that's good when the embedded firmware freaks out can be dumped and recovered offline.
If there's a God, Linus will go NUCLEAR over this, get a few seconds on CNN & other networks to rant about the unreliability of SSDs, and scare enough consumers to hit the industry HARD where it'll hurt the most... their bank accounts.
It might not be possible to make SSDs reliable, but DAMMIT, they should at least be RECOVERABLE. There were goddamn hard drives with recoverable data pulled out of laptops left in safes in the Vistamark hotel when a tower sheared it in half and buried it under flaming rubble, yet a SSD that dies if you so much as look at it the wrong way due to firmware bugs ends up being fundamentally unrecoverable for no hard technical reason.
And yes, I'm bitter about having my hard drive commit suicide for no reason besides Sandforce Business Policy. As long as they keep making controllers that cause drives to self-destruct at the drop of a hat, I'll keep doing my best to talk people out of buying drives tainted by their controller chips. Sandforce sucks.
Re:RAID by Solandri · 2013-09-11 11:39 · Score: 5, Informative

I stopped using RAID in any of my systems after I started using WHSv1. WHS2011 has the same feature -- live system backups. If a drive fails, I pop in a new one (of any type/size), boot a CD that came with WHS (essentially a WinPE environment with a recovery software baked in), select my backup (I save 7-10 days -- I forget what it's set to), and in about an hour my system is back to the state of the last backup.
There's the operative phrase. RAID is for systems where you can't have or don't want an hour of downtime while restoring from a backup. The R in RAID stands for redundant. As in you can have a failure and keep going.

Note that this is the converse of "RAID is not a backup!" Just like RAID is not a replacement for a backup, a backup is not a replacement for RAID either. They do different things (and if you're smart, you will also backup your RAID). From your own description, you wanted a backup. RAID was never the correct solution for your needs.
Re:RAID by GigaplexNZ · 2013-09-11 13:27 · Score: 2, Insightful

That's just asinine. You should never rely on recovery of data from a broken drive to avoid data loss. Even if you do recover data from a broken HDD you shouldn't trust it hasn't had some form of corruption. Always have a backup. If you have backups, who cares if the drive is recoverable?

Also, don't buy Sandforce SSDs. There are plenty of alternatives that are faster and more reliable.
Re:RAID by drinkypoo · 2013-09-11 15:06 · Score: 3, Insightful

You're right in that you should never rely blah blah blah, but he's right in that you should be able to attempt recovery. And he's more right, because he never said you shouldn't make backups.

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Re:RAID by fnj · 2013-09-11 17:04 · Score: 3, Informative

Why not do it right?
Re:RAID by bemymonkey · 2013-09-11 19:19 · Score: 4, Insightful

So... stay the fuck away from Sandforce controllers? This has been common knowledge for years...
Re:RAID by QBasicer · 2013-09-12 01:59 · Score: 2

I have two Linux machines and two NASes.
The first Linux machine, my laptop, rsyncs itself to the other Linux machine and to a QNAP NAS that's in RAID5.
The second Linux machine (desktop) backs itself up to the QNAP as well.
The DNS323 gets backed up to the QNAP NAS and to the desktop Linux machine
The QNAP nas gets backed up once a quarter to an offsite location.
I figure in my plan, I have enough redundancy and backup that I can recover to most failures.

--
x86, oh yes, I'm pro.

And that's why.... by stox · 2013-09-11 09:03 · Score: 2

I have a mirrored set of SSD's on all my important machines, and RAID 6 for bulk storage.

Unlike Linus, I can't afford to lose work.

--
"To those who are overly cautious, everything is impossible. "

Good ol' Linus and his aversion to backups by Anonymous Coward · 2013-09-11 09:18 · Score: 2, Insightful

According to a speech of his, that's how Linux got started. He accidentally wiped his MINIX partition.

Re:You trust Torvalds after this? by hawguy · 2013-09-11 09:23 · Score: 3, Insightful

As someone who's taken over server administration from very talented developers a number of times, I've found that being a great developer doesn't mean that you're a great sysadmin. Developers may understand conceptually that RAID and backups are important (but sometimes think that RAID is a backup), but that doesn't mean that they actually set them up.

And as a sysadmin, I'm tired of hearing that. RAID1,5,6,10,Z is a backup. It's not an archive. An archive is what you go to when you want the old version. A backup is generally one of two things:
1) Something that lets you keep chugging through a failure (raid5, a backup generator with automatic cut-over, etc)
2) A standby spare (tape, NAS/usb drive, secondary location with desks/computers/etc.

RAID (other than 0) is absolutely a backup. It's not the perfect backup but it is a backup. What it is NOT is an archive - last night's/week's/month's/quarter's data.

No, RAID is *not* a backup, RAID's only purpose is to improve reliability/uptime by letting you ride through hardware failures, but it does nothing to protect you from all of the rest of the things that can destroy your data, like file corruption, fat fingering a "rm -rf / home/someuser", a virus, a website hack attack, etc. That's what your backups are for, but you can call them archives if you like, but don't call RAID a "backup" because it's not. Depending on what the problem is and when you discover it, you may need to go back through several archives before you find the data you're looking for.

Re:None of that mattered, because by Zero__Kelvin · 2013-09-11 09:24 · Score: 4, Insightful

". Now people have to redo a lot of effort, because he was too lazy or arrogant to install one of the many effortless backup systems available."

That is a ridiculous statement. Work is lost every time a drive fails unless it happens to fail immediately after a backup. Full backups take lots of time. If you understood git better you would realize that a lot less work is lost the git way than with old school backups. I'm sure that every time Linus does a successful merge he pushes it to a git repo elsewhere. All history is in the git logs. I am certain the work he lost is minimal, and is much less than if he was relying on nightly backups and the failure happened near the end of the work day. Just the effort of trying to determine what was done and what has been lost would be far more time consuming without git.

--
Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun

Reduce flash rewrite wear with noatime by redelm · 2013-09-11 09:27 · Score: 2

This might be [electrolytic] capacitor or some other component-level magic-smoke release. There is also the dreaded, much-discussed "wear" from re-writing flash memory -- worse than you think because blocks of 64 KB [typically] have to be erased and re-written to change any byte therein.

Linus, of all people, ought to know his kernel has options to minimize the re-writes, many of them developed to optimize laptops (like delaying writes). Another thing is to mount partitions (/etc/fstab anyone?) with `noatime` as an option (maybe 'nodiratime` too). Un*x and other Linux-like systems by default will re-write the access time for any disk inode read. Turning it off reduces disk write load (and seeks on slow disks). I've had it off for over ten years an not noticed any malperformance, althrough there are rumored to be some, somewhere.

Re:Really? Naa by Psyko · 2013-09-11 10:07 · Score: 5, Interesting

trying to desolder 100 pins spaced 0.01" apart then resoldering them, unless you have a 0.1 mill precision soldering robot it is impossible, you can't even buy wire thin enough to do it by hand.

SMT rework by hand isint rocket science, but takes more tools than the average garage has.

Desoldering you use a custom tip for that socket/package type (one tip per package & they're not cheap). It's essentially a metal ring that heats the solder on all the pins at once. In the center of the assembly is a vacuum probe. You heat all the pins, melting all the solder & hit the button on the handpiece to suction the chip up off the board. Then clean up the pads on the board. Careful with the heat because you dont want to lift pads off the board, if you do then you have to either fix them, or make a new pads. And then if you manage to trash a via (conductivity path to a different board layer), then you've got to drill out a new one and you have to use a esd safe conductive drill with a resistance cutoff. You put a clip from the drill in contact with the layer you're trying to get to, drill down and when the drill tip makes contact with that layer the drill turns off because the circuit is complete. But it still sucks and if you don't know how all the board layers are put together you may end up trashing a trace a couple layers into the board and wrecking the whole thing.

Soldering it down you do this. Align all the chip legs on the pads. Then you can either run a small bead of solder paste across all the pins or use a wave soldering tip (small cup, uses surface tension to hold the solder in place) and drag the tip over all the pins. Heat on the pin & pad draws the solder down into the joints. If you put too much solder you might have to vac it back up and redo it if you've made bridges etc. Alignment is key, and keeping the part in position is key. I used to try and avoid using glue underneath because that made it difficult to get it back off if you needed to down the road.

Doing hand rework on that kind of stuff the hardest thing for me was dealing with smt chip caps, little bastards will crack if you heat em to fast, so you have to get a temp regulated hot plate, heat em up slow, then pick and place em quick with tweezers/needlenose & solder em down quick.

--
01:36AM up 426 days, 2:46, 1 user, load average: 0.14, 0.11, 0.05

RAID != Operating System by dutchwhizzman · 2013-09-11 10:17 · Score: 5, Interesting

You have a software feature in a server OS that supports certain client OSes to do backups to the server. RAID may be a software feature, but even if it's "software raid", you often have BIOS bootable raids that even work with one of the drives missing. This essentially means that you can work OS agnostic on a lower level than "I have a backup system that works". For Linux, you can have a backup system too that will restore from a LiveCD/USB stick and stores on a remote server. The same amount of time roughly will be needed to backup and restore, differential, incremental, full backups, the works. The solution you are providing is really nothing comparable to RAID. It's fundamentally different because it works on a totally different layer, doesn't prevent downtime and it's not OS agnostic. RAID should prevent downtime, making working backups should prevent data loss. Maybe WHS is the shizniz, you rock for making actual backups, but other than that, your post is totally offtopic in this context and doesn't even begin to solve a problem that Linus was facing with his desktop.

I'm not modding you down, even though I have mod-points, but I'm telling you exactly why I think you shouldn't have posted this. I hope you learned something from it and in the future will implement both backups and RAID when unscheduled downtime is important. Maybe you would even implement a system that works for all relevant OSes in the environment you have to do it for, without relying on a single vendor that offers a closed source product. It's a risk that means you'll have to support their product and licencing and other requirements until the data isn't relevant anymore, even after you have migrated to a competing product.

--
I was promised a flying car. Where is my flying car?

Re:RAID != Operating System by AmiMoJo · 2013-09-12 00:17 · Score: 2

RAID is not at all comparable or better than a network backup. It's not even a backup. If your PSU dies or there is a power surge you could easily lose both drives. If something hoses the filesystem you can't just roll back to yesterday's image.
A network backup onto a separate machine, preferably protected with a UPS, is far more secure and likely to save you from a variety of hard and soft disasters.

--
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC

Will never work with modern drives by dutchwhizzman · 2013-09-11 10:20 · Score: 5, Informative

Modern drives for the last five years at least, have calibration factors for platter/head packs on the EEPROM on the controller board. If you swap boards, the board most likely won't be able to read the data on the disk, since it's not calibrated to the head/platter kit.

--
I was promised a flying car. Where is my flying car?

Re:Will never work with modern drives by mysidia · 2013-09-11 15:04 · Score: 2

Modern drives for the last five years at least, have calibration factors for platter/head packs on the EEPROM on the controller board.
If it's an EEPROM chip; then that means... in principle, you could capture a dump of the EEPROM content from the old board, and then erase the EEPROM on the replacement board, and use a programmer to reload the old content
Re:Will never work with modern drives by GuB-42 · 2013-09-12 01:17 · Score: 2

If the EEPROM is not damaged and is a discrete component (usually a 8 pin SOIC) you can unsolder it from the old board and put it on the new board. It's not that difficult.
Re:Will never work with modern drives by petermgreen · 2013-09-12 10:08 · Score: 2

Which is why you take the eeprom off the original controller board and put it on the new board. In my experiance the EEPROM is in an 8 pin package with a 1.27mm pin spacing (either a SOIC or a similar sized leadless package). Pretty easy to pop off with hot air (and yes I have done this several times).

--
note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register

Re:None of that mattered, because by Zero__Kelvin · 2013-09-11 12:15 · Score: 2

"I don't think "running a backup application" is part of "software development""

Do you work for me, because if you do, you're fired.

"The man is clearly not as infallible as you seem to think he is. His mistake now requires work from other people to clean up."

Wow. I mean seriously. Wow. You are just hell bent on being a moron. You offered "solution" after "solution" that is no solution at all, and seem to have missed the first line of his post: "I had pushed out _most_ of my pulls today, so realistically I didn't lose a lot of work." I have to assume you don't work in the real world, because if you did you would realize how truly asinine you sound.

--
Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun

Modernization needed by DrYak · 2013-09-11 22:27 · Score: 2

"Only wimps use tape backup: real men just upload their important stuff on ftp, and let the rest of the world mirror it ;)" - Linus Torvalds[1]

Pfff... That's soooo last century!

Let me fix that for you, Mr. Torvalds
"Only wimps use tape backup: real men just upload their important stuff on git, and let the rest of the world clone it"
Now that sounds more typical for the current decade.

Oh, and for the MasterCard-Ads like finish:
"For everyone else, there's the NSA."

----

The funniest part is that he is the actual author of the git scm system which served him as backup this time.

--
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]

Re:Pathetic by Builder · 2013-09-11 22:28 · Score: 2

rsync is nothing at all like time machine.

Look a little harder into the versioning aspects of time machine and let me know how to make rsync do that.

And when you've done that, please let me know how to get rsync to manage space and remove the oldest backup first when a full backup cannot complete.

Three Words to choke upon by Gallomimia · 2013-09-12 04:45 · Score: 2

Rack Mounted Server. I just gotta know, why is all this mission-critical operational stuff taking place on a workstation with workstation grade hardware and no backups or raids? Everyone's talking about oh raid at home isn't good, just use backup drives. Look: This is LINUX. If there's need for additional hardware and compile farms, people will probably donate. To have a single SSD failure cause so much calamity for any project, least of all *THE* open source project, is just embarrassing. Worse than swearing at your devs on a mailing list read by the whole world.

--
Sadly, a Libertarian cannot force his views on another, and freedom cannot spread as does the cancer known as religion.

Slashdot Mirror

SSD Failure Temporarily Halts Linux 3.12 Kernel Work

87 of 552 comments (clear)