Data Recovery from ReiserFS RAID Array?
Ruatha asks: "We've recently had a problem with a ReiserFS RAID-5 array - two of the disks failed and, of course, some of the people using the array didn't have backups of their data...Ontrack have returned the disks because they can do nothing with them due to the FS we used on the array. Does anyone know of a company that can deal with data recovery from a ReiserFS RAID-5 array?"
Simultaneous disk failure is about as rare as winning the lotto while simultaneously having you and your friend on the other side of the planet get struck by lightning - unless there's a common larger problem (e.g. power surge to both drives or something).
As a practical example, 4 or 5 years ago I had large amount of disks attached to some large oracle servers, roughly on the order of 600 or so hard drives in several arrays taking up several racks, all the same manufacturer/model, with a handful of groupings of revision/lot/date.
This set of disks was seeing fairly constant and heavy activity for a few years while I was there. As you can imagine, with 600 disks and the usual MTBF numbers, we quite regularly had disk failure. We kept a few spares onsite and replaced them as they failed, then exchanged the dead drive for a new spare. As I roughly remember it, we probably averaged about one disk failure every 2-3 weeks. Two, perhaps three times, we had a double disk failure during a 24 hour period - but they were never close enough that we didn't have plenty of time to replace the first (and in any case, odds are slim that two failed out of 600 would happen to affect the same data).
Of course, another point back at the original guy with the failed disks - don't use raid 5, chunk out some more money (disks are cheap) and do proper mirroring - and if you stripe use 1+0, not 0+1.
11*43+456^2
This is certainly true, but you should consider the flipside of it. The typical way it works with IT departments is that they are given unfunded mandates right and left. There is no possible way they can do everything with the money they have. What should happen is that some stuff should be taken off their plate. But they rarely have the political pull needed to do that, so what actually happens is that either everything is done poorly or the IT guys work on what they think is important.
So before you go pointing fingers at the IT department's attitude, it would be good to ask, "Did they tell the managment that they needed a way to back up those machines? And did the managers give them the necessary time and funds?"
Every IT person I know with a bad attitude has didn't start that way; they acquired it through years of crappy management.
Next morning we powered everything up, and on all systems where the disks were more than 2.5 years (and the disks had been running for all that time), we lost about 20-25% of the disks. We've been told that it was due to the fact that the disks actually went cold, and the reheating broke the disk platters. On most of the disks you could hear within 2 hours of power up the heads grinding against the platters, since the platters were wobbleing.
As a result more than one system had multiple disk failures, but the backup was good so no problem (except for the deliverytime of 4 hours on the disks).
I just heard a tip from a hardware guy. He says you can "freeze" a drive in a freezer to make it work for that one last spin. You'll have to be fast though and copy the stuff quickly somewhere else. After it warms up again its gone for good.
He told me this since his laptop hard drive went bad a few days ago and he had succesfully used the trick on it, saving all the data one would have thought lost.
According to him this should work on both drives with broken electronics or broken mechanics. He said he has many theories why it works, but has proven none. YMMV.
I work on the design of hardware RAID controllers and the original post is missing a key piece of information -- are the two drives really bad, or were they just marked as bad by the RAID driver? I ask because often (particularly with parallel SCSI drives) you have a bad connection or a bad drive which causes one drive to hang, which makes the RAID set go degraded, but since the bus is still corrupted the next drive it tries to talk to will also appear be bad so the array is marked offline. This happens at least ten times as often as genuine double drive failures. How the RAID software reacts to double drive failures depends upon the author. You should have some kind of log or console printout -- the timestamp of the errors is the tell-tale clue, if the failures happened within a minute or so then you can be pretty sure that only the first failure is real.
First thing's first -- put in a set of scratch drives and see if the bus and HBA is working ok. Test it thoroughly! Then, using a read-only tool like IOMeter, check each of your original data drives individually to see if it can read reliably across the platters (If using NT for this test, do NOT let DiskAdmin write signatures on the drives!!). Hopefully one of the drives will be bad -- if so, set it aside. Perhaps you are certain you know which was the first drive to fail -- remove that one if you know it. Reboot, and see if the array metadata is recognized. If not, concentrate your efforts on fixing the second drive to fail. If there is a gap in time between the first and second drive failures then the data on the first drive to fail is no longer of any use to you as it is out of sync with the other drives.
If you have more details please post them here and I will try to give you more detailed advice.
One other soapbox comment -- people who sell RAID technology should always provide some kind of metadata debugger because, as they say, sh!t happens.