Compelling Alternatives to RAID Setups?
jabbadabbadoo asks: "Our software shop has about 30 Linux servers and 15 NT servers running enterprise applications for our customers. Since we have service level agreements with most of them, uptime is crucial. One of the things we've done is to use RAID setups extensively, using products from well renowned disk- and controller vendors. However, we have discovered the paradox that introducing RAID controllers actually reduces overall uptime! Not only does more 'steel' increase the probability of failure, but what fails first is usually the RAID controllers. What is your experience? Have we been having bad luck?"
"A related problem, especially on Linux, is that setting up RAIDs is actually a quite costly process. There seems to be endless problems with library versions, and upgrading existing servers simply takes too many hours. To keep the customers happy, we routinely have to create a 'shadow' server while upgrading which in turn means we, at some point, have to synchronize data to the new server, which in turns means a bit of a downtime. Ouch. Does anyone have a good solution to these problems? Of course, cost is a major issue, but so is uptime (which also means cost if we don't provide the uptime dictated in the SLA). What setup gives the best cost/uptime ratio? Thank for any thoughts!"
In the past two years, none of the "downtime" that I've experenced has been attributed to the disk array or controller.
The biggies have been: power outage that exceeded the capacity of the UPS (3 hours), planned upgrades and an anonymous gremlin who bumped the reset button - since detached.
This is a boring sig
XSan can 'hide' the complexity of RAID, as well as providing management tools and 'intelligent' cascading failure... but that's just from reading the specs, not from actual experience. I hear XSan is based on CVFS? I should look at that too.
GPL Deconstructed
This is on a lower level than the RAID you are using, but we are having major problems with 10 Promise Technology TX2000 mirroring RAID controllers that we bought. The mirrors go critical for no detectable reason. Promise Technology technical support is unable to find the problem, and the company is unwilling to escalate the issue. The Promise Technology technicians escalate the issue, but 2nd level technical support never calls back.
Promise mirroring controllers on ECS (EliteGroup) L7VTA v 1.0 motherboards have the same problem. When we call ECS tech support, there is a recorded message saying they are busy and to call back later.
We've been supplying computers with Promise mirroring RAID controllers since the company began doing business, and we've had very few problems until now.
Possibly the problems are associated with newer, faster motherboards, or with AMD VIA chipset motherboards. We've never had problems with RAID controllers on Intel chipset motherboards.
Another possibility is that the RAID controllers are incompatible with DVD burner drivers that are installed with Roxio or Nero DVD burning software.
I spent the past week and a half trying to set up a 4x160 SATA Raid-5. It was a huge excercise in frustration because every time I'd try to build a volume, my machine would promptly freeze after a few percent. I changed out IDE emulation for SCSI emulation in kernel... same thing... I changed SATA controllers, same thing. I changed SATA cables, same thing. I changed power supplies, same thing. I added 4 80 mm case fans, same thing. In the end, it turned out that the culprit was raidtools. Nobody had ever bothered to post that raid-5 + raidtools + kernel 2.6 locks up a computer. I changed to mdadm, and I had a working array 50 minutes later.
If your bandwidth requirements are not too high you may be able to use a distributed file system on many redundant (cheap IDE & G ethernet) nodes and allow for replacements. Your uptime should be constant, given enough UPS and redundancy of nodes.
--
"we live in a post-ideological world..." - Billy Bragg.
They're systems are probably 80% auctioned desktops and such from busted dot coms.. and I suspect that many of them are not RAID at all. I have yet to hear of a redundant raid controller either. Your best bet is just replication of data on you backend servers and using something in the nature of a Cisco CSS or some other services balancer device to handle keeping alive servers available while redirecting away from dead servers.
You can still do RAID with this setup but you'd have the added security of 2 or more systems making up your entire functional system so if one is down the other can continue normally. Then it's trivial to repair the dead machine and bring it back into the cluster.
I really don't understand some folks fascination with SATA on "servers".
SATA is designed for desktops. SATA drives don't meet MTBF criteria of the equiv. SCSI drive, nor the performance.
If you've chosen it because it's the cheaper of the solutions, ok... if you chose it for performance... well, make sure you have a good backup solution.
When that slow 250GB ATA class drive is dead, and while its fellow drives are chugging their little hearts out (and probably maxing out that 3ware controller), how long will it take to rebuild your array?
Have you tested how long it takes? Probably better than 24 hours if your system is moderately loaded.
Guess what you have now? The marvelous opportunity for a CASCADING FAILURE!
That's right kids! Because you just had a drive fail, and all the other drives are doing double the work to rebuild from parity data, you have a higher chance of getting a second drive failure.
Consider that you bought all of the drives in that array at the same time. They've all been running the same amount of time. What if there was a minor manufacturing defect that caused that First drive to fail? How soon before it takes out the other 4?
A 'resume generating event' waiting to happen.
Best of luck.. and I agree with the comment upthread. SATA drives are for Workstations. Maybe for storing what we call 'reference data'.
Not much more.
There's a few choice terms in the industry- 'Economy Enterprise'
'Garbage RAID'
'Ghetto SAN'
Good luck
Striving to achieve a lower state of conciousness
If (disk)space and performance is not a problem (i.e. HD below 200GB, non-fancy single CPU), you could simply go with two (or three) cheap PC boxen instead of one "data center quality" RAID machine (for the same total price). If you mirror data+setup over from "production" to "standby" daily, any downtime due to any failure (HD, controller, mobo, OS, filesystem) can be minimized to 1-2 minutes (switch service over to the standby) - continuing with yesterdays data, which should be sufficient for most cases.
Integrating a backup/backlog (e.g. 3 months data) into a mirror setup is possible in several ways - my company does offer such a solution (managed service, that is).
Continuing with current data instead of yesterday's status is quite a bit more challenging, though...