Slashdot Mirror


Compelling Alternatives to RAID Setups?

jabbadabbadoo asks: "Our software shop has about 30 Linux servers and 15 NT servers running enterprise applications for our customers. Since we have service level agreements with most of them, uptime is crucial. One of the things we've done is to use RAID setups extensively, using products from well renowned disk- and controller vendors. However, we have discovered the paradox that introducing RAID controllers actually reduces overall uptime! Not only does more 'steel' increase the probability of failure, but what fails first is usually the RAID controllers. What is your experience? Have we been having bad luck?" "A related problem, especially on Linux, is that setting up RAIDs is actually a quite costly process. There seems to be endless problems with library versions, and upgrading existing servers simply takes too many hours. To keep the customers happy, we routinely have to create a 'shadow' server while upgrading which in turn means we, at some point, have to synchronize data to the new server, which in turns means a bit of a downtime. Ouch. Does anyone have a good solution to these problems? Of course, cost is a major issue, but so is uptime (which also means cost if we don't provide the uptime dictated in the SLA). What setup gives the best cost/uptime ratio? Thank for any thoughts!"

3 of 113 comments (clear)

  1. A few tips by menscher · · Score: 5, Insightful
    First off, you're looking at the wrong "uptime" number. Don't look at how many days since your last reboot. Look at how many hours/year you are offline. If you're not doing raid, a failed disk means restoring from backups. That's a time-consuming, and therefore costly, process. If your controller fails, just pop in your spare controller. You do have a spare in-house, don't you?

    I'll agree that setting it up is a nightmare. I'm currently helping test two 4TB arrays for use on a Linux box (16 SATA drives presented as a single SCSI device). Benchmarks under linux are slower than under windows. It's a mess figuring out why. Meanwhile, vendors (who I will not name ship crappy software, and take months to act on bug reports.

    As for transitioning servers, I've been there too. And yes, copying a terabyte of disk in single is a very long process. It'd have taken several days, which is of course unacceptable. This is where the magic of rsync comes in handy. Copy the data over several days in advance, sync it just before the scheduled downtime, and you'll have a fairly short downtime.

  2. Multi-engine aircraft by wowbagger · · Score: 3, Insightful

    There is an old saw in the aviation industry: "A twin engine aircraft will have twice as many engine problems as a single engine aircraft."

    However, which would you rather be in, a twin engine aircraft that just lost one engine, or a single engine aircraft that just lost an engine?

    Yes, RAID cards die - I've been shocked at how often that happens. And 5 disk RAID will have more failures than a 4 disk JBOD (just a bunch of disks) array.

    But the question is, are you seeing a reduction in UPTIME, or just in mean time to failure? Maybe the RAID system throws an error once a month and the JBOD system throws an error every two months, but if you can recover in 5 minutes by swaping cards or drives rather than 5 hours for restoring the JBOD from backup, you are better off.

    Perhaps what you might look at would be using RAID software on the server's processor, coupled with Firewire drive bays, disks, and multiple Firewire cards. If you have a card die, move the disks to another card until you can schedule downtime. A disk dies, hot-swap and rebuild in background.

  3. RAID 10? by b!arg · · Score: 3, Insightful

    If uptime is so absolutely crucial how about a duplexed mirror of RAID 5 arrays. Two controllers and a RAID 5. When in doubt throw more money at the problem. :)

    --

    Everybody dies frustrated and sad and that is beautiful