Slashdot Mirror


Compelling Alternatives to RAID Setups?

jabbadabbadoo asks: "Our software shop has about 30 Linux servers and 15 NT servers running enterprise applications for our customers. Since we have service level agreements with most of them, uptime is crucial. One of the things we've done is to use RAID setups extensively, using products from well renowned disk- and controller vendors. However, we have discovered the paradox that introducing RAID controllers actually reduces overall uptime! Not only does more 'steel' increase the probability of failure, but what fails first is usually the RAID controllers. What is your experience? Have we been having bad luck?" "A related problem, especially on Linux, is that setting up RAIDs is actually a quite costly process. There seems to be endless problems with library versions, and upgrading existing servers simply takes too many hours. To keep the customers happy, we routinely have to create a 'shadow' server while upgrading which in turn means we, at some point, have to synchronize data to the new server, which in turns means a bit of a downtime. Ouch. Does anyone have a good solution to these problems? Of course, cost is a major issue, but so is uptime (which also means cost if we don't provide the uptime dictated in the SLA). What setup gives the best cost/uptime ratio? Thank for any thoughts!"

11 of 113 comments (clear)

  1. RAID is good by kansei · · Score: 2, Informative

    I remember swapping quite a few Compaq RAID controllers in my day. They wouldn't outright fail, but get in a "compromised" mode, and you usually had enough time to schedule downtime to swap them out. This was much better than messing with software mirror or raid settings, because it's transparent to the OS - the OS just sees a single large disk.

  2. Brands? by JLester · · Score: 4, Informative

    You don't list what brand controllers you are using, but your problems are not typical in my experience. We are a 100% Compaq shop and use their SmartArray controllers with Novell Netware and Debian Linux. We've never had a controller failure and have only lost about 3 drives over the last six years or so.

    I'm a firm believer that you get what you pay for with enterprise-class servers. You shouldn't expect Tier-1 reliability from servers that are built with commodity hardware. There is a reason that Compaq/Dell/IBM servers are more expensive.

    We also haven't had any issues installing other than the default Debian boot disks not supporting the SmartArray controller. A custom set of disks took care of that though.

    Jason

    --
    "FORMAT C:" - Kills bugs dead!
  3. Sounds like a design problem on your end. by stienman · · Score: 4, Informative

    It's hard for me to believe that RAID causes more downtime than single drive setups, unless you have a really bad raid system and a really good backup system.

    The only time RAID should ever be down, is during initial setup. Thereafter you should replace bad drives while it's running and you should never have cause to shut it down due to a RAID issue.

    If you are experiencing RAID hardware problems then take a good look into these areas:
    RAID Hardware --> Are you using cheap stuff? It honestly isn't worth it. Perhaps you're just discovering the 'real' value of 'cheap' hardware.
    RAID Software --> If you're using unsupported drivers (ie, vendor doesn't supply or support them) then ditch the hardware and get hardware with supported drivers - make sure they support them on your configuration. You've already proven that you can't support them yourself.
    System Hardware --> If the system is generally cheap (cheap power, bad airflow, cheap components, etc) then you simply can't expect the RAID card to work 24/7.
    Server Room --> Make certian your server room can handle the power and ventilation needs of the servers. This should go without saying, but all too often it is the problem.

    The reason people go with cheap components is the lower initial cost. They only work for a few thousand hours of heavy operation. You must get server rated components if you want them to operate for more than a year or two. There really is a difference.

    Lastly, I use 20+ Promise FastTrack ATA RAID cards in 20+ Novell networks. I use cheap components, and they work in harsh conditions. They are not set up for hot-swap, as that's not a need in this situation. I have to replace the cheap hardware every 2-4 years, powersupplies every year, hard drives every 2-3 years. The only time the RAID cards have gone bad is when a power supply failure (usually due to a power outage/surge/brownout) fries the motherboard and usually most of the components in the case.

    I have never had a failure where both HDs completely failed simultaneously, though usually when the rest of the computer goes I replace the whole thing and get the data off one of the old hard drives. This is not an advertisement for Promise. They simply are the only one's with supported Novell 3.12 drivers. :-) Soon to go away... :-(

    I'd be surprised if you've covered all these bases and are still having problems.

    -Adam

  4. Hardware, Configs, Backups by duffbeer703 · · Score: 4, Informative

    The answer is SysAdmin 101 stuff.

    1. Buy quality hardware.

    IDE RAID for critical servers is a bad idea.

    In my experience, RAID hardware tends to be very picky and suffers from subtle and often bizarre hardware conflicts. In general, using a RAID solution that is packaged with the hardware is the best idea.

    If you cannot afford good RAID hardware, stick to conventional JBOD configurations.

    2. Configuration

    Design your the configuration of your systems around consistency first, performance second.

    You need to document your procedures for building servers, allocating storage, etc. Create scripts whenever possible.

    If you are not confident that you could not talk a marginally qualified technician through a server rebuild over the phone, your docs aren't good enough. If you don't have the time to write docs, make the time or work late.

    3. Backups

    You need documented, tested backup AND restore procedures. All of your oncall staff need to be able to restore a server. ..

    With 50 servers, disk controller or disk failures should not be a common event. We work with approximately 400 datacenter and 200 field servers (varying in age from 1-9 years), and replaced 3 controllers and 19 disks last year.

    Look for electrical issues, you may have crappy electrical service.

    --
    Conformity is the jailer of freedom and enemy of growth. -JFK
  5. Re:Major problems with Promise RAID controllers. by GoRK · · Score: 3, Informative

    There is a very important thing that you have not realized...

    Those are not really true hardware RAID controllers. They are regular hacked up IDE controllers with a bit of BIOS firmware on them that handles software RAID via INT13 until the OS loads and the software RAID in the "driver" can take over.

    They offer nothing that a legitimate hardware raid setup should give you such as cache RAM or CPU offloading. Mirrored setups on these types of pseudo-hardware RAID controllers HURTS PERFORMANCE. Don't believe me? Benchmark it yourself versus software raid and hardware raid on a real controller such as Adaptec AAA or 3ware...

  6. RAID Alone != good design by photon317 · · Score: 4, Informative


    You can't slap a buzzword like RAID onto whatever you were doing before and expect results. Reliable systems have to be carefully engineered correctly.

    From the sound of your posting, I'm assuming when you say you're using RAID, you mean internal RAID cards inside a server with internal disks attached, and relatively small amounts of it. In these types of scenarios, the highest performing, most reliable, and most cost effective option is to put two seperate scsi controllers in your boxes, buy twice as much storage as you need, and mirror between the controllers using the OS's software mirroring capabilities. You are now indepedant of controller failure, the controllers themselves are less likely to "fail" (which doesn't always mean hardware frying) than a complex raid controller by their simpler nature, and you're getting the performance benefit of full mirroring instead of that clunky raid5 business. If you have enough storage to warrant four or more internal disks of some size, use mirror+striping. Always mirror at the lowest level, and then stripe on top of that (in a 4 disk design actually it doesn't matter which way you layer them, but in 6+ disk designs it gives higher data availability in the unlikely event of multiple disk failures). Or in other words - raid5 and hardware cards = bad, mirroring/striping + software raid = good.

    Your goal is not to be buzzword compliant by slapping in a raid controller, your goal is to carefully analyze your systems, your options, your requirements, and your budget, and eliminate single points of failure everywhere that it's feasible and desirable to do so, starting with the lowest MTBF items in the system and working your way up. There are no magic bullet answers of course - change the situation and the "right" answer can change dramatically.

    --
    11*43+456^2
  7. Fibre Channel and the Xserve RAID by caseih · · Score: 4, Informative

    I don't see why setting up the RAIDs under any OS should be more time consuming than on other OSs. Certainly if you use the right hardware-based RAID things should be very simple and very fast.

    Bang for the buck, you can't beat the Apple Xserve RAID. They are IDE, but almost as fast as the fastest scsi arrays, and seem to be very reliable. The array can be easily partitioned into a variety of raid types with hot spares. The unit can then connect to Windows or Linux via standard fibre channel interface and look like simple scsi drives. The RAID is administered via an ethernet connection using a nice java gui tool.

    We set our Xserve RAIDs up such that each array (each Xserve RAID box has 2 arrays with separate controller logic for each) is RAID 5 plus a hot spare, and then the array is mirrored with the other one. This gives is .8 TB or so at a very reasonable price and very reliable. So far it has worked well.

  8. Re:Look at Google by dubl-u · · Score: 2, Informative

    Google never writes to the filesystem -- it's all in memory and temporary. They only use the disk to boot the system.

    For their production service, I understand that they keep it all in memory. But it's hardly temporary.

    Hardly anyone is like Google.

    For now. Google was one of the first companies to take advantage of the fact that RAM and procesing power have become ridiculously cheap. SQL databases arose in an era when 32k was a fair bit of RAM, and where a business computer was one or more refrigerator-sized units kept in a sacred temple.

    Now computers are cheap and disposable. I can fill a rack with cheap 1Us and get processing power that Sun can't match at 10 times the price. The only trick is to write your apps in such a way that you can tolerate hardware failures. That's a little hard, but it paid off handsomely for Google. Others will learn this trick.

    You can bet they'll be using RAID (etc) for the GMail service.

    You'd lose that bet. They already have built their own distributed network filesystem, GFS, that holds at least hundreds of terabytes. It has performance and reliability levels well above any RAID installation I've ever heard of, and it uses cheap commodity hardware to do it. I'd bet that GMail will be built on top of a variant of GFS or some other in-house technology.

  9. Re:A few tips by menscher · · Score: 2, Informative
    The 4TB arrays are units we're evaluating (one from Excel, the other from RaidKing). They're just rack-mountable boxes that have a scsi uplink. So, as far as the computer is concerned, you just have one massive scsi drive. (There's a catch, which is that these units can't seem to have more than 2TB per "device", so you really get two scsi devices presented to the computer.)

    Life is made a litte annoying by the 2TB limit in the 2.4 kernel. But we're willing to live with that, for now. I'm told there are patches to fix this, but I prefer stability over features for this box.

    As for 3ware, I've got a box with an Escalade 7500-4LP running RedHat 9. It works by default (can boot off the raid, etc). 3ware has extra drivers, but I don't use them. It's a messy situation, since you have to simultaneously upgrade firmware, driver, and utility programs. I've been less-than-impressed with their support. When I reported that the md5sum on their website didn't match the file, they said "We know.... don't worry about it... it doesn't matter." Umm, yeah. Right.

  10. Re:So would XSan help? by Crypt0pimP · · Score: 2, Informative

    Don't believe the marketing.

    From what I read, the XSan software is first and foremost a distributed file system for shared volumes from the Xserve RAID.
    If you look at the applications, it's about multiple servers or workstations with concurrent access to a single volume - distributed file locking.

    Great stuff for the stated purpose, can't wait to get my hands on it!

    Hiding the complexity of RAID is the domain of storage 'virtualization' solutions. The ones that let you mix and match raid types across any number of spindles you throw at it.

    <Shameless_Plug>
    My product, the XIOtech Magnitude does that. Take up to 126 spindles, create RAID 0, 1, 5, 10 volumes and give 'em to your servers. Boot off 'em, mirror 'em, copy 'em. Stick 'em in your ear!
    </Shameless_Plug>

    direct flames or questions to slineyp at hotmail dot com

    --
    Striving to achieve a lower state of conciousness
  11. Re:Major problems with Promise RAID controllers. by GoRK · · Score: 2, Informative

    The IO overhead will (should, at least) be the same whether it's hardware RAID or software RAID.

    On a real hardware raid controller this overhead exists only on the controller CPU (normally an i960 or somesuch) and is further alleviated by the cache ram on the card.

    Compared to what, though ? OS-level software RAID is going to have to do precisely the same thing and IMHO the processing involved, taken in the context of modern, fast CPUs, is insignificant.

    Well, I wasn't trying to compare promise/hpt/et al. to software raid, but if the overhead of any kind of host-cpu based raid were really actually insignificant as you claim, then I guess we are all real suckers for plunking down hundreds or thousands of dollars for RAID cards.

    The point is that any extra overhead whatsoever on the CPU dealing with the disk is very often unacceptable. The disk subsystem is pretty well the slowest component in any system, and having the host CPU wait around on it all the time can be a real performance killer. Take an example of building a workstation to edit HD video. This will normally use RAID 0 if it is a capturing machine or sometimes RAID 1. Build one -- or better yet build three - one using software raid, one using 'hardware-assisted' raid, and one using a genuine hardware controller.

    The kind of thinking you are doing is the kind of thinking that leads to bloated software. The idea that the CPU is "fast enough" that efficency doesn't matter might be fine for the desktop in most cases, but on a server could mean the difference between supporting 3000 and 6000 users.