Pros & Cons of Different RAID Solutions

← Back to Stories (view on slashdot.org)

Pros & Cons of Different RAID Solutions

Posted by Cliff on Thursday November 18, 1999 @07:13PM from the looking-for-solutions dept.

sp1n writes "Our mail server has hit a major disk bottleneck, and we're considering a RAID 5 solution. We're a local ISP with 13,000 users that's been around since 1993. The mail server is a Sun UltraSparc 2 w/ 768M ram, a 10000rpm 9 gig for mail storage, a 7200rpm 7 gig for spool, and another for system mountpoints, all on a u2w bus, and it's running exim. It reaches a load of 10 on a daily basis, and hits around 20 once a week. The spool and mail storage drives are cranking away constantly. Requirements are: external RAID controller, scaleability, speed, and a rackmount case for the controller & drives." Sound intriguing? If you've ever wanted to learn a bit about RAID, then hit the link below.

sp1n continues: "We are currently considering 3 options:

(1) SCSI - EIDE controller with six 9G/7200 ATA drives (hadn't heard of this one until recently). This supposedly accesses the drives directly through DMA and bypasses all IDE, just using them as physical media. All are accessed in parallel. I'm a bit weary about the reliability of IDE drives under constant use.

(2) SCSI - SCSI controller with six 9G/7200 u2w drives. The controller currently at the top of my list is the Mylex DAC960SXi w/ 32MB cache. However, something that fits in a half-height bay, instead of hogging a full-height would be nice.

(3) SCSI - SCSI controller as above, running with 2 disk channels and 2 separate RAID 5 arrays for each mountpoint (spool/mail storage).

I'm looking for any experience with IDE/DMA raid setups (1), as well as the pros/cons of making 2 partitions, both which are very active, on one array of 6 drives (2), as well as 2 separate level 5 arrays of 3 for each mountpoint (3). In addition, any suggestions for external controllers and rackmount enclosures would be greatly appreciated. I would like the controller to have an i960 or better processor.

--
"The glass is not half full, nor half empty. The glass is just too big."

12 of 261 comments (clear)

Min score:

Reason:

Sort:

Check to be 100% sure drives are the problem by vectro · 1999-11-18 14:36 · Score: 5

Before you go out and purchase an expensive RAID solution (of any kind), make sure this is really the problem. The vmstat command will make it quickly apparant what kind of i/o is happening, and further analysis might tell you more about what kind of hd accesses are happening.

In many cases, adding more memory or CPU can make a bigger difference than more/faster hard drives, if the problem is that the cache is too small, or paging activity too much. Also check your CPU load and make sure it is nowhere near 100% - if so, time to get a 2nd CPU.

Also, avoid software RAID implementations like the plague. They will slow down your system and provide questionable reliability. You should also try to find cards that have redundant SCSI controllers onboard, and support redundant cabling. This way if the cable, plug, or SCSI bus fails for some reason you will not be SOL.

Finally, be sure that the majority of your disk accesses are reads. RAID will slow down writes, sometimes drastically so. If the majority of your disk accesses are writes, then tuning your kernel to flush dirty buffers less often may make a good difference.
Couple things by Falsch+Freiheit · 1999-11-18 14:56 · Score: 5

First off, it's not clear from your post how heavily loaded the drives really are.

In particular: load is a measure of how many processes are using or waiting for a resource (such as disk I/O, CPU or network I/O). On a busy mail server that's completely adequate for the job, I'd expect to often see a high load average due to the number of processes that are waiting on the network. That is, due to the number of processes waiting for slow network connections to places halfway around the world.

All you mention is the load averages and a fairly non-specific measure of drives that are "cranking away constantly". If the drives were being used at a current constant 10% of available I/O, they'd tend to "crank constantly" even if they could be hit much harder. (still, given that losing email is considered bad by customers, a RAID 5 solution seems like a good idea anyways and leaves you room to grow and handle sudden increases in email from the holidays or spammers or gradual expansion of business)

As to IDE vs. SCSI -- never go with straight IDE on a server. SCSI has the ability to lie to the OS and silently move data from sectors that have gone bad into sectors reserved for that purpose. Sure, it slows down access to that particular block of data, but it's a lot easier than the OS having to deal with failures directly. However, I'm completely unfamiliar with the strange SCSI - EIDE setup that you're describing -- if it treats them as just physical media and provided the SCSI interface itself, it may be able to do that particular SCSI trick, as well. Physically, SCSI drives and EIDE drives are identical -- as in, you can find the *exact* same drive from certain manufacturers, only one has SCSI and the other EIDE. Reliability of the physical media is the same, IOW. In a normal configuration, *apparent* physical reliability is higher for SCSI due to wonderfully useful trickery.

I don't recall the exact model numbers, but I've seen pretty good results with Mylex RAID controllers before. (more along the lines of database stuff than what you're talking about -- somewhat different needs, but not all *that* different, I suppose.)

I can't see putting two partitions on one RAID device as making a lot of sense -- since things are striped you'd end up running into contention issues.

IOW: I'd guess that option #3 would be the fastest -- it's also probably the most expensive.

If I were you, I'd check more carefully to determine how much of the currently available disk I/O is actually being used... If the budget allows it, the dual-channel RAID solution sounds pretty good. You might want to go with two single-channel RAID cards instead -- makes it easier to stock a backup card in case a card decides to die. Try and get something with hot-swappable drives, too. It makes the RAID stuff so much more useful.

Also, I don't know the details of your setup (of course), but seriously consider breaking the mail serving task into separate pieces and run it on separate machines.

You have:
1) incoming email
2) outgoing email
3) email from customers
4) email customers pick up (POP)

It sounds like you have one machine handling all of these. Breaking these tasks onto separate boxes (If you've made the mistake of telling customers the same thing for #3 and #4 (ie, mail.isp.net instead of mail.isp.net and pop.isp.net) it might be impossible to split those two tasks away from each other)

You can have a setup such as:
outgoing1 through outgoingN all behind the single name of "outgoing" that internal machines are told to send email to that they don't know how to deal with
mail1 through mailN all behind "mail" that customers are told to have as their outgoing mail server. In particular, it should blindly send off email it doesn't know how to deal with to outgoing.
pop (harder to break into separate machines, but possible)
incoming1 through incomingN with MX records pointing at them for your domain.

Now, breaking into that many machines is probably silly. Moving outgoing to one machine and everything else to a second machine (and possibly mailing lists off to a third machine) may make a *lot* of sense though. Don't get tied into the idea of a monolithic machine to accomplish everything related to a particular task -- eventually it's much more expensive than many cheaper boxes to handle the same task.
1. Re:Couple things by kijiki · 1999-11-18 16:22 · Score: 3
  
  n particular: load is a measure of how many processes are using or waiting for a resource (such as disk I/O, CPU or network I/O). On a busy mail server that's completely adequate for the job, I'd expect to often see a high load average due to the number of processes that are waiting on the network. That is, due to the number of processes waiting for slow network connections to places halfway around the world.
  
  Correct me if I'm wrong, but isn't the load the average number of processes in the run queue? This would mean that processes that are blocked on the network or disk would be in the sleep (wait) queue, and not counted in the load average.
  
  In this case, a load of 20 means 20 processes are ready to run, which is not so good.
Fibre Channel RAID by thesteveco · 1999-11-18 14:58 · Score: 5

We've just spent 2 weeks at my office researching the different solutions available to us for implementing the most reliable and scalable solution available today. Our needs differ a bit from yours as we're looking to put many machines on a network for load-distribution yet they all need to speak to the same data on a single repository. This holy grail is know as a SAN, or Storage Area Network.

Our solution is going to be a single cabinet RAID (level 5 for accessing smaller files) with a "hot spare" that will rebuild a crashed disk on the fly. This being a standard cabinet we'll have 8 disks, of which the capacity of 6 will be data (one parity (term used loosely as parity is striped on RAID-5), and one spare).

The disks are Seagate's 10,000 RPM Cheetahs, the most commonly recommended units among all the vendors we've talked to, and the controller is a multi-channel u2w with fibre interface to a Q-Logic PCI adapter.

The total system is going to run just over $15,000. This sounds like a lot, but pricing lower end systems isn't too much cheaper and you'll never get 24-hour turnaround on failed parts (if they're even available). This seems like overkill for a single system, but by adding a fibre hub later we can use the single system for many many machines once a file controller (dedicated machine) is put into place.

The beauty of SAN is that it operates much like FTP, with a control and a data connection. The control connection occurs over your existing LAN, and the data is transmitted directly over the fibre channel (max rate of 100 MB/s).

Other NAS (Network Accessible Storage) models are somewhat cheaper to implement, but performance can never match the fibre as the "control" and "data" connections (NFS or SMB) both transmit across your network.

I apologize for digressing from the straight RAID topic, but I felt obligated to give the /. community something to chew on in return for all that I've learned here.

-Steve
try to do some benchmarking before you buy by troutman · 1999-11-18 15:11 · Score: 4

This is only mildly applicable to your question since it isn't for Solaris, but it is all I have to offer.
I spent a fair amount of time looking at RAID 5 solutions this past summer for a client. Both external and internal, for Linux. Tried several different controller card brands and drive configurations, did a lot of reading, and bugged a lot of vendors.
You really should try to test your options and all of the configuration combinations using something like Bonnie, on a machine with a simular configuration to your target server. Make sure that your Bonnie test file size is at least twice physical RAM, to eliminate the effects of RAM and controller caching on the results.
I found that using 6 drives in a RAID 5 config was a LOT faster than 5 drives, most of the time. In fact, 3 drives in an array was faster than 5 in some cases. I think it has to do with the way the controller cards were calculating the distributed parity, and perhaps also due to things the driver was doing. 4 drives usually wasn't much better than 3, either.
Stripe sizes for the array can also make a big difference. 32k vs 128k, etc. Larger strips sizes are usually better for I/O speed, but you may find for email that having a higher number of random seek transactions per second is better than raw speed.
I did not get a chance to do any hard testing of multiple channel configurations with these cards. I suspect that splitting the I/O onto multiple channels would be a win.
IMHO, you definately want a i960 based board or system, with the fastest CPU you can find on them. I noticed a signifigant difference between boards with the 33Mhz part vs. the 66Mhz part.
FYI for others: for controllers, the AMI MegaRAID (alias Dell's PERC2/SC) just blows chunks. Older non-LVD, non-raid SCSI systems can run rings around it, at least on write speed.
It has been my experience that the write speed on a RAID 5 system is generally only a fraction of the reading speed, like 1/4th to 1/2. For a quick and stupid test, do something like 'time cat /proc/kcore > /tmp/kcore' and do the math for MB/second.
oh, and my current favorite card is the DPT Millenium V controller, using it in several systems in various places for the last 3 or 4 months. Here are some Bonnie results for a system with a DPT with 6x 7200 RPM drives, all on the same channel (internal) Linux kernel 2.2.10, dual P3 500Mhz:
-------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 1024 7637 97.5 16743 15.2 9561 19.4 8384 98.3 52923 36.2 583.2 9.0
something isnt right by ZxCv · 1999-11-18 15:11 · Score: 3

our setup has right about 31,000 users constantly checking and sending email and is running RH 6.1 on a dual PII/333 with 128mb ram and 9g UW SCSI. I haven't seen a load higher than 0.75 since that machine has been the mail server... maybe something about how your mail server is setup is creating a tremendous bog on it.

--

Perl - $Just @when->$you ${thought} s/yn/tax/ &couldn\'t %get $worse;
Re:More spindles, more simultanious reads by jemhddar · 1999-11-18 15:22 · Score: 5

My raid experience comes from nt software raid and using AMI MegaRaid controllers. For performance the following things are important

PCI Bus-- The fastest controller/drives wont make a difference if the PCI bus cant get data to the drives fast enough. Look at what else you are running, consider upgrading memory/processor like another person said.

Stripe Size-- In a hardware raid setup the controller will write to one hard drive for xxx kb before switching to the next hard drive. You want to figure out what size 'chunks' of data the OS will send to the controller. Netware uses a 64k block size, which means large file reads/writes will be sent from the OS to controller in 64k pieces. If your stripe size is set to 8k, and you have 6 hard drives in a raid 5 array, look at the following situation.
drive1 - 8k total=8k
drive2 - 8k total=16k
drive3 - 8k total=24k
drive4 - 8k total=32k
drive5 - 8k total=40k
now time to calculate parity. this requires the controller to read data from drive1,2,3,4,5, calculate the parity using an XOR algorithm then write the parity
drive6 - 8k parity
drive1 - 8k total=48k
drive2 - 8k total=56k
drive3 - 8k total=64k
Now it has to calculate and write parity again.

compare this to a stripe size of 64k
drive1 - 64k total=64k
calculate parity, write parity
drive6 - 64k parity

Having a poorly configured stripe size can cause a huge performance problem. NT and NetWare(current versions) both optimize their disk writes to 64k. YES! I know the block size in NT is 4k, but the OS still optimizes disk requests to 64k chunks for performance reasons. I'm not sure about various *nix, can someone else answer that? Some people have the notion that writing smaller amounts of data to multiple hard drives is somehow faster. Hard drive maximum transfer rates are based on controller->hdd cache. A 64k or 8k write isnt going to fill up the cache on the controller, and a single 64k write will take less time on the controller, fewer commands will need to be issued, and performance will be better overall.

An anecdote about this.
Copying a 1.5 gig file from a workstation to a server with the stripe size at 8k took about 40min, with the stripe size at 64k it took 6min

Another consideration is how much cache the controller has and what its use is. The AMI Megaraid controller has 3 types of cache. Write, Read and IO. Write cache allows for Lazy Writes, which can improve performance. Read cache will allow the controller to read ahead, hopefully improving performance. IO cache(and I20 cards) allow the controller to take some of the work off of the processor, improving overall system performance.

Some controller come with multiple channels. The AMI MegaRaid series 438 controller has 3 different SCSI channels on it. IIRC each channel can transfer up to 80MB/S. This is similar to the idea of putting hard drives on different SCSI controllers except that I've never seen an implementation that allows a raid array to span multiple controllers.

The above info IS NOT ACCURATE for RAID 0, RAID 1, or RAID 3, those levels have different rules. You should consult the OS vendor, documentation, and Database vendor for specific settings to optimize the controller.

--
--
mail configuration by lucky+luck · 1999-11-18 15:57 · Score: 5

Hi,
a couple of years ago we had the same problem till I discovered that all our mailboxes where in one mail spool directory. This was a huge bottleneck and after adapting qpopper and configuring sendmail to a split mailspool dir load came down to 1. (split mailspool is /mail/a /mail/b /mail/c and all users which will begin with an a will be placed in /mail/a ... etc ... )

check above first before you buy hardware
I just did this... my experiences by abulafia · 1999-11-18 16:12 · Score: 5

Our mail server is currently handling about 1M messages a day. IO became a serious issue. We're still using sendmail, and I'm not going to give it up (we know it, we have a custom builds for strange applications, it works). As others have noted, load average doesn't mean much here - I have some machines with a load average at 4 that are actually idle and fine, and others at .2 that need tuning. Ignore it and concentrate on what matters.

Assuming IO matters, I am putting my full faith (and job) on Mylex controllers. I love them. I only have one in production, but am about to deploy 5 more, and we'll come in at about 600G managed by them. They just work. The DAC960SXi I have in production (for 7months now) has been flawless, delivering wire speed doing RAID 5 without any effort after initial config (which is a bit annoying, to be sure).

My production system using it is doing far too many things - mail, staging server, enterprise backup. This is changing - lack of time and historical accident made it that way. The point is that the Mylex handles it with no grief.

If you're building these, be aware that Mylex external controllers need to be mounted in a box with "internal" style connectors. For good RAID cases, check out http://www.storagepath.com/ - they are what I'm using. They look low rent, but the boxes are nice (if a bit expensive).

Down to specifics. For a mail only machine doing the sort of volume you're talking about, I'd deploy a dual processor box with three SCSI busses (one for spool, two for mbox/system access - system access is pretty cheap in comparison) attached to two harware RAID setups. Granted volume allows, I'd go RAID 5 for spool (with 18G disks, that's ~65G spool) and hot spares. For mboxes, I'd do 0+1, for as much space as needed. Stripe disks on independent controllers, mirrored to each other. Striped mirrors can grow, as you need them to (RAID 5 can't, easily). You don't want to lose anyone's mail. Hot spares for each.

Assuming 100G of mboxes, that's a total of 17 18G disks. Add three Mylex DAC9660SXis and (initially) 3 rack mount cases, and that's something around ~24K.

Availability beyond disk is a different question, that gets platform specific. I do mainly Solaris now, so I can't talk much about Linux for this. Mylex controllers can do dual active/dual host configurations, but things get more complex, and
a summary here doesn't make sense.

Other options like A1000s (Sun specific) and Netapps require different approaches - they're very different beasts. We have all of the above, and treat them very differently. We'll buy them all again - they're all decent - but are good at different things.

If you can, buy raw Mylex contollers through a reseller like TechData or similar - you'll save a lot.

Hope this helps some.

-j

--
I forget what 8 was for.
Evaluating RAIDs by Grimwiz · 1999-11-18 16:13 · Score: 4

The first thing about a hardware raid controller is that it hides failures from the operating system. With software RAID you have to manually carry out all sorts of tasks, and I'm sure we've all heard of the engineer who mirrored the new blank disk on top of the one remaining data disk of a mirror.
Units such as SUN A1000 and Baydel connect via SCSI and you just watch for an orange light, even the part-time cleaner could pull out the correct disk and replace it and have the system back and running without the OS noticing. Storageworks and Clariion(EMC) do the same but over Fiber Channel. SCSI units tend to top out at 40Mb/s, Fiber Channel theoretically top out at 200Mb/s (they have two 100Mb/s loops) but since I only had a max of 30x18Gb disks to play with the disks were the bottleneck. Monster multi-scsi machines like EMC/IBM's can achieve whatever bandwidth you want by multiplexing SCSI connections.
We've evaluated software RAID, Hardware RAID over SCSI, Hardware RAID over Fiber channel from EMC, IBM, SUN, Compaq(storageworks) and in our opinion a good smart raid controller with two data channels and load balancing software is impossible to beat.
For Speed, stripe(0) mirrors together(1), in RAID 0+1, this allows reads at double speed because each mirrored disk can handle a request seperately, and slightly sped-up writes because you can write to the RAID controller's NV cache and carry on doing your work whilst that takes care of putting the data to media.
This of course has only a 50% data efficiency.
Using Raid 3 or 5 you lose one disk in a rank for parity, raid 6 (used by Network Appliances) use two disks for parity but have wider ranks of disks. This often means that sequential reads are fast, because a request for data wakes up all the disks in the rank, but therefore the whole rank can only handle one request at a time. Writes are slower because you have to read a stripe of data, calculate parity and write the whole stripe back again.
RAID5 is really good for data which doesn't have to be the absolute fastest.
Whilst we were doing performance tests, we measured a linear increase in speed up to 20 disks (in transactions/second), and there is a definite art in making sure that you spread the load over all the disks available so that a single disk doesn't get thrashed to death.
In conclusion? well, that depends on your OS.
For me, for a PC-based system I would choose a hardware RAID system with SCSI connection which let me choose the LUN sizes. 5 disks in a RAID5 configuration will only waste 1 disk in capacity. If you're finding your mail spool is being thrashed then I would build a 10 disk 0+1 raid and stripe the mail area across them, using the rest of the area for home areas or web areas or something else which has large storage requirements but doesn't get hit hard.
Oops, this assumes that this REALLY is your problem, a lot of disk problems go away by adding more memory to the machine... I assume you have measured this by tracking the outstanding I/O queue.

--
-- Don't believe everything you read, hear or think
Wrong, wrong, wrong by jocks · 1999-11-18 16:48 · Score: 4

I accept that you will need to test to make sure that the disks are not the problem but you will need to do it the right way.

Firstly vmstat tells you very little about disk i/o. What it is good for is the processes. Look at the output from vmstat 5 for example. The first three colums are r b w, running, blocked and waiting. If there are blocked processes look at WHY processes are blocked. Use top to get the i/o wait information. If there is a lot of io wait then look at the disks. Use iostat -D to get percentage utilisation of the disks. If there is a lot of disk wait then you may need to either add more disks or spread the load.

It is interesting to note the relative speeds of devices:
If cpu takes 3 seconds to do a job then,
Level 1 cache takes 10 seconds
Level 2 cache takes 1 minute
Memory takes 10 minutes
Disk takes 7.7 months
Network takes 6.5 years

Get stuff off your disks better! Monitor your cache hit rate to get information on efficiency. Use vmstat or sar or stuff from the se toolkit. Get the se toolkit from http://www.sun.com/sun-on/net/performance. Run zoom.se to monitor your system. Run virtual_adrian.se to tune your system. Use the right tools and don't just add more memory, identify the bottleneck, fix the bottleneck, re-test and repeat until the performance is satisfactory.
Re:Dell Powervault by Salamander · 1999-11-18 21:07 · Score: 3
>the whole point of the NetApps is to be faster than local storage. and they are, as long as your network is fast enough.

I think network-attached storage is a fine idea and the "right solution" for many things, but I just have to add a rebuttal here anyway.

Network-attached storage is faster than local storage if your network (including the protocol stack) is fast enough and your local-storage subsystem (including its own separate protocol stack) is slow enough. That's a totally useless claim. It's like saying that a train is faster than a car, leaving out the part about the train being an unloaded bullet-train engine on an empty track and the car being a Yugo stuck in New York traffic.

In actual fact, the raw bandwidth of modern storage interconnects (e.g. UW SCSI, FC) is higher than that of most network interconnects (e.g. 100baseT) for which the adapter cost is similar. In addition, the protocols used for storage (e.g. SCSI, the various layers of FC) are more suited toward that task - duh - than are the protocols used for networking (e.g. TCP/IP). There is no reason in hell that it should be faster to use network interconnects and protocols to access your storage than to use storage-specific interconnects and protocols.

Why might it appear that network-attached storage performs better? I can think of at least three reasons right off the top of my head:
- Many computers are "unbalanced". They are misdesigned or misconfigured so that they have a lack of direct-to-storage capability coupled with an excess of network capability. This may actually make NAS the correct solution for that environment but is irrelevant when considering the overall merits of the two approaches.
- Network-attached storage devices often benefit by having much more cache than direct-attached storage devices. If you took that same amount of cache and applied it to the direct-attach devices, the NAS boxes wouldn't look so good.
- The caching strategies used for NAS - i.e. thos in NFSv3 - sacrifice consistency for speed, while direct-attach systems are held to a higher consistency standard. Everyone who has tried to use NFS for something where data consistency or up-to-date modification times matter - even something like "make" - has probably cursed NFS already over this. Some NFS vendors make things even worse by failing even to meet the NFS requirements. Sun's own Solaris NFS client, for example, doesn't always flush data when it's supposed to. If you added all the appropriate sync() operations and fixed the NFS implementations so that your NAS solution was really doing the same thing as your direct-attach solution, you might see some different performance comparisons. Note, though, that for many applications the NFS tradeoff and hence the NAS solution is pretty reasonable.
At this point I should disclose my own biases. First, I work for EMC. That's not by choice - the company I was working for got bought out - and I'm often not thrilled about it, but the pay is good. In particular, I don't buy in to all of EMC's arrogant "storage is the center of the universe and the Symmetrix is the ultimate storage device" attitude, and I heartily dislike our own Celerra NAS product even though it blows the doors off NetApp in terms of performance and scalability. Secondly, my professional areas of interest include distributed, cluster, and SAN filesystems, so I of course have some fairly strong opinions on such matters. That said...

I think that once we start seeing true, mature, multi-platform shared-storage filesystems, NAS will start to seem much less appealing. Why pay for NAS when you can just add software to your existing hardware investment and get all the sharing with almost all the performance of local access? Now all we need is a decent implementation of such a filesystem.
--
Slashdot - News for Herds. Stuff that Splatters.