Pros & Cons of Different RAID Solutions

← Back to Stories (view on slashdot.org)

Pros & Cons of Different RAID Solutions

Posted by Cliff on Thursday November 18, 1999 @07:13PM from the looking-for-solutions dept.

sp1n writes "Our mail server has hit a major disk bottleneck, and we're considering a RAID 5 solution. We're a local ISP with 13,000 users that's been around since 1993. The mail server is a Sun UltraSparc 2 w/ 768M ram, a 10000rpm 9 gig for mail storage, a 7200rpm 7 gig for spool, and another for system mountpoints, all on a u2w bus, and it's running exim. It reaches a load of 10 on a daily basis, and hits around 20 once a week. The spool and mail storage drives are cranking away constantly. Requirements are: external RAID controller, scaleability, speed, and a rackmount case for the controller & drives." Sound intriguing? If you've ever wanted to learn a bit about RAID, then hit the link below.

sp1n continues: "We are currently considering 3 options:

(1) SCSI - EIDE controller with six 9G/7200 ATA drives (hadn't heard of this one until recently). This supposedly accesses the drives directly through DMA and bypasses all IDE, just using them as physical media. All are accessed in parallel. I'm a bit weary about the reliability of IDE drives under constant use.

(2) SCSI - SCSI controller with six 9G/7200 u2w drives. The controller currently at the top of my list is the Mylex DAC960SXi w/ 32MB cache. However, something that fits in a half-height bay, instead of hogging a full-height would be nice.

(3) SCSI - SCSI controller as above, running with 2 disk channels and 2 separate RAID 5 arrays for each mountpoint (spool/mail storage).

I'm looking for any experience with IDE/DMA raid setups (1), as well as the pros/cons of making 2 partitions, both which are very active, on one array of 6 drives (2), as well as 2 separate level 5 arrays of 3 for each mountpoint (3). In addition, any suggestions for external controllers and rackmount enclosures would be greatly appreciated. I would like the controller to have an i960 or better processor.

--
"The glass is not half full, nor half empty. The glass is just too big."

34 of 261 comments (clear)

Min score:

Reason:

Sort:

Check to be 100% sure drives are the problem by vectro · 1999-11-18 14:36 · Score: 5

Before you go out and purchase an expensive RAID solution (of any kind), make sure this is really the problem. The vmstat command will make it quickly apparant what kind of i/o is happening, and further analysis might tell you more about what kind of hd accesses are happening.

In many cases, adding more memory or CPU can make a bigger difference than more/faster hard drives, if the problem is that the cache is too small, or paging activity too much. Also check your CPU load and make sure it is nowhere near 100% - if so, time to get a 2nd CPU.

Also, avoid software RAID implementations like the plague. They will slow down your system and provide questionable reliability. You should also try to find cards that have redundant SCSI controllers onboard, and support redundant cabling. This way if the cable, plug, or SCSI bus fails for some reason you will not be SOL.

Finally, be sure that the majority of your disk accesses are reads. RAID will slow down writes, sometimes drastically so. If the majority of your disk accesses are writes, then tuning your kernel to flush dirty buffers less often may make a good difference.
Pricey but attractive by synaptic · 1999-11-18 14:50 · Score: 2

The Network Appliance Filers are really sexy.

The beautiful thing is they use the WAFL filesystem so you can expand your array when you need to without adding big sets of drives.

Granted, I don't have one but I've submitted the proposals and am waiting on financing. The F720 scales to 464GB, is network attached, has journaling (rad), and can benefit your WHOLE network.

Of course, you have to use NFS or SMB though. I've heard they start as low as $17k but usually $30-40k with a bunch of drives but it's difficult to find general prices without hearing the sales pitch.

This paper discusses testing the Stanford Linear Accelerator Center performed while evaluating the NetApp filers. It's geared toward Usenet news but if it can handle that, it can surely handle your mail situation.

Does anyone here have first hand experience good or bad with NetApp Filers? And some word on the pricing?
Couple things by Falsch+Freiheit · 1999-11-18 14:56 · Score: 5

First off, it's not clear from your post how heavily loaded the drives really are.

In particular: load is a measure of how many processes are using or waiting for a resource (such as disk I/O, CPU or network I/O). On a busy mail server that's completely adequate for the job, I'd expect to often see a high load average due to the number of processes that are waiting on the network. That is, due to the number of processes waiting for slow network connections to places halfway around the world.

All you mention is the load averages and a fairly non-specific measure of drives that are "cranking away constantly". If the drives were being used at a current constant 10% of available I/O, they'd tend to "crank constantly" even if they could be hit much harder. (still, given that losing email is considered bad by customers, a RAID 5 solution seems like a good idea anyways and leaves you room to grow and handle sudden increases in email from the holidays or spammers or gradual expansion of business)

As to IDE vs. SCSI -- never go with straight IDE on a server. SCSI has the ability to lie to the OS and silently move data from sectors that have gone bad into sectors reserved for that purpose. Sure, it slows down access to that particular block of data, but it's a lot easier than the OS having to deal with failures directly. However, I'm completely unfamiliar with the strange SCSI - EIDE setup that you're describing -- if it treats them as just physical media and provided the SCSI interface itself, it may be able to do that particular SCSI trick, as well. Physically, SCSI drives and EIDE drives are identical -- as in, you can find the *exact* same drive from certain manufacturers, only one has SCSI and the other EIDE. Reliability of the physical media is the same, IOW. In a normal configuration, *apparent* physical reliability is higher for SCSI due to wonderfully useful trickery.

I don't recall the exact model numbers, but I've seen pretty good results with Mylex RAID controllers before. (more along the lines of database stuff than what you're talking about -- somewhat different needs, but not all *that* different, I suppose.)

I can't see putting two partitions on one RAID device as making a lot of sense -- since things are striped you'd end up running into contention issues.

IOW: I'd guess that option #3 would be the fastest -- it's also probably the most expensive.

If I were you, I'd check more carefully to determine how much of the currently available disk I/O is actually being used... If the budget allows it, the dual-channel RAID solution sounds pretty good. You might want to go with two single-channel RAID cards instead -- makes it easier to stock a backup card in case a card decides to die. Try and get something with hot-swappable drives, too. It makes the RAID stuff so much more useful.

Also, I don't know the details of your setup (of course), but seriously consider breaking the mail serving task into separate pieces and run it on separate machines.

You have:
1) incoming email
2) outgoing email
3) email from customers
4) email customers pick up (POP)

It sounds like you have one machine handling all of these. Breaking these tasks onto separate boxes (If you've made the mistake of telling customers the same thing for #3 and #4 (ie, mail.isp.net instead of mail.isp.net and pop.isp.net) it might be impossible to split those two tasks away from each other)

You can have a setup such as:
outgoing1 through outgoingN all behind the single name of "outgoing" that internal machines are told to send email to that they don't know how to deal with
mail1 through mailN all behind "mail" that customers are told to have as their outgoing mail server. In particular, it should blindly send off email it doesn't know how to deal with to outgoing.
pop (harder to break into separate machines, but possible)
incoming1 through incomingN with MX records pointing at them for your domain.

Now, breaking into that many machines is probably silly. Moving outgoing to one machine and everything else to a second machine (and possibly mailing lists off to a third machine) may make a *lot* of sense though. Don't get tied into the idea of a monolithic machine to accomplish everything related to a particular task -- eventually it's much more expensive than many cheaper boxes to handle the same task.
1. Re:Couple things by Forward+The+Light+Br · 1999-11-18 16:07 · Score: 2
  
  port forward to a dedicated POP server... its not so bad ;-)
  We are all in the gutter, but some of us are looking at the stars --Oscar Wilde
  
  --
  
  Grrr. my nick is "Forward the Light Brigade"...
2. Re:Couple things by kijiki · 1999-11-18 16:22 · Score: 3
  
  n particular: load is a measure of how many processes are using or waiting for a resource (such as disk I/O, CPU or network I/O). On a busy mail server that's completely adequate for the job, I'd expect to often see a high load average due to the number of processes that are waiting on the network. That is, due to the number of processes waiting for slow network connections to places halfway around the world.
  
  Correct me if I'm wrong, but isn't the load the average number of processes in the run queue? This would mean that processes that are blocked on the network or disk would be in the sleep (wait) queue, and not counted in the load average.
  
  In this case, a load of 20 means 20 processes are ready to run, which is not so good.
Fibre Channel RAID by thesteveco · 1999-11-18 14:58 · Score: 5

We've just spent 2 weeks at my office researching the different solutions available to us for implementing the most reliable and scalable solution available today. Our needs differ a bit from yours as we're looking to put many machines on a network for load-distribution yet they all need to speak to the same data on a single repository. This holy grail is know as a SAN, or Storage Area Network.

Our solution is going to be a single cabinet RAID (level 5 for accessing smaller files) with a "hot spare" that will rebuild a crashed disk on the fly. This being a standard cabinet we'll have 8 disks, of which the capacity of 6 will be data (one parity (term used loosely as parity is striped on RAID-5), and one spare).

The disks are Seagate's 10,000 RPM Cheetahs, the most commonly recommended units among all the vendors we've talked to, and the controller is a multi-channel u2w with fibre interface to a Q-Logic PCI adapter.

The total system is going to run just over $15,000. This sounds like a lot, but pricing lower end systems isn't too much cheaper and you'll never get 24-hour turnaround on failed parts (if they're even available). This seems like overkill for a single system, but by adding a fibre hub later we can use the single system for many many machines once a file controller (dedicated machine) is put into place.

The beauty of SAN is that it operates much like FTP, with a control and a data connection. The control connection occurs over your existing LAN, and the data is transmitted directly over the fibre channel (max rate of 100 MB/s).

Other NAS (Network Accessible Storage) models are somewhat cheaper to implement, but performance can never match the fibre as the "control" and "data" connections (NFS or SMB) both transmit across your network.

I apologize for digressing from the straight RAID topic, but I felt obligated to give the /. community something to chew on in return for all that I've learned here.

-Steve
What about the AMI MegaRaid cards? by Wakko+Warner · 1999-11-18 14:58 · Score: 2

I'm thinking of getting one myself. It's supported in Linux, does hardware raid 0, 1, 0/1, 3, 5, 30, and 50. Does anyone have one? Is it decent? Can I trust my data to it? It's $150 on pricewatch, which sounds like a damn good deal for something with its own CPU on board.

- A.P.
--

"One World, one Web, one Program" - Microsoft promotional ad

--
"Remember when the U.S. had a drug problem, and then we declared a War On Drugs, and now you can't buy drugs anymore?"
1. Re:What about the AMI MegaRaid cards? by jemhddar · 1999-11-18 15:42 · Score: 2
  
  The AMI MegaRaid cards are excellent IMO. Very clean setup of raid arrays, uses simms for its cache(on some models) so you can upgrade the cache easily. Like you mentioned it has an Intel I960 processor and the newer ones are I20 devices. I20 is a standard where the card has a processor to offload work from the main system processor. Not all OS's support it though. I20 will also allow for 1driver per card, instead of 1 driver per card, per os, per os version.
  
  --
  --
Suggestions by mosch · 1999-11-18 14:59 · Score: 2

on the IDE v SCSI be careful. with some drives the difference really is just a chip, but often drive manufacturers will use different actuators and such for SCSI drives (due to the fact that they're more likely to be dropped into a high-stress environment). The MTBF for a drive that's expecting to run grandma's recipe book is not relevant when used as a high-stress server.

I'd suggest a SCSI or Fibre Channel raid array, with some 10,000RPM drives, and lots of cache on the drives and the controller. If you are currently IO-bound, you want to make sure that you remove that bottleneck for at least a couple years. Some sort of external enclosure might be nice if only due to the fact that 10,000RPM hard drives make a LOT of heat, so it keeps things a little less critical. Oh, and of course I'd recommend using RAID-5 for obvious reasons. RAID-0 is faster, but clinically insane.
try to do some benchmarking before you buy by troutman · 1999-11-18 15:11 · Score: 4

This is only mildly applicable to your question since it isn't for Solaris, but it is all I have to offer.
I spent a fair amount of time looking at RAID 5 solutions this past summer for a client. Both external and internal, for Linux. Tried several different controller card brands and drive configurations, did a lot of reading, and bugged a lot of vendors.
You really should try to test your options and all of the configuration combinations using something like Bonnie, on a machine with a simular configuration to your target server. Make sure that your Bonnie test file size is at least twice physical RAM, to eliminate the effects of RAM and controller caching on the results.
I found that using 6 drives in a RAID 5 config was a LOT faster than 5 drives, most of the time. In fact, 3 drives in an array was faster than 5 in some cases. I think it has to do with the way the controller cards were calculating the distributed parity, and perhaps also due to things the driver was doing. 4 drives usually wasn't much better than 3, either.
Stripe sizes for the array can also make a big difference. 32k vs 128k, etc. Larger strips sizes are usually better for I/O speed, but you may find for email that having a higher number of random seek transactions per second is better than raw speed.
I did not get a chance to do any hard testing of multiple channel configurations with these cards. I suspect that splitting the I/O onto multiple channels would be a win.
IMHO, you definately want a i960 based board or system, with the fastest CPU you can find on them. I noticed a signifigant difference between boards with the 33Mhz part vs. the 66Mhz part.
FYI for others: for controllers, the AMI MegaRAID (alias Dell's PERC2/SC) just blows chunks. Older non-LVD, non-raid SCSI systems can run rings around it, at least on write speed.
It has been my experience that the write speed on a RAID 5 system is generally only a fraction of the reading speed, like 1/4th to 1/2. For a quick and stupid test, do something like 'time cat /proc/kcore > /tmp/kcore' and do the math for MB/second.
oh, and my current favorite card is the DPT Millenium V controller, using it in several systems in various places for the last 3 or 4 months. Here are some Bonnie results for a system with a DPT with 6x 7200 RPM drives, all on the same channel (internal) Linux kernel 2.2.10, dual P3 500Mhz:
-------Sequential Output-------- ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 1024 7637 97.5 16743 15.2 9561 19.4 8384 98.3 52923 36.2 583.2 9.0
something isnt right by ZxCv · 1999-11-18 15:11 · Score: 3

our setup has right about 31,000 users constantly checking and sending email and is running RH 6.1 on a dual PII/333 with 128mb ram and 9g UW SCSI. I haven't seen a load higher than 0.75 since that machine has been the mail server... maybe something about how your mail server is setup is creating a tremendous bog on it.

--

Perl - $Just @when->$you ${thought} s/yn/tax/ &couldn\'t %get $worse;
Load Ave 10 need not mean an IO Bottleneck. by noidd · 1999-11-18 15:15 · Score: 2

Load average is defined as the number of processes sitting on the run queue. This need not indicate a disk IO bottleneck.

I would be surprised if any exim system was having more of a bottleneck to disk than it was to network. Your disks are faster than your network and exim is pretty light on un-required disk access.

The more bottleneck to network (by network I mean end-to-end with your customer not just your links) is large, the more processes are going to hang around longer.

More processes, more paging, less cacheing. Less cacheing, more IO. More paging, more IO.

Probably teching granny to suck eggs - but you do have your swap space on a seperate device don't you ;)

The more exim processes that hang around longer, the more processes for the CPU to switch around. The more switching, the more likely you are to see paging.

If the processes hang around longer, they take up more memory which reduces the cache-size available.

Exim has several files which it accesses frequently, mainly the retry databases and its configuration. These should perminantly be in memory.

Bottom Line:

I do however suggest that you don't consider moving a single server to RAID. If you have a server that you want to move to RAID for efficency purposes... your design is wrong and you should be building a scalable system .

Red
1. Re:Load Ave 10 need not mean an IO Bottleneck. by remande · 1999-11-18 20:56 · Score: 2
  
  Load average is defined as the number of processes sitting on the run queue. This need not indicate a disk IO bottleneck.
  Indeed, a high load average indicates that there is no I/O bottleneck, and a low load average may indicate an I/O bottleneck.
  The run queue holds only those processes that the kernel thinks can constructively use CPU cycles. Once a process asks the kernel to access an I/O device, the kernel decides whether the device is currently available. If not, the process gets kicked off the run queue until the device becomes available again.
  Thus, if you have a lot of processes hitting the same device, an I/O bottleneck would actually drop the load, as there are fewer processes able to use the processor.
  
  --
  --The basis of all love is respect
SCSI RAID by JSG · 1999-11-18 15:20 · Score: 2

Personally speaking for a load of this magnitude SCSI is the only solution.

Don't even think of software RAID.

For some background on SCSI itself try http://www.scsifaq.org

There are many types of RAID 0-5 are the "standard" but there are several new ones eg level 10 which attempts to address throughput issues. Your actual space requirements don't seem outrageous so level 5 would be reasonably cost effective.

Another thing you will probably want is hot swapping. Once you've had a box tell you a drive is dead, you've removed it and popped a new one in without taking the box down, you will not want anything else.

On the IDE vs SCSI debate, whilst IDE is fast it seems to me that under continuous load SCSI gives better throughput.

As others have pointed out - a 'designed' server, rather than a "roll your own" box would make sense. Compaq Proliants make excelent Linux machines. The SMART arrays are very good and support RAID to level 5. You can fit a lot of disks in the drive cages as well. They are a little pricey but of a good quality and reliability. We have rather a lot of them running NetWare. I get to use the older kit to run my funny Open Source stuff ...

A suggestion might be:
Proliant 1600, 2 x 600Mhz processors, SMART 3200 with 64Mb cache, 5 drive slots - 81 Gb available after RAID 5 on 18Gb 1" drives (that's Ultra-2 SCSI) supports upto 1Gb RAM (has 128 by default). There is also an on-board SCSI interface for CDROM etc. This comes in at about GBP 9,000
Re:More spindles, more simultanious reads by jemhddar · 1999-11-18 15:22 · Score: 5

My raid experience comes from nt software raid and using AMI MegaRaid controllers. For performance the following things are important

PCI Bus-- The fastest controller/drives wont make a difference if the PCI bus cant get data to the drives fast enough. Look at what else you are running, consider upgrading memory/processor like another person said.

Stripe Size-- In a hardware raid setup the controller will write to one hard drive for xxx kb before switching to the next hard drive. You want to figure out what size 'chunks' of data the OS will send to the controller. Netware uses a 64k block size, which means large file reads/writes will be sent from the OS to controller in 64k pieces. If your stripe size is set to 8k, and you have 6 hard drives in a raid 5 array, look at the following situation.
drive1 - 8k total=8k
drive2 - 8k total=16k
drive3 - 8k total=24k
drive4 - 8k total=32k
drive5 - 8k total=40k
now time to calculate parity. this requires the controller to read data from drive1,2,3,4,5, calculate the parity using an XOR algorithm then write the parity
drive6 - 8k parity
drive1 - 8k total=48k
drive2 - 8k total=56k
drive3 - 8k total=64k
Now it has to calculate and write parity again.

compare this to a stripe size of 64k
drive1 - 64k total=64k
calculate parity, write parity
drive6 - 64k parity

Having a poorly configured stripe size can cause a huge performance problem. NT and NetWare(current versions) both optimize their disk writes to 64k. YES! I know the block size in NT is 4k, but the OS still optimizes disk requests to 64k chunks for performance reasons. I'm not sure about various *nix, can someone else answer that? Some people have the notion that writing smaller amounts of data to multiple hard drives is somehow faster. Hard drive maximum transfer rates are based on controller->hdd cache. A 64k or 8k write isnt going to fill up the cache on the controller, and a single 64k write will take less time on the controller, fewer commands will need to be issued, and performance will be better overall.

An anecdote about this.
Copying a 1.5 gig file from a workstation to a server with the stripe size at 8k took about 40min, with the stripe size at 64k it took 6min

Another consideration is how much cache the controller has and what its use is. The AMI Megaraid controller has 3 types of cache. Write, Read and IO. Write cache allows for Lazy Writes, which can improve performance. Read cache will allow the controller to read ahead, hopefully improving performance. IO cache(and I20 cards) allow the controller to take some of the work off of the processor, improving overall system performance.

Some controller come with multiple channels. The AMI MegaRaid series 438 controller has 3 different SCSI channels on it. IIRC each channel can transfer up to 80MB/S. This is similar to the idea of putting hard drives on different SCSI controllers except that I've never seen an implementation that allows a raid array to span multiple controllers.

The above info IS NOT ACCURATE for RAID 0, RAID 1, or RAID 3, those levels have different rules. You should consult the OS vendor, documentation, and Database vendor for specific settings to optimize the controller.

--
--
Procedure by jocks · 1999-11-18 15:34 · Score: 2

First try iostat -D -l (numberof disks+2) 5 to get percentage utilisation in 5 second intervals.

This is my favourite tool for disk analysis. Secondly go to http://www.sun.com/sun-on-net/performance read what you feel is important but download the se toolkit.

Run zoom.se to get a professional analysis of your system. Run virtual_adrian.se to get a virtual professional to tune your box.

I recommend you do this BEFORE spending any money. I have an E3000 with 2Gb RAM and 2% processor utilisation because nobody checked the system properly.

If it is your disks I recommend sun kit even though it is expensive and RAID 5. Don't worry about people telling you about it being slower, compared to a thrashing single spindle it is extremely fast and as importantly reliable. Tinker and learn!
mail configuration by lucky+luck · 1999-11-18 15:57 · Score: 5

Hi,
a couple of years ago we had the same problem till I discovered that all our mailboxes where in one mail spool directory. This was a huge bottleneck and after adapting qpopper and configuring sendmail to a split mailspool dir load came down to 1. (split mailspool is /mail/a /mail/b /mail/c and all users which will begin with an a will be placed in /mail/a ... etc ... )

check above first before you buy hardware
I just did this... my experiences by abulafia · 1999-11-18 16:12 · Score: 5

Our mail server is currently handling about 1M messages a day. IO became a serious issue. We're still using sendmail, and I'm not going to give it up (we know it, we have a custom builds for strange applications, it works). As others have noted, load average doesn't mean much here - I have some machines with a load average at 4 that are actually idle and fine, and others at .2 that need tuning. Ignore it and concentrate on what matters.

Assuming IO matters, I am putting my full faith (and job) on Mylex controllers. I love them. I only have one in production, but am about to deploy 5 more, and we'll come in at about 600G managed by them. They just work. The DAC960SXi I have in production (for 7months now) has been flawless, delivering wire speed doing RAID 5 without any effort after initial config (which is a bit annoying, to be sure).

My production system using it is doing far too many things - mail, staging server, enterprise backup. This is changing - lack of time and historical accident made it that way. The point is that the Mylex handles it with no grief.

If you're building these, be aware that Mylex external controllers need to be mounted in a box with "internal" style connectors. For good RAID cases, check out http://www.storagepath.com/ - they are what I'm using. They look low rent, but the boxes are nice (if a bit expensive).

Down to specifics. For a mail only machine doing the sort of volume you're talking about, I'd deploy a dual processor box with three SCSI busses (one for spool, two for mbox/system access - system access is pretty cheap in comparison) attached to two harware RAID setups. Granted volume allows, I'd go RAID 5 for spool (with 18G disks, that's ~65G spool) and hot spares. For mboxes, I'd do 0+1, for as much space as needed. Stripe disks on independent controllers, mirrored to each other. Striped mirrors can grow, as you need them to (RAID 5 can't, easily). You don't want to lose anyone's mail. Hot spares for each.

Assuming 100G of mboxes, that's a total of 17 18G disks. Add three Mylex DAC9660SXis and (initially) 3 rack mount cases, and that's something around ~24K.

Availability beyond disk is a different question, that gets platform specific. I do mainly Solaris now, so I can't talk much about Linux for this. Mylex controllers can do dual active/dual host configurations, but things get more complex, and
a summary here doesn't make sense.

Other options like A1000s (Sun specific) and Netapps require different approaches - they're very different beasts. We have all of the above, and treat them very differently. We'll buy them all again - they're all decent - but are good at different things.

If you can, buy raw Mylex contollers through a reseller like TechData or similar - you'll save a lot.

Hope this helps some.

-j

--
I forget what 8 was for.
Evaluating RAIDs by Grimwiz · 1999-11-18 16:13 · Score: 4

The first thing about a hardware raid controller is that it hides failures from the operating system. With software RAID you have to manually carry out all sorts of tasks, and I'm sure we've all heard of the engineer who mirrored the new blank disk on top of the one remaining data disk of a mirror.
Units such as SUN A1000 and Baydel connect via SCSI and you just watch for an orange light, even the part-time cleaner could pull out the correct disk and replace it and have the system back and running without the OS noticing. Storageworks and Clariion(EMC) do the same but over Fiber Channel. SCSI units tend to top out at 40Mb/s, Fiber Channel theoretically top out at 200Mb/s (they have two 100Mb/s loops) but since I only had a max of 30x18Gb disks to play with the disks were the bottleneck. Monster multi-scsi machines like EMC/IBM's can achieve whatever bandwidth you want by multiplexing SCSI connections.
We've evaluated software RAID, Hardware RAID over SCSI, Hardware RAID over Fiber channel from EMC, IBM, SUN, Compaq(storageworks) and in our opinion a good smart raid controller with two data channels and load balancing software is impossible to beat.
For Speed, stripe(0) mirrors together(1), in RAID 0+1, this allows reads at double speed because each mirrored disk can handle a request seperately, and slightly sped-up writes because you can write to the RAID controller's NV cache and carry on doing your work whilst that takes care of putting the data to media.
This of course has only a 50% data efficiency.
Using Raid 3 or 5 you lose one disk in a rank for parity, raid 6 (used by Network Appliances) use two disks for parity but have wider ranks of disks. This often means that sequential reads are fast, because a request for data wakes up all the disks in the rank, but therefore the whole rank can only handle one request at a time. Writes are slower because you have to read a stripe of data, calculate parity and write the whole stripe back again.
RAID5 is really good for data which doesn't have to be the absolute fastest.
Whilst we were doing performance tests, we measured a linear increase in speed up to 20 disks (in transactions/second), and there is a definite art in making sure that you spread the load over all the disks available so that a single disk doesn't get thrashed to death.
In conclusion? well, that depends on your OS.
For me, for a PC-based system I would choose a hardware RAID system with SCSI connection which let me choose the LUN sizes. 5 disks in a RAID5 configuration will only waste 1 disk in capacity. If you're finding your mail spool is being thrashed then I would build a 10 disk 0+1 raid and stripe the mail area across them, using the rest of the area for home areas or web areas or something else which has large storage requirements but doesn't get hit hard.
Oops, this assumes that this REALLY is your problem, a lot of disk problems go away by adding more memory to the machine... I assume you have measured this by tracking the outstanding I/O queue.

--
-- Don't believe everything you read, hear or think
1. Re:Evaluating RAIDs by otis+wildflower · 1999-11-19 01:09 · Score: 2
  
  Writes are slower because you have to read a stripe of data, calculate parity and write the whole stripe back again.
  
  Kinda why you want gobs of battery-backed RAID controller cache memory... (and a UPS, and clean power... ;)
  
  Your Working Boy,
Re:just a small note about scsi vs. ide by Holger · 1999-11-18 16:36 · Score: 2

There is really not so much that differentiates ATA from SCSI anymore. ATA (formerly known as IDE) drives have been remapping bad blocks transparently for years, they have been doing DMA for nearly as long, and some drives even came in ATA and SCSI versions (IBM DCAA/DCAS for one), where only the interface board was different and absolutely everything else was equal.

There is even a usable external ATA RAID subsystem out there, manufactured by Arena. They use the same i960 that is used on high end SCSI RAID controllers and deliver decent performance with cheap drives. (Remember: The I in RAID once meant inexpensive)

Of course, in a server, you want reliable drives. But that has next to nothing to do with the interface. UDMA is very reliable as far as the interface data transfer is concerned, I would rate it even higher than SCSI in this regard (proper CRC vs. ordinary parity). The quality of the disk mechanism is another thing, but with IDE drives being so cheap, you could afford to upgrade the things so quickly that they never get a chance to fail at work. Or you could just buy two big ATA drives for less than one SCSI drive and do RAID1.

For the records: Recent ATA drives really scream. Look at these bonnie results from my workstation (dual P2, 128M, Red Hat 6.1, 2.2.13, Test run on 2 GB / Partition 50% full):

-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
512 18196 97.7 23648 22.5 10807 19.1 19702 84.2 23128 6.9 129.9 2.0

The drive is a 20 GB Seagate ST320430A which sells for less than 400 DM around here. Remember: These are not artificial results on an empty filesystem. This is my real root partition which is used daily.
Wrong, wrong, wrong by jocks · 1999-11-18 16:48 · Score: 4

I accept that you will need to test to make sure that the disks are not the problem but you will need to do it the right way.

Firstly vmstat tells you very little about disk i/o. What it is good for is the processes. Look at the output from vmstat 5 for example. The first three colums are r b w, running, blocked and waiting. If there are blocked processes look at WHY processes are blocked. Use top to get the i/o wait information. If there is a lot of io wait then look at the disks. Use iostat -D to get percentage utilisation of the disks. If there is a lot of disk wait then you may need to either add more disks or spread the load.

It is interesting to note the relative speeds of devices:
If cpu takes 3 seconds to do a job then,
Level 1 cache takes 10 seconds
Level 2 cache takes 1 minute
Memory takes 10 minutes
Disk takes 7.7 months
Network takes 6.5 years

Get stuff off your disks better! Monitor your cache hit rate to get information on efficiency. Use vmstat or sar or stuff from the se toolkit. Get the se toolkit from http://www.sun.com/sun-on/net/performance. Run zoom.se to monitor your system. Run virtual_adrian.se to tune your system. Use the right tools and don't just add more memory, identify the bottleneck, fix the bottleneck, re-test and repeat until the performance is satisfactory.
Re:RAID 0 + 1 would be faster than RAID 5 by Znork · 1999-11-18 18:43 · Score: 2

Not entirely true. RAID 0+1 is faster for writing, but RAID5 is usually, depending on configuration, faster to much faster for reading (you have more platters to simulatenously read from, and calculating checksums isnt necessary for reads).

Some array types (notably HP that I know of) will dynamically rearrange data storage between RAID 0+1 and RAID5 to optimize speed and space.
NetApp the good the bad and the ugly by Corvar · 1999-11-18 19:06 · Score: 2

Now from all of my research it seemed like NetApp was the way to go. So I pushed and pushed and pushed, and finally we got a F760. (Nothing like going from nothing to the top of the ladder) And now it is 2.5 months into being a NetApp user. Both the 1 and 2 month aniversaries were marked with a MB dieing. I must say it is fast, real fast, but right now the analogy is fast like a race car going towards a wall. Now ease of use, maintainence, etc on the UNIX side has been pretty carefree for me. The NetApp has been very easy to use, easy to monitor, and easy to setup. But the NT department which paid for half of it is hating life. The NetApp's quota system is straight out of unix which is not good for NT, i.e. you are putting quota's on users, groups, or qtree's (Think root level directories which are made in a special way). According to the NT guru's file ownership by individual's in NT is a bad idea, therefore all files are owned by an administrator equivalent. This means you lose user quotas. NT has a different group philosophy than unix (multiple groups can have access to a single file) so I am guessing the group quota's are out as well. Leaving qtree's, which are sort of ugly. Right now our NT people are looking at taking the loss on the NetApp and giving it to UNIX (Fine by me ;) and replacing it with a conventional NT file server. Another downside for the NT side of things is that the NetApp's is configured much like a UNIX box. It uses init and rc files etc etc. Well from NT land there is a carriage return/line feed issue. All of those files have Unix style carriage return/line feeds. I am not sure if they break if you start using dos style but I am leary to find out. Which means the Unix side is resonsible for all configuration of the NetApp. This is both good and bad. They aren't going to break my stuff, but I have to take on additional labour. Note: The hardware failures were quickly resolved by NetApp, but it still sucked hard. The NT quota issues are supposed to be resolved in the next major version of the NetAppOS codenamed Guiness or some such. The NT people IMO haven't fully explored the quota possibilities instead taking the partyline that it's too much work. And it is entirely possible that I have not uncovered all of the problem's and solution's for those problems in the time we have had it.
The trouble with NetApp by Electra · 1999-11-18 20:04 · Score: 2

I work for a Systems Integrator-nice word for RESELLER! We are a Sun reseller first and formeost, but we are very strong in the NetApp arena. Since I am a geek trapped in the hell of being a sales(wo)man, please forgive me if I sound salesy at all....
Anyway, NetApp's are a great solution for multiprotocol storage. One of the drawbacks is that it is Network attached and therefore only as fast as your network...which has been a problem for many of our customers. Another HUGE problem is backup. There is only one product that can do it well-a product called BudTool. BudTool is a little guy that some geeks in my company thought up and brought to market, then along came NetApp who asked us to figure out a way to b/u their filers. Out of that venture NDMP was born. BudTool is the only product that makes use of NDMP correctly. That divison of my company was recently sold to Legato systems, who plans to EOL that product. NetApp is now scrambling to find another solution, since they've been recommeding BudTool from Jump Street....
Pricing is also an issue. And you were right in saying that they start at aroung $17K, but that is WITHOUT storage. A good sized storage solution, let's say 1 TB is going to run you upwards of $100K. Yikes.
There is also a good resource for people who are thinking of deploying a NetApp solution, which is the toasters users group. You can send an e mail to toasters@mathworks.com and ask to subsrcibe to the group. You'll get alot of good feed back on what works, and what doesn't. You'll also get to see the downside to using it (and BudTool). I think there is info about the group at http://teaparty.mathworks.com but i haven't been able to get there in a few.....Check it out. It's definitely worth the trip.
And if you need any quotes I'd love to help you out!!! Just Joking

--
"Most of my heros won't appear on no stamps..." Chuck D from Fight the Power
Raid & NFS Systems for Sun Sparc & ISP's by cybrthng · 1999-11-18 20:35 · Score: 2

I've been in the ISP business for years. Ran an ISP with 2000 customers and was the Systems Admin for an ISP with 150,000 customers.
Reliability Is the issue when it comes to email, and raid systems. Ofcourse Sun has the edge, so why not stick with Sun Software & hardware. The sun StorEdge A1000 has a caching controller and usually 30-40 gigs per rack, it plugs into your SCSI Bus, and you can simply add another Dual Channel Scsi card to split the load or add redudancy.
Network Appliances makes an Excellent Solution. NFS Toasters are the way to go in a distributed environment. Say you have customer on a shell account, well you can export the mail directory and mount it VIA NFS and access it from the shell servers without throwing more email load on them locally. NFS Toasters come in a great looking appliance rackmount case, and depending on how much storage you need, is how much rackspace you need.
And ofcourse there is StorageTek, which will run you a pretty penny, but offers Fibre Channel, or Multiple SCSI channel connections, full redundancy, caching, hotswap and maintenance features.
I'd never stick and IDE solution on a production box, You need something that you can get support on and Services on, so i'd suggest that you stick with the Sun StorEdge A1000 drive systems for complete compatibility and put it under the same Support contract as your UltraSparcl
AND
As far as email is concerned, you should setup an MX server to cache and forward incoming email, these work real nice since you can run RBL or pre-process out spam without killing the actuall server that holds and processes email for incoming clients. You have to look at a distributed environment, as email is precious to alot of people, and a single server machine is not gonna cut it when your upwards to 20,000 customers doing that much email.
PS. Try out Qmail too :) smaller footprint!
Go with a professional solution by hey! · 1999-11-18 20:53 · Score: 2

My guess is that in this role, performance is not the paramount issue. You're not bopping the heads around like you would in a database application; and even 20MB/sec is going to be a plenty of throughput unless you have banks of ADSL lines. The important issues are reliability and maintainability.

I'm as much of a tinkerer as anybody; for my own use I don't mind spending two bucks of labor to svae one buck of investment, because I'm really investing in myself. That said, if I had 13K users depending on me for e-mail, I wouldn't mess around; two days of down time could be fatal for your business.

I'd invest $1.50-$2.00/user in a professional grade solution:

Hardware SCSI raid controller.
Drives on hot swap trays.
Same/next day on-site service contract.
External cabinet that can be swapped over to another computer.

It's been over two years since I spec'd a solution like this one (I'm doing software exclusively these days), so I can't make a specific recommendation for today's hardware. I know that some devices used to come in a separate cabinet and looked like a humungous SCSI drive; they even had their own RJ-11 to hook up to a phone line for remote diagnostics from the vendor's tech support.

If the money to swing this is impossible, then I'd recommend mirroring rather than RAID 5. All these kinds of things are compromises between reliability, cost, convenience and performance. RAID 5 is an excellent overall solution from a performance standpoint; but if you cannot afford this RAID 1 is a good choice. It offers fast reads at the cost of slow writes and survival from failure on either disk. In this application, users won't be affected by slightly slower write times. Since drives are so incredibly cheap these days, I'd say this is a pretty good choice if you are strapped for cash. You could even use IDE drives. If you could afford a second IDE controller, then you could use software mirroring across two different controllers for improved throughput.

One thing I haven't looked into is RAID-2; RAID-2 is like RAID-1 with additional error correction codes. It is seldom used in SCSI because SCSI does this for you, but it might be worth looking into for IDE raids.

Good luck.

Really what would be great is failover clustering.

--
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
Some observations... by YuppieScum · 1999-11-18 20:55 · Score: 2

Much of this is probably repeated elsewhere, and much is common sense, but...

1. When was the last time you defragged the drives? Chances are this will reduce thrashing immediately.
2. Add more memory. More cache == less I/O. Double the RAM for a week and see how much better things are...
3. Hardware RAID is the only RAID. In most cases, the overhead of s/w RAID exceedes the I/O performance increase. Plus, the OS (whatever OS) need never know the boot drive is spread across 5 drives is three racks...
4. Hot Swap is a must for a production environment. Nothing beats the warm feeling of yanking a dead drive, slapping in a new one, and watching it get rebuilt on the fly - and the users never know...
5. Any amount of RAID will still fail badly if the PSU dies - always get redundant, hot swap power supplies.
6. The same goes for cabling.

--
This sig left unintentionally blank.
Re:Dell Powervault by Salamander · 1999-11-18 21:07 · Score: 3
>the whole point of the NetApps is to be faster than local storage. and they are, as long as your network is fast enough.

I think network-attached storage is a fine idea and the "right solution" for many things, but I just have to add a rebuttal here anyway.

Network-attached storage is faster than local storage if your network (including the protocol stack) is fast enough and your local-storage subsystem (including its own separate protocol stack) is slow enough. That's a totally useless claim. It's like saying that a train is faster than a car, leaving out the part about the train being an unloaded bullet-train engine on an empty track and the car being a Yugo stuck in New York traffic.

In actual fact, the raw bandwidth of modern storage interconnects (e.g. UW SCSI, FC) is higher than that of most network interconnects (e.g. 100baseT) for which the adapter cost is similar. In addition, the protocols used for storage (e.g. SCSI, the various layers of FC) are more suited toward that task - duh - than are the protocols used for networking (e.g. TCP/IP). There is no reason in hell that it should be faster to use network interconnects and protocols to access your storage than to use storage-specific interconnects and protocols.

Why might it appear that network-attached storage performs better? I can think of at least three reasons right off the top of my head:
- Many computers are "unbalanced". They are misdesigned or misconfigured so that they have a lack of direct-to-storage capability coupled with an excess of network capability. This may actually make NAS the correct solution for that environment but is irrelevant when considering the overall merits of the two approaches.
- Network-attached storage devices often benefit by having much more cache than direct-attached storage devices. If you took that same amount of cache and applied it to the direct-attach devices, the NAS boxes wouldn't look so good.
- The caching strategies used for NAS - i.e. thos in NFSv3 - sacrifice consistency for speed, while direct-attach systems are held to a higher consistency standard. Everyone who has tried to use NFS for something where data consistency or up-to-date modification times matter - even something like "make" - has probably cursed NFS already over this. Some NFS vendors make things even worse by failing even to meet the NFS requirements. Sun's own Solaris NFS client, for example, doesn't always flush data when it's supposed to. If you added all the appropriate sync() operations and fixed the NFS implementations so that your NAS solution was really doing the same thing as your direct-attach solution, you might see some different performance comparisons. Note, though, that for many applications the NFS tradeoff and hence the NAS solution is pretty reasonable.
At this point I should disclose my own biases. First, I work for EMC. That's not by choice - the company I was working for got bought out - and I'm often not thrilled about it, but the pay is good. In particular, I don't buy in to all of EMC's arrogant "storage is the center of the universe and the Symmetrix is the ultimate storage device" attitude, and I heartily dislike our own Celerra NAS product even though it blows the doors off NetApp in terms of performance and scalability. Secondly, my professional areas of interest include distributed, cluster, and SAN filesystems, so I of course have some fairly strong opinions on such matters. That said...

I think that once we start seeing true, mature, multi-platform shared-storage filesystems, NAS will start to seem much less appealing. Why pay for NAS when you can just add software to your existing hardware investment and get all the sharing with almost all the performance of local access? Now all we need is a decent implementation of such a filesystem.
--
Slashdot - News for Herds. Stuff that Splatters.
IDE still isn't SCSI by DragonHawk · 1999-11-19 00:09 · Score: 2

There is really not so much that differentiates ATA from SCSI anymore.

I wouldn't go that far.

Yes, IDE has finally caught on to such things as DMA and busmastering, and throughput on IDE devices is in the same arena as SCSI now. But.

IDE is limited to two devices per bus, and generally requires one IRQ per bus. IDE also has very strict and short cable length limits, and lack a "external" connector -- you generally can't have an external IDE device (I know is is possible, but the cable restrictions make it very difficult).

There are more kinds of devices (scanners, printers, etc.) available for SCSI then IDE. SCSI is generally more capable in terms of what you can do with it.

IDE controllers tend to be very primitive compared to their SCSI counterparts. Things like bus disconnect, command queuing, scatter-gather, even busmastering are often not available or iffy on IDE controllers. This applies especially to the onboard controllers in many motherboards; the number of shortcuts taken there are incredible.

Likewise, the drive electronics and HDA components in IDE drives are often cheaper then those in SCSI drives. These are all design and engineering issues, not issues with the specification itself, but they exist. The problems stem from the fact that IDE is marketed to be cheap, cheap, cheap, and thus gets are higher incidence of cheap components. It isn't limited to IDE, either -- you can also find cheap SCSI hardware, it is just that there is less of it.

IDE often appears faster in benchmarks, because benchmarks typically try to do operations in bulk on a single device. IDE has a lower command overhead then SCSI, so for such things, IDE will be faster. But when you get into the real world, and have multiple processes trying to access multiple devices at once, that is when IDE stalls, while SCSI keeps on going.

I realize this started off as a discussion about RAID, and that IDE RAID devices are not your typical RAID devices. They usually have one drive per bus, connected to a custom controller that multiplexes them all and presents them to the host as a SCSI interface. But the topic has drifted to more general applications.

Just my 1/4 of a byte. ;-)

--

dragonhawk@iname.microsoft.com
I do not like Microsoft. Remove them from my email address.
Re:Go with a professional solution (RAID-1 vs 5) by ninjaz · 1999-11-19 03:27 · Score: 2

If the money to swing this is impossible, then I'd recommend mirroring rather than RAID 5. All these kinds of things are compromises between reliability, cost, convenience and performance. RAID 5 is an excellent overall solution from a performance standpoint; but if you cannot afford this RAID 1 is a good choice. It offers fast reads at the cost of slow writes and survival from failure on either disk. In this application, users won't be affected by slightly slower write times. Since drives are so incredibly cheap these days, I'd say this is a pretty good choice if you are strapped for cash.
Actually, RAID-1 is more expensive and faster for writes than RAID-5.
The reason for this is that RAID-1 uses 1:1 mirroring of a 2-drive set while RAID-5 uses rotating parity in which parity information is distributed across all drives.
With regard to space, using RAID-1, your usable yield (what shows up in df) is half of the total disk space put into it. With RAID-5, parity info is spread througout all the drives. Eg., I have a RAID-5 using four 4GB drives, which gives me 12GB of usable space. With 0+1 on this configuration, it would be 8GB usable.
As for speed, both RAID-1 and RAID-5 allow you to read from multiple disks at once (which, of course, is a win). For writes, a drive pair in a RAID-1 will take as long as a write to a single drive. On RAID-5, however, it takes longer because (afaik) the RAID controller has to determine which drives to write the parity info to, which takes CPU time.
A decent little overview is at DPT's site (sadly, only in PDF) at http://www.dpt.com/pdf/understand_raid.pdf
Re:More spindles, more simultanious reads by jemhddar · 1999-11-19 08:07 · Score: 2

There is a great deal more information involved, part was the saturation of the PCI bus causing the slowdown, part was OS tuning, part was Hardware configuration. We were using IIRC 7200 or 5400rpm ultra scsi drives(not ultra 2). the point was to show it makes a big difference tho

--
--
FOLLOWUP - Current Solution by sp1n · 1999-11-19 21:38 · Score: 2

It took quite some time for my original question to be posted, and we were on a critical schedule. We ended up buying a whole new server and internal RAID controller. Details follow:
After much shopping, questions, advice and temporary insanity, we decided to go for a new Linux box to handle the mail. Apparently, the load wasn't only coming from disk i/o wait; the kernel was using 70% cpu. We chose a Dual PIII/500 setup on an Asus P3B-DS, 512M ECC SDRAM (less than before, but prices are so high right now, and we figure processes should end sooner on this box), Intel Pro/100, Seagate Barracuda for system, six Seagate Cheetahs for spool and mail storage, and a Mylex eXtremeRAID 1100 (w/ the 233MHz i960).
It was configured with 5 spindles in RAID 5, with 1 as a hot spare, and then partitioned in half. I'm confident this badarse controller can keep up on the writes, with minimal performance hit. Preliminary results with bonnie are inconclusive, since it's working with one huge file, rather than thousands of small files. If write performance lags once it goes online (this Sunday am), we'll split it into 0+1.
Exim, QPOP, and IMAPD were hax0red to use a double-hashed directory structure. ie: "spin" would reside in /var/mail/s/.p/spin (the dot was required for those who have a single digit username). This should eliminate any overhead that ext2fs may have with large directories.
Thanks for all your advice, keep it coming. If you're a gamer, check out http://www.xmission.com/quake
-Kevin Blackham Xmission Internet Salt Lake City, UT
Actually... by YuppieScum · 1999-11-21 23:42 · Score: 2

My understanding was that some fs's will perform some actions to avoid some fragmentation.

A collegue of mine recommends doing a complete backup/reformat/restore cycle every 2 months or so on partitions that see a great deal of edit/extension to files - on a partition in use since '93 i expect this would give a radical reduction in trashing . . .

I also give you a chance to test your backup procedures :)

--
This sig left unintentionally blank.