SANs and Excessive Disk Utilization?

← Back to Stories (view on slashdot.org)

SANs and Excessive Disk Utilization?

Posted by Cliff on Thursday March 22, 2007 @12:25AM from the tweaks-and-performance-improvements dept.

pnutjam asks: "I work for a small to medium mental health company as the Network Administrator. While I think a SAN is a bit of overkill for our dozen servers, it was here when I got here. We currently boot 7 servers from our SAN, which houses all of their disks. Several of them have started to show excessive disk load, notably our SQL server, and our old domain controller (which is also the file/print server). I am in the process of separating our file/print server from our domain controller, but right now I get excessive disk load during the morning when people log on (we use roaming profiles). I think the disks need to be defragged, but should this be done on the servers, or on the SAN itself? When it comes to improving performance, I get conflicting answers when I inquire whether I would get better throughput from newer fibre-channel cards (ours are PCI-x, PCI-e is significantly faster), or mixing in some local disks, or using multiple fibre channel cards. Has anyone dealt with a similar situation or has some expertise in this area?"

9 of 83 comments (clear)

Min score:

Reason:

Sort:

Who's San Box is it? by haplo21112 · 2007-03-22 00:35 · Score: 4, Informative

You might want to consider calling the maker for technical support. Some SAN devices require defrag at the BOX level and doing it from the Server will adversely affect your data. Others its OK either way.

--
Power Corrupts,Absolute Power Corrupts Absolutely, leaving one person(group)in charge is absolutely corrupt.
That seems like an odd configuration by hattmoward · 2007-03-22 00:40 · Score: 3, Insightful

I've always kept the system disks local so the server isn't dependent on the SAN connection to boot. That said, do you have this SAN configured as a single shared filesystem or as a group of raid containers that are isolated from one another and provisioned to a single server? If it's shared, I'd say you need to take all but one server down and defragment from that. If it's not shared, they can all defragment their private filesystems at once (though I'd only do one or two at a time anyway).
Re:No Acronyms! by Anonymous Coward · 2007-03-22 00:44 · Score: 4, Informative

If you don't know what a SAN is, and are too lazy to consult Google, then why post? He's asking for someone who might be able to help, not trying to teach a lesson.

----

The whole SAN part is a red herring. He just has a storage area network (presumably Fibre Channel, as opposed to iSCSI), which just is a means of connecting servers to storage enclosures. The storage protocol is still SCSI, it's just over a different transport layer.

In other words, he has multiple servers connected to a single storage enclosure, and he's seeing capacity and performance issues.

The disks should be considered just like internal disks: defrag from the respective servers.

I would bet that his problem is simply having insufficient disks (spindles) to serve the morning peak workload... just like if you had a few internal disks.

In short:
- Defrag from each server, if you have a fragging issue
- Add more disks to spread the workload out
- Consider leaving the boot disks in each server, and just put data on the san. One main reason is that swapping to the SAN can be a problem by consuming storage enclosure cache (presuming there is any)
Not enough information by PapaZit · 2007-03-22 01:10 · Score: 5, Insightful

First, what do you mean by excessive disk load? I'm not being facetious here. Do you mean that the SAN unit is pegged. How do you know that? Are the servers spending a lot of time waiting for I/O? Is the unit making loud noises? Or are the machines that are connected to the server just slow without the processor being pegged?

Also, while "have you tried defragging?" is a common home troubleshooting tip, it's not clear how you came up with the idea that the SAN has to be defragged. If you have reasons and you're just simplifying to keep the post short, great. Defrag away according to the SAN manufacturer's recommendations. However, don't become obsessed with it unless you know that fragmentation's an issue.

You need to spend some time benchmarking the whole system. Figure out how much disk, processor, network IO, and SAN IO are being used. Know what percentage of the total that is. Figure out exactly which servers are causing performance problems at which times.

"Find the problem" is always the first step in "fix the problem."
Once you know what's going on, you can deal with the problem intelligently. Are all the servers booting at the same time? Give them different spindles to work from or stagger the boot times. Are all of the users logging in at once? Figure out why that's slow (network speed, SAN, data size, etc.) and split the data across multiple servers and SANS or improve the hardware.

If you can make the case with hard data that the SAN is swamped, you can probably pry money from management to fix the problem. However, guessing that it -might- be something won't get you very far. They don't want to spend $20k on a fix to be told, "Nope. It was something else."

--
Forward, retransmit, or republish anything I say here. Just don't misquote me.
1. Re:Not enough information by pnutjam · 2007-03-22 01:44 · Score: 3, Interesting
  
  I use Big Sister to monitor all my servers. I get nice graphs that show memory, CPU, network load, disk utilization, etc. I looked and looked at this trying to find the cause of my problems. People complained about slow login times, sometimes they would get temporary profiles because their roaming profile would time out. The also complained about slow access times in our SQL dependant EMR (Electronic Medical Records) system. All my graphs showed everything within an acceptable range.
  
  I finally found an SNMP query for "disk load". This purports to be a percentage, but I've seen it showing way over 100, sometimes as high as four or five hundred. If it gets above 50 or 60 people start to complain. My disk load spikes in the morning when people are logging in, it generally goes to about 80% or higher on my graphs. My SQL server doesn't have these problems and I have yet to find a suitable way of monitoring the SQL log where I think the problem is originating.
  
  --
  Cheap storage VM.
Re:What kind of SAN? by pnutjam · 2007-03-22 01:20 · Score: 3, Informative

Xiotech 3d 1000, w/ qlogic fibre channel cards.

--
Cheap storage VM.
hmm.. by Anonymous Coward · 2007-03-22 01:34 · Score: 5, Informative

I am the Sr. Storage Architect for a Fortune 100 company. If you gave the type of array you have specifically, I'd be able to give more specific advice. That said: 1. You should have at least two fibre cards in each box anyway, and it has nothing to do with throughput. 2. Generally, your bottleneck is the disks themselves. If you want to increase performance, You need to increase the number of spindles that the data is striped across. Depending on the type of array, this may be a non-disruptive operation. The other big thing to look at is the type of RAID being used. You can usually get better performance from something striped with RAID10 vs. Raid 5, especially for write intensive data, because RAID 5 incurs an IO write penalty in calculating parity. 3. If you are going to defrag, do it on the server. It could help. There are some defrag functions available in most mid tier storage arrays, but it isn't what you think. The defrag there typically refers to lining up LUNs in a raid group. So, if you have a raid group with 5 LUNs in it, then delete one, you end up with a big empty space in the middle of the group. Defragging that raid group lines up all the LUNs inside that raid group.
SANs by Sobrique · 2007-03-22 01:59 · Score: 4, Informative

The fact you're using a SAN is likely to be fairly irrelevant here. SANs are a way to move data between server and disks. They're not really much more complicated than that.
First question, is what's the symptoms of the problem - how do you know you're 'pegging your disks'? If you're seeing IO load to your HBAs being really high, then yes, you might find that you need to upgrade these. From experience though, HBAs are rarely your limiting factor.
Much more likely is that you're experiencing local disk fragmentation, as you correctly point out. I can't offer specific advise for your array, but in my experience, SANs are 'blind' to filesystems. They work on disks and LUNs. LUNs are the devices a host sees. This can be safely and easily be defragmented, in all the normal ways that you would do normally.
Are you accessing your SAN over fiber channel or iSCSI? IF it's fiber, then again, you _may_ have network contention, but it's unusual in my experience (especially on a 17 servre SAN). If it's network, then you have contention to worry about. Is it possible that your 'gimme profile' requests across your network are also contending with your iSCSI traffic?
You may find that your SAN has 'performance tools' built in. That's worth a look, to see how busy your spindles are. Because of the nature of a SAN, you may find that the LUNs are being shared on the same physical disks. This can be a real problem if you've done something scary like using windows dynamic disk to grow your filesystem - Imagine having two LUNS striped, when in acutality on the back end, they're on two different 'bits' of a RAID 5 set. This is bad, and is worth having a look at.
One place where SANs do sometimes have issues is in page files. Which is possibly a problem if you're SAN booting. SANs have latency, and windows doesn't like high latency on page files. If you really push it, it'll start bluescreening.
This is fixed by local disks for OS, or just moving swap file to local disk.
HBA expansion _might_ improve performance, assuming this is your bottleneck. However you'll need to ensure you are multipathing your HBAs. (Think of them like network cards, and you won't go far wrong - you need to 'cheat' a bit in order to share network bandwidth on multiple cards). But like I say, you probably want to check this is actually a problem. If they're not very old, then it's unlikely, although it might be worth checking which internal bus the HBAs are on. (Resilience and contention).
It's possible your SAN is fragmented, but it's unlikely this is your problem - SANs don't have the same problem with adding and deleting files (LUNs) so all your backend storage will be in contiguous lumps anyway.
And I apologise if I use terminology that you're not familiar with. Each SAN vendor seems to have their own nomenclature when it comes to the 'bits', but they all work in roughly the same way. You have disks, which are ... well disks. RAID groups, which are disks bundled together, with a RAID 1, RAID 1+0, RAID 5 (with variable numbers of parity ratio) and very occasionally RAID 0. You have LUNs. Logical Units. These are ... well, chunks of your bundles of disks. The first 100Mb of a 5 disk RAID 5 group, might be a LUN. The LUN is what the host 'sees', as a single atomic volume. Most disk groups can have multiple LUNs on them, which is why you do need to watch out for how volume management is operating. I have seen a case where a Windows 2000 server added a second LUN, and used dynamic disk to stripe. Not realising that on the back end, both those LUNs were on the same RAID 5 (4+1). Which cause the disks to seek back and forth continually, and really hurt performance.
Oh, and this is also probably a good excuse to be booking SAN training. IMO SANs are fun and interesting, not to mention in demand and well paid :)
A few answers by sirwired · 2007-03-22 02:42 · Score: 3, Informative

1) No, it isn't your Fibre cards. The PCI-whatever bus (or the line speed of the card, for that matter), usually only affect high-bandwidth operations like tape backup. One thing you must remember is that loads that can beat the crap out of disk (random operations spread all over the platters), do not affect the I/O bus of the Fibre adapters at all, which cares only about total throughput.
2) It is far more likely your OS needs defragging than your disk array. Your disk array CAN become fragmented if you add and delete LUNs often, though.
3) Yes, you need multiple fibre cards, but for redundancy, not for bandwidth.
4) Try and put your major workloads on their own RAID arrays on your disk controller.
5) Check to see if you have enough memory in those boxes. If you have one server that keeps swapping out to disk and you are booting from SAN, you are going to get very hosed, very quickly. If these boxes have any internal disk at all, put the swap there.
6) If it is possible with your arrays, max out the segment size. (Engenio/LSI - based arrays can do this.)

This should be enough to get you started.

SirWired