SANs and Excessive Disk Utilization?
pnutjam asks: "I work for a small to medium mental health company as the Network Administrator. While I think a SAN is a bit of overkill for our dozen servers, it was here when I got here. We currently boot 7 servers from our SAN, which houses all of their disks. Several of them have started to show excessive disk load, notably our SQL server, and our old domain controller (which is also the file/print server). I am in the process of separating our file/print server from our domain controller, but right now I get excessive disk load during the morning when people log on (we use roaming profiles). I think the disks need to be defragged, but should this be done on the servers, or on the SAN itself? When it comes to improving performance, I get conflicting answers when I inquire whether I would get better throughput from newer fibre-channel cards (ours are PCI-x, PCI-e is significantly faster), or mixing in some local disks, or using multiple fibre channel cards. Has anyone dealt with a similar situation or has some expertise in this area?"
First, what do you mean by excessive disk load? I'm not being facetious here. Do you mean that the SAN unit is pegged. How do you know that? Are the servers spending a lot of time waiting for I/O? Is the unit making loud noises? Or are the machines that are connected to the server just slow without the processor being pegged?
Also, while "have you tried defragging?" is a common home troubleshooting tip, it's not clear how you came up with the idea that the SAN has to be defragged. If you have reasons and you're just simplifying to keep the post short, great. Defrag away according to the SAN manufacturer's recommendations. However, don't become obsessed with it unless you know that fragmentation's an issue.
You need to spend some time benchmarking the whole system. Figure out how much disk, processor, network IO, and SAN IO are being used. Know what percentage of the total that is. Figure out exactly which servers are causing performance problems at which times.
"Find the problem" is always the first step in "fix the problem."
Once you know what's going on, you can deal with the problem intelligently. Are all the servers booting at the same time? Give them different spindles to work from or stagger the boot times. Are all of the users logging in at once? Figure out why that's slow (network speed, SAN, data size, etc.) and split the data across multiple servers and SANS or improve the hardware.
If you can make the case with hard data that the SAN is swamped, you can probably pry money from management to fix the problem. However, guessing that it -might- be something won't get you very far. They don't want to spend $20k on a fix to be told, "Nope. It was something else."
Forward, retransmit, or republish anything I say here. Just don't misquote me.
I am the Sr. Storage Architect for a Fortune 100 company. If you gave the type of array you have specifically, I'd be able to give more specific advice. That said: 1. You should have at least two fibre cards in each box anyway, and it has nothing to do with throughput. 2. Generally, your bottleneck is the disks themselves. If you want to increase performance, You need to increase the number of spindles that the data is striped across. Depending on the type of array, this may be a non-disruptive operation. The other big thing to look at is the type of RAID being used. You can usually get better performance from something striped with RAID10 vs. Raid 5, especially for write intensive data, because RAID 5 incurs an IO write penalty in calculating parity. 3. If you are going to defrag, do it on the server. It could help. There are some defrag functions available in most mid tier storage arrays, but it isn't what you think. The defrag there typically refers to lining up LUNs in a raid group. So, if you have a raid group with 5 LUNs in it, then delete one, you end up with a big empty space in the middle of the group. Defragging that raid group lines up all the LUNs inside that raid group.