Slashdot Mirror


What Sustained Disk Transfer Rates Do You Get?

Mr. Jackson asks: "What kind of disk transfer rates (MB/s) do people get in the real world when moving around large (100s MB) files? Either every machine in our building is mis-configured, or our notions about what we were getting are way off. I've tested half a dozen machines, mostly Win2k, some Linux, by just copying a large file and timing it with a watch. 8 MB/s seems to be about average for inter-disk copies. RAID 1 (stripped) got as high as 12 MB/s after fiddling with cache settings. RAID 5 was as low as 2 MB/s. We all thought the numbers should have been around 30 MB/s."

5 of 94 comments (clear)

  1. Mega-what? by itwerx · · Score: 4, Insightful

    What you're describing sounds about right, actually.
    Be sure you're keeping Mega-Bytes and Mega-Bits straight!

  2. Tuning Can Make a Big Difference by ScottG · · Score: 4, Informative
    At least on IDE drives, using the hdparm tool can greatly improve performance of modern drives. I found my throughput went from 3 MB/sec to 22 MB/sec with just a few tweaks.

    Most distros use very conservative settings for the IDE interfaces which will work with just about any old drives, but do not take advantage of more modern hardware. hdparm allows you to activate those advanced features.

    There is a nice write-up about using hdparm here: http://www.oreillynet.com/pub/a/linux/2000/06/29/h dparm.html

    Of course, all this only applies to Linux boxes.

    --
    Hey, who else could go for some flapjacks right now?
  3. Use more disks & the RAID that performs best ( by netringer · · Score: 4, Informative
    The bottleneck is the design of the mechanical disk. You can minimize the bottlerneck by having more disk spindles handle the I/O.

    As you've found out it does matter which RAID scheme you use. RAID 0+1 will outperform RAID 5 substantially.

    Think spindles. Because each disk has only one spindle, the disk head can only be over one given track at any instant. If you want the heads to nearer to your where your data is stored you want to have more heads. With RAID 1 your read or write request can be handled by more than one disk spindle. That gives you the best performance.

    To get more spindles, use as many disks as practical. I've had some long conversations with my co-workers that now that disks are really cheap it doesn't matter that RAID 1 "wastes" half the disks. It does matter that disk I/O is a bottleneck and more disks will help ease that bottleneck..

    References:
    "In general, when cost is no object, RAID 1 or RAID 0/1 provides the best overall performance. Since striping spreads the I/O load across multiple disks, RAID 0/1 has the best overall performance characteristics of any RAID option. However, if you know ahead of time that the proportion of writes to disk is low, you can fall back on a less expensive RAID 5 configuration. In addition, if there is adequate battery-backed cache memory in the configuration, you may be able to support a moderate amount of disk writes under RAID 5. But even with large amounts of cache, a heavy write-oriented workload is likely to cause performance problems under RAID 5."
    http://www.oreillynet.com/pub/a/network/2002/01/18 /diskperf.html

    To optimize your file layout, follow these...rules:
    1. Use RAID
    2. The more disks, the better"
    http://www.swynk.com/friends/israel/optimaldisk.as p

    "If your SQL Server is experiencing I/O bottlenecks, consider these possible solutions: Add more physical drives to the current arrays. This helps to boost both read and write access times. But don't add more drives to the array than your I/O controller can support.
    http://www.sql-server-performance.com/fixing_bottl enecks.asp
    --
    Ever dream you could fly? Get up from the Flight Sim. I Fly
  4. RAID intricacies by photon317 · · Score: 5, Informative


    On RAID technologies, speaking in general terms assuming vendors do a good job of implementing it, here's a summary:

    RAID 0: Pure striping, maximum performance, no redundancy. Cost is the same as concatenating disks to get the space you need.

    RAID 1: Pure Mirroring, full redundancy - reads can be as fast as a stripe of the same width as the number of mirrors (2-way stripe, 2-way mirror, same read speed, etc) if they do round-robin reading. Writes happen in parallel, and can be slower unless you've got the headroom and the disk spindle is the only write bottleneck. Cost is double a simple concat or stripe.

    RAID 2-4: Sometimes used for very special purposes, but generally ignored by all because one of the other raid levels does the same thing better. I've seen RAID-3 recently, there are occasionally valid uses for like 0.01% of people out there.

    RAID 5: You get some data redundancy to survive a single disk failure, but you don't pay the double disk cost of full mirroring. It's an N+1 type of configuration. Speed is generally the slowest compared to everything else.

    Now on top of those very basic things, there are other factors. Because RAID-5 is cheapest disk-wise, and (IMHO) because it has the highest number of the well-standardized RAID levels, RAID-5 is very popular. To make up for RAID-5's abysmal performance, people use hardware RAID-5 accelerators with cache and whatnot. The problem there is that the controller can add significant cost (in some cases enough to have paid for a full mirror in plainly controlled disks), and that the RAID controller itself can become a single point of failure.

    At my office (where a lot of bad decisions get made every day and I have to eat it) they built a Veritas cluster of Sun machines around a SAN. The idea was that no node was a single point of failure because of clustering (with veritas allowing all nodes to reach the SAN storage). However, the SAN storage was a big fat RAID-5 array with redundant controllers/disks/yadda/yadda. Of course, as much as the vendor tries to bury it in the fine print, the RAID-5 hardware is a single point of failure. Sure enough, our very reputable vendor's "redundant" hardware raid-5 controller did fully fail one, knocking our data offline for hours.

    For the same cost as the expensive raid-5 array and the disks in it, we could have bought two independant JBOD arrays (just a bunch of disks, no raid controller), placed them on the redundant SAN, with the redundant clustered machines doing software mirroring to the disks, and been truly free from single points of failure (assuming we do all the details right - that the mirrors are always across seperate arrays, and that the arrays are on seperate power, etc)..

    I've spent a lot of time on these problems, and it is my strong belief that the optimal solution for almost all normal situations where you want high availability is to do software mirror/stripe (1+0). Be careful that there is a difference between 1+0 and 0+1 when the 0 part's stripe is more than two disks wide... Consider two JBOD arrays of 5x 36G disks each...

    In 0+1, you first stripe each array into a 180G stripe, then mirror the two together. When your first disk fails, nothing so mcuh as hiccups. However, of your remaining 9 disks, if any of the 5 disks in the array opposite the one with the first disk fails, you will lose data. Thus there's a 5/9 chance that the second disk failure causes data loss.

    In 1+0, you first mirror each disk from the first array with its partnet in the second array. You then take your 5 36G mirrors and stripe them together for your 180G. Again, first failure, no hiccups. If a second disk fails, in order to cause data loss it must be the partner of the first failed disk - any of the other disks can fail and you still lose nothing. So the chances of data loss on a second disk failure are now 1/9 instead of 5/9.

    --
    11*43+456^2
  5. Re:Disk transfer rates, my experience by gbnewby · · Score: 4, Informative
    This topic is near and dear to me....truly "news for nerds, stuff that matters."

    My application is for information retrieval, I'm using some software that utilizes BerkeleyDB files at the back end. I spent the last week trying to figure out why I wasn't getting better throughput, and eventually figured out it's related to BerkeleyDB's handling of lots of tree duplicate pages. But that's not why I wanted to post.

    One thing people didn't mention: The file system. The file system can make a big difference. For larger files, think about ext2 or XFS. For lots of small files, think ReiserFS. ext3 does journaling and is supposed to have comparable throughput. There's a lot of information out there about filesystems, including a filesystem HOWTO at ibiblio.org. Pick the right filesystem for your application.

    Here's what I found. I was copying an 8GB file back and forth (this was one of my DB files; yes, it was sparse, I used "cp --sparse=always"). This was on a Dell 530 with dual 1.7Ghz Xeons, 2GB of PC800 RAM, an Adaptec 39160 controller (U160 SCSI) and JBOD (just a bunch of disks=no raid). Linux kernel is 2.4.18-64GB-SMP on a SUSE 8.0 distribition. The experiments were between different drives on separate channels on the same controller. The drives are 73GB 10KRPM Cheetahs.

    I copied the 8GB file and a few other multi-gig files, and used "vmstat" to track progress. This is NOT the way to benchmark for files of just a few meg or even a few hundred meg, because it only samples every few seconds. But for long-running processes, I would "vmstat 10 10000" (resample every 10 seconds; 10000 times) and watch as the files copied in the background on a quiescent system. The "bi" colums is blocks in (typically 4KB blocks, but you can tune this on your system); "bi" is blocks out.

    I did XFS to ext2 and back again. I also copied off a ReiserFS drive.

    Both XFS and ext2 were comparable for reading & writing. They peaked at about 35,000 bi or bo. 35,000 * 4096 bytes per block =~ 143MB/second. In other words, I was getting close to the max transfer rate for the SCSI bus (160MB/sec per channel). Long-term average was closer to 25K blocks or ~100MB/second.

    With a ReiserFS, either for reading or writing, the pattern was that it could peak at ~18K blocks bi or bo, but generally was far lower, on the order of 3000-8000 (i.e., sustained rates of about 12-35MB/sec). What seemed to be happening was the other drive (XFS or ext2) would outpace the ReiserFS' ability to read or write, then wait. If you read the ReiserFS info, they admit this is part of the design (ReiserFS is *great* for loads of small files, really really great). For longer files, they end up needing to basically chain it across a lot of blocks in their B-tree.

    I know the question was about IDE, not SCSI, but I'm sure that the filesystem matters for IDE as well, especially if very large files are involved. If you're working with large files and are willing to lose a percentage to block roundoffs, some filesystems let you choose a block size > 4096 (though I think Linux ends up chunking in 4K blocks anyway).