Are RAID Controllers the Next Data Center Bottleneck?

Wait. You mean my SAN is Dead? by mpapet · 2009-07-25 04:38 · Score: 4, Insightful

Hardware RAID's are not exactly hopping off the shelf and I think many shops are happy with fiberchannel.

Let's do another reality check: this is enterprise class hardware. Are you telling me you can get SSD RAID/SAN in a COTS package that is cost approximate to whatever is available now? Didn't think so....

Let's face it, in this class of hardware things move much more slowly.

--
http://www.maxineudall.com/2010/02/should-economists-be-sued-for-malpractice.html

Re:distibution by bschorr · 2009-07-25 04:46 · Score: 3, Insightful

That's fine for some things but I really don't want my confidential client work-product mirrored around the world. Despite all the cloud hype there is still a subset of data that I really do NOT want to let outside my corporate walls.

--
-B-

BAD MATH by adisakp · 2009-07-25 04:57 · Score: 5, Interesting

FTA Since a disk sector is 512 bytes, requests would translate to 26.9 MB/sec if 55,000 IOPS were done with this size. On the other end of testing for small block random is 8192 byte I/O requests, which are likely the largest request sizes that are considered small block I/O, which translates into 429.7 MB/sec with 55,000 requests

I'm not going to believe an article that assumes that because you can do 55K IOPS for 512Byte reads, you can do the same number of IOPS for 8K reads which are 16X larger and then just extrapolate from there. Especially since most SSD's (at least SATA ones) right now top out around 200MB/s and the SATA interface tops out at 300MB/s. Besides there are already real world articles out there where guys with simple RAID0 SSD's are getting 500-600 MB with 3-4 drives using Motherboard RAID much less dedicated harware RAID.

Re:BAD MATH by fuzzyfuzzyfungus · 2009-07-25 05:47 · Score: 4, Insightful

"simple RAID0 SSD's are getting 500-600 MB with 3-4 drives using Motherboard RAID much less dedicated harware RAID."

The last part of that sentence is particularly interesting in the context of this article. "Motherboard RAID" is, outside of the very highest end motherboards, usually just bog-standard software raid with just enough BIOS goo to make it bootable. Hardware RAID, by contrast, actually has its own little processor and does the work itself. Of late, general purpose microprocessors have been getting faster, and cores in common systems have been getting more numerous, at a substantially greater rate than hardware RAID cards have been getting spec bumps(outside of the super high end stuff, I'm not talking about whatever EMC is connecting 256 fibre channel drives to, I'm talking about anything you could get for less than $1,500 and shove in a PCIe slot). Perhaps more importantly, the sophistication of OS support for nontrivial multi-disk configurations(software RAID, ZFS, storage pools, etc.) has been getting steadily greater and more mature, with a good deal of competition between OSes and vendors. RAID cards, by contrast, leave you stuck with whatever firmware updates the vendor deigns to give you.

I'd be inclined to suspect that, for a great many applications, dedicated hardware RAID will die(the performance and uptime of a $1,000 server with a $500 RAID card will be worse than a $1,500 server with software RAID, for instance) or be replaced by software RAID with coprocessor support(in the same way that encryption is generally handled by the OS, in software; but can be supplemented with crypto accelerator cards if desired).

Dedicated RAID of various flavors probably will hang on in high end applications(just as high end switches and rouers typically still have loads of custom ASICs and secret sauce, while low end ones are typically just embedded *nix boxes on commodity architectures); but the low end seems increasingly hostile.
Re:BAD MATH by jon3k · 2009-07-25 06:35 · Score: 2, Interesting

You forgot about SSDs, consumer versions of which are already doing over 250MB/s reads for less than $3.00/GB. And we're still essentially talking about second generation products (Vertex switched from JMICRON to Indilinx controllers and Intel basically just shrunk down to 34nm for their new ones, although their old version did 250MB/s as well).

I'm using a 30GB OCZ Vertex for my main drive on my windows machine and it benchmarks around 230MB/s _AVERAGE_ read speed. It cost $130 ($4.30/GB) when I bought it a couple months ago, and prices are falling. The new Intel X25-M is $225 for 80GB ($2.81/GB).

enterprise storage by perlchild · 2009-07-25 04:57 · Score: 3, Insightful

Storage has been the performance bottleneck for so long, it's a happy problem if you actually must increase the bus speeds/cpu processors/get faster memory on raid cards to keep up. Seems to me the article(or at least the summary) was written by someone hadn't been following enterprise storage for very long...

Re:enterprise storage by ZosX · 2009-07-25 05:45 · Score: 2, Interesting

That's kind of what I was thinking too. When you really start pushing the 300mb/s sata gives its hard to find something to complain about. Most of my hard drives max out at like 60-100mb a second and even the 15,000k drives are not a great deal faster. Low latency, fast speeds, increased reliability. This could get interesting in the next few years. Heck why not just build a raid 0 controller into the logic card with a sata connection and break the ssd into a bunch of little chunks and raid 0 them all max performance right out of the box so you get the performance advantages of raid without the cost of a card and the waste of a slot? PCIe SSD is quite interesting too..........

--
zosxavius photography
Re:enterprise storage by HockeyPuck · 2009-07-25 05:51 · Score: 4, Interesting

Ah... pointing the finger at the storage... My favorite activity. Listening to DBAs, application writers, etc point the finger at the EMC DMX with 256GB of mirrored cache and 4Gb/s FC interfaces. You point your finger and say, "I need 8Gb FibreChannel!. Yet when I look at your hba utilization over a 3mo period (including quarter end, month end etc..) I see you averaging a paltry 100MB/s. Wow. Guess I could have saved thousands of dollars with going with 2Gb/s HBAs. Oh yeah, and you have a minimum of two HBAs per server. Running a nagios application to poll our switchports for utilization, the average host is running maybe 20% utilization of the link speed, and as you beg, "Gimme 8Gb/s FC", I look forward to your 10% utilization.
We've taken whole databases and loaded them into dedicated cache drives on the array, and surprise, no performance increase. DBAs and application writers have gotten so used to yelling, "Add Hardware! That they forgot how to optimize their applications and sql queries."
If storage was the bottleneck, I wouldn't be loading up storage ports (FAs) with 10-15 servers. I find it funny that the only devices on my 10,000 port SAN that can sufficiently drive IO are media servers and the tape drives (LTO-4) that they push.
If storage was the bottleneck there would be no oversubscription in the SAN or disk array. Let me know when you demand a single storage port per HBA, and I'm sure my EMC will take us all out to lunch.
I have more data than you. :)
Re:enterprise storage by Anonymous Coward · 2009-07-25 06:14 · Score: 4, Insightful

Ah... pointing the finger at the storage... My favorite activity. Listening to DBAs, application writers, etc point the finger at the EMC DMX with 256GB of mirrored cache and 4Gb/s FC interfaces. You point your finger and say, "I need 8Gb FibreChannel!. Yet when I look at your hba utilization over a 3mo period (including quarter end, month end etc..) I see you averaging a paltry 100MB/s. Wow. Guess I could have saved thousands of dollars with going with 2Gb/s HBAs. Oh yeah, and you have a minimum of two HBAs per server. Running a nagios application to poll our switchports for utilization, the average host is running maybe 20% utilization of the link speed, and as you beg, "Gimme 8Gb/s FC", I look forward to your 10% utilization.
You do sound like you know what you're doing, but there is quite a difference between average utilization and peak utilization. I have some servers that average less than 5% usage on a daily basis, but will briefly max out the connection about 5-6 times per day. For some applications, more peak speed does matter.
Re:enterprise storage by Slippy. · 2009-07-25 06:58 · Score: 4, Insightful

Sort of true, but not entirely accurate.
Is the on-demand response slow? Stats lie. Stats mislead. Stats are only stats. The systems I'm monitoring would use more I/O if they could. Those basic read/write graphs are just the start. How's the latency? Any errors? Pathing setup good? Are the systems queuing i/o requests while waiting for i/o service response?
And traffic is almost always bursty unless the link is maxed - you're checking out a nice graph of the maximums too, I hope? That average looks mighty deceiving when long periods are compressed. At an extreme over months or years, data points can be days. Overnight + workday could = 50%. No big deal on the average.
I have a similiar usage situation on many systems, but the limits are generally still storage dependent issues like i/o latency (apps make a limited number of requests before requests start queuing), poorly grown storage (a few luns there, a few here, everything is suddenly slowing down due to striping in one over-subscribed drawer), and sometimes unexpected network latency on the SAN (switch bottlenecks on the path to the storage).
Those graphs of i/o may look pitiful, but perhaps that's only because the poor servers can't get the data any faster.
Older enterprise SAN units (even just 4 or 5 years ago) kinda suck performance wise. The specs are lies in the real world. A newer unit, newer drives, newer connects and just like a server, you'll be shocked. What'cha know, those 4Gb cards are good for 4Gb after all!
Every year, there's a few changes and growth, just like in every other tech sector.

--
-- Life is good. Tastes like chicken.

Re:distibution by Ex-MislTech · 2009-07-25 04:59 · Score: 2, Informative

This is correct, there are laws on the books in most countries that prohibit the exposure of medical and other data
to risk by putting it out in the open. Some have even moved to private virtual circuits, and the SAN's with fast
access via solid state storage of active files works fine, and it moves less accessed data to drive storage,
but none the less quite fast and SAS technology is faster than SCSI tech in throughput.

--
google "32 trillion offshore needs IRS attention"

Re:distibution by Ex-MislTech · 2009-07-25 05:01 · Score: 2, Informative

An example of SAS throughput pushing out 6 Gbps.

http://www.pmc-sierra.com/sas6g/performance.php

--
google "32 trillion offshore needs IRS attention"

Re:Hardware RAID becoming less relevant every day. by Alain+Williams · 2009-07-25 05:08 · Score: 2, Informative

The second question is, with processors coming with 8 cores, why have some separate specialized controller that handles RAID and not just do it in software?

I much prefer s/ware raid (Linux kernel dm_mirror), it removes a complicated piece of h/ware which is just another thing to go wrong. It also means that you can see the real disks that make up the mirror and so monitor it with the smart tools.

OK: if you do raid5 rather than mirroring (raid1) you might want a h/ware card to offload the work to, but for many systems a few terabyte disks are big and cheap enough to just mirror.

Not quite by greg1104 · 2009-07-25 05:11 · Score: 3, Informative

There may need to be some minor rethinking of controller throughput for read applications on smaller data sets for SSD. But right now, I regularly saturate the controller or bus when running sequential RW tests against a large number of physical drives in a RAID{1}0 array, so it's not like that's anything new. Using SSD just makes it more likely that will happen even on random workloads.

There are two major problems with this analysis though. The first is that it presumes SSD will be large enough for the sorts of workloads people with RAID controllers encounter. While there are certainly people using such controllers to accelerate small data sets, you'll find just as many people who are using RAID to handle large amounts of data. Right now, if you've got terabytes of stuff, it's just not practical to use SSD yet. For example, I do database work for living, and the only place we're using SSD right now is for holding indexes. None of the data can fit, and the data growth volume is such that I don't even expect SSDs to ever catch up--hard drives are just keeping up with the pace of data growth.

The second problem is that SSDs rely on volatile write caches in order to achieve their stated write performance, which is just plain not acceptable for enterprise applications where honoring fsync is important, like all database ones. You end up with disk corruption if there's a crash, and as you can see in that article once everything was switched to only relying on non-volatile cache the performance of the SSD wasn't that much better than the RAID 10 system under test. The write IOPS claims of Intel's SSD products are garbage if you care about honoring write guarantees, which means it's not that hard to keep with them after all on the write side in a serious application.

Re:Not quite by A+beautiful+mind · 2009-07-25 07:20 · Score: 2, Insightful

The second problem is that SSDs rely on volatile write caches in order to achieve their stated write performance, which is just plain not acceptable for enterprise applications where honoring fsync is important, like all database ones. You end up with disk corruption if there's a crash, and as you can see in that article once everything was switched to only relying on non-volatile cache the performance of the SSD wasn't that much better than the RAID 10 system under test. The write IOPS claims of Intel's SSD products are garbage if you care about honoring write guarantees, which means it's not that hard to keep with them after all on the write side in a serious application.

Most enterprise level SSDs have BBWC already for exactly that reason. On those systems fsync is a noop. I for one am looking forward to SSDs in enterprise level applications, we could easily consolidate current database servers that are IOPS bottlenecked, with very low levels of CPU and non-caching memory utilization. BBWC solves the "oh, but we need to honour fsync" kind of problems. We're looking at a performance increase of 10-20x (IOPS) easily if >500G enterprise level SSDs become available for database servers. Even if prices/GB stay way above SAN prices, it's still more than worth it to switch.

--
It takes a man to suffer ignorance and smile
Be yourself no matter what they say
Re:Not quite by greg1104 · 2009-07-25 07:56 · Score: 2, Insightful

You can't turn fsync into a complete noop just by putting a cache in the middle. A fsync call on the OS side that forces that write out to cache will block if the BBWC is full for example, and if the underlying device can't write fast enough without its own cache being turned on you'll still be in trouble.
While the cache in the middle will improve the situation by coalescing writes into the form the SSD can handle efficiently, the published SSD write IOPS numbers are still quite inflated relative to what you'll actually see. What I was trying to suggest is that the performance gap isn't nearly as large as suggested by the article of TFA once you start building real-world systems around them. After all, regular discs benefit from the write combining to lower seeks you get out of a BBWC, too, even more than the SSDs do.
The other funny thing you discover if you benchmark enough of these things is that a regular hard drive confined to only use as much space as a SSD provides is quite a bit faster too. When you limit a 500GB SATA drive to only use 64GB (a standard bit of short stroking), there's a big improvement in sequential and seek speeds there. If you want to be fair, you should only compare your hard drive's IOPS when it's configured to only provide as much space as the SSD you're comparing against.

All wrong. by sirwired · 2009-07-25 05:16 · Score: 2, Informative

1) Most high-end RAID controllers aren't used for file serving. They are used to serve databases. Changes in filesystem technology don't affect them one bit, as most of the storage allocation decisions are made by the database.
2) Assuming that a SSD controller that can pump 55k IOPS w/ 512B I/O's can do the same w/ 4K I/O's is stupid and probably wrong. That is Cringely math; could this guy possibly be as lame?
3) The databases high-end RAID arrays get mostly used for do not now, and never have, used much bandwidth. They aren't going to magically do so just because the underlying disks (which the front-end server never even sees) can now handle more IOPS.

All SSD's do is flip the Capacity/IOPS equation on the back end. Before, you ran out of drive IOPS before ran out of capacity. Now, you get to run out of capacity before you run out of IOPS on the drive side.

Even if you have sufficient capacity (due to the rapid increase in SSD capacity), you are still going to run out of IOPS capacity on the RAID controller before you run out of IOPS or bandwidth on the drives. The RAID controller still has a lot of work to do with each I/O, and that isn't going to change just because the back-end drives are now more capable.

SirWired

Re:All wrong. by AllynM · 2009-07-25 08:37 · Score: 2, Interesting

Well said. I've found using an ICH-10R kills that overhead, and I have seen excellent IOPS scaling with SSDs right on the motherboard controller. I've hit over 190k IOPS (single sector random read) with queue depth at 32, using only 4 X25-M G1 units. The only catch is the ICH-10R maxes out at about 650-700 MB/sec on throughput.
Allyn Malventano
Storage Editor, PC Perspective

--
this sig was brought to you by the letter /.

Re:distibution by lgw · 2009-07-25 05:38 · Score: 2, Informative

SAS technology is faster than SCSI tech in throughput

"SCSI" does not mean "parallel cable"!

Sorry, pet peev, but obviously Serial Attached SCSI (SAS) is SCSI. All Fibre Channel storage speaks SCSI (the command set) all USB storage too. And iSCSI? Take a wild guess. Solid state hard drives that plug directly into PCIe slots with no other data bus? Still SCSI command set. Fast SATA drives? The high end ones often have a SATA-to-SCSI bridge chip in front of SCSI internals (and SAS can use SATA cabling anyhow these days).

Pardon me, I'll just be over here grumbling about this.

--
Socialism: a lie told by totalitarians and believed by fools.

Re:iscsi, 10gig by Anonymous Coward · 2009-07-25 05:45 · Score: 2, Informative

Of course. NFS provides an easy to use concurrent shared filesystem that doesn't require any cluster overhead or complication like GFS or GPFS.

Re:I/O is random? What have you been smoking? by countertrolling · 2009-07-25 05:50 · Score: 3, Insightful

I think we need a mod option to mod down the article summary: -1, stupid editor.

You had your chance.

--
For justice, we must go to Don Corleone

Re:Hardware RAID becoming less relevant every day. by mysidia · 2009-07-25 05:51 · Score: 2, Informative

Well, ZFS is great, but don't get that mixed up with software RAID. It's not. The storage redundancy algorithms used by ZFS are not the RAID algorithms, such that using ZFS is much better than using EITHER hardware or software RAID.

ZFS provides performance and data integrity assurance that standard RAID does not. Primarily, because filesystem level data is checksummed, and it should be almost impossible for silent data corruption to occur at the storage device level, except cases where the data written actually matches the checksums, (a later 'zpool scrub' should detect it, if ZFS is implemented properly).

But aside from ZFS, software RAID (and even fakeraid/hostraid hardware adapters that perform RAID in the driver) really really suck both in terms of reliability, data integrity, and performance when you need to push things to the maximum, compared to a good hardware RAID controller; software RAID is measurably slower on the same CPU and memory.

SMART provides so little of what you need to be doing to keep a reliable array, it isn't even funny.

Good hardware controllers keep metadata and do frequent consistency checks / "scrubs" / surface scans, to ensure every bit of data is periodically read from every drive, so HDD firmware has an opportunity to fix errors before they become "unrecoverable read errors".

Hardware controllers will also detect when a hard drive is having a problem that cannot be easily identified by software. Hard drives are direcly plugged into the controller; it can detect things such as abnormal command response latencies.

A software controller can't be sure the abnormal latency isn't due to other workload on the bus, or "not a drive failing", so the HW controller is more responsive to failure.

HW contollers also provide writethrough caching, and sometimes have a BBU with a full writeback cache, which drastically helps performance, and reduces the RAID performance penalty, which software RAID doesn't mitigate, but in fact makes worse.

Oh yes, and Good controllers also have monitoring and administration tools for various OSes, including Linux, Windows, and Solaris, produced by the manufacturer.

Many of the good controllers come equipped with audible alarms and terminals for you to plug drive failure LEDs into, so that anyone near the server can know a drive has failed, and which one.

Re:distibution by lgw · 2009-07-25 05:54 · Score: 2

For my own personal data, I'd consider that adequate. For data I'm legally required to keep secret - absolutely not. Your physical security design should force an attacker to steal both your keys and your data, each from a seperate physical location, so that you can destroy one as soon at the other is stolen to prevent data loss. Electronic security of course focuses on compartmentalization and auditing, so that an inside attacker can only steal a small portionof the data, and can be caught an jailed afterwards. That's all pretty basic design.

Also, 256-bit symmetric encryption really is enough - it's firmly beyond the realm of what can be brute-forced, unless some fundamental understanding of physics is wrong. 256-bit AES is only vulnerable to weaknesses in the algorithim being discovered at some future point. If you're paranoid, you're far better off using 2 unrelated 256-bit symmetric algorithms than a symmetric key larger than 256 bits.

--
Socialism: a lie told by totalitarians and believed by fools.

Re:I/O is random? What have you been smoking? by Anpheus · 2009-07-25 06:33 · Score: 5, Insightful

All the important operations tend to be random. For a file server, you may have twenty people accessing files simultaneously. Or a hundred, or a thousand. For a webserver, it'll be hitting dozens or hundreds of static pages and, if you have database backend, that's almost entirely random as well.

For people consolidating physical servers to virtual servers, you now have two, three, ten or twenty VMs running on one machine. If every one of those VMs tries to do a "sequential" IO, it gets interlaced by the hypervisor into all the other sequential IOs. No hypervisor would dare tell all the other VMs to sit back and wait so that every IO is sequential. That delay could be seconds or minutes or hours.

Now imagine all that, and take into account that the latest Intel SSD gets around 6600 IOPS read and write. A good, fast hard drive gets 200. So you could put thirty three hard drives in RAID 0 and have the same number of IOPS, and your latency would still be worse. All the RAID0 really does for you is give you a nice big queue pipeline, like in a CPU. Your IO doesn't really get done faster, but you can have many more running simultaneously.

Given that SSDs are easily three to four times faster on sequential IO and an order of magnitude faster on random IO, I don't think it's that implausible to believe that the industry isn't ready.

Re:I/O is random? What have you been smoking? by wagnerrp · 2009-07-25 09:07 · Score: 2

When you calculate IOPS, a good portion small of reads and writes get executed at random places on the disks. When you you make one filesystem write on a raid0 set (depending on how smart the raid0 controller is), it will be locking up several or ALL the disk spindles for that individual read/write.

Actually, that's incorrect. Here's why:

When you make a RAID0 array, you stripe large blocks between all the disks, usually 64K-256K large. If your operation does not cross the block boundary, you only access a single drive. Assuming those random small files are evenly distributed, your IOPS scale almost linearly with drive count.

Re:iscsi, 10gig by drsmithy · 2009-07-25 10:15 · Score: 2, Informative

Does anyone actually still use NFS?

Of course. It's nearly always fast enough, trivially simple to setup, and doesn't need complicated and fragile clustering software so that multiple systems can access the same disk space.

Slashdot Mirror

Are RAID Controllers the Next Data Center Bottleneck?

26 of 171 comments (clear)