Are RAID Controllers the Next Data Center Bottleneck?
storagedude writes "This article suggests that most RAID controllers are completely unprepared for solid state drives and parallel file systems, all but guaranteeing another I/O bottleneck in data centers and another round of fixes and upgrades. What's more, some unnamed RAID vendors don't seem to even want to hear about the problem. Quoting: 'Common wisdom has held until now that I/O is random. This may have been true for many applications and file system allocation methodologies in the recent past, but with new file system allocation methods, pNFS and most importantly SSDs, the world as we know it is changing fast. RAID storage vendors who say that IOPS are all that matters for their controllers will be wrong within the next 18 months, if they aren't already.'"
It is very common when doing disk benchmarks so have separate tests for small random reads/writes, and large sequential reads/writes. The numbers are often different.
And while you can't always predict what disk sector is going to be read next, often you can, which is why predictive raid controllers with lots of memory are very useful.
I think we need a mod option to mod down the article summary: -1, stupid editor.
with things like Haadop and cloudstore, pNFS, Lustre, and others storage will be distributed. There will no longer be the huge EMC, Netapp, Hitachi etc central storage devices. There's no reason to pay big bucks for a giant single point of failure when you can use the Linus method of upload to the internet and let it get mirrored around the world. (In a much more localized manor.)
Paying taxes to buy civilization is like paying a hooker to buy love.
Hardware RAID's are not exactly hopping off the shelf and I think many shops are happy with fiberchannel.
Let's do another reality check: this is enterprise class hardware. Are you telling me you can get SSD RAID/SAN in a COTS package that is cost approximate to whatever is available now? Didn't think so....
Let's face it, in this class of hardware things move much more slowly.
http://www.maxineudall.com/2010/02/should-economists-be-sued-for-malpractice.html
This article suggests that most RAID controllers are completely unprepared for solid state drives and parallel file systems
Right. The point of a parallel file system is that you do not need RAID. Slashdot's editors must think really low of their readers.
FTA Since a disk sector is 512 bytes, requests would translate to 26.9 MB/sec if 55,000 IOPS were done with this size. On the other end of testing for small block random is 8192 byte I/O requests, which are likely the largest request sizes that are considered small block I/O, which translates into 429.7 MB/sec with 55,000 requests
I'm not going to believe an article that assumes that because you can do 55K IOPS for 512Byte reads, you can do the same number of IOPS for 8K reads which are 16X larger and then just extrapolate from there. Especially since most SSD's (at least SATA ones) right now top out around 200MB/s and the SATA interface tops out at 300MB/s. Besides there are already real world articles out there where guys with simple RAID0 SSD's are getting 500-600 MB with 3-4 drives using Motherboard RAID much less dedicated harware RAID.
Storage has been the performance bottleneck for so long, it's a happy problem if you actually must increase the bus speeds/cpu processors/get faster memory on raid cards to keep up. Seems to me the article(or at least the summary) was written by someone hadn't been following enterprise storage for very long...
The first question is really, why RAID a SSD? It's already more reliable than a mechanical disk, so that argument goes out the window. You might get some increased performance, but that's often not a big factor.
The second question is, with processors coming with 8 cores, why have some separate specialized controller that handles RAID and not just do it in software?
AccountKiller
Run FiberChannel
Multiple interfaces and lots of block servers.
Does anyone actually still use NFS?
Deleted
There may need to be some minor rethinking of controller throughput for read applications on smaller data sets for SSD. But right now, I regularly saturate the controller or bus when running sequential RW tests against a large number of physical drives in a RAID{1}0 array, so it's not like that's anything new. Using SSD just makes it more likely that will happen even on random workloads.
There are two major problems with this analysis though. The first is that it presumes SSD will be large enough for the sorts of workloads people with RAID controllers encounter. While there are certainly people using such controllers to accelerate small data sets, you'll find just as many people who are using RAID to handle large amounts of data. Right now, if you've got terabytes of stuff, it's just not practical to use SSD yet. For example, I do database work for living, and the only place we're using SSD right now is for holding indexes. None of the data can fit, and the data growth volume is such that I don't even expect SSDs to ever catch up--hard drives are just keeping up with the pace of data growth.
The second problem is that SSDs rely on volatile write caches in order to achieve their stated write performance, which is just plain not acceptable for enterprise applications where honoring fsync is important, like all database ones. You end up with disk corruption if there's a crash, and as you can see in that article once everything was switched to only relying on non-volatile cache the performance of the SSD wasn't that much better than the RAID 10 system under test. The write IOPS claims of Intel's SSD products are garbage if you care about honoring write guarantees, which means it's not that hard to keep with them after all on the write side in a serious application.
1) Most high-end RAID controllers aren't used for file serving. They are used to serve databases. Changes in filesystem technology don't affect them one bit, as most of the storage allocation decisions are made by the database.
2) Assuming that a SSD controller that can pump 55k IOPS w/ 512B I/O's can do the same w/ 4K I/O's is stupid and probably wrong. That is Cringely math; could this guy possibly be as lame?
3) The databases high-end RAID arrays get mostly used for do not now, and never have, used much bandwidth. They aren't going to magically do so just because the underlying disks (which the front-end server never even sees) can now handle more IOPS.
All SSD's do is flip the Capacity/IOPS equation on the back end. Before, you ran out of drive IOPS before ran out of capacity. Now, you get to run out of capacity before you run out of IOPS on the drive side.
Even if you have sufficient capacity (due to the rapid increase in SSD capacity), you are still going to run out of IOPS capacity on the RAID controller before you run out of IOPS or bandwidth on the drives. The RAID controller still has a lot of work to do with each I/O, and that isn't going to change just because the back-end drives are now more capable.
SirWired
The fact that SSD perf drops like a rock when you actually need to be absolutely sure the data makes it to disk is huge factor in enterprise storage. No enterprise storage customer is going to accept the possibility their data goes down the bit-bucket just because somebody tripped over the power cord. Enterprise databases are built around the idea that when the storage stack says data has been written, it has, in fact, been written. Storage vendors spend a great deal of money, effort, and complexity guaranteeing the non-volatility of write cache; for SSD to ignore that requirement when publishing performance data is fundamentally dishonest.
SirWired
the on board chips are not build for high speed / useing all the ports at the max at one time.
Somebody will find a way to clog it up.
Where's our "paperless" society?
For justice, we must go to Don Corleone
Anyone seriously into benchmarking or high performance applications would know that raid controllers has been a bigger bottleneck than the harddrives for ages already.
It's just the last 2-3 years or so that you have gotten raid controllers fast enough to properly deal with the performance of the 6 tp 8 15k rpm drives that a normal 2U server can hold, and still today, many of the server raid cards out there still cannot do this.
Raid card performance has easily been the biggest differentiator on server performance for anyone that needed a reasonable amount of I/O capacity on their servers. Most servers have been reasonably equal in terms of memory and network performance. After all, they are all built around a very limited number of CPU and chipset architectures and there is only so much that can different there and it is a long time since gigabit network HW for server did not manage to fill a gigabit link.
Raid cards on the other side has major differences in architecture, software and processors. Proper HW raids are basically small computers on a card. They got their own CPU, memory and OS. This isolate them from the host they are plugged into and protect the data even if the host crash (the raid card will normally not crash and has all data in the battery backed up cache, which means a great deal for critical data and massively reduces the chances that you need to do consistency checks/validation and rebuilds after a crash on the host server which is equally important for a server).
Unfortunately as a result of all that extra complexity, you also get the potential for large performance variations between different raid cards.
When that is said, a good quality raid card now definately help on performance in most scenarios and easily outperforms software raids for most server usage that includes a reasonable amount of writing as long as you got that battery backup on your cache so you can safely enable write back caching.
Just do your homework and make sure you get a good card when you shop. The better cards can easily be 2-3x faster than the worst.
Cost per IOPS yes, several vendors are selling SSD now. Cost per terrabyte, no, SSD isn't even close. What we're seeing now is a Tier 0 storage using SSD's. It fits in between RAM cache in SAN controller nodes and on-line storage (super fast, typically fiber channel storage vs near line).
:)
So previously it looked like (slowest to fastest): SATA (near-line), Fiber Channel (online) -> RAM cache
Now we'll have: SATA -> FC -> SSD -> RAM
And in a few years after the technology gets better and much less expensive, we'll see: SATA -> SSD -> RAM
And hopefully eventually: SSD -> Memristors
SSD killed the Raid(io) star. Really, who needs the fuss of raid. Unless it's for backup, there is no need for raid as far as speed goes. SSD are already bottlenecking the 3.0Gb/s SATA II. A single SSD can produce the same throughput as 4 raided Raptors (=fast drives). Plus anyone can install and SSD into an existing setup, Raid requires a lot of reinstalls and drivers etc..
Not sure what Datacenters you have been visiting but it sounds like you need to get out more. In a standard Colocation datacenter you see a lot of data that lives in midgrade x86 server raid subsystems. You also see a lot of bakers racks filled with crap white box systems.
In a real datacenter the only raid seen is a raid 1 for the boot drives to get the server up into the operating system. The data lives on the SAN. If the server suffers a hardware failure or other problem the admin is able to assign the LUN to another server and get the application back up during the repairs. Clustered applications are even able to do this on their own and page the admin and let him know its time for a service call.
And of course you mention nothing about ZFS which is even able to judge the read and write speed of its devices. A raidz configuration of a mix between regular spindles and ssd's would be able to balance between the two depending on the needs of the operations involved.
SSD's won't be in the datacenter for a long time. The 15k rpm fibre channel drives found in most EMC hardware is robust and scary fast on top of being extremely fault tolerant with BCV's and multiple LUN's. I wonder how well SSD's would do in a real world test of multiple LUN configuration with 24 hour hammering on the other end of 5000 hosts on an 8gb fibre?
...Our EMC sales rep has been putting the hard sell on us to buy some SSD product. I think they are worried about their profit margins on conventional drives, and they want to move customers to a product with a higher margin - and along the way they can also try to get you to upgrade head units, etc.
This is a pretty simplistic view. As a senior storage engineer, I have conversations like this quite often. RAID controller hardware, at the enterprise storage level, are not articles of hardware that figure in things. In addition, and perhaps more pertinently, there is a reasonable chance that in the next few years the RAID paradigm may pass into history and that disk interface models that incorporate linear power/throughput growth in enterprise storage subsystems such as IBM's XIV will take over. It's certainly a quantum improvement in thinking, at least. It will also deal with all of these smug statistical analyses that talk about RAID rebuild times growing (in line with spindle size growth) such that second disk failures prior to the rebuild of an original disk failure taking out an entire array.
currently network bw is the limit in the data centers. i dont see this changing anytime soon.
Not dead just ready to be overtaken by something else. The bizzare idea of a redundant array of fileservers with parallel NFS is already done by Panasas - but I'm too scared to get a price in case I have a heart attack.
There's also RAID6, but it just gives you a bit more leeway in the number of drives you can LOSE.
Some pages,however, will be rather consistent.
Common CSS files, a site's front page,etc. For scripted pages (php/perl/asp/whatever, as well as javascript) there will be a whack of commonly included modules inside the app, please all modules or stuff that might be part or PHP/etc and pulled on an as-needed basis.
Having worked in a company with some fairly high-traffic sites (maybe not as big as the giants, but big enough), caching of those makes a HUGE difference in performance.
I don't disagree than SSD's will make a big difference in all that other,random, IO, but there's plenty of consistent things that can be dealt with (see: cached in memory) that a lot of people simply overlook.
I've been hit by this before. New app rolls out, servers take a dive. Having some knowledge of DB's myself, I hit it with MTOP and find HUGE query tables,generally caused by extremely poorly written blocks of queries/code (doing things in code that should be done in the query, or vise-versa) and shit-poor indexing or query structure.
Now I'm no DB expert, but when I add a few indexes/changes and suddenly that 45s query is going down to less than 1... then yeah poor, sloppy, or just lazy coding becomes a much bigger issue than lack of hardware or a poorly configured/performing server. Unfortunately there's often a big divide between the IT admins and the programmers, so collaboration in this regard gets lost as you get cowboys on both sides.
Unless someone designs an entire system, top to bottom, there will always be a slowest piece (aka a bottleneck). All this means is that RAID controllers will be the bottleneck until someone designs a better RAID controller. Then the bottleneck will be some other part of the system. Hard to see what the fuss is about.
linquendum tondere
The main speed up provided by hardware raid is reliable deep write buffering ... I don't see how parallel file systems will make that advantage go away.
Caching, and efficient wear-levelling algorithms incorporated into the drives are designed to prevent exactly that.
He may be on to something, but not in the form in which TFA is now.
pNFS: a file of size A striped over 3 servers becomes A/3 (smaller) which actually increases randomness.
Many vendors (e.g. NetApp, Sun and the whole bunch that's been focusing on large sequential I/O for many years (DDN, IBM)) already have RAID controllers that do a good job with non-random I/O.
True, but in the context of the article it would be more something like,
RAID_CONTROLLER -> SATA -> SSD -> RAM
I suspect that the hardware raid controller can easily be replaced by the network,
[Network/GIGE/10GE/etc] -> SATA -> [SSD -> RAM]
The way things stand right now there's no real benefit that I can see from sharing
SSD across the network, even though the network is certainly fast enough to compete
with latencies on the SSD.
Network shared block devices or more probably "object stores" are an interesting
option, especially for read only or read mostly applications like web provision.
Salut,
Jacques
"[Network/GIGE/10GE/etc] -> SATA -> [SSD -> RAM]"
Exactly, in the context of SANs, you typically don't have a RAID controllers.
I think you underestimate performance requirements if you don't see a need for SSDs. Typical SANs operate in microsecond latencies across the cable plant, whereas mechanical disks have seek latencies in the milliseconds. Also SSDs throughput is already twice that of mecahnical disks and (read) IOPS aren't even comparable, SSDs are an order of magnitude faster. And just wait until we see fiber channel SSDs or SATA 6G SSDs that will be doing over 500MB/s right out of the gate.
The problem isn't a matter of IOBs, it's bandwidth. A controller RAID doesn't care if you're using 512-byte blocks or 4k. What matters is the request rate, size of requests, and striping size. A 64kB read is a 64kB read, whether it's a 128 block request to the drive, or a 16 block. The larger block size is easier for some caching algorithms because there are fewer blocks to manage. Unless the databases change how they work, they're still going to be making a ton of 4k requests.
There are a couple things holding back RAID performance in the low end hardware. First, the typical card is 8-lane PCIe or less, which means a limit of 2, 4, or 8 GB/s, depending of which generation of PCIe is used. That can be eaten by 10-40 SSDs. With R1, that doubles the drive I/Os, a R5 RMW quadruples it, and R6 sextuples it. An 8-lane PCIe 3 means a potential back end of 48GB/s. SAS 3 is 6Gb/s, so 600MB/s per channel means 80 ports are needed, but a typical backplace controller only has 8. This carries over to whether it's a hardware RAID controller or host-based RAID/file system using a SAS controller. Adding to the hardware controller's problems, the memory for RAID5/6 or read/write cache can't support that backend speed. PC3-12800 tops out at less than 13GB/s. Finally, the processors on the RAID controllers are embedded class, so there's one or 2 MIPS, PPC, or ARM CPUs. They range in speed from a couple hundred MHz to 1 GHz, which is pretty anemic for managing all those requests. Going to a faster processor means a lot more heat manage on a daughter card.