Building a Massive Single Volume Storage Solution?
An anonymous reader asks: "I've been asked to build a massive storage solution to scale from an initial threshold of 25TB to 1PB, primarily on commodity hardware and software. Based on my past experience and research, the commercial offerings for such a solution becomes cost prohibitive, and the budget for the solution is fairly small. Some the technologies that I've been scoping out are iSCSI, AoE and plain clustered/grid computers with JBOD (just a bunch of disks). Personally I'm more inclined on a grid cluster with 1GB interface where each node will have about 1-2TB of disk space and each node is based on a 'low' power consumption architecture. Next issue to tackle is finding a file system that could span across all the nodes and yet appear as a single volume to the application servers. At this point data redundancy is not a priority, however it will have to be addressed. My research has not yielded any viable open source alternative (unless Google releases GoogleFS) and I've researched into Lustre, xFS and PVFS. There some interesting commercial products such as the File Director from NeoPath Networks and a few others; however the cost is astronomical.
I would like to know if any Slashdot readers have any experience in build out such a solution? Any help/idea(s) would be greatly appreciated!"
Have you checked out GFS from RedHat (formerly Sistina)?
Check out AFS.
Howabout the PetaBox, used by the Internet Archive ?
Apple Xserve may be the cheapest of that kind of storage, but it's probably not fitting the original idea of commodity hardware.
Scaling to petabytes means spanning storage across multiple systems.
Suppose each disk has a MTBF (mean time before failure) of 500,000 hours. That means that the average disk is expected to have a failure about every 57 years. Sounds good, right? Now, suppose you have 1000 disks. How long before the first one fails? Chances, are, not 57 years. If you assume that the failures are spread out evenly across time, a 1000-disk system will have a failure every 500 hours, or about every 3 weeks!
Now, of course the failures won't be spread out evenly, which makes this even trickier. A lot of your disks will be dead on arrival, or fail within the first few hundred hours. A lot will go for a long time without failure. The failure rates, in fact, will likely be fractal -- you'll have long periods without failures, or with few failures, and then a bunch of failures will occur in a short period of time, seemingly all at once.
You absolutely must plan on using some redundancy or erasure coding to store data on such a system. Some of the filesystems you mentioned do this. This allows the system to keep working under X number of failures. Redundancy/coding allows you to plan on scheduled maintanence, where you simply go in and swap out drives that have gone bad after the fact, rather than running around like a chicken with its head cut off every time a drive goes belly up.
...what your management was thinking. I mean, I can't imagine a storage requirement that large that you can build in a distributed model that would beat on price per GB an EMC or Hitachi or IBM or whomever SAN solution. The administration and DR costs alone for something like this would be astronomical. There just isn't really a way to do something this big on the cheap. I mean, this is what SANs were developed for in the first place. Its cheaper per GB than distributed local storage ever could be.
What?
You may want to take a look at IBRIX systems. They do a pretty robust parallel file system that has redundancy and failover.
Apple has a solution for this. Xsan is a distrubuted filesystem that is based on the ADIC's StoreNext filesystem. Apple states on that page that it will scale into the range of petabytes.
There seems to be lots of SATA-RAID based iSCSI SAN devices available nowadays.. Some links to products I have seen:
t m
t m
h tm
:)
d uct_detail/dataframe_420.html t _53700.html?view=1&curNodeId=0
/ section/Product~Categories/category/iSCSI/options/ IPBank/drivetype/L~Series/formfactor/Integrated/in face/SATA~-~Serial~ATA
o rmat.jhtml?articleID=170702726
http://www.equallogic.com./ They make nice SATA-raid based iSCSI SAN devices with all the features you could expect (volumes, snapshots, array/volume-expansion, hotswap, redundant controllers, redundant fans, etc).
http://www.equallogic.com/pages/products_PS100E.h
14 250G sata disks, 3U, 3.5 TB of raw storage.
http://www.equallogic.com/pages/products_PS300E.h
14 500G sata disks, 3U, 7 TB of raw storage.
http://www.equallogic.com/pages/products_PS2400E.
56+ TB
Looks good. I have not yet used them myself
Another iSCSI SATA SAN possibility:
http://www.mpccorp.com/smallbiz/store/servers/pro
16 sata disks, review:
http://www.infoworld.com/MPC_DataFrame_420/produc
This company also has SATA iSCSI SAN devices:
http://www.dynamicnetworkfactory.com/products.asp
iSCSI SAN comparison:
http://www.networkcomputing.com/story/singlePageF
There are also software iSCSI target solutions for use with your own/custom hardware.
http://iscsitarget.sourceforge.net/ for building linux-based iSCSI target/SAN.
If you are familiar with iSCSI targets / iSCSI SAN devices please post your comments!
I do remember some college buiding a nearline backup storage system using 1U servers with 2 or 3raid cards each connected to like 12 drives per machine in homemade brackets but it was hardly ideal. But It did work. Anybody remember where that was?
Storage nodes: 7 x 2.8TB 2U RAID5+1 boxen with Serial ATA. The 2.8TB is logical, not physical. The OS for each of those machines is RAMDISK based (something we concocted based on what I read about the DNALounge awhile back) so it helps curb disk failures of the storage nodes themselves. We avoid disk failure by using RAID5. Of course that doesn't protect against mutiple simultaneous disk failure, but read on for more. Each of the storage nodes is exported via NBD.
Then we have a head unit, a 64-bit machine. This machine does a software RAID5 across the storage nodes using an NBD client. Essentially each storage node is a "disk" and the head unit binds and manages the sofware raid5. So let's say a whole storage node goes down (for whatever reason it does), all the data is still intact. RAID5 rebuild time over the gigabit network is about 18hrs, which is acceptable. We even have another storage box as a hot-spare.
On top of that, we have the whole cluster mirrored to another identical cluster via DRBD in a different geographic location. This is linked by Gigabit WAN. So if we have a massive disaster and lose the entire primary cluster, then we have a 2ndary cluster ready to go. We needed to purchase the Enterprise version of DRBD ($2k US) but that's worth it because they're neato guys.
We use XFS as the filesystem. This system gives us 14TB of redundant "RAID-55 with a Mirror" space. Both clusters together? $85k.
When the cluster starts running out of space (about 70% or so), we add ANOTHER cluster of similar stats to the initial one and use LVM to join the two units together.
This has scaled us to 30TB and we're pretty happy with it. The read speed is very good (hdparm says Timing buffered disk reads: 200 MB in 3.01 seconds = 66.49 MB/sec) and the write speed is about 32 MB/sec. For what our application is doing, that's a fine speed.
well, since all of the (high-end) PC's we were looking at for snort boxen had severe problems pushing even 5Gbit/s (not GByte) of traffic in/out over the PCI busses simultaneously, you hit a bottleneck pretty quickly there, even before you get to 25TB with your disk sizes. at 500GB disks you get pretty close, but you're at the ceiling already. while a decent (not even cutting-edge) machine could push a Gbit to the server pretty easily, the server, no matter how beefy, needs a ton of internal bandwidth to gather/process/serve the data timely-like. if he only needs 100mbit/s of data service then he's golden =)
or did you mean to specify a GBit switch in between the clients/big box?
also, agree with yours and others' proclamation that administration will not be trivial. be sure to spec at least 6 months of your time in writing/debugging scripts to automate the detection and RMA of dead drives, and find a vendor who will ship based on an automated mail you can send out about failed disks, rather than waiting on turnaround from you pulling the drive and the delivery making a round trip.
Just a comment about MTBF. It's often not understood, and it is one of my little pet peaves with tech producers because they don't try to correct it. MTBF is a rating for reliability to achieve lasting the warrenty period.
You have a drive that is rated 500,000 hours MTBF. Suppose you bought a drive and let it run at rated duty. Driver are normally rated to run 100% of the time, but many other devices will have duty period. Further, you run the drive until its warrenty is up. You then throw this perfectly working drive out the window and replace it. If you keep the up this pattern, then approximately once per 500,000 hours on average you should have a drive fail before the warrenty period is up. This is why it is important to not only look at the MTBF but also its warrenty period.
As a side note: In theory, you should be throwing drives out on a periodic basic. One way around this is to not buy all the same drive type and manufacturer. By having a pool of drive types, you distribute, thus minimize, risk of drive failures. Additionally, you may want to have a standard period of time for drive replacement so as to shedule your down time, as opposed to it all being unexpected.
Bel, the mostly sane.. "Of course I can't see anything! I'm standing on the shoulders of idiots." -- Me