Building a Massive Single Volume Storage Solution?

← Back to Stories (view on slashdot.org)

Building a Massive Single Volume Storage Solution?

Posted by Cliff on Tuesday October 25, 2005 @07:21AM from the 15-zeros-is-a-lot-of-bytes dept.

An anonymous reader asks: "I've been asked to build a massive storage solution to scale from an initial threshold of 25TB to 1PB, primarily on commodity hardware and software. Based on my past experience and research, the commercial offerings for such a solution becomes cost prohibitive, and the budget for the solution is fairly small. Some the technologies that I've been scoping out are iSCSI, AoE and plain clustered/grid computers with JBOD (just a bunch of disks). Personally I'm more inclined on a grid cluster with 1GB interface where each node will have about 1-2TB of disk space and each node is based on a 'low' power consumption architecture. Next issue to tackle is finding a file system that could span across all the nodes and yet appear as a single volume to the application servers. At this point data redundancy is not a priority, however it will have to be addressed. My research has not yielded any viable open source alternative (unless Google releases GoogleFS) and I've researched into Lustre, xFS and PVFS. There some interesting commercial products such as the File Director from NeoPath Networks and a few others; however the cost is astronomical. I would like to know if any Slashdot readers have any experience in build out such a solution? Any help/idea(s) would be greatly appreciated!"

18 of 557 comments (clear)

Min score:

Reason:

Sort:

GFS? by fifirebel · 2005-10-25 07:24 · Score: 4, Informative

Have you checked out GFS from RedHat (formerly Sistina)?
1. Re:GFS? by N1ck0 · 2005-10-25 07:32 · Score: 3, Informative
  
  GFS over a FC SAN with some EMC CLARiiON CX700s as the hosts is the solution that I'm going to looking at deploying next year, although there is still some thoughts on using iSCSI instead of FC. It all really depends on what your usage patterns and performcance requirements are. I don't believe GFS supports ATAoE systems but since their is linux support I doubt it would be too far of a strech.
2. Re:GFS? by LnxAddct · 2005-10-25 08:33 · Score: 3, Informative
  
  I second this parent post. GFS is exactly what he wants, although I've never used it in the 1 PB range, I can vouch for it working excellent with TBs.
  Regards,
  Steve
Andrew FIle System by mroch · 2005-10-25 07:25 · Score: 4, Informative

Check out AFS.
1. Re:Andrew FIle System by miles31337 · 2005-10-25 08:40 · Score: 3, Informative
  
  No longer true, the OpenAFS 1.3.X (soon to be 1.4) has support for larger files.
PetaBox by Anonymous Coward · 2005-10-25 07:26 · Score: 4, Informative

Howabout the PetaBox, used by the Internet Archive ?
1. Re:PetaBox by MikeFM · 2005-10-25 07:52 · Score: 3, Informative
  
  I priced one of those and decided I'd have to work my way up to that kind of toy. Instead I started with Buffalo's TeraStations which are affordable and have built-in RAID support. You can mount them in Linux and use LVM to span a single filesystem across several of them or just mount them normally depending on your needs. $1-$2 per GB for external, RAID, storage isn't bad at all.
  
  --
  At what price learning? At what cost wisdom? The price is a man's peace of mind, and the cost is his life.
Re:Apple Xserve? by Jeff+DeMaagd · 2005-10-25 07:29 · Score: 3, Informative

Apple Xserve may be the cheapest of that kind of storage, but it's probably not fitting the original idea of commodity hardware.

Scaling to petabytes means spanning storage across multiple systems.
Data redundancy REQUIRED by cheesedog · 2005-10-25 07:34 · Score: 5, Informative

One thing to think about when building such a system from a large number of hard disks is that disks will fail, all the time. The argument is fairly convincing:
Suppose each disk has a MTBF (mean time before failure) of 500,000 hours. That means that the average disk is expected to have a failure about every 57 years. Sounds good, right? Now, suppose you have 1000 disks. How long before the first one fails? Chances, are, not 57 years. If you assume that the failures are spread out evenly across time, a 1000-disk system will have a failure every 500 hours, or about every 3 weeks!
Now, of course the failures won't be spread out evenly, which makes this even trickier. A lot of your disks will be dead on arrival, or fail within the first few hundred hours. A lot will go for a long time without failure. The failure rates, in fact, will likely be fractal -- you'll have long periods without failures, or with few failures, and then a bunch of failures will occur in a short period of time, seemingly all at once.
You absolutely must plan on using some redundancy or erasure coding to store data on such a system. Some of the filesystems you mentioned do this. This allows the system to keep working under X number of failures. Redundancy/coding allows you to plan on scheduled maintanence, where you simply go in and swap out drives that have gone bad after the fact, rather than running around like a chicken with its head cut off every time a drive goes belly up.
1. Re:Data redundancy REQUIRED by Alef · 2005-10-25 08:09 · Score: 3, Informative
  
  If you assume that the failures are spread out evenly across time, a 1000-disk system will have a failure every 500 hours, or about every 3 weeks!
  For the sake of your argument I suppose that assumption could be considered fair. If one were to do a somewhat more sophisticated analysis, a better model for hard drive failures is the Bathtub curve. It represents the result of a combination of three types of failures: infant mortality (flaws in the manufacturing), random failures and wear-out failures.
  The failure rates, in fact, will likely be fractal -- you'll have long periods without failures, or with few failures, and then a bunch of failures will occur in a short period of time, seemingly all at once.
  I think what you are referring to is how multiple observations of a uniformly distributed stochastic variable generally look. It doesn't have anything to do with fractals, though.
I just have to ask... by jcdick1 · 2005-10-25 07:34 · Score: 5, Informative

...what your management was thinking. I mean, I can't imagine a storage requirement that large that you can build in a distributed model that would beat on price per GB an EMC or Hitachi or IBM or whomever SAN solution. The administration and DR costs alone for something like this would be astronomical. There just isn't really a way to do something this big on the cheap. I mean, this is what SANs were developed for in the first place. Its cheaper per GB than distributed local storage ever could be.

--
What?
IBRIX by Wells2k · 2005-10-25 07:42 · Score: 3, Informative

You may want to take a look at IBRIX systems. They do a pretty robust parallel file system that has redundancy and failover.
Re:Apple Xserve? by stang7423 · 2005-10-25 07:55 · Score: 3, Informative

Apple has a solution for this. Xsan is a distrubuted filesystem that is based on the ADIC's StoreNext filesystem. Apple states on that page that it will scale into the range of petabytes.
iSCSI storage / san by pasikarkkainen · 2005-10-25 08:13 · Score: 3, Informative

There seems to be lots of SATA-RAID based iSCSI SAN devices available nowadays.. Some links to products I have seen:

http://www.equallogic.com./ They make nice SATA-raid based iSCSI SAN devices with all the features you could expect (volumes, snapshots, array/volume-expansion, hotswap, redundant controllers, redundant fans, etc).

http://www.equallogic.com/pages/products_PS100E.ht m
14 250G sata disks, 3U, 3.5 TB of raw storage.

http://www.equallogic.com/pages/products_PS300E.ht m
14 500G sata disks, 3U, 7 TB of raw storage.

http://www.equallogic.com/pages/products_PS2400E.h tm
56+ TB

Looks good. I have not yet used them myself :)

Another iSCSI SATA SAN possibility:
http://www.mpccorp.com/smallbiz/store/servers/prod uct_detail/dataframe_420.html
16 sata disks, review:
http://www.infoworld.com/MPC_DataFrame_420/product _53700.html?view=1&curNodeId=0

This company also has SATA iSCSI SAN devices:
http://www.dynamicnetworkfactory.com/products.asp/ section/Product~Categories/category/iSCSI/options/ IPBank/drivetype/L~Series/formfactor/Integrated/in face/SATA~-~Serial~ATA

iSCSI SAN comparison:
http://www.networkcomputing.com/story/singlePageFo rmat.jhtml?articleID=170702726

There are also software iSCSI target solutions for use with your own/custom hardware.
http://iscsitarget.sourceforge.net/ for building linux-based iSCSI target/SAN.

If you are familiar with iSCSI targets / iSCSI SAN devices please post your comments!
Re:Apple Xserve? by TRRosen · 2005-10-25 08:19 · Score: 4, Informative

To do this would cost around $50,000 with xRaids and xSan...$2000/TB is probably the best price your going to get. You could do this with generic hardware but the cost of assembling, the extra room, extra power consumption and the maintaince and enginnering costs will cetainly wipe out what you might save. The xRaid solution could be up in a day and fit in one (actually 1/2) rack.
I do remember some college buiding a nearline backup storage system using 1U servers with 2 or 3raid cards each connected to like 12 drives per machine in homemade brackets but it was hardly ideal. But It did work. Anybody remember where that was?
What we've done (30TB so far) by bernz · 2005-10-25 08:21 · Score: 4, Informative

We've scaled this to 30TB so far. I'm not sure about 1PB, though. For us, redundancy and storage size is key, performance less so.
Storage nodes: 7 x 2.8TB 2U RAID5+1 boxen with Serial ATA. The 2.8TB is logical, not physical. The OS for each of those machines is RAMDISK based (something we concocted based on what I read about the DNALounge awhile back) so it helps curb disk failures of the storage nodes themselves. We avoid disk failure by using RAID5. Of course that doesn't protect against mutiple simultaneous disk failure, but read on for more. Each of the storage nodes is exported via NBD.
Then we have a head unit, a 64-bit machine. This machine does a software RAID5 across the storage nodes using an NBD client. Essentially each storage node is a "disk" and the head unit binds and manages the sofware raid5. So let's say a whole storage node goes down (for whatever reason it does), all the data is still intact. RAID5 rebuild time over the gigabit network is about 18hrs, which is acceptable. We even have another storage box as a hot-spare.
On top of that, we have the whole cluster mirrored to another identical cluster via DRBD in a different geographic location. This is linked by Gigabit WAN. So if we have a massive disaster and lose the entire primary cluster, then we have a 2ndary cluster ready to go. We needed to purchase the Enterprise version of DRBD ($2k US) but that's worth it because they're neato guys.
We use XFS as the filesystem. This system gives us 14TB of redundant "RAID-55 with a Mirror" space. Both clusters together? $85k.
When the cluster starts running out of space (about 70% or so), we add ANOTHER cluster of similar stats to the initial one and use LVM to join the two units together.
This has scaled us to 30TB and we're pretty happy with it. The read speed is very good (hdparm says Timing buffered disk reads: 200 MB in 3.01 seconds = 66.49 MB/sec) and the write speed is about 32 MB/sec. For what our application is doing, that's a fine speed.
Re:For the most part by fool · 2005-10-25 09:17 · Score: 3, Informative

well, since all of the (high-end) PC's we were looking at for snort boxen had severe problems pushing even 5Gbit/s (not GByte) of traffic in/out over the PCI busses simultaneously, you hit a bottleneck pretty quickly there, even before you get to 25TB with your disk sizes. at 500GB disks you get pretty close, but you're at the ceiling already. while a decent (not even cutting-edge) machine could push a Gbit to the server pretty easily, the server, no matter how beefy, needs a ton of internal bandwidth to gather/process/serve the data timely-like. if he only needs 100mbit/s of data service then he's golden =)

or did you mean to specify a GBit switch in between the clients/big box?

also, agree with yours and others' proclamation that administration will not be trivial. be sure to spec at least 6 months of your time in writing/debugging scripts to automate the detection and RMA of dead drives, and find a vendor who will ship based on an automated mail you can send out about failed disks, rather than waiting on turnaround from you pulling the drive and the delivery making a round trip.
That's not MTBF, this is.. by beldraen · 2005-10-25 09:54 · Score: 4, Informative

Just a comment about MTBF. It's often not understood, and it is one of my little pet peaves with tech producers because they don't try to correct it. MTBF is a rating for reliability to achieve lasting the warrenty period.

You have a drive that is rated 500,000 hours MTBF. Suppose you bought a drive and let it run at rated duty. Driver are normally rated to run 100% of the time, but many other devices will have duty period. Further, you run the drive until its warrenty is up. You then throw this perfectly working drive out the window and replace it. If you keep the up this pattern, then approximately once per 500,000 hours on average you should have a drive fail before the warrenty period is up. This is why it is important to not only look at the MTBF but also its warrenty period.

As a side note: In theory, you should be throwing drives out on a periodic basic. One way around this is to not buy all the same drive type and manufacturer. By having a pool of drive types, you distribute, thus minimize, risk of drive failures. Additionally, you may want to have a standard period of time for drive replacement so as to shedule your down time, as opposed to it all being unexpected.

--
Bel, the mostly sane.. "Of course I can't see anything! I'm standing on the shoulders of idiots." -- Me