Slashdot Mirror


Building a Massive Single Volume Storage Solution?

An anonymous reader asks: "I've been asked to build a massive storage solution to scale from an initial threshold of 25TB to 1PB, primarily on commodity hardware and software. Based on my past experience and research, the commercial offerings for such a solution becomes cost prohibitive, and the budget for the solution is fairly small. Some the technologies that I've been scoping out are iSCSI, AoE and plain clustered/grid computers with JBOD (just a bunch of disks). Personally I'm more inclined on a grid cluster with 1GB interface where each node will have about 1-2TB of disk space and each node is based on a 'low' power consumption architecture. Next issue to tackle is finding a file system that could span across all the nodes and yet appear as a single volume to the application servers. At this point data redundancy is not a priority, however it will have to be addressed. My research has not yielded any viable open source alternative (unless Google releases GoogleFS) and I've researched into Lustre, xFS and PVFS. There some interesting commercial products such as the File Director from NeoPath Networks and a few others; however the cost is astronomical. I would like to know if any Slashdot readers have any experience in build out such a solution? Any help/idea(s) would be greatly appreciated!"

8 of 557 comments (clear)

  1. GFS? by fifirebel · · Score: 4, Informative

    Have you checked out GFS from RedHat (formerly Sistina)?

  2. Andrew FIle System by mroch · · Score: 4, Informative

    Check out AFS.

  3. PetaBox by Anonymous Coward · · Score: 4, Informative

    Howabout the PetaBox, used by the Internet Archive ?

  4. Data redundancy REQUIRED by cheesedog · · Score: 5, Informative
    One thing to think about when building such a system from a large number of hard disks is that disks will fail, all the time. The argument is fairly convincing:

    Suppose each disk has a MTBF (mean time before failure) of 500,000 hours. That means that the average disk is expected to have a failure about every 57 years. Sounds good, right? Now, suppose you have 1000 disks. How long before the first one fails? Chances, are, not 57 years. If you assume that the failures are spread out evenly across time, a 1000-disk system will have a failure every 500 hours, or about every 3 weeks!

    Now, of course the failures won't be spread out evenly, which makes this even trickier. A lot of your disks will be dead on arrival, or fail within the first few hundred hours. A lot will go for a long time without failure. The failure rates, in fact, will likely be fractal -- you'll have long periods without failures, or with few failures, and then a bunch of failures will occur in a short period of time, seemingly all at once.

    You absolutely must plan on using some redundancy or erasure coding to store data on such a system. Some of the filesystems you mentioned do this. This allows the system to keep working under X number of failures. Redundancy/coding allows you to plan on scheduled maintanence, where you simply go in and swap out drives that have gone bad after the fact, rather than running around like a chicken with its head cut off every time a drive goes belly up.

  5. I just have to ask... by jcdick1 · · Score: 5, Informative

    ...what your management was thinking. I mean, I can't imagine a storage requirement that large that you can build in a distributed model that would beat on price per GB an EMC or Hitachi or IBM or whomever SAN solution. The administration and DR costs alone for something like this would be astronomical. There just isn't really a way to do something this big on the cheap. I mean, this is what SANs were developed for in the first place. Its cheaper per GB than distributed local storage ever could be.

    --
    What?
  6. Re:Apple Xserve? by TRRosen · · Score: 4, Informative
    To do this would cost around $50,000 with xRaids and xSan...$2000/TB is probably the best price your going to get. You could do this with generic hardware but the cost of assembling, the extra room, extra power consumption and the maintaince and enginnering costs will cetainly wipe out what you might save. The xRaid solution could be up in a day and fit in one (actually 1/2) rack.

    I do remember some college buiding a nearline backup storage system using 1U servers with 2 or 3raid cards each connected to like 12 drives per machine in homemade brackets but it was hardly ideal. But It did work. Anybody remember where that was?

  7. What we've done (30TB so far) by bernz · · Score: 4, Informative
    We've scaled this to 30TB so far. I'm not sure about 1PB, though. For us, redundancy and storage size is key, performance less so.

    Storage nodes: 7 x 2.8TB 2U RAID5+1 boxen with Serial ATA. The 2.8TB is logical, not physical. The OS for each of those machines is RAMDISK based (something we concocted based on what I read about the DNALounge awhile back) so it helps curb disk failures of the storage nodes themselves. We avoid disk failure by using RAID5. Of course that doesn't protect against mutiple simultaneous disk failure, but read on for more. Each of the storage nodes is exported via NBD.

    Then we have a head unit, a 64-bit machine. This machine does a software RAID5 across the storage nodes using an NBD client. Essentially each storage node is a "disk" and the head unit binds and manages the sofware raid5. So let's say a whole storage node goes down (for whatever reason it does), all the data is still intact. RAID5 rebuild time over the gigabit network is about 18hrs, which is acceptable. We even have another storage box as a hot-spare.

    On top of that, we have the whole cluster mirrored to another identical cluster via DRBD in a different geographic location. This is linked by Gigabit WAN. So if we have a massive disaster and lose the entire primary cluster, then we have a 2ndary cluster ready to go. We needed to purchase the Enterprise version of DRBD ($2k US) but that's worth it because they're neato guys.

    We use XFS as the filesystem. This system gives us 14TB of redundant "RAID-55 with a Mirror" space. Both clusters together? $85k.

    When the cluster starts running out of space (about 70% or so), we add ANOTHER cluster of similar stats to the initial one and use LVM to join the two units together.

    This has scaled us to 30TB and we're pretty happy with it. The read speed is very good (hdparm says Timing buffered disk reads: 200 MB in 3.01 seconds = 66.49 MB/sec) and the write speed is about 32 MB/sec. For what our application is doing, that's a fine speed.

  8. That's not MTBF, this is.. by beldraen · · Score: 4, Informative

    Just a comment about MTBF. It's often not understood, and it is one of my little pet peaves with tech producers because they don't try to correct it. MTBF is a rating for reliability to achieve lasting the warrenty period.

    You have a drive that is rated 500,000 hours MTBF. Suppose you bought a drive and let it run at rated duty. Driver are normally rated to run 100% of the time, but many other devices will have duty period. Further, you run the drive until its warrenty is up. You then throw this perfectly working drive out the window and replace it. If you keep the up this pattern, then approximately once per 500,000 hours on average you should have a drive fail before the warrenty period is up. This is why it is important to not only look at the MTBF but also its warrenty period.

    As a side note: In theory, you should be throwing drives out on a periodic basic. One way around this is to not buy all the same drive type and manufacturer. By having a pool of drive types, you distribute, thus minimize, risk of drive failures. Additionally, you may want to have a standard period of time for drive replacement so as to shedule your down time, as opposed to it all being unexpected.

    --
    Bel, the mostly sane.. "Of course I can't see anything! I'm standing on the shoulders of idiots." -- Me