Slashdot Mirror


Building a Massive Single Volume Storage Solution?

An anonymous reader asks: "I've been asked to build a massive storage solution to scale from an initial threshold of 25TB to 1PB, primarily on commodity hardware and software. Based on my past experience and research, the commercial offerings for such a solution becomes cost prohibitive, and the budget for the solution is fairly small. Some the technologies that I've been scoping out are iSCSI, AoE and plain clustered/grid computers with JBOD (just a bunch of disks). Personally I'm more inclined on a grid cluster with 1GB interface where each node will have about 1-2TB of disk space and each node is based on a 'low' power consumption architecture. Next issue to tackle is finding a file system that could span across all the nodes and yet appear as a single volume to the application servers. At this point data redundancy is not a priority, however it will have to be addressed. My research has not yielded any viable open source alternative (unless Google releases GoogleFS) and I've researched into Lustre, xFS and PVFS. There some interesting commercial products such as the File Director from NeoPath Networks and a few others; however the cost is astronomical. I would like to know if any Slashdot readers have any experience in build out such a solution? Any help/idea(s) would be greatly appreciated!"

19 of 557 comments (clear)

  1. GPFS from IBM by LuckyStarr · · Score: 5, Interesting

    May or may not be what you search. Quite expensive but impressive featurelist.

    http://www-03.ibm.com/servers/eserver/clusters/sof tware/gpfs.html

    --
    Meme of the day: I browse "Disable Sigs: Checked". So should you.
    1. Re:GPFS from IBM by Zombie · · Score: 2, Interesting

      My wife's building a 4 petabyte array (starting with 600 terabyte by the end of this year) for real-time multiple-access high-speed video streaming on GPFS. All GNU/Linux and commodity hardware. The switch fabric of the network is the hard bit. It's a bitch on fibre channel, but iSCSI should deliver higher performance at less than half the price. That's when you can get the hardware, and if you have the right Ethernet switch fabric again...

  2. Re:Apple Xserve? by medazinol · · Score: 5, Interesting

    My first thought as well. However, he is asking for a single volume solution. So XSAN from Apple would have to be implemented. Good thing that it's compatible with ADIC's solution for cross-platform support.
    Probably would be the least expensive option overall and the simplest to implement. Don't take my word for it, go look for yourself.

  3. Scale by LLuthor · · Score: 3, Interesting

    If you know the scale of the problem, you should consult with a company like EMC to provide the support for this thing - you WILL need it.

    Clustering the disks with iSCSI or ATAoE is trivial - you can do that very easily, but the filesystem to run on top of it is where you will have problems.

    PVFS - has no redundancy - Lose one node lose them all.
    GFS - does not scale well to those sizes or a large number of nodes - lots of hassle with the dlm.
    GoogleFS - Essentially one write only - no small (50GB) files - little or no locking.
    xFS - Way too easy to lose your data.

    It seems that you only have one option:
    Lustre - VERY Expensive - lots of hassle with meta-data servers and lock servers.

    Go with a company to take care of all this hassle - you do not have the resources of Google to deal with this kind of thing yourself.

    --
    LL
  4. Re:Go Virtual by krbvroc1 · · Score: 2, Interesting

    He asked for low cost commodity hardware. The fact that no price is mentioned and you need to contact a sales droid for a quote is an instant red-flag. I hate vendors who do not put price lists, even 'retail' prices on their product pages. I realize they may have different price levels based on quantity, but there is a value to seeing that a product is in the '$1000-$1500' range versus the '$120000-$150000' range. Having the contact sales droids who will put your name/phone number on a sales list and harrass you just to find out the price range turns me off of a lot of these outfits. I do a lot of product research and selection using the Internet. I favor outfits who allow me to get all the info online without contacting a sales rep. Many times if I cannot get the info on the web and I cannot get a price on the first phone call without providing sales lead information, I skip them.

  5. Re:gmail by Anonymous Coward · · Score: 2, Interesting

    Gmail? Why bother when you can just use a few hundred million Tinydisks instead?

    I wonder if tinyurl can handle 25TB...

  6. Been there done that by CommanderC · · Score: 2, Interesting

    I wrote a web application and a client in C# that uses gmail accounts as a sort of file system. using a set of email accounts as "index" accounts that use the gmail search functionality to find what you are looking for then pulling the attachment on the index to grab the parts of the file that where spread accross multiple gmail accounts in 500K chunks. it works really well. I did it for fun to see if I could. uses smtp to post the file chunks to a given set of accounts and users can donate accounts to the hive at will, increasing the overall storage size. all hosted maintained and index by gmal or any other free mail service as one big file system.

  7. Just wait 5 years ... by tomhudson · · Score: 3, Interesting

    Hard disk space is doubling every 6 months - wait 5 years and you'll be able to buy a 25TB disk for $125.00.

    A single raid50 of them will then give you your petabyte of storage, for around $6,000.

  8. Re:Er... be careful by CommanderC · · Score: 1, Interesting

    If they can find the accounts. The usage of the service would not indicate anything unusual if it is done right, and even then you can implement parity or redundancy for data integrity.

  9. Hell, BUY it from EMC! by Genady · · Score: 5, Interesting

    As a VERY satisfied customer, I say, just buy the damned thing from EMC. There's few enough warm fuzzy feelings that SysAdmins have in this day and age, like your CE calling at 7:00am saying: "Hey, you had a few hard SCSI errors on Disk 3 Enclosure 0 Tray 0 last night, that's your production LUNs isn't it? There should be a courier there with a disk by 10, and I'll stop by to make sure things are hotsparing back properly after you replace the disk okay?" And *THIS* is just because my CE knows I can handle replacing a disk. Normally he'd come out and do that, and sit around while it re-built the Raid Group.

    Yeah, EMC costs. THIS is why. The support, when needed, is top top top notch. Which would you rather have in a DR situation?

    --


    What if it is just turtles all the way down?
  10. Re:Petabox by rpresser · · Score: 2, Interesting

    Depending on latency requirements, perhaps most of the cluster can stay in sleep mode until it is needed.

  11. How about a PetaBox? by McSpew · · Score: 4, Interesting

    The folks at the Internet Archive have already done the hard work of figuring out how to create a petabyte storage system using commodity hardware. The system works so well they started a company to sell PetaBoxes to others. Why reinvent the wheel?

  12. Here's my solution by Anonymous Coward · · Score: 2, Interesting

    I manage a small (29 dual-xeon nodes) linux cluster in a lab for my local college. A while ago I had the same problem when we ran out of storage space on the main file server.

    My solution was to use the nodes' hard disks (each one has a 120GB Ultra320 10000rpm disk) combined in a network RAID1+0 solution (we use gigabit ethernet) to get more space. With that aproach you can get as much redudancy as you need.

    Heres what I did:

    1. After install the network block device server (nbd-server)in each one of the nodes, I created a 100GB partition on the HD and exported then directly using the raw mode;

    2. On the master node (using the nbd-client) I created a block device for each one of the nodes partitions;

    3. After that I installed the linux software raid tools (mdadm) and created a small RAID1 array for each pair of nodes. I ended up with 14 100GB network RAID1 arrays each one with its very own /dev/md# blcok device;

    4. I created a big 1.4TB (14 * 100GB) RAID0 array with the 14 RAID1 ones and attached it to the /dev/md0 device;

    5. The final step was to create a large RaiseFS filesystem on the /dev/md0 array, and I was done.

    You have to pay special attention to the array shutdown and startup procedures. I wrote my own scripts to take care of that for me.

    Our array may seens small compared to what you are looking for, but I am pretty sure that it will scale well for arrays much larger then ours.

    Good luck.

  13. Small companies and short-sighted management by Anonymous Coward · · Score: 1, Interesting
    There are some smart proprietors of small businesses that think cheap, like this. Use to work for one -- the guy was smart, but not smart enough. A four-person company, and he asked me to build something similar. I tried to explain why storage solutions from IBM were so expensive; but he would have none of that, and insisted on building this from Intel white-box parts. The project failed.

    1/5 boxes arrived DOA. The ethernet cards didn't work. The cables to the hard drives weren't long enough. The hot-pluggable disk trays were flakey. The BIOS had to be flashed. The properitary hard drive controller drivers sucked, had to buy new controllers. 1/10 disk drives were DOA.

    Three monhs later, and $40K poorer, we had a system that couldn't pass 24 hours of stress testing without failing in some wacky way. For the $40K and my salary time, we could have bought a usable system from IBM or HP or whomever, and it would have worked. Engineering big systems is non-trivial.

  14. Re:Data redundancy REQUIRED by cheesedog · · Score: 2, Interesting

    Not to nitpick back at you or anything, but have you ever sat in front of a system with 100s of cheap-off-the-shelf drives and recorded the failure times? I'll be a monkey's uncle if they aren't self-similar.

  15. 4.4Tb on raid5 per mode at $0.32 per GB by Hackeron · · Score: 2, Interesting

    1) ~$100 - nforce4 motherboard with 8 onboard stata,
    2) ~$40 - an additional PCI sata controller with 4 ports,
    3) ~$100 - the cheapest AMD64 CPU you can buy, 12 400GB drives,
    4) ~$150 - coolermaster stacker case
    5) ~$1020 - 12 WD 400Gb drives
    5) $0 - your favorite Linux distribution.

    TOTAL: $1410

    Each drive eats about 15W meaning around 180W with an additional 60W for motherboard/cpu consumption which makes it a comparable solution to an efficient scsi solution in terms of power consumption at a small fraction of the cost.

    Personally, I created a raid1 array of 2 37GB 10krpm raptor drives for critical stuff and OS, and 2 raid5 arrays of 5 300GB drives for even superior cost per GB while increasing redundancy by a factor of 2. But that only gives you 2.4TB per mode in that case.

    The configuration can be done with evms or lvm2, rebuilding on the fly and replacing drives on the fly should work just fine in theory (never tried on the fly), but if not, a scheduled 5 minute downtime is just fine also. My previous 0.5TB raid5 is up >3 years so far and a hard drive failure just required to mdadm md0 --add /dev/sda5 to rebuild the array after a drive failure.

    Increasing the array size becomes tricky (although an available option) and fiddling with various distributed network filesystems doesnt really seems worth it for me personally, but openmosix and other clustering solutions offer distributed filesystems.

    Just remember, the SATA architecture is nice, SCSI isnt really a requirement for this kind of solution.

  16. Re:Petabox by faragon · · Score: 2, Interesting

    But you do not have ramdom access to your own data (needless to say about reliability).

  17. terrascale is cool... by anon+mouse-cow-aard · · Score: 2, Interesting

    http://www.terrascale.com/prod_e.html Run a client on linux boxes with user-mode drivers that provide a logical abstraction for a whole network of backend linux boxes over any networking transport you want.

  18. Petabox from Capricorn by Ty_Berg · · Score: 2, Interesting

    I ran accross this a while back at linuxdevices it is supposed to scale to Petabytes and is the main technology used for the Internet Archive.

    Capricorn Technologies Petabox
    http://www.capricorn-tech.com/

    Linux Devices Review
    http://linuxdevices.com/news/NS2659179152.html