Building a Massive Single Volume Storage Solution?

← Back to Stories (view on slashdot.org)

Building a Massive Single Volume Storage Solution?

Posted by Cliff on Tuesday October 25, 2005 @07:21AM from the 15-zeros-is-a-lot-of-bytes dept.

An anonymous reader asks: "I've been asked to build a massive storage solution to scale from an initial threshold of 25TB to 1PB, primarily on commodity hardware and software. Based on my past experience and research, the commercial offerings for such a solution becomes cost prohibitive, and the budget for the solution is fairly small. Some the technologies that I've been scoping out are iSCSI, AoE and plain clustered/grid computers with JBOD (just a bunch of disks). Personally I'm more inclined on a grid cluster with 1GB interface where each node will have about 1-2TB of disk space and each node is based on a 'low' power consumption architecture. Next issue to tackle is finding a file system that could span across all the nodes and yet appear as a single volume to the application servers. At this point data redundancy is not a priority, however it will have to be addressed. My research has not yielded any viable open source alternative (unless Google releases GoogleFS) and I've researched into Lustre, xFS and PVFS. There some interesting commercial products such as the File Director from NeoPath Networks and a few others; however the cost is astronomical. I would like to know if any Slashdot readers have any experience in build out such a solution? Any help/idea(s) would be greatly appreciated!"

26 of 557 comments (clear)

gmail by Adult+film+producer · 2005-10-25 07:24 · Score: 4, Funny

register a few thousand gmail accounts and write the interface that will make writing of data to gmail inboxes invisible to the app.
1. Re:gmail by Stuart+Gibson · 2005-10-25 07:44 · Score: 4, Funny
  
  That would have been my second answer.
  
  The first, and presumably the reason this was posted to /. is simple...
  
  Imagine a Beowolf cluster...
  
  Stuart
  
  --
  It's all fun and games until a 200' robot dinosaur shows up and trashes Neo-Tokyo... Again
GFS? by fifirebel · 2005-10-25 07:24 · Score: 4, Informative

Have you checked out GFS from RedHat (formerly Sistina)?
Andrew FIle System by mroch · 2005-10-25 07:25 · Score: 4, Informative

Check out AFS.
PetaBox by Anonymous Coward · 2005-10-25 07:26 · Score: 4, Informative

Howabout the PetaBox, used by the Internet Archive ?
1. Re:PetaBox by sycodon · 2005-10-25 07:45 · Score: 5, Funny
  
  Just don't call it PetaFile.
  
  --
  When Fascism comes to America, it will call itself Anti-Fascism, and tell you to give up your guns.
2. Re:Petabox by afidel · 2005-10-25 07:56 · Score: 4, Insightful
  
  This guy is worried about budget, yet even with the "low power" usage of the petabox it would still use 50kW for one petabyte of storage! When you combine the cooling for that with the cost of electricity you are talking some serious money. If you have trouble getting the capital funds for something like this how are you ever going to pay the operating costs?
  
  --
  There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
GPFS from IBM by LuckyStarr · 2005-10-25 07:29 · Score: 5, Interesting

May or may not be what you search. Quite expensive but impressive featurelist.

http://www-03.ibm.com/servers/eserver/clusters/sof tware/gpfs.html

--
Meme of the day: I browse "Disable Sigs: Checked". So should you.
Re:Apple Xserve? by medazinol · 2005-10-25 07:30 · Score: 5, Interesting

My first thought as well. However, he is asking for a single volume solution. So XSAN from Apple would have to be implemented. Good thing that it's compatible with ADIC's solution for cross-platform support.
Probably would be the least expensive option overall and the simplest to implement. Don't take my word for it, go look for yourself.
Wow by DingerX · 2005-10-25 07:33 · Score: 5, Funny

I never thought I'd see the day when sites were boasting a petabyte of porn.
That's over 3 million hours of .avis -- if you sat down and watched them end-to-end, you'd have 348 years of "backdoor sliders", "dribblers to short", "pop flies", and "long balls". We live in an enlightened age.
1. Re:Wow by spuke4000 · 2005-10-25 07:57 · Score: 5, Funny
  
  I'm not really sure I need 348 years of porn. I usually find porn really interesting for the first 3 minutes or so, then for some reason it's not so interesting anymore. But maybe that's just me.
  
  --
  This post cannot be rebroadcast without the express written constent of Major League Baseball.
Data redundancy REQUIRED by cheesedog · 2005-10-25 07:34 · Score: 5, Informative

One thing to think about when building such a system from a large number of hard disks is that disks will fail, all the time. The argument is fairly convincing:
Suppose each disk has a MTBF (mean time before failure) of 500,000 hours. That means that the average disk is expected to have a failure about every 57 years. Sounds good, right? Now, suppose you have 1000 disks. How long before the first one fails? Chances, are, not 57 years. If you assume that the failures are spread out evenly across time, a 1000-disk system will have a failure every 500 hours, or about every 3 weeks!
Now, of course the failures won't be spread out evenly, which makes this even trickier. A lot of your disks will be dead on arrival, or fail within the first few hundred hours. A lot will go for a long time without failure. The failure rates, in fact, will likely be fractal -- you'll have long periods without failures, or with few failures, and then a bunch of failures will occur in a short period of time, seemingly all at once.
You absolutely must plan on using some redundancy or erasure coding to store data on such a system. Some of the filesystems you mentioned do this. This allows the system to keep working under X number of failures. Redundancy/coding allows you to plan on scheduled maintanence, where you simply go in and swap out drives that have gone bad after the fact, rather than running around like a chicken with its head cut off every time a drive goes belly up.
1. Re:Data redundancy REQUIRED by OrangeSpyderMan · 2005-10-25 07:42 · Score: 4, Insightful
  
  Agreed. We have around 50 TByte of data in one of our datacenters and it's great, but the number of disks that fail when you have to restart the systems (SAN fabric firmware install ) is just scary. Even on the system disks of the Wintel servers (around 400) which are DAS, around 10% fail on Datacenter powerdowns. That's where you pray that statistics are kind and you have no more failures on any one box than you have hot spares+tolerance :-) Last time one server didn't make it back up because of this.... though it was actually strictly speaking the PSUs that let go, it would appear.
  
  --
  Try NetBSD... safe,straightforward,useful.
I just have to ask... by jcdick1 · 2005-10-25 07:34 · Score: 5, Informative

...what your management was thinking. I mean, I can't imagine a storage requirement that large that you can build in a distributed model that would beat on price per GB an EMC or Hitachi or IBM or whomever SAN solution. The administration and DR costs alone for something like this would be astronomical. There just isn't really a way to do something this big on the cheap. I mean, this is what SANs were developed for in the first place. Its cheaper per GB than distributed local storage ever could be.

--
What?
For the most part by retinaburn · 2005-10-25 07:39 · Score: 4, Insightful

the reason you can't find a cheap way to do this is because it just isn't cheap.

I would look at some lessons learned from Google. If you decide to go with some sort of homebrew solution based on a bunch of standard consumer disks you will run into other problems besides money. The more disks you have running, the more failures you will encounter. So any system you setup has to be able to have drives fail all day, and not require human intervention to stay up and running(unless you can get humans for cheap too).
Do It Right by moehoward · 2005-10-25 07:41 · Score: 5, Insightful

Look. Everyone wants a Lamborgini for the price of a Chevy. Cute. Yawn. Half of the Ask Slashdot questions are people who didn't find what they want at Walmart. Despite the amazing Slashdot advice, Ask Slashdot answers have somehow failed to put EMC, IBM, HP, etc. out of business. There is no free lunch.

Just call EMC, get a rep out, and give the paperwork to your boss. Do it today instead of 5 months from now and you will have a much better holiday season.

Note to moderators and other finger pointers: I did not say to BUY from EMC, I just said to show his boss how and why to do things the right way. It does not hurt to get quotes from the big vendors, mainly because the quote also comes with good, solid info that you can share with the PHBs. Despite what you think about "evil" tech sales persons and sales engineers, you actually can learn from them.

--
"If you want to improve, be content to be thought foolish and stupid." - Epictetus
Yup, time to pick up the phone. by Kadin2048 · 2005-10-25 07:48 · Score: 5, Insightful

Exactly. This seems like somebody is trying to figure out a way to do something in-house which really ought to be left to either an outside contractor, or at least set up as a turnkey solution by a consultant. Given that he knows little enough about it that he's asking for help on Slashdot, I think this is yet another problem best solved using the telephone and a fat checkbook, and enough negotiating skills to convince management to pony up the cash up front instead of piddling it out over time on an in-house solution that's going to be a hole into which money and time are poured.

I know people get tired of hearing "call IBM" as a solution to these questions, but in general if you have some massive IT infrastructure development task and are so lost on it that you're asking the /. crowd for help, calling in professionals to take over for you isn't probably a bad idea.

It's not even a question if whether you could do it in-house or not; given enough resources you probably could. It comes down to why you want to do something like this yourselves instead of finding people who do it all the time, week after week, for a living, telling them what you want, getting a price quote, and getting it done. Sure seems like a better way to go to me.

--
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
No Redundancy? by Giggles+Of+Doom · 2005-10-25 07:57 · Score: 4, Insightful

A PETABYTE without redundancy? I can't imagine having that much data I didn't care about.

--
"A coward dies a thousand deaths, the brave but one."
1. Re:No Redundancy? by digidave · 2005-10-25 08:12 · Score: 4, Funny
  
  "I can't imagine having that much data I didn't care about."
  
  Hollywood script archive.
  
  --
  The global economy is a great thing until you feel it locally.
Hell, BUY it from EMC! by Genady · 2005-10-25 08:08 · Score: 5, Interesting

As a VERY satisfied customer, I say, just buy the damned thing from EMC. There's few enough warm fuzzy feelings that SysAdmins have in this day and age, like your CE calling at 7:00am saying: "Hey, you had a few hard SCSI errors on Disk 3 Enclosure 0 Tray 0 last night, that's your production LUNs isn't it? There should be a courier there with a disk by 10, and I'll stop by to make sure things are hotsparing back properly after you replace the disk okay?" And *THIS* is just because my CE knows I can handle replacing a disk. Normally he'd come out and do that, and sit around while it re-built the Raid Group.

Yeah, EMC costs. THIS is why. The support, when needed, is top top top notch. Which would you rather have in a DR situation?

--

What if it is just turtles all the way down?
Re:Apple Xserve? by TRRosen · 2005-10-25 08:19 · Score: 4, Informative

To do this would cost around $50,000 with xRaids and xSan...$2000/TB is probably the best price your going to get. You could do this with generic hardware but the cost of assembling, the extra room, extra power consumption and the maintaince and enginnering costs will cetainly wipe out what you might save. The xRaid solution could be up in a day and fit in one (actually 1/2) rack.
I do remember some college buiding a nearline backup storage system using 1U servers with 2 or 3raid cards each connected to like 12 drives per machine in homemade brackets but it was hardly ideal. But It did work. Anybody remember where that was?
What we've done (30TB so far) by bernz · 2005-10-25 08:21 · Score: 4, Informative

We've scaled this to 30TB so far. I'm not sure about 1PB, though. For us, redundancy and storage size is key, performance less so.
Storage nodes: 7 x 2.8TB 2U RAID5+1 boxen with Serial ATA. The 2.8TB is logical, not physical. The OS for each of those machines is RAMDISK based (something we concocted based on what I read about the DNALounge awhile back) so it helps curb disk failures of the storage nodes themselves. We avoid disk failure by using RAID5. Of course that doesn't protect against mutiple simultaneous disk failure, but read on for more. Each of the storage nodes is exported via NBD.
Then we have a head unit, a 64-bit machine. This machine does a software RAID5 across the storage nodes using an NBD client. Essentially each storage node is a "disk" and the head unit binds and manages the sofware raid5. So let's say a whole storage node goes down (for whatever reason it does), all the data is still intact. RAID5 rebuild time over the gigabit network is about 18hrs, which is acceptable. We even have another storage box as a hot-spare.
On top of that, we have the whole cluster mirrored to another identical cluster via DRBD in a different geographic location. This is linked by Gigabit WAN. So if we have a massive disaster and lose the entire primary cluster, then we have a 2ndary cluster ready to go. We needed to purchase the Enterprise version of DRBD ($2k US) but that's worth it because they're neato guys.
We use XFS as the filesystem. This system gives us 14TB of redundant "RAID-55 with a Mirror" space. Both clusters together? $85k.
When the cluster starts running out of space (about 70% or so), we add ANOTHER cluster of similar stats to the initial one and use LVM to join the two units together.
This has scaled us to 30TB and we're pretty happy with it. The read speed is very good (hdparm says Timing buffered disk reads: 200 MB in 3.01 seconds = 66.49 MB/sec) and the write speed is about 32 MB/sec. For what our application is doing, that's a fine speed.
Ask Slashdot Formula: by jlarocco · 2005-10-25 08:22 · Score: 5, Funny

Dear Slashdot,
I have been tasked with (insert very difficult, very important job). This is very important to my company. I have (insert number much lower than it should be) dollars to do this. I do not want to use (insert company name specializing in this exact thing) because management thinks they are too expensive. I think I can do this (insert better/faster/cheaper/...) than said company, even though they have vastly more experience and have invested much more time and research than I have. My continued and future employment probably rests on this project. Please advise.

--
Maybe not
How about a PetaBox? by McSpew · 2005-10-25 08:44 · Score: 4, Interesting

The folks at the Internet Archive have already done the hard work of figuring out how to create a petabyte storage system using commodity hardware. The system works so well they started a company to sell PetaBoxes to others. Why reinvent the wheel?
AFS Rocks- Now stop by sirket · 2005-10-25 09:01 · Score: 5, Insightful

Stop what you are doing right now. If your architecture requires you to have one huge volume then you have architected things wrong. Imagine trying to fsck this damned thing! What about file system corruption- What the hell are you going to do when you lose a Petabyte of data because of some file system corruption? Small, sensible, easily managed smaller partitions are the way to go. Use a database to organize where given files are stored. Do something that makes sense. I have a client now who just lost a bunch of data because they used a system like this.

Having said all this- If you are still intent on finding a good file system then use AFS. It's probably your best free solution. If you want to sleep at night call EMC.

-sirket
That's not MTBF, this is.. by beldraen · 2005-10-25 09:54 · Score: 4, Informative

Just a comment about MTBF. It's often not understood, and it is one of my little pet peaves with tech producers because they don't try to correct it. MTBF is a rating for reliability to achieve lasting the warrenty period.

You have a drive that is rated 500,000 hours MTBF. Suppose you bought a drive and let it run at rated duty. Driver are normally rated to run 100% of the time, but many other devices will have duty period. Further, you run the drive until its warrenty is up. You then throw this perfectly working drive out the window and replace it. If you keep the up this pattern, then approximately once per 500,000 hours on average you should have a drive fail before the warrenty period is up. This is why it is important to not only look at the MTBF but also its warrenty period.

As a side note: In theory, you should be throwing drives out on a periodic basic. One way around this is to not buy all the same drive type and manufacturer. By having a pool of drive types, you distribute, thus minimize, risk of drive failures. Additionally, you may want to have a standard period of time for drive replacement so as to shedule your down time, as opposed to it all being unexpected.

--
Bel, the mostly sane.. "Of course I can't see anything! I'm standing on the shoulders of idiots." -- Me