Building a Massive Single Volume Storage Solution?

← Back to Stories (view on slashdot.org)

Building a Massive Single Volume Storage Solution?

Posted by Cliff on Tuesday October 25, 2005 @07:21AM from the 15-zeros-is-a-lot-of-bytes dept.

An anonymous reader asks: "I've been asked to build a massive storage solution to scale from an initial threshold of 25TB to 1PB, primarily on commodity hardware and software. Based on my past experience and research, the commercial offerings for such a solution becomes cost prohibitive, and the budget for the solution is fairly small. Some the technologies that I've been scoping out are iSCSI, AoE and plain clustered/grid computers with JBOD (just a bunch of disks). Personally I'm more inclined on a grid cluster with 1GB interface where each node will have about 1-2TB of disk space and each node is based on a 'low' power consumption architecture. Next issue to tackle is finding a file system that could span across all the nodes and yet appear as a single volume to the application servers. At this point data redundancy is not a priority, however it will have to be addressed. My research has not yielded any viable open source alternative (unless Google releases GoogleFS) and I've researched into Lustre, xFS and PVFS. There some interesting commercial products such as the File Director from NeoPath Networks and a few others; however the cost is astronomical. I would like to know if any Slashdot readers have any experience in build out such a solution? Any help/idea(s) would be greatly appreciated!"

12 of 557 comments (clear)

GPFS from IBM by LuckyStarr · 2005-10-25 07:29 · Score: 5, Interesting

May or may not be what you search. Quite expensive but impressive featurelist.

http://www-03.ibm.com/servers/eserver/clusters/sof tware/gpfs.html

--
Meme of the day: I browse "Disable Sigs: Checked". So should you.
Re:Apple Xserve? by medazinol · 2005-10-25 07:30 · Score: 5, Interesting

My first thought as well. However, he is asking for a single volume solution. So XSAN from Apple would have to be implemented. Good thing that it's compatible with ADIC's solution for cross-platform support.
Probably would be the least expensive option overall and the simplest to implement. Don't take my word for it, go look for yourself.
Wow by DingerX · 2005-10-25 07:33 · Score: 5, Funny

I never thought I'd see the day when sites were boasting a petabyte of porn.
That's over 3 million hours of .avis -- if you sat down and watched them end-to-end, you'd have 348 years of "backdoor sliders", "dribblers to short", "pop flies", and "long balls". We live in an enlightened age.
1. Re:Wow by spuke4000 · 2005-10-25 07:57 · Score: 5, Funny
  
  I'm not really sure I need 348 years of porn. I usually find porn really interesting for the first 3 minutes or so, then for some reason it's not so interesting anymore. But maybe that's just me.
  
  --
  This post cannot be rebroadcast without the express written constent of Major League Baseball.
Data redundancy REQUIRED by cheesedog · 2005-10-25 07:34 · Score: 5, Informative

One thing to think about when building such a system from a large number of hard disks is that disks will fail, all the time. The argument is fairly convincing:
Suppose each disk has a MTBF (mean time before failure) of 500,000 hours. That means that the average disk is expected to have a failure about every 57 years. Sounds good, right? Now, suppose you have 1000 disks. How long before the first one fails? Chances, are, not 57 years. If you assume that the failures are spread out evenly across time, a 1000-disk system will have a failure every 500 hours, or about every 3 weeks!
Now, of course the failures won't be spread out evenly, which makes this even trickier. A lot of your disks will be dead on arrival, or fail within the first few hundred hours. A lot will go for a long time without failure. The failure rates, in fact, will likely be fractal -- you'll have long periods without failures, or with few failures, and then a bunch of failures will occur in a short period of time, seemingly all at once.
You absolutely must plan on using some redundancy or erasure coding to store data on such a system. Some of the filesystems you mentioned do this. This allows the system to keep working under X number of failures. Redundancy/coding allows you to plan on scheduled maintanence, where you simply go in and swap out drives that have gone bad after the fact, rather than running around like a chicken with its head cut off every time a drive goes belly up.
I just have to ask... by jcdick1 · 2005-10-25 07:34 · Score: 5, Informative

...what your management was thinking. I mean, I can't imagine a storage requirement that large that you can build in a distributed model that would beat on price per GB an EMC or Hitachi or IBM or whomever SAN solution. The administration and DR costs alone for something like this would be astronomical. There just isn't really a way to do something this big on the cheap. I mean, this is what SANs were developed for in the first place. Its cheaper per GB than distributed local storage ever could be.

--
What?
Do It Right by moehoward · 2005-10-25 07:41 · Score: 5, Insightful

Look. Everyone wants a Lamborgini for the price of a Chevy. Cute. Yawn. Half of the Ask Slashdot questions are people who didn't find what they want at Walmart. Despite the amazing Slashdot advice, Ask Slashdot answers have somehow failed to put EMC, IBM, HP, etc. out of business. There is no free lunch.

Just call EMC, get a rep out, and give the paperwork to your boss. Do it today instead of 5 months from now and you will have a much better holiday season.

Note to moderators and other finger pointers: I did not say to BUY from EMC, I just said to show his boss how and why to do things the right way. It does not hurt to get quotes from the big vendors, mainly because the quote also comes with good, solid info that you can share with the PHBs. Despite what you think about "evil" tech sales persons and sales engineers, you actually can learn from them.

--
"If you want to improve, be content to be thought foolish and stupid." - Epictetus
Re:PetaBox by sycodon · 2005-10-25 07:45 · Score: 5, Funny

Just don't call it PetaFile.

--
When Fascism comes to America, it will call itself Anti-Fascism, and tell you to give up your guns.
Yup, time to pick up the phone. by Kadin2048 · 2005-10-25 07:48 · Score: 5, Insightful

Exactly. This seems like somebody is trying to figure out a way to do something in-house which really ought to be left to either an outside contractor, or at least set up as a turnkey solution by a consultant. Given that he knows little enough about it that he's asking for help on Slashdot, I think this is yet another problem best solved using the telephone and a fat checkbook, and enough negotiating skills to convince management to pony up the cash up front instead of piddling it out over time on an in-house solution that's going to be a hole into which money and time are poured.

I know people get tired of hearing "call IBM" as a solution to these questions, but in general if you have some massive IT infrastructure development task and are so lost on it that you're asking the /. crowd for help, calling in professionals to take over for you isn't probably a bad idea.

It's not even a question if whether you could do it in-house or not; given enough resources you probably could. It comes down to why you want to do something like this yourselves instead of finding people who do it all the time, week after week, for a living, telling them what you want, getting a price quote, and getting it done. Sure seems like a better way to go to me.

--
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
Hell, BUY it from EMC! by Genady · 2005-10-25 08:08 · Score: 5, Interesting

As a VERY satisfied customer, I say, just buy the damned thing from EMC. There's few enough warm fuzzy feelings that SysAdmins have in this day and age, like your CE calling at 7:00am saying: "Hey, you had a few hard SCSI errors on Disk 3 Enclosure 0 Tray 0 last night, that's your production LUNs isn't it? There should be a courier there with a disk by 10, and I'll stop by to make sure things are hotsparing back properly after you replace the disk okay?" And *THIS* is just because my CE knows I can handle replacing a disk. Normally he'd come out and do that, and sit around while it re-built the Raid Group.

Yeah, EMC costs. THIS is why. The support, when needed, is top top top notch. Which would you rather have in a DR situation?

--

What if it is just turtles all the way down?
Ask Slashdot Formula: by jlarocco · 2005-10-25 08:22 · Score: 5, Funny

Dear Slashdot,
I have been tasked with (insert very difficult, very important job). This is very important to my company. I have (insert number much lower than it should be) dollars to do this. I do not want to use (insert company name specializing in this exact thing) because management thinks they are too expensive. I think I can do this (insert better/faster/cheaper/...) than said company, even though they have vastly more experience and have invested much more time and research than I have. My continued and future employment probably rests on this project. Please advise.

--
Maybe not
AFS Rocks- Now stop by sirket · 2005-10-25 09:01 · Score: 5, Insightful

Stop what you are doing right now. If your architecture requires you to have one huge volume then you have architected things wrong. Imagine trying to fsck this damned thing! What about file system corruption- What the hell are you going to do when you lose a Petabyte of data because of some file system corruption? Small, sensible, easily managed smaller partitions are the way to go. Use a database to organize where given files are stored. Do something that makes sense. I have a client now who just lost a bunch of data because they used a system like this.

Having said all this- If you are still intent on finding a good file system then use AFS. It's probably your best free solution. If you want to sleep at night call EMC.

-sirket