Building a Massive Single Volume Storage Solution?

gmail by Adult+film+producer · 2005-10-25 07:24 · Score: 4, Funny

register a few thousand gmail accounts and write the interface that will make writing of data to gmail inboxes invisible to the app.

Re:gmail by Anonymous Coward · 2005-10-25 07:43 · Score: 2, Interesting

Gmail? Why bother when you can just use a few hundred million Tinydisks instead?

I wonder if tinyurl can handle 25TB...
Re:gmail by Stuart+Gibson · 2005-10-25 07:44 · Score: 4, Funny

That would have been my second answer.

The first, and presumably the reason this was posted to /. is simple...

Imagine a Beowolf cluster...

Stuart

--
It's all fun and games until a 200' robot dinosaur shows up and trashes Neo-Tokyo... Again

GFS? by fifirebel · 2005-10-25 07:24 · Score: 4, Informative

Have you checked out GFS from RedHat (formerly Sistina)?

Re:GFS? by N1ck0 · 2005-10-25 07:32 · Score: 3, Informative

GFS over a FC SAN with some EMC CLARiiON CX700s as the hosts is the solution that I'm going to looking at deploying next year, although there is still some thoughts on using iSCSI instead of FC. It all really depends on what your usage patterns and performcance requirements are. I don't believe GFS supports ATAoE systems but since their is linux support I doubt it would be too far of a strech.
Re:GFS? by LnxAddct · 2005-10-25 08:33 · Score: 3, Informative

I second this parent post. GFS is exactly what he wants, although I've never used it in the 1 PB range, I can vouch for it working excellent with TBs.
Regards,
Steve

Apple Xserve? by mozumder · 2005-10-25 07:24 · Score: 2, Informative

Can't you hook up 4x 7TB Xserve RAIDs to a PowerMac and use that?

Re:Apple Xserve? by Jeff+DeMaagd · 2005-10-25 07:29 · Score: 3, Informative

Apple Xserve may be the cheapest of that kind of storage, but it's probably not fitting the original idea of commodity hardware.

Scaling to petabytes means spanning storage across multiple systems.
Re:Apple Xserve? by medazinol · 2005-10-25 07:30 · Score: 5, Interesting

My first thought as well. However, he is asking for a single volume solution. So XSAN from Apple would have to be implemented. Good thing that it's compatible with ADIC's solution for cross-platform support.
Probably would be the least expensive option overall and the simplest to implement. Don't take my word for it, go look for yourself.
Re:Apple Xserve? by stang7423 · 2005-10-25 07:55 · Score: 3, Informative

Apple has a solution for this. Xsan is a distrubuted filesystem that is based on the ADIC's StoreNext filesystem. Apple states on that page that it will scale into the range of petabytes.
Re:Apple Xserve? by TRRosen · 2005-10-25 08:19 · Score: 4, Informative

To do this would cost around $50,000 with xRaids and xSan...$2000/TB is probably the best price your going to get. You could do this with generic hardware but the cost of assembling, the extra room, extra power consumption and the maintaince and enginnering costs will cetainly wipe out what you might save. The xRaid solution could be up in a day and fit in one (actually 1/2) rack.
I do remember some college buiding a nearline backup storage system using 1U servers with 2 or 3raid cards each connected to like 12 drives per machine in homemade brackets but it was hardly ideal. But It did work. Anybody remember where that was?
Re:Apple Xserve? by Anonymous Coward · 2005-10-25 10:48 · Score: 3, Insightful

"This product is tangentially related to a product which, five years ago, I had unspecified bad experiences with. Ergo, this product sucks."

Only on fucking Slashdot.

Andrew FIle System by mroch · 2005-10-25 07:25 · Score: 4, Informative

Check out AFS.

Re:Andrew FIle System by Simon+Lyngshede · 2005-10-25 07:27 · Score: 2, Informative

Agreed. AFS is exceptional nice. However I think it still have a max file size of 2GB.
Re:Andrew FIle System by Anonymous Coward · 2005-10-25 07:46 · Score: 2, Informative

http://www.openafs.org/
Re:Andrew FIle System by Trepalium · 2005-10-25 08:11 · Score: 2, Informative

Transarc was acquired by IBM in 1998, and released OpenAFS in 2000. This used to be IBM's site for Transarc technologies, but it looks like it doesn't exist anymore, and instead just redirects to IBM's software page.

--
I used up all my sick days, so I'm calling in dead.
Re:Andrew FIle System by miles31337 · 2005-10-25 08:40 · Score: 3, Informative

No longer true, the OpenAFS 1.3.X (soon to be 1.4) has support for larger files.

PetaBox by Anonymous Coward · 2005-10-25 07:26 · Score: 4, Informative

Howabout the PetaBox, used by the Internet Archive ?

Re:PetaBox by sycodon · 2005-10-25 07:45 · Score: 5, Funny

Just don't call it PetaFile.

--
When Fascism comes to America, it will call itself Anti-Fascism, and tell you to give up your guns.
Re:PetaBox by MikeFM · 2005-10-25 07:52 · Score: 3, Informative

I priced one of those and decided I'd have to work my way up to that kind of toy. Instead I started with Buffalo's TeraStations which are affordable and have built-in RAID support. You can mount them in Linux and use LVM to span a single filesystem across several of them or just mount them normally depending on your needs. $1-$2 per GB for external, RAID, storage isn't bad at all.

--
At what price learning? At what cost wisdom? The price is a man's peace of mind, and the cost is his life.
Re:Petabox by afidel · 2005-10-25 07:56 · Score: 4, Insightful

This guy is worried about budget, yet even with the "low power" usage of the petabox it would still use 50kW for one petabyte of storage! When you combine the cooling for that with the cost of electricity you are talking some serious money. If you have trouble getting the capital funds for something like this how are you ever going to pay the operating costs?

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Re:Petabox by rpresser · 2005-10-25 08:40 · Score: 2, Interesting

Depending on latency requirements, perhaps most of the cluster can stay in sleep mode until it is needed.
Re:Petabox by Databass · 2005-10-25 10:22 · Score: 3, Insightful

This guy is worried about budget, yet even with the "low power" usage of the petabox it would still use 50kW for one petabyte of storage!

Interesting to think about. My brain probably holds about a petabyte of memories and it uses 20-60 watts. Mostly from sugar.
Re:Petabox by russ_allegro · 2005-10-25 10:38 · Score: 2, Informative

They claim ~40 watts per terrabyte. That is pretty darn low, if you are going to try to come up with your own solution with off the shelf parts it'll be hard to match that. If they can't pay for 40 watts per terrabyte for a petabyte maybe they should reconsider that they need the petabyte for now.

Lets say $0.07 per kW/hr,
Then the 50kW as you said would be:
50*24*31*$0.07 = $2,604/month

So it isn't super cheap, guess that is why you don't hear about everyday people buying petabyte of storage. I think if you try to save more on electricity (liking coming up with some other device besides hard drives) you will end up paying a huge amount in whatever makes you save that electricity beyond the electricity costs.
Re:Petabox by faragon · 2005-10-25 20:19 · Score: 2, Interesting

But you do not have ramdom access to your own data (needless to say about reliability).

MogileFS from livejournal by mikeee · 2005-10-25 07:27 · Score: 2, Informative

Livejournal developed their own distributed filesystem:

http://www.danga.com/mogilefs/

It's scalable and has nice reliability features, but is all userspace and doesn't have all the features/operations of a true POSIX filesystem, so it may not suit your needs.

Go the Easy Route by Evil+W1zard · 2005-10-25 07:27 · Score: 3, Funny

I know a certain recent Zombie network that was discovered which collectively had quite a few Pbs of storage... Of course I wouldn't recommend going down that road as it leads to you know ... jail.

--
News Reporters Make Tasty Polar Bear Treats!

Petabox by russ_allegro · 2005-10-25 07:28 · Score: 2, Insightful

archive.org made a petabox

http://www.archive.org/web/petabox.php

There is now a company that seems to make the same design:

http://www.capricorn-tech.com/products.html

I don't know what FS they use, but apprently it is redudent.

GPFS from IBM by LuckyStarr · 2005-10-25 07:29 · Score: 5, Interesting

May or may not be what you search. Quite expensive but impressive featurelist.

http://www-03.ibm.com/servers/eserver/clusters/sof tware/gpfs.html

--
Meme of the day: I browse "Disable Sigs: Checked". So should you.

Re:GPFS from IBM by Zombie · 2005-10-25 09:20 · Score: 2, Interesting

My wife's building a 4 petabyte array (starting with 600 terabyte by the end of this year) for real-time multiple-access high-speed video streaming on GPFS. All GNU/Linux and commodity hardware. The switch fabric of the network is the hard bit. It's a bitch on fibre channel, but iSCSI should deliver higher performance at less than half the price. That's when you can get the hardware, and if you have the right Ethernet switch fabric again...
Re:GPFS from IBM by Obasan · 2005-10-25 09:23 · Score: 2, Insightful

Having implemented GPFS I feel qualified to say it kicks butt. As the poster mentions, its not cheap but if you want reliability and support it may be well worth it. Thats where you need to decide the level of risk you are willing to expose your data to. One limitation of GPFS is that it does (or did last I looked) only run on IBM hardware, either Pseries or Xseries with FastT fiber channel at the back end.

From what I've heard, definitely give GFS a thorough shakedown before you decide to implement it, I've heard some horror stories.
Re:GPFS from IBM by icehawk55 · 2005-10-25 12:34 · Score: 2, Informative

I've implemented multiple gpfs file systems in the multi terabyte range. It's a pretty robust file system. With full redundancy at the disk/controller/brocade/server level per file system I can still write more the 3 gb/s and read better than 3.5 gb/s. This was a design for redundancy and not performance.

20+ Terabytes of FAStT fibre attached storage. After four "SURPRISE" power outages after Katrina which caused the loss of 12 disks and I still did not lose a single byte of data for the customer. GPFS can be pretty robust if implemented correctly.

I'd have no qualms about putting together a petabyte of gpfs file systems.
Icehawk55

Why? by Anonymous Coward · 2005-10-25 07:29 · Score: 2, Insightful

What are you doing on a limited budget trying to build a 1PB solution? And why are you on a budget?

Just because you are starting at 25TB doesn't mean you aren't building a 1PB solution.

You also need to figure out what kind of bandwidth you need. It's very seldom that people have 1PB of data that is accessed by one person occasionally. If Some sort of USB or 1394 connection will work you are much better off than requiring infiniband.

Like many "ask Slashdot" questions this is the last place you should be looking for help...

Scale by LLuthor · 2005-10-25 07:33 · Score: 3, Interesting

If you know the scale of the problem, you should consult with a company like EMC to provide the support for this thing - you WILL need it.

Clustering the disks with iSCSI or ATAoE is trivial - you can do that very easily, but the filesystem to run on top of it is where you will have problems.

PVFS - has no redundancy - Lose one node lose them all.
GFS - does not scale well to those sizes or a large number of nodes - lots of hassle with the dlm.
GoogleFS - Essentially one write only - no small (50GB) files - little or no locking.
xFS - Way too easy to lose your data.

It seems that you only have one option:
Lustre - VERY Expensive - lots of hassle with meta-data servers and lock servers.

Go with a company to take care of all this hassle - you do not have the resources of Google to deal with this kind of thing yourself.

--
LL

Re:Scale by Wesley+Felter · 2005-10-25 08:19 · Score: 3, Insightful

Why do people keep talking about GoogleFS, given that it doesn't exist outside Google?

Here's a couple to look at by Anonymous Coward · 2005-10-25 07:33 · Score: 2, Informative

Compete File System at http://www.python.org/pycon/2005/papers/46/Compete FileSystem.pdf.

MogileFS at http://www.danga.com/mogilefs/

Wow by DingerX · 2005-10-25 07:33 · Score: 5, Funny

I never thought I'd see the day when sites were boasting a petabyte of porn.
That's over 3 million hours of .avis -- if you sat down and watched them end-to-end, you'd have 348 years of "backdoor sliders", "dribblers to short", "pop flies", and "long balls". We live in an enlightened age.

Re:Wow by spuke4000 · 2005-10-25 07:57 · Score: 5, Funny

I'm not really sure I need 348 years of porn. I usually find porn really interesting for the first 3 minutes or so, then for some reason it's not so interesting anymore. But maybe that's just me.

--
This post cannot be rebroadcast without the express written constent of Major League Baseball.
Re:Wow by rco3 · 2005-10-25 08:13 · Score: 3, Funny

Three minutes? You wish!

Come to think of it, so do I.

--

Ce n'est pas un vrai mouvement de robot!

Data redundancy REQUIRED by cheesedog · 2005-10-25 07:34 · Score: 5, Informative

One thing to think about when building such a system from a large number of hard disks is that disks will fail, all the time. The argument is fairly convincing:

Suppose each disk has a MTBF (mean time before failure) of 500,000 hours. That means that the average disk is expected to have a failure about every 57 years. Sounds good, right? Now, suppose you have 1000 disks. How long before the first one fails? Chances, are, not 57 years. If you assume that the failures are spread out evenly across time, a 1000-disk system will have a failure every 500 hours, or about every 3 weeks!

Now, of course the failures won't be spread out evenly, which makes this even trickier. A lot of your disks will be dead on arrival, or fail within the first few hundred hours. A lot will go for a long time without failure. The failure rates, in fact, will likely be fractal -- you'll have long periods without failures, or with few failures, and then a bunch of failures will occur in a short period of time, seemingly all at once.

You absolutely must plan on using some redundancy or erasure coding to store data on such a system. Some of the filesystems you mentioned do this. This allows the system to keep working under X number of failures. Redundancy/coding allows you to plan on scheduled maintanence, where you simply go in and swap out drives that have gone bad after the fact, rather than running around like a chicken with its head cut off every time a drive goes belly up.

Re:Data redundancy REQUIRED by OrangeSpyderMan · 2005-10-25 07:42 · Score: 4, Insightful

Agreed. We have around 50 TByte of data in one of our datacenters and it's great, but the number of disks that fail when you have to restart the systems (SAN fabric firmware install ) is just scary. Even on the system disks of the Wintel servers (around 400) which are DAS, around 10% fail on Datacenter powerdowns. That's where you pray that statistics are kind and you have no more failures on any one box than you have hot spares+tolerance :-) Last time one server didn't make it back up because of this.... though it was actually strictly speaking the PSUs that let go, it would appear.

--
Try NetBSD... safe,straightforward,useful.
Re:Data redundancy REQUIRED by Alef · 2005-10-25 08:09 · Score: 3, Informative

If you assume that the failures are spread out evenly across time, a 1000-disk system will have a failure every 500 hours, or about every 3 weeks!
For the sake of your argument I suppose that assumption could be considered fair. If one were to do a somewhat more sophisticated analysis, a better model for hard drive failures is the Bathtub curve. It represents the result of a combination of three types of failures: infant mortality (flaws in the manufacturing), random failures and wear-out failures.
The failure rates, in fact, will likely be fractal -- you'll have long periods without failures, or with few failures, and then a bunch of failures will occur in a short period of time, seemingly all at once.
I think what you are referring to is how multiple observations of a uniformly distributed stochastic variable generally look. It doesn't have anything to do with fractals, though.
Re:Data redundancy REQUIRED by cheesedog · 2005-10-25 09:56 · Score: 2, Interesting

Not to nitpick back at you or anything, but have you ever sat in front of a system with 100s of cheap-off-the-shelf drives and recorded the failure times? I'll be a monkey's uncle if they aren't self-similar.

I just have to ask... by jcdick1 · 2005-10-25 07:34 · Score: 5, Informative

...what your management was thinking. I mean, I can't imagine a storage requirement that large that you can build in a distributed model that would beat on price per GB an EMC or Hitachi or IBM or whomever SAN solution. The administration and DR costs alone for something like this would be astronomical. There just isn't really a way to do something this big on the cheap. I mean, this is what SANs were developed for in the first place. Its cheaper per GB than distributed local storage ever could be.

--
What?

Re:I just have to ask... by temojen · 2005-10-25 07:42 · Score: 3, Funny

With a project this large, they may be able to do it in-house and still take advantage of economies of scale. They can buy HDDs, motherboards, rackmount cases, etc. by the pallet or container load and temporarily up-hire some of their part-timers to do the assembly.

With a network bootable bios, the nodes could just be plugged in and install an image off a server, then customize it based on their MAC.
Re:I just have to ask... by richie2000 · 2005-10-25 23:36 · Score: 2, Funny

Yeah, but apart from that, did it work out for you? Don't hold back, I can take the truth.

--
Money for nothing, pix for free

Stress the importance .... by gstoddart · 2005-10-25 07:38 · Score: 3, Insightful

I've been asked to build a massive storage solution to scale from an initial threshold of 25TB to 1PB ... Based on my past experience and research, the commercial offerings for such a solution becomes cost prohibitive, and the budget for the solution is fairly small.

Unfortunately, I should think needing a solution which can scale up to a Petabyte (!) of disk-space and a "fairly small" budget are at odds with one another.

Maybe you need to make a stronger case to someone that if such a mammoth storage system is required, it needs to be a higher priority item with better funding?

Heck, the loss of such large volumes of data would be devastating (I assume it's not your pr0n collection) to any organization. Buliding it on the cheap and having no backup (*)/redundancy systems would be just waiting to lose the whole thing.

(*) I truly have no idea how one backs up a petabyte

--
Lost at C:>. Found at C.

Re:Google Releases OSS? by ggvaidya · 2005-10-25 07:39 · Score: 2, Informative

A while ago

For the most part by retinaburn · 2005-10-25 07:39 · Score: 4, Insightful

the reason you can't find a cheap way to do this is because it just isn't cheap.

I would look at some lessons learned from Google. If you decide to go with some sort of homebrew solution based on a bunch of standard consumer disks you will run into other problems besides money. The more disks you have running, the more failures you will encounter. So any system you setup has to be able to have drives fail all day, and not require human intervention to stay up and running(unless you can get humans for cheap too).

Re:For the most part by epiphani · 2005-10-25 08:10 · Score: 2, Informative

It wont be cheep - but how about this idea. You'll get plenty of data redundancy out of it, however you may need to spend some extra bucks on stability and maintainability.

eWare 12x S-ATA raid5 card
12x 300GB raid5
linux machine
iscsi software - share out 1 LUN.

Duplicate this machine until you have enough storage.

One big box with a number of trunked/bonded gigE ports
Iscsi initiator software - mount all the luns.
software raid them together - striping if you arent too worried - raid5 if you are.

tada - big storage, one volume, all accessable from one machine.

the maintenance will suck though.

--
.
Re:For the most part by fool · 2005-10-25 09:17 · Score: 3, Informative

well, since all of the (high-end) PC's we were looking at for snort boxen had severe problems pushing even 5Gbit/s (not GByte) of traffic in/out over the PCI busses simultaneously, you hit a bottleneck pretty quickly there, even before you get to 25TB with your disk sizes. at 500GB disks you get pretty close, but you're at the ceiling already. while a decent (not even cutting-edge) machine could push a Gbit to the server pretty easily, the server, no matter how beefy, needs a ton of internal bandwidth to gather/process/serve the data timely-like. if he only needs 100mbit/s of data service then he's golden =)

or did you mean to specify a GBit switch in between the clients/big box?

also, agree with yours and others' proclamation that administration will not be trivial. be sure to spec at least 6 months of your time in writing/debugging scripts to automate the detection and RMA of dead drives, and find a vendor who will ship based on an automated mail you can send out about failed disks, rather than waiting on turnaround from you pulling the drive and the delivery making a round trip.

Re:Go Virtual by krbvroc1 · 2005-10-25 07:39 · Score: 2, Interesting

He asked for low cost commodity hardware. The fact that no price is mentioned and you need to contact a sales droid for a quote is an instant red-flag. I hate vendors who do not put price lists, even 'retail' prices on their product pages. I realize they may have different price levels based on quantity, but there is a value to seeing that a product is in the '$1000-$1500' range versus the '$120000-$150000' range. Having the contact sales droids who will put your name/phone number on a sales list and harrass you just to find out the price range turns me off of a lot of these outfits. I do a lot of product research and selection using the Internet. I favor outfits who allow me to get all the info online without contacting a sales rep. Many times if I cannot get the info on the web and I cannot get a price on the first phone call without providing sales lead information, I skip them.

Do It Right by moehoward · 2005-10-25 07:41 · Score: 5, Insightful

Look. Everyone wants a Lamborgini for the price of a Chevy. Cute. Yawn. Half of the Ask Slashdot questions are people who didn't find what they want at Walmart. Despite the amazing Slashdot advice, Ask Slashdot answers have somehow failed to put EMC, IBM, HP, etc. out of business. There is no free lunch.

Just call EMC, get a rep out, and give the paperwork to your boss. Do it today instead of 5 months from now and you will have a much better holiday season.

Note to moderators and other finger pointers: I did not say to BUY from EMC, I just said to show his boss how and why to do things the right way. It does not hurt to get quotes from the big vendors, mainly because the quote also comes with good, solid info that you can share with the PHBs. Despite what you think about "evil" tech sales persons and sales engineers, you actually can learn from them.

--
"If you want to improve, be content to be thought foolish and stupid." - Epictetus

Re:Do It Right by stanmann · 2005-10-25 08:20 · Score: 2, Insightful

Yes, there are lots of things that can be done by an open source team on the cheap... Massive hardware components aren't currently one of them. And aren't likely to be in the future.

--
Food not Bombs is a nice platitude but it breaks down when you notice that the Bombees are usually well fed

IBRIX by Wells2k · 2005-10-25 07:42 · Score: 3, Informative

You may want to take a look at IBRIX systems. They do a pretty robust parallel file system that has redundancy and failover.

Er... be careful by LeonGeeste · 2005-10-25 07:43 · Score: 2, Informative

That violates their terms of use pretty severely. I don't know what they would do (Google's not the "suing-for-the-hell-of-it" type), but that wouldn't last very long when they found out. And they would find out. +5 Interesting? Well, curiosity killed the cat.

--
Rank my idea: http://www.sinceslicedbread.com/node/531

Yup, time to pick up the phone. by Kadin2048 · 2005-10-25 07:48 · Score: 5, Insightful

Exactly. This seems like somebody is trying to figure out a way to do something in-house which really ought to be left to either an outside contractor, or at least set up as a turnkey solution by a consultant. Given that he knows little enough about it that he's asking for help on Slashdot, I think this is yet another problem best solved using the telephone and a fat checkbook, and enough negotiating skills to convince management to pony up the cash up front instead of piddling it out over time on an in-house solution that's going to be a hole into which money and time are poured.

I know people get tired of hearing "call IBM" as a solution to these questions, but in general if you have some massive IT infrastructure development task and are so lost on it that you're asking the /. crowd for help, calling in professionals to take over for you isn't probably a bad idea.

It's not even a question if whether you could do it in-house or not; given enough resources you probably could. It comes down to why you want to do something like this yourselves instead of finding people who do it all the time, week after week, for a living, telling them what you want, getting a price quote, and getting it done. Sure seems like a better way to go to me.

--
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."

Been there done that by CommanderC · 2005-10-25 07:52 · Score: 2, Interesting

I wrote a web application and a client in C# that uses gmail accounts as a sort of file system. using a set of email accounts as "index" accounts that use the gmail search functionality to find what you are looking for then pulling the attachment on the index to grab the parts of the file that where spread accross multiple gmail accounts in 500K chunks. it works really well. I did it for fun to see if I could. uses smtp to post the file chunks to a given set of accounts and users can donate accounts to the hive at will, increasing the overall storage size. all hosted maintained and index by gmal or any other free mail service as one big file system.

Just wait 5 years ... by tomhudson · 2005-10-25 07:52 · Score: 3, Interesting

Hard disk space is doubling every 6 months - wait 5 years and you'll be able to buy a 25TB disk for $125.00.

A single raid50 of them will then give you your petabyte of storage, for around $6,000.

Re:Petabox.... by timeOday · 2005-10-25 07:55 · Score: 2, Funny

Then I suggest using *nothing*. It's free, and will work with the appropriate hardware and software add-ons.

No Redundancy? by Giggles+Of+Doom · 2005-10-25 07:57 · Score: 4, Insightful

A PETABYTE without redundancy? I can't imagine having that much data I didn't care about.

--
"A coward dies a thousand deaths, the brave but one."

Re:No Redundancy? by digidave · 2005-10-25 08:12 · Score: 4, Funny

"I can't imagine having that much data I didn't care about."

Hollywood script archive.

--
The global economy is a great thing until you feel it locally.
Re:No Redundancy? by ichigo+2.0 · 2005-10-25 09:15 · Score: 3, Funny

Maybe he needs it for a swap file? I heard Microsoft upped the memory requirements in the next version of windows.

Re:IBRIX by Wells2k · 2005-10-25 07:57 · Score: 2, Informative

Something else I forgot about is the actual hardware... you may want to take a look at the nStor products. Their hardware RAID systems are relatively economical, and you can go to fibrechannel drives with fibre connected boxes quite easily with their equipment.

Hell, BUY it from EMC! by Genady · 2005-10-25 08:08 · Score: 5, Interesting

As a VERY satisfied customer, I say, just buy the damned thing from EMC. There's few enough warm fuzzy feelings that SysAdmins have in this day and age, like your CE calling at 7:00am saying: "Hey, you had a few hard SCSI errors on Disk 3 Enclosure 0 Tray 0 last night, that's your production LUNs isn't it? There should be a courier there with a disk by 10, and I'll stop by to make sure things are hotsparing back properly after you replace the disk okay?" And *THIS* is just because my CE knows I can handle replacing a disk. Normally he'd come out and do that, and sit around while it re-built the Raid Group.

Yeah, EMC costs. THIS is why. The support, when needed, is top top top notch. Which would you rather have in a DR situation?

--

What if it is just turtles all the way down?

Re:Hell, BUY it from EMC! by lukewarmfusion · 2005-10-25 08:24 · Score: 3, Funny

And you'll probably get at least one nice lunch out of the sales deal. I recommend saving your lunch money and asking for sales visits from all of the major players.

Another, There Is. by LifesABeach · 2005-10-25 08:09 · Score: 2, Insightful

If designing for speed, NOT cost:
given 2PB = 1 Human Brain, non interlaced
1024 TB == 1 PB
1 TB == 1 PC Computer with 1200GB H/D, 2Gig RAM, Networking

If designing for cost, NOT speed:
1 DVD = 4.5GB
1 PB = 1024 TB = 1,048,576 GB
1 PC Computer, with a DVD like the one mementioned above.
1 Robotic CNC Arm, with DVD Gripper(tm)
1 Very Huge Wire Cage to hold DVD's like a Juke Box.
(This has been done before, but with Tapes)

Red Hat GFS != Google FS by Anonymous Coward · 2005-10-25 08:11 · Score: 2, Informative

Read the post that you're replying to more carefully next time.

iSCSI storage / san by pasikarkkainen · 2005-10-25 08:13 · Score: 3, Informative

There seems to be lots of SATA-RAID based iSCSI SAN devices available nowadays.. Some links to products I have seen:

http://www.equallogic.com./ They make nice SATA-raid based iSCSI SAN devices with all the features you could expect (volumes, snapshots, array/volume-expansion, hotswap, redundant controllers, redundant fans, etc).

http://www.equallogic.com/pages/products_PS100E.ht m
14 250G sata disks, 3U, 3.5 TB of raw storage.

http://www.equallogic.com/pages/products_PS300E.ht m
14 500G sata disks, 3U, 7 TB of raw storage.

http://www.equallogic.com/pages/products_PS2400E.h tm
56+ TB

Looks good. I have not yet used them myself :)

Another iSCSI SATA SAN possibility:
http://www.mpccorp.com/smallbiz/store/servers/prod uct_detail/dataframe_420.html
16 sata disks, review:
http://www.infoworld.com/MPC_DataFrame_420/product _53700.html?view=1&curNodeId=0

This company also has SATA iSCSI SAN devices:
http://www.dynamicnetworkfactory.com/products.asp/ section/Product~Categories/category/iSCSI/options/ IPBank/drivetype/L~Series/formfactor/Integrated/in face/SATA~-~Serial~ATA

iSCSI SAN comparison:
http://www.networkcomputing.com/story/singlePageFo rmat.jhtml?articleID=170702726

There are also software iSCSI target solutions for use with your own/custom hardware.
http://iscsitarget.sourceforge.net/ for building linux-based iSCSI target/SAN.

If you are familiar with iSCSI targets / iSCSI SAN devices please post your comments!

I built a 1.7 TB for about $2000 by composer777 · 2005-10-25 08:16 · Score: 2, Insightful

but I'm just a linux hobbyist and programmer, so take any advice I give with a grain of salt, but here's what I did for my setup at home. To start, you're looking a little over $1000 per TB. And, that's about as cheap as it gets with redudundancy. I have 8 drives in one machine, it's in a RAID 5 config, and I have a hot spare. However, if I were doing this for a mission critical application, I would have it in a RAID 6 configuration with a hot spare, and buy a hot swap cage, which would further add to the costs. Then, I would simply export the RAID 5 volume using ISCSI, and then see if there is a way to RAID all of the ISCSI volumes using a master server. I imagine that if you do it right, you could scale up such a system to a fairly large number of machines. You would probably want something faster than gigabit eithernet, probably 10,000 MB/s connecting everything together, otherwise, things could get a bit congested at the head node.

Where all this could get terribly expensive is in power requirements, it requires less power to run a cage of hard drives than it does to run a network of PC's. I'd imagine that any money you save on hardware, you would spend on your power bill. Either way, your looking at, bare minimum, about $30K to start for 25TB's, and I would add another 10K padding just to be safe, to pay for stuff like UPS (which you want), a high end switch (which you'll also need), cabling, etc. In other words, it's not cheap, and like my parent just said, it will probably be cheaper in the long run to have someone like IBM do it for you. Do you really want to be responsible for 25-1000 TB's of data?

What we've done (30TB so far) by bernz · 2005-10-25 08:21 · Score: 4, Informative

We've scaled this to 30TB so far. I'm not sure about 1PB, though. For us, redundancy and storage size is key, performance less so.

Storage nodes: 7 x 2.8TB 2U RAID5+1 boxen with Serial ATA. The 2.8TB is logical, not physical. The OS for each of those machines is RAMDISK based (something we concocted based on what I read about the DNALounge awhile back) so it helps curb disk failures of the storage nodes themselves. We avoid disk failure by using RAID5. Of course that doesn't protect against mutiple simultaneous disk failure, but read on for more. Each of the storage nodes is exported via NBD.

Then we have a head unit, a 64-bit machine. This machine does a software RAID5 across the storage nodes using an NBD client. Essentially each storage node is a "disk" and the head unit binds and manages the sofware raid5. So let's say a whole storage node goes down (for whatever reason it does), all the data is still intact. RAID5 rebuild time over the gigabit network is about 18hrs, which is acceptable. We even have another storage box as a hot-spare.

On top of that, we have the whole cluster mirrored to another identical cluster via DRBD in a different geographic location. This is linked by Gigabit WAN. So if we have a massive disaster and lose the entire primary cluster, then we have a 2ndary cluster ready to go. We needed to purchase the Enterprise version of DRBD ($2k US) but that's worth it because they're neato guys.

We use XFS as the filesystem. This system gives us 14TB of redundant "RAID-55 with a Mirror" space. Both clusters together? $85k.

When the cluster starts running out of space (about 70% or so), we add ANOTHER cluster of similar stats to the initial one and use LVM to join the two units together.

This has scaled us to 30TB and we're pretty happy with it. The read speed is very good (hdparm says Timing buffered disk reads: 200 MB in 3.01 seconds = 66.49 MB/sec) and the write speed is about 32 MB/sec. For what our application is doing, that's a fine speed.

Re:What we've done (30TB so far) by bernz · 2005-10-25 09:27 · Score: 2, Informative

I'll put this out as a side point since I'm the OP: If we had to do more than 50TB, I think we'd go to a "real" solution like EMC or something like that. This has been very good for us, but given the need for that amount of storage, we also now have the money to spend on a superduper storage machine. Homebrew has been wonderful to get to this point, but unless we get the kind of employees necessary to really write our own FS a-la GoogleFS, I can't see us taking this solution that much further past where it is now only because I can't see myself putting THAT much scalable trust into something like NBD or software RAID5. At least not with really really close inspection of the limitations of that code.

Proprietary FS, commodity disk enclosures by SteveOU · 2005-10-25 08:22 · Score: 2, Informative

The filesystems going to be the hardest component of this. I know of no open-source fs that could handle this. I'm assuming this is all online storage, and there is no desire to nearline it to tape. Ideally, you'd want something that could contcatenate multiple LUNs (of RAIDed storage) without having to run through a volume manager. Nothing agaist volume managers, but it'd be another component to support. Looking at proprietary FSs, you've got CXFS from SGI, which could easily handle the PB requirement and plays nice on Linux. Sun's got QFS, which would max out at 1PB and could do the volume management bit easily. Linux support was a little flakey last time I used it, but it's a free download and evaluation, you could go get it right now.

IBM's SAN-FS would also meet the capacity needs and would have the advantage of providing nearline capability, if you're into that. Sun's SAM-FS is basically the QFS product with nearline-to-tape capability. Linux is only supported as a client OS there. Of course, if you buy the mantra that Solaris is 'open-source,' then that might not be an issue.

As for hardware with any of the above solutions, you're going to be looking at using multiple RAIDing disk enclosures of some kind. At a budget, probably SATA disks talking to the controller, and iSCSI to the host. FibreChannel to the host would be a little more costly, but might be worth it since iSCSI is just getting mature enough to be usable in production.

Ask Slashdot Formula: by jlarocco · 2005-10-25 08:22 · Score: 5, Funny

Dear Slashdot,
I have been tasked with (insert very difficult, very important job). This is very important to my company. I have (insert number much lower than it should be) dollars to do this. I do not want to use (insert company name specializing in this exact thing) because management thinks they are too expensive. I think I can do this (insert better/faster/cheaper/...) than said company, even though they have vastly more experience and have invested much more time and research than I have. My continued and future employment probably rests on this project. Please advise.

--
Maybe not

Re:Ask Slashdot Formula: by lakin · 2005-10-25 10:32 · Score: 3, Funny

Dear Sir,

Use Linux.

Regards,
Slashdot

--
Paul

Have you looked at.... by Farfromlosin · 2005-10-25 08:24 · Score: 2, Informative

Capricorn Tech? They power the Internet Archive. "Capricorn Technologies was founded in 2004 and provides petabyte-class storage solutions for organizations worldwide. Capricorn's PetaBox technology grew out of a search for high density, low cost, low power storage systems for the world's largest data collections. Capricorn Technologies is proud to be a leader in the next data storage revolution."

--
...because what good is power unless you can abuse it?

Fibre Channel 30TB in 7 RU by Ironsides · 2005-10-25 08:30 · Score: 2

Nexsan has a box called ATA Beast
Raid, Fibre Channel, 42 ATA drives per 7 RU chasis. Throw in 500GB drives and 1 parity drive for every 6 data drives and you have ~30 TB per chasis.

--
Fly me to the moon Let me sing among those stars Let me see what spring is like On jupiter and mars

Can the company funding this really afford this? by @madeus · 2005-10-25 08:33 · Score: 2, Insightful

I appreciate this might not seem like helpful advice, but...

If you've been asked to do something this by a company that can afford to buy one commercial off-the-shelf high volume storage solutions, then I honestly can't imagine any solution they try and knock up will actually work (as I'm not aware of any free software solution that's currently up to the task).

If your company doesn't have / can't raise the capital to buy a commercial system for a project of this scale, I can't possibly see how they could afford to screw up on this and go with an untested idea that could very well end up being a huge money sink they wouldn't be able to dig themselves out of - one that could doom the entire company and all it's investors given the cost it could run to.

And of course, for such a big project, they should hire people who would already know how to do something like this (which is not a dig, it's just crazy to skimp on staff when you have an ambitious project which requires large amounts of capital investment).

That said...

I were going to do large scale storage on the cheap, depending on the design of the software and the specific requirements (particularly if I was also developing the software we were going to use, or was able to set feature requirements and/or was able to make the modifications myself) I would build the largest standard file shares I could with SATA disks (using commodity hardware, hot swappable, running linux, with front loading drive bays).

The specifics of handling the load balancing (via multiple front ends, multiple mount points, pre-deteremined hashing to balance things out, proxies/caches, hooks in the file system calls, hooks in the application to talk to a controller, etc) depend entirely on the sort of application however.

It's definately likely to be far easier (and more cost effective) to have the software take care of knowing where the data is stored, rather than trying to build a single really large file share. I know at least one very known large company who've went down this route (with essentially elaborately hacked up versions of common OS software).

The downside is you have to support whatever hack you come up with to do this, but that shouldn't be an enormous amount of work (and you can probably afford to hire someone to support it full time for significantly less than the cost of a support contract for a commercial solution).

Good point, bad data by fm6 · 2005-10-25 08:34 · Score: 2, Insightful

If you assume that the failures are spread out evenly across time, a 1000-disk system will have a failure every 500 hours, or about every 3 weeks!

Not a sound assumption. Things don't fail uniformly over time. Suppose 70 babies are born with a life expectancy of 70 years. Is one of them guaranteed to die every year for the next 70 years? Obviously not. If they avoid some joint disaster (like they all take a trip on the Titanic), most of them will die within a decade or so of the 70-year mark.

Same with disk drives — most failures will be clustered around the 57-year mark. Not that your attitude towards redunancy is wrong. Just as people sometimes die in infancy, some disk drives break down quickly. So there's a chance that you'll lose some drives from your thousand-disk system in the first year.

How big a chance? To answer that question, you need more statistics about drive failure — and a much better grasp of probability theory.

Why one volume? by photon317 · 2005-10-25 08:34 · Score: 2, Informative

What's making your question hard is the "make it like one volume" restriction. The problem is trivial otherwise. If I were you, I'd be asking whoever tasked you with this to *really* justify on a technical level why they need it to appear as a single volume, since that makes all the possible solutions slower, more costly, and more difficult to maintain.

Chances are extremely high that what they really want is a "/bigfatfs" directory visible everywhere in which they will store many discrete items in subdirectories by project or by dataset or by user. You should convince them to let you build it from commodity machines serving a few TB each mounted as seperate filesystems underneath that umbrella directory. Then your only challenge is coherent management of the namespace of mountpoints for consistency across the environment (which there are longstanding tools for, like autofs + (ldap, nis, nis+, whatever)), and administration/assignment of new space requests within your cluster (that could be scripted to automatically allocate from the least-used volume which can satisfy the request (where least used could mean space or could mean activity hotness based on the metrics you're logging)).

--
11*43+456^2

How about a PetaBox? by McSpew · 2005-10-25 08:44 · Score: 4, Interesting

The folks at the Internet Archive have already done the hard work of figuring out how to create a petabyte storage system using commodity hardware. The system works so well they started a company to sell PetaBoxes to others. Why reinvent the wheel?

Re:How about a PetaBox? by yppiz · 2005-10-25 13:42 · Score: 2, Informative

You beat me to this link.

I will add that the Archive has particular design and performance goals, namely:

- keep the cost / GB as low as possible
- keep cooling and power requirements low
- use the filesystem and bundle objects into large chunks (~100MB ARC files, last I checked)
- assume streaming writes affecting an edge of the system -- previously written data isn't modified
- assume random reads
- read latency is less important than cost / GB

I worked on the Archive ~5 years ago, and these are based on my understanding of the Archive from that period, so some of these may have changed.

But essentially, these are instantiated as: off-the-shelf SATA disks in fairly standard cases with either normal or special low-power motherboards, running a free OS (the Archive has used both Linux and FreeBSD), with off-the-shelf networking equipment.

--Pat

Re:call EMC. i am sure their clarion line will han by aminorex · 2005-10-25 08:44 · Score: 2, Insightful

What once required talent and brilliance today only requires reading a how-to file, configuring,
and rebooting.

EMC is obsolete. Their customers just haven't discovered it yet.

--
-I like my women like I like my tea: green-

LeftHand Networks storage does it by smartsaga · 2005-10-25 08:50 · Score: 2, Informative

http://www.lefthandnetworks.com/ supports all that of what the person is talking about in the article. As you add more of these units, the volumes are spread over the units you add. This means that you can add storage as you go and still have redundancy. You can configure each individual unit to use RAID 0, 1, or 5, and still get to have a volume, or many, across multiple storage units that in turn have parts of a whole voule or set of volumes. Its like haveing double mirroring, once within each individual storage unit level (which has many IDE drives in RAID 1, or 5) and then twice at the storage unit level. Of course this assumes that you have at least two storage units. And, yes, this means that to have redundancy you ahve to add them in pairs (I think) and have some storage units in one physical location and the pairs of each of those in another location for disaster recovery (fire, earthquackes, you know things can happen.)

I have worked with this units and they kick ass. You can do snapshots of entire servers quickly, given that you have the right infrastructure, set thresholds for voulmes that can be increased or reduced on the fly, brick level restoration of files!!!, etc. And of course, my respect goes to their engineers. I saw them working on one unit cause we had a really bad power failure that killed one HD. Man those guys know their stuff up and down, and I've never seen anybody type commands so complex and so freaking long at that speed! They fixed the damn thing and got 99.99999% back from limbo!

I guess their storage boxes follow the model of LVM which is pretty cool and the storage boxes run Linux!!!

Don't take my word for it, go to their website and take a look 'cause I tend to confuse people with my posts rather than pass info efficiently.

Have a good one.

--
===== "Every head is a different world so don't invade mine you FREAK!" smartSAGA said

What about MatrixStore? by Steve.Murray · 2005-10-25 08:54 · Score: 2, Informative

MatrixStore from Object Matrix http://www.object-matrix.com/ uses commodity hardware and clusters it together to create a highly expandable, reliable and secure storage environment.

iSCSI/AoE + LVM + Software RAID? by someguysomewhere · 2005-10-25 08:57 · Score: 2

How about this:
- Use LVM on every node to make the 2TB seem like a single disk ( Assume 4 x 500GB disks )
- Use iSCSI/AoE to make the LVM volumes available on the network
- Use LVM again to merge exported volumes
- For redundancy use software raid 5 on the lvm volumes

I suspect there will be a lot of problems with efficiency but I think you should be relatively safe from hardware failures as the software raid will detect and repair them.

Anyone have any idea whether what i mention is possible/recomended?

Xsan has volume size limits by Rhys · 2005-10-25 08:59 · Score: 2, Informative

I want to say it is 16 Tbyte offhand, but I'm not sure on that.

Short research indicates this was a limitation in 10.3, but I haven't found anything confirming or denying that 10.4 still has it.

Not that we've been looking into large amounts of Xsan storage here, but our requirements are a bit different. You can't hook >600 nodes up to the storage via fibre. Our problem is scaling out the NFS servers to be able to push all this data around.

--
Slashdot Patriotism: We Support our Dupes!

AFS Rocks- Now stop by sirket · 2005-10-25 09:01 · Score: 5, Insightful

Stop what you are doing right now. If your architecture requires you to have one huge volume then you have architected things wrong. Imagine trying to fsck this damned thing! What about file system corruption- What the hell are you going to do when you lose a Petabyte of data because of some file system corruption? Small, sensible, easily managed smaller partitions are the way to go. Use a database to organize where given files are stored. Do something that makes sense. I have a client now who just lost a bunch of data because they used a system like this.

Having said all this- If you are still intent on finding a good file system then use AFS. It's probably your best free solution. If you want to sleep at night call EMC.

-sirket

GPFS - performance and stability by painehope · 2005-10-25 09:02 · Score: 2

GPFS
Take it from someone who's messed with nearly every storage product on the market, if you want something that works fairly simply, performs at approaching spindle speed ( meaning the file system is not the bottleneck - if you have 10 GB/sec. storage bandwidth, expect to see near that with proper tuning ), is very stable ( compared to most storage solutions on the market - bear in mind that most storage products are aimed at large-block sequential I/O, and fall down - either performance-wise or stability-wise - when you throw other I/O patterns or combinations of patterns at them ), and is portable across nearly any Linux distribution ( with varying amounts of difficulty, I have had to hack their kernel patches before when using a unsupported kernel ), GPFS is the one. Of course, the problem there is I believe it's pretty expensive to run on non-IBM hardware. But if you have IBM hardware ( even if it's not the hardware you're running the FS on ) or some sort of in with IBM, they'll let you have it for a song and a dance.

Having said that, Lustre is getting there. I'd say it's the equal of GPFS ( as a parallel filesystem - I believe it is even more flexible as a distributed filesystem ) in performance, probably scales roughly the same ( haven't played with it in a large installation, so can't tell you beyond looking at the architecture ), and is going to the be the biggest player on the market in the future. It's also free ( IIRC Cluster File Systems sells support, but the code is freely available ) and not tied to IBM and whatnot, like GPFS is. Of course, HP has a big connection with Lustre, but not ownership thereof.

Those are really the only two that I would consider for a serious high-performance storage project. If you don't need great performance, that's when you can start looking at things like GFS, ADIC's StorNext, Ibrix, etc.

Oh, Gautham Sastri ( of former Maximum Throughput fame ) has a newer company called Terrascale, I recall them putting on a presentation at the 2003 or 2004 ( can't remember ) Supercomputing conference ( SC2005 is coming up in a few weeks, yeah!!! ) which showed pretty good performance ( relative to the small system they were using ), not sure how they're coming along...

Anyways, good luck...and don't forget to use Iozone to benchmark the damn thing!

--
PC moderators can suck my White pierced, tattooed dick. If you think pride == hate, s/dick/Aryan meat mallet/g.

I'm no storage expert but... by Mars+Ultor · 2005-10-25 09:03 · Score: 3, Funny

Why not store the data randomly in a dilithium matrix with asynchronous data transfer and AJAX? Maybe some RUBY on RAILS too - I hear that's hot right now. Of course, you'd have to make use of a couple of Heisenberg compensators configured in parallel to keep track account for any memory addressing issues, but no need to state the obvious there.

--
"Nokia is not a country, it's the capital of Finland!" -Moderated "Informative". Yeesh.

The SCIENTIFIC Answer by MightyMartian · 2005-10-25 09:04 · Score: 2, Funny

We at Vap-o-tech 2003 Inc. (not associated with Vap-o-tech 2001 Inc. which has closed its doors due to allegations of investor fraud) have developed ToastFS 2003. Using patented CRUMB technology and high capacity BUTTER read/write caching, we are able to turn your average loaf of Wunderbread into a 200gb storage media. Simply buy a loaf of our own specially tested Wunderbread ($250 USD) along with a USB-to-Popup Toaster interface (don't worry, USB 2.0 is more than capable of handling 120amp wall sockets without a problem, except in California). Then take our Vap-o-bake ToastFS drive and pop two pieces in. For doubled capacity, buy our Vap-o-bake ToastFSx2 drive, which takes four pieces. From a command prompt, simply type FORMAT C: and answer yes. Your new ToastFS drive will be formatted in minutes. Please note that we have 24 hour technical support via 1-900-842-8524 ext 241. Please don't hang up. Our operators in the Dutch Antilles are very busy and could take upwards of an hour to get to you.

--
The world's burning. Moped Jesus spotted on I50. Details at 11.

Controllers! by man_of_mr_e · 2005-10-25 09:06 · Score: 2, Informative

You could get a bunch of Broadcom 8 port SATA controllers, which equals about 4TB per controller. 4 or 5 controllers = 16-20TB per box, then you can run the cables into an outside drive bay enclosure and one box can control 40 500GB hard drives.

If you're not doing any processing on this, a good CPU should be able to handle the load.

--
If you need web hosting, you could do worse than here

Re:Controllers! by sr180 · 2005-10-25 13:14 · Score: 3, Insightful

The CPU might be able to handle this load easily, but my question is will the bus (PCI or otherwise) be able to handle this load?

--
In Soviet Russia the insensitive clod is YOU!

Here's my solution by Anonymous Coward · 2005-10-25 09:07 · Score: 2, Interesting

I manage a small (29 dual-xeon nodes) linux cluster in a lab for my local college. A while ago I had the same problem when we ran out of storage space on the main file server.

My solution was to use the nodes' hard disks (each one has a 120GB Ultra320 10000rpm disk) combined in a network RAID1+0 solution (we use gigabit ethernet) to get more space. With that aproach you can get as much redudancy as you need.

Heres what I did:

1. After install the network block device server (nbd-server)in each one of the nodes, I created a 100GB partition on the HD and exported then directly using the raw mode;

2. On the master node (using the nbd-client) I created a block device for each one of the nodes partitions;

3. After that I installed the linux software raid tools (mdadm) and created a small RAID1 array for each pair of nodes. I ended up with 14 100GB network RAID1 arrays each one with its very own /dev/md# blcok device;

4. I created a big 1.4TB (14 * 100GB) RAID0 array with the 14 RAID1 ones and attached it to the /dev/md0 device;

5. The final step was to create a large RaiseFS filesystem on the /dev/md0 array, and I was done.

You have to pay special attention to the array shutdown and startup procedures. I wrote my own scripts to take care of that for me.

Our array may seens small compared to what you are looking for, but I am pretty sure that it will scale well for arrays much larger then ours.

Good luck.

Get out now!! by egriebel · 2005-10-25 09:36 · Score: 2, Insightful

Really, go now before your company's stinginess brings you down too.

There's a reason why Terabyte storage arrays for commercial applications cost a lot of money, and why consulting services from IBM, EMC, Hitachi, etc. have the huge per-hour cost. If you/your management can't see that, you really have no business being there. Sure, anyone can throw a JBOD RAID together for a thousand bucks, but I wouldn't trust anything more important than MP3s to it.

--
ACHTUNG! Das computermachine ist nicht fuer gefingerpoken und mittengrabben. Ist nicht fuer gewerken bei das dumpkopfen.

Who let the PHBs out? by buss_error · 2005-10-25 09:48 · Score: 2, Insightful

Sounds like the PHBs have been at this. First, *why* does it have to be a single file system? With Oracle, MySQL, and MS-SQL you can do partitioning, if your need is databases. If your need is really a monolithic file, then I'll bet that the single file size won't be multi-hundreds of gigs.

In short, your stated objective smells. Not enough data.

WHAT is going to be done (database, file storage?)

HOW will it be accessed? (One large file, many smaller files)

WHEN will it be accessed? (During business hours, distributed over the day?)

AVERAGE TRANSFERS - will the whole schmear come over, selected parts?

SECURITY a concern? (Sensitive data, protected network)

BACKUP - a petabyte of tape storage is expensive, and takes quite a while to do.

POWER - do you have enough?

COOLING - ditto

SPACE - ditto - my $DAYJOB computer room is about 3000 sq ft... and we're going to be using all of it within 12 months.

That said, if you go with big drives over a lot of systems, use lots-o-nics to keep the nic from being the bottleneck. A single gig connection sounds fine, but wait until you have 100's of people going for files at once. It'll get swamped. And swear off V-SAN from Cisco. Not worth it at all.

--
Necessity is the plea for every infringement of human freedom. It is the argument of tyrants; it is the creed of slaves.

Re:Oracle, also by Spudley · 2005-10-25 09:52 · Score: 2, Insightful

It sounds like a seriously ambitious project to approach...

I second that.

Starting at 25TB to scale 1PB? And you want it cheap? If it was cheap to do that sort of thing, we'd all be lining up to get one of our own(*).

Seriously, though, you don't really specify how cheap you are expecting to get it for. What are your expectations, and just how far over-budget are the options you've looked at already? Do you really need 25TB/1PB in one volume, or could it be achieved by splitting it into smaller chunks and working out some sort of load-sharing system?

And in any case, what on Earth kind of data do they anticipate will take a petabyte of contiguous storage????

[(*) Yes, I'm aware that in X years, someone's going to be looking back at this in the /. archive, and laughing about how low tiny our disc storage space was back in 2005]

--
(Spudley Strikes Again!)

That's not MTBF, this is.. by beldraen · 2005-10-25 09:54 · Score: 4, Informative

Just a comment about MTBF. It's often not understood, and it is one of my little pet peaves with tech producers because they don't try to correct it. MTBF is a rating for reliability to achieve lasting the warrenty period.

You have a drive that is rated 500,000 hours MTBF. Suppose you bought a drive and let it run at rated duty. Driver are normally rated to run 100% of the time, but many other devices will have duty period. Further, you run the drive until its warrenty is up. You then throw this perfectly working drive out the window and replace it. If you keep the up this pattern, then approximately once per 500,000 hours on average you should have a drive fail before the warrenty period is up. This is why it is important to not only look at the MTBF but also its warrenty period.

As a side note: In theory, you should be throwing drives out on a periodic basic. One way around this is to not buy all the same drive type and manufacturer. By having a pool of drive types, you distribute, thus minimize, risk of drive failures. Additionally, you may want to have a standard period of time for drive replacement so as to shedule your down time, as opposed to it all being unexpected.

--
Bel, the mostly sane.. "Of course I can't see anything! I'm standing on the shoulders of idiots." -- Me

Re:Oracle, also by Catbeller · 2005-10-25 10:44 · Score: 2, Insightful

And you don't have an answer to the question.

If you don't want to participate, don't. Stop stuffing the threads with posts about how lame everyone's questions, knowledge and motivations are.

I'm actually interested in what people have thought about this very topic, AND I'm not a petabyte database expert. So it's news to me. And probably is to you as well.

PetaBox? by mr_zorg · 2005-10-25 10:46 · Score: 2, Informative

The PetaBox, as previously discussed on Slashdot sounds like just what you want...

Re:Oracle, also by menkhaura · 2005-10-25 11:20 · Score: 3, Funny

what on Earth kind of data do they anticipate will take a petabyte of contiguous storage?

I know. They don't know I know, but I do. It's data gathered by the black helicopters, by Echelon, by Carnivore, by our very own printers, by RFID, about every movement of every single one of us... *They* do it. They.

--
Stupidity is an equal opportunity striker.
Fellow slashdotter Bill Dog

Re:Oracle, also by bezgin · 2005-10-25 12:06 · Score: 2, Funny

Wow! This is a real conspiracy theory. All I could think of was Porn. :)

--
exit();

That's a lot of disks there son! by giberti · 2005-10-25 15:31 · Score: 2, Informative

File systems asside, your talking about a whole lot of hardware here! Is it really necessary to have all this data online at the same time, is it possible to store it in some other way (ie tapes) because it would probably be a whole lot cheaper!

Well, lets see... using 300Gb SCSI disks (assuming you can find raid 0 hardware to support enough disks) you can build out a 1Pb storage system with about 3,334 disks. That would set you back about $3.3Million, assuming you paid retail prices for the disks ~$1,000 / disk @ CDW today. Of course, if you orderded 3,000+ disks, I'm sure they would cut you a deal on the price.

Any hope of daisy chaining together a few dozen direct attached storage devices to a NAS server? Something like a Dell PowerVault 220 with 14 300Gb SCSI drives will set you back about $21K and give you 3.4Tb / 3U of space (RAID-5) so you would have some saftey net built in (albeit not much). Slap 10 of these on a Powervault 6000 series and you should have a ball park of 34Tb (while shy of what your looking for gets you in the right direction). Total cost around $250K - do it four times and spread the work out over four logical volumes and you should get in the neighborhood of 1Petabyte. You could then set up a redundant server structure and for $2Million you have a redundant mirrored architecture ready for one to fail and be brought up online quickly.

--

AF-Design, web development.

Google FS is not a real FS! by 1tsm3 · 2005-10-25 19:48 · Score: 2, Informative

Google FS is not a real file system. It's just a bunch of API's that the program calls. The GoogleFS is not integrated into the Linux VFS. So you can't mount a GoogleFS. All programs need to be modified to use the GoogleFS API.

Also, the GoogleFS has very narrow requirements/goals. It works best for programs that only append to the files.

--
-ItsME

4.4Tb on raid5 per mode at $0.32 per GB by Hackeron · 2005-10-25 19:50 · Score: 2, Interesting

1) ~$100 - nforce4 motherboard with 8 onboard stata,
2) ~$40 - an additional PCI sata controller with 4 ports,
3) ~$100 - the cheapest AMD64 CPU you can buy, 12 400GB drives,
4) ~$150 - coolermaster stacker case
5) ~$1020 - 12 WD 400Gb drives
5) $0 - your favorite Linux distribution.

TOTAL: $1410

Each drive eats about 15W meaning around 180W with an additional 60W for motherboard/cpu consumption which makes it a comparable solution to an efficient scsi solution in terms of power consumption at a small fraction of the cost.

Personally, I created a raid1 array of 2 37GB 10krpm raptor drives for critical stuff and OS, and 2 raid5 arrays of 5 300GB drives for even superior cost per GB while increasing redundancy by a factor of 2. But that only gives you 2.4TB per mode in that case.

The configuration can be done with evms or lvm2, rebuilding on the fly and replacing drives on the fly should work just fine in theory (never tried on the fly), but if not, a scheduled 5 minute downtime is just fine also. My previous 0.5TB raid5 is up >3 years so far and a hard drive failure just required to mdadm md0 --add /dev/sda5 to rebuild the array after a drive failure.

Increasing the array size becomes tricky (although an available option) and fiddling with various distributed network filesystems doesnt really seems worth it for me personally, but openmosix and other clustering solutions offer distributed filesystems.

Just remember, the SATA architecture is nice, SCSI isnt really a requirement for this kind of solution.

terrascale is cool... by anon+mouse-cow-aard · 2005-10-25 22:26 · Score: 2, Interesting

http://www.terrascale.com/prod_e.html Run a client on linux boxes with user-mode drivers that provide a logical abstraction for a whole network of backend linux boxes over any networking transport you want.

Petabox from Capricorn by Ty_Berg · 2005-10-25 22:35 · Score: 2, Interesting

I ran accross this a while back at linuxdevices it is supposed to scale to Petabytes and is the main technology used for the Internet Archive.

Capricorn Technologies Petabox
http://www.capricorn-tech.com/

Linux Devices Review
http://linuxdevices.com/news/NS2659179152.html

Slashdot Mirror

Building a Massive Single Volume Storage Solution?

107 of 557 comments (clear)