Building a Massive Single Volume Storage Solution?

gmail by Adult+film+producer · 2005-10-25 07:24 · Score: 4, Funny

register a few thousand gmail accounts and write the interface that will make writing of data to gmail inboxes invisible to the app.

Re:gmail by Anonymous Coward · 2005-10-25 07:43 · Score: 2, Interesting

Gmail? Why bother when you can just use a few hundred million Tinydisks instead?

I wonder if tinyurl can handle 25TB...
Re:gmail by Stuart+Gibson · 2005-10-25 07:44 · Score: 4, Funny

That would have been my second answer.

The first, and presumably the reason this was posted to /. is simple...

Imagine a Beowolf cluster...

Stuart

--
It's all fun and games until a 200' robot dinosaur shows up and trashes Neo-Tokyo... Again
Re:gmail by JoshRosenbaum · 2005-10-25 11:25 · Score: 1

Better register a few thousand yahoo accounts as well for redundancy! (RAID 1 baby! ;) ) Then if the plug was pulled by one of one of them, at least you'd have the other to work with! Not only would you get redudancy, but you'd get faster speeds since you could access portions of files from both "disks". :)
Re:gmail by mogwai7 · 2005-10-25 17:21 · Score: 1

There is already an app called Gmail Filesystem that lets you use a gmail account as a drive. It's implemented in python so it shouldn't be difficult to modify it to do what you suggest.

GFS? by fifirebel · 2005-10-25 07:24 · Score: 4, Informative

Have you checked out GFS from RedHat (formerly Sistina)?

Re:GFS? by N1ck0 · 2005-10-25 07:32 · Score: 3, Informative

GFS over a FC SAN with some EMC CLARiiON CX700s as the hosts is the solution that I'm going to looking at deploying next year, although there is still some thoughts on using iSCSI instead of FC. It all really depends on what your usage patterns and performcance requirements are. I don't believe GFS supports ATAoE systems but since their is linux support I doubt it would be too far of a strech.
Re:GFS? by LnxAddct · 2005-10-25 08:33 · Score: 3, Informative

I second this parent post. GFS is exactly what he wants, although I've never used it in the 1 PB range, I can vouch for it working excellent with TBs.
Regards,
Steve
Re:GFS? by ralf1 · 2005-10-25 08:36 · Score: 1

Or Polyserve Matrix Server, which unlike GFS actually works...?

--
"Would you, could you, with a goat?" Dr Seuss
Re:GFS? by InsaneGeek · 2005-10-25 10:20 · Score: 1

I'll let polyserve login to my SAN switches and "automatically" down ports for fencing when you pry the storage from my cold dead hands.
Re:GFS? by Anonymous Coward · 2005-10-25 13:51 · Score: 0

See the FP. The Google File System has already been suggested.
Re:GFS? by mathrock · 2005-10-25 16:05 · Score: 1

One of the issues I have with RedHat GFS is that the maximum file system size is 8TB per clustered file system....
Re:GFS? by asamad · 2005-10-25 18:32 · Score: 1

having tried both I have found both to work, although the poly serve was about 10x the cost of the gfs solution PLus with gfs you can use iscsi
Re:GFS? by Anonymous Coward · 2005-10-26 01:25 · Score: 0

> See the FP. The Google File System has already been suggested.

See the link. The filesystem you think he is suggesting is not the filesystem he is suggesting.

Apple Xserve? by mozumder · 2005-10-25 07:24 · Score: 2, Informative

Can't you hook up 4x 7TB Xserve RAIDs to a PowerMac and use that?

Re:Apple Xserve? by Jeff+DeMaagd · 2005-10-25 07:29 · Score: 3, Informative

Apple Xserve may be the cheapest of that kind of storage, but it's probably not fitting the original idea of commodity hardware.

Scaling to petabytes means spanning storage across multiple systems.
Re:Apple Xserve? by medazinol · 2005-10-25 07:30 · Score: 5, Interesting

My first thought as well. However, he is asking for a single volume solution. So XSAN from Apple would have to be implemented. Good thing that it's compatible with ADIC's solution for cross-platform support.
Probably would be the least expensive option overall and the simplest to implement. Don't take my word for it, go look for yourself.
Re:Apple Xserve? by stang7423 · 2005-10-25 07:55 · Score: 3, Informative

Apple has a solution for this. Xsan is a distrubuted filesystem that is based on the ADIC's StoreNext filesystem. Apple states on that page that it will scale into the range of petabytes.
Re:Apple Xserve? by Anonymous Coward · 2005-10-25 07:59 · Score: 0

This guy needs Xsan. No other solution will be as cost effective. Trying to build a clustered file system with such massive storage on commodity hardware is not worth the effort. You'll be pulling your hair out pretty quickly. You NEED Xsan.
Re:Apple Xserve? by SWroclawski · 2005-10-25 08:13 · Score: 1

Ah ADIC's SNFS... I remember when it was CFS (Cluster File System), and I remember hours of problems...

Driver memory leaks... Nodes that dissapear. Boxes stopping for no apparent reason.

And the performance wasn't great.
Re:Apple Xserve? by TRRosen · 2005-10-25 08:19 · Score: 4, Informative

To do this would cost around $50,000 with xRaids and xSan...$2000/TB is probably the best price your going to get. You could do this with generic hardware but the cost of assembling, the extra room, extra power consumption and the maintaince and enginnering costs will cetainly wipe out what you might save. The xRaid solution could be up in a day and fit in one (actually 1/2) rack.
I do remember some college buiding a nearline backup storage system using 1U servers with 2 or 3raid cards each connected to like 12 drives per machine in homemade brackets but it was hardly ideal. But It did work. Anybody remember where that was?
Re:Apple Xserve? by 76chyquem · 2005-10-25 10:02 · Score: 1

XSAN scales to petabytes but you're still stuck with RAID5 data protection only. Take a look at MatrixStore - www.object-matrix.com - works on the Apple kit, adds automatic data protection over and above XSAN, still on commodity hardware.
Re:Apple Xserve? by Anonymous Coward · 2005-10-25 10:48 · Score: 3, Insightful

"This product is tangentially related to a product which, five years ago, I had unspecified bad experiences with. Ergo, this product sucks."

Only on fucking Slashdot.
Re:Apple Xserve? by nanophilia · 2005-10-25 11:06 · Score: 1

The ADIC cross platform product is called StorNext Its very fast, only caveat outside the 2K/per node price :) is that they have been slow to release a driver for 2.6 kernels. The latest supported kernel is 2.4.21 RH-ES I believe.
Re:Apple Xserve? by Anonymous Coward · 2005-10-25 11:39 · Score: 0

Scaling to petabytes means spanning storage across multiple systems.

no shit, sherlock :p

my guess is that if you really want petabyte storage, you would be likely to think that you'd have to span a few volumes to make it work.
Re:Apple Xserve? by Anonymous Coward · 2005-10-25 11:46 · Score: 0

17 years ago Microsoft Word 2 ate a paper I'd written right before the deadline. Microsoft sucks.
Re:Apple Xserve? by labratuk · 2005-10-25 13:00 · Score: 1

Did I miss the bit where he specifies that he wants to be locked in to a proprietary hardware and software system?

--
Malike Bamiyi wanted my assistance.
Re:Apple Xserve? by rhaig · 2005-10-25 16:48 · Score: 1

it's not even the cheapest. check out nexsan atabeast

--
"We are not tolerant people. We prefer drastically effective solutions"
Re:Apple Xserve? by SWroclawski · 2005-10-26 00:44 · Score: 1

It wasn't five years ago- it was ONE year ago, and I saw the transition of product names. "The names change, but the binaries remain the same."

The product certainly didn't fit my criteria for "good" or "stable", but I didn't say it sucked.
Re:Apple Xserve? by Anonymous Coward · 2005-10-26 01:18 · Score: 0

A better planning value is around $8/GB. And take that to $12-14/GB if you're going to be implementing any sort of backup system. So 25TB would clock in at between $200k and $350k.
Re:Apple Xserve? by Anonymous Coward · 2005-10-26 15:48 · Score: 0

"This product was released and supposed to do something where reliability was important but it wasn't reliable, once bitten, twice shy."

Oh, I'm sorry, was I mocking you?

How about this? Apple is FINALLY making ECC available on the G5's. About. Damn. Time. No, Apple is working on getting it but they Just Don't Grok the Enterprise space.

Here's a nickle, go buy yourself a real computer.

Veritas Filesystem by Anonymous Coward · 2005-10-25 07:25 · Score: 0

Go check out veritas.com (now Symantec) for a comercially available filesystem...

Andrew FIle System by mroch · 2005-10-25 07:25 · Score: 4, Informative

Check out AFS.

Re:Andrew FIle System by Simon+Lyngshede · 2005-10-25 07:27 · Score: 2, Informative

Agreed. AFS is exceptional nice. However I think it still have a max file size of 2GB.
Re:Andrew FIle System by ashpool7 · 2005-10-25 07:29 · Score: 1

Additionally, Coda, but I'm not sure if it's as stable.
Re:Andrew FIle System by JAZ · 2005-10-25 07:37 · Score: 1

I was about to recommend this but when I googled afs and found a faq it said:

Subject: 1.02 Who supplies AFS?

Transarc Corporation phone: +1 (412) 338-4400
The Gulf Tower
707 Grant Street fax: +1 (412) 338-4404
Pittsburgh
PA 15219 email: information@transarc.com
United States of America afs-sales@transarc.com

WWW: http://www.transarc.com/

BUT....
I clicked that transarc.com link and found the only porn site that my company proxies don't block. EEK!

--

"Karma can only be portioned out by the cosmos." -- Homer Simpson
Re:Andrew FIle System by Anonymous Coward · 2005-10-25 07:46 · Score: 2, Informative

http://www.openafs.org/
Re:Andrew FIle System by Anonymous Coward · 2005-10-25 08:04 · Score: 0

Thanks for the link, I actually *do* want a new girlfriend!
Re:Andrew FIle System by Trepalium · 2005-10-25 08:11 · Score: 2, Informative

Transarc was acquired by IBM in 1998, and released OpenAFS in 2000. This used to be IBM's site for Transarc technologies, but it looks like it doesn't exist anymore, and instead just redirects to IBM's software page.

--
I used up all my sick days, so I'm calling in dead.
Re:Andrew FIle System by miles31337 · 2005-10-25 08:40 · Score: 3, Informative

No longer true, the OpenAFS 1.3.X (soon to be 1.4) has support for larger files.
Re:Andrew FIle System by finkployd · 2005-10-25 09:19 · Score: 1

Coda is long dead, never lived up to anything. Most Coda people have gone to Intermezzo, which is still not really usable yet.

Go use Openafs

Finkployd
Re:Andrew FIle System by finkployd · 2005-10-25 09:23 · Score: 1

However I think it still have a max file size of 2GB.

This has not been true for quite a while.

Finkployd
Re:Andrew FIle System by kfhickel · 2005-10-25 09:31 · Score: 1

Sort of true. Transarc was always funded and supported by IBM. It's just that in 1999 Transarcians started getting "blue" paychecks.....
Re:Andrew FIle System by Anonymous Coward · 2005-10-25 13:19 · Score: 0

There was just a new release in September it would seem. That doesn't sound too dead to me. At least not dead enough to start rummaging the pockets for spare change.
Re:Andrew FIle System by Anonymous Coward · 2005-10-25 15:52 · Score: 0

Thanks for the link!

UFS by Anonymous Coward · 2005-10-25 07:25 · Score: 0

UFS (universal file system) seems to be what you want here, any other thoughts?

PetaBox by Anonymous Coward · 2005-10-25 07:26 · Score: 4, Informative

Howabout the PetaBox, used by the Internet Archive ?

Re:PetaBox by sycodon · 2005-10-25 07:45 · Score: 5, Funny

Just don't call it PetaFile.

--
When Fascism comes to America, it will call itself Anti-Fascism, and tell you to give up your guns.
Re:PetaBox by MikeFM · 2005-10-25 07:52 · Score: 3, Informative

I priced one of those and decided I'd have to work my way up to that kind of toy. Instead I started with Buffalo's TeraStations which are affordable and have built-in RAID support. You can mount them in Linux and use LVM to span a single filesystem across several of them or just mount them normally depending on your needs. $1-$2 per GB for external, RAID, storage isn't bad at all.

--
At what price learning? At what cost wisdom? The price is a man's peace of mind, and the cost is his life.
Re:Petabox by afidel · 2005-10-25 07:56 · Score: 4, Insightful

This guy is worried about budget, yet even with the "low power" usage of the petabox it would still use 50kW for one petabyte of storage! When you combine the cooling for that with the cost of electricity you are talking some serious money. If you have trouble getting the capital funds for something like this how are you ever going to pay the operating costs?

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Re:Petabox by OrangeTide · 2005-10-25 08:03 · Score: 1

How about 1PiB of CompactFlash? Lower power (you pretty much only use power when you're accessing it, otherwise it just uses a tiny bit of power). Sure you pay a lot up front for it. I think a TB of CF would run about $100,000. So not really cheap I suppose:)

--
“Common sense is not so common.” — Voltaire
Re:Petabox by digidave · 2005-10-25 08:08 · Score: 1

"I don't know what FS they use, but apprently it is redudent."

I think they use FAT32 FAT32.

--
The global economy is a great thing until you feel it locally.
Re:Petabox by twiddlingbits · 2005-10-25 08:21 · Score: 1

Could be the ops budget and the dev budget are different piles of money. I've seen that happen a LOT, but your point is valid as energy costs are high and getting higher. I also noticed the PetaBox only supports up to GB Ethernet, and there are other technolgies (like FiberChannel and InfiniBand) that are faster (if speed is an issue), but I haven't seen open-source drivers.
Re:Petabox by rpresser · 2005-10-25 08:40 · Score: 2, Interesting

Depending on latency requirements, perhaps most of the cluster can stay in sleep mode until it is needed.
Re:PetaBox by Anonymous Coward · 2005-10-25 08:52 · Score: 0

Yeah.
YOU FAIL IT
Re:Petabox by Darby · 2005-10-25 09:38 · Score: 1

there are other technolgies (like FiberChannel and InfiniBand) that are faster (if speed is an issue), but I haven't seen open-source drivers.

I'm pretty sure FiberChannel is supported, and I know there is an option for infiniband in the Linux kernel config but I have zero experience using these so I cna't say how good the drivers are.

Since these technologies are used a lot in large clusters, I'd imagine the support is pretty decent at least.
Re:Petabox by Databass · 2005-10-25 10:22 · Score: 3, Insightful

This guy is worried about budget, yet even with the "low power" usage of the petabox it would still use 50kW for one petabyte of storage!

Interesting to think about. My brain probably holds about a petabyte of memories and it uses 20-60 watts. Mostly from sugar.
Re:Petabox by holloway · 2005-10-25 10:24 · Score: 1

CompactFlash can only have about 2000 read/writes before failing. It's good for photos, and that's it. A ramdrive is better, but expensive.

--
-Docvert converts MSWord to OpenDocument, clean HTML
Re:Petabox by russ_allegro · 2005-10-25 10:38 · Score: 2, Informative

They claim ~40 watts per terrabyte. That is pretty darn low, if you are going to try to come up with your own solution with off the shelf parts it'll be hard to match that. If they can't pay for 40 watts per terrabyte for a petabyte maybe they should reconsider that they need the petabyte for now.

Lets say $0.07 per kW/hr,
Then the 50kW as you said would be:
50*24*31*$0.07 = $2,604/month

So it isn't super cheap, guess that is why you don't hear about everyday people buying petabyte of storage. I think if you try to save more on electricity (liking coming up with some other device besides hard drives) you will end up paying a huge amount in whatever makes you save that electricity beyond the electricity costs.
Re:Petabox by TTK+Ciar · 2005-10-25 10:49 · Score: 1

Capricorn pre-installs our setup (archive.org) on the redboxes they sell us (slightly customized Debian, reiserfs3 on each disk, no filesystem abstraction, wrapped in rsync modules for access and Alexa UDP locator for indexing), but last I heard they will negotiate with customers to install whatever OS they want, and I presume with any abstraction solution they want (RAID at the per-node level, or not; OpenAFS for cluster-wide abstraction, whatever). So if you want FreeBSD/OpenAFS, that's doable, or if you want Windows, I'm sure they'll accomodate you there too.

The disks in the redboxes may appear to miss the optimal space-per-unit-price ratio, but really when you factor in physical compactness, data-per-unit-heat, these disk's robustness, and data-per-complete-system, they're very very good for a mass data warehousing solution, where physical space and power/cooling requirements count. The VIA C3's and VIA motherboards are extremely low-power systems, but iirc the disks make up the majority of each node's wattage appetite. It's a good system, and CR (the chief hardware engineer, and founder of Capricorn) has reason to be proud.

BTW, if someone has the desire and money to build a data warehousing cluster, but not the expertise, I'm available for contracting gigs.

-- TTK
Re:Petabox by buck_wild · 2005-10-25 13:19 · Score: 1

I've been out of the RamDisk arena for quite a while. Which would you recomend, if you don't mind my asking?

--
If all you have is a hammer, everything looks like a nail.
Re:Petabox by holloway · 2005-10-25 15:17 · Score: 1

I only casually read about this kind of stuff, but this one seems to make the forum goers happy. Expensive as anything though.

--
-Docvert converts MSWord to OpenDocument, clean HTML
Re:PetaBox by walrusx · 2005-10-25 16:51 · Score: 1

Whatever you do, don't try to use their software interface - it looks (and behaves) like it was built in Visual Basic 3.0 by a High Schooler.
Re:PetaBox by Amouth · 2005-10-25 17:24 · Score: 1

last time i looked they juut did jdob - without falut tolerance what good is large amount of storage?

--
'...if only "Jumping to a Conclusion" was an event in the Olympics.'
Re:PetaBox by MikeFM · 2005-10-25 18:11 · Score: 1

What is jdob? They offer a range of RAID options so you have your choice of fault tolerance levels.

--
At what price learning? At what cost wisdom? The price is a man's peace of mind, and the cost is his life.
Re:Petabox by faragon · 2005-10-25 20:19 · Score: 2, Interesting

But you do not have ramdom access to your own data (needless to say about reliability).
Re:Petabox by WuphonsReach · 2005-10-26 01:29 · Score: 1

Aye, the C3 is something like a 10W to 15W CPU, if I remember my specs correctly. Disk drives (3.5" 7200rpm) are typically 10-12W when spinning/seeking/active (and 5-6W when idle). Startup wattage can be about double the normal active wattage (some makers say 28W for startup).

Not sure if the smaller 2.5" SCSI drives end up at the same power draw due to the increased RPM.

--
Wolde you bothe eate your cake, and have your cake?
Re:PetaBox by Anonymous Coward · 2005-10-26 05:37 · Score: 0

Just don't call it PetaFile.
Don't laugh. Sony unintentionally did this, not understanding the English connotation. Product was renamed 'Petasite'.
Re:Petabox by OrangeTide · 2005-10-26 10:03 · Score: 1

CF supports an infinite number of read cycles (unlike your harddrive which will eventually fail to seek on numerous reads, but it will usually recover ... eventually). And most manufactures offer 100,000 write cycle for thier low-end product line. And 300,000 to 500,000 for thier higher end.

If you have a PiB of flash, you could do some pretty impressive write leveling. If you are only writing a fraction of a PiB (which is normal) the left over space multiplies the life of your flash. It's a good deal.

A ramdrive is a lot faster, but it's is not reliable for long-term storage either. Battery death, power glitch, stray high-energy photon, etc.

I don't know where you got that 2000 number, I suspect you just pulled it out of your ass.

--
“Common sense is not so common.” — Voltaire
Re:Petabox by buck_wild · 2005-10-26 12:16 · Score: 1

Wow. $698 for the board, then ~ $2k to fill it. Ouch. Thanks for the information though! :)

--
If all you have is a hammer, everything looks like a nail.
Re:Petabox by OrangeTide · 2005-10-27 10:17 · Score: 1

Micromemory makes a them for a bit cheaper.

--
“Common sense is not so common.” — Voltaire
Re:Petabox by holloway · 2005-10-27 10:29 · Score: 1

Out of someone else's arse actually, a post on SA. I should check my facts.

--
-Docvert converts MSWord to OpenDocument, clean HTML
Re:PetaBox by dublin · 2005-10-28 10:44 · Score: 1

Instead I started with Buffalo's TeraStations which are affordable and have built-in RAID support.

While the Buffalo product isn't bad, it's really not industrial strength, either.

Check out Infrant's ReadyNAS servers as an alternative (http://infrant.com/) - lots of really nice features (including NFS support, if you want/need it, unlike Buffalo), much better performance, and a rack-mount option, if you're into that kinky rack thing...

Seriously, if you're looking for a killer low-cost, high-performance NAS server appliance, check it out. I haven't found anything that better balances cost and performance, although let me say I've not yet seen one in action up close, so I can't offer a real recommendation just yet.

--
"The future's good and the present is nothing to sneeze at." - Roblimo's last ./ post

MogileFS from livejournal by mikeee · 2005-10-25 07:27 · Score: 2, Informative

Livejournal developed their own distributed filesystem:

http://www.danga.com/mogilefs/

It's scalable and has nice reliability features, but is all userspace and doesn't have all the features/operations of a true POSIX filesystem, so it may not suit your needs.

Go the Easy Route by Evil+W1zard · 2005-10-25 07:27 · Score: 3, Funny

I know a certain recent Zombie network that was discovered which collectively had quite a few Pbs of storage... Of course I wouldn't recommend going down that road as it leads to you know ... jail.

--
News Reporters Make Tasty Polar Bear Treats!

Petabox by treerex · 2005-10-25 07:27 · Score: 0, Redundant

Check out the Internet Archive's Petabox. They have a 100 TB rack running in Europe right now.

call EMC. i am sure their clarion line will handle by Anonymous Coward · 2005-10-25 07:28 · Score: 0

call EMC. i am sure their clarion line will handle it.

i am unsure how you plan to do this with open source
software. It seems to me, you will want mgmt software
to go along with it. That is the real value, me thinks.

Oracle, also by PCM2 · 2005-10-25 07:28 · Score: 1

The Oracle Cluster filesystem is also available under the GPL. Dunno if that fits the bill; the description here is sort of vague. It sounds like a seriously ambitious project to approach for someone who doesn't even know what can be done, let alone what's within his budget.

--
Breakfast served all day!

Re:Oracle, also by PCM2 · 2005-10-25 07:32 · Score: 1

Er, sorry, version 2 is what I meant.

--
Breakfast served all day!
Re:Oracle, also by N1ck0 · 2005-10-25 07:38 · Score: 1

From what I've heard OCFS2 can be a bit...finicky like most oracle systems, and it hasn't really taken off like they really hoped.
Re:Oracle, also by rebelcan · 2005-10-25 09:37 · Score: 1

I'm not saying that he shouldn't have posted a question to Slashdot, it would just be nice if he had done some more research/planning/forethought.

It just seems from the question that he went straight from being asked by his boss about doing this to posting this question on Slashdot.

--
God is dead -- Nietzsche
Nietzsche is dead -- God
Zombie Nietzsche lives! -- Zombie Nietzsche
Re:Oracle, also by Spudley · 2005-10-25 09:52 · Score: 2, Insightful

It sounds like a seriously ambitious project to approach...

I second that.

Starting at 25TB to scale 1PB? And you want it cheap? If it was cheap to do that sort of thing, we'd all be lining up to get one of our own(*).

Seriously, though, you don't really specify how cheap you are expecting to get it for. What are your expectations, and just how far over-budget are the options you've looked at already? Do you really need 25TB/1PB in one volume, or could it be achieved by splitting it into smaller chunks and working out some sort of load-sharing system?

And in any case, what on Earth kind of data do they anticipate will take a petabyte of contiguous storage????

[(*) Yes, I'm aware that in X years, someone's going to be looking back at this in the /. archive, and laughing about how low tiny our disc storage space was back in 2005]

--
(Spudley Strikes Again!)
Re:Oracle, also by Anonymous Coward · 2005-10-25 09:52 · Score: 0

How about offering a solution (which you can't) instead of just being another tired (yawn) arm-chair critic.
Re:Oracle, also by catprog · 2005-10-25 10:28 · Score: 1

By one volume I think he means appears as 1 volume. (e.g A RAID array only is 1 logical drive but many physical drives)

--
My Transformation Website
Kindle Books http://www.catprog.org/rev
Interactive CYOA http://www.catprog.org/st
Re:Oracle, also by Catbeller · 2005-10-25 10:44 · Score: 2, Insightful

And you don't have an answer to the question.

If you don't want to participate, don't. Stop stuffing the threads with posts about how lame everyone's questions, knowledge and motivations are.

I'm actually interested in what people have thought about this very topic, AND I'm not a petabyte database expert. So it's news to me. And probably is to you as well.
Re:Oracle, also by menkhaura · 2005-10-25 11:20 · Score: 3, Funny

what on Earth kind of data do they anticipate will take a petabyte of contiguous storage?

I know. They don't know I know, but I do. It's data gathered by the black helicopters, by Echelon, by Carnivore, by our very own printers, by RFID, about every movement of every single one of us... *They* do it. They.

--
Stupidity is an equal opportunity striker.
Fellow slashdotter Bill Dog
Re:Oracle, also by bezgin · 2005-10-25 12:06 · Score: 2, Funny

Wow! This is a real conspiracy theory. All I could think of was Porn. :)

--
exit();
Re:Oracle, also by pboulang · 2005-10-25 12:14 · Score: 1

They certainly know now. And they probably already knew.

--
This comment is guaranteed*
*not guaranteed
Re:Oracle, also by jericho4.0 · 2005-10-25 16:53 · Score: 1

Please. Witness the 'Ask Slashdot' before this one.
Shit, I thought of posting and thanking him for all the links and research he had done....

--
"A language that doesn't affect the way you think about programming, is not worth knowing" - Alan Perlis
Re:Oracle, also by HAMgeek · 2005-10-26 01:03 · Score: 1

The real question is who was the first to know that you knew that they knew. Did you know first or did they? If they knew first what did they know? If you knew first, what did they know about what you knew and how did they find it out?

--
"Just because you do not take an interest in politics doesn't mean politics won't take an interest in you." --Pericles
Re:Oracle, also by Ben+Hutchings · 2005-10-26 04:03 · Score: 1

what on Earth kind of data do they anticipate will take a petabyte of contiguous storage????

The applications I can think of are (1) uncompressed or losslessly compressed digital video or film (2) large compressed video archive (think Internet Archive) (3) 3D medical images.

Petabox.... by HotNeedleOfInquiry · 2005-10-25 07:28 · Score: 1

Does not appear to be a single volume..

--
"Eve of Destruction", it's not just for old hippies anymore...

Re:Petabox.... by treerex · 2005-10-25 07:32 · Score: 1

Does not appear to be a single volume..

That depends entirely on what software you run on top of the hardware, doesn't it.
Re:Petabox.... by timeOday · 2005-10-25 07:55 · Score: 2, Funny

Then I suggest using *nothing*. It's free, and will work with the appropriate hardware and software add-ons.

Petabox by russ_allegro · 2005-10-25 07:28 · Score: 2, Insightful

archive.org made a petabox

http://www.archive.org/web/petabox.php

There is now a company that seems to make the same design:

http://www.capricorn-tech.com/products.html

I don't know what FS they use, but apprently it is redudent.

GPFS from IBM by LuckyStarr · 2005-10-25 07:29 · Score: 5, Interesting

May or may not be what you search. Quite expensive but impressive featurelist.

http://www-03.ibm.com/servers/eserver/clusters/sof tware/gpfs.html

--
Meme of the day: I browse "Disable Sigs: Checked". So should you.

Re:GPFS from IBM by Zombie · 2005-10-25 09:20 · Score: 2, Interesting

My wife's building a 4 petabyte array (starting with 600 terabyte by the end of this year) for real-time multiple-access high-speed video streaming on GPFS. All GNU/Linux and commodity hardware. The switch fabric of the network is the hard bit. It's a bitch on fibre channel, but iSCSI should deliver higher performance at less than half the price. That's when you can get the hardware, and if you have the right Ethernet switch fabric again...
Re:GPFS from IBM by cartoon · 2005-10-25 09:20 · Score: 1

If I'm not mistaken, GPFS is free as in beer for Linux. Download it and give it a spin...

GPFS download page...

--
//Cartoon
Re:GPFS from IBM by Obasan · 2005-10-25 09:23 · Score: 2, Insightful

Having implemented GPFS I feel qualified to say it kicks butt. As the poster mentions, its not cheap but if you want reliability and support it may be well worth it. Thats where you need to decide the level of risk you are willing to expose your data to. One limitation of GPFS is that it does (or did last I looked) only run on IBM hardware, either Pseries or Xseries with FastT fiber channel at the back end.

From what I've heard, definitely give GFS a thorough shakedown before you decide to implement it, I've heard some horror stories.
Re:GPFS from IBM by cartoon · 2005-10-25 09:24 · Score: 1

Oh, and some pointers on installing it... install RSCT first (Reliable Scalable Cluster Technology). That is the cluster framework for AIX and Linux from IBM, and provides the node-to-node communications and some low-level management tools and APIs. Then put GPFS on top. While you're at it, check out CSM too. That only support a limited distro set (SLES/RHEL), but automates node installation and management on a higher level... it use the RSCT layer and is quite neat.

But get the manuals too and keep them close while installing :)

--
//Cartoon
Re:GPFS from IBM by icehawk55 · 2005-10-25 12:34 · Score: 2, Informative

I've implemented multiple gpfs file systems in the multi terabyte range. It's a pretty robust file system. With full redundancy at the disk/controller/brocade/server level per file system I can still write more the 3 gb/s and read better than 3.5 gb/s. This was a design for redundancy and not performance.

20+ Terabytes of FAStT fibre attached storage. After four "SURPRISE" power outages after Katrina which caused the loss of 12 disks and I still did not lose a single byte of data for the customer. GPFS can be pretty robust if implemented correctly.

I'd have no qualms about putting together a petabyte of gpfs file systems.
Icehawk55
Re:GPFS from IBM by MilenCent · 2005-10-25 20:16 · Score: 1

My wife's building a 4 petabyte array (starting with 600 terabyte by the end of this year) for real-time multiple-access high-speed video streaming on GPFS.

Wow! WHERE did you FIND HER??
Re:GPFS from IBM by Zombie · 2005-10-25 21:58 · Score: 1

>> My wife's building a 4 petabyte array (starting with 600 terabyte by the end of this year) for real-time multiple-access high-speed video streaming on GPFS.
> Wow! WHERE did you FIND HER??
She found me. What can I say, she's a sucker for disk size.

Go Virtual by furry_wookie · 2005-10-25 07:29 · Score: 1

Check this out to do what you want.

This is the one of the coolest companies out there and their product is better than anything EMC has for storage.

http://www.falconstor.com/

--
-- Given enough time and money, Microsoft will eventualy invent UNIX.

Re:Go Virtual by krbvroc1 · 2005-10-25 07:39 · Score: 2, Interesting

He asked for low cost commodity hardware. The fact that no price is mentioned and you need to contact a sales droid for a quote is an instant red-flag. I hate vendors who do not put price lists, even 'retail' prices on their product pages. I realize they may have different price levels based on quantity, but there is a value to seeing that a product is in the '$1000-$1500' range versus the '$120000-$150000' range. Having the contact sales droids who will put your name/phone number on a sales list and harrass you just to find out the price range turns me off of a lot of these outfits. I do a lot of product research and selection using the Internet. I favor outfits who allow me to get all the info online without contacting a sales rep. Many times if I cannot get the info on the web and I cannot get a price on the first phone call without providing sales lead information, I skip them.
Re:Go Virtual by gstoddart · 2005-10-25 08:06 · Score: 0, Troll

He asked for low cost commodity hardware. The fact that no price is mentioned and you need to contact a sales droid for a quote is an instant red-flag. I hate vendors who do not put price lists, even 'retail' prices on their product pages.

Well, in the event of scarling between a few TB and >= 1 PB, you'd be talking about a rather large price range.

Combine that with the fact that the prices are probably changing all the time with market factors, and the likelihood of someone selling a petabyte storage system with anything meaningfully called a retail price is absurd -- these are big, custom pieces not something you have the clerk run to the warehouse and see if there are any in stock.

This isn't like an Nvidia card where the manufacturer says "reccomended retail" is x and you go to the price comparison sites.

--
Lost at C:>. Found at C.
Re:Go Virtual by krbvroc1 · 2005-10-25 10:29 · Score: 1

I called it retail price, but you can call it whatever you want. The sales people have a price list and it would be helpful to have that info available (in a non-binding/ballpark way) on their website when evaluating products. One of the benefits of the web is being able to do the research and evaluate products. Having to phone a sales person and wait 2 days for a reply, be forced onto a sales lead list, and finding out the product price was far outside your budget is very common with these products. It negates the benefit of the website!
As far as how custom these are, I am not so sure. Most likely just a combination of COTS products ganged together. The web is a dynamic medium, there is no reason they cannot update their prices (like other vendors do) regularly.
The original poster asked for a low cost commodity solution. Someone posted this 'falconstor' link. I went to their site looking for cost range and I see ' For more information about how FalconStor appliances are right for your business and for purchasing information, click here. A FalconStor representative will contact you.' How is that using the web in a helpful way to prospective consumers? And how come my accurate critic of using the web to evaluate products (which I do a lot) is considered a troll?
Re:Go Virtual by gstoddart · 2005-10-25 13:26 · Score: 1

I called it retail price, but you can call it whatever you want. The sales people have a price list and it would be helpful to have that info available (in a non-binding/ballpark way) on their website when evaluating products. One of the benefits of the web is being able to do the research and evaluate products. Having to phone a sales person and wait 2 days for a reply, be forced onto a sales lead list, and finding out the product price was far outside your budget is very common with these products. It negates the benefit of the website!

Well, in a lot of cases, they don't want to be fielding calls from everyone who gets it in their head they need a Freakin Huge Array(tm) but can't really afford it.

In some industries, price can be a little malleable depending on who you are and what you've already bought, so they don't broadcast even a guestimate price unless you're in the sales machine.
The web is a dynamic medium, there is no reason they cannot update their prices (like other vendors do) regularly.

That doesn't mean they're obliged to. If they simply don't wish to, or because of pricing structure they're unable to, they don't need to. You, are equally free to take offense at that position. :-P
The original poster asked for a low cost commodity solution. Someone posted this 'falconstor' link. I went to their site looking for cost range and I see ' For more information about how FalconStor appliances are right for your business and for purchasing information, click here.

And I was questioning whether or not a low-cost solution to a petabyte storage array made from commodity hardware is even possible. This beast takes 60kW of power and is "Shipping container friendly-- Able to be run in a 20' by 8' by 8' shipping container" (it can't be that big can it?)

In much the same way, I didn't think it entirely unreasonable that people selling things this big don't post a price on their web-site. These puppies are probably really expensive ... moreso the closer you get to a petabyte I should think.

Things that cost a quarter of a million (WAG) don't have their prices listed on the web-site usually. Sun doesn't list the prices of their top-end servers on their web-site -- because they're really expensive and only sold in the channels.
How is that using the web in a helpful way to prospective consumers?

Well 'finding' perspective customers, and weeding out those that will choke when they hear the price are two different things. Maybe they figure if you're not serious enough to talk with a salesman, you're not serious enough to require their assistance.
And how come my accurate critic (sic) of using the web to evaluate products (which I do a lot) is considered a troll?

This is Slashdot, where my reasonable, if contrarian, post also gets a troll. Don't take it personally, there's a randomness factor to it. :-P

--
Lost at C:>. Found at C.
Re:Go Virtual by Kaenneth · 2005-10-25 13:52 · Score: 1

I think the statement, "If you have to ask, you can't afford it' applies.
Re:Go Virtual by krbvroc1 · 2005-10-25 13:56 · Score: 1

Well, in a lot of cases, they don't want to be fielding calls from everyone who gets it in their head they need a Freakin Huge Array(tm) but can't really afford it.
And that is exactly my point. When I am researching how to engineer something I don't start by thinking I need a Freaking Huge Array. I start by searching for what technologies are available, what their costs are, how I can use them, and whether the design makes sense. If they put price info (ballpark) I can avoid wasting their time and my time. If the price is too high or the product doesn't do what I need its going to get crossed of my list for that particular application. The real issue here is that the sales guys want to be secretive for their pricing for two reasons. 1) They think that if only they could talk to me and convince me that I need their product no matter what, I'll buy it. 99.9% of the time they don't understand how I am using the product and add no value - they just want to make the sale and will say anything to get the sale. 2) They sometimes think if they can figure out my budget they can charge me more than someone else (deeper pockets).
Of course all this back and forth just cost me X number of days per vendor when I could have figured out that they were too expensive in the first place. A lot of times in engineering you can accomplish the same thing many different ways. Knowing the product price is outside a range of what I'm looking for just means I'll accomplish it another way or need to change a design parameter.
And I was questioning whether or not a low-cost solution to a petabyte storage array made from commodity hardware is even possible.
BTW, the product mentioned doesn't approach this. The largest turnkey item they offer is a 9TB (24 bay raid box).
Well 'finding' perspective customers, and weeding out those that will choke when they hear the price are two different things. Maybe they figure if you're not serious enough to talk with a salesman, you're not serious enough to require their assistance.
See I thought that in a market driven economy that my needs as a customer is important. Give me the information so that I can be the one to make judgement calls about my needs. You make it sound like I should be grateful that the company is willing to even speak with me.
Re:Go Virtual by gstoddart · 2005-10-25 15:17 · Score: 1

The real issue here is that the sales guys want to be secretive for their pricing for two reasons.

Just one really, they want your money. They're salespeople after all.
See I thought that in a market driven economy that my needs as a customer is important. Give me the information so that I can be the one to make judgement calls about my needs. You make it sound like I should be grateful that the company is willing to even speak with me.

I'm just pointing out that some companies feel exactly that. Or at least for certain products.

Sales people of high-end stuff will be happy to give you the information -- if they think you might actually buy. But they'll treat you like a chump if they think you're browsing beyond your means.

It is an unfortunate fact of life, but "the customer is always right" can be prefixed with a Disney-like disclaimer -- "you must be this tall to get on this ride". Not all companies are as accomodating when it comes to doing your initial price-shopping on the web.

Cheers

--
Lost at C:>. Found at C.
Re:Go Virtual by electroniceric · 2005-10-26 02:55 · Score: 1

The real issue here is that the sales guys want to be secretive for their pricing for two reasons. 1) They think that if only they could talk to me and convince me that I need their product no matter what, I'll buy it. 99.9% of the time they don't understand how I am using the product and add no value - they just want to make the sale and will say anything to get the sale. 2) They sometimes think if they can figure out my budget they can charge me more than someone else (deeper pockets).
For commodity products, I agree that that kind of pricing is annoying, but I also agree with the parent's sentiments. At half a million dollars, you're not talking a simple sign-the-credit-card purchase, you're negotiating a deal. And putting a reference price up there doesn't help the vendor in deal-making. The other problem is that putting your price on the internet exposes your pricing structure to your competitors and potential competitors, which is in the consumer's interest, but not in the company's interest (especially in the newer stages of product development and sales). Why should a business provide its competitors with a roadmap?

And yes, the sales guys want you on their lead lists - so they can show the CEO how many sales calls they've made and leads they generated. That's how they make their living. When you're talking 10's or 100's of thousands of dollars, that's legit. When you're talking $25, it's pretty aggravating.
Re:Go Virtual by krbvroc1 · 2005-10-26 03:35 · Score: 1

For commodity products, I agree that that kind of pricing is annoying, but I also agree with the parent's sentiments. At half a million dollars, you're not talking a simple sign-the-credit-card purchase, you're negotiating a deal. And putting a reference price up there doesn't help the vendor in deal-making. The other problem is that putting your price on the internet exposes your pricing structure to your competitors and potential competitors, which is in the consumer's interest, but not in the company's interest (especially in the newer stages of product development and sales). Why should a business provide its competitors with a roadmap?
A few comments on this. First, while a complete PB system is probably very expensive, the website I was talking about was selling 'raid boxes'. These are not that expensive, without a price its tough to know, but perhaps $10-15k? Second, the biggest problem is that I am NOT NEGOTIATING A DEAL! I am an engineer and I'm looking to filter which products are potential candidates. I'm not a procurement person or a 'buyer'. I don't do money and am not usually authorized to negotiate anything on a companies (or my clients behalf). Third, I don't buy the security by obscurity argument that posting a price list (or even ballbark figures) exposes you secret pricing structure to your competitors. Your competitors know your prices and will find out. Finally, as far as the sales guys wanting me on their lead lists - well, I dont appreciate phone calls from these guys. As I said, I'm the engineer, not someone who can negotiate contracts -- I'm not the guy to call. Secondly, if I call 10 vendors to get price info (or even tech details) and I rule out 8 of them, I still get calls from all 10. Multiply this by all the various projects (or even brainstorms) I have and it gets out of control. Some will stop calling when I ask, but many or pushing and persistent at wasting my time.

Why? by Anonymous Coward · 2005-10-25 07:29 · Score: 2, Insightful

What are you doing on a limited budget trying to build a 1PB solution? And why are you on a budget?

Just because you are starting at 25TB doesn't mean you aren't building a 1PB solution.

You also need to figure out what kind of bandwidth you need. It's very seldom that people have 1PB of data that is accessed by one person occasionally. If Some sort of USB or 1394 connection will work you are much better off than requiring infiniband.

Like many "ask Slashdot" questions this is the last place you should be looking for help...

Re:Why? by temojen · 2005-10-25 07:50 · Score: 1, Insightful

Unless you are the mint, every budget is limited.

Network Block Device by drightler · 2005-10-25 07:30 · Score: 1

LVM/Software RAID over Linux NBD.... ok it might suck, but I think it would work.

--

blah blah blah....
drightler@technicalogic.com

Re:Network Block Device by Anonymous Coward · 2005-10-25 07:35 · Score: 0

FreeBSD has some nice tools to address this problem, but I'm not very familiar with. I would use GEOM to export devices to the network. It would use software LVM/RAID too. And would be centralized in a server for that. Thats a suggestion...

what gall by Anonymous Coward · 2005-10-25 07:31 · Score: 0, Troll

Considering that there are billion dollar companies whose only job it is to provide secure and redunant storage of the type that you describe, what makes you believe that someone on slash-dot would give you a solution for free?

The kind of thing you are talking about is non-trivial. If people have ideas concerning these matters you should pay them for them.

What a lot of gall!
Also, if you are being paid to do this by someone, then they obviously hired the wrong person to do the work.

Google Releases OSS? by TheoMurpse · 2005-10-25 07:31 · Score: 1

My research has not yielded any viable open source alternative (unless Google releases GoogleFS)

Since when has Google released any open source software?

Re:Google Releases OSS? by Evangelion · 2005-10-25 07:37 · Score: 1

Since google's massive infrastructure is built on Linux, chances are any kernel-space filesystem they release is going to have to be GPL compatible.
Re:Google Releases OSS? by ggvaidya · 2005-10-25 07:39 · Score: 2, Informative

A while ago
Re:Google Releases OSS? by LLuthor · 2005-10-25 07:39 · Score: 1

See http://code.google.com/

--
LL
Re:Google Releases OSS? by Bananatree3 · 2005-10-25 07:46 · Score: 0, Redundant

I have a feeling you haven't seen http://code.google.com yet. This site just so happens to release code, written by Google employees, available for free. 100% open-source free.
Re:Google Releases OSS? by Anonymous Coward · 2005-10-25 07:48 · Score: 0

Yeah, we definately need more opensource-free software.
Re:Google Releases OSS? by adrianbaugh · 2005-10-25 07:59 · Score: 1

Only if they release it. They can use a proprietary fs in-house as much as they like. It's only if they were selling some kind of google-in-a-box which ran their filesystem: then they would need to provide the source, as they would be distributing it. Running a filesystem on their own machines does not count as distributing it, regardless of how many people are accessing the data on it; therefore they do not need to release the code as GPL (indeed, at all).

--
"'I pass the test,' she said. 'I will diminish, and go into the West, and remain Galadriel.'"
- JRR Tolkien.
Re:Google Releases OSS? by morcego · 2005-10-25 08:12 · Score: 1

As long they are not releasing the software, they don't need to release the code.
GPL controls distribution, not usage. They can modify the software as much as they want under the GPL, and don't need to provide the code for anyone.

--
morcego
Re:Google Releases OSS? by i+wanted+another+nam · 2005-10-25 13:43 · Score: 1

Well I'll be darned.

http://www.google.com/enterprise/gsa/

--
The image is a dream, the beauty is real. Can you see the difference?
Re:Google Releases OSS? by isometrick · 2005-10-25 17:59 · Score: 1

That appliance doesn't use their cluster file system. It's one standalone machine that uses their crawling/indexing/searching system, not the 10,000s they have back at home.
Re:Google Releases OSS? by Evangelion · 2005-10-26 06:09 · Score: 1

Only if they release it. .... which was the assumption that was underlying this whole thread. It started with an offhand statement that ended "(unless Google releases GoogleFS)".

So yes, if they release GoogleFS, and if it requires kernel modifications, and if they don't sequester the binary module away behind a kernel-space loader like nVidia does with thier graphics drivers, then yes they will need to make it GPL-compatable.

previous story: Distributed storage by Anonymous Coward · 2005-10-25 07:32 · Score: 0

You might want to do a google search for "Linux Distributed Storage" or otherwise look at this old post on Slashdot which covered your question.

Otherwise there are various solutions already availlable for free under Linux, but none will offer you a system that is easily implemented cheaply with fully redundant data storage.

http://ask.slashdot.org/article.pl?sid=05/05/04/15 22247&tid=198&tid=230&tid=4&tid=106

Good luck,
-eks

Scale by LLuthor · 2005-10-25 07:33 · Score: 3, Interesting

If you know the scale of the problem, you should consult with a company like EMC to provide the support for this thing - you WILL need it.

Clustering the disks with iSCSI or ATAoE is trivial - you can do that very easily, but the filesystem to run on top of it is where you will have problems.

PVFS - has no redundancy - Lose one node lose them all.
GFS - does not scale well to those sizes or a large number of nodes - lots of hassle with the dlm.
GoogleFS - Essentially one write only - no small (50GB) files - little or no locking.
xFS - Way too easy to lose your data.

It seems that you only have one option:
Lustre - VERY Expensive - lots of hassle with meta-data servers and lock servers.

Go with a company to take care of all this hassle - you do not have the resources of Google to deal with this kind of thing yourself.

--
LL

Re:Scale by LLuthor · 2005-10-25 07:35 · Score: 1

I forgot to mention OCFS2 - It does not scale well to large numbers of nodes, but it does handle Pb volumes better than lustre 1.2 (I have never used 1.4).

--
LL
Re:Scale by c_woolley · 2005-10-25 07:48 · Score: 1, Funny

I concur. Everything I have researched matches what you have stated. It is not likely this will be a very easy task to perform on a budget (depending on what he is calling a "budget"). I would guess that GoogleFS is the only viable solution other than Lustre, depending on what he is attempting to use this storage for. If large file storage is what he desires, this may be the answer once it is released to the public.
Re:Scale by Wesley+Felter · 2005-10-25 08:19 · Score: 3, Insightful

Why do people keep talking about GoogleFS, given that it doesn't exist outside Google?
Re:Scale by Karcaw · 2005-10-25 08:47 · Score: 1

Lustre is not VERY expensive. It is still a GPL'ed product. Cluster file systems sells support for the file system, and along with that you can get the very latest release. If you want free you are 6 months to year behind. You can get the latest code and a few hours of design consultation for $5k. I run a filesystem with over 300TB right now on lustre, and we do pay a little for support. But it can be done completely free.
Re:Scale by LLuthor · 2005-10-25 09:59 · Score: 1

I only said very expensive because I meant it relative to the other options. The cost of running your own SAN + support from luster is very much more expensive than getting the same thing from EMC or Symantec. Of course, the custom option is better suited to certain tasks, but since the OP was not specific as to what the purpose of this storage cluster was, I am inclined to suggest avoiding running their own. I agree that lustre is not very expensive as such, but its not the only expense when running your own cluster.

--
LL
Re:Scale by fatcatman · 2005-10-25 11:20 · Score: 1

I only said very expensive because I meant it relative to the other options. The cost of running your own SAN + support from luster is very much more expensive than getting the same thing from EMC or Symantec.

No, it isn't. Exactly how much experience do you have with Lustre? Do you run a 300TB Lustre array, like the OP? I'm guessing not.

Price out a 1PB disk array from EMC or Symantec and get back to me. I can do it for under $2M including several hundred thousand dollars worth of support from CFS. I guarantee you EMC can't touch that price.
Re:Scale by Anonymous Coward · 2005-10-25 11:22 · Score: 0

Don't forget recoverability as well. That much storage will be quite a task to backup/restore if things go awry. Even solutions that seem to scale well often don't take into account disaster recovery/business continuity issues.

As for the NeoPath solution mentioned, the company I work for is looking at it. The biggest problem I see with it is in the DR realm. Appears so far that it's easy to lose files since they didn't understand the need for consistency checking of their meta data database. When asked about it, they literally had no idea why I thought that was necessary!

We had a very interesting meeting with them and I'm now concerned about GPL issues as well. They claimed initially that their OS "started out Linux, but has been extended so much it's not really Linux anymore". When asked about GPL issues with that statement, they changed the tune to "the OS is still Linux, but our application is proprietary". Still no answer on the source code for the base OS though. And based on limitations on number of processors supported - it doesn't sound like a standard Linux kernel. Since I don't want to toss accusations around, I won't go further than saying I have questions on this point - I'll also remain AC due to the litiguous nature of modern society.
Re:Scale by ErikZ · 2005-10-29 12:41 · Score: 1

Really? I would have thought that the google severs that they sell would have it.

--
Democrats or Republicans. They are both taking us to the same place and they are not afraid of us anymore.

Here's a couple to look at by Anonymous Coward · 2005-10-25 07:33 · Score: 2, Informative

Compete File System at http://www.python.org/pycon/2005/papers/46/Compete FileSystem.pdf.

MogileFS at http://www.danga.com/mogilefs/

Wow by DingerX · 2005-10-25 07:33 · Score: 5, Funny

I never thought I'd see the day when sites were boasting a petabyte of porn.
That's over 3 million hours of .avis -- if you sat down and watched them end-to-end, you'd have 348 years of "backdoor sliders", "dribblers to short", "pop flies", and "long balls". We live in an enlightened age.

Re:Wow by Anonymous Coward · 2005-10-25 07:41 · Score: 0

We live in an enlightened age.
Maybe you do. I'm pleased to be unenlightened about the meaning of "backdoor sliders", "dribblers to short", "pop flies", and "long balls".
Re:Wow by Anonymous Coward · 2005-10-25 07:42 · Score: 0
Don't forget some of my personal favorites:
- Anal Invaders 6
- MIWLF: Moms I Wouldn't Like to Fuck
- Face Shots with Wheelchair Bound Midgets
- Midget Defication: Shittin' on the Little Guy
Re:Wow by Surt · 2005-10-25 07:44 · Score: 1

You're not thinking far enough ahead. The porn industry is always on the leading edge of technology, so of course they're going to be storing high definition porn on those petabytes, so that brings you down to a few paltry years worth of porn. And of course you have to factor in fast forwarding through whatever parts don't interest you.

--
"Who is the Journal of Quantum Physics going to believe?" --Stephen Hawking
Re:Wow by spuke4000 · 2005-10-25 07:57 · Score: 5, Funny

I'm not really sure I need 348 years of porn. I usually find porn really interesting for the first 3 minutes or so, then for some reason it's not so interesting anymore. But maybe that's just me.

--
This post cannot be rebroadcast without the express written constent of Major League Baseball.
Re:Wow by rco3 · 2005-10-25 08:13 · Score: 3, Funny

Three minutes? You wish!

Come to think of it, so do I.

--

Ce n'est pas un vrai mouvement de robot!
Re:Wow by smithmc · 2005-10-25 08:26 · Score: 1

That's over 3 million hours of .avis -- if you sat down and watched them end-to-end
You'd end up like that guy in Brainstorm who played the sex scene on infinite loop.

--
Downmodding is the refuge of the weak. Don't downmod, make a better argument!
Re:Wow by shut_up_man · 2005-10-25 09:05 · Score: 1

It is for this reason alone that we need to invent a cybernetic implant that allows the human nervous system to be sped up by 348 times.
Re:Wow by Anonymous Coward · 2005-10-25 09:06 · Score: 0

I usually find porn really interesting for the first 3 minutes or so, then for some reason it's not so interesting anymore. But maybe that's just me.
Well, it's maybe not just you, but it's not me... I could watch porn all night long. Sadly, I have to get a little bit of sleep because I need this stupid job thing to pay for my high-speed internet connection... oh, and support the wife and kids, I guess... whatever...
Re:Wow by Anonymous Coward · 2005-10-25 09:43 · Score: 0

I'm sure Interpol would get involved if porn sites were boasting pedabytes...
Re:Wow by bchernicoff · 2005-10-25 09:56 · Score: 1

I discovered something interesting recently. You know your favorite video? Well, it has an ending. I know you've only ever seen the first 3 minutes. For a kick, skip ahead to 3 minutes from the end. Mind blowing.
Re:Wow by Anonymous Coward · 2005-10-25 09:58 · Score: 0

Damn, you guys can last 3 minutes?

My best is 1:39.4, but I keep trying to improve.

Practice makes perfect, right?
Re:Wow by natefanaro · 2005-10-25 10:06 · Score: 1

ME TOO! You do have ADD, right?
Re:Wow by Anonymous Coward · 2005-10-25 11:10 · Score: 0

It doesn't suprise me. An EMC engineer once told me that the companys original storage solutions came holland (or somewhere around there) and was....you guessed it.... use for storing porn
Re:Wow by SillySnake · 2005-10-25 12:32 · Score: 1

Sounds like you need to work on your arm/hand stamina.
Re:Wow by Anonymous Coward · 2005-10-25 12:37 · Score: 0

Actually the porn industry hates HD. They actively oppose it and lobby against it.

They claim it makes the imperfections of the "stars" harder to disguise, and freckles and pimples are much more visible.
Re:Wow by jred · 2005-10-25 13:17 · Score: 1

The trick is finding the *right* 3 minutes....

--

jred
I'm not a mechanic but I play one in my garage...
Re:Wow by buck_wild · 2005-10-25 13:24 · Score: 1

ADD? No, not that I Look! Shiny things!

--
If all you have is a hammer, everything looks like a nail.
Re:Wow by Anonymous Coward · 2005-10-25 16:17 · Score: 0

Not 348 years of porn, around seven and a half years. Which is still a lot.

1 Gigabyte for ~4 minutes of uncompressed avi video.
1 Terabyte for ~4000 minutes or 67 hours.
1 Petabyte for ~67000 hours or 2791 days or 7.6 years of porn.
Re:Wow by coaxial · 2005-10-25 19:27 · Score: 1

I usually find porn really interesting for the first 3 minutes or so, then for some reason it's not so interesting anymore. But maybe that's just me.

You are not alone.
Re:Wow by MilenCent · 2005-10-25 20:19 · Score: 1

Gah....

You know, I'm really sure, if you have THAT many images of the same parts and activities over and over, that you can invent a much better compression algorithm for it than MPEG....

Data redundancy REQUIRED by cheesedog · 2005-10-25 07:34 · Score: 5, Informative

One thing to think about when building such a system from a large number of hard disks is that disks will fail, all the time. The argument is fairly convincing:

Suppose each disk has a MTBF (mean time before failure) of 500,000 hours. That means that the average disk is expected to have a failure about every 57 years. Sounds good, right? Now, suppose you have 1000 disks. How long before the first one fails? Chances, are, not 57 years. If you assume that the failures are spread out evenly across time, a 1000-disk system will have a failure every 500 hours, or about every 3 weeks!

Now, of course the failures won't be spread out evenly, which makes this even trickier. A lot of your disks will be dead on arrival, or fail within the first few hundred hours. A lot will go for a long time without failure. The failure rates, in fact, will likely be fractal -- you'll have long periods without failures, or with few failures, and then a bunch of failures will occur in a short period of time, seemingly all at once.

You absolutely must plan on using some redundancy or erasure coding to store data on such a system. Some of the filesystems you mentioned do this. This allows the system to keep working under X number of failures. Redundancy/coding allows you to plan on scheduled maintanence, where you simply go in and swap out drives that have gone bad after the fact, rather than running around like a chicken with its head cut off every time a drive goes belly up.

Re:Data redundancy REQUIRED by OrangeSpyderMan · 2005-10-25 07:42 · Score: 4, Insightful

Agreed. We have around 50 TByte of data in one of our datacenters and it's great, but the number of disks that fail when you have to restart the systems (SAN fabric firmware install ) is just scary. Even on the system disks of the Wintel servers (around 400) which are DAS, around 10% fail on Datacenter powerdowns. That's where you pray that statistics are kind and you have no more failures on any one box than you have hot spares+tolerance :-) Last time one server didn't make it back up because of this.... though it was actually strictly speaking the PSUs that let go, it would appear.

--
Try NetBSD... safe,straightforward,useful.
Re:Data redundancy REQUIRED by CuteVlogger · 2005-10-25 08:03 · Score: 1

We have a couple HP MSA20s (12x250GB SATA, went on recommendation from a friend at Yahoo), and they're really good, except they'll burn a disk almost everytime you have to restart the unit. It's kind of annoying, and the firmware updates don't seem to be helping much.

--
My Video Blog!
Re:Data redundancy REQUIRED by Alef · 2005-10-25 08:09 · Score: 3, Informative

If you assume that the failures are spread out evenly across time, a 1000-disk system will have a failure every 500 hours, or about every 3 weeks!
For the sake of your argument I suppose that assumption could be considered fair. If one were to do a somewhat more sophisticated analysis, a better model for hard drive failures is the Bathtub curve. It represents the result of a combination of three types of failures: infant mortality (flaws in the manufacturing), random failures and wear-out failures.
The failure rates, in fact, will likely be fractal -- you'll have long periods without failures, or with few failures, and then a bunch of failures will occur in a short period of time, seemingly all at once.
I think what you are referring to is how multiple observations of a uniformly distributed stochastic variable generally look. It doesn't have anything to do with fractals, though.
Re:Data redundancy REQUIRED by Sparohok · 2005-10-25 08:13 · Score: 1

The failure rates, in fact, will likely be fractal -- you'll have long periods without failures, or with few failures, and then a bunch of failures will occur in a short period of time, seemingly all at once.

Not to nitpick or anything, but drive failure rates aren't fractal.

Martin
Re:Data redundancy REQUIRED by Feyr · 2005-10-25 08:35 · Score: 1

you, sir, are scaring me.

10% failure on powerdown? lets just hope you don't have to do that too often, and that the ups and generators are REALLY good!
Re:Data redundancy REQUIRED by Retric · 2005-10-25 08:38 · Score: 1

When you look at failures there is something of a "fractal" nature to it. Failures tend to spread around specific events like power up / down, excess heat, excess usage ect. So after a month of burn in you see things tend to cluster around events like "summer" and High Usage / heat in the afternoon - evening and backups. So that over the average day there is a specific high failure times and around the year there is also specific high failure times with some random events thrown in like major power outages or expanded capacity > next burn in.
Re:Data redundancy REQUIRED by OrangeSpyderMan · 2005-10-25 08:39 · Score: 1

10% failure on powerdown? lets just hope you don't have to do that too often, and that the ups and generators are REALLY good!

To be honest - yes they are. Redundant electricity providers (seperate cables to seperate substations) and UPS / generators mean this really doesn't happen unless we want it to. Unfortunately, we do this from time to time essentially for upgrades of the SAN itself. :-/

--
Try NetBSD... safe,straightforward,useful.
Re:Data redundancy REQUIRED by Anonymous Coward · 2005-10-25 08:50 · Score: 0

> A lot of your disks will be dead on arrival, or fail within the first few hundred hours.
If that's really an issue, stop buying Fujitsu.
Hey, it even rhymed!
Re:Data redundancy REQUIRED by photon317 · 2005-10-25 09:34 · Score: 1

But what does that have to do with a fractal?

--
11*43+456^2
Re:Data redundancy REQUIRED by cheesedog · 2005-10-25 09:44 · Score: 1

> But what does that have to do with a fractal?
"self-similarity" and "fractal" are often used interchangeably in the literature. Do a literature search on either term -- it applies generally to a lot of things: NFS traffic, web site traffic, cars passing under an overpass, number of people in a bathroom stall at any moment, number of disks that fail in a massive storage array, etc...
Re:Data redundancy REQUIRED by cheesedog · 2005-10-25 09:56 · Score: 2, Interesting

Not to nitpick back at you or anything, but have you ever sat in front of a system with 100s of cheap-off-the-shelf drives and recorded the failure times? I'll be a monkey's uncle if they aren't self-similar.
Re:Data redundancy REQUIRED by photon317 · 2005-10-25 10:03 · Score: 1

I can see how one can equate certain forms of self-similarity with fractals, but I don't think net/car traffic or disk failure rates qualify just because they tend to clump up in spots. Now if, for instance, there was an "event" every 5 days that caused a ~5% higher rate of failure than the average, and every 25 days (or every 5th of those events), it was 25% higher than normal (and every 125 days it was 125% higher than normal, and the pattern aggregated indefinitely like that), then I could see calling that pattern "fractal" in nature. But clumping around events like hot days doesn't sound like a fractal pattern to me.

Then again, I've never read anything about fractal patterns in failure analysis, so what do I know. Just sounds like fishy use of a cool-sounding term to me.

--
11*43+456^2
Re:Data redundancy REQUIRED by cheesedog · 2005-10-25 10:07 · Score: 1

> If one were to do a somewhat more sophisticated analysis, a better model for hard drive failures is the Bathtub curve.
If one were to do a somewhat more sophisticated build-the-danged-thing-and-watch-the-failures, one would be hard pressed to find a bathtub curve anywhere at all in the data. Your analysis appears sound, but analysis is often misleading.
> I think what you are referring to is how multiple observations of a uniformly distributed stochastic variable generally look. It doesn't have anything to do with fractals, though.
You statistical guys always want to avoid self-similarity! :) I really do think, though, that you'd find a self-similar model to be more accurate representation of the combined failure rates of such a system than a multiple-observations-of-a-uniformly-distributed-s tochastic-variable model.
Re:Data redundancy REQUIRED by GoodOmens · 2005-10-25 11:24 · Score: 1

Wow these are some interesting reads. I learned a lot about hard drives today lol. Who would have thought .... only on slashdot!
Re:Data redundancy REQUIRED by toddestan · 2005-10-26 04:15 · Score: 1

It does seem a bit high, but when a harddrive's motor is failing - it can still keep spinning for weeks, if not months or longer. But turn it off and it will never be able to spin up again. So I could imagine a system that has been going continously for years having a huge number of failures once the power is cycled.
Re:Data redundancy REQUIRED by Sparohok · 2005-10-26 13:01 · Score: 1

Then I guess you'll be a monkey's uncle.
The phenomenon you've discovered is the brain's ability to find patterns in randomness.
The definitional characteristic of a fractal is that it is self similar on all scales. A little clustering does not a fractal make.
Martin
Re:Data redundancy REQUIRED by Anonymous Coward · 2005-10-27 05:54 · Score: 0

Oh, don't remind me of my time doing a data center MOVE from a past employer's privately owned DC to SC4. (Those of you on the west coast should know SC3 & SC4.) We had servers that hadn't been powered down in more than 4 years being shutdown for the first time. Too bad EVERYONE (other than myself) thought all the servers would come back up. I think we lost 20 servers out of roughly 400 servers and probably as many HDs between the servers, RAID and NetApp units. All in all, drive loss is less of an issue than power and cooling failures. It's just another thing which needs to be considered in a production environment. But then again, we did pay the price for enterprise class storage in all but the web serving farm which pulled its content from a NetApp anyway.
Re:Data redundancy REQUIRED by Retric · 2005-10-28 15:10 · Score: 1

Knowing something has self-similarity let's you design system to deal with it. EX:

The self-similarity issue with line noise is part of how low level TCP/IP works, basically most errors tend to clump together so while their might be 5000 errors in 1,000,000 bits of data most of those will be in a small number of packets so they only sending a small checksum which takes up less than 1% of the packet let's you drop those packets and get more total packets though then spending a lot of space sending complex error correction codes.. Basically if errors are truly random then trying to correct them has value, but if they clump together then complex error correcting codes is a waste of time. You might correct for 1/3 of the errors by using a Error correcting system that takes up 20% of you bandwith but you lost out on sending 20% of your pacets by doing that which is worse than sending a larger number of pact's which have a higher chance of as long as that chance is less than 20%.

On the other hand other systems like collections of HDD work best when you can recover from random independent total failures vs. worrying about the extremely rare errors in the data it's self. You might get 1TB off a data and have one bit wrong but at some point your not going to be able to get any errors so it's escalating probability of drastic failure vs. a network connection which is self similar and prone to lot's if intermittent failures but you can expect to get the data though at some point.

Oooo... by temojen · 2005-10-25 07:34 · Score: 1

I was going to suggest Reiser4 on LVM over a bunch of 4-disk RAID-5 arrays, but it seems that his definition or massive is more massive than mine.

NFS on Reiser4 on RAID-5 on AoE (multipath) on LVM on RAID-5?

What kind of availability do you need? Does all data need to be up all the time (like a bank/telco), or most of the data need to be up all the time (like google), or all the data need to be up most of the time (like a movie studio)?

I just have to ask... by jcdick1 · 2005-10-25 07:34 · Score: 5, Informative

...what your management was thinking. I mean, I can't imagine a storage requirement that large that you can build in a distributed model that would beat on price per GB an EMC or Hitachi or IBM or whomever SAN solution. The administration and DR costs alone for something like this would be astronomical. There just isn't really a way to do something this big on the cheap. I mean, this is what SANs were developed for in the first place. Its cheaper per GB than distributed local storage ever could be.

--
What?

Re:I just have to ask... by temojen · 2005-10-25 07:42 · Score: 3, Funny

With a project this large, they may be able to do it in-house and still take advantage of economies of scale. They can buy HDDs, motherboards, rackmount cases, etc. by the pallet or container load and temporarily up-hire some of their part-timers to do the assembly.

With a network bootable bios, the nodes could just be plugged in and install an image off a server, then customize it based on their MAC.
Re:I just have to ask... by lysander · 2005-10-25 07:50 · Score: 1

I entirely agree with the parent post. If there were an easy, cheap way to to this with the required redundancy and speed you need, the big SAN companies would not be around.
If there is more data than disks you can shove in a computer, data that your company considers important: buy a SAN. If you have speed requirements, you'll need caching: buy a SAN. If you haven't worked with anything this big before, are you willing to risk your company's data while you learn the ropes?
If you're still intent on doing this, at least look at how the SAN companies pull it off.

--
GET YOUR WEAPONS READY! --DR.LIGHT
Re:I just have to ask... by Gverig · 2005-10-25 08:01 · Score: 1

I will not argue for/against SANs (never worked with them, not even sure what it stands for) but I don't really like/accept the logic of "if there are people that charge a lot of money for it, it is impossible to do it yourself". Did not work for Linux or MySQL...
Re:I just have to ask... by supabeast! · 2005-10-25 08:03 · Score: 1

"...what your management was thinking."

What makes you think that the manager behind this nutty idea is thinking? I'm guessing that any manager cranking out this kind of cracked idea is both clueless and stupid.
Re:I just have to ask... by sconeu · 2005-10-25 08:09 · Score: 1

Storage Area Network. Usually built on top of FibreChannel.

--
General Relativity: Space-time tells matter where to go; Matter tells space-time what shape to be.
Re:I just have to ask... by kpharmer · 2005-10-25 08:15 · Score: 1

> I mean, I can't imagine a storage requirement that large that you can build in a distributed model that
> would beat on price per GB an EMC or Hitachi or IBM or whomever SAN solution. The administration and DR
> costs alone for something like this would be astronomical.

This is exactly what I was thinking.

Look at it this way - lets say you spend $100k on a small but extensible storage array. Probably 90% of this cash would be paying for hard drives - a cost you'll have to spend anyway. Not sure of current prices, but $100k is something like 15 TB of fast disk in a raid5 configuration. Side benefits:
1. this is a capital investment - that can be written off by your company. Assume that this is the same as a 30% discount, which ends up costing you $70k.
2. the cost will be probably be paid in a lease stream across five years. Now, you're down to about $15k / year.
3. the cost includes support and replacement of defective equipment
4. it also includes a variety of interfaces, the ability to share disks between multiple hosts, dual-controllers to each array, etc, etc.

Ok, now compare this against assembling this yourself. Dealing with various vendors, working out firmware issues, working out performance-tuning strategies, etc, etc, etc. It could easily take you months to provide the same amount of well-tested, out-of-the-box functionality that this product has.

And you still have to buy the disk.

I'm not a storage expert, but I'd take a very close look at a commercial solution. It's probably cheaper in the long-term.
Re:I just have to ask... by AnotherBlackHat · 2005-10-25 09:11 · Score: 1

...what your management was thinking. I mean, I can't imagine a storage requirement that large that you can build in a distributed model that would beat on price per GB an EMC or Hitachi or IBM or whomever SAN solution. The administration and DR costs alone for something like this would be astronomical. There just isn't really a way to do something this big on the cheap. I mean, this is what SANs were developed for in the first place. Its cheaper per GB than distributed local storage ever could be.

Checking prices at outpost.com, a 300 gig SATA drive is $150, a pc that can hold 6 of those drives is $250, and 3, 2 port SATA controller cards is $150 ($50 each).
The parts costs is therefore $1,300 per 1.8 terabyte node of networked storage.
You could easily get 15 systems assembled and tested for under $25,000, which would be over 25 terabytes, with some redunancy.
A comparable EMC system would be about $100,000.

Management is probably thinking that $75,000 is more than enough to make those parts work, and even if it isn't, the potential $3,000,000 in savings for the 1 PB version is.
Now they could be wrong, but you're going to need a hell of a lot better argument than "it's really hard to do" or "those EMC guys know what they're doing and we don't" to overcome $3,000,000 in savings.

-- Should you believe authority without question?
Re:I just have to ask... by Anonymous Coward · 2005-10-25 09:34 · Score: 1, Informative

...what your management was thinking. I mean, I can't imagine a storage requirement that large that you can build in a distributed model that would beat on price per GB an EMC or Hitachi or IBM or whomever SAN solution. The administration and DR costs alone for something like this would be astronomical. There just isn't really a way to do something this big on the cheap. I mean, this is what SANs were developed for in the first place. Its cheaper per GB than distributed local storage ever could be.

Yes, this is what SANS were developed for. Let me clue you in on our SAN, and what it cost us in the long run:

We bought a Xiotech Magnitude 1.4T SAN, with everything redundant, except the backplane (not an option in that line at the time). Well, what part dies? The backplane. Twice!
So, after converting all of our critical servers over to this SAN, it takes down our entire datacenter. We paid tons of money for this POS too.

Now, we're trying to sell it. We don't need it, have since moved to Infortrend for our storage requirements (way better product), and would like to get rid of it.

We tried eBay. Everytime a customer inquired about re-licensing the crappy software on it from Xiotech, whadda ya know? It costs more to relicense the software than just buying a whole new unit from Xiotech. We call Xiotech and complain to high hell about this.

We try to donate it to a local university... SAME DEAL! Xiotech tries to relicense the software, EVEN AFTER KNOWING ITS A CHARITIBLE DONATION for more than a new unit!

So, in our closet sits a @@#@!(%$ pile of @($*!@ called a Xiotech. I will speak badly about this company until the day I die. I will but into other people's conversations that are unrelated to massive amounts of storage, and tell them to curse the name Xiotech and stay bloody clear of that company. I will sacrifice small animals in Xiotech's name, so hopefully something horribly evil will happen to that company, and the damn CIO that tells us to deal with it, and hangs up on us.

My ire towards Xiotech is strong and will never die. I wish a worse fate for Xiotech than I do SCO.
Re:I just have to ask... by np_bernstein · 2005-10-25 09:45 · Score: 1

yeah, I'd agree with the above poster. You're going to have a high rate of disk failure, and either you're going to have to worry about maintaing those disks setting up some kind of hardware monitoring for each disk and then figuring out when it failed and replacing it w/in as short of a window as you possibly can or you can go with a commercial vendor and get a support contract and then come in in the morning and see three emails that say "disk failed" and then tracking messages of the tech being sent out and having the drive replaced w/in 4 hours. You can go cheaper on the onset, sure, but unless you're going to have a dedicated staff for this, your cost savings come in lowering your complexity and not needing to dedicate resources to maintain the thing.

--
RandomAndInteresting.comdefending the world from stupidity since 1979
Re:I just have to ask... by Kent+Recal · 2005-10-25 10:47 · Score: 1

The parts costs is therefore $1,300 per 1.8 terabyte node of networked storage.

Yeah, right. You *may* want to triple this figure for adequate hardware, though.

Now they could be wrong, but you're going to need a hell of a lot better argument than "it's really hard to do" or "those EMC guys know what they're doing and we don't" to overcome $3,000,000 in savings.

Well, "it's really hard to do" pretty much sums it up.
If "create a large storage device from scratch" is not an essential part of your business-plan you should probably back the hell off and buy something off-the-shelf. You obviously underestimate the effort it takes to create a system like this, much less the effort to make it perform reasonably and maintain it. SATA, my ass...

If your business really plans with and depends on 1PB of storage but at the same time a 3mio investment makes you pee
then there's something seriously wrong with your business plan.

And, last but not least, if you have to "ask slashdot" about it then you're most certainly not in a position to pull it off either way.
Re:I just have to ask... by Ayende+Rahien · 2005-10-25 12:58 · Score: 1

You forgot one thing that you get that you wouldn't get if you do it in house.
If there is a problem, you call IBM and then _fix_ it. Now try to do the same with something you brew up by yourself.

--

--
Two witches watched two watches.
Which witch watched which watch?
Re:I just have to ask... by TClevenger · 2005-10-25 15:01 · Score: 1

Well, and what happens if you start having multiple disk failures within the warranty period? You're going to have to hire a full-time person just to fill out the RMA paperwork, send the old drives in and receive the replacements. Of course, most manufacturers will then replace RMA'd drives with "remanufactured" drives, so there's another unknown.
Re:I just have to ask... by Wonko · 2005-10-25 18:56 · Score: 1

Yeah, right. You *may* want to triple this figure for adequate hardware, though.

He may also want to do it "right" as well :p. 500 GB SATA drives seem to be around 400 bucks a piece. I have seen 4u cases with 16 hot swap SATA bays for about 1200 bucks. A pair of 8 port 3ware SATA controllers would be about 700 bucks, and figure 1000 bucks for motherboard/cpu/nic/etc. 16 drives will set you back 6400 bucks. Assume you go RAID5 with a hot spare on each controller, that works out to 6 TB for well under $10,000.

I don't think any SAN can compete with that price. I also don't think this machine can compete with a SAN in any scenario where you should actually be using a SAN :). However, 10 of these machines could be squeezed into a single rack for a total of 60 TB for $100,000. I have no idea what someone would use a rack of these machines for... Online backups or something, maybe?

You obviously underestimate the effort it takes to create a system like this, much less the effort to make it perform reasonably and maintain it.

The problem only becomes difficult when you want this to appear as a single volume. If you need a reasonably priced 25 TB in volumes of less than about 3 TB each it becomes quite easy.

If your business really plans with and depends on 1PB of storage but at the same time a 3mio investment makes you pee then there's something seriously wrong with your business plan.

Maybe their business plan involves selling hardware to store up to 1 PB on a single volume? :p

And, last but not least, if you have to "ask slashdot" about it then you're most certainly not in a position to pull it off either way.

I absolutely agree with you. I don't think this fellow has any idea what he is looking at getting himself into... He wants to make it to 25 TB with 1-2 GB per node and scale up to 40x as much from there, and he thinks it will survive with no redundancy. :)
Re:I just have to ask... by brain007 · 2005-10-25 20:56 · Score: 1

Perhaps you could rent a room at the university in it's IT department for 99 years for $1 and allow them to rent 1.4TB in storage for the same amount of time for $1 as well. You still own it, but they get to use it for way more time then it will even matter (in 2104 I hope my grandkids have a cell phone or whatever with 1.4TB in it).
Re:I just have to ask... by richie2000 · 2005-10-25 23:36 · Score: 2, Funny

Yeah, but apart from that, did it work out for you? Don't hold back, I can take the truth.

--
Money for nothing, pix for free
Re:I just have to ask... by stanmann · 2005-10-26 02:19 · Score: 1

If you want any sort of operational reliability you'll want Raid 51/15 or something similar so Double your minimum estimate Don't forget power solutions, rack costs and power utilization in your estimate either.

--
Food not Bombs is a nice platitude but it breaks down when you notice that the Bombees are usually well fed
Re:I just have to ask... by Wonko · 2005-10-26 05:59 · Score: 1

If you want any sort of operational reliability you'll want Raid 51/15 or something similar so Double your minimum estimate

I suppose it depends on your level of paranoia, and how much you want to spend. If I wanted more redundancy than RAID 5 I would probably go with RAID 10 and eliminate the write penalty you get from having to compute the parity. That would only require 50% more hardware, not double. Read performance would suffer a bit, but writes would be drastically better. That might not matter if the network turns out to be the bottleneck.

Don't forget power solutions, rack costs and power utilization in your estimate either.

Racks are cheap enough they would easily be covered in my quick cost estimate (I inflated my math by nearly $2000 per 4u server). A SAN from EMC would also use quite a bit of power, since the biggest drain in either case will be the drives. The SAN will have the advantage that it doesn't need a server for every 16 drives. Since you won't need top of the line machines for this, I would guess that the extra power requirements would quite a bit less than 200 watts per server, or 2000 watts per 40 - 60 TB of usable space (80 TB raw). How much more than 100k would I have to pay for a low end 80 TB SAN?

My cheap soution would be nearly as dense as the old EMC Clariion I used to work with. The 4U chasis I am thinking of was almost nothing but hard drives on the front side.

I am not saying a solution like this fits all, or even most, situations. I can recall a number of situations in the past where something like this would have been a very good fit.
Re:I just have to ask... by stanmann · 2005-10-26 06:08 · Score: 1

For power I was refering to a Rack ups or the like to sustain power, not provide it so much.

--
Food not Bombs is a nice platitude but it breaks down when you notice that the Bombees are usually well fed
Re:I just have to ask... by Wonko · 2005-10-27 00:08 · Score: 1

For power I was refering to a Rack ups or the like to sustain power, not provide it so much.

You are going to have pretty similar UPS requirements for a home grown solution or a big commercial SAN. It won't have much impact on the overall price.

Check out Isilon by elan · 2005-10-25 07:35 · Score: 1

We liked what we saw when we were looking for a similar thing. It's not cheap, but it's much cheaper than comparable stuff, and it runs well. We had an eval cluster and they worked like a champ.

Re:Check out Isilon by Anonymous Coward · 2005-10-25 07:50 · Score: 0

I would second this, their clustered file system is very robust, fault tolerant and scales well and is much more cost effective than some 1st tier solution providers systems.

If you think that you can do this with off the shelf commodity hardware and maintain your sanity, more power to you!! This is a non-trivial task. Just try building a 24x7 available 100 TB system out of Fry's disks without having a full-time support person managing it..
Re:Check out Isilon by Anonymous Coward · 2005-10-25 08:19 · Score: 0

Agreed...they have essentially done what the poster's attempting to do. Question is how many hours does the poster have to research and setup the solution versus the moderate cost from someone like Isilon...
Re:Check out Isilon by Scott · 2005-10-25 12:26 · Score: 1

My company is in a similar situation as the submitter of the question, though not with the same capacity requirements. Originally I was looking at traditional filers from the usual suspects, along with building out our own. For about $10k you can get something around 3.5TB of storage if you really scrape the bottom for the absolute best deal and are going to rely entirely on your own skills to make the software work. However our redundancy requirements were such that it really cut that capacity number down to about 1.6TB, far from enough to cover future growth for very long at our present rate.

At the same time I was looking at that gear from EMC and Sun. It's a step up in that they give you some expansion options, but a lot of what they are offering seems to come from a time when the idea of dealing with as many disks and nodes made possible now by cheap hardware seemed utterly improbable. Then a couple weeks back further research led to clustered storage companies, namely Isilon but also Ibrix and a few others. HP seems to be getting into this arena as well. This really looks like the future as far as creating highly scalable, huge capacity data storage systems go.

I guess it comes down to what you define as 'low budget' in your case. Does that mean $10k or $100k or $1000k? I know the MSRP on a baseline Isilon three node cluster (disk failure and node failure protection, pretty sweet) gets you a little over 4TB for $50k. Considering a bottom line NetApp runs the same a commercial clustered system rather than an old-school filer is probably the better way to go, and leasing gear is almost always an option. If however you are absolutely limited to $25k or less, you're basically screwed and looking at multi-RAID and filer replication of some sort, and a lot of late nights trying to figure out how to make it all work reliably.

15 zeros are no bytes at all by caluml · 2005-10-25 07:36 · Score: 1

15-zeros-is-a-lot-of-bytes

15 zeros is no bytes at all... :)

--
Get your own free personal location tracker

Re:15 zeros are no bytes at all by Anonymous Coward · 2005-10-25 07:43 · Score: 0

Actually it's 15 bytes... don't diss the big 0!!*

* Unless of course we are in pure binary in which case it could be 15 bits... take that pedantic reply boy!
Re:15 zeros are no bytes at all by rk · 2005-10-25 07:43 · Score: 1

Oh, sure it is! It's almost two!
Re:15 zeros are no bytes at all by Anonymous Coward · 2005-10-25 08:03 · Score: 0

1.875 bytes, to be precise.
Re:15 zeros are no bytes at all by Anonymous Coward · 2005-10-25 08:12 · Score: 0

Actually it's just under two bytes:
0000 0000 0000 000
or
0x00 0x00

Have you looked at OneFS by Anonymous Coward · 2005-10-25 07:38 · Score: 0

You really ought to look at Isilon Systems OneFS solutions for this?

This problem is a very real one in film production and we are moving in this direction for future productions after numerous faciliities got back to me with rave reviews of the speed, scalability and reliability of these units.

The nice thing is that they do scale, as the number of inodes grows so does the performance of the cluster and as you add storage you add bandwidth to your core filesystem. They are a great option for just this type of application.

The main issue you may run into at that size is that the real issue becomes having enough CPU horsepower to handle all the potential requests. This is where conventional network connected filesystem appliance solutions fall flat on their face. These seem to not have that issue at all. Only drawback is the obscene price Cisco charges for their infiniband switches they use as the backplane for the clusters. If you look at this as a potential solution you may want to pressure them to find another infiniband switch provider instead of paying the extortional pricing cisco's invented on their 48 port units.

Stress the importance .... by gstoddart · 2005-10-25 07:38 · Score: 3, Insightful

I've been asked to build a massive storage solution to scale from an initial threshold of 25TB to 1PB ... Based on my past experience and research, the commercial offerings for such a solution becomes cost prohibitive, and the budget for the solution is fairly small.

Unfortunately, I should think needing a solution which can scale up to a Petabyte (!) of disk-space and a "fairly small" budget are at odds with one another.

Maybe you need to make a stronger case to someone that if such a mammoth storage system is required, it needs to be a higher priority item with better funding?

Heck, the loss of such large volumes of data would be devastating (I assume it's not your pr0n collection) to any organization. Buliding it on the cheap and having no backup (*)/redundancy systems would be just waiting to lose the whole thing.

(*) I truly have no idea how one backs up a petabyte

--
Lost at C:>. Found at C.

Re:Stress the importance .... by Anonymous Coward · 2005-10-25 07:43 · Score: 0

How?

http://bssc.sel.sony.com/BroadcastandBusiness/Disp layModel?id=20114

Like that...
Re:Stress the importance .... by dustinbarbour · 2005-10-25 07:48 · Score: 1

With another petabyte of storage.. DUH!

--
What is your penile percentile?
Re:Stress the importance .... by afidel · 2005-10-25 08:11 · Score: 1

You backup a Petabyte with a StorageTek Powderhorn or Timberwolf storage silo. The maximum configuration for the Powderhorn is 28.8PB using LTO2 drives (a more modern version with LTO3 would double that, or half the number of carts used). Of course a PB equipped Powderhorn with a decent number of drives is going to cost over a million dollars without cartridges.

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Re:Stress the importance .... by aero2600-5 · 2005-10-25 09:20 · Score: 1

I have the funny feeling that what he really needs to build is something that can handle a few TB, and that needing a solution which can scale up to a Petabyte is one of those nice bonuses he would like to have. The difference between need and want. If he actually needs the scalability of 1 Petabyte, he should build/design the system to handle 1 Petabyte and then scale it down to whatever his immediate or near immediate needs are.

On a side note, this idea sounds like it came from a PHB that someone is trying to please, which is rarely worth the effort.

Aero

--
Please stop hurting America -- Jon Stewart
Re:Stress the importance .... by rndmtxt · 2005-10-25 09:51 · Score: 1

(*) I truly have no idea how one backs up a petabyte

Presumably with wheels?
Re:Stress the importance .... by Anonymous Coward · 2005-10-25 10:49 · Score: 0

Lots of other options besides a StorageTek library.

(FWIW, I *do* actually manage backup environments for a living, with several petabytes of data)

StorageTek makes good devices. So does IBM. Storing a petabyte of data doesn't really require that large of a library - assuming you don't have multiple versions of a large portion of that data. I'm guessing that this is probably a data-storehouse (images, video, DNA sequences, whatever) that has a pretty low rate of change. Assuming LTO3, you could do this with an IBM 3584 with one additional frame - barely registering on the "enterprise" scale of things.

More troublesome is how long it will take you to get that first backup. This is the kind of place where a tool like Tivoli Storage Manager makes sense as a backup tool... never having to take another "full" backup has its advantages when you're talking about a petabyte of data in a single filesystem.
Re:Stress the importance .... by Sam+Nitzberg · 2005-10-25 11:28 · Score: 1

"(*) I truly have no idea how one backs up a petabyte"

I honestly don't know how to either, but part of the solution may involve trucks and tapes and an off-site storage facility. In the event of a disaster, reloading the data set will likely be non-trivial, and not cheap either. Of course, there is some "latency" in this model.

I doubt that sending this data off over a network pipe will be a viable approach...

I'd find organizations that have backed up huge amounts of data - and investigate with them an appropriate solution for you that covers both routine data loss, as well as disasters.

Your backup management solution may in and of itself be quite costly...

- Sam
Re:Stress the importance .... by gstoddart · 2005-10-25 15:35 · Score: 1

You backup a Petabyte with a StorageTek Powderhorn or Timberwolf storage silo.

Jesus! A backup solution with 144,000 tape cartgiges. That's freakin' monstrous. My brain hurts.

I still can't fathom WTF kind of data you have that is that big. :-P

--
Lost at C:>. Found at C.
Re:Stress the importance .... by ShakaUVM · 2005-10-25 17:15 · Score: 1

Do you really think a petabyte is a lot of space? Each of my PCs has a pair of 500GB drives in it, which we picked up for next to nothing due to a sweet mail in rebate offer from maxtor. (Well, it WILL be sweet if they ever send us the rebate, but I digress...)

It kind of blew my mind when I realized I had a terabyte of storage in my computer. Then I checked my disk space and realized it was half full already. I've had it for less than one month, WTF? =)
Re:Stress the importance .... by Anonymous Coward · 2005-10-25 23:29 · Score: 0

I think you mean "with reverse gear"
Re:Stress the importance .... by toker95 · 2005-10-26 01:17 · Score: 1

One backs up a Petabyte with something like this...

http://www-132.ibm.com/webapp/wcs/stores/servlet/C ategoryDisplay?storeId=1&catalogId=-840&langId=-1& dualCurrId=73&categoryId=2058932&x=6&y=11

Priced anywhere from $150k to $275k depending on the drive types and configuration. We have a total of 70 or so of these frames as they are referred to, spread out in 3 datacenters in delaware alone. But then again, while our budget is small... its an enterprise level system, so it gets an enterprise level budget. A weekend's worth of full backups exceeds 50 TB at my site alone.

--
~~~ SCO sued me because I printed this t-shirt with a Linux driven printer...

Used by gov't labs and universities.... by Anonymous Coward · 2005-10-25 07:39 · Score: 0

You may want to check out Panasas. From what I can discern, they are used by some high-end entities, but it does have the advantage of being high performance. You probably won't find anything this good for reading/writing out data from your app server to your data storage network.

BTW one of the founders wrote the original paper on RAID drives...

www.panasas.com

IBRIX by Anonymous Coward · 2005-10-25 07:39 · Score: 1, Insightful

Check out the IBRIX Clustered Filesystem. http://www.ibrix.com/

Re:IBRIX by Wells2k · 2005-10-25 07:57 · Score: 2, Informative

Something else I forgot about is the actual hardware... you may want to take a look at the nStor products. Their hardware RAID systems are relatively economical, and you can go to fibrechannel drives with fibre connected boxes quite easily with their equipment.
Re:IBRIX by Anonymous Coward · 2005-10-25 08:02 · Score: 1, Informative

IBRIX is an excellent product. Especially if used in conjunction with HPCC or a renderfarm that requires very low access times, high availability, and works in conjunction with storage systems that will scale up to Petabytes.

For the most part by retinaburn · 2005-10-25 07:39 · Score: 4, Insightful

the reason you can't find a cheap way to do this is because it just isn't cheap.

I would look at some lessons learned from Google. If you decide to go with some sort of homebrew solution based on a bunch of standard consumer disks you will run into other problems besides money. The more disks you have running, the more failures you will encounter. So any system you setup has to be able to have drives fail all day, and not require human intervention to stay up and running(unless you can get humans for cheap too).

Re:For the most part by epiphani · 2005-10-25 08:10 · Score: 2, Informative

It wont be cheep - but how about this idea. You'll get plenty of data redundancy out of it, however you may need to spend some extra bucks on stability and maintainability.

eWare 12x S-ATA raid5 card
12x 300GB raid5
linux machine
iscsi software - share out 1 LUN.

Duplicate this machine until you have enough storage.

One big box with a number of trunked/bonded gigE ports
Iscsi initiator software - mount all the luns.
software raid them together - striping if you arent too worried - raid5 if you are.

tada - big storage, one volume, all accessable from one machine.

the maintenance will suck though.

--
.
Re:For the most part by fool · 2005-10-25 09:17 · Score: 3, Informative

well, since all of the (high-end) PC's we were looking at for snort boxen had severe problems pushing even 5Gbit/s (not GByte) of traffic in/out over the PCI busses simultaneously, you hit a bottleneck pretty quickly there, even before you get to 25TB with your disk sizes. at 500GB disks you get pretty close, but you're at the ceiling already. while a decent (not even cutting-edge) machine could push a Gbit to the server pretty easily, the server, no matter how beefy, needs a ton of internal bandwidth to gather/process/serve the data timely-like. if he only needs 100mbit/s of data service then he's golden =)

or did you mean to specify a GBit switch in between the clients/big box?

also, agree with yours and others' proclamation that administration will not be trivial. be sure to spec at least 6 months of your time in writing/debugging scripts to automate the detection and RMA of dead drives, and find a vendor who will ship based on an automated mail you can send out about failed disks, rather than waiting on turnaround from you pulling the drive and the delivery making a round trip.
Re:For the most part by SatanicPuppy · 2005-10-25 09:35 · Score: 1

The maintenance would be a freaking nightmare, you're totally right there...I'm assuming no hotswap here, because a adding hotswap for that many drives on otc hardware would be expensive as hell.

You'd be getting about 2.4Tb off each of those machines, (assuming raid5) so that would be 417 of those machines for a Pb. With 12 harddrives per machine, and 300Gb (serial ATA) harddrives costing about 135 on Newegg, you're talking 675,540 dollars in harddrives alone, and with 5,004 harddrives the odds of one going south are pretty high, but we're not even going there.

So we're at 675,540 in harddrives. I'd say double that for the whole systems, what the hell it's only money, right? So we're at 1.3 million, and we have 417 computers sitting in a field somewhere. We could put 'em in a warehouse...If we had all the output fans pointing in one direction, we could sell it as the world's most sweltering windtunnel. You could probably run a rotissery turkey business back there...

I think this is one of those situations where the only good way to go about it is to talk to Sun or IBM, because building this with over the counter hardware is a joke, and not a funny one. Any system that has 417*13 or so points of failure is scary enough, but then you think about the failure rates associated with retail hardware, AND the fact that it's going to cost millions and still suck...There is no need to go on there.

Though 2.4 Terabytes of storage is clearly pretty feasable for the home, if you can get a box big enough to hold all the drives...Pretty cool.

--
ad logicam Claiming a proposition is false because it was presented as the conclusion of a fallacious argument.
Re:For the most part by jabuzz · 2005-10-25 10:43 · Score: 1

There are such things a 10GbE cards and switches you know. As you are using ATA drives I would go with AoE and save a shead load of CPU power to boot.

XServe RAID + XSan by Anonymous Coward · 2005-10-25 07:39 · Score: 0

Oracle endorses, uses Xserve RAID:
http://alienraid.org/article.php?story=oracle

http://www.apple.com/xserve/raid/
http://www.apple.com/xsan/

Homebrew HSM by benow · 2005-10-25 07:40 · Score: 1

A buddy and I were talking of similar. We were looking for expandable large scale storage with good performance and cost and a high level facilitation of data management tasks (metadata management, media shuffling, accomodating technology advancements, etc). Decided on single nodes each controlling data storage to fit data use case. Each node presenting a span over ram, jbod, optical and tape in increasing size. RAM fs for that which is always in use, a couple TBs of jbod/ide raid holding more frequently used, everything in optical jukebox and safety backups to tape jukebox. Ideally the entire tower could be modular and auto-discovering, so when more long term storage is required a jukebox module could be added and the system could autodiscover capacities, alignment, etc. Data presentation done via a cascading hsm aware of each of the components and with optimized usecases (ie burning to blank optical when sufficient new data arrives on raid, maintaining indexing for each of the modules, etc). On top of the cascading storage would be a metadata vfs/presentation layer, to allow data navigation at a high level (cd /video/nature/banff/) via nfs or http web app over gig eth. A deligation/peering layer could allow for grouping such towers to grow in size. Tho quite theorhetical and the software being somewhat tricky, it could scale to the tens of TB/node size easily. We'd originally thought of an open hardware design with modules being added by the community and open hsm software supporting it. Neither of us has yet had the time to do much more than basic dev, but it's planned.

AtomChip Corp by OctoberSky · 2005-10-25 07:40 · Score: 1

Why not just wait until those Atom Chip Laptops.

I mean, yeah the portibility is not what the customer wanted but the 6.8GHz CPU, cuppled with the 1TB of RAM should easily make up for the limited 2TB of HDD space.

Courtesy Link.

raid-nfs-raid by eatjello · 2005-10-25 07:41 · Score: 1

How about this: set up a bunch of mini-itx or similar low-power machines with 9 250GB PATA drives (and a single smaller drive for OS install), then use software raid under Linux to configure them as a single RAID-5 array (roughly 2TB) and set up an NFS server on the machine to share the array. Then set up a couple of controllers (for redundancy) that mount all the NFS shares and turn them into a linear or striped array(or RAID-1 for added data security). Then the controllers would be able to present a share using NFS, SMB, or whatever you need it to, that have a capacity that scales seamlessly... all you have to do is add more 2TB nodes to it. Obviously the details are flexible, like making the nodes RAID-1 instead of RAID-5 (and dropping them to 8 250GB drives for a round 1TB per node), but this should give you exactly what you're looking for. Your cost per node would simply be the mini-itx mobo, memory, 4 channel IDE controller card, and hard drives, and cost per controller would be practically any computer with high speed network interfaces (I'd recommend something with 2x Gb LAN at least, maybe Gb out and 10Gb to the HDD farm).

Re:raid-nfs-raid by Eunuchswear · 2005-10-25 09:00 · Score: 1

So how do you "set up a couple of controllers (for redundancy) that mount all the NFS shares and turn them into a linear or striped array"?

--
Watch this Heartland Institute video

Do It Right by moehoward · 2005-10-25 07:41 · Score: 5, Insightful

Look. Everyone wants a Lamborgini for the price of a Chevy. Cute. Yawn. Half of the Ask Slashdot questions are people who didn't find what they want at Walmart. Despite the amazing Slashdot advice, Ask Slashdot answers have somehow failed to put EMC, IBM, HP, etc. out of business. There is no free lunch.

Just call EMC, get a rep out, and give the paperwork to your boss. Do it today instead of 5 months from now and you will have a much better holiday season.

Note to moderators and other finger pointers: I did not say to BUY from EMC, I just said to show his boss how and why to do things the right way. It does not hurt to get quotes from the big vendors, mainly because the quote also comes with good, solid info that you can share with the PHBs. Despite what you think about "evil" tech sales persons and sales engineers, you actually can learn from them.

--
"If you want to improve, be content to be thought foolish and stupid." - Epictetus

Re:Do It Right by Tankko · 2005-10-25 07:45 · Score: 1

This is damn good advice! I always try to get bids from people when I have very little interested in using them (they have changed my mind a few times). It has always provided me with invaluable information.
Re:Do It Right by Anonymous Coward · 2005-10-25 08:01 · Score: 0

What about the Veritas file system http://www.veritas.com/Products/www?c=subcategory& refId=109&categoryId=120
Re:Do It Right by shmlco · 2005-10-25 08:07 · Score: 1

Not to mention opportunity costs inccured while you're dinking around, development costs, and maintenance costs.
While solutions from some of the big boys may seem expensive, it's entirely possible that's because you haven't figured out all of the costs involved in doing it "cheap"...

--
Any sect, cult, or religion will legislate its creed into law if it acquires the political power to do so.
Re:Do It Right by greg_barton · 2005-10-25 08:09 · Score: 1, Funny

Ask Slashdot answers have somehow failed to put EMC, IBM, HP, etc. out of business.

Yes, that's right... You can't do it yourself... There's never an open source solution... You must buy an expensive solution.... From a big company.... You're getting very....very...sleepy...

MMMMWWWUUUUUUHAHAHAHAHAHAHAHAHAHAHAHAHAHA!!!!!!!!! !!!!!!!!!!
Re:Do It Right by stanmann · 2005-10-25 08:20 · Score: 2, Insightful

Yes, there are lots of things that can be done by an open source team on the cheap... Massive hardware components aren't currently one of them. And aren't likely to be in the future.

--
Food not Bombs is a nice platitude but it breaks down when you notice that the Bombees are usually well fed

IBRIX by Wells2k · 2005-10-25 07:42 · Score: 3, Informative

You may want to take a look at IBRIX systems. They do a pretty robust parallel file system that has redundancy and failover.

Er... be careful by LeonGeeste · 2005-10-25 07:43 · Score: 2, Informative

That violates their terms of use pretty severely. I don't know what they would do (Google's not the "suing-for-the-hell-of-it" type), but that wouldn't last very long when they found out. And they would find out. +5 Interesting? Well, curiosity killed the cat.

--
Rank my idea: http://www.sinceslicedbread.com/node/531

Re:Er... be careful by Anonymous Coward · 2005-10-25 07:59 · Score: 0

All they can do is disable the accounts.

I don't care, I have 4,500+ account-creation-hashed left over.
Re:Er... be careful by CommanderC · 2005-10-25 08:05 · Score: 1, Interesting

If they can find the accounts. The usage of the service would not indicate anything unusual if it is done right, and even then you can implement parity or redundancy for data integrity.
Re:Er... be careful by metzjtm · 2005-10-25 21:50 · Score: 1

You all crack me up.

Just a Thought... by Anonymous Coward · 2005-10-25 07:43 · Score: 0

And this is a little insane, but you MIGHT look at a combination of Oracle offerings. Specifically, Collaboration Suite for the end-user presentation (which gives you a web interface, and FTP/FTPS, and WebDAV, and they have a little desktop app that'll let you mount the WebDAV volume like a traditional SMB share, drive letter and all) and ASM for the disk management. ASM is an Oracle database (that can be a bunch of RAC instances for redundancy) that can take a bunch of disk, doesn't even need to be the same type or anything, just as long as you can present it to the ASM server somehow like NFS, and creates a sort of virtual data pool that could then be used for another, regular Oracle database like the one used as a datastore for the Collaboration Suite instance above.

Yes, I realize this is probably needlessly complicated, and since we don't have very specific information about what the disk is actually FOR it's also likely to be inappropriate for some other reason, but it could work, and Oracle Collaboration Suite (which is the only part you'd actually have to license, I believe, the rest just sort of comes with it) is licensed on a per-user basis. I'm not sure what the minimum number of users is, but for only the files-based part we're talking about here, I think the list price is something like $15/user.

UnionFS by MaskedSlacker · 2005-10-25 07:43 · Score: 1

UnionFS ought to do the trick.

Are you working for Facebook? by fitchmicah · 2005-10-25 07:43 · Score: 1

I hope this is for facebook! Maybe to expand the new photo galleries?

Re:Are you working for Facebook? by Anonymous Coward · 2005-10-25 08:00 · Score: 0

all of your comments are fucking stupid

www.pillardata.com by bvoth · 2005-10-25 07:44 · Score: 1

check it out! have no experience with this company but looks very cool

--
perl -e 'print pack("H*", "6272616440766f74682e6e616d65")'

Nice Engrish by suckass · 2005-10-25 07:45 · Score: 0

After reading that post and seeing the level of your english you should probably let someone else handle a project so complex.

--
blah, blah, blah

Re:Nice Engrish by Eunuchswear · 2005-10-25 09:07 · Score: 1

This from the guy who seems unable to write more than one sentence at a time.

And how about "[...] let someone else handle a project so complex", way to go with the fluid use of the English language.

Ah shit, I'm bored with the reasoned, witty replies: Fuck off and die.

--
Watch this Heartland Institute video

storagetek... by blackcoot · 2005-10-25 07:45 · Score: 1

...can probably solve this problem for you. whether or not they can do so on the sort of budget you're willing to spend is a totally different story, however....

Don't forget Coda.. by Sir+Pallas · 2005-10-25 07:46 · Score: 1

Coda works even when nodes disconnect, for instance with network outages or mobile computing. Plus, there is a Windows client, if that's the way your shop swings.

Re:call EMC. i am sure their clarion line will han by Anonymous Coward · 2005-10-25 07:46 · Score: 0

Anonymous marketingdroid?

Stripped by zephris · 2005-10-25 07:46 · Score: 1

Well, I don't know alot about this kind of thing, if XP doesn't have an upper drive size limit, could you just throw it in a big server case, thow in some scsi drives, stripe them (I *think* that's what it's called) and have it appear as one big volume?

Re:Stripped by Anonymous Coward · 2005-10-25 08:11 · Score: 0

He wants to be able to get up to 1PB...NTFS will only get you up to 256TB. But even then you need controllers for all of the disks. AND, if 1 disk fails, the entire volume is gone. You have to do this in a Raid5 array. It's the only way to go. Oh, and it's obvious that you don't know a lot about this kind of thing....just saying.
Re:Stripped by quazee · 2005-10-25 10:17 · Score: 1

NTFS design allows for 2^64 clusters.
Thus, with a 4096 byte cluster size, it can handle 2^26 Petabyte volumes.

The problem is that the current NTFS driver uses 32-bit integers to manipulate the cluster IDs - hence the 256 TB limit, using the largest supported cluster size (64KB).

I don't know about the x64 versions - probably the limit is still there because Microsoft doesn't care much about it at the moment.

--
throw new SuccessException("Sig read successfully");

what about JFS and ATAoE? by imsmith · 2005-10-25 07:48 · Score: 1

I don't know what the limits of JFS are, but it sounds like a nice set up.

This article in Linux Journal ( http://www.linuxjournal.com/article/8149 ) talks about doing just that. The hardware costs ring up and don't scale as you get into your capacity ranges unless you can get a deal buying bulk HDDs - something like $10K per 7.5 terabytes

Yup, time to pick up the phone. by Kadin2048 · 2005-10-25 07:48 · Score: 5, Insightful

Exactly. This seems like somebody is trying to figure out a way to do something in-house which really ought to be left to either an outside contractor, or at least set up as a turnkey solution by a consultant. Given that he knows little enough about it that he's asking for help on Slashdot, I think this is yet another problem best solved using the telephone and a fat checkbook, and enough negotiating skills to convince management to pony up the cash up front instead of piddling it out over time on an in-house solution that's going to be a hole into which money and time are poured.

I know people get tired of hearing "call IBM" as a solution to these questions, but in general if you have some massive IT infrastructure development task and are so lost on it that you're asking the /. crowd for help, calling in professionals to take over for you isn't probably a bad idea.

It's not even a question if whether you could do it in-house or not; given enough resources you probably could. It comes down to why you want to do something like this yourselves instead of finding people who do it all the time, week after week, for a living, telling them what you want, getting a price quote, and getting it done. Sure seems like a better way to go to me.

--
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."

Re:Yup, time to pick up the phone. by lonesome+phreak · 2005-10-25 08:09 · Score: 1

I work for IBM. Yes, call us...join us...your requirments are mearly a speck on the windsheild of Big Blue!

--
Maybe we DID take the blue pill. You wouldn't remember anyway.
Re:Yup, time to pick up the phone. by Anonymous Coward · 2005-10-25 08:27 · Score: 0

Ahhh. An MCSE.

Depends on the content by behrman · 2005-10-25 07:50 · Score: 1

If you're talking about building some sort of archival-type repository (like, keeping years worth of satellite imagery, for example), then you should probably look at the Centera from EMC. They scale into the petabyte range.

Providing you can find some sort of filesystem to support it (good luck), you could stash multiple arrays behind your host, or you could put in a TagmaStore from HDS with several arrays behind it. I'm not entirely sure how large the Tagma will scale, but the number 32 petabytes sticks in my head from a whitepaper somewhere.

I'd also question the perceived need to create one big filesystem to hold your whole petabyte of data. I'm a storage geek for a living, and I've found that usually after you start drilling into the application requirements, you find out that the app folks are either trying to use a data warehouse solution that's too small for the environment, or they're simply not aware of other alternatives available in their chosen app. No offense, but it sounds like you've had snowshoes strapped to your feet and directed to take a stroll through a minefield.

im really not a redhat fanboy but.... hahahah by frankm_slashdot · 2005-10-25 07:51 · Score: 1

no really... http://www.redhat.com/software/rha/gfs/ from redhat seems to be what you might want.... take a look.

Whatout for File System too by Anonymous Coward · 2005-10-25 07:52 · Score: 0

You've done your research so you may alreay know, but it's worth mentioning that EXT3 and ReiserFS will not cut it for your system. Most file systems (not to be confused with storage sub-systems) have a maximum volume size.

http://en.wikipedia.org/wiki/Comparison_of_file_sy stems

Also - consider the problem of expanding any array.

As for how do do it, if I were to go super-cheap:
- Use software RAID-5 on each node (2 sub-systems, 8 disks each, hot-swap SATA 200GB, RAID-0 them and export as NBD)
- Use NBD to concatinate each node into a single block device
+ you DO NOT want to rebuild any parity info accross nodes with this poor-man's setup
- Use as many of GigE to interconnect with the nodes (limit use of switches)

Note:
- This setup will write data at multi 100MBit, but not GBit. It will read at close to GBit. I have this setup at home (1 node) and I'm impressed with software RAID and SATA.
- Contact the maintainers of any userland / kernel stuff you'll be needed and ask if they support the sizes you're looking for. I ran into trouble with dm-crypt (unsigned 32bit integer overflow) relating to file system size and mode of operation. All fixable.

Been there done that by CommanderC · 2005-10-25 07:52 · Score: 2, Interesting

I wrote a web application and a client in C# that uses gmail accounts as a sort of file system. using a set of email accounts as "index" accounts that use the gmail search functionality to find what you are looking for then pulling the attachment on the index to grab the parts of the file that where spread accross multiple gmail accounts in 500K chunks. it works really well. I did it for fun to see if I could. uses smtp to post the file chunks to a given set of accounts and users can donate accounts to the hive at will, increasing the overall storage size. all hosted maintained and index by gmal or any other free mail service as one big file system.

Just wait 5 years ... by tomhudson · 2005-10-25 07:52 · Score: 3, Interesting

Hard disk space is doubling every 6 months - wait 5 years and you'll be able to buy a 25TB disk for $125.00.

A single raid50 of them will then give you your petabyte of storage, for around $6,000.

Re:Just wait 5 years ... by dr2chase · 2005-10-25 08:00 · Score: 1

But hard disk bandwidth is growing more slowly; density arguments alone suggest doubling only once every 12 months (reading/writing is linear; density is quadratic). Another way to look at that is that every two years, it will take twice as long to read your entire disk (twice as long to backup, twice as long to bring online after a hot-swap, whatever.
Re:Just wait 5 years ... by tomhudson · 2005-10-25 08:55 · Score: 1

They'll combat that by installing multiple head mechanisms (will also reduce warranty claims because then if one head support mechanism no longer works, the other ones will) and/or monolithic heads.
Re:Just wait 5 years ... by jmorris42 · 2005-10-25 09:28 · Score: 1

> Hard disk space is doubling every 6 months - wait 5 years and you'll be able to buy...

Which is of course something to factor into any plans. The original question said they need 25TB now and 1PB later. So one big part of the plan should be upsizing the drives in the existing arrays in the future instead of just growing more racks.

--
Democrat delenda est
Re:Just wait 5 years ... by tomhudson · 2005-10-25 09:45 · Score: 1

This is one of the problems everyone is having a hard time conceptualizing - that you don't overbuy so that you can expand existing systems - oftentimes, it'll be more expensive to buy those 3-year-old now obsolete parts than to buy bigger, better, and faster parts.
What I'd do is, as each new increase of storage is due to come on-line, grab the larger sizes and relegate the older systems to backup and semi-stable storage, or to store the stuff that doesn't need high availability, rather than expanding them. Expanding a system with 4 250gig hard drives 3 years from now is going to be as silly as expanding a system with 2 40-gig hard drives today.
Re:Just wait 5 years ... by FS · 2005-10-25 13:42 · Score: 1

Maybe the technology to produce a certain size drive is doubling every 6 months (don't know, never checked), but drives aren't doubling in size every 6 months. Let's see, we have 500GB drives out now, which means that three years ago the best we could do would have been no more than an 8 GB drive. Nah, I don't think it is quite that good, but it would be nice.
Re:Just wait 5 years ... by toddestan · 2005-10-26 04:56 · Score: 1

Harddrive space is not doubling every six months. Right now, the biggest single drive you can buy (that I'm aware of) is 500GB. Those came out a few months ago. Before that, the 400GB was the biggest for over a year. Even when harddrive technology was advancing by leaps and bounds a few years ago, I don't think it was doubling quite that fast.
Re:Just wait 5 years ... by tomhudson · 2005-10-26 05:17 · Score: 1

Before that, the 400GB was the biggest for over a year.

You weren't able to walk into a store and buy a 400-gig hard drive off the shelf last year. Heck, most places STILL don't carry them in inventory.
The "price point" - the most bang for the buck, has been dropping like a stone. You can get a 320 gig hd today for the same price you paid for a 160 gig earlier this year. And this will continue. Expect to see terabyte drives becoming commodity items over the next year or so.

iSCSI by wasabii · 2005-10-25 07:53 · Score: 1

I have been searching for a solution for this as well. My current thought is that iSCSI is most appropiate. I plan to set up a number of small linux boxes, with as much storage space as a single system can accomidate, MD them so that each system is itself redundant. Each system will export an iSCSI target of the MD device. A single large node will then mount all the iSCSI devices and add anothe rlayer of raid (so that a single node failure doesn't result in down time), and export the file system as NFS to clients. I plan to just start with XFS for the on disk structures with an out-of-band journal.

HPPS, used by LARGE databases by metb · 2005-10-25 07:54 · Score: 1

Is http://www.hpss-collaboration.org/hpss/index.jsp something to try ?

The ECMWF uses this for their extremly large dataset. http://www.ecmwf.int/

Space dosent matter IO's do by silas_moeckel · 2005-10-25 07:55 · Score: 1

OK first things first figure out the IO's you need to do and how they need to scale. If your looking for just bulk storage look into some nice big SATA drives. 4RU cases can get you 24 500 gig drives with 22 usable in raid 5 on a pair of 3ware 12 port or similar raid controlers or 11TB's per Unit. Serve these rater large chunks up with iSCSI. Take a HA cluster and merge those chunks together with software raid. The end servers just need to be fast enough to handle your interconnect speeds (gig or better I would hope) the HA pair needs a good deal of computaional ability to do raid calcs. All of them can use as much ram as you can shove in them if performance is a goal.

This isn't the fastest config by far but it's cheap and reliable.

Now with this being said there generaly isn't any good reason to make a disk that big, split things up if at all possible you do not want to deal with a PB of data in on huge volume.

--
No sir I dont like it.

Re:Space dosent matter IO's do by geekoid · 2005-10-25 08:01 · Score: 1

Be aware that SATA uses a dirty buffer, so you can not reliably know when a write to media has actually occured.

--
The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
Re:Space dosent matter IO's do by silas_moeckel · 2005-10-25 09:30 · Score: 1

I'm fairly sure 3ware has allready taken care of this, when running with a BBU it keeps a copy of all data in the sata drives write cache and writes it to disk again if the power is lost.

--
No sir I dont like it.

Just buy a filer by OrangeTide · 2005-10-25 07:56 · Score: 1

Agami sells fairly large and very fast filers for cheap. Your 25Tb could be reached in under $170k with them. That's pretty cheap, in my experience.

--
“Common sense is not so common.” — Voltaire

Panasas anyone? by Chyeburashka · 2005-10-25 07:56 · Score: 1

We have one oftheir products a few miles down the road from where I work.

From a Panasas press release:

The Panasas storage cluster can scale from Gigabytes to Petabytes in capacity and still be managed as a single system. Key attributes include: a single virtualized global namespace, dynamic load balancing and Quality of Service attributes. These combine to simplify ongoing operation of the system while maintaining peak effectiveness.

No Redundancy? by Giggles+Of+Doom · 2005-10-25 07:57 · Score: 4, Insightful

A PETABYTE without redundancy? I can't imagine having that much data I didn't care about.

--
"A coward dies a thousand deaths, the brave but one."

Re:No Redundancy? by Anonymous Coward · 2005-10-25 08:10 · Score: 1, Funny
I can think of lots of data that falls into that category. Any inherently lossy high-volume data would tollerate the loss of a few more percent as disks fail.
- Archives of Spams, for developing a smarter spam filters.
- Archives of the internet.
- Sensor data from wide-networks of sensors
- Archives of consumer info (addresses, demographics, etc) used by spam companies on low-margin viagra-ad-spamming.
  - But most probably
    
    large porn archives.
    
    I have about 2.5TB of it here; and I'm an amateur. If I were a porn business, I could easily desire acquiring 20X what I have. And in those quantities, 90+% of it isn't worth backing up.
Re:No Redundancy? by digidave · 2005-10-25 08:12 · Score: 4, Funny

"I can't imagine having that much data I didn't care about."

Hollywood script archive.

--
The global economy is a great thing until you feel it locally.
Re:No Redundancy? by guygo · 2005-10-25 08:47 · Score: 1

The Congressional Record
Re:No Redundancy? by Anonymous Coward · 2005-10-25 08:48 · Score: 0

" A PETABYTE without redundancy? I can't imagine having that much data I didn't care about."

slashdot posts maybe? your email, usage statistics, maybe the next big mmorpg world data these are all things that are not buisiness vital in the short term and he did say initialy so it would only be 25tb that was not redundent.
Re:No Redundancy? by Anonymous Coward · 2005-10-25 08:58 · Score: 0

It's a cache to speed up access to non-locally-stored porn. If you lose it, it just increases latency times for a while.
Re:No Redundancy? by ichigo+2.0 · 2005-10-25 09:15 · Score: 3, Funny

Maybe he needs it for a swap file? I heard Microsoft upped the memory requirements in the next version of windows.
Re:No Redundancy? by JohnsonWax · 2005-10-25 09:56 · Score: 1

You don't need that much storage.

Just store the first script and then the diffs for all subsequent scripts. I think you'll find that the incremental changes from one Hollywood script to the next is astonishingly small.
Re:No Redundancy? by mobby_6kl · 2005-10-25 09:59 · Score: 1

"I can't imagine having that much data I didn't care about."

Hollywood script archive.

Slashdot comments.
Re:No Redundancy? by Frodo+Crockett · 2005-10-25 10:10 · Score: 1

"I can't imagine having that much data I didn't care about."

Hollywood film archive.

Fixed it for you.

--
"The newly born animals are then whisked off for a quick run through a giant baking oven." --heard on Food Network
Re:No Redundancy? by Anonymous Coward · 2005-10-25 16:33 · Score: 1, Funny

Ah, but the grandparent post said no redundancy.
Re:No Redundancy? by Giggles+Of+Doom · 2005-10-26 02:29 · Score: 1

Perhaps, but if your spending the cash to get a PB of storage, I would think spending the additional $5 to get RAID5 or some other form of protection would be something the company could swing for.

--
"A coward dies a thousand deaths, the brave but one."
Re:No Redundancy? by ichigo+2.0 · 2005-10-26 20:34 · Score: 1

But RAID 5 isn't redundant. And if they'd use RAID 1, they'd need to double the amount of HDD's.

Or better yet use Slashdot by pebs · 2005-10-25 07:58 · Score: 0

register a few thousand gmail accounts and write the interface that will make writing of data to gmail inboxes invisible to the app.

Or you could use a system that posts to Slashdot. It will use signatures to ensure authenticity so Slashdot trolls can't mess up your data (or by Slashdot user, but I'd imagine they'd get banned pretty quick so you better use AC). To make changes to existing data it will write diffs (like CVS does). You will need a large number of IP addresses not in the same subnet (they can ban subnets). Getting past the lameness filter will be the only real challenge here, and that's not that hard.

Or you could go the easy way and use user journal postings to store the data.

--
#!/

Re:Or better yet use Slashdot by Anonymous Coward · 2005-10-25 09:09 · Score: 0

And everybody knows that your data would be backed up using Slashdots redundancy engine!
Re:Or better yet use Slashdot by Anonymous Coward · 2005-10-25 10:35 · Score: 0

Or you could use a system that posts to Slashdot. It will use signatures to ensure authenticity so Slashdot trolls can't mess up your data (or by Slashdot user, but I'd imagine they'd get banned pretty quick so you better use AC). To make changes to existing data it will write diffs (like CVS does). You will need a large number of IP addresses not in the same subnet (they can ban subnets). Getting past the lameness filter will be the only real challenge here, and that's not that hard.

I think someone's already doing that here.

Coraid EtherDrive by enderak · 2005-10-25 07:58 · Score: 1

http://www.coraid.com/products.htm

I haven't used it, but it caught my eye a while back and looks promising. 500GB per disk, 15 disks per 3U shelf, and up to 65,536 shelves per network means it's expandable from 7.5 TB for one shelf up to (theoretically) 480 PB or so.

If you have to do this yourself... Use Solaris by forq · 2005-10-25 07:58 · Score: 1

You can wait for Sun to release ZFS, install Solaris 10 on an X86 box (or buy a new Sun X4100) Purchase as many Promise Vtrak 15200's as you require, configure them as iSCSI targets, and then use the Solaris 10 iSCSI initiator, and mount them. Then put them in your ZFS pool.

Use your head when configuring redundancy, and glory in your new found storage availability and capacity.

Good luck!

company suggestion by Anonymous Coward · 2005-10-25 07:59 · Score: 0

The company I work for uses LeftHand (http://www.lefthandnetworks.com./ They are an iSCSI solution. Totally scaleable, you want more storage, just buy a few more units. You can also carve it up to be one gigantic volume if you wish. Pretty darn cheap also. Beats the heck out of EMC on price.

Why not NFS? by Chuck+Messenger · 2005-10-25 07:59 · Score: 0, Redundant

Wny not just a bunch of PC's, each with 6x400GB drives? That's 2.4 TB per PC. 25TB is only 10 PC's -- 60 drives. What's the big deal just using NFS? Seems like not a very difficult target to hit.

Now, going to 1 PB -- that would be 400 PC's. At this point you've left the domain of something which is all that simple. Even so, if you use wake-on-LAN, you could no doubt get away with having 400 PCs, without special power, heating/cooling, etc -- as long as you were able to control the flow of information, so that no more than tiny fraction of the PCs would be on at a given time. And, of course, you'd have significant latency -- waiting for a node to wake. Since I don't know your requirements, I can't say how significant a barrier this would be.

You would get into a reliability issue with 400 PC's. Again, it would be useful to understand something about your problem space. It's possible that a butt-simple method would work for you.

Really, the difficulty/cost of the implementation is a direct function of data flow. If your data flow requirements are tiny, then the solution could be quite cheap.

Of course, even with low outgoing data flow, you'll still have your hands full just filling up the disks in the first place! Consider this: a typical cheap hard drive on a cheap PC can sustain, say, 5MB per second. If one PC is handling 2.4 TB, that would take 5-6 days to fill up. I suppose it would be reasonable to fill up, say, 10-20 PCs at a time. For the 25 TB system, you could fill it up in a week, easy. But for the 1 PB system, filling 20 PCs at a time, it would take you 3 months! Still, that might be OK, depending on your requirements...

Re:Why not NFS? by Anonymous Coward · 2005-10-25 08:01 · Score: 0

You have got to be kidding right?
Re:Why not NFS? by pe1chl · 2005-10-25 08:31 · Score: 1

Typical cheap harddisks in cheap PCs sustain 50 MB/s these days...
(the times, they are changing)

Archivas? by Anonymous Coward · 2005-10-25 07:59 · Score: 0

Have you looked into the GFS from Redhat - or something like the Archivas (http://www.archivas.com/ ArC system - the latter is commercial, but sounds like it fits the bill for doing what you need, and it supports TPOF configurations, commodity hardware, etc.

easy, use a RAID! by matt+me · 2005-10-25 08:00 · Score: 1

Use a RAID.

(I'm not entirely sure what that is, but I know that's its storage-related, and this is Slashdot, and there's a small chance that even if my comment isn't at all helpful, it might just be incredibly funny)

Easy, cheap, commodity solution by Scoth · 2005-10-25 08:00 · Score: 1

One thing came to mind when reading this:

http://ohlssonvox.8k.com/fdd_raid.htm

Cheap hardware, commodity interface and storage media, dirt cheap... Now, you'd need over 18 million of the things for the low-end capacity, but they'd be easily replaceable, probably hot-swappable, and might actually be somewhat durable ;)

I'd pay good money to get a tour of a company with rows and rows of iMacs with 127 floppy drives hanging off each one... :)

Talk to the right people .. oh and some pointers. by MarkTina · 2005-10-25 08:01 · Score: 1

Go have a chat with your local EMC, HDS, HP or IBM rep .. this is bread and butter stuff for them and they'll give you heaps of whitepapers.

Personally I think you'd be insane to put it all in as one volume and then sharing it to multiple computers ... but I'm sure you've a valid reason for doing it .. right ?

Software you might want to look at :

Transoft FibreNet (gobbled up by HP, may still exist as one of their products)
Tivoli SANergy (not bad actually)

Wow by thesnarky1 · 2005-10-25 08:01 · Score: 1

Only time I've ever needed anything this massive is for my porn collection which is only in the terabytes, not petabytes, sorry.

--
Want to find other gamers to play board and role playing game

Sun! by Anonymous Coward · 2005-10-25 08:03 · Score: 0

They need you as much as you need them!

http://www.theregister.co.uk/2005/10/25/sun_grid_s lip/

Three choices, you've made one by flying_monkies · 2005-10-25 08:03 · Score: 1

The three choices are:

1) Big
2) Cheap
3) Easy

You've already chosen Big, I'd recommend making your other choice Option 3. As a couple of other people have said call EMC, Hitachi or IBM and price a SAN. After you've got your hardware, drop a copy of Veritas VxFS/VxVM on your server(s). The first time you'll find out about a hard disk failure is when the wrench monkey from whichever hardware vendor you decide on shows up to replace the disk because the box called for help and Veritas makes FS/Volume management so easy even a paper MCSE can puzzle it out.

--
I disagree with what you say, but I'll defend your right to say it to the death - Voltaire

Nowhere near enough information by kingsqueak · 2005-10-25 08:03 · Score: 1

There isn't enough detail here at all to begin with recommendations.

Small budget...

There's small ($10k) and there's small ($1M). Small? Quantify that.

What is the nature of the data?

There is a massive difference between a PB of static files that are rarely accessed or indexed and a PB that is a highly transactional database.

What size are the files themselves?

What SLA has to be met with regard to the throughput?

Most of the arguments already made in favor of redundancy are also true. How could you have a PB of anything that isn't worth protecting? Just the cost of assembling a PB of data or having to restore it from a single disk fault would pretty much demand having a volume management system in place with redundancy.

If none of this has occured to you, hire someone with the background needed to architect this for you.

This is easy by stanmann · 2005-10-25 08:04 · Score: 1

1. Price out 25 TB of storage using 250 GB drives.
2. Multiply by 4 to account for media failure
3. Multiply by 5 to account for support hardware
4. Multiply by 2 for support and maintenance
5. Multiply by 2 for unexpected changes
6. Note that thiss solution will NOT SCALE to 1PB and multiply by 10
7. Compare numbers with EMC/TERAData/IBM/etc

End up buying the turnkey solution anyway

--
Food not Bombs is a nice platitude but it breaks down when you notice that the Bombees are usually well fed

Re:This is easy by stanmann · 2005-10-25 08:09 · Score: 1

OOPS forgot to put the totals and running totals in 1. 10K
2. 40K
3. 200K
4. 400K
5. 800k
6. 8M

--
Food not Bombs is a nice platitude but it breaks down when you notice that the Bombees are usually well fed

Intermezzo... by JoeLinux · 2005-10-25 08:04 · Score: 1

Check it out...fits the bill nicely.

Another route by Sir_Ace · 2005-10-25 08:05 · Score: 1

I would shy away from software... There are Raid controllers out there that will span a raid across multiple controllers, and multiple machines. LSI has some nice ones.
You can put together a ton of NO-O/S servers with the raid controllers, and interlink them all doing a hardware raid across the machines. Only one machine {preferable the main node} needs an O/S and then you can attach it to the network through that O/S.

It costs more, but is FAR more reliable.

Re:call EMC. i am sure their clarion line will han by Datamonstar · 2005-10-25 08:07 · Score: 1

Sadly, he's right. It's funny to see marketing guys clamoring about like worker ants from rival colonies going after the same crumb, the crumb in this instance being the contract for storing our data. But in the end, it's a hell of a lot stress-free and *perhaps* cheaper in the long run to get an all-in-one solution like EMC. Unless you've got a very talented and devoted group to create some brillant software solutions, that is.

--
The eternal struggle of good vs. evil begins within one's self.

2 PB and redundancy doesn't matter? by Steepe · 2005-10-25 08:08 · Score: 1

Sounds VERY strange to me. We only have 2TB of storage and its raid 10.

To answer your question.. you get what you pay for. The last sysadmin here was a lets do it on the cheap kinda guy, and it took me 2 years to fix that shit. The way you save the cash is you play vendors off each other. I just bought a fully redundant high speed storage array that does everything for 82k. (list price is around 180k) Took a few weeks, and I played vendors off each other like mad puppies, but I got more than I wanted for what I wanted to pay.

If you try to do storage on the cheap, 75 tb of storage for what I'm paying for 2 is cheap, so if your budget is under 150k or so, then look for another job. You are never going to have anything but problems.

Everything on one drive is a huge problem too... seek time will be out of this world.

--
Just three more hours seapeople and you can finally take me away from this crappy God Damned planet full of hippies

Re:2 PB and redundancy doesn't matter? by Anonymous Coward · 2005-10-25 08:31 · Score: 0

You paid 82k for a redundant 2 TB RAID 10 server? What were you thinking? I built a 1 TB RAID 5 Array out of 300 gig disks (with a hot spare to boot) for under two thousand dollars, and it can scale to 4 TB by just adding additional disks and controllers. The fact of the matter is, you can build rather large storage systems (10+ TB), with great redundancy provided you know what you are doing that will undercut anything any large data vendor has out there.

Take your car into the shop, spend $1000, 80% of which goes towards labor. Do it yourself will always be cheaper but there are times when you still should bring in the experts (like building a new engine from scratch, or replacing the one you had, unless you are REALLY talented).

Also, see:

http://www.xav.com/scripts/misc/1016.html
http://www.accs.com/p_and_p/TeraByte/
http://vtwug.w2k.vt.edu/pdf/chubtoad.pdf
http://www.phy.olemiss.edu/HEP/sanders_chep03.pdf
http://staff.chess.cornell.edu/~schuller/raid.html

Note: Many of these were done years ago and costs and software have only gone down and improved. I would strongly suggest to go the cheap way, which if done right would result in cost savings in the hundreds of thousands if not millions for this project.
Re:2 PB and redundancy doesn't matter? by Steepe · 2005-10-25 09:19 · Score: 1

I'll take mine over yours any day. Mine does ISCSI, Fiber Channel, 175,000 NFS operations per second, will mount windows machines, boot blade servers, and all that, with 5 9's uptime. (coupletely redundant including all heads and drive storage.)

Mine is expandable to 80TB, 2 is all we need at the moment.

--
Just three more hours seapeople and you can finally take me away from this crappy God Damned planet full of hippies
Re:2 PB and redundancy doesn't matter? by Anonymous Coward · 2005-10-25 09:47 · Score: 0

To each their own - I store a bunch of DVDs on mine - no real point in anything that fancy or that expensive to stream DVDs across a gigabit network. I'd rather keep the 80k in my pocket because it does what I need. I still think that your system could have been built for 30k or less.

Hell, BUY it from EMC! by Genady · 2005-10-25 08:08 · Score: 5, Interesting

As a VERY satisfied customer, I say, just buy the damned thing from EMC. There's few enough warm fuzzy feelings that SysAdmins have in this day and age, like your CE calling at 7:00am saying: "Hey, you had a few hard SCSI errors on Disk 3 Enclosure 0 Tray 0 last night, that's your production LUNs isn't it? There should be a courier there with a disk by 10, and I'll stop by to make sure things are hotsparing back properly after you replace the disk okay?" And *THIS* is just because my CE knows I can handle replacing a disk. Normally he'd come out and do that, and sit around while it re-built the Raid Group.

Yeah, EMC costs. THIS is why. The support, when needed, is top top top notch. Which would you rather have in a DR situation?

--

What if it is just turtles all the way down?

Re:Hell, BUY it from EMC! by lukewarmfusion · 2005-10-25 08:24 · Score: 3, Funny

And you'll probably get at least one nice lunch out of the sales deal. I recommend saving your lunch money and asking for sales visits from all of the major players.
Re:Hell, BUY it from EMC! by twiddlingbits · 2005-10-25 08:31 · Score: 1

Agreed. Sun now owns StorageTek. With a StorageTek SAN and some Sun AMD 1U server boxes using Linux, and Fibrechannel HBAs you get a lot of capability and performance for less than EMC. IBM SHARC is also very good as are the boxes from Fujutsi, Hitachi and Ingenia but none of these are "low cost".
Re:Hell, BUY it from EMC! by Anonymous Coward · 2005-10-25 08:41 · Score: 0, Troll

A certain very large brick and mortar bookstore had a multi-day mainframe outage as a result of EMC bungling. The outage shut down loading docks across the country, etc.

They were running EMC storage on their IBM in a raid 1 config. Yes, EMC prefers that you to run the system drives on their storage. They tried to get me to move my system drives onto their storage years ago (crashola when you can't swap due to EMC performance issues/cache full).

After upgrading one of the redundant nodes ("source load", whatever that is), it wouldn't come up. No problem, they thought.. They shutdown and replicated the contents of the other node's system volume. Then, neither would come up.

It was very, very bad. Don't always believe EMC hype. They were clueless on this and the damage was in the many millions (just due to the docks being shutdown).

I could tell you stories about performance on EMC vs. their hype.. But you'd die of boredom. Instead, get your boss to kick down the cash for the 'real thing' and go eat free food with them, see major sporting events and concerts. They have suites in all major baseball, football, hockey and basketball arenas. Your boss sounds cheap, party with EMC.
Re:Hell, BUY it from EMC! by shokk · 2005-10-25 09:02 · Score: 1

Go Network Appliance and you get the same treatment. Their hardware is rock solid and completely hands off for many months at a time. I only ever touch them to add filesystems. You get what you pay for, so remember that if you cheap out YOU have to live with it, not your boss.

--
"Beware of he who would deny you access to information, for in his heart, he dreams himself your master."
Re:Hell, BUY it from EMC! by Anonymous Coward · 2005-10-25 09:03 · Score: 0, Troll

Have you ever seen a single drive failure cause a 30TB SAN to grind to a halt? I have, thanks to EMC.

Needless to say, we are keeping our future storage options open.

Give IBM a call, the DS4000 series sounds like just the thing you are looking for. IBM doesn't screw you over in software license renewals either (unlike some other companies).

But hey, EMC did buy us lunch after our 4 hour outage.
Re:Hell, BUY it from EMC! by Anonymous Coward · 2005-10-25 10:00 · Score: 1, Informative

Have you ever seen a single drive failure cause a 30TB SAN to grind to a halt?

Sure have, it happened when some dumbass striped their database across multiple hypers on the same physical spindle. There's a reason they give you those device layout maps, you should probably pay attention to them.
Re:Hell, BUY it from EMC! by Anonymous Coward · 2005-10-25 11:36 · Score: 0

Yeah, IBM. Great idea.

We have dozens of IBM servers with IBM RAID controllers that keep eating disks. And whats even better, sometimes it takes a few days for a RAID controller to decide that a disk is bad so it just hangs from time to time until it marks the disk as defunct.
Re:Hell, BUY it from EMC! by Anonymous Coward · 2005-10-25 12:34 · Score: 0

They are expensive because they pay their sales folk enough to drive porsches and bribe your coworkers with questionable if not illegal "gifts". And then don't pay their engineers enough to make something that actually works.

I suggest you look into Hitachi or IBM.
Re:Hell, BUY it from EMC! by mathrock · 2005-10-25 16:18 · Score: 1

Why does everyone here seem to have such a hard-on for EMC?!

My dealings with them have been absolutely horrible, plus this person's problem (25TB - 1PB) in a shared/clustered file system isn't a problem that EMC has a solution for....

Centera you say, BULLSHIT. You're not going to be able to scale the Centera very well since it does NFS and NFS pretty much blows if you need really good performance/bandwidth. Ok, you don't have to use NFS you can write your application to the Centera API, but that doesn't work well with legacy applicatons where you cannot change them...

So as far as I know, EMC isn't going to be able to solve this person's LARGE shared/clustered/global/distributed file system problem...
Re:Hell, BUY it from EMC! by Anonymous Coward · 2005-10-26 01:13 · Score: 0

You got modded a troll, but I'll respond anyhow. I assume you were booting from SAN ("EMC prefers that you to run the system drives on their storage"). If you're having "cache full" issues, I can only assume that you were running on Clariion storage. The top of the line Clariion, though the marketspeak says 8GB of cache, only gives you ~3GB that's actually usable. If you're booting systems from Clariion that are responsible for "many millions" of real dollars, you were asking for trouble to begin with.
Sounds to me like you tried to do it on the cheap and got burned. Much like the guy who's Asking Slashdot... If you're using EMC gear and your business depends on it, buy a Symmetrix.
Re:Hell, BUY it from EMC! by Phishcast · 2005-10-26 01:20 · Score: 1

Without knowing what the storage is to be used for I don't think you can reasonably assume that Centera would fit the bill. It's not simple disk you just lay a filesystem on top of, it's an object-based archival solution. You can put about 100TB of ATA storage into a single Clariion. The new DMX-3 line will scale much larger than that.
Admittedly, these solutions aren't cheap, but as other people have pointed out the cheap route has enormous pitfalls when you start talking about this much storage.
Re:Hell, BUY it from EMC! by Anonymous Coward · 2005-10-27 03:02 · Score: 0

It was a top of the line Symmetrix. It was a fairly small company and we had $8M worth of them.

I could write a book on EMC BS. I also spent time in Hopkington working with their engineers on performance issues, so don't try and blame the 'field'.

Open Enterprise server? by isotope23 · 2005-10-25 08:08 · Score: 1

http://www.novell.com/products/openenterpriseserve r/iscsi.html

NSS 3.0 does up to 8TB I believe. XFS does 9PB?

--
Service guarantees Citizenship! Questions Guarantee GITMO.... Amerika Uber Alles!

Another, There Is. by LifesABeach · 2005-10-25 08:09 · Score: 2, Insightful

If designing for speed, NOT cost:
given 2PB = 1 Human Brain, non interlaced
1024 TB == 1 PB
1 TB == 1 PC Computer with 1200GB H/D, 2Gig RAM, Networking

If designing for cost, NOT speed:
1 DVD = 4.5GB
1 PB = 1024 TB = 1,048,576 GB
1 PC Computer, with a DVD like the one mementioned above.
1 Robotic CNC Arm, with DVD Gripper(tm)
1 Very Huge Wire Cage to hold DVD's like a Juke Box.
(This has been done before, but with Tapes)

Re:Another, There Is. by ErikZ · 2005-10-29 12:21 · Score: 1

Uh, your solution would require a quarter of a million blank DVDs.

At that point, I would use hot swappable hard drives. 4,200 250GB drives is far more manageable.

--
Democrats or Republicans. They are both taking us to the same place and they are not afraid of us anymore.

Nutch: NDFS by otisg · 2005-10-25 08:09 · Score: 1

See http://lucene.apache.org/nutch/ and look for Nutch NDFS (something similar to Google's FS you mentioned). I use Nutch over at Simpy (think Web 2.0) and am very happy with it.

--
Simpy

/dev/null by sho222 · 2005-10-25 08:09 · Score: 1

It seems like I found a magic disk on my system that will take unlimited data! I've been storing huge amounts of stuff there for the past few months, and still haven't run out of disk space. Try it out: /dev/null

Retrieving the data, on the other hand, has been problematic... I'm figuring that when I really need it, I'll just post to Ask Slashdot and somebody will help me out.

Re:/dev/null by joe_bruin · 2005-10-25 09:21 · Score: 1

Ah, you're confused. The data that your stored in /dev/null can be retrieved from /dev/random. However, it may take an unbounded number of read operations to get the data you want out of it.

Backups by killermookie · 2005-10-25 08:11 · Score: 1

I'm sorry, but if you're going to build a massive disk system like this and plan on filling a good portion of it up then you need to have backup plans in place. In the event that you lose data, how will you recover?

So this isn't about how to build 1 PB of disk space but two: One is the main and the other is the backup.

Can you afford that?
Can you afford to lose the data?

Re:Backups by rhaig · 2005-10-25 16:58 · Score: 1

depending on how the data behaves this answer varies greatly. if the data never changes, but more of it is added, then perpetual incrementals is the way to go. if the data changes, then block level incrementals will help you get your backups done, but restores will be a bitch.

The question is really about the restore. once you have a PB on disk, it's going to take weeks to restore. no matter how you back it up. disk to disk is the only option then with a forklift restore.

--
"We are not tolerant people. We prefer drastically effective solutions"

Red Hat GFS != Google FS by Anonymous Coward · 2005-10-25 08:11 · Score: 2, Informative

Read the post that you're replying to more carefully next time.

iSCSI storage / san by pasikarkkainen · 2005-10-25 08:13 · Score: 3, Informative

There seems to be lots of SATA-RAID based iSCSI SAN devices available nowadays.. Some links to products I have seen:

http://www.equallogic.com./ They make nice SATA-raid based iSCSI SAN devices with all the features you could expect (volumes, snapshots, array/volume-expansion, hotswap, redundant controllers, redundant fans, etc).

http://www.equallogic.com/pages/products_PS100E.ht m
14 250G sata disks, 3U, 3.5 TB of raw storage.

http://www.equallogic.com/pages/products_PS300E.ht m
14 500G sata disks, 3U, 7 TB of raw storage.

http://www.equallogic.com/pages/products_PS2400E.h tm
56+ TB

Looks good. I have not yet used them myself :)

Another iSCSI SATA SAN possibility:
http://www.mpccorp.com/smallbiz/store/servers/prod uct_detail/dataframe_420.html
16 sata disks, review:
http://www.infoworld.com/MPC_DataFrame_420/product _53700.html?view=1&curNodeId=0

This company also has SATA iSCSI SAN devices:
http://www.dynamicnetworkfactory.com/products.asp/ section/Product~Categories/category/iSCSI/options/ IPBank/drivetype/L~Series/formfactor/Integrated/in face/SATA~-~Serial~ATA

iSCSI SAN comparison:
http://www.networkcomputing.com/story/singlePageFo rmat.jhtml?articleID=170702726

There are also software iSCSI target solutions for use with your own/custom hardware.
http://iscsitarget.sourceforge.net/ for building linux-based iSCSI target/SAN.

If you are familiar with iSCSI targets / iSCSI SAN devices please post your comments!

Re:iSCSI storage / san by nzin · 2005-10-25 08:36 · Score: 1

there are some vendors also like netapp (http://www.netapp.com/ well known for this kind of stuff.

You should test certainly iscsi implementation on linux.
Re:iSCSI storage / san by pasikarkkainen · 2005-10-25 08:56 · Score: 1

The big question is do you need block-level or filesystem-level access to your data?

Netapp devices can export volumes as block-devices via iscsi and fc or as filesystems via nfs and cifs.
Of course you cannot access block-level volume via nfs, or filesystem volume via iscsi.. but both can be served separately from the same netapp box.

Why? by Crouty · 2005-10-25 08:13 · Score: 1

Why would you need one single volume for 25 TB of pr0n?
Or is there anything else that eats up these amounts of space?

--
On se Internetz nobody noes your German.

I built a 1.7 TB for about $2000 by composer777 · 2005-10-25 08:16 · Score: 2, Insightful

but I'm just a linux hobbyist and programmer, so take any advice I give with a grain of salt, but here's what I did for my setup at home. To start, you're looking a little over $1000 per TB. And, that's about as cheap as it gets with redudundancy. I have 8 drives in one machine, it's in a RAID 5 config, and I have a hot spare. However, if I were doing this for a mission critical application, I would have it in a RAID 6 configuration with a hot spare, and buy a hot swap cage, which would further add to the costs. Then, I would simply export the RAID 5 volume using ISCSI, and then see if there is a way to RAID all of the ISCSI volumes using a master server. I imagine that if you do it right, you could scale up such a system to a fairly large number of machines. You would probably want something faster than gigabit eithernet, probably 10,000 MB/s connecting everything together, otherwise, things could get a bit congested at the head node.

Where all this could get terribly expensive is in power requirements, it requires less power to run a cage of hard drives than it does to run a network of PC's. I'd imagine that any money you save on hardware, you would spend on your power bill. Either way, your looking at, bare minimum, about $30K to start for 25TB's, and I would add another 10K padding just to be safe, to pay for stuff like UPS (which you want), a high end switch (which you'll also need), cabling, etc. In other words, it's not cheap, and like my parent just said, it will probably be cheaper in the long run to have someone like IBM do it for you. Do you really want to be responsible for 25-1000 TB's of data?

Re:I built a 1.7 TB for about $2000 by iamlucky13 · 2005-10-25 08:58 · Score: 1

Beyond just considering the cost of building an array as big as he's talking about, you only touched lightly on scalability issues. While it possibly would take a 10 gigabit connection to serve that much data (how regularly it's accessed would come into play there), I would expect you'd come out ahead with a distributed system, with all requests going first through an index server, which forwards you to the node with the data you're looking for.

As important or more so than the connection, is going to be finding and serving the data. You've got to have a pretty good index of where things are or searching will be even worse than grandma trying to find a long lost word document using the Windows search feature. I don't know what your 1.7 TB array does for file management, but how well do you think it will scale to 15 times the data, much less a petabyte, and still provide reasonable response times on querries? I'm definitely no expert on this sort of thing, but my gut again tells me an index server is critical.
Re:I built a 1.7 TB for about $2000 by composer777 · 2005-10-25 11:56 · Score: 1

I agree. In fact, I have a bad feeling about this guy's situation. His whole question reminds me of the time I built a website for a friend of my brother who was launching a dot com in the late 90's. I was in my last year of college at the time, and figured that the experience would be a good thing to add to my resume. I found out the hard way why you never want to work for someone that wants to do things as inexpensively as possible. Not only did they want it done unreasonably cheap ($200), but they also wanted it done unreasonably fast (in a few days). Now, html isn't a problem, but graphic design is not my thing, and while I was willing to learn and do the best I could, they just weren't willing to wait.

In the end they ended up getting what they paid for, and I was so embarrassed by how terrible it was that I never even bothered putting it on my resume. The lesson I learned is first, I may know how to program but I'm not a graphics designer, and second, is that if people aren't willing to pay, send them somewhere else. Let them be someone else's headache.

ZFS: Zetabyte File System by Zemplar · 2005-10-25 08:16 · Score: 1

Although ZFS is not immediately available, it should be before long. Though this does not address your hardware concerns, choosing hardware compatible with either Solaris 10 or OpenSolaris would be beneficial, in my opinion.

A good ZFS introduction.

No Beowulf Cluster jokes!? by mister_llah · 2005-10-25 08:16 · Score: 1

I am amazed, this topic just screamed for one...

In Soviet Russian, Beowulf cluster jokes make you! ... ...

Ugh I feel so dirty.

--
MoM++ - A Classic Expanded - [Master of Magic 1.5]
http://mompp.sourceforge.net/

Not possible with a small budget by Anonymous Coward · 2005-10-25 08:18 · Score: 0

You are being asked to do the impossible. The simple answer is get more money and get something like a NetApp solution that has complete redundancy and protection.

There are so many issues that you are completely unaware of:

1) Data redundancy. 25 TB is a huge amount of data and a hell of a lot of disk drives. 1 PB is more than you can imagine. What is the likelihood that 1 of your disks will fail? What about 2 disks failing? If you lose two disks, you will lose everything, can you handle that risk? Netapp has RAID DP which, unlike RAID 6, offers protection from 2 disks going down, without the performance hits of RAID 6.

2) How are you going to allocate the space?
3) What about when you add more space? How will you handle this?
4) What are the physical limitations that you will encounter?
5) What about network card redundancy? Can your company handle the downtime if a network card fails and you need to go in a exchange it?
6) What about power failures?
7) Off-site storage?
8) How are you going to manage all these drives and be aware of what their status is?
9) What about when volumes get full?

The problem is that you have spent so much time thinking about technology when technology isn't the main issue, its how to handle something this large.

Do yourself a favor and go with a reputable vendor like NetApp and save yourself a whole bunch of trouble.

Netware by Havokmon · 2005-10-25 08:18 · Score: 1

I believe Netware will do what you're looking for. Netware volumes can grow dynamically as you add disk space, and Netware does support iSCSI, so theoretically it should work. Then you export the whole thing as NFS, and you're set to go. (Plus you get kick-ass Netware managment as a bonus).

If you really wanted to be cheap, just NFS mount that cluster(fuck?) to a Linux box and do all your user management from there. That way you'll only use 1 connection to the Netware box. Download a copy of Netware and check it out.

--
"I can't give you a brain, so I'll give you a diploma" - The Great Oz (blatently stolen sig)

Where are your specs? by metoc · 2005-10-25 08:19 · Score: 1

These are getting monotonous.

I want a terabyte storage solution, I want a networking solution, I want a cheap computer solution.

Want.Want.Want. Sounds like Bill Connolly's 'I want' rant.

Where are your specs!! and by the way the answer is 42.

I truly have no idea how one backs up a petabyte by aywwts4 · 2005-10-25 08:19 · Score: 1

Duh, With two petabytes. ;)

--
Web Developers: Celebrate to our roots! Animated Gifs and Tiled Backgrounds, dont let our history die!

Oracle's CFS2 by Anonymous Coward · 2005-10-25 08:20 · Score: 0

I think Oracle plans to release the second version of its cluster filesystem (CFS2?) as stand-alone, no Oracle DB needed. I guess it's not ready for primetime yet, but it might be interesting in a year.

file counts by bort13 · 2005-10-25 08:21 · Score: 1

You might also want to consider that the number of files on a petabyte-sized file system will not play nicely with things like Windows Explorer or various backup programs. But before all that, I'd check that the requirement that you have a file system this big is tied to a project that will make money for your business. I think the request for this implementation is based on some unsound logic -- e.g. how are you going to index all that? -- and should be reviewed before you go off half-cocked and start implementing.

If the logic were actually sound and I really thought this were a good idea (I don't), I'd talk to EMC/Hitachi/IBM just so you can get a price tag on what actually implementing this will cost. Then cobble together your homebrew solution and ask the company to pay you the difference in cash. Step 2: move offshore.

MogileFS by Anonymous Coward · 2005-10-25 08:21 · Score: 0

Check out MogileFS http://www.danga.com/mogilefs/ . It is open source and might meet your requirements.

What we've done (30TB so far) by bernz · 2005-10-25 08:21 · Score: 4, Informative

We've scaled this to 30TB so far. I'm not sure about 1PB, though. For us, redundancy and storage size is key, performance less so.

Storage nodes: 7 x 2.8TB 2U RAID5+1 boxen with Serial ATA. The 2.8TB is logical, not physical. The OS for each of those machines is RAMDISK based (something we concocted based on what I read about the DNALounge awhile back) so it helps curb disk failures of the storage nodes themselves. We avoid disk failure by using RAID5. Of course that doesn't protect against mutiple simultaneous disk failure, but read on for more. Each of the storage nodes is exported via NBD.

Then we have a head unit, a 64-bit machine. This machine does a software RAID5 across the storage nodes using an NBD client. Essentially each storage node is a "disk" and the head unit binds and manages the sofware raid5. So let's say a whole storage node goes down (for whatever reason it does), all the data is still intact. RAID5 rebuild time over the gigabit network is about 18hrs, which is acceptable. We even have another storage box as a hot-spare.

On top of that, we have the whole cluster mirrored to another identical cluster via DRBD in a different geographic location. This is linked by Gigabit WAN. So if we have a massive disaster and lose the entire primary cluster, then we have a 2ndary cluster ready to go. We needed to purchase the Enterprise version of DRBD ($2k US) but that's worth it because they're neato guys.

We use XFS as the filesystem. This system gives us 14TB of redundant "RAID-55 with a Mirror" space. Both clusters together? $85k.

When the cluster starts running out of space (about 70% or so), we add ANOTHER cluster of similar stats to the initial one and use LVM to join the two units together.

This has scaled us to 30TB and we're pretty happy with it. The read speed is very good (hdparm says Timing buffered disk reads: 200 MB in 3.01 seconds = 66.49 MB/sec) and the write speed is about 32 MB/sec. For what our application is doing, that's a fine speed.

Re:What we've done (30TB so far) by bernz · 2005-10-25 09:27 · Score: 2, Informative

I'll put this out as a side point since I'm the OP: If we had to do more than 50TB, I think we'd go to a "real" solution like EMC or something like that. This has been very good for us, but given the need for that amount of storage, we also now have the money to spend on a superduper storage machine. Homebrew has been wonderful to get to this point, but unless we get the kind of employees necessary to really write our own FS a-la GoogleFS, I can't see us taking this solution that much further past where it is now only because I can't see myself putting THAT much scalable trust into something like NBD or software RAID5. At least not with really really close inspection of the limitations of that code.
Re:What we've done (30TB so far) by Anonymous Coward · 2005-10-25 09:48 · Score: 0

jesus.
Re:What we've done (30TB so far) by Cheeze · 2005-10-25 09:49 · Score: 1

That's pretty awesome. I've messed around with a lot of different file systems (my latest being IBM's GPFS). I evaluated DrBD and nbd but found it to not perform as I expected. I was using all fiber channel disks and 2Gb fiber interconnects and was getting like 50MB/sec transfers. When we moved to GPFS, i clocked it at around 225MB/sec which is just about right where it should be.

--
Why read the article when I can just make up a snap judgement?
Re:What we've done (30TB so far) by bernz · 2005-10-25 09:57 · Score: 1

given the poster's issue of VERY low cost massive storage, i like my solution. If he has the money for Fiber between all nodes, then rest assured, I agree very much with your idea. As I said for us, size and price and redundancy were really important and performance much less so. But, as I also said, if we needed to scale beyond what we have now, your way is a much better idea.
Re:What we've done (30TB so far) by Anonymous Coward · 2005-10-25 18:14 · Score: 0

We've scaled this to 30TB so far. I'm not sure about 1PB, though. For us, redundancy and storage size is key, performance less so.

Storage nodes: 7 x 2.8TB 2U RAID5+1 boxen

Boxen?

You prancy, dancy little ass pirate. Join the Greek Navy already, twink!

Proprietary FS, commodity disk enclosures by SteveOU · 2005-10-25 08:22 · Score: 2, Informative

The filesystems going to be the hardest component of this. I know of no open-source fs that could handle this. I'm assuming this is all online storage, and there is no desire to nearline it to tape. Ideally, you'd want something that could contcatenate multiple LUNs (of RAIDed storage) without having to run through a volume manager. Nothing agaist volume managers, but it'd be another component to support. Looking at proprietary FSs, you've got CXFS from SGI, which could easily handle the PB requirement and plays nice on Linux. Sun's got QFS, which would max out at 1PB and could do the volume management bit easily. Linux support was a little flakey last time I used it, but it's a free download and evaluation, you could go get it right now.

IBM's SAN-FS would also meet the capacity needs and would have the advantage of providing nearline capability, if you're into that. Sun's SAM-FS is basically the QFS product with nearline-to-tape capability. Linux is only supported as a client OS there. Of course, if you buy the mantra that Solaris is 'open-source,' then that might not be an issue.

As for hardware with any of the above solutions, you're going to be looking at using multiple RAIDing disk enclosures of some kind. At a budget, probably SATA disks talking to the controller, and iSCSI to the host. FibreChannel to the host would be a little more costly, but might be worth it since iSCSI is just getting mature enough to be usable in production.

Re:Proprietary FS, commodity disk enclosures by fatcatman · 2005-10-25 11:10 · Score: 1

The filesystems going to be the hardest component of this. I know of no open-source fs that could handle this.

Be enlightened: http://www.lustre.org/

"The latest version of Lustre is always available from Cluster File Systems, Inc. Public Open Source releases of Lustre are made under the GNU General Public License. These releases are found here, and are suitable for clusters with thousands of nodes and hundreds of terabytes of storage."

Ask Slashdot Formula: by jlarocco · 2005-10-25 08:22 · Score: 5, Funny

Dear Slashdot,
I have been tasked with (insert very difficult, very important job). This is very important to my company. I have (insert number much lower than it should be) dollars to do this. I do not want to use (insert company name specializing in this exact thing) because management thinks they are too expensive. I think I can do this (insert better/faster/cheaper/...) than said company, even though they have vastly more experience and have invested much more time and research than I have. My continued and future employment probably rests on this project. Please advise.

--
Maybe not

Re:Ask Slashdot Formula: by lakin · 2005-10-25 10:32 · Score: 3, Funny

Dear Sir,

Use Linux.

Regards,
Slashdot

--
Paul
Re:Ask Slashdot Formula: by Anonymous Coward · 2005-10-25 15:34 · Score: 0

And, of course, the entire solution must be Open Source. Because OSS rules!
Re:Ask Slashdot Formula: by porttikivi · 2005-10-25 23:39 · Score: 1

Well, being a Slashdotter and a hacker/geek is supposed to mean passion for new computer technology. New is supposed to be better than old. Implementing standard service with standard money is not good enough. The whole point (or the one rational point) of Slashdot is to discuss ways to do better with less money.

I think bernz put up a very good example there in a previous comment http://ask.slashdot.org/comments.pl?sid=166332&cid =13874861

--
Anssi Porttikivi / app@iki.fi

my gosh, just how big... by Anonymous Coward · 2005-10-25 08:22 · Score: 0

...of a porn collection do you have?

TinyDisk ? by Anonymous Coward · 2005-10-25 08:24 · Score: 0

It's distributed I believe.

Have you looked at.... by Farfromlosin · 2005-10-25 08:24 · Score: 2, Informative

Capricorn Tech? They power the Internet Archive. "Capricorn Technologies was founded in 2004 and provides petabyte-class storage solutions for organizations worldwide. Capricorn's PetaBox technology grew out of a search for high density, low cost, low power storage systems for the world's largest data collections. Capricorn Technologies is proud to be a leader in the next data storage revolution."

--
...because what good is power unless you can abuse it?

ugh by hpavc · 2005-10-25 08:24 · Score: 1

As always these questions lack an actual budge figure, even a ballpark number would be nice. Not to mention the specification of the data being laid down on the disk.

--
members are seeing something, your seeing an ad

You can make this cheap, at least for 25TB by Anonymous Coward · 2005-10-25 08:28 · Score: 0

I've been setting up linux based SAN (used to be NAS) for a few years now, and I know for a fact you can make a 25TB~100TB SAN with commodity hardware and low end drives...

There are many ways...
-LVM over Serveral 5-10TB RAID5/6 servers (through FC, iSCSI or AoE)
-Software RAID 5/6 over Serveral 5-10TB RAID5 servers (through FC, iSCSI or AoE)

This is the cheapest yet still reliable way to work with low end hardware. Expect your servers to crash, your disks to fail... AoE is not CPU intensive and doesn't require expensive HBA, this is a good solution if you want to build a custom/noone will never understand/ cheap SAN. iSCSI is easy to deploy but very CPU intensive (without HBA). FC is extremly expensive.

Anyway, with this kind of setup, you'll only have "Mass" Storage, performance will be bad, security as well and you'll probably have a very high & frequent system downtime...

10 TB system by sanjacguy · 2005-10-25 08:30 · Score: 1

I'd be more tempted to just not worry about the power supply issue and go whole hog. My company recently purchased a NAS device called a Terastorus from Aberdeen solutions. Thing runs like a champ. Comes with an 80 gb internal HD for Storage Server 2003 and 24 500 GB HDs. We raid 5'd it and added a hot spare at the total cost of two terabytes. Total cost: maybe 7 grand.

But it's a heck of a lot better in both performance and cost than our EMC AX-100. That's just a big turkey!!!

Of course it weighs like 75 lbs and cranks out heat like a banshee, but it was cheaper than our two TB AX-100 and is a heck of a lot more reliable!

Fibre Channel 30TB in 7 RU by Ironsides · 2005-10-25 08:30 · Score: 2

Nexsan has a box called ATA Beast
Raid, Fibre Channel, 42 ATA drives per 7 RU chasis. Throw in 500GB drives and 1 parity drive for every 6 data drives and you have ~30 TB per chasis.

--
Fly me to the moon Let me sing among those stars Let me see what spring is like On jupiter and mars

Re:Fibre Channel 30TB in 7 RU by himself · 2005-10-26 04:12 · Score: 1

We use a Beast or two for disk-based backups (under a VTL, virtual tape library), and we likes them. They're pretty cheap, and pretty easy. Mind you, I htink it has to be unracked to add a disk, but them's the breaks. Oh, and they do SATA now, I hear.
Re:Fibre Channel 30TB in 7 RU by Ironsides · 2005-10-26 04:51 · Score: 1

Oh, and they do SATA now, I hear.

Even better. I've only had expereience with the ATA Boys before, which have all 14 disks accesible from the front. With 42 drives in 7RU I'm not surprised it has to be unracked. Do you have to shut it down and open it up or are the drives just mounted from the top instead of the front?

--
Fly me to the moon Let me sing among those stars Let me see what spring is like On jupiter and mars
Re:Fibre Channel 30TB in 7 RU by himself · 2005-10-26 05:44 · Score: 1

I believe that the ATABeast's cover has to come off before the drives can come out, as they pull straight up.
Tangentially related, in the box we found a fearsomely sharp, translucent red plastic thing that looks like it's a tool for gutting deer, or like it fell off a superhero costume or something. The manual didn't mention it, but I guessed it's for grabbing hold of drives and yanking them up-and-out when they need to be replaced.
Anyway, go, Beast, go!

Asked to do VS. what needs to be done by Anonymous+Custard · 2005-10-25 08:31 · Score: 1

I've been asked to build a massive storage solution to scale from an initial threshold of 25TB to 1PB, primarily on commodity hardware and software... ...At this point data redundancy is not a priority, however it will have to be addressed.

What you're asked to do isn't always what needs to be done. You're making a huge mistake if data redundancy for this enormous project is just an afterthought.

I don't know what role you play in your organization, but try to get the business-minded folks to tell you what they want to accomplish, and then YOU and your architecture people will decide what needs to be done to accomplish it.

With such vague requirements, how can they already know that you should build it from scratch instead of choosing a turnkey solution?

--
$8.95/mo web hosting

Maybe a distributed-DB-based FS? by davidwr · 2005-10-25 08:32 · Score: 1

I can't think of any specific versions, but if you don't need performance, a database-driven filesystem that sits on top of a networked file system should be doable, but costly.

A toy version:
DB simply keeps track of filenames and reasonably-sized chuncks of files. Chunks of files are spread far and wide across the LAN.
The local filesystems store each chunk as a local file.
Hardware RAID provides protection against disk outages.
Performance sucks, and like any fake file system file semantics may not equal a true local file system.

Per node hardware configuration:
Fast server stuffed with let's say 5 SCSI raid cards, dual power supplies, each with 15-disk RAID5 using 0.5TB drives with redundant power supplies: well under $30,000 for 7TB.
Power consumption: 3KW I'm guessing - hopefully that's a huge overestimate.
1PB is 140 of these give or take. That's $4.2M and 420KW of power.

Even if the power is 1/10th that, 42KW is still a big power bill.

--
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.

Can the company funding this really afford this? by @madeus · 2005-10-25 08:33 · Score: 2, Insightful

I appreciate this might not seem like helpful advice, but...

If you've been asked to do something this by a company that can afford to buy one commercial off-the-shelf high volume storage solutions, then I honestly can't imagine any solution they try and knock up will actually work (as I'm not aware of any free software solution that's currently up to the task).

If your company doesn't have / can't raise the capital to buy a commercial system for a project of this scale, I can't possibly see how they could afford to screw up on this and go with an untested idea that could very well end up being a huge money sink they wouldn't be able to dig themselves out of - one that could doom the entire company and all it's investors given the cost it could run to.

And of course, for such a big project, they should hire people who would already know how to do something like this (which is not a dig, it's just crazy to skimp on staff when you have an ambitious project which requires large amounts of capital investment).

That said...

I were going to do large scale storage on the cheap, depending on the design of the software and the specific requirements (particularly if I was also developing the software we were going to use, or was able to set feature requirements and/or was able to make the modifications myself) I would build the largest standard file shares I could with SATA disks (using commodity hardware, hot swappable, running linux, with front loading drive bays).

The specifics of handling the load balancing (via multiple front ends, multiple mount points, pre-deteremined hashing to balance things out, proxies/caches, hooks in the file system calls, hooks in the application to talk to a controller, etc) depend entirely on the sort of application however.

It's definately likely to be far easier (and more cost effective) to have the software take care of knowing where the data is stored, rather than trying to build a single really large file share. I know at least one very known large company who've went down this route (with essentially elaborately hacked up versions of common OS software).

The downside is you have to support whatever hack you come up with to do this, but that shouldn't be an enormous amount of work (and you can probably afford to hire someone to support it full time for significantly less than the cost of a support contract for a commercial solution).

Buy An EMC Symmetrix..Seriously by haplo21112 · 2005-10-25 08:34 · Score: 1

The amount of space that your talking about needing its going to save you TIME (and remember TIME=MONEY^2 in the IT field), Money and Headaches. You will Never build anything that scales into that kind of Territory without replicating the kind of technology that EMC(or any other storage company, however EMC is the clear leader in the field a little more $ upfront, but HUGE ROI over the length of the product) has already created. The ease of use of the product and the tools available for it are going to save you and your company money in the end. Its a little more upfront, but the money you will save over the next 5+ years in labor costs, hardware maintainace, and POWER costs is worth it. A huge cluster full of JBOD is going to chew electrons like nobodies business. Never mind your time building and maintaining it...I will bet you that it will never be fully stable day to day, some part of it is always going to need poking at.

--
Power Corrupts,Absolute Power Corrupts Absolutely, leaving one person(group)in charge is absolutely corrupt.

Good point, bad data by fm6 · 2005-10-25 08:34 · Score: 2, Insightful

If you assume that the failures are spread out evenly across time, a 1000-disk system will have a failure every 500 hours, or about every 3 weeks!

Not a sound assumption. Things don't fail uniformly over time. Suppose 70 babies are born with a life expectancy of 70 years. Is one of them guaranteed to die every year for the next 70 years? Obviously not. If they avoid some joint disaster (like they all take a trip on the Titanic), most of them will die within a decade or so of the 70-year mark.

Same with disk drives — most failures will be clustered around the 57-year mark. Not that your attitude towards redunancy is wrong. Just as people sometimes die in infancy, some disk drives break down quickly. So there's a chance that you'll lose some drives from your thousand-disk system in the first year.

How big a chance? To answer that question, you need more statistics about drive failure — and a much better grasp of probability theory.

Re:Good point, bad data by forand · 2005-10-25 08:46 · Score: 1

Your logic doesn't make any sense. No one has tested ANY modern hard drive in a situation as is being discussed for anywhere near 57 years. What they do is almost exactly what the parent posted. Put 1000 drives running in a room and then get the failure rate. So while one might hope that the drives will last that long what is actually measured is the failure rate then the average failure time is found from this thus the parent poster had it right.
Re:Good point, bad data by Syberghost · 2005-10-25 09:15 · Score: 1

I just want to know where I can get those disks with a 57-year MTBF. I'll take 10,000.
Re:Good point, bad data by Intron · 2005-10-25 09:30 · Score: 1

"Suppose 70 babies are born with a life expectancy of 70 years."

OK, they're born in 1900. I see some infant mortality, a cluster of "failures" during WWI, the flu epidemic and WWII. Then a tailing failure rate as the generic old-age diseases come into play. What was your point again?

--
Intron: the portion of DNA which expresses nothing useful.
Re:Good point, bad data by ohmypolarbear · 2005-10-25 16:20 · Score: 1

It looks like we've moved beyond RTFA to RTFP. Cheesedog goes on to say that the part you quoted is a bad assumption. A lot of trouble could be spared if people would listen to somebody's entire argument before calling them stupid...
Re:Good point, bad data by fm6 · 2005-10-25 17:06 · Score: 1

Sorry, I forgot to hop in my time machine and check for posts that hadn't been written yet.
Re:Good point, bad data by ohmypolarbear · 2005-10-25 17:56 · Score: 1

I was talking about the post you replied to. Read the sentence after the one you quoted:
Now, of course the failures won't be spread out evenly, which makes this even trickier.
He knows the even spreading isn't a very accurate description; he's just providing it to give a sense of the scale of the problem. He goes on to describe failure patterns that would emerge in a real-world setting.
Re:Good point, bad data by Eivind · 2005-10-25 22:30 · Score: 1

Actually, with hardware generally the failures follow a bath-tub curve. Quite a few disks are dead on arrival, or die in the first few hundred hours.
Then follows a period with relatively few failures, the bad production ones have been rooted out, and wear and tear aren't yet starting to show up.
Then, later, the failure-rate climbs again as old age starts to be a problem.
Actually, babies follow a similar curve, A lot more babies die between age 0 and 5 years than do between 5 and 50 years of age.
Re:Good point, bad data by smyle · 2005-10-26 03:34 · Score: 1

I just want to know where I can get those disks with a 57-year MTBF. I'll take 10,000.
Here ya go (just the first link I found).
Obviously, MTBF doesn't mean what you think it does.

--
Sleep is just a poor substitute for caffeine, anyway. -Bob Lehmann
Re:Good point, bad data by fm6 · 2005-10-26 05:14 · Score: 1

Actually, with hardware generally the failures follow a bath-tub curve. Quite a few disks are dead on arrival, or die in the first few hundred hours.
So to go back to the 70-babies analogy, there's a lot of infant mortality.
A long time ago, I worked for a pre-PC workstation/server manufacturer. To avoid shipping systems that would die soon after arrival, we'd "burn in" our systems: put them in a special overheated room and leave them running for a couple of days. I guess with commodity hardware it's more economic to let your customers do the burn-in. Which is a pain in the ass — the first time I installed a home router, I wasted a couple of hours figuring out that the thing was DOA, and another hour taking it back to the store. Lucky I didn't order it online!
Re:Good point, bad data by Syberghost · 2005-10-26 05:30 · Score: 1

Or, perhaps you're talking about consumer MTBFs like they have some kind of meaning in large installations. Divide that number by four to get the MTBF for adults, Fortune 500 companies, and everybody else who isn't duped by the sales pitch on the retail box.
Re:Good point, bad data by smyle · 2005-10-26 05:41 · Score: 1

Did you even read the link? The ONLY place they mean anything is in large installations. And yes, they use a "theoretical MTBF" until they have enough data (what would be called "beta testing" in software-land) to show otherwise.

--
Sleep is just a poor substitute for caffeine, anyway. -Bob Lehmann

Why one volume? by photon317 · 2005-10-25 08:34 · Score: 2, Informative

What's making your question hard is the "make it like one volume" restriction. The problem is trivial otherwise. If I were you, I'd be asking whoever tasked you with this to *really* justify on a technical level why they need it to appear as a single volume, since that makes all the possible solutions slower, more costly, and more difficult to maintain.

Chances are extremely high that what they really want is a "/bigfatfs" directory visible everywhere in which they will store many discrete items in subdirectories by project or by dataset or by user. You should convince them to let you build it from commodity machines serving a few TB each mounted as seperate filesystems underneath that umbrella directory. Then your only challenge is coherent management of the namespace of mountpoints for consistency across the environment (which there are longstanding tools for, like autofs + (ldap, nis, nis+, whatever)), and administration/assignment of new space requests within your cluster (that could be scripted to automatically allocate from the least-used volume which can satisfy the request (where least used could mean space or could mean activity hotness based on the metrics you're logging)).

--
11*43+456^2

What about a AoE and c-jbdc and maybe Mysql? by bubulubugoth · 2005-10-25 08:36 · Score: 0

Storage with AtaOverEthernet. The cheapest Midium size storage...

And write a fs wrapper to acces clusterd jdbc proyect...

And mysql as file repository?

Ata Over Ethernet
http://freshmeat.net/projects/aoelinux/

Ata Over Ethernet tools
http://freshmeat.net/projects/aoetools/

c-jbdc
http://c-jdbc.objectweb.org/

Mysql
http://mysql.org/

--
Â_Â

Try this (just joking) by totallygeek · 2005-10-25 08:38 · Score: 1

Get a box with four 250GB IDE drives and find an old copy of Stacker! 1TB should become 25TB after installation.

--
Click here or here.

Howabout five Sun StorEdge 3511s? by Anonymous Coward · 2005-10-25 08:39 · Score: 0

Specs: 10u rackspace, 1500w power, $115k cash (medium selection + 35 BYO drives)

Apple. by neuroklinik · 2005-10-25 08:41 · Score: 1

XServe RAID and XSan.

http://www.apple.com/xserve/raid/
http://www.apple.com/xsan/

'nuff said.

Just use NFS... by Anonymous Coward · 2005-10-25 08:41 · Score: 0

Perhaps - and I don't know your requirements - NFS will work, with sub-servers for each portion.

The real problem here is you want commodity hardware - but really 99.999% uptime that
expensive hardware gives you. Otherwise any solution you come up with will have reliablity
problems - you don't want those problems with a system that big.

Think about it - you are talking about hundreds of commodity IDE drives.... all working
in unision - one goes down they all go down. Unless you have a raid-array controller.

NFS - sort of solves that - in effect - NFS servers come and go... and [can, if properly
configured] be automatically re-mounted [an auto-mounter]

This would also require that you segment your data into segments - perhaps you can do that
or - perhaps you cannot. This also assumes that (a) no one file is huge, (b)you do not
require "hard-links" across sub-sections (c) speed is not totally critical, but capacity is.

Easy... by Anonymous Coward · 2005-10-25 08:43 · Score: 0

You just need:

One hundred thousand computers, networked
Two hundred thousand trained monkeys
Two billion floppy disks (for redundancy)
An infinite supply of bananas
Enough shelves for all the above

There you have it: Distributed storage that scales by simply adding more computers, monkeys and floppy disks.

How about a PetaBox? by McSpew · 2005-10-25 08:44 · Score: 4, Interesting

The folks at the Internet Archive have already done the hard work of figuring out how to create a petabyte storage system using commodity hardware. The system works so well they started a company to sell PetaBoxes to others. Why reinvent the wheel?

Re:How about a PetaBox? by owlstead · 2005-10-25 12:43 · Score: 1

* The first 100TB Rack is operational in Amsterdam http://www.eu.archive.org/
* The second 80TB rack is operational in San Francisco
* Loaded with movies and music

My MP3 collection suddenly pales....
Re:How about a PetaBox? by yppiz · 2005-10-25 13:42 · Score: 2, Informative

You beat me to this link.

I will add that the Archive has particular design and performance goals, namely:

- keep the cost / GB as low as possible
- keep cooling and power requirements low
- use the filesystem and bundle objects into large chunks (~100MB ARC files, last I checked)
- assume streaming writes affecting an edge of the system -- previously written data isn't modified
- assume random reads
- read latency is less important than cost / GB

I worked on the Archive ~5 years ago, and these are based on my understanding of the Archive from that period, so some of these may have changed.

But essentially, these are instantiated as: off-the-shelf SATA disks in fairly standard cases with either normal or special low-power motherboards, running a free OS (the Archive has used both Linux and FreeBSD), with off-the-shelf networking equipment.

--Pat
Re:How about a PetaBox? by AndreySeven · 2005-10-25 17:05 · Score: 1

according to a CNET news story, buying from these guys, you would pay about $2million for a Petabyte. This is 10 times cheaper than EMC.

--
University of Washington
Student

Re:call EMC. i am sure their clarion line will han by aminorex · 2005-10-25 08:44 · Score: 2, Insightful

What once required talent and brilliance today only requires reading a how-to file, configuring,
and rebooting.

EMC is obsolete. Their customers just haven't discovered it yet.

--
-I like my women like I like my tea: green-

Obligatory Bash Quote. by e.loser · 2005-10-25 08:45 · Score: 1

Dude, you just told us you come in three minutes.

Re:Obligatory Bash Quote. by Anonymous Coward · 2005-10-25 15:42 · Score: 0

lol I'd admit I cum in 3 minutes... but I have a woman that'll back up my claims that my dining skills more than make up for it.

It's not what you do wrong that sets you apart... everyone is bad at something... it's how you compensate for you deficiencies that make you a man or a mouse

Thanks for speaking and removing all doubt by tfiedler · 2005-10-25 08:50 · Score: 1

This guy's full of crap, and it's further testimony that slashdot's editors are also full of it.

No one, and I mean, no one, builds multi-petabyte storage silos without actually understanding storage. It's obvious from the questions in this posting that this guy is so far out of his league that he should quit hist job and go live on a beach in a cardboard box somewhere or he's another dumbass trying to sound intelligent.

Thanks for speaking and removing all doubt.

Did I get a troll rating????? Did I? Cool.

--
Democrats and Republicans are like AIDS and Cancer, I want neither!

yeah, we're all jaded by fade · 2005-10-25 08:50 · Score: 1

there's no free lunch. everybody wants a lamborghini for the cost of an impala. good, fast or cheap -- pick two. nobody puts symmetrix/shark blah blah blah out of business with duct tape and bailing wire. Jesus. if you're all so bored, give up the keyboards and get real jobs. :)

anyhow, the heads over at archive.org spun out a company to develop storage systems closely matching the brief you just laid down. Check this out: Petabox

you can buy the nodes and their stuff is proven in the archive.org infrastructure. :) not free, but then again, not as ludicrously expensive as the EMC/Hitachi/IBM/NetApp alternatives.

LeftHand Networks storage does it by smartsaga · 2005-10-25 08:50 · Score: 2, Informative

http://www.lefthandnetworks.com/ supports all that of what the person is talking about in the article. As you add more of these units, the volumes are spread over the units you add. This means that you can add storage as you go and still have redundancy. You can configure each individual unit to use RAID 0, 1, or 5, and still get to have a volume, or many, across multiple storage units that in turn have parts of a whole voule or set of volumes. Its like haveing double mirroring, once within each individual storage unit level (which has many IDE drives in RAID 1, or 5) and then twice at the storage unit level. Of course this assumes that you have at least two storage units. And, yes, this means that to have redundancy you ahve to add them in pairs (I think) and have some storage units in one physical location and the pairs of each of those in another location for disaster recovery (fire, earthquackes, you know things can happen.)

I have worked with this units and they kick ass. You can do snapshots of entire servers quickly, given that you have the right infrastructure, set thresholds for voulmes that can be increased or reduced on the fly, brick level restoration of files!!!, etc. And of course, my respect goes to their engineers. I saw them working on one unit cause we had a really bad power failure that killed one HD. Man those guys know their stuff up and down, and I've never seen anybody type commands so complex and so freaking long at that speed! They fixed the damn thing and got 99.99999% back from limbo!

I guess their storage boxes follow the model of LVM which is pretty cool and the storage boxes run Linux!!!

Don't take my word for it, go to their website and take a look 'cause I tend to confuse people with my posts rather than pass info efficiently.

Have a good one.

--
===== "Every head is a different world so don't invade mine you FREAK!" smartSAGA said

What about MatrixStore? by Steve.Murray · 2005-10-25 08:54 · Score: 2, Informative

MatrixStore from Object Matrix http://www.object-matrix.com/ uses commodity hardware and clusters it together to create a highly expandable, reliable and secure storage environment.

Re:What about MatrixStore? by Anonymous Coward · 2005-10-25 09:57 · Score: 0

MatrixStore scales as required, runs on commodity hardware but does enforce data redundancy to raid level or above - could be a good option depending on the application, and if you intend to scale to a petabyte today, might be best to think about what the data requirements are going to be like if a HD v2 comes along!!
Re:What about MatrixStore? by Anonymous Coward · 2005-10-26 05:56 · Score: 0

Is this running on the Xserve platform?

A SAN and archive solution based on Apple hardware would be excellent as our organisation is looking to get rid of tape (for the obvious reasons!).

TinyPETA! by mjeppsen · 2005-10-25 08:54 · Score: 1

How about TinyDisk?
;-)

-MJ

iSCSI/AoE + LVM + Software RAID? by someguysomewhere · 2005-10-25 08:57 · Score: 2

How about this:
- Use LVM on every node to make the 2TB seem like a single disk ( Assume 4 x 500GB disks )
- Use iSCSI/AoE to make the LVM volumes available on the network
- Use LVM again to merge exported volumes
- For redundancy use software raid 5 on the lvm volumes

I suspect there will be a lot of problems with efficiency but I think you should be relatively safe from hardware failures as the software raid will detect and repair them.

Anyone have any idea whether what i mention is possible/recomended?

Re:iSCSI/AoE + LVM + Software RAID? by bigredradio · 2005-10-25 11:09 · Score: 1

I wondered when someone was going to mention LVM. This is the way to go.

--
Flexible bare-metal recovery for Linux/UNIX

No simple answer for the complex by Anonymous Coward · 2005-10-25 08:58 · Score: 0

The link I quickly googled examines the hardware side of your project and provides a little insight into the complexity of reliable, large-scale storage solutions. (http://ssrc.cse.ucsc.edu/Papers/hospodor-mss04.pd f) This could probably make a nice research project for a college or university if you have a large grant, but if you are talking about a business solution - I'd go with a commercial vendor with a proven track record and a verifiable list of satisfied customers. You don't want YOU to be the single point of failure if a petabyte of valuable data is lost, compromised, or even unavailable for any length of time. Downtime for a large database spells the loss of big bucks for most businesses and/or short employment for the responsible IT personnel.

Xsan has volume size limits by Rhys · 2005-10-25 08:59 · Score: 2, Informative

I want to say it is 16 Tbyte offhand, but I'm not sure on that.

Short research indicates this was a limitation in 10.3, but I haven't found anything confirming or denying that 10.4 still has it.

Not that we've been looking into large amounts of Xsan storage here, but our requirements are a bit different. You can't hook >600 nodes up to the storage via fibre. Our problem is scaling out the NFS servers to be able to push all this data around.

--
Slashdot Patriotism: We Support our Dupes!

Re:Xsan has volume size limits by Anonymous Coward · 2005-10-25 10:07 · Score: 0

The limit is now 16 PB (I could be off by a PB or two). The limit was caused by the OS. You will need the ADIC clients if you want to use a non Apple OS to connect to this size volume. You could split up the volumes into smaller sizes (a size that the Windows OS natively supports) and use SAMBA to share out the storage for free.

You will need to run OS X 10.4 with XSan v 1.1 or greater. Nothing will come closer when you start to look at cost. Very low cost and Apple will work directly with you to make it work.

When was the last time M$ was willing to call people all around Redmond to solve your problem?
Re:Xsan has volume size limits by blofeld42 · 2005-10-25 11:32 · Score: 1

The XSAN manual says 1 petabyte per volume is the max under 10.4.
Re:Xsan has volume size limits by Rhys · 2005-10-27 07:32 · Score: 1

Oh I'm aware Apple is very eager in this field. But we had some hardware issues with their Xserve RAIDs in the past (which caused Linux kernel bugs to corrupt the filesystem, but that's an aside) that keep those who have the pockets asking questions.

Time will tell what occurs.

--
Slashdot Patriotism: We Support our Dupes!

AFS Rocks- Now stop by sirket · 2005-10-25 09:01 · Score: 5, Insightful

Stop what you are doing right now. If your architecture requires you to have one huge volume then you have architected things wrong. Imagine trying to fsck this damned thing! What about file system corruption- What the hell are you going to do when you lose a Petabyte of data because of some file system corruption? Small, sensible, easily managed smaller partitions are the way to go. Use a database to organize where given files are stored. Do something that makes sense. I have a client now who just lost a bunch of data because they used a system like this.

Having said all this- If you are still intent on finding a good file system then use AFS. It's probably your best free solution. If you want to sleep at night call EMC.

-sirket

Re:AFS Rocks- Now stop by Anonymous Coward · 2005-10-25 09:58 · Score: 0

I just called EMC. Their low-end Clariion systems max out at about $600k with 240 500GB SATA drives (availability is Nov 1) for about 120TB of non-redundant storage. Their high-end Symmetrix systems max out at 960 drives (287TB with 300GB Fibre drives) and cost $millions.

Neither of these will scale up to the 1000TB needed (at least not yet), and you're probably looking at a minimum of $4k/TB. That's probably 2-3 times what it would cost with a cheap PC-based grid alternative.

dom
Re:AFS Rocks- Now stop by Anonymous Coward · 2005-10-25 10:48 · Score: 0

You get what you pay for. I am sure you can get couple storage arrays from EMC and have a Multi-PetaByte SAN like my company does, their ILM data cycle is the way to go.
Re:AFS Rocks- Now stop by Anonymous Coward · 2005-10-25 11:16 · Score: 0

Sure, you can have as many DMX boxes as can fit in your datacenter, but the poster needs a single filesystem image. An EMC NAS image can only grow to 287TB at the moment.

dom
Re:AFS Rocks- Now stop by Anonymous Coward · 2005-10-25 12:24 · Score: 0

You were fine until you mentioned EMC. I've used all the high-end storage vendors, and EMC is the only one where I consistently lose data, including multiple instances of an entire array going TU, with no resolution from EMC other than to provide a second array as a spare.
Re:AFS Rocks- Now stop by tweakt · 2005-10-25 14:02 · Score: 1

If you want to sleep at night call EMC.
Um... did you miss this part?
"...commercial offerings for such a solution becomes cost prohibitive, and the budget for the solution is fairly small"
Re:AFS Rocks- Now stop by Anonymous Coward · 2005-10-25 14:14 · Score: 0

Never call EMC. I have three and have had many, many problems.
Never spending that much money on their junk again.
Re:AFS Rocks- Now stop by rhaig · 2005-10-25 16:53 · Score: 1

you have EMC hardware for your storage and you can sleep?

since when should replacing one drive that is marked bad cause 18 others to mark themselves bad. Oh, btw, it turns out, the drive that caused the problems (throwing 18000 FC errors in about 10 minutes) was next to the first one that was replaced (which wasn't really bad) and never got marked as bad.

that's what I call quality.

we don't do hot drive replacements on our clariion's anymore (either of them).

--
"We are not tolerant people. We prefer drastically effective solutions"

GPFS - performance and stability by painehope · 2005-10-25 09:02 · Score: 2

GPFS
Take it from someone who's messed with nearly every storage product on the market, if you want something that works fairly simply, performs at approaching spindle speed ( meaning the file system is not the bottleneck - if you have 10 GB/sec. storage bandwidth, expect to see near that with proper tuning ), is very stable ( compared to most storage solutions on the market - bear in mind that most storage products are aimed at large-block sequential I/O, and fall down - either performance-wise or stability-wise - when you throw other I/O patterns or combinations of patterns at them ), and is portable across nearly any Linux distribution ( with varying amounts of difficulty, I have had to hack their kernel patches before when using a unsupported kernel ), GPFS is the one. Of course, the problem there is I believe it's pretty expensive to run on non-IBM hardware. But if you have IBM hardware ( even if it's not the hardware you're running the FS on ) or some sort of in with IBM, they'll let you have it for a song and a dance.

Having said that, Lustre is getting there. I'd say it's the equal of GPFS ( as a parallel filesystem - I believe it is even more flexible as a distributed filesystem ) in performance, probably scales roughly the same ( haven't played with it in a large installation, so can't tell you beyond looking at the architecture ), and is going to the be the biggest player on the market in the future. It's also free ( IIRC Cluster File Systems sells support, but the code is freely available ) and not tied to IBM and whatnot, like GPFS is. Of course, HP has a big connection with Lustre, but not ownership thereof.

Those are really the only two that I would consider for a serious high-performance storage project. If you don't need great performance, that's when you can start looking at things like GFS, ADIC's StorNext, Ibrix, etc.

Oh, Gautham Sastri ( of former Maximum Throughput fame ) has a newer company called Terrascale, I recall them putting on a presentation at the 2003 or 2004 ( can't remember ) Supercomputing conference ( SC2005 is coming up in a few weeks, yeah!!! ) which showed pretty good performance ( relative to the small system they were using ), not sure how they're coming along...

Anyways, good luck...and don't forget to use Iozone to benchmark the damn thing!

--
PC moderators can suck my White pierced, tattooed dick. If you think pride == hate, s/dick/Aryan meat mallet/g.

Backup a petabyte? by Anonymous Coward · 2005-10-25 09:02 · Score: 0

Approximately 764 million floppies would do the trick, give or take a few, although you'd probably want some kind of ultra-efficient volume catalog system...I'd say a good Jet Database configuration could work.

Re:Backup a petabyte? by Anonymous Coward · 2005-10-25 09:15 · Score: 0

Your assuming they have high density desks available? Or was that double sided?

Just to be safe, let's dig out the 8 inch floppies.

I'm no storage expert but... by Mars+Ultor · 2005-10-25 09:03 · Score: 3, Funny

Why not store the data randomly in a dilithium matrix with asynchronous data transfer and AJAX? Maybe some RUBY on RAILS too - I hear that's hot right now. Of course, you'd have to make use of a couple of Heisenberg compensators configured in parallel to keep track account for any memory addressing issues, but no need to state the obvious there.

--
"Nokia is not a country, it's the capital of Finland!" -Moderated "Informative". Yeesh.

The SCIENTIFIC Answer by MightyMartian · 2005-10-25 09:04 · Score: 2, Funny

We at Vap-o-tech 2003 Inc. (not associated with Vap-o-tech 2001 Inc. which has closed its doors due to allegations of investor fraud) have developed ToastFS 2003. Using patented CRUMB technology and high capacity BUTTER read/write caching, we are able to turn your average loaf of Wunderbread into a 200gb storage media. Simply buy a loaf of our own specially tested Wunderbread ($250 USD) along with a USB-to-Popup Toaster interface (don't worry, USB 2.0 is more than capable of handling 120amp wall sockets without a problem, except in California). Then take our Vap-o-bake ToastFS drive and pop two pieces in. For doubled capacity, buy our Vap-o-bake ToastFSx2 drive, which takes four pieces. From a command prompt, simply type FORMAT C: and answer yes. Your new ToastFS drive will be formatted in minutes. Please note that we have 24 hour technical support via 1-900-842-8524 ext 241. Please don't hang up. Our operators in the Dutch Antilles are very busy and could take upwards of an hour to get to you.

--
The world's burning. Moped Jesus spotted on I50. Details at 11.

Re:The SCIENTIFIC Answer by noamsml · 2005-10-25 09:13 · Score: 1

Does it happern to be running NetBSD?

--
My new blog
Re:The SCIENTIFIC Answer by Anonymous Coward · 2005-10-25 15:20 · Score: 0

Netcraft confirms it.

Not to mention wasted time, etc...try 3par, OnStor by Tmack · 2005-10-25 09:05 · Score: 1

Companies shouldn't waste time/money developing a solution for a problem from the ground up where adaquate solutions already exist. It will cost less and waste alot less time (and since time=$$ ...) to simply buy a storage solution thats been around, been tested, already has tools and utilities built specifically for it to monitor/configure/report/etc, and since they are being mass produced can actually be cheaper than just the hardware for your custom system. Not to mention that there will be support for it outside your company. I would rather spend a little more now to have a solution than spend several months developing something to save a minimal ammount of money. Unless you plan on marketing the solution itself, there is no need great enough to justify developing it from scratch.

<shamelessplug> As a customer, the 3par solution has been very impressive to me and the company I work for. We have EMC arrays, Netapps, etc, but the 3par blows them all away in performance/size/just about every aspect including price, and we are currently migrating as much as we can off the other solutions onto the 3par. To make it more flexable (the unit itself is designed for fiberchannel), we got a set of onStor NAS gateways, and they make NFS actually faster than local disk (using Gig-e). The 3par is highly modular, and the software to use it makes it simple to reconfigure the volumes/raid type/whatever. It also does snapshots for you.

Tm

--
Support TBI Research: http://www.raisinhope.org

Controllers! by man_of_mr_e · 2005-10-25 09:06 · Score: 2, Informative

You could get a bunch of Broadcom 8 port SATA controllers, which equals about 4TB per controller. 4 or 5 controllers = 16-20TB per box, then you can run the cables into an outside drive bay enclosure and one box can control 40 500GB hard drives.

If you're not doing any processing on this, a good CPU should be able to handle the load.

--
If you need web hosting, you could do worse than here

Re:Controllers! by sr180 · 2005-10-25 13:14 · Score: 3, Insightful

The CPU might be able to handle this load easily, but my question is will the bus (PCI or otherwise) be able to handle this load?

--
In Soviet Russia the insensitive clod is YOU!

Here's my solution by Anonymous Coward · 2005-10-25 09:07 · Score: 2, Interesting

I manage a small (29 dual-xeon nodes) linux cluster in a lab for my local college. A while ago I had the same problem when we ran out of storage space on the main file server.

My solution was to use the nodes' hard disks (each one has a 120GB Ultra320 10000rpm disk) combined in a network RAID1+0 solution (we use gigabit ethernet) to get more space. With that aproach you can get as much redudancy as you need.

Heres what I did:

1. After install the network block device server (nbd-server)in each one of the nodes, I created a 100GB partition on the HD and exported then directly using the raw mode;

2. On the master node (using the nbd-client) I created a block device for each one of the nodes partitions;

3. After that I installed the linux software raid tools (mdadm) and created a small RAID1 array for each pair of nodes. I ended up with 14 100GB network RAID1 arrays each one with its very own /dev/md# blcok device;

4. I created a big 1.4TB (14 * 100GB) RAID0 array with the 14 RAID1 ones and attached it to the /dev/md0 device;

5. The final step was to create a large RaiseFS filesystem on the /dev/md0 array, and I was done.

You have to pay special attention to the array shutdown and startup procedures. I wrote my own scripts to take care of that for me.

Our array may seens small compared to what you are looking for, but I am pretty sure that it will scale well for arrays much larger then ours.

Good luck.

Three letters for ya... by Anonymous Coward · 2005-10-25 09:11 · Score: 0

SAN

Scalable, fault tolerant, etc...

Small companies and short-sighted management by Anonymous Coward · 2005-10-25 09:18 · Score: 1, Interesting

There are some smart proprietors of small businesses that think cheap, like this. Use to work for one -- the guy was smart, but not smart enough. A four-person company, and he asked me to build something similar. I tried to explain why storage solutions from IBM were so expensive; but he would have none of that, and insisted on building this from Intel white-box parts. The project failed.

1/5 boxes arrived DOA. The ethernet cards didn't work. The cables to the hard drives weren't long enough. The hot-pluggable disk trays were flakey. The BIOS had to be flashed. The properitary hard drive controller drivers sucked, had to buy new controllers. 1/10 disk drives were DOA.

Three monhs later, and $40K poorer, we had a system that couldn't pass 24 hours of stress testing without failing in some wacky way. For the $40K and my salary time, we could have bought a usable system from IBM or HP or whomever, and it would have worked. Engineering big systems is non-trivial.

You want Isilon by Anonymous Coward · 2005-10-25 09:22 · Score: 0

http://www.isilon.com/

Change providers. by Anonymous Coward · 2005-10-25 09:25 · Score: 0

You are being screwed.

10% of failures is completely unnaceptable, i can only assume yur solution is crap or you are making the numbers up.

Re:Change providers. by OrangeSpyderMan · 2005-10-25 23:09 · Score: 1

No - it's not unacceptable - and while the figures are "ball park" they're not made up, and certainly correct to within a couple of percent. It just happens that of our total failures very many happen on power downs. As a previous poster mentioned MTBF is very much just that, mean. Oh, and we don't do this every other weekend we do full restarts on the whole Datacenter *VERY* rarely. Believe me with the amount of money this means, any failures due to incorrect installations would very quickly mean that our hardware providers would start asking questions. They don't.

--
Try NetBSD... safe,straightforward,useful.

Ask Slashdot Formula: Outsourcing! by Anonymous Coward · 2005-10-25 09:27 · Score: 1, Funny

Dear Poster.

This is Slashdot India and Slashdot China. With over a Billion people combined we can do it by using PeopleRAID, which is hot-swappable once every generation, and redundant once we eliminate our "one child per a couple" restrictions, and increase our Viagra imports.

Google File system by Anonymous Coward · 2005-10-25 09:27 · Score: 0

What your looking for is the goolge file system. It uses desktop grade hardrives to support larger than pentabyte storage requirements with fast file access. This year at Sigmetrics 2005 the inved talk was

Google - Or how I learned to love Terabytes
Urs Hoelzle, VP of Operations and Engineering, Google Inc

He certainly made it sound like the google file system was available for use.

Know when you are out of your league. by Eric_Cartman_South_P · 2005-10-25 09:35 · Score: 1

You were given a task that can't be done cheap or simple. The best path to success IMO with the specs you described, is to pick up the phone and call IBM or SUN and say "Help!". Chances are you will end up with something where the cost per gig is going to be very close to what you would pay had you done it yourself, but without all of the wasted time and wrong choices.

Want to save lots of money? Go lean on the support by learning what you have, how to take care of it, and SUPPORT the system yourself.

Just for kicks, call up a sales rep at Iomega and let then know you want to do this with ZIP disks. Record it and blog the MP3 of their head exploding.

Good luck.

Pay money or do it yourself. by tmortn · 2005-10-25 09:35 · Score: 1

The real catch to your problem seems to be the single volume issue.

Would be fairly easy to network a bunch of boxes that added up to the requisite storage amount... but accessed as a single drive with redundancy would be an issue. As mentioned the number of disks invovled pretty much means you can't go into it without be worried about redundancy unless you can deal with data loss pretty much from day one.

If you have to do it on your own my guess is you are going to have to take the largest easily available solution and then do your own work to scale it up by gluing those together. And there are any number of ways to do that... I'd suggest tackling it with the tools you are most comfortable with... or if your not up for that level of development it is time to tell those who must be obeyed that they can't get there from here if they want a non proffesional HUGE single volume storgae solution.

--
I don't ask you to be me. I only ask you not expect me to be you.

Get out now!! by egriebel · 2005-10-25 09:36 · Score: 2, Insightful

Really, go now before your company's stinginess brings you down too.

There's a reason why Terabyte storage arrays for commercial applications cost a lot of money, and why consulting services from IBM, EMC, Hitachi, etc. have the huge per-hour cost. If you/your management can't see that, you really have no business being there. Sure, anyone can throw a JBOD RAID together for a thousand bucks, but I wouldn't trust anything more important than MP3s to it.

--
ACHTUNG! Das computermachine ist nicht fuer gefingerpoken und mittengrabben. Ist nicht fuer gewerken bei das dumpkopfen.

Get a demo of the Compellent Storage Center by roj3 · 2005-10-25 09:44 · Score: 1

Their founding engineers are old-school storage guys who built a new system from scratch in 2002. The SAN supports FC and iSCSI front end, FC and SATA backend and natively virtualizes all storage into a single pool. When you want more storage, just add more controllers, enclosures, etc.. - the array supports n-way dynamic controllers. They don't have a ton of information on their website because they sell entirely through partners. The demo is worth your time b/c you'll see where SANs are headed.

dCache by Dan+Yocum · 2005-10-25 09:45 · Score: 1

dCache

Who let the PHBs out? by buss_error · 2005-10-25 09:48 · Score: 2, Insightful

Sounds like the PHBs have been at this. First, *why* does it have to be a single file system? With Oracle, MySQL, and MS-SQL you can do partitioning, if your need is databases. If your need is really a monolithic file, then I'll bet that the single file size won't be multi-hundreds of gigs.

In short, your stated objective smells. Not enough data.

WHAT is going to be done (database, file storage?)

HOW will it be accessed? (One large file, many smaller files)

WHEN will it be accessed? (During business hours, distributed over the day?)

AVERAGE TRANSFERS - will the whole schmear come over, selected parts?

SECURITY a concern? (Sensitive data, protected network)

BACKUP - a petabyte of tape storage is expensive, and takes quite a while to do.

POWER - do you have enough?

COOLING - ditto

SPACE - ditto - my $DAYJOB computer room is about 3000 sq ft... and we're going to be using all of it within 12 months.

That said, if you go with big drives over a lot of systems, use lots-o-nics to keep the nic from being the bottleneck. A single gig connection sounds fine, but wait until you have 100's of people going for files at once. It'll get swamped. And swear off V-SAN from Cisco. Not worth it at all.

--
Necessity is the plea for every infringement of human freedom. It is the argument of tyrants; it is the creed of slaves.

Re:Who let the PHBs out? by rhaig · 2005-10-25 17:09 · Score: 1

restoring one PB at 60MB/sec takes about 5000 hours.

let's assume best case. you're allowed to spend money on backups. let's say your solution is going to have 100 servers each with about 10TB on them. and you have 10 LTO3 tape drives. if you get 80MB/sec out of each tape drive (that's about the realistic max of 1Gb ethernet) then it's still 373 hours with all drives spinning at 100%. that's a little over 2 weeks. assuming everything goes right. and depending on how well the data compresses, more than 1000 tapes which hopefully haven't gone bad.

how long can you be without the data....

--
"We are not tolerant people. We prefer drastically effective solutions"
Re:Who let the PHBs out? by buss_error · 2005-10-31 17:38 · Score: 1

restoring one PB at 60MB/sec takes about 5000 hours.
See the SpectraLogic 950 series tape silos. up to 120 tape drives and fiber channel. Shouldn't take too long with that.

--
Necessity is the plea for every infringement of human freedom. It is the argument of tyrants; it is the creed of slaves.
Re:Who let the PHBs out? by rhaig · 2005-11-01 06:02 · Score: 1

how fast can you write to storage on any given system? if you put 4 dual port FC cards in a host, and you MAX them all out to the theoretical limit of 256MB/sec (2Gb/sec) then you get 2GB/sec spread across all 8 interfaces. then it doesn't take that long. BUT. while you can add up enough tape drives to get that throughput. getting the backplane to write that fast isn't likely. if you're restoring the data to multiple machines, then you have network issues to deal with.

--
"We are not tolerant people. We prefer drastically effective solutions"
Re:Who let the PHBs out? by buss_error · 2005-11-04 16:30 · Score: 1

if you're restoring the data to multiple machines, then you have network issues to deal with.
Restores can go to many machines. However, I find that noramlly, my restores go to one machine. Add in that for the most part, restores are done in off peak hours, you have more bandwidth to play with.

--
Necessity is the plea for every infringement of human freedom. It is the argument of tyrants; it is the creed of slaves.
Re:Who let the PHBs out? by rhaig · 2005-11-07 04:00 · Score: 1

Most restores are the result of hardware or user error. Both of which may end up being emergency restores. No off-hours work there. Also, in a solution as large as this, there will be a dedicated backup network anyway, so you will have 1Gb to work with. That is still a really big limiting factor.

Backup/Restore is what I do. It's my job. I'm happy to share my experiences and architectural opinions on any project.

--
"We are not tolerant people. We prefer drastically effective solutions"

Why not GFS by keepper · 2005-10-25 09:49 · Score: 1

Why not have those 7 boxes all export their storage via GFS and create one large volume like that?

But .... my budget is only $10,000! by wsanders · 2005-10-25 09:52 · Score: 1

Somehow every time an article like this gets posted, the poster forgets to mention - OBTW, my budget is only .....

If you can afford a 25TB sized storage system, that will scale by a factor of 40, and still be recoverable, and you are asking slashdot for suggestions . . . really just give up and call EMC or their ilk and be prepared to write some zeros in your check.

OTOH it can be done - 6 years ago I interviewed at Up and Coming Photo Site that planned on archiving every photo uploaded by every user for all time for free. They had standardized on a really cheap and no doubt hideously dodgy open source software raid-5 white box design that they could churn out for a few kilobucks, and were building them by the hundreds and connecting them with NFS. With SATA RAID-5 controllers, ReiserFS, and the Linux VM in much better shape now than then, a project like that might actually be fun!

--
Give a man a fish and you have fed him for today. Teach a man to fish, and he'll say "WHERE'S MY FISH, YOU IDIOT?"

Re:But .... my budget is only $10,000! by Kent+Recal · 2005-10-25 10:52 · Score: 1

a project like that might actually be fun!

Until you get fired when it breaks down...
OTOH, the learning expirience may actually be worth it ;-)

That's not MTBF, this is.. by beldraen · 2005-10-25 09:54 · Score: 4, Informative

Just a comment about MTBF. It's often not understood, and it is one of my little pet peaves with tech producers because they don't try to correct it. MTBF is a rating for reliability to achieve lasting the warrenty period.

You have a drive that is rated 500,000 hours MTBF. Suppose you bought a drive and let it run at rated duty. Driver are normally rated to run 100% of the time, but many other devices will have duty period. Further, you run the drive until its warrenty is up. You then throw this perfectly working drive out the window and replace it. If you keep the up this pattern, then approximately once per 500,000 hours on average you should have a drive fail before the warrenty period is up. This is why it is important to not only look at the MTBF but also its warrenty period.

As a side note: In theory, you should be throwing drives out on a periodic basic. One way around this is to not buy all the same drive type and manufacturer. By having a pool of drive types, you distribute, thus minimize, risk of drive failures. Additionally, you may want to have a standard period of time for drive replacement so as to shedule your down time, as opposed to it all being unexpected.

--
Bel, the mostly sane.. "Of course I can't see anything! I'm standing on the shoulders of idiots." -- Me

Re:That's not MTBF, this is.. by cheesedog · 2005-10-25 10:13 · Score: 1

I'd rate you up if I were allowed. I didn't realize that, and I think it is a fascinating nuance.
Doesn't change the argument -- just adjust the MTBF I quoted to whatever the real MTBF of the drive is and go on.
Or, do as you say and toss good drives during scheduled maintanence (or at least eBay them :) )
Re:That's not MTBF, this is.. by Blackforge · 2005-10-25 13:02 · Score: 1

There is a problem when mixing manufacturers and types:

Different manufacturers actual drive size varies. They can vary through their various product lines and drive generations too. That 160GB drive you just put in, to rebuild your RAID1/RAID5 is too small, sorry try a different disk! Of course you can always replace them using bigger drives to insure this won't be a problem, but of course you'll lose usuable disk space. (This is in a typical ATA/SCSI RAID environment.)
Re:That's not MTBF, this is.. by beldraen · 2005-10-25 15:19 · Score: 1

Yes and no. You answered your own question by recognizing the raid will work with largest capacity as the smallest disc. So, you will waste space on the other drives. Question: How much does it cost to pay someone to reconstruct a raid that has multiple drives go bad? Raids are created to solve the business issue of lowering the cost of down time. Losing a few gigs of space (what? A few bucks??) to saving an employee a few hours worth of work (what? $100/hr given all benefits and costs to company??) is a very fair trade. This goes for throwing drives out early as well. You never pay money to have space or equipment or computers, you pay money so your company can make money.

--
Bel, the mostly sane.. "Of course I can't see anything! I'm standing on the shoulders of idiots." -- Me
Re:That's not MTBF, this is.. by Anonymous Coward · 2005-10-25 18:45 · Score: 1, Insightful

Your ideas are wrong. First off, If you have redundancy set up correctly for your arrays, drive failure will not be an issue. You just replace drives as they fail, and let them rebuild the array. Hell, set up hotspares, so the array rebuilds automatically when there is a failure. Then you just replace the bad drive at your leisure, and set it up as a new hotspare.

Secondly, you generally can't mix drive types, as they tend not to be exactly the same size. This will really mess up any attempts to rebuild a failed drive, or redundancy in general. Additionally, most "hot-swap" array solutions require drives of a specific mounting type and form-factor, which is going to throw that idea out the window.
Re:That's not MTBF, this is.. by Hydroksyde · 2005-10-25 21:09 · Score: 1

I'd be happy to take the drives off your hands in this case, to save you disposal costs...
Re:That's not MTBF, this is.. by petermgreen · 2005-10-27 04:36 · Score: 1

afaict if you have mixed sized drives in a raid array the larger ones are just left with some unused space at the end hardly a disaster.

--
note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register

Welcome to the world of Parallel filesystem by a3217055 · 2005-10-25 09:54 · Score: 1

The best way to have access to have all that idea is to have a parallel filesystem. There are many type of parallel filessytems. Some are designed for certain types of files others for other sorts.

PVFS and GPFS can be said to be the two most useful type of filesystems to be in use. I would reccomend that you use somethng like attached storage through fibre or through a SCSI because if a part of the array goes down then you can jump start it by replacing the SCSI array or the storage node.

Make sure your data storage drives are raid 5 so if you have one disk that goes bad then you don't have to call the whole file system to do a recreate.

Making a large filesystem is expensive. But go with the parallel filesystem option you will see how fast and easy it is. And also try to use some 64 bit system so you can have larger ammounts of metadata etc..

And PS with such a large filesystem say bye bye to the inode problem.

Slashdot: Recent Intelligence Solutions For Nerds by Cyburbia · 2005-10-25 09:59 · Score: 1

Sad. That post had more jargon than a 1999s dot-com press release.

It looks like marketing-speak has crossed over to the everyday vocabulary of the Slashdot crowd. Didn't a "massive single storage solution" used to be called a "hard drive bank?"

Wait... I know you... by MessageDrivenBean · 2005-10-25 10:02 · Score: 1

Aren't you the people from our local university in Gr.n.ng.n building that giant distributed satelite?

--
Quisque verborum suorum optimus interpres...

Only Coda by Anonymous Coward · 2005-10-25 10:04 · Score: 0

Why not just using the best solution -- Coda !?!

Time to move out of the basement by Anonymous Coward · 2005-10-25 10:04 · Score: 0

You know, Mom & Pop will probably be happy once you go out of there, despite your wet dreams.

Message from the future... by eyebits · 2005-10-25 10:05 · Score: 1

It is the year 2015. I have a petabyte in my iPod.

Try http://www.object-matrix.com/ by Anonymous Coward · 2005-10-25 10:05 · Score: 0

the FS can cope, but it does insist on redundancy

We've come full circle. . . by lisnter · 2005-10-25 10:08 · Score: 1

Funny. It's this the same quandry that those "old" computers faced. . .100,000 tubes with a MTBF of 1000 hours so one would fail every 5 minutes. Not having been there I can't say from first hand experience but I read it someplace. . .

Cheapest and easyest to maintain (the crazy way) by Qbertino · 2005-10-25 10:27 · Score: 1

It may sound crazy but the cheapest solution with relatively low power consumption and a high reduncancy I can think of would be some bizarely large PC (Quad Opteron w 16 GB RAM or something) and 30 to 50 external USB harddrives attached. Add in Linux and some virtual Software RAID thingy set up to make good use of the Horsepower and the only problem you have left might be IO speed. Ext3 is a slowpoke, but it's free, stable and safe - and probably fast enough.
The biggest problem is finding USB adapters that can handle the load and enough sockets to plug them into. I don't know the USB specs, but from what I can tell USB 2 is far more powerfull than people usually expect. You'll need a little scripting to keep track of all those drives and their state, but it should work on the software side.
Your power consumption would be extremely low for 'homebrew' and redundancy and inexpensiveness would be best. And you can get everything for that at your local PC shop. Exept the Quad Opteron Board maybe.

Let's see:
Biiiiiig PC + 5 heavyweight USB 2 Cards => 9000$
60 external 0,4 TB USB HDDs => 18000$
12 USB Switches => 1000$
Backup HDDs, USB stuff and spare parts => 5000$

Sum: 33000$

That's extremely cheap. If it works with all those
USB drives hooked to one Box this is your ticket.
Crazy but feasable none the less.

--
We suffer more in our imagination than in reality. - Seneca

SIOS with PVFS2 by lroland · 2005-10-25 10:41 · Score: 1

If you want is just SIOS (Single IO Space) then a PVFS2 setup with multiple dataservers would be the way to go. As long as you do not care for concurrent writes to the same data then PVFS2 would be the easist and cheapes solution to go. You can even setup simple stripe so your writes will be distributed (round rubin manner) amoung the avalible data server. Note however that parallel file systems such as PVFS2 are distinguised from the more general term: distributed file systems, in that they are designed for multiple clients accessing the filesystem data in parallel from a pool of machines - so if what you really want is just to have one machine access a large storage pool then the lack of meta data cahcing in PVFS2 may prove to high a price to accept in which case you may want to give GFS a try.

--
"Politics is for the moment, equations are forever" -Albert Einstein

PetaBox? by mr_zorg · 2005-10-25 10:46 · Score: 2, Informative

The PetaBox, as previously discussed on Slashdot sounds like just what you want...

Internet Archive by guerby · 2005-10-25 10:51 · Score: 1

Make a phone call to those guyes http://www.archive.org/

No stamina, eh? :) by Anonymous Coward · 2005-10-25 10:53 · Score: 0

Better go practice... 3 minutes... hmm...

boo hoo. poor you. by Anonymous Coward · 2005-10-25 11:04 · Score: 0

Yeah, I'd like three Ferrari's and my budget is $26.00

What kind of crap Ask Slashdot question is this????

Go spend the $x Million required for a handful of Hitachi XP1024's and stop looking to Slashdot to provide your cheap-ass low-budget IT "solutions".

Internet Backplane Protocol [IBP] by mosel-saar-ruwer · 2005-10-25 11:19 · Score: 1

You might also check out the Internet Backplane Protocol, or "IBP", which was designed to store massive amounts of data in a generic "cloud".

For instance, more than 18 months ago, it was already moving 1TB per week on Internet2, and this past week was at 1.896TB.

iBrix and others ... by Anonymous Coward · 2005-10-25 11:22 · Score: 0

There are a number of commercial vendors out there which have very interesting but not necessarily cheap solutions. Have a look at iBrix and iSilon.

QFS by Anonymous Coward · 2005-10-25 11:37 · Score: 0

I am currently in the process of deploying a Sun QFS/SAMFS based filesystem. This environment is being built out to manage about 300TB of data. QFS is a shared, SAN based filesystem, that as far as I am aware, has some kind of obscene maximum filesystem / name space size

The latest version of QFS (4.4 I think) supports Solaris 10 x86 as a full peer which would allow you to install on commodity servers, switches and disk (I'm using a McData switch with IBM and Nexsan disk) for a ~$5000 a box software license, each of which could export the shared QFS over NFS/Samba/FTP whatever for non Solaris clients.

Sun ships a Linux client, but so far it's kind of crappy.

Re:QFS by BobFillmore · 2005-10-25 15:13 · Score: 1

According to this:
http://www.sun.com/storage/software/data_mgmt/qfs/ features.xml
QFS supports:
"Scales up to a petabyte with support for 16TB LUNs"

Let's see... 1000/16 = 62 LUNs ... not too bad.
Runs on Solaris servers and supports Solaris and Linux clients @ $150 each.
Not cheap on the server side... $8K to $100K++.

Re:GPFS from IBM - GPL VIOLATOR by Anonymous Coward · 2005-10-25 11:42 · Score: 0

There's been a lot of discussion about whether the closed source modules gpfs requires violate the kernel GPL license. But who would be crazy enough to sue IBM?

Re:What about MatrixStore? - We're evaluating it.. by Anonymous Coward · 2005-10-25 11:46 · Score: 0

Yeah, I bumped into them at the Broadcast show in Amsterdam a while back, we are looking into their stuff in detail to archive all the video we have on tape right now. Not free but they are being pretty agressive with the pricing right now.

Yep, I said *might* by wsanders · 2005-10-25 11:49 · Score: 1

Well, at least the development and test process would be fun.

And based on what I have heard, most SATA raid-5 controllers are not quite ready for prime time, although I did recently interview with one outfit that was running their whole enterprise on Pogo Linux StorageWare boxes:

http://pogolinux.com/storage/sata/storagewaresata. html

I dunno. I would want to pound the crap out of them for a few months before I committed. Worse, in my last job we had zero budget and a bunch of ancient DL380s with hardware raid; nothing bad ever happened but it kept me awake at night, mostly thinking of new scripts I had to install to make sure every box was at least rsync'ed somewhere else. Now, I'm fortunate to work at a place that can afford EMC and Sun, SCSI and FC stuff. Nothing ever breaks, ever, and I sleep like a baby.

--
Give a man a fish and you have fed him for today. Teach a man to fish, and he'll say "WHERE'S MY FISH, YOU IDIOT?"

ObSimon by sharkey · 2005-10-25 11:52 · Score: 1

Hell, you could just buy several miles of ethernet cable and keep all the bits moving in the network. Imagine how fast retrieval would be if the data was already being transmitted before you requested it? Make sure you use Cat6 though, don't skimp on data integrity!

--

--
"Outlook not so good." That magic 8-ball knows everything! I'll ask about Exchange Server next.

Backup? by kbahey · 2005-10-25 11:56 · Score: 1

How are you going to backup that monster?

You said you do not need data redundancy, since backup is nearly impossible, how are you going to survive a disk crash (they *will* crash!)

--
2bits.com, Inc: Drupal, WordPress, and LAMP performance tuning.

How about 0.5 cents/MB? by vrmlguy · 2005-10-25 12:01 · Score: 1

I work for EMC. If anyone needs a petabyte of storage and is willing to buy it all at once, I can get you under the price mentioned above. It will be our largest, slowest drives with minimal host connnectivity, but it does include 7+1 RAID protection. If you are serious, drop me a line. Yes, a saleman will call.

--
Nothing for 6-digit uids?

Re:How about 0.5 cents/MB? by ErikZ · 2005-10-29 12:30 · Score: 1

hmmmm .5 cents a MB is $5.12 a GB
which is $5,242 for a TB
which is 5.4 million dollar for a PB...

Gah.

--
Democrats or Republicans. They are both taking us to the same place and they are not afraid of us anymore.

openfiler by brianjcain · 2005-10-25 12:30 · Score: 1

Use openfiler. It is good.

Our Environment by evilninja · 2005-10-25 13:21 · Score: 1

I work at a mid-size aerospace company where we're faced with similar problems. Management doesn't grasp the size of the test images/video/etc. that are coming off our satellites, so our budgets generally fall somewhere short of adequate.

In the past, our management splurged and bought a Network Appliance. Two, actually; we have an F880 and an F740. The 740 is pretty much defunct right now; it's only got about 300GB of disk and we use it to house static application installs. The 880 is more robust and therefore more efficiently used, but only has 2.2TB of (insanely reliable) disk. Disk that costs about $20,000 per additional TB.

Last year, we learned that Xyratex (the company that makes the disk shelves for NetApp) has started selling SATA-based disk arrays. Right now, I believe they only support 400GB SATA drives in a 16-drive chassis, but support for 500GB drives is supposedly right around the corner. A fully-populated chassis with 400GB drives will yield about 4.8TB of usable space. We have purchased five head units (about $20,000 full of disk) and one shelf (about $15,000 full of disk). Each unit is expandable to 7 shelves (including the head), which yields over 32TB of usable disk. I don't know what your budget is, but $110,000 is pretty reasonable for over 32TB. Admittedly, you could buy 32 stripped-down Dell Dimension 4700's (to get SATA) with two 500GB hard drives; install a slim OS and you could get approximately the same amount of usable space. But the reliability of the Xyratex has been far greater than the reliability of the Dell machines we've purchased in recent years.

It's ironic that you brought up the NeoPath File Director. We're going through the trials and tribulations of installing a clustered pair of them right now. We've had some difficulty in getting them set up, but it seems like they'll do the trick when we get them going. The MSRP on the cluster is kind of high, but talk to their sales guys if you're interested - we got almost 1/3 off the listed price. We plan on using the File Director to migrate old files from the NetApps to our Xyratex, thereby expanding our storage at $3,300 per TB instead of $20,000 per TB. I can see it working well, though, for aggregating a large number of file servers into a single virtual server.

I don't know how the File Director will interface with your operating systems. We use Veritas's Storage Foundation (~$500 per license - you should only need one) on our systems because we're primarily limited to Windows and Solaris, which have difficulty with large filesystems. Storage Foundation breaks the size limitations as well as enabling easier management of your volumes.

I hope this helps. Good luck.

A petabyte of ... by loossy · 2005-10-25 13:58 · Score: 1

pr0n, even i think a tb is enough...

Scaling it? by LinuxATA · 2005-10-25 14:00 · Score: 1

This is a trivial task if you have a few key tools.

One (1) management/virtualization box/head.
Six (6) 5TB iscsi targets.

Why six, because RAID is nice and all hardware fails regardless.

In the first box, you need to import all six targets for raw storage. Using a robust raid/scsi stack to glue all the segments togather should yield a raw 30TB base under RAID 0, or 25TB base under RAID 5.

Now to provision the bucket of storage.

I am sure this is boring and trivial, so I will stop.

Whom ever you are feel free to email me offline if more details are desired or skip it and read the next message.

Cheers,

Andre

Check this out! by oldlefthander · 2005-10-25 14:20 · Score: 1

Look at this..... http://www.archive.org/web/petabox.php

Here's how I did it: by Anonymous Coward · 2005-10-25 14:21 · Score: 0

I've created enormous filesystems (500TB and up) with the following recipie:

As many ATABeasts as you can afford (they'll currently do about 20TB each)...

As many QLogic 2GB fibrechannel cards as you have ATABeasts

1 or more Sun V880s (Just keep buying them as you fill PCI slots)

Veritas Volume Manager/Veritas Filesystem 4.1 or higher.

NFS/Samba to your heart's content

Now, I realize that there are two glaring problems with the above recipie. The first is that only Samba is open source, and I understand theres something of an issue about slashdotters using software that they actually have to pay for. Fear not--in this case the result will be software that actually works.

The second is that Sun will only look at 2TB LUNS and no bigger---and Veritas will only see ~1 TB "drives" or smaller. This actually isn't a problem--simply configure your ATABeasts intelligently, throw a fuckload of LUNS at Solaris (It can take it) and smoosh them all together into one enormo-volume with Veritas.

you might find some useful info at SC|05 by Anonymous Coward · 2005-10-25 15:02 · Score: 0

The SC|05 (SuperComputing) conference in November, will have some tracks on high performance storage capabilities as well as a special initiative called StorCloud of which the goal is to build a petabyte scale storage farm on the show floor (http://sc05.supercomp.org/initiatives/storcloud.p hp) A lot of the major storage vendors will be at this conference too.

While not all of this may fit in with your organization's goals and requirements exactly, you might find it a useful forum just to check out what these folks are up to and take an opportunity to chat with them.

Just a thought. Good luck!

Re:Andrew File System - I Like It, But Try SAM-FS by draggin_fly · 2005-10-25 15:08 · Score: 1

A better solution for petabytes of storage is a commercial product developed by some former Cray engineers, SAM-FS. This is really an HSM solution -- very scalable, which is the entire point.

The samfs filesystem allows you to browse to a file as if it's online in a disk even if samfs has cached it to some other media, such as DVDs or tapes. The nice thing about this HSM style solution is that it combines completeness (all files available, even if they are on tape and rarely used) with the need for data integrity. SAM-FS allows duplicate tapes or dupe writes to DVD or CD-ROM; it can even write the duplicates in another city. Why not? You've can have your 2 PB in NYC and your backup 2 PB in LA. I think that's a worthwhile feature.

The hitch for some folks here is that SAM-FS is written for Solaris only, so far as I know. It is darned fast for what it does, although retrievals of rarely-used files tend to be limited by the HSM media type(s) used in the particular storage system.

Hmm ... disclaimer: I don't sell SAM-FS or Solaris. This is just a good, scalable solution that I've used at a large government facility and can be bought OTS from Sun or StorageTek for fairly serious bucks.

iSCSI + LVM? by psyon1 · 2005-10-25 15:14 · Score: 1

This is totally out of my ass, but wouldnt mind finding out if its possbile/feasable. Could the person setup a bunch of standard PCs each stuffed with disks, run software iscsi servers on those pcs to share each disk. Have one main master server using software or hardware iSCSI add the disks to an Logical Volume. When more storage is needed, a new iSCSI serving pc could be added to the volume. Im not sure what file system would work, but then again, I am not sure the idea would work at all.

Re:Cheapest and easyest to maintain (the crazy way by Anonymous Coward · 2005-10-25 15:26 · Score: 0

You dont know what your talking about.
The fact that you would call ext3 stable and safe is a dead give away, the ext file system
family is an abomination when it comes to saftey.

Re:Backing-up a petabyte by uvajed_ekil · 2005-10-25 15:27 · Score: 1

>(*) I truly have no idea how one backs up a petabyte
Okay, you know how those flash keystick thingies are starting to look really cheap? You just get about....

--
This is a hacked account, for which the owner can not be held responsible.

That's a lot of disks there son! by giberti · 2005-10-25 15:31 · Score: 2, Informative

File systems asside, your talking about a whole lot of hardware here! Is it really necessary to have all this data online at the same time, is it possible to store it in some other way (ie tapes) because it would probably be a whole lot cheaper!

Well, lets see... using 300Gb SCSI disks (assuming you can find raid 0 hardware to support enough disks) you can build out a 1Pb storage system with about 3,334 disks. That would set you back about $3.3Million, assuming you paid retail prices for the disks ~$1,000 / disk @ CDW today. Of course, if you orderded 3,000+ disks, I'm sure they would cut you a deal on the price.

Any hope of daisy chaining together a few dozen direct attached storage devices to a NAS server? Something like a Dell PowerVault 220 with 14 300Gb SCSI drives will set you back about $21K and give you 3.4Tb / 3U of space (RAID-5) so you would have some saftey net built in (albeit not much). Slap 10 of these on a Powervault 6000 series and you should have a ball park of 34Tb (while shy of what your looking for gets you in the right direction). Total cost around $250K - do it four times and spread the work out over four logical volumes and you should get in the neighborhood of 1Petabyte. You could then set up a redundant server structure and for $2Million you have a redundant mirrored architecture ready for one to fail and be brought up online quickly.

--

AF-Design, web development.

Re:Andrew File System - I Like It, But Try SAM-FS by mathrock · 2005-10-25 15:46 · Score: 1

More info on SAM.... SAM-FS and SAM-QFS are both closed source commercial products developed by LSC which was bought by SUN a few years back. At work recently I had a chance to test SAM-QFS, in my lab. To do a shared SAM-FS(QFS) file system the metadata server must be a Solaris machine. Clients of the file system can be Linux, I forget which kernel versions are currently supported. The shared file system part of SAM-QFS scales better than RedHat GFS (largest file system = 8TB) by supporting 252 disk luns per filesystem @ 2TB/disk lun. The HSM archive part of SAM-QFS seemed to have a few issues with very large #'s of files > 40 million, such as file system backup time and containerizing MANY small files into a container to archive. Also SAM-QFS isn't a journaled file system...there wasn't any consistency check done on the file system dump / restore utilities. I could do a dump of the file system, change a block in the dump file, restore the dump, and have corrupted data -- in my case I hex editted the names of file names and restored it and it didn't complain one bit. I'd be interested in what other people might suggest for a good archive/HSM product. I've looked at ADIC AMASS, ADIC StorNext, FileTek StorHouse RFS, SUN SAM-QFS, HPSS, IBM SAN FS and haven't found one that was good enough and scalable to the 1 billion files/FS and 1+ PB of data level....

PetaBox by MercMan · 2005-10-25 16:03 · Score: 1

It sounds like your requirements are good (scalable) , fast (or large in this case) and cheap. I'm pretty sure that only 2 out of 3 are possible.

How important is the data? More importantly how much will it cost if you lose the data? Without redundancy if you lose one disk you are likely to lose the whole volume... and cheap disks WILL die. If your budget won't allow you to purchase a real enterprise storage system will it allow you to buy an adequate backup solution?

Check this out it might be your best bet http://www-03.ibm.com/servers/storage/disk/ds4000/

I just sat through a demo of one of these units. I was impressed with their goodness, storage capabilities and the ability to use SATA drives brings the cost for large amounts storage cheaper than your typical SCSI enterprise storage solution. These boxes are also fully RAID capable... but you still need a backup... but everyone who reads /. knows that :)

RAIDCore S-ATA by Waerloga · 2005-10-25 16:24 · Score: 1

Build a system based on the Broadcom RAIDCore BC4852 controller.
http://www.broadcom.com/products/Enterprise-Small- Office/Storage-Solutions

Tomshardware sucessfully ran 32 S-ATA drives in a single box in RAID5 mode (2x 16 drive array because of software limitations). With the current generation of 500GB drives that should yield you close to 15TB. Add several more boxes along with some clustering software.
http://www.tomshardware.com/storage/20041006/index .html

In a motherboard with 4x PCI-X slots you will get very good performance for your money. S-ATA drives may not be as reliable as SCSI, but they (along with the controllers) are cheap so you can always have a few spare around in case a drive died (you would with SCSI anyway). Linux drivers are available ofcourse.

More than one .. by TTK+Ciar · 2005-10-25 17:27 · Score: 1

We have a few redbox racks running in SF now. The Archive grows by about 25TB or 30TB a month, and all of our new storage is Petabox racks of redboxes. We are also retiring some of our aging whitebox systems and replacing them with redboxes, copying their contents over onto the newer media (and md5-doublechecking the contents before we let go of the old box).

-- TTK

coraid by thodu · 2005-10-25 17:48 · Score: 1

Check out www.coraid.com - they use ATA commands over Ethernet for a cheap and scalable storage solution.

Petabytes aren't that much storage by David+Rolfe · 2005-10-25 17:50 · Score: 1

Do you really think a petabyte is a lot of space?

Yes. At your rate, assuming you don't accelerate your [movie downloading, film-less video production/processing, simulation/model data acquisition, account/ticketing for your million-client monthly expansion, vhosting aggregation, etc.]:

It will take you two months to fill a terabyte... so 2000 months to fill a petabyte. If you keep buying new drives every couple months it will only take you 166 years to reach your petabyte!

(This ignores practical problems :-), I mean even boxes with 30 drives in them have a facilities impact at home.)

"Honey, I'm going to need a new edition to the house to hold my 1000 hard-drives."*

"Um, no."

(*) Assumes storage densities improve a little over the next 150 years

--
Read Heinlein's 1953 Revolt in 2100, now more than ever.

Re:Petabytes aren't that much storage by ShakaUVM · 2005-10-25 23:50 · Score: 1

No, the point is more that lil old me (and I don't really bother with file sharing/torrents) could fill have a terabyte rather incidentally. For example, I have many gigs of WOW screenshots, which I suppose I could run a batch compressor on... or not -- I have a terabyte of storage!

I picked up 1TB for next to nothing. I'd have bought less hard drive space, but less would have cost more. So if 1TB is sort of the ho-hum consumer standard, 1PB doesn't excite me all that much.

I'm used to working with scientific datasets up to 13TB in size, so, lesse, I could hold a whopping 76 runs on this 1PB server. Eh, thats about half a year's worth of data.
Re:Petabytes aren't that much storage by David+Rolfe · 2005-10-26 06:55 · Score: 1

You said, "No, the point is more that lil old me [...] could fill have [sic] a terabyte rather incidentally."

If that was your point, why did you lead off with the rhetorical regarding a quantity one order of magnitude greater? ("Do you really think a petabyte is a lot of space?")

You then said, "I picked up 1TB for next to nothing. [...] So if 1TB is sort of the ho-hum consumer standard, 1PB doesn't excite me all that much."

But this of course is a complete fallacy. For one, "next to nothing" times 1000 is ... greater than next to nothing! I can't see how you make it out to be some sort of equivalence. Further, in my light-hearted response I alluded to the problem of owning (let alone buying) a petabyte of storage -- the non-trivial facilities required. FWIW, as non-sequitors go, 1PB doesn't excite me either.

I'm used to working with scientific datasets up to 13TB in size, so, lesse, I could hold a whopping 76 runs on this 1PB server. Eh, thats about half a year's worth of data.

Well this raises the question, while doing your work/research how do you store a year's worth of data (if you even do)? Tape? Tape/Cartridge library (like Sony's PetaSite)? Online?

With today's storage densities a ho-hum consumer hard-drive is 80 cubic inches per terabyte (roughly 4x2x5 inches and that's on the small side, I don't know if any .5 TB drives meet those dimensions). So, you'll need a thousand of those or so, 80,000 cubic inches. That's a big drive: 44 inches on three sides (and damn heavy). Optionally, just take any 500 GB drive you have and imagine it twelve and a half times longer, wider and higher.

"Have you seen my buddy Pete? He's a petabyte drive a little over 5'2", blonde hair, blue eyes. I don't know how I lost him; He's impossible to move."

Either way, I guess it's all just personal opinion. :-\ Using the same scaling factor to my apartment... I can imagine that some folks feel a 12 story mansion with 200,000 sq.ft. per floor is "not that much space".

Just sparked me as odd that a storage capacity that was conveniently measured in hundreds of thousands of dollars could not be "a lot of space" to some folks. Cheers.

--
Read Heinlein's 1953 Revolt in 2100, now more than ever.
Re:Petabytes aren't that much storage by ShakaUVM · 2005-10-26 11:03 · Score: 1

>If that was your point, why did you lead off with the rhetorical regarding a
> quantity one order of magnitude greater? ("Do you really think a petabyte is a
> lot of space?")

It used to be. My point I guess is that a petabyte used to be an unimaginable amount of space. Now it's possible for a prosumer to buy 1PB of disk space for about $20k or so. That doesn't address the issue of infrastructure, but anything that your local video editing dilletante can afford doesn't color me impressed any more.

>Well this raises the question, while doing your work/research how do you store
> a year's worth of data (if you even do)? Tape? Tape/Cartridge library (like
> Sony's PetaSite)? Online?

My master's thesis was on making really large scientific visualization datasets manageable. Imagine the hassle of trying to load such a dataset into a workstation to visualize it. How many hours over gigabit ethernet do you need? How many DVDs would you need to burn? Instead, what I developed was a system where the full dataset was first compressed, but in such a manner the dataset was still randomly accessable. The uncompressed dataset was never written to disk, since my code was a layer in the I/O component of the simulation software. The compressed dataset remained on SDSC's file store. Next we ran an algorithm over it developing a map of the dataset finding interframe correlations. That map was sent to the workstation, which would then asyncronously pull in chunks of the dataset by iterating across the map. It let us work with these massive datasets in close to real time.
Re:Petabytes aren't that much storage by David+Rolfe · 2005-10-29 17:31 · Score: 1

Instead, what I developed was a system where the full dataset was first compressed, but in such a manner the dataset was still randomly accessable. The uncompressed dataset was never written to disk, since my code was a layer in the I/O component of the simulation software.

That sounds neat. Isn't this kind of what's happening with non-linear video editing? It's not exactly analgous I can see (this is also how, in the bad old days [of early 90s consumer PCM audio editing] Cooledit's "peak files" would give you low latency access to long pieces of audio -- the peak files were just variable resolution maps to the audio on disk, requiring one-time delays on initial/final i/o).

--
Read Heinlein's 1953 Revolt in 2100, now more than ever.

http://www.exanet.com by Anonymous Coward · 2005-10-25 18:18 · Score: 0

I was interviewed for a job there:

http://www.exanet.com/

University of Tübingen by Anonymous Coward · 2005-10-25 18:24 · Score: 0

Toms hardware has an artickle about the 70 TB online backupsystem of University of Tübingen

http://www.tomshardware.com/storage/20030425/index .html

The afrotech solution by zqad · 2005-10-25 18:26 · Score: 1

Just do like they do on scrapheap challenge; send a couple of maniacs to a junk yard and let them build a diskarray from for example parts from an old Dell powerserver and a desktop case. Although the circuit card may be too large, so you may have to cut them of. Example/tutoirial (although in swedish:

http://www.acc.umu.se/images/archive/20050517-Plas to2000/

Re:The afrotech solution by Hannes+Eriksson · 2005-10-25 21:08 · Score: 1

Best part of it is that it's actually being used for something as productive as serving free software to the world :-)

--
Geek rants since like... 2000 or something.

1000mph by Anonymous Coward · 2005-10-25 18:45 · Score: 0

I need a jet that goes 1000mph. I looked at the commercial offerings, but they are price-prohibitive. For the time being I don't need landing gear. I was hoping that the Slashdot folks could help me.

ZFS by Anonymous Coward · 2005-10-25 18:52 · Score: 0

I think you're waiting for ZFS, the last word in filesystems. Interesting stuff here.

Google FS is not a real FS! by 1tsm3 · 2005-10-25 19:48 · Score: 2, Informative

Google FS is not a real file system. It's just a bunch of API's that the program calls. The GoogleFS is not integrated into the Linux VFS. So you can't mount a GoogleFS. All programs need to be modified to use the GoogleFS API.

Also, the GoogleFS has very narrow requirements/goals. It works best for programs that only append to the files.

--
-ItsME

4.4Tb on raid5 per mode at $0.32 per GB by Hackeron · 2005-10-25 19:50 · Score: 2, Interesting

1) ~$100 - nforce4 motherboard with 8 onboard stata,
2) ~$40 - an additional PCI sata controller with 4 ports,
3) ~$100 - the cheapest AMD64 CPU you can buy, 12 400GB drives,
4) ~$150 - coolermaster stacker case
5) ~$1020 - 12 WD 400Gb drives
5) $0 - your favorite Linux distribution.

TOTAL: $1410

Each drive eats about 15W meaning around 180W with an additional 60W for motherboard/cpu consumption which makes it a comparable solution to an efficient scsi solution in terms of power consumption at a small fraction of the cost.

Personally, I created a raid1 array of 2 37GB 10krpm raptor drives for critical stuff and OS, and 2 raid5 arrays of 5 300GB drives for even superior cost per GB while increasing redundancy by a factor of 2. But that only gives you 2.4TB per mode in that case.

The configuration can be done with evms or lvm2, rebuilding on the fly and replacing drives on the fly should work just fine in theory (never tried on the fly), but if not, a scheduled 5 minute downtime is just fine also. My previous 0.5TB raid5 is up >3 years so far and a hard drive failure just required to mdadm md0 --add /dev/sda5 to rebuild the array after a drive failure.

Increasing the array size becomes tricky (although an available option) and fiddling with various distributed network filesystems doesnt really seems worth it for me personally, but openmosix and other clustering solutions offer distributed filesystems.

Just remember, the SATA architecture is nice, SCSI isnt really a requirement for this kind of solution.

terrascale is cool... by anon+mouse-cow-aard · 2005-10-25 22:26 · Score: 2, Interesting

http://www.terrascale.com/prod_e.html Run a client on linux boxes with user-mode drivers that provide a logical abstraction for a whole network of backend linux boxes over any networking transport you want.

Re:terrascale is cool... by Anonymous Coward · 2005-10-26 05:14 · Score: 0

it is cool... but has some implementation limits currently of 16TB for one single file system

Petabox from Capricorn by Ty_Berg · 2005-10-25 22:35 · Score: 2, Interesting

I ran accross this a while back at linuxdevices it is supposed to scale to Petabytes and is the main technology used for the Internet Archive.

Capricorn Technologies Petabox
http://www.capricorn-tech.com/

Linux Devices Review
http://linuxdevices.com/news/NS2659179152.html

beowulf cluster by Anonymous Coward · 2005-10-26 01:32 · Score: 0

That is what I currently have -- a cluster with 4 servers and about 10TB of distributed data... Now what about the single name-space thing. The only way I have been able to solve the name-space thing is to aoutomount the file servers and then symbolic link various directory trees... Not very desirable.

Suggestions?

Re:call EMC. i am sure their clarion line will han by Phishcast · 2005-10-26 01:34 · Score: 1

EMC is obsolete. Their customers just haven't discovered it yet.

Great! Please build me a 100TB storage array with 128GB of cache in your garage. Not shared cache, please allocate it intelligently per logical device I create. Also, please make sure I have 64 front end fibre channel ports so I can attach this storage to my server farm. Oh, and also have it dial your house when a drive fails and be at my site within 4 hours to replace it.

I may want to connect my mainframe to it as well. While you're at it, build me a second one so I can synchronously or asynchronously mirror logical devices to my datacenter across town. I may want to do cloning and snapshotting to make copies of my production database, so throw in those capabilites too.

Please let me know when it's ready so I can drop my obsolete storage vendor.

prior art. by Anonymous Coward · 2005-10-26 02:39 · Score: 0

I cannot speak for the poster, but when I played with this problem myself I spent *weeks* trying to get AFS and Coda running on a modern 2.6 kernel... just for starters. I thank the wonderful person for the GFS link (I'll try that too). Instead of just critisizing can you post a link or two yourself?

PetaBox. by kdriedge · 2005-10-26 02:54 · Score: 1

Nerd TV did a show on the archive org founder: Brewster Kahle.

He talked about doing storage on the cheap.

Here is a link to the system they are using in production.

This solution is GPL'ed. It also appears you can buy it as well.

Umm that's a lot of porn by mustangsal66 · 2005-10-26 03:09 · Score: 1

Just out of curiosity... How long does it take to watch 25 TB of porn?

--
Why worry? Each of us is wearing an unlicensed "nucular" accelerator on his back.
Sig changed for readability by G.W.

Petabyte storage on commodity hardware? by dud83 · 2005-10-26 03:18 · Score: 1

Sure, it quite possibly *can* be done. In the same way that you could theoretically build a spaceshuttle from a T-Ford and lots of old spraycans!
When you're aiming for extreme solutions, you have to use quite extreme components. Think http://www.sgi.com/products/storage/ http://h18006.www1.hp.com/products/storageworks/xp 12000/index.html or even http://www.sun.com/storage/highend/9990/index.xml.

The only imaginable way to get petabyte on commodity hardware that I can think of is to build a seriously huge Beowulf cluster. But putting a singular FS on about ~500 seperate computers in a cluster is rather madness...
Take the good advice many fellow slashdotters has made, do NOT use a singular filesystem that spans 1 PetaByte or more...

Ocean Store? by Anonymous Coward · 2005-10-26 04:13 · Score: 0

http://oceanstore.cs.berkeley.edu/
At least, they have pond.

NetApp does SATA too by Miniluv · 2005-10-26 05:01 · Score: 1

I know they're late to the party, but the FAS3000 line supports the new shelves with SATA drives. This drops their cost down dramatically (over 50% reduction), with the same unusually high NetApp standards. We're upgrading our F820c cluster with FAS3050c heads, and adding some SATA storage for new projects. I'd have to agree with the parent that NetApp is definitely a good solution, though none of their individual filer heads, or even two head clusters, can scale this far up. Their single volume limitation is still at 17.6TB.

Dumb Question by Arandir · 2005-10-26 06:32 · Score: 1

Pardon me for asking a dumb question, but why the fsck do you need all of this in a *single* volume? I can understand the need for single volumes, and the need for large volumes, and even the need for single large volumes up to a point. But your reqest is taxing my understanding.

--
A Government Is a Body of People, Usually Notably Ungoverned

SDSNM by Anonymous Coward · 2005-10-26 07:27 · Score: 0

I realize that I will most likely never be seen by the other Anonymous Coward but would like to add a possibility. I know that some laboratory in this country was faced with the same problem and eventually had this company SDSNM do the work. Their web page is scant on details but I vaguely recall that specification required 100MB/s for a year using off the shelf hard drives. Worth contacting. If anything happens because of this mention LB.

Massive Solution: by TauntingElf · 2005-10-26 15:39 · Score: 1

try filetek.com, its a massive solution but comes with a far cheaper bill at the end of the month, startup may be steep, but it looks like a hard drive to a windows box just like that 100gb one sitting in your computer.

xrootd by gowdy · 2005-10-27 16:53 · Score: 1

You might want to check out xrootd if this is read-only data. It does work for read-write but it isn't as performant as you might want without serious application work. This is a server that uses a redirector to send clients to the machine with the actual data. The web
site is http://xrootd.slac.stanford.edu/.

Late followup: 1 petabyte on 10.4 with xsan 1.1 by Rhys · 2005-11-08 09:50 · Score: 1

It's in the admin guide buried a ways down into it in a table of other limits. You'd think they'd have that on the web page, a petabyte is a big number!

--
Slashdot Patriotism: We Support our Dupes!

Slashdot Mirror

Building a Massive Single Volume Storage Solution?

557 comments