Does ZFS Obsolete Expensive NAS/SANs?

ZFS by Anonymous Coward · 2007-05-30 00:05 · Score: 5, Informative

Also should be noted that FreeBSD has added ZFS support to Current (v7). It's built on top of GEOM too so if you know what that is you can leverage that underneath zfs.

Re:ZFS by ggendel · 2007-05-30 00:08 · Score: 4, Interesting

I think you'll a bit high. I put together a 5-500Gb Sata II disk setup with Raid-Z in a 5 disk enclosure for under $1000. I run it off my Sunfire v20z. That's 2 TBs for under 1k USD!
Re:ZFS by Anonymous Coward · 2007-05-30 00:33 · Score: 0

The best part is that when FreeNAS startus using the FreeBSD v7.0 CORE... everyone will be able to utilize it. Not that _I_ like FreeNAS; but people on /., like people in general, can't handle the console. Preferably I run BSD and Solaris barebones, but FreeNAS is probably a good idea for those who can unly use a mouse.
Re:ZFS by FST777 · 2007-05-30 02:01 · Score: 1

Why use ZFS? As far as I can see, I find no reason not to use GEOM and UFS2 or something like that...

(This question is serious. I just wonder why ZFS is chosen for this solution. Anyone knows?)

--
Free beer is never free as in speech. Free speech is always free as in beer.
Re:ZFS by mortonda · 2007-05-30 02:31 · Score: 1

I agree... $2000 for 1.4 TB is a bit expensive. I just put together a box with 1.4 GB Raid5 for just under $800 using decent parts from newegg. OTOH, this doesn't include hot-swap bays and such, but it's a great storage solution on a small budget.
Re:ZFS by ePhil_One · 2007-05-30 02:41 · Score: 1, Interesting

I don't get what is so special about ZFS that makes this magically possible. Why not IBM's JFS? Or ReiserFS? Or the old reliable EXT2/3/4? or XFS? FreeNAS has been out there for a while, why not use it? Or a CentOS based OS in place of Solaris?
I'm actually very interested in such a project, but I see nothing compeling here.

--
You are in a maze of twisted little posts, all alike.
Re:ZFS by Penguin's+Advocate · 2007-05-30 02:58 · Score: 1, Insightful

Then you don't know what ZFS is.

--
Frag 'em all...
Re:ZFS by chegosaurus · 2007-05-30 03:19 · Score: 4, Insightful

> Why not IBM's JFS? Or ReiserFS?

Because they are just filesystems. ZFS is also a volume manager.

> Or a CentOS based OS in place of Solaris?

Because CentOS doesn't have ZFS.

+4 Interesting. Awesome.
Re:ZFS by darrylo · 2007-05-30 03:23 · Score: 3, Informative

Increased reliability (all data is checksummed, even in non-raid configurations), near brainless management (e.g., newfs is not needed, raid configurations are trivial to setup, etc.), built-in optional compression (even for swap, if you're feeling masochistic), etc.. Encryption is in development.
See my other posts here for links.
Re:ZFS by Penguin's+Advocate · 2007-05-30 03:43 · Score: 1

I used the same case and put together a 3.5TB RAID5 array with 12 drives almost a year ago. I spent $2700 total, and almost $900 of that was on an Areca 12-port SATA-II PCIe RAID5 board. The machine is running gentoo linux and the array filesystem is XFS. It works beautifully. I played with Solaris 10 and ZFS a bit on my AMD64 desktop a couple months ago, but for now I'll stick with hardware RAID. I've also got a few Sun boxes around (a few Ultra2's and an E4500). None of them have much local storage, so I haven't played with ZFS on them. I mostly use them for experimenting with parallel programming stuff (the Ultra2's have 2x400Mhz UltraSPARC II's with 4MB of cache apiece and 2GB of ram, the E4500 has 8x400Mhz UltraSPARC II's with 8MB of cache apiece and 8GB of ram).

Back to the point, I like where ZFS is going, and I can see it working for personal storage and even small to medium business storage, especially if the rumors of it being supported by OS X leopard are true. For the moment, for most people, I'd still recommend a hardware RAID5 for smaller arrays (12 or fewer discs). Your odds of losing more than one disc at a time are pretty low, you can easily hot-swap a new drive in and keep going, and an array of that size isn't too bad to backup (or even replicate with an entire hot-spare array). If you're a small company, or a startup, or for any other reason have a very limited IT budget, this is a very feasible solution, than can in some cases appear as reliable as a much more expensive solution for a while.

--
Frag 'em all...
Re:ZFS by perbu · 2007-05-30 03:44 · Score: 2, Informative

See for yourself.
Re:ZFS by ChrisA90278 · 2007-05-30 03:58 · Score: 4, Insightful

ZFS will do some things the others file systems can't. First off it is "copy on write" and keeps a complete backwards version history so if a file is damaged, deleted or you just need to back out a change to a word processing document you can do that. Also ZFS moves both volume managment and raid into the file system. You can add and remove physical drives without stopping the system. And of course it is huge. A 128 bit file system can't ever be filled. (yes "never" do the math) It's also fast and maintains end to end checksum. Sun really has raised the bar here. That said this is not what a typical home user with only a hand full of disk drives and users needs.

Back to the question. Can it replace a SAN. Depends on the required performance. If you have 25 or 30 video edit workstations or a corporation with 5,000 desk tops it's hard to see how one Solaris server is going to work. You need something that can a lot of IO bandwidth.
Re:ZFS by WeAreAllDoomed · 2007-05-30 04:16 · Score: 1, Flamebait

'Alan Cox [interview] suggested, "the real test of whether Sun were serious about ZFS being anywhere but Solaris is what they do to license it - they've patented everything they can, and made the code available only under licenses incompatible with other OS products. Their intent is quite clear, and quite sad. Compare it to what the old Sun company did with NFS, which is now a standard used everywhere."' http://kerneltrap.org/node/8066

--
free software, open standards, open file formats, no software patents.
Re:ZFS by Kymermosst · 2007-05-30 04:57 · Score: 3, Informative

Why use ZFS? As far as I can see, I find no reason not to use GEOM and UFS2 or something like that...

Simple administration and data integrity. This is all it takes to make a 6-disk RAID at home:

zpool create sun711 raidz c0t1d0 c0t2d0 c0t3d0 c0t4d0 c0t5d0 c0t0d0

That gets you a ZFS storage pool mounted at /sun711. 'raidz' specifies a ZFS RAID that is similar to RAID 5 but always does full-stripe writes. Every block is checksummed. All operations are copy-on-write, so no journal or fsck is needed.

Of course, once you have a storage pool, you can then create additional file systems from there. Here's how you create some NFS storage:

zfs create sun711/storage
zfs set sharenfs=rw sun711/storage

When I had a disk start getting flaky (it started reporting high raw read error rates - that's what I get for buying drives on ebay...), I simply did the following:

zfs offline c0t5d0

zfs replace c0t5d0

There you have it... it can't get much simpler than than.

--
"Alcohol, Tobacco, Firearms, and Explosives" should be a convenience store, not a government agency.
Re:ZFS by Kymermosst · 2007-05-30 04:59 · Score: 1

Slashdot ate a line (stupid html filter). That should be:

zfs offline c0t5d0
(physically replace disk)
zfs replace c0t5d0

--
"Alcohol, Tobacco, Firearms, and Explosives" should be a convenience store, not a government agency.
Re:ZFS by beezly · 2007-05-30 05:09 · Score: 4, Informative

A correction:

First off it is "copy on write"

Copy-on-write is quite a misnomer here (even if Sun use that term). It is a Transactional filesystem. Blocks are not copied upon write, they are only written and then the transaction log is updated. It's far more clever than old-fashined COW schemes. It can be compared with NetApp's WAFL filesystem.
Re:ZFS by bl8n8r · 2007-05-30 05:26 · Score: 2, Informative

You can get ZFS with FreeBSD also. http://lists.freebsd.org/pipermail/freebsd-current /2007-April/070544.html

--
boycott slashdot February 10th - 17th check out: altSlashdot.org
Re:ZFS by synthespian · 2007-05-30 05:35 · Score: 2, Interesting

FUD.
FUD.
FUD.

Interesting times. These days, we not only get the typical Redmond FUD, but FUD from the Linux people.

Available on OpenSolaris and FreeBSD (and being ported at least to NetBSD, AFAIK). Those are free software operating systems.

Your problem is that not everything fits in your little GNU/Linux box.

--
Main difference between the BSD license and the GPL license: one is from California and the other is from Massachusetts
Re:ZFS by slashthedot · 2007-05-30 05:45 · Score: 3, Informative

From ZFS link in Opensolaris.org :

"ZFS is a new kind of filesystem that provides simple administration, transactional semantics, end-to-end data integrity, and immense scalability. ZFS is not an incremental improvement to existing technology; it is a fundamentally new approach to data management. We've blown away 20 years of obsolete assumptions, eliminated complexity at the source, and created a storage system that's actually a pleasure to use.

ZFS presents a pooled storage model that completely eliminates the concept of volumes and the associated problems of partitions, provisioning, wasted bandwidth and stranded storage. Thousands of filesystems can draw from a common storage pool, each one consuming only as much space as it actually needs. The combined I/O bandwidth of all devices in the pool is available to all filesystems at all times.

All operations are copy-on-write transactions, so the on-disk state is always valid. There is no need to fsck(1M) a ZFS filesystem, ever. Every block is checksummed to prevent silent data corruption, and the data is self-healing in replicated (mirrored or RAID) configurations. If one copy is damaged, ZFS will detect it and use another copy to repair it."
Re:ZFS by Anonymous Coward · 2007-05-30 05:52 · Score: 0

>> Why not IBM's JFS? Or ReiserFS?
> Because they are just filesystems. ZFS is also a volume manager.

So why not a volume manager (say, LVM) in addition to JFS.

It's not like it's a big deal to have the volume manager as a separate component.

> +4 Interesting. Awesome.

Yeah, I'm amazed you got "Insightful" instead of "Funny".
Re:ZFS by Anonymous Coward · 2007-05-30 06:13 · Score: 0

zfs allows sort of a 'water mark' in the file system that makes it so you don't have to physically access the entire disk to do a snap shot. It takes a fraction of a second to take a snapshot regardless of the size of the storage. So you can roll back to previous points in time painlessly. It is a 128 bit filesystem, I believe it supports encryption too. (anyone?) In solaris land, it lets you do changes to the size of your array without having to drop and remount devices. Lots of nice features (not saying ufs2 or others aren't nice too), so lots of reasons to like it.
Re:ZFS by WeAreAllDoomed · 2007-05-30 06:20 · Score: 1

Available on OpenSolaris and FreeBSD (and being ported at least to NetBSD, AFAIK). Those are free software operating systems.
Your problem is that not everything fits in your little GNU/Linux box.
why, yes, actually, that is my problem. that's why i posted it - in anticipation of the inevitable "does (will) it run on linux?" questions. apparently, as a result of patents and licensing, the answer is "No".

--
free software, open standards, open file formats, no software patents.
Re:ZFS by Anonymous Coward · 2007-05-30 06:39 · Score: 0

Where's the uncertainty? Sun fears Linux, and their programmers have already admitted this is why they deliberately made a GPL-incompatible license. Using their patent minefield to prohibit GPL implementations would be incredibly foolish if widespread use of ZFS were actually their goal.
Re:ZFS by Anonymous Coward · 2007-05-30 06:59 · Score: 0

Because you want something that comes with support and is designed for fault tolerance, thats why.
Re:ZFS by agallagh42 · 2007-05-30 07:06 · Score: 3, Interesting

"A 128 bit file system can't ever be filled. (yes "never" do the math)"

I did the math. That would handle 3.4x10^38 Bytes, or 340 trillion YottaBytes (1 YottaByte = 1 billion PetaBytes, 1 PetaByte = 1 million GigaBytes). That's a very large number of Bytes, but I still wouldn't use the word never. I usually even try to avoid the phrase "never in my lifetime", but in this case that's probably a safe bet. :)

Note: I'm using the hard drive manufacturer's definition of *bytes here.

--
Carpe Cerevisi - Seize the Beer
Re:ZFS by Frank+T.+Lofaro+Jr. · 2007-05-30 07:06 · Score: 2, Interesting

A 128 bit file system can't ever be filled.

640K is enough for anyone! :)

Our government will find a way. :)

--
Just because it CAN be done, doesn't mean it should!
Re:ZFS by Anonymous Coward · 2007-05-30 07:15 · Score: 1, Insightful

Actually it is exactly what a home user needs:

ZFS loves cheap disks.

And with only two commands (zpool and zfs) and integrated web frontend, it's very simple to use.
Re:ZFS by tehdaemon · 2007-05-30 07:28 · Score: 4, Interesting

Mass of the earth = 5.9742 × 10^27 grams

Make the drives out of the earth, you need a drive density of 57Gb/gram

A drive with a density of 1 bit per carbon atom, 5.4 *10^10 metric tons

Size of said nanotech drive, a cube 2.88 Km tall (at the standard density of carbon)

Never in your lifetime is a really safe bet.
T

--
Laws are horrible moral guides, moral guides make even worse laws.
Re:ZFS by phoenix_rizzen · 2007-05-30 07:30 · Score: 1

It's also available in MacOS X.

Linux is the only OS that is not interested in it, instead (as per normal) doing their own thing.
Re:ZFS by slonkak · 2007-05-30 07:51 · Score: 2, Insightful

Thanks for being a dick. Maybe a little explanation instead of your put-downs would have been more helpful.
Re:ZFS by Anonymous Coward · 2007-05-30 08:19 · Score: 1, Informative

*shrug*, I have Centos in a Solaris Zone. It's living in a ZFS volume. I zfs snapshot it and the ZFS home directory within it just like anything else. Of course I do that from OpenSolaris [technically Solaris Express Community Edition] (I could do it from Nexenta if I wanted, but I'm not).
Re:ZFS by Penguin's+Advocate · 2007-05-30 09:22 · Score: 1

Maybe if you did a little reading of readily available information before posting you could have avoided posting altogether.

--
Frag 'em all...
Re:ZFS by Anonymous Coward · 2007-05-30 10:05 · Score: 0

near brainless management
Reminds me of my job...
Re:ZFS by Perky_Goth · 2007-05-30 10:35 · Score: 1

I don't know, interesting because they're questions that he wanted answered and so did others? Is there an actual harm in not knowing?
Re:ZFS by vux984 · 2007-05-30 11:08 · Score: 1

I did the math. That would handle 3.4x10^38 Bytes, or 340 trillion YottaBytes (1 YottaByte = 1 billion PetaBytes, 1 PetaByte = 1 million GigaBytes). That's a very large number of Bytes, but I still wouldn't use the word never.

If you start running out of space just up the block size. ;)

On the other hand if they ever make a 512 bit filesystem, I think we'll finally be covered. You'll use up all the atoms in the universe making the disk, and still have media for less than 1% of the addressable space, even if your block size is 1 byte.
Re:ZFS by retiarius · 2007-05-30 11:22 · Score: 1

One answer, directly from Sun (a non-trivial ZFS subset is already GPL2):

http://blogs.sun.com/darren/entry/zfs_under_gplv2_ already_exists

Another answer direct from the May 14, 2007 blog of Sun CTO Greg Papadopolous:
_____

"We will *never* (yes, I said *never*) sue anyone who uses our ZFS codebase and follows the terms of the license: they publish their improvements, propagate the license, and not sue anyone else who uses the ZFS codebase. And look at the innovation not only with ZFS in OpenSolaris, but its adoption by Mac OS X and BSD.

But under what conditions would we enforce our patents? How would we feel if someone did a cleanroom version of ZFS and kept the resulting code proprietary?

We wouldn't feel good, to be sure. But I'd put the burden back on us (certainly as a large company) that if such a thing were to happen it was because we were failing to *continue to* innovate around our original code. Being sanguine about patent protection as an exclusive right would result in less innovation, not more."
_____

GPL3 will make this all moot, whereby Stallman's GNU/Linux will merge with Solaris
which will disintermediate the Linus GPL2 version, left behind as the
Unix clone it was originally intended to be. Ultimately, a "me too" Unix
such will have little reason to exist, when one can get the real thing
under the new laws of free software, courtesy BSD/GPL3, and Sun's patent peace.
Re:ZFS by jp10558 · 2007-05-30 12:18 · Score: 1

One thing I've read is you cannot increase a RAID Z "drive" easily by adding another disk. Is this true?

--
Opera, Proxomitron-Grypen,GPG 0x0A1C6EE3
Re:ZFS by Anonymous Coward · 2007-05-30 12:32 · Score: 0

Dick status confirmed.
Re:ZFS by Kymermosst · 2007-05-30 12:40 · Score: 2, Insightful

I haven't tried it personally, but it should work just fine with "zpool add " (might need a -f after add). RAID Z even copes well with different-sized disks, a configuration I was running for a short while while I waited for bigger disks to arrive. What you cannot do easily is decrease the size of a RAID Z. Currently there is no non-trivial way to permanently reduce the size of a RAID Z by removing a disk. Support for that is planned, however.

--
"Alcohol, Tobacco, Firearms, and Explosives" should be a convenience store, not a government agency.
Re:ZFS by larry+bagina · 2007-05-30 13:11 · Score: 1

There are no patent or licensing problems. There are social/technical problems -- namely ZFS doesn't fit into the linux approach to file systems and device drivers.

--
Do you even lift?
These aren't the 'roids you're looking for.
Re:ZFS by nuzak · 2007-05-30 15:05 · Score: 3, Informative

To say nothing of the energy requirements of populating that drive. Quoth Jeff Bonwick:
Although we'd all like Moore's Law to continue forever, quantum mechanics imposes some fundamental limits on the computation rate and information capacity of any physical device. In particular, it has been shown that 1 kilogramme of matter confined to 1 litre of space can perform at most 1051 operations per second on at most 1031 bits of information [see Seth Lloyd, "Ultimate physical limits to computation." Nature 406, 1047-1054 (2000)]. A fully populated 128-bit storage pool would contain 2128 blocks = 2137 bytes = 2140 bits; therefore the minimum mass required to hold the bits would be (2140 bits) / (1031 bits/kg) = 136 billion kg.

To operate at the 1031 bits/kg limit, however, the entire mass of the computer must be in the form of pure energy. By E=mc, the rest energy of 136 billion kg is 1.2x1028 J. The mass of the oceans is about 1.4x1021 kg. It takes about 4,000 J to raise the temperature of 1 kg of water by 1 degree Celsius, and thus about 400,000 J to heat 1 kg of water from freezing to boiling. The latent heat of vaporization adds another 2 million J/kg. Thus the energy required to boil the oceans is about 2.4x106 J/kg * 1.4x1021 kg = 3.4x1027 J. Thus, fully populating a 128-bit storage pool would, literally, require more energy than boiling the oceans.

--
Done with slashdot, done with nerds, getting a life.
Re:ZFS by Anonymous Coward · 2007-05-30 15:37 · Score: 0

> Never in your lifetime is a really safe bet.
How is it a safe bet if I'm dead?
Re:ZFS by jhol13 · 2007-05-30 16:55 · Score: 1

That said this is not what a typical home user with only a hand full of disk drives and users needs. Perhaps does not *need*, but the advantage over e.g. ext3 is huge.

If you do not want redundancy you can stripe the disks and the performance will increase almost linearly (depending on ATA/SATA/motherboard/...). And you've got end-to-end checksums, simple administration, trivial snapshots, etc.

Yes, ZFS do have its limitations, it is no silver bullet for every system (biggest IMHO is inability to remove a disk from a stripe even if there would be enough space).
Re:ZFS by drsmithy · 2007-05-30 17:02 · Score: 1

One thing I've read is you cannot increase a RAID Z "drive" easily by adding another disk. Is this true?
Yes, although they're working on it because it's pretty high on many people's wishlists. It's a non-trivial problem, however, since all the data and parity has to be redistributed across the entire set of disks (including the new one).
ZFS supports RAID5 (RAIDZ) and RAID6 (RAIDZ2), but it's pretty much assumed that in production you'll use RAID1 (mirrors). Growing an existing mirrored array is easy - just add another pair of disks.
Re:ZFS by Anonymous Coward · 2007-05-30 19:19 · Score: 0

What are the patent problems preventing ZFS inclusion in Linux but not in BSD or Mac OS X?
Re:ZFS by Anonymous Coward · 2007-05-30 19:44 · Score: 0

your management has brains?
Re:ZFS by Anonymous Coward · 2007-05-30 20:20 · Score: 0

I'm sorry 1992 called and wants its operating system back. Nobody cares about your legacy crap.
Re:ZFS by Flodis · 2007-05-30 20:46 · Score: 1

I guess it's time to start uncurling all those hidden dimensions hinted at in string theory. Maybe they'll hold the data.
Re:ZFS by Gekke+Eekhoorn · 2007-05-30 21:30 · Score: 1

Actually, the problem is that you can't grow the number of columns in your RAIDZ set. This means that if you have a 4-disk RAIDZ storage, you can't make it a 5-disk RAIDZ storage.

What you can do however, is replace the disks one by one by bigger disks. This will give you a bigger RAIDZ pool with the same number of disks.

See a thread discussing this and other options here:
http://www.opensolaris.org/jive/message.jspa?messa geID=118614#118614

Wout.
Re:ZFS by jsoderba · 2007-05-31 02:36 · Score: 2, Insightful

In the development branch, yes. If you want a welltested ZFS, Solaris is it.
Re:ZFS by Anarke_Incarnate · 2007-05-31 03:13 · Score: 2, Informative

Depending on the needs and configuration, however, I have found UFS to be up to 5x faster than ZFS for reading small blocks of data " One more thing, ZFS will try to use almost all memory as a cache. This is bad for databases or performance applications. ZFS will release memory when an application, which should have a higher priority, needs it, but there is a bug in some versions of ZFS that, due to the bad memory accounting, releases too many pages of memory, and too fast, causing thrashing and a performance hit. For a file server, ZFS is great, but it still has a ways to go for some applications.
Also, if you are using a controller (as on a SAN) that has read/write order battery backed cache, you will want to disable fsync() for ZIL in the /etc/system file. There is also a nice script that can be used to limit the amount of memory ZFS uses, though, again, due to the poor memory accounting, it doesn't work that great. I have limited ZFS to 512MB of RAM to use and it grabs 3GB. Solaris 10 u4 is supposed to address some of this. Because of ZIL (part of the integrity check/scrubber) performance for multiple writes suffers and it can make using it as an NFS server less than ideal. It has a LOT of potential, and with enough spindles it can overcome some of these issues temporarily. It just has not matured to the level at which they are selling ZFS.
Re:ZFS by slonkak · 2007-05-31 06:03 · Score: 1

You could use this:

How to be nice
Re:ZFS by lukas84 · 2007-06-01 21:43 · Score: 1

It's also painfully slow. Performance comes with a high number of disk arms.
Re:ZFS by mortonda · 2007-06-02 02:26 · Score: 1

Which is why I have 4 disks in the array... :)
Re:ZFS by darkuncle · 2007-06-02 06:10 · Score: 1

340 trillion billion million gigabytes. yeah, I think if you go out on a limb and say "never", you can rest assured that you will at least not be proven wrong in your lifetime.

--
illum oportet crescere me autem minui
Re:ZFS by insignis · 2007-06-04 15:37 · Score: 1

From http://physics.princeton.edu/~mcdonald/examples/QM /lloyd_nature_406_1047_00.pdf [PDF]:

"The ultimate laptop performs 2mc^2/(pi)(Planck's reduced constant) = 5.4258 x 10^50 logical operations per second on ~10^31 bits."

(rather than the 1051 and 1031 quoted)

Specifics please. by PowerEdge · 2007-05-30 00:06 · Score: 5, Insightful

Not enough specifics here. I am going to say do your thing. If it works, you're a hero and saved 47k. If it doesn't obfuscate and negotiate the 50k of storage down to 47k. Win for all.

Unless you would like to give more specifics. Cause I am going to say in 99% of cases where you want fast, reliable, and cheap storage you only get to pick two.

Re:Specifics please. by Anonymous Coward · 2007-05-30 00:16 · Score: 0

fast, reliable, and cheap storage you only get to pick two.

Well, in this case they went with cheap consumer-grade hardware, meaning that the drives will probably not be the part most likely to fail.
Re:Specifics please. by tgatliff · 2007-05-30 00:42 · Score: 4, Insightful

"comsumer-grade hardware"???

Do you honestly believe the slogan of "business-grade"? Come on, let the marketing jargon go. Hardware designs are expensive, so rarely are there multiple designs. Sales guys are selling you additional support, but the hardware is rarely different. If it is, then the volume is not there, so the reliability is actually worse. Volume is the king of reliability. Reliability is always more dependent on the age of the design and its volume rather than the intended customer...
Re:Specifics please. by Ngarrang · 2007-05-30 01:02 · Score: 5, Insightful

Unless you would like to give more specifics. Cause I am going to say in 99% of cases where you want fast, reliable, and cheap storage you only get to pick two.
I disagree completely. Computer hardware is a commodity. The big box makers are afraid of this very kind of configuration which would blow them out of business if more people caught on to it. No, they use FUD to convince PHBs that because of the low cost, it cannot possibly be as good. Hot-swap and hot-spare are commodity technologies. But, please, feel free to continue the FUD, because it helps the bottom line.

--
Bearded Dragon
Re:Specifics please. by toleraen · 2007-05-30 01:57 · Score: 3, Insightful

So "Mission Critical" is just a myth too, right?
Re:Specifics please. by Score+Whore · 2007-05-30 02:13 · Score: 5, Insightful

No kidding. Without specific details there is no way to answer whether this is a good solution to his particular situation. However, even in the absence of details I can say this:

1) That case has twelve spindles. You aren't going to get the same performance from a dozen drives as you get from an hundred.

2) That system includes a small Celeron D processor with 512 MB RAM. You aren't going to get the same performance as you'll get from multiple dedicated RAID controllers with twenty+ gig of dedicated disk cache.

3) Your single gigabit ethernet interface won't even come close to the performance of the three or four (or ten) 2 gigabit fibre channel adapters involved in most SAN arrays.

4) Software iscsi initiators and targets aren't a replacement for dedicated adapters.

5) The Hitachi array at work never goes down. Ever. Not for expansions. Not for drive replacements. Not for relayouts. The reliability of a PC and the opportunity to do online maintenance won't approach that of a real array.

Don't get me wrong. That case makes me all tingly inside -- for personal use. But as a SAN replacement, fuck no. It's not the same thing. The original question just shows ignorance of what SANs are and the roles they fill in the data center.

As a workgroup storage solution for a handful of end users on their desktops, that solution probably may be a good fit. As a storage solution for ten (or two hundred) business critical server systems, no way.
Re:Specifics please. by sixoh1 · 2007-05-30 02:22 · Score: 4, Informative

Designs are expensive, but components are not. My PCB designs can support several different Bill-Of-Materials loads during manufacturing and when the boards are destined for industrial or military use we can use 'screened' parts which have been pre-selected and tested at high-temperature to ensure correct operation. Marginal parts at higher temps may be fine for consumer boxes (ie the ones on your desktop) but in a server box run 24-7-365 it-has-to-work-all-the-time may not be a good idea. I've been fustrated with the exact scenario quizzed in the original topic, using Maxtor SATAII 500GB disks as a drop-zone for my DLT backup machine I've had the HDDs for less than a month and already 3 of 4 failed with bad sectors because they all sit in a PC case. I'm going to have to rig out the box with extra fans, and the hastle of pulling and replacing the disks is driving me crazy too so now I'm adding removable disk bays. Not as easy or as cheap as I had anticipated (labor costs mostly).
Re:Specifics please. by Bandman · 2007-05-30 02:22 · Score: 1

What kind of Hitachi array do you have?

We're looking at getting a Qlogic 2340 later this year for ~ $50k.

--
Check out my sysadmin blog!
Re:Specifics please. by Ngarrang · 2007-05-30 02:28 · Score: 5, Informative

Toleraen wrote, "So "Mission Critical" is just a myth too, right?"

No system can compensate for bad management by people, but I digress.

All data is critical. But, to say that your data is less safe with a system that cost $4700 than a system that cost $50,000 is fallacious without some heavy proof behind it. For now, I am going to ignore that a functional backup is part of "mission critical" and just address the online storage portion of the argument.

Let's start with a server white box. Something with redundant power supplies, ethernet, etc. Put a mirrored boot drive in it. Install Linux. So far, the cost is fairly low. Add an external disk array, at least 15 slots, the ones with hot-swap, hot-spare, RAID 5, redundant power supplies and fill it with inexpensive (but large) SATA drives. Promise sells one, as do others. Attach to server, voila, a cheaper solution than EMC for serving up large amounts of disk space.

What if a drive fails? The system recreates the data (it is RAID5, after all) onto a hot-spare. You remove the bad drive, insert new, run the administration. The uses won't even notice their MP3's and Elf Bowling game were ever in danger.

For the people who believe strongly in really expensive storage solutions, please explain why. I would like to know if you also hold the same theory for your desktop PCs, because surely, a more expensive PC has to be better. Right?

--
Bearded Dragon
Re:Specifics please. by Score+Whore · 2007-05-30 02:29 · Score: 1

http://www.sun.com/storagetek/disk_systems/data_ce nter/9990/
Re:Specifics please. by hackstraw · 2007-05-30 02:32 · Score: 3, Informative

Hardware designs are expensive, so rarely are there multiple designs. Sales guys are selling you additional support, but the hardware is rarely different.

True, but there is a difference. The difference is in QA.

The "consumer-grade" and "business-grade" are the same off the shelf stuff, but if you are getting business-grade stuff from a reputable vendor they QA the consumer-grade parts, throw out the bad ones, and stamp "business-grade" on the ones that survive. This is why the business-grade level of products often are a generation or so behind the consumer-grade level.

Yes, you can get lucky and get consumer-grade stuff that works great. But if it doesn't, then you are the QA guy, and the downtime is on you. If the time for you to do the QA and the associated downtime is cheaper than the cost of business-grade, then by all means do it. Otherwise, you have to pay the extra bucks.

Now, regarding NASs, I think these things are overpriced, especially the maintenance on them. The maintenance goes through the roof once the equipment is beyond the MTBF of the drives, which is where a high dollar NAS should shine right? Any piece of crap RAID box will work when all the drives are new and functioning well. What you are paying for is the redundancy and availability, which is redundant to pay for when all of the equipment is new.
Re:Specifics please. by lymond01 · 2007-05-30 03:00 · Score: 3, Informative

If you're buying from EMC or another large storage company, you do pay a premium. Generally, it's for simple configuration of the NAS or SAN using their proprietary software. You're also paying for warranty and support, something you don't get through NewEgg (you get it, but it's limited). If you're either a large company not wanting to pay a yearly salary for 3 or 4 admins to run your storage system, or a smaller company that doesn't have the technical know-how to do it yourself reliably (not everyone reads "Ask Slashdot" regularly), then the premium stores are a good way to go (if you have the money).

It's the same reason we buy Dell. We could buy white boxes or parts from Newegg for all of our systems, but talk about a hassle when it comes to them needing hardware maintenance or just assembly. With the support Dell offers, we get a complete box that's been tested, we just need to reformat and install our own stuff. Something breaks, we make a 10 minute phone call and get a replacement the next day, with or without assisted installation. But we pay probably 30% more per box for that.
Re:Specifics please. by zerocool^ · 2007-05-30 03:07 · Score: 3, Insightful

Solly Chollie.

We have one of those "$50,000 SANs". Actually, with the expansion, ours cost closer to $110,000, but whatever. For one, it's got more drives (I think we're up to 30). For two, they're 10k RPM SCSI drives. For three, it's fibrechannel, and all the servers that it's connected to get full speed access to the drives. For four, we have a service level agreement with Dell/EMC (It's an EMC array, rebranded under Dell's name) that says if we have a dead drive, we get a dell service hobbit on site within 4 hours with a new drive. We also get high level support from them for things like firmware upgrades (we had to go through that recently to install the expansion).

The grandparent is exactly correct:
fast, reliable, and cheap storage - you only get to pick two.

We picked fast and reliable. You picked reliable and cheap. There's nothing wrong with either decision, but please don't tell me that a $3000 solution would have done everything that we needed.

You can't have all three.

~Wx

--
sig?
Re:Specifics please. by BosstonesOwn · 2007-05-30 03:17 · Score: 1

As a sun employee , we love those arrays , barely ever get calls on them. And when we do it's mostly service on disks that went bad that they didn't notice some times for weeks (unattentive sa?).

--
This package Does Not Contain a Winner
Re:Specifics please. by toleraen · 2007-05-30 03:21 · Score: 1

Storage space, hot swapping, and hot spares aren't the only advantage the higher end storage systems come with. Compare the certifications, mechanical specs, and levels of support the cheaper options have versus the higher end options. Those are the things that are important to me, and are required by my customers.
Re:Specifics please. by PowerEdge · 2007-05-30 03:39 · Score: 1

Things like snapshotting, mirrorview, replication manager, disaster recovery, 4Gb ports, HW iSCSI, switched redundant fabrics, active/active pathing, and tons of cache are reasons to go for higher end storage systems.

Another option that fits between the DIY and EMC are something like the PowerVault MD arrays with HW iSCSI or not. iSCSI won't really compete with SAN until 10Gbit and even then Fiber will be on par as far as per port costs (that and it is established protocol). If going the MD route, investigate RHEL and GFS (formerly Sistina) it is a good product GFS2 will be ready for prime-time come Update 1.
Re:Specifics please. by p!ssa · 2007-05-30 03:43 · Score: 1

I had the same experience with Maxtor SATA drives (MaxLine Plus II 250GB). Previously we ran all SCSI but purchased 50 IBM xSeries 226's each loaded out w/4 of these drives in RAID 5. Of the 50 servers, 36 failed in the first 3 months, all due to drive failures (many losing 2 at a time killing the array). The servers themselves were fine but Maxtor (Seagate?) drives are junk. We still use some SATA drives but only the WD Raptors, they have been very reliable. We've only lost 3-4 out of hundreds of Raptors in the last year and a half. I think SATA drives are OK for data/content that is infrequently accessed but I would never try using them in any database or high load content server.
Re:Specifics please. by hercubus · 2007-05-30 03:44 · Score: 0, Flamebait

i can tell you they don't "throw out" the bad ones,
they sell them to the US government. then overcharge
to replace them, using of course more substandard gear

here at the Fed, we're so lazy we even outsource
wasting your tax dollars. i'm smiling, are you?

--
-- How I want a drink, alcoholic of course, after the heavy lectures involving quantum mechanics.
Re:Specifics please. by TopSpin · 2007-05-30 03:44 · Score: 2, Informative

Not enough specifics here. I am going to say do your thing. If it works, you're a hero and saved 47k. Not really. The assertion that a 12 spindle NAS box with iSCSI costs 50k is the issue. That level of NAS/iSCSI hardware does not cost 50k. It may have, years ago, from Netapp or someone, but not today. Today such a box will cost around 10k with equivalent storage.

A Netapp S500 with 12 disks and NAS/iSCSI features is a good example. Roughly 10k and you get Netapp's SMB/CIFS implementation (considered excellent), NFS, iSCSI, snapshots, etc. Slightly lower price points can be had through Adaptec Snap Servers. They have a nice SAS JBOD expansion unit for their systems. HP just released new "storage servers" based on Microsoft's storage server OS; heck of a lot of value in those systems.

The delta between 3-4k and 10k isn't trivial, and if your budget is tight perhaps should roll your own. But 10k for supported NAS/iSCSI that functions a few minutes after you get it in the rack isn't a ripoff either. Not by a long shot.

--
Lurking at the bottom of the gravity well, getting old
Re:Specifics please. by swordgeek · 2007-05-30 03:46 · Score: 5, Interesting

It's all a question of scale, and your scale is a bit skewed.

The premium paid for higher-end storage is decidedly nonlinear. For marginally more reliable or faster storage, you pay about a factor of ten. One example I'm familiar with is Hitachi. We had a 64TB HDS array a few years ago that was worth roughly $2M. We could have purchased an equivalent amount of commodity storage for probably $200k at the time, but didn't. Why would we spend the extra money? Speed, configurability, expandability, and reliability.

First of all, speed. That thing was loaded with 73GB 15k FCAL drives. RAID was in sets of four disks, with no two disks in a set sharing the same controller, backplane, or cache segment. Speaking of cache, the rule was 1GB/TB. so we had 64GB of fast, low-latency, fully mirrored cache on the thing. It was insanely fast, and (most importantly) didn't slow down under point load. One tool automatically ran on the array itself, looking for hotspots and reallocating data on the fly.

Configurability: We could mirror data synchronously or asynchronously to our DR site, by filesystem, file, block, LUN, or byte. We could dynamically (re)allocate storage to multiple systems, and moving databases between machines was a breeze. Disk could be allocated from different pools (i.e. different performing drives could be installed), depending on requirements. Quality-of-Service restrictions could be put in place as well, although we never used them.

Expandability: The beast had 32 pairs of FC connections, could support 96GB of internal mirrored cache, and I can't remember how much actual disk. The key wasn't the amount of disk we could put on it, so much as how well the bandwidth scaled--and it scaled well.

Finally, the real key - Reliability. All connections were dual-pathed, with storage presented to a pair of smart FC switches which were zoned to present storage to various systems. We could lose three of the four power cables to the main unit (auxiliary disk cabinets only had two power connections each), and still run. We could lose any entire rack, and still run. We could lose any switch in our environment, and still run. We could lose two disks from the same RAID set and still run. When we lost a disk, the system would automatically suck up some cache to use for remirroring the data to multiple disks as fast as possible, and then after protecting it, would remirror back to a single logical device. In the event that we lost the entire device, we could run from our DR site synchronous mirror with less than a ten second failover.

This sort of thing is massive overkill for most people and companies, but when someone is doing realtime commodities trading, (or banking, or stock exchanges, etc.) the protection and support are worth the extra money. You just can't build that sort of thing on your own for any less money, at the end of the day.

--

"People who do stupid things with hazardous materials often die." -- Jim Davidson on alt.folklore.urban
Re:Specifics please. by twiddlingbits · 2007-05-30 04:08 · Score: 1

As a former Sun Sales support guy (PreSales Architect) we loved selling those. About a 500K+ per unit configured modestly. However the EMC Symmetrix wins in cost and also has very good performance.
Re:Specifics please. by BosstonesOwn · 2007-05-30 04:12 · Score: 1

And I wouldn't disagree with that. We have always been you pay for the support. My experience has been that other comp[anies don't have quite as much "talent" in the center and on the streets as sun has.

That is a major reason why i chose to work here.

--
This package Does Not Contain a Winner
Re:Specifics please. by Anonymous Coward · 2007-05-30 04:21 · Score: 0

There's a big difference between paying 1.3x and paying 10x.
Re:Specifics please. by hpavc · 2007-05-30 04:25 · Score: 1

The problem I see in your solution is raw access to the disk, with a a minimal network of 50 clients clawing at that $2000 home brew setup are going to be painful, much less 300 or so clients. An EMC/Filer isn't going to blink. There is a lot more than an controler, drives and a nic going on.

--
members are seeing something, your seeing an ad
Re:Specifics please. by mcrbids · 2007-05-30 04:26 · Score: 2, Interesting

We have one of those "$50,000 SANs". Actually, with the expansion, ours cost closer to $110,000, but whatever. For one, it's got more drives (I think we're up to 30). For two, they're 10k RPM SCSI drives. For three, it's fibrechannel, and all the servers that it's connected to get full speed access to the drives. For four, we have a service level agreement with Dell/EMC (It's an EMC array, rebranded under Dell's name) that says if we have a dead drive, we get a dell service hobbit on site within 4 hours with a new drive. We also get high level support from them for things like firmware upgrades (we had to go through that recently to install the expansion).

While I agree with your "Solly Cholly" comment, I've steered clear of the SAN/NAS solution for once simple reason: it's a single point. Maybe I'm just paranoid or unreasonable, but having ANY single point of failure creeps me out, even if it has excellent numbers, etc. Any SINGLE point will have the following problems:

1) Uptime: if it goes down, so does everything else. Is there more to add here?

2) Scalability: Any single point limits the performance of the system at large. Sure, your SAN/NAS might be mighty quick, and mighty fast. But given its current workload, can you sustain 100x growth annually for a few years without re-architecting your entire information infrastructure?

3) Price: It costs lots of cash upfront. Yes, expensive. But it gets even more expensive as your system grows beyond the performance of a single NAS.

4) Locality: Less of an issue than it used to be, but still an issue. If your IT infrastructure depends on the existence of a single-point SAN/NAS, your ability to spread to other geographical regions is limited. Sure, there's the WAN. But WANs are slow, unreliable, riddled with hiccups, and anybody who has to move very much data becomes acutely aware of these inconveniences!

I choose to be expensive in another way - intelligent system design. When architecting a software stack, it should immediately be able to scale up or down to virtually any size, from a $100 second-hand P3 to a 100-node cluster of servers, and handle it with minimal fuss. Thus, I've designed my softwares to scale up linearly. We started our hosted web-based product on a single P4 1U server. Now running on a 4-node, 16-core Opteron cluster.

Maybe you're not expecting to handle 100x growth annually. But I AM anticipating this kind of growth. Growth over the past 3-4 years has been between 40% and 100% annually, and there's every reason to believe that the next year will see a severe hockey-stick as a few potentially landmark-scale agreements have just been made.

We're ready to grow FAST to basically any size, modeling after the /. supreme being, Google. Are you?

--
I have no problem with your religion until you decide it's reason to deprive others of the truth.
Re:Specifics please. by dfn_deux · 2007-05-30 04:52 · Score: 1

Sure, just show me how I'm going to get a 4 hour repair/replace onsite technician response contract for this $3000 beige box NAS device. I'll buy 12 of them. Until then I'm gonna stick to my Netapps ;)

--
-*The above statement is printed entirely on recycled electrons*-
Re:Specifics please. by Vancorps · 2007-05-30 04:54 · Score: 2, Interesting

Wow, talk about not understanding what a SAN actually is. A $4700 unit is no comparison to what you get for 50k from an HP or IBM SAN solution. The biggest factor, I started with 30tb now I need 100tb! Oh noes, I need to buy more units and create more volumes. With a SAN you fill up your fabric or iSCSI switch with more storage racks allowing you to mix and match, SATA, SCSI, FCA or anything else you like. Then replicate the whole deal off-site at the hardware level.

Now mirrored OS drives? Are you mad? You've got a SAN, the suckers will boot straight from the SAN meaning you'll never have to worry about RAID array failure or hard drive failure causing downtime. This also means there is no need for hard drives in your blade which will reduce the internal temperature under full load. Of course since the OS drive is connecting through a fiber-channel SAN it is a hell of a lot faster than two mirrored drives which means faster and more reliable servers and increased uptime. A motherboard fries? No problem, either remap the LUN to another server with identical hardware which could be configured automatically or just change which port the server is plugged into the fabric.

There are hundreds more advantages here, ZFS does address a lot of them but it is not at that level and not near the performance. How about automatic archiving of files that haven't been accessed in say six months? Either archive to near-line storage or to tape, or do a tiered approach and archive to near-line after six months and tape after a year. The management is there for SANs and is very mature. It's just getting started for ZFS. ZFS is turning out to be a great low cost alternative meaning less risks starting out, but once you're into SAN country and not just DASD land then the rules are different.
Re:Specifics please. by Ngarrang · 2007-05-30 05:16 · Score: 1

It's all a question of scale, and your scale is a bit skewed.
... This sort of thing is massive overkill for most people and companies, but when someone is doing realtime commodities trading, (or banking, or stock exchanges, etc.) the protection and support are worth the extra money. You just can't build that sort of thing on your own for any less money, at the end of the day.
Thank you for a most intelligent and well-written responses. This is the sort of response I was hoping for. In a case like this, the proprietary and expensive features are certainly worth their price.

--
Bearded Dragon
Re:Specifics please. by trisweb · 2007-05-30 05:21 · Score: 1

I think that pretty much blew every argument for cheaper solutions out of the water... if you need all of those 4 points above then it makes perfect sense to pay for them. If you don't, well duh, build your own system for cheaper, but you get what you pay for. Nice.

--
"!"
Re:Specifics please. by Anonymous Coward · 2007-05-30 05:34 · Score: 1, Informative

There is not single point that can fail on a 100K+ san. You have multiple controllers on each shelf, multiple power supplies on each self, multiple paths to each server. Take something like a SAN from EMC, it is basically two completely separate SANs running in parallel with each other. The only thing in common between the two is the disks and parts of the enclosure they are installed in (each enclosure can hold 14 drives and there are multiple enclosures) which can be in any raid configuration that you want with multiple spares across multiple enclosures. So basically, everything is at least double redundant and data paths can automatically switch between any available path. I can have a SAN storage processor fail, a fiber switch with 14 servers attached fail and a two hard drive fail all at one time and not loose a single bit of data and keep running along fine like nothing happened. I know the name SAN seems like a single thing that can fail and you might be concerned but you really need to look at what they can do and what redundancy they offer before making your opinion.
Re:Specifics please. by prgrmr · 2007-05-30 05:40 · Score: 2, Informative

've steered clear of the SAN/NAS solution for once simple reason: it's a single point.

If you honestly think a SAN, particularly an EMC SAN, is a single point of failure, then you don't understand SANs. Take a look at the high-end Clariion and Symmetrix models for redundancy and scalability. Then take a look at who some of EMC's bigger customers are.

No, I don't work for EMC, but my employer has several mid-level EMC storage systems. They work. Reliably, quietly, and yes, they scale nicely too.
Re:Specifics please. by Anonymous Coward · 2007-05-30 05:58 · Score: 0

Mod +1, Scary.
Re:Specifics please. by Anonymous Coward · 2007-05-30 06:11 · Score: 0

The difference is enterprise level. I mean I want to be able to hook 300 drives to it. I want to be able to use fiber HBA2 or better. I want to hook up 20 different sets of clustered sql servers to it. I want to have snap view.

It is not there yet. There is a reason that you pay for it. If you don't work in and environment where you see the value of a 50k system with all that it brings to the table over a 5k system that you hobble together then you should be using that 5k system. It is really that simple.

They cost that much because when you run 9 billion dollars worth of transactions over a system you pay for them to work right, not hope that a white box can hold it's weight vs. the big boys.

1 tb of mp3 - yea white box sounds great.

240 tb of finacial data, your job..your boss job..your cube mates job...everyone on the floors job all depends on that data.... well now I don't think 50k is all that much do you?
Re:Specifics please. by femtoguy · 2007-05-30 06:14 · Score: 1

Having just done a thing like this, I have renewed appreciation for the issues involved. When you go to management saying that you can get the same capabilities from a homegrown solution for cheaper, they are already on the defensive. You have to prove that you are right that it will work. If it ever fails, you have nobody to blame but yourselves. When you buy EMC, you can curse them for messing up, and demand that they fix it. If you own box fails, you have nobody to blame but yourself.

Now I am not saying that this is a good thing. We can get our systems up and running faster then EMC can fix them. We can get more hardware than we could afford from EMC. And I think that in the IT world there is far too much passing of the buck to vendors (people should care more that their computers are virus infected than that it is all Microsoft's fault, but that's the world that we live in) and not enough holding people responsible for getting services working.
Re:Specifics please. by celtic_hackr · 2007-05-30 06:15 · Score: 1

I disagree completely. Computer hardware is a commodity. The big box makers are afraid of this very kind of configuration which would blow them out of business if more people caught on to it.

And I completely disagree with you.

While, this configuration looks tempting and plausible. It will not be able to compete with a SAN. Now if he were to replace the SATAs with SCSIs, then maybe he'd have something. I'd have to analyze this further, but in any case you'd have a nice storage array that'd outperform any equivalent IDE array. So this could be labeled an average man's SAN or AMSAN (which by the way, I've patented, trademarked, and is my trade secret - so don't you go and try to implement one or I'll sue you and Fujitsu for Billions and Billions!).

SATA cannot do the same kind of bandwidth as a SCSI can. A SCSI array will bury a SATA array on throughput in any disk intensive operation. Any sales guy can and probably will FUD you if he thinks he can. So the answer is to know your hardware before talking to the sales guys.

Ok, so maybe I don't totally disagree with you.
Re:Specifics please. by Nutria · 2007-05-30 06:24 · Score: 1

If you honestly think a SAN, particularly an EMC SAN, is a single point of failure,

Unless you are mirroring it off-site, it is a single point of failure.

http://ask.slashdot.org/comments.pl?sid=236627&mod e=nested&cid=19324827

--
"I don't know, therefore Aliens" Wafflebox1
Re:Specifics please. by j35ter · 2007-05-30 06:44 · Score: 1

Uh, I'm running a mission critical application server on a x226 with 2 Maxtor SATA disks in a RAID1 configuration. Due to budget cuts, this was a "temporary solution until the ordered hardware arrives"(Jan 1. 2007) :-)
This is our companies billing system, with 250k transactions/week distributed on 400 locations.
Oh, and there are no backup servers. Even the database is single, running oracle on a SCSI RAID 5 -- temporarily of course.

Any message for the suits?

P.S. got a job for me? :D

--
Delta-Mike November Bravo Tango
Re:Specifics please. by MrSenile · 2007-05-30 06:48 · Score: 1

Which is why for any large corpate system, particularilly trade information or other valuable data, it's usually spread over multiple data centers over multiple SAN's.

Where we worked we had two massive data centers that had data replication between the SAN's using specific zoning.

One at our location. One about 10 states away.

A San, for all intents and purposes, IS fully redundant in every conceiveable way. You can not, however, protect against three things.

1) Murphy's Law
2) Human Stupidity
3) Act of God

That way, short of having a nation wide catastrophy, you protect against #2 and #3 by having multiple data centers with different sets of people maintaining each location.

There's no current method of protection from #1.

Any person who has their entire reliance set upon a single San not only doesn't understand how a San works, but doesn't understand System Architecture and should look to find a new career.
Re:Specifics please. by Nutria · 2007-05-30 07:15 · Score: 1

Any person who has their entire reliance set upon a single San not only doesn't understand how a San works, but doesn't understand System Architecture and should look to find a new career.

The CIO who pushed SANs is gone, and the guy who defended SANs against my "all the eggs" arguments was fired for making a database change which crashed the database in the middle of the day, when he was explicitly told not to.

And we've never had SAN replication (though we religiously do backups and test them semi-annually, which is why/how we successfully got that system completely restored without any hassle).

--
"I don't know, therefore Aliens" Wafflebox1
Re:Specifics please. by Anonymous Coward · 2007-05-30 07:21 · Score: 0

So what is your plan for data spread across multiple geographical regions that a SAN would not work in?
What method to you suggest that works so much better with local attached disk server storage instead of a SAN? I know spreading a SAN across regions is relatively easy, we do it at my company. Our SANs are attached to each other over ethernet via a fiber->eth switch. Works great.
Re:Specifics please. by Anonymous Coward · 2007-05-30 07:29 · Score: 0

You're also paying for "sue-ability". Management needs to have someone to blame when things go pear-shaped, as they all too frequently do, and building a cheap whitebox solution isn't going to get you that.
Re:Specifics please. by MrSenile · 2007-05-30 07:37 · Score: 1

Most companies can't afford the 36-72 hour downtime it requires to pull the multi-terrabyte database from tape back onto the SAN. And while backup to disk greatly speeds this up, it still takes a bit of time to restore multiple terabytes of data, even disk to disk.

That's why there's usually San replication on high availability environments.

The cost of even 1 minute of downtime is devistating.
Re:Specifics please. by myowntrueself · 2007-05-30 08:30 · Score: 1

4) Software iscsi initiators and targets aren't a replacement for dedicated adapters.

Tell me about it...

Software-only iscsi on Linux is a recipe for massive data corruption.

You need a nice expensive HBA on every machine involved.

The openiscsi and iscsi-target projects are *NOT* production-ready. Use at your own peril.

--
In the free world the media isn't government run; the government is media run.
Re:Specifics please. by sf_basilix · 2007-05-30 08:50 · Score: 1

let's not forget about Hitachi's 100% guaranteed reliability contract. In their top of their line storage units (9900's - also named Tagma) they will sign a contract with their customer that guarantees 100% uptime or they will pay the customer for any repercussions involving any downtime.

It's no joke - I've watched customers sign it myself. Not even EMC can do that...
Re:Specifics please. by mcrbids · 2007-05-30 08:51 · Score: 1

3 degrees of separation from Vladimir Putin!

OMG you are so KOOL! Try this:

2 Degrees of separation from George W. Bush. (No, I voted for the other guy both times)

2 Degrees of separation from "Govornator" Arnold. (Voted once)

3 Degrees of separation from Johny Cash.

As if it matters. Which it doesn't. Who are YOU again?

--
I have no problem with your religion until you decide it's reason to deprive others of the truth.
Re:Specifics please. by ddent · 2007-05-30 09:00 · Score: 1

Is your failure rate high enough that a 30% premium is really worth it? It seems like it would be cheaper just to buy extra computers. Don't have to wait for the spare parts to come in either, they are just sitting there ready to go.

--
SSL Certificate
Re:Specifics please. by Allnighterking · 2007-05-30 09:36 · Score: 1

Have you every done a full cost analysis on your Dell systems? We have gone away from Dell for a number of reasons. Cost of electricity. Dells suck power like a hoover on 220. Cost of OS. Since Dell can't actually support an OS (they are a hardware company) I have to hire the admins anyway. Cost of repair. If I lose a motherboard on a Penguin or PSSC box for example that is out of "warranty" I can get a replacement at NewEgg. Can't do that with proprietary hardware. I have to in fact spend even more money getting rid of old Dell hardware (I'm in California I can't through it in a dumpster) that can't be fixed by Dell. With the old pIII 1U systems I was able to buy some cheap motherboards, with dual Xeons ... swap out parts and I had a new test lab cluster in place in 2 days. The broken dell systems won't take standard motherboards they cost me money to get rid of so now I have a closet of "parts".
No I'm sorry if you are going non commodity hardware it's either for show or for pencil pushers. It's just too expensive to buy equipment that cannot be re-provisioned into the future.

--
I'm sorry, I'm to tired to be witty at the moment so this message will have to do.
Re:Specifics please. by Anonymous Coward · 2007-05-30 09:53 · Score: 0

You will never be able to successfully sue the big names in the business. The contracts are all spelled out in their favour. They have no accountability whatsoever.

So you pay for hardware replacements, technical support, etc and all is fine. But if something seriously goes wrong and you want some sort of compensation for lost data, you might as well be shouting into the wind.
Re:Specifics please. by killjoe · 2007-05-30 10:06 · Score: 1

"if we have a dead drive, we get a dell service hobbit on site within 4 hours with a new drive. "

You must live in a big city. We were going to go with HP for some server because they promised quick turnaround and then they told us we didn't live in a major metropolitan area and that the best they could to is to promise a technician visit within 24 hours and overnight shipping for parts.

I called around to other vendors and got the same story.

Called a local guy who promised us he would give us his cell number and had plenty of spare parts in stock at all times. Since his hardware cost less then half as much as the HP hardware we bought two machines from the local guy and kept one as a spare in the store room and he gave us his mobile number.

We never did have to call him. The machine ran for three years without a problem and we upgraded to a new machine at that time demoting the two old machines to lesser tasks.

--
evil is as evil does
Re:Specifics please. by twiddlingbits · 2007-05-30 10:21 · Score: 1

Well, I work for EMC now and the talent is not as wide in the service organizations as Sun was but the ones who are sharp are REALLY sharp. Sun does servers really well and Storage about half-assed, EMC does Storage very very well and VMWare is very very good too. I have an ex-Sun Tech support guy who works for me and he's pretty decent..much better than many of the other people I have from EMC.
Re:Specifics please. by zerocool^ · 2007-05-30 10:26 · Score: 1

I live in Blacksburg, VA. I work for Virginia Tech's CS Department.

Which is why, above poster, that we don't have to worry about things like 400% increases in 30 minutes, etc. But, yeah, I guess the people are coming from Roanoke.

It costs us to get that SLA, though.

--
sig?
Re:Specifics please. by itwerx · 2007-05-30 10:52 · Score: 1

Maxtor (Seagate?) drives are junk.

Seagate's purchase of Maxtor was purely for intellectual property reasons. Maxtor owned a couple of patents on technology for exceptionally high speed data transfer within the drive. DoveBid Auctions has had decommissioned Maxtor assembly line equipment auctions constantly for several months now - Seagate is decommissioning them all. (But in the next year or two look for Seagate drives to get even faster! :)
Re:Specifics please. by jabuzz · 2007-05-30 10:56 · Score: 1

Well here in good old blighty the Dell service hobbit does arrive in four hours and I don't live or work in a major metropolitan area either. I know because I have had to call them out. The total incident time from actual drive failure, to failure being picked up, to replacement drive in the array was about two and half hours.

You won't get a four hour service if you live in the highlands and islands of Scotland, but pretty much everywhere else you will.
Re:Specifics please. by Anonymous Coward · 2007-05-30 11:05 · Score: 0

Still comes back to the problem of what it is for. Yeah sure if you have a large number of files you want to serve, and most requests are for a small percentage of that, say the most recent, then a cheap built it your self like this may do the job. Most requests will hit cache, drives won't sweat it and command queues will be low. SCSI drives and enclosures are more expensive, so the $4700 has just gone up.

But if you are doing something more demanding, like running a mid/large size Db or media editing, then from my experience this just won't cut it. For these applications, I've always found that SCSI out perform SATA by at least 50%, and under load that gap increases. Then there is the overhead of the controlling OS. Sure I could compile a custom kernel and strip down the OS so that it ran leaner and had better throughput, but that would take some time and experimenting to acheive this, so let use assume it take 2 weeks of playing around, that means that the cost has just gone up by whatever they pay you for those 2 weeks.

Then there are the remote management tools, pre-emptive failure reporting etc.. Sure you can have that with zfs and open source tools, but you will need to find out what there is, what works best for you, how to configure it, test that it works, test for failures, test for recovery, test the hot spares etc..

And if something fails, you have no-one to blame, sue or make fix it. Well except you of course.

And lastly there is what is good for the business. You might set up the best system. It might be cheap, fast, reliable. And the only one who will know how it all works is you. So that one day when you move on, the company will all of a sudden have to support a system without having anyone to call on.

So yeah, the EMC system may be $45000 more, but for that you get support, the reseach and testing behind the system, better perfomance, and from a business perspective, better piece of mind.

And no I don't work for a hardware vendor, I've just had my share of experience with kludged systems versus proprietary. I love building them, I hate supporting them, and I loath explaining what went wrong.
Re:Specifics please. by rrohbeck · 2007-05-30 12:28 · Score: 1

>Add an external disk array

That's where it gets interesting.

Add a cheap one where the Taiwanese PCB manufacturer didn't have their process under control and the backplane smokes, taking 3 drives with it (and triggering the end user's automatic fire suppression system.)

Oops! I saw it happen.

--
thegodmovie.com - watch it
Re:Specifics please. by Anonymous Coward · 2007-05-30 13:28 · Score: 0

This sort of thing might someday be possible on a ZFS-based storage device. Sun has released its replication software (SNDR) into the OpenSolaris project. OpenSolaris already has an iSCSI target, an FC target is possible (Sun used to have one for Solaris).

As many big storage frames use standard processors, and some run UNIX or UNIX-like operating systems, this is certainly possible. For example, NetApp now has most of the high-end features once only belonging to EMC Symmetrix and HDS Lightning arrays. IBM's "Shark" and current 8000 series arrays are actually just pSeries servers running AIX in front of a bunch of JBOD arrays.

And of course such a device could serve out NFS, as well as NFS over RDMA, over a variety of connections (anything you can put in a PCIe slot, like GigE, 10GigE, and InfiniBand).

Add the simple management and checksumming of ZFS to a universal RAID controller, and you get something quite interesting. You could easily balance cost vs. capacity by changing out the back end arrays. SATA RAID-Z for cheap storage, SAS RAID-Z2 for intermediate storage, and FC RAID-1 for mission critical.

Also, if Sun integrated its SAM hierarchical storage manager, you could easily move old data to tape or to tiered storage.

Regardless of what Sun does, expect storage to move this way over the next five years.
Re:Specifics please. by Dare+nMc · 2007-05-30 15:05 · Score: 1

OK, I'll bite. Which (current?) dell systems are you talking about? The parent is talking about desktops, clearly you must be talking about their servers. Since all the Dimension, etc box's I have seen use all the standard mounting, plates, connections, etc. Granted they have in the past kicked out some non-standard stuff, but who hasn't (that was around when you bought those Pentium's?)
I can get a replacement at NewEgg. Can't do that with proprietary hardware.

You are correct that the PowerEdge server I am running at work has a dell specific motherboard, but everything plugged into it can be purchased at newegg (IE they will all plug into many newegg motherboards.) The Dell system is really pretty, it wouldn't be pretty with the newegg board swapped in, but I would be surprised if it wouldn't mount (if it doesn't, then I need a $20 case along with a motherboard.)

I am not a huge dell fan, I do buy a few dell PC's at home/work when they are within $20-30. I have had better luck with them than any other brand (well the 1 Penguin Computing box I bought in 00' has been impressive, but a motherboard replacement is no longer available anywhere if it fails, but what do I care just save the data.)

your saving the PIII system sounds like fun (I would have done the same), but I doubt it truly saved your company much over swapping the HD into a modern box and tossing the rest (assuming a supported OS.)
Re:Specifics please. by Anonymous Coward · 2007-05-30 15:21 · Score: 0

*** I disagree completely. Computer hardware is a commodity. The big box makers are afraid of this very kind of configuration which would blow them out of business if more people caught on to it. ***

Except current SAN tech ias connectivity up to 4gb/sec/channel....new stuff is up to 10gb/sec....

Eithernet is 1gb/sec....Can boost it up a bit by going fiber...

But again, if you just need cheap storage, this will work. if you need performance, it won't work worth a shit.
Re:Specifics please. by drsmithy · 2007-05-30 17:34 · Score: 1

Software-only iscsi on Linux is a recipe for massive data corruption.
Details ? I've shuffled a lot of data through some machines using IET and the software initiator in RHEL without any problems.
Re:Specifics please. by Anonymous Coward · 2007-05-30 18:33 · Score: 0

Good point. As an illustration... take a screwdriver to some generic white box and then to a something like an RS/6000. Notice that there is just that bit more force need to open the it. A few more cents are spent on connectors better connectors than a white box translating to less service calls.
Re:Specifics please. by Anonymous Coward · 2007-05-30 20:39 · Score: 0

Try serving 50Tb to 200 hosts each pushing 1000iops simultaneously on your commodity hack. Oh and btw that's a mid tier array configuration.
Re:Specifics please. by Allnighterking · 2007-05-30 21:15 · Score: 1

Actually I've got a number of Dimension towers with non standard (meaning I can't go to the store and by a replacement) heat riser/fan mounts not to mention the non standard placement of screw holes on the motherboard (standard ATX mobo's from Tyan or Asus don't match up.) So I can't buy a 100 - 150 dollar motherboard I have to buy the more expensive one from Dell. And as for the swap. Our out of pocket was under 2.5K most of that was for the HDD's (the one thing the pIII's had that was real junk were the 6GB scsi drives.) 6 1U's were brought back to life in total. compare that to 2.5K each from PSSC, Rackable or Penguin for the closest comparable system.

--
I'm sorry, I'm to tired to be witty at the moment so this message will have to do.
Re:Specifics please. by tgatliff · 2007-05-31 01:53 · Score: 1

I dont know if it is that bad.... Your exaggerating a little. Yes all of these types of things have occured, but I would not say they occur everyday. Also, there are severe consequences when people get caught doing it.

For example... Many manufactures will take a sample from a component shipment and test. If so many fail, they would send back the entire shipment. In one instance for me, we suspect that the vendor was just shipping it back to us, so we tagged the box. When we noticed the box coming back, we axed the vendor forever... Toyota is known very well for this type of approach to things as well.. A zero tolerance for defects...
Re:Specifics please. by tgatliff · 2007-05-31 02:09 · Score: 1

Another marketing ploy here as well... :-)

Just so you know, there is goods and bads of testing. When you test something, you actually increase its risk of failure. Why? Because you are using it... Because of this, we use what is called the threshold number.. Meaning, we determine where we want the reasonable reliability, such as 90%, and then we test to that range, which is a calculated number by the way. Anything beyond the 90% range traditionally will be groomed out by either a design change and thru volume... Once you bring up the volume on manufacturing line, it gives you a chance to see some really odd errors. Once you correct them, the entire line gets more reliable over time. There is very little mystery about this and certainly is not consumer or business related.. Reliability is something everyone wants..

Also, as far as labeling as "business grade" for the ones that survive. This is just marketing hype to allow vendors to justify making business products more expensive. Any returns, including high volume consumer lines, just kill margins because as a manufacture you traditionally have to eat this costs. No vendor in their right mind would ever ship out a product knowing that it will fail prematurely.

Also, I have never heard of "business-grade" being one generation behind anything. This would make allot of sense, but I just have not seen much evidence of this because competition would just kill this approach. In fact, it is my experience that business equipment is more unreliable than consumer equipment that is at volume. Meaning, take a traditional HP business server and compare the same arrangement in a spec colo center and see which one fails first. You will pretty much always find that it is the one with the newest design or the least volume at the manufacturing level.....

Finally, the requirement of large throughput in NAS's, in my opinion, is a side effect of poor design or patching of an existing poor design. Meaning, you should never design a network where huge throughput is in one area. Spread it out across a network so that this never becomes a problem. Some approaches do this thru muliple NICs, by the way. HA Linux and its directoring principles are ideal for this. Oh, did I mention that you can use "cheap" consumer grade hardware for HA Linux? The best of both worlds... :-)
Re:Specifics please. by swordgeek · 2007-05-31 02:59 · Score: 1

True enough, but you have to be very careful in reading the terms of the contract. "Idiot admins" aren't covered, which is fair enough, but when the product is sold (rebranded) and supported through Sun, idiot Sun admins are exempted as well.

We had a 36-hour outage on a 9960 that was the joint fault of bad documentation and bad support. No compensation.

--

"People who do stupid things with hazardous materials often die." -- Jim Davidson on alt.folklore.urban
Re:Specifics please. by Thundersnatch · 2007-05-31 07:51 · Score: 1

Unless you are mirroring it off-site, it is a single point of failure.

We do block-level replication off-site with our SAN, which is not an EMC, but a Left Hand Networks cluster. The reason we bought in was scalability. Our LeftHand units come in as small as 2 TB chunks, each with it own mobo and two Gigabit Ethernet adapters. As you buy more chunks, your storage pool automatically grows (and volumes automatically re-balance themselves amongst spindles). But even better, the aggregate throughput also increases, as each chunk adds its 2 GB/sec to the total SAN bandwidth available. Using iSCSI and and multi-patch IO, there are no single points of failure in the SAN at this site. In case of fire, we have our replicated data off-site (also handled by the SAN in a bandwidth-efficient manner), and also traditional backup-media off-site.

You just can't do that sort of stuff with a roll-your-own solution. If you don't have those sorts of needs, by all means, roll your own cheap solution.

What this should illustrate is that the majority of the value of a SAN is in the management software, not just the hardware. Which is probably why LeftHand has now partnered with HP to sell their iSCSI SAN software running on DL300-series machines.

Note, I do not work for or invest in LeftHand, I am just a satisfied customer. LeftHand's SAN/IQ operating system is Linux-based, which is neat. We were also thrilled with demos of EqualLogic's gear, which is quite similar.
Re:Specifics please. by killjoe · 2007-05-31 15:20 · Score: 1

That guy didn't work for dell. Dell got in touch with a local computer company who sent the tech out. You are lucky it was a drive, anybody can go down to the store and pick up a drive. If it was a server motherboard or a PERC raid controller you would have been shit out of luck.

--
evil is as evil does
Re:Specifics please. by duffbeer703 · 2007-06-01 02:28 · Score: 1

I've seen instances where "enterprise" mid or high end SAN/NAS systems go down during firmware upgrades, cable cuts, etc.

It doesn't happen often, but when a SAN that is the storage backend for 500 servers goes down for a few minutes, you're in deep shit.

--
Conformity is the jailer of freedom and enemy of growth. -JFK
Re:Specifics please. by aminorex · 2007-06-01 03:26 · Score: 1

> As a storage solution for ten (or two hundred) business critical server systems, no way.

Now two of them, on the other hand, would do that nicely. No downtime, ever.

--
-I like my women like I like my tea: green-
Re:Specifics please. by NateTech · 2007-06-01 20:50 · Score: 1

They also offer SLA's and business contracts that would allow a company to sue their assess off if their storage solutions fail. This is the oft-overlooked reason that large proprietary systems are used by businesses. They want "one throat to choke" if something bad happens.

Saying that your $2K "developed by some guy down the hall" storage solution lost your most important data to your customers, is far more painful PR than saying "the EMC went down and we've requested that they come fix it and honor their SLA by refunding our money."

In many cases it's not about the technology -- It's about the realities of a company's fiscal liability if something goes wrong.

I love the cheap route, and I think ZFS is niftier than sliced bread, from a purely technological standpoint.

But I don't think Sun has started writing SLA's and offering free data recovery support on-site for the $2K SATA Terrabyte "cheap-ass-engineering" solution the guy down the hallway built with OpenSolaris... yet.

Would I build one or two and replicate my data across them for redundancy to save $50K? Probably. Would I put something mission-critical that pulls down $50,000 an hour on it? Probably not... but maybe.

That's the kind of decision big corporate management is making when they buy a brand-name high-end product that lives in their data center... "Can my people make a phone call and expect the problem will be fixed within the allotted down-time limitations without hassle or having to keep hideously expensive people on staff to babysit the thing? And if the thing blows up, burns down the data center, whatever... does the vendor have deep enough pockets for me to pay lawyers to go recoup some/most/all of my losses?"

There's a reason there's always a "top-three" in any industry, even if it seems the technology is simple enough "anyone" could build one. Sure they can... but can they SUPPORT it and back up that talk and put their money where their mouths are when the excrement hits the rotating air oscillator.

--
+++OK ATH
Re:Specifics please. by Alex · 2007-06-02 02:56 · Score: 1

Thats an HBA - how about I sell you 2 for $25k ?
Re:Specifics please. by Alex · 2007-06-02 02:58 · Score: 1

We're ready to grow FAST to basically any size, modeling after the /. supreme being, Google. Are you?

Does he need to be ?
Re:Specifics please. by rfc1394 · 2007-06-03 05:55 · Score: 1

The premium paid for higher-end storage is decidedly nonlinear. For marginally more reliable or faster storage, you pay about a factor of ten. One example I'm familiar with is Hitachi. We had a 64TB HDS array a few years ago that was worth roughly $2M. We could have purchased an equivalent amount of commodity storage for probably $200k at the time, but didn't. Why would we spend the extra money? Speed, configurability, expandability, and reliability.

Well, let's see, on the other hand, you could have spent $800K, and bought 4 of the "commodity" systems for less than 1/2 of the cost of the proprietary system. Now, given this, you can either have 4 machines each to handle 1/4 of all the requests, and maybe get at least as fast a response time, or have such a system which unless you have 4 consecutive failures gives you quad reliability. You talk about every system being dual reliable, but here, it still looks like, for a lot less money you can install quad reliability and for less than 1/2 of the high priced equipment you get twice as much reliability.

Is the $2 million system 4 times as reliable as commodity hardware? Or should I ask, is it 8 times as reliable as commodity hardware since it costs 10 times as much? Or is the capabilities so far superior that they justify a 100% increase in price?

Now, maybe there are other figures and maybe there is a serious justification for such a huge price premium, but again, it still sounds like people are paying a lot more money because of FUD from disk drive manufacturers.
-- Paul Robinson - My Blog

--
The lessons of history teach us - if they teach us anything - that nobody learns the lessons that history teaches us.
Re:Specifics please. by Sobrique · 2007-06-06 01:37 · Score: 1

From experience EMC kit is very expensive, but it is worth it. In a high end business environment, the level of tolerable downtime is zero. I mean, I'm working supporting a big name UK bank. Our 'support obligations' involve a 45 minute 'time to fix' on a critical incident. Yes, that's including the time it takes me to get out of bed and have my coffee.
Our contract was negotiated with that in mind. It was Really Expensive, because ... well in IT anything's possible, but the price tag can get extreme. So we run EMC storage arrays, with very expensive support. It's not that I _couldn't_ build the same amount of storage out of JBODs or other SAN configuration, it's quite simply that we _need_ to be able to pull things back together very quickly, or face very significant penalties.
We run multiple arrays, replicating between them, to remote sites, and agressively replace any components that 'might fail soon'. It's expensive, and intensive, but compared to the cost of 'outage' (or worse, data loss) it's nothing. When you stand to be fined multiple millions for 'problems' by the regulators, it's in your interest to invest heavily in your toys.
Personally, I couldn't afford EMC's support (or any of the other vendors, but I've only really used EMC extensively) but having dealt with them on a lot of occasions, they're one of the few that are actively useful to be talking to.
And for my money, ZFS isn't really a replacement for SAN or NAS. They do similar things, true, but the overlap isn't total. *shrug* I'd say pick out the right tool for the job, but keep in mind that these things can be used in a complimentary fashion.

Everyman? by iknownuttin · 2007-05-30 00:08 · Score: 1, Redundant

As a common everyman who needs big, fast, reliable storage without a big budget, ...

Porn jokes aside, what in the World does a common "everyman" need with that kind of storage?

I have a 40 gig OEM drive on this machine that I've had since 2003, and I still haven't approached the half way mark. And I run a couple of businesses.

--
I prefer Flambe as apposed flamebait.

Re:Everyman? by apathy+maybe · 2007-05-30 00:18 · Score: 3, Insightful

Porn jokes indeed aside. I may not be an "everyman", but I think I'm close enough. My desire for storage (though not yet in the terrabyte range) comes from my photography (no not porn...). I take a bunch of pictures, and well, because storage is cheap I leave them all at the original file size (which in this case is about 2-5 MB depending).

I don't have a proper video camera, but I'm sure that people who do, have even bigger storage requirements.

Not only that, what with all the music you can copy of a friends HD now, your storage just jumps a bit more! (I've got literally more then 10 gigabytes of music on my desktop HD. And I know people who have hundreds of CDs, so if they ripped all those, they would have much more...)

Added to all those movies you can either rip or download...
Chuck in a decent network, family and/or friends, and you can now stream all this stuff around to wherever you want it.

I'd say then, that the most common use of all this space, multimedia. Not sure who has terrabytes of multimedia though.

--
I wank in the shower.
Re:Everyman? by eldepeche · 2007-05-30 00:20 · Score: 1

His business probably needs hundreds of DVD rips and 30000 mp3s.
Re:Everyman? by theStorminMormon · 2007-05-30 00:26 · Score: 1

I have a 40 gig OEM drive on this machine that I've had since 2003, and I still haven't approached the half way mark.

And you're obviously not storing any substantial media. I have more then 40GB just of MP3s.

--
The Southern Baptist Convention has creationism. On Slashdot, we have porn.
Re:Everyman? by Baddas · 2007-05-30 00:29 · Score: 3, Insightful

150GB mp3s
80GB DVDs
120GB games
14GB/hr for DV editing
1 whole drive for OSes
RAID-5ed (1 parity drive)

So I'm up to 4 200gb drives right now, without even trying hard.

Soon I'm going to jump to 500GB drives, and I expect to be hitting their limits in a year or so.

Also, how the hell am I supposed to back up all this?! Incrementals would be 10gb+ / week
Re:Everyman? by simong · 2007-05-30 00:34 · Score: 2, Insightful

You're not trying hard enough ;)

I've got just over a terabyte of live storage around the house and I probably use about half of it - I have a couple of hundred gigs of video and about 60 gigs of music. I know of someone who is currently buying seven of Hitachi's new terabyte HDs for an in-home video streaming system, There's always someone who has a use for it.
Re:Everyman? by Max+von+H. · 2007-05-30 00:35 · Score: 4, Informative

I'm a photographer and my RAW image files are 15MB each. At every shooting, I come back with 1 to 8GB worth of data to be processed. My workflow involves working on 16-bit TIFFs that weigh in excess of 40MB/file and I'm not even counting the photoshop work files. 40GB would last less than a week here.

Not being rich, I have a couple of external HDs totalling a little less than 1TB, and it's nearly full. The rest is archived on DVD or transfered to HD for storage (cheaper, faster and more reliable than DVD).

So yeah, I can easily imagine why any organisation dealing with huge media files would be interested. Heck, I'd be a client for a safe, multi-TB storage system if I could afford it... Not everybody only deals with text files for a living :P

--
-- It's always darker before it goes pitch black.
Re:Everyman? by Anonymous Coward · 2007-05-30 00:45 · Score: 1, Funny

Arrrr, matey! That's a scurrrrilous rumor, ye salty sea dog!
Re:Everyman? by miller701 · 2007-05-30 00:48 · Score: 1

150 GB of MP3s? That's on the order of 3,600 albums.
Re:Everyman? by mulvane · 2007-05-30 00:53 · Score: 1

I can easily see the need for mass storage in the HOME with multimedia and other types of storage. Sure the web makes things easy to acquire, but that is still never gonna compare to instant over a home network. At home, I have an 8TB movie server that's using all 200GB drives. The drive enclosure is custom built and is using multiple 5 DC power supply bricks and also has SATAII Raid 5 capable expanders to create the first level of raid5. From there I go to a 8 port SATAII card and create 2 RAID 5 arrays using 2 4 port groups. At that point, the whole thing is brought into a raid0. The ONLY thing this is used for is storage of video (movies/tv ep's). I also have a 4TB video capture server with 6 TV capture cards. Each capture card can be controlled from a remote media center pc for live stream and recording at same time. I also have 2 2TB file servers with an archive of ISO's from quite literally every CD that has ever entered my possesion (minus AOL) since 1998 and various other things of personal business like pictures, and video's I shoot of the family. I have a couple 500GB machines floating around as servers for other things, but those don't count for anything important here. In conclusion to the article, I have built a fast, reliabe, and easy to manage solution with samba with web frontends to manage all the file serving needs I have. I look forward to the day I can ebay all these 200GB drives though after I replace with 750GB-1TB in the near future.
Re:Everyman? by d3ac0n · 2007-05-30 01:01 · Score: 2, Interesting

Two words: High Bitrate.

If you like your music to actually SOUND good, 128kbps sucks. I personally rip my music using a Variable bitrate between 224 and 320 kbps. Unfortunately, this makes for VERY large files. But my music sounds FANTASTIC!

--
Official Heretic from the "Church of Global Warming". Proven right thanks to whistle blowers. AGW = Flat Earth Theory
Re:Everyman? by WhoBeDaPlaya · 2007-05-30 01:07 · Score: 1

Just beefed up the 'ol media server / HTPC with 8x 750GB Seagates (got great deals on 'em). Holds vid caps, FLACs (~18K songs), family photos and other misc vids (game / movie trailers, etc.) Still have room for another 6 drives.
Re:Everyman? by Znork · 2007-05-30 01:13 · Score: 2, Interesting

"what in the World does a common "everyman" need with that kind of storage?"
Consolidate your multimedia and run MythTV for a while. Once you rip and encode several TV series, all your DVD films, and have the Myth recording your favourite shows, a terabyte doesnt seem that much. If you want an idea for future examples of massive storage consumptions, imagine having MythTV recording all channels all the time, so you'd basically be able to decide post-transmission what you want to view and save...
Of course, while I agree most NAS and SAN solutions are grotesquely overpriced and mainly useful for separating fools from their money, I cant really see why one would bring up ZFS and OpenSolaris for this purpose. Something like Openfiler would be vastly more appropriate, proven and easy to manage.
Re:Everyman? by bkr1_2k · 2007-05-30 01:15 · Score: 1

I don't know about you, but I have about 700 CDs ripped to about 50GB. That's about 2100 albums, which really isn't that much these days. Especially for music collectors. I don't even have music that is hugely dynamic, for the most part, so my compression is reasonably good with VBR set with minimum of 190 or something like that. For music that has a lot of dynamic sound bitrates are going to be higher. 150 GB is nowhere near 3600 albums if you actually want the music to sound close to correct.

--
"Growing old is inevitable; growing up is optional."
Re:Everyman? by kannibul · 2007-05-30 01:20 · Score: 1

Maybe not everyman, but...

With the recent changes to the laws for business, having to retain data and versions of documents for YEARS has become a reality (otherwise, one can be found liable of "virtual shredding". With that, the need for cheap, fast, and large storage is a must.
Re:Everyman? by iknownuttin · 2007-05-30 01:25 · Score: 2, Funny

Not everybody only deals with text files for a living
Well, I'll have to buy a digital camera that shoots in ASCII. Oh wait.......

--
I prefer Flambe as apposed flamebait.
Re:Everyman? by Mr+Z · 2007-05-30 01:29 · Score: 3, Interesting

I actually have two 48GB databases full of minimal instruction sequences for generating boolean functions. Do I win the obscure use of disk space prize?

--
Program Intellivision!
Re:Everyman? by miller701 · 2007-05-30 01:31 · Score: 0, Flamebait

I ripped most of my collection back in the OS9 pre-iPod iTunes days. Now that I have a kid and a house to take care of, I don't think I'll be re-ripping unless the computer dies. No judgement, just different priorities for me now. Peace
Re:Everyman? by WarwickRyan · 2007-05-30 01:36 · Score: 1

Be careful with using DVDs for backup, they don't last that long. Personally I'm now sticking to top quality DVD+Rs as they seem to last the longest.

Tape also isn't made for long term storage.

Hope you offsite anything you can't loose, photos are irreplacable so it'd be suck to loose them..
Re:Everyman? by eMbry00s · 2007-05-30 01:38 · Score: 1

[b]Everyman[/b] needs porn.
Re:Everyman? by Afrosheen · 2007-05-30 01:44 · Score: 2, Interesting

They made the Buffalo Terastation for guys like you. Google it, it's not too expensive and is pretty much hands-off.
Re:Everyman? by Score+Whore · 2007-05-30 02:23 · Score: 1

I can't imagine how to take this in a good way, but you're an idiot. Or lying. You've got a raid-5 group behind a raid-5 group behind a raid-0. You're adding mondo latency, wasting space, and adding unnecessary complexity. Put all your disks in a big, flat, raid-5 with two parity disks (eg. "raid-6".) You sound like those guys who make shit extra complicated just so they can sound smart/impressive/skilled.
Re:Everyman? by cayenne8 · 2007-05-30 02:25 · Score: 1

"If you like your music to actually SOUND good, 128kbps sucks. I personally rip my music using a Variable bitrate between 224 and 320 kbps. Unfortunately, this makes for VERY large files. But my music sounds FANTASTIC!"
If you're really concerned about sound...do what I do. Rip them to FLAC, for listening on the home stereo, the one that you put $$ into to sound good. Then, rip those down to mp3 or whatever, for portables, cars or poorer listening environments.

--
Light travels faster than sound. This is why some people appear bright until you hear them speak.........
Re:Everyman? by endianx · 2007-05-30 02:26 · Score: 1

My CD collection alone would take up more than 40 gigs, once I finally get around to ripping them all to FLAC.

The general answer to your question is "media". CDs, DVDs, and photos.
Re:Everyman? by hoggoth · 2007-05-30 02:35 · Score: 4, Informative

I am the original poster, and I am not actually a typical user.
I routinely work with files that are 100 GB - 300 GB each.
Just copying one file from drive to drive takes hours.
I have about 4 Terabytes in use, with another 4 Terabytes for backup.

My usage is the exact opposite of database usage (which most storage is optimized for).
I need to copy huge sequential files. I rarely need many small reads or writes.

Because of the long times it takes to move these files around, I think NFS or CIFS would be too slow. That's why I am interested in the ability of ZFS to easily export iSCSI targets. Some tests I read showed that ZFS exporting iSCSI is about 4 times faster than ZFS exporting NFS or CIFS.

I am comparing to drives directly attached via eSATA, so it's got to be fast to come anywhere close to what I get with eSATA.

--
- For the complete works of Shakespeare: cat /dev/random (may take some time)
Re:Everyman? by mulvane · 2007-05-30 02:47 · Score: 2, Informative

That's why I am doing an upgrade to larger disk soon. Hopefully I can get a deal on 1TB drives. The data has all been offloaded to a myriad of machines 3 times so I could upgrade the arrays and stay consistent with disk sizes. Latency isn't as noticeable as one would think. The array is mounted as read only except for scheduled uploads of new content(usually only 2-3 times a month). Once the reads start, they play without any problem (never played more than 2 HD and 4 SD streams at once), and writes are slow(read that as EXTREMELY slow) but not a problem as I only sync like I said 2-3 times a month. I'm looking to do a single RAID6 using 2 PCI-E 16x cards and upto 12 drives. My initial storage requirements of storage have been met and I only add a few movies and EP's a month now so a gain of 4TB over my current would keep me going for a couple years considering I am only right now using a little over 6TB. The reason I have the first raid5 is that the SATAII port multipliers I am using support JBOD, RAID0,1 and 5. Backup is not CRITICAL as I have legit copies of all the movies, and most of the tv shows have been bought as season bundles. Redundancy of data is important though so I can can suffer a pretty massive crash of a number of drives in this setup. It seems like a reasonable trade off to use RAID as I did to ensure I could recover without have to rip everything all over again. At the time I built this (200GB drives were new at the inception), RAID6 was not an option so please don't call me stupid for building a setup in a much smaller scale and growing with it until a large enough disk capacity became available at a price point to make it worth building a system from scratch. Its server my needs and now with the 750's at a good price point, and 1TB's coming out, the next few months will see my capacity grow and complexity diminish.
Re:Everyman? by hoggoth · 2007-05-30 02:48 · Score: 4, Interesting

> I cant really see why one would bring up ZFS and OpenSolaris for this purpose

Here's why:
1) Snapshots. ZFS lets me make lots of snapshots to protect myself from user error, viruses, etc destroying my data. ZFS snapshots are so lightweight that I can make them hourly at nearly no cost in time and disk space.
2) Data integrity. Even RAID-5 can allow some errors to creep into my data (google: bit rot). ZFS has a much higher level of data integrity protection.
3) Cost/Performance. ZFS RAID-Z appears to be much faster than software RAID-5. it appears to be even or faster than hardware RAID-5. Hardware RAID-5 is much more expensive than software.

--
- For the complete works of Shakespeare: cat /dev/random (may take some time)
Re:Everyman? by Anonymous Coward · 2007-05-30 03:24 · Score: 0

Video editing, mostly. I think his solution is more appropriate for underfunded organizations. Not so much small businesses -- they don't have a lot of data -- but small overlooked underfunded orgs in a big business.

We have a 4TB SNAP server in my org, and it falls over every week. No redundancy, just parity that it doesn't seem to build reliably. Piece of garbage, those things are. I'd love to put something reliable together for less than a hundred benjamins.
Re:Everyman? by multipartmixed · 2007-05-30 03:24 · Score: 1

Have you thought about getting some Sun T3+s?

You can connect multiple computers to each T3+ via FCAL, you just can't mount the same F/S at the same time.

If you were particularly clever (AND your access pattern happened to be right), you could write the data to a mirror on the T3, unmount it from whatever is generating the data, split the mirror, and each computer could now mount it. And maybe even start making another mirror if you're so inclined.

I suggest this not because I think it's a good SAN solution, I'm thinking maybe giving you direct disk access would solve some of your problems.

Oh, and don't buy the T3s from Sun. They are cheap and plentiful on the refurb market. Typical configuration per unit is 9 73GB disks, dual power supplies, built in batteries (to run disks until cache is flushed) and 1GB write-behind cache. You can also chain them together, run them in parallel, whatever. Administration is serial or 10mbps telnet.

--

Do daemons dream of electric sleep()?
Re:Everyman? by SkunkPussy · 2007-05-30 03:34 · Score: 1

Although he said MP3s, its not uncommon for people to use lossless compression these days, so 150GB isn't unreasonable. Also what if you have a lot of dance music mixes - those typically last 1-2 hours which is generally longer than a cd album.

--
SURELY NOT!!!!!
Re:Everyman? by SkunkPussy · 2007-05-30 03:38 · Score: 1

pwned :)

--
SURELY NOT!!!!!
Re:Everyman? by hoggoth · 2007-05-30 03:43 · Score: 1

Oh, and
4) ZFS is safe without any battery backed up cache. RAID isn't really safe without it.

--
- For the complete works of Shakespeare: cat /dev/random (may take some time)
Re:Everyman? by billcopc · 2007-05-30 03:44 · Score: 2, Interesting

I'd never claim to be an everyman, but I broke 2TB on my desktop three years ago with a huge pile of SATA drives and a couple extra controller cards. Besides, chicks dig the little side-cart full of hard drives :) I just took a couple of four-slot drive cages from cheap PC cases and built them into an enclosure, complete with its own ATX power supply.

Of course that was before I jumped onto the NAS storage cash cow, doing pretty much exactly what the article poster wrote, only I turned around and sold my PC-based NAS boxes for about half the price of "enterprise" solutions, which still represents a 400% markup for me :)

You have to realize, the companies and people building these overpriced RAID arrays are just your average greedy bastard, usually no smarter or more skilled than any other geek. Most of the computer-attached devices today are little more than an XScale processor, a tiny bit of RAM and Linux. Broadband routers, NAS boxes, KVM/IP switches, "smart" network adapters, heck I wouldn't be surprised to find home entertainment devices running Linux. We're in the age of mashups, where any idiot with a marketing budget can slap various I/O ports on a board and "invent" appliances.

--
-Billco, Fnarg.com
Re:Everyman? by Kryai · 2007-05-30 03:56 · Score: 1

What is a long time for you? I've copied via NFS and CIFS (samba) at fairly impressive levels. 40+ gigs in as fast as my disks could write. Perhaps I've not done transfers on the terabyte scale but have you tested your transfer speeds with NFS/CIFS to get a reference point? I imagine that when you transfer 4 terabytes it's going to take a while no matter what protocol you use to transfer the data.
Re:Everyman? by maxume · 2007-05-30 03:56 · Score: 1

Very large? Only 3x at the most, so assuming the parent estimate was perfect for 128, you still have on the order of 1200 albums, which is still a lot.

--
Nerd rage is the funniest rage.
Re:Everyman? by hoggoth · 2007-05-30 04:08 · Score: 1

Oh, and
5) Easy to expand pool with larger drives (remove old drive, insert new larger drive, add it and let it rebuild)

--
- For the complete works of Shakespeare: cat /dev/random (may take some time)
Re:Everyman? by sco_robinso · 2007-05-30 04:20 · Score: 1

I can see where you're coming from. I do some IT consulting work for a professional photographer who explains it exactly as you do. She literally has about 8 hard drives inside her computer and another 5 external, and she's still running out of space.

What I did for her was basically what this guy is doing - building a cheap NAS box. I just got an off-the-shelf Thecus N-5200R NAS appliance and threw in 5x 500GB hard drives. For about $2100CAD ($1850US), we had a 2TB RAID-5 enclosure set up and running within an hour.
Re:Everyman? by ChrisA90278 · 2007-05-30 04:40 · Score: 1

For your use, copying large files, ZFS would be ideal because with copy on write no data would move and in theory you could "copy" a 300GB file instantly. Only when you start making changes to the data would there be need to move data to/from the disk drives. In your Case ZFS on a server is ideal.

Where SAN is needed is if you have many users of the data. Let's say you have a dozen people who need to access these files on a dozen computers. Or 100 users on 100 computers. This is where SAM is needed and where the single server using 1000BaseT falls apart. The PCI (or whatever other kind of) bus on the server becomes the bottle neck.
Re:Everyman? by Score+Whore · 2007-05-30 05:13 · Score: 1

FYI. Someday's I'm a total dick. Your setup sounds interesting. But I'd never build it.
Re:Everyman? by kestasjk · 2007-05-30 05:16 · Score: 1

I bet you couldn't tell the difference between a 192 kbps MP3 file and a 320 kbps MP3 file. It's one of those "fine wine" things; you think a 320 kbps sounds better, but if you did a blind test on a typical piece of music you wouldn't be able to tell.

Some experts think that MP3 reaches transparency (i.e. you are unable to tell the difference between it and the original) at 128 kbps when using a good encoder. Others think it's 192 kbps. But 320 kbps is a waste of space.

--
// MD_Update(&m,buf,j);
Re:Everyman? by mulvane · 2007-05-30 05:20 · Score: 1

If my end goal when I had originally started had been 8TB, I would have never built it in this way either. The cost of drives and additional support hardware to simply upgrade what I had over building from scratch was cheaper as 200GB seagate SATA drives kept going down in price. The hardware is still upgradeable as it is, to support 16TB if I wanted but now I am getting into bus bottlenecks and other obstacles that make a from scratch solution cheaper for more capacity. I tell ya what though, I'm gonna make a small fortune selling 200GB SATA drives on ebay soon :-)
Re:Everyman? by Anonymous Coward · 2007-05-30 05:35 · Score: 0

Heck, I'd be a client for a safe, multi-TB storage system if I could afford it

Check out Drobo. (Not a shill, just thought it was cool.)
Re:Everyman? by jtwronski · 2007-05-30 06:00 · Score: 1

Terrastation seconded here.

I just picked up a 2TB kit for video storage and have it setup RAID5, so I get 1.5TB of storage out of it. It came with an ESATA install kit for the old fileserver its attached to, and ought to last me about 2 years until I fork out for a larger setup.

Total cost: 1200USD including shipping, and I went from boxcutter to fsck in 15 minutes flat.
Re:Everyman? by Nutria · 2007-05-30 06:36 · Score: 1

But 320 kbps is a waste of space.

Not if it's still significantly smaller than flac.

--
"I don't know, therefore Aliens" Wafflebox1
Re:Everyman? by hoggoth · 2007-05-30 09:08 · Score: 1

> Where SAN is needed is if you have many users of the data.

I thought SANs could only be 'mounted' by one user at a time. I was under the impression that as a raw block-device a SAN couldn't arbitrate multiple users at a time. How do they handle that?

--
- For the complete works of Shakespeare: cat /dev/random (may take some time)
Re:Everyman? by locokamil · 2007-05-30 09:39 · Score: 1

"It is not possible to add a disk to a raidz or raidz2 vdev. This feature appears very difficult to implement. It should also be noted that adding disk to a raidz would degrade the data protection by reducing the proportion of parity to data bits." -- Wikipedia

I don't know if RAID-Z should be used as a selling point for ZFS. If my reading of the situation is correct, you need to add another RAID-Z array to the storage pool for all of your data to remain RAID-ed. To grow your protected storage capability, you therefore have to add at least three (?) drives, as opposed to the one single drive required for XFS based software RAID-5 solutions. For SOHO users, the larger granularity is a bit of a deterrent... but then again, I doubt Sun had this demographic in mind when designing ZFS.

RAID lameness aside, ZFS is an infinitely safer filesystem: even I can see how checksums and transactions will promote data integrity where it matters.
Re:Everyman? by hoggoth · 2007-05-30 10:45 · Score: 1

Then someone explain this (from the Wikipedia entry):

"It is also possible to expand the array by iteratively swapping each drive in the array with a bigger drive and waiting for ZFS to heal itself"

Does this imply you CAN swap out a drive and replace it with a bigger one in a RAID-Z vdev? If so, that's all I need. I will max out my drive bays with the cheapest drives, add them all to a RAID-Z pool and upgrade them as I need to.
If it does NOT imply that, then what are they talking about? How can it heal itself without parity information to rebuild your data?

--
- For the complete works of Shakespeare: cat /dev/random (may take some time)
Re:Everyman? by Anonymous Coward · 2007-05-30 12:40 · Score: 1, Funny

Quick answer is: you are correct(!) ZFS obviates the $47k expense, and you should ignore the FUD-ing mutants that suggest that it's better to ignore software advances and instead roll back the clock, open your wallet, and replace your 12 SATA drives with hundreds of dinky 73GB SCSI dinosaurs.
Re:Everyman? by locokamil · 2007-05-30 12:58 · Score: 1

I'm a little fuzzy on the details. I think you can take out one drive and put in another (i.e. the system is hot swap capable). However, the number of drives that can be in the array is fixed.

I'm going to be playing around a bit with ZFS over the next few weeks in an effort to build (how relevant) a cheapo redundant storage solution. If you're interested, I can keep you posted.
Re:Everyman? by drsmithy · 2007-05-30 17:51 · Score: 1

If it does NOT imply that, then what are they talking about? How can it heal itself without parity information to rebuild your data?
They are talking about replacing *all* the drives in the array with larger ones.
For example, you have a 4*500G, 2TB (raw) RAIDZ array.
You replace the first 500G disk with a 750G disk and wait for it to rebuild ("resilver").
You replace the second 500G disk with a 750G disk and wait for it to rebuild.
You replace the third 500G disk with a 750G disk and wait for it to rebuild.
You replace the fourth 500G disk with a 750G disk and wait for it to rebuild.
At the end of this process, you have a 4*750G = 3TB (raw) RAIDZ array (ie: you've gained ca. 750G of usable space).
In other words, you can't *extend* an existing RAIDZ/RAIDZ2 array with an additional disk, but you can *replace* all the existing disks with bigger ones and eventually (once they're all replaced) be able to access the additional space. Then you can create another RAIDZ with the old drives.
Do note, however, that you can extend a "RAID10", simply by adding more pairs of mirrored disks. Ie: 4*500G (2TB raw) RAID10, add two 750G disks, 3.5TB (raw) RAID10.
Re:Everyman? by hoggoth · 2007-05-30 23:39 · Score: 1

Ah, so I should fill all my drive bays with the cheapest drives I can get and add then all to the RAID-Z array. Then I have maximum expansion ability later on.

--
- For the complete works of Shakespeare: cat /dev/random (may take some time)
Re:Everyman? by drsmithy · 2007-05-31 00:14 · Score: 1

Ah, so I should fill all my drive bays with the cheapest drives I can get and add then all to the RAID-Z array. Then I have maximum expansion ability later on.
Well, for best performance your RAIDZ array shouldn't be more than about 8 drives (and generally speaking, RAID5 arrays shouldn't be much larger than that). But that's probably not your primary interest in a home server.
Also remember that you need to replace all the drives at once - you won't be able to take advantage of any bigger drives until *all* the smaller ones have been removed from the array. This might be important to keep in mind from a financial perspective when sizing your arrays - replaced 12 - 16 drives in one hit could be expensive.
Re:Everyman? by barronVonBackstabber · 2007-05-31 03:01 · Score: 1

stick a global file system on top of it.

Congradulations, you discovered the "File Server" by BigBuckHunter · 2007-05-30 00:08 · Score: 4, Informative

For quite a while now, it has been less expensive to build a DIY file server then to purchase NAS equipment. I personally build gateway/NAS products using Via c7/8 boards as they are low power, have hardware encryption, and are easy to work with under linux. Accessory companies even make back plane drive cages for this purpose that fit nicely into commodity PCs.

BBH

ok for low end, not for high by alen · 2007-05-30 00:09 · Score: 3, Informative

place where i work looked at one of these things from another company. did the math and it's too slow even over gigabit for database and exchange server. OK for regular file storage, but not for heavy I/O needs

Re:ok for low end, not for high by morgan_greywolf · 2007-05-30 00:38 · Score: 4, Insightful

Precisely. The question in the title is a little bit like asking "Will large PC clusters obsolete mainframes?" or "Will Web applications obsolete traditional GUI applications?" The answer is, as always, "It depends on what you use it for." For high-performance databases or a high-traffic Exchange server, these things may not work well.

I've seen plenty iSCSI of solutions coupled with NAS servers that get pretty good throughput in this price range that are already integrated and ready to go, but the bottom line is that if you want high-peformance, high-availability storage for I/O-intensive applications, you need a fiber SAN/NAS solution.

--
My blog
Re:ok for low end, not for high by Jim+Hall · 2007-05-30 00:49 · Score: 4, Insightful

I agree. At my work, we have a SAN ... low-end frames (SATA) to mid-range (FC+SATA) to high-end frames (FC.) We put a front-end on the low-end and mid-range storage using a NAS, so you can still access using the storage fabric or over IP delivery. Having a SAN was a good idea for us, as it allowed us to centralize our storage provisioning.

I'm familiar with ZFS and the many cool features laid out in this Ask Slashdot. The simple answer is: ZFS isn't a good fit to replace expensive SAN/NASs. However, ZFS on a good server with good storage might be a way to replace an inexpensive SAN/NAS. Depending on your definition of "inexpensive." And if you don't mind the server being your single point of failure.
Re:ok for low end, not for high by flakier · 2007-05-30 01:35 · Score: 3, Interesting

I wonder if the Coraid ATA-over-Ethernet would be good enough? It ditches TCP/IP in favor of raw Ethernet frames so has much lower overhead than iSCSI and only major loss is no routing. http://www.coraid.com/

BTW, I read recently that where 4Gb FC really excels is in large block sequential transfers and that small random access transfers are actually better over gigabit iSCSI. Check it out: http://searchstorage.techtarget.com/columnItem/0,2 94698,sid5_gci1161824,00.html

Plus you really have to think about other bottlenecks. How many disks need to be striped to consistently saturate the bandwidth of 1Gb Ethernet? 10Gb ethernet? What about the bus that the host adapter/NIC is on? Precious few boxes have 4x PCIe and then what about CPU overhead, managing all this streaming data? Just food for thought...

--
--
Re:ok for low end, not for high by Ryan+Amos · 2007-05-30 01:53 · Score: 1

Agreed 100%. With a decent iSCSI offload card, a single server can saturate one of these arrays without breaking a sweat.

It's ok for a file server, or for storing backups, or for storing large files in a small environment. But it's still SATA, you see almost exclusively fiber channel in this space for a reason.
Re:ok for low end, not for high by Colin+Smith · 2007-05-30 02:02 · Score: 1

And if you don't mind the server being your single point of failure. HA Linux is pretty trivial to set up, but it does require an experienced admin, which isn't needed so much with your typical NAS.

--
Deleted
Re:ok for low end, not for high by Retric · 2007-05-30 02:22 · Score: 1

How much bandwith do you need? Redundant GB network card per box = ~750Mbits/s. 5+1 x 12TB DB at 6k a pop = ~4.25Gbits/s on a 12TB DB for 36k. Or you can get use more than 1 network card and double that...
Re:ok for low end, not for high by Anonymous Coward · 2007-05-30 04:55 · Score: 0

I've seen plenty iSCSI of solutions coupled with NAS servers that get pretty good throughput in this price range that are already integrated and ready to go, but the bottom line is that if you want high-peformance, high-availability storage for I/O-intensive applications, you need a fiber SAN/NAS solution. I would disagree. We're running a high I/O application backending a Top 50 website using Equalogic iSCSI arrays. Some of our Linux boxes have TOE cards (HBAs for you fiber guys) and some use software initiators. Because our I/O is random we can't use huge array-side caches. Expensive FCAL hardware really turned us off too. I mean, it was great but the learning curve sucked and we realized we'd never use all of the features fiber channel offers to make it ROI-positive.

In the last 12 months we've had better than five nines reliability. We've even doubled the size of our storage on the fly with no service interruption. All in all we've probably saved over $1MM bucks doing it this way.
Re:ok for low end, not for high by Anonymous Coward · 2007-05-30 07:19 · Score: 0

I'm interested in putting something like this together.
What are you using for chassis/MB...is there a good "barebone" system you like?

And more importantly. by Spazntwich · 2007-05-30 00:10 · Score: 4, Funny

Does the overuse of TLAs obfuscate the meaning of SDS?

Re:And more importantly. by xappax · 2007-05-30 01:57 · Score: 1

IDK, WTF is SDS?
Re:And more importantly. by weilawei · 2007-05-30 03:04 · Score: 1

/. syndrome?

Current issues by packetmon · 2007-05-30 00:12 · Score: 4, Informative

I've snipped out the worst reasons as per Wiki entry:

A file "fsync" will commit to disk all pending modifications on the filesystem. That is, an "fsync" on a file will flush out all deferred (cached) operations to the filesystem (not the pool) in which the file is located. This can make some fsync() slow when running alongside a workload which writes a lot of data to filesystem cache.
ZFS encourages creation of many filesystems inside the pool (for example, for quota control), but importing a pool with thousands of filesystems is a slow operation (can take minutes).
ZFS filesystem on-the-fly compression/decompression is single-threaded. So, only one CPU per zpool is used.
ZFS eats a lot of CPU when doing small writes (for example, a single byte). There are two root causes, currently being solved: a) Translating from znode to dnode is slower than necessary because ZFS doesn't use translation information it already has, and b) Current partial-block update code is very inefficient.
ZFS Copy-on-Write operation can degrade on-disk file layout (file fragmentation) when files are modified, decreasing performance.
ZFS blocksize is configurable per filesystem, currently 128KB by default. If your workload reads/writes data in fixed sizes (blocks), for example a database, you should (manually) configure ZFS blocksize equal to the application blocksize, for better performance and to conserve cache memory and disk bandwidth.
ZFS only offlines a faulty harddisk if it can't be opened. Read/write errors or slow/timeouted operations are not currently used in the faulty/spare logic.
When listing ZFS space usage, the "used" column only shows non-shared usage. So if some of your data is shared (for example, between snapshots), you don't know how much is there. You don't know, for example, which snapshot deletion would give you more free space.
Current ZFS compression/decompression code is very fast, but the compression ratio is not comparable to gzip or similar algorithms.

--
Infiltrated dot Net

Re:Current issues by ZorinLynx · 2007-05-30 00:20 · Score: 1

ZFS does not do user quotas. If you want to do user quotas you need to create a filesystem per user. Filesystems are easy to create but a filesystem per user gets cumbersome if you have thousands of users, not to mention having to have thousands of NFS exports and making backups a greater headache.

Really, Sun, you gotta fix this. At least give users a choice as to what to use, seperate filesystems or user quotas (or a combination of both)
Re:Current issues by allenw · 2007-05-30 01:20 · Score: 1

I'm surprised your list doesn't include the inability to evacuate a drive. This is one of the biggest problems with ZFS right now. Makes it a real pain to upgrade drives.
Re:Current issues by segfaultcoredump · 2007-05-30 03:29 · Score: 1

zpool replace [-f] pool old_device [new_device]

of course, if you are using raid-z/mirroring, trust the system and dont have anywhere to plug a new drive in, you could get away with pulling the old drive, dumping in a new drive and running `zpool replace ` on the updated drive (new_device defaults to old_device if it is not specified).
Re:Current issues by Kymermosst · 2007-05-30 06:58 · Score: 1

I can address some of them:

* ZFS encourages creation of many filesystems inside the pool (for example, for quota control), but importing a pool with thousands of filesystems is a slow operation (can take minutes).

Pools are only imported when the system boots or when you specifically move a ZFS pool from one machine to another. It's a non-issue unless you reboot all the time or make a habit of carrying your external SCSI enclosure back and forth between machines.

* ZFS filesystem on-the-fly compression/decompression is single-threaded. So, only one CPU per zpool is used.

So, don't use compression if that's a problem for you.

* ZFS eats a lot of CPU when doing small writes (for example, a single byte). There are two root causes, currently being solved: a) Translating from znode to dnode is slower than necessary because ZFS doesn't use translation information it already has, and b) Current partial-block update code is very inefficient.

This will be very workload-dependent. If it's a problem, don't use ZFS for now.

* ZFS Copy-on-Write operation can degrade on-disk file layout (file fragmentation) when files are modified, decreasing performance.

You make up for it in reliability.

* ZFS blocksize is configurable per filesystem, currently 128KB by default. If your workload reads/writes data in fixed sizes (blocks), for example a database, you should (manually) configure ZFS blocksize equal to the application blocksize, for better performance and to conserve cache memory and disk bandwidth.

I'm not quite sure why this is a reason to not use ZFS. Are people afraid of doing a little tuning?

* ZFS only offlines a faulty harddisk if it can't be opened. Read/write errors or slow/timeouted operations are not currently used in the faulty/spare logic.

I had this happen, but I noticed the errors in the system log and manually offlined and replaced the disk. It wasn't that big of a deal and ZFS recovers gracefully from read/write errors. It would only be a problem if two disks failed a read/write on the same block. A fairly rare scenario.

* When listing ZFS space usage, the "used" column only shows non-shared usage. So if some of your data is shared (for example, between snapshots), you don't know how much is there. You don't know, for example, which snapshot deletion would give you more free space.

This is hardly a big deal.

* Current ZFS compression/decompression code is very fast, but the compression ratio is not comparable to gzip or similar algorithms.

Again, don't use compression if that's a problem for you.

I currently have a RAID-Z ZFS pool with some filesystems exported via Samba and NFS with compression enabled on one filesystem. I haven't really found any of the above to be a problem. For most of them, the fixes are on the way. None of them were enough to detract from ZFS. The only other option I would have considered to manage this storage would have been Veritas Storage Foundation, which costs more.

I'm not saying ZFS is without some issues, as the ones you pasted from Wikipedia, but it should be fine for cheap storage and everyday workloads.

--
"Alcohol, Tobacco, Firearms, and Explosives" should be a convenience store, not a government agency.
Re:Current issues by Anonymous Coward · 2007-05-30 14:37 · Score: 0

Note i work on ZFS.

For the fsync() issue, we actually have code going through final code review comments which will allow you to specify separate devices for the "ZIL" (ZFS intent log). The ZIL is responsible for pushing out the synchronous writes. With a separate log device in place, the asynchronous writes won't interfer with the synchronous ones. See:
http://bugs.opensolaris.org/view_bug.do?bug_id=633 9640

I know the compression is no longer single threaded as i fixed that back in February (build snv_59). See:
http://bugs.opensolaris.org/view_bug.do?bug_id=646 0622

The configurable blocksize is actually an advantage, not a disadvantage. ZFS's blocksize is dynamic (it can grow on demand from 512B to 128KB). This is really nice for very large files (less indirect blocks). But yes it causes problems for workloads such as databases that do random aligned I/O (typically 8k). If your blocksize was 128KB and not cached, then its silly to have to read in 128KB to modify 8KB, when all you wanted to do was just a single 8KB write. Allowing the user specify the maximum blocksize is a win for workloads (such as Oracle). I see it as a disadvantage that other filesystems can't do this on the fly (no recompiling, ZFS can do this online).

Note, back in March we added gzip as a compression algorithm (snv_62). See:
http://bugs.opensolaris.org/view_bug.do?bug_id=653 6606

I'm not sure what "ZFS eats a lot of CPU when doing small writes" really refers to. If someone has done or can do some initial performance analysis, feel free to send your results to zfs-discuss@opensolaris.org. If its a real bug, we'd like to fix it.

The issue of "ZFS only offlines a faulty harddisk if it can't be opened" is true and the fix is going through final development.

I encourage everyone to take a look at our documentation (pdf overview, admin guide, on-disk format guide, and source tour) at:
http://opensolaris.org/os/community/zfs/

The filesystem space is really at an interesting point in time (not just inside Sun), and hopefully we will see more innovation (besides just ZFS). I know the Linux community is very active in this space. I look foward to seeing what future filesystem(s) they come up with.

eric

Real SANs do more by PIPBoy3000 · 2007-05-30 00:13 · Score: 4, Informative

For starters, our SAN uses extremely fast connectivity. It sounds like you're moving your disk I/O over the network, which is a fairly significant bottleneck (even Gb). We also have the flexibility of multiple tiers - 1st tier being expensive, fast disks, and 2nd tier being cheaper IDE drives. I imagine you can fake that a variety of ways, but it's built in. Finally, there's the enclosure itself, with redundant power and such.

Still, I bet you could do what you want on the cheap. Being in health care, response time and availability really are life-and-death, but many other industries don't need to spend the extra. Best of luck.

Re:Real SANs do more by Colin+Smith · 2007-05-30 02:24 · Score: 1

You can buy 10Gb ethernet to run iSCSI for less than FC. Or, you can easily run quad Gb cards for much less. It makes huge sense to use the same technology for your LAN & SAN. In terms of reliability & performance it's fairly simple to design a system which is both fault tolerant and fast.

--
Deleted
Re:Real SANs do more by ocbwilg · 2007-05-30 02:52 · Score: 2, Insightful

You can buy 10Gb ethernet to run iSCSI for less than FC.

I'm not sure where you're shopping, but I've only seen a handful of 10Gb ethernet switches, and they have all been dreadfully expensive. So factor in a pair of those, plus multiple 10Gb TOE cards, and I bet that your price isn't any cheaper than fiber. In fact, it's probably more expensive.

It makes huge sense to use the same technology for your LAN & SAN.

Not always, and I think that this tends to be overstated by the iSCSI proponents. This statement makes perfect sense on the surface, but when you look into the details it comes up short. It makes sense from a knowledge transfer standpoint to use the same technology. You can get by with a single person who understands TCP/IP networking instead of a person who does networking and a person who does SAN networking. But that person who handles the networking for the SAN usually also handles all of the zoning, space allocation, LUN carving, and everything else that comes with managing a SAN. Who does that now?

Then there's the whole issue of network segregation. Do you really want to put your SAN on the same switches are your desktop PCs or servers? Probably not. It doesn't make a lot of sense. Sure, you can VLAN it off, but those switches still have a finite amount of bandwidth available, and you don't want I/O intensive applications (like iSCSI SANs) eating up all of the bandwidth that users are accustomed to having. Then there's the speed issue. Do you buy a ton of 10Gb switches for all of your datacenter network, or just a few for the iSCSI SAN? Do you run iSCSI on 1Gb switches instead? What you're going to end up doing eventually is just buying dedicated networking hardware for the SAN and physically separating it from the rest of your network. So at that point, what advantages do you get with iSCSI over fiber?
Re:Real SANs do more by Monkeybaister · 2007-05-30 06:52 · Score: 1

Sure, you can VLAN it off, but those switches still have a finite amount of bandwidth available

Not to take too much from your thunder (where I work, we use Fibre Channel for SAN), but all of our core ethernet switches are non-blocking; when we give them more ports, they get more bandwidth. We could not only segment the traffic onto separate VLANs, but we could also put that traffic onto separate trunks. Ethernet's poor multi-pathing support is what deters us the most from iSCSI.
Re:Real SANs do more by WuphonsReach · 2007-05-30 13:18 · Score: 1

For a smaller business, iSCSI makes a lot of sense. You get to use basic Ethernet equipment rather then single-use Fibre Channel hardware. At the start, you can VLAN the iSCSI traffic for long enough to get up and running. A good idea for long term? Not at all, but it allows you to get started without lots of cash.

Even better, later on, the old iSCSI equipment (NICs, switches) can be used in other parts of your network. So when you upgrade from 1GigE to 10GigE, you can use all the old 1GigE equipment to bulk up other areas of the network. (There's a lot of small businesses that have yet to make the switch from 10/100 to gigabit.) Or if you decide to switch over to FC, you can reuse the old iSCSI equipment for regular network duties.

Hell, if nothing else, iSCSI pushes the cost of SAN technology down.

--
Wolde you bothe eate your cake, and have your cake?
Re:Real SANs do more by this+great+guy · 2007-05-30 18:19 · Score: 1

Huh ? ZFS or SAN, no matter what you choose, in both cases you can decide to put the file server on, say, a GbE network, so in both cases the 1 Gbps bottleneck will be the same.

Regarding disk connectivity, a typical ZFS setup (like the one described in the article) uses local SATA 1.5G/3.0G disks (so 1.5 or 3.0 Gbps per *disk*). Therefore the max aggregate throughput can be theoretically much higher than your typical SAN with one or two 4-Gbps FC links. For a fair comparison, don't forget to make sure that the SATA or FC controllers do not hit another bottleneck (PCI-X or PCIe bus...).

I have seen some ZFS benchmarks where 48 disks in the same zpool were able to provide 20+ Gbps of sequential read/write throughput.

Its just not the same thing. by Tester · 2007-05-30 00:14 · Score: 4, Informative

A good 20k$ RAID array does much more. First, it doesn't use cheap SATA drives, but Fiberchannel Drivers or even SAS drives which are tested to a higher level of quality (each disk costs like 500$ or more..). And those cheap SATA drives also react much more poorly to non-sequential access (like when you have multiple users). They are unusable for serious file serving. You can never compare RAID arrays that use SATA/IDE to ones that use enterprise drives like FC/SCSI/etc, because the drives are quite different.

Then you have the other features like dual redundant everything: controllers, power supplies, etc. Then you have thermal capabilities of rack-mount solutions that often are different from SATA, etc, etc.

Re:Its just not the same thing. by ZorinLynx · 2007-05-30 00:25 · Score: 5, Informative

These overpriced drives aren't all that much different from SATA drives. They're a bit faster, but a HELL of a lot more expensive, and not worth paying more than double per gig.

We have a Sun X4500 which uses 48 500GB SATA drives and ZFS to produce about 20TB of redundant storage. The performance we have seen from this machine is amazing. We're talking hundreds of gigabytes per second and no noticeable stalling on concurrent accesses.

Google has found that SATA drives don't fail noticeably more often than SAS/SCSI drives, but even if they did, having several hot spares means it doesn't matter that much.

SATA is a great disk standard. You get a lot more bang for your buck overall.
Re:Its just not the same thing. by Firethorn · 2007-05-30 00:29 · Score: 2, Insightful

Sure, a good $20k NAS RAID does more. Question is, is it really needed?

I mean, you could deploy 10 times as many of these as you could your array, giving me 10 times the storage.

Depending on what you do, it has to hit the network sometime. For example - we have a big expensive SAN solution. What's it used for? As a massive shared drive.

For a fraction the price we could of put a similarly sized one of these in every organization. Sure, it wouldn't be able to serve quite as much data, but most of our stuff isn't accessed extremely often anyways.

I've noted before that in many cases it'd be cheaper for us to backup to IDE hard drives than to tape. The tape alone costs more per megabyte than a HD, and has slower transfer rates to boot. Our backup solution could be several of these, connected by fiber to another building(so we don't have to move them).

--
I don't read AC A human right
Re:Its just not the same thing. by tgatliff · 2007-05-30 00:32 · Score: 5, Informative

It is not my intention to offend, but I alway love it when I hear the dreaded marketing phrase of hardware "tested to a higher level of quality".

I work in the world of hardware manufacturing, and I can tell you that this "magical" more testing process simply does not exist. Hardware failures are always expensive, and we do anything we can to prevent them. To do this, we build burn in procedures based on what most call the 90% rule, but you really cannot guarantee more reliability beyond that. Better device design at that point is what will determine reliability beyond that point. Any person who says differently either does not completely understand individual test harness processes or does not understand how burn in procedures work.

In short, more money is not nessesarily better. More volume designs typically are, though...
Re:Its just not the same thing. by llZENll · 2007-05-30 00:37 · Score: 1

I guess you missed the Google disk report stating that expensive fiber drives have the same reliability as SATA and IDE drives. The only benefit of a fiber drive in reality is they tend to have a higher RPM which translates to more IOPS, and even that can be had with the SATA Raptor. There is little reason to use fiber other than wasting your IT departments money and falsely inflating someones ego.
Re:Its just not the same thing. by Anonymous Coward · 2007-05-30 01:02 · Score: 0

> which are tested to a higher level of quality

Oh gawd!

Can I interest you in a bridge? It's a famous historical landmark I'm selling on behalf of a wealthy client and I'm prepared to cut you a great deal.
Re:Its just not the same thing. by Lumpy · 2007-05-30 01:05 · Score: 2, Insightful

Show me how you build a Raid 50 of 32 sata or ide drives.

also show me a SINGLE sata or ide drive that can touch the data io rates of a u320 scsi drive with 15K spindle speeds.

Low end consumer drive cant do the high end stuff. Dont even try to convince anyone of this. guess what, those uses are not anywher near strange for big companies. witha giant SQL db you want... no you NEED the fastest drives you can get your hands on and that is SCSI or Fiberchannel.

--
Do not look at laser with remaining good eye.
Re:Its just not the same thing. by terminal.dk · 2007-05-30 01:14 · Score: 1, Insightful

You did not read the reports out recently on drive reliability.

SCSI vs IDE is no issue. Newer technology drives are better than older drives is a bigger factor. So SATA lasts as long as SCSI. Performance wise for sequential access, a 500GB 7200RPM disk beats a 10000 RPM 72GB or 144 GB SCSI any day. For my laptop, my 160GB 5400 RPM disk is faster than the 7200 RPM 100GB disk.

The advantage of the SCSI disk is native command queuing. So that it can stop on the way from sector A to C and read sector B. But, this is also implemented in SATA-300 drives, so this advantage is also gone from SCSI.

I have been a big SCSI fanboy myself, but the magic has gone. SATA has flown past SCSI.
Re:Its just not the same thing. by RedHat+Rocky · 2007-05-30 01:26 · Score: 2, Insightful

Comparing drive to drive, I agree with you; 15k wins.

However, the price point on 15k drives is such a comparison for a single drive vs multiple drives is reasonable. The basis is $$/GB, not drive A vs drive B.

Ask Google how they get their throughput on their terrabyte datasets. Hint: it's not due to 15k drives.

--
Anything is possible given time and money.
Re:Its just not the same thing. by drsmithy · 2007-05-30 01:32 · Score: 2, Interesting

Show me how you build a Raid 50 of 32 sata or ide drives.
Get yourself a nice big rackmount case and some 8 or 16 port SATA controllers.
also show me a SINGLE sata or ide drive that can touch the data io rates of a u320 scsi drive with 15K spindle speeds.
Of course, the number of single drives you can buy for the cost of that single 15k drive will likely make a reasonable showing...
Low end consumer drive cant do the high end stuff. Dont even try to convince anyone of this. guess what, those uses are not anywher near strange for big companies. witha giant SQL db you want... no you NEED the fastest drives you can get your hands on and that is SCSI or Fiberchannel.
This is technically true, however, "low end consumer" drives can certain beat the performance of what *used* to be "high end" until relatively recently, and are quite adequate for significant proportions of the market.
Individually, 7.2k SATA drives are quite a bit slower than 15k SAS/FC drives. But get a dozen or two of those SATA spindles in an array, and you've got some (relative to cost) serious performance.
Re:Its just not the same thing. by Draconian · 2007-05-30 01:34 · Score: 1

48 disks and hundreds of GB/s ? That leads to over 2 GB/s per disk. A good disk gives about 100MB/s sustained, or less. Come to think of it, memory speeds are rarely that fast, I think only the fastests graphics cards come to 100GB/s.
Re:Its just not the same thing. by NSIM · 2007-05-30 01:35 · Score: 2, Informative

We're talking hundreds of gigabytes per second and no noticeable stalling on concurrent accesses.

In which case you're talking complete rubbish, "hundreds of gigabytes per second" just one GB/sec would need 4x2Gbit FC links all exceeding their peak theoretical throughput :-) Hundreds of MB/sec I can beleive (just about assuming the right access patterns)
Re:Its just not the same thing. by Anonymous Coward · 2007-05-30 01:35 · Score: 0

No, you don't "need" the expensive drive, Google are clear proof of that. You buy the expensive drives because it's cheaper to do that when storage isn't critical to your core business.

The writing is on the wall, SATA will keep improving and home storage will be driving the market (until flash catches up).
Re:Its just not the same thing. by silas_moeckel · 2007-05-30 01:42 · Score: 1

I think you mean to say you need 15k sas drives. If you really want speed though look as SSD drives as long as your dataset fits on them few things are faster. SCSI is going the way of the dodo replaced by SAS, really it's just the same protocol on a different interface. Fiber channel is great for interconnecting devices not as nice as a raid controller to disk interface but it's the current speed king for common interfaces at 4GB.

--
No sir I dont like it.
Re:Its just not the same thing. by Anonymous Coward · 2007-05-30 01:42 · Score: 0

You can't be talking hundreds of gigabytes per second for an x4500 it's just impossible. Perhaps you meant to say hundreds of megabytes per second. Even if say each one of the 48 drives in the x4500 was capable of providing a continuous 100 megabytes/second of throughput (Which they can't), that multiplied by 48 drives, is only 4800 megabytes, or 4.8 gigabytes/second of throughput. Real world benchmarks have shown the x4500 to be able to do a sustained 2.1GB/sec for certain workloads. Now when you throw an iscsi stack (The x4500 does not have iscis TOE, so the Opterons are going to be busy doing the work) on top of the OS (OpenSolaris iscsi target works well on the x4500), there is just no way that even if the drives could that the 4 CPU cores in the x4500 could do hundreds of gigabytes per second. Then there is the fact that there isn't enough bandwidth in the x4500 CPU/Memory/IO busses either to do that kind of throughput. Not to rant, but I think you meant hundreds of megabytes.
Re:Its just not the same thing. by Anonymous Coward · 2007-05-30 01:43 · Score: 0

I agree but it's not necessarily the drives that are the advantage. The large corporate SANS advantages are that in addition to the redunancy (multiple controllers, power, battery backup, etc.), they offload the I/O to the array. These units have their own controllers and memory so that for the server writing to disk is often almost as quick as writing to memory, since the data goes into the memory on the SAN and the actual write to disk is handled by the hardware. Reading from disk gets the same advantage, if the data is in the cache on the SAN.
Re:Its just not the same thing. by dfghjk · 2007-05-30 01:43 · Score: 1

"Show me how you build a Raid 50 of 32 sata or ide drives."

Is that a trick question? Since when does a specific RAID configuration matter anyway?

"also show me a SINGLE sata or ide drive that can touch the data io rates of a u320 scsi drive with 15K spindle speeds."

yep, the fastest single spindles are generally not offered in IDE. Of course, that's what RAID is for...to replace a single, fast, expensive disk with multiple, less expensive ones.

"Low end consumer drive cant do the high end stuff. Dont even try to convince anyone of this."

Of course they can. You've been sold a bill of goods from the drive manufacturers.

"witha giant SQL db you want... no you NEED the fastest drives you can get your hands on and that is SCSI or Fiberchannel."

The fastest drives are determined by the HDA, not the interface, and drive manufacturers insist on using alternative interfaces for their server drives in order to charge substantially higher margins. The attitude that somehow SCSI is better than IDE is what enables them to continue this charade and is responsible for the creation of SAS, a SCSI standard that runs an alternate software stack over the SATA physical layer. There can be no more substantial proof that SCSI is no better for disks than SAS/SATA itself. The best disks use SAS and the controllers can support either interface because, in reality, SAS and SATA are physically the same.
Re:Its just not the same thing. by ZorinLynx · 2007-05-30 01:52 · Score: 1

Yep, I meant hundreds of megabytes/sec. I had just woken up and was typing in a sleepy haze still. :)
Re:Its just not the same thing. by Spirilis · 2007-05-30 01:55 · Score: 1

"but a HELL of a lot more expensive, and not worth paying more than double per gig."

If capacity is what you're after, then you are right--bulky 7200 rpm drives are the way to go, whether it's SATA or low-cost FC or whatnot.
If you need "access density"--a blanket term referring to the available random IOPS per unit of disk space--then you need to shell out the bucks for 10+K RPM disks, eg SCSI/SAS/FC (which also provide features like Tagged Command Queueing to improve IOPS). It's great that you have 20TB of storage, but what if you have dozens of database servers concurrently slamming different parts of that 20TB with wildly random I/O?

A sensible storage design will consider the priorities of access density vs. storage capacity, and integrate cost into the whole equation.

On another note, though, there are the WD Raptor drives-- 10K RPM SATA-II disks which utilize SATA-II's NCQ (similar to TCQ). That is a nice option, shame they're the only player in town for 10K RPM SATA...

--
the real at&t mix
Re:Its just not the same thing. by ElecCham · 2007-05-30 02:06 · Score: 1

Disclaimer: I work for a company that sells disks, sounds like "fee rate" :) That said, there are some noticeable differences between SAS and SCSI. What Google was looking at in that study, IIRC, was hard failures (read, "dead drive") - not simply marginal sectors and the like. One big difference, off-hand, is that SAS/FC disks are usually written at a lower areal density ("bits per inch") than a SATA drive. Consumer-grade drives are all about the capacity, so the design team will push the areal density as high as is still manufacturable. For a SAS or FC drive, reliability is the key... so doing things like writing at a lower bit density, adding extra ECC, and such are okay. Advertised error rates on a desktop drive are usually on order 1 bit in 10-12; the latest SAS drive I saw a press release on advertises less than 1 in 10-16 bits. There's more, but those are the big ones I can think of off-hand.

--
Sig broken, watch for .finger
Re:Its just not the same thing. by StarfishOne · 2007-05-30 02:09 · Score: 2, Informative

Direct linkage for those who are interested:

Google Releases Paper on Disk Reliability:
http://hardware.slashdot.org/article.pl?sid=07/02/ 18/0420247
Re:Its just not the same thing. by GiMP · 2007-05-30 02:17 · Score: 1

Those expensive drive arrays support SAS not only for pure SAS drives, but in order to support SATA devices. SAS controllers and backplanes are compatable with SATA drives.
Re:Its just not the same thing. by SanityInAnarchy · 2007-05-30 02:27 · Score: 1

It's great that you have 20TB of storage, but what if you have dozens of database servers concurrently slamming different parts of that 20TB with wildly random I/O?

I would think that in that situation, more disks would be better than fewer, individually faster disks. Kind of like how if you're doing much concurrently, a dual-core machine with 2ghz/core is faster than one 3ghz machine (even if it was all about ghz).

That said, I'm no expert, and I haven't done the benchmarks. My biggest array is half a terabyte.

--
Don't thank God, thank a doctor!
Re:Its just not the same thing. by ortholattice · 2007-05-30 02:31 · Score: 1

Google has found that SATA drives don't fail noticeably more often than SAS/SCSI drives, but even if they did, having several hot spares means it doesn't matter that much.
While I'm not saying that SATA drives can't, in principle, be adapted for redundancy and reliability, I don't think Google is a good model for NAS/SAN solutions.
Google is in a different category from something like a corporate financial database or workstation backups, and I don't think it makes sense to compare them. Google's information is loosey-goosey and approximate. If some of its servers go down, no big deal, the others will catch up, and with sufficient redundancy it's likely no one will notice. It would probably be indistinguishable from the inconsistency it has already, where two searches for the same thing often end up with different results depending on which servers it happens to hit and how up-to-date they are. (I find this latter behavior annoying, sometimes missing the search results I want that I'll find on a second try, but that's for another discussion.)
This just won't do for, say, customer financial records or a medical database. You can't have some of the information missing, inconsistent, or not up-to-date. I would bet that Google itself does not use its redundant "cheap server" farm setup for its corporate and customer financial records.
Re:Its just not the same thing. by Anonymous Coward · 2007-05-30 02:43 · Score: 0

Show me how you build a Raid 50 of 32 sata or ide drives.

Well, the dell md1000 is a 15-disk SATA/SAS array that can do that. It has redundant power & redundant controlers. It can be combined with 2 additional MD1000 arrays for a 45-disk system. Not bad.

also show me a SINGLE sata or ide drive that can touch the data io rates of a u320 scsi drive with 15K spindle speeds.

You are correct, u320 will be faster, but not much faster. And if you short-stroke the big sata disks (only use the fastest part of the disk, the outside tracks), you can get really big speeds, very close to 15k scsi disks.

witha giant SQL db you want... no you NEED the fastest drives you can get your hands on and that is SCSI or Fiberchannel.

In the real world, companies have budgets. And the budget doesn't allow for what you want.
Re:Its just not the same thing. by Anonymous Coward · 2007-05-30 02:44 · Score: 0

Sure, if you have an army of engineers, more bandwidth then god, and a small army of people pulling dead computers and drives, oh yeah, and no real uptime requirements.
Re:Its just not the same thing. by sootman · 2007-05-30 02:49 · Score: 1

A good 20k$ RAID array does much more. First, it doesn't use cheap SATA drives...

The 'I' in RAID stands for Inexpensive. Why pay 3x more per disc to go from 99.5 to 99.9% reliable? The whole point of RAID is to get them cheap and have bunches of them. I'd rather have a whole second server than just one server with an array of unobtanium disks. Of course, space, heat, and power may become issues, but for a lot of people, simpler-cheaper-more is a good way to go.

And lets not forget that EMC etc are not perfect magic bullets by any means. We've gone through 3 SAN vendors in the last 5 years. A recent quote to add 0.5 TB to 2 or 3 servers came in at $2-3000 per server. For that kind of money, I'd rather just give each department a couple Mac Pros or XServes. Personally, I'd rather have my eggs in several baskets, rather than trying to build the one holy storage system, forever and ever amen, which WILL eventually fail spectacularly with dramatic results. (Unless you've got buckets and buckets of money to throw at perfect reliability, like banks and airlines--but for most of us in the real world, with managers trying to cut corners here and there, I'd rather spend $10,000 on two $5,000 boxes than one $10,000, supposedly better box.)

--
Dear Slashdot: next time you want to mess with the site, add a rich-text editor for comments.
Re:Its just not the same thing. by Spirilis · 2007-05-30 02:52 · Score: 1

You wouldn't use 'fewer' disks that are faster, you'd use more disks since the faster disks would be smaller than the SATA disks. They don't make 10,000 RPM 500GB disks yet. They do, however, make 300GB disks at 10,000 RPM.

A 48-disk SATA-II (500GB) solution may give you 20TB. A similar amount of capacity (20TB) using 300GB 10K-rpm FC disks would require around 80 disks. The cost would obviously be much higher--each disk is much pricier than the SATA-II disks, on top of the fact that you need 66% more spindles. But the access density would be around 2.5-3x higher for the 300GB FC solution compared to the SATA solution. So for the same amount of capacity, you can slam it ~2-3x harder.

(the math, using some 'generous' numbers for the IOPS: assuming a 500GB SATA-II 7200rpm disk can do ~75 IOPS, a 300GB FC 10K RPM drive can do ~120 IOPS... 48*75 = 3600 IOPS in total for the SATA solution, 80*120 = 9600 IOPS in total for the FC solution)

Now, is it worth it? Depends on the application. For lots of databases using the same storage concurrently, you bet it'll be worth it--might even want to step things up to the 146GB 15K disks, requiring double the spindles of the 300GB FC solution (160 spindles) but possibly 8x the throughput of the SATA solution.

For backups, or streaming media... the SATA solution may very well be good enough.

--
the real at&t mix
Re:Its just not the same thing. by Anonymous Coward · 2007-05-30 02:55 · Score: 0

I've noted before that in many cases it'd be cheaper for us to backup to IDE hard drives than to tape. The tape alone costs more per megabyte than a HD, and has slower transfer rates to boot.

You need to get a modern LTO tape drive. The transfer rates are awesome - faster than most hard disks - 60 megabytes/sec is not unusual. $40 will buy me 400 GB of tape - IDE hard disks aren't that cheap yet.
Re:Its just not the same thing. by biftek · 2007-05-30 02:58 · Score: 1

"Hundreds of gigabytes per second" sounds like it's either from RAM or you're talking out of your arse.

Say a standard SATA disk might do 60meg/sec (could be a bit more, but I reckon that's probably ballpark), then 48*60 = 2.8G/sec. So around 2 orders of magnitude less.
Re:Its just not the same thing. by psbrogna · 2007-05-30 03:20 · Score: 1

I also used to believe that SATA wasn't appropriate for the applications being discussed in this post for the reasons you site. However, after having a couple of Apple Xserver's in place for 3+ years I can state from experience that the large SATA drives do just fine in a multiuser environment rife with random access.

After having a good experience with the 250Gbx3 in two Xservers, I deployed a DIY Box-O-Drives with an Opteron mb ; 750Gbx4, including an 8ch SATA2 (3gb/sec) controller (grand total of $3,500). The experience since the deployment 6 mo. ago has been consistent with that of the Xservers- terrific. The DIY box provides a storage buffer which enabled us to cut our to-tape backup time by a third.

While MTBF is still open for discussion, I can say that 6 250 Gb drives in the two Xservers haven't stopped spinning in almost 4 years and drives in the DIY box have the highest MTBF of any drive I've seen. (at 6 mo.'s it's too early to confirm in earnest).

(Additional details: Xservers & DIY box run Opensuse w/reiser & user/server environment is mixed windows/unix.)
Re:Its just not the same thing. by cayenne8 · 2007-05-30 03:22 · Score: 1

"Get yourself a nice big rackmount case and some 8 or 16 port SATA controllers. "
Do you know of a website or HOWTO out there that can explain how to build your own fileserver with SATA drives like you describe?

--
Light travels faster than sound. This is why some people appear bright until you hear them speak.........
Re:Its just not the same thing. by ocbwilg · 2007-05-30 03:29 · Score: 1

These overpriced drives aren't all that much different from SATA drives. They're a bit faster, but a HELL of a lot more expensive, and not worth paying more than double per gig.

That's not necessarily the right way to look at it. Enterprise class drives (those with 10k and 15k spindle speeds) are much faster than consumer class drives (with 7.2k spindle speeds). And if what you are after is speed rather than price, then you're better off with enterprise-class drives. Which means that paying more than double per gigabyte is worth it. If all you need is a ton of cheap disk space, and performance isn't much of a concern, then you use the consumer-class drives.

For example. if I have a database server hosting multiple DBs, drive performance is probably more important than drive capacity. But let's say that I need 3TB of disk space. I have two options, a consumer-level 7.2k RPM SATA solution, and an enterprise grade 15k RPM SCSI or FC solution. Knowing that performance is key (and we are talking about a database here), I know that I need RAID 10.

The SATA solution works out to being a single shelf enclosure filled with 15 500 GB drives, giving me 3.5TB of RAID 10 storage across 14 7.2k RPM spindles, and one hot spare. The SCSI or FC solution works out to being a three-shelf enclosure filled with 45 146 GB drives, giving me 3TB of RAID 10 storage across 42 15k RPM spindles, and 3 hot spares.

Both arrays will meet my space requirements, but only one of them is likely to meet my I/O requirements. The SCSI/FC solution will have 3 times the spindles as the SATA solution, and each spindle will be twice as fast. Think about the speed difference here, there's no way the SATA solution could compete. Cost-wise, you could probably build the SATA array for $5000, and the SCSI/FC array for $15,000-$20,000, depending on the vendor. But if you wanted a SATA array that could come close to matching the performance of the SCSI/FC array you would probably need 5-6 shelves worth of SATA drives, driving the cost up to $25,000-$30,000 range. Then there's the consideration of the added heat, power draw, and space requirements.

Of course, you would also end up with 15+ TB of disk space in your array. That might be a selling point for some people, but if your databases only need 3 GB of space then you've got a lot of wasted space. The other 12+ GB is useless. Unless of course you want to put more data on all of that slack space, which will undoubtedly hurt your databases' performance.

SATA is a great disk standard. You get a lot more bang for your buck overall.

That's not quite right. SATA is a great disk standard because you get a lot more space for your buck overall. But if you're looking for more bang, you need something much faster. These days you can buy a single drive with 1TB of disk space. Disk capacity is no longer an issue, but in many cases speed is. Choosing the correct solution for the situation is key in storage decisions.
Re:Its just not the same thing. by Anonymous Coward · 2007-05-30 04:15 · Score: 0

Actually, it is about CPU offloading.

For IDE (and SATA to some extent), the CPU of the system sends the commands to the drives individually, takign some CPU time and typing up the devices.

In SCSI, the whole instruction gets handed off to the SCSI controller, which then handles the commands, as well, the drives each have a smaller controller which allows them to talk to each other with minimal direction from the controller.

eg: IDE disk 1 to IDE disk 2: CPU gets data from disk 1, puts it in system RAM, CPU writes data from RAM to disk 2.

SCSI to SCSI: CPU tells SCSI controller to move X data from disk 1 to disk 2, SCSI controller tells drive 1 "give this to disk 2", drive 1 transfers the data to disk 2.

Overall, there is much less CPU usage, any RAM needed would be onboard the SCSI controller, leaving your system RAM alone, and most of the proccessing is done by the drives so your SCSI controller can do other things too.
Re:Its just not the same thing. by Rakishi · 2007-05-30 04:20 · Score: 1

Google is based on a clustered infrastructure in everything as I understand it. If a system dies they put a new one in, no problems. Costs less to hire the monkey to do that and there is no uptime hit. A lot of companies don't have 200 hundred servers and applications designed for clusters.
Re:Its just not the same thing. by drinkypoo · 2007-05-30 04:33 · Score: 1

I would think that in that situation, more disks would be better than fewer, individually faster disks.

Your limiting factors are price, space, and power consumption. More disks means more places to plug them in means more money spent; it also means higher power consumption.

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Re:Its just not the same thing. by LWATCDR · 2007-05-30 05:00 · Score: 1

"Google is in a different category from something like a corporate financial database or workstation backups, and I don't think it makes sense to compare them. Google's information is loosey-goosey and approximate. If some of its servers go down, no big deal, the others will catch up, and with sufficient redundancy it's likely no one will notice. "
Okay why?
For workstation backups I think a Google like system sounds ideal. Having multiple backups spread across multiple servers could give you a very reliable system. For the finical data the problem would be keeping the data synchronized across the cluster. Since that could be very transaction heavy compared to say workstation backups it would be rather tricky at best.
The holy grail would be a system as fault tolerant as Google's but with a low enough latency for finical transactions. Two things always "bothered" me about SANS. One it seems like a huge single point of failure. and the second is performance. I just can't see how for something like a database how moving the mass storage out of the server and on to a network was a good idea. Even with a dedicated link between the NAS and the server it seems like one more layer.

--
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
Re:Its just not the same thing. by flaming-opus · 2007-05-30 06:52 · Score: 2, Interesting

I'm sure the origional writer meant hundreds of megs per second, which is a reasonably good rate for 24 sata drives. Most 3/4U raid enclosures can get this level of performance, particularly for reads. I can get about 330MB/s out of a 14 drive Apple Xraid, and almost 400M/s out of a 24 drive engenio, but that's with a lot of careful tuning.

Hundreds of gigs per second is limited to the very largest GPFS or Lustre configurations. The fastest filesystem throughput I'm aware of, anywhere in the world, is about 100 gig/second at Sandia national labs. That requires dozens of high-end raid cabinets.
Re:Its just not the same thing. by rabtech · 2007-05-30 08:47 · Score: 1

The reliability fallacy has already been proven false and you can read about that in other posts, but I wanted to chime in regarding NCQ support on SATA:

If your controller and drives both support it SATA-II drives can do NCQ (Native Command Queuing), which buffers and reorders outstanding disk commands to maximize speed and minimize power/heat/wear. This is similar to TCQ that SCSI drives have been doing for ages.

So no, SATA drives don't react worse to server loads than SCSI drives.

--
Natural != (nontoxic || beneficial)
Re:Its just not the same thing. by blueskies · 2007-05-30 09:11 · Score: 1

try: http://pcsforeveryone.com/ The sell 3U storage servers with drive space for 12 drives i think?
Re:Its just not the same thing. by dreddnott · 2007-05-30 11:29 · Score: 1

What are you on? Spindle speed is far more important for sequential transfer rates than areal density. It's also the biggest factor for seek time, which can impact a server with lots of concurrent I/O significantly more than raw transfer rate.

500GB/7200RPM SATA drives have a minimum sequential transfer rate of around 35-45 megabytes per second, with the maximum possible somewhere around 75 megabytes per second. 73-147GB 10,000RPM SCSI drives have comparable minimum transfer rate performance, but with maximum rates of 80 to nearly 100 megabytes per second. For new 15,000 RPM SCSI drives, the *minimum* sequential transfer rate is 70-80 megabytes per second, topping off at between just under 100 and 135 megabytes per second at the edge of the platters. You can't buy 15,000 RPM SATA or IDE drives - that kind of performance is only available in SCSI, SAS, and FC drives.

--
I may make you feel, but I can't make you think.
Re:Its just not the same thing. by jsoderba · 2007-05-31 03:05 · Score: 1

That's redundancy, not backups. A backup not only protects against hardware error, but also software and user error. If your app (or admin) goes crazy, having the same erroneous data saved on multiple machines wont do you any good.
Re:Its just not the same thing. by LWATCDR · 2007-05-31 04:53 · Score: 1

I was assuming that such a backup system would have journeying so that you could restore to any point in time. Otherwise you are correct that it isn't a real backup.

--
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
Re:Its just not the same thing. by ortholattice · 2007-05-31 05:01 · Score: 1

For workstation backups I think a Google like system sounds ideal.
I disagree - it would be horrible and completely useless for that purpose. For restoring a workstation backup, you want to know that your data is consistent, complete, and up-to-date. With a google-like system you'd get random snapshots of the data stored at different times, and if you try again you'll get a different set of random snapshots. Good luck getting pieces of different versions of your app to play together. Google doesn't have a good handle on being able to make their searches repeatable and reliable (in the sense of consistency). I don't know if it is a practical problem due to the amount of data and and users they serve, but their algorithm is unsuitable for workstation backups.
I doubt very seriously that Google uses its search-engine setup for their internal workstation backups.
Re:Its just not the same thing. by LWATCDR · 2007-05-31 05:30 · Score: 1

"I doubt very seriously that Google uses its search-engine setup for their internal workstation backups."
I would agree. I was speaking more about the hardware setup of multiple redundant data stores than there search software.

--
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.

But what drives do you use? by Puchku · 2007-05-30 00:14 · Score: 1

Do you use consumer level drives, or enterprise level drives? You have not specified that. The cost varies.

Re:But what drives do you use? by bemenaker · 2007-05-30 00:31 · Score: 1

Use consumer grade drives. The google drive failure report, basically shows no difference between them.
Re:But what drives do you use? by Sancho · 2007-05-30 00:36 · Score: 1

Had to be consumer level, since the drive cage he wrote about is $1300 by itself.
Re:But what drives do you use? by finkployd · 2007-05-30 02:41 · Score: 1

The cost varies.

Too bad the reliability doesn't

Finkployd

HW Raid vs ZFS by packetmon · 2007-05-30 00:15 · Score: 4, Informative

http://milek.blogspot.com/2007/04/hw-raid-vs-zfs-s oftware-raid-part-iii.html

--
Infiltrated dot Net

123 Incorporate ! by bytesex · 2007-05-30 00:19 · Score: 1

Well I must say it's true: a company is born on /. every minute.

--
Religion is what happens when nature strikes and groupthink goes wrong.

Re:123 Incorporate ! by Anonymous Coward · 2007-05-30 00:33 · Score: 0

Reminds me of hearing this yesterday.
Re:123 Incorporate ! by aminorex · 2007-06-01 03:38 · Score: 1

> If a bear...

You misspelled "the pope".

--
-I like my women like I like my tea: green-

No by iamredjazz · 2007-05-30 00:19 · Score: 5, Informative

Speaking from personal experience - This file system is far from ready. It can kernel panic and reboot after minor IO errors, we were hosed by it, and probably won't ever revisit it. This phenomenon can be repeated with a usb device, you might want to try it before you hype it. Try a google search on it and see what you think...there is no fsck or repair, once it's hosed, it's hosed, the recovery is to go to tape. http://www.google.com/search?hl=en&q=zfs+io+error+ kernel+panic&btnG=Google+Search

Re:No by Anonymous Coward · 2007-05-30 00:32 · Score: 0

Plus, how long do you think an fsck would take on a really large ZFS?

I don't fully understand the lust people have for a filesystem. Sure 128 is bigger than 64bit but in all but the most extreme situations, even 64bit is practically future proof.
I don't know, I've been doing this a really long time and only a couple of times have I ever noticed a great difference changing filesystems. Seems like you shouldn't even notice any more, it's just a slightly different configuration process.
Re:No by darrylo · 2007-05-30 02:45 · Score: 2, Informative

It's generally not about the 64- vs 128- vs whatever.
It's about the additional reliability (current bugs aside), and the ease of filesystem/pool management. For example, a Sun developer was developing on a workstation with bad hardware, which occasionally caused incorrect data to be written to disk. After setting up raidz, ZFS automatically detected and corrected the error: http://blogs.sun.com/elowe/entry/zfs_saves_the_day _ta
Scary, yes. Doing that definitely isn't something I'd recommend, but it does show one of the powerful features of ZFS.
Re:No by flaming-opus · 2007-05-30 02:53 · Score: 1

It could replace low end nas, but what the poster is describing, basically IS low end nas. An AX150i is a cheap box, crammed full of drives, running an iscsi target and a NFS/SMB server. It's more expensive than a beige box, but not by a lot. (after you account for the real cost of the drives and redundant gigE, etc) Somebody already had this idea, it was EMC (or netapp, comvault, etc), and they built it, marketed it, and are sellinig it to you with a 32% gross margin. I'll bet that 32% is probably worth it.

How much is your time worth? How much testing do you have to do before your manager is comfortable putting this thing in production, and how much time do you spend pushing security updates onto your new solaris box? Adminning a NAS box is not without some effort, but it's generally fairly little. What do you do when you want another one? What if you want a bigger one? What if you leave the company, and they guy who replaces you doesn't know much about a roll-your-own nas box?

This seems like a great solution for a dorm room, or maybe a college research lab.
Re:No by saleenS281 · 2007-05-30 03:11 · Score: 1

It's always good to post 6 month old bugs that have been fixed as FUD for why you shouldn't use something.
Re:No by iamredjazz · 2007-05-30 04:15 · Score: 1

It's also good to know the quality you can expect from a company's production releases - zfs was not ready for the real world. The io error hardening (bug id #6386910) looks like it may have been fixed in March of 2007, but Sun released this FS to the world with much hype in May of 2006. That's somewhere around 9 months for something flaunted and mercilessly promoted as a the world most advanced file system.

Maybe it's great now, maybe it's not, I can only say that based on what we've seen -- I wouldn't touch it. There are much more tried and true filesystems out there for critical data, maybe once it has the years and the track record that some of the others filesystems have (like ext and ufs) it might be comparable in stability by then...but at 9 months for a bug fix - one has to wonder when it will actually even come close.
Re:No by saleenS281 · 2007-05-30 07:45 · Score: 1

Except the reason it took 9 months was because it wasn't a critical bug for any major customers... Money talks in this world. Something that is *critical* to you doesn't really mean anything to sun if you aren't dropping serious dollars in their pocket. If you were ebay, that bug would've been fixed in a week or less... let's not pretend that Sun doesn't get the job done when they need to.
Re:No by KonoWatakushi · 2007-05-30 10:33 · Score: 2, Informative

Yes, it is easy to panic a system with ZFS, but such situations are also easily avoided. Furthermore, they will not lead to data corruption. If you are aware of the causes, it is no more than a minor inconvenience.

For example, ZFS will panic when you lose enough data or devices that the pool is no longer functional. If you take care and use a replicated pool though, this is unlikely to ever happen. Even if it does, all it requires is that you reattach the devices and reboot. If the disks truly are dead, then you are going to backups anyway. You do have backups, right?

ZFS has some rough edges yet, but to call it "far from ready" is mere FUD. Most of the problems with it are a matter of convenience, and nearly all that have been mentioned in the comments are actively being worked on. With a little bit of care, and proper backups, ZFS is rock solid. Meanwhile, it is improving every day, and if you choose not to revisit it, it is your loss.
Re:No by drsmithy · 2007-05-30 12:33 · Score: 1

It could replace low end nas, but what the poster is describing, basically IS low end nas.
Well, his DIY frankenbox certainly is, but I'd argue that a US$50k off-the-shelf solution qualifies as "mid range". For that sort of dough you should be into the territory of multiple, redundant everything, FC backend (even if it's only iSCSI on the front), significant expandability potential (50 - 100TB+) and that inescapable feeling of violation from paying $thousands for trivial software features that should be standard, like LUN masking.

That's the idea by complete+loony · 2007-05-30 00:19 · Score: 1

ZFS has a lot of potential. However the current implementation of ZFS has its limits, and you should know what they are before you commit to maintaining a server running it.

--
09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.

It depends by Anonymous Coward · 2007-05-30 00:20 · Score: 0

I'd say this highly depends on your usage of the NAS/SAN. I, for example, have setup a FreeBSD NAS (Samba Share) of a Software Raid-3 configuration. For the most part this is mainly a storage setup (write-once, read often). This coupled with my Gb network works well enough for me. It's fast enough that I'm able to burn dvds over the network and look at files in real time (and watch videos without any lag). If you're looking to implement a high volume of reads/writes using your setup, you should look into using a hardware based raid configuration (Raidz is nice and all but the parity is still calculated in software which can take up more cycles than necessary). I see absolutely no point in buying a pre-made NAS/SAN as all they've done is pre-set everything up.

Again, determine your situation, if it's mainly storage, than your raidz1 setup should be fine. But if you're going for very high I/O I would recommend using a hardware-based raid setup. In either case Gb ethernet should be fast enough (make sure you use a cat 6 and not a cat 5e)

Reliable? by Jjeff1 · 2007-05-30 00:21 · Score: 4, Informative

Businesses buy SANs to consolidate storage, placing all their eggs in one basket. They need redundant everything, which this doesn't have. Additionally, SATA drives are not as reliable long term as SCSI. Compare the data sheets for Seagate drives, they don't even mention MTBF on the SATA sheet.
Businesses also want service and support. They want the system to phone home when a drive starts getting errors, so a tech shows up at their door with a new drive before they even notice there are problems. They want to have highly trained tech support available 24/7 and parts available within 4 hours for as long as they own the SAN.
Finally, the performance of this solution almost certainly pales as compared to a real SAN. These are all things that a home grown solution doesn't offer. Saving 47K on a SAN is great, unless it breaks 3 years from now and your company is out of business 3 days waiting for a replacement motherboard off Ebay.
That being said, everything has a cost associated with it. If management is ok with saving actual money in the short term by giving up long term reliability and performance, then go for it. But by all means, get a rep from EMC or HP in so the decision makers completely understand what they're buying.

Re:Reliable? by Wdomburg · 2007-05-30 00:29 · Score: 1

Not that I think that home grown storage is necessarily a good fit for
To be fair, Seagate does list MTBF if you look at the data sheet for SATA drives actually sold for enterprise applications.

I'll agree, however, that home grown solutions are only approrpriate for a limited number of applications where outlay costs are more important that reliability and support.
Re:Reliable? by Anonymous Coward · 2007-05-30 00:45 · Score: 0

They want the system to phone home when a drive starts getting errors, so a tech shows up at their door with a new drive before they even notice there are problems. We had a SAN at where I work phone home to the vendor telling it that drives were failing. The support guy went out to the datacenter, went into out cage, and saw that the roof was leaking onto the SAN unit. Apparently the moisture was killing the drives off. Who would have thought that water was bad for powered on electronics?
Re:Reliable? by Ruie · 2007-05-30 01:10 · Score: 2, Insightful
A few points:
- Reliability: if your solution costs $3K instead of $47K, just buy 2 or three. Nothing beats system redundancy, provided the components are decent.
- Saving money - in 3 years just buy a new system. You will need more storage anyway.
- SAS versus SATA: the only company that I know of that makes 300G 15K rpm drives is Seagate. They cost $1000 a piece. Compare to $129-$169 per Western Digital 400G enterprise drives. For multi-terabyte arrays it makes a lot of sense to get SATA disks and put the money into RAM. Lots of RAM - like 32GB or 64GB.
- The big advantage of a Linux system (or BSD if you know it better) is flexibility. Configure SAMBA they way you like, mount shares from other machines the way you like. Use chattr, setfacl, or whatever else. Run database on the same machine the RAID controller is on. Use multiple ethernet adaptors.
- To people above - if bandwidth is a problem, 10Gb adapter now costs $1000. Though I doubt your RAID controller can saturate even one gigabit line under moderate seek load.
Re:Reliable? by LWATCDR · 2007-05-30 01:55 · Score: 2, Insightful

Again it depends.
At $3000 for everything it would be logical to stock a spare motherboard, Power supply, NIC, controller cards, and A few Hard drives plus a few hot spares.
With SMART it isn't hard and other sensor packages it wouldn't be hard to build in monitoring.
It could be done but it would be a project. So yes it could be done but for a business buying a SAN will be more logical if for no other reason than time. It would take time to set up a system. It could be cheaper and have just as little downtime as a commercial SAN but it could also be a huge time sink. For a small tech company it might be a great solution and a potental product. For a HUGE tech company it might be a good solution and a potental product. For anybody else it is a risk probably not worth taking.

--
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
Re:Reliable? by hoggoth · 2007-05-30 01:59 · Score: 1

> Saving 47K on a SAN is great, unless it breaks 3 years from now and your company is out of business 3 days waiting for a replacement motherboard off Ebay

Well that just doesn't make sense. If I am saving $47K I can certainly buy a couple of spare sets of the entire $1200 system. If the motherboard croaks I just move the drives to a spare system. 30 minutes, not 3 days.

--
- For the complete works of Shakespeare: cat /dev/random (may take some time)
Re:Reliable? by Znork · 2007-05-30 02:14 · Score: 4, Informative

"Additionally, SATA drives are not as reliable long term as SCSI."

The CMU study by Bianca Schroeder and Garth A. Gibson would suggest otherwise. In fact, there was no significant difference in replacement rates between SCSI, FC, or SATA drives.

"They want the system to phone home when a drive starts getting errors"

Of course, the other recent study by Google showed that predictive health checks may be of limited value as an indicator of impending catastrophic disk failure.

Basically, empirical research has shown that the SAN storage vendors are screwing their customers every day of the week.

"Saving 47K on a SAN is great, unless it breaks 3 years from now"

Of course, saving 47K on a SAN means you can easily triple your redundancy, still save money, and when it breaks, you have two more copies.

At the same time, the guy spending the extra 47k on an 'Enterprise Class Ultra Reliable SAN', will get the same breakage 3 years from now, he wont have been able to afford all those redundant copies, and as he examines his contract with the SAN vendor, he notes that they actually dont promise anything.

"But by all means, get a rep from EMC or HP in so the decision makers completely understand what they're buying."

Premium grade bovine manure with (fake) gold flakes?

Really, handing the decision makers several scientific papers and a few google search strings would leave them much better equipped to make a rational decision.

Having several years experience with the kind of systems you're talking about I can just say, I've experienced several situations where, if we didnt have system-level redundancy, we would have suffered not only system downtime but actual data loss on expensive 'enterprise grade' SANs. That experience, as well as the research, has left me somewhat sceptical towards the claims of SAN vendors.
Re:Reliable? by Anonymous Coward · 2007-05-30 02:35 · Score: 0

. Additionally, SATA drives are not as reliable long term as SCSI.

I think that's a myth. The mechanisms are exactly the same, they come off the same product lines. The only differences historically have been the controllers (or rather the controller microcode,) possibly some kind of QA process that is done to SCSI, perhaps some kind of CPU like rate validation that may be done to scsi to verify that they can spin at 15k RPMS (at a substantially reduced life time, I might add) and the pricing. The actual drive itself is exactly the same. All the moving parts are identical. The platters are identical.

Now SCSI usage patterns might be substantially different such that it affect product life.

That's why SAS is where SCSI is going, they are exactly the same in SAS, it's just a pricing and branding exercise if they choose to make it one.
Re:Reliable? by darrylo · 2007-05-30 03:00 · Score: 1

Of course, the other recent study by Google showed that predictive health checks may be of limited value as an indicator of impending catastrophic disk failure.

The google study did find that the sudden appearance of an error, recoverable or otherwise, did correlate well with increased disk failure probability.
Re:Reliable? by flaming-opus · 2007-05-30 07:07 · Score: 1

I've certainly seen my fair share of very expensive SANs melt down, both as a buyer and as a vendor. I'm not convinced, however, that rolling your own is going to make that any better. It seems to me that the number one cause of storage-system failure is user-error. (EBCAK - Error Between Chair and Keyboard) This is true for the million dollar san, and it is true for the four thousand dollar beige box. Rolling your own just increases the number of places for you to screw it up, and increases your employer's dependance on you to keep it running. (Good for job security, not so good for long term data retention)

Absolutely, high-end SAN storage breaks. That's the reason that the tape industry just won't die. No matter what you're primary storage is, you need a backup and disaster recovery plan. Note: more RAID levels, or even mirrored servers are NOT a backup plan. "rm -rf" can wipe out both halves of a mirror.

Having been on the vendor side of this equation, I'm not so sure that the SAN story is so much bullflop. After you get hauled across the carpet and then kicked out of a customers' machine room once, you bust your ass to make sure things work right. I remember overnighting 96 drives to On-Track to get data off of failed drives, and then stiching raid-stripes together to get back customer data. I'm sure that one weekend cost more than all of the storage they had, ten times over. On the other hand, we fixed the bug, got the data back, the customer was happy, and kept buying our stuff.

Caveat Emptor, but also: you get what you pay for.
Re:Reliable? by probabilitist · 2007-05-31 03:02 · Score: 1

well, I read the academic research by Shroeder and Gibson. While it suggests (even states) there is no significant difference between disk types their conclusions are not supported by any scientific methodology or even base consistancy.

They did not collect their own results, rather they accepted information collected by different sources over widely different timescales. Of the '100,000' drives, around 20% were SATA drives but they had been in place less than 12 months, compared to various ages for other drives (5 and 8 years in two instances).

Additionally for some reason they ignored the fact that in one complete implementation, 11000 (around half of the SATA drives in the study) were replaced completely under warranty within the first year. This was due to unacceptable media error rates, but not considered to be disk failures. This is written off as a 'bad batch', and covered over by stating anecdotal evidence suggests other disk types have bad batches too.

The various applications are described briefly: The HPC apps contain mainly internal disks, with only one having external storage (the ill fated SATA storage that was replaced completely, and another FC array that wasn't replaced). The COM apps do not detail the storage at all, but since they are ISP's with huge number of servers, then this would appear to be internal storage too. There is no discription of the environmental factors, application loading, or relevant factors.

The inconsistancy in the raw data makes drawing conclusions futile, yet the authors present an impressive array of graphs, charts, and even conclusions. If you presented this to senior management they would probably be blinded by the academic gloss, accept the conclusions and base their decisions on a misleading artificial construction. This makes this paper far more dangerous than marketing material, where at least you can go back to the vendor for some accountability.

We do need an independent report on disk failures, but there needs to be controls in the data collection.

Others have already commented on the google report, which is of only slighly more use to companies that are not google.
Re:Reliable? by aminorex · 2007-06-01 03:42 · Score: 1

> They need redundant everything, which this doesn't have.

But two of them do.

> SATA drives are not as reliable long term as SCSI.

False. In fact, your 15k fc drives will fail significantly more often than my 10k sata drives. Cf. Google's report.

> Businesses also want service and support.

That's why it's such a great world in which to live and work, innit?

--
-I like my women like I like my tea: green-
Re:Reliable? by Znork · 2007-06-01 22:13 · Score: 1

"While it suggests (even states) there is no significant difference between disk types their conclusions are not supported by any scientific methodology or even base consistancy."

IMO, the methodology is useful, as long as it's clearly stated. Running a statistically significant test under controlled environmental and operating conditions would be prohibitively expensive and difficult to accomplish. While the operating conditions may vary, as long as they are within reasonable parameters (ie, operating ranges stated by vendor), and the statistical distribution is fairly even, it serves the purpose for the test (and, I'd suggest that conditions would tend to be skewed towards expensive disks being better taken care of).

"This makes this paper far more dangerous than marketing material, where at least you can go back to the vendor for some accountability."

I'd be much more confident in disk vendors accountability if it werent for the IBM deathstar incident.

And, you know, the vendors do have the data themselves. Still, I have yet to see one actually publish it. Which rather makes me less inclined to trust them.

Still, I wouldn't present the paper to execs without making sure they understood the method with which the raw data was obtained.

"use to companies that are not google."

Yes, the problem with Googles report is that it's geared towards a highly redundant infrastructure where they can live with disks crapping out regularly. However, that infrastructure is also built to provide the best price/performance ratio, and more companies would be better off going with cheap high redundancy rather than expensive, exclusive and at best only slightly more reliable hardware.

Support by vaderhelmet · 2007-05-30 00:21 · Score: 1

In a standard corporate environment where the thought of spending $50,000 has come to the plate, you're probably in deeper than you think. A lot of what you pay for with a big name SAN from EMC or the like is that you're getting serious support and reliable equipment from a well established company. Your homebrew method will almost undoubtedly work... but all equipment fails, and when this solution fails, unless you're picky about getting manufacturer warrantees, then you're going to be losing money to fix the problem. With our EMC/Dell solution, if something goes wrong we'll have a tech with a replacement on site within 4 hours. If your thing fails, can you get an RMA replacement in even under a week?

I'm not trying to discourage you from building this system, in fact I think DIY is a great way to go. However, you do need to take into account how downtime will affect the cost of this device. It is always important to have a failover/replacement plan for when your system goes down because most systems DO go down. (Which is why many of us are even employed.)

Good luck to you, sir!

Re:Support by multipartmixed · 2007-05-30 00:42 · Score: 1

> I'm not trying to discourage you from building this system, in fact
> I think DIY is a great way to go. However, you do need to take into
> account how downtime will affect the cost of this device.

We do some roll-your-own stuff at my company. The downtime and spares issue is solved simply by stocking spares *ahead of time*. For example, if I am needing three RAID enclosures for a project, I buy four and leave the fourth one on the shelf. If we need to dip into the spares, THEN we can start shopping for replacements.

We rarely use manufacturer's warranties, simply because it is such a pain. Although, I was dismayed once to call up Seagate with 48 bad disks, only to find out that their warranty had expired the month before. Guess I should've been more proactive..

--

Do daemons dream of electric sleep()?
Re:Support by 19thNervousBreakdown · 2007-05-30 00:51 · Score: 1

With the amount of money that's saved, he could have a complete offline backup system. Go to SCSI and he could multi-host two systems, and have a third to implement an active/passive cluster, and it's STILL less than 1/4 of the cost.

--
<xml><I><am><so><damn>Web 2.0</damn></so></am></I></xml>
Re:Support by Anonymous Coward · 2007-05-30 00:55 · Score: 0

I agree - a inexpensive Dell/EMC solutions is by far the best way to go. We have 25PetaBytes of storage from various vendors (HDS/Compaq/IBM/NetApps/EMC) - and EMC's service and quality has won us over - so that now we are only buying EMC.
Re:Support by Anonymous Coward · 2007-05-30 01:50 · Score: 0

We have 25PetaBytes

I have that much media here, my storage/file system is called BitTorrent.
Re:Support by saleenS281 · 2007-05-30 04:53 · Score: 1

if he's building this for 3k, he should be able to have two spares of every pertinent part on hand...

No but... by Junta · 2007-05-30 00:22 · Score: 3, Informative

ZFS does not obsolete NAS/SAN. However, for many many many instances, DIY fileservers have been more appropriate than SAN or NAS situations for many concepts long before ZFS came along, and ZFS has done little to change that situation (though adminning ZFS is more straightfoward and in some ways more efficient than the traditional, disparate strategies to achieve the same thing).

I haven't gotten the point of standalone NAS boxes. They always were not fundamentally different from a traditional server, but with a premium price attached. I may not have seen the high-end stuff, howerver.

SAN is an entirely different situation all together. You could have ZFS implemented on top of a SAN-backed block device (though I don't know if ZFS has any provisions to make this desirable). SAN is about solid performance to a number of nodes with extreme availability in mind. Most of the time in a SAN, every hard drive would be a member of a RAID, with each drive having two paths to power and to two RAID controllers in the chassis, each RAID controller having two uplinks to either two hosts or two FC switches, and each host either having two uplinks to the two different controllers or to two FC switches. Obviously, this gets pricey for good reason which may or may not be applicable to your purposes (frequently not), but the point of most SAN situations is no single-point of failure. For simple operation of multiple nodes on a common block device, HA is used to decide which single node owns/mounts any FS at a given time. Other times, a SAN filesystem like GPFS is used to mount the block device concurrenlty among many nodes, for active-active behavior.

For the common case of 'decently' availble storage, a robust server with RAID arrays has for a long time been more appropriate for the majority of uses.

--
XML is like violence. If it doesn't solve the problem, use more.

Re:No but... by Anonymous Coward · 2007-05-30 03:13 · Score: 0

I haven't gotten the point of standalone NAS boxes. They always were not fundamentally different from a traditional server, but with a premium price attached. I may not have seen the high-end stuff, howerver.

Well, for some server OSes (like MS), you need to pay a licence cost per user. NASes generally don't. NASes often have nifty features for point-in-time snapshots, copy-on-write for automatic file versioning, replication, cross-platform access (SMB, NFS, webdav, etc), easy growing of the file system when you add disks, etc.
Re:No but... by Junta · 2007-05-30 05:34 · Score: 1

Hm, guess I should have more specifically said didn't see the point for me. Most all of those bullet points can be implemented at the OS layer in a cross-platform manner using an OS without draconian licensing for client access count. However, admittedly, you must have a perhaps more competent admin as implementing some of these features is not as trivial as what the NAS vendors may do for you.

The one thing that stands out as lacking would be a true versioned copy-on-write system. Currently everything I mess with is contigent on point-in-time snapshotting, if you make two changes to a file between snapshots, that one change is lost.

--
XML is like violence. If it doesn't solve the problem, use more.

Home-grown storage solutions by GPSguy · 2007-05-30 00:23 · Score: 1

It's been a couple of years since we built our first multi-terabyte storate (~1.6TB for ~$5k US) and started the grand experiment with XFS. Of all we've learned, I think the biggest lesson has been that XFS has problems in this realm. We are now spinning over 30TB and likely to double soon. To SAN or not to SAN has become the question, as finding our data and metadata are both important to us, more important than slinging the files off and accidentally discovering them later.

Some form of indexing system is valuable, especially if you're looking at multiple volumes to span.

We're now looking at lustre and possibly zfs to support our solutions. For hardware, we've gone to ATA-over-Ethernet with CoRAID hardware and been pretty satisfied. It's not iSCSI but it works well and we get adequate transfer rates. We do a caching process when we're anticipating high user access periods and can predict the patterns.

I'd say, for the money, try it, benchmark it, and report back.

--
Never ascribe to malice that which can adequately be explained by tenure.

NAS and SAN are two very different technologies wi by Anonymous Coward · 2007-05-30 00:25 · Score: 0

NAS and SAN are two very different technologies with different goals.

iSCSI is a block level protocol that requires an iSCSI soft initiator or iSCSI HBA ToE for other computers to access it. It is a similar but slow, less secure, and cheaper solution than a normal fibre channel SAN.

To do it right you really should have an iSCSI HBA ToE in the target and initiator, as well dedicated router that is made to handle iSCSI - because of its inherent "bursty" nature allot of routers choke on it.

NAS is file level transfer that uses NFS and CIFS which are already built into every OS

Just did the same.. by Padrino121 · 2007-05-30 00:26 · Score: 0, Redundant

I have experience with a number NAS solutions and if cost wasn't or reliability/throughput was paramount I would continue to purchase them (e.g., Netapp). Depending on the environment they are being installed in the (perceived) liability and additional complexity can be challenging to overcome.

With that said for places where rolling your own is an option I would keep your eye out for a good deal on drives and you will be able to build one much less expensive. I put together a new Myth backend with the following:

Antec Sonata II - $65 (rebate)
Asus M2N32-Vista addition (it's running Liux but the vista addition has an LIRC supported IR receiver) - $210
AMD 4200+ X2 - $96
2GB RAM - $55
Nvidia 7600 with HDMI out - $110
6 x 500GB Maxtor SATA II HDDs - $600

It's not RAID-Z but with a standard RAID-5 I have 2.5TB usable storage with HDTV output and ATA/iSCSI targets for $1136. Not bad and Linux SW RAID-5 write speed actually screams these days, with this setup I expect 200MB write throughput.

One word of caution with RAID-Z, although writes are extremely fast there is a performance issue around reads if they are small and random because there will be a lot of cache misses. Relatively speaking it's not that bad but something to kep in mind when looking at the workload you will be supporting.

At $20k? by Anonymous Coward · 2007-05-30 00:36 · Score: 0

It uses SATA drives (assuming it's big enough to be called an array, rather than being five disks shoved in a 1U box). If you want FC or SAS, you're looking at $50K on up -- probably up.

DIY has always been an economical option by dogsbreath · 2007-05-30 00:41 · Score: 1

There is nothing new here; it has always been cheaper to work out your own storage solution than to buy a commercial unit. Same goes for servers/desktops etc. Why buy a Sun (or Dell, or IBM or whatever) when you can get your local guy to put a system together for half the price?

If you are an individual or even a small business with limited capital then DIY is often the best (only?) way to go but you also get to deal with flakey controllers, incompatible drivers, and warranty returns all on your own. The integration of components, performance management, and the harmony of the complete system is all yours.

At some point, either because of the scale or the criticality of the system, it is worth the bucks to pay someone who has researched the issues and built a solid product to provide you with a solution that you can (hopefully) trust. Your sysadmins and techies can spend their time on ROI generating projects instead of figuring out why a component does wild and whacky crap every full moon. Tech support can be a very good thing.

Even open source has its commercial providers. Personally, I have always liked Slackware but if we are deploying servers it's going to be Red Hat.

I think homebrew is super: put your system together and do some benchmarking, then publish it for the rest of us to benefit!

ZFS is great, but... by Etherized · 2007-05-30 00:42 · Score: 4, Informative

It's no NetApp - yet. One thing to realize is that iSCSI target isn't even in Solaris proper yet - you have to run Solaris Express or OpenSolaris for the functionality. That may be fine for some people, but it's a deal-breaker for most companies - you're really going to place all those TB of data on a system that's basically unsupported? I'm sure Sun would lend you a hand for enough money, but running essentially a pre-release version of Solaris is a non-starter where real business is concerned. Even when iSCSI target makes it into Solaris 10 - which should be in the next release - are you really comfortable running critical services off of essentially the first release of the technology? Furthermore, while ZFS is amazingly simple to manage in comparison to any other UNIX filesystem/volume manager, it still requires you to know how to properly administer a Solaris box in order to use it. Even GUI-centric sysadmins are generally able to muddle through the interface on a Filer, but ZFS comes with a full-fledged OS that requires proper maintenance. Your Windows admins may be fine with a NetApp - especially with all that marvelous support you get from them - but ask them to maintain a Solaris box and you're asking for trouble. Not to mention, since it's a real, general purpose server OS, you'll have to maintain patches just like you do on the rest of your servers - and the supported method for patching Solaris is *still* to drop to single user mode and reboot afterwards (yes, I know that's not necessarily *required*). Also, "zfs send" is no real replacement for snapmirrors. And while ZFS snapshots are functionally equivalent to NetApp snapshots, there is no method for automatic creation and management of them - it's up to the admin to create any snapshotting scheme you want to implement. Don't get me wrong - I love ZFS and I use it wherever it makes sense to do so. It may even be acceptable as a "poor man's Filer" right now, assuming you don't need iSCSI or any of the more advanced features of a NetApp. In fact, it's a really great solution for home or small office fileservers, where you just need a bunch of network storage on the cheap - assuming, of course, that you already have a Solaris sysadmin at your home or small office. Just don't fool yourself, Filer it ain't - at least not yet.

Re:ZFS is great, but... by Anonymous Coward · 2007-05-30 02:26 · Score: 1, Informative

"patching Solaris is *still* to drop to single user mode and reboot afterwards (yes, I know that's not necessarily *required*)"

Now seriously, you must be a windows "admin"... In the rare case a reboot is necessary, you should know that you can patch an offline boot partition (via lu) and boot into it afterwards (whenever). The downtime is measured in seconds and you can boot back to the original un-patched partition if anything goes wrong.

That's unix administration for you.
Re:ZFS is great, but... by Etherized · 2007-05-30 04:44 · Score: 2, Interesting

Hence my note that the single-user/reboot process isn't *really* required, but if you look at Sun's patch documentation that's what it almost invariably says.

I don't want to sound critical of ZFS or Solaris - that's not my intent. Solaris is a wonderful operating environment, and ZFS is an incredibly powerful filesystem and volume manager. However, the fact of the matter is that implementing and administering an appliance-based solution is almost invariably going to be simpler and better supported, especially since ZFS is a relatively immature technology - there are some things you just flat out can't do yet, and some things you can't do as easily.

If the question is "Does ZFS Obsolete Expensive NAS/SANs?" the answer is clearly "not yet." At some point that may be a more difficult question to answer - ZFS is amazingly powerful despite being very young, and along with Solaris itself it continues to rapidly improve in important ways.
Re:ZFS is great, but... by saleenS281 · 2007-05-30 05:01 · Score: 1

My question is, why do you need to patch and continually upgrade the solaris box? You allow nothing but console access to the global zone. In the zone you'll be sharing out from, you only enable cifs/nfs/iscsi, and perhaps SSH (if you don't have kvm over IP). When was the last time solaris had a serious bug in cifs/nfs or ssh? And by serious I mean remote root?

This may be a *full OS*, but when you only allow people the access they need, which would be to the shares they're using, it should be just as secure as a filer, no matter who is *administrating* it. You *MAY* require a consultant to come in and do the initial setup for you if you don't have the expertise in house, but that is trivial compared to the premium on a filer.
Re:ZFS is great, but... by Etherized · 2007-05-30 05:35 · Score: 1

You don't need to "continually upgrade," but there are several reasons that this setup in particular is going to be fairly volitile.

Most notably, if you're going with OpenSolaris or Solaris Express now, you'll at the very least want to move to Solaris 10 (or 11) when the actual product's feature set catches up with the "beta" version.

The key thing to remember here is that this isn't a typical Solaris install, where you're running something that's rock solid and proven that you can run largely unmodified for 5-10 years - this is a bleeding edge setup using the absolutely latest (largely untested) features that are still likely to evolve over time. You'll find new bugs, and you'll probably want to have them fixed. New features will appear, and you'll probably want to take advantage of them.

Basically, OpenSolaris is to Solaris 10 as Fedora Core is to RHEL. By running the OS, you acknowledge that your system is not a final product, and that certain things may not be in their final state. For something like home use or small office use, that's probably acceptable - but if you want to be able to maintain it, and you have any hopes for commercial support, you're going to need to upgrade.
Re:ZFS is great, but... by saleenS281 · 2007-05-30 07:49 · Score: 1

So why do you have to run solaris express instead of Solaris 10 again? As for features, yes, that would require an upgrade. News flash, it requires an upgrade when I want to move to a new version of ONTAP as well... If anything Solaris does a better job of allowing you to patch on the fly than Netapp.

I've yet to see a talking point that explains WHY I would need to upgrade the system constantly. *WANTING* new features does not mean I *HAVE* to upgrade. As for bugs, again, the CIFS and NFS stack are completely stable *TODAY*. iSCSI target does need work, but then again, iSCSI isn't NAS, and I wouldn't recommend ANYONE use Solaris's iSCSI in production at this point.
Re:ZFS is great, but... by Etherized · 2007-05-30 10:27 · Score: 1

The original post specifically mentions iSCSI and OpenSolaris, presumably because iSCSI is a required feature and OpenSolaris (or Solaris Express) is where you need to look to get it.

More generally speaking, iSCSI is a pretty integral feature in the whole network storage game, and in that area Solaris (and as an extension ZFS) isn't even competing yet.

If you can get by without iSCSI, you're in a much better position as you can run Solaris 10 11/06 instead, which is a supportable OS and in my experience generally rock solid. But again, if you're looking for feature parity with the big appliance players, it's just not there yet.
Re:ZFS is great, but... by saleenS281 · 2007-05-30 14:46 · Score: 1

I would say lack of FC is a far bigger *issue* than iSCSI. Also, last I checked, you can get full sun support on opensolaris... and it's also completely rock solid in my experience. They usually do some pretty extensive testing before allowing commits.
Re:ZFS is great, but... by this+great+guy · 2007-05-30 18:24 · Score: 1

Solaris Express, Developer Edition is supported (and tested) by Sun.
You may be confusing it with the Community Edition, which is the bleeding-edge unsupported version.

Terabyte by palladiate · 2007-05-30 00:42 · Score: 1

I have a terabyte RAID in my main computer, and a 2 TB fileserver. With my video editing, I'm starting to look at a full fileserver (still have most of my main box empty). And, I'm not even a pirate, and I've just been a computer and video hobbyist for 2 decades now. I'm not even that serious about my hobbies.

I can imagine we'll all need terabytes of capacity in the next few years. Some games are already into the 14-20GB sizes.

hard drives will be that big soon by 192939495969798999 · 2007-05-30 00:44 · Score: 1

based on progression of drive sizes, A 1.3+ terabyte drive should be available within a few years, and a 7.0+TB drive shortly thereafter. If you want cheap storage, just use large single drives. They are super cheap and while not super large, they are far less complicated than this setup.

--
stuff |

Either pay a lot, or roll your own by siddesu · 2007-05-30 00:46 · Score: 1

Whether you buy or build depends a lot on whom you're buying from. Buying from people who are not in the storage business, even if it is a big corporation like Dell, gives you about the same level of support as rolling your own when the shit hits the fan. Don't believe me? See this one, and notice how long the problem kept on and on (me is one of the happy users there):

http://www.dellcommunity.com/supportforums/board/m essage?board.id=pv_raid&message.id=214&view=by_dat e_ascending&page=1

Buying from EMC or the likes (even EMC from Dell) tends to work better. The hardware is expensive, the consulting fee is expensive, and the support is expensive, but at least for that kind of money you are sure someone tries a bit harder to help.

All in all, it depends on your business. If you are making a zillion a month from that hardware working flawlessly, _not_ paying $200k for the storage is dumb. If you are making little enough so that $50k makes you think about it, rolling your own could be the way to go.

Re:Either pay a lot, or roll your own by Anonymous Coward · 2007-05-30 00:53 · Score: 0

Wow, thats f-ing relevant....4 year old stuff. Pssh.

Re:Congradulations, you discovered the "File Serve by tokul · 2007-05-30 00:46 · Score: 1

For quite a while now, it has been less expensive to build a DIY file server then to purchase NAS equipment.

Depends on needed storage space. If you need more than 1-3 TB, you can't use generic components, price goes up and hardware starts taking more space than dedicated NAS box. Or Tyan 2U-4U boxes are DIY in your country.

This hardly depends on ZFS... by Wdomburg · 2007-05-30 00:46 · Score: 4, Informative

This doesn't strike me as having much to do with ZFS at all. You've been able to do a home grown NAS / SAN box for years on the cheap using commodity equipment. Take ZFS out of the picture and you just need to use a hardware raid controller or a block level RAID (like dmraid on Linux or geom on FreeBSD). There are even canned solutions for this, like OpenFiler.

That being said, this sort of solution may or may not be appropriate, depending on site needs. Sometimes support is worth it.

You're also grossly overestimating the cost of an entry-level iSCSI SAN solution. Even going with EMC, hardly the cheapest of vendors, you can pick up a 6TB solution for about $15k, not $50k. Go with a second tier vendor and you can cut that number in half.

Re:This hardly depends on ZFS... by swillden · 2007-05-30 02:15 · Score: 1

This doesn't strike me as having much to do with ZFS at all. You've been able to do a home grown NAS / SAN box for years on the cheap using commodity equipment. Take ZFS out of the picture and you just need to use a hardware raid controller or a block level RAID (like dmraid on Linux or geom on FreeBSD).
The main difference between ZFS and what you can build on Linux without ZFS is good snapshotting. Yes, LVM supports snapshots, but they don't work very well unless all you're doing is freezing the state of the file system for a short period of time while you make a backup. With ZFS you can have a cron job that automatically creates snapshots hourly, keeps them for a few days, then removes all but a daily snapshot, etc. I tried to set up a system that gave me hourly snapshots for one day, daily snapshots for a week, weekly snapshots for a month and monthly snapshots a year with LVM and it went wrong in all sorts of weird and hard-to-debug ways. And performance of such a solution would be abysmal, since maintaining 24 + 7 + 4 + 12 = 47 snapshots means that every disk block write causes 45 writes (each of which must be sent to each disk in the RAID array -- 47 * 6 = 282 actual disk writes triggered by one FS-level write!).
Even if it worked perfectly as designed, and performance was somehow acceptable, there's the issue that LVM requires you to specify the size of a snapshot volume. That's hard to do accurately, and hard to maintain because when a snapshot gets full it simply stops tracking changes. So, in practice, you have to waste space by allocating excessive amounts to each snapshot, and you also have to have a cron job that monitors them and expands the snapshot volumes when they're close to getting full.
Finally, besides the space wasted by preallocating snapshot volumes, LVM snapshots also waste space by tracking changes in free blocks.
ZFS addresses all of these problems, which puts it considerably closer to competing with NAS solutions. Snapshotting is something that really has to be done at the file system level to be done well.

--
Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.
Re:This hardly depends on ZFS... by darrylo · 2007-05-30 02:49 · Score: 1

ZFS also has disk compression, and encryption is under development (IIRC).
Re:This hardly depends on ZFS... by swillden · 2007-05-30 04:28 · Score: 1

ZFS also has disk compression, and encryption is under development (IIRC).

You can also get both of those on Linux without ZFS.

--
Note to ACs: I usually delete AC replies without reading them. If you want to talk to me, log in.

Ah, digital photography. by iknownuttin · 2007-05-30 00:53 · Score: 1

My desire for storage (though not yet in the terrabyte range) comes from my photography (no not porn...). I take a bunch of pictures, and well, because storage is cheap I leave them all at the original file size (which in this case is about 2-5 MB depending).

Ah yes, digital photography. It's a good thing I asked because I'm in the (gradual) process of moving away from film. Which means, I'll be having a similar problem as yourself. If you do it, I hope you post your results. I will be really interested!

BTW, a 5 MB photo file sounds very small - even if it's jpg. I'm assuming you made a typo and your storing a 2 - 5 megapixel image, which would be, what, at least 15 megabytes? Even more reason for the setup you're talking about!

--
I prefer Flambe as apposed flamebait.

Re:Ah, digital photography. by maxume · 2007-05-30 04:06 · Score: 1

??

1500*2000=3,000,000=3 megapixels

3000000*3(that's for 8 bit/channel color;) = 9,000,000 bytes. Jpeg does 10 to 1 compression fairly well, leaving 900,000 bytes to store. Double the color channels and you still only have 1,800,000 bytes to store. Your intuition is off.

--
Nerd rage is the funniest rage.

Redundancy by fredr1k · 2007-05-30 00:53 · Score: 1

How is it with redundancy? you got redundant PSU, redundant controlers? redundant Network-paths to and from the server. Obviously the tech itself have some LARGE advantages. But working with enterprise technology makes you think redundancy*3. (No one wants the SQL-cluster to fail due to a failed PSU that turned out to be a single point of failure.)

--
"Never EVER mess with a jumper you don't know about, even if it's labeled 'sex and free beer'." - Dave Haynie

vs Reiser4 (someday, maybe) by SanityInAnarchy · 2007-05-30 00:56 · Score: 5, Informative

Some of these issues looked familiar, so I thought I'd do a basic comparison:

Reiser4 had the same problems with fsync -- basically, fsync called sync. This was because their sync is actually a damned good idea -- wait till you have to (memory pressure, sync call, whatever), then shove the entire tree that you're about to write as far left as it can go before writing. This meant awesome small-file performance -- as long as you have enough RAM, it's like working off a ramdisk, and when you flush, it packs them just about as tightly as you can with a filesystem. It also meant far less fragmentation -- allocate-on-flush, like XFS, but given a gig or two of RAM, a flush wasn't often.

The downside: Packing files that tightly is going to fragment more in the long run. This is why it's common practice for defragmenters to insert "air holes". Also, the complexity of the sync process is probably why fsync sucked so much. (I wouldn't mind so much if it was smarter -- maybe sync a single file, but add any small files to make sure you fill up a block -- but syncing EVERYTHING was a mistake, or just plain lazy.) Worse, it causes reliability problems -- unless you sync (or fsync), you have no idea if your data will be written now, or two hours from now, or never (given enough RAM).

(ZFS probably isn't as bad, given it's probably much easier to slice your storage up into smaller filesystems, one per task. But it's a valid gotcha -- without knowing that, I'd have just thrown most things into the same huge filesystem.)

There's another problem with reliability: Basically, every fast journalling filesystem nowadays is going to do out-of-order write operations. Entirely too many hacks depend on ordered writes (ext3 default, I think) for reliability, because they use a simple scheme for file updating: Write to a new temporary file, then rename it on top of the old file. The problem is, with out-of-order writes, it could do the rename before writing the data, giving you a corrupt temporary file in place of the "real" one, and no way to go back, even if the rename is atomic. The only way to get around this with traditional UNIX semantics is to stick to ordered writes, or do an fsync before each rename, killing performance.

I think the POSIX filesystem API is too simplistic and low-level to do this properly. On ordered filesystems, tempfile-then-rename does the Right Thing -- either everything gets written to disk properly, or not enough to hurt anything. Renames are generally atomic on journalled filesystems, so either you have the new file there after a crash, or you simply delete the tempfile. And there's no need to sync, especially if you're doing hundreds or thousands of these at once, as part of some larger operation. Often, it's not like this is crucial data that you need to be flushed out to disk RIGHT NOW, you just need to make sure that when it does get flushed, it's in the right order. You can do a sync call after the last of them is done.

Problem is, there are tons of other write operations for which it makes a lot of sense to reorder things. In fact, some disks do that on a hardware level, intentionally -- nvidia calls it "native command queuing". Using "ordered mode" is just another hack, and its drawback is slowing down absolutely every operation just so the critical ones will work. But so many are critical, when you think about it -- doesn't vim use the same trick?

What's needed is a transaction API -- yet another good idea that was planned for someday, maybe, in Reiser4. After all, single filesystem-metadata-level operations are generally guaranteed atomic, so I would guess most filesystems are able to handle complex transactions -- we just need a way for the program to specify it.

The fragmentation issue I see as a simple tradeoff: Packing stuff tightly saves you space and gives you performance, but increases fragmentation. Running a defragger (or "repacker") every once in awhile would have been nice. Problem is, they never got one written. Common UNIX (and Mac) philosoph

--
Don't thank God, thank a doctor!

Re:vs Reiser4 (someday, maybe) by slamb · 2007-06-01 10:57 · Score: 1

I think the POSIX filesystem API is too simplistic and low-level to do this properly. On ordered filesystems, tempfile-then-rename does the Right Thing -- either everything gets written to disk properly, or not enough to hurt anything. Renames are generally atomic on journalled filesystems, so either you have the new file there after a crash, or you simply delete the tempfile. And there's no need to sync, especially if you're doing hundreds or thousands of these at once, as part of some larger operation. Often, it's not like this is crucial data that you need to be flushed out to disk RIGHT NOW, you just need to make sure that when it does get flushed, it's in the right order. You can do a sync call after the last of them is done.
Problem is, there are tons of other write operations for which it makes a lot of sense to reorder things. In fact, some disks do that on a hardware level, intentionally -- nvidia calls it "native command queuing". Using "ordered mode" is just another hack, and its drawback is slowing down absolutely every operation just so the critical ones will work. But so many are critical, when you think about it -- doesn't vim use the same trick?
What's needed is a transaction API -- yet another good idea that was planned for someday, maybe, in Reiser4. After all, single filesystem-metadata-level operations are generally guaranteed atomic, so I would guess most filesystems are able to handle complex transactions -- we just need a way for the program to specify it.

I was with you in the first paragraph. I expected you to lead up to "high-performance applications need a barrier - ensure operations this process performed before the barrier hit disk before operations it performs after the barrier()". Or for even higher performance, a more precise barrier that takes inodes or file descriptors: "the operations performed before on inode A hit disk before the operations performed after the barrier on inode B if a call to". I don't quite understand you leap to a transaction API. It seems useful (it would be cool to have full ACID on filesystem operations, even ones that span multiple files) but I'm not sure it's the way to get the best performance.
By the way, you seem knowledgeable on filesystem semantics. I've been wondering something for a while, and I haven't had luck googling the answer. If I call write() on a section of a file and then lose power before calling fdatasync(), what's the worst that can happen on...say...Linux/ext3? Is there a standard that specifies this behavior more generally? I don't see anything like it in SUS.
More precise questions: Does my write necessarily happen in byte order (I'm guessing not)? Does each block have either all old bytes or all new bytes (unsure of this), and if so is there a way (preferably portable) to get the filesystem's blocksize (something pathconf()-like)? Does each byte have either all old bits or all new bits (I hope so)? Can bits get flipped that I didn't even change?
Re:vs Reiser4 (someday, maybe) by SanityInAnarchy · 2007-06-03 05:48 · Score: 1

I expected you to lead up to "high-performance applications need a barrier - ensure operations this process performed before the barrier hit disk before operations it performs after the barrier()". Or for even higher performance, a more precise barrier that takes inodes or file descriptors: "the operations performed before on inode A hit disk before the operations performed after the barrier on inode B if a call to".
That might work, and I think some filesystems might have implemented it. It's also closer to working with existing POSIX stuff than what I might've come up with. But it also requires the app to do quite a lot that I don't think it has to, and requires some duplication of code.
For the simple, small-file example, it works reasonably well. Write to the new file, barrier, rename to the old, barrier. Essentially have barrier calls in most places you'd have sync calls -- and yes, maybe fbarrier or fdatabarrier, like fsync/fdatasync. Note that sync-ing already does basically the same thing, just slower. And you might also want to have a call which explicitly says you're going to be using barriers, telling the FS it can reorder as much as it wants.
There are a couple of problems with that approach, I think. The most obvious one is that essentially, your application is trying to implement a transaction, and your filesystem will, if it's smart, group these into a transaction. In other words, if I read from file a, write a new version to file tmp, barrier, and then rename that tmp to a, then barrier again, all before the FS writes it out to disk, a smart enough FS might figure out that I want to atomically update a, and skip the tempfile altogether. Skipping the tmpfile means skipping dealing with allocating a new inode, setting a dozen timestamps, then unlinking it, when it might be faster to simply use the FS's own journal. (Or it might use the tmpfile anyway, if it thinks that's faster.)
At the same time, your application, if it's more than just vim, might have more than one file it wants updated atomically. For instance, it might want to go through a whole directory tree, doing some operation to, say, half the files in the tree. This starts to make it more complicated. The simplest way I can think of with barriers would be something like: Copy everything to a temporary tree, do all of your update operations there. Barrier. Rename the original tree to something like orig.bak, and move the new tree to the original tree. Barrier.
The above may not always work, of course. It ignores open filehandles or working directories within the tree by other processes. It also means that, even if you use hardlink tricks, you're probably copying far more than you need to.
And consider a database, or any large file that you might update random chunks in the middle of. For that to work, you're either going to need to copy the database around a LOT -- not feasible if it's a few gigs in size -- or you're going to have to maintain some sort of journal yourself.
So, again -- the FS already has a journal and a concept of an atomic operation. Your app is trying to implement an atomic operation. It seems to me that you should communicate via atomic operations. It also doesn't even seem that hard -- consider that ZFS supports writable copy-on-write snapshots of the entire filesystem. If I were implementing a filesystem, that's about how I'd do it -- provide a new API that resembles POSIX (and maybe is backwards-compatible), but allows each operation to be tagged as belonging to a certain transaction. It would look like this (excuse the pseudo-C code, I don't do a lot of C):
foo = begin_transaction(); write(some_file, "some data", 9, foo); write(some_other_file, .... foo); rename ('a', 'b', foo); unlink ('c', foo); commit_transaction(foo); // or you could do sync_transaction(foo);
Roughly, of course. The idea is, anything read or written in the context of that transactio

--
Don't thank God, thank a doctor!
Re:vs Reiser4 (someday, maybe) by slamb · 2007-06-03 09:57 · Score: 1

Essentially have barrier calls in most places you'd have sync calls -- and yes, maybe fbarrier or fdatabarrier, like fsync/fdatasync. Note that sync-ing already does basically the same thing, just slower.
Yes, and since fsync() is a working but suboptimal fbarrier(), you can writable portable code easily:
#ifndef HAVE_FBARRIER #define fbarrier(fd) fsync(fd)

There are a couple of problems with that approach, I think. The most obvious one is that essentially, your application is trying to implement a transaction, and your filesystem will, if it's smart, group these into a transaction. In other words, if I read from file a, write a new version to file tmp, barrier, and then rename that tmp to a, then barrier again, all before the FS writes it out to disk, a smart enough FS might figure out that I want to atomically update a, and skip the tempfile altogether. Skipping the tmpfile means skipping dealing with allocating a new inode, setting a dozen timestamps, then unlinking it, when it might be faster to simply use the FS's own journal. (Or it might use the tmpfile anyway, if it thinks that's faster.)

Yeah, directly making an atomic change to the file would probably be faster. And there are many places that full ACID would be convenient. But there's also a lot of stuff running that doesn't need transactional semantics, and I'd expect they'd be a lot slower for it. Also databases that have their own transactional systems with different performance characteristics. This seems like a global change, and I'm not sure that's desirable.
Anyway, the transactional semantics you're describing exist, in Microsoft Windows of all places. I don't know much about them, what with hating both Microsoft and Windows.
It's my understanding that this is also approaching problems at the hardware level. For instance, tagged command queuing and friends, disk write buffers, and the general nature of the media (and the fact that you don't actually know what the physical media is) all means that your only guarantee is that when your fdatasync returns, the data is safely on disk. Maybe. Hopefully.

Well, what hardware guarantees may or may not be available is the filesystem implementator's problem. I want to know what the filesystem implementation can guarantee to me. I'd suggest reading this, this, and this. I suspect similar techniques can be used on a large file inside of an existing filesystem -- which would also blatantly make my point about duplicated effort. But beyond that, I'm really not sure.

I've read those, but they don't really help. They're focused on metadata, and I want to know what guarantees I have for data.

At that price he can keep spares by HighOrbit · 2007-05-30 00:56 · Score: 1

You have a good point about the support with higher value equipment, but at this price he can afford to keep a few spares in the closet, or even have a few other complete units as a failover 'live' backup.

Replace NAS? Sure. SAN? No way. by pyite · 2007-05-30 00:57 · Score: 3, Informative

I guess this setup could replace some people's need for a turnkey NAS solution. But your thinking it could replace SAN solutions shows you haven't looked into SAN too much. To start, there's a reason Fibre Channel is way more popular than iSCSI. The financial services company I work for has about 3 petabytes of SAN storage, and not a drop of it is iSCSI. Storage Area Networks are special built for a purpose. They typically have multiple fabrics for redundancy, special purpose hardware (we use Cisco Andiamo, i.e., the 9500 series), and a special purpose layer 2 protocol (Fibre Channel). iSCSI adds the overhead of TCP/IP. TCP does a really nice job of making sure you don't drop packets, i.e. layer 3 chunks of data, but at the expense of possibly dropping frames, i.e. layer 2 data. The nature of TCP just does this, as it basically ramps up data sending until it breaks, then slows down, rinse and repeat. This also has the effect of increasing latency. Sometimes this is okay, people use FCIP (Fibre Channel over IP), for example. But, sometimes it's not. Fibre Channel does not drop frames. In addition, Fibre Channel supports cool things like SRDF which can provide atomic writes in two physically separate arrays. (We have arrays 100 km away from each other that get written basically simultaneously and the host doesn't think its write is good until both arrays have written it.) So, like I said, this might be good for some uses, but not for any sort of significant SAN deployment.

--

"Nature doesn't care how smart you are. You can still be wrong." - Richard Feynman

Re:Replace NAS? Sure. SAN? No way. by Phishcast · 2007-05-30 02:20 · Score: 1

I agree with just about everything you've said. Maybe the big storage vendors are gouging people, but what they deliver simply isn't available yet as a DIY project (in any reliable and supportable way, anyhow). SRDF is a good example.
Are you really doing synchronous replication over a 100km distance with SRDF? The latency over that distance would probably make many applications choke if you were doing truly synchronous replication. There are various channel extension products you can place between the two sites which would trick your EMC Symmetrix storage arrays into thinking they were doing synchronous replication, but in reality your remote site would be somewhat behind your local site. Or maybe you're using asynchronous SRDF (SRDF/A)?
Nice nick, by the way. I'm going to have that song in my head all day now.
Re:Replace NAS? Sure. SAN? No way. by pyite · 2007-05-30 03:07 · Score: 1

Are you really doing synchronous replication over a 100km distance with SRDF?

I think one of our longest legs is currently about 100 fiber km. That's pretty much the accepted limit for synchronous as far as we know. We've begun deploying Cisco Storage Services Modules to make use of Fibre Channel Write Acceleration. You may have heard of it before. We've begun using it in areas where there are applications with a lot of small block size writes. In addition, we're currently testing SRDF/A over FCIP to use from New York to London.

Thanks for the nick complement. I begun using this one when I forgot the password to my four digit ID :-)

--
"Nature doesn't care how smart you are. You can still be wrong." - Richard Feynman
Re:Replace NAS? Sure. SAN? No way. by Phishcast · 2007-05-30 04:46 · Score: 1

Yes, we're an all Cisco Fibre Channel shop and I've heard of FC write acceleration. It should be quite helpful over those long runs, it'll remove a whole round trip. We don't have any SSM blades yet, but I've been hearing that people are really wanting the Gen2 SSM blades to come out so they're not stuck with the Gen1 limit for port count in a director. The rumor is they'll need to wait until at least Q1 2008. Are you using your Cisco gear to do FCIP? You must be using Inter-VSAN routing then, right?

ZFS FUD YAY by toby · 2007-05-30 00:58 · Score: 1

It's a pity your summary fails to emphasise: Most of these are known bugs; some are likely already fixed; and a couple are simply wrong.

But it's all beside the point. ZFS still offers data integrity that no other hardware or software system does (some other systems do provide copy on write, pool-like manageability, and so on). One day we'll all be using it (or a clone).

--
you had me at #!

Cheap, redundant, and performant storage. by Lethyos · 2007-05-30 00:58 · Score: 4, Interesting

Google have a great solution that focuses on the “cheap” part without compromising much the latter two. If you have not read up on the Google Filesystem, definitely take the time to. At the very least, it seems to call into question the need to shell out tens of thousands for high-end storage solutions that promise reliability in proportion to the dollar.

--
Why bother.

Re:Cheap, redundant, and performant storage. by TheSunborn · 2007-05-30 01:30 · Score: 3, Informative

But Google Filesystem is not available for buying which is a shame.

And hiring a team to develop something similary to google filesystem is not cheep. Even highend sans will be cheeper.
Re:Cheap, redundant, and performant storage. by Coward+Anonymous · 2007-05-30 01:49 · Score: 1

And yet Google uses their vaunted GFS only for your web data. Google's internal data as well as revenue impacting data is stored on more traditional NAS/SAN solutions. Oops.
Re:Cheap, redundant, and performant storage. by flaming-opus · 2007-05-30 02:46 · Score: 1

And google can get away with it because they have VERY specialized needs, but on a very large scale. GoogleFS is not posix compliant. You can't run third party software apps on top of it, and expect them to behave well. It supports a very high level of read performance, with limited write performance, and depends on very highly ordered machine environments. It would be wonderful if we could all afford millions of dollars of custom engineering for our specific needs, but there's a reason most of us pay the big bucks to the storage companies of the world: it's often cheaper, in sum, than rolling your own.

EMC Dell solution by Leadmagnet · 2007-05-30 00:59 · Score: 1

I would go with the Dell EMC AX150i SP Array for an iSCSI solution of that size - it can do iSCSI up to 6TB.

--
http://www.leadmagnet.50megs.com

No by drsmithy · 2007-05-30 01:07 · Score: 2, Insightful

Potentially it will obselete low-end NAS/SAN hardware (eg: Dell/EMC AX150i, StoreVault S500) in the next couple of years, for companies who are prepared to expend the additional people time in rolling their own and managing it (a not insignificant cost - easily making up $thousands or more a year). There's a lot of value in being able to take an array out of a box, plug it in, go to a web interface and click a few buttons, then forget it exists.

However, your DIY project isn't going to come close to the performance, reliability and scalability of even an off the shelf mid-range SAN/NAS device using FC drives, multiple redundant controllers and power supplies - even if the front end is iSCSI.

Not to mention the manageability and support aspects. When you're in a position to drop $50k on a storage solution, you're in a position to be losing major money when something breaks, which is where that 24x7x2hr support contract comes into play, and hunting around on forums or running down to the corner store for some hardware components just isn't an option.

ZFS also still has some reliability aspects to work out - eg: hot spares. Plus there isn't a non-OpenSolaris release that offers iSCSI target support yet AFAIK.

I've looked into this sort of thing myself, for both home and work - and while it's quite sufficient for my needs at home, IMHO it needs 1 - 2 years to mature before it's going to be a serious alternative in the low-end NAS space.

CORAID and ATA over Ethernet by backtick · 2007-05-30 01:08 · Score: 1

Buy one of CORAID's 1521 disk shelves w/ their CLN20 front end for $6600 and drop in 15 500 GB SATA drives (they're a whopping $100 each these days) for a quick 7TB of raw storage for ~$8K or ~9K. Need more storage? Go w/ 750GB drives (They're validating the 1 TB raw drives now, but the price isn't worth it, per GB). Want to add storage later? Buy another 1521 and plug it in. Oh, and it's AoE, with less overhead than iSCSI.

Why not good ol' trusted Linux? by Britz · 2007-05-30 01:10 · Score: 5, Informative

Linux has more perfomance testing on x86 than OpenSolaris (so you are not as likely to run into a bad bottleneck). On Linux you can create a RAID-1,-4,-5 and -6 under Multiple Device Driver Support in the kernel. You can then use mkraid to include all the drives you want. This code in not new at all. It was stable in 2.4, maybe even in 2.2

After that you just create a filesystem on top of the raid. If you don't like ext3 or don't trust it, there is always xfs. I had some rough times with reiserfs, xfs, and ext3 and for all the experience I had I would go xfs for long running server environments (and now get flamed for this little bit, use ext3 all you want).

The advantage is that you use very well tested code.

The problem comes with hotswapping. I don't know if the drivers are up to that yet. But I also highly doubt that OpenSolaris SATA drivers for some low price chip in a low price storage box can deal with hotswapping. So Linux might be faster on that one.

That is a setup I would compare to a plug'n play SAN solution. And it totally depends on the environment. If the Linux box goes down for some reason for a couple hours/days, how much will that cost you? If it is more than twice the SAN-solution, you might just buy the SAN and if it fails just pull the disks and put them in the new one. I dunno if that would work on Linux.

Re:Why not good ol' trusted Linux? by this+great+guy · 2007-05-30 18:40 · Score: 1

The problem comes with hotswapping. I don't know if the drivers are up to that yet
Only 4 (out of ~12) Linux SATA drivers support hotswap: ahci, sata_nv, sata_sil and sata_sil24. Fortunately the first 3 ones support the majority of SATA chips on the market.
But I also highly doubt that OpenSolaris SATA drivers for some low price chip in a low price storage box can deal with hotswapping.
Meh, you are so wrong :-) About 3 months ago, I assembled a cheap AMD64 2.5 TB ZFS fileserver running OpenSolaris Nevada B55 for $950, that's $0.38/GB (!) and it supports hotswap (Sil3124 controller).
I switched to ZFS from my previous 1 TB Linux MD raid5 setup precisely because the ZFS feature set is far superior to what the Linux MD setup offered me (end-to-end checksumming, consistency of data guaranteed after a unexpected reboot, no "raid 5" write hole, flexibility of managing multiple fs on a single zpool, compression, etc).
Re:Why not good ol' trusted Linux? by Anonymous Coward · 2007-06-01 11:02 · Score: 0

The advantage is that you use very well tested code.

Ha ha ha! Oh, that's a good one.
I recently lost a Linux SATA+md/RAID5+LVM+ext3 filesystem to corruption. I don't know which layer was at fault, but I learned the combination is not trustworthy.
Re:Why not good ol' trusted Linux? by Anonymous Coward · 2007-06-01 11:49 · Score: 0

no "raid 5" write hole

Could you please elaborate? What is the "raid 5" write hole? I see it mentioned in the wikipedia article on RAID 5, but it doesn't actually explain what it is.
Re:Why not good ol' trusted Linux? by this+great+guy · 2007-06-01 12:17 · Score: 1

The raid 5 "write hole" problem (I should have used the double quotes like this, instead of around the term raid 5) is mentioned here [1] and on pages 16-17 of this presentation [2].
[1] http://opensolaris.org/os/community/zfs/whatis/
[2] http://opensolaris.org/os/community/zfs/docs/zfs_l ast.pdf
Re:Why not good ol' trusted Linux? by Britz · 2007-06-02 22:10 · Score: 1

That's why I use xfs in those environments. I have no idea why ext3 keeps crapping out on those.
Re:Why not good ol' trusted Linux? by Anonymous Coward · 2007-06-05 07:03 · Score: 0

That's why I use xfs in those environments. I have no idea why ext3 keeps crapping out on those.

Thing is, I don't know ext3 was the problem. Each layer was supposedly well tested, including ext3...when they failed together, that makes me distrust every layer...clearly they weren't tested as well as people thought they were.

Have you tried out Starfish? by msporny · 2007-05-30 01:10 · Score: 3, Informative

Ever heard of Starfish? It's a new distributed clustered file system:

Starfish Distributed Filesystem

From the website:

Starfish is a highly-available, fully decentralized, clustered storage file system. It provides a distributed POSIX-compliant storage device that can be mounted like any other drive under Linux or Mac OS X. The resulting fault-tolerant storage network can store files and directories, like a normal file system - but unlike a normal file system, it can handle multiple catastrophic disk and machine failures.

And you can build clusters at relatively low cost:

For a 2-way redundant, RAID-1 protected, 1.0 Terabyte cluster: $2,000 (Jan 2007 prices). Per server, that breaks down into around $400 for a AMD 2.6Ghz CPU, 1GB of memory, and a motherboard with integrated 100 megabit LAN connection, SATA support, 350 watt power supply and a commodity server enclosure. Four SATA 500GB hard drives will run you around $600. The cluster would ensure proper file system operation even in the catastrophic failure of a single machine. Hard drive failure rates could even approach 50% without affecting the Starfish file system.

(warning: I work for the company that created Starfish)

-- manu

--
Manu Sporny (skype: msporny, twitter: manusporny, G+: +Manu Sporny)
Founder/CEO - Digital Bazaar, Inc.

Re:Have you tried out Starfish? by hirschma · 2007-05-30 01:28 · Score: 1

Why is Starfish better than pNFS? How much does the software cost?
Re:Have you tried out Starfish? by msporny · 2007-05-30 02:44 · Score: 2, Interesting

Why is Starfish better than pNFS?
The biggest reason right now is that there is a working, stable implementation of Starfish - there isn't one for pNFS. Data redundancy and high-availability is another strong reason to choose Starfish over pNFS. That being said, it is an unfair comparison - pNFS was not designed for highly-available clustered environments. Starfish is also a POSIX-compliant file system, it supports extended attributes and we provide all of the source code.
How much does the software cost?
The software is free for up to 1TB of storage or up to 10 nodes in a cluster (which is most of the website clusters in operation today). We also give very generous licenses (usually free) to academic and research institutions. -- manu

--
Manu Sporny (skype: msporny, twitter: manusporny, G+: +Manu Sporny)
Founder/CEO - Digital Bazaar, Inc.
Re:Have you tried out Starfish? by Anonymous Coward · 2007-05-30 03:56 · Score: 0

> The biggest reason right now is that there is a working, stable implementation of Starfish

Really? Cause, your download page warns that it's a beta product and should not be used in production.

On the other hand, there stable ZFS support in OpenSolaris, (non-open) Solaris, and in a few months in FreeBSD, and since it's available fully free and open source, there are no necessary licensing costs. Are there any advantages of Starfish over ZFS?
Re:Have you tried out Starfish? by gfilion · 2007-05-30 05:34 · Score: 1

Ever heard of Starfish? It's a new distributed clustered file system:

Starfish Distributed Filesystem

From the website:

Starfish is a highly-available, fully decentralized, clustered storage file system. It provides a distributed POSIX-compliant storage device that can be mounted like any other drive under Linux or Mac OS X. The resulting fault-tolerant storage network can store files and directories, like a normal file system - but unlike a normal file system, it can handle multiple catastrophic disk and machine failures.

I read on the web site that mirroring should only be available in August 2007, is that true?

Also, is it possible to have StarPeers of different size? I have a server with 80 GB in RAID1 and one with 500 GB in RAID1. If I set both of them as StarPeers will both have access to 580 GB of storage?

Thanks,
GFK's
Re:Have you tried out Starfish? by msporny · 2007-05-30 06:19 · Score: 2, Interesting

I read on the web site that mirroring should only be available in August 2007, is that true?

Yes, at the present speed of development and testing, mirroring should be available by the end of August 2007.

Also, is it possible to have StarPeers of different size? I have a server with 80 GB in RAID1 and one with 500 GB in RAID1. If I set both of them as StarPeers will both have access to 580 GB of storage?

Yes, absolutely. Keep in mind that preference will be given to the 500GB storage node until it fills up to around 420GBs. Starfish tries to load-balance storage across all nodes, thus it always picks the nodes with the most amount of free space remaining.

There will be different storage node selection strategies in the future. Currently there is round-robin (which load-balances based on number of files per storage node) and largest-free-storage (which load-balances files to the storage node with the largest amount of available storage).

--
Manu Sporny (skype: msporny, twitter: manusporny, G+: +Manu Sporny)
Founder/CEO - Digital Bazaar, Inc.
Re:Have you tried out Starfish? by msporny · 2007-05-30 07:47 · Score: 3, Insightful
Really? Cause, your download page warns that it's a beta product and should not be used in production.

We do that as a preventative measure for people that don't get support through us. We don't want anybody to assume that the file system is ready for a highly-available cluster without talking to us first. As with all file systems, there are trade-offs to using Starfish. We are being honest - the software is stable as far as we can tell, but it doesn't have a great deal of field use (it was released to the public in March 2006).

We must be especially careful with file systems - data is very important to people. If somebody uses our system and loses data, we can't fix that - not that it has ever happened. We put the beta message as a warning that people should talk to us before thinking about putting our system into production. After all - it doesn't have nearly the amount of testing behind it that EXT3 does. It is common for a file system to remain as a beta product for the first year or two.

On the other hand, there stable ZFS support in OpenSolaris, (non-open) Solaris, and in a few months in FreeBSD, and since it's available fully free and open source, there are no necessary licensing costs. Are there any advantages of Starfish over ZFS?

Hmm... ZFS and Starfish aren't really meant to address the same storage problem. Take a bit of time and read through what ZFS does and what Starfish does. ZFS is a block-level file system. Starfish is a file-level distributed clustered storage system. Those are two very different things - at the end of the day they store files, but in very different ways and for very different purposes. Starfish can use ZFS as it's block-level file system... it can also use Reiser and EXT3.

Here are a couple of reasons to use Starfish (even though we think that ZFS is a fantastic solution for block-level file system problems):
- Starfish is a decentralized, multi-node fault tolerant file storage solution that provides N-way redundancy.
- Starfish is useful when you have a large number of nodes that need to access data in parallel. The ability to perform asynchronous parallel throughput is one of Starfish's biggest advantages.
- Starfish runs in userspace - which means that it is capable of using much smarter algorithms and databases to manage metadata and file placement. For example, a SQL database is used for file system metadata, which means that a variety of optimizations (such as creating metadata indices) to the file system can be performed at runtime.
However, this really isn't a "what is better, Starfish or ZFS?" discussion. You can have the best of both worlds: Starfish as the file-level network storage cloud using ZFS as the block-level file system.
-- manu
--
Manu Sporny (skype: msporny, twitter: manusporny, G+: +Manu Sporny)
Founder/CEO - Digital Bazaar, Inc.
Re:Have you tried out Starfish? by bughouse26 · 2007-05-30 20:23 · Score: 1

How does Starfish compare to GFS, Lustre, or IBRIX?
Re:Have you tried out Starfish? by thommym · 2007-05-30 23:58 · Score: 1

"The system is free for use if your storage requirements are less than 1 Terabyte." Not really what we in the free world is looking for...

--
Don't feed the penguins
Re:Have you tried out Starfish? by Krishnoid · 2007-05-31 04:07 · Score: 1

Really? Cause, your download page warns that it's a beta product and should not be used in production.
We do that as a preventative measure for people that don't get support through us. We don't want anybody to assume that the file system is ready for a highly-available cluster without talking to us first.
That word, 'beta', I do not think it means what you think it means.
Re:Have you tried out Starfish? by msporny · 2007-05-31 15:40 · Score: 1

GFS

You can't really compare GFS and Starfish. GFS is a shared-disk clustered filesystem. For example, GFS is useful when you have 10 machines sharing a cabinet of hard drives via FibreChannel or shared SCSI. The disks must be situated very close to the machines.

Starfish is a distributed clustered file storage network. Starfish is capable of using any commodity grade hardware and uses software to ensure high availability and node-level fault tolerance. Hard drives do not have to be shared between machines (a requirement for GFS) and it is meant to scale to 100s if not 1000s of nodes (GFS doesn't scale to that size easily). Starfish is capable of using GFS as its block-level file storage mechanism. Starfish is a file-level network file system - GFS is a block-level local file system. I hope that clears things up... in short, you don't have to pick between GFS and Starfish - they are complementary.

Lustre

The Lustre team is a great bunch of people - we ran their clustered file system software for two years before we needed to create our own. Lustre is run on most of the highest performing supercomputers in the world. The biggest reason we had to create Starfish was because Lustre does not distribute its metadata. There is a comparison between Lustre and Starfish on our website. You should check them out if you're trying to decide on a good distributed clustered file system - Lustre focuses on ultra high performance, Starfish focuses on data redundancy and high availability.

IBRIX

No idea - we haven't had a chance to get some real-world benchmarks from their clusters. As far as bottlenecks in their system, we don't have access to their source code, so we can't do a thorough analysis. It seems that their system design is close to ours, I would expect that they currently perform better than Starfish due to the maturity of their project. You tend to have to export IBRIX filesystems via NFS, which limits the fault-tolerant aspects. Starfish is fully POSIX compliant and can be mounted just like any Linux filesystem. Wish I could tell you more - in short, they probably perform better, have more features and cost far more than Starfish.

GFS and Lustre have free downloads available via their respective websites - it's worth taking them for a spin. You can also download and play around with Starfish, we even provide a quick start tutorial. You can even use it for up to 1TB of storage or 10 machines at no cost.
-- manu

--
Manu Sporny (skype: msporny, twitter: manusporny, G+: +Manu Sporny)
Founder/CEO - Digital Bazaar, Inc.

Re:Congradulations, you discovered the "File Serve by Wdomburg · 2007-05-30 01:13 · Score: 1

Depends on what you mean by "generic" components. There are plenty of generic servers cases out there designed for storage applications. A 2U rack enclosure with 12 drive bays can be had for around a grand. Add a twelve port raid controller for six or seven hundred and you've got capacity to expand to 6TB raw even if you stick to 500GB drives (currently best price per GB).

Hardware RAID by ehfortin · 2007-05-30 01:17 · Score: 1

I was reading the various comments and I was surprised that nobody suggested a hardware raid ctrl. There is some from well known business (Adaptec, promise and so on) that will support from 4, 8 and more SATA drives and do RAID 1/5/1+0. They usually support hot spare, raid level change, volume growth, etc. These are available from about 300$ with a median for 8 drives at about 500$. Using this kind of solution would be faster and more robust (make sure you are taking a full RAID chipset, not just a RAID accelerator where a combination of software and hardware is necessary) and should be easier to manage as well then a zfs setup over OpenSolaris. These are also often available for multiple OS. I'm looking at this kind of solution for my personal LAN. I didn't had time yet to order the hardware and the disks so I have no experience doing it but on paper, it look good. At this point I'm looking for an Adaptec 2820SA which support 8 drives and offer a lot of interesting features. Does anybody have comments on taking that road or on the Adaptec 2x20SA particularly?

Re:Hardware RAID by Anonymous Coward · 2007-05-30 02:26 · Score: 1, Insightful

Unless you *need* the performance or it is part of a system that $VENDOR is supporting. I would *strongly* suggest against hardware RAID cards; especially the low end ones. If you use a pure software RAID in an operating system that makes the source code available - if it all goes wrong you can read the source and work out how to put the pieces back together. I've done it. It's not fun. At all. But it is possible. If the same thing happens with a hardware RAID (or heaven forbid, some of the modern win-RAID, propriatory software + undocumented hardware, low end offerings), then unless the 'recovery' tools you are given can do it; you're stuffed. Do not underestimate the value of having a recovery option for worst case scenarios.
Re:Hardware RAID by drsmithy · 2007-05-30 04:00 · Score: 1

Assuming a reasonably modern machine, software RAID will be at least as fast - if not faster - than hardware RAID.
I cannot conceive of any reason whatsoever why anyone would spend money for a hardware RAID controller in a home server context.
Re:Hardware RAID by j-turkey · 2007-05-30 05:16 · Score: 1

Assuming a reasonably modern machine, software RAID will be at least as fast - if not faster - than hardware RAID.

I disagree. Do you have any numbers to back this up? Depending on the RAID type, I believe that your comment is pretty far off the mark. For things like RAID 5, hardware is far better than software. First of all, RAID 5 is very heavily dependent on bitwise operations (speciafically XOR). Modern general purpose CPU's are indeed very fast, but have never been as good as certain specific tasks (like XOR) as purpose-built chips. This is exactly why we have things like RAID controllers and cryptographic chips.

I cannot conceive of any reason whatsoever why anyone would spend money for a hardware RAID controller in a home server context.

Again, when using RAID 5, software is far slower than hardware, especially for transaction access. Further, if you're doing any kind of caching, you won't get battery-backed cache without a hardware solution.

Then again, if someone is just trying to do RAID 1, you may be right. However, have you considered that the home server might want to boot to a RAID array? Does your OS support that in software?

--

-Turkey
Re:Hardware RAID by Anonymous Coward · 2007-05-30 07:49 · Score: 0

Jesus, more of this FUD? Hardware RAID5 is useful when you want a battery backed cache and portability between machines. Software RAID5 is faster in every other way. I've run the gammut of Adaptec, 3ware, Arecca etc and software wins every time. Then again, my home fileserver is a 4x500GB zfs raidz, so I put my money where my mouth is.
Re:Hardware RAID by drsmithy · 2007-05-30 13:12 · Score: 1

I disagree. Do you have any numbers to back this up?
Yes. I've benchmarked with our own hardware fairly extensively and found software RAID to be faster in every case - which is why we have a bunch of "storage boxes" with expensive 8 - 16 port RAID controllers doing nothing more than exporting individual drives (although were I to do it again today this wouldn't happen, since there are now a few "dumb" 8-port SATA controllers out there). There are also numerous hardware vs software RAID5 benchmarks on the 'net as well.
For things like RAID 5, hardware is far better than software. First of all, RAID 5 is very heavily dependent on bitwise operations (speciafically XOR). Modern general purpose CPU's are indeed very fast, but have never been as good as certain specific tasks (like XOR) as purpose-built chips.
Even a lowly x00Mhz P3 can do RAID5 checksumming at 1000MB/sec or more - far faster than even a high-end, double-digit-disk-count 15k SAS drive array is going to sustain during writes (and especially random writes). Modern machines (eg: Core-based Xeons) are up around the 5000 - 6000MB/sec range. The processing "overhead" of RAID5 is insignificant on any remotely modern machine and the RAID5 bottleneck isn't caused by it anyway. It's caused by having to do roughly twice as much physical I/O (operations orders of magnitude slower than any checksumming calculations, even on machines a decade old) as other types of RAID. This is one of the reasons ZFS's RAIDZ is much faster than traditional RAID5 - it avoids much of the additional physical I/O.
Again, when using RAID 5, software is far slower than hardware, especially for transaction access.
No, it's not. There are situations where it might be, but they are generally related to overall hardware limitations (eg: hanging a dozen disks off a single 32 bit PCI bus), rather than anything inherent to the concept.
Further, if you're doing any kind of caching, you won't get battery-backed cache without a hardware solution.
Better to spend the money on a UPS to protect the whole system.
Then again, if someone is just trying to do RAID 1, you may be right. However, have you considered that the home server might want to boot to a RAID array? Does your OS support that in software?
Most OSes will boot from at least a RAID1 array, some from more exotic types. Linux is probably the most difficult to setup to do so, but it's not difficult once you know how.

Comment removed by account_deleted · 2007-05-30 01:22 · Score: 2, Interesting

Comment removed based on user account deletion

FUD alert! by Anonymous Coward · 2007-05-30 01:23 · Score: 0

Once upon a time a new outside broadcast truck was being built. The new boss made changes to the design to use domestic televisions instead of $5000 D1 monitors. Guess what? The teevee sets fell apart and the boss became laughing stock. The proper monitors had to be put in, at great expense.
Computer gear is a bit different. Six years ago a 100Gb real-time disk array cost $100000. A cheaper 100Gb of storage could then have been built with a Pentium PC and four 33Gb IDE Drives, raided together to set you back $3000 or so. A spare could have been built for another $3000 and the systems rsync-ed on a daily basis. Perhaps new disks could have been bought in the years between, maybe another $3000 worth. Assuming the 'linux' box did the job and delivered the 1's and 0's quick enough, what would you prefer, $90000 in the bank or a very expensive and a not so powerful real-time disk?
As it worked out we paid for the $100000 box. It did mess up a few times, even needing new drives. It also sounded like a rocket, not indicative of being energy efficient. Not one client gave a damn about what we were using for storage, however, we did have to get them to pay lots of money for such toys and we did have to manage the financial risk.
Looking back I think the homebrew solution could have been more fun, perhaps to give a business edge and an opportunity to gain useful skills instead of niche knowledge.
Clearly the solution has to suit the application, and in this case some budget could be spent on cache RAM, hopefully to provide a really interesting setup. If things are DB intensive then some of the indexes could stay in the cache-RAM, maybe to give better performance than with the $50000 SAN and with no application re-writes.

This will work by gweihir · 2007-05-30 01:23 · Score: 1

I have one Linux fileserver with 5TBs for some time now, the only issue ever was a dead PSU. Fixed that by replacing Fortron with Enermax.

Only thing I would do differently now is to use PCI-E SATA controllers for bandwith (my server has PCI). Linux software RAID is perfectly up to the task. ZFS should also be.

One thing: Do temperature and SMART moniroting on all drives and run along SMART selftest every two weeks sor so. Also have the RAID and SMART monitoring software send Email on problems and have at least one spare disk ready to use.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.

Another option by doubledjd · 2007-05-30 01:26 · Score: 1

I'm not that well versed in this stuff so make sure to do your own research. (the disclaimer is now out of the way).
You can't compare zfs to SAN. I didn't need a san but look at nas. Netapp is great but still had a few limitations.

The discs were proprietary.
redundancy. To go with the redundant head was to bump up to a new level in cost.
chassis size was set
the cost ran up fast for the software to do the things I wanted. all proprietary (which was alright considering)

Still, netapp is quite a quality product.
I wanted a cheaper solution that had more to it so I looked at coraid.

They use an AoE (ata over ethernet) protocol that is lighter than tcp/ip.
It can run with off-the-shelf sata drives
a gateway can be used if you want to add more shelves
they have redundant gateway solution as well

We'll be putting it in this quarter.
Having said this, I'm unsure of the state of lvm2 for block level snapshots. This was one thing that netapp did well. Also, be selective on the drives you choose. This is only another option you can look into.

Re:Another option by Anonymous Coward · 2007-05-30 02:38 · Score: 0

I think you're mistaken on one issue regarding NetApp. I've worked with them for years, and they've never had proprietary disks in their systems. The high-end has been FCAL (and more recently SATA as well), and the low-end has been ATA (which is now being replaced by SATA). Nothing proprietary about it.
Re:Another option by doubledjd · 2007-05-30 04:38 · Score: 1

Ah, my sincere apologies. You are correct.
I'm a bit embarrassed :). I confused that with another company I was reviewing around that same time period. Their name escapes me...but I will find out. (I'd hate to say the wrong company)
thanks for the correction

Openfiler.com by Anonymous Coward · 2007-05-30 01:35 · Score: 0

Take a look at this solution: www.openfiler.com. Supports CIFS, NFS, FTP, iSCSI, and WebDAV protocols. Provides snapshots, quotas, and active directory integration. It can use builtin storage or act as a gateway to existing fibre channel or iSCSI storage. Plus, you can receive commercial support for it.

Re:Congradulations, you discovered the "File Serve by pla · 2007-05-30 01:36 · Score: 2, Informative

If you need more than 1-3 TB, you can't use generic components

Why not?

Sure, a 16-channel SATA controller with RAID 0/1/5 will cost you $400. But that will handle, using 750GB drives that have recently entered the "affordable" range, a total of 12TB (or more practically, a 10.5TB RAID5 with one hot spare). Find an OEM that can set you up with that for under $5000 total.

Now, that uses a PC chassis and wouldn't look "nice" in a rack. So what? If you need 10TB and don't want to blow $50k on it, you don't have a lot of choices... So if you insist on all racked equipment, buy a rack shelf kit and lay it on its side (and hide it with blanks if you care that much) ;-)

The *other* Google solution by dkegel · 2007-05-30 01:37 · Score: 1

Don't forget Zumastor (http://code.google.com/p/zumastor/).

It's kind of like a homegrown subset of ZFS, but
it already has online remote replication, which
is one of the key features of Netapp NAS boxes.
(People are starting to add that to zfs; see also
http://milek.blogspot.com/2007/03/zfs-online-repli cation.html )

ZFS current drawbacks by Anonymous Coward · 2007-05-30 01:42 · Score: 0

Does Vista run ZFS yet? I know Apple isnt supporting ZFS until 10.5, I have a NAS that uses the ZFS file system and I have had limited use of it because I cannot access it on half of my hardware due to the os.

ZFS ftw! by GuyverDH · 2007-05-30 01:42 · Score: 2, Interesting

I've been using ZFS for quite some time, and have yet to have any form of failure or data corruption. I've used it with simple JBOD, SAN attached, even USB attached drives - it just plain works. With Solaris 10 U3, and the latest revs of OpenSolaris, you have access to RAIDZ2 - which gives you double parity, even more protection. Snapshotting can be scripted to run as often as you like. I keep 2 months worth of snapshots every 5 minutes day in and day out. You can disassemble the system, scramble the drives, reattach and bring the system back up and all will be well. ZFS will just re-assemble the pool and continue. You can replace drives on the fly. Let's say you make your 12 disk encloser with 750GB drives. Now, 2 years later you want to replace them with 2TB drives. Simply use the zfs replace command to replace each drive, wait for it to re-silver and rebuild the data on the replaced drive, then move on to the next drive. As you replace them, the pool will grow automatically. This would grow your (assuming 12x750GB in a RAIDZ2) 7.5TB pool to a 20TB pool, without downtime. With OpenSolaris on x86, you can even boot off of ZFS now, so use ZFS to mirror your boot disk with a like drive and you should be good for quite some time... Using something like Sun's Thumper system, you can get a 12TB system for less than 30k (for those who have something akin to a budget) ZFS, it's fast, safe, secure - and I enjoy working with it (as I don't have to do much with it).

--
Who is general failure, and why is he reading my hard drive?

Re:Congradulations, you discovered the "File Serve by BigBuckHunter · 2007-05-30 01:46 · Score: 1

Depends on needed storage space. If you need more than 1-3 TB, you can't use generic components, price goes up and hardware starts taking more space than dedicated NAS box. Or Tyan 2U-4U boxes are DIY in your country.

With the Via config and the Supermicro drive cages I outlined in my post, you can effectively have 15 drives in a medium tower case. The case and basic components will run you around $250 - 300 US. All that remains is your distro of choice and the HDDs. Grab FreeNAS and 15 1TB HGST drives and you now have a 14TB fileserver.

My main complaint about the Via reference boards is that few of them come with GB ethernet. I usually add a GB ethernet to the single PCI slot. I'd go with an AMD or Intel, but the lack of features and price tend to put me off. The use of SATA port multipliers aleviates the need for HW Raid.

BBH

References
http://www.cooldrives.com/cosapomubrso.html
http://www.supermicro.com/products/accessories/m obilerack/CSE-M35T-1.cfm

Google Search Appliance by PIPBoy3000 · 2007-05-30 01:47 · Score: 1

Ironically, just a few weeks ago our Google Search Appliance lost two drives in a few days (resulting in a new server being shipped to us after some hassle with customer service). One theory is that the drives don't handle losing power very well (we had a power outage). My guess is that under ideal conditions, the failure rates are about the same. When things go wrong (e.g. hotter than normal, power issues, etc), perhaps the SAS drives have an advantage.

For size, maybe... by Ryan+Amos · 2007-05-30 01:49 · Score: 1

On sheer size, yeah, you can cram a lot of drives into it and SAN them together, but I think you're going to find the performance of a low-end iSCSI SAN to be lacking. I use one as the disk storage component of a disk-to-disk-to-tape backup system, and even then it can be a performance bottleneck. It was a $3000 array loaded with $1200 worth of drives, so I'm not complaining, but if you're looking for blazing fast storage, DAS USCSI360 or SAS is gonna whoop its ass.

In other words: You get what you pay for. Capacity, reliability, speed. Pick any two or be prepared to pay 10 times as much.

Re:For size, maybe... by FlyingGuy · 2007-05-30 02:27 · Score: 1

The phrase your looking for is...

Speed - Qaulity - Price

Pick any TWO

--
Hey KID! Yeah you, get the fuck off my lawn!

ZFS by slashthedot · 2007-05-30 01:50 · Score: 1

could be on the way to make NetApp solutions irrelevant in the future. Why pay $$$ when you can get something that does the same thing and much more and that is free? ZFS is showing a lot of promise.

Use ZFS by Slashcrap · 2007-05-30 01:50 · Score: 2, Funny

Why? Because it's 128 bits! One hundred and twenty eight fucking bits! That's 64 bits more than any other FS, so any fool can see that it's twice as good as the alternatives.

It may be new and untested, but that's hardly important in the face of 128 fucking bits is it? Besides it's designed by Sun engineers and nobody has more experience in FS design and implementation. That's why all the previous Solaris filesystems rocked so hard. Nothing can beat UFS in terms of stability and performance after all.

Oh, and I nearly forgot - because it's made by Sun it's going to be 10 times as Enterprisey as any half-baked, so-called "tried and tested" RAID/SAN solution that those other suppliers are going to come up with.

Quite frankly, the fact that you're even asking this question suggests you are guilty of criminal hype evasion.

Paper by fuliginous · 2007-05-30 01:51 · Score: 1

I know there are lots of downsides (thinking loss of the digital advantages) but you make it sound like when it comes to image storage old 35mm negatives seem to be competitive.

Re:Paper by maxume · 2007-05-30 04:19 · Score: 1

The big downside there is that storing a couple thousand images ends up costing a couple of thousand dollars(to actually get the slides) and taking up several cubic feet of storage. Spending that money on ever larger redundant storage means that you have less problems as time goes on(because each physical unit of storage holds more and more of your information(of course, that also means that losing one unit gets more and more painful)).

--
Nerd rage is the funniest rage.
Re:Paper by fuliginous · 2007-05-30 22:20 · Score: 1

I didn't mean adding the bulk of slides just the negatives as little strips.

Sadly. by Lethyos · 2007-05-30 01:53 · Score: 1

After re-reading the article and your comment, I was reminded of this unfortunate fact. Speculating on the reasons, I am not so convinced it is a matter of competitive advantage (as suggested by StorageMojo). Even if someone else set up shop with massive storage using GFS, they still would not offer the services Google provides. And with the exception of Gmail (whose limitless storage is not the most compelling feature), end-users (typically?) do not consider how Google stores data. If I am right, it may only be a matter of time before Google makes GFS available to the masses.

--
Why bother.

Re:Sadly. by Catbeller · 2007-05-30 04:27 · Score: 1

Q: Does Google claim to own the rights to access the data, or even to own the data? And they do hand it over to warrantless searches by secret police types, not so? Not asking idly; I tend not to use "free" services as they claim to own the data I upload.
Re:Sadly. by Anonymous Coward · 2007-05-30 04:55 · Score: 0

Q: Do you have to upload the key that decrypts the data? Does Google steal your data to use as an entropy pool for their random number generator?
Re:Sadly. by Catbeller · 2007-05-30 14:08 · Score: 1

Emp. Encryption has a shelf life, not that I'm terribly worried about the RNC reading my archives. Encrypted files will be decrypted using future technology much faster than we think the breakthroughs will come. I'm also not worried about my data being "stolen", as it is on a foreign server and is therefore theirs. What I wonder, rhetorically, is if they claim ownership over the bits on their server. It's a trick question, as the answer is "yes", firstly, and even if it were not, the terms of your contract with Google gives them the option of changing the terms at their pleasure. So, good place to stash your artwork, not so good your email archives and chat logs. Young Americans grew up with no idea of privacy at home or in school, so it's hard to make the point about the lack of it.

Re:Congradulations, you discovered the "File Serve by numbski · 2007-05-30 01:58 · Score: 2, Informative

What amazes me is all the talk of iSCSI, but almost no mention of AoE (ATA over Ethernet).

What you have is a box that exports block devices out over layer 2. Another devices loads it as a block device, and can now treat it in whatever fashion it could deal with any other block device, so for example I have 2 "shelves" of Serial ATA drives going. I have a third box that I could either load linux on, using md to create raid sets, or what I've actually done is used the hardware on each of the two shelves, created a raid5 set on each, then used md to create a raid1 set out of the two raid5's. I then take my spankin' new md0 device which is huge for my needs (7.5TB), use LVM to create a volume group (called 'office' for me) and that creates /dev/office. Then I create several lv's (logical volumes) of arbitrary size beneath *that*. So I have /dev/office/home, /dev/office/mp3, /dev/office/blah, etc.

Now you can format those lv's like any other partition/slice. I've used xfs on all of mine, but you could use ext2/3 if you really wanted.

--

Karma: Chameleon (mostly due to the fact that you come and go).

You run a couple of business? What kind? by porky_pig_jr · 2007-05-30 01:59 · Score: 1

Is "Psychiatric help - 5 cents" one of them?

It's about the admin, not the hardware by Colin+Smith · 2007-05-30 02:00 · Score: 1

NAS stuff tends to be plug and play. No admins required. Or at least, minimal admins.

--
Deleted

Re:It's about the admin, not the hardware by brunson · 2007-05-30 03:24 · Score: 1

Yeah... riiiiight.

--
09F911029D74E35BD84156C5635688C0
Jesus loves you, I think you suck
Re:It's about the admin, not the hardware by BigBuckHunter · 2007-05-30 20:20 · Score: 1

NAS stuff tends to be plug and play. No admins required. Or at least, minimal admins.

I agree, but this is a function/feature of the software, and not the underlying hardware. Distributions like FreeNAS are purpose built to afford users "plug and play" NAS features on commodity hardware.

BBH

ZFS vs AIX storage management? by porky_pig_jr · 2007-05-30 02:03 · Score: 1

I remember working with AIX many years ago (that was the time when IBM started exploring Unix and the RISC architecture) and remember very well how with SMIT (their management system) you join all the hard drive into the common pool, and then define the partitions out of that pool. Isn't that something similar to ZFS? If so, then IBM was the first, at least in terms of concept.

Re:ZFS vs AIX storage management? by raftpeople · 2007-05-30 03:11 · Score: 1

IBM has been doing this in their operating systems for as long as I have used them (late 70's), and probably longer.

1-Gig Ethernet will also be your bottleneck... by Anonymous Coward · 2007-05-30 02:03 · Score: 0

...unless your actual net average i/o load profile is fairly light. iSCSI over gigabit ethernet fell on its face under load for us. Trying to mount an MS SQL Server database that lives on an iSCSI target over the network via may for alright for a trivial small number of users hitting a trivially small database with a trivially small I/O load (e.g. less than 25 users hitting an 8GB database with 5-10 queries per minute per user and a fair amount of full table scans due to a stupidly written app that's beyond my control) before the SQL server is brought to its knees waiting for iSCSI I/O over the network. That same database when run on a locally attached Ultra-160 SCSI drive (yesteryear's technology) absolutely smokes the iSCSI setup we tested.

After finding this dismal performance, we knew we could not even think of mounting a much larger Oracle database via iSCSI over gigabit ethernet. That would be a stupid joke.

Simple SMB filesharing (user's Word & Excel docs, PDF's, etc) of our users' network home folders over iSCSI over gig ethernet does work acceptably well, however, since that doesn't really generate all that much network traffic.

In conclusion, I've come to the determination that iSCSI is basically an academic curiosity that was created just because somebody thought it was cool to encapsulate the SCSI protocol in IP packets, and with only a light traffic load, such as a home network or a (very) small business network, and over only gigabit ethernet, the performance is adequate, but why not just use simple and cheap attached big hard drives to your server instead? That is much less complicated.

Maybe with 10 gig ethernet, the performance bottleneck might be less of an impact, but I have no 10 gig hardware to play with.

Re:1-Gig Ethernet will also be your bottleneck... by drsmithy · 2007-05-30 02:52 · Score: 1

That same database when run on a locally attached Ultra-160 SCSI drive (yesteryear's technology) absolutely smokes the iSCSI setup we tested.
It's unfortunate you've had a bad experience with iSCSI, but IMHO your experience isn't inherent to, or indicative of, the technology.
In conclusion, I've come to the determination that iSCSI is basically an academic curiosity that was created just because somebody thought it was cool to encapsulate the SCSI protocol in IP packets, and with only a light traffic load, such as a home network or a (very) small business network, [...]
iSCSI - even a single GBe link - has loads of bandwidth for typical usage patterns. 100MB/sec is a *lot* of data to be reading and writing consistently and you need a fairly beefy disk array to exceed it, outside of sequential data streaming applications.
For a database, 100MB/sec is a *lot* of work. Heck, even for typical fileserver usage, it's a lot. This is before getting into bonded links - something like an x4100 has 4 GBe links onboard - bonded together you're into 4Gb FC territory (athough it won't be as fast, it's in the ballpark).
[...] and over only gigabit ethernet, the performance is adequate, but why not just use simple and cheap attached big hard drives to your server instead? That is much less complicated.
Off the top of my head:
* Unless you're in a streaming situation, you'll need a lot of fast disks attached to exceed even a single GBe iSCSI link.
* Your data is tied to that server.
* Moving to some sort of clustering configuration with a shared storage architecture is significantly more difficult.
* Upgrading the IO capabilities of your single server can be difficult, if not impossible (eg: 2U server already stacked out with 6 15k drives, or 1U server that maxes out at two drives). Upgrading the IO capabilities of something on the other end of an iSCSI link is significantly easier. This princople applies to both performance and raw space (eg: we have an "archiving" server with ~10TB of iSCSI-attached storage. Increasing that by, say, another 6TB via iSCSI is trivial, because it's just a matter of plugging another array into the LAN. If all that space was internal, increasing it would be much more difficult.)
iSCSI is going to own the low-end and mid-range SAN infrastructure market within 5 years. Even at single GBe link speeds, performance is adequate for most applications and gig ethernet infrastructure is _vastly_ cheaper than FC infrastructure - especially since it lets you piggyback your non-storage traffic over the same physical connections.
Maybe with 10 gig ethernet, the performance bottleneck might be less of an impact, but I have no 10 gig hardware to play with.
Unless you're in a relatively uncommon scenario, it's unlikely the single-GBe iSCSI link was the bottleneck.
Re:1-Gig Ethernet will also be your bottleneck... by PilotDvr · 2007-05-30 06:28 · Score: 1

My experience is the same as yours... I would bet that that parent poster may have had a problem with his CPU binding on the TCP/IP processing overhead...that was also my experience. When I moved to a dedicated TCP/IP offload iSCSI adapter, my performance problems went away. We have both MSSQL and Oracle apps supporting several hundred users each. Granted it is not 'extreme' usage, but definately a higher volume than what he indicated.

For most businesses it's not just the storage by Coward+Anonymous · 2007-05-30 02:05 · Score: 1

Your ZFS system could work for SOHO where things such as uptime, disaster recovery, flexible provisioning of disks, speed and support, for instance, are not as critical. Let's look at the list:
1. uptime - so you've raided the box. What happens when the power supply fries or the CPU fails? Redundancy is typically dealt with through multi-pathing in SAN, clustering in NAS.
2. disaster recovery - your box is lost in a flood/earthquake/burglary. Did you have an efficient data remoting solution to a DR site? Putting all this data on DVDs is not practical.
3. flexible provisioning of disks - both SAN and NAS offerings have loads of features to allow provisioning and re-provisioning of storage for different usage. In the NAS world it even goes as far as integration with specific Windows apps (Exchange, SQL, etc.)
4. Speed - Think about a box that can push 1GBytes/sec of data over multiple links from a single filesystem.
5. Support - who's gonna fix the box when it fails or has bugs? Most businesses don't care to do that themselves.

There are many other additional reasons why ZFS+COTS ain't there yet. Where ZFS+COTS is potent is for upstart NAS vendors looking for an easy entry point where a lot of the heavy lifting has already been done. It can be used as a base unto which many features/functionality still need to be added.
In any case, at the low end, you can buy something from Buffalo, Infrant/Netgear or even StoreVault/Netapp if you want a "high-end low-end" box.

Re:For most businesses it's not just the storage by HairyCanary · 2007-05-30 06:06 · Score: 1

Well said. You just described all the reasons that we currently use NetApp filers where I work. Transparent realtime mirroring of data to another filer a couple thousand miles away, plus having NetApp automatically respond to failures by sending out hardware and a tech in just a few hours, that is what makes a real storage solution priceless. We used to have to do it the cheap way -- and while rewarding in its own way, there is a certain bliss to being able to simply ignore the storage for the most part and focus energy on other concerns.

I'm a Hardware Guy, Not a ZFS Guy by OS24Ever · 2007-05-30 02:06 · Score: 2, Interesting

I'd like the following scenarios explained.

RAID0 = bunch of hard drives strung together, look like one big drive. in the implementation I'm refering two the data is striped to a block size and written across each disk simeultaneously (or nearly). This is the fastest disk subsystem available but the most susceptible to failure. If one disk fails, you're toast.

Does ZFS do anything in this situation? I have 'one big drive' presented to the operating system, the striping is abstracted at the hardware layer, and I have a semi-expensive ($300 - 800) RAID card running this. In a random I/O workload I get about 150 iops per drive, streaming is another ball of wax typically more interface limited/block size limited than head movement.

To save money, I'd drop the RAID card and put ZFS down. I now have 12 drives (SAS attached), can I get better performance with ZFS like I could with the RAID Card? Think Log Drives for a big DB, or scratch storage space while manipulating a metric assload of video files. Gbit / sec transfer rates for real-time storage of HD Video. In a random I/O workload I get about 150 iops per drive, streaming is another ball of wax.

RAID 0+1. All the perf benefits of RAID0, but 2x the drives. Typically two cabinets RAID 0'd then RAID 1 the two RAID 0s. I get redundancy at a slight penalty of performance due to 2x the writes happening, but no degredation the read.

What can ZFS do for me here? Again, performance improvements/changes?

RAID5

50% penalty in performance even with the best card because in a high drive count you have to read in data, calc, then do the write. However one drive gives me full redundancy, I loose a second though and I'm toast. RAID6 sometimes is used to describe distributing the hot spare into the array, so no more disk space but can take two simultaneous drive failures and keep running.

What does ZFS do for us here?

This isn't a troll, I really know nothing about ZFS and I'm really curious how I could not have to do the above to protect my photography / video data for my photography business. Would be cool if I could do it on my Mac Pro

--

As a rock-in-roll Physicist once said, No matter where you go, there you are.

Re:I'm a Hardware Guy, Not a ZFS Guy by tbuskey · 2007-05-30 02:52 · Score: 1

ZFS can do RAID1 (mirror). I'm not sure if it does a concat or RAID0. It does RAIDZ which is like RAID5 w/o the write penalty. And RAIDZ2 which is like RAID6 (2 parity disks). You can layer them; A mirror of 2 RAIDZ pools for example.

ZFS does error correction. If you have a firmware issue in your controller or a bad cable, ZFS will detect it and correct on a RAID. It will do a bit or protection on non RAID too. Any other filesystem will silently corrupt your data. IMO this is the most important reason to use ZFS. Yes, I've lost a filesystem due to a firmware bug.

I have a system and get 60MB/s write with 4 500GB SATA drives in RAIDZ. Gigabit ethernet is about 20MB/s. I haven't done any tuning and am using cheap hardware (ethernet $20 card, $20 switch, motherboard SATA). My setup was $2k. If you need a more robust, fast platform, look to Sun hardware (the Thumper) that's been designed with this in mind

I'm running Solaris 10u3 which has been tested more extensively then Solaris Express/OpenSolaris. Many of the extra drivers & features from those will end up in 10u4 (iSCSI target).

I access the fileserver from my MacOSX server via NFS or Samba. MacOSX 10.5 has been seen with ZFS in it also.
Re:I'm a Hardware Guy, Not a ZFS Guy by darrylo · 2007-05-30 03:13 · Score: 5, Informative
OK:
- ZFS w/flaky hardware (scary): http://blogs.sun.com/elowe/entry/zfs_saves_the_day _ta
- Self Healing with ZFS: http://www.opensolaris.org/os/community/zfs/demos/ selfheal/
- 100 Mirrored Filesystems in 5 minutes: http://www.opensolaris.org/os/community/zfs/demos/ basics/
And, for more than you wanted to know about ZFS: http://en.wikipedia.org/wiki/ZFS
Re:I'm a Hardware Guy, Not a ZFS Guy by MauriceV · 2007-05-30 03:18 · Score: 1

>Gigabit ethernet is about 20MB/s.

Gigabit ethernet is about 120 MB/s.
Re:I'm a Hardware Guy, Not a ZFS Guy by Deadplant · 2007-05-30 04:47 · Score: 1

Sure, gig-e is 120MB/s in theory but you missed the part about his gigabit switch costing $20 ;)
Re:I'm a Hardware Guy, Not a ZFS Guy by drew · 2007-05-30 04:56 · Score: 1

RAID 0+1. All the perf benefits of RAID0, but 2x the drives. Typically two cabinets RAID 0'd then RAID 1 the two RAID 0s. I get redundancy at a slight penalty of performance due to 2x the writes happening, but no degredation the read.

ZFS aside, wouldn't you be safer doing this the other way? Using n disks, make n/2 RAID 1 arrays, and stripe those arrays together using RAID 0. Should give the same performance as your solution, but with slightly better reliability. Using your version, a second disk failure would bring down the whole array if the disk is anywhere in the opposite RAID 0 setup as the first disk (slightly higher than a 1:2 probability), while in this setup, a second disk failure would only cause data loss if it is in the same RAID 1 array, which would be a 1:n-1 probability.

--
If I don't put anything here, will anyone recognize me anymore?
Re:I'm a Hardware Guy, Not a ZFS Guy by OS24Ever · 2007-05-30 05:16 · Score: 1

Safer - heck yeah.

Cheaper? Heck no :)

I do the safer route now, but when the 2TB of space I use runs out, i'm gonna need more. if I could add more later at a less cost, that'd be nice.

--
As a rock-in-roll Physicist once said, No matter where you go, there you are.
Re:I'm a Hardware Guy, Not a ZFS Guy by SteveOU · 2007-05-30 13:32 · Score: 1

RAID0: yes, effectively. You can have multiple vdevs in the zpool. ZFS stripes data in the top-level vdevs, so there's your performance gain.
RAID0+1: I don't believe so (you cannot nest vdevs)
RAID1+0: yes. multiple mirror sets in the zpool, with striping among the top-level vdevs.
RAID5: yes, effectively. Called raidz or raidz2 (for double parity). Unlike RAID5, the strip width is variable-sized, based on the size of the write and number of vdevs. Has the additional benefit of eliminating the RAID5 write-hole and eliminating readbacks for parity calculations. That alone is a big performance improvement in our environment.
Re:I'm a Hardware Guy, Not a ZFS Guy by duffbeer703 · 2007-06-01 03:19 · Score: 1

Ben Rockwood is THE OpenSolaris/ZFS guy... read his blog to learn more about it:

http://www.cuddletech.com/blog/pivot/entry.php?id= 775
http://www.cuddletech.com/blog/pivot/entry.php?id= 729

Also the OpenSolaris page:
http://opensolaris.org/os/community/zfs/

ZFS is a kick-ass filesystem and a volume manager, and a raid controller. It's hard to categorize. You get things like snapshots, thin provisioning, etc without shelling out for a NetApp filer.

To me one of the best things about ZFS is error correction. You can very easily lose data with RAID-5 if you're getting silent block level corruption that the controller doesn't know about. Because it's silent, you end up backing up corrupt data without even knowing about it. I got burned pretty badly by this a few years back when we ordered a batch of disks that turned out to be junk. ZFS would have detected that problem, because it checks checksums on blocks.

Sun is using ZFS to move storage back to the server. You can spend $50k on a 4U Thumper with 24TB of disk on 48 spindles, and use ZFS to move data at Gigabit line speeds with about 10 minutes of work. I think that in the long run as the fault tolerance gets better on the server (there's a single point of failure in the Thumper) they have have viable alternatives to NAS devices.

--
Conformity is the jailer of freedom and enemy of growth. -JFK

backups by PurPaBOO · 2007-05-30 02:10 · Score: 1

How do you back this lot up? :-)

--
If it weren't for the rocks in its bed, the stream would have no songs.

Re:ZFS and Sun boxes by cayenne8 · 2007-05-30 02:15 · Score: 1

"I think you'll a bit high. I put together a 5-500Gb Sata II disk setup with Raid-Z in a 5 disk enclosure for under $1000. I run it off my Sunfire v20z. That's 2 TBs for under 1k USD!"

I just recently acquired a sunfire 280R...and am looking for storage on it. I got a good deal, and have always wanted a Solaris box to play/learn from.

But, from reading about it...it mentions wanting to use FC (Fiber Channel) harddrives...I'm familiar with SATA and IDE...but, the FC ones are new to me..

I've been looking to get a T3 or Storedge array on eBay to hook to it...but, if there were a way to hook a new enclosure like you described, I'd like to go that way. Do you know much about hooking HD's to something like a 280R, or could you point me to links where I can learn more about this?

I got the machine for a steal, but, found it only has 2 HD's in it....and I want to use it as a home server, and want to really add more storage space to it....

Thanks in advance!

--
Light travels faster than sound. This is why some people appear bright until you hear them speak.........

It will only do so if... by csoto · 2007-05-30 02:16 · Score: 1

...everyone stops implementing NAS/SAN and instead implements ZFS. Duh!

--
There exists no way of exchanging information without making judgments. --Bene Gesserit Axiom

Let's pretend you're right for a moment... by SanityInAnarchy · 2007-05-30 02:16 · Score: 5, Interesting

It seems to me that even if the entire setup is prone to failure, all you really need is a gigabit crossover or two running to an identical setup. I don't know if ZFS does anything like this, but I can think of at least one way to make it work on Linux: DRBD + OCFS2 + heartbeat. If you're smart, you can even do some load balancing, at least until one of them fails -- and when that happens, the other should be able to take over very quickly, if not instantly -- Linux heartbeat means it would simply takeover the other machine's IP and start its services.

So, that's $6k total instead of $3k.

The one problem I have with OCFS2 is that when it fences a system, it tends to either bring the whole thing down (kernel panic), or in newer versions, give you the option of forceably rebooting instead. This killed it for a project I was working on, where one of the machines had other mission-critical systems running that were not on the OCFS2, and thus, it seemed retarded to panic and bring down everything else too.

So if that's your problem, you can always build a third, identical system to run the other stuff on. $9k.

Even if you figure another $1k for random stuff, like maybe a LOT of gigabit crossovers, or 10gig fiber, or something, that's still a fifth of the cost of the "business-grade" or whatever else he was considering. Even assuming the worst-case scenario, where the homebrew system costs a lot more to maintain (even electricity and cooling, maybe), how long will it take for it to cost another $40k? And this way, you have an ENTIRELY redundant system -- the only way you lose it is if, say, the whole building blows up.

I mean, I sort of agree that you get what you pay for. But when the difference in price is that much, the only way it's ever worth it is if there's really great support with the high-end package. And is it $40k worth of support? If not, I imagine this guy could put together a company selling little $3k, $6k, and $10k systems for $20k each (including support), shaving off $30k even for the most paranoid.

And all of that is pretending you're right about the cheap consumer-grade hardware actually being less reliable.

--
Don't thank God, thank a doctor!

Re:Let's pretend you're right for a moment... by AVee · 2007-05-30 22:56 · Score: 1

True, but than add the cost off the admin setting the whole thing up. Add to cost of the needed spare parts and add the cost of replacement (both in labor and parts) of failing components. In an some companies it may even mean the difference between having to hire a sysadmin and being able to do without. At home it's absolutely the way to go, in a bussiness enviroment i'd think twice.
Re:Let's pretend you're right for a moment... by SanityInAnarchy · 2007-05-31 01:46 · Score: 1

True, but than add the cost off the admin setting the whole thing up. Add to cost of the needed spare parts and add the cost of replacement (both in labor and parts) of failing components.

I'm very, very skeptical that you could get away without either, in a company of any size which can afford to spend $50k on a single box. Certainly, I'd think twice about building a business environment around a single box with no hot spare which probably costs at least half of a year's salary of an admin anyway.

--
Don't thank God, thank a doctor!
Re:Let's pretend you're right for a moment... by mink · 2007-05-31 05:06 · Score: 1

Sounds like OCFS2 took a page from IBM's HACMP.

Machines that get their resorces taken are usually sent a "FOAD" signal from the living nodes so as to make sure there are not netork conflicts and the resorces are no longer in use when it takes over (I've dumbed the process down quite a bit since I dont feel like writing 15 pages on HACMP).

--
Well I've wrestled with reality for thirty five years doctor, and I'm happy to say I finally won out over it.

is NFS reliable? by WeAreAllDoomed · 2007-05-30 02:20 · Score: 1

i've just recently started looking at centralizing my data at home. i've set up NFS to share a 3ware RAID array over gigabit ethernet, but every time i try to copy some serious volume to the server, i end up with some corrupted files on the receiving end and a dead NFS connection.

currently the server is some older possibly substandard hardware - that could be it - but i'm asking the more general question about NFS.

what does it take to get bullet-proof file sharing for linux clients? is opensolaris the answer? is it more reliable?

--
free software, open standards, open file formats, no software patents.

Re:is NFS reliable? by pyite69 · 2007-05-30 02:53 · Score: 1

I have an NFS server (3ware 7810) that holds the disks for my hi def MythTV setup; it beats on the disks continuously over NFS and I have never seen a corrupt file.

I am using an Athlon MP server with Ubuntu Breezy, and the other boxes are dual core AMD with Ubuntu.

I'm not sure what you could change; maybe use tcp,nfsvers=3 in the NFS options. Also, make sure you can write large amounts of data to the server locally to 100% prove it is NFS that is causing your woes.

RAID? by Poromenos1 · 2007-05-30 02:21 · Score: 1

I have been trying to convert an old PC to a fileserver, and I haven't yet found a good RAID solution. Some motherboards do provide RAID, but that's software RAID with hardware support. I need a pure RAID 5 solution, does anyone have anything to propose (motherboard/PCI/whatever?). It would help me very much.

--
Send email from the afterlife! Write your e-will at Dead Man's Switch.

Re:RAID? by WuphonsReach · 2007-05-30 13:40 · Score: 1

I have been trying to convert an old PC to a fileserver, and I haven't yet found a good RAID solution.

For Linux, just use Software RAID and hook N drives up to any interface that the Linux kernel can see. You should avoid RAID5 though (too easy to lose the entire array when a 2nd drive fails) which means RAID1, RAID6 or RAID10 are your best bets. (You can even do a 3-disk active RAID1 instead of a 2-disk active RAID1 with hot-spare. After all, if you're going to have that 3rd disk sitting there drawing power, why not put it to work?)

Good, inexpensive motherboards are the Asus M2N-E or M2N-SLI Deluxe. Both come with a set of NVIDIA 6-port SATA-II connectors. They also have enough PCIe slots to let you add dual-port NICs (Intel PRO/1000) or larger RAID cards (8-16 port SATA). The Asus boards don't have any moving parts (cooling of the chipset is a combination of heat pipe + heat sinks). Alternately, try a Tyan Thunder K8WE board (dual-Opteron Socket F) which also has the 6 on-board SATA-II connections and PCIe slots.

For cases, Lian Li PC-A16 or a SuperMicro 4U 742i. Both have (9) 5.25" bays up front with no obstructions. The SuperMicro case can be purchased with a triple-redundant 760W PSU. In the front 5.25" bays, you can install 5:3 SATA backplanes, which fit (5) SATA drives into the space of (3) 5.25" bays. So you could cram (15) drives into that 4U rack case. Which is a heck of a lot of storage. Just use a USB DVD drive for the times when you need an optical drive.

If you want real hardware RAID, I think the only choices are either Areca or 3Ware. Not sure if the Promise EXnn350 (EX12350, EX16350) series cards are true hardware RAID (given their price, I suspect they are).

--
Wolde you bothe eate your cake, and have your cake?
Re:RAID? by Poromenos1 · 2007-05-30 22:02 · Score: 1

Ah, thank you very much for this information. I was actually thinking of RAID5 because it's cheaper than the mirrored solutions, and I haven't had a drive fail yet (let alone two drives in quick succesion). By the way, do you know offhand if it's easy to see when a drive has failed (Command/logfile/email alert)? I wouldn't want to find that out when my data is inaccessible :)

Software RAID looks quite good, thanks for the clarification.

--
Send email from the afterlife! Write your e-will at Dead Man's Switch.
Re:RAID? by WuphonsReach · 2007-05-31 07:02 · Score: 1

By the way, do you know offhand if it's easy to see when a drive has failed (Command/logfile/email alert)? I wouldn't want to find that out when my data is inaccessible :)

Interactively: # cat /proc/mdstat

md6 : active raid1 sdc8[2] sdb8[1] sda8[0]
267257216 blocks [3/3] [UUU]

The "mdadm" tool also has a monitor mode, where you can specify e-mail addresses, or to run a program, or dump stuff into syslog, or you can cron a one-shot monitor event (man mdadm and look at the monitoring section). There are probably plug-ins for Nagios and other system monitoring tools. Not to say that there aren't ways to hook 3ware's drivers into Nagios or other alert software, but then you'd probably have to learn a different solution for using an Areca card.

(One of these days I'll get around to configuring mdadm to alert me via e-mail. For now, with hot-spares in the units, checking it once a week interactively is often enough. Biggest array I have at the moment is a 6-disk RAID10, but when the 10-disk RAID10 goes online in a few weeks I'm going to be a little more concerned with real-time notification of events.)

--
Wolde you bothe eate your cake, and have your cake?

Google. by SanityInAnarchy · 2007-05-30 02:21 · Score: 1

Or, in other words, hell yes, large PC clusters can obsolete mainframes. And yes, it depends how you use it -- or, in other words, if your application actually can support a large PC cluster instead of a mainframe.

--
Don't thank God, thank a doctor!

netgfs? by Lethyos · 2007-05-30 02:25 · Score: 1

Maybe something interesting will happen here: http://code.google.com/p/netgfs/

--
Why bother.

What is the Difference... by eno2001 · 2007-05-30 02:31 · Score: 1

...between doing this or running Linux instead and using:

1. Either hardware or software RAID for redundancy and performance
2. Using LVM2 to slice up the disks any way that is desired
3. Using NBD or GNDB (network block device or global network block device) to export the LVMs to other systems as block devices or just using Linux's iSCSI implementation or even (yuck) ATAoE
4. Using Samba and/or NFS to share volumes for temporary mappings (as opposed to the permanent mappings with (g)nbd

This is what I was planning on doing. What makes ZFS and RAID-Z better than the above options, which I find to be very easy to implement?

--
-"...bad old ideas look confusingly fresh when they are packaged as technology" - Jaron Lanier (Digital Maoism on Edge.o

Re:What is the Difference... by jimcooncat · 2007-05-30 08:18 · Score: 1

I'm on the same track, only plan to use DRBD for redundancy instead of RAID.

Check out:
http://www.gridvm.org/drbd-lvm-gnbd-and-xen-for-fr ee-and-reliable-san.html

Comments welcome!
Re:What is the Difference... by myowntrueself · 2007-05-30 08:34 · Score: 1

I don't know the 'difference' but Linux-based, software-only iscsi *will* crap all over your data sooner or later. Its not a matter of if, its a matter of when. It is horrifically unreliable especially under any kind of load.

--
In the free world the media isn't government run; the government is media run.
Re:What is the Difference... by eno2001 · 2007-05-31 07:27 · Score: 1

I'm not much of a fan of iSCSI myself. I prefer NBD which I use liberally at home. It's great for playing DVDs via WiFi or even watching full MPEG program stream files (recordings of TV shows). I've also used it for video editing with really huge, high volume access (we're talking gigbit network as well). So I think my approach will likely be NBD or GNBD, but... the link above might be interesting too.

--
-"...bad old ideas look confusingly fresh when they are packaged as technology" - Jaron Lanier (Digital Maoism on Edge.o

Competing with the low end by PilotDvr · 2007-05-30 02:33 · Score: 1

Take a look at http://www.promise.com/product/product_detail_eng. asp?product_id=149 Pretty much what you want...from a decent manufacturer for just $8K (with 15 decent SATA II 250GB drives)

Ethernet bonding and IP multipathing by Anonymous Coward · 2007-05-30 02:34 · Score: 0

each RAID controller having two uplinks to either two hosts or two FC switches, and each host either having two uplinks to the two different controllers or to two FC switches.

How is this different from Ethernet interface bonding and IP multipathing on Solaris? You can redundant network paths on IP as much as on FC.

With ZFS you also get the added benefit of checksumming (a Merkle tree). No NAS/SAN vender or other file system gives you that.

Lots of disks in an enclosure for lowend by tbuskey · 2007-05-30 02:34 · Score: 1

At home I was running Linux w/ 3 120 GB SATA drives in software RAID-5 and LVM on top. I switched to Solaris 10u3 x86 with ZFS RAIDZ on 4 500GB SATA disks. Much easier to admin the files/resizing plus I get compression and error correction.

My main issues are:
1) An inexpensive ($100) 4 port SATA controller that is supported by Solaris 10u3
I had a $20 card working in Linux. I needed to flash the bios to JBOD (no "RAID" in its bios). I bricked it. I ended up getting a motherboard w/ 4 ports that someone else had already been using/testing. When I add more disk, I'll need to find a controller board. Hopefully 10u4 will be out with support for more boards.

2) Getting lots of disks in an enclosure with power and cooling
The Solaris box got a new case (Antec p180) that has power/cooling for 4 drives in the case. How do I add 4 more? For the other box I built something out of plywood/cardboard to hold the drives and duct the air over them. I added an old PC power supply and a 120 mm fan. 42" SATA cables can run out the back of the PC to it with no problem. I'd like to find something premade that holds 4 + drives and closer to $100.

Any suggestions?

SDS = System Design Specifications? by Cassini2 · 2007-05-30 02:37 · Score: 1

I am wondering what SDS means too. This is the wikipedia entry: http://en.wikipedia.org/wiki/SDS

Re:SDS = System Design Specifications? by RubberChainsaw · 2007-05-30 04:59 · Score: 1

I assumed it meant Slash Dot Summary.

--
I welcome our new 99% overlords.
Re:SDS = System Design Specifications? by Spazntwich · 2007-05-30 07:48 · Score: 1

Yessir!

I'm suprised to see no mention of FreeNAS by suteny0r · 2007-05-30 02:38 · Score: 1

FreeNAS is a free NAS (Network-Attached Storage) server, supporting: CIFS (samba), FTP, NFS, AFP, RSYNC, iSCSI protocols, S.M.A.R.T., local user authentication, Software RAID (0,1,5) with a Full WEB configuration interface. FreeNAS takes less than 32MB once installed on Compact Flash, hard drive or USB key.

The minimal FreeBSD distribution, Web interface, PHP scripts and documentation are based on M0n0wall.

http://www.freenas.org

Re:I'm suprised to see no mention of FreeNAS by Anonymous Coward · 2007-05-30 06:12 · Score: 0

It's a piece of crap, thats why no one mentioned it.
Re:I'm suprised to see no mention of FreeNAS by Anonymous Coward · 2007-05-30 14:54 · Score: 0

Crap? It installed perfectly on some old hardware. Gets used every day. And has never given me trouble. Hmmmm. That's crap eh? You must be a Windows user.

ZFS isn't SAN capable by clawsoon · 2007-05-30 02:40 · Score: 2, Insightful

I'm surprised no-one has pointed this out yet: ZFS can't be used as a SAN filesystem. For that you need something like Redhat's GFS (formerly from Sistina), PVFS, or one of a number of commercial products.

From the ZFS FAQ:

Q: Suppose I have three E450s. Would ZFS allow me to integrate storage across all three boxes into one big "poor man's SAN"?

A: No, ZFS is a local filesystem (for the time being). To access storage attached to a different host, use NFS.

Re:ZFS isn't SAN capable by Mistah+Blue · 2007-05-30 14:49 · Score: 1

He isn't advocating using ZFS as a clustered file system. It will be the filesystem used to build "LUNs" on his storage box. These will be presented out as CIFS/NFS shares, or iSCSI LUNs.

so many other factors by rgaginol · 2007-05-30 02:43 · Score: 1

Like administrator skills - the very fact that your asking slashdot about this stuff means that you're inquisitive about this and that puts you far ahead of most of the dum dums I see in the skill shortage world which is IT in Australia at the moment. The number one cause of data loss is a badly skilled administrator. If you understand the system you've deployed and looked through the different scenarios of data loss and have recovery strategies in place, then you've probably covered most of your bases already. And that's the kicker isn't it; it's not so much about what you purchase as it is about having a well thought out process in place when you make one of these decisions. When I was a system admin, the best book I ever read was "the practice of system and network administration" http://www.amazon.com/Practice-System-Network-Admi nistration/dp/0201702711/. It's probably a few years out of date now, but should still have a heap on the process of making purchases like this. That used to be one of my favourite reads... but then someone borrowed it at work (thanks whoever took it yeh bastard... hehe).

Dont' forget about performance by shirpa_kewl · 2007-05-30 02:49 · Score: 1

One problem with your setup is it does not provide enough disks for performance or caching to provide large amounts of I/O. ZFS might be good on big hardware, but I do not know that it would be practical on your rig.

Big vendors like EMC use excellent caching hardware and connectivity hardware that create a mesh of hundreds of disks using large amounts of memory within their storage arrays. Vendors like EMC spread data across tens and sometimes hundreds of disks along with the use of their caching mechanisms to prevent I/O bottlenecks depending on the application. While they have started using SATA disks, if you were to use disks that are too large in a rig like yours, you end up with lots of storage and too few disks across which to spread your I/O.

So, in summary, 12 sata disks might be good for a lab or development setup, but would not provide the performance of the "$50K" storage solutions. But then in most cases, if your lab does not match your production setup, what good is it?

Re:ZFS and Sun boxes by ggendel · 2007-05-30 02:50 · Score: 1

FC actually pre-dates SATA They are blindingly fast, but Fibre Channel boxes are usually pretty pricey. You can do the same thing as I did, but you need a PCI controller board, (not mine which is PCI-X). I would check with the OpenSolaris forum, since they would have more experience with which Sparc drivers are available for you and which cards are best. Good Luck, that looks like a nice machine to play with.

ZFS Learning Centre / Tutorial by Anonymous Coward · 2007-05-30 02:50 · Score: 0

A good video (in Real(tm) format) describing how ZFS is contructed and works is available from Sun's web site:

http://www.sun.com/software/solaris/zfs_learning_c enter.jsp

The slides that Bill Moore is showing are also available online:

http://www.sun.com/software/solaris/zfs_lc_preso.p df

Intended use matters quite a bit by codethug · 2007-05-30 02:52 · Score: 1

It's difficult to advise without details of intended use, but generally I think it through like this:

Your solution is $5K, alternative $50K.
Do you have $50k to spend? If all you can spend is $5K then this is the right choice.
How much time our you going to spend on-disk? Not much? Great then use those savings to buy a poopload of RAM for your memcached servers.
Already have RAM maxed out? Oh well use some of that money on more efficient power supplies for your perlbal/pound/nginx load balancers.
Load Balancers? Look dude, I just need to store alot of media.
Oh! why didn't you say so, just use Amazon S3.
What's that? You need in-house reliability and speed? Oh ok.
Buy the $50k SAN.

Why build your own - just get an appliance... by Anonymous Coward · 2007-05-30 02:55 · Score: 0

Or, you could just buy this off ebay with 3TB of storage for under $2000. Radiator OS is linux based... http://www.infrant.com/products/products_details.p hp?name=ReadyNAS%20NVPlus

Filesystem vs Hardware by Builder · 2007-05-30 02:57 · Score: 1

ZFS is a filesystem. It's a very impressive filesystem and does work very well in SAN environments.

What it won't do though is take your 3k toy and turn it into a SAN. Your box could maybe feed 2 applications in an enterprise environment. This isn't an issue with capacity but with performance.

A single SAN device serves 50 applications in my current environment and the plans for the new DC have a bigger HDS model serving around 100 apps.

Disks are cheap. The interconnect, processing, management tools and other bits and pieces are why we pay for a SAN. If you don't need that, then you are probably ok with a homebrew solution.

SATA-II performance if you use half the drive.. by Spirilis · 2007-05-30 02:57 · Score: 1

Another curious idea I have, though... is what happens if you use SATA-II disks, but only use 1/2 of the disk (the half closest to the outer tracks). A 500GB SATA-II disk would yield 250GB if you only used half the disk, but since your seek times are now cut in half, how does the performance stack up against a similarly-sized ~300GB 10K SCSI/FC/SAS disk? If they're similar (would really like to see this proven out...), one could make the argument that you can get by with lots of inexpensive SATA disks, but only half of them in use, at least during peak usage times. The other half of the disk could be reserved for off-peak bulk I/O, such as backups.

--
the real at&t mix

What about performance? by SleezyG · 2007-05-30 03:00 · Score: 1

It's a decent idea, but gigabit ethernet isn't a fat enough pipe to keep up with SATA. I think you need to examine how many users it will serve and what the throughput requirements are for each of the users. If you have only a few users on a relatively small LAN, I would purchase fibre channel cards instead of gigabit ethernet. Then again, if it is for a large office environment, you're stuck with gigabit or 100 megabit ethernet like the rest of us cubicle-inhabiting suckers.

Re:Congradulations, you discovered the "File Serve by hoggoth · 2007-05-30 03:15 · Score: 1

Forgive my ignorance, but how do I use AoE to provide high speed block devices to Windows systems?

I don't want to export them by CIFS or NFS because it is too slow. I want direct access to the devices from Windows boxes. iSCSI gives me this because ZFS (or Linux for that matter) can export iSCSI and Windows can "import" it as a block device.

The description of AoE sounds like it may be faster and easier than iSCSI, but how do I access it from a Windows machine?

--
- For the complete works of Shakespeare: cat /dev/random (may take some time)

backup to disk by reversible+physicist · 2007-05-30 03:17 · Score: 1

Using ZFS or even a big cheap RAID array as a backup or archiving target addresses short term issues but not long-term safety of archival data.

For the long term you have the issues of scaling and data migration of your archive. You need to minimize the chance of losing data due to hardware, software or human failures even as your storage scales up exponentially with time. You need to be able to incrementally/non-disruptively replace your backup disks and servers and networks as hardware becomes old and/or obsolete. You need to maintain accessibility and robustness of old data by migrating old data to new media. Businesses also need to worry about regulatory and best-practices enforcement of retention policies: some data you're required to guarantee won't get modified or deleted for mandated periods of time (and you must be able to prove that it hasn't been modified). There are commercial "archiving on disk" solutions that "solve" all of these problems (including eliminating risks due to the corruptible and failure-prone human components), but they involve a hell of a lot more than than ZFS and HA Linux!

Re:backup to disk by Firethorn · 2007-05-30 03:48 · Score: 1

but they involve a hell of a lot more than than ZFS and HA Linux!

I wasn't so much worrying about the operating and file system, instead concentrating on hardware.

As for you other concerns - when I did the research, the complete backup solution came up cheaper than the tape drives we'd need alone.

As for the other things - we don't do long term archiving of electronic data. If it's from much more than three months ago you're out of luck.

--
I don't read AC A human right

No competition for enterprise SANs by Anonymous Coward · 2007-05-30 03:23 · Score: 0

The most obvious why this is no threat to real SANs is that you have a single cpu and single motherboard. If your board dies so does your access to the data.

It's simplistic to think about SAN and NAS as simply a capacity. You pay out the nose for the big SANs because you want high availibility, not just redundancy. As cute as this device is, it's not that.

Re:Congradulations, you discovered the "File Serve by Anonymous Coward · 2007-05-30 03:24 · Score: 0

I've always wondered why some people think that using 1U or 2U c chassis instead of normal PC cases is the only way to go if you want to look professional. Even places like ProCurve have stacks of average looking PC's.

Here's a picture just to prove it.

Re:ZFS and Sun boxes by linzeal · 2007-05-30 03:25 · Score: 1

Add a cheap PCI IDE card and forget FC. It is highly unlikely you need the throughput.

--
An Education is the Font of All Liberty

RAID controller failure by wurp · 2007-05-30 03:28 · Score: 4, Informative

And what happens when the RAID controller fails and corrupts all of your drives?

Because I've seen that happen more than once.

I'm not saying the more expensive solution is better. I'm just saying that in my personal experience I've seen *more* data destroyed from RAID controller failure than from hard drive failure. I would love to find out the solution to that one.

I do not claim to be a hardware expert or system administrator, so there may be a well known solution (don't buy 'brand X' RAID controllers). I just don't happen to know it.

Re:RAID controller failure by hpavc · 2007-05-30 04:20 · Score: 2, Interesting

Agreed, the controller just doesn't 'stop' it sort of goes on a rampage for a bit. Making you wish you went through another layer of abstraction or redundancy.

--
members are seeing something, your seeing an ad
Re:RAID controller failure by Wolfrider · 2007-05-30 06:25 · Score: 1

Would you please post (somehow) which brand and model of controller did this + date (year at least)? TIA

--
.
== WolfriderV6 == I'm willing to admit that *I just might* be wrong... Are you??
Re:RAID controller failure by wurp · 2007-05-30 06:29 · Score: 1

I'm sorry; that was years ago and it was a RAID controller that was purchased by the sysadmin at my workplace at the time. Those controllers failed, then the replacements eventually failed too.

I should have made it a point to find out, but I didn't.
Re:RAID controller failure by Anonymous Coward · 2007-05-30 07:13 · Score: 0

Happened last year to one of our users who insisted on getting mirrored RAID drives on his new Dell workstation. Both of them ended up corrupt; one had data, but couldn't boot, while the other would boot, but the data was corrupt.
Re:RAID controller failure by Phantom+Gremlin · 2007-05-30 08:21 · Score: 1

Would you please post (somehow) which brand and model of controller did this + date (year at least)? TIA

Given how often I've heard stories like that, I think it's harder to find controllers that don't malfunction.
Re:RAID controller failure by Anonymous Coward · 2007-05-30 08:48 · Score: 0

(don't buy 'brand X' RAID controllers)......whatever you do don't go with the Dell. Their Power vault stuff has had a horrible failure rate with complete corruption of data when a Disk drops off line. (200 server data center setting)
Re: RAID controller failure by Dolda2000 · 2007-05-30 09:07 · Score: 2, Interesting

I have two things to say about that particular case: 1) ZFS maintains checksums of all data, so that will at least be noticed fast and 2) just use two controller cards and mirror them -- if one fails, you'll have valid data in the other half of the mirror.
The more important question may be whether there are other, unknown sources of errors. Not many people would probably even think of a hard drive controller card failure when building a storage solution, and that's probably one of the major advantages of the more expensive vendors; they have the experience to know a lot more error sources. I'm not sure which weighs the heaviest: The enormous difference in price, or a chance to avoid a source of major errors that one would never have thought of on one's own (even if there's a 1 ppm risk of it happening)?
Re: RAID controller failure by MadMorf · 2007-05-30 11:29 · Score: 1

2) just use two controller cards and mirror them -- if one fails, you'll have valid data in the other half of the mirror.

And by doing that you've bumped the price of the system by double...

So now your cheap SAN/NAS box is $6k instead of $3k, for the same usable space...

--
Goofy, Geeky Gifts and More!
Re:RAID controller failure by kraut · 2007-05-30 11:35 · Score: 2, Informative

Software RAID. Been running it for years for exactly that reason,

Maybe not suitable for high performance situations, but I've not found it slow.

--
no taxation without representation!
Re:RAID controller failure by jp10558 · 2007-05-30 12:28 · Score: 0

This is why RAID doesn't mean BACKUP. You still need backups. Or something like the above Starfish filesystem with mirroring.

--
Opera, Proxomitron-Grypen,GPG 0x0A1C6EE3
Re:RAID controller failure by Anonymous Coward · 2007-05-30 17:01 · Score: 0

Don't use HW RAID. Use ZFS with JBOD and create raid-z (or raid-2z). You get probably better performance and absolutely certainly much better reliability.
Re:RAID controller failure by this+great+guy · 2007-05-30 17:48 · Score: 3, Interesting

And what happens when the RAID controller fails and corrupts all of your drives?
That's the whole point of ZFS: you don't need (and don't want) to use ZFS with a hardware RAID controller. Sun designed ZFS so that you just feed it with *disks*. ZFS takes care of the rest: volume management, corruption detection, software RAID, etc.
Look at the high-end Sun Fire X4500 server ("Thumper") they released a few months ago: 48 drives in 4U with *no* hw RAID raid controller, Sun designed this server as the perfect machine to run ZFS.
Re:RAID controller failure by mrdogi · 2007-05-31 01:43 · Score: 1

As I understand it, in the case of ZFS, which the question specifically mentioned, nothing. You get a new RAID controller, attach the drives, and let ZFS figure out what's where. In ZFS, you don't (or at least it's highly recommended that you don't) use RAID on the controller(s). Give the disks to ZFS and let it figure it out.

A bonus of using ZFS (RAID-Z) is that the 'write hole' that traditionally shows up in RAID configurations disappears. It writes all data, leaving the previous data alone, and the last thing it does is change a pointer from the old data to the new. Also a reason the snapshots are so easy.
Re:RAID controller failure by aminorex · 2007-06-01 03:15 · Score: 1

> a well known solution

software raid.

--
-I like my women like I like my tea: green-
Re:RAID controller failure by NateTech · 2007-06-01 20:54 · Score: 1

Always run more than one controller, is the lesson you missed somewhere. In truly mission-critical systems, a single controller (or anything else single) is NEVER used.

Multiple servers (clustering of whatever form you like), multiple controllers, multiple cables, multiple arrays, multiple power supplies, multiple disks.

Home and low-end users tend to only focus on the latter... multiple disks, and ooh and ahh over boring things like RAID.

Super high-end systems might even have multiple internal backplanes in all of the above.

How much redundancy do you want/need is balanced by available budget. Systems with enough revenue flowing always pay for themselves.

--
+++OK ATH

Quality - industrial vs. commercial by FuzzyDaddy · 2007-05-30 03:29 · Score: 4, Insightful

The "consumer-grade" and "business-grade" are the same off the shelf stuff, but if you are getting business-grade stuff from a reputable vendor they QA the consumer-grade parts, throw out the bad ones, and stamp "business-grade" on the ones that survive.

I worked at ATMEL many years ago in their EPROM division. I had an up close and personal view of the screening flows, both Military and otherwise. Let's put aside the issue of Military screening, which is extensive and costly. You can't make very much out of Military grade ICs, because there are not very many available.

The difference between commercial and industrial parts is one of operating temperature, not quality. (In point of fact, there was no actual difference in the screening or handling.) The quality standards for both parts were the same - the goal was always zero defects. I spent weeks weeding out a problem with a 50 ppm failure rate that was slipping through our screening, and everyone was damned happy when I fixed it.

There's no reason to expect a correlation between maximum operating temperature and quality. A part might run too slow at elevated temperature to pass, but this will usually happen for process variation reasons that do not affect the expect lifetime of the part.

Any part coming from a reputable IC manufacturer should have the same level of quality, regardless of the rating.

Now, that being said, there is a very serious quality issue that an OEM does need to address, and that's counterfeit parts. If an OEM is not careful about where their parts come from, or buys them cheap and looks the other way, then there quality will obviously suffer. But this isn't so much a commercial versus industrial quality; it's about honest versus dishonest business practices.

--
It's not wasting time, I'm educating myself.

Check out Coraid by Rsriram · 2007-05-30 03:29 · Score: 1

www.coraid.com

Coraid Storage Products connect to the network using standard Ethernet. With EtherDrive Storage you can create a shared storage system from a few Terabytes to multiple Petabytes for under $0.70 per Gigabyte.

--
O this learning! What a thing it is - William Shakespeare

Re:Check out Coraid by DaveCar · 2007-05-30 03:52 · Score: 1

I've just been playing with the (Coraid contributed) vblade and aoe tools and module in Debian & Ubuntu.

Just managed to get a stock Ubuntu install to boot diskless from a virtual ATAoE blade (vblade) exported from a machine. Mmmmm.

Suprisingly quick one it's up and running too, and that's only on 100Mb/s.

Just had to write one initramfs-tools script to set up the aoe device from the pxe/tftpboot command line parameters - I can post once finished with it if anyone is interested.

Thank you Coraid! I might just buy some of your hardware :)

Re:vs Reiser4 - plugin compression by tbuskey · 2007-05-30 03:30 · Score: 1

gzip compression is available in ZFS in addition to the standard.

I've found that compression sometimes makes for faster writes.

Re:ZFS and Sun boxes by tricorn · 2007-05-30 03:33 · Score: 2, Informative

You can pick up those 750GB Seagate SATA drives for about $200 each now...

Re:Congradulations, you discovered the "File Serve by ch-chuck · 2007-05-30 03:33 · Score: 1

Well, there's this.

--
try { do() || do_not(); } catch (JediException err) { yoda(err); }

apple xsan + eonstors by t35t0r · 2007-05-30 03:34 · Score: 1

We're going with an apple xsan + infortrend's eonstor solution. 2 x apple xserves running xsan, one qlogic sanbox (with sfp's), 2 x 4gb FC HBAs for the xserves, and 2 x 16 bay infortrend eonstors supporting 4GB FC with 750GB SATAII enterprise class seagates (single RAID controller, redundant PSU's) all for $38K. Unfortunately infortrend has not yet qualified the 1TB drives, otherwise we would have saved even more money. If you have a bigger budget you can get eonstors with redundant RAID controllers.

The nice thing about xsan is that we can keep adding eonstors to the racks or use larger drives in the eonstors and increase our storage. This is great since we're growing at about 2-3TB yearly.

The answer to the question is... by teflaime · 2007-05-30 03:35 · Score: 1

not really. For the most part, these solutions are going to be used by different users. SAN/NAS is for enterprise users who have specific needs that are best met by fiber attached SAN on caching arrays; major enterprise entities -- very large corporations with complicated storage needs and thousands of servers. The sort of solution we are discussing here is for the person who isn't reliant on through put or the kind of caching and redundancy that an EMC or other brand name high end san solution might offer; consumer level or small business level purchasers with straightforward storage needs that reside on one or a few computers. There's a middle group that has to more carefully balance the cost/benefit analysis when choosing a storage solution. Do they require fiber throughput? Is price a more important factor? The large SAN products will likely lose some of these mid-tier clients, but giant corporations require support and indemnification that low cost systems such as the one in the article will never be able to provide.

All storage is not created equal by marian · 2007-05-30 03:41 · Score: 1

Whether the thing you've described will meet your needs really depends on what you're looking for. Unfortunately, you've described the gizmo, but not what you need it to do.

That being said, here are a few things you need to be aware of:

1st problem:
From the Lime Technology web site, their MD-12000 "Unlike other RAID systems, however, user data is not striped across the data disk drives. Instead, each data disk is formatted normally with it's own file system." What this means is that it's going to be slow. SATA/IDE drives are slow to begin with, but without the striping it's going to be individual drive with an external connection slow. This also raises the question of how they're doing RAID5, since that requires multiple drives in order to hold the parity so that if one fails the data can be recreated when it is replaced.

2nd problem:
There is no mention of data cache anywhere in the technical information about the MD-12000. Without that, the seek/write times are going to be awful.

3rd problem:
The Lime Technology web site says that both the IDE and SATA verions are "temporarily out of stock". You can't even get one.

4th problem:
No spare drives. If you lose a drive, you need to get a replacement before you have protection again. RAID5 doesn't give you any protection at all from multiple drive failures, and yes it does happen. More frequently than you know. Especially with IDE/SATA drives. They don't have a particularly long lifespan, and years of experience with big storage has also shown me that you lose drives at spinup/spin down. Which means the age of your drives will be essentially the same and the chance to lose more than one is pretty good.

That all being said, if what you want is something inexpensive to attach to your desktop that will hold lots of data that isn't really important, needed by lots of people, and you don't need really fast access to, it sounds like a good deal. But once you start having to share that data with others from the same storage box, and need to get faster access, this thing isn't going to give you performance. It's never going to give you a really high amount of reliability. Not high enough to be putting critical data on it.

--
"Suppose you were an idiot..... And suppose you were a member of Congress... But I repeate myself."

Adding new _bigger_ disks. Does this exist ? by dargaud · 2007-05-30 03:51 · Score: 1

One solution I would like to see is the following:

In all the RAIDs I've seen all the disks must have the same sizes, which is a big expense when it's just for your personal photo storage needs. I'd like to see some kind of RAID that can take anything, just adding to it regularly to increase either storage size or redundancy or both in a controlled manner.

The reason is that I purchase a new drive every year or so, so I currently have 60G, 120Gb, 200Gb, 250Gb, 400Gb, 500Gb... Currently they are all mounted in different ways, but I'd love to have an enclosure where I can just add a new bigger disk and remove the oldest and smallest and keep going. Bonus point if I can plug it into the RJ45 of my adsl/wifi router or the USB of my laptop.

Does such a thing exist, it's exactly what I want ? Thanks

--
Non-Linux Penguins ?

Absolutely true by Flying+pig · 2007-05-30 04:01 · Score: 2, Interesting

You failed to mention that all stress testing increases the probability of failure. To a considerable extent, you can only design in robustness and quality and hope. It used to depress me quite a lot (when I was involved in such things) that many American companies (and others, I hasten to add) just fundamentally did not get ISO 9000 and statistical process control, which is quite different from post-production testing. As a German engineer once remarked to me, (some years ago) "Rolls-Royce cars do not have quality. BMW does not have quality. They just throw away everything that is defective. Toyota has quality. They aim to eliminate defects."

Old guys like me may remember when National Semiconductor was, if I recall rightly, fined for faking test records in the same year they won an award for the reliability in the field of their military products. Or the discovery in the early 90s that volume produced Japanese semiconductors were far more reliable than many JAN devices. There is just no substitute for having to manufacture in volume in a competitive world.

In semiconductors, the downside is that things that produce higher reliability like thicker oxide and bigger anti-static diodes also slow down clocks. You would think that, for a really reliable disk array, you would a less than state of the art system running conservatively. I guess that this is a case where having a great deal of practical and experimental experience is the best recipe for success, and perhaps this is where SAN manufacturers shine.

--
Pining for the fjords

Re:Congradulations, you discovered the "File Serve by Anonymous Coward · 2007-05-30 04:03 · Score: 0

AOL has data centers full of mid-tower PCs. Say what you will about AOL itself, but they never go down.

Nit-picky wanker post by Bohnanza · 2007-05-30 04:06 · Score: 1

"Obsolete" isn't a verb. Thank you.

--

-----

Sorry, I'm only a 1336 h4x0r.

Re:Nit-picky wanker post by Anonymous Coward · 2007-05-30 04:16 · Score: 0

obsolete (verb): to make obsolete by replacing with something newer or better; antiquate: Automation has obsoleted many factory workers.

Source: Random House Unabridged Dictionary.
Re:Nit-picky wanker post by rrhal · 2007-05-30 06:03 · Score: 1

The OED has it as a verb as well. It's been in use as such since the 1600's.

--
All generalizations are false, including this one. Mark Twain

Re:ZFS and Sun boxes by Anonymous Coward · 2007-05-30 04:18 · Score: 2, Informative

I'm familiar with SATA and IDE...but, the FC ones are new to me..

Just a brief summary:
-- SATA refers to the new Serial ATA.
-- ATA or PATA refers to the older "Parallel" ATA. (ATA dates back to IBM PC AT and refers to that machine's AT Attachment interface.)
-- IDE refers to any drive (ATA, SCSI...) with integrated drive electronics, that is, everything that has come after the ancient dumb drives that required a model-specific controller on the motherboard. In other words, not a very useful term anymore.
-- SCSI refers to Small Computer System Interface; funny how it's the one used in the bigger iron. Beats the pants out of ATA when handling multiple daisy-chained drives; SATA is catching up in handling multiple drives. SCSI also has parallel interface and cabling.
-- SAS refers to Serially Attached SCSI (some inspiration from SATA perhaps?).
-- FC refers to Fibre Channel, a SCSI-like very fast interconnect type and interface protocol; often (but not always) uses optical cabling.
-- iSCSI refers to SCSI over Ethernet (thus it could be "SCSIoIP"...).

But I never understood the difference between a SAN and a NAS when the configuration gains any complexity beyond a textbook example. You can have a SAN with many NAS boxes, or you can have NAS with multiple SANs, sooo... ;-)

Re:Congradulations, you discovered the "File Serve by drinkypoo · 2007-05-30 04:20 · Score: 1

1TB is available in a single disk now, for something like $350. You can get desktop or rack cases that will accomodate more than four drives for less than $250. (Many years ago now, I bought a 4U aluminum tower case that could be loaded - tightly - with as many as nine hard disks if you had no optical drives, for $169 with no power supply. They are now cheaper.) You can also put disks in firewire enclosures and attach them to a hub, which will provide you some additional redundancy at the cost of adding clutter. Or you could buy Apple's RAID enclosures or something, and build your own filer to put them near on your rack, if you need a larger number of disks.

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"

Ah, you're right. by iknownuttin · 2007-05-30 04:25 · Score: 1

Your intuition is off.

Actually, I don't quite understand the whole digital thing - JPEG, RAW, storage, etc....

I was curious about what you said so I looked in my folders where I store my digital photos. I have a 5 mp KODAK C533 and the typical jpeg file it makes is anywhere from 1.2 to 1.25 MB - right along with what you said. I guess the discrepancy from what you said is other stuff that the camera puts in the file?

Oh well, I'm still learning. Thanks for the lesson! You may have saved me some $$$ down the line!

--
I prefer Flambe as apposed flamebait.

Re:Ah, you're right. by maxume · 2007-05-30 05:27 · Score: 1

The discrepancy is in the compression. Jpeg works a little better on some images than others.

--
Nerd rage is the funniest rage.
Re:Ah, you're right. by iknownuttin · 2007-05-30 05:56 · Score: 1

The discrepancy is in the compression. Jpeg works a little better on some images than others.
Ah! Which explains the differences between image sizes from the same camera.
Thanks! Eventually, it gets through the concrete! - [taps on head].

--
I prefer Flambe as apposed flamebait.
Re:Ah, you're right. by maxume · 2007-05-30 08:43 · Score: 1

Check out:

http://www.clarkvision.com/imagedetail/does.pixel. size.matter/

The server seems a little flakey at the moment, but it is well worth it, even to just read the conclusions.

--
Nerd rage is the funniest rage.

Huh? by nanosquid · 2007-05-30 04:49 · Score: 1

You can put a few TB of storage into a single box with ZFS, but then you have a single box with a few TB of drives in them. ZFS replaces RAID.

SAN and NAS is about using networks for interconnecting storage and CPUs. I don't see what ZFS has to do with that at all, and ZFS probably can't even be used at all over some network-attached drives.

Don't forget about the SMB market by sco_robinso · 2007-05-30 04:57 · Score: 1

Don't forget that 80% of businesses in N. America are small businesses - under 500 people. And the majority of those businesses are under 75 people.

The whole point of this whole discussion is that relatively inexpensive NAS solutions are becoming a huge competitor to traditional SAN setups. Yes, FC/SCSI are likely more reliable than a plain-jane SATA setup, but often the cost just isn't justified in a smaller business.

I do IT consulting for the SMB market in Canada - companies ranging from 50 to 500 employees, and in many cases (not all) $20k - $50k is a lot of $$ to shell out for a couple TB of storage. Granted, the performance of a SATA NAS product isn't on par with a FC SAN product, but all things considered, for 10% the cost, you can have a decently reliable NAS product in place, and considering how cheap SATA drives are, you can have half a dozen spares sitting there, ready to go, for only a couple hundred bucks.

It all depends on the businesses needs at the end of the day. If a business can afford to spend $50,000 on a FC/SAN product, they're probably in a situation where they can't afford not to.

The nonmagic of ZFS' magic by Anonymous Coward · 2007-05-30 05:03 · Score: 0

You've got a point; all this is possible w/out ZFS. But ZFS is very cool and makes it easier. ZFS is funny in that it is both amazingly revolutionary, while simultaneously an unimpressive incremental advance. It all depends on how you look at it.

My suggestion... by Junta · 2007-05-30 05:12 · Score: 1

Depending on the scale and application of the fileserver, generic OS provided software RAID will probably be more robust/recoverable than hardware RAID, without an unacceptable degradation in performance. I have, for example, an Adaptec 2410SA RAID adapter. The performance is not compelling, but the deal breaker for me is the card had on occasion, lost it's mind on unclean system reset. The drives were out of the array and that adapter offered no 'put together and assume it was a RAID-5 with this layout before, just sync the parity', so the data was non-trivially unavailable. With linux software RAID in particular (and the one time with Windows software RAID that I had to maintain as well), the tools in terms of what conditions can be recovered from and how it is to do are pretty smart and flexible. Both architectures I saw systems that decided (wrongly so) to put multiple drives on a single IDE channel and wrap in software RAID. A drive would effectively knock out accessibility for the other and the array would lose two members. For many hardware controllers, that would be 'the end'. Software RAID I was able to say 'assemble this array and assume I know what I'm talking about', and recover data without resorting to other backups which were inconvenient, or non-existant.

Some situations, you have particular hard throughput requirements you can quantify, in which case you may have little choice. Keep in mind if just doing simple file storage sharing (NFS/SMB) primarily over a single gigabit link or worse, you have really nothing much to gain from a high-performance internal subsystem. If you have local-storage for a high TPC database system or some disk intensive modeling or something, that's where you really start thinking about hardware controllers, IMHO,

--
XML is like violence. If it doesn't solve the problem, use more.

Re:My suggestion... by Poromenos1 · 2007-05-30 08:58 · Score: 1

Hmm, really? I just want this thing for home/office storage, so throughput is of little importance, but reliability is paramount. I had no idea software RAID was this good, I was under the impression that a bug in the program or a cold reboot could mess with the RAID in the disks. I have never tried it, but I had read an article on installing linux on a software RAID (fakeraid) which basically bashed it. I will be sure to have my next motherboard include RAID, since it is good.

Basically I was worried about the reporting capabilities, since the machine would be a fileserver (I would have to check the logs vs the card beeping or giving off other notifications). Interesting comment, thank you very much for giving me some insight on this. I wish I had mod points.

--
Send email from the afterlife! Write your e-will at Dead Man's Switch.
Re:My suggestion... by Bishop · 2007-05-30 10:26 · Score: 1

I think you are confusing two types of raid.

There is fake hardware raid. This type of raid is cheap and most of the raid calculations are done by the driver in the host OS. Some of the calculations are handled by the hardware. This is the type of raid found on most motherboards. Adaptec calls this "HostRaid." This type of raid is poor. The drivers tend to be buggy. Despite the driver running in the OS, the kernel typically does not know that the device is software raid and cannot optimize accordingly. The raid array is also typically chipset dependent and you cannot physically move the array to another computer.

There is also pure software raid. This is the type of raid offered by the Linux "md" driver. FreeBSD, Windows, and MacOS all have something similar. This is the type of raid the parent posted discussed. Software raid of this type is often the best choice for home and small office use. The OS knowns and understand that the device is a software raid. The drivers and tools are full featured and mature software. In the case of Linux, and probably the others, the raid array is hardware independent. The drives could be moved to any system with sufficient disk controllers. If the server is lightly loaded the performance impact is negligible. In my experience Linux software raid (md) is resilient to hardware and power failures.

I suspect that the article you read was on fake hardware raid.
Re:My suggestion... by WuphonsReach · 2007-05-30 13:26 · Score: 1

Software raid of this type is often the best choice for home and small office use.

I'd go farther and say that Software RAID is also good for situations where you have excess CPU power and aren't limited by a system bus. Back in the 32bit PCI days (ignoring PCI-X and 64bit PCI), the bus bandwidth was somewhere south of 100MB/s. So trying to do Software RAID for RAID1 would cut your throughput (due to writing to multiple disks at the same time, instead of sending a single data packet to a RAID card).

For the moment, with dual-CPU / dual/quad-core systems with PCIe x8/x16 connections, I feel that it's harder to make a decision one way or the other. But the big advantage of Software RAID is that it's hardware agnostic. Give it a disk connected to anything that the Linux kernel can see and you can use it in a Software RAID. (That and you don't have to learn a whole different set of tools for each different make of RAID card that you use.)

--
Wolde you bothe eate your cake, and have your cake?
Re:My suggestion... by Poromenos1 · 2007-05-30 21:58 · Score: 1

Yes, the article I read talked about the hardware/software RAID found on motherboards. I didn't know there was a pure software RAID solution, so I assumed that was it, thank you for clearing that up for me. I'll be sure to look more into linux's md, since this enables me to have a home fileserver without even needing extra hardware. Does anyone know of a good software RAID solution for Windows, by the way?

--
Send email from the afterlife! Write your e-will at Dead Man's Switch.

Re:ZFS and Sun boxes by Anonymous Coward · 2007-05-30 05:21 · Score: 3, Informative

A SAN is not a host. It presents itself to a host machine as native storage in the form of raid groups/Luns, and/or raw storage. Access controls related to end users are done by the host OS, not the SAN, the SAN has no concept of file locking either, this is accomplished at the OS level on the host as well. Although the SAN does provide access controls for which host OS can connect to it. A NAS is the storage and some type of OS supplying network shares to a host. There are many tools that can make a NAS appear to a host as a native file system as well which kind of blurs the lines.

In really really simple terms, a SAN provides configurable disk space to a host, a NAS supplies file space and file serving to a host(s). Many storage solutions offer various functionality and can provide both NAS and SAN functionality at some level.

Need for continuity by linear+a · 2007-05-30 05:23 · Score: 1

Don't forget to consider the need to maintain continuity of expertise for your solution. If you expect the same people to be around for years then a home-brew is more tenable. If you will have a lot of turnover of your admin/technical staff then factor in the bother of either continually retraining people or (more likely) them having to puzzle out the "old" system when they inherit it.

Yes, but... by Junta · 2007-05-30 05:28 · Score: 1

Avoiding single point of failure is about more than the cables between systems. Most architected SAN solutions are paranoid about even solid-state component failure. I.e. a RAID controller determines it has bad cache memory. In a typical SAN architecture, the controller notes the presence of a peer controller, throws up its hands, says replace me, and the other controller takes over, while the other controller is presumably hot swapped. If your controller with direct connected drives fails, well, that was your single point of failure. Your server suffers a catastrophic failure (PCI fault, processor fault, uncorrectable ECC), you have lost access to your storage. If you implement anything fancier (i.e. with multiple systems, talking to multiple controllers each with multiple paths to each individual drive), then by definition it becomes a SAN. Feel free to replace the FC fabric with iSCSI, or whatever commonly supported block-device-oriented strategy you want, it's still a SAN. Feel free to run ZFS on top of a SAN, though perhaps only in an active-passive way.

BTW, in many many ungodly amounts of explicit storage testing for data miscompare situations (why you would need above-block-layer checksuming), I have yet to see a single instance not caught by individual hard drive ECC mechanisms. I consider ZFS' checksumming nifty and good for demos involving dd and random data with an array member, but in real-world situations, it is largely redundant. Checksumming is done in lower levels transparently to the user, it's more of an assumed must-have than a 'hey look, we checksum!'. The checksumming would only catch errors that probably are obvious and caught before they ever reach ZFS. Now one thing I would like to see something do is upon seeing a bad block reported from a lower level, automatically go back and attempt to write back that one block of data based on other array member data. Bad block reallocation is nice for writes, but I have seen one array where by the time one bad, unrecoverable block was detected, other members of the array also had unrecoverable blocks in other places. All the data was still there, but the array wanted N-1 100% good members for the entire data set, instead of N-1 100% good members on a block-by-block basis. If instead of kicking out of the array, it would have rewritten the data it could then interpolate from other members, just a disaster would be averted (just monitor the SMART reported bad block reallocation count.)

--
XML is like violence. If it doesn't solve the problem, use more.

Great, I love it! by Paracelcus · 2007-05-30 05:35 · Score: 1

But it has been my experience that selling management types on anything where the word "cheap" is used or anything that is built in house is nearly impossible.

--
I killed da wabbit -Elmer Fudd

Why is ZFS not good on a SAN? by emil · 2007-05-30 05:49 · Score: 1

No, ZFS doesn't replace a SAN, but ZFS does checksums on every block in-kernel, so you're sure that what you wrote is indeed what you've read.

The performance will be slightly degraded (as opposed to other filesystems), but why wouldn't you want this feature on a SAN to guard against silent firmware defects or media errors?

Re:Why is ZFS not good on a SAN? by Jim+Hall · 2007-05-30 10:15 · Score: 1

No, ZFS doesn't replace a SAN, but ZFS does checksums on every block in-kernel, so you're sure that what you wrote is indeed what you've read. The performance will be slightly degraded (as opposed to other filesystems), but why wouldn't you want this feature on a SAN to guard against silent firmware defects or media errors?

Yours is a different question. I was answering the original topic of "why not use ZFS on a server to do all the duties an expensive SAN/NAS normally provides?" In my answer, you can use ZFS (with locally-attached disk) to do normal SAN/NAS stuff, but you would only want to do that to replace an inexpensive SAN/NAS. The expensive SAN/NAS systems will do it better than a server sharing its storage, where its storage is on ZFS.

But you asked a different question: "why not use ZFS on a SAN?" I think you would want to use ZFS using SAN-provided storage. In fact, in our work environment, we are planning to use our SAN to provide the storage that we will put under ZFS. Over time, we can predict our filesystems will grow to be very large. While the SAN provides good protection at the disk layer, we want to protect ourself for easy expansion (ZFS is good at this) and to protect against bit rot (again, ZFS has a good implementation for this.)

You mention that ZFS access may be slightly slower than under, say, UFS. But the stability improvement on very large filesystems is worth the slight impact on performance. At least, that's how I see it.

Re:ZFS and Sun boxes by Wolfrider · 2007-05-30 05:50 · Score: 1

o That is only a SINGLE CHANNEL IDE card (2 drives max)

o It's a bit pricey compared to a Silicon Image IDE-133 RAID-capable card:

http://www.siliconimage.com/products/product.aspx? id=31
http://computers.pricegrabber.com/storage-device-c ontrollers/m/19271025

o IDE is $dying. SATA and other tech is the way to plan for $future, and believe me I am currently fairly heavily invested in IDE drives.

--
.
== WolfriderV6 == I'm willing to admit that *I just might* be wrong... Are you??

Short answer: no by Salamander · 2007-05-30 05:52 · Score: 1

That's an order of magnitude less expensive than other solutions.

Sure, and a Hyundai is cheaper than a Porsche, too. What's the difference in performance? In reliability? Don't compare GigE to 4Gb/s FC, don't compare a cheap box with multiple SPOFs (Single Points Of Failure) to a fully redundant disk array, and by the way don't include the price of switches and cables for one alternative while omitting them from the other. Make a genuine apples-to-apples comparison, show us numbers to prove that it's apples to apples, and you'll have something. Until then, blathering about ZFS's supposed data-integrity benefits while proposing an inherently unreliable hardware platform just looks like more of the astroturf Sun has become famous for.

--
Slashdot - News for Herds. Stuff that Splatters.

Multiple-disk failures? Why?! by mi · 2007-05-30 06:03 · Score: 1

it can handle multiple catastrophic disk and machine failures.

I tried to crudely calculate a RAID5's MTTF recently, and even if I assume, the drive-manufacturers are exaggerating their drives' MTTFs ten times, I'm still getting MTTF of an array counted in centuries — assuming a failed-drive is replaced within a couple of days.

Here are my attempts (the text is the Gnuplot script, which produces the graphics), what do your company's experts say?

Even if my calculations are wrong, I suspect, the a failure of another disk, while the RAID is recovering from an earlier disk-failure is so improbable (even if the RAID spans dozens of drives), no efforts to reduce that already minuscule risk can possibly be justified. The companies peddling such reductions (RAID6 is how some solutions are called) are praying on their customers' being bad in Statistics...

Now, for RAIDs spanning many hundreds of drives — maybe...

--
In Soviet Washington the swamp drains you.

Re:Multiple-disk failures? Why?! by msporny · 2007-05-30 06:48 · Score: 2, Informative

Here are my attempts (the text is the Gnuplot script, which produces the graphics), what do your company's experts say?

The first problem with your gnuplot script is that you're assuming a Poisson distribution for HDD failures (which is incorrect). Statistical failure distribution follows a Weibull distribution with k roughly equivalent to 7.5. Unfortunately, because you build your argument off of a Poisson distribution approximation, the rest of the analysis doesn't make much sense.

If you are interested in HDD failure rates and failure prediction, there is a fantastic paper done by Bianca Schroeder and Garth Gibson of CMU. I think this is the link to their main research website.

Even if my calculations are wrong, I suspect, the a failure of another disk, while the RAID is recovering from an earlier disk-failure is so improbable (even if the RAID spans dozens of drives), no efforts to reduce that already minuscule risk can possibly be justified.

I think you miss the point of systems such as Starfish and other distributed clustered file systems. You have many other points of failure in a system: memory, CPU, power supply, power outage, motherboard, network switch, OS kernel, router, network cable, and the all important "oops, I tripped over the power cord". There are also times that you want to take down nodes in a highly-available cluster for maintenance without affecting your applications - to do this, you need a file system that assumes and can work around node-level failure.

There is much more to highly-available clustering than just making sure your disk sub-systems are bulletproof.
-- manu

--
Manu Sporny (skype: msporny, twitter: manusporny, G+: +Manu Sporny)
Founder/CEO - Digital Bazaar, Inc.
Re:Multiple-disk failures? Why?! by mi · 2007-05-30 07:03 · Score: 1

The first problem with your gnuplot script is that you're assuming a Poisson distribution for HDD failures (which is incorrect). Statistical failure distribution follows a Weibull distribution with k roughly equivalent to 7.5. Unfortunately, because you build your argument off of a Poisson distribution approximation, the rest of the analysis doesn't make much sense.

Actually, I assumed evenly distributed failures, so there may still be some sense left in the analysis :-)
I did read the CMU's paper just yesterday (could not update my own little script) and was most frustrated, they stopped short of a formula for a RAID5's MTTF. No, they don't owe me anything, but this would've been the single most practical result of their research. With all the rest of their findings coming up with this last one would've been easy for them...
you need a file system that assumes and can work around node-level failure.

Absolutely — a system like that should be ready for one node going down for a short time. Yes. But handling multiple-disk failures will not give you that, if all of those disks happen to be inside that single node :-) for example. I'm sure, you (your company) has thought of that...
Now, are you gaining much from working at a file-system level, rather than offering a device (SCSI, FC, or SATA)? It seems like a lot more OS-specific drivers need to be written using a file-system approach? Is it worth it?

--
In Soviet Washington the swamp drains you.
Re:Multiple-disk failures? Why?! by msporny · 2007-05-30 08:11 · Score: 1

Absolutely -- a system like that should be ready for one node going down for a short time. Yes. But handling multiple-disk failures will not give you that, if all of those disks happen to be inside that single node :-) for example. I'm sure, you (your company) has thought of that...

That is why there is forthcoming mirroring support for Starfish - which would allow you to mirror files across nodes to guarantee node-level redundancy. So, if a power supply, cooling fan, or OS kernel fails in a storage node and takes it down - you are still guaranteed to be able to retrieve the file from a different storage node.

Now, are you gaining much from working at a file-system level, rather than offering a device (SCSI, FC, or SATA)? It seems like a lot more OS-specific drivers need to be written using a file-system approach? Is it worth it?

It is actually the other way around - you reduce the amount of code that you have to write by several factors when authoring file systems as POSIX-compliant user level programs. Starfish is built on top of FUSE, which means that it works without modification on Linux and Mac OS X - a Windows port is forthcoming. Even without a windows port, Starfish file systems can be mounted via Samba or NFS.

Due to the way Starfish is designed, we didn't have to implement any OS-specific drivers... in fact, there is hardly anything that is OS specific in Starfish.
-- manu

--
Manu Sporny (skype: msporny, twitter: manusporny, G+: +Manu Sporny)
Founder/CEO - Digital Bazaar, Inc.
Re:Multiple-disk failures? Why?! by Anonymous Coward · 2007-05-30 10:55 · Score: 1, Informative

A double disk failure may be very unlikely, but a disk failure combined with a read-error during rebuild isn't...

Nice little blog about that (from a manufacturer) ;
http://blogs.netapp.com/dave/TechTalk/2006/03/21/E xpect-Double-Disk-Failures-With-ATA-Drives.html
Re:Multiple-disk failures? Why?! by mi · 2007-05-30 12:45 · Score: 1

A double disk failure may be very unlikely, but a disk failure combined with a read-error during rebuild isn't...

If the read-error is recoverable, then there is no problem. And if it is not, then you contradict yourself, for such an unrecoverable read-error would be a disk-failure, which you agreed is very unlikely to overlap the replacement process for an earlier failed disk.
Your link (and I'm rather hesitant to trust a storage vendor's advice to buy more storage, BTW) is not explicit — could it imply an undetected read error? These could happen any time (RAID or not, array-rebuild or not), and there ought to be means of detecting them...

--
In Soviet Washington the swamp drains you.
Re:Multiple-disk failures? Why?! by FoolishBluntman · 2007-05-30 14:06 · Score: 2, Informative

Hello, I work at a 3 letter company whose name starts with "E", a friend of mine works at another 3 letter company starting with "I" as field service supervisor. Not a month goes by without seeing a double disk failure in a RAID-5 system from a customer site for either of these companies and that's on SCSI, SAS and Fibre Channel drives. ATA & SATA drive MTBF values that are given by drive manufactures can be off by as much as a factor of 1000 depending on lot. Most consumer level drives are listed as "Nonrecoverable Read Errors per Bits Read 1 per 10^14" a 1TB drive contains 8*10^12 bits. 1/8*10^14/10^12 = 1/8*10^2 = 100/8 = 12.5 So the manufacture says if you read the entire drive 12.5 times you will get 1 Nonrecoverable Read error. So if you think the manufacture is off by a factor of 100 then every .125 times you read the entire drive you get a non-recoverable read error. Do you still feel you data is safe?
Re:Multiple-disk failures? Why?! by Anonymous Coward · 2007-05-30 14:44 · Score: 0

A raid setup is not a substitute a backup. It is a live and active file system. It is used for a combination of reliability, availablity, and speed and the combination of those varies greatly on how much you want to spend. I repeat, no RAID setup is suitable substitute for a backup. Not in the enterprise using a SAN, not in the medium business using a simple 1+0, and not for the home user messing with that new SATA raid controller.
Re:Multiple-disk failures? Why?! by afidel · 2007-05-30 16:44 · Score: 1

Do you have anyone mounting multiple Starfish presentation nodes via SAMBA using DFS. This would seem to be the ideal way to access the redundant data from a Windows host without introducing a single point of failure with the SAMBA presentation server.

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Re:Multiple-disk failures? Why?! by mi · 2007-05-30 17:56 · Score: 1

Not a month goes by without seeing a double disk failure in a RAID-5 system from a customer site for either of these companies.

People die on highways every day too, you know... Yet most of us drive to work, however insane this may seem to some county coroner.
Most likely the failures you are seeing are correlated due to something like overheating. But with that correlation no amount of redundancy will help much — you'll see triple and quadruple disk failures too...
So the manufacturer says if you read the entire drive 12.5 times you will get 1 Nonrecoverable Read error.

This is an alarming problem, but it has nothing to do with disk-replacement. Even during routine operations, when everything is fine, a RAID5, as far as I know, does not verify the parity during reads the way registered RAM would — it is too expensive to do so. This means, even without disk-failures, there are periodic read-errors — often undetected.
No, I don't think my data is safe, but dedicating a drive as "hot-spare" so as to reduce the array-vulnerability window from 4 hours — 2 hours for disk replacement by your employer plus 2 hours for the array to rebuild itself — to just 2 hours for the latter makes no sense. That EMC recommends doing so is nothing short of a deceptive practice designed to sell more hardware...
It takes guts to challenge "vendors recommendations" and convince the boss, he takes much higher risk of a deadly accident on the way to work every morning, than the risk of losing data by simply using a (well-ventilated) RAID5. Most sysadmins would have neither the guts nor the Math to do that, and EMC is milking that for all they can...

--
In Soviet Washington the swamp drains you.
Re:Multiple-disk failures? Why?! by msporny · 2007-05-31 14:58 · Score: 1

Do you have anyone mounting multiple Starfish presentation nodes via SAMBA using DFS. This would seem to be the ideal way to access the redundant data from a Windows host without introducing a single point of failure with the SAMBA presentation server.

Not really, but it's a good idea. We actually don't see a great deal of Windows usage in heavy-duty computing clusters. I wonder why that is... =)
-- manu

--
Manu Sporny (skype: msporny, twitter: manusporny, G+: +Manu Sporny)
Founder/CEO - Digital Bazaar, Inc.

Show me a real world test by fredr1k · 2007-05-30 06:04 · Score: 1

I would like to see some real world tests. Comparing an HP EVA8000 with say 32*146Gb FC-disks, against a ZFC with a decent hardware, like SCSI-Disks, GBIT-network, and everything as redundant as it can be.

Run 10-15 servers against it with "enterprise-like-load" as Exchange 2003+Front-end, 2*DC's 1-2 App-servers, some fileservers and a clusterd MSSQL2005 Solution, maybe a CRM system and a ERP-system. We can even add some spice by adding a VMWARE ESX Server.

Let it run for some months. Which one causes most problems? Which one needs least hands-on-administration. How many severity 1 errors where the direct root cause was failed disk-communiation where there? (NOW don't forget 90% of the enterprises uses Windows-servers so this is a relevant so please dont begin to moan about GNU/Linux this and BSD that.)

What you all people seems to forget is the financial part of a failure. If a server breaks down for an e-commerce vendor, they lose real hard money. If the databases used for a research laboratory fails and 100's of researches cant work, how much do that cost? those money you saved on the homebrewn Storage system is long gone. Thats why enterprises tends to go for the more expensive and proven reliable stuff. (and when it comes to FDA-regulated industry, well don't get me started ;)

But on the other hand, people who buy proven saftey tends to get out unharmed more often. Just why do you think people buy Volvo and not an obscure chinese brand?

--
"Never EVER mess with a jumper you don't know about, even if it's labeled 'sex and free beer'." - Dave Haynie

Excellent! by ms1234 · 2007-05-30 06:09 · Score: 1

Now you have 47k to blow on a a team/office party! Remember, what was budgeted must be spent :)

The best laid plans of mice and men... by Nutria · 2007-05-30 06:12 · Score: 2, Interesting

Finally, the real key - Reliability. All connections were dual-pathed, with storage presented to a pair of smart FC switches which were zoned to present storage to various systems. We could lose three of the four power cables ...

True story:

Two years ago next month, a clumsy plumber got a propane torch too close to a sprinkler head with the expected consequences: LOTS of water took the path of least resistance, where it finally filtered it's way into the basement data center, coming out right on top of our SAN.

Obviously, it didn't survive.

--
"I don't know, therefore Aliens" Wafflebox1

Re:The best laid plans of mice and men... by PowerEdge · 2007-05-30 08:06 · Score: 2, Insightful

This is why you replicate to another site :).
Re:The best laid plans of mice and men... by Nutria · 2007-05-30 13:37 · Score: 1

This is why you replicate to another site

It's all cost-benefit, and what we agreed to do in the contract.

--
"I don't know, therefore Aliens" Wafflebox1
Re:The best laid plans of mice and men... by swordgeek · 2007-05-31 03:02 · Score: 1

That's quite funny--only because we had an almost identical situation right around the same time. A plumber opened a valve on a cooling water line that was supposed to be dry, because it wasn't hooked up yet. It wasn't dry, and our near-line storage got soaked, along with all of the sub-floor cabling. No data loss, though.

--

"People who do stupid things with hazardous materials often die." -- Jim Davidson on alt.folklore.urban

It's all about intelligent decisions by drwho · 2007-05-30 06:21 · Score: 1

Everyone things their data is important, but it's up to management to decide what the value of each type of data is. Certainly, the critical financial data mention in one post here has a high value. However, how long does it have to remain online? Other data has less value, but some tangible value, such as old email. How long does it have to remain online vs. the convenience of having it at your fingertips?

Managers pull apart the poor sysadmin, but in the end you have to tell them that they have to decide the value of each type of data, and segregate it appropriately. I did this in a crappy company I used to work for that made codecs for cellphones and video games. Management had its head up its collective ass, so I just took control (something I shouldn't have to do on a sysadmin's salary with no stock options) and told them the cost of data on each storage solution. I left before the plan was fully implemented, but the ideas was to have important financial data and current software in development on the expensive commercial NAS, the audio files, email, and miscellaneous junk on the cheap Linux RAID box I build out of white-box parts. The only time the linux box had a problem was when the AC failed in the data closet (the NAS was located right near it, and I wonder if it would have puked a few moments after the Linux box panicked. Note that having reliable environmental controls is probably more important than choosing the expensive hardware). I left before the division of data was fully separated, but it seemed like it was going well. Every manager's ass seemed to be fully covered, and at a reasonable cost.

No, ATA over Ethernet obsoletes expensive SANs by Tracy+Reed · 2007-05-30 06:33 · Score: 1

I have been using AoE (ATA over Ethernet) in heavy production use for 9 months and it ROCKS. Block device semantics over commodity gigabit ethernet is just so useful. I love being able to decouple my disk storage from my cpu power. Especially handy when using something like Xen so you can do live migration of domains. Linux really is getting a lot of high-end features that you used to only be able to find on mainframes.

Re:No, ATA over Ethernet obsoletes expensive SANs by Tracy+Reed · 2007-05-30 06:47 · Score: 1

Oh, and allow me to add that AoE has FAR less protocol overhead than iSCSI (since iSCSI is basically application layer and AoE is network layer). The AoE specification is 8 pages compared with iSCSI's 257 pages.

My SAN is pretty much as the guy describes except I use AoE. I use the AoE driver which comes in the Linux kernel and the vblade daemon which I can download from sourceforge. Works great.

Dell is not so good by Anonymous Coward · 2007-05-30 06:33 · Score: 0

I've heard so far four stories where Dell just weaseled theirselves out of the "warranty". I don't know if this is a common pattern, but I will not buy Dell and do not recommend it any more.

So you pay premium and possibly get nothing in return. Not a good deal.

Re:ZFS and Sun boxes by mollog · 2007-05-30 06:40 · Score: 2, Informative

Anonymous Coward writes IDE refers to any drive (ATA, SCSI...) with integrated drive electronics, that is, everything that has come after the ancient dumb drives that required a model-specific controller on the motherboard. In other words, not a very useful term anymore.

Well, actually, IDE's history is a bit different than that. IDE requires a host buss interface, but, yes, they do have their disk controllers built into the PCA attached to the disk mechanism.

Before Compaq and others developed the first IDE systems, hard drives usually had external controller boards that used low level commands. IDE standardized the host interface to disk storage at the driver level, and standardized the host buss/drive command set at that buss level.

And, it's not just disk drives that use the IDE stack. Other devices can be attached to the IDE buss, too.

SCSI drives require a SCSI host buss adapter with a dedicated processor and that adapter does the heavy lifting for disk access. IDE requires the host CPU to do a lot of processing, where SCSI does the majority of the work. This model was used for the FC technology. It, too, unloads the processing from the CPU.

SCSI/FC are preferred in the 'big iron' type of installations. IDE/ATA/SATA are fine on a dedicated NAS system. In effect, the CPU of the NAS motherboard is doing the work that is done on the host buss adapters in SCSI/FC.

At the drive mech level, FC is a copper interface. The design of the connector on the disk mech allows it to be plugged. This provides the ability to quickly replace failed drives. The drive mechs are aggregated into some type of array to provide protection from data loss. This array of drives is then attached to systems via fiber optic cabling.

You can simulate some, but not all, of the benefits of a FC/SCSI array using SATA technology. I don't know if the IDE drivers are being rewritten to use the multi-core processors yet, but that would help reduce some of the latency.

Short answer, if what the OP was aiming for is to get into a large disk array for cheap, trading some reliability and performance for low cost, the idea is a good one. I would be looking for a multi-core cpu in the motherboard and an OS that has parallel processing drivers for the IDE channel. Be sure that all the drives have plenty of cooling. Have a backup solution. Some day, this lash-up will give you heartache, but till then, you've saved money.

Can you tell I used to work in the disk storage business?

Good luck.

--
Best regards.

Re:ZFS and Sun boxes by fluffy99 · 2007-05-30 06:43 · Score: 2, Informative

I would avoid that card. It's limited to striping or mirroring, for starters. It's also not true hardware raid and depends on the drivers to do all the raid work. You really do get what you pay for here. You also get very little notice when one drive starts going bad. You just start getting random system hangs.

Re:Congradulations, you discovered the "File Serve by numbski · 2007-05-30 06:45 · Score: 1

He could. He wouldn't want to.

The reason is that by exporting directly into Windows, you lose the #1 biggest advantage of this setup, with is LVM. In the install docs, it says to create your initial LV's to be only slightly larger than you need them, so I have actually only used about 200GB of my 7.5TB array right now. The reason is that you can always easily grow an LV, but shrinking, though may work, runs the risk of data loss. Under windows, you would have to format to the size you're going to use and that's that. I guess you *could* try using Partition Magic or similar, but...um...no. :P If you use Gigabit Ethernet, it's going to be plenty fast either way.

--

Karma: Chameleon (mostly due to the fact that you come and go).

Re:ZFS and Sun boxes by Forseti · 2007-05-30 06:46 · Score: 1

But, from reading about it...it mentions wanting to use FC (Fiber Channel) harddrives...I'm familiar with SATA and IDE...but, the FC ones are new to me.. I've been looking to get a T3 or Storedge array on eBay to hook to it...

The 280R's FC hard drives don't have to be hooked up using an external Fibrechannel enclosure. FC-AL drives are just SCSI hard drives that use an internal, arbitrated loop FC connector (usually fiber) to increase transfer speeds as opposed to a SCSI cable. I can't remember if the drives come with a FC-AL enclosure, or if they're just regular SCSI drives and the FC-AL is on the Sun's backplane...

BTW, if you get an external array, be sure to check SunSolve to see if it's compatible with the 280R, and which OPB firmware version you need to support it, they can be fickle...

--
Delay is preferable to error. (Thomas Jefferson)

NFS is fast enough by bill_mcgonigle · 2007-05-30 06:57 · Score: 1

Because of the long times it takes to move these files around, I think NFS or CIFS would be too slow. That's why I am interested in the ability of ZFS to easily export iSCSI targets. Some tests I read showed that ZFS exporting iSCSI is about 4 times faster than ZFS exporting NFS or CIFS.

A friend of mine runs a big system with ZFS and quad-bonded gigabit between the solaris backend and linux frontends. ZFS with compression *on* (faster than off) can saturate his network.

--
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)

enter GPL3 by bill_mcgonigle · 2007-05-30 07:14 · Score: 3, Informative

Where's the uncertainty? Sun fears Linux, and their programmers have already admitted this is why they deliberately made a GPL-incompatible license. Using their patent minefield to prohibit GPL implementations would be incredibly foolish if widespread use of ZFS were actually their goal.

That's nice except Jonathan Schwartz has indicated that OpenSolaris will go GPL3, assuming the final version of the license is OK.

--
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)

To obsolete has obsolesced. by mbius · 2007-05-30 07:34 · Score: 1

The online OED allegedly has the use as a transitive verb marked (obs).

Word nerds:
http://itre.cis.upenn.edu/~myl/languagelog/archive s/002222.html

Sadly, the primary source (oed.com) is not available through bugmenot.

--
you can have my violent video games when you pry them from my cold, dead hands.
Prime UID Club

Document your concerns... by mollog · 2007-05-30 07:37 · Score: 1

Document your concerns to your 'higher ups', and then freshen up your resume'. Your boss isn't going to want to take the heat when this system goes down, so BOHICA, you're going to get the blame. Start your job search now.

And, ask them how much will it cost the company when the system fails and data gets lost.

The good news; the job market got better lately, so you won't be out of work for long.

--
Best regards.

You forgot ... by dJOEK · 2007-05-30 07:51 · Score: 1

Backups? add a small autoloader (like http://www.sun.com/storagetek/tape_storage/tape_li braries/c2/) + software like bacula. Relying on raid disks alone is stupid.
Multipathing? You want multiple links to your storage especially if you're using cheap unreliable off-the-shelf parts
Scalability? How do you grow this contraption once your greedy users start sucking up every byte you have. And yes, they will.

--
Exercise caution when modding this message up: the author acts like a jerk when his karma is excellent.

Re:RAID controller failure vs Software RAID by seawall · 2007-05-30 08:03 · Score: 1

I too have seen RAID controller failure (5 year old raid controllers) and had the unpleasant experience that: 1) At least some RAID controllers do a proprietary low-level format and 2) For at least some companies this low level format may change over time.

For this reason it's a good idea to buy spare controllers and, if you can manage it:

Have a warm-spare machine using software RAID. It's slower but it also means you can get your data back even if all you have is a sufficient number of working disks, a box to hold them, and a standard PC.

Which problem are you trying to solve? by Chris+Snook · 2007-05-30 08:10 · Score: 1

The poster's proposal sounds like a nice NAS, and a lousy SAN. A NAS saves you money by keeping your important data in a centrally-managed system with automated backups so that lay users cannot easily lose it, and so that it can be shared easily between multiple systems. A SAN saves you money by packing a big chunk of high-performance storage behind extremely-high-redundancy hardware and provisioning dedicated slices out to your servers so that you don't need dual redundant RAID controllers with mirrored write caches in each box that's dealing with critical data. They solve different problems.

Since a NAS centrally manages all of its storage, it can take advantage of fancy filesystem tricks to boost performance on SATA drives with good sequential access speeds. When you're physically provisioning chunks of disk on a SAN, you just can't do that, and you really need the 15k RPM drives with the 3 ms seek times, since you'll have numerous independent operations on physically discontiguous sectors on the platter, if performance matters at all. This is why SANs have enormous write caches.

Of course, since power failures or controller failures can knock out a write cache, high-end SANs have dual controllers with mirrored battery-backed write caches. This is specialized hardware, not something you can hack together from parts on Newegg.

The poster's idea is interesting for solving certain problems, but I don't think it's a panacea.

--
There's no failure quite as dissatisfying as a complete and total solution to the wrong problem.

It is feasible... by frank_adrian314159 · 2007-05-30 08:26 · Score: 1

... because this is essentially what the manufacturers of this type of system do (modulo file system, volume manager, specialized networking (perhaps), choice of base OS, and nifty cases). The main differences are this:

(1) The manufacturer can get hardware much cheaper than you. This doesn't make much difference in cost, because all of it goes to the profit margin. What it does buy you is that the manufacturer probably has a stock of these things that he can shove in if something breaks. At worst, he can have an entire working system delivered to your door before you could even start rustling up the parts for a replacement.

(2) The manufacturer has a service staff. What happens if you leave the company? What happens if you want to take a vacation? Someone else has to fix your cobbled up server. It's more likely that they'll screw up the configuration or break the thing. Un less you want to give 24/7 support.

(3) The manufacturer has better management software than your set of cobbled up scripts or web pages. At least it's been tested more thoroughly (probably).

(4) The hardware and software configurations are actually tested. Not that you won't do a few tests, but it's fairly likely that you won't do the burn-in testing that these manufacturers do. Nor can you guarantee appropriate SLAs.

So, yeah. You do pay more to these people. There's a reason. If I were your boss, I wouldn't let you build some wonk box. Maybe yours will. In the final analysis, you're just making more work for yourself. Besides, if you're so good at putting together these kinds of system, why aren't you out there in the marketplace competing with these high price suppliers? Don't tell me - it's just altruism for your boss...

--
That is all.

Thanks! by iknownuttin · 2007-05-30 08:57 · Score: 1

Thanks!

--
I prefer Flambe as apposed flamebait.

recipe for a storage solution by OrangeTide · 2007-05-30 10:03 · Score: 1

you have a few choices to make before buying a storage solution, here are your basic options:

storage capacity - how many terabytes do you want in your volume
reliability - you can get this through a few very reliable drives, or through many cheap drives
performance - ops/rate and streaming i/o performance
storage density - do you want a rack full of equipment? or a single small chassis
price - how much do you want to spend. (total cost of ownership)

most of these things compete with one another. you can maximize the first four and the price will be astronomically high. you can give up a few things (usually the density) and lower the price.

Anyone can build a cheap multi-terabyte array. but it will likely not be high performance or extremely reliable.

with things like ZFS, that solves a problem I didn't list. features of the filesystem that improve scalability, add features (like snapshots and replication), and improve the administration and maintainability of the system. and the flexibility grants you some limited reliability improvements, but good drives and lots of them is still the only real solution for reliability.

--
“Common sense is not so common.” — Voltaire

Cheap Hardware Alternatives? by __aajwxe560 · 2007-05-30 10:14 · Score: 1

I smiled when I read this question as to the viability of the lower cost storage solution and how it compares to an enterprise-level solution such as that offered by EMC, NetApp, Hitachi, etc. It reminded me of a sysadmin consulting gig I was introduced to a few years ago. An energy consulting company of approximately 50 employees had a small data center that they wanted to contract out for IT support. Up until this point, the CFO had been designing and building the systems for everyone. When I walked into the data closet, he had a stack of various Dell laptops in various precarious positions in a single 2-post telecom rack. This was his server farm that they were using to perform careful analytical data of power plants, as well as store business critical data. This was the server farm. They had found that they ran out of storage rather quickly on each "laptop server", so of course there were various USB hard drives hanging off all over the place. When I inquired as to the rationale of this type of setup, he demonstrated a sense of proud accomplishment that he had solved a server consolidation issue on the cheap by just using people's old laptops and re-deploying them as servers. He didn't want the burden of those large 2U+ servers. Sure, many of the laptops had cracked screens, keyboards that didn't work locally, or just looked severely depressed, but I quickly gained the sense that it was not worth arguing with this person as to why HP or hell, even cheap Dell servers might be better.

While I am sure the long term laptop "server" maintenance would have kept be quite busy, I passed on the consulting gig.

except for end-to-end checksumming by toby · 2007-05-30 10:15 · Score: 1

Take ZFS out of the picture and you just need to use a hardware raid controller or a block level RAID

It's odd that so few posters have mentioned what, to me, is the #1 advantage of ZFS: end-to-end checksumming (combined with COW, redundancy, etc) guarantees your data in ways that no other RAID solution can.

--
you had me at #!

...because it's not the same thing at all? by toby · 2007-05-30 10:18 · Score: 1

Google "end to end checksumming zfs" and READ ON.

--
you had me at #!

re: Does ZFS Obsolete Expensive NAS/SANs? by mythbuster · 2007-05-30 11:05 · Score: 1

It is a common error for people to think that storage is storage, but there are quite a few differences between the configuration you spec'd and a decent modular fibre-channel array. Before I ramble on I would like to add that I have that exact case, stuffed full of 400GB SATA drives at home and it's *great* (although I am currently using LVM and ext3). I will also add that all of my friends and colleagues who use, or are still testing ZFS have pretty much come to the same conclusion. It's fast and virtually indestructible. Everyone seems to love it so far. I am a storage admin by trade and manage nearly 4000 disks and 2 dozen arrays. Your low cost solution is a good one for non-mission critical applications and low to medium I/O demands. Here is what the expensive arrays offer in addition to what your solution provides: Hot swappable, redundant (RHS) power. RHS controllers, cache and NVRAM. 4 or even 8 loops (used in pairs) to provide two active/active 200GB paths to each disk, or 300GB for SAS. Large write cache. Most drives have 8MB or maybe 16MB cache but the array may provide anywhere from 2GB to 32GB (or more) of cache, which makes writes a *lot* faster. Intelligent pre-fetch algorithms that, combined with cache, can significantly improve read response times. Configuration tuning options. Try changing your LUN layout without downtime. Try changing your cache block size on the fly. Try turning read caching off for write intensive LUN's, turning read caching on for a few more LUN's that do sequential I/O, and bumping up the number of blocks prefetched on those LUN's...all on the fly without interruption of service. Try updating the firmware on your controllers without disruption to service. Try doing parity scrubs on your drives without disruption. Try getting more statistics from your storage than you can get with iostat or sar. I'll leave support out of the equation but it should be a consideration when designing a solution. And now for a few notes on the CoolerMaster appliance model. How many controllers will you need to accomodate 7TB raw? And how many PCI slots will that require? Be aware that if you decide to go for maximum density (probably 750GB 7200RPM) you may be asking for quite a few IOPS per spindle. Now how about the network side of the equation. How many gig ports are you going to need? Doesn't take much to snarf a 12MB/s pipe in the disk world. And how many PCI slots will be required for those NIC's? You may have a tough time finding a motherboard that can accomodate your slot requirements *and* have enough bus/processor power to handle all that I/O, logical device overhead and filesystem overhead. While it is valid to suggest distributing load across many of these units, you may be introducing significant management overhead. Probably better off with a small number of scalable boxes that can handle a lot of disks per box. Enough rambling. ZFS is showing significant promise. You question is a fair one, and your configuration is a good one for certain applications. But it is *very* important to compile all of your business requirements before designing a solution. And those of us who have been around for awhile have learned that sometimes the least expensive solution ends up being quite a bit more expensive than anticipated (in downtime, manhours and sometimes, actual data loss).

Re:ZFS and Sun boxes by billcopc · 2007-05-30 11:37 · Score: 1

I'd skip hardware RAID altogether and just get the most basic SATA controllers I could find. Even fake RAID controllers use vendor-specific disk signatures, which means when the cheap controller dies and you find out the company went out of business or the model is discontinued after 18 months, your data is locked away in a weird illogical format.

On the other hand, with software RAID and a dumb controller, you can move your drive arrays anywhere and they will work. I personally use Promise SATA2 cards because I was able to score them cheaply from my supplier, but go with whatever's available. They all pretty much use the same garbage Silicon Image chips so there's hardly any difference between the OEM stuff and the pricey brand-name adapters.

--
-Billco, Fnarg.com

Here's what you get by buying name brand by MadMorf · 2007-05-30 11:46 · Score: 1

Disclaimer: I AM a Storage Engineer at one of the big 3 SAN/NAS houses.

Here's what you get:
Double Parity RAID instead of RAID-5. Thousands of times more resillient. Read the latest literature on disk failure rates on SATA drives and then try to sleep at night knowing you only have RAID-5.

Speed: Dedicated storage OSes are optimized for raw speed. For more info see: Standard Performance Evaluation Corporation

Expandability: How much downtime do you require to expand your homebrewed system? Most commercial storage systems require none.

Support: What happens WHEN you do have a multi-disk failure (notice I said WHEN and not IF)? We can recover most multi-disk failures without data loss. Can you do that?

Disaster recovery and business continuity: Big box storage systems are built around this. What are you going to tell your CIO/Insurance company?

Sure, we cost more.
But how much is your data worth?

--
Goofy, Geeky Gifts and More!

Re:Here's what you get by buying name brand by this+great+guy · 2007-05-30 19:53 · Score: 1

Double parity: ZFS implements it (raidz2).
Speed: ZFS is fast because it is typically used with local disks (high-throughput & low-latency), for example: the sequential read/write throughput on a Sun Fire X4500 server with 48 SATA disks in a 4U chassis has been measured as reaching ~2-3 GBytes/s (yes, bytes not bits).
Expandability: ZFS requires no downtime when expanding the size of a zpool.
Support: Sun offers commercial support for Solaris & ZFS.

My advice to NAS/SAN vendors: keep an eye on ZFS :)

It depends on what you need by Anonymous Coward · 2007-05-30 12:23 · Score: 0

There are a number of areas where ZFS might not be sufficient:

Feature Set:
First off, every serious NAS/SAN vendor is going to have a snapshot solution. Here are a couple other features you might need: Automatic Replication, High-Availablity / Failover, Integrated Virus Scanning, Clustering. Many of the "exciting" features listed in Sun's press releases are not even vaguely revolutionary.
Manageability:
As the name "NetApp" may imply, NAS/SAN vendors often sell "appliances" that attempt to simplify many of the management concerns (e.g. monitoring, automation of backups, etc). How well they do this varies from vendor to vendor and based on what features you need.
Compatibility:
Large NAS/SAN vendors have already verified that their product works with a number of 3rd-party apps and hardware. Will your old tape hardware work well with ZFS? Is SQL certified to run on ZFS? Will Sun's customer support help you if you do get things working? Likely not.
Performance:
For all the business I've heard about ZFS being the "last word in file systems", the amount of actual performance data has been incredibly lacking. For example, most NFS products have published their SPEC numbers. Although these performance results are often gamed a bit, they're the current standard for NFS performance.
The performance numbers I have seen with ZFS so far are useless (e.g. see here for someone measuring how fast ZFS can write to RAM or here for someone getting the blazing throuput of 45 mb/s). Filesystem-based iSCSI solutions (as opposed to SANs) tend to have terrible performance (this includes some of NetApp's products), so I'm a bit dubious about claims that ZFS does iSCSI faster than it does NFS.
Reliability:
In addition to reliability features such as High-Availability and Disaster Recovery, how many enterprise production environments is ZFS actually running in? How many data corruption bugs are waiting to be ironed out? How mature are the repair tools (e.g. fsck)?
Also, enterprise NAS/SANs (e.g. those of NetApp) often have a nice feature where an operation is stable once the client receives an ACK. They get this by logging pending operations to non-volatile RAM. As far as I can tell this is not possible with ZFS, which means that your applications need to be aware that operations may need to be resent to the server after crash.

The submitter needs to check which of these things are important to him/her, and then decide if ZFS is suitable. For homes and small offices NetApp, EMC, and most other large storage vendors are likely overkill. For others, enterprise NAS/SAN may be the only option.

[posted anonymously as my employer might not be happy with my post]

Finally a storage appliance with brains! by frankShook · 2007-05-30 14:22 · Score: 1

At home I favor a dedicated machine over SAN/NAS due to lack of programmability and auto backup. With LINUX, I expect easy crontab backup schemes!

ZIL and sync by aphor · 2007-05-30 14:49 · Score: 1

Try to NFS export a ZFS volume and see what happens when multiple apps accessing the volume perform what they believe to be relatively cheap NFS sync RPCs to get a guarantee that the data has been committed.

The ZFS volume will not return from the sync nfsd generates to implement the RPC on the host storage until the ZIL (ZFS Intent Log) has been flushed. This synchronizes operations that should be parallel and makes the ZIL a bottleneck.

There are evil hacks to make ZFS lie about the state of files on disk returning from sync when the IO has been received (written to the ZIL). This apparently creates a race as subsequent reads can return data that has not been affected by the data written to the ZIL.

Please tell me I am wrong. I'd love to try this again, but I just cannot suck up the performance penalty.

--
--- Nothing clever here: move along now...

Re:ZFS and Sun boxes by drsmithy · 2007-05-30 16:46 · Score: 1

But I never understood the difference between a SAN and a NAS when the configuration gains any complexity beyond a textbook example.

A NAS is something you access via the network (CIFS/SMB, NFS, etc).

A SAN is something you access as a block device (/dev/sda, etc).

Hot spares are in ZFS by Anonymous Coward · 2007-05-30 17:24 · Score: 0

As of June last year, when Eric Shrock added this entry to his blog: ZFS Hot Spares

Hitachi _can_ go down! by Terje+Mathisen · 2007-05-30 18:24 · Score: 2, Interesting

We have a large (geographically replicated) Hitachi disk array (as well as many NetApp boxes), mostly it works very well indeed.

However 2-3 years ago we stumbled (very painfully!) across a firmware bug which took the primary Hitachi array down:

As we (i.e. the Hitachi service reps) were upgrading the mirrored cache, an error hit the active half, and it turned out that the firmware would always check the mirror (a very good idea, right?) before falling back on re-reading the disk(s). However, the firmware error handler which could have handled an error on the mirror copy as well (as long as the data wasn't dirty, of course), did not know how to handle a _missing_ copy, instead it blew away the entire array while crashing.

It took us three days to get everything back up, even though most of the critical systems were running off of the WAN backup copy after 2-3 hours.

Terje

PS. That particular firmware bug has of course been extinguished, but there's bound to be some more lurking around. Getting totally non-stop operation is a _hard_ problem!

--
"almost all programming can be viewed as an exercise in caching"

ZFS to replace SANs/NAS? No. by kenoshi · 2007-05-30 19:19 · Score: 1

If all you want is cheap NAS, you can do the same thing using any number of the free NAS solutions out there. Just add lots of cheap SATA disks in whatever cheap JBODs you can get your hands on, and you are done.

To replace a modest percentage of the functionalities of a mature NAS solution like Netapp's ONTAP + WAFL, you would need to:

1. Install Solaris 10 on a extremely reliable platform
2. Configure your zpools/filesystems
3. Set your NFS attributes
4. Install and compile all dependencies for Samba to support LDAP, Winbind, Kerberos (SFW stuff doesn't cut it)
5. Configure winbind/kerberos/LDAP
6. Configure LDAP backend in Samba (hint: NFS)
7. Either extend AD schema/turn on POSIX attributes or use a metadirectory to store idmaps

Now throw manageability into this equation...considering half of the admins out there have problems getting past step 5 above with any kind of consistency, Netapps suddenly makes a lot of sense, especially to companies that are contractor-happy and not willing to maintain that level of in house expertise.

What about scalability as a NAS (keyword NAS)? We can always throw SC 3.2 into the mix right? Unfortunately, NFS/Samba can only be configured as failover, and not scalable services in a Solaris Cluster. And ZFS? it can't be globally mounted and can only be used locally or for failover through HAStorage+...by the time you get to this point, you won't be using SATA drives with cheap JBODs anyway.

Now what about using ZFS to replace a SAN? Again, no.

A SAN's main function is to provide hosts with uniformed access to storage devices in your environment, be it through FC/Infiniband fabrics or iSCSI. This in turn allows you to better manage many aspects in your storage environment, such as but not limited to storage allocation, host access control/security, availability, and virtualization.

ZFS, at its core, is a file system. Yes, its well integrated with volume management, NFS, and self-healing functions. But it is still just a file system. A file system's job is to provide access to files (yes I know you can set up raw volumes with ZFS).

If you really look at the big picture, a file system is just another layer of virtualization in the storage game...by their very nature, filesystems and SANs sit at different layers in the OSI model. Comparing ZFS to SANs, is somewhat akin to comparing FTP to IP.

One cannot replace the other, they serve different functions in your storage environment.

Lucy in the sky with diamonds! by jotaeleemeese · 2007-05-31 00:07 · Score: 1

What are you going to offer us next? World peace?

With EMC I get full support for the storage solution. If a drive fails I don't have to diagnose it and I don't have to change it, all is done as part of the *service* they offer.

With your solution I have to waste my time diagnosing and replacing disks. With SAN solutions I do system administration and planning.

I do not want to be a disk replacing monkey (and when you have hundreds or thousends of computers, each with several disks, EMC is worth the price, you will have disks failing daily and the amount of time to deal with disk failures becomes substantial).

People that have not worked in big IT states forget that what is bought is not a machine but a service.

In other words, you are comparing apples and oranges.

--
IANAL but write like a drunk one.

It depends what you want to be by jotaeleemeese · 2007-05-31 00:11 · Score: 1

System administrator or disk changer (or machine changer, your pick) monkey?

When you state comprises more than a few servers you want everything standarized and rationalized. This saves time and money.

--
IANAL but write like a drunk one.

Re:It depends what you want to be by lymond01 · 2007-05-31 03:29 · Score: 1

That's it exactly. 400 computers, one sys admin. If I built my own, I'd be a computer-builder, not a sys admin. That extra 30% over 3 years balances the cost to my department for my time in avoiding trifling matters like replacing motherboards with bulged capacitors, etc.

SANs are not single POF by jotaeleemeese · 2007-05-31 00:18 · Score: 1

You can replicate accross data centres....

--
IANAL but write like a drunk one.

Delete files. by jotaeleemeese · 2007-05-31 00:33 · Score: 1

You will never revisit 99% of tha information.

I keep my best (or more menaingful) pictures and throw all the rest away.

SInce I am not into design or something that needs access to a big archive of pictures, my bad photographs do not need to be stored forever.

As for music, todays software makes easy to check what one is actualy listening to, stuf I have never heard but that was dwonloaded or ripped because I could gets regularly removed, I only keep my fav music or stuff I go back to regularly.

As for movies, I have lets say 2 hours/day to watch TV and/or movies. That is no more than 8 movies a month or around 100 a year, and this without counting going to the cinema, etc. So I keep 100 movies at all time and once in a while I get rid of some. I just don't need them and most likely will never been watched again.

I happily survive with 160GB with plenty of space to spare.

--
IANAL but write like a drunk one.

Re:Delete files. by Baddas · 2007-06-01 08:41 · Score: 1

I find it far less expensive in time and effort to just keep it all.

It exists... by Junta · 2007-05-31 00:50 · Score: 1

It's been a long time, but the Windows 2000 server I used to admin back in the day, software raid was integrated into the 'disk management' MMC plugin. I believe last I checked on a non-'server' MS install (2000 workstation), software raid beyond simple mirroring was not available (at least I couldn't find the same way to enable it I did on the server OS).

This was rather vague, but I know MS does this as well (that's one of the installs where I had to rebuild the array from it thinking two drives went bad concurrently). I'm sure google can help in more detail (it's obviously been years since I had to be a Windows admin, never even touched Server 2003).

--
XML is like violence. If it doesn't solve the problem, use more.

Re:It exists... by Poromenos1 · 2007-05-31 04:42 · Score: 1

Hmm, yes, it's called "Dynamic disks" in XP. So THAT's what that was. Very useful bit of information...

--
Send email from the afterlife! Write your e-will at Dead Man's Switch.

What I find amusing.. by Junta · 2007-05-31 00:59 · Score: 1

Is people insisting on high-performance RAID knowing full well that the system will only ever do NFS/SMB serving, not running any more complex processes locally, and accessing the disks via a single gigabit link. Or assuming absolutely that a RAID controller that is true hardware is guaranteed to be faster. I have seen RAID controllers where, true, the processor usage/throughput was technically lower (meaning the already largely idle cpus were more idle), the hardware engine on the card would hit its cap in performance well before the processor would hit any cap. Basically, if your criteria is 'must be hardware RAID', without any specific metrics and without researching the quantified performance numbers for the cards being looked into, then I think software RAID is probably really the answer.

Adding to your point about differing sets of tools, if you have an 8-year old server the raid controller goes out on, your chances of finding a compatible controller that can understand and import that array is slim, so you could have to rely unnecessarily on your backups just because your hardware vendor doesn't support you anymore. Hardware RAID has a lot of disadvantages.

--
XML is like violence. If it doesn't solve the problem, use more.

From the POV of the consumer.... by jotaeleemeese · 2007-05-31 02:05 · Score: 1

.... that does not matter. What I don't want is something that brakes on me easily. Both methods achieve this.

--
IANAL but write like a drunk one.

FUD. by jotaeleemeese · 2007-05-31 02:31 · Score: 1

All technology has hiccups when it is first introduced.

It is completely irrational to abandon a new technology that clearly has advantages on the strength of *one* bad experience.

--
IANAL but write like a drunk one.

It *really* depends. by jotaeleemeese · 2007-05-31 02:55 · Score: 1

Is that your only machine and alll your bussiness plan relies on it be up and running? Then yes, maybe you should plan for redundancy as you are saying.

Do you have 500 machines (or more) providing different services? THen you would be mad not to "outsource" the disk management to an specialist via SAN.

--
IANAL but write like a drunk one.

Empirically scientific study..... by jotaeleemeese · 2007-05-31 03:15 · Score: 1

I will not point all the juicy ironies there, that is an exercise to the reader.

Anyway, conceding that there are no substantial differences between the different kind of disks when it comes to quality (yeah, right) you talk about predictive failure. Well I talk about a disk fucked.

Once it is fucked, I have hundreds of machines in which I need to do multiple activities.

Do I need to waste my time to change that disk, ensure it is mirrored, etc? No, I am a systems administrator, not a DRM (Disk Replacement Monkey). Big companies with hundreds of computers actually save money using a SAN, since specialist time can be focussed in higher level activities.

I do not want to have 3 copies of each important machine just to save $47000 per system (as if). The would just multiply the chances of something going wrong.

You truly believe that SAN providers are screwing over thei costumers (banks! the ones stingiest with money when you come to think about it) bu miserably fail to appreciate the environments and demands in which a SAN is a perfectly justifiable, cost efficient solution.

That Google developped their own SAN software (or equivalent) does not mean that SAN companies specialized in those services are redundnat.

--
IANAL but write like a drunk one.

Re:Empirically scientific study..... by Znork · 2007-06-01 21:32 · Score: 1

"I do not want to have 3 copies of each important machine just to save $47000 per system (as if)."

I'd suggest you're reading things into what I'm saying. Like the article we're talking about, I'm not saying you should stick local disk and triple your machine park.

I'm saying you'd be better off with COTS SATA arrays shared over iSCSI over dedicated redundant gigabit (or 10GB) ethernet. With mirroring, backups and/or rsynced/revisioned data.

"fail to appreciate the environments and demands in which a SAN is a perfectly justifiable, cost efficient solution."

I've worked in such environments for ten years. I've worked with such SANS. In my experience their failure rate is vastly higher than COTS hardware. Not because they get disk or hardware failures, but because their complexity and relative rarity has the whole infrastructure riddled with buggy firmwares, crap drivers, random incompatiblity and inexperienced 'certified' service technicians.

Which means you cant rely on the storage anyway and have to use host level mirroring to cope with relatively frequent SAN failures and/or maintenance (to be fair, the last few years have seen improvements, but it still aint there, and the consumer hardware just keeps getting "there" faster). Which means you could just as well be using less expensive COTS solutions.

Migrating from local disk to SAN promised to save time on storage maintenance work. It hasnt. The sysadmins may not be replacing disks anymore (which the vendors or "disk-monkeys" usually did anyway) but now they have to work on SAN storage migrations, remirroring systems after maintenance, remirroring systems after SAN failures, dealing with substandard drivers, etc. As a whole I'd say far more time is spent on storage with SAN attached systems than the non-SAN attached systems. And to top that off, you usually get storage delivery times that are several times more than what it'd take to buy COTS disk and shove it in yourself. And prices that are orders of magnitude higher than the COTS disk.

SAN vendor costs are not justifiable, but not because they're high priced. They're not justifiable because they dont deliver what they promise. That means if you replace them with a solution that delivers what a SAN _actually_ delivers (as opposed to what they promise), you'd get a far cheaper solution.

And his time? by jotaeleemeese · 2007-05-31 03:19 · Score: 1

You are not factoring his time and lack of service.

Risk is a cost that has to be covered, most of the people defending this simple solution are not factoring that hidden cost that has to be covered somehow if you are professional (spare parts, duplicate systems, downtime costs, etc).

Get all that factored and what looked like a good idea begins to stink a bit fishy. Unless you need a departamentla server with data that is not vital for your bussiness.

--
IANAL but write like a drunk one.

big, fast, reliable - pick two by wsanders · 2007-05-31 08:41 · Score: 1

The weak link is going to be the enclosure. Does it have dual interfaces? Dual Power? In the disk tray to chassis connection reliable? Does hot-swap really work, every time? Can you get environmentals from the enclosure? What about firmware upgrades? There is no Hell quite like cheap-ass disk enclosure Hell.

And someone get back to me in 7 years with the track record of these 1TB SATA disks, when subjected to continuous, database-style random access activity.

From the Line Tech web page: "Dual 300W power supplies. If you have 6 or fewer hard drives installed, then only one power supply needs to be on." - Not really dual-power IMHO.

And ZFS is no panacea, although it's a great step forward in Linux-land, and as far as I know, you still can't create a bootable ZFS volume. (It's been a while since I messed with it, so this may be wrong.)

Should be fine for your home system, though. Buy a cheap AIT drive on Ebay and keep it backed up every once in a while, you should be fine.

--
Give a man a fish and you have fed him for today. Teach a man to fish, and he'll say "WHERE'S MY FISH, YOU IDIOT?"

Re:ZFS and Sun boxes by Wolfrider · 2007-06-01 02:51 · Score: 1

I don't use the RAID "feature" on the card; but it's a decent cheap IDE expander. Pretty much standardized on it because there is Linux support for it out of the box, and if I need 4 more IDE drives or an extra DVD burner in a system, it fits the bill.

I've even started stepping away from LVM with drive capacities getting larger these days. One disk in an LVM goes bad and you're fscked.

--
.
== WolfriderV6 == I'm willing to admit that *I just might* be wrong... Are you??

Details? by RandyOo · 2007-06-03 19:01 · Score: 1

I'd love to hear more hardware details on that file server setup. I'm looking into doing the exact same thing, except I have *zero* experience with Solaris. So any pointers you have would be great, but mostly I'd like to know what motherboard/chipset you ended up with that was cheap and supported.

Re:Details? by this+great+guy · 2007-06-09 23:04 · Score: 1

The best reference would be the official Solaris Hardware Compatibility List (HCL). But here is a piece of advice: the whole family of nVidia nForce chipsets is generally well supported by Solaris: any motherboard based on the nForce 4, nForce 500 (and maybe nForce 600) chipset should work flawlessly with Solaris, that includes probably more than half of the market of entry-level and mid-range motherboards. In my case I wanted a cheap, low-power, GbE-enabled fileserver capable of serving files over NFS at a throughput of 70-80% of the bandwidth of a GbE link. So I bought the most inexpensive nForce 4 mobo I found on newegg, with on-board GbE (even entry-level GbE controllers are easily capable of saturating a GbE link nowadays), and with a socket 754 (so I could use it with a low-power 25W Turion processor).

Regarding the SATA controller to use, I would recommend you either the Marvell 88SXxxxx family (such as the 88SX6081: 8-port, PCI-X, about $100), or the Silicon Image 3124 (4 ports, PCI-X, about $60), or an AHCI compatible controller (such as the built-in SATA controller found in modern Intel chipsets: ICH6, ICH7, etc, but you will need to use recent OpenSolaris builds: "Nevada B56" and up). Solaris supports SATA hotplug for these 3 families of SATA controllers.

I kept the list of what I bought 3 months ago:
Coolermaster RC-330-KKN1-GP Elite 330 Mid Tower Case (Black) Retail
$45 http://www.zipzoomfly.com/jsp/ProductDetail.jsp?Pr oductCode=141815
ECS NFORCE4-A754 Socket 754 NVIDIA nForce4 4X ATX AMD Motherboard
$46 http://www.newegg.com/Product/Product.asp?Item=N82 E16813135190
AMD Turion 64 MT37 Lancaster 2.0GHz (25W)
$69 http://www.newegg.com/Product/Product.asp?Item=N82 E16819103521
COOLER MASTER DK8-8ID2A-0L 80mm Rifle CPU Cooler - Retail
$5 http://www.newegg.com/Product/Product.asp?Item=N82 E16835103166
CORSAIR ValueSelect 512MB 184-Pin DDR SDRAM DDR 400
$38 http://www.newegg.com/Product/Product.asp?Item=N82 E16820145026
Thermaltake W0070RUC TR2 Series 430W
$40 http://www.zipzoomfly.com/jsp/ProductDetail.jsp?Pr oductCode=370565
Western Digital Caviar SE 16 WD5000AAKS 500GB
$625 (125*5) http://www.zipzoomfly.com/jsp/ProductDetail.jsp?Pr oductCode=101259
Silicon Image 3124
$70 http://cooldrives.com/saii3gra4p64.html
Total: $938

If you buy this today, prices would be even lower ! I would feel jealous of you having a setup cheaper than mine :-)

Re:ZFS and Sun boxes - not RAID by FreeBSD+evangelist · 2007-06-04 10:13 · Score: 1

It's a bit pricey compared to a Silicon Image IDE-133 RAID-capable card

You don't want RAID capable controllers. That keeps ZFS from seeing the individual drives and prevents much of the magic (like self-healing). This all works best with JBOD (Just a Box Of Disks).

YEAH WHATEVER! by MilesNaismith · 2007-06-04 18:05 · Score: 0, Flamebait

Another n00b who thinks they can build a data-center capably 5-9's setup using cheap stuff they strung together. I love you guys. Let me take you on a tour of my email-server setup where I can randomly yank out fiber or even flip off one of the 3510FC arrays in my group, and watch it all keep purring along with 70,000 users NOT EVEN NOTICING. You simply don't understand that saving a few bucks on a lashed-together setup can be MUCH less important than having it "JUST WORK". Can your solution survive a mid-plane board or a controller or PSU going PHTHTHT! Can it survive an admin oops? Mine can. It's massively inefficient to do a RAID 5+1+0 as far as money, but it's VERY efficient as far as not ever having to say "yeah this service will be down for 3 days while we figure out how to unscramble things and restore the data." Professionalism means never having to say you're sorry. That's why there will always be work for guys like me. Because one day your "miraculously cheap" setup will go FIZZLE and people will look around for a home for their services that is first and foremost RELIABLE not cheap.

Slashdot Mirror

Does ZFS Obsolete Expensive NAS/SANs?

578 comments