Does ZFS Obsolete Expensive NAS/SANs?

← Back to Stories (view on slashdot.org)

Does ZFS Obsolete Expensive NAS/SANs?

Posted by kdawson on Wednesday May 30, 2007 @12:02AM from the fast-secure-reliable-cheap dept.

hoggoth writes "As a common everyman who needs big, fast, reliable storage without a big budget, I have been following a number of emerging technologies and I think they have finally become usable in combination. Specifically, it appears to me that I can put together the little brother of a $50,000 NAS/SAN solution for under $3,000. Storage experts: please tell me why this is or isn't feasible." Read on for the details of this cheap storage solution.

Get a CoolerMaster Stacker enclosure like this one (just the hardware not the software) that can hold up to 12 SATA drives. Install OpenSolaris and create ZFS pools with RAID-Z for redundancy. Export some pools with Samba for use as a NAS. Export some pools with iSCSI for use as a SAN. Run it over Gigabit Ethernet. Fast, secure, reliable, easy to administer, and cheap. Usable from Windows, Mac, and Linux. As a bonus ZFS let's me create daily or hourly snapshots at almost no cost in disk space or time.

Total cost: 1.4 Terabytes: $2,000. 7.7 Terabytes: $4,200 (Just the cost of the enclosure and the drives). That's an order of magnitude less expensive than other solutions.

Add redundant power supplies, NIC cards, SATA cards, etc as your needs require.

22 of 578 comments (clear)

Min score:

Reason:

Sort:

ZFS by Anonymous Coward · 2007-05-30 00:05 · Score: 5, Informative

Also should be noted that FreeBSD has added ZFS support to Current (v7). It's built on top of GEOM too so if you know what that is you can leverage that underneath zfs.
1. Re:ZFS by beezly · 2007-05-30 05:09 · Score: 4, Informative
  
  A correction:
  
  First off it is "copy on write"
  
  Copy-on-write is quite a misnomer here (even if Sun use that term). It is a Transactional filesystem. Blocks are not copied upon write, they are only written and then the transaction log is updated. It's far more clever than old-fashined COW schemes. It can be compared with NetApp's WAFL filesystem.
Congradulations, you discovered the "File Server" by BigBuckHunter · 2007-05-30 00:08 · Score: 4, Informative

For quite a while now, it has been less expensive to build a DIY file server then to purchase NAS equipment. I personally build gateway/NAS products using Via c7/8 boards as they are low power, have hardware encryption, and are easy to work with under linux. Accessory companies even make back plane drive cages for this purpose that fit nicely into commodity PCs.

BBH
Current issues by packetmon · 2007-05-30 00:12 · Score: 4, Informative
I've snipped out the worst reasons as per Wiki entry:
- A file "fsync" will commit to disk all pending modifications on the filesystem. That is, an "fsync" on a file will flush out all deferred (cached) operations to the filesystem (not the pool) in which the file is located. This can make some fsync() slow when running alongside a workload which writes a lot of data to filesystem cache.
- ZFS encourages creation of many filesystems inside the pool (for example, for quota control), but importing a pool with thousands of filesystems is a slow operation (can take minutes).
- ZFS filesystem on-the-fly compression/decompression is single-threaded. So, only one CPU per zpool is used.
- ZFS eats a lot of CPU when doing small writes (for example, a single byte). There are two root causes, currently being solved: a) Translating from znode to dnode is slower than necessary because ZFS doesn't use translation information it already has, and b) Current partial-block update code is very inefficient.
- ZFS Copy-on-Write operation can degrade on-disk file layout (file fragmentation) when files are modified, decreasing performance.
- ZFS blocksize is configurable per filesystem, currently 128KB by default. If your workload reads/writes data in fixed sizes (blocks), for example a database, you should (manually) configure ZFS blocksize equal to the application blocksize, for better performance and to conserve cache memory and disk bandwidth.
- ZFS only offlines a faulty harddisk if it can't be opened. Read/write errors or slow/timeouted operations are not currently used in the faulty/spare logic.
- When listing ZFS space usage, the "used" column only shows non-shared usage. So if some of your data is shared (for example, between snapshots), you don't know how much is there. You don't know, for example, which snapshot deletion would give you more free space.
- Current ZFS compression/decompression code is very fast, but the compression ratio is not comparable to gzip or similar algorithms.
--
Infiltrated dot Net
Real SANs do more by PIPBoy3000 · 2007-05-30 00:13 · Score: 4, Informative

For starters, our SAN uses extremely fast connectivity. It sounds like you're moving your disk I/O over the network, which is a fairly significant bottleneck (even Gb). We also have the flexibility of multiple tiers - 1st tier being expensive, fast disks, and 2nd tier being cheaper IDE drives. I imagine you can fake that a variety of ways, but it's built in. Finally, there's the enclosure itself, with redundant power and such.

Still, I bet you could do what you want on the cheap. Being in health care, response time and availability really are life-and-death, but many other industries don't need to spend the extra. Best of luck.
Its just not the same thing. by Tester · 2007-05-30 00:14 · Score: 4, Informative

A good 20k$ RAID array does much more. First, it doesn't use cheap SATA drives, but Fiberchannel Drivers or even SAS drives which are tested to a higher level of quality (each disk costs like 500$ or more..). And those cheap SATA drives also react much more poorly to non-sequential access (like when you have multiple users). They are unusable for serious file serving. You can never compare RAID arrays that use SATA/IDE to ones that use enterprise drives like FC/SCSI/etc, because the drives are quite different.

Then you have the other features like dual redundant everything: controllers, power supplies, etc. Then you have thermal capabilities of rack-mount solutions that often are different from SATA, etc, etc.
1. Re:Its just not the same thing. by ZorinLynx · 2007-05-30 00:25 · Score: 5, Informative
  
  These overpriced drives aren't all that much different from SATA drives. They're a bit faster, but a HELL of a lot more expensive, and not worth paying more than double per gig.
  
  We have a Sun X4500 which uses 48 500GB SATA drives and ZFS to produce about 20TB of redundant storage. The performance we have seen from this machine is amazing. We're talking hundreds of gigabytes per second and no noticeable stalling on concurrent accesses.
  
  Google has found that SATA drives don't fail noticeably more often than SAS/SCSI drives, but even if they did, having several hot spares means it doesn't matter that much.
  
  SATA is a great disk standard. You get a lot more bang for your buck overall.
2. Re:Its just not the same thing. by tgatliff · 2007-05-30 00:32 · Score: 5, Informative
  
  It is not my intention to offend, but I alway love it when I hear the dreaded marketing phrase of hardware "tested to a higher level of quality".
  
  I work in the world of hardware manufacturing, and I can tell you that this "magical" more testing process simply does not exist. Hardware failures are always expensive, and we do anything we can to prevent them. To do this, we build burn in procedures based on what most call the 90% rule, but you really cannot guarantee more reliability beyond that. Better device design at that point is what will determine reliability beyond that point. Any person who says differently either does not completely understand individual test harness processes or does not understand how burn in procedures work.
  
  In short, more money is not nessesarily better. More volume designs typically are, though...
HW Raid vs ZFS by packetmon · 2007-05-30 00:15 · Score: 4, Informative

http://milek.blogspot.com/2007/04/hw-raid-vs-zfs-s oftware-raid-part-iii.html

--
Infiltrated dot Net
No by iamredjazz · 2007-05-30 00:19 · Score: 5, Informative

Speaking from personal experience - This file system is far from ready. It can kernel panic and reboot after minor IO errors, we were hosed by it, and probably won't ever revisit it. This phenomenon can be repeated with a usb device, you might want to try it before you hype it. Try a google search on it and see what you think...there is no fsck or repair, once it's hosed, it's hosed, the recovery is to go to tape. http://www.google.com/search?hl=en&q=zfs+io+error+ kernel+panic&btnG=Google+Search
Reliable? by Jjeff1 · 2007-05-30 00:21 · Score: 4, Informative

Businesses buy SANs to consolidate storage, placing all their eggs in one basket. They need redundant everything, which this doesn't have. Additionally, SATA drives are not as reliable long term as SCSI. Compare the data sheets for Seagate drives, they don't even mention MTBF on the SATA sheet.
Businesses also want service and support. They want the system to phone home when a drive starts getting errors, so a tech shows up at their door with a new drive before they even notice there are problems. They want to have highly trained tech support available 24/7 and parts available within 4 hours for as long as they own the SAN.
Finally, the performance of this solution almost certainly pales as compared to a real SAN. These are all things that a home grown solution doesn't offer. Saving 47K on a SAN is great, unless it breaks 3 years from now and your company is out of business 3 days waiting for a replacement motherboard off Ebay.
That being said, everything has a cost associated with it. If management is ok with saving actual money in the short term by giving up long term reliability and performance, then go for it. But by all means, get a rep from EMC or HP in so the decision makers completely understand what they're buying.
1. Re:Reliable? by Znork · 2007-05-30 02:14 · Score: 4, Informative
  
  "Additionally, SATA drives are not as reliable long term as SCSI."
  
  The CMU study by Bianca Schroeder and Garth A. Gibson would suggest otherwise. In fact, there was no significant difference in replacement rates between SCSI, FC, or SATA drives.
  
  "They want the system to phone home when a drive starts getting errors"
  
  Of course, the other recent study by Google showed that predictive health checks may be of limited value as an indicator of impending catastrophic disk failure.
  
  Basically, empirical research has shown that the SAN storage vendors are screwing their customers every day of the week.
  
  "Saving 47K on a SAN is great, unless it breaks 3 years from now"
  
  Of course, saving 47K on a SAN means you can easily triple your redundancy, still save money, and when it breaks, you have two more copies.
  
  At the same time, the guy spending the extra 47k on an 'Enterprise Class Ultra Reliable SAN', will get the same breakage 3 years from now, he wont have been able to afford all those redundant copies, and as he examines his contract with the SAN vendor, he notes that they actually dont promise anything.
  
  "But by all means, get a rep from EMC or HP in so the decision makers completely understand what they're buying."
  
  Premium grade bovine manure with (fake) gold flakes?
  
  Really, handing the decision makers several scientific papers and a few google search strings would leave them much better equipped to make a rational decision.
  
  Having several years experience with the kind of systems you're talking about I can just say, I've experienced several situations where, if we didnt have system-level redundancy, we would have suffered not only system downtime but actual data loss on expensive 'enterprise grade' SANs. That experience, as well as the research, has left me somewhat sceptical towards the claims of SAN vendors.
Re:Everyman? by Max+von+H. · 2007-05-30 00:35 · Score: 4, Informative

I'm a photographer and my RAW image files are 15MB each. At every shooting, I come back with 1 to 8GB worth of data to be processed. My workflow involves working on 16-bit TIFFs that weigh in excess of 40MB/file and I'm not even counting the photoshop work files. 40GB would last less than a week here.

Not being rich, I have a couple of external HDs totalling a little less than 1TB, and it's nearly full. The rest is archived on DVD or transfered to HD for storage (cheaper, faster and more reliable than DVD).

So yeah, I can easily imagine why any organisation dealing with huge media files would be interested. Heck, I'd be a client for a safe, multi-TB storage system if I could afford it... Not everybody only deals with text files for a living :P

--
-- It's always darker before it goes pitch black.
ZFS is great, but... by Etherized · 2007-05-30 00:42 · Score: 4, Informative

It's no NetApp - yet. One thing to realize is that iSCSI target isn't even in Solaris proper yet - you have to run Solaris Express or OpenSolaris for the functionality. That may be fine for some people, but it's a deal-breaker for most companies - you're really going to place all those TB of data on a system that's basically unsupported? I'm sure Sun would lend you a hand for enough money, but running essentially a pre-release version of Solaris is a non-starter where real business is concerned. Even when iSCSI target makes it into Solaris 10 - which should be in the next release - are you really comfortable running critical services off of essentially the first release of the technology? Furthermore, while ZFS is amazingly simple to manage in comparison to any other UNIX filesystem/volume manager, it still requires you to know how to properly administer a Solaris box in order to use it. Even GUI-centric sysadmins are generally able to muddle through the interface on a Filer, but ZFS comes with a full-fledged OS that requires proper maintenance. Your Windows admins may be fine with a NetApp - especially with all that marvelous support you get from them - but ask them to maintain a Solaris box and you're asking for trouble. Not to mention, since it's a real, general purpose server OS, you'll have to maintain patches just like you do on the rest of your servers - and the supported method for patching Solaris is *still* to drop to single user mode and reboot afterwards (yes, I know that's not necessarily *required*). Also, "zfs send" is no real replacement for snapmirrors. And while ZFS snapshots are functionally equivalent to NetApp snapshots, there is no method for automatic creation and management of them - it's up to the admin to create any snapshotting scheme you want to implement. Don't get me wrong - I love ZFS and I use it wherever it makes sense to do so. It may even be acceptable as a "poor man's Filer" right now, assuming you don't need iSCSI or any of the more advanced features of a NetApp. In fact, it's a really great solution for home or small office fileservers, where you just need a bunch of network storage on the cheap - assuming, of course, that you already have a Solaris sysadmin at your home or small office. Just don't fool yourself, Filer it ain't - at least not yet.
This hardly depends on ZFS... by Wdomburg · 2007-05-30 00:46 · Score: 4, Informative

This doesn't strike me as having much to do with ZFS at all. You've been able to do a home grown NAS / SAN box for years on the cheap using commodity equipment. Take ZFS out of the picture and you just need to use a hardware raid controller or a block level RAID (like dmraid on Linux or geom on FreeBSD). There are even canned solutions for this, like OpenFiler.

That being said, this sort of solution may or may not be appropriate, depending on site needs. Sometimes support is worth it.

You're also grossly overestimating the cost of an entry-level iSCSI SAN solution. Even going with EMC, hardly the cheapest of vendors, you can pick up a 6TB solution for about $15k, not $50k. Go with a second tier vendor and you can cut that number in half.
vs Reiser4 (someday, maybe) by SanityInAnarchy · 2007-05-30 00:56 · Score: 5, Informative

Some of these issues looked familiar, so I thought I'd do a basic comparison:

Reiser4 had the same problems with fsync -- basically, fsync called sync. This was because their sync is actually a damned good idea -- wait till you have to (memory pressure, sync call, whatever), then shove the entire tree that you're about to write as far left as it can go before writing. This meant awesome small-file performance -- as long as you have enough RAM, it's like working off a ramdisk, and when you flush, it packs them just about as tightly as you can with a filesystem. It also meant far less fragmentation -- allocate-on-flush, like XFS, but given a gig or two of RAM, a flush wasn't often.

The downside: Packing files that tightly is going to fragment more in the long run. This is why it's common practice for defragmenters to insert "air holes". Also, the complexity of the sync process is probably why fsync sucked so much. (I wouldn't mind so much if it was smarter -- maybe sync a single file, but add any small files to make sure you fill up a block -- but syncing EVERYTHING was a mistake, or just plain lazy.) Worse, it causes reliability problems -- unless you sync (or fsync), you have no idea if your data will be written now, or two hours from now, or never (given enough RAM).

(ZFS probably isn't as bad, given it's probably much easier to slice your storage up into smaller filesystems, one per task. But it's a valid gotcha -- without knowing that, I'd have just thrown most things into the same huge filesystem.)

There's another problem with reliability: Basically, every fast journalling filesystem nowadays is going to do out-of-order write operations. Entirely too many hacks depend on ordered writes (ext3 default, I think) for reliability, because they use a simple scheme for file updating: Write to a new temporary file, then rename it on top of the old file. The problem is, with out-of-order writes, it could do the rename before writing the data, giving you a corrupt temporary file in place of the "real" one, and no way to go back, even if the rename is atomic. The only way to get around this with traditional UNIX semantics is to stick to ordered writes, or do an fsync before each rename, killing performance.

I think the POSIX filesystem API is too simplistic and low-level to do this properly. On ordered filesystems, tempfile-then-rename does the Right Thing -- either everything gets written to disk properly, or not enough to hurt anything. Renames are generally atomic on journalled filesystems, so either you have the new file there after a crash, or you simply delete the tempfile. And there's no need to sync, especially if you're doing hundreds or thousands of these at once, as part of some larger operation. Often, it's not like this is crucial data that you need to be flushed out to disk RIGHT NOW, you just need to make sure that when it does get flushed, it's in the right order. You can do a sync call after the last of them is done.

Problem is, there are tons of other write operations for which it makes a lot of sense to reorder things. In fact, some disks do that on a hardware level, intentionally -- nvidia calls it "native command queuing". Using "ordered mode" is just another hack, and its drawback is slowing down absolutely every operation just so the critical ones will work. But so many are critical, when you think about it -- doesn't vim use the same trick?

What's needed is a transaction API -- yet another good idea that was planned for someday, maybe, in Reiser4. After all, single filesystem-metadata-level operations are generally guaranteed atomic, so I would guess most filesystems are able to handle complex transactions -- we just need a way for the program to specify it.

The fragmentation issue I see as a simple tradeoff: Packing stuff tightly saves you space and gives you performance, but increases fragmentation. Running a defragger (or "repacker") every once in awhile would have been nice. Problem is, they never got one written. Common UNIX (and Mac) philosoph

--
Don't thank God, thank a doctor!
Why not good ol' trusted Linux? by Britz · 2007-05-30 01:10 · Score: 5, Informative

Linux has more perfomance testing on x86 than OpenSolaris (so you are not as likely to run into a bad bottleneck). On Linux you can create a RAID-1,-4,-5 and -6 under Multiple Device Driver Support in the kernel. You can then use mkraid to include all the drives you want. This code in not new at all. It was stable in 2.4, maybe even in 2.2

After that you just create a filesystem on top of the raid. If you don't like ext3 or don't trust it, there is always xfs. I had some rough times with reiserfs, xfs, and ext3 and for all the experience I had I would go xfs for long running server environments (and now get flamed for this little bit, use ext3 all you want).

The advantage is that you use very well tested code.

The problem comes with hotswapping. I don't know if the drivers are up to that yet. But I also highly doubt that OpenSolaris SATA drivers for some low price chip in a low price storage box can deal with hotswapping. So Linux might be faster on that one.

That is a setup I would compare to a plug'n play SAN solution. And it totally depends on the environment. If the Linux box goes down for some reason for a couple hours/days, how much will that cost you? If it is more than twice the SAN-solution, you might just buy the SAN and if it fails just pull the disks and put them in the new one. I dunno if that would work on Linux.
Re:Specifics please. by sixoh1 · 2007-05-30 02:22 · Score: 4, Informative

Designs are expensive, but components are not. My PCB designs can support several different Bill-Of-Materials loads during manufacturing and when the boards are destined for industrial or military use we can use 'screened' parts which have been pre-selected and tested at high-temperature to ensure correct operation. Marginal parts at higher temps may be fine for consumer boxes (ie the ones on your desktop) but in a server box run 24-7-365 it-has-to-work-all-the-time may not be a good idea. I've been fustrated with the exact scenario quizzed in the original topic, using Maxtor SATAII 500GB disks as a drop-zone for my DLT backup machine I've had the HDDs for less than a month and already 3 of 4 failed with bad sectors because they all sit in a PC case. I'm going to have to rig out the box with extra fans, and the hastle of pulling and replacing the disks is driving me crazy too so now I'm adding removable disk bays. Not as easy or as cheap as I had anticipated (labor costs mostly).
Re:Specifics please. by Ngarrang · 2007-05-30 02:28 · Score: 5, Informative

Toleraen wrote, "So "Mission Critical" is just a myth too, right?"

No system can compensate for bad management by people, but I digress.

All data is critical. But, to say that your data is less safe with a system that cost $4700 than a system that cost $50,000 is fallacious without some heavy proof behind it. For now, I am going to ignore that a functional backup is part of "mission critical" and just address the online storage portion of the argument.

Let's start with a server white box. Something with redundant power supplies, ethernet, etc. Put a mirrored boot drive in it. Install Linux. So far, the cost is fairly low. Add an external disk array, at least 15 slots, the ones with hot-swap, hot-spare, RAID 5, redundant power supplies and fill it with inexpensive (but large) SATA drives. Promise sells one, as do others. Attach to server, voila, a cheaper solution than EMC for serving up large amounts of disk space.

What if a drive fails? The system recreates the data (it is RAID5, after all) onto a hot-spare. You remove the bad drive, insert new, run the administration. The uses won't even notice their MP3's and Elf Bowling game were ever in danger.

For the people who believe strongly in really expensive storage solutions, please explain why. I would like to know if you also hold the same theory for your desktop PCs, because surely, a more expensive PC has to be better. Right?

--
Bearded Dragon
Re:Everyman? by hoggoth · 2007-05-30 02:35 · Score: 4, Informative

I am the original poster, and I am not actually a typical user.
I routinely work with files that are 100 GB - 300 GB each.
Just copying one file from drive to drive takes hours.
I have about 4 Terabytes in use, with another 4 Terabytes for backup.

My usage is the exact opposite of database usage (which most storage is optimized for).
I need to copy huge sequential files. I rarely need many small reads or writes.

Because of the long times it takes to move these files around, I think NFS or CIFS would be too slow. That's why I am interested in the ability of ZFS to easily export iSCSI targets. Some tests I read showed that ZFS exporting iSCSI is about 4 times faster than ZFS exporting NFS or CIFS.

I am comparing to drives directly attached via eSATA, so it's got to be fast to come anywhere close to what I get with eSATA.

--
- For the complete works of Shakespeare: cat /dev/random (may take some time)
Re:I'm a Hardware Guy, Not a ZFS Guy by darrylo · 2007-05-30 03:13 · Score: 5, Informative
OK:
- ZFS w/flaky hardware (scary): http://blogs.sun.com/elowe/entry/zfs_saves_the_day _ta
- Self Healing with ZFS: http://www.opensolaris.org/os/community/zfs/demos/ selfheal/
- 100 Mirrored Filesystems in 5 minutes: http://www.opensolaris.org/os/community/zfs/demos/ basics/
And, for more than you wanted to know about ZFS: http://en.wikipedia.org/wiki/ZFS
RAID controller failure by wurp · 2007-05-30 03:28 · Score: 4, Informative

And what happens when the RAID controller fails and corrupts all of your drives?

Because I've seen that happen more than once.

I'm not saying the more expensive solution is better. I'm just saying that in my personal experience I've seen *more* data destroyed from RAID controller failure than from hard drive failure. I would love to find out the solution to that one.

I do not claim to be a hardware expert or system administrator, so there may be a well known solution (don't buy 'brand X' RAID controllers). I just don't happen to know it.