Does ZFS Obsolete Expensive NAS/SANs?

← Back to Stories (view on slashdot.org)

Does ZFS Obsolete Expensive NAS/SANs?

Posted by kdawson on Wednesday May 30, 2007 @12:02AM from the fast-secure-reliable-cheap dept.

hoggoth writes "As a common everyman who needs big, fast, reliable storage without a big budget, I have been following a number of emerging technologies and I think they have finally become usable in combination. Specifically, it appears to me that I can put together the little brother of a $50,000 NAS/SAN solution for under $3,000. Storage experts: please tell me why this is or isn't feasible." Read on for the details of this cheap storage solution.

Get a CoolerMaster Stacker enclosure like this one (just the hardware not the software) that can hold up to 12 SATA drives. Install OpenSolaris and create ZFS pools with RAID-Z for redundancy. Export some pools with Samba for use as a NAS. Export some pools with iSCSI for use as a SAN. Run it over Gigabit Ethernet. Fast, secure, reliable, easy to administer, and cheap. Usable from Windows, Mac, and Linux. As a bonus ZFS let's me create daily or hourly snapshots at almost no cost in disk space or time.

Total cost: 1.4 Terabytes: $2,000. 7.7 Terabytes: $4,200 (Just the cost of the enclosure and the drives). That's an order of magnitude less expensive than other solutions.

Add redundant power supplies, NIC cards, SATA cards, etc as your needs require.

13 of 578 comments (clear)

Min score:

Reason:

Sort:

ZFS by Anonymous Coward · 2007-05-30 00:05 · Score: 5, Informative

Also should be noted that FreeBSD has added ZFS support to Current (v7). It's built on top of GEOM too so if you know what that is you can leverage that underneath zfs.
Specifics please. by PowerEdge · 2007-05-30 00:06 · Score: 5, Insightful

Not enough specifics here. I am going to say do your thing. If it works, you're a hero and saved 47k. If it doesn't obfuscate and negotiate the 50k of storage down to 47k. Win for all.

Unless you would like to give more specifics. Cause I am going to say in 99% of cases where you want fast, reliable, and cheap storage you only get to pick two.
1. Re:Specifics please. by Ngarrang · 2007-05-30 01:02 · Score: 5, Insightful
  
  Unless you would like to give more specifics. Cause I am going to say in 99% of cases where you want fast, reliable, and cheap storage you only get to pick two.
  I disagree completely. Computer hardware is a commodity. The big box makers are afraid of this very kind of configuration which would blow them out of business if more people caught on to it. No, they use FUD to convince PHBs that because of the low cost, it cannot possibly be as good. Hot-swap and hot-spare are commodity technologies. But, please, feel free to continue the FUD, because it helps the bottom line.
  
  --
  Bearded Dragon
2. Re:Specifics please. by Score+Whore · 2007-05-30 02:13 · Score: 5, Insightful
  
  No kidding. Without specific details there is no way to answer whether this is a good solution to his particular situation. However, even in the absence of details I can say this:
  
  1) That case has twelve spindles. You aren't going to get the same performance from a dozen drives as you get from an hundred.
  
  2) That system includes a small Celeron D processor with 512 MB RAM. You aren't going to get the same performance as you'll get from multiple dedicated RAID controllers with twenty+ gig of dedicated disk cache.
  
  3) Your single gigabit ethernet interface won't even come close to the performance of the three or four (or ten) 2 gigabit fibre channel adapters involved in most SAN arrays.
  
  4) Software iscsi initiators and targets aren't a replacement for dedicated adapters.
  
  5) The Hitachi array at work never goes down. Ever. Not for expansions. Not for drive replacements. Not for relayouts. The reliability of a PC and the opportunity to do online maintenance won't approach that of a real array.
  
  Don't get me wrong. That case makes me all tingly inside -- for personal use. But as a SAN replacement, fuck no. It's not the same thing. The original question just shows ignorance of what SANs are and the roles they fill in the data center.
  
  As a workgroup storage solution for a handful of end users on their desktops, that solution probably may be a good fit. As a storage solution for ten (or two hundred) business critical server systems, no way.
3. Re:Specifics please. by Ngarrang · 2007-05-30 02:28 · Score: 5, Informative
  
  Toleraen wrote, "So "Mission Critical" is just a myth too, right?"
  
  No system can compensate for bad management by people, but I digress.
  
  All data is critical. But, to say that your data is less safe with a system that cost $4700 than a system that cost $50,000 is fallacious without some heavy proof behind it. For now, I am going to ignore that a functional backup is part of "mission critical" and just address the online storage portion of the argument.
  
  Let's start with a server white box. Something with redundant power supplies, ethernet, etc. Put a mirrored boot drive in it. Install Linux. So far, the cost is fairly low. Add an external disk array, at least 15 slots, the ones with hot-swap, hot-spare, RAID 5, redundant power supplies and fill it with inexpensive (but large) SATA drives. Promise sells one, as do others. Attach to server, voila, a cheaper solution than EMC for serving up large amounts of disk space.
  
  What if a drive fails? The system recreates the data (it is RAID5, after all) onto a hot-spare. You remove the bad drive, insert new, run the administration. The uses won't even notice their MP3's and Elf Bowling game were ever in danger.
  
  For the people who believe strongly in really expensive storage solutions, please explain why. I would like to know if you also hold the same theory for your desktop PCs, because surely, a more expensive PC has to be better. Right?
  
  --
  Bearded Dragon
4. Re:Specifics please. by swordgeek · 2007-05-30 03:46 · Score: 5, Interesting
  
  It's all a question of scale, and your scale is a bit skewed.
  
  The premium paid for higher-end storage is decidedly nonlinear. For marginally more reliable or faster storage, you pay about a factor of ten. One example I'm familiar with is Hitachi. We had a 64TB HDS array a few years ago that was worth roughly $2M. We could have purchased an equivalent amount of commodity storage for probably $200k at the time, but didn't. Why would we spend the extra money? Speed, configurability, expandability, and reliability.
  
  First of all, speed. That thing was loaded with 73GB 15k FCAL drives. RAID was in sets of four disks, with no two disks in a set sharing the same controller, backplane, or cache segment. Speaking of cache, the rule was 1GB/TB. so we had 64GB of fast, low-latency, fully mirrored cache on the thing. It was insanely fast, and (most importantly) didn't slow down under point load. One tool automatically ran on the array itself, looking for hotspots and reallocating data on the fly.
  
  Configurability: We could mirror data synchronously or asynchronously to our DR site, by filesystem, file, block, LUN, or byte. We could dynamically (re)allocate storage to multiple systems, and moving databases between machines was a breeze. Disk could be allocated from different pools (i.e. different performing drives could be installed), depending on requirements. Quality-of-Service restrictions could be put in place as well, although we never used them.
  
  Expandability: The beast had 32 pairs of FC connections, could support 96GB of internal mirrored cache, and I can't remember how much actual disk. The key wasn't the amount of disk we could put on it, so much as how well the bandwidth scaled--and it scaled well.
  
  Finally, the real key - Reliability. All connections were dual-pathed, with storage presented to a pair of smart FC switches which were zoned to present storage to various systems. We could lose three of the four power cables to the main unit (auxiliary disk cabinets only had two power connections each), and still run. We could lose any entire rack, and still run. We could lose any switch in our environment, and still run. We could lose two disks from the same RAID set and still run. When we lost a disk, the system would automatically suck up some cache to use for remirroring the data to multiple disks as fast as possible, and then after protecting it, would remirror back to a single logical device. In the event that we lost the entire device, we could run from our DR site synchronous mirror with less than a ten second failover.
  
  This sort of thing is massive overkill for most people and companies, but when someone is doing realtime commodities trading, (or banking, or stock exchanges, etc.) the protection and support are worth the extra money. You just can't build that sort of thing on your own for any less money, at the end of the day.
  
  --
  
  "People who do stupid things with hazardous materials often die." -- Jim Davidson on alt.folklore.urban
No by iamredjazz · 2007-05-30 00:19 · Score: 5, Informative

Speaking from personal experience - This file system is far from ready. It can kernel panic and reboot after minor IO errors, we were hosed by it, and probably won't ever revisit it. This phenomenon can be repeated with a usb device, you might want to try it before you hype it. Try a google search on it and see what you think...there is no fsck or repair, once it's hosed, it's hosed, the recovery is to go to tape. http://www.google.com/search?hl=en&q=zfs+io+error+ kernel+panic&btnG=Google+Search
Re:Its just not the same thing. by ZorinLynx · 2007-05-30 00:25 · Score: 5, Informative

These overpriced drives aren't all that much different from SATA drives. They're a bit faster, but a HELL of a lot more expensive, and not worth paying more than double per gig.

We have a Sun X4500 which uses 48 500GB SATA drives and ZFS to produce about 20TB of redundant storage. The performance we have seen from this machine is amazing. We're talking hundreds of gigabytes per second and no noticeable stalling on concurrent accesses.

Google has found that SATA drives don't fail noticeably more often than SAS/SCSI drives, but even if they did, having several hot spares means it doesn't matter that much.

SATA is a great disk standard. You get a lot more bang for your buck overall.
Re:Its just not the same thing. by tgatliff · 2007-05-30 00:32 · Score: 5, Informative

It is not my intention to offend, but I alway love it when I hear the dreaded marketing phrase of hardware "tested to a higher level of quality".

I work in the world of hardware manufacturing, and I can tell you that this "magical" more testing process simply does not exist. Hardware failures are always expensive, and we do anything we can to prevent them. To do this, we build burn in procedures based on what most call the 90% rule, but you really cannot guarantee more reliability beyond that. Better device design at that point is what will determine reliability beyond that point. Any person who says differently either does not completely understand individual test harness processes or does not understand how burn in procedures work.

In short, more money is not nessesarily better. More volume designs typically are, though...
vs Reiser4 (someday, maybe) by SanityInAnarchy · 2007-05-30 00:56 · Score: 5, Informative

Some of these issues looked familiar, so I thought I'd do a basic comparison:

Reiser4 had the same problems with fsync -- basically, fsync called sync. This was because their sync is actually a damned good idea -- wait till you have to (memory pressure, sync call, whatever), then shove the entire tree that you're about to write as far left as it can go before writing. This meant awesome small-file performance -- as long as you have enough RAM, it's like working off a ramdisk, and when you flush, it packs them just about as tightly as you can with a filesystem. It also meant far less fragmentation -- allocate-on-flush, like XFS, but given a gig or two of RAM, a flush wasn't often.

The downside: Packing files that tightly is going to fragment more in the long run. This is why it's common practice for defragmenters to insert "air holes". Also, the complexity of the sync process is probably why fsync sucked so much. (I wouldn't mind so much if it was smarter -- maybe sync a single file, but add any small files to make sure you fill up a block -- but syncing EVERYTHING was a mistake, or just plain lazy.) Worse, it causes reliability problems -- unless you sync (or fsync), you have no idea if your data will be written now, or two hours from now, or never (given enough RAM).

(ZFS probably isn't as bad, given it's probably much easier to slice your storage up into smaller filesystems, one per task. But it's a valid gotcha -- without knowing that, I'd have just thrown most things into the same huge filesystem.)

There's another problem with reliability: Basically, every fast journalling filesystem nowadays is going to do out-of-order write operations. Entirely too many hacks depend on ordered writes (ext3 default, I think) for reliability, because they use a simple scheme for file updating: Write to a new temporary file, then rename it on top of the old file. The problem is, with out-of-order writes, it could do the rename before writing the data, giving you a corrupt temporary file in place of the "real" one, and no way to go back, even if the rename is atomic. The only way to get around this with traditional UNIX semantics is to stick to ordered writes, or do an fsync before each rename, killing performance.

I think the POSIX filesystem API is too simplistic and low-level to do this properly. On ordered filesystems, tempfile-then-rename does the Right Thing -- either everything gets written to disk properly, or not enough to hurt anything. Renames are generally atomic on journalled filesystems, so either you have the new file there after a crash, or you simply delete the tempfile. And there's no need to sync, especially if you're doing hundreds or thousands of these at once, as part of some larger operation. Often, it's not like this is crucial data that you need to be flushed out to disk RIGHT NOW, you just need to make sure that when it does get flushed, it's in the right order. You can do a sync call after the last of them is done.

Problem is, there are tons of other write operations for which it makes a lot of sense to reorder things. In fact, some disks do that on a hardware level, intentionally -- nvidia calls it "native command queuing". Using "ordered mode" is just another hack, and its drawback is slowing down absolutely every operation just so the critical ones will work. But so many are critical, when you think about it -- doesn't vim use the same trick?

What's needed is a transaction API -- yet another good idea that was planned for someday, maybe, in Reiser4. After all, single filesystem-metadata-level operations are generally guaranteed atomic, so I would guess most filesystems are able to handle complex transactions -- we just need a way for the program to specify it.

The fragmentation issue I see as a simple tradeoff: Packing stuff tightly saves you space and gives you performance, but increases fragmentation. Running a defragger (or "repacker") every once in awhile would have been nice. Problem is, they never got one written. Common UNIX (and Mac) philosoph

--
Don't thank God, thank a doctor!
Why not good ol' trusted Linux? by Britz · 2007-05-30 01:10 · Score: 5, Informative

Linux has more perfomance testing on x86 than OpenSolaris (so you are not as likely to run into a bad bottleneck). On Linux you can create a RAID-1,-4,-5 and -6 under Multiple Device Driver Support in the kernel. You can then use mkraid to include all the drives you want. This code in not new at all. It was stable in 2.4, maybe even in 2.2

After that you just create a filesystem on top of the raid. If you don't like ext3 or don't trust it, there is always xfs. I had some rough times with reiserfs, xfs, and ext3 and for all the experience I had I would go xfs for long running server environments (and now get flamed for this little bit, use ext3 all you want).

The advantage is that you use very well tested code.

The problem comes with hotswapping. I don't know if the drivers are up to that yet. But I also highly doubt that OpenSolaris SATA drivers for some low price chip in a low price storage box can deal with hotswapping. So Linux might be faster on that one.

That is a setup I would compare to a plug'n play SAN solution. And it totally depends on the environment. If the Linux box goes down for some reason for a couple hours/days, how much will that cost you? If it is more than twice the SAN-solution, you might just buy the SAN and if it fails just pull the disks and put them in the new one. I dunno if that would work on Linux.
Let's pretend you're right for a moment... by SanityInAnarchy · 2007-05-30 02:16 · Score: 5, Interesting

It seems to me that even if the entire setup is prone to failure, all you really need is a gigabit crossover or two running to an identical setup. I don't know if ZFS does anything like this, but I can think of at least one way to make it work on Linux: DRBD + OCFS2 + heartbeat. If you're smart, you can even do some load balancing, at least until one of them fails -- and when that happens, the other should be able to take over very quickly, if not instantly -- Linux heartbeat means it would simply takeover the other machine's IP and start its services.

So, that's $6k total instead of $3k.

The one problem I have with OCFS2 is that when it fences a system, it tends to either bring the whole thing down (kernel panic), or in newer versions, give you the option of forceably rebooting instead. This killed it for a project I was working on, where one of the machines had other mission-critical systems running that were not on the OCFS2, and thus, it seemed retarded to panic and bring down everything else too.

So if that's your problem, you can always build a third, identical system to run the other stuff on. $9k.

Even if you figure another $1k for random stuff, like maybe a LOT of gigabit crossovers, or 10gig fiber, or something, that's still a fifth of the cost of the "business-grade" or whatever else he was considering. Even assuming the worst-case scenario, where the homebrew system costs a lot more to maintain (even electricity and cooling, maybe), how long will it take for it to cost another $40k? And this way, you have an ENTIRELY redundant system -- the only way you lose it is if, say, the whole building blows up.

I mean, I sort of agree that you get what you pay for. But when the difference in price is that much, the only way it's ever worth it is if there's really great support with the high-end package. And is it $40k worth of support? If not, I imagine this guy could put together a company selling little $3k, $6k, and $10k systems for $20k each (including support), shaving off $30k even for the most paranoid.

And all of that is pretending you're right about the cheap consumer-grade hardware actually being less reliable.

--
Don't thank God, thank a doctor!
Re:I'm a Hardware Guy, Not a ZFS Guy by darrylo · 2007-05-30 03:13 · Score: 5, Informative
OK:
- ZFS w/flaky hardware (scary): http://blogs.sun.com/elowe/entry/zfs_saves_the_day _ta
- Self Healing with ZFS: http://www.opensolaris.org/os/community/zfs/demos/ selfheal/
- 100 Mirrored Filesystems in 5 minutes: http://www.opensolaris.org/os/community/zfs/demos/ basics/
And, for more than you wanted to know about ZFS: http://en.wikipedia.org/wiki/ZFS