The State of ZFS On Linux

← Back to Stories (view on slashdot.org)

Posted by Soulskill on Thursday September 11, 2014 @02:55AM from the ready-for-the-big-show dept.

An anonymous reader writes: Richard Yao, one of the most prolific contributors to the ZFSOnLinux project, has put up a post explaining why he thinks the filesystem is definitely production-ready. He says, "ZFS provides strong guarantees for the integrity of [data] from the moment that fsync() returns on a file, an operation on a synchronous file handle is returned or dirty writeback occurs (by default every 5 seconds). These guarantees are enabled by ZFS' disk format, which places all data into a Merkle tree that stores 256-bit checksums and is changed atomically via a two-stage transaction commit.. ... Sharing a common code base with other Open ZFS platforms has given ZFS on Linux the opportunity to rapidly implement features available on other Open ZFS platforms. At present, Illumos is the reference platform in the Open ZFS community and despite its ZFS driver having hundreds of features, ZoL is only behind on about 18 of them."

7 of 370 comments (clear)

Min score:

Reason:

Sort:

Re:Unfamiliar by Anrego · 2014-09-11 03:14 · Score: 5, Interesting

I too have kinda been watching passively with a kinda "I'll look into this once it's ready" attitude.
The gist as far as I understand it is (again, take with huge helping of salt (it's not that bad for your health any more!), I'm posting these partly to be told I'm wrong):
Pros:
- data integrity (checksums and more rigorous checks that something is actually written to the disk)
Cons:
- cpu and ram overhead (even by current standards, uses a tonne of resources)
- doesn't like hardware raid (apparently a lot of the pros rely on talkign to an actual disk)
- expandability sucks (can be done, but weird rules based on pool sizes and such) compared to most raid levels where you can easily toss a new disk in there and expand.
Re: Unfamiliar by zeigerpuppy · 2014-09-11 03:26 · Score: 3, Interesting

Actually it's pretty friendly on resources but likes lots of RAM to perform well (1Gb per Tb of storage is a good minimum). One of my servers runs on an atom processor (8x 3TB drives in equivalent to RAID 6 gets throughput of about 200MB/sec) Adding disks is also a strength. You can grow data sets quite easily but naturally performance degrades until you update the whole drive set. A lot of RAID controllers can be put in HDA mode so you may be lucky. However the Adaptec controllers go cheap 2nd hand ($100).
Re:Be sure to use ECC RAM on home set-ups by Anonymous Coward · 2014-09-11 03:44 · Score: 3, Interesting

No... their numbers are about right.
And the numbers go back to times before Google existed.
Even on the old Cray Y systems, there was roughly one single bit error every day, corrected by ECC. Every week or so there would be roughly 1 double bit error, recovered by data reload...
The only times the memory got disabled was when double bit errors were NOT recovered OR the error rate exceeded 10 (from my memory, number could be higher) in a day. The hardware itself would remap memory so that the system would keep running until the CE could run diagnostics on it and either replace it or restore it to use as an identified transient error.
above, below, and at the same level. ZFS is everyt by raymorris · 2014-09-11 03:59 · Score: 4, Interesting

> ZFS is a layer below LVM.
Typically you'd layer raid, then LVM, then the filesystem. ZFS tries to be all three. It's raid, and it's a volume manager, and it's a filesystem. There are some benefits to integration, and some drawbacks. With the raid>lvm>filesystem approach, it's trivial to add dm-cache, bcache, iscsi, or any other piece of storage technology. With ZFS, anything you want to add has to be specifically supported within ZFS.
The Unix tradition is small, single purpose tools that do one thing well. Witness sort, grep, wc, etc. Want to count the log entries that mention Slashdot? You don't need a special tool for that, just grep slashdot | wc -l . Tools like mdadm and lvm are building blocks that can be combined to suit your need, the Unix way. ZFS is a big monolithic package that does everything, much like Microsoft Word or Outlook. ZFS is more in the Microsoft tradition.
Re:above, below, and at the same level. ZFS is eve by Vesvvi · 2014-09-11 04:14 · Score: 3, Interesting

I think you're giving the wrong idea here. I have yet to find a format of storage capacity that zfs won't support, with one exception: you can't create a zvol on a zpool, then attach that zvol as back-end storage for the same zpool. That is specifically disallowed, and I'm guessing that you can't use a zvol from one zpool to back-end another zpool either. This is a very bizarre (also, probably dumb) thing to do, but even this can be overridden if you're really desperate. For more practical applications, everything else just works: at least in FreeBSD, you can "hide" the block devices behind all different kinds of abstractions to provide 4k writes, encryption, whatever, and zfs will consume those virtual block devices just fine.
Re:Unfamiliar by Guspaz · 2014-09-11 04:34 · Score: 5, Interesting

Adding additional drives to a raidz vdev is not supported, no. Apparently it's a use case that is extremely rare in enterprise, which is where zfs was intended for. Adding additional capacity is easy if you have no redundancy (12x2TB drives in a pool? Just add 2x2TB more drives to the pool and boom, more space), but not as easy if you want redundancy.
So you can't expand an existing vdev, but you can add a new vdev to the zpool. For example, say your current configuration is 12x2TB in raidz2 (the zfs equivalent of raid6). That's giving you 20TB of capacity, after redundancy. You need to add 4TB of additional usable capacity...
There are a few options. ZFS doesn't enforce redundancy, so there's nothing stopping you from adding two bare 2TB drives to the zpool. You'd get your extra 4TB, but data on those drives would be unprotected. Instead, you'd probably have to take 4x2TB, put them in a new raidz2 vdev, and then add that to your zpool. Then you'd have 12x2TB & 4x2TB, giving you that 12TB of usable capacity, and every disk in the array has dual redundancy.
My home file server currently has 7x4TB & 8x2TB. They're both raidz2 arrays, in the same zpool, for 32TB of usable capacity on 44TB of raw storage. I started out with 5x2TB in raidz1 and migrated the data between various configurations. The iterations looked like this:
Configuration 1: 5x2TB (raidz1)
Configuration 2: 5x2TB (raidz1) + 5x2TB (raidz1)
Configuration 3: 7x4TB (raidz2) + 8x2TB (raidz2)
The migration process was:
1 to 2: Add the new 5x2TB (raidz1) vdev to the existing storage pool
2 to 3: Add the new 7x4TB (raidz2) vdev to a new storage pool, zfs send the file system from the old pool to the new pool, wipe the old 2TB drives, add back 8 of them in a new raidz2 vdev, add that new vdev to the existing new pool
The server only has 15 hotswap bays (the 2-to-3 migration required opening the case to get some of the drives hooked up directly), so my next migration will involve replacing the 2TB drives with something larger (probably 8TB by the time I need to expand). To do that, the process in zfs is that you replace a drive, re-silver the array, replace a drive, resilver the array, etc. When you have replaced the last drive, zfs automatically will expand the vdev to use the new capacity. Resilvering a completely empty drive is not fast, so I expect the process will probably take me about a week, since I'd probably start a new resilver each night before bed. But since I run raidz2, at no point would I be without redundancy, so it should be safe.
Re:rsync causes lockups? by wagnerrp · 2014-09-11 04:49 · Score: 3, Interesting

If you intend to send the snapshots over the network, as is often the case with rsync, you need to pair it with some independent communication tool, and since the output of "zfs send" tends to be very bursty, you need a sizable memory buffer.