The State of ZFS On Linux
An anonymous reader writes: Richard Yao, one of the most prolific contributors to the ZFSOnLinux project, has put up a post explaining why he thinks the filesystem is definitely production-ready. He says, "ZFS provides strong guarantees for the integrity of [data] from the moment that fsync() returns on a file, an operation on a synchronous file handle is returned or dirty writeback occurs (by default every 5 seconds). These guarantees are enabled by ZFS' disk format, which places all data into a Merkle tree that stores 256-bit checksums and is changed atomically via a two-stage transaction commit.. ... Sharing a common code base with other Open ZFS platforms has given ZFS on Linux the opportunity to rapidly implement features available on other Open ZFS platforms. At present, Illumos is the reference platform in the Open ZFS community and despite its ZFS driver having hundreds of features, ZoL is only behind on about 18 of them."
I've been using ZFSonLinux for a year in production. No problems at all. It's my storage back end for Xen Virtual machines. Just make sure you use ECC RAM and a decent hard disk controller. Instant snapshots and ZFS send/receive functions are awesome, have reduced my backup times by an order of magnitude. I use a Debian Wheezy/Unstable hybrid.
Is the target not a zfs filesystem as well? If so zfs send/recv allows for replication and handles deltas at the filesystem level. It should be more efficient.
ZFS is a layer below LVM. It's best to give it direct control over your drives (no hardware RAID). The reason for this is to allow it to do data integrity checks on the actual data being written. It's similarly fast compared to hardware RAID but guarantees data integrity in a much more compete fashion. I use a striped mirrored setup which is similar to RAID 10 (over 4x 3TB drives with caches on a pair of SSDs). If you cache like this, frequent reads don't need to go to the spindles. It also had built in compression and deduplication. The best thing IMO is instant snapshots though, that's one feature I can't believe I lived without.
The CPU and RAM overhead is relatively minimal. You can get away with very few resources, even after enabling compression.
I have a ZFS server ~5 years old right now, serving over 100 NFS and a handful of Samba/Netatalk connections simultaneously (home directories mounted on NFS, SMB and AFP for other mounts). There is a fairly steady 1000-2000 IOPS with spikes up to 100k IOPS, the machine has an uptime over 300 days, the CPU load (8 2.4GHz Xeon CPU's) hovers around 5-10% (100TB of data in 8 RAIDZ2 stripes of 8 disks (2 and 4TB), 800GB in SSD read cache, 120GB in mirrored SSD write cache, directly attached with SAS).
It will off course eat as much RAM as you will give it but for the amount you spend on a halfway decent SAS RAID controller, you can easily buy 100GB of RAM and a set of SSD's. You don't WANT a RAID controller. Regular SAS controllers with ZFS are so much faster; RAID controllers are limited by their on-board chips which are typically sub-GHz RISC (ARM, Intel, MIPS) processors - an external SAS RAID controller will cost you about $2-5000 extra and have a throughput of a few 100MBps and a few 100's of IOPS. In contrast, my setup (36 disks, 4 6G SAS channels) can give a whopping 20Gbps and 1M IOPS.
Custom electronics and digital signage for your business: www.evcircuits.com
The point of ZFS is that hardware raid sucks.
With hardware raid you're trusting a small, underpowered embedded computer to manage data at a block level.
1. That computer is purposefully kept in the dark about the data being stored as it's designed to be agnostic. Thus it has no way to gracefully recover from errors. It's either your whole volume is consistent, or an unknown state of corruption. This is bad.
2. RAID schemes are mathematically unable to deal with large modern hard drives. The unavoidable error rates for 4GB+ drives (and their interconnects) mean that you are guaranteed to have corruption within the useful lifetime of the drive. This means even if everything works perfecly with 0 hardware failures, your raid array will have to rebuild sometime in it's lifetime. This is bad. It's why you're stupid to go with RAID5 with large hard drives.
3. RAID controllers are pretty much all unique and their volumes are non portable. They are also not documented well. Your drives are useless without the controller, and even recovering with a new controller of the same type is a crapshoot.
ZFS throws the above model away because:
1. Your computer is fast, has lots of processors, and lots of cheap ram. Why ignore all that and use a small, embedded computer that's slower and costs extra?
2. Being part of the filesystem, it's aware of everything on both the block and the file level. It's aware of every file, the blocks it uses, the checksum of the file, and the checksum of every block. You can give yourself as many or as few redundant blocks as you want for some or all of your files.
3. Your volume can be imported on to any other computer that supports ZFS. It's a standard and is portable.
4. Because of all of the above you and implement a whole list of amazing features you can't even begin to dream of in RAID. Look up what you can do with copy-on-write filesystems and you'll wonder how you ever lived without them. (Basically free versioning/snapshotting that almost parodoxically improves performance at the same time)
Anything that can be represented as a block device can added to a zpool. This also includes files which is handy when your trying to understand complicated interactions you can mock up a small zpool based on files instead of devices for testing.
On the otherside of the abstraction ZFS can also expose block devices called zvols that will be backed by the zpool. So if you wanted to run a dmcrypted EXT4 filesystem backed by a zpool you can certain do that using a zvol and still get all the benefits of ZFS integrity protection and snapshoting.
Plenty of layering can be done with ZFS.
So you can't expand an existing vdev
While you cannot add new drives to a vdev, you can expand a vdev by incrementally replacing all of its drives with larger versions. Replace a drive, resilver, replace a drive, resilver... and when you're all done, just export the pool, import it back, and you have the full capacity of the new drives available.
Back when I did OpenSolaris work, we used a tool called mbuffer which is basically netcat with a buffer on each end. It wouldn't been suitable for internet backups (no encryption) but it works pretty well for cross campus backups and the like.
IIRC it works like this on the sending side: 'zfs send pool/fs@snap | mbuffer -s 128k -m 4G -O 10.0.0.1:9090'
And on the receive side: 'mbuffer -s 128k -m 4G -I 9090 | zfs receive pool/fs'
It can still be pretty bursty but it smoothes out a lot of it.
Dedup easily needs 5GB of RAM per TB.
For general usage (no dedup), 1GB per TB is a good rule of thumb.
You can kludge on encryption in the pipeline:
http://sourceforge.net/project...