The State of ZFS On Linux
An anonymous reader writes: Richard Yao, one of the most prolific contributors to the ZFSOnLinux project, has put up a post explaining why he thinks the filesystem is definitely production-ready. He says, "ZFS provides strong guarantees for the integrity of [data] from the moment that fsync() returns on a file, an operation on a synchronous file handle is returned or dirty writeback occurs (by default every 5 seconds). These guarantees are enabled by ZFS' disk format, which places all data into a Merkle tree that stores 256-bit checksums and is changed atomically via a two-stage transaction commit.. ... Sharing a common code base with other Open ZFS platforms has given ZFS on Linux the opportunity to rapidly implement features available on other Open ZFS platforms. At present, Illumos is the reference platform in the Open ZFS community and despite its ZFS driver having hundreds of features, ZoL is only behind on about 18 of them."
I've been using ZFSonLinux for a year in production. No problems at all. It's my storage back end for Xen Virtual machines. Just make sure you use ECC RAM and a decent hard disk controller. Instant snapshots and ZFS send/receive functions are awesome, have reduced my backup times by an order of magnitude. I use a Debian Wheezy/Unstable hybrid.
Is the target not a zfs filesystem as well? If so zfs send/recv allows for replication and handles deltas at the filesystem level. It should be more efficient.
ZFS is a layer below LVM. It's best to give it direct control over your drives (no hardware RAID). The reason for this is to allow it to do data integrity checks on the actual data being written. It's similarly fast compared to hardware RAID but guarantees data integrity in a much more compete fashion. I use a striped mirrored setup which is similar to RAID 10 (over 4x 3TB drives with caches on a pair of SSDs). If you cache like this, frequent reads don't need to go to the spindles. It also had built in compression and deduplication. The best thing IMO is instant snapshots though, that's one feature I can't believe I lived without.
I too have kinda been watching passively with a kinda "I'll look into this once it's ready" attitude.
The gist as far as I understand it is (again, take with huge helping of salt (it's not that bad for your health any more!), I'm posting these partly to be told I'm wrong):
Pros:
- data integrity (checksums and more rigorous checks that something is actually written to the disk)
Cons:
- cpu and ram overhead (even by current standards, uses a tonne of resources)
- doesn't like hardware raid (apparently a lot of the pros rely on talkign to an actual disk)
- expandability sucks (can be done, but weird rules based on pool sizes and such) compared to most raid levels where you can easily toss a new disk in there and expand.
Actually it's pretty friendly on resources but likes lots of RAM to perform well (1Gb per Tb of storage is a good minimum). One of my servers runs on an atom processor (8x 3TB drives in equivalent to RAID 6 gets throughput of about 200MB/sec) Adding disks is also a strength. You can grow data sets quite easily but naturally performance degrades until you update the whole drive set. A lot of RAID controllers can be put in HDA mode so you may be lucky. However the Adaptec controllers go cheap 2nd hand ($100).
I've been using ZFS on linux for years with nightly backup jobs that rely on rsync. I've never had a problem.
There are so many pros for ZFS that I don't even. Until you try it, you won't "get it" - it's more like trying to describe purple to a life long blind guy. But, I'd adjust your list to at least include:
Pros:
- Data integrity
- Effortless handling of failure scenarios (RAIDZ makes normal RAID look like a child's crayon drawing)
- Snapshots.
- Replication. Imagine being able to DD a drive partition without taking it offline, and with perfect data integrity.
- Clones. Imagine being able to remount an rsync backup from last tuesday, and make changes to it, in seconds, without affecting your backup?
- Scrub. Do an fsck mid-day without affecting any end users. Not only "fix" errors, but actually guarantee the accuracy of the "fix" so that no data is lost or corrupted.
- Expandable. Add capacity at any time with no downtime. Replace every disk in your array with no downtime, and it can automatically use the extra space.
- Redundancy, even on a single device! Can't provide multiple disks, but want to defend against having a block failure corrupting your data?
- Flexible. Imagine having several partitions in your array, and be able to resize them at any time. In seconds. Or, don't bother to specify a size and have each partition use whatever space they need.
- Native compression. Double your disk space, while (sometimes) improving performance! We compressed our database backup filesystem and not only do we see some 70% reduction in disk space usage, we saw a net reduction in system load as IO overhead was significantly reduced.
- Sharp cost savings. ZFS obviates the need for exotic RAID hardware to do all the above. It brings back the "Inexpensive" in RAID. (Remember: "Redundant Array of Inexpensive Disks"?)
Cons: /)
- CPU and RAM overhead comparable to Software RAID 5.
- Requires you to be competent and know how it operates, particularly when adding capacity to an existing pool.
- ECC RAM strongly recommended if using scrub.
- Strongly recommended for data partitions, YMMV for native O/S partitions. (EG:
I have no problem with your religion until you decide it's reason to deprive others of the truth.
The CPU and RAM overhead is relatively minimal. You can get away with very few resources, even after enabling compression.
I have a ZFS server ~5 years old right now, serving over 100 NFS and a handful of Samba/Netatalk connections simultaneously (home directories mounted on NFS, SMB and AFP for other mounts). There is a fairly steady 1000-2000 IOPS with spikes up to 100k IOPS, the machine has an uptime over 300 days, the CPU load (8 2.4GHz Xeon CPU's) hovers around 5-10% (100TB of data in 8 RAIDZ2 stripes of 8 disks (2 and 4TB), 800GB in SSD read cache, 120GB in mirrored SSD write cache, directly attached with SAS).
It will off course eat as much RAM as you will give it but for the amount you spend on a halfway decent SAS RAID controller, you can easily buy 100GB of RAM and a set of SSD's. You don't WANT a RAID controller. Regular SAS controllers with ZFS are so much faster; RAID controllers are limited by their on-board chips which are typically sub-GHz RISC (ARM, Intel, MIPS) processors - an external SAS RAID controller will cost you about $2-5000 extra and have a throughput of a few 100MBps and a few 100's of IOPS. In contrast, my setup (36 disks, 4 6G SAS channels) can give a whopping 20Gbps and 1M IOPS.
Custom electronics and digital signage for your business: www.evcircuits.com
No... their numbers are about right.
And the numbers go back to times before Google existed.
Even on the old Cray Y systems, there was roughly one single bit error every day, corrected by ECC. Every week or so there would be roughly 1 double bit error, recovered by data reload...
The only times the memory got disabled was when double bit errors were NOT recovered OR the error rate exceeded 10 (from my memory, number could be higher) in a day. The hardware itself would remap memory so that the system would keep running until the CE could run diagnostics on it and either replace it or restore it to use as an identified transient error.
> ZFS is a layer below LVM.
Typically you'd layer raid, then LVM, then the filesystem. ZFS tries to be all three. It's raid, and it's a volume manager, and it's a filesystem. There are some benefits to integration, and some drawbacks. With the raid>lvm>filesystem approach, it's trivial to add dm-cache, bcache, iscsi, or any other piece of storage technology. With ZFS, anything you want to add has to be specifically supported within ZFS.
The Unix tradition is small, single purpose tools that do one thing well. Witness sort, grep, wc, etc. Want to count the log entries that mention Slashdot? You don't need a special tool for that, just grep slashdot | wc -l . Tools like mdadm and lvm are building blocks that can be combined to suit your need, the Unix way. ZFS is a big monolithic package that does everything, much like Microsoft Word or Outlook. ZFS is more in the Microsoft tradition.
The point of ZFS is that hardware raid sucks.
With hardware raid you're trusting a small, underpowered embedded computer to manage data at a block level.
1. That computer is purposefully kept in the dark about the data being stored as it's designed to be agnostic. Thus it has no way to gracefully recover from errors. It's either your whole volume is consistent, or an unknown state of corruption. This is bad.
2. RAID schemes are mathematically unable to deal with large modern hard drives. The unavoidable error rates for 4GB+ drives (and their interconnects) mean that you are guaranteed to have corruption within the useful lifetime of the drive. This means even if everything works perfecly with 0 hardware failures, your raid array will have to rebuild sometime in it's lifetime. This is bad. It's why you're stupid to go with RAID5 with large hard drives.
3. RAID controllers are pretty much all unique and their volumes are non portable. They are also not documented well. Your drives are useless without the controller, and even recovering with a new controller of the same type is a crapshoot.
ZFS throws the above model away because:
1. Your computer is fast, has lots of processors, and lots of cheap ram. Why ignore all that and use a small, embedded computer that's slower and costs extra?
2. Being part of the filesystem, it's aware of everything on both the block and the file level. It's aware of every file, the blocks it uses, the checksum of the file, and the checksum of every block. You can give yourself as many or as few redundant blocks as you want for some or all of your files.
3. Your volume can be imported on to any other computer that supports ZFS. It's a standard and is portable.
4. Because of all of the above you and implement a whole list of amazing features you can't even begin to dream of in RAID. Look up what you can do with copy-on-write filesystems and you'll wonder how you ever lived without them. (Basically free versioning/snapshotting that almost parodoxically improves performance at the same time)
I think you're giving the wrong idea here. I have yet to find a format of storage capacity that zfs won't support, with one exception: you can't create a zvol on a zpool, then attach that zvol as back-end storage for the same zpool. That is specifically disallowed, and I'm guessing that you can't use a zvol from one zpool to back-end another zpool either. This is a very bizarre (also, probably dumb) thing to do, but even this can be overridden if you're really desperate. For more practical applications, everything else just works: at least in FreeBSD, you can "hide" the block devices behind all different kinds of abstractions to provide 4k writes, encryption, whatever, and zfs will consume those virtual block devices just fine.
Anything that can be represented as a block device can added to a zpool. This also includes files which is handy when your trying to understand complicated interactions you can mock up a small zpool based on files instead of devices for testing.
On the otherside of the abstraction ZFS can also expose block devices called zvols that will be backed by the zpool. So if you wanted to run a dmcrypted EXT4 filesystem backed by a zpool you can certain do that using a zvol and still get all the benefits of ZFS integrity protection and snapshoting.
Plenty of layering can be done with ZFS.
Adding additional drives to a raidz vdev is not supported, no. Apparently it's a use case that is extremely rare in enterprise, which is where zfs was intended for. Adding additional capacity is easy if you have no redundancy (12x2TB drives in a pool? Just add 2x2TB more drives to the pool and boom, more space), but not as easy if you want redundancy.
So you can't expand an existing vdev, but you can add a new vdev to the zpool. For example, say your current configuration is 12x2TB in raidz2 (the zfs equivalent of raid6). That's giving you 20TB of capacity, after redundancy. You need to add 4TB of additional usable capacity...
There are a few options. ZFS doesn't enforce redundancy, so there's nothing stopping you from adding two bare 2TB drives to the zpool. You'd get your extra 4TB, but data on those drives would be unprotected. Instead, you'd probably have to take 4x2TB, put them in a new raidz2 vdev, and then add that to your zpool. Then you'd have 12x2TB & 4x2TB, giving you that 12TB of usable capacity, and every disk in the array has dual redundancy.
My home file server currently has 7x4TB & 8x2TB. They're both raidz2 arrays, in the same zpool, for 32TB of usable capacity on 44TB of raw storage. I started out with 5x2TB in raidz1 and migrated the data between various configurations. The iterations looked like this:
Configuration 1: 5x2TB (raidz1)
Configuration 2: 5x2TB (raidz1) + 5x2TB (raidz1)
Configuration 3: 7x4TB (raidz2) + 8x2TB (raidz2)
The migration process was:
1 to 2: Add the new 5x2TB (raidz1) vdev to the existing storage pool
2 to 3: Add the new 7x4TB (raidz2) vdev to a new storage pool, zfs send the file system from the old pool to the new pool, wipe the old 2TB drives, add back 8 of them in a new raidz2 vdev, add that new vdev to the existing new pool
The server only has 15 hotswap bays (the 2-to-3 migration required opening the case to get some of the drives hooked up directly), so my next migration will involve replacing the 2TB drives with something larger (probably 8TB by the time I need to expand). To do that, the process in zfs is that you replace a drive, re-silver the array, replace a drive, resilver the array, etc. When you have replaced the last drive, zfs automatically will expand the vdev to use the new capacity. Resilvering a completely empty drive is not fast, so I expect the process will probably take me about a week, since I'd probably start a new resilver each night before bed. But since I run raidz2, at no point would I be without redundancy, so it should be safe.
If you intend to send the snapshots over the network, as is often the case with rsync, you need to pair it with some independent communication tool, and since the output of "zfs send" tends to be very bursty, you need a sizable memory buffer.
So you can't expand an existing vdev
While you cannot add new drives to a vdev, you can expand a vdev by incrementally replacing all of its drives with larger versions. Replace a drive, resilver, replace a drive, resilver... and when you're all done, just export the pool, import it back, and you have the full capacity of the new drives available.
Back when I did OpenSolaris work, we used a tool called mbuffer which is basically netcat with a buffer on each end. It wouldn't been suitable for internet backups (no encryption) but it works pretty well for cross campus backups and the like.
IIRC it works like this on the sending side: 'zfs send pool/fs@snap | mbuffer -s 128k -m 4G -O 10.0.0.1:9090'
And on the receive side: 'mbuffer -s 128k -m 4G -I 9090 | zfs receive pool/fs'
It can still be pretty bursty but it smoothes out a lot of it.
Dedup easily needs 5GB of RAM per TB.
For general usage (no dedup), 1GB per TB is a good rule of thumb.
Dedup easily needs 5GB of RAM per TB.
For general usage (no dedup), 1GB per TB is a good rule of thumb.
This. Don't starve the ARC. You wouldn't like it when it's angry.
The sky won't fall but the walls might.
-Shaka
You can kludge on encryption in the pipeline:
http://sourceforge.net/project...