The State of ZFS On Linux

Re:rsync causes lockups? by gbkersey · 2014-09-11 03:01 · Score: 2

We use ZFS to store backups, and we backup with rsync. No problems so far.

Unfamiliar by Anonymous Coward · 2014-09-11 03:02 · Score: 1

I'm still quite unfamiliar with all the concept of ZFS. How would it compare to a LVMed RAID-5 with EXT4?

Re: Unfamiliar by zeigerpuppy · 2014-09-11 03:12 · Score: 5, Informative

ZFS is a layer below LVM. It's best to give it direct control over your drives (no hardware RAID). The reason for this is to allow it to do data integrity checks on the actual data being written. It's similarly fast compared to hardware RAID but guarantees data integrity in a much more compete fashion. I use a striped mirrored setup which is similar to RAID 10 (over 4x 3TB drives with caches on a pair of SSDs). If you cache like this, frequent reads don't need to go to the spindles. It also had built in compression and deduplication. The best thing IMO is instant snapshots though, that's one feature I can't believe I lived without.
Re:Unfamiliar by Anrego · 2014-09-11 03:14 · Score: 5, Interesting

I too have kinda been watching passively with a kinda "I'll look into this once it's ready" attitude.
The gist as far as I understand it is (again, take with huge helping of salt (it's not that bad for your health any more!), I'm posting these partly to be told I'm wrong):
Pros:
- data integrity (checksums and more rigorous checks that something is actually written to the disk)
Cons:
- cpu and ram overhead (even by current standards, uses a tonne of resources)
- doesn't like hardware raid (apparently a lot of the pros rely on talkign to an actual disk)
- expandability sucks (can be done, but weird rules based on pool sizes and such) compared to most raid levels where you can easily toss a new disk in there and expand.
Re: Unfamiliar by the_humeister · 2014-09-11 03:26 · Score: 1

The main huge feature of filesystems like zfs and btrfs is check summing of the filesystem for enhanced data integrity. Snapshots, data deduplication, etc. are also nice features, but without the check summing, any file system issues will be multiplied (eg wrong bits would be propagated through the snapshots).
Re: Unfamiliar by zeigerpuppy · 2014-09-11 03:26 · Score: 3, Interesting

Actually it's pretty friendly on resources but likes lots of RAM to perform well (1Gb per Tb of storage is a good minimum). One of my servers runs on an atom processor (8x 3TB drives in equivalent to RAID 6 gets throughput of about 200MB/sec) Adding disks is also a strength. You can grow data sets quite easily but naturally performance degrades until you update the whole drive set. A lot of RAID controllers can be put in HDA mode so you may be lucky. However the Adaptec controllers go cheap 2nd hand ($100).
Re: Unfamiliar by jabuzz · 2014-09-11 03:34 · Score: 1

Why would I prefer ZFS over DIF on FC and now SAS2 which does the checksumming in a far more comprehensive manner for me and is agnositic to the file system it is on?
Re:Unfamiliar by mcrbids · 2014-09-11 03:39 · Score: 5, Insightful

There are so many pros for ZFS that I don't even. Until you try it, you won't "get it" - it's more like trying to describe purple to a life long blind guy. But, I'd adjust your list to at least include:
Pros:
- Data integrity
- Effortless handling of failure scenarios (RAIDZ makes normal RAID look like a child's crayon drawing)
- Snapshots.
- Replication. Imagine being able to DD a drive partition without taking it offline, and with perfect data integrity.
- Clones. Imagine being able to remount an rsync backup from last tuesday, and make changes to it, in seconds, without affecting your backup?
- Scrub. Do an fsck mid-day without affecting any end users. Not only "fix" errors, but actually guarantee the accuracy of the "fix" so that no data is lost or corrupted.
- Expandable. Add capacity at any time with no downtime. Replace every disk in your array with no downtime, and it can automatically use the extra space.
- Redundancy, even on a single device! Can't provide multiple disks, but want to defend against having a block failure corrupting your data?
- Flexible. Imagine having several partitions in your array, and be able to resize them at any time. In seconds. Or, don't bother to specify a size and have each partition use whatever space they need.
- Native compression. Double your disk space, while (sometimes) improving performance! We compressed our database backup filesystem and not only do we see some 70% reduction in disk space usage, we saw a net reduction in system load as IO overhead was significantly reduced.
- Sharp cost savings. ZFS obviates the need for exotic RAID hardware to do all the above. It brings back the "Inexpensive" in RAID. (Remember: "Redundant Array of Inexpensive Disks"?)
Cons:
- CPU and RAM overhead comparable to Software RAID 5.
- Requires you to be competent and know how it operates, particularly when adding capacity to an existing pool.
- ECC RAM strongly recommended if using scrub.
- Strongly recommended for data partitions, YMMV for native O/S partitions. (EG: /)

--
I have no problem with your religion until you decide it's reason to deprive others of the truth.
Re:Unfamiliar by QuietLagoon · 2014-09-11 03:40 · Score: 1

...expandability sucks...
Expansion is different with ZFS. Different does not mean sucks. Different means you need to learn something new.
.
In my experience, it does not suck, but is rather easy to do. I added a couple of disks, ran a couple of commands, and doubled the size of my ZFS pool.
Easy as pie.
Re:Unfamiliar by guruevi · 2014-09-11 03:43 · Score: 5, Informative

The CPU and RAM overhead is relatively minimal. You can get away with very few resources, even after enabling compression.
I have a ZFS server ~5 years old right now, serving over 100 NFS and a handful of Samba/Netatalk connections simultaneously (home directories mounted on NFS, SMB and AFP for other mounts). There is a fairly steady 1000-2000 IOPS with spikes up to 100k IOPS, the machine has an uptime over 300 days, the CPU load (8 2.4GHz Xeon CPU's) hovers around 5-10% (100TB of data in 8 RAIDZ2 stripes of 8 disks (2 and 4TB), 800GB in SSD read cache, 120GB in mirrored SSD write cache, directly attached with SAS).
It will off course eat as much RAM as you will give it but for the amount you spend on a halfway decent SAS RAID controller, you can easily buy 100GB of RAM and a set of SSD's. You don't WANT a RAID controller. Regular SAS controllers with ZFS are so much faster; RAID controllers are limited by their on-board chips which are typically sub-GHz RISC (ARM, Intel, MIPS) processors - an external SAS RAID controller will cost you about $2-5000 extra and have a throughput of a few 100MBps and a few 100's of IOPS. In contrast, my setup (36 disks, 4 6G SAS channels) can give a whopping 20Gbps and 1M IOPS.

--
Custom electronics and digital signage for your business: www.evcircuits.com
Re: Unfamiliar by zeigerpuppy · 2014-09-11 03:46 · Score: 1

Because ZFS allows you to flexibly allocate, administer, snapshot and backup data in a coherent way. You can even use other file system on top of it if you want (zvols). Hardware level checksumming on drives is not aware of your file structure and a look at SMART info on a drive will reveal quite a few Unrecovered errors (density is a bitch). So I wouldn't rely on your drive to correct it's own errors.
Re: Unfamiliar by csirac · 2014-09-11 03:58 · Score: 2

For the same reasons your package manager bothers with shasums on the software you install even though the several network layers reaponsible for delivering it already faithfully checksummed each little packet as it flew past: the filesystem is the earliest and only point which knows exactly what files are supposed to actually look like in their entirety. That ZFS/BTRFS scrubs turn up errors on large pools with otherwise perfectly fine hardware means those block/packet level validations are at too low a level to make assurances for the higher level data structures using them.
Re:Unfamiliar by Anonymous Coward · 2014-09-11 04:03 · Score: 5, Informative

The point of ZFS is that hardware raid sucks.
With hardware raid you're trusting a small, underpowered embedded computer to manage data at a block level.
1. That computer is purposefully kept in the dark about the data being stored as it's designed to be agnostic. Thus it has no way to gracefully recover from errors. It's either your whole volume is consistent, or an unknown state of corruption. This is bad.
2. RAID schemes are mathematically unable to deal with large modern hard drives. The unavoidable error rates for 4GB+ drives (and their interconnects) mean that you are guaranteed to have corruption within the useful lifetime of the drive. This means even if everything works perfecly with 0 hardware failures, your raid array will have to rebuild sometime in it's lifetime. This is bad. It's why you're stupid to go with RAID5 with large hard drives.
3. RAID controllers are pretty much all unique and their volumes are non portable. They are also not documented well. Your drives are useless without the controller, and even recovering with a new controller of the same type is a crapshoot.
ZFS throws the above model away because:
1. Your computer is fast, has lots of processors, and lots of cheap ram. Why ignore all that and use a small, embedded computer that's slower and costs extra?
2. Being part of the filesystem, it's aware of everything on both the block and the file level. It's aware of every file, the blocks it uses, the checksum of the file, and the checksum of every block. You can give yourself as many or as few redundant blocks as you want for some or all of your files.
3. Your volume can be imported on to any other computer that supports ZFS. It's a standard and is portable.
4. Because of all of the above you and implement a whole list of amazing features you can't even begin to dream of in RAID. Look up what you can do with copy-on-write filesystems and you'll wonder how you ever lived without them. (Basically free versioning/snapshotting that almost parodoxically improves performance at the same time)
Re:Unfamiliar by Anrego · 2014-09-11 04:04 · Score: 1

Way back when I looked into it (which again, was a while ago and quite brief, so I may/probably am totally wrong) the big problem seemed to be adding small amounts of storage to a large array.
In my particular use case, I have a 20TB file server (raid6, 12x 2TB drives). Lets say I fill that up and want to add 4 more TB. With my current RAID6/dm-crypt/lvm/xfs setup, this is fairly easy. Add 2 drives and expand everything. With ZFS it seemed hard to add arbitrary amounts of storage like this in most configurations.
I'd add that even if this and the other stuff I listed was legitimate, I'll probably end up using it at some point once it's more mainstream. I really like the data integrity stuff, and all the clone/snapshot stuff sounds excessively useful.
Re:Unfamiliar by MightyYar · 2014-09-11 04:07 · Score: 2, Insightful

I would add to you "cons" list that it requires* ECC RAM, though you should probably be using that anyway.
* It's not technically a requirement, but you'll probably be sorry if you don't use it.

--
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
Re:Unfamiliar by sribe · 2014-09-11 04:08 · Score: 1

One correction, the RAM overhead is only intense if you use deduplication.
One different perspective, it doesn't like hardware RAID, and neither should anybody else at this point. (Yeah, I still have hardware RAID in the field.) With ZFS, you will never have the experience of the replacement RAID controller having a different firmware version and not recognizing your disks. With ZFS, you will never get data corruption from a "write hole". With ZFS, it's actually documented as to wtf the RAID is doing in terms of disk layout.
One nitpick, expandability does kind of suck compared to some other RAID schemes, but most RAID levels you CANNOT just "easily toss a new disk in there and expand"--that ability is limited in most RAID schemes. ZFS is in the middle, more easily expandable than some, but definitely not as good as the easiest.
Re:Unfamiliar by Vesvvi · 2014-09-11 04:21 · Score: 1

If you want to add 4 more TB, then you attach a new set of mirror, and you're left with RAID6(12x)+RAID1(2x). There is zero rebalancing (for better or worse): it's available immediately and transparently. The only catch is that you can't remove it again, but you can replace it with any combination of storage that provides equal or greater capacity to your RAID1(2x).
You could also grow your RAID6, and it's more efficient that it would be on most normal hardware RAID. But please don't do that: RAID5/6 really should be phased out, and it's not a good idea to create huge RAIDZ groups, even as RAIDZ2+. If you really want to stick with RAID5/6, it's better to just make a new group: leave your RAID6(12x) and add another RAID6(n
Re: Unfamiliar by zeigerpuppy · 2014-09-11 04:22 · Score: 1

You explained that so much better than I did!
Re: Unfamiliar by devman · 2014-09-11 04:28 · Score: 2

1GB RAM per TB Storage is only needed if you require dedupe. Dedupe is honestly more trouble than its worth anyway and it isn't enabled by default. Without dedupe RAM requirements are closer to a standard fileserver.
Re:Unfamiliar by Guspaz · 2014-09-11 04:34 · Score: 5, Interesting

Adding additional drives to a raidz vdev is not supported, no. Apparently it's a use case that is extremely rare in enterprise, which is where zfs was intended for. Adding additional capacity is easy if you have no redundancy (12x2TB drives in a pool? Just add 2x2TB more drives to the pool and boom, more space), but not as easy if you want redundancy.
So you can't expand an existing vdev, but you can add a new vdev to the zpool. For example, say your current configuration is 12x2TB in raidz2 (the zfs equivalent of raid6). That's giving you 20TB of capacity, after redundancy. You need to add 4TB of additional usable capacity...
There are a few options. ZFS doesn't enforce redundancy, so there's nothing stopping you from adding two bare 2TB drives to the zpool. You'd get your extra 4TB, but data on those drives would be unprotected. Instead, you'd probably have to take 4x2TB, put them in a new raidz2 vdev, and then add that to your zpool. Then you'd have 12x2TB & 4x2TB, giving you that 12TB of usable capacity, and every disk in the array has dual redundancy.
My home file server currently has 7x4TB & 8x2TB. They're both raidz2 arrays, in the same zpool, for 32TB of usable capacity on 44TB of raw storage. I started out with 5x2TB in raidz1 and migrated the data between various configurations. The iterations looked like this:
Configuration 1: 5x2TB (raidz1)
Configuration 2: 5x2TB (raidz1) + 5x2TB (raidz1)
Configuration 3: 7x4TB (raidz2) + 8x2TB (raidz2)
The migration process was:
1 to 2: Add the new 5x2TB (raidz1) vdev to the existing storage pool
2 to 3: Add the new 7x4TB (raidz2) vdev to a new storage pool, zfs send the file system from the old pool to the new pool, wipe the old 2TB drives, add back 8 of them in a new raidz2 vdev, add that new vdev to the existing new pool
The server only has 15 hotswap bays (the 2-to-3 migration required opening the case to get some of the drives hooked up directly), so my next migration will involve replacing the 2TB drives with something larger (probably 8TB by the time I need to expand). To do that, the process in zfs is that you replace a drive, re-silver the array, replace a drive, resilver the array, etc. When you have replaced the last drive, zfs automatically will expand the vdev to use the new capacity. Resilvering a completely empty drive is not fast, so I expect the process will probably take me about a week, since I'd probably start a new resilver each night before bed. But since I run raidz2, at no point would I be without redundancy, so it should be safe.
Re:Unfamiliar by JBMcB · 2014-09-11 04:37 · Score: 1

- CPU and RAM overhead comparable to Software RAID 5.
In my experience it needs a lot more memory than software RAID5. Something like 1GB per TB of disk space if running RAIDZ. Scrubbing can thrash your CPU pretty good, too.
I ran ZFS for a while on a dedicated file server with a fair amount of disk space (16TB) but switched over to btrfs RAID1 as my hardware wasn't up to ZFS requirements, and I needed the capability to add new drives to the pool which ZFS doesn't handle gracefully.

--
My Other Computer Is A Data General Nova III.
Re:Unfamiliar by MightyYar · 2014-09-11 04:54 · Score: 1

ZFS is in the middle, more easily expandable than some, but definitely not as good as the easiest.
Yes, ZFS is not a Drobo. You need to plan out your disk usage from the beginning, because you are kind of stuck with it.
For instance, if you have 5 disks and they are all the same size and you want 2 disk redundancy, it is almost a no-brainer to setup a raidz2. The downside is that if you ever want to make the vdev larger by replacing disks, you need to replace all 5 disks to the new larger size... a vdev is limited by the smallest disk. You can mitigate this by putting the same 5 disks into a pair of mirrors plus a hot spare. You will lose some initial capacity, but then later on you can add capacity by swapping out just two disks or by adding another pair to the pool.
And once you've added a vdev to the pool, you can never remove it... that's probably the biggest irritation for me personally. Even that isn't such a big deal, since it is so easy to clone the whole pool to another one.

--
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
Re:Unfamiliar by Rich0 · 2014-09-11 05:01 · Score: 1

Adding additional drives to a raidz vdev is not supported, no. Apparently it's a use case that is extremely rare in enterprise, which is where zfs was intended for.
This is largely what has kept me away from zfs, besides it not being in the mainline kernel.
If I had 300 disks then being forced to add/remove them in groups of 5 or so wouldn't be a big deal. When you have just a few disks at 90% capacity, being unable to add/remove them 1 at a time while keeping everything redundant is a much bigger problem (using an n+1 redundancy solution, not an n*2 solution).
One of the things I like about btrfs is that the design is more dynamic in this regard - you can have disks of varying sizes and add/remove them at will. Of course, anything more than raid1 on btrfs isn't production-ready yet (actually, anything on btrfs is only production-ready if you're willing to make some tradeoffs).
Re:Unfamiliar by mdmkolbe · 2014-09-11 05:08 · Score: 1

you can easily buy 100GB of RAM and a set of SSD
It sounds like you are focusing on a server setting. Would the CPU and RAM overheads be enough to be a concern in a desktop setting?
Re:Unfamiliar by pr0nbot · 2014-09-11 05:14 · Score: 1

It's also not great with disks of mixed size. For example, you can't create a 4TB mirror using a 4TB drive and 2x2TB drives (spanned). For home users, who will have a collection of mixed-size disks, you can do things, but it involves partitioning. I ended up doing something along the lines of this guy: http://tentacles666.wordpress....
Re:Unfamiliar by wagnerrp · 2014-09-11 05:21 · Score: 1

In my experience it needs a lot more memory than software RAID5. Something like 1GB per TB of disk space if running RAIDZ.
It appears to use a lot of memory because it replaces the standard kernel disk cache with its own ARC, and as unused memory is wasted memory, the ARC will eat up every last bit of memory you allow it.

Scrubbing can thrash your CPU pretty good, too.
It's performing a checksum of your entire system. That's going to be a CPU hog. BTRFS will be no different in this regard. Still, the default algorithm is fairly lightweight, and on a modern multi-core multi-GHz system, you should be bottlenecked on disk long before you "thrash" your CPU. If you're trying to run ZFS on an old low end Atom, well... don't do that!

and I needed the capability to add new drives to the pool which ZFS doesn't handle gracefully.
Of course it does. It just has some limitations. You cannot remove devices from a pool, and you cannot reshape a Z/2/3 vdev. You can add a new disk to a mirror vdev. You can replace all the disks in a Z vdev with larger ones, and then expand the vdev to use the new space. You can add a new disk to a pool. You can add a new mirror or Z vdev to a pool.
Re:Unfamiliar by mcrbids · 2014-09-11 05:21 · Score: 1

Scrubbing doesn't thrash your CPU as much as it thrashes I/O. Remember that both I/O and CPU are part of your "load average". This would be expected; it's reading every block on every device in your system.
You're right about the memory; I've forgotten that detail since RAM is cheap. 1 GB per TB is the recommended amount, though I've worked with far less in practice in low/medium write load environments.

--
I have no problem with your religion until you decide it's reason to deprive others of the truth.
Re:Unfamiliar by Vesvvi · 2014-09-11 05:24 · Score: 1

That's a nice writeup.
I'm sure you've chosen that configuration for a reason, but I think it's a good example for why stripes over mirrors can be a better choice for some applications
You are running raidz2(7x4TB)+raidz2(8x2TB). Let's say that instead it was 3x(mirror(2x4TB))+4x(mirror(2x2TB)). Your capacity is 32TB as-is, or 20TB as mirrors: obviously that's a huge loss, and factoring in heat/electricity/performance/reliability it's likely that the raidz is a good choice for a home setup. Bandwidth would also be more that sufficient for home use.
But as you mention, the upgrades either take forever (one drive at a time) or require ridiculous free ports (add 7x at once?!). Even if you were to do them all at once, it would still be a fairly slow process with a massive performance hit.
On the other hand, with mirrors you can increase capacity 2 drives at a time, and at that level it's reasonable to leave both drives active as part of the "mirror" (now, 4-way) for some time. This is my preferred approach: new drives get added to a mirror set and run along with the system for a month or two. This stress-tests them, and if any point there are warning signs the drives can be dropped out immediately. If all is good after the test period, the old 2x of the mirror are removed and the space is immediately available (autoexpand=on). The process can then be repeated. Overall it takes as much or more time than your approach, but the system is completely usable during that time with no real performance hits, and of course the overall system performance is substantially improved with the equivalent of 7 devices running in parallel instead of 2.
There are definitely situations in which raidz2/3 makes more sense than mirrors, but if you're regularly expanding or looking for performance, I think the balance favors mirrors.
Re:Unfamiliar by wagnerrp · 2014-09-11 05:25 · Score: 3, Informative

So you can't expand an existing vdev
While you cannot add new drives to a vdev, you can expand a vdev by incrementally replacing all of its drives with larger versions. Replace a drive, resilver, replace a drive, resilver... and when you're all done, just export the pool, import it back, and you have the full capacity of the new drives available.
Re:Unfamiliar by wagnerrp · 2014-09-11 05:32 · Score: 1

The same can be said for any other filesystem as well. If you have a bad bit in memory, and you write it to disk, that data is corrupted. The only penalty under ZFS is that if you gave it redundancy, and leave checksums enabled, it will detect that fault, try to correct it, and in doing so crush the whole block instead of just one bit.
If you aren't going to use ECC memory, don't use checksums either.
Re: Unfamiliar by wagnerrp · 2014-09-11 05:36 · Score: 1

Why would I prefer DIF on FC and SAS2 over the drive's internal ECC mechanisms? A single layer of protection means a single layer that has to fail before data loss.
Re: Unfamiliar by ericloewe · 2014-09-11 05:40 · Score: 3, Informative

Dedup easily needs 5GB of RAM per TB.
For general usage (no dedup), 1GB per TB is a good rule of thumb.
Re:Unfamiliar by ericloewe · 2014-09-11 05:44 · Score: 1

By the time btrfs is production-ready, ZFS' block pointer rewrite will be done and hell will have frozen over.
Re:Unfamiliar by DoomSprinkles · 2014-09-11 05:47 · Score: 1

ZFS not liking raid is not a con but more an alernative. ZFS does what raid controllers do but in software. Plenty of advantages not the least of which it does raid faster, uses your ram as a very large cache, does not have the raid 5 write-hole bug. Would you want to run a hardware raid off of another hardware raid controller? ZFS has fantastic performance. It does love memory but it uses it well. Ive been using ZFS on Archlinux for over two years and have had 0 zfs failures. The only expandability issue people run into is when you want to increase your zvol device it has to match the size of the others in the same volume. Btrfs has advantages in this area as its been designed with releveling in mind but zfs generally outperforms everywhere else.
Re:Unfamiliar by ericloewe · 2014-09-11 05:48 · Score: 1

CPU overhead is minimal on a modern desktop.
RAM depends on what the desktop is used for. More RAM will definitely be better for ZFS. I wouldn't use less than 8GB and I'd be prepared to add more.
Re:Unfamiliar by nabsltd · 2014-09-11 05:53 · Score: 1

RAID controllers are limited by their on-board chips which are typically sub-GHz RISC (ARM, Intel, MIPS) processors - an external SAS RAID controller will cost you about $2-5000 extra and have a throughput of a few 100MBps and a few 100's of IOPS.
There are sub-$1000 LSI RAID controllers that have no problem providing 500MB/sec even with 10x 5400rpm drives in RAID-6. Faster drives and more spindles plus SSD cache (handled by that same LSI controller, so it's OS-agnostic) can give apparent throughput of around 150MB/sec per spindle. For your 36 spindle scenario, that would be around 5GB/sec, which is nearly double your throughput.
Right now, I'm in the process of building a cluster where each node will use two groups of hardware RAID-6 over 10 drives, and then use a zpool to combine the storage and give me all the other ZFS features (snapshots, scrub, etc.). The whole cluster will be combined using Lustre to give around 350TB of usable storage. Based on similar build-outs that our software vendor has seen, our biggest issue will be that we only have 20Gbps total network capacity per node, and under heavy load will likely be saturated.
Re:Unfamiliar by Aaden42 · 2014-09-11 05:55 · Score: 1

This is exactly what I see: much bigger hit on I/O than CPU for scrub.
I use maybe 30% CPU on a slow old quad Core2 vintage CPU while scrubbing about 14TB on three pools (all three pools running concurrently). I/O is flattened while it’s running, but CPU isn’t that bad all things considered.
Re:Unfamiliar by nabsltd · 2014-09-11 05:56 · Score: 1

3. RAID controllers are pretty much all unique and their volumes are non portable. They are also not documented well. Your drives are useless without the controller, and even recovering with a new controller of the same type is a crapshoot.
Modern LSI RAID controllers use a completely portable format. You can move the array to any other controller that supports that same level of RAID and it can be imported into the config. I have done this successfully even when the controller chip was a completely different model.
Re: Unfamiliar by nabsltd · 2014-09-11 06:00 · Score: 1

Why would I prefer ZFS over DIF on FC and now SAS2 which does the checksumming in a far more comprehensive manner for me and is agnositic to the file system it is on?
ZFS checksums the data in situ and will prevent bit rot, while the methods you reference are for verifying the data hasn't changed during the transfer across a wire.
Re: Unfamiliar by stoploss · 2014-09-11 06:17 · Score: 3, Funny

Dedup easily needs 5GB of RAM per TB.
For general usage (no dedup), 1GB per TB is a good rule of thumb.
This. Don't starve the ARC. You wouldn't like it when it's angry.
Re:Unfamiliar by smash · 2014-09-11 06:19 · Score: 1

CPU and RAM overhead is not "required". If you want to do things like in-line de-dupe, sure. If you use retarded ways of setting up your pools, then sure, expandability sucks.
The rules aren't "wierd", they are just different. The big mistake people make with ZFS is diving into it without reading any of the documentation on the assumption that they know what they're doing because they've used other filesystems before.
Don't do that. Have run ZFS for years, it's awesome.

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Re: Unfamiliar by smash · 2014-09-11 06:21 · Score: 2

1 GB of RAM is worth about $20 these days anyhow (less?).
And yes, de-dup is expensive. Most of the time in my experience you get far better benefits from compression anyhow (source: real world enterprise datasets at work).

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Re:Unfamiliar by smash · 2014-09-11 06:25 · Score: 1

Think about why you want to do that. Normally, it's due to fuck up from not setting your pool up in a sensible manner in the first place. Don't do that.

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Re:Unfamiliar by smash · 2014-09-11 06:30 · Score: 1

Yup. Most people's expansion difficulties are due to retarded pool configurations. If you accept that 1. disk is cheap and 2. mirrors, whilst expensive in terms of disk capacity are way better performance and more flexible, zfs rocks.
People seem to have it stuck in their head that bigger RAID numbers are better, but RAIDZ/RAIDZ2/RAIDZ3 are only really useful when you're dealing with HUGE numbers of disks and performance is not so important. Normally you're far better off creating a larger number of VDEV mirrors, both in terms of performance and in terms of flexibility.
Which brings up another point - those not used to dealing with enterprise storage may not realize that you can/should/maybe want an array with more than one RAID group in it. They end up putting all their disks in one big VDEV which sucks for performance and flexibility, then blame ZFS for not being flexible.
Read how it works, don't make retarded choices based on ignorance, and you'll be fine.

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Re:Unfamiliar by smash · 2014-09-11 06:35 · Score: 1

Just accept that you need to add (or replace) disks 2 at a time (mirror VDEVs), and move on. Unless you're dealing with > 20-30 drives, I'd suggest that RAIDZn is a poor choice. Also, the way writes work, making massive raid groups with large numbers of drives in them (i.e., adding another drive to a RAID5, like you would with BTRFS) is a bad idea. Parity RAID In general is a bad idea. Capacity is cheap, performance is not. Parity raid sucks for performance.

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Re:Unfamiliar by smash · 2014-09-11 06:39 · Score: 1

Bullshit. You can add different size VDEVs to a pool, it works fine. It auto balances load across them, and no partitioning is required. My current home setup is 2x1 TB and 2x 512 GB mirrors (soon to be replaced with bigger drives, when it is full).

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Re:Unfamiliar by smash · 2014-09-11 06:41 · Score: 1

I ran a home ZFS box with 2 GB in it (1.5 TB of mirrored storage) for 6 months with zero issues using FreeNAS. It's now got 10 GB, for home media streaming use i have noticed basically zero difference. Saturated gig-e with both setups.

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Re:Unfamiliar by RobbieCrash · 2014-09-11 07:09 · Score: 1

I've got a dedicated file server running Ubuntu 14.04 at home with a small group of ZFS pools (4x1TB, 4x2TB, 4x4TB each in RAIDZ). The FS used for backups has dedup turned on, everything else uses compression, but not dedup. It serves out 15 SMB shares, 10 netatalk shares and ~20 NFS shares to 3 Windows clients, 6 OSX clients all using TimeMachine and normal file access, 10 Linux VMs doing rsync backups and other normal file server stuff. Data is primarily HD video being streamed, as well as backups for everything.
The server is built on a low end i3 with 32GB ECC RAM, all disks are consumer grade (WD blacks for the 1TB and greens for everything else) attached to an LSI 9211-8i doing JBOD only. Sustained writes of 2x1Gbps over the network and internal writes are sustained over 3Gbps. I have 30GB of RAM dedicated to the pool, but really could get away with using about 6, dedup tables are currently ~3GB. I did a bunch of stress testing when I first set things up (scripted boatloads of fake read/write requests from each client) and CPU usage topped out at ~80%.
Home usage on an older desktop is not out of the question, assuming you can host the drives and maybe put in a total of 8GB RAM.

--
Keep on knockin'
https://robbiecrash.me
Re: Unfamiliar by suutar · 2014-09-11 07:16 · Score: 1

If I recall, there's on-the-fly dedupe and batch dedupe. Is the 5GB/TB estimate for both, or just on the fly? (I like dedupe as a concept, and I've gotten worthwhile results from it, but I can live with batch to avoid getting enough ram to require a new motherboard :)
Re:Unfamiliar by UnknownSoldier · 2014-09-11 08:04 · Score: 1

> - doesn't like hardware raid (apparently a lot of the pros rely on talkign to an actual disk)
There is ZERO point of using hardware RAID with ZFS.
You can NOT _trust_ RAID in the first place -- it doesn't matter how fast your RAID is if it can silently fail!
ZFS RAID Z0, Z1, or Z2 gives you more advantages of RAID without its disadvantages.
You can have it fast & insecure OR slow & safe. Pick one.
Re:Unfamiliar by flux · 2014-09-11 08:13 · Score: 1

It's performing a checksum of your entire system. That's going to be a CPU hog. BTRFS will be no different in this regard.
Well, my btrfs scrubs 1.3 gigabytes per second and chances are it's IO-bound on the RAID10 SSDs, so if ZFS scrubbing performance is comparable, I would say it's not a CPU hog.

You can add a new disk to a pool. You can add a new mirror or Z vdev to a pool.
You make it sound such a petty limitation. But if you do have 5-device raidz and a pool, you are not going to add a single device to the system without risking data durability, you need to add at least two so you can mirror them, and then you're wasting space compared to parity-raids.
Re:Unfamiliar by Guspaz · 2014-09-11 08:14 · Score: 1

I like raidz2 over mirrors because it allows any two drives to fail without data loss. In a mirror configuration (even the one you specified above), the wrong two drives failing can cause data loss. More specifically, if any one drive fails in your listed setup, you've lost redundancy, and any read error on the other drive in the troubled pair would cause data loss.
Mirrors will be faster, while raidz2 will be safer and less wasteful of space. It's all about tradeoffs, and for home use, I prefer the extra reliability and the cost savings of needing less drives for equivalent capacity. The downside is upgrades are bigger and less frequent, but that's a tradeoff I'm willing to accept.
It should be pointed out that with the approach of replace-resilver-replace-resilver-etc, the entire process is done online. No downtime, and the resilver doesn't kill the performance too badly (you can configure how aggressively it goes if you care to do so). So even though I need to replace 8 drives for my next upgrade, and even though it will probably take me a week, my array will be up and available, and I need never reboot. Of course, one of my two HBAs only supports 2TB drives, so I'll need to shut down to replace the controller :P
Re: Unfamiliar by rrohbeck · 2014-09-11 08:25 · Score: 1

Hmm, does it get faster on a bigger CPU? 8x3TB gives you 500-600 MB/s with a HW RAID6.

--
thegodmovie.com - watch it
Re:Unfamiliar by JBMcB · 2014-09-11 08:25 · Score: 1

It appears to use a lot of memory because it replaces the standard kernel disk cache with its own ARC, and as unused memory is wasted memory, the ARC will eat up every last bit of memory you allow it.
Well, I had 4GB of RAM, the cache ate up every bit of it and didn't run particularly well.

It's performing a checksum of your entire system. That's going to be a CPU hog. BTRFS will be no different in this regard.
Very true, but if CPU usage is a factor, on an app server say, then choosing ZFS is hardly a "no-brainer" as the OP stated.

Of course it does. It just has some limitations.
Right - what I was looking for is the ability to simply add a drive to a pool and get more drive space. With btrfs RAID1, which is what I'm using, you throw a drive in, hit rebalance, and you now have more storage, properly mirrored with distributed metadata.

--
My Other Computer Is A Data General Nova III.
Re:Unfamiliar by rrohbeck · 2014-09-11 08:27 · Score: 1

The same can be said for any file server and workstation. I still don't understand why ECC is so unpopular.

--
thegodmovie.com - watch it
Re: Unfamiliar by Guspaz · 2014-09-11 08:40 · Score: 2

ZFS only supports on-the-fly dedupe. For batch dedupe, you're probably thinking of HAMMER in DragonFly BSD.
BSD consumes insane amounts of RAM and has a massive performance penalty. It's almost never worth it, because the cost of extra RAM will be more than if you had just bought more disks in the first place.
Compression, on the other hand, requires very little RAM or CPU resources, gives a tangible performance improvement, and saves space. Once ZFS implemented LZ4 (which is extremely fast) it begun making sense to simply always enable compression globally on every filesystem. They should probably make it enabled by default.
Re:Unfamiliar by Guspaz · 2014-09-11 08:46 · Score: 1

Since you shouldn't have more than 8 to 10 disks in any one raidz vdev, the suggestion that raidz is only for huge numbers of disks is absurd. If you're using more drives than that, you're going to be adding multiple vdevs to a pool anyhow, which is striping, so roughly equivalent to raid 5+0.
Re:Unfamiliar by Guspaz · 2014-09-11 08:52 · Score: 1

Why would you run ZFS on top of two raid6 arrays instead of building a storage pool consisting of two 10-drive raidz2 vdevs? By doing what you're doing, you're effectively running with no redundancy despite being on top of raid6. If ZFS finds a checksum error, it thinks it's running on two big drives in a stripe with no redundancy, and it will be unable to recover the lost data.
What you're doing is highly inadvisable.
Re:Unfamiliar by wagnerrp · 2014-09-11 08:55 · Score: 1

With btrfs RAID1, which is what I'm using, you throw a drive in, hit rebalance, and you now have more storage, properly mirrored with distributed metadata.
If you have RAID1 and add a drive, you still have RAID1, and just as much storage as you started with. You only add redundancy, unless you're saying it converted the mirror into a parity array.
Re:Unfamiliar by Guspaz · 2014-09-11 08:56 · Score: 1

It works if and only if the target system is also using LSI RAID controllers.
Meanwhile, I created my storage pool on Solaris UNIX, used it for years, then switched to Linux without having to do anything to the pool except "zpool export tank" on the old OS and "zpool import tank" on the new one.
Re:Unfamiliar by wagnerrp · 2014-09-11 09:00 · Score: 1

The argument against allowing expansion of parity arrays is that if you find yourself wanting to add a single disk to a parity array, you didn't properly plan for expansion when you designed the system. ZFS was originally designed for enterprise customers, for whom that was not a feature that would rarely ever see use. It was not intended for the home user piecing together spare parts for a file server.
Re:Unfamiliar by sribe · 2014-09-11 09:09 · Score: 1

Even that isn't such a big deal, since it is so easy to clone the whole pool to another one.
And that right there, I think, sums up nicely the limitation with ZFS re expansion. ZFS is not intended for users whole can't buy a pile of new disks in order to expand, users who want to expand by adding a single disk are just not its intended audience.
I very much appreciated that ability with Synology's "Hybrid" RAID while I was using that device, but in the end I'll gladly trade it for ZFS's attention to data integrity.
Re:Unfamiliar by h4ck7h3p14n37 · 2014-09-11 09:12 · Score: 1

Don't forget the excellent portability of ZFS filesystems!
With ZFS you can move a drive between FreeBSD, Linux and MacOS systems. AFAIK the only other filesystem you can do this with is FAT32 and you lose a lot of features if you go that route. This is a big deal when you've got terabytes of data on drives and a heterogenous computing environment.
Re: Unfamiliar by ericloewe · 2014-09-11 09:16 · Score: 1

Some OSes (FreeNAS) already default to compression being enabled.
Re: Unfamiliar by Guspaz · 2014-09-11 09:29 · Score: 1

Sorry, brain fart. I meant "DEDUPE consumes insane amounts of RAM", not "BSD".
Re:Unfamiliar by ultranova · 2014-09-11 12:09 · Score: 1

Which brings up another point - those not used to dealing with enterprise storage may not realize that you can/should/maybe want an array with more than one RAID group in it. They end up putting all their disks in one big VDEV which sucks for performance and flexibility, then blame ZFS for not being flexible.
Read how it works, don't make retarded choices based on ignorance, and you'll be fine.

Right. So let's take a practical example: I have 4 2TB drives, and one 120GB SSD. What are the "non-retarded" ways to configure that under ZFS, and why?

--
Forget magic. Any technology distinguishable from divine power is insufficiently advanced.
Re:Unfamiliar by nabsltd · 2014-09-11 13:43 · Score: 1

If ZFS finds a checksum error, it thinks it's running on two big drives in a stripe with no redundancy, and it will be unable to recover the lost data.
There will be yet another level of redundancy through the cluster, as each file will be available on at least two different nodes. ZFS will help in finding discrepancies that are found at the cluster level.
Re:Unfamiliar by nabsltd · 2014-09-11 13:45 · Score: 1

It works if and only if the target system is also using LSI RAID controllers.
In the business world where you don't change the underlying OS on a critical system just because you feel like it, it's pretty easy to make sure the target hardware meets the spec.
Re:Unfamiliar by dbIII · 2014-09-11 13:53 · Score: 1

cpu and ram overhead (even by current standards, uses a tonne of resources)
It will use what it can get, but will get by with what it has. I've run it on a 32 bit "netburst" Xeon system with 4GB of ram and old IDE drives and was still pushing stuff out at 60Mb/s. Any modern machine with 8GB or more (preferably a lot more) and a few SATA disks should be able to saturate gigabit twice over.

doesn't like hardware raid
It's normally just better not to use it since your CPU/s are going to be able to do more than the little things on the RAID card plus those things don't have much memory. I've run it OK on top of RAID with cards that cannot do JBOD.

expandability sucks
Adding a new "volume" to a pool is trivial, with mirrors it's usually just another two, but with raidz2 that may be another 6 disks. I've done that a few times. The only thing that sucks is you are stuck with the existing volume configuration for the existing volumes so you can't just pop in one more disk and go from raidz1 to raidz2. With adhoc expansion you can add a two disk mirror to a pool with raidz1 since the volumes in the pool are independant, it's messy but it works as a way to add space to existing filesystems without changing anything else.
Re:Unfamiliar by dbIII · 2014-09-11 14:00 · Score: 1

I'd say the answer is do that, but be prepared to do it again properly later :) ZFS send to something set up properly is a good cure for early fuckups.
Re:Unfamiliar by dbIII · 2014-09-11 14:14 · Score: 1

Raidz of 4 drives for space or two mirrors in the pool for speed, with the SSD either as an OS drive outside the pool or caching for the pool. ZFS speed relates to the number of volumes, hence the comment above about large virtual devices (eg. a huge raidz array) versus smaller ones (several smaller raidz arrays or a bunch of mirrored pairs).
For a single user the SSD probably won't make a lot of difference to the pool. Using an SSD as cache wins when there are a lot of people hitting the same files over and over and the files won't fit in main memory. In a lot of cases you are better off booting from it than putting it in the pool.
Re:Unfamiliar by Guspaz · 2014-09-11 14:21 · Score: 1

Only if ZFS is communicating with that higher level. A simpler solution is to just use ZFS's native RAID instead of treating a RAID array as a block device. I can't think of a single benefit to doing that, but I can think of lots of reasons why it's a bad idea.
Re:Unfamiliar by mynamestolen · 2014-09-11 14:25 · Score: 1

re your slgblock
what of depriving yourself of the truth and voting in extremists thus fucking the world?

--
work in progress
Re:Unfamiliar by guruevi · 2014-09-11 15:04 · Score: 1

a) File systems are not OS agnostic, in servers it doesn't matter much
b) You're talking about internal RAID controllers, in a HA situation, you need external RAID controllers
c) Even with those internal RAID controllers, I've tested the LSI MegaRAID and ~1000IOPS is all I get out of it on regular spindles. The latest and greatest from LSI still gets stuck at ~200k IOPS with SSD's (~400k IOPS if you use some proprietary software) while individual SSD's get ~50k IOPS.

--
Custom electronics and digital signage for your business: www.evcircuits.com
Re:Unfamiliar by fnj · 2014-09-11 18:26 · Score: 1

You can look at this as nitpicking and ZFS ass-covering if you want, but it's meant to be constructive.
Twelve good-size drives is too many for a single-level raidz2 (or RAID6 for that matter). Any guru will tell you that. The design on zfs would be far better with a zpool built on two raidz2 vdevs, 6 drives each. Six drives is the sweet spot for double parity. OK, that's now 16 TB instead of 20. A tradeoff I would make (and did in fact make, with twelve 3 TB drives) in a heartbeat.
Now when you want to grow your pool you can logically concatenate another vdev of 6 more drives. That won't involve any data rebuilding at all, like a RAID setup does. With ZFS the operation is essentially instantaneous. OK, thats an 8 TB increment instead of 4, but a 50% increment makes a whole lot more sense than a lousy 20% anyway.
Re:Unfamiliar by fnj · 2014-09-11 18:38 · Score: 1

While you cannot add new drives to a vdev ...
Yes you can, if the vdev is a concatenation of drives or other vdevs. And yes you can (in the form of additional copies) if it is a mirror.
And, for example, if you have a 6 drive raidz2, you can change each component - on the fly - from a single drive to a logical concatenation of 2-3 drives. Yes, the data safety will be reduced because you only have to simultaneously lose 3 drives out of 12 or 18 instead of 3 out of 6, to get catastrophic data loss, but you've still got double parity.
Re:Unfamiliar by fnj · 2014-09-11 18:51 · Score: 1

No problem, if youi're talking about a reasonabe size ZFS store. Not 64 SAS drives, but say 4-12 SATA. Forget dedupe, which is useless in normal settings anyway, and 16 GB would do it. At or below the lower end of 4-12 drives, you could probably get away half decently with 8 GB depending on how many GB you gobble up in user processes (damn that Firefox!).
I'm running 12 SATA 3 TB in a 16 GB server, where there is nada RAM usage since there is no local user, and even no GUI running at all. It works just dandy for my usage pattern - the great preponderance of the files are 2-30 GB in size, not a whole lot of tiny files - and there is only one user, me.
If you have less than 8 GB, you have to be kidding me.
Re:Unfamiliar by fnj · 2014-09-11 18:58 · Score: 1

Your reference is a lot of utter bullcrap mixed in with a few posters who have a clue. The consequence of RAM errors is EXACTLY the same using ZFS as it is with any other filesystem. You can corrupt your data, or even metadata, either coming from or going to storage. So what. Unreliable SATA cables or bad drive electronics can do EXACTLY the same thing. Even ECC RAM has a finite undetected bit error rate.
Obviously ECC RAM is a Good Idea when you have Important Data, no matter what the file system is, but there is absolutely nothing magic about ZFS that makes magically higher demands on RAM.
Re:Unfamiliar by fnj · 2014-09-11 19:05 · Score: 1

Sounds like a fairy tale to me. Yes, one can conceive of scenarios where this MIGHT happen, but in general no. There are other scenarios where a RAM error leads to writing wrong data which will in fact be FIXED by ZFS checksums. And still other scenarios where a RAM error has absolutely no effect on ZFS checksumming. Probably the third group of scenarios is most common, followed by the second group.
Generally, with non-ECC RAM, either you can't find a single bit error in years of runtime, or else the first onset of errors will be severe enough to crash the system and the user will run memtest and fix it before any significant damage is done.
Bad advice.
Re: Unfamiliar by zeigerpuppy · 2014-09-11 19:23 · Score: 1

Actually it has 32GB of reg ECC RAM but that's not all for ZFS!
Re:Unfamiliar by fnj · 2014-09-11 19:42 · Score: 1

Sounds (revising this) like there's a bit more to it than that, but not much. I believe either of those two methods will work. I suspect the original export/import method may still work too.
Re:Unfamiliar by Mirar · 2014-09-11 21:52 · Score: 1

The big difference from raid+filesystem to zfs or brtfs is that the new ones have a checksum on the raid blocks.
That means that if you get bit errors (or more than bit errors) on one of the blocks on the raid, you can rebuild that block from the others.
WIth a normal RAID there is no way of telling which bit is wrong, just that the blocks don't match up anymore. You're protected against one disk failing, but not the random errors.
You can also add deduplication, compression and snapshots if you want. Don't know how LVM+raid works with that.
Re:Unfamiliar by AaronW · 2014-09-11 22:27 · Score: 1

I can do most of those things using my old Areca hardware RAID controller and XFS.
Data integrity is maintained in my RAID array which has its own battery-backed ECC memory. I can grow and shrink logical volumes on the fly. I can change the striping or even the RAID level without any downtime. I replaced all of the drives in my RAID array (one at a time) with larger drives with zero downtime.
Running XFS makes it easy to do incremental backups or doing the equivalent of DD on a mounted filesystem using xfsdump. It also supports defragmentation while mounted.
The RAID array also does data scrubbing and runs all of the SMART checks.
I can easily add more capacity without downtime, just drop another disk in the array and add it.
While I can't do snapshots or native compression, I can do most other things. Compression would do nothing for me since most of my data is already compressed. I run continuous backup software to back up onto removable SATA drives as well as to a cloud backup service (Crashplan) which encrypts everything. It maintains snapshots of everything and I have several TB backed up that way.
While I haven't played with ZFS, I did try out BTRFS but had to throw it out. Performance was abysmal on the SSD I was using and without a clear way of knowing how much space is free is a major issue. If everything is snapshotted, how do you deal with deleted files when you run low on space? The performance of trying to put my IMAP server on it was unusable. I gave up after a couple of hours trying to write all of my emails to it on a SSD. On XFS or EXT4 it takes a fraction of the time, despite it being hundreds of thousands of small files.
Also, I can still run bcache or some other method of using a SSD to cache my data.
I have been using XFS for years and always found it to be reliable, more so than my experience with EXT2/3/4 though I'm also one of the rare people who never had a problem with the killer Reiserfs. My IMAP server ran for 10 years on Reiserfs with the same hard drive with uptimes on the order of years before I finally retired the machine (the hard drive has over 10 years of uptime according to SMART). The only major problem I had was that the Linux kernel had a bug where the uptime would wrap after 497 days. After that happened a few times I finally had to reboot the computer when the UPS died and it loaded an updated kernel.

--
This post is encrypted twice with ROT-13. Documenting or attempting to crack this encryption is illegal.
Re:Unfamiliar by Eunuchswear · 2014-09-11 22:29 · Score: 1

The point of ZFS is that hardware raid sucks.
Yes, that's the point about mdadm too.
What does ZFS get you that a layered EXTx/LVM/Mdadm stack doesn't?

--
Watch this Heartland Institute video
Re:Unfamiliar by Eunuchswear · 2014-09-11 22:30 · Score: 1

That's not a "con". Sanity requires ECC RAM.

--
Watch this Heartland Institute video
Re:Unfamiliar by Rich0 · 2014-09-11 23:31 · Score: 1

What you propose requires doubling the cost of storage. I realize that CPU is cheap, RAM is cheap, storage is cheap, and so on. But, for whatever reason it seems like everybody selling all this "cheap" stuff still wants my credit card number.
With RAID5 the dollar cost of parity isn't nearly as significant. Sure, it might not perform quite as well, but until you start getting into the 20-30 drive scenario performance of spinning disks is always going to be lousy. Unless you're running VMs on the disks, it probably doesn't matter, since most of the stuff that takes lots of space (video) tends to be rarely accessed, and only sequentially.
Re:Unfamiliar by cerberusss · 2014-09-11 23:46 · Score: 1

You had me at data integrity :)

--
8 of 13 people found this answer helpful. Did you?
Re:Unfamiliar by MightyYar · 2014-09-11 23:56 · Score: 1

Unreliable SATA cables or bad drive electronics can do EXACTLY the same thing.
No, because the corruption will be caught. I - and many others - have had controller failures and bad hard drives cause corruption on the drives, but this corruption was caught during the scrub. If the data is bad in-memory and then hashed and written to disk in that condition, the corruption will be silent.

Even ECC RAM has a finite undetected bit error rate.
ECC RAM will only correct 1 bad bit, but the system is supposed to halt on 2 bad bits. A halt is better than operating in an unknown state, IMHO.

Obviously ECC RAM is a Good Idea when you have Important Data, no matter what the file system is, but there is absolutely nothing magic about ZFS that makes magically higher demands on RAM.
Even if you aren't worried about the specific scenario where the whole pool goes down, taking all the trouble to run a filesystem with parity seems silly if you can't trust the error detection/correction to actually work. And since you are writing a hash as well as a file, you are doubling the opportunities to corrupt any particular file vs a regular file system.

--
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
Re:Unfamiliar by smash · 2014-09-12 00:27 · Score: 1

You are aware that the write IOPS of your RAIDZ VDEV is the performance of a SINGLE DISK, right?

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Re:Unfamiliar by Guspaz · 2014-09-12 01:06 · Score: 1

Yes, but write throughput is still increased, and not everybody needs more write IOPS. Furthermore, even with 8 disks you can build two raidz2 arrays and put them in a pool, at which point you've got the IOPS of two disks. And on top of that, you can use fast SSDs as ZIL cache devices.
Re:Unfamiliar by fnj · 2014-09-12 01:09 · Score: 1

Repeating falsehoods does not make them true. ZFS does not mysteriously "require" ECC RAM, end of goddam story. Plenty of people are using ZFS without ECC RAM.
Re:Unfamiliar by MightyYar · 2014-09-12 01:14 · Score: 1

I'd love for you to be right, but you haven't added any information to the discussion so it is hard to believe you.

--
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
Re:Unfamiliar by wagnerrp · 2014-09-12 01:44 · Score: 1

Are you sure about that? ZFS has a "copies" parameter as well. It just means it stores two copies of the block somewhere in the pool. If you give it multiple disks in a pool, it will try to place those copies on different disks, but it will not guarantee it. It's a measure to prevent data loss when you have a damaged sector, not a full disk failure.
If you have disks of different sizes with copies=2, will it refuse to write if you only have one disk with free space remaining?
Re:Unfamiliar by tinker_taylor · 2014-09-12 03:13 · Score: 1

Cons:
- cpu and ram overhead (even by current standards, uses a tonne of resources)
You can tune the size of the ZFS arc cache, thereby optimizing RAM use. In the Solaris world, upto 25% of RAM is used by ZFS by default, unless we throttle it. If you have SAN LUNs as the underlying block storage for ZFS, it is better to reduce the arc cache size. I suspect same thing is possible in Linux port as well.
- doesn't like hardware raid (apparently a lot of the pros rely on talkign to an actual disk)
It is not recommended to use hardware raid, but that's because ZFS has superior FT mechanisms (RAIDZ2/Z3 etc). And if you use a JBOD, you can leverage things like L2 ARC (using flash devices), SSD based ZFS write-ahead-logs (what's called the Logzilla in the ZFS appliance world)
- expandability sucks (can be done, but weird rules based on pool sizes and such) compared to most raid levels where you can easily toss a new disk in there and expand.
This is incorrect. Expanding ZFS pools is as simple as adding additional devices to the pool. Depending on your underlying striping strategy, you would have to add storage in commensurate manner of course. It's literally -- "zpool add "
Re:Unfamiliar by Bengie · 2014-09-12 04:31 · Score: 1

You only *think* you can do all that because you don't understand how it's being done. ZFS fixes a lot of corner failure/corruption cases.
Re:Unfamiliar by ejasons · 2014-09-12 06:04 · Score: 1

Unless you want encryption, where each OS seems to have its own solution (cryptsetup for Linux, geli for FreeBSD, native ZFS for Solaris, etc.)...
Re:Unfamiliar by bhiestand · 2014-09-12 06:09 · Score: 1

It works if and only if the target system is also using LSI RAID controllers.
In the business world where you don't change the underlying OS on a critical system just because you feel like it, it's pretty easy to make sure the target hardware meets the spec.
In the business world, if you don't have the scale and expertise to build your own cluster, you use real enterprise gear in redundant configurations. Whether NetApp/EMC or ZFS on qualified hardware.
If availability isn't important to you, and you can afford to keep spare controllers on hand so you don't have to wait days to source a compatible controller 5 years from now... fine, use LSI. But don't pretend it's somehow smarter to use HW RAID on a critical system.

--
SWM seeks new sig for a brief fling
Re:Unfamiliar by Vesvvi · 2014-09-12 07:15 · Score: 1

So "p" is the probability of a drive being down at any given time. A hard drive takes a day to replace, and has a 5% chance of going dead in a year. A given hard drive has a "p" of ~1.4e-4.
For RAID6 with 8 drives, you can drop 2 independent drives: failure = 1.4e-10. It's out in the 6+ nines.
It would take 6x sets of mirrors to get the same space. Each mirror has a failure probability of (p^2), 1.9e-8. Striped over the mirrors, all sets have to stay active: success = (1-p^2)^6, failure = 1.1e-7. Way easier to calculate without the binomial coefficient, by the way.
Technically, the mirrors are 3 orders of magnitude more likely to fail, but the odds are still ridiculously good. Fill a 4U with 22 drives (leave some bays for hot-swap) as mirrors and it's failure = 2e-7. Statistically, neither of these is going to happen: you just won't see two drives happen to go down together by random chance.
People already know this. There are much more advanced models that account for the what-happens-next situation after you've already lost a single drive, and of course it non-linearly worse. But just to keep it simple, going back to the naive model, for the RAID6 with 7 remaining drives, the failure probability is now up to 4e-7 during the re-silver time. The mirror model stays at a "huge" failure = 1.4e-4 during a re-silver, but it's brief, predictable, and with low system impact. It's my stance that that kind of probability keeps it in the category of less-important compared to many other factors for a risk analysis.
Re:Unfamiliar by Guspaz · 2014-09-12 10:10 · Score: 1

You're accounting only for full-drive failures. IIRC BackBlaze indicates failure rates are higher than 5% per year, but that's not really relevant. The bigger problem is a read error during a resilver. That's something that the drive specs indicate should be expected at least once during any resilver, although in practice I find it less likely than that.
If you're using mirrored pairs, any resilver is (by spec) highly likely to result in corruption due to unrecoverable read errors due to lack of redundancy. Resilvering a single drive in a raidz2 array, however, still provides you with redundancy to recover from any read errors.
Re:Unfamiliar by MightyYar · 2014-09-13 01:56 · Score: 1

zfs also keeps more data (a lot more) in memory than a "regular" filesystem, so you are more likely to encounter flaky memory in the first place. If I weren't going to use ECC RAM, I would probably forgo these fancy hashing filesystems and instead run something more mundane and then do a separate data integrity check with my backup. I use Unison for my data that is impractical to keep on ZFS. It is slow but has saved my butt on data that is important to me (family photos with some corruption on the master).

--
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
Re:Unfamiliar by bhiestand · 2014-09-14 20:45 · Score: 1

Read the other comments in this article that point out all the pros. I love md and lvm, but they are little league compared to ZFS.
Hell, just the snapshotting alone. User accessible previous copies of files!

--
SWM seeks new sig for a brief fling
Re:Unfamiliar by jwhitener · 2014-09-15 12:47 · Score: 1

I am not the sys admin. I am an application developer / analyst who works closely with sys admin.
"expandability sucks"
Whoa... no way. It is by far the most flexible expansion I've ever seen. I have yet to make a request of any of the sys admins that wasn't instantly fulfilled on a zfs system. Other systems I'll often get a "well.. we can't do it that way, but I can move this mount point over here, and rename this, then add a disk, then name it back, etc....".
ZFS has pools. You can add anything that can present as a block device (file, hard disk, virtual disk from a storage device, usb keychain, etc..) to a pool. Then you can carve that pool in many different ways and attach it to zones (zfs virtual machines). And, of course, all this can be done live, in production. No reboots required. Space is added or removed from my live servers all the time.
I think one of the home NAS manufacturers uses ZFS. You can mix match drives, hot swap them, and the raid will rebuild itself on the fly.
ZFS
Snapshot of 500GB, instant.
Rollback when I mess something up, instant or like a minute.
VMWare
Snapshot of 500GB, 10 minutes.
Rollback when I mess something up, 30 minutes.
ZFS snapshot "myvolume" | zfs send "myvolume-snapshot" other zfs system. On other system, zfs boot "myvolume", log in, change the IP and system name, done. A second new server is up and running. Or a new backup system created from production, etc..

License mismatch by Anonymous Coward · 2014-09-11 03:03 · Score: 1

It's unfortunate that the code is being ported to Linux, not rewritten. This means there will never be native Debian support for it. As a result, unsurprisingly, packages are only available for the Debian amd64 architecture.

Re:License mismatch by jedidiah · 2014-09-11 03:10 · Score: 1, Insightful

The GPL only hates Mad Max post apocalyptic style "freedom".
The FSF rightfully understands that the complete absence of the rule of law simply enables the person with the biggest pile of guns to control things.
Sun seemed like a benign enough master but Oracle far less so.

--
A Pirate and a Puritan look the same on a balance sheet.
Re:License mismatch by Anrego · 2014-09-11 03:30 · Score: 1, Insightful

They are strongly against one set of freedoms in support of the subset of freedoms they deem more important.
Which is fine, but I've always found their choice in terminology and strong focus around the word "free" to be annoying. Consequently I try to avoid using the term "free software" and instead usually opt for "open source", which while it doesn't convey the idea that it's restrictively licensed to ensure it and any derivatives remain open source, it also doesn't falsely convey that it is entirely free (as in do whatever you want with it free of restrictions).
Re: License mismatch by zeigerpuppy · 2014-09-11 03:30 · Score: 1

Because ZFS has far better features than BtrFS http://rudd-o.com/linux-and-fr...
Re:License mismatch by armanox · 2014-09-11 03:38 · Score: 1

Sun used their own Open Source license, which they've had for quite a while (and released quite a bit of software over the years using). The issue is "Free" vs "Free" vs "Open Source"

--
I'm starting to think GNU is the problem with "GNU/Linux" these days.
Re:License mismatch by armanox · 2014-09-11 03:42 · Score: 1

If Sun wanted to hate freedom, would they have released it under an open source license, as approved by the OSI?

--
I'm starting to think GNU is the problem with "GNU/Linux" these days.
Re: License mismatch by Rich0 · 2014-09-11 03:45 · Score: 1

Because ZFS has far better features than BtrFS
http://rudd-o.com/linux-and-fr...

It has SOME features which btrfs has not yet implemented. Btrfs also has some features which ZFS has not yet implemented, including support for dynamically resizing a RAID (not adding/removing a RAID from a zpool).
Re:License mismatch by meta-monkey · 2014-09-11 04:04 · Score: 1

Well, they're two different things. "Open source" is a design methodology. "Free software" is a social movement. I try my hardest to use only FOSS. I say that from my windows computer at work. But hey, I take my ultrabook with debian on it just about everywhere I go.

--
We don't have a state-run media we have a media-run state.
Re:License mismatch by MightyYar · 2014-09-11 04:15 · Score: 1, Flamebait

There is "free as in beer" (usually both GPL and BSD). There is "free as in freedom" (BSD). And then there is "free as in free-range chickens" (GPL).

--
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
Re:License mismatch by brambus · 2014-09-11 04:30 · Score: 1

Sun still had closed bits in the OS kernel when OpenSolaris was CDDL'd, so GPL was a non-starter because of GPL's "infectious" nature of spreading to all source that makes up a project. CDDL is more permissive in this - it simply forces you to keep the free bits that you got free, but doesn't expand to anything outside of the source files you got.
Re:License mismatch by NotSanguine · 2014-09-11 05:27 · Score: 1

There is "free as in beer" (usually both GPL and BSD). There is "free as in freedom" (BSD). And then there is "free as in free-range chickens" (GPL).
It amazes me that Milton was already involved in this conversation in 1649:

None can love freedom heartily, but good men; the rest love not freedom, but license.
--John Milton
I guess the question is, to which license was he referring?

--
No, no, you're not thinking; you're just being logical. --Niels Bohr
Re: License mismatch by wagnerrp · 2014-09-11 05:49 · Score: 1

Chances are if you're an enterprise user running enterprise Linux with a support contract, you're going to engineer in immediate needs, future expansion, and replacement when you purchase your server equipment. You're not going to throw something together and get yourself into a situation where you would need dynamic resizing of a stripe. That's more a concern for the home and small business user who may not have the funds to plan for expansion when building a server, but then that's not the market Sun was shooting for when designing ZFS.
Re: License mismatch by sjames · 2014-09-11 07:02 · Score: 1

I recently evaluated both btrfs and zfs for a new server. I like the features of btrfs and the overall design, but chose zfs in the end.
The problem is that btrfs has very poor behavior in the face of a failed disk in the array. It actually papered over the problem rather than doing the right thing and kicking it out of the pool.
Ideally, I would like for it to improve in that and grow the ability to evacuate a disk in an orderly manner.
Re: License mismatch by fnj · 2014-09-11 18:01 · Score: 1

Does btrfs' idea of RAID include the enormously improved features of raidz[23], or is it clearly on the roadmap?
Does btfs support nested filesets, or is it clearly on the roadmap?
They're questions, not a challenge.
Re: License mismatch by Rich0 · 2014-09-11 23:16 · Score: 1

I'm not intimately familiar with all the details of raidz, but I believe what btrfs calls raid5/6 is roughly equivalent to raidz/2. It involves n+1/2 redundancy but with chunk-level allocation/striping instead of having it at the physical layer (so no raid hole, can have multiple raid modes on the same physical partitions, can operate with mixed-size media, etc). If anything it should be more flexible than raidz because you can add/remove individual drives at will, and convert a raid1 to raid5 in-place without rewriting anything (existing chunks stay raid1, new ones are written as raid5, unless you rebalance).
Of course, this is all roadmap. Raid5/6 is still experimental, and in particular it doesn't actually recover from a disk loss at this point, so I really do MEAN experimental.
From a VERY brief perusal of "datasets" on zfs they appear to be similar to btrfs subvolumes. In btrfs these certainly can be nested, however if you snapshot a subvolume the snapshot does NOT contain the nested subvolumes. That can be useful for some things, but I imagine it could also be less than useful for others - I had to design my use of subvolumes around this. For example, if /home is a subvolume and every user is a subvolume, then you could have per-user-dir quotas easily enough and per-user snapshots, but if you do a snapshot of /home chances are you won't have anything in it but a bunch of empty user directories.
Re: License mismatch by Rich0 · 2014-09-11 23:18 · Score: 1

Agree 100%. I had a similar conversation with a zfs proponent in my local linux user group. The zfs design makes a lot more sense when you think about enterprise storage. Btrfs is targeted as being more of an ext4 replacement, where you aren't managing groups of 5 disks in a unit with 100 disks.
Re: License mismatch by Rich0 · 2014-09-11 23:34 · Score: 1

Yup - disk failure isn't really implemented yet in btrfs. You get the behavior you describe for raid1 mode, and for the parity raid modes it doesn't handle recovery at all yet. I'd be reluctant to run it in a server I didn't babysit for that reason.
With mdadm if a drive fails I get messages to dmesg, and it has a daemon that will happily send mail/logs/etc wherever I want it to go. I think btrfs might output a line to dmesg but it would be VERY easy to miss a degraded array, and that is a problem.
But, I chalk that up to btrfs not being done yet. When it will be done is very much a matter of speculation.
Re: License mismatch by sjames · 2014-09-12 09:13 · Score: 1

Yes. That's the thing. I needed it then, not some future point.
I'll keep watching btrfs and see how it goes. It shows promise but it also shows immaturity right now.

Working well for me by zeigerpuppy · 2014-09-11 03:04 · Score: 4, Informative

I've been using ZFSonLinux for a year in production. No problems at all. It's my storage back end for Xen Virtual machines. Just make sure you use ECC RAM and a decent hard disk controller. Instant snapshots and ZFS send/receive functions are awesome, have reduced my backup times by an order of magnitude. I use a Debian Wheezy/Unstable hybrid.

Re: Working well for me by zeigerpuppy · 2014-09-11 03:16 · Score: 2

The technical descriptions I've read say that you absolutely should use ECC because ZFS will eventually hit a checksum mismatch. This could result in valid data being flagged as corrupt. ECC RAM is not much more expensive these days but you do need a mobo that supports it.
Re: Working well for me by zeigerpuppy · 2014-09-11 03:19 · Score: 1

One should also use SSDs with capacitors if they are backing a ZIL but the only things that are absolutely essential are decent HDD controllers (I've had good results with Adaptec SAS controllers, even using a mixed SAS/SATA drive set on separate channels)
Re: Working well for me by zeigerpuppy · 2014-09-11 03:32 · Score: 2

Good sane description here: http://ianhowson.com/do-you-re...
Re: Working well for me by zeigerpuppy · 2014-09-11 04:03 · Score: 1

But isn't that the whole point, ZFS is designed to avoid the most common failure modes but it relies on reducing the errors in the data it is using for check summing. Thats why ECC is important, RAM errors are the 2nd most common bit flip error and if that's your comparator, it needs to be accurate.
Re: Working well for me by Vesvvi · 2014-09-11 04:26 · Score: 1

Not all Adaptec controllers are supported by FreeBSD. It would be a "safer" choice to use LSI, since they work great in Linux and FreeBSD: that gives you the option to migrate your host OS should you desire.
Admittedly, if you're changing over that much then buying new controllers isn't a big deal, but I like to have the option of having the "reference" implementation of ZFS just a few minutes away.
Re:Working well for me by MightyYar · 2014-09-11 04:36 · Score: 1

If you decide to chance it, make sure you don't use the "scrub" functionality on ZFS. Scrub can cause memory errors to eat your pool like a cancer.
Or, just use ECC :)

--
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
Re: Working well for me by Rich0 · 2014-09-11 04:41 · Score: 1

But isn't that the whole point, ZFS is designed to avoid the most common failure modes but it relies on reducing the errors in the data it is using for check summing. Thats why ECC is important, RAM errors are the 2nd most common bit flip error and if that's your comparator, it needs to be accurate.
Sure, but what good is having the RAM store the checksum correctly if the CPU calculated it incorrectly, or if the CPU compares the correct checksum to the identical correct checksum and determines that they don't match?
This is just about diminishing returns and where to draw the line. There are lots of things that can go wrong. Believe it or not people use systems that actually detect and correct CPU logic errors, and I'm sure the people selling them could tell you how often they detect errors.
I'd seriously consider ECC RAM in my next system in any case, but you can't just compare the cost of the RAM - if it requires buying a different CPU or motherboard you have to look at all the costs and tradeoffs.
Re: Working well for me by Aaden42 · 2014-09-11 08:06 · Score: 1

ZFS on BSD, then Linux (same hardware, new OS) since about 2008. 19TB raw, about 14TB w/ RAID taken into account. Not using ECC since day one.
Nothing catastrophic has happened, and any issues with repairs made during ZFS scrubs have eventually proven to be failing drives, flaky cables, or cheap/lousy SATA backplanes. IE replacement of offending hardware caused rare intermittent CRC errors to cease completely (until the next non-RAM thing started acting up). Even with those errors, zero dataloss over the time frame.
ECC’s a nice feap, but your data won’t catch fire without it.
Re: Working well for me by Guspaz · 2014-09-11 09:12 · Score: 1

Intel's best kept secret is that many of Intel's cheapest processors support ECC (including most of the i3 series), and as such enable you to build some surprisingly low-cost low-power file servers.
Here's the list of Intel desktop CPUs that support ECC:
http://ark.intel.com/search/ad...
Looks like the MSRP starts at $64 or so. The downside is that you need a chipset that supports ECC too, and those are only server chipsets. Luckily, a motherboard with one of those (like the Intel C222 chipset) start at ~$140 or so.
Slapping together a low-end server motherboard with an i3, some 8-drive HBAs, and a bunch of ECC RAM, it's a popular way to make a low-end file server.
Re:Working well for me by MightyYar · 2014-09-11 10:10 · Score: 1

I'd suggest some further reading.

--
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
Re: Working well for me by Rich0 · 2014-09-11 13:35 · Score: 1

Well, you can believe whatever you want.
Real world experience has shown me one of those things is at least 3 orders of magnitude more likely than the other.
Guess which.
I think you missed the point. There will ALWAYS be some situation where you can gain another 9 in reliability by adding another zero or two to the price tag. I don't debate that ECC RAM is a more cost-effective solution than a hardware architecture that tolerates CPU errors.
Re: Working well for me by dbIII · 2014-09-11 14:30 · Score: 1

There's different "support" - for instance there's some LSI controllers that FreeBSD is not happy to boot from but has no problems with anything that shows up on the adapter after boot.
Re: Working well for me by allo · 2014-09-13 06:58 · Score: 1

why is there no "repair file" command, which calculates a correct checksum to the broken data?

Re:rsync causes lockups? by QuietLagoon · 2014-09-11 03:06 · Score: 2

Been using rsync on ZFS for many months (FreeBSD 10.0). No issues whatsoever.

I agree... by sarkeizen · 2014-09-11 03:08 · Score: 1

I've been using this for a production fileserver for about a year and a half. Prior to that I was using ZFS on FUSE for about a year.

The only minor negative things I can say is that when you do have some odd kind of failure ZFS (and this may be the case on BSD and Solaris) gives you some pretty scary messages like "Please recover from backup" but usually exporting and importing the FS brings it back at least in a degraded state. My other caveat might just be my linux distro but I've often had problems with older versions of the libraries hanging around and causing the command line tools to break.

Re:I agree... by Vesvvi · 2014-09-11 04:31 · Score: 1

Maybe your ZIL comments are specific to Linux? It used to be the case in FreeBSD that you had to have the ZIL present to import, and a dead ZIL was a very big problem, but that was many versions ago (~3-4 years?). I personally went through this when I had a ZIL die and the pool was present but unusable. I was able to successfully perform a zpool version upgrade on the "dead" pool, after which I was able to export it and re-import it as functional without the ZIL.
Note that this was NOT a recommended sequence of operations, and I wouldn't suggest it unless you have no choice.
Re:I agree... by Aaden42 · 2014-09-11 09:12 · Score: 1

Inability to import after a damaged/missing ZIL is an old problem of ZoL (and ZFS in general) that’s (long) since been fixed. ZFS on-disk version 19 and newer support removal of a faulted / missing secondary log device. ZFSv19 was released in OpenSolaris b125 (sometime in 2009 - can’t pin down an exact date). I can’t find an easy reference of ZoL versions versus their supported ZFS versions, but the earliest release on the zfsonlinux.org home page (0.6.1 released 27-MAR-2013) supports SLOG removal.
It takes some debugging steps, but it’s doable without loss of a pool where missing ZIL was the only problem. Additional problems may of course prevent import, but if the ZIL is your only issue, it’s recoverable with current ZoL. Likewise for missing L2ARC devices.
As you mentioned you run the risk of losing data that was “guaranteed” written but only in the ZIL at the time the system lost power or crashed. If the pool was exported cleanly, there would be no uncommitted data in the ZIL, so this would only be a possibility after power loss or system crash.

Re:rsync causes lockups? by NFN_NLN · 2014-09-11 03:09 · Score: 3, Informative

Is the target not a zfs filesystem as well? If so zfs send/recv allows for replication and handles deltas at the filesystem level. It should be more efficient.

Re:rsync causes lockups? by NFN_NLN · 2014-09-11 03:10 · Score: 2

http://docs.oracle.com/cd/E192...

No thanks, I'll stick with ReiserFS by Anonymous Coward · 2014-09-11 03:15 · Score: 2, Funny

It's a killer file system. Once you've used it, you won't be able to leave it.

Re:No thanks, I'll stick with ReiserFS by rssrss · 2014-09-11 09:29 · Score: 2

Groan

--
In the land of the blind, the one-eyed man is king.
Re:No thanks, I'll stick with ReiserFS by Tyler+Durden · 2014-09-12 07:57 · Score: 1

Perhaps for his encore he'll move on to posting jokes about airplanes and twin towers for real giggles.
"I have to leave early tonight, I have to fly out to L.A. I couldn't get a direct flight, I have to make a stop at the Empire State Building." -Gilbert Gottfried

--
Happy people make bad consumers.

Proformance by wisnoskij · 2014-09-11 03:15 · Score: 1

So how much space does the chechsums take up? How much does all this behind the scenes work slow down the data retrieval/writing?
Is this something that a normal consumer would use for their main storage?

--
Troll is not a replacement for I disagree.

Re:Proformance by McKing · 2014-09-11 03:45 · Score: 1

The checksums don't really take up more physical overhead than a more traditional RAID + LVM setup, and performance is equivalent in my experience (albeit on Solaris 10 and not Linux). There is also the ability to turn on compression, which trades a little bit of CPU overhead for increased disk I/O performance. On a lot of workloads the difference can be dramatic.
If you are already comfortable with RAID + LVM, then I would wholeheartedly recommend ZFS for your main workstation. I would also recommend taking a look at FreeNAS if you are looking at building a network storage device. Snapshots, replication, ease of management are all ZFS strongpoints and ZFS is one of the things that I miss most about Solaris before Oracle bought them and priced themselves out of our datacenter.

--
If only "common" sense was actually that common...
Re:Proformance by brambus · 2014-09-11 04:00 · Score: 1

The overhead is barely there at all. I've measured the performance of the default fletcher4 checksum on a modest 2GHz Core 2 CPU and it comes to around 4GB/s/core. Now given that most CPUs now come with 4 or more cores, in order to get the checksum to be 10% of CPU overhead, you'd have to do be doing around 1.2GB/s of I/O. Needless to say, you're not ever going to get that even for fairly high-performance boxes.
Re:Proformance by tlhIngan · 2014-09-11 04:14 · Score: 1

The overhead is barely there at all. I've measured the performance of the default fletcher4 checksum on a modest 2GHz Core 2 CPU and it comes to around 4GB/s/core. Now given that most CPUs now come with 4 or more cores, in order to get the checksum to be 10% of CPU overhead, you'd have to do be doing around 1.2GB/s of I/O. Needless to say, you're not ever going to get that even for fairly high-performance boxes.
Not really. A SATA3 SSD can push 550MB/sec both ways (limited by SATA3 itself) nowadays - just your standard Samsung EVO or later revision drive. Consumer level PCIe SSDs like what Apple provides already do 750MB/sec (and it isn't the fastest - just consumer level). Stick two of those (and ZFS might not be a bad idea for SSDs) and I'm assuming ZFS can access data on both drives (or more) to achieve high throughput. Two SATA3 drives is 1.1GB/sec full bore, while two PCIe ones full bore are 1.5GB/sec.
I believe the Fusion IO ones (Wozniak) do 1.2GB/sec on a bad day already.
Granted, spinning rust drives will need way more since most can't pull faster than about 200MB/sec off the platters.
Re:Proformance by brambus · 2014-09-11 05:05 · Score: 1

Keep perspective. Are you really going to build a box like that with just one 2 GHz quad-core CPU?
I have pushed 4GB/s through a SAS SSD array on ZFS, but even so I maxed out on other stuff way before the CPU and much less checksumming ever began to be an issue (e.g. had to go through two LSI SAS 9200-8e HBAs, because one maxes out the PCI-e 2.0 x8 lanes; with two HBAs I maxed out on the two 6G SAS links to my JBOD). That the point of my post. I've yet to see a system which is constrained by the checksumming in any meaningful way.

Be sure to use ECC RAM on home set-ups by Anonymous Coward · 2014-09-11 03:17 · Score: 1

https://pthree.org/2013/12/10/zfs-administration-appendix-c-why-you-should-use-ecc-ram/

It's really too bad laptops don't offer ECC RAMs still. I'm willing to pay a little more for ECC RAM capable motherboard.

Re:Be sure to use ECC RAM on home set-ups by rubycodez · 2014-09-11 03:31 · Score: 1

Your linked article doesn't really prove that one should use ECC, it speaks of studies showing wide range of errors, from "roughly one bit error, per hour, per gigabyte of memory to one bit error, per millennium, per gigabyte of memory"
Then it takes google's study as gospel truth, "25,000 to 70,000 errors per billion device hours per megabit (about 2.5^(–7) x 10^(11) error/bit-h[ours])(i.e. about 5 single bit errors in 8 Gigabytes of RAM per hour using the top-end error rate), and more than 8% of DIMM memory modules affected by errors per year."
Maybe google has a problem the rest of the world doesn't....
Re:Be sure to use ECC RAM on home set-ups by Anonymous Coward · 2014-09-11 03:44 · Score: 3, Interesting

No... their numbers are about right.
And the numbers go back to times before Google existed.
Even on the old Cray Y systems, there was roughly one single bit error every day, corrected by ECC. Every week or so there would be roughly 1 double bit error, recovered by data reload...
The only times the memory got disabled was when double bit errors were NOT recovered OR the error rate exceeded 10 (from my memory, number could be higher) in a day. The hardware itself would remap memory so that the system would keep running until the CE could run diagnostics on it and either replace it or restore it to use as an identified transient error.
Re:Be sure to use ECC RAM on home set-ups by brm · 2014-09-12 02:49 · Score: 1

What prevents ZFS's "parity, mirroring, checksums and other mechanisms to protect your data" from being applied to blocks stored in RAM as well as on disk? Sure, ECC in hardware seems "free" in performance but there is some small latency overhead as well as a huge price overhead. At the system level it may be more effective in price/performance to be able to use more standard hardware and throw a bit of software at the problem.
Re:Be sure to use ECC RAM on home set-ups by rubycodez · 2014-09-12 09:54 · Score: 1

The number are about right according to which source? I will point out the article itself mentions studies that found the error rate a thousand times less than Google's.
The Cray Y system was ECC SRAM with error rate of 6E-13 upset/bit-hour, but probably not relevant to discussion about today's typical memory.

Re:Technobabble... by rubycodez · 2014-09-11 03:20 · Score: 1

or you could spend 20 minutes reading about it on their web page instead of relying on slashdot summaries, you lazy git

Magic by N7DR · 2014-09-11 03:20 · Score: 2

I've been using ZFS on Linux for about a year. I can summarise my position on the experience with two words: it's magic.

It is still tricky to run one's root system off ZFS (at least on Debian). That, I think, is for those who are brave and have to time to deal with issues that might arise following updates. But for non-root filesystems, ZFS is, as I said, magic. It's fast, reliable, caches intelligently, adaptable to a large variety of mirror/striping/RAID configurations, snapshots with incredible efficiency, and simply works as advertised.

Someone once (before the port to other OSes) said that ZFS was Solaris' "killer app". Having used it in production for a year, I can understand why they said that.

Re:Magic by TheGratefulNet · 2014-09-11 03:35 · Score: 1

I ran zfs on freebsd for a few years but gave up on it. at one time, I did a cvsup (like an apt-get update, sort of, on bsd) and it updated zfs code, updated a disk format encoding but you could not revert it! if I had to boot an older version of the o/s (like, before the cvsup) the disk was not readable! that was a showstopper for me and a design style that I object to, VERY MUCH. makes support a nightmare.
I've never seen this in linux with jfs, xfs, ext*fs, even reiser (remember that?) never screwed me like this before.
the system also was very ram hungry and cpu hungry.
I'm still not convinced its good for anything but serious users who have a GOOD backup/restore plan. updating a disk image format and not allowing n-1 version of o/s to read it is a huge design mistake and I'm not sure I understand the reasoning behind it, but until that is changed, I won't run zfs.

--

--
"It is now safe to switch off your computer."
Re:Magic by Rich0 · 2014-09-11 03:59 · Score: 1

adaptable to a large variety of mirror/striping/RAID configurations
"Adaptable" is a bit of a stretch here. If you set up a RAID on ZFS, you can't change it, you can only replace individual disks within it, or destroy the entire array.
That isn't a big deal if you're talking about a ZFS filesystem with a very large number of drives, but it is a big limitation for a small ZFS filesystem. That is, if I have 300 disks in 60 arrays of 5 1TB disks each, and I want to move to 3TB disks, then I just need to add 5 3TB disks, turn them into an array, add them to the filesystem, then remove 51TB disks, and then keep repeating this. On the other hand, if I have a motherboard with 5 SATA ports and I have 5 1TB drives in an array, then there is no easy way to replace those with 5 3TB drives one at a time and actually get use out of the extra space. You can do that with mdadm in any of its raid modes, and with btrfs in any of the raid modes that actually work (basically raid 0/1 now - it will work with raid5/6/etc but those aren't production ready).
Granted, a lot of people who are interested in ZFS are interested in using it for SAN/NAS/etc and this isn't likely to be an issue for them. For the average slashdotter with a few TB of drives in a RAID, it could be a problem.
Re: Magic by zeigerpuppy · 2014-09-11 04:09 · Score: 1

Have another look at the docs, you can upgrade capacity one disk at a time but, of course, throughput suffers because you now have asymmetric storage. You can also change level to some degree by grouping sets (eg turning a mirror into a stripped mirror)
Re:Magic by brambus · 2014-09-11 04:18 · Score: 2

it updated zfs code, updated a disk format encoding but you could not revert it
You can thank your package maintainer for this. ZFS never ever ever upgrades the on-disk format silently. You always have to do a manual "zpool upgrade" to do it. It'll tell you when a pool's format is out of date in "zpool status", but it'll never do the upgrade by itself.

updating a disk image format and not allowing n-1 version of o/s to read it is a huge design mistake and I'm not sure I understand the reasoning behind it, but until that is changed, I won't run zfs
Again, this is not ZFS' fault, it's your package maintainer for auto-upgrading all your imported zpools. ZFS never does this by itself.
Re: Magic by Rich0 · 2014-09-11 04:32 · Score: 1

Do you have a link. The last time I looked into this, you could not add a disk to a raid-z. You could add disks to a zpool, or add another raid-z to a zpool. However, a raid-z was basically immutable. This is in contrast to mdadm where you can add/remove individual disks from a raid5.
Google seems to suggest that this has not changed, however I'd certainly be interested in whether this is the case. The last time I chatted with somebody who was using ZFS in a big way they indicated that this was a limitation. He was using it for very large storage systems, and I could see how many of the ZFS features made it much more appropriate in these kinds of situations, especially with things like write intent log on seperate media, having many independent storage units which are individually redundant but otherwise behaving like a big array of disks (which helps to distribute IO which reduces some of the penalties with RAID), etc. I'm more familiar with btrfs and it seems to be evolving more towards being an ext4 replacement, where smaller arrays are the norm, etc. That isn't to say that many of the features on either aren't potentially useful for both.
Re:Magic by MightyYar · 2014-09-11 05:18 · Score: 1

then there is no easy way to replace those with 5 3TB drives one at a time and actually get use out of the extra space.
It's not THAT bad. You do this:
1. Put new disk in usb cradle.
2. Run 'zpool replace', swapping new disk for old disk.
3. Take the new disk and physically replace the old disk.
4. Repeat 1-3 for each new disk until you have the whole array running at the new capacity.
5. If autoexpand is not enabled, run the 'zfs online' command with the '-e' flag to use the new capacity.
I've only used FreeBSD, not Linux - but I presume this would work so long as you are giving ZFS the whole disk. ZFS does not care which interface disks are attached to... you can take them all out and shuffle them around and it will map them correctly.

--
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
Re:Magic by Rich0 · 2014-09-11 05:22 · Score: 1

True. Now, what if you wanted to add 1 3TB drive to the existing 3x1TB RAID-Z, so that instead of having 2TB of usable space it now has 3TB of usable space?
Re:Magic by MobyDisk · 2014-09-11 05:30 · Score: 1

And what is the difference to the end user?

Knowing who to blame is important for the end user. Don't call Apple when your cell phone reception is bad. Don't blame ZFS because the vendor you chose for your distro pre-configured it in stupid way. The end-user needs to understand who to blame so they know which vendor to avoid next time. In my reception example, the end-user must decide if they want to switch from Apple to Samsung, or if they want to switch from AT&T to Verizon. This is one of the reasons non-techies tend to make the same technology mistake over-and-over again. I have heard people say things like: "I can never get my email to work in this damn Dell! But it works fine on my HP at work!" I try to explain to them that the problem is their email client software is configured wrong at home, not that Dell can't make PCs that support email.
AC: Can you state what distro you were using when this happened?
Brambus: Can you recommend a distro that does not do this?
Re: Magic by ByTor-2112 · 2014-09-11 07:55 · Score: 1

OP is not talking about adding to a pool but replacing drives (zpool replace is a one-for-one replacement). You are correct in that one cannot add to an existing vdev, only replace drives. But you can add additional vdevs to a pool without a problem. They don't even have to have the same redundancy level (stupid idea).
Re:Magic by ByTor-2112 · 2014-09-11 07:59 · Score: 1

You could add the 3TB drive to the pool, it just wouldn't be a part of the RAID-Z vdev and thus have limited redundancy. It would be RAID-0. But you could survive one failure in the original vdev. Any additional failure or loss of the new drive would result in total loss of data.
So... Backup, destroy pool, restore. Think ahead better the next time. Really, a good SAS JBOD controller is only a couple hundred bucks. Spend the money and plan your storage accordingly.
Re:Magic by Aaden42 · 2014-09-11 09:20 · Score: 1

ZFS on Gentoo does not automatically update pool or filesystem versions after a ZFS driver update.
I suspect this is also true of the vast majority of distros, but there’s one example for you anyway.
Re:Magic by h4ck7h3p14n37 · 2014-09-11 09:33 · Score: 1

The parent did this to themselves. You use cvsup to track code changes when you're building a system from source. It is not used to perform upgrades of installed ports; that's what portupgrade is for.
It sounds like he was doing an in-place build and nuked his drive. If you care about your data you shouldn't take the risk of attempting this.
Re: Magic by dnavid · 2014-09-11 09:40 · Score: 1

Do you have a link. The last time I looked into this, you could not add a disk to a raid-z. You could add disks to a zpool, or add another raid-z to a zpool. However, a raid-z was basically immutable. This is in contrast to mdadm where you can add/remove individual disks from a raid5.
Google seems to suggest that this has not changed, however I'd certainly be interested in whether this is the case. The last time I chatted with somebody who was using ZFS in a big way they indicated that this was a limitation. He was using it for very large storage systems, and I could see how many of the ZFS features made it much more appropriate in these kinds of situations, especially with things like write intent log on seperate media, having many independent storage units which are individually redundant but otherwise behaving like a big array of disks (which helps to distribute IO which reduces some of the penalties with RAID), etc. I'm more familiar with btrfs and it seems to be evolving more towards being an ext4 replacement, where smaller arrays are the norm, etc. That isn't to say that many of the features on either aren't potentially useful for both.
MightyYar's process isn't adding a disk to a RAID-Z, its addressing your original question of how to replace 1TB drives with 3TB drives. His process uses an external USB drive to kickstart the process, adding a USB drive, telling ZFS to logically replace one of the older drives with the newer (bigger) USB drive, letting it rebuild with that USB drive, and when the old drive has been replaced in the array with the USB drive, removing that old drive and replacing it physically with a new 3TB drive, then asking the array to rebuild again. You don't even need the USB drive; you could replace the disks in the array, but unless you are at RAIDZ2 or higher, for a long time during the process you would not have drive redundancy. MightyYar's process prevents having a case where you are running without at least n+1 redundancy in the array. Once all the original drives are replaced physically with 3TB drives, you can ask ZFS to expand the array to use all the space.
Re:Magic by MightyYar · 2014-09-11 10:06 · Score: 1

Not something I would attempt. Personally, I accept this limitation and always add drives in pairs. Upgrading capacity then becomes a 2-drive cost instead of a number_of_disks_in_raid cost.

--
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
Re:Magic by Carnildo · 2014-09-11 11:09 · Score: 1

It's fast, reliable, caches intelligently, adaptable to a large variety of mirror/striping/RAID configurations, snapshots with incredible efficiency, and simply works as advertised.
Can I:
1) Add a disk to a RAID array (or whatever ZFS calls it) and reshape the array to take advantage of the space?
2) Run with less than 1 GB of RAM per TB of disk space?
3) Pull a disk that's suffered a transient failure, check it, plug it back in, and have the array write only the portions of the disk that changed, rather than doing a full rebuild?
The last time I looked at using ZFS for my storage server, #1 and #2 were deal-breakers. #3 was added when I expanded the server with a bunch of Seagate hard drives -- md's write-intent bitmaps reduced typical rebuild times from around a week to less than half an hour.

--
"They redundantly repeated themselves over and over again incessantly without end ad infinitum" -- ibid.
Re:Magic by fnj · 2014-09-11 16:56 · Score: 1

I ran zfs on freebsd for a few years but gave up on it. at one time, I did a cvsup (like an apt-get update, sort of, on bsd) and it updated zfs code, updated a disk format encoding but you could not revert it!
That must have been a while ago, because cvsup has been obsolete for years. Furthermore, cvsup never touched system code; only userland ports. The way you update system code is freebsd-update. I believe there was a lot more to your disaster than you are saying. And I bet ZFS on Linux either didn't exist, or was very primitive and immature, at that time. Sort of like btrfs is now (btrfs is the only native linux filesystem that is even remotely comparable to ZFS in some, but far from all, ways).
That said, it's true that updating FreeBSD should never be a heedless operation. You need a lot more insight and attention to detail to work with FreeBSD than linux. That's the tradeoff for getting a true Unix system.
BTW, most claims that ZFS is extremely RAM hungry stem from users that don't know what they are doing. For example, deduplication virtually never should be enabled for most use. Dedupe can eat RAM like a blue whale. But wthout dedup, any realistic system with 16 GB+ is fine. Skimping on RAM is just silly, anyway.
Re: Magic by fnj · 2014-09-11 17:46 · Score: 1

You're exactly right as far as I know. You would have to build a new raidz with larger drives or more drives, and with both old and new pools online, zfs send -> zfs receive all the data from old to new; then you could remove the old raidz and throw away the ridiculous tiny obsolete drives.
BTW, a bit of terminology. Zpool is the top level (root) ZFS structure for *any* use of ZFS, even one which only uses a single drive (degenerate case - which actually "works" just fine and dandy, and is a great improvement over ext4fs because of numerous ZFS features such as snapshots and block checksumming). You can have any number of independent zpools. Within a zpool you have vdev components.
From the bottom up, a vdev can be a single drive or partition, or multiple sub vdevs in the form of either (1) a logical concatenation of vdevs (like RAID0), a mirror (sort of like RAID1 but better), a raidz (single parity; sort of like RAID5 but considerably better), a raidz2 (double parity; sort of like RAID6 but considerably better), or a raidz3 (triple parity, beyond RAID6).
A zpool is then either a single vdev (possibly nothing but a single drive or partition), or multiple vdevs combined in exactly the same way.
Thus you can have a tree of arbitrary complexity, for example a mirror of mirrors of mirrors of mirrors of ...
Or a raidz of raidzs. Or a mirror of raidz2s. Or a raidz3 of mirrors. Or, you name it.
Finally, within a zpool you can create, in a completely ad hoc manner, any combination of zfs filesets (recursively if you wish). This is orthogonal to structure of vdevs and sub vdevs. Each such fileset can independently grow to any size, limited only by the size of the zpool it lies within.
And the stark reality. Once you create a raidz[23], the number of components and the utilized size of the components are forever set in stone for that particular raidz[23]. Mirrors, on the other hand, can have extra elements added (from 2-way to 3-way to arbitrarily many copies). And, if you know how, even the sizes of the individual elements can be grown.
Re: Magic by Rich0 · 2014-09-11 23:40 · Score: 1

I understand that. If you have 25 drives that is probably all the flexibility you need. On the other hand, if you have 3x1TB drives (2TB usable, 1TB parity), and you'd like to add 1 more 1TB drive so as to have 3TB usable, 1TB parity, then being able to add additional vdevs doesn't buy you anything, unless you want to make that 1TB drive non-redundant. You could add 2x1TB in a mirrored configuration and get 1TB space, but if you could reshape the array you'd get that 1TB space for the cost of a single drive.
For the more casual user, that flexibility matters more. If you're running a SAN with 50 drives in it, then you're not going to have one big raid-z across all of them for performance reasons if nothing else. You could have 10x5 disk arrays, and adding another 5-disk array is easy.
Re:Magic by Rich0 · 2014-09-11 23:46 · Score: 1

A couple hundred bucks is more than the cost of all the hard drives I own. :)
I think that is where the gap lies. The designers of ZFS are marketing towards people who manage at least dozens of disks, and they really don't need the features I'm talking about. On the other hand, if you own a grand total of 3x1TB drives and you want to add 1TB of storage, and you don't own a tape drive or any other spare media, then backup, destory, restore isn't a viable option.
I'm running full backups only because I'm experimenting with btrfs and don't trust the filesystem to not eat my data. However, I don't plan to do that for the long-term and at some point I'll probably move my backup drives into my storage array. At that point most of my data won't have backups at all. Everything I REALLY care about is backed up offsite daily, but it isn't more than a few GB of storage. If I lose a few hundred hours of MythTV video it isn't the end of the world, but that doesn't mean that I want to wipe my system every time I want to add a disk.
The whole point of btrfs (or mdadm for that matter) is that I don't have to plan better next time.
Re: Magic by Rich0 · 2014-09-11 23:48 · Score: 1

Agree - I phrased my original question poorly. My point was that raidz was not as flexible as the roadmap raid5 support for btrfs (which behaves like raidz, not like raid5 in zfs). I'm interested in being able to add/remove individual drives to a parity array.
Re:Magic by Rich0 · 2014-09-11 23:51 · Score: 1

Not something I would attempt. Personally, I accept this limitation and always add drives in pairs. Upgrading capacity then becomes a 2-drive cost instead of a number_of_disks_in_raid cost.
I've never replaced all my disks in a raid at the same time. When I ran mdadm I would just partition the disks so that I was running multiple raid5s on each drive (so if I had 3x1TB and 2x3TB drives I'd have 6TB of usable space - 4x1TB+1x2TB). With btrfs I'd just put them all in an array and let btrfs figure out how to use it once raid5 mode was mature (actually, I should look into how btrfs will handle mixed sizes in raid5 mode - it generally does the right thing in raid1, but I suspect it will need help in parity modes).
Re:Magic by MightyYar · 2014-09-12 01:18 · Score: 1

When I ran mdadm I would just partition the disks so that I was running multiple raid5s on each drive (so if I had 3x1TB and 2x3TB drives I'd have 6TB of usable space - 4x1TB+1x2TB).
Yes, you can do this with zfs as well, but you need to be very, very careful or you won't have the redundancy that you think you do. There are crazy partition schemes that can let you do Drobo-ish things - but they get so complicated that you need to keep track of them in something like an Excel spreadsheet. :)
Besides, zfs seems to like having the entire drive.

--
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
Re:Magic by Rich0 · 2014-09-12 01:24 · Score: 1

With btrfs I can do this while letting it have the entire drive - there are no partitions involved. You just feed it drives, and it maximizes space use. Btrfs does mirroring at the chunk level (I forget how large a chunk is but it is hundreds of MB to GB size), and not at the drive/etc level. This is why it is a lot more flexible - imagine that vdevs aren't a collection of drives, but that each vdev is maybe a few GB of data that is spread across multiple drives, and drives are just collections of hundreds/thousands of parts of vdevs. You could do this on zfs if you divided each drive into 1000 partitions and made 1000 vdevs out of each drive, and then you'd probably have a lot more flexibility about moving data around.
At the enterprise level that granularity probably isn't needed. On the other hand, at the workstation level it is a lot more useful.
Re:Magic by MightyYar · 2014-09-12 01:44 · Score: 1

btrfs sounds very interesting. It was not ready for prime time when I setup my current box, which is why I chose zfs instead. I'll have to try to murder it in a VM :)
Can you set btrfs to use arbitrary block devices or files? One of the things that made it easy to screw with zfs was it's ability to do so. I was able to set up a VM and do random writes to the "drives" it was using to see how it would respond. Anyway, to my surprise btrfs seems production-ready at this time so I'll have to play with it.

--
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
Re:Magic by Rich0 · 2014-09-12 05:36 · Score: 1

Well, I'd take "production-ready" with a heavy grain of salt. If you stick to the features that aren't too new it is pretty decent, but the 3.15 kernel series had issues with deadlocks when using compress=lzo (just fixed in 3.16.2), and while that sort of thing isn't too common it happens. The checksumming does provide a pretty high level of data integrity - if you have issues most likely it will be more about hangs/crashes/etc, and not so much about major loss of data. The one big gaping hole in btrfs is that it does not behave well when space gets tight - it can be tricky to recover if the filesystem runs out of space to allocate metadata (since it is COW, even deleting files requires empty metadata space to function).
However, it is certainly well-worth playing around with, especially if you're interested in storage (and I'd certainly say the same about zfs). The box I'm typing this on uses a non-redundant btrfs root partition on an SSD (btrfs handles SSD well), and a 5-disk RAID1 on btrfs for everything else. I rsnapshot all of that stuff to ext4 daily on another drive, so I'm not too concerned about issues. I'm doing this half for the learning, but it is REALLY nice to be able to just snapshot a container/etc before I do an update on it (as with zfs snapshots on btrfs take almost zero time), and it is also nice to have my cp aliased with --reflink=auto - I'm not sure if zfs on linux supports that yet but reflinks on linux are basically a file-level COW snapshot (it isn't quite as fast as a regular snapshot since all the file metadata for the file needs to be copied if it has multiple extents - with a subvolume there is a pointer into the b-tree that allows a snapshot to be made by copying one record and incrementing the reference count of the record it points to - kind of like making two tags for the same commit in git).
In theory I'd get much better performance if I did my backups with send/receive, but I use rsnapshot so that my backups don't depend on any btrfs magic to work.
You can definitely point btrfs at a loopback device. That is an easy way to get into it. Ages ago I actually got a kernel panic when trying to do this with an ext4 conversion, but I think those issues are ancient history.
Re: Magic by dnavid · 2014-09-12 09:30 · Score: 1

Agree - I phrased my original question poorly. My point was that raidz was not as flexible as the roadmap raid5 support for btrfs (which behaves like raidz, not like raid5 in zfs). I'm interested in being able to add/remove individual drives to a parity array.
Although people ask for it often, I'm unaware of that feature being on anyone's implementation roadmap. Part of the problem I think is that there's a difference between supporting a feature, and having that feature be problematic to use. Its unclear to me how useful btrfs rebalance is in RAID5/6 arrays with large drives. It could impact performance enough to make it annoying to use in the general case. That's one of the reasons why even hardware RAID5/6 is not used as much in servers with high performance requirements. The cost to rebuild makes it prohibitive to recover from a drive failure.
Even for a home array, I'm more inclined to use mirroring than RAIDX, and ZFS does allow you to add mirror pairs to pools composed of mirrored vdevs. In other words, you can make a pool with four drives set up as two pairs of mirrored vdevs, and then later add drives in pairs of mirrors and dynamically resize the entire array across the new drives. That's not what you're looking for, but its what ZFS users typically do instead, which is why there is less pressure to add dynamically adding disks to RAIDZ vdevs not as high as you might expect.
Re: Magic by Rich0 · 2014-09-12 09:42 · Score: 1

Sure, but even in a mirrored btrfs configuration you don't have to add drives in pairs. Btrfs doesn't do mirroring at the drive level - it does it at the chunk level. So, chunk A might be mirrored across drives 1 and 2, and chunk B might be mirrored across drives 2 and 3. For the most part you can add a single n GB drive at a time and expand your usable storage capacity by n/2 GB. You don't have to rebalance anything when you add a new drive - it will just be used for new chunks in that case. However, in most cases you'll want to force a rebalance.
Another way to look at it is a zfs analogy. Let's set up 5 drives in a raid-z configuration in zfs. However, instead of creating a vdev with 5 drives, instead create 1000 partitions on each of those drives. Then create a vdev using the first partition on each of those drives, and add that to the zpool. Leave the rest of the drives unused. Then if space starts getting full, create another vdev using the second partition on each of those drives, and add that to the zpool. Continue operating in this manner. Then if the drives are half allocated to vedevs you can add another drive with 1000 partitions, and now whenever the zpool gets full you create a vdev that goes across all 6 drives. If the new drive is half the capacity of the others then you'd only put 500 partitions on it. If you decide you want to try mirroring, then the next time you need space pick the two drives with the most free partitions and create a vdev using one partition on each in a mirrored configuration.
This is basically how btrfs does multiple devices - the drives are allocated into chunks when needed, and the raid-like structures operate on top of those. This allows for more flexibility, since rebuilding/adding/removing/etc can be done a few chunks at a time, instead of a few drives at a time.
Re: Magic by dnavid · 2014-09-12 11:10 · Score: 1

Sure, but even in a mirrored btrfs configuration you don't have to add drives in pairs. Btrfs doesn't do mirroring at the drive level - it does it at the chunk level. So, chunk A might be mirrored across drives 1 and 2, and chunk B might be mirrored across drives 2 and 3. For the most part you can add a single n GB drive at a time and expand your usable storage capacity by n/2 GB. You don't have to rebalance anything when you add a new drive - it will just be used for new chunks in that case. However, in most cases you'll want to force a rebalance.
That's a good thing/bad thing. Its what allows BTRFS to gain n+1 redundancy per data chunk on odd numbers of drives or in arrays where the number of drives changes, because mirroring isn't geometry-specific. But with disk mirrored vdevs you have the case where you can lose either half of the mirror for any vdev with no data loss. In other words, with four drives organized as two sets of mirrors, I can lose any two drives as long as they are not both members of the same mirror. With chunk-based mirroring once you lose a single drive you can't be certain the next drive failure anywhere won't fail the array without knowing exactly how the chunks are mirrored. That's not what people generally expect when they use "mirroring" and that mismatch can cause problems in maintaining arrays. Honestly, if ZFS could do that kind of mirroring with metaslabs, I would personally turn it off.
If I was going to look at more advanced device redundancy and management storage, I would probably jump past btrfs and go to Ceph. There I can not only add storage devices however I want, I can also scale up the number of storage servers any way I want. Its still a developing system, but then again so is btrfs.
Re: Magic by Rich0 · 2014-09-12 13:40 · Score: 1

I tend to agree that Ceph is much more promising at large scale, but btrfs is more targeted at being a replacement for the likes of ext4. I haven't actually looked at it in a while, so thanks for reminding me about it. Perhaps I'll be crazy enough to run it at home one of these days. :)
Re: Magic by Rich0 · 2014-09-13 05:48 · Score: 1

I was just reading up on Ceph a bit. One thing that does have me concerned is that it does not appear to do any kind of content checksumming. Of course, if you store the underlying data on btrfs or zfs you'll benefit from that checksumming at the level of a single storage node. However, if for whatever reason one node in a cluster decides to store something different than all the other nodes in a cluster before handing it over to the filesystem, then you're going to have inconsistency. I'm shocked that they don't do some kind of checksumming at some level on the client and store that with the metadata.
Apparently if an inconsistency is detected one copy is treated as the primary copy by default and it just overwrites the others, regardless of whether the one primary copy differs from 75 replicate copies of it, etc.
It just seems a bit odd - if you're going to suffer all the performance penalties of a distributed storage system, you'd think that you'd do EVERYTHING you could to insure data integrity. I'd consider having one definitive version of what a file should contain a priority. Plus, if you hash everything at some level I'd think that this would make stuff like dedup easier anyway.
Re:Magic by Rich0 · 2014-09-13 05:57 · Score: 1

Ugh - if I wasn't running btrfs I'd be running lvm+mdadm. The last thing I want is to have to have a ton of mounts everywhere simply because the underlying storage management system doesn't support dynamic resizing. Sure, there are lots of good reasons to use different filesystems for certain things, but I don't want to have to do it just to juggle hard drives.
Right now I'm running btrfs. If I want to add/remove drives I can just do it, and the space gets used. If I want to I could set up multiple subvolumes and directly mount them, but I don't have to do that just to juggle drives.
As far as rearranging raid geometry goes - I've done this just about everytime I've increased my drive sizes. Sure, we probably wouldn't do it at work since we can just have the shareholders sacrifice a few hundred bucks for spare drives to hold everything while we rebuild arrays from scratch, or more likely we'd just have them pay $1000/TB for some big-name SAN. At home I don't have an extra 5 hard drives just sitting in a box to play with, so I'll pick a storage system that affords me flexibility without having to start over.
As I've said elsewhere in this thread - zfs is targetted mostly at people who are managing dozens of hard drives, not individual workstations, and it is looking more at things sold by EMC/etc as competitors. Btrfs is more targetted at being an ext4/lvm/mdadm replacement.
Re: Magic by dnavid · 2014-09-15 08:06 · Score: 1

I was just reading up on Ceph a bit. One thing that does have me concerned is that it does not appear to do any kind of content checksumming. Of course, if you store the underlying data on btrfs or zfs you'll benefit from that checksumming at the level of a single storage node. However, if for whatever reason one node in a cluster decides to store something different than all the other nodes in a cluster before handing it over to the filesystem, then you're going to have inconsistency.
The problem you're describing is a problem that neither ZFS nor BTRFS is capable of handling either. Both checksum data on disk, but are vulnerable to errors that occur anywhere else in the write path starting from network clients through the OS. That's why ECC or fault tolerant memory is explicitly recommended for ZFS enterprise servers; a bit flip in memory is impossible for ZFS to correct for or detect in most cases.
Re: Magic by Rich0 · 2014-09-15 13:34 · Score: 1

No argument, but the thing is that there is one definitive checksum of what should be on-disk when you're dealing with zfs/btrfs. When you're dealing with Ceph each storage server keeps its own checksum at the filesystem level (if you're running it on zfs/btrfs - otherwise no protection at all).
On the other hand, it does sound like the network traffic is checksummed, so the only real risk is to bugs, memory/cpu errors, or manipulation of the on-disk files.
People go on about ECC, but as far as I can tell neither btrfs nor zfs are any more vulnerable to memory errors than anything else. It is just a matter of RAM being the next largest source of risk once you've eliminated the disks, so it is the next logical thing to fix. Using ECC with these filesystems probably provides no more or less protection than using ECC with any other filesystem. Of course, if you can add it economically to your system, you probably should consider it.

Re:Technobabble... by nine-times · 2014-09-11 03:24 · Score: 1

20 minutes would make for a long elevator pitch.

Re:Completely Broken (At least for me) by rubycodez · 2014-09-11 03:24 · Score: 1

So you suck as a sys admin. Try another hobby.

Re:Technobabble... by ISoldat53 · 2014-09-11 03:25 · Score: 1

Bingo!

Re:Technobabble... by BenLutgens · 2014-09-11 03:26 · Score: 1

burn.

--
"If you love someone, set them free. If they come home, set them on fire." - George Carlin

Still no SELinux support by Kahenraz · 2014-09-11 03:27 · Score: 2

How can it be production-ready if it still lacks SELinux support.. the ZOL FAQ suggests either permissive or disabling of it entirely.

Re:Still no SELinux support by rubycodez · 2014-09-11 03:38 · Score: 1

other security systems exist, many believe that SELinux is causes more problems than it solves
Re:Still no SELinux support by wonkey_monkey · 2014-09-11 04:09 · Score: 1

How many of the others are now integrated into the Linux kernel?

--
systemd is Roko's Basilisk.
Re:Still no SELinux support by devman · 2014-09-11 04:37 · Score: 1

The FAQ is outdated. SELinux support was added in the last release. I run a ZFS system on CentOS 6 with SELinux set to Enforcing and it works fine.
Re:Still no SELinux support by Rob+Riggs · 2014-09-11 04:37 · Score: 1

many believe that SELinux is causes more problems than it solves
I've met those people. Not impressed.

--
the growth in cynicism and rebellion has not been without cause
Re:Still no SELinux support by rubycodez · 2014-09-11 04:56 · Score: 1

and I've watched SELinux heads waste days trying to figure out why it's killing standard apps that used to work for years

Re:rsync causes lockups? by yup2000 · 2014-09-11 03:32 · Score: 3, Insightful

I've been using ZFS on linux for years with nightly backup jobs that rely on rsync. I've never had a problem.

Re:Technobabble... by the_humeister · 2014-09-11 03:34 · Score: 1

The main feature is data checksumming. All the other features are just icing on the cake (snapshots, data dedup, etc.). Ars has a good article with illustrations.

Re: Completely Broken (At least for me) by zeigerpuppy · 2014-09-11 03:36 · Score: 1

ZFS-fuse is not ZFSonLinux! It uses an older version and is much buggier. Also which HDD controller are you using? This really matters cos ZFS should be talking to your drives directly and bad controllers sometimes do nasty things.

Re:Technobabble... by Kjella · 2014-09-11 03:37 · Score: 1

I think "a filesystem with ECC (like memory) and ACID (like a good database)" is as close as you come for an elevator pitch. I've had bits flip, whether it's in memory or on disk or in transit I'm not really sure but it happens. Like for example I create "known good" PARs for a 5GB video and later it fails CRC, If you use a hex editor compare tool it'll show a single bit difference. Backups are neat, but you really want control over bit rot - real bit rot - so you don't end up with slow corruption. ZFS pretty much makes sure that what you take backup of is good.

--
Live today, because you never know what tomorrow brings

production ready? by inerlogic · 2014-09-11 03:38 · Score: 1

clearly doesn't know what "production ready" means.......
you have to add devices by-id because the /dev/sd** designation changes with every reboot
and don't try to do a dist-upgrade without exporting the pool and praying.... prayer isn't even going to work, ask me how i f****** know....

Re:production ready? by Vesvvi · 2014-09-11 04:40 · Score: 1

Is it a problem to add them by ID? I intentionally use partition IDs because they're stable, and it works well in both Linux and FreeBSD, but the FreeBSD people seem to like labels or raw device names.
Regardless, every import should bring the same pool online in the same way, regardless of the device names.
Re:production ready? by fnj · 2014-09-11 19:20 · Score: 1

Yes, on linux you can specify drives to ZFS using any of the mechanisms in /dev/disk: by-id , by-label, by-path and by-uuid. As well as /dev/sd* of course. This is one aspect where linux is way ahead of FreeBSD.

Re:Technobabble... by rayvd · 2014-09-11 03:39 · Score: 1

We run a lot of ZFS on OpenIndiana/Nexenta, but also have some ZoL.

My favorite things about ZFS:
- Simpler volume management -- there's no more LVM layer! A little weird at first, but it really grows on you. Just zpool create, zfs create and you're off and running.
- Huge volumes -- we have a couple in production near 800TB
- Writable snapshots (think FlexClone on NetApp) -- no performance penalty. We have systems with hundreds of snaps and clones.
- Really stable (in our experience, ZFS on *Solaris has been rock solid -- the management pieces on top is where we occasionally run into issues). ZoL has already been quite stable.
- ZIL/L2ARC -- Use SSD's to accelerate reads/writes.
- Performs great with minimal tuning, but there are plenty of hidden knobs if you need them.
- Triple parity RAID options. Essential for larger drives.

Cons and Caveats:
- Memory hungry. Really memory hungry. Fortunately, RAM is cheap these days.
- Does require CPU as it wants to do all the "RAID" itself. Processors are so fast that this has never been an issue for us. Also you probably want to use disks that speak real SAS, not SATA to ensure graceful failure.
- For the *Solaris versions, picking the right hardware tends to be important. ZoL opens a lot of doors here.
- Deduplication sucks (or sucked last time we tried it). Required a ton of memory, especially if you want to use smaller block sizes to get better space savings. Very challenging to move away from deduplication once you turn it on.

Re: Ready for primetime... by zeigerpuppy · 2014-09-11 03:39 · Score: 1

You're talking about different levels. I use a zpool on SSDs for my databases AND give the databases lots of RAM. The two are not mutually exclusive but if course your RAM should be utilised as close to your query as possible to increase throughput. ZFS is for integrity first, squeezing speed out of it is certainly possible but not it's primary purpose.

Re: no support for posix acls by zeigerpuppy · 2014-09-11 03:56 · Score: 1

You could still use ZVOLs. I use ext4 on top of ZFS for some virtual machines. If you make sure to use the same block size in both the performance hit is small (approx 10%) and you will have ext4 with proper check summing and snapshot capabilities under it.

above, below, and at the same level. ZFS is everyt by raymorris · 2014-09-11 03:59 · Score: 4, Interesting

> ZFS is a layer below LVM.

Typically you'd layer raid, then LVM, then the filesystem. ZFS tries to be all three. It's raid, and it's a volume manager, and it's a filesystem. There are some benefits to integration, and some drawbacks. With the raid>lvm>filesystem approach, it's trivial to add dm-cache, bcache, iscsi, or any other piece of storage technology. With ZFS, anything you want to add has to be specifically supported within ZFS.

The Unix tradition is small, single purpose tools that do one thing well. Witness sort, grep, wc, etc. Want to count the log entries that mention Slashdot? You don't need a special tool for that, just grep slashdot | wc -l . Tools like mdadm and lvm are building blocks that can be combined to suit your need, the Unix way. ZFS is a big monolithic package that does everything, much like Microsoft Word or Outlook. ZFS is more in the Microsoft tradition.

Re:Technobabble... by Anonymous Coward · 2014-09-11 04:04 · Score: 1

It is basically the non-BSD, non-GPL version of btrfs. :)

You have that backwards.

btrfs is the non-BSD, GPL version of ZFS.

Re:Technobabble... by meta-monkey · 2014-09-11 04:09 · Score: 1

Is there a good way to calculate how much RAM you need? I'm considering ZFS for my next server build. It'll be around 10TB.

--
We don't have a state-run media we have a media-run state.

Yay for me! by sribe · 2014-09-11 04:12 · Score: 1

Hey, I'm the guy who got modded +5 funny for replying to the 8/10TB disk announcement with "of course they did, I ordered 6TB drives 2 hours ago". Well, I switched my home NAS over to ZFS last month. So, yay for me, for once I'm ahead in at least some minimal sense or other!

Seriously though, I have found ZFS to be a damned good solution so far. (FYI, CentOS, Core i5, 4GB, 6x4TB with 2-disk parity, 2 eSATA -> port multipliers...) I really don't think I will ever deploy hardware RAID again.

Re: Little Baby Linux by zeigerpuppy · 2014-09-11 04:13 · Score: 1

Except FreeBSD can't do Xen + ZFS. BSD is good for a lot of things and so is Linux. A good sysadmin picks the right tool for the job. I'd like to think the BSD project benefits from more people using ZFS.

Re:above, below, and at the same level. ZFS is eve by Vesvvi · 2014-09-11 04:14 · Score: 3, Interesting

I think you're giving the wrong idea here. I have yet to find a format of storage capacity that zfs won't support, with one exception: you can't create a zvol on a zpool, then attach that zvol as back-end storage for the same zpool. That is specifically disallowed, and I'm guessing that you can't use a zvol from one zpool to back-end another zpool either. This is a very bizarre (also, probably dumb) thing to do, but even this can be overridden if you're really desperate. For more practical applications, everything else just works: at least in FreeBSD, you can "hide" the block devices behind all different kinds of abstractions to provide 4k writes, encryption, whatever, and zfs will consume those virtual block devices just fine.

Time Slider by PaulHarper · 2014-09-11 04:18 · Score: 1

ZFS on Linux would be cooler if they could port Time Slider to Linux from Open Solaris. http://java.dzone.com/news/kil...

Re: Time Slider by zeigerpuppy · 2014-09-11 04:35 · Score: 1

It's not too hard to do with a script. My snapshots are daily and they rotate every month leaving the first of every month as a longer term backup. It's granular enough for my data but hourly or even every minute would work fine too.
Re:Time Slider by Rich0 · 2014-09-11 05:05 · Score: 1

ZFS on Linux would be cooler if they could port Time Slider to Linux from Open Solaris. http://java.dzone.com/news/kil...

That's what snapper is for.
Re:Time Slider by linuxguy · 2014-09-11 11:50 · Score: 1

> That's what snapper is for.
Does snapper work with ZFS?
Re:Time Slider by Rich0 · 2014-09-11 23:52 · Score: 1

> That's what snapper is for.
Does snapper work with ZFS?
No idea, but I'm sure you could make it work. I use it all the time with btrfs.

Re:rsync causes lockups? by joel48 · 2014-09-11 04:21 · Score: 1

With ZoL there were issues with metadata heavy workloads (rsync being one of the prime examples) with 0.6.2, however with 0.6.3 those issues have been significantly improved by many accounts.

Re:Little Baby Linux by TangoMargarine · 2014-09-11 04:21 · Score: 2

FreeBSD has had ZFS for what, over five years now? They are the reason it exists in any actual use (OpenSolaris/Illumos don't count) on any non-Sun/Oracle platform.

God forbid it take the Linux guys longer to get it up and running when Sun purposely licensed it to be difficult to do so on Linux.

And Linux's wannabee ZFS competitor BTRFS (oooh, look at us) sucks so bad it can't get off the ground.

So, this being Linux, some guys* also designed Btrfs to do the same things in the meantime. How dare they!? Sun released ZFS after 4 years of work; Btrfs, 2. Presumably they were working under more of an "agile" setup? Which doesn't really make sense for an FS but hey.

So what does Linux do.... import (steal) ZFS from OpenZFS/FreeBSD

It's called porting, and I don't see how you can call it "stealing" in any honest way.

and start posting about how great all their work with ZFS is, and how Linux bloggers now say 'oh yeah, ZFS is actually solid, so we can use it'. As if they are the only/first ones to certify ZFS.

If you actually skim the article he is saying ZFS On Linux is ready, not ZFS itself.

Thing is, ZFS was always solid. When bashing ZFS Linux was really just babbling about ZFS's more open and free BSD License and their own failure of BTRFS.

Was there bashing of it? Being on Slashdot only since 2007/8 I thought it was more Linux people being irked that they couldn't play with it due to the licensing rather than saying it was crap.

Also, I really hope you're aware that the CDDL and the BSD License(s) are not the same thing. ZFS is CDDL.

If you want an integrated system that just works, try FreeBSD.

You're using "just works" and ZFS in the same argument? With a straight face? The intersection of "Just Works" and people who use ZFS has to be pretty small. If you want Just Works just slap an ext3 or ext4 partition on your desktop and be done with it.

* Interestingly, Wikipedia says Btrfs is (was?) actually an Oracle project. Oracle, of course, bought Sun, which made ZFS. So maybe "competitor" isn't entirely accurate?

--
Unity? Screw that: XFCE. Slashdot Beta? Screw that: SoylentNews. Australis? Screw that: Pale Moon. UX developers DIAF

Re:above, below, and at the same level. ZFS is eve by brambus · 2014-09-11 04:23 · Score: 2

iSCSI doesn't need to be baked into ZFS, in fact, even on Illumos it isn't. It's in a completely different subsystem and will happily work with any block device as its backend storage (be it a physical drive, a ZFS zvol, a loopback block device or anything else, really).

Re:above, below, and at the same level. ZFS is eve by devman · 2014-09-11 04:25 · Score: 5, Informative

Anything that can be represented as a block device can added to a zpool. This also includes files which is handy when your trying to understand complicated interactions you can mock up a small zpool based on files instead of devices for testing.

On the otherside of the abstraction ZFS can also expose block devices called zvols that will be backed by the zpool. So if you wanted to run a dmcrypted EXT4 filesystem backed by a zpool you can certain do that using a zvol and still get all the benefits of ZFS integrity protection and snapshoting.

Plenty of layering can be done with ZFS.

Re: Technobabble... by zeigerpuppy · 2014-09-11 04:27 · Score: 1

The recommendations vary a lot, mostly because it also depends on whether you're accessing a subset of your files more frequently than others. 1Gb per Tb seems good for most loads but ZFS will use as much as you throw at it.

Re:Little Baby Linux by bigfinger76 · 2014-09-11 04:27 · Score: 1

More like Big Baby BSD, amirite?

Re:Technobabble... by Rich0 · 2014-09-11 04:42 · Score: 1

Are you suggesting that if A = B, then B = A? :)

Re:Technobabble... by Daniel_Staal · 2014-09-11 04:45 · Score: 1

It depends partly on what features of ZFS you'll be using, and what types of performance you need. In general, you can run ZFS for an arbitrarily-large disk set with about 2GB of RAM - but you won't be using the memory cache features of ZFS much at all. The more ram you have available, the more it'll assign to the ARC (read cache). If you are running a media fileserver, where every read is a large file and is unique, then the ARC doesn't make much difference. If it's a webserver, where you read the same small files over and over, it's a huge difference. Things like compression and larger checksums also can take slightly more RAM.

The one real computable is if you try to turn on deduplication - you need something like 5GB of RAM per TB of data to be deduped, or performance goes to hell. This is to store the dedup lookup tables (which are put in the ARC) - if you can't fit them into RAM, every read/write adds having to read them into RAM, lookup where the data is, and then load the data. (Which can mean several reads per IO op.) Note that you don't have to dedup the entire dataset - it's on a per-filessystem basis. (And ZFS makes creating filesystems trivial.) Still, it's best to leave it off unless you have ungodly amounts of RAM to throw at it, and know you are storing heavily duplicated data.

--
'Sensible' is a curse word.

Re:rsync causes lockups? by wagnerrp · 2014-09-11 04:49 · Score: 3, Interesting

If you intend to send the snapshots over the network, as is often the case with rsync, you need to pair it with some independent communication tool, and since the output of "zfs send" tends to be very bursty, you need a sizable memory buffer.

Re:Technobabble... by Voyager529 · 2014-09-11 04:58 · Score: 1

For all the technobabble in that summary, I still don't know what ZFS offers me over other filesystems. Maybe the guys working on the system should do a little marketing course, or work on their 'elevator pitch'...

Here's my attempt...

1.) ZFS does software RAID as its normal mode of existence. It's naturally contested as to whether this is a good thing, but it depends on context. ZFS doing software RAID on a busy MySQL server? Not great. ZFS doing software RAID on a FreeNAS box whose lot in life is to shuffle data two and from a bank of hard disks? Better.
2.) Datasets. These are best described as the lovechild of folders and partitions. Like partitions, they can have their own mount points, their own permissions, storage quotas, and their own compression settings. Like folders, it's possible to have dozens of datasets on a volume, and then let the dataset use as much of the volume's storage capacity as needed, and dynamically expand or contract them as necessary.
3.) Snapshots. If you're used to Windows, think "Shadow Copies", but easier to work with.
4.) Deduplication. This *can* be dangerous, but deduplication can be enabled on a per-dataset level, so if you have a known set of data that has massive duplication (e.g. a dozen Windows VM disks for a test environment), it can save a whole lot of hard disk space.
5.) ZFS brings a lot of the functionality of the more expensive SCSI cards to commodity hardware with basic drives, and can do its thing with a hodgepodge of disks. This is useful if you're like me and think it's useful to have a RAID-6 array with drives from several vendors to help mitigate the risk of a homogenous manufacturing run.
6.) Not a feature of ZFS directly, but ZFS and FreeNAS/Nas4Free/Nexenta have a rather symbiotic relationship. If a NAS is built running a BSD distribution explicitly designed for storage, these distros make it extremely easy to manage the storage array and use the data transfer protocols best suited for the task at hand - all support FTP, SMB, iSCSI, and NFS, with some more exotic stuff generally available as well.

Re:above, below, and at the same level. ZFS is eve by Vesvvi · 2014-09-11 04:58 · Score: 1

Have you confirmed using a zvol underneath a zpool, and if so was it a different zpool?
I've wanted to do that in the past, but it was specifically blocked. It's a pretty ugly thing to do, but it does give you a "new" block device that could be imported as a mirror on-demand. With enough drives in the zpool, that new device is nearly independent from its mirror, from a failure perspective.

Re:rsync causes lockups? by TechyImmigrant · 2014-09-11 05:01 · Score: 2

Does the sky fall in if your buffer isn't 'sizable'? Or does it just run a bit slower?

--
I should use this sig to advertise my book ISBN-13 : 978-1501515132.

Re:Technobabble... by meta-monkey · 2014-09-11 05:02 · Score: 1

Thank you so much, that's very informative!

--
We don't have a state-run media we have a media-run state.

Re:Technobabble... by MightyYar · 2014-09-11 05:06 · Score: 1

I've been a member of the Church of Parity ever since I discovered that some of my dutifully backed-up family photos had not only gotten corrupted, but the backup dutifully copied the corruption as well. Ever since, I use backup tools which do a parity check (e.g. Unison) and I try to store important things on ZFS if I can.

In my case I was lucky and I had an older backup without the corruption. But lesson learned... Also, have more than one backup :)

--
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.

This is great news... by Anonymous Coward · 2014-09-11 05:16 · Score: 1

But it still won't stem the flow of Linux refugees to *BSD due to the trash that is being put into the Linux ecosystem such as systemd.

Re:above, below, and at the same level. ZFS is eve by sjames · 2014-09-11 05:19 · Score: 1

Not really. Fundamentally, a filesystem's job is to store data in a structured manner on an unstructured array of blocks. For everything ZFS does, it still comes down to that.

There are a great many advantages to having that structure include duplicate blocks and checksums.

If you really prefer, you can reasonably build a non-redundant ZFS pool on top of a RAID volume though you will lose a few advantages that way.

Re:rsync causes lockups? by bragr · 2014-09-11 05:35 · Score: 3, Informative

Back when I did OpenSolaris work, we used a tool called mbuffer which is basically netcat with a buffer on each end. It wouldn't been suitable for internet backups (no encryption) but it works pretty well for cross campus backups and the like.

IIRC it works like this on the sending side: 'zfs send pool/fs@snap | mbuffer -s 128k -m 4G -O 10.0.0.1:9090'

And on the receive side: 'mbuffer -s 128k -m 4G -I 9090 | zfs receive pool/fs'

It can still be pretty bursty but it smoothes out a lot of it.

Several others posting say that risks corruption by raymorris · 2014-09-11 05:44 · Score: 1

Several other posting here, fans of ZFS, are saying on this very page that ZFS really needs to be accessing bare disks. it will allow you to use another block device, they say, but data corruption is highly likely.

Re:above, below, and at the same level. ZFS is eve by brambus · 2014-09-11 05:44 · Score: 1

I've confirmed it by actually reading the source code. Even a quick look at the manpage will tell you the same:

# touch /export/lun/0 # sbdadm create-lu -s 10g /export/lun/0 # file backing # sbdadm create-lu /dev/rdsk/c1t1d0s0 # raw disk backing

Missing features by DanielOom · 2014-09-11 05:47 · Score: 1

The Linux implementation lacks support for the Archive, Hidden, ReadOnly, and System file attributes; those are needed for MS-DOS support:-]

Re:Several others posting say that risks corruptio by brambus · 2014-09-11 05:48 · Score: 1

I was talking about iSCSI support, not how the ZFS themselves pools are built. Yes, ZFS *does* prefer raw disk backing (due to certain management simplifications), but it does not need it. In fact, it's quite possible and people do frequently run ZFS on top of RAID arrays.

Linux Distributions by dutchwhizzman · 2014-09-11 05:58 · Score: 1

You must have an enormous collection of Linux Distributions at home to need that much storage.

--
I was promised a flying car. Where is my flying car?

Re:above, below, and at the same level. ZFS is eve by smash · 2014-09-11 06:15 · Score: 1

You know you can lay a zfs filesystem on files, right?

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.

I used it for about a year by Solandri · 2014-09-11 06:32 · Score: 2

And was very impressed. It was a new 4-drive system I'd put together to operate as both a NAS/fileserver and a host for virtual machines. I had originally intended to use RAID 5, but decided to give ZFS a try after reading about it. My initial config had it booting Ubuntu (maybe Mint? I don't recall), with ZFS for Linux installed as the main non-boot filesystem with one-drive redundancy. I had all sorts of problems with drives dropping out of the array, which I eventually tracked down to the motherboard shipping with bad SATA cables. ZFS handled this admirably. At first I didn't notice one of the drives had dropped, and continued using the system for about a day. When I got the drive working again, as I understand it RAID 5 would have had to do a complete array rebuild because of the changed files. ZFS noticed most of my old data was on the "new" drive and simply validated the checksums as still accurate, then noticed I had written new files and automatically created new redundancy files for them on the "new" drive. The entire "rebuild" only took a little over an hour instead of the 20+ hours I was expecting (how long it takes me to backup the data over eSATA).

If you're wondering why ZFS trusts the checksums on the "new" drive instead of reading the entire file, it will read the entire file and compare it to the checksum every time you access it. Once a month by default, it runs a "scrub" where it reads every file and verifies they haven't suffered bit rot and still match the checksums. Apparently the strategy after a dropped drive is to get the redundant filesystem up and running again ASAP, then do the file integrity scrub afterwards at its leisure. (You can manually force this check at any time with a zfs scrub.)

The other main advantage I'd say is that it's incredibly flexible when you're putting together redundant arrays. RAID 5 normally requires 3+ drives or partitions of the same size. ZFS lets you mix together drives, partitions, files (yes, one of your ZFS "drives" can be a file on another filesystem), other devices like SAS drives, etc. You can even put the 3+ "drives" needed for redundancy onto a single drive if you just want to play around with it for testing.

The only problem I ran into was with deduplication. Dedup was part of the reason I decided to try ZFS, and is one of the features frequently mentioned by ZFS advocates. While dedup does work, it is an incredible memory and performance hog. Writes to the ZFS array went from 65+ MB/s (bunch of mixed random files) down to about 8 MB/s with dedup turned on, and memory use climbed to where I ordered more RAM to bump the system up to 16 GB. In the end I decided the approx 2% disk space I was saving with dedup wasn't worth it and disabled it.

I eventually switch to FreeNAS (based on FreeBSD, which has a native port of ZFS) because it was annoying having to reinstall ZFS for Linux after an Ubuntu/Mint update, and I couldn't see myself doing that after every new release because I wanted features which were added to the core OS. (And if you're wondering, dedup performance is just as bad under FreeNAS.)

Re:I used it for about a year by Chewbacon · 2014-09-11 08:46 · Score: 1

I'm using Ubuntu server with ZFS and haven't had such issues after updating it other than it changing /dev/disk/by-id. I had to seek a little help with it to fix and it turned out being 2 commands and my ZFS array was up and running again. I love it flexibility as far as drives go like you described: you can make things work in the fashion of duct tape and extension cords. Performance and reliability is great for a home server when you compare the price of a high-end true RAID card. Haven't had a real issue with ZFS yet (knock knock).

--
Chewbacon
The Bible is like Wikipedia: written by a bunch of people and verifiable by questionable sources.
Re:I used it for about a year by Guspaz · 2014-09-11 09:22 · Score: 1

You shouldn't have to reinstall ZFS after any updates (apart from maybe a distro release upgrade, which on a file server running Ubuntu are probably being done every two years, or six months if you live on the edge), as it uses DKMS and will recompile the modules when you update the kernel or zfs.
Re:I used it for about a year by jwhitener · 2014-09-15 12:17 · Score: 1

Writes to the ZFS array went from 65+ MB/s (bunch of mixed random files) down to about 8 MB/s with dedup turned on, and memory use climbed to where I ordered more RAM to bump the system up to 16 GB. In the end I decided the approx 2% disk space I was saving with dedup wasn't worth it and disabled it.

I was always curious how well it scaled down (like for home use). At work we have have multiple 100+ disk storage systems using ZFS, and notice zero performance hits using de-dupe features (mainly through mirroring).

Re:above, below, and at the same level. ZFS is eve by mi · 2014-09-11 06:33 · Score: 1

The Unix tradition is small, single purpose tools that do one thing well. Witness sort, grep, wc, etc.

The cost of this approach has always been performance. It is faster, for example, to use grep's -c switch than to pipe grep's output into wc -l (as is commonly done in poorly-written scripts).

When it comes to storage, the performance penalty of using separate layers, which aren't well-aware of each other, becomes big enough to justify integration...

--
In Soviet Washington the swamp drains you.

Re:rsync causes lockups? by kingramon0 · 2014-09-11 06:54 · Score: 3, Funny

The sky won't fall but the walls might.

-Shaka

Works great by just_common_sense · 2014-09-11 06:55 · Score: 1

I've used ZFS on Linux for years now and it's been fantastic for my long-term storage. One 2 TB drive runs all the time and I power another one up periodically to auto-sync with the first one. That saves power and drive wear vs an always-on RAID0 setup.

Re: rsync causes lockups? by CAIMLAS · 2014-09-11 07:07 · Score: 1

Rsync causes a lot of metadata lookups, which will fill arc metadata in a hurry. If arc max isnt set, you'll oom the box (or crash zol if it is, i think). I'm not sure how to monitor or control it on zol, because zol's memory management is still kernel independent....

--
~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers

Re:Several others posting say that risks corruptio by devman · 2014-09-11 07:09 · Score: 1

Data corruption isn't likely. The reason to use partitioned disks is for performance so that ZFS can control the disk caches directly.

Re:Technobabble... by stoploss · 2014-09-11 07:11 · Score: 1

No, the insinuation is that ZFS is approximately a decade more mature and proven than that upstart knockoff, btrfs.

Re:above, below, and at the same level. ZFS is eve by 0123456 · 2014-09-11 07:12 · Score: 1

The cost of this approach has always been performance.

Uh, no.

Sure, in the specific case you cite, that's correct, because grep can easily count the number of lines it outputs. In the general case, however, you'll find a pipe is probably faster, because the two processes can run on different cores.

Re:above, below, and at the same level. ZFS is eve by Bengie · 2014-09-11 07:30 · Score: 1

Certain required features could not be done if they were separate features. In order to properly do certain things, the 3 layers must understand each other. I'm not talking about "fun" features, I'm talking about problems that have been plaguing data centers and there was no other way.

Re:Little Baby Linux by TangoMargarine · 2014-09-11 07:30 · Score: 1

No, sun purposefully licensed it as BSD to be easy to use anywhere.

No, ZFS is CDDL licensed. ZFS is not BSD licensed. If it were BSD, Linux could pull it in trivially. That's the whole point.

hes using just works with the word freebsd. and yes it does, freebsd is not a distribution slapped together with blah blah blah

Well then this is offtopic. The whole rest of the conversation is about ZFS, not BSD vs. Linux in general.

--
Unity? Screw that: XFCE. Slashdot Beta? Screw that: SoylentNews. Australis? Screw that: Pale Moon. UX developers DIAF

Example? by raymorris · 2014-09-11 07:51 · Score: 1

Do you have an example? The storage system I'm using provides every important feature I'm aware of in ZFS, and it keeps the layers separate. As ZFS has matured, it seems to be a way of getting all of those features out-of-the-box, without needing to think about how to put it together. LVM is one volume manager provides most of the same features, though. Then put your choice of filesystem on top of LVM. Can you think of any feature that actually requires the volume manager to be stirred together with the filesystem?

Re:Example? by Sloppy · 2014-09-11 09:19 · Score: 2

(I still do things the classic way: filesystem on lvm on luks on mdadm. not using ZFS yet.) I'm not sure it's exactly about what's required.
Consider wear leveling on SSDs. Only the filesystem really understands which blocks need to preserve data and which ones are don't-care. So to do SSDs right, it needs to pass info about unallocated storage down to the volume manager, whch then passes it to the encryption, which then passes it to the RAID, which then gives it to old-school "real" block device (which then passes it to the wear-leveling firmware, I guess). Sure, that can work. But when the filesystem can talk to the physical block device, it's easier. If you're writing block devices that implement things like volumes and encryption and RAID, from your PoV, things that are allocated vs not-allocated are totally different than how the filesystem sees it. To you, a block is just a block and a whole bunch of ioctls are totally irrelevant and not related to what you're working on. You're going to find this type of information to be pesky and you might not handle it right (or more likely, it takes a long time before you handle it at all). And in fact that has happened a few times, where certain block devices' feature set lagged a bit, behind what people with SSDs needed.
I suppose another easily-contrived example would be if you have a few gigabytes of data on a few terabytes of RAID, and need to [re]build the RAID. If your RAID doesn't know which blocks actually have data, then it'll need to copy/xor a few terabytes. If it's a unified system, then it can be complete after copying/xoring a few gigabytes.

--
As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.
Re:Example? by Carnildo · 2014-09-11 11:49 · Score: 1

Can you think of any feature that actually requires the volume manager to be stirred together with the filesystem?
Smart array (re)builds. In the typical layered approach, the redundancy layer doesn't know what parts of the filesystem are in use, so it spends a great deal of time synchronizing empty space.

--
"They redundantly repeated themselves over and over again incessantly without end ad infinitum" -- ibid.
Re:Example? by dbIII · 2014-09-11 13:29 · Score: 1

That's the killer feature especially as 10TB drives hit the market this week. A resilver puts just the used data on the drive while a rebuild fills the entire thing.

Re:rsync causes lockups? by wagnerrp · 2014-09-11 08:17 · Score: 1

On a gigabit network with the standard 64K pipe buffer, you'll trickle along at just a few megabit.

Re:Several others posting say that risks corruptio by fwarren · 2014-09-11 08:20 · Score: 1

To get all the features of data integrity and error correction you need to avoid hardware raid with ZFS.

If ZFS controls the drives and you have striping or mirroring and one of the drives has corrupt data, ZFS can log an error and fix the data corruption. If hardware RAID controls the drive, it may realize a copy of the data is bad, and pass back the good copy, but RAID won't log an error nor fix the bad block. ZFS won't fix it either because it won't know since RAID handled returning the correct data to ZFS.

ZFS is self healing if it is NOT on top of hardware RAID. To reap all of the benefits of ZFS you need lots of RAM, that RAM should be ECC and two or more disks without hardware RAID.

--
vi + /etc over regedit any day of the week.

Re:No defrag! by ewhac · 2014-09-11 08:25 · Score: 1

Yes. Alas, this is a consequence of ZFS's COW (copy on write) design.

In a filesystem like EXT3, if you open a file, seek to some offset, and write new data, EXT3 will write the new data to the existing disk block in place. ZFS, however, will allocate a new block for that offset (copy on write), write the modified data to it, and update the block chain. The result is that it's apparently very easy to badly fragment a ZFS file (do a Google search for "ZFS fragmentation" to see various stories and tests people have written).

You can apparently mitigate the problem by occasionally copying the entire affected file -- Oracle's own whitepaper on the subject apparently reads, "Periodically copying data files reorganizes the file location on disk and gives better full scan response time."

Bottom line: ZFS is not a panacea, nor is it simple. There are myriad options, and trade-offs to all of them.

--
Editor, A1-AAA AmeriCaptions

Re:above, below, and at the same level. ZFS is eve by Guspaz · 2014-09-11 08:29 · Score: 1

More than that, since you're effectively virtualizing your EXT4 filesystem, you can expand it pretty easily too. You're backed by a storage pool, which means you can expand that pool by adding or replacing drives, and then simply resize the EXT4 filesystem live. EXT4 need not know about the fact that you've added a new raid array to the storage pool.

Re:rsync causes lockups? by Guspaz · 2014-09-11 08:30 · Score: 2

They're working on fixing that, but in the mean time you can pipe it through mbuffer or something similar to resolve the issue.

Re:above, below, and at the same level. ZFS is eve by Guspaz · 2014-09-11 08:32 · Score: 1

You're assuming that the single process isn't multithreaded. ZFS is multithreaded.

Re:above, below, and at the same level. ZFS is eve by RabidReindeer · 2014-09-11 08:58 · Score: 1

Want to count the log entries that mention Slashdot? You don't need a special tool for that, just grep slashdot | wc -l .

Not since journalctl took over.

Re:above, below, and at the same level. ZFS is eve by BitZtream · 2014-09-11 09:12 · Score: 1

You're not really understanding how ZFS does and can work. It already has hooks to provide 'features' such as you talk about. It does require crossing several traditional Unix boundaries, thats true, but its an accepted trade off to get the benefits that go with ZFS, but the hooks to include such features at the typical boundary points still exist in the ZFS code. Pretending that ZFS has to be totally and completely aware of what you hook in isn't really fair. What you hook in has to integrate with the API, which is well defined, and that really isn't any different than with the approach you seem to prefer.

And for reference: dm-cache and cache are not needed with ZFS, l2arc already covers them, and it does it better because it knows whats going on across all 3 layers. I seem to have no problem doing iscsi sharing of ZFS storage space nor do I seem to have any problem using iscsi targets as part of zdevs. Hell, technically you can still use dm-cache and bcache with ZFS, if you're ignorant enough to do so. You can even run whatever file system you want on top of zvols. You'd be stupid to do it in most cases, but the ability is there if need be.

Since you want to use the word Unix, lets get a few things clear. Linux is not and likely never will have a Unix certification. Sun on the other hand had two operating systems that were certified Unix and they were doing it before Linus had a computer to start Linux on. Drop the 'my OS does it right' bullshit because your OS isn't what you're claiming it to be, and the system you arguing against was written by people who did make something you're claiming it isn't.

I don't disagree with the Unix tradition in the least, compartmentalized code with strong boundaries and good interoperablility where ever possible ... and occasionally you tear down the walls for specific reasons. Graphical performance is an example where your philosophy sucks, which is why Windows kicks the ever living shit out of Linux performance. Note: Linux, NOT Unix. SGI had a terrific graphics stack as an example, and Sun's wasn't too horrible.

--
Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager

Re:above, below, and at the same level. ZFS is eve by raymorris · 2014-09-11 10:43 · Score: 1

Yeah, that's the thing about systemd- it's not Unixy. It might be great, but it's not designed according to Unix principles.

I heard of that. Wrote it, actually by raymorris · 2014-09-11 10:52 · Score: 1

> Drop the 'my OS does it right' bullshit because your OS isn't what you're claiming it to be,

Where did I say one approach was right and the other wrong? In fact, I said each approach has it's advantages and disadvantages. What I said is that ZFS is not designed according to the Unix tradition of "do one small thing, and do it right". Apparently you agree that's the case:

> don't disagree with the Unix tradition in the least, compartmentalized code with strong boundaries and good interoperablility where ever possible

That's why some who appreciate the Unix approach hate systemd. It would be more at home on Windows.

Re Sun, if you look at an old Sun Solaris box, you'll find some of the was written by a guy named Ray Morris. Coincidentally, this post was also written by Ray Morris.

Re:above, below, and at the same level. ZFS is eve by ultranova · 2014-09-11 11:48 · Score: 1

The Unix tradition is small, single purpose tools that do one thing well.

Yet the kernel does scheduling, memory management, user access control, filesystems, device management, TCP/IP, power management, etc.

--

Forget magic. Any technology distinguishable from divine power is insufficiently advanced.

above, below, and at the same level. ZFS is everyt by tbuskey3553 · 2014-09-11 13:05 · Score: 1

> ZFS is a layer below LVM.

Typically you'd layer raid, then LVM, then the filesystem. ZFS tries to be all three. It's raid, and it's a volume manager, and it's a filesystem. There are some benefits to integration, and some drawbacks. With the raid>lvm>filesystem approach, it's trivial to add dm-cache, bcache, iscsi, or any other piece of storage technology.

ZFS can have things added on top just as trivially. I've never done smb in ZFS, I used Samba. On solaris, I did NFS with ZFS, but on Linux, I've kept it in /etc/exports.

With RAID+LVM+filesystem+network share, I need to unshare, umount. The resize LVM, resize the filesystem partition (if it is allowed), mount and export. With ZFS I do zfs set quota= zpool/directory. On the fly with no downtime. That alone is worth the difference.

ZFS does compression also. And you can toggle back & forth while running. zfs compression=on zpool/directory and every new write is compressed. compression=off and new items are not compressed.

ZFS also does checksums to verify the data. If you have a dodgy sata cable or controller firmware that corrupts your data, it will be detected. If ZFS is doing RAID, it has a 2nd copy of that data and can self correct. If it doesn't have a 2nd copy, it will shut the filesystem down so nothing else gets corrupted. You can't do that with hardware RAID.

With ZFS, anything you want to add has to be specifically supported within ZFS.

Not true at all.

The Unix tradition is small, single purpose tools that do one thing well. Witness sort, grep, wc, etc. Want to count the log entries that mention Slashdot? You don't need a special tool for that, just grep slashdot | wc -l . Tools like mdadm and lvm are building blocks that can be combined to suit your need, the Unix way. ZFS is a big monolithic package that does everything, much like Microsoft Word or Outlook. ZFS is more in the Microsoft tradition.

Some of the things that ZFS does would be difficult if it wasn't the whole system. Newer systems like btrfs and ceph take a similar approach.

Re:no support for posix acls by ewhac · 2014-09-11 13:07 · Score: 1

I dunno about ZFS for Linux, but FreeNAS's ZFS has NFSv4 ACLs. Are these not sufficient?

--
Editor, A1-AAA AmeriCaptions

Re:rsync causes lockups? by dbIII · 2014-09-11 13:08 · Score: 1

It just runs slower. Only 4GB for a machine with a bit of traffic on ZFS is a bad idea but it gets things done eventually - even sending and receiving snapshots of a few TB.
Yes the example is a crap machine, and a 32 bit "netburst" Xeon as well, but crap machines are things that can get used to test things to destruction before going near production machines. I was using it with FreeBSD9 since ZFS was more advanced on that platform.

Re:My missing feature by ewhac · 2014-09-11 13:19 · Score: 1

As far as I'm aware, you don't need 'dump' with ZFS. You create a snapshot, then 'zfs send' that snapshot off to your backup storage. Can be done on a live filesystem. Delete the local snapshot when you're done copying it off. ( http://docs.oracle.com/cd/E187... )

--
Editor, A1-AAA AmeriCaptions

They are not really a single layer by dbIII · 2014-09-11 13:25 · Score: 1

The zpool and zfs layers could be considered seperate in the same way - especially with things like zfs send to move filesystem snapshots to other pools which are usually on other machines. The filesystem does not get influenced by the nature of the pool and vice versa, so long as it's big enough to fit. Global options on pools (eg. no atime) are really just passed down to the filesystems instead of it being a zpool operation.
It's really more like if LVM and ext4 were done by the same development team than a totally different and totally monolithic approach.

Re:Technobabble... by Rich0 · 2014-09-11 13:32 · Score: 1

No, the insinuation is that ZFS is approximately a decade more mature and proven than that upstart knockoff, btrfs.

If they had only BSD licensed it, the latter might never have been built. :) Just another case of qt syndrome I guess.

Btrfs does have a few advantages - it is a bit more flexible in some respects, mostly in areas where the more enterprise-y ZFS audience isn't going to care.

Re:No defrag! by zixxt · 2014-09-11 14:02 · Score: 1

I used ZFS under FreeBSD its was good for a few months until it got slow and I needed to defrag it, oh no ZFS is too good for a defrag tool so I zapped it and installed Debian with XFS, much much more faster and it comes with a online defrag tool.

One of the few reasons I stick to ext4 and XFS under Linux too! I got burned by having no way to defrag JFS, ReiserFS or UFS/FFS under the BSD's.

--
---- GENERATION 26: The first time you see this, copy it into your sig on any forum and add 1 to the generation.

What, fsync() is STILL broken? by gweihir · 2014-09-11 14:42 · Score: 1

If I call fsync() is legitimately expect it to not ever return before the data is on disk or to return with an error. Any other behavior is just completely unacceptable and a rather severe fault.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.

great example. Removeable, interchangeable schedul by raymorris · 2014-09-11 14:44 · Score: 1

You listed some great examples, examples of the opposite of what you probably meant to show.
Take scheduling- there what, six different interchangeable, removeable kernel modules to do scheduling in different ways, including the option to not do it at all. The scheduler only does scheduling, and nothing else. The rest of the kernel doesn't know or care about the scheduling. You mentioned filesystems as well. Yep, you can choose from dozens of different filesystems. The rest of the kernel doesn't care which filesystem you're using, because those other modules do their job and nothing more. You can use any scheduler with any filesystem.

Enter zfs, a popular volume manager similar to LVM. It just manages volumes, so you choose whichever filesystem to lay on top. Er, no. If you want to use the ZFS volume manager, you probably need to use the ZFS filesystem. That's cool, it'll also provide an extra level of resiliency on top of that great hardware raid you have. Actually, not so much. It doesn't play nicely with most enterprise storage hardware. You need to use dumb hardware and use ZFS raid to avoid problems. Wait, what? ZFS, a filesystem, is telling you which hardware to use? That's not like the interchangeable kernel modules at all.

Re:Technobabble... by stoploss · 2014-09-11 15:58 · Score: 1

Haha, and conversely ZFS' "rampant layering violations" approach confers advantages that btrfs cannot match. However, most Linux users will just go with what their distro rolls out.

It is somewhat ironic that Oracle is primarily responsible for both ZFS and btrfs, given that btrfs was developed because Oracle wasn't willing to dual license ZFS.

Re:above, below, and at the same level. ZFS is eve by Vesvvi · 2014-09-11 16:59 · Score: 1

Sorry, I'm not that familiar with OpenSolaris.
Don't the first and second commands create a zpool backed by a file? That's not what's at question here, I want to know if you can back a zpool with a zvol created on that same zpool.

A quick test showed that it does work on FreeBSD to create a zpool upon a zvol from a different zpool. The circular version has made it hang for a not-insignificant amount of time...

Re:Little Baby Linux by fnj · 2014-09-11 19:15 · Score: 1

I really hope you're aware that the CDDL and the BSD License(s) are not the same thing. ZFS is CDDL.

But at least they are plainly *compatible*, aren't they? I mean, since FreeBSD has ZFS in the base system.

And CDDL is *incompatible* with GPL, Isn't it? I mean, since no distro includes it and you have to graft the two together from separate sources.

P.S. - I actually USE ZFS on Linux (as well as on FreeBSD), and I love it.

Re:above, below, and at the same level. ZFS is eve by brambus · 2014-09-11 19:28 · Score: 1

I think you misunderstood my reply. I was replying to the poster talking about iSCSI having to be implemented in ZFS. That's what I was addressing.
You talk about a different thing altogether - ZFS backing. The zvol-on-another-zpool solution should work, although performance will suck. The zvol-on-the-same-zpool solution can and will hang for obvious reasons.

Re:above, below, and at the same level. ZFS is eve by rev0lt · 2014-09-11 21:49 · Score: 1

ZFS is a big monolithic package that does everything, much like Microsoft Word or Outlook. ZFS is more in the Microsoft tradition.

Well, that is well within the Unix tradition. ZFS is a *kernel* module, not a userland application. Just because the cli interface is comprised of 2 commands, it doesn't mean its monolithic. Its as monolithic as ifconfig and other complex utilities.
And I'd take anyday the zfs/zpool command format over the lvm ugly mess.

well.. by Mirar · 2014-09-11 21:57 · Score: 1

I tried to run backups to ZFS on a crypted USB disk. It worked for a while, but if something fails (like the backup disk going to sleep), the entire chain hangs. I can't disconnect the crypt device, and I can't disconnect the ZFS pool, zpool and zfs hangs. What I do with the USB cable and hardware no longer has any impact. I stopped doing that. (I didn't have better luck with btrfs.) Although I don't really blame ZFS that much other than it can't handle hanged devices. USB on Linux is still flaky.

The other problem I have is that it after a while happily uses up 30GB of my 32GB on the computer, and extremely reluctantly gives them away again. I can't seem to be able to control how much ZFS will use. And the rest of the system isn't really happy with just 2GB to run programs in (several virtual machines of 8GB RAM each, for instance).

Re:No defrag! by Mirar · 2014-09-11 21:59 · Score: 1

is this something I should think of for the mysql databases?

Re:Technobabble... by RLiegh · 2014-09-11 22:07 · Score: 1

the btrfs project was started in 2007, before Oracle purchased SUN.

Re:Technobabble... by Rich0 · 2014-09-11 23:36 · Score: 1

Haha, and conversely ZFS' "rampant layering violations" approach confers advantages that btrfs cannot match.

What advantage does ZFS have over btrfs, other than maturity? You do realize that both are virtually identical as far as "rampant layering violations" go, right?

Re:Technobabble... by stoploss · 2014-09-11 23:49 · Score: 1

Ah, so there is an equivalent to RAID-Z (1, 2, etc, and no, MD is not an acceptable substitute), the ability to use a ZIL type log device (and bring it on/take it offline without dismount), dedupe, and probably other features I'm forgetting but haven't bothered to check again since I last compared them about two years ago?

Re:Technobabble... by Rich0 · 2014-09-12 00:01 · Score: 1

RAID-Z, yes (but more flexible, though it is experimental and definitely not usable yet), RAID1 is production-ready (and it works like RAID-Z, not like traditional Raid1). ZIL does not exist at all yet. Dedupe does exist.

It is a COW filesystem that does checksumming of all blocks/etc, manages storage pools, doesn't have the RAID write hole, etc, just like ZFS. It tends to be more flexible as you don't have the limitation of being unable to add drives to a "vdev." However, it is far less mature and many features are still just on the roadmap.

Right now my one RAID1 has 4x1TB drive and 1x3TB drive, with 3.5TB of usable space. This is because RAID is implemented at the chunk level, not the drive level, so that 3TB drive can be mirrored across all the other drives (of course, that makes it a bit of a bottleneck for writes). I could covert that to a RAID5 (which is basically the equivalent of RAIDZ in zfs) without having to rewrite all the data - the already-written chunks would just stay in raid1 mode and new chunks would be allocated in raid5 mode unless I told the filesystem to migrate everything.

In general I think the design of btrfs is a bit more flexible. However, it is far less mature and in particular the more "enterprise-oriented" features tend to be lacking as they aren't a priority. They're really focused on different audiences.

Re:rsync causes lockups? by Syberghost · 2014-09-12 00:58 · Score: 3, Informative

You can kludge on encryption in the pipeline:

http://sourceforge.net/project...

Re:Technobabble... by stoploss · 2014-09-12 01:05 · Score: 1

Thanks for the info. FYI, RAID-5 or 6 is *not* equivalent to RAID-Z or Z2 due to the issue of the write hole.

Despite your assertion to the contrary, a cursory google showed no proof that raid with btrfs can circumvent the write hole issue, because it is a problem beneath the filesystem. The "layering violations" ZFS uses to effect the fix are what made raid-z(n) possible. I would be interested if you can provide a citation showing the write hole has been precluded in btrfs raid.

Also, it doesn't look like btrfs includes support for cache devices.

Re:Little Baby Linux by TangoMargarine · 2014-09-12 02:18 · Score: 1

Exactly. AC keeps claiming "Sun licensed it as BSD" as a reason why "BSD is The Way"...which could be true philosophically, but it's not factual as it's CDDL.

--
Unity? Screw that: XFCE. Slashdot Beta? Screw that: SoylentNews. Australis? Screw that: Pale Moon. UX developers DIAF

Re:above, below, and at the same level. ZFS is eve by tinker_taylor · 2014-09-12 03:03 · Score: 1

The RAID, LVM, Filesystem approach is defunct in the modern world. Also, ZFS already incorporates multi-protocol support, ability to turn any host with local storage into a target (via the COMSTAR framework). Not sure how much of this is in the linux port, but I suspect that if it's close enough to Illumos, it should have these features.

ZFS is not in the microsoft tradition, it is a departure from 20th century storage design/architecture. The very idea that there has to be a RAID/LVM/FS is archaic and has been thoroughly disproven. In my previous shop we had petabytes of storage in ZFS pools and hardly ever lost data.
The Pool-based model that eliminates the layers of RAiD/LVM/FS results in better performance, easier supportability and superior diagnostics capabilities.

Do you realize that almost every major storage vendor first bashed ZFS and then about 3-4 years later started building architecture that was eerily like ZFS?

My shop was one of the early adopters of ZFS since back 2007. There were a few bugs then, but over the years I have been absolutely impressed with the efficiency and stability of ZFS.

Re:great example. Removeable, interchangeable sche by Bengie · 2014-09-12 04:22 · Score: 1

ZFS plays just fine, the problem is in order to fully benefit from ZFS, ZFS must manage its own redundancy. You can still use RAID5 on your SAN, but you'll still want RAID5 with ZFS, which is just that much more wasted space. You also get the disadvantage that when a drive dies in the SAN RAID, performance will take a bigger hit than it needs to be, because the hardware RAID has no idea how the file system works.

In most situations, you're better off having each layer completely independent, but in the case of ZFS, it seems that when you don't make the layers entirely generic, but make them specialized to each other, the end product is much greater than the sum of the parts.

Re:ZFS - faster IO on larger pools by ledow · 2014-09-12 05:18 · Score: 1

1) Yes, it's a general features of RAID's. Multiple devices are reading the data, the "fastest finger first" wins.

2) File server only dependent on your disk format, you mean? I happen to agree here but, if you're doing it at the FS level, then just a standardised RAID layout (such as Linux md / LVM) is the same thing. The non-standard formats that tie you into hardware do so for a reason - the hardware RAID provides things that no software RAID can, sheer speed. (Though, please note, I've happily run Linux software RAID on server-end hardware in production systems without any performance problems).

3) 3 disks dying out of 11? RAID6+1 will actually do better (I think... I can't do the maths just now).

ZFS is cool, don't get me wrong, but it's basically just a RAID fs. The Merkel tree journalling trick just saves having to have battery backup, but whether it works like that in real life failures is another matter entirely.

Re:Technobabble... by Rich0 · 2014-09-12 05:22 · Score: 1

Btrfs RAID5 won't have a write hole. It is fairly equivalent to zfs RAID-Z.

Right now the btrfs raid5 code doesn't recover from any kind of disk failure at all - it is experimental (that is, it writes parity to disk, but cannot use it to recover missing data). All data written to each disk are checksumed, so in the event of a problem during write the filesystem can determine which disks are correct and which ones aren't. Also, like zfs btrfs does not overwrite data in-place, so in the event of a failure files should end up in either the pre-write or post-write state.

I couldn't find any official docs about raid5 on btrfs at all - it is still experimental. Mailing list posts seem to suggest that it will not have the write hole when complete (such as http://comments.gmane.org/gman... ).

Now, I'm talking about btrfs directly writing to disks. If you stick mdadm in-between btrfs and the disks (avoiding the layering violation), then the result will be the same as if you stick mdadm between zfs and the disks - you'll have the write hole. The same layering violation concerns that apply to zfs also apply to btrfs as it also does volume management, raid, disk management, etc. If you run a scrub on btrfs or rebuild a disk it will only scrub/rebuild the blocks that are actually in use, etc.

I won't argue that btrfs on the whole is nowhere near as stable as zfs is currently, but most of the key features are very similar. They are both COW filesystems that are designed to run directly against disks, they both checksum all data to protect against on-disk modification and the write-hole, they both utilize their knowledge of what is on the disk to optimize how things like striping/parity are done, they both support very cheap snapshots, etc.

Anybody interested in storage at a low level would do well to have at least a general familiarity of both, regardless of preference.

It's not a big, multifunctional package? Or grep i by raymorris · 2014-09-12 05:31 · Score: 1

> ZFS is not in the microsoft tradition

Balsa (Gnome email client): 2.5 MB, reads email. Optionally use libgtkhtml (315kb) to render HTML email.

Microsoft Outlook: (Microsoft email client): Several GBs. Reads email, handles calendar, embedded mail server, task list weather reports(?!?!) fax, rss, html templates, _sharing_ calendars. Loads MS Word (several GB) to partially display HTML messages.

Let's break this down into three statements and see where we disagree:

1. It appears that the Microsoft tradition is big monolithic packages that do everything. Including weather reports embedded in their email client.
Do you disagree with that?

2. Do you disagree with the statement that the Unix tradition (from ed to grep to elm and balsa) is small, focused tools?

3. Do you disagree with the statement that ZFS is a volume manager, a filesystem, a raid-like redundancy system, and a few other other things as well? In other words, that it's a big, monolithic package tat does many things. Do you disagree with that?

You LIKE ZFS. I understand that. It does a lot of cool things. It does a lot of boring things. It does a lot of things. Just like Microsoft Office.

Re:above, below, and at the same level. ZFS is eve by mi · 2014-09-12 07:08 · Score: 1

In the general case, however, you'll find a pipe is probably faster, because the two processes can run on different cores.

The poster I was responding to referred to "Unix tradition". The tradition started on single-CPU systems...

Even on modern multi-core computes, piping data from stdout to stdin is inefficient. Very convenient, but inefficient nonetheless. When the cost of developing (such as shell scripts written to either be one-offs or rarely executed) exceeds the costs of the inefficiency, it is justified.

But with storage — the code, that is used by millions thousands (millions!) times per day, it makes all the sense to invest in developing the subsystem.

Indeed, various OS-vendors (free and otherwise) all spend a lot of effort (and money) on improving their offerings. ZFS is just an example of something better than all (or most?) of the competition.

--
In Soviet Washington the swamp drains you.

Re:above, below, and at the same level. ZFS is eve by badkarmadayaccount · 2014-09-13 20:07 · Score: 1

Except that LVM is a PITA, mixing with RAID makes it even more so, and the RAID is unaware of the actual used space, making RAID 5 or 6 very expensive, not to mention it cant assist FS level checksumming with restoring individual blocks, you need to fail the whole drive. Implementing network transparency at the block level is inefficient, but no other FS has ZFS connect functionality.

--
I know tobacco is bad for you, so I smoke weed with crack.

Re:It's not a big, multifunctional package? Or gre by tinker_taylor · 2014-09-24 08:14 · Score: 1

3. Do you disagree with the statement that ZFS is a volume manager, a filesystem, a raid-like redundancy system, and a few other other things as well? In other words, that it's a big, monolithic package tat does many things. Do you disagree with that?

I'm suggesting that concepts such as "volume manager", "filesystem" and "raid-like redundancy system" don't need to be separate entities. The concepts such as "filesystem" and "volume" etc exist to conform to a 20th century vocabulary. And it's not that revolutionary any more. Companies like EMC, Netapp etc went that route too...thereby simplifying things like HSM etc at a "pool of disks" level.

Re:It's not a big, multifunctional package? Or gre by raymorris · 2014-09-24 08:37 · Score: 1

> > 3. Do you disagree with the statement that ZFS is a volume manager, a filesystem, a raid-like redundancy system, and a few other other things as well? In other words, that it's a big, monolithic package tat does many things. Do you disagree with that?

> I'm suggesting that concepts such as "volume manager", "filesystem" and "raid-like redundancy system" don't need to be separate entities.

Absolutely they don't NEED to be.
I'm suggesting that concepts such as "mail client, calendar, fax, RSS reader and weather reports" don't need to be separate entities.
In fact, if you smash them all together into one big entity, you can sell it for $109.99 and LOTS of people will buy it.

So we agree that because those things don't HAVE to be separate, systemd combines all of them together, into one package that does everything. We also agree that:
> 1. It appears that the Microsoft tradition is big monolithic packages that do everything.

Ergo, system fits the Microsoft tradition, the Microsoft way of doing things. That way certainly isn't impossible - Microsoft has made billions of dollars doing it that way. Unix traditionally does things a different way.

Re:It's not a big, multifunctional package? Or gre by tinker_taylor · 2014-09-25 07:58 · Score: 1

Absolutely they don't NEED to be.
I'm suggesting that concepts such as "mail client, calendar, fax, RSS reader and weather reports" don't need to be separate entities.
In fact, if you smash them all together into one big entity, you can sell it for $109.99 and LOTS of people will buy it.

So we agree that because those things don't HAVE to be separate, systemd combines all of them together, into one package that does everything. We also agree that:
> 1. It appears that the Microsoft tradition is big monolithic packages that do everything.

Ergo, system fits the Microsoft tradition, the Microsoft way of doing things. That way certainly isn't impossible - Microsoft has made billions of dollars doing it that way. Unix traditionally does things a different way.

There is a difference between a low-level tool and a high-level product. You could say that it is not "UNIX-like" to put together a bunch of individual components into a single program (monolith?) - why doesn't everyone use awk, sed, grep etc do their text processing? The reason why people build higher level products out of low level components is to make life easier.

I don't particularly see any problem with Word, Excel, Powerpoint etc. See, even google does it with Google Docs...so it must be right :)

I don't know if you have ever used ZFS -- if you have you would know what I mean. Having had to deal with migrating thousands of LUNs from one Storage array to another (via host-side migration - Veritas Volume Manager), i can tell you the ease of use and simplicity of a product that is not "Disk and LVM and Filesystem" (aka ZFS which is pool, vdev and volume) is a lot simpler. I have to just replace disks at the pool level, not have to worry about timing the mirroring and detachment of the mirrors (of disks from 2 separate Storage arrays) such that it doesn't kill performance of my 20TB DWH running on the box. Or wanting to accelerate things using the L2 ARC capabilities, etc.

Good/bad != Unix/Windows by raymorris · 2014-09-25 08:26 · Score: 1

Your last two posts seem to be getting at the idea that being monolithic isn't bad. I never said it was.
I said that monolithic packages are the way Microsoft and other Windows developers traditionally do things,
and that small, single purpose tools are the Unix tradition.

> I don't particularly see any problem with Word, Excel, Powerpoint etc.

I didn't say there was a problem with those. I said Microsoft builds software like that. Do you disagree?

304 of 370 comments (clear)