The State of ZFS On Linux
An anonymous reader writes: Richard Yao, one of the most prolific contributors to the ZFSOnLinux project, has put up a post explaining why he thinks the filesystem is definitely production-ready. He says, "ZFS provides strong guarantees for the integrity of [data] from the moment that fsync() returns on a file, an operation on a synchronous file handle is returned or dirty writeback occurs (by default every 5 seconds). These guarantees are enabled by ZFS' disk format, which places all data into a Merkle tree that stores 256-bit checksums and is changed atomically via a two-stage transaction commit.. ... Sharing a common code base with other Open ZFS platforms has given ZFS on Linux the opportunity to rapidly implement features available on other Open ZFS platforms. At present, Illumos is the reference platform in the Open ZFS community and despite its ZFS driver having hundreds of features, ZoL is only behind on about 18 of them."
We use ZFS to store backups, and we backup with rsync. No problems so far.
I'm still quite unfamiliar with all the concept of ZFS. How would it compare to a LVMed RAID-5 with EXT4?
It's unfortunate that the code is being ported to Linux, not rewritten. This means there will never be native Debian support for it. As a result, unsurprisingly, packages are only available for the Debian amd64 architecture.
I've been using ZFSonLinux for a year in production. No problems at all. It's my storage back end for Xen Virtual machines. Just make sure you use ECC RAM and a decent hard disk controller. Instant snapshots and ZFS send/receive functions are awesome, have reduced my backup times by an order of magnitude. I use a Debian Wheezy/Unstable hybrid.
Been using rsync on ZFS for many months (FreeBSD 10.0). No issues whatsoever.
I've been using this for a production fileserver for about a year and a half. Prior to that I was using ZFS on FUSE for about a year.
The only minor negative things I can say is that when you do have some odd kind of failure ZFS (and this may be the case on BSD and Solaris) gives you some pretty scary messages like "Please recover from backup" but usually exporting and importing the FS brings it back at least in a degraded state. My other caveat might just be my linux distro but I've often had problems with older versions of the libraries hanging around and causing the command line tools to break.
Is the target not a zfs filesystem as well? If so zfs send/recv allows for replication and handles deltas at the filesystem level. It should be more efficient.
http://docs.oracle.com/cd/E192...
It's a killer file system. Once you've used it, you won't be able to leave it.
So how much space does the chechsums take up? How much does all this behind the scenes work slow down the data retrieval/writing?
Is this something that a normal consumer would use for their main storage?
Troll is not a replacement for I disagree.
https://pthree.org/2013/12/10/zfs-administration-appendix-c-why-you-should-use-ecc-ram/
It's really too bad laptops don't offer ECC RAMs still. I'm willing to pay a little more for ECC RAM capable motherboard.
or you could spend 20 minutes reading about it on their web page instead of relying on slashdot summaries, you lazy git
I've been using ZFS on Linux for about a year. I can summarise my position on the experience with two words: it's magic.
It is still tricky to run one's root system off ZFS (at least on Debian). That, I think, is for those who are brave and have to time to deal with issues that might arise following updates. But for non-root filesystems, ZFS is, as I said, magic. It's fast, reliable, caches intelligently, adaptable to a large variety of mirror/striping/RAID configurations, snapshots with incredible efficiency, and simply works as advertised.
Someone once (before the port to other OSes) said that ZFS was Solaris' "killer app". Having used it in production for a year, I can understand why they said that.
20 minutes would make for a long elevator pitch.
So you suck as a sys admin. Try another hobby.
Bingo!
burn.
"If you love someone, set them free. If they come home, set them on fire." - George Carlin
How can it be production-ready if it still lacks SELinux support.. the ZOL FAQ suggests either permissive or disabling of it entirely.
I've been using ZFS on linux for years with nightly backup jobs that rely on rsync. I've never had a problem.
The main feature is data checksumming. All the other features are just icing on the cake (snapshots, data dedup, etc.). Ars has a good article with illustrations.
ZFS-fuse is not ZFSonLinux! It uses an older version and is much buggier. Also which HDD controller are you using? This really matters cos ZFS should be talking to your drives directly and bad controllers sometimes do nasty things.
I think "a filesystem with ECC (like memory) and ACID (like a good database)" is as close as you come for an elevator pitch. I've had bits flip, whether it's in memory or on disk or in transit I'm not really sure but it happens. Like for example I create "known good" PARs for a 5GB video and later it fails CRC, If you use a hex editor compare tool it'll show a single bit difference. Backups are neat, but you really want control over bit rot - real bit rot - so you don't end up with slow corruption. ZFS pretty much makes sure that what you take backup of is good.
Live today, because you never know what tomorrow brings
clearly doesn't know what "production ready" means....... /dev/sd** designation changes with every reboot
you have to add devices by-id because the
and don't try to do a dist-upgrade without exporting the pool and praying.... prayer isn't even going to work, ask me how i f****** know....
We run a lot of ZFS on OpenIndiana/Nexenta, but also have some ZoL.
My favorite things about ZFS:
- Simpler volume management -- there's no more LVM layer! A little weird at first, but it really grows on you. Just zpool create, zfs create and you're off and running.
- Huge volumes -- we have a couple in production near 800TB
- Writable snapshots (think FlexClone on NetApp) -- no performance penalty. We have systems with hundreds of snaps and clones.
- Really stable (in our experience, ZFS on *Solaris has been rock solid -- the management pieces on top is where we occasionally run into issues). ZoL has already been quite stable.
- ZIL/L2ARC -- Use SSD's to accelerate reads/writes.
- Performs great with minimal tuning, but there are plenty of hidden knobs if you need them.
- Triple parity RAID options. Essential for larger drives.
Cons and Caveats:
- Memory hungry. Really memory hungry. Fortunately, RAM is cheap these days.
- Does require CPU as it wants to do all the "RAID" itself. Processors are so fast that this has never been an issue for us. Also you probably want to use disks that speak real SAS, not SATA to ensure graceful failure.
- For the *Solaris versions, picking the right hardware tends to be important. ZoL opens a lot of doors here.
- Deduplication sucks (or sucked last time we tried it). Required a ton of memory, especially if you want to use smaller block sizes to get better space savings. Very challenging to move away from deduplication once you turn it on.
You're talking about different levels. I use a zpool on SSDs for my databases AND give the databases lots of RAM. The two are not mutually exclusive but if course your RAM should be utilised as close to your query as possible to increase throughput. ZFS is for integrity first, squeezing speed out of it is certainly possible but not it's primary purpose.
You could still use ZVOLs. I use ext4 on top of ZFS for some virtual machines. If you make sure to use the same block size in both the performance hit is small (approx 10%) and you will have ext4 with proper check summing and snapshot capabilities under it.
> ZFS is a layer below LVM.
Typically you'd layer raid, then LVM, then the filesystem. ZFS tries to be all three. It's raid, and it's a volume manager, and it's a filesystem. There are some benefits to integration, and some drawbacks. With the raid>lvm>filesystem approach, it's trivial to add dm-cache, bcache, iscsi, or any other piece of storage technology. With ZFS, anything you want to add has to be specifically supported within ZFS.
The Unix tradition is small, single purpose tools that do one thing well. Witness sort, grep, wc, etc. Want to count the log entries that mention Slashdot? You don't need a special tool for that, just grep slashdot | wc -l . Tools like mdadm and lvm are building blocks that can be combined to suit your need, the Unix way. ZFS is a big monolithic package that does everything, much like Microsoft Word or Outlook. ZFS is more in the Microsoft tradition.
It is basically the non-BSD, non-GPL version of btrfs. :)
You have that backwards.
btrfs is the non-BSD, GPL version of ZFS.
Is there a good way to calculate how much RAM you need? I'm considering ZFS for my next server build. It'll be around 10TB.
We don't have a state-run media we have a media-run state.
Hey, I'm the guy who got modded +5 funny for replying to the 8/10TB disk announcement with "of course they did, I ordered 6TB drives 2 hours ago". Well, I switched my home NAS over to ZFS last month. So, yay for me, for once I'm ahead in at least some minimal sense or other!
Seriously though, I have found ZFS to be a damned good solution so far. (FYI, CentOS, Core i5, 4GB, 6x4TB with 2-disk parity, 2 eSATA -> port multipliers...) I really don't think I will ever deploy hardware RAID again.
Except FreeBSD can't do Xen + ZFS. BSD is good for a lot of things and so is Linux. A good sysadmin picks the right tool for the job. I'd like to think the BSD project benefits from more people using ZFS.
I think you're giving the wrong idea here. I have yet to find a format of storage capacity that zfs won't support, with one exception: you can't create a zvol on a zpool, then attach that zvol as back-end storage for the same zpool. That is specifically disallowed, and I'm guessing that you can't use a zvol from one zpool to back-end another zpool either. This is a very bizarre (also, probably dumb) thing to do, but even this can be overridden if you're really desperate. For more practical applications, everything else just works: at least in FreeBSD, you can "hide" the block devices behind all different kinds of abstractions to provide 4k writes, encryption, whatever, and zfs will consume those virtual block devices just fine.
ZFS on Linux would be cooler if they could port Time Slider to Linux from Open Solaris. http://java.dzone.com/news/kil...
With ZoL there were issues with metadata heavy workloads (rsync being one of the prime examples) with 0.6.2, however with 0.6.3 those issues have been significantly improved by many accounts.
FreeBSD has had ZFS for what, over five years now? They are the reason it exists in any actual use (OpenSolaris/Illumos don't count) on any non-Sun/Oracle platform.
God forbid it take the Linux guys longer to get it up and running when Sun purposely licensed it to be difficult to do so on Linux.
And Linux's wannabee ZFS competitor BTRFS (oooh, look at us) sucks so bad it can't get off the ground.
So, this being Linux, some guys* also designed Btrfs to do the same things in the meantime. How dare they!? Sun released ZFS after 4 years of work; Btrfs, 2. Presumably they were working under more of an "agile" setup? Which doesn't really make sense for an FS but hey.
So what does Linux do.... import (steal) ZFS from OpenZFS/FreeBSD
It's called porting, and I don't see how you can call it "stealing" in any honest way.
and start posting about how great all their work with ZFS is, and how Linux bloggers now say 'oh yeah, ZFS is actually solid, so we can use it'. As if they are the only/first ones to certify ZFS.
If you actually skim the article he is saying ZFS On Linux is ready, not ZFS itself.
Thing is, ZFS was always solid. When bashing ZFS Linux was really just babbling about ZFS's more open and free BSD License and their own failure of BTRFS.
Was there bashing of it? Being on Slashdot only since 2007/8 I thought it was more Linux people being irked that they couldn't play with it due to the licensing rather than saying it was crap.
Also, I really hope you're aware that the CDDL and the BSD License(s) are not the same thing. ZFS is CDDL.
If you want an integrated system that just works, try FreeBSD.
You're using "just works" and ZFS in the same argument? With a straight face? The intersection of "Just Works" and people who use ZFS has to be pretty small. If you want Just Works just slap an ext3 or ext4 partition on your desktop and be done with it.
* Interestingly, Wikipedia says Btrfs is (was?) actually an Oracle project. Oracle, of course, bought Sun, which made ZFS. So maybe "competitor" isn't entirely accurate?
Unity? Screw that: XFCE. Slashdot Beta? Screw that: SoylentNews. Australis? Screw that: Pale Moon. UX developers DIAF
iSCSI doesn't need to be baked into ZFS, in fact, even on Illumos it isn't. It's in a completely different subsystem and will happily work with any block device as its backend storage (be it a physical drive, a ZFS zvol, a loopback block device or anything else, really).
Anything that can be represented as a block device can added to a zpool. This also includes files which is handy when your trying to understand complicated interactions you can mock up a small zpool based on files instead of devices for testing.
On the otherside of the abstraction ZFS can also expose block devices called zvols that will be backed by the zpool. So if you wanted to run a dmcrypted EXT4 filesystem backed by a zpool you can certain do that using a zvol and still get all the benefits of ZFS integrity protection and snapshoting.
Plenty of layering can be done with ZFS.
The recommendations vary a lot, mostly because it also depends on whether you're accessing a subset of your files more frequently than others. 1Gb per Tb seems good for most loads but ZFS will use as much as you throw at it.
More like Big Baby BSD, amirite?
Are you suggesting that if A = B, then B = A? :)
It depends partly on what features of ZFS you'll be using, and what types of performance you need. In general, you can run ZFS for an arbitrarily-large disk set with about 2GB of RAM - but you won't be using the memory cache features of ZFS much at all. The more ram you have available, the more it'll assign to the ARC (read cache). If you are running a media fileserver, where every read is a large file and is unique, then the ARC doesn't make much difference. If it's a webserver, where you read the same small files over and over, it's a huge difference. Things like compression and larger checksums also can take slightly more RAM.
The one real computable is if you try to turn on deduplication - you need something like 5GB of RAM per TB of data to be deduped, or performance goes to hell. This is to store the dedup lookup tables (which are put in the ARC) - if you can't fit them into RAM, every read/write adds having to read them into RAM, lookup where the data is, and then load the data. (Which can mean several reads per IO op.) Note that you don't have to dedup the entire dataset - it's on a per-filessystem basis. (And ZFS makes creating filesystems trivial.) Still, it's best to leave it off unless you have ungodly amounts of RAM to throw at it, and know you are storing heavily duplicated data.
'Sensible' is a curse word.
If you intend to send the snapshots over the network, as is often the case with rsync, you need to pair it with some independent communication tool, and since the output of "zfs send" tends to be very bursty, you need a sizable memory buffer.
For all the technobabble in that summary, I still don't know what ZFS offers me over other filesystems. Maybe the guys working on the system should do a little marketing course, or work on their 'elevator pitch'...
Here's my attempt...
1.) ZFS does software RAID as its normal mode of existence. It's naturally contested as to whether this is a good thing, but it depends on context. ZFS doing software RAID on a busy MySQL server? Not great. ZFS doing software RAID on a FreeNAS box whose lot in life is to shuffle data two and from a bank of hard disks? Better.
2.) Datasets. These are best described as the lovechild of folders and partitions. Like partitions, they can have their own mount points, their own permissions, storage quotas, and their own compression settings. Like folders, it's possible to have dozens of datasets on a volume, and then let the dataset use as much of the volume's storage capacity as needed, and dynamically expand or contract them as necessary.
3.) Snapshots. If you're used to Windows, think "Shadow Copies", but easier to work with.
4.) Deduplication. This *can* be dangerous, but deduplication can be enabled on a per-dataset level, so if you have a known set of data that has massive duplication (e.g. a dozen Windows VM disks for a test environment), it can save a whole lot of hard disk space.
5.) ZFS brings a lot of the functionality of the more expensive SCSI cards to commodity hardware with basic drives, and can do its thing with a hodgepodge of disks. This is useful if you're like me and think it's useful to have a RAID-6 array with drives from several vendors to help mitigate the risk of a homogenous manufacturing run.
6.) Not a feature of ZFS directly, but ZFS and FreeNAS/Nas4Free/Nexenta have a rather symbiotic relationship. If a NAS is built running a BSD distribution explicitly designed for storage, these distros make it extremely easy to manage the storage array and use the data transfer protocols best suited for the task at hand - all support FTP, SMB, iSCSI, and NFS, with some more exotic stuff generally available as well.
Have you confirmed using a zvol underneath a zpool, and if so was it a different zpool?
I've wanted to do that in the past, but it was specifically blocked. It's a pretty ugly thing to do, but it does give you a "new" block device that could be imported as a mirror on-demand. With enough drives in the zpool, that new device is nearly independent from its mirror, from a failure perspective.
Does the sky fall in if your buffer isn't 'sizable'? Or does it just run a bit slower?
I should use this sig to advertise my book ISBN-13 : 978-1501515132.
Thank you so much, that's very informative!
We don't have a state-run media we have a media-run state.
I've been a member of the Church of Parity ever since I discovered that some of my dutifully backed-up family photos had not only gotten corrupted, but the backup dutifully copied the corruption as well. Ever since, I use backup tools which do a parity check (e.g. Unison) and I try to store important things on ZFS if I can.
In my case I was lucky and I had an older backup without the corruption. But lesson learned... Also, have more than one backup :)
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
But it still won't stem the flow of Linux refugees to *BSD due to the trash that is being put into the Linux ecosystem such as systemd.
Not really. Fundamentally, a filesystem's job is to store data in a structured manner on an unstructured array of blocks. For everything ZFS does, it still comes down to that.
There are a great many advantages to having that structure include duplicate blocks and checksums.
If you really prefer, you can reasonably build a non-redundant ZFS pool on top of a RAID volume though you will lose a few advantages that way.
Back when I did OpenSolaris work, we used a tool called mbuffer which is basically netcat with a buffer on each end. It wouldn't been suitable for internet backups (no encryption) but it works pretty well for cross campus backups and the like.
IIRC it works like this on the sending side: 'zfs send pool/fs@snap | mbuffer -s 128k -m 4G -O 10.0.0.1:9090'
And on the receive side: 'mbuffer -s 128k -m 4G -I 9090 | zfs receive pool/fs'
It can still be pretty bursty but it smoothes out a lot of it.
Several other posting here, fans of ZFS, are saying on this very page that ZFS really needs to be accessing bare disks. it will allow you to use another block device, they say, but data corruption is highly likely.
# touch /export/lun/0 /export/lun/0 # file backing /dev/rdsk/c1t1d0s0 # raw disk backing
# sbdadm create-lu -s 10g
# sbdadm create-lu
The Linux implementation lacks support for the Archive, Hidden, ReadOnly, and System file attributes; those are needed for MS-DOS support:-]
I was talking about iSCSI support, not how the ZFS themselves pools are built. Yes, ZFS *does* prefer raw disk backing (due to certain management simplifications), but it does not need it. In fact, it's quite possible and people do frequently run ZFS on top of RAID arrays.
You must have an enormous collection of Linux Distributions at home to need that much storage.
I was promised a flying car. Where is my flying car?
You know you can lay a zfs filesystem on files, right?
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
And was very impressed. It was a new 4-drive system I'd put together to operate as both a NAS/fileserver and a host for virtual machines. I had originally intended to use RAID 5, but decided to give ZFS a try after reading about it. My initial config had it booting Ubuntu (maybe Mint? I don't recall), with ZFS for Linux installed as the main non-boot filesystem with one-drive redundancy. I had all sorts of problems with drives dropping out of the array, which I eventually tracked down to the motherboard shipping with bad SATA cables. ZFS handled this admirably. At first I didn't notice one of the drives had dropped, and continued using the system for about a day. When I got the drive working again, as I understand it RAID 5 would have had to do a complete array rebuild because of the changed files. ZFS noticed most of my old data was on the "new" drive and simply validated the checksums as still accurate, then noticed I had written new files and automatically created new redundancy files for them on the "new" drive. The entire "rebuild" only took a little over an hour instead of the 20+ hours I was expecting (how long it takes me to backup the data over eSATA).
If you're wondering why ZFS trusts the checksums on the "new" drive instead of reading the entire file, it will read the entire file and compare it to the checksum every time you access it. Once a month by default, it runs a "scrub" where it reads every file and verifies they haven't suffered bit rot and still match the checksums. Apparently the strategy after a dropped drive is to get the redundant filesystem up and running again ASAP, then do the file integrity scrub afterwards at its leisure. (You can manually force this check at any time with a zfs scrub.)
The other main advantage I'd say is that it's incredibly flexible when you're putting together redundant arrays. RAID 5 normally requires 3+ drives or partitions of the same size. ZFS lets you mix together drives, partitions, files (yes, one of your ZFS "drives" can be a file on another filesystem), other devices like SAS drives, etc. You can even put the 3+ "drives" needed for redundancy onto a single drive if you just want to play around with it for testing.
The only problem I ran into was with deduplication. Dedup was part of the reason I decided to try ZFS, and is one of the features frequently mentioned by ZFS advocates. While dedup does work, it is an incredible memory and performance hog. Writes to the ZFS array went from 65+ MB/s (bunch of mixed random files) down to about 8 MB/s with dedup turned on, and memory use climbed to where I ordered more RAM to bump the system up to 16 GB. In the end I decided the approx 2% disk space I was saving with dedup wasn't worth it and disabled it.
I eventually switch to FreeNAS (based on FreeBSD, which has a native port of ZFS) because it was annoying having to reinstall ZFS for Linux after an Ubuntu/Mint update, and I couldn't see myself doing that after every new release because I wanted features which were added to the core OS. (And if you're wondering, dedup performance is just as bad under FreeNAS.)
The cost of this approach has always been performance. It is faster, for example, to use grep's -c switch than to pipe grep's output into wc -l (as is commonly done in poorly-written scripts).
When it comes to storage, the performance penalty of using separate layers, which aren't well-aware of each other, becomes big enough to justify integration...
In Soviet Washington the swamp drains you.
The sky won't fall but the walls might.
-Shaka
I've used ZFS on Linux for years now and it's been fantastic for my long-term storage. One 2 TB drive runs all the time and I power another one up periodically to auto-sync with the first one. That saves power and drive wear vs an always-on RAID0 setup.
Rsync causes a lot of metadata lookups, which will fill arc metadata in a hurry. If arc max isnt set, you'll oom the box (or crash zol if it is, i think). I'm not sure how to monitor or control it on zol, because zol's memory management is still kernel independent....
~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
Data corruption isn't likely. The reason to use partitioned disks is for performance so that ZFS can control the disk caches directly.
No, the insinuation is that ZFS is approximately a decade more mature and proven than that upstart knockoff, btrfs.
The cost of this approach has always been performance.
Uh, no.
Sure, in the specific case you cite, that's correct, because grep can easily count the number of lines it outputs. In the general case, however, you'll find a pipe is probably faster, because the two processes can run on different cores.
Certain required features could not be done if they were separate features. In order to properly do certain things, the 3 layers must understand each other. I'm not talking about "fun" features, I'm talking about problems that have been plaguing data centers and there was no other way.
No, sun purposefully licensed it as BSD to be easy to use anywhere.
No, ZFS is CDDL licensed. ZFS is not BSD licensed. If it were BSD, Linux could pull it in trivially. That's the whole point.
hes using just works with the word freebsd. and yes it does, freebsd is not a distribution slapped together with blah blah blah
Well then this is offtopic. The whole rest of the conversation is about ZFS, not BSD vs. Linux in general.
Unity? Screw that: XFCE. Slashdot Beta? Screw that: SoylentNews. Australis? Screw that: Pale Moon. UX developers DIAF
Do you have an example? The storage system I'm using provides every important feature I'm aware of in ZFS, and it keeps the layers separate. As ZFS has matured, it seems to be a way of getting all of those features out-of-the-box, without needing to think about how to put it together. LVM is one volume manager provides most of the same features, though. Then put your choice of filesystem on top of LVM. Can you think of any feature that actually requires the volume manager to be stirred together with the filesystem?
On a gigabit network with the standard 64K pipe buffer, you'll trickle along at just a few megabit.
To get all the features of data integrity and error correction you need to avoid hardware raid with ZFS.
If ZFS controls the drives and you have striping or mirroring and one of the drives has corrupt data, ZFS can log an error and fix the data corruption. If hardware RAID controls the drive, it may realize a copy of the data is bad, and pass back the good copy, but RAID won't log an error nor fix the bad block. ZFS won't fix it either because it won't know since RAID handled returning the correct data to ZFS.
ZFS is self healing if it is NOT on top of hardware RAID. To reap all of the benefits of ZFS you need lots of RAM, that RAM should be ECC and two or more disks without hardware RAID.
vi +
In a filesystem like EXT3, if you open a file, seek to some offset, and write new data, EXT3 will write the new data to the existing disk block in place. ZFS, however, will allocate a new block for that offset (copy on write), write the modified data to it, and update the block chain. The result is that it's apparently very easy to badly fragment a ZFS file (do a Google search for "ZFS fragmentation" to see various stories and tests people have written).
You can apparently mitigate the problem by occasionally copying the entire affected file -- Oracle's own whitepaper on the subject apparently reads, "Periodically copying data files reorganizes the file location on disk and gives better full scan response time."
Bottom line: ZFS is not a panacea, nor is it simple. There are myriad options, and trade-offs to all of them.
Editor, A1-AAA AmeriCaptions
More than that, since you're effectively virtualizing your EXT4 filesystem, you can expand it pretty easily too. You're backed by a storage pool, which means you can expand that pool by adding or replacing drives, and then simply resize the EXT4 filesystem live. EXT4 need not know about the fact that you've added a new raid array to the storage pool.
They're working on fixing that, but in the mean time you can pipe it through mbuffer or something similar to resolve the issue.
You're assuming that the single process isn't multithreaded. ZFS is multithreaded.
Want to count the log entries that mention Slashdot? You don't need a special tool for that, just grep slashdot | wc -l .
Not since journalctl took over.
You're not really understanding how ZFS does and can work. It already has hooks to provide 'features' such as you talk about. It does require crossing several traditional Unix boundaries, thats true, but its an accepted trade off to get the benefits that go with ZFS, but the hooks to include such features at the typical boundary points still exist in the ZFS code. Pretending that ZFS has to be totally and completely aware of what you hook in isn't really fair. What you hook in has to integrate with the API, which is well defined, and that really isn't any different than with the approach you seem to prefer.
And for reference: dm-cache and cache are not needed with ZFS, l2arc already covers them, and it does it better because it knows whats going on across all 3 layers. I seem to have no problem doing iscsi sharing of ZFS storage space nor do I seem to have any problem using iscsi targets as part of zdevs. Hell, technically you can still use dm-cache and bcache with ZFS, if you're ignorant enough to do so. You can even run whatever file system you want on top of zvols. You'd be stupid to do it in most cases, but the ability is there if need be.
Since you want to use the word Unix, lets get a few things clear. Linux is not and likely never will have a Unix certification. Sun on the other hand had two operating systems that were certified Unix and they were doing it before Linus had a computer to start Linux on. Drop the 'my OS does it right' bullshit because your OS isn't what you're claiming it to be, and the system you arguing against was written by people who did make something you're claiming it isn't.
I don't disagree with the Unix tradition in the least, compartmentalized code with strong boundaries and good interoperablility where ever possible ... and occasionally you tear down the walls for specific reasons. Graphical performance is an example where your philosophy sucks, which is why Windows kicks the ever living shit out of Linux performance. Note: Linux, NOT Unix. SGI had a terrific graphics stack as an example, and Sun's wasn't too horrible.
Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
Yeah, that's the thing about systemd- it's not Unixy. It might be great, but it's not designed according to Unix principles.
> Drop the 'my OS does it right' bullshit because your OS isn't what you're claiming it to be,
Where did I say one approach was right and the other wrong? In fact, I said each approach has it's advantages and disadvantages. What I said is that ZFS is not designed according to the Unix tradition of "do one small thing, and do it right". Apparently you agree that's the case:
> don't disagree with the Unix tradition in the least, compartmentalized code with strong boundaries and good interoperablility where ever possible
That's why some who appreciate the Unix approach hate systemd. It would be more at home on Windows.
Re Sun, if you look at an old Sun Solaris box, you'll find some of the was written by a guy named Ray Morris. Coincidentally, this post was also written by Ray Morris.
Yet the kernel does scheduling, memory management, user access control, filesystems, device management, TCP/IP, power management, etc.
Forget magic. Any technology distinguishable from divine power is insufficiently advanced.
> ZFS is a layer below LVM.
Typically you'd layer raid, then LVM, then the filesystem. ZFS tries to be all three. It's raid, and it's a volume manager, and it's a filesystem. There are some benefits to integration, and some drawbacks. With the raid>lvm>filesystem approach, it's trivial to add dm-cache, bcache, iscsi, or any other piece of storage technology.
ZFS can have things added on top just as trivially. I've never done smb in ZFS, I used Samba. On solaris, I did NFS with ZFS, but on Linux, I've kept it in /etc/exports.
With RAID+LVM+filesystem+network share, I need to unshare, umount. The resize LVM, resize the filesystem partition (if it is allowed), mount and export. With ZFS I do zfs set quota= zpool/directory. On the fly with no downtime. That alone is worth the difference.
ZFS does compression also. And you can toggle back & forth while running. zfs compression=on zpool/directory and every new write is compressed. compression=off and new items are not compressed.
ZFS also does checksums to verify the data. If you have a dodgy sata cable or controller firmware that corrupts your data, it will be detected. If ZFS is doing RAID, it has a 2nd copy of that data and can self correct. If it doesn't have a 2nd copy, it will shut the filesystem down so nothing else gets corrupted. You can't do that with hardware RAID.
With ZFS, anything you want to add has to be specifically supported within ZFS.
Not true at all.
The Unix tradition is small, single purpose tools that do one thing well. Witness sort, grep, wc, etc. Want to count the log entries that mention Slashdot? You don't need a special tool for that, just grep slashdot | wc -l . Tools like mdadm and lvm are building blocks that can be combined to suit your need, the Unix way. ZFS is a big monolithic package that does everything, much like Microsoft Word or Outlook. ZFS is more in the Microsoft tradition.
Some of the things that ZFS does would be difficult if it wasn't the whole system. Newer systems like btrfs and ceph take a similar approach.
I dunno about ZFS for Linux, but FreeNAS's ZFS has NFSv4 ACLs. Are these not sufficient?
Editor, A1-AAA AmeriCaptions
It just runs slower. Only 4GB for a machine with a bit of traffic on ZFS is a bad idea but it gets things done eventually - even sending and receiving snapshots of a few TB.
Yes the example is a crap machine, and a 32 bit "netburst" Xeon as well, but crap machines are things that can get used to test things to destruction before going near production machines. I was using it with FreeBSD9 since ZFS was more advanced on that platform.
As far as I'm aware, you don't need 'dump' with ZFS. You create a snapshot, then 'zfs send' that snapshot off to your backup storage. Can be done on a live filesystem. Delete the local snapshot when you're done copying it off. ( http://docs.oracle.com/cd/E187... )
Editor, A1-AAA AmeriCaptions
The zpool and zfs layers could be considered seperate in the same way - especially with things like zfs send to move filesystem snapshots to other pools which are usually on other machines. The filesystem does not get influenced by the nature of the pool and vice versa, so long as it's big enough to fit. Global options on pools (eg. no atime) are really just passed down to the filesystems instead of it being a zpool operation.
It's really more like if LVM and ext4 were done by the same development team than a totally different and totally monolithic approach.
No, the insinuation is that ZFS is approximately a decade more mature and proven than that upstart knockoff, btrfs.
If they had only BSD licensed it, the latter might never have been built. :) Just another case of qt syndrome I guess.
Btrfs does have a few advantages - it is a bit more flexible in some respects, mostly in areas where the more enterprise-y ZFS audience isn't going to care.
I used ZFS under FreeBSD its was good for a few months until it got slow and I needed to defrag it, oh no ZFS is too good for a defrag tool so I zapped it and installed Debian with XFS, much much more faster and it comes with a online defrag tool.
One of the few reasons I stick to ext4 and XFS under Linux too! I got burned by having no way to defrag JFS, ReiserFS or UFS/FFS under the BSD's.
---- GENERATION 26: The first time you see this, copy it into your sig on any forum and add 1 to the generation.
If I call fsync() is legitimately expect it to not ever return before the data is on disk or to return with an error. Any other behavior is just completely unacceptable and a rather severe fault.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
You listed some great examples, examples of the opposite of what you probably meant to show.
Take scheduling- there what, six different interchangeable, removeable kernel modules to do scheduling in different ways, including the option to not do it at all. The scheduler only does scheduling, and nothing else. The rest of the kernel doesn't know or care about the scheduling. You mentioned filesystems as well. Yep, you can choose from dozens of different filesystems. The rest of the kernel doesn't care which filesystem you're using, because those other modules do their job and nothing more. You can use any scheduler with any filesystem.
Enter zfs, a popular volume manager similar to LVM. It just manages volumes, so you choose whichever filesystem to lay on top. Er, no. If you want to use the ZFS volume manager, you probably need to use the ZFS filesystem. That's cool, it'll also provide an extra level of resiliency on top of that great hardware raid you have. Actually, not so much. It doesn't play nicely with most enterprise storage hardware. You need to use dumb hardware and use ZFS raid to avoid problems. Wait, what? ZFS, a filesystem, is telling you which hardware to use? That's not like the interchangeable kernel modules at all.
Haha, and conversely ZFS' "rampant layering violations" approach confers advantages that btrfs cannot match. However, most Linux users will just go with what their distro rolls out.
It is somewhat ironic that Oracle is primarily responsible for both ZFS and btrfs, given that btrfs was developed because Oracle wasn't willing to dual license ZFS.
Sorry, I'm not that familiar with OpenSolaris.
Don't the first and second commands create a zpool backed by a file? That's not what's at question here, I want to know if you can back a zpool with a zvol created on that same zpool.
A quick test showed that it does work on FreeBSD to create a zpool upon a zvol from a different zpool. The circular version has made it hang for a not-insignificant amount of time...
But at least they are plainly *compatible*, aren't they? I mean, since FreeBSD has ZFS in the base system.
And CDDL is *incompatible* with GPL, Isn't it? I mean, since no distro includes it and you have to graft the two together from separate sources.
P.S. - I actually USE ZFS on Linux (as well as on FreeBSD), and I love it.
I think you misunderstood my reply. I was replying to the poster talking about iSCSI having to be implemented in ZFS. That's what I was addressing.
You talk about a different thing altogether - ZFS backing. The zvol-on-another-zpool solution should work, although performance will suck. The zvol-on-the-same-zpool solution can and will hang for obvious reasons.
ZFS is a big monolithic package that does everything, much like Microsoft Word or Outlook. ZFS is more in the Microsoft tradition.
Well, that is well within the Unix tradition. ZFS is a *kernel* module, not a userland application. Just because the cli interface is comprised of 2 commands, it doesn't mean its monolithic. Its as monolithic as ifconfig and other complex utilities.
And I'd take anyday the zfs/zpool command format over the lvm ugly mess.
I tried to run backups to ZFS on a crypted USB disk. It worked for a while, but if something fails (like the backup disk going to sleep), the entire chain hangs. I can't disconnect the crypt device, and I can't disconnect the ZFS pool, zpool and zfs hangs. What I do with the USB cable and hardware no longer has any impact. I stopped doing that. (I didn't have better luck with btrfs.) Although I don't really blame ZFS that much other than it can't handle hanged devices. USB on Linux is still flaky.
The other problem I have is that it after a while happily uses up 30GB of my 32GB on the computer, and extremely reluctantly gives them away again. I can't seem to be able to control how much ZFS will use. And the rest of the system isn't really happy with just 2GB to run programs in (several virtual machines of 8GB RAM each, for instance).
is this something I should think of for the mysql databases?
the btrfs project was started in 2007, before Oracle purchased SUN.
Haha, and conversely ZFS' "rampant layering violations" approach confers advantages that btrfs cannot match.
What advantage does ZFS have over btrfs, other than maturity? You do realize that both are virtually identical as far as "rampant layering violations" go, right?
Ah, so there is an equivalent to RAID-Z (1, 2, etc, and no, MD is not an acceptable substitute), the ability to use a ZIL type log device (and bring it on/take it offline without dismount), dedupe, and probably other features I'm forgetting but haven't bothered to check again since I last compared them about two years ago?
RAID-Z, yes (but more flexible, though it is experimental and definitely not usable yet), RAID1 is production-ready (and it works like RAID-Z, not like traditional Raid1). ZIL does not exist at all yet. Dedupe does exist.
It is a COW filesystem that does checksumming of all blocks/etc, manages storage pools, doesn't have the RAID write hole, etc, just like ZFS. It tends to be more flexible as you don't have the limitation of being unable to add drives to a "vdev." However, it is far less mature and many features are still just on the roadmap.
Right now my one RAID1 has 4x1TB drive and 1x3TB drive, with 3.5TB of usable space. This is because RAID is implemented at the chunk level, not the drive level, so that 3TB drive can be mirrored across all the other drives (of course, that makes it a bit of a bottleneck for writes). I could covert that to a RAID5 (which is basically the equivalent of RAIDZ in zfs) without having to rewrite all the data - the already-written chunks would just stay in raid1 mode and new chunks would be allocated in raid5 mode unless I told the filesystem to migrate everything.
In general I think the design of btrfs is a bit more flexible. However, it is far less mature and in particular the more "enterprise-oriented" features tend to be lacking as they aren't a priority. They're really focused on different audiences.
You can kludge on encryption in the pipeline:
http://sourceforge.net/project...
Thanks for the info. FYI, RAID-5 or 6 is *not* equivalent to RAID-Z or Z2 due to the issue of the write hole.
Despite your assertion to the contrary, a cursory google showed no proof that raid with btrfs can circumvent the write hole issue, because it is a problem beneath the filesystem. The "layering violations" ZFS uses to effect the fix are what made raid-z(n) possible. I would be interested if you can provide a citation showing the write hole has been precluded in btrfs raid.
Also, it doesn't look like btrfs includes support for cache devices.
Exactly. AC keeps claiming "Sun licensed it as BSD" as a reason why "BSD is The Way"...which could be true philosophically, but it's not factual as it's CDDL.
Unity? Screw that: XFCE. Slashdot Beta? Screw that: SoylentNews. Australis? Screw that: Pale Moon. UX developers DIAF
The RAID, LVM, Filesystem approach is defunct in the modern world. Also, ZFS already incorporates multi-protocol support, ability to turn any host with local storage into a target (via the COMSTAR framework). Not sure how much of this is in the linux port, but I suspect that if it's close enough to Illumos, it should have these features.
ZFS is not in the microsoft tradition, it is a departure from 20th century storage design/architecture. The very idea that there has to be a RAID/LVM/FS is archaic and has been thoroughly disproven. In my previous shop we had petabytes of storage in ZFS pools and hardly ever lost data.
The Pool-based model that eliminates the layers of RAiD/LVM/FS results in better performance, easier supportability and superior diagnostics capabilities.
Do you realize that almost every major storage vendor first bashed ZFS and then about 3-4 years later started building architecture that was eerily like ZFS?
My shop was one of the early adopters of ZFS since back 2007. There were a few bugs then, but over the years I have been absolutely impressed with the efficiency and stability of ZFS.
ZFS plays just fine, the problem is in order to fully benefit from ZFS, ZFS must manage its own redundancy. You can still use RAID5 on your SAN, but you'll still want RAID5 with ZFS, which is just that much more wasted space. You also get the disadvantage that when a drive dies in the SAN RAID, performance will take a bigger hit than it needs to be, because the hardware RAID has no idea how the file system works.
In most situations, you're better off having each layer completely independent, but in the case of ZFS, it seems that when you don't make the layers entirely generic, but make them specialized to each other, the end product is much greater than the sum of the parts.
1) Yes, it's a general features of RAID's. Multiple devices are reading the data, the "fastest finger first" wins.
2) File server only dependent on your disk format, you mean? I happen to agree here but, if you're doing it at the FS level, then just a standardised RAID layout (such as Linux md / LVM) is the same thing. The non-standard formats that tie you into hardware do so for a reason - the hardware RAID provides things that no software RAID can, sheer speed. (Though, please note, I've happily run Linux software RAID on server-end hardware in production systems without any performance problems).
3) 3 disks dying out of 11? RAID6+1 will actually do better (I think... I can't do the maths just now).
ZFS is cool, don't get me wrong, but it's basically just a RAID fs. The Merkel tree journalling trick just saves having to have battery backup, but whether it works like that in real life failures is another matter entirely.
Btrfs RAID5 won't have a write hole. It is fairly equivalent to zfs RAID-Z.
Right now the btrfs raid5 code doesn't recover from any kind of disk failure at all - it is experimental (that is, it writes parity to disk, but cannot use it to recover missing data). All data written to each disk are checksumed, so in the event of a problem during write the filesystem can determine which disks are correct and which ones aren't. Also, like zfs btrfs does not overwrite data in-place, so in the event of a failure files should end up in either the pre-write or post-write state.
I couldn't find any official docs about raid5 on btrfs at all - it is still experimental. Mailing list posts seem to suggest that it will not have the write hole when complete (such as http://comments.gmane.org/gman... ).
Now, I'm talking about btrfs directly writing to disks. If you stick mdadm in-between btrfs and the disks (avoiding the layering violation), then the result will be the same as if you stick mdadm between zfs and the disks - you'll have the write hole. The same layering violation concerns that apply to zfs also apply to btrfs as it also does volume management, raid, disk management, etc. If you run a scrub on btrfs or rebuild a disk it will only scrub/rebuild the blocks that are actually in use, etc.
I won't argue that btrfs on the whole is nowhere near as stable as zfs is currently, but most of the key features are very similar. They are both COW filesystems that are designed to run directly against disks, they both checksum all data to protect against on-disk modification and the write-hole, they both utilize their knowledge of what is on the disk to optimize how things like striping/parity are done, they both support very cheap snapshots, etc.
Anybody interested in storage at a low level would do well to have at least a general familiarity of both, regardless of preference.
> ZFS is not in the microsoft tradition
Balsa (Gnome email client): 2.5 MB, reads email. Optionally use libgtkhtml (315kb) to render HTML email.
Microsoft Outlook: (Microsoft email client): Several GBs. Reads email, handles calendar, embedded mail server, task list weather reports(?!?!) fax, rss, html templates, _sharing_ calendars. Loads MS Word (several GB) to partially display HTML messages.
Let's break this down into three statements and see where we disagree:
1. It appears that the Microsoft tradition is big monolithic packages that do everything. Including weather reports embedded in their email client.
Do you disagree with that?
2. Do you disagree with the statement that the Unix tradition (from ed to grep to elm and balsa) is small, focused tools?
3. Do you disagree with the statement that ZFS is a volume manager, a filesystem, a raid-like redundancy system, and a few other other things as well? In other words, that it's a big, monolithic package tat does many things. Do you disagree with that?
You LIKE ZFS. I understand that. It does a lot of cool things. It does a lot of boring things. It does a lot of things. Just like Microsoft Office.
The poster I was responding to referred to "Unix tradition". The tradition started on single-CPU systems...
Even on modern multi-core computes, piping data from stdout to stdin is inefficient. Very convenient, but inefficient nonetheless. When the cost of developing (such as shell scripts written to either be one-offs or rarely executed) exceeds the costs of the inefficiency, it is justified.
But with storage — the code, that is used by millions thousands (millions!) times per day, it makes all the sense to invest in developing the subsystem.
Indeed, various OS-vendors (free and otherwise) all spend a lot of effort (and money) on improving their offerings. ZFS is just an example of something better than all (or most?) of the competition.
In Soviet Washington the swamp drains you.
Except that LVM is a PITA, mixing with RAID makes it even more so, and the RAID is unaware of the actual used space, making RAID 5 or 6 very expensive, not to mention it cant assist FS level checksumming with restoring individual blocks, you need to fail the whole drive. Implementing network transparency at the block level is inefficient, but no other FS has ZFS connect functionality.
I know tobacco is bad for you, so I smoke weed with crack.
3. Do you disagree with the statement that ZFS is a volume manager, a filesystem, a raid-like redundancy system, and a few other other things as well? In other words, that it's a big, monolithic package tat does many things. Do you disagree with that?
I'm suggesting that concepts such as "volume manager", "filesystem" and "raid-like redundancy system" don't need to be separate entities. The concepts such as "filesystem" and "volume" etc exist to conform to a 20th century vocabulary. And it's not that revolutionary any more. Companies like EMC, Netapp etc went that route too...thereby simplifying things like HSM etc at a "pool of disks" level.
> > 3. Do you disagree with the statement that ZFS is a volume manager, a filesystem, a raid-like redundancy system, and a few other other things as well? In other words, that it's a big, monolithic package tat does many things. Do you disagree with that?
> I'm suggesting that concepts such as "volume manager", "filesystem" and "raid-like redundancy system" don't need to be separate entities.
Absolutely they don't NEED to be.
I'm suggesting that concepts such as "mail client, calendar, fax, RSS reader and weather reports" don't need to be separate entities.
In fact, if you smash them all together into one big entity, you can sell it for $109.99 and LOTS of people will buy it.
So we agree that because those things don't HAVE to be separate, systemd combines all of them together, into one package that does everything. We also agree that:
> 1. It appears that the Microsoft tradition is big monolithic packages that do everything.
Ergo, system fits the Microsoft tradition, the Microsoft way of doing things. That way certainly isn't impossible - Microsoft has made billions of dollars doing it that way. Unix traditionally does things a different way.
Absolutely they don't NEED to be.
I'm suggesting that concepts such as "mail client, calendar, fax, RSS reader and weather reports" don't need to be separate entities.
In fact, if you smash them all together into one big entity, you can sell it for $109.99 and LOTS of people will buy it.
So we agree that because those things don't HAVE to be separate, systemd combines all of them together, into one package that does everything. We also agree that:
> 1. It appears that the Microsoft tradition is big monolithic packages that do everything.
Ergo, system fits the Microsoft tradition, the Microsoft way of doing things. That way certainly isn't impossible - Microsoft has made billions of dollars doing it that way. Unix traditionally does things a different way.
There is a difference between a low-level tool and a high-level product. You could say that it is not "UNIX-like" to put together a bunch of individual components into a single program (monolith?) - why doesn't everyone use awk, sed, grep etc do their text processing? The reason why people build higher level products out of low level components is to make life easier.
I don't particularly see any problem with Word, Excel, Powerpoint etc. See, even google does it with Google Docs...so it must be right :)
I don't know if you have ever used ZFS -- if you have you would know what I mean. Having had to deal with migrating thousands of LUNs from one Storage array to another (via host-side migration - Veritas Volume Manager), i can tell you the ease of use and simplicity of a product that is not "Disk and LVM and Filesystem" (aka ZFS which is pool, vdev and volume) is a lot simpler. I have to just replace disks at the pool level, not have to worry about timing the mirroring and detachment of the mirrors (of disks from 2 separate Storage arrays) such that it doesn't kill performance of my 20TB DWH running on the box. Or wanting to accelerate things using the L2 ARC capabilities, etc.
Your last two posts seem to be getting at the idea that being monolithic isn't bad. I never said it was.
I said that monolithic packages are the way Microsoft and other Windows developers traditionally do things,
and that small, single purpose tools are the Unix tradition.
> I don't particularly see any problem with Word, Excel, Powerpoint etc.
I didn't say there was a problem with those. I said Microsoft builds software like that. Do you disagree?