Oracle Engineer Talks of ZFS File System Possibly Still Being Upstreamed On Linux (phoronix.com)
New submitter fstack writes: Senior software architect Mark Maybee who has been working at Oracle/Sun since '98 says maybe we "could" still see ZFS be a first-class upstream Linux file-system. He spoke at the annual OpenZFS Developer Summit about how Oracle's focus has shifted to the cloud and how they have reduced investment in Solaris. He admits that Linux rules the cloud. Among the Oracle engineer's hopes is that ZFS needs to become a "first class citizen in Linux," and to do so Oracle should port their ZFS code to Oracle Linux and then upstream the file-system to the Linux kernel, which would involve relicensing the ZFS code.
One nice thing about ZFS not being in upstream is that it is currently maintained and updated separate from the Linux kernel.
Now, it would be nice to relicense ZFS under GPL so that it can be included in the kernel. But this should wait until the port is a bit more mature. Right now development is very active on ZFS and we have new versions coming out every few weeks; having to coordinate this with kernel releases will complicate things.
All this said, relicensing ZFS would definitely help Oracle redeem themselves a bit. After mercilessly slaughtering Sun after acquiring them, they have a long way to go to get from the "evil" side back to the forces of good.
Some folks don't like the particular set of tradeoffs, but for a filesyste (as opposed to an object store, one of which I'm testing right now), it's a very good offering. I definitely want it on my Fedora dev laptop, along with a write cache on flash.
davecb@spamcop.net
Holy shit are you serious? Like SERIOUS? OMG why don't we all switch to BSD! Everyone stop! I know Linux is *everywhere* but BSD has ZFS! Did you guyz know this????
*cough*Java*cough*
General Relativity: Space-time tells matter where to go; Matter tells space-time what shape to be.
ZFS wants to live in a fairly specific configuration. It wants a bunch of drives, a bunch of memory, and not much competition for system resources. It's really a NAS filesystem, which is why there are no recovery utilities for it. If your filesystem takes a dump, you're SOL, hope you have a backup.
You can run it on a single drive on a desktop machine, but you are incurring a bunch of overhead and not getting the benefits of a properly set up ZFS configuration.
My Other Computer Is A Data General Nova III.
The version in BSD is a older version derived from when Solaris was open-source, in 2007. It is independently maintained and a part of OpenZFS. In fact the ZFS stacks in IllumOS (a fork of open-source Solaris), FreeBSD, Linux and OS/X share a lot of code and are compatible, in the sense that if you create a ZFS filesystem on one of these OSes, it will work on the others.
OpenZFS has made enormous progress. I have been using it on my FreeBSD, Linux and OS X (macOS) boxes for over 3 years now.
> The problem with ZFS on Linux is that some aspects of it are redundant with the kernel.
Probably ALL aspects of it. Linux already has a raid implementation in-kernel. It already has filesystems. It already has multiple volume managers, which handle whichever type of snapshots you prefer. It already has IO schedulers. ZFS, or rather something that looks just like it, can be implemented as a few configuration lines for pre-existing Linux components.
Because Linux normally lets you use your choice of file system on top of your choice of volume manager, on top of whichever RAID implementation you choose, with your choice of IO scheduling options, ZFS isn't exactly the best fit. ZFS mashes all those different things into one big blob. That's not really how Linux is designed.
That's the same issue as systemd - it may (or may not) be a good init system. It may or may not be a good logging system. It may possibly be a good DNS server (probably not). But it can't seem to decide wtf it is.
Are you doing Z+1? Or just striping with an L2ARC, which is nearly pointless? What's the areal density of the drives? 'Cause if you are using anything above 2TB the odds of getting uncorrectable errors on both drives becomes non-trivial.
At this point you are better off using XFS with a really good backup strategy.
My Other Computer Is A Data General Nova III.
As you may know, RedHat has deprecated BTRFS in RHEL7.4 whereas many distributions like Ubuntu fully support ZFS.
I woud say that the status of BTRFS is worse than that of OpenZFS on Linux. See also here for an interesting article.
Yes, there is a lot of duplicated code in ZFS for Linux, such as an SHA256 implementation, RAID parity, compression, and lately a whole crypto library.
The reason is either the kernel doesn't reliably support this natively or the implementation isn't usable. Linux doesn't allow non-GPL modules to access a lot of features (eg: the crypto library) or some features are version-specific (eg: LZ4 (de)compressor). The simplest solution is to import the Solaris versions.
But they've improved. SSE and AVX instructions are available for many of the above. And if ZFS does get re-licensed to GPL, then sure maybe we can make use of some of that stuff natively. Until then, ZFS on Linux has to deal with the reality of a non-GPL non-Linux driver on a GPL Linux kernel.
I played with zfs-fuse on KDE Neon a couple years ago after reading from its acolytes that it was "more advanced" and "better" than EXT4 or Btrfs. It wasn't. A lot of it is missing in the fuse rendition.
I switched to Btrfs. I have three 750Gb HD's in my laptop. I use one as a receiver of @ and @home backup snapshots. I've configured the other two as a 2 HD pool and then as a RAID1, and then back to a pool again. In 2 1/2 years of using Btrfs I've never had a single hiccup with it.
There are some excellent posts on the KubuntuForums.net website by Oshunluver which describe how to use Btrfs to install many different distros to a single Btrfs installation, and how to use Btrfs in general.
Running with Linux for over 20 years!
I would say you are wrong.
That RH has not retained qualified Btrfs programmers is their business decision and has little to nothing to do with Btrfs or its usability.
https://www.itwire.com/open-sa...
KDE Neon User Edition has zfs-fuse and a version of OpenZFS in its repository. I've played with the fuse version and was unimpressed.
After I tried zfs-fuse I tried Btrfs. I've been using it without a single fault or problem for 2 1/2 years.
Running with Linux for over 20 years!
Whatever the reason, btrfs is not supported in production on RHEL. It has never been, it's always been in "preview" and will soon be out of the picture completely.
It's been going on for years so I would agree with the above that OpenZFS would have a brighter future.
lucm, indeed.
Just as this article popped up I was assembling a JBOD array (twelve 4TB drives) for a new data center project, my first in quite a while. Also self funded so I don't have to defer to anyone in decisions.
When I started I did a bit of reading trying to decide what RAID hardware to get. To make a long story short once I read the architecture of ZFS and several somewhat-polemic-but-well-reasoned blog entries I decided that is what I wanted.
Only two months ago I had an aged Dell RAID array let me down. I have no idea what actually happened, but it appears some error crept in one of the drives and it got faithfully spread across the array and there was just no recovering it. If I didn't have good backups that would have been about 12 years of the company's IP up in smoke. I just thought I'd share.
So I ended up as a prime candidate (with new found distrust for hardware RAID) to be a new ZFS-as-my-main-storage user. I've just recently learned stuff that was well established five years ago and I can't understand why doesn't everybody do it this way.
Wow. snapshots? I can do routine low-cost snapshots? Data compression? Sane volume management? (I consider LVM to the the crazy aunt in the attic. Part of the family but ...) Old Solaris hands are probably rolling their eyes but this is like mana from heaven to me.
Given the plethora of benefits I am sure the incentive is high enough to keep ZFS on Linux going onward. ZFS root file system would be nice but I am more than willing to work around that now.
The issue is they wanted to own all the code so instead of donating fixes or adding hooks as needed they encapsulated the whole thing into something they could own and control.
Considering ICMP traffic is the lowest priority and frequently dropped in congestion situations, they would have been better off implementing it via their own protocol rather than piggy-backing off of that. I don't suppose I need to explain why the theory is bad, and just how DMCA violations would just stack up using this filesystem..
Why not just have a small partition with a stable system installed on it for recovery purposes? Or a live cd?
> Until mdadm and hardware RAID controllers allow you to issue a "read, but try to give a different result" operation you can't do this. (Said operation would attempt to use parity even on a healthy array in an attempt to give a different block content by pretending a disk is dead).
So until the late 1980s? That's called RAID scrubbing and I believe it was mentioned toward the end of the original RAID paper in 1987 or 1988. Certainly 10 years ago I had a "mdadm check" command in my crontab. I know this for sure because I still have a copy of my 2007 server image.
The "mdadm repair" command was also in use by then.
Cool "new feature" you've got there.
I'll respond to your other two gross misunderstandings about raid by replying to your other post.
> ZFS has checksums to figure out which is right. MDADM doesn't.
You have no idea how RAID works, do you? Neither through the mdadm UI or any other.
RAID level 2 uses Hamming error correction codes.
Levels 3 through 5 use checksums much like ZFS does. Level 6 uses two independent sets of checksums, so even if you lose half your checksums, you're still okay.
>. if there is an API to allow you to ask for data from a specific disk rather than letting the RAID driver pick one, I'm interested.
An API to read from sda? Uhm, it's called read(). You very simply read from sda or whichever drive rather than reading from md0. That's how you can boot from a RAID 1 partition without the BIOS or bootloader knowing anything about RAID - it just reads from any of the member disks.
He hopes that... but he has no decision power, i bet. Maybe he's on the next firing list.
This is Oracle that we're talking about, it's more likely they'll let you license ZFS for a couple thousand per month...
I apologize for the lack of a signature.
It's probably more accurate to say that the version in Solaris is a fork of an older version. Most of the ZFS developers left Oracle quite early on after they bought Sun and most of the rest left when Oracle decided to stop releasing CDDL versions of Solaris. The version that ended up in OpenZFS has been actively developed by the same people who created ZFS for the last 10 years. The version that Oracle owns has had a few incremental changes. This also means that it would be difficult for Oracle to GPL ZFS in a useful way: the version that's had all of the work done on it for the last decade contains a load of CDDL code that Oracle doesn't own.
I am TheRaven on Soylent News
Add to that, a lot of the most active ZFS developers work on the various OpenSolaris forks. A GPL'd version is completely useless to them. It's also not clear if Oracle even could release a GPL'd version. If they've taken any code from OpenZFS, then their version will include CDDL'd code that they don't own the copyright to, which would make relicensing impossible without replacing all of that code.
I am TheRaven on Soylent News
One nice thing about ZFS not being in upstream is that it is currently maintained and updated separate from the Linux kernel.
And that's actually a huge problem that makes it a major obstacle to its upstream adoption.
Mainly due to code duplication.
ZFS (and its competitor BTRFS) is peculiar, because it's not just a filesystem. It's a whole integrated stack that includes a filesystem layer on the top, but also a volume management and replication layer underneath (ZFS and BTRFS on their own a the equivalent of a full EXT4 + LVM + MDADM stack).
That is a necessity, due to some features in these : e.g. the checksuming going on in the filesystem layer is also useful to determine correct copies in case of bitrot in the replication layer.
But how this is handled is the big difference between ZFS and BTRFS.
ZFS on Linux just packs all the needed bits together with it.
It comes with its own volume management and replication code.
That is a duplicate of functionnality existing elsewhere in the kernel.
And duplication is always bad for maintenance.
BTRFS being developped on Linux tries to leverage as much as possible :
- the Zstd compression currently being introduced to BTRFS, uses the same routines as the Zstd compression being introduced into the kernel loader : both leverage the in-kernel compression facilities of the crypto modules
- the device mapper facilities are used by lvm, mdadm and dmraid but also by btrfs. There was a plan to develop code to support more than 2 parity blocks (more than RAID6), that would have been beneficial to both btrfs and mdadm.
That's why developers complain of boundaries/layers violation with ZFS but not about BTRFS.
ZFS comes with its own tangled mess of layers, BTRFS is just a wrapper around facilities already existing in-kernel.
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
In contradistinction ZFS takes a holistic, unified approach:
* Volument Management <--> File Management <--> Block
{...}
ZFS works because it intentionally "Flattened the stack" -- Yes, this runs counter to the layered Unix approach
The problem is that ZFS implement this by rolling everything in the same "rampant layering violation" package.
It is one single "flattened stack".
On the other hand, BTRFS shares as much code as possible with in-kernel facilities (it leverages "device mapper" routines that are used also by lvm, mdadm, mdraid, etc. it leverages in-kernel compression routine that are also used by the kernel loader and the crypto module, etc.)
It's not as much a "rampant layering violation" as a wrapper against layer facilities already existing in kernel.
-- but sometimes that is NOT the best design decision.
So basically, the problem isn't the overall design, but that actual code re-use vs. re-write.
Meanwhile Oracle keeps flailing about with Btrfs.
Btrfs works. It's in kernel, It's a first class filesystem in opensuse, and its copy-on-write facilities are extensively used for versioning with snapper.
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
openZFS is not even in the Kernel, btrfs is in the upstream kernel. It's silly to think that OpenZFS will ever get better RHEL support the btrfs.
I use btrfs on openSuSE and it works great.
Cheap storage VM.
Right, btrfs has proven itself as a stable filesystem which is not prone to corrupting itself on single or mirrored drive configurations.
Except it isn't, and does so regularly.
The practical implications of zfs over btrfs far outweigh the architectural encapsulation of zfs. This limitation primarily relates to arc, a situation which has plauged freebsd, illumos, and even Solaris since the changeover from sparc to x86. It is drastically better now across the board, particularly on linux, where the native memory mapping has been taught to play nice.
~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
Please remind me not to let you administer my filesystems.
http://jrs-s.net/2015/02/03/will-zfs-and-non-ecc-ram-kill-your-data/
https://forums.freenas.org/index.php?threads/ecc-vs-non-ecc-ram-and-zfs.15449/
https://arstechnica.com/civis/viewtopic.php?f=2&t=1235679&p=26303271#p26303271
https://www.csparks.com/ZFS%20Without%20Tears.html
Unfortunately, it's not quite there. Very close though.
https://antergos.com/wiki/miscellaneous/zfs-under-antergos/
Just ask SUSE:
Just ask SUSE:
Just learn to read the docs if you insist on having esoteric options turned.
In 2017, RAID56 are still marked incomplete.
Modern filesystem are a huge pile of diverse features, some are fully stable and used in production (e.g.: RAID0 and 1) other are still in development (e.g.: RAID56).
Complain that BTRFS is completely crap because RAID5/6 isn't fully functionnal and production ready, is like complaining that the venerable XFS is utter crap because its copy-on-write and snapshotting doesn't work yet.
(and BTW, in-band deduplication doesn't even exist yet in BTRFS. ZFS is supposed to have it, but is an ultra-massive performance killer from what I've heard)
(auto-defrag works, but is a write-perfomance killer. alternatives a no defrag at all, which is a read-performance killer. or using out-band defrag, which requires maintenance and kills snapshot correlation.
all these are problem which are specific to copy-on-write (ZFS, BTRFS) and log-structured (UDF, F2FS) filesystems)
(
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
How about these?
Friends don't let friends use BtrFS for OLTP.
Use the right tool for the right job. If you care about your data and you want to use ZFS, it makes sense to use FreeBSD for its top-notch ZFS implementation. Use it when it's the best tool for the job. When I set up a home NAS I specifically went with FreeBSD because this is a genuinely excellent feature, and ZFS on Linux is not yet up to scratch in comparison. We also use it on FreeBSD VMs at work for CI and testing work where the snapshot support is worth having. But at home and work everything else is still Linux on Ubuntu, Ubuntu LTS and CentOS as appropriate; we use them rather than FreeBSD because they have features and usability which FreeBSD does not. We use ZFS on Linux where appropriate as well.
A drive can correct for errors if a block is bad. Problem is, as areal densities increase, the odds of data changing randomly increases. This is mainly due to cosmic rays or other natural sources of radiation, but there can be other factors. The drive doesn't know anything about the data itself, it only knows if it can read a block or not, and that's really the way you want it. You want the drive to be structure and data agnostic. Otherwise you would need a specific drive for a specific file system, which would be a nightmare.
My Other Computer Is A Data General Nova III.
BTRFS is the future. ZFS is an incredible memory hog.