OpenZFS Project Launches, Uniting ZFS Developers
Damek writes "The OpenZFS project launched today, the truly open source successor to the ZFS project. ZFS is an advanced filesystem in active development for over a decade. Recent development has continued in the open, and OpenZFS is the new formal name for this community of developers, users, and companies improving, using, and building on ZFS. Founded by members of the Linux, FreeBSD, Mac OS X, and illumos communities, including Matt Ahrens, one of the two original authors of ZFS, the OpenZFS community brings together over a hundred software developers from these platforms."
I love ZFS, if one can love a file system. Even for home use. It requires a little bit nicer hardware than a typical NAS, but the data integrity is worth it. I'm old enough to have been burned by random disk corruption, flaky disk controllers, and bad cables.
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
If this gets us BP-rewrite, the holy grail of ZFS i'll be a happy man.
For those who don't know what it is - BP-rewrite is block pointer rewrite, a feature promised for many years now but has never come. It's a lot like cold fusion is that its always X years away from us.
BP-rewrite would allow implementation of the following features
- Defrag
- Shrinking vdevs
- Removing vdevs from pools
- Evacuating data from a vdev (say you wanted to destroy you're old 10 disk vdev and add it back to the pool as a different numbered disk vdev)
Oh well. I'd somehow hoped "truly open source" meant BSD license, or LGPL.
what John Siracusa will be doing this weekend.
Not to rain on anybody's parade,but will the commercial holders of ZFS allow this? Or will they unleash some unholy patent suit to keep it from happening?
As long as Oracle's patents are valid, can anyone seriously believe this will go anywhere?
His fleet of boats isn't going to pay for itself.
http://lkml.org/lkml/2005/8/20/95
I wish them best of luck. ZFS is the best FS out there.
Does this mean we might finally get ZFS for Windows?
It's about time ZFS went open. I feel like the only reason btrfs got any traction was ZFS licensing issues.
I'm sure I'll be corrected if I'm wrong, but does it offer any advantage over BTRFS? I'm not trying to start a flame war; I'm honestly asking.
Everything else is already handled with LVM and software RAID.
You have a great sense of humor, keep it up.
That. Those who don't understand ZFS are condemned to reinvent it, poorly.
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
It would be nice if the posted article or even the OpenZFS project home page provided some sort of summary of the benefits and objectives of this effort.
What is the difference between this ZFS and Oracle's ZFS? If I have to patch the kernel either way, why should I choose one or the other?
Using a small, fast SSD as a cache for large, slow disks can be awesome for some workloads, mostly servers with many concurrent users.
To do that with ANY filesystem, bcache is now part of the mainline kernel . dmcache does the same thing, and there is another one that Facebook uses.
I wish they had encryption... *sigh*
No, I don't want workarounds, I want it to be built in to ZFS like in Solaris 11.
Not sure what you mean. You certainly can set up a mirrored pair (or triplet or quadruplet), but you can also set up what's referred to as raidz, where it stripes the redundancy across multiple disks. You can configure how much redundancy... 1, 2, or more disks if you like. You can also tell ZFS to keep multiple copies of blocks, and it will spread those copies out among the disks. You can set that policy per sub-volume (file system in zfs-speak), so that if you decide that some of your data deserves more redundancy, you can set up a folder that will keep 2 copies of everything, but leave all the other folders at 1 copy. It's super geeky. I've had it detect (and correct) corruption in a failing disk, detect corruption because of a flaky disk controller that would otherwise pretend to work fine, and detect corruption when a SATA cable came loose. Combined with the ECC RAM in the server, I feel more comfortable about the integrity of my data than I ever have. I've lost family photos before to random drive corruption, so I'm sensitive to this stuff :)
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
Hope they'll implement encryption feature ASAP, fuck Oracle for locking in the good stuff from v29 and on.
I am sick and tired of these clueless greedy fucks who kept fighting against the tide at the cost of the community, it's not like they can keep it out of the open for ever, eventually someone will make something better for free, all they did was piss people off.
Geek card please.
One point to be extremely clear on however - when you set copies = 2 on a folder level, it does NOT guarantee those copies end up on different physical spindles. Early on there were many people who lost files because they skipped RAID thinking that copies=X would protect their data. It is NOT meant as a means to protect against hardware failures.
How does ZFS compare to btrfs? Several intentionally unnamed and unlinked commentaries on ZFS apparent current omission from Mac OSX refer to btrfs to be the more GPL-compliant alternative to ZFS. I need more information, as I do not think btrfs has the same aggressive checksumming and automatic volume size feature that ZFS does.
Thanks.
Kriston
Raid with 1/2/3/etc (distributed) parity sectors is good for protecting against 1/2/3/etc entire drive failures but they offer no protection against 2/3/4/etc [minus one with each full disk failure] corrupt sectors in a row leading to whole group being rendered corrupt. Complete data copies are a horrible waste of space to combat random data corruption considering typical HDD have somewhere around 1 read error per ~12TB read. Even assuming a dying drive with 100 sector errors per GB [0.038% corruption with 4KB sectors], the space needed to be able to completely fix it is a mere 800KB (with Reed-Solomon) instead of needing 1GB (with repetition).
Going and reimplementing it so it can be available under another license loses you the patent protection of the original code.
This is the ingenious and evil usage of 'copyleft' licenses. You don't have to stop redistribution of the original code, or even derivatives, you just have to ensure it's unusable to anybody else under alternative terms (especially if theirs are radically different than yours.)
I'd always wondered when we'd start seeing the licensing arms-race, and the first skirmishes by those wishing ill are already upon us, many with the money to back them up.
What are the chances of the exact same sector being corrupt on at least three disks in a raidz2 vdev? This doesn't seem like a plausible scenario.
ECC RAM is an important part here, due to how scrubbing works in ZFS. The background disk scrubbing can check every block on the filesystem to see if it still matches its checksum, and it tries to repair issues found too. But if your memory is prone to flipping a bit, that can result in scrubbing actually destroying data that was perfectly fine until then. The worst case impact could even destroy the whole pool like that. It's a controversial issue; the odds of a massive pool failure and associated doom and gloom are seen as overblown by many people too. There's a quick summary of a community opinion survey at ZFS and ECC RAM, but sadly the mailing list links are broken and only lead to Oracle's crap now.
That's what you have backups for.
What might change related to ZFS on FreeBSD?
BTW the "ZFS On Disk Specification" document is a very interesting read. It inspires confidence.
Uhm... rerun failing checksum calculations one or more times. If they stop failing, it was likely a memory issue. In that case, rerun 10 or 20 more times to ensure that it's not a question of the harddisk failing only some times. Problem solved?
ZFS doesn't have ECC, but it does checksum each block, so it can detect per-block errors. If you have valuable data, you can set the copies property to some value greater than 1 for that data set and it will ensure that each block is duplicated on the disk so if one fails a checksum then the other will be used to recover. If you have three disks, you can use RAID-Z, which loses you 1/3 of the space (not 1/2) and allows any single-disk failures to be recovered. Running zfs scrub will make it validate all of the data and when any read fails the checksums recover the data from the other two.
The reason it doesn't use ECC is that ECC doesn't mesh well with the failure modes of disks. ECC is used in RAM because when it gets hot, hit by a solar ray, or whatever, it is common for a single bit to flip (in a single direction, which makes the error correction easier). In a disk, you typically have an entire block fail, not a single bit. Modern disks use multiple levels, so the smallest failure that is even theoretically possible might be a single byte (or nibble) in a block. And since the failure isn't biased, you'd need a fairly large amount of space. A better approach would be for the filesystem to generate something like Reed–Solomon code blocks for every n blocks that are written. This would allow single-block errors to be recovered, as long as the other blocks are okay. The down side of this approach is that the error correcting block would need to be rewritten whenever any of the other blocks is modified. this might be relatively easy to add to ZFS, as it uses a CoW structure, so block-overwrites are relatively rare (although erasing a lot of data would require a lot of checksums to be recalculated). This would mean that a single-block write would end up triggering a lot of reads and that would hurt performance. For ZFS, this might actually be easier to implement, as blocks are written out in transaction groups and so including an error correction block at the end might be a fairly simple modification.
I am TheRaven on Soylent News
That depends on the reason for the failure. If it's because there's a little bit of dust on the platter, or a manufacturing defect in the substrate, then it's very unlikely. If it's because of a bug in the controller or a poor design on the head manipulation arm, then it's very likely.
This is why the recommendation if you care about reliability more than performance is to use drives from different manufacturers in the array. It's also why it costs a lot more if you buy disks from NetApp than if you buy them directly: they're the same commodity drives, but NetApp tests batches, discards the least reliable ones, and ensures that you don't have two disks from the same production run in the same array. You're still getting the same drives you can buy elsewhere for a fraction of the price, but you're getting more diversity.
I am TheRaven on Soylent News
You clearly have not been paying attention to the news, have you?
After the leaks of Snowden regarding general malfeasance from security agencies against the encryption standards that we require to communicate safely and securely (like with your bank, just saying) you can't trust any software that you can't build (or know other people more capable can't build) from scratch.
The GPL guarantees that no stupid institution or individual has free reign to corrupt the computational resources you are using.
Other licenses wax lyrical on this, and the consequence is that your precious Apple OS and applications are now tainted, because you have no way to know if they have backdoors or not.
What does this have to do with ZFS you ask?
Well, encryption. ZFS has the capability to encrypt the datasets you are using, but unfortunately its license would not make it suitable for truly secure encryption in the cases where the company or individual implementing it (Oracle, ahem,ahem) chose not to make the source code available.
At that point you have no way to know if backdoors have been added to your implementation of ZFS.
So again, how is GPL, a license that is protecting your security, the problem?
IANAL but write like a drunk one.
Or just run ECC memory! :)
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
If you are hell bent on this, just partition the drive into however many parts you feel is sufficient and run raidz across them. With 5 partitions and a single redundant partition, you would only use up 1/5 of your drive on parity. It's a hack, but I'm not aware of any production-worthy filesystem that can do what you want.
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
Thanks for the clarification, my text is misleading. It will spread the blocks out, but randomly - there is no guarantee that the copies will end up on separate disks (unless you are using mirrored vdevs).
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
Bcache does read caching and your choice of write through or write back. I believe that's the same thing ads offers. If you know of some difference in the caching, please specify what you are referring to.
Obviously ZFS is a volume manager, a filesystem, a file server AND a caching solution. Bcache does one thing and does it well - caching. Volume management is a separate thing handled by a volume manager such as LVM, though LVM can serve as a front end to bcache, allowing the user to manage both with one set of tools.
I don't know why everyone's freaked out by a silly little murder.
Futurist Traditionalism
If you have valuable data, you can set the copies property to some value greater than 1 for that data set and it will ensure that each block is duplicated on the disk so if one fails a checksum then the other will be used to recover.
The sector numbers have long been decoupled from the physical location on disk of the data, for good reason, see: Swapping in a spare sector for one that is going bad.
I'm wondering how anything but a device driver ensures that writing the same exact block of 1's and 0's to a device doesn't result in the same exact hash, and is thus NOT automatically de-duplicated by the hardware and stored in a single location?
In other words: It a post physical addressing world (and post NSA world) you should use an encrypted file system. One can almost create whole drive encryption using ZFS, but it's a bit of a kludge using wrappers and what-not. Further, ZFS supports compression............. De-duplication. Ensure this option is off. Otherwise, if you want to ensure duplicate sectors, do it at the drive level with RAID.
I thought dealing with single-bit RAM failures was a little more complicated than that?
As I understand it, a failure is caused by a change of voltage in a stored bit. If the voltage change places the stored value between the 0 and 1 thresholds, the state becomes fuzzy. The failed bit can then easily be detected and it's original state calculated using the ECC data. However, if the change in voltage is enough to produce a valid-looking bit flip, the ECC data can detect there has been an error in the block but not which bit has been changed.
Why would a RAM failure be in any particular direction?
there is nothing controversial. they say very early on, garbage in, garbage out and ECC is a must if you value your data. the only thing controversial is why ECC isnt standard in ALL computers.
you're risking corruption too with any other filesystem and bad memory during ANY operation which can lead to a write to the file system.
ZFS at least guarantees your data integrity
OpenZFS' creation as an organization was announced today.
OpenZFS, the software stack, has been part of FreeBSD (9.2, since July) and FreeNAS (9.1, since August).
Does the open non-Oracle filesystem stack predate the organization?
Welcome to the Panopticon. Used to be a prison, now it's your home.
I've had 2 drives die on me at once on me before in a RAID and all it takes then is a single read error during rebuilding to cause uncorrectable errors
I have been looking into ZFS as a replacement to my Linux + mdadm server (backups scripted to AWS account). May I ask what your current setup is OS and hardware wise?
Hardware:
- Old HP Core 2 Duo workstation from eBay with 4GB ECC RAM
- Extra SATA controller, both for performance and to give me a 5th plug for when I'm replacing drives.
- A 5-drive caddy that replaces the old drive cage.
- 2x 2GB drives and 2x 3GB drives
Software:
- FreeBSD 8.x
Configuration:
- Boot from USB2 Thumb Drive (which I periodically clone to a second, identical thumb drive for instant recovery)
- Drives are mirrored in pairs, for a total capacity of 5GB
I put this together a couple of years ago. If I did it today, I might make some different choices.
For instance, the HP Microserver sells new for about what I paid for the workstation used. It supports ECC RAM and apparently runs FreeBSD well. I would probably choose that, as it would be a cleaner build. This was not available when I put my machine together.
I might consider the Linux version of ZFS, but probably not, since keeping the kernel up to date would be a pain. I would also consider FreeNAS. I tried it back when I decided to use FreeBSD instead, and it was not ready for prime time. It seems to have improved a whole lot, and makes setup and maintenance easier. Not that FreeBSD is hard to use, but it is different from Linux and so you need to learn the new set of tools (like the ports system). I would go with the 9.x branch if I built today - 8.x was the stable branch when I did my build, and FreeBSD is really good about supporting the older branch.
I started with a motley collection of dissimilar drives, which is why I went with pairs. I would be able to get more usable space from the drives by running raidz instead of mirrored pairs - but only if you upgrade drives all at once. My setup lets me replace them in pairs. If I was buying everything brand new, I would probably choose raidz with 2 redundant drives... but I'm on a budget!
Finally, if you use WD Green drives, they work fine but make sure you disable the stupid head parking feature! One of mine beat itself to an early death by parking its heads every 10 seconds for about a year :) There used to be 4096 sector size issues with some drives, but I think they have been sorted out.
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
Thank you for the very informative response. I was thinking of FreeNAS for the sake of simplicity but I haven't looked into it enough to see if I can manually tweak the ZFS settings. I started my *nix tinkering on a pentium 166 running FreeBSD 4 then moved on to Linux once I discovered Debian around 2002. Haven't used BSD since but I am sure I can dive back into it.
One problem is the cost of ECC machines. Intel only offers it on their more expensive processors (despite it being cheap to implement). I haven't seen very many ARM chips with ECC either. AMD has it on all of their processors, but AMD isn't really very competitive with Intel.
Back when I set this up, I spent a couple of weeks playing with various solutions in VirtualBox. It is especially easy to play with ZFS, since you can "yank" and add drives so easily inside VirtualBox. You can even simulate corruption by writing to the disk images. FreeNAS was very tempting at the time, but still had some things I couldn't work around. They seem to have put a lot of work into it since then.
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
Which is why raidz3 might be a better option for people with super critical data than raidz2 :) I do get your point, though. It was for that reason that I migrated from raidz to raidz2; I didn't like the idea that if a single disk failed, I had no protection from read errors during a rebuild.
It's also worth commenting on what a few other people have said, which is that copies=2 doesn't put the data on different disks (it's not the same as mirroring). This is partially true; ZFS tries hard to get the data onto different disks and different vdevs, it's just not guaranteed (it depends on available space).
Err, what? there's $60 Pentium Gs and even Atoms with ECC support...
Single-bit errors in DRAM are caused by the capacitor that stores the data being discharged. This means that the transitions happen in one direction: from charged to discharged. With parity RAM, you can tell that an error has occurred, but you can't tell what the error is. The parity and ECC checks happen in the the digital circuitry and so have no knowledge of the analogue state. Since ECC uses Hamming codes, it can detect more than single-bit errors, but it can only fix one bit flip (the bias isn't actually required, but it does make the code shorter).
I am TheRaven on Soylent News
Copies=X does offer protection against some kinds of failures, particulalry bad blocks. The other copies ARE guaranteed to be on different blocks. And the algorithm WILL put copies on different drives if it finds drives with free blocks.
In a raidz setup or a stripe where all the disks were added at the same time (eg at pool creation) and are of the same size, then typically the drives will have the same amount of free blocks.
I would be able to get more usable space from the drives by running raidz instead of mirrored pairs - but only if you upgrade drives all at once
I thought the "zpool replace" is for that purpose? I am planning a ZFS setup, but all I can afford is a motley collection of dissimilar drives.
Can't I replace one disk at a time when upgrading?
Bingo Dictionary - Pragmatist, n. A myopic idealist.
You can replace a disk any time, but the pool won't use the entire capacity of the disk if it is larger than the others.
I chose a mirror because it wasted the least amount of space:
mirror1 - 500GB drive and 750GB drive
mirror2 - 2x 2TB drives
So my mirror gives me 2TB + 500GB = 2.5TB with 2-drive redundancy. 250GB of my 750GB drive was not used. Later I swapped both smaller drives for 3TB drives when they went on sale Black Friday for $89. When zfs saw the new space, it increased the pool size to 5GB. So now my array doesn't waste any space with 2-drive redundancy.
Had I chosen to do a zraid originally, I would have wasted a lot of space because each drive would be limited to the smallest individual drive's space. So with 1 drive redundancy I would have had 3x500GB = 1.5TB or with 2-drive redundancy I would have had 2x500GB = 1 TB. That obviously wasn't going to be my strategy :) If I switched to zraid with 1-drive redundancy today, I would get 3x2TB = 6TB. 2-drive redundancy would get me 4TB. My mirrors give me sort-of 2 drive redundancy. Obviously, it depends which drives :) Since this is mostly a backup server with no unique data on it except for TV media center storage, I've judged this an acceptable risk :)
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
ZFS is a filesystem, so it can not have ECC. RAM memory sticks has ECC.
ZFS has checksums, which is better than ECC, because most ECC RAM can only detect and correct a single bit flip, and detect double bit flips. But ZFS uses SHA256 or fletcher checksums, which can detect and correct much more than single or double bit flips.
In a raid-z setup you can rebuild the block from parity, and if you've got raid the point is moot. I would not consider a bad block a hardware failure, that's a hardware error - bad blocks are expected behavior out of spinning rust. The allocator makes no guarantees about copies being on different spindles - it simply prefers to split blocks across vdevs if possible (but there are no guarantees it will do so, and there are many instances where it won't). If you've got multiple partitions on a single disk, it will not delineate those vdevs from vdevs on another physical drive. Furthermore if you have an entire drive die and don't have raid, your pool is going to be toast anyways, regardless of your copies setting. What matters is: COPIES IS NOT A REPLACEMENT FOR RAID. That's the gist of the matter and the point I was making (and thought I was rather clear about).
That's an utterly idiotic means of getting error correction plus it's still worthless against 2 sequential blocks being damaged. Raid5 isn't at all a suited for data recovery on a single drive and the read/write performance would be beyond shit (a RAID4 without striping would be a slightly less bad idea).
Writing a small script that periodically runs a parchive derivative on recently modified files/directories is a far better (in space required and robustness) "hack" but a file systems implementing it would still be far superior.
That's an utterly idiotic means of getting error correction
Agreed.
plus it's still worthless against 2 sequential blocks being damaged
Why? You could wipe out an entire partition and still have data integrity.
Raid5 isn't at all a suited for data recovery on a single drive
Agreed. Though to be pedantic I was suggesting raidz, not RAID5.
a RAID4 without striping would be a slightly less bad idea
I don't think zfs offers anything like RAID4.
Writing a small script that periodically runs a parchive derivative on recently modified files/directories is a far better
I was thinking something like taking snapshots and then running parchive against the snapshot? I haven't put much thought into this - drives are cheap.
but a file systems implementing it would still be far superior.
Yes, but it is kind of an edge case... yeah, you sometimes get some SMART warning about a failing drive, but who wants to image a failing drive and then apply parity tools? At that point you might as well just put in a new drive and restore from backup.
I was proposing this "solution" facetiously - no sane person would live with such a setup.
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
I was thinking of "upgrading" btrfs to zfs, but looks like it will be a downgrade given my variety of disks. Btrfs, while not supporting RAID 5 , handles variety of disks very efficiently. IIRC there is already the problem with zfs that you can't shrink a dataset.
Thanks, I need to do more research.
Bingo Dictionary - Pragmatist, n. A myopic idealist.
SHA256 is not an error correcting code. It can not correct even single-bit flips. If it could, it would be useless as a cryptographic hash. If you could take a hash and some data that was close to the data for which the hash was computed, and find the single-bit flip that would allow the data to match the hash, then you'd have a very easy way of creating SHA256 collisions. And if you had such an algorithm, you wouldn't use it in a filesystem, you'd use it to break all of the systems that rely on SHA256 collisions being difficult to create. If you want error correcting codes in a filesystem, then you'd use an error correcting code, not a cryptographic hash.
I am TheRaven on Soylent News
Please read a book on coding theory before saying something highly retarded. There are many other Error Detecting and Correcting codes other than a distance 4 hamming codes hardwired into RAM chips. Also HASHING is NOT a DATA RECOVERY nor a DATA COMPRESSION algorithm. Even if you hash just a single data sector back when they were 512 bytes, it will still take 256^512 different combinations before finding the correct byte in the worst case. 256^(512) is more time than is needed to brute force EVERY ENCRYPTED MESSAGE THAT HAS EVER BEEN SENT OR WILL BE SENT WITH AES.
btrfs looks very promising. The fact that SUSE is considering making it the default is heartening. When I was setting up my server, it was not a real option. Even now, I might be a little uneasy until someone is using the multi-disk stuff in production. I'll keep playing with it :)
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
http://open-zfs.org/wiki/Developer_resources#Implementation_documentation notes that the ZFS On-Disk Format document is "a good overview, but sorely outdated". Of possible interest: Max Bruning's weblog: ZFS Raidz Data Walk (2009)
https://github.com/zfsrogue/zfs-crypto
In the ZFSonLinux area at https://github.com/zfsonlinux/zfs/issues/494#issuecomment-23652335 it's noted that the zfsrogue code is encumbered and so, will not be used.
There's an earlier comment https://github.com/zfsonlinux/zfs/issues/494#issuecomment-7158618 and a corresponding note in the OpenZFS wiki: The early ZFS encryption code published in the zfs-crypto repository of OpenSolaris.org could be a starting point
hope that this nice umbrella will evolve into a single point of access
If you find time, I recommend listening to bsdtalk227 (listed under Publications ).
https://twitter.com/grahamperrin/status/380395699734466560 quoting plus hashtag from a Delphix blog: " To some degree, #OpenZFS is just putting a name to what we have already been doing as a community ".
Sometimes these things turn into just another layer of non-information.
I understand your concern.
The first of the three goals of OpenZFS is to raise awareness of the quality, utility, and availability of open source implementations of ZFS. As an end user, very much into awareness-raising of ZFS, I'll occasionally edit (and/or discuss in IRC) wherever I feel that the value of something thats in the wiki is not immediately clear. But I'm neither a developer nor a typical end user, so there'll be large areas that are beyond me. Maximising the value of contributions to the wiki should be very much a collaborative effort...