ZFS, the Last Word in File Systems?

Open source by Splinton · 2004-09-16 04:07 · Score: 4, Informative

And it looks like it's going to be opensourced along with most of Solaris 10!

Presumably a 32 bit machine will be able to handle a 128 bit file system, in the same way as Solaris 10 is currently destined for (at most) 64 bits.

Re:Hmf. by Kenja · 2004-09-16 04:10 · Score: 5, Informative

"So, what was the point of creating a 128-bit filesystem?

Getting rid of file/drive size limitations for the foreseeable future?

--

"Have you ever thought about just turning off the TV, sitting down with your kids, and hitting them?"

What is their disk allocation scheme? by grunt107 · 2004-09-16 04:12 · Score: 3, Informative

Having a global pool does lessen maintenance/support, but what method are they using to place data on the disks?

Frequently accessed data needs to be spread out on all the disks for the fastest access, so does that mean Sun has FS files/tables that track usage and repositions data based on that?

Re:What is their disk allocation scheme? by dTb · 2004-09-16 05:29 · Score: 3, Informative

According to the information given in this blog it is possible to "show how much space is used in each disk. If you want to reduce the amount of space in a pool by removing a disk, you could use this to choose the least-full disk, thus minimizing the time it will take to migrate that data to other disks".

There already is a ZFS. by TheLoneGundam · 2004-09-16 04:13 · Score: 5, Informative

IBM has ZFS on their z/OS Unix Systems Services (POSIX interfaces on z/OS) component. ZFS was developed to provide improvements over the HFS (Hierarchical File System) that they ship with the OS.

Re:billion billion? by michael+path · 2004-09-16 04:15 · Score: 4, Informative

How about quintillion?

Re:Oh wow! by elmegil · 2004-09-16 04:24 · Score: 2, Informative

Until Veritas makes their product free, there's going to have to be SOMETHING that operates in that space that is under Sun's control, don't you think? Not to mention VxVM has plenty of warts all its own.

--
7 November 2006: The day Americans realized corruption and incompetence weren't addressing 11 September 2001

Sounds really nice by mveloso · 2004-09-16 04:25 · Score: 5, Informative

Looks like Sun went out and redid their filesystem based on the performance characteristics of machines today, instead of machines of yesteryear.

Some highllights, for those that don't (or won't) RTA:

* Data integrity. Apparently it uses file checksums to error-correct files, so files will never be corrupted. About time someone did this.

* Snapshots, like netapp?

* Transactional nature/copy-on-write

* Auto-striping

* Really, Really Large volume support

All of this leads to speed and reliability. There's a lot of other stuff (varying blocks sizes, write queueing, stride stuff which I haven't heard about in years), but all of it leads to above.

Oh, and they simplified their admin too.

It's hard to make a filesystem look exciting. Most of the time it just works, until it fails. The data checksum stuff looks interesting, in that they built error correction into the FS (like CDs and RAID but better hopefully).

It might also do away with the idea of "space free on a volume," since the marketing implies that each FS grows/shrinks dynamically, pulling storage out of the pool as needed.

Any users want to chime in?

Re:Sounds really nice by the+melon · 2004-09-16 06:21 · Score: 2, Informative

All I can really say is if you have ever use a volume manager before
you will rejoice at the ease of zfs.

I have been using it on my main nfs server in my Solaris lab at Sun
for quite a while now and it is great.

I have a 1.6tb disk array that is allocated to a single zpool on the
system. I can add/subtract drives/arrays to this pool at any time to
increade decrease the amount of storage avalable to the pool.

I can then creat, format and mount a zfs filesystem with one single
command to the zpool. the filesystem will only consume as much of the
zpool as it is actually using.

It really is a great system.

Re:What I really want to see in a file system... by FullMetalAlchemist · 2004-09-16 04:30 · Score: 2, Informative

There are several FS like this, but you don't know of them because they require completely new FS API to work with.
With UFS2/SU we have snapshots which is a compromise; it does require any changes in the original UNIX API, and all current apps therefor work. On the other hand, it either requires a daemon or a competent user.

So, either you have UNIX or you have something else. Plan9 has many advantages, still, we use BSD, Solaris or whatever.

64 bits is awfully big already by pslam · 2004-09-16 04:30 · Score: 4, Informative

Getting rid of file/drive size limitations for the foreseeable future?

It would take over 500 years to fill a 64 bit filesystem written at 1GB/sec (and of course 500 years to read it back again). 64 bits is already an impossibly large figure. There's absolutely nothing special or clever whatsoever about doubling the size of your pointers aside from using up more disk space for all the metadata.

64 bits is enough for today's filesystems in much the same way that 256 bit AES is enough for today's encryption - there are far bigger things that will require complete system changes than that so called "limit". I suspect a better filesystem will come along well before those 500 years are up... I agree with grandparent:

-1, Marketing Hype.

Re:64 bits is awfully big already by dTb · 2004-09-16 05:45 · Score: 2, Informative

The filesystem has compression built in as an option to make storege more efficient. They currently use LZJB (fast but little reduction) compression but plan to add more powerfull but slower compression at a later date.

Re:Out of letters. by badriram · 2004-09-16 04:34 · Score: 4, Informative

I just wonder how many people on slashdot would even understand that....

To those who dont know.. [ comes after Z in ASCII and unicode-latin

Re:What I really want to see in a file system... by dominator · 2004-09-16 04:39 · Score: 4, Informative

Reiserfs will apparently soon have what you're looking for. Already, all primitive operations are atomic, but they plan on exporting a user-space transaction interface soon.

http://www.namesys.com/benchmarks.html

"V4 is a fully atomic filesystem, keep in mind that these performance numbers are with every FS operation performed as a fully atomic transaction. We are the first to make that performance effective to do. Look for a user space transactions interface to come out soon....

Finally, remember that reiser4 is more space efficient than V3, the df measurements are there for looking at....;-) "

Some snippets from the article by ChrisRijk · 2004-09-16 04:41 · Score: 2, Informative

ZFS achieves its impressive performance through a number of techniques:
* Dynamic striping across all devices to maximize throughput
* Copy-on-write design makes most disk writes sequential
* Multiple block sizes, automatically chosen to match workload
* Explicit I/O priority with deadline scheduling
* Globally optimal I/O sorting and aggregation
* Multiple independent prefetch streams with automatic length and stride detection
* Unlimited, instantaneous read/write snapshots
* Parallel, constant-time directory operations

ZFS has some similarities to NetApp's WAFL in that it uses "copy on write".

One of the fun things with ZFS is that it automatically stripes across all the storage in your pool. Disk size doesn't matter - it's all used. This even works across SCSI and IDE.

One of the important things is that volume management isn't a seperate feature. Effectively, all the current limitations of volume managers are blown away:

Just as it dramatically eases the suffering of system administrators, ZFS offers relief for your company's bottom line. Because ZFS is built on top of virtual storage pools (unlike traditional file systems that require a separate volume manager), creating and deleting file systems is much less complex. Not only does this eliminate the need to pay for volume manager licenses and allow for single support contracts, it lowers administration costs and increases storage utilization.

ZFS appears to applications as a standard POSIX file system--no porting is required. But to administrators, it presents a pooled storage model that eliminates the antique concept of volumes, as well as all of the related partition management, provisioning, and file system sizing problems. Thousands--even millions--of file systems can all draw from ZFS' common storage pool, each one consuming only as much space as it needs. The combined I/O bandwidth of all of the devices in that storage pool is always available to each file system.

This is also part of the stuff making admin and configuration far far simpler. The thing I like is that it should be far harder to go wrong with ZFS (not available in Solaris Express yet so I haven't seen this for myself).

The very high degree of reliability as standard is very welcome too:

Data can be corrupted in a number of ways, such as a system error or an unexpected power outage, but ZFS removes this fear of the unknown. ZFS prevents data corruption by keeping data self-consistent at all times. All operations are transactional. This not only maintains consistency but also removes almost all of the constraints on I/O order and allows changes to succeed or fail as a whole.

All operations are also copy-on-write. Live data is never overwritten. ZFS writes data to a new block before changing the data pointers and committing the write. Copy-on-write provides several benefits:

* Always-valid on-disk state
* Consistent, reliable backups
* Data rollback to known point in time

"We validate the entire I/O stack, start to finish, no guesswork involved. It's all provable data integrity," says Bonwick.

Administrators will never again have to run laborious recovery procedures, such as fsck, even if the system is shut down in an unclean fashion. In fact, Solaris Kernel engineers Bill Moore and Matt Ahrens have subjected ZFS to more than a million forced, violent crashes in the course of their testing. Not once has ZFS lost data integrity or leaked a single block.

For more technical info see Matt Ahrens's and Val Henson's blogs - since they're among the engineers who worked on it.

Re:Unlimited scalability by dynamo · 2004-09-16 04:44 · Score: 2, Informative

Just because not all worlds are inhabited doesn't mean there aren't an infinite number. If you allow yourself to presume infinite space and infinite worlds, suppose 9% of them turn out to be inhabited, no matter how many you keep examining.

Infinity is relative.

Actually, Novell already made ZFS... by thehunger · 2004-09-16 04:45 · Score: 3, Informative

The codename for the first generation of Novells current filesystem was ZFS. Why? because it was supposed to be "the last, or final word" in file systems.

Novell now Novell Storage System (I think it used to be NetWare Storage System).

Apart from the obvious fact that SUN didnt manage to be very original in naming their filesystem, its noteworthy that Novell is porting their ZFS - now NSS - to Linux. It'll be part of Novell Open Enterprise Server - on both Linux and NetWare kernels.

From the top of my mind, here are some features of NSS that SUN needs to exceed to qualify for a new "final word..":

- Background compression
- Fast on-demand decompression
- Transactions
- Pluggable Name spaces
- Pluggable protocols (ie. http, nfs, etc)
- Advanced Access control model with inheritance, rights filters, etc. integrated with directory service (duh!)
- Quotas on user, group, directory level
- 64-bit (ok, SUN obviously got that one)
- mini-volumes
- journaled
- etc.

oh well, I wont bother continuing, but its worth looking out for NSS. Hopefully Novell will open source it and not make it exclusive to their distros.

There are a lot of cluster file systems by anzha · 2004-09-16 04:50 · Score: 5, Informative

Right now there are a lot of file systems that do somehing not all that different than what Sun is proposing. The project I am on is evaluating them as we speak for a center wide filesystem. I've had the fun (no sarcasm, honestly) of setting up a number of different onces and helping to run benchmarks and tests against each. All of them have strengths. Every single one of them has some nasty weaknesses.

If you are looking for an open source based cluster file system, Lustre is what you want. It's supported by LLNL, PNNL, and the main writers at ClusterFS Inc. It's a network based cluster FS. We've been using it over GigE. However, we've found that there needs to be a ratio of 3:1 for data server:clients for a ratio. Wehave only used one metadata server. Failover isn't the greatest. Quotas don't exist. it also makes kernel mods (some good and bad) to do a mild fork of the linux kernel (they put them into the newer kernels every so often). It only runs on Linux. Getting it to run on anything else looks...scary.

GPFS runs on AIX and Linux. Even sharing the same storage. It runs and is pretty stable. it has the option to run in a SAN mode or network based FS. In the latter form, it even does local discovery of disks via labels so that if a client can see the disks locally it will read and write to them via FC rather than to the server. It, however, is a balkanized mess. It requires a lot more work to bring up and run: there is an awful lot of software to configure to get it to run (re: RSCT. If you haven't had the joys of HATS and HAGS, count yourself very, very lucky).

ADIC's StorNext software is another option. This one is good if you are interested in ease of installation, maintanence, and very, very fast speeds (damn near line speed on Fibre channel). I have set this one up for sharing disks in less than two hours from first install to getting numerous assorted nodes of different OS's to play together (Solaris, AIX, Linux). It freakin on virtually everything from Crays to Linux to Windows. It's issues seem to be scaling (right now doesn't go past 256 clients) and it has some nontrivial locking issues (righting to the same block from multiple clients, and parallel I/O to the same file from multiple clients if you change the file size).

There are some others that are not as mature. Among them are Ibrix, Panasas, GFS, and IBM's SANFS. All of them are interesting or promising. Only SANF looks like it runs on more than Linux though at this point. Our requirements for the project I am on are to share the same FS and storage instance among disparate client OSes simultaneously. This might not be the same for others though and these might be worth a look. Lustre dodges this because its open source and they're interested in porting.

--
Do you know why the road less traveled by is littered with the bones of the unwary?

Re:There are a lot of cluster file systems by Plugh · 2004-09-16 06:56 · Score: 3, Informative

You forgot to mention the GPLed Cluster Filesystem that Oracle released some time ago.
You also may want to check out the ASM (Automated Storage Manager). It only works for disks that Oracle manages, but it does some pretty cool automatic load-balancing and RAIDing.
Disclaimer:
Yes, I do work for ORCL.
No, I do not work on either OCFS or ASM (but I have partied with those guys :-)

--
Part of the Second American Revolution!

Re:Just better than the old stuff from Sun by Zapman · 2004-09-16 04:53 · Score: 2, Informative

Well, I'm not 100% sure that's fair. AIX and HP still have their old school 'format -> mkfs' path, and that is what Sun is comparing their 'new world order' to. Now, if you want to do cool things like Raid, then you need to either do the hardware based stuff, or you play with Disksuite or Veritas Volume Manager[1].

Both have more interesting and pretty ways of playing with volumes. Disksuite is a free, add on package, and Veritas charges an arm and a leg for their Volume Manager.

In addition to the other cool features, ZFS is just a way to deepen the abstraction away from physical volumes.

As to it's inherent coolness, or lack there of, I'll let y'all know when I've actually been able to play with it.

[1]Had Sun been wise years ago, they would have just bought Veritas, and the world would be very different. Now however, Veritas is one of the largest software companies in the world.

--
Zapman

The proof is in the pudding by melted · 2004-09-16 04:55 · Score: 3, Informative

As someone who's been involved with performance/stress optimizations I can tell you that for each situation you can carefully put together two types of tests: one which proves that there's a problem, another that proves the problem doesn't exist.

The proof is in the pudding. Let Sun release it and administrators use it for a year or two, then we'll see if it's good enough. Right now I'm having doubts it's as good as they want you to believe.

Re:Just better than the old stuff from Sun by drinkypoo · 2004-09-16 05:01 · Score: 2, Informative

How is this actually different from JFS on top of a LVM? Either way it's made up of blocks, which can be added to the filesystem later, located on any physical medium available, using RAID... The only measurable difference seems to be the 128-bitness, which as described elsewhere seems like a big fat waste of time for the next hundred years or so.

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"

ZFS by BJH · 2004-09-16 05:05 · Score: 2, Informative

Two words:

"Patent burdened"

Re:Just better than the old stuff from Sun by sysadmn · 2004-09-16 05:13 · Score: 2, Informative

With AIX and HP-UX, there's still 28 steps. It's just that the manuals say: 1) Run smit (IBM version) or 1) Run SAM (HP-UX version). and you're supposed to read the menus to figure out the other 27 steps.

--
Envy my 5 digit Slashdot User ID!

Re:Oh wow! by Wakko+Warner · 2004-09-16 05:17 · Score: 2, Informative

Oh, I have no problem with Sun offering a VM of its own. It's the lack of functionality that's always concerned me. It always seemed silly to pay $25k for the kind of volume management on Solaris that you get for free in AIX and HP/UX.

Also, I'm tired running a volume manager simply to mirror root, and a separate, expensive volume manager (with a different level of support from a different vendor) simply to manage my data volumes, and I'm distressed that this is the "standard" way to do it in Solaris.

Hopefully, this changes things significantly.

- A.P.

--
"Remember when the U.S. had a drug problem, and then we declared a War On Drugs, and now you can't buy drugs anymore?"

White Papers by dTb · 2004-09-16 05:22 · Score: 2, Informative

If anyone wants to read more details on the "Zettabyte File System" they can view the white papers on ZFS self-tuning and QOS as they contain far more detail than the marketing article given.

Re:Another quote to cherish by mdmarkus · 2004-09-16 05:23 · Score: 5, Informative

From Bruce Schneier in Applied Cryptography: Thermodynamic Limitations One of the consequences of the second law of thermodynamics is that a certain amount of energy is necessary to represent information. To record a single bit by changing the state of a system requires an amount of energy no less than kT where T is the absolute temperature of the system and k is the Boltzman constant. (Stick with me; the physics lesson is almost over.) Given that k = 1.38x10^-16 erg/Kelvin, and that the ambient temperature of the universe is 3.2K, an ideal computer running at 3.2K would consume 4.4x10^-16 ergs every time it set or cleared a bit. To run a computer any colder than the cosmic background radiation would require extra energy to run a heat pump. Now, the annual energy output of our sun is about 1.21x10^41 ergs. This is enough to power about 2.7x10^56 single bit changes on our ideal computer; enough changes to put a 187-bit counter through all of its values. If we built a Dyson sphere around the sun and captured all of its energy for 32 years, without any loss, we could power a computer to count up to 2^192. Of course it wouldn't have the energy left over to perform any useful calculations with this counter. But that's just one star, and a measly one at that. A typical supernova releases something like 10^51 ergs. (About a hundred times as much energy would be released in the form of neutrinos, but let them go for now.) If all of the energy could be channedel into a single orgy of computation, a 219-bit counter could be cycled through all of its states. These numbers have nothing to do with the technology of the devices; they are the maxiumums that thermodynamics will allow. And they strongly imply that brute-force attacks against 256-bit keys will be infeasible until computers are built from something other than matter and occupy something other than space.

Re:Why don't they just describe the capacity in by Insightfill · 2004-09-16 05:50 · Score: 3, Informative

Here's a good source.

"Johnny Carson, America's popular talk-show host, loved to affectionately mimic Carl - one of his favorite guests - by saying "billions and billions," until everyone associated it with Carl. Yet Carl never said that precise phrase in public until years later.
He grew quite tired of it. I remember a concert for Planetfest, a Planetary Society celebration of space exploration in 1981. He spoke about space exploration while accompanied by music conducted by John Williams, and inevitably had to use the word "billions." As soon as he did, tittering broke out in the audience. He glared at the offenders and continued."

Seriously, I would LOVE to use "Sagan" as a unit of counting "billions" or something.

Re:billion billion? by Dazza · 2004-09-16 05:58 · Score: 5, Informative

Hmm... another one who doesn't know that there's a fair amount of land outside the US borders.

Nope. He said he'd never been outside the UK, so I'd be fairly certain he's aware of land outside the US.

Also living in the UK, I can attest that whenever you hear '1 billion', '1000 million' is meant. The UK converted to this for accounting purposes during the 70's.

The same I suspect is true for most of previously Europe-dominated countries (say India for example).

India, in particular, is toally different. They don't rely on millions and billions but 'crore' and 'lakh' which are 10million and 100k respectively.

--
-- "I know that this is vitriol, no solution, spleen-venting, but I feel better having screamed, don't you ?"

Re:billion billion? by escher · 2004-09-16 06:17 · Score: 1, Informative

No, nobody can really visualize a billion (seriously, try!)

Okay!

Lesse, lets define a millimeter as 1000. That means a million is one meter and a billion is one kilometer. I, for one, can visualize a little over half a mile quite easily.

Re:billion billion? by mikael · 2004-09-16 06:25 · Score: 2, Informative

Our newspapers regularly like to have front page headlines like "Chancellor raids nine billion pounds from company pension schemes". In this sense it means 9 thousand million pounds. At the same time we frequently have news reports from the USA, especially with regard to budget deficits in states like California.

--
Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads

Re:Oh wow! by elmegil · 2004-09-16 06:50 · Score: 2, Informative

If someone from Sun has conviced you that this is "standard" or "necessary", you need to talk to their management. While many people do it that way, there's absolutely no reason, since you're already paying for Veritas, to just use Veritas and be done with it.

You're right, it'd be nice to see some regularization.

--
7 November 2006: The day Americans realized corruption and incompetence weren't addressing 11 September 2001

Re:billion billion? by Just+Some+Guy · 2004-09-16 06:59 · Score: 2, Informative

I'm pretty film-ignorant, but let's say that you're talking about the equivalent of a 10000x10000 image with 64 bits of color (because you clearly want to maintain all of the information possible). That's 800,000,000 bytes (10000*10000*8) per image. Impressive, but at 24 frames per second a 64-bit filesystem will still yield 960,767,920 seconds (30.4 years) of uncompressed footage.

Again, what exactly are you planning to film? :)

--
Dewey, what part of this looks like authorities should be involved?

Re:Patents and other Bad Signs. by gimpboy · 2004-09-16 08:39 · Score: 2, Informative

There are many opensource licenses. All opensource means is that the code is available for inspection and modification. Opensource is more a copyright issue and has nothing to do with patents. The gpl --- which is not the same as opensource --- addresses both copyright and patent issues.

--
-- john

Re:billion billion? by david.given · 2004-09-16 09:11 · Score: 3, Informative

I dunno, man. I've got a lot of porn...

Hmm.

If you had a filesystem 2^64 bytes wide, and your average porn jpeg was 100kB, then this means that you could store 1x10^14 images on it. That's 100'000'000'000'000 of them.

Assuming you're male and heterosexual, this means that every woman on the planet would have to take 30'000 compromising pictures of herself to fill it up; or about 60'000 assuming you're not into the weird stuff.

You're right, that's a lot of porn.

Re:billion billion? by lee7guy · 2004-09-16 12:21 · Score: 2, Informative

What part of "lets define a millimeter as 1000" don't you get?

--
Ceterum censeo Microsoftem esse delendam

More technical information on ZFS by ahrens · 2004-09-16 17:52 · Score: 2, Informative

You can find some more technical information about ZFS in my weblog. Check out the comments to my first entry about ZFS, there are a few juicy details there and I'll do my best to answer any questions posted to my blog.

Disclaimer: I work on ZFS at Sun.

Re:billion billion? by lee7guy · 2004-09-22 11:12 · Score: 2, Informative

define: 1 mm = 1000.

1 m = 1000 mm, per definition.

1000x1000 = ?

--
Ceterum censeo Microsoftem esse delendam

Slashdot Mirror

ZFS, the Last Word in File Systems?

38 of 564 comments (clear)