ZFS, the Last Word in File Systems?

billion billion? by michael+path · 2004-09-16 04:07 · Score: 5, Funny

From the article:

Unlimited scalability
As the world's first 128-bit file system, ZFS offers 16 billion billion times the capacity of 32- or 64-bit systems.

Microsoft immediately countered by saying WinFS will now support "twelveteen million billion times" as much storage as Sun's ZFS, and is "a bazillion times" more secure.

When reached for comment, Sun CEO Scott McNealy replied "neener neener". Microsoft CEO Steve Ballmer responded by putting gum in Sun President Jonathan Schwartz's hair.

Re:billion billion? by hackwrench · 2004-09-16 04:27 · Score: 5, Funny

In fact, this page is just one big number.
Re:billion billion? by Shadowlion · 2004-09-16 04:47 · Score: 5, Funny

Even including all the world's porn.

I dunno, man. I've got a lot of porn...
Re:billion billion? by Dazza · 2004-09-16 05:58 · Score: 5, Informative

Hmm... another one who doesn't know that there's a fair amount of land outside the US borders.

Nope. He said he'd never been outside the UK, so I'd be fairly certain he's aware of land outside the US.

Also living in the UK, I can attest that whenever you hear '1 billion', '1000 million' is meant. The UK converted to this for accounting purposes during the 70's.

The same I suspect is true for most of previously Europe-dominated countries (say India for example).

India, in particular, is toally different. They don't rely on millions and billions but 'crore' and 'lakh' which are 10million and 100k respectively.

--
-- "I know that this is vitriol, no solution, spleen-venting, but I feel better having screamed, don't you ?"

Two things... by rincebrain · 2004-09-16 04:08 · Score: 5, Insightful

1) Even Sun has succumbed to recursive acronyms, now.

2) Is it just me, or is the post surprisingly bereft of unique details? I mean, integration with all existing applications is rather assumed, given that it's a file system and all...

--
It's only an insult if it's not true.

Hmf. by BJH · 2004-09-16 04:09 · Score: 5, Insightful

Logically, the next question is if ZFS' 128 bits is enough. According to Bonwick, it has to be. "Populating 128-bit file systems would exceed the quantum limits of earth-based storage. You couldn't fill a 128-bit storage pool without boiling the oceans."

So, what was the point of creating a 128-bit filesystem?

-1, Marketing Hype.

*Yawn*

Re:Hmf. by Kenja · 2004-09-16 04:10 · Score: 5, Informative

"So, what was the point of creating a 128-bit filesystem?
Getting rid of file/drive size limitations for the foreseeable future?

--

"Have you ever thought about just turning off the TV, sitting down with your kids, and hitting them?"

Re:Out of letters. by Saint+Stephen · 2004-09-16 04:12 · Score: 5, Funny

[fs, natch

There already is a ZFS. by TheLoneGundam · 2004-09-16 04:13 · Score: 5, Informative

IBM has ZFS on their z/OS Unix Systems Services (POSIX interfaces on z/OS) component. ZFS was developed to provide improvements over the HFS (Hierarchical File System) that they ship with the OS.

Re:not alphabetically by laird · 2004-09-16 04:13 · Score: 5, Funny

Nah, the ultimate filesystem has to be xyzzyfs! Your data magically appears... :-)

--
Enable 3D printed prosthetics!

Why don't they just describe the capacity in by wiredog · 2004-09-16 04:14 · Score: 5, Funny

Sagans?

--

Best Slashdot Co

Just better than the old stuff from Sun by Ewan · 2004-09-16 04:16 · Score: 5, Insightful

Reading the article, all I see is Sun saying how bad their old stuff was, e.g.:

Consider this case: To create a pool, to create three file systems, and then to grow the pool--5 logical steps--5 simple ZFS commands are required, as opposed to 28 steps with a traditional file system and volume manager.

and

Moreover, these commands are all constant-time and complete in just a few seconds. Traditional file systems and volumes often take hours to configure. In the case above, ZFS reduces the time required to complete the tasks from 40 minutes to under 10 seconds.

Compared to AIX or HP-UX, 28 steps is shockingly bad, both have had much simpler logical volume management for several versions now (AIX for 5 years or more? certainly as long as I have used it). The existing Solaris 9 logical volume infrastructure is years behind the competition, this is bringing it up to date, but not putting it far ahead.

Ewan

That's a lot of storage by Gentoo+Fan · 2004-09-16 04:17 · Score: 5, Funny

But of course you'll still have to have your boot image within the first 1024 cylinders.

Re:Open source by balster+neb · 2004-09-16 04:18 · Score: 5, Interesting

Yes, it does look like it would be open-sourced as part of Solaris 10 (it was mentioned as one of the major new features).

Assuming the Solaris 10 will be true open source (not like Microsoft's "shared source"), as well as GPL compatibile, would I be able to use ZFS on my GNU/Linux desktop? Will ZFS be a viable alternative to ext3 and ReiserFS? Or is the overhead too big?

What I really want to see in a file system... by kcbrown · 2004-09-16 04:19 · Score: 5, Insightful

...and that I haven't seen in any file system announced to date, is a way of bundling multiple filesystem operations into a single atomic transaction that can be rolled back. This would clearly require an addition of four system calls (one to begin a transaction, one to commit it, one to roll it back, and one to set the default action, commit or rollback, on exit).

Such a feature would rock, because it would be possible to make things like installers completely atomic: interrupt the installer process and the whole thing rolls back.

--
Use 'slashdot stuff' in the subject line in any email you send me if you want to get past the spam filter.

Apparently... by qtone42 · 2004-09-16 04:20 · Score: 5, Funny

... ZFS will also make you forget everything you knew about English grammar.

"We've rethought everything and rearchitected it," says Jeff Bonwick

Rearchitected? WTF? Howsaboot "Redesigned?"

I'm still wrapping my brain around "adaptive endian-ness" as well.

--QTone

Re:Open source by CrkHead · 2004-09-16 04:22 · Score: 5, Funny

It looks like Microsoft may have its new WinFS after all...

Sounds really nice by mveloso · 2004-09-16 04:25 · Score: 5, Informative

Looks like Sun went out and redid their filesystem based on the performance characteristics of machines today, instead of machines of yesteryear.

Some highllights, for those that don't (or won't) RTA:

* Data integrity. Apparently it uses file checksums to error-correct files, so files will never be corrupted. About time someone did this.

* Snapshots, like netapp?

* Transactional nature/copy-on-write

* Auto-striping

* Really, Really Large volume support

All of this leads to speed and reliability. There's a lot of other stuff (varying blocks sizes, write queueing, stride stuff which I haven't heard about in years), but all of it leads to above.

Oh, and they simplified their admin too.

It's hard to make a filesystem look exciting. Most of the time it just works, until it fails. The data checksum stuff looks interesting, in that they built error correction into the FS (like CDs and RAID but better hopefully).

It might also do away with the idea of "space free on a volume," since the marketing implies that each FS grows/shrinks dynamically, pulling storage out of the pool as needed.

Any users want to chime in?

There are a lot of cluster file systems by anzha · 2004-09-16 04:50 · Score: 5, Informative

Right now there are a lot of file systems that do somehing not all that different than what Sun is proposing. The project I am on is evaluating them as we speak for a center wide filesystem. I've had the fun (no sarcasm, honestly) of setting up a number of different onces and helping to run benchmarks and tests against each. All of them have strengths. Every single one of them has some nasty weaknesses.

If you are looking for an open source based cluster file system, Lustre is what you want. It's supported by LLNL, PNNL, and the main writers at ClusterFS Inc. It's a network based cluster FS. We've been using it over GigE. However, we've found that there needs to be a ratio of 3:1 for data server:clients for a ratio. Wehave only used one metadata server. Failover isn't the greatest. Quotas don't exist. it also makes kernel mods (some good and bad) to do a mild fork of the linux kernel (they put them into the newer kernels every so often). It only runs on Linux. Getting it to run on anything else looks...scary.

GPFS runs on AIX and Linux. Even sharing the same storage. It runs and is pretty stable. it has the option to run in a SAN mode or network based FS. In the latter form, it even does local discovery of disks via labels so that if a client can see the disks locally it will read and write to them via FC rather than to the server. It, however, is a balkanized mess. It requires a lot more work to bring up and run: there is an awful lot of software to configure to get it to run (re: RSCT. If you haven't had the joys of HATS and HAGS, count yourself very, very lucky).

ADIC's StorNext software is another option. This one is good if you are interested in ease of installation, maintanence, and very, very fast speeds (damn near line speed on Fibre channel). I have set this one up for sharing disks in less than two hours from first install to getting numerous assorted nodes of different OS's to play together (Solaris, AIX, Linux). It freakin on virtually everything from Crays to Linux to Windows. It's issues seem to be scaling (right now doesn't go past 256 clients) and it has some nontrivial locking issues (righting to the same block from multiple clients, and parallel I/O to the same file from multiple clients if you change the file size).

There are some others that are not as mature. Among them are Ibrix, Panasas, GFS, and IBM's SANFS. All of them are interesting or promising. Only SANF looks like it runs on more than Linux though at this point. Our requirements for the project I am on are to share the same FS and storage instance among disparate client OSes simultaneously. This might not be the same for others though and these might be worth a look. Lustre dodges this because its open source and they're interested in porting.

--
Do you know why the road less traveled by is littered with the bones of the unwary?

Technically.... by jolyonr · 2004-09-16 04:51 · Score: 5, Funny

The last word in file systems is "systems".

Thank you.

--

Please read my Canon EOS tech blog at http://www.everyothershot.com

Re:64 bits is awfully big already by pslam · 2004-09-16 04:54 · Score: 5, Insightful

Yeah, its probably marketing hype now, but in 5 years, what about 10? Just because we can't do it now doesn't mean that we should stop progress.

No, precisely because we can't do it now, and for the very predictable future, we shouldn't be wasting all that disk space, access and CPU time for a boundary that no production system is likely to ever reach before they get upgraded. That's just practicality.

Seagate apparently sold 18.3 million desktop drives last year. Assuming they're all about 120GB (which is generous of me), that would be about 17.6*10^18 bits. Guess what, that's 2^64 bits. Yes, you would have to buy every single desktop hard drive Seagate shipped in the last year to have the capacity to fill a 64 bit filesystem. And find space for 18 million drives. And a power station to deliver the several hundred megawatts you'd need.

Even at 2 times drive capacity growth per year that's still a ridiculously unattainable figure. In 14 years time you'd only need to buy 1000 drives (which are now 2000TB each). But 14 years is a geological time scale when it comes to computers. You'd have wasted 14 years of CPU time and disk space devoted to those extra 64 bits.

If you still think 64 bits isn't enough, how about 96 bits? It would take 46 years before hard disks were big and cheap enough so you could fill the filesystem by buying 1000 of them. But no, they chose 128 bits because it sounded good.

Re:Another quote to cherish by mdmarkus · 2004-09-16 05:23 · Score: 5, Informative

From Bruce Schneier in Applied Cryptography: Thermodynamic Limitations One of the consequences of the second law of thermodynamics is that a certain amount of energy is necessary to represent information. To record a single bit by changing the state of a system requires an amount of energy no less than kT where T is the absolute temperature of the system and k is the Boltzman constant. (Stick with me; the physics lesson is almost over.) Given that k = 1.38x10^-16 erg/Kelvin, and that the ambient temperature of the universe is 3.2K, an ideal computer running at 3.2K would consume 4.4x10^-16 ergs every time it set or cleared a bit. To run a computer any colder than the cosmic background radiation would require extra energy to run a heat pump. Now, the annual energy output of our sun is about 1.21x10^41 ergs. This is enough to power about 2.7x10^56 single bit changes on our ideal computer; enough changes to put a 187-bit counter through all of its values. If we built a Dyson sphere around the sun and captured all of its energy for 32 years, without any loss, we could power a computer to count up to 2^192. Of course it wouldn't have the energy left over to perform any useful calculations with this counter. But that's just one star, and a measly one at that. A typical supernova releases something like 10^51 ergs. (About a hundred times as much energy would be released in the form of neutrinos, but let them go for now.) If all of the energy could be channedel into a single orgy of computation, a 219-bit counter could be cycled through all of its states. These numbers have nothing to do with the technology of the devices; they are the maxiumums that thermodynamics will allow. And they strongly imply that brute-force attacks against 256-bit keys will be infeasible until computers are built from something other than matter and occupy something other than space.

Re:What is their disk allocation scheme? by majid · 2004-09-16 05:30 · Score: 5, Interesting

I was in a chat session with their engineers yesterday. It looks like they have adaptive disk scheduling algorithms to balance the load across the drives (e.g. if a drive is faster than others, it will get correspondingly more I/O). The scheduler also tries to balance I/O among processes and filesystems sharing the data pool.

This is a good thing - queueing theory shows a single unified pool has better performance than several smaller ones. People who try to tune databases by dedicating drives to redo logs don't usually realize what they are doing is counterproductive - they optimize locally for one area, at the expense of global throughput for the entire system.

ZFS uses copy-on-write (a modified block is written wherever the disk head happens to be, not where the old one used to be). This means writes are sequential (as with all journaled filesystems) and also since the old block is still on disk (until it is garbage collected) this gives the ability to take snapshots, something that is vital for making coherent backups now that nightly maintenance windows are mostly history. This also leads to file fragmentation so enough RAM to have a good buffer cache helps.

Because the scheduler works best if it has full visibility of every physical disk, rather than dealing with an abstract LUN on a hardware RAID, they actually recommend ZFS be hosted on a JBOD array (just a bunch of disks, no RAID) and have the RAID be done in software by ZFS. Since the RAID is integrated with the filesystem, they have the scope for optimizations that is not available if you have a filesystem trying to optimize on one side and a RAID controller (or separate LVM software) on the other side. Network Applicance does something like this with their WAFL network filesystem to offer decent performance despite the overhead of NFS.

With modern, fast CPUs, software RAID can easily outperform hardware RAID. It is quite common for optimizations like hardware RAID made at a certain time to become counterproductive as technology advances and the assumptions behind the premature optimization are no longer valid. A long time ago, IBM offloaded some database access code in its mainframe disk controllers. It used to be a speed boost, but as the mainframe CPU speeds improved (and the feature was retained for backward compatibility), it ended up being 10 times slower than the alternative approach.

Slashdot Mirror

ZFS, the Last Word in File Systems?

23 of 564 comments (clear)