State Of The Filesystem
Skeme writes "Have you heard of Plan 9 or Reiser4 but don't know much about them? Are you curious about the improvements free software is making to its filesystems in general? Read my summary of the current developments in the filesystem: namely, what improvements we can expect (a lot), and what Linux and the BSDs can do to improve on the filesystem."
No, of course not. You'd use ssh to transfer an intermediate format (like tar) which DOES retain metadata.
It wouldnt be hard to extend the scp command to send metadata across if there is a compatable (ie. up to date) ssh server running that supports accepting metadata descriptions for files. The remote machine can then do as it pleases with the metadata.
What is needed is a POSIX specification for handling file metadata. Then applications, servers *and* filesystems can be gradually upgraded to include the metadata features in a portsable and consistent way.
--
Blue SSL
It's really nice. But what does it brings new that we shoudl rewrite 90% of all system tools too use this new features? I find "cp /a/..uid /b/..uid" same as "chmod /a --reference=/b"...
...and not very general. Interesting for its comments on what's being tried out in R-FS & Plan9 but certainly doesn't manage to be a general summary of what's going on.
How about the changes coming in 2.6 (like xfs support built in)?
The article makes some good points but for me it could have done with rewriting to make it more general, separate the analysis of filesystem implementation problems from technical detail, and included more examples from other file systems.
"we demand rigidly defined areas of doubt and uncertainty!"
Generally you lose the data, unless you wrap it in another format to encapsulate all the information. This is what Macheads did on Classic MacOS: they .hqx'd or .bin'd their files before transferring them to another system. It's not ideal. The alternative, flat streams-of-bytes, is not ideal either (and not true: even in Unix, files have some metadata that doesn't translate very well).
Hopefully in the future our filesystems and transfer protocols will evolve to have some reasonably broad common ground where metadata is concerned (a development similar to the diminishing need to accomodate DOS 8+3 filenames).
Often that layer is a DB - database. I suggest you to try ZODB, database in Zope, it's very good to handle files as documents - with many unified metadata about files.
Another good example to study is Subversion, which is revisionining/versioning metadata-management layer on a top of a regular FS.
You may research and find some software implementing a layer (on a top of a regular FS) specially designed to handle MP3 playlists. But again, that would be a layer on a top of FS, not a filesystem by itself.
Less is more !
Really, I think someone should get on with finishing the NTFS filesystem access in the kernel. With people migrating to XP it's really becoming more important that this driver is fixed (how long has it been declared "dangerous" for write use now?!).
I'd really like to know why this driver has taken so long to complete - is there some information that the developers don't have access to? Some technical reason? What?!
Code, Hardware, stuff like that.
there os should have a 'list' of what's supported by each fs. when you copy a file from fsa to fsb the os (or program) should compare feature and let you know (somehow) that something is not (may not) going to work right. if you copy something from the regular ext2 system that is case sensitive to a ms-dos floppy disk, something should try to remind you. or the program checks this and looks for problems.
remember that not all problems can be detected, so are you willing to live with: 'this may not work correctly' messages?
eric
if that crock, that bag-on-the-side, that mess is what we have to look forward to, I think i'll switch to BSD.
/directory/..owner is a big ugly crock.
I mean, acessing owner data by travelling into a directory then backwards out of it again like: vi
Feel that power? That's mah MOUSING FINGER
I see it happening more and more that people present their summaries, articles and technical papers on the net as pdfs. This is very inconvenient.
Pdfs are nice for printing and publishing on traditional media, because you can be sure they will be in the correct layout etcetera for the printer. But on the web, where people browse between lightweight, easy html-documents, they're just a nuisance.
Please, if you must publish a pdf, publish an html version next to it.
Before adopting any of these ideas, one must consider the security implications of doing so.
If we assume that the filesystem is decoupled from the access control layer in the kernel, then one must ensure that any operation that potentially affects security is adequately controlled.
For example, on systems with POSIX_RESTRICTED_CHOWN, the following ought to be illegal:
cp foo/..uid bar/..uid
This can be accomplished by making the UIDs mode 444. Without POSIX_RESTRICTED_CHOWN, the UID is 644. However, we have now moved a systemwide security feature into the filesystem. If multiple filesystems are configured into one kernel, then they ought to be consistent; otherwise the security model will be flawed.
As for things such as allowing access to an environment, doesn't that break encapsulation? It means for a certain filename, the filesystem must grovel through a user-space process to find the environment. Also, if a change in some external environment immediately affects some partially-related processes (e.g. daemons started from that shell), then a whole new raft of security holes will come up based on a process' environment or filesystem layout changing unexpectedly.
Cool ideas, but let's be careful lest we make a steaming pile of Swiss cheese.
I'm really getting tired of the ever-creeping assertion that transactions are required for [x]. At first x was ACID-compliant relational databases, and such was true because ACID was defined as such. However, then I started to see assertions that relational databases had to be ACID-compliant (mostly from the anti-MySQL camps who were ignoring the long history of highly valuable, non-ACID relational databases).
Now, in this article, I see the assertion that databases in general require transactions, and thus cannot be supported by a filesystem.
Worse, the logic is self-refuting, as the article previously states that a filesystem is a database, just a limited one. As it happens, POSIX-type filesystems are quite powerful, and let's not kid ourselves into thinking that they have not served us well for 20-30 years! Yes, changes are coming and I'm frankly quite impressed by Hans Reiser's accomplishment in finally coming up with a balanced-tree-based filesystem. Many have tried and failed where he succeeded.
That's because his was a great step forward, not because the old UNIX filesystems weren't also. Let's stop trying to re-define terms so that we can explain why the last 20 years were the dark-ages. They simply were not.
This article seems to just be the author brainstorming or feeling excited about reiserfs. It's hardly a "summary of developments in the filesystem". Now if he was asking about opinions on his article it'd be fine, but he's not, so I'll just discard this as another non-news.
People keep trying to use file hierarchies as data bases. You can do a lot of stuff, but arrays and m to n forward and reverse mappings aren't among the things you can do with filesystems. That's why you have databases and XML.
Cool ideas, but let's be careful lest we make a steaming pile of Swiss cheese.
/etc/passwd just before the system crashes, you'll be unable to log in again when the system comes up.
/var/log/messages as the system went down, and you'll find this message (as well as the rest of its 4k block) in /etc/passwd, while your changed password file may be found at the end of /var/log/messages. This is a feature of ReiserFS, not a bug.
Evidently you haven't used ReiserFS. It already does this.
ReiserFS only journals filesystem metadata. Because it uses a B+ tree balanced allocation scheme for file blocks, when the system crashes the last pair of blocks written will often be swapped with respect to their files. For example (this has happened to me and separately to a friend) if you modify out
What happened? syslogd wrote a panic message to
I still miss the raw speed of ReiserFS, to be sure, but EXT3 has kept every last one of my hundred-odd filesystems rock solid for two years now, which is really what you want a journalled FS to do.
ObOnTopicComment: Miner's examples are clumsy and ill-considered. BBN's Dave Mankins put a relational database into the 4.1BSD filesystem back in 1984, and Plan 9 took a more rational approach with its namespace algebra. This is not a new idea, so there's no absolutely no excuse for breathless exposition based merely on coolness factor.
At best, Miner's descriptions obscure any true value Reiser's proposal might have. Organize my MP3 collection with FS metadata and lose it all when I try to move to another FS? What is he thinking? Is he thinking at all?
Sheesh.
This is said by someone who obviously hasn't done any real world application profiling. It's quite the opposite -- CPU is relatively rarely a limiting factor in desktop applications, dealing with the HDD very often is.
This is very often why adding more memory to a system makes it seem more responsive -- larger disk buffers, less need for disk based virtual memory.
Basically hard disks are very often *the* limitation; CPUs are fast.
With the appropriate software support (in the filesystem api and browser), this seems like a major advance in usability to me. Certainly people liked the BeOS filesystem. Since Microsoft is claiming to have something like this in the works, it's probably necessary on the Linux side just to keep up...but my guess that ReiserFS will turn out to be cooler than Longhorn.
Translators should not be restricted to mappings from flat file to director structure, but should allso allow for mappings like dir->file, dir->dir or file->file
The mapping dir->file would make it possible to implent different access right for different parts of a file.
I wish people with clever ideas to redesign POSIX namespaces would spend ten years in system administration first so they realise what's involved with managing REAL WORKING SYSTEMS.
/bin/prog /bin/prog
/bin/prog into my home fs - Counter-intuitive to the path semantics. If I run this a second time it copies my copy of /bin/prog over itself - Inconsistent.
/etc/passwd becomes a hierarchy of files. Just logging in one user will involve multiple open()-read()-close() operations. Whilst these might be efficiently implementable at fs-level, it is still very inefficient in user space, or will at least require a dramatic rethink of unix tools.
Some of the ideas might well lead in useful directions, but some (at least as described in the paper) are plain silly. viz:
1) with overlayed mounts:
suppose my home dir is mounted read-write over a read-only system root, and I do not have a "/bin/prog" in my home dir. Consider:
cp
First time, it copies the system
2) Attributes in the namespace
We have a rather carefully written setuid chown/chgrp/chmod replacement which can be run by users in an "admin" group, and allows devolution of 1st-line support tasks to nominated users. It won't touch files whose uid/gid is 100, so they can only touch non-system files.
If attributes (file uid) is file/..uid and cp is supposed to handle what chown does, the above breaks big-time. We now need a custom cp replacement. Either that or we have to add an ACL for the admin group to every file we want them to manage, which is a great deal of effort, and likely end up inconsistent.
Contrary to the paper, setuid and PARTICULARLY setgid is NOT going to go away in the real world any time soon, as far as files are concerned. Ports less than 1024 are a different matter and I agree with the document.
3) Consider the number of file descriptors involved if
Yes they do, if they're seen by tar as ordinary files. That was one of the main points of the article, which not many people here seem to have read (as per usual).
Female Prison Rape in NY
A filesystem is nothing like an relational database. I wish people would stop making this comparison, because it's completely misleading and unhelpful. A filesystem is not a set of user-defined tables, each of which contains an unordered set of rows. Queries and joins are not possible. Constraints and null values are not supported. Files within a directory have an inherent order. Files are variable-length and byte-addressable. Duplicate "rows" are not permitted. The principle relationship modeled is hierarchy... ever heard of a hierarchical database?
Java: the COBOL of the new millenium.
Why not do both? It would seem the eaisiest solution would to be to implement common header files, like Dvorak suggests, that then get mirrored into the file system. This could eaisly be done by the file system when ever it writes a file. That way, the fs could have a rational database for searching and all, but the files retain control over the metadata. Transferring the file would be no problem. The metadata would get transferred in the header of the file, and then written to the database by the filesystem. (and yes, there would be a little overhead for checking and writting the metadata to the fs everytime a file is written, but this is being done anyways by any fs that uses a metadata database, yes/no?)
Windows support for metadata has always sucked, recognised by every Mac user who moved to a PC and discovered that you had to tell the system what a file did by appending a clumsy tla to the end, and passing gently over the inconsistencies of the support for long and short filenames.
Actually, NTFS support for "metadata" is impressive; you can have 255 streams per file. A stream in NTFS is what Mac users call a fork, but Macs are limited to 2, data and resource. You can happily make every file on NTFS an OLE server too and do away with file extensions altogether, if you want to. Oh, and NTFS has reparse points too - think like a trigger on a database table, but attached to a file. And NTFS has journalled from day 1, whereas Linux filesystems are only just discovering this.
So why are file extensions still in common use? Largely because people who don't know NTFS come out with statements like "windows support for metadata has always sucked" without bothering to read the documentation, so few apps take advantage of this NTFS feature.
I'm delighted with the prospect of metadata-as-files and files-as-directories (ergo, metadata-as-directories?) -- but here we have another problem to address: Insufficiently escaped data. Human-readable data fields (including filename, if the user can read/write it) should be able to contain any human-readable characters. Filenames should be free to contain normal punctuation; path separators -- again, if the user can read/write paths -- should be selected from outside the normal punctuation set, or else the stuff between the separators should be escaped. Or the user-accessible file and path names should be stored as metadata.
Can't tell you how much frustration I've endured over other people's improperly escaped data. This just looks like one more case.
(Mac OS <=9 used a colon as a path separator, making it the only forbidden character for file/folder names, which could have been avoided so easily: How about a pipe, guys? or (shiver) a backslash? Or, even better, some control character unique to the Mac, akin to the Option-Shift-K Apple logo? Programmers. Sheesh.)
While there are problems to be solved (backup, for instance), there are many benefits as well.
/usr/share/doc" on ext2 can attest to the fact that ext2 craps out on large directories. Reiser V3 does quite well, and V4 is rumored to do even better.
/var/orders/ and place this text in it, as well as a file in /var/ccbilling with the customer's credit info in it, and if the system crashes halfway through, throw it all away, because I don't want to charge for products I'm not shipping and I don't want to ship products for which I'm not billing."
/etc/passwd thing. I'm not sure where he got the idea of subusers and I don't much care, as it's not really relevant to the namespace thing.
Many detractors of the UNIX security model point to the lack of ACLs and fine-grained security.
The U.S. DoD has contracted Namesys (the ReiserFS guys, led by Hans Reiser) to develop a filesystem upon which security can be build. Reiser's vision is a filesystem which allows users of the filesystem to define what security means in an extensible manner, with plugins.
The future of ReiserFS includes a plugin architecture which makes it easy to implement NT permissions, for instance, without breaking existing programs or requiring new semantics which wouldn't work across, say, NFS.
I'm not sure why the author chose to throw Plan 9 into the mix. Plan 9 has some interesting features, but I don't feel that either Plan 9 or ReiserFS was given sufficient attention to allow the reader to understand just _why_ such things are interesting.
I tried to sum up some interesting Plan 9isms here, but I'd rather not go into it -- reexporting modified views of the filesystems is a complicated thing, and it's hard to justify such complexity in limited space.
As far as ReiserFS goes, the current and future benefits are many.
First, speed is king, and ReiserFS takes the crown. Anyone who has waited more than 10 seconds for "ls
Second, space efficiency is nice, and ReiserFS does it better than anything out there. While storage sizes are increasing dramatically, it "feels" wrong to waste a whole block on a trivial file (only a few bytes long, for instance). With ReiserFS, developers don't need to waste time trying to work around the need for efficient small file access -- it's efficient to have many small files, and it's efficient to have many of them in one directory if needed.
I believe it's been said that in the future, ReiserFS may compress the filenames in some way to try to eliminate even that overhead.
ReiserFS V4 implements wandering journals, and support for transactions. What does that mean to the layperson? It means that in the event of a crash, an application which handles important data (for instance, an online purchasing system) can promise that in the event of a crash, a group of related filesystem operations can be guaranteed either to have all completed, or have all failed.
For instance, a purchasing system makes a note to charge the person's credit card, and to ship the items that were ordered. You can tell ReiserFS programmatically to "create a file in
If the system crashes, you don't have to work to make sure that you have a one-to-one mapping of orders to sets of billing information -- ReiserFS can guarantee you that either all of it got recorded, or none of it got recorded.
So now you've got a fast, reliable filesystem. I'll even let you ignore the fact that ReiserFS will let you implement per-file compression and encryption plugins. Still not impressed? Wondering what all of this crap about namespaces is about?
The author of this article basically ripped off Hans Reiser's examples from his V4 draft document, such as the
What is interesting, but not really mentioned, is that file filters can become part of the filesystem.
The specific implementation details in this example are products of my imagination,
Somebody get that guy an ambulance!
Right, but if on the destination FS, files-as-directores aren't supported, you've lost you ability to play any of the files, because they are now stored as directories.
It true that NTFS has all that capability. But very little of it exploited in applications or even the standard Explorer Shell.
This goes back to the same problem they've had since OS/2 --> Because the UI has to be 100% funcitonal on FAT and dumb network filesystems, nobody takes advantage of neat filesystem features in NTFS.
Linux GUIs will have likely have the same problem with Reiser4 -> Very few users, therefore all features will be designed around the Lowest Common Denominator POSIX-type filesystem.
The way it would work is a program would look for a piece of metadata. If it was not there it would run a standard program using popen that would then create the metadata and return it, and also set the metadata on the file if it had permission to. This program would probably work something like "file" does now and would use text configuration files to actually determine how to extract interesting information from all known file types. Most likely this actual implementation of looking up metadata would be wrapped in a libc-level function so it just looks like you ask for the metadata.
The obvious advantages:
Files can be transferred by protocols that don't understand metadata.
Files can be stored on filesystems that don't understand metadata.
Metadata is never "lost" as long as the real file data is intact.
Metadata can be stored locally for remote files.
Metadata can be modified without permission to modify the file.
Probably much faster because metadata is not calculated until first needed.
I also think that all files should be treatable as directories, and that the metadata should be presented as being inside this directory (ie the metadata "blah" for file "foo" is foo/blah). But this does not mean it has to be stored this way and does not mean it has to exist when copied.
The one thing I see with plugins like the ms-word plugin is, while cool, seems like it would load the kernel down doing things best left to userspace. How seperate is that ms-word(or anything, acrobat, html, whatever) from the rest of kernel land, where things can go VeryWrong if something fails?
/dev/dc0/ip or /dev/eth0/subnet (whatever your ethernet iface is) because these could be dynamically updated as life goes on for the ethernet interface. (Your server has been up long enough to suffer a complete network overhaul right? :)
If a buffer over or underflows in the FS plugin is the kernel going to panic, or simply segfault that FS module, core it, and move on? I'm just worried about my system going down because of a mostly-tested plugin.
A cool kernel land example I can see for this is like cat
Many detractors of the UNIX security model point to the lack of ACLs and fine-grained security.
You don't throw away the door just because it doesn't have a lock. You simply put a lock on it!
FreeBSD now has ACLs, and it did it without throwing away UFS. It didn't need to replace the "everything is a file" model to do it. It just expanded the available extended attributes. To get ACLs you will certainly need to extend the filesystem, but you don't need to replace it completely.
p.s. FreeBSD also got soft updates without throwing away UFS. In short, you don't need to throw away the old to get the new, something a lot of developers don't seem to understand.
A Government Is a Body of People, Usually Notably Ungoverned
Yes, but this isn't throwing anything away. It's just using the "files-as-directories" paradigm to add new stuff.
Hence the whole point of implementing this. It will work with everything that uses the OLD style of doing things, while giving us room to grow and expand and enhance.
More important, it extends the "everything-as-a-file" paradigm to meta-data, something that we have never had before. What with procfd and devfs and the like, this is only a good thing.
Ideally, this would be implemented above the filesystems themselves, in the VFS layer, such that *ALL* filesystems could take advantage of the features, and the feature could be controlled via a mount option. Unfortunately, the performance penalty might be too high since the meta-data is so small, and will need to be platform/filesystem agnostic.
I can conceive of a few ways to make it this work, although I'm pretty sure most (if not all) of them will not be high-performance.