State Of The Filesystem
Skeme writes "Have you heard of Plan 9 or Reiser4 but don't know much about them? Are you curious about the improvements free software is making to its filesystems in general? Read my summary of the current developments in the filesystem: namely, what improvements we can expect (a lot), and what Linux and the BSDs can do to improve on the filesystem."
I've always wondered how these filesystems with metadata handle transferring files between different systems. It would suck to have all your MP3 info in filesystem metadata and then lose it all when you transferred to a system without fs metadata. Anyone have any insight?
...and not very general. Interesting for its comments on what's being tried out in R-FS & Plan9 but certainly doesn't manage to be a general summary of what's going on.
How about the changes coming in 2.6 (like xfs support built in)?
The article makes some good points but for me it could have done with rewriting to make it more general, separate the analysis of filesystem implementation problems from technical detail, and included more examples from other file systems.
"we demand rigidly defined areas of doubt and uncertainty!"
The concept of reducing primitives is a good one, and based in sound mathematical theory. As already pointed out though, you need some container format that can handle some of these new ideas, things like very small files, files as directories and so on. This is needed, because when you transfer files through lossy mediums like the internet, or older filing systems, you don't want to lose the structure.
As far as I know, there isn't a container format that can do this. Tar is showing its age already, I wouldn't like to see it hacked yet again. Zip is alright, but you'd need to break compatability to add in all those extra features, and then it wouldn't be zip anymore. It'd also be inefficient.
So, what I propose is rather than reinvent the wheel to solve this problem, we add support for "boxing" to the Linux kernel.
A box is a filing system in a file. We already use them, to some extent - it's been possible to mount ISO images using the loopback filing system for a while. What's needed is to take this to the next level. The first thing is that we need the ability to use files as mount points, not just directories. When files and directories are the same, well, I guess that should be easier.
The .box file format simply contains a short header with some useful metadata, like maybe a checksum, and the filing system it contains (maybe that isn't needed). The fun part is the UI. What you need is the ability to right click on any dirfile (for want of a better term) and choose the "Box it" option. You'd need a better label. What essentially happens then is that the heirarchy below this point is sucked up and transformed into an ISO containing perhaps a "Reiser4-Lite" filing system. You can forgo the journal and other things that are redundant purely for storage.
The user has then converted their file or directory into something that can be transferred across the net, on Windows compatible CDs and so on, without losing the inherant structure of the original.
At the other end, choosing the "Unbox" option mounts the contents of the box using the loopback FS, mounted at the point of the file. To the user, it is seamless, far easier than zips or tarballs.
Of course, there are lots of complications. You have to agree on the format to use inside the box, for one, because the need to have kernel mods and so on makes it more complex than just installing tar.
I think MacOS has something a little bit similar with disk mountable images (.dmg) files, but the MacOS filing system is rather poor, and I don't know how easy it is for users to create them. Also the OS unfortunately applies some magic to them - for instance Safari will automatically extract the contents of the DMG file then destroy it when you download one (but other stuff does not, oops).
Anyway. That's one way to prevent loss of vital structure when transferring across lossy mediums, that I can think of. There are probably others.
You're missing the point. chmod would still exist as a userland program; it is the kernel call which would be removed.
To the user, there would be no change; to the userland programmer, there would be no change; to the C library developer, there would be a change (to implement chmod in terms of the existing filesystem operations); and to the kernel developer there would be a change (mostly in the direction of reduced complexity due to a smaller number of necessary functions).
Tarsnap: Online backups for the truly paranoid
Before adopting any of these ideas, one must consider the security implications of doing so.
If we assume that the filesystem is decoupled from the access control layer in the kernel, then one must ensure that any operation that potentially affects security is adequately controlled.
For example, on systems with POSIX_RESTRICTED_CHOWN, the following ought to be illegal:
cp foo/..uid bar/..uid
This can be accomplished by making the UIDs mode 444. Without POSIX_RESTRICTED_CHOWN, the UID is 644. However, we have now moved a systemwide security feature into the filesystem. If multiple filesystems are configured into one kernel, then they ought to be consistent; otherwise the security model will be flawed.
As for things such as allowing access to an environment, doesn't that break encapsulation? It means for a certain filename, the filesystem must grovel through a user-space process to find the environment. Also, if a change in some external environment immediately affects some partially-related processes (e.g. daemons started from that shell), then a whole new raft of security holes will come up based on a process' environment or filesystem layout changing unexpectedly.
Cool ideas, but let's be careful lest we make a steaming pile of Swiss cheese.
GConf was a better example. ATM using GConf is, well, not hard, but you have a lot of extra machinery involved, new APIs to learn and so on. Basically all that machinery does is control the backends and give change notification (it does stuff like schema validation as well).
It'd be *much* easier to use GConf if in order to read a value, you didn't have to load up the GConf libs (which in turn depend on CORBA), or parse XML files. At the moment that's really the only way to do it, but in most environments/languages it's far easier to manipulate files and directories than it is to talk to a CORBA server or bind APIs into them.
You also get an increase in efficiency. Parsing XML is kind of cludgy - XML is not a particularly efficient format to store stuff in. It's a good compromise between humans and machines, but both of us have to do lots more work to meet in the middle. The reason it's used, rather than lots of small files, is that otherwise GConf would be too slow. In fact, they are already talking about removing yet more of the files/directories to speed things up, and sticking them all in the same XML file.
Being able to have a configuration system that truly leveraged the filing system would make a lot of stuff easier, more reliable, and faster (because you can take advantage of filing systems that are really really tuned to take advantage of advanced data structures).
It won't really impact the way you do things like set file attributes today. Most of the changes would be under the hood. But used well, everything would become easier for the developers, and so more advanced and slicker for the user.
I'm really getting tired of the ever-creeping assertion that transactions are required for [x]. At first x was ACID-compliant relational databases, and such was true because ACID was defined as such. However, then I started to see assertions that relational databases had to be ACID-compliant (mostly from the anti-MySQL camps who were ignoring the long history of highly valuable, non-ACID relational databases).
Now, in this article, I see the assertion that databases in general require transactions, and thus cannot be supported by a filesystem.
Worse, the logic is self-refuting, as the article previously states that a filesystem is a database, just a limited one. As it happens, POSIX-type filesystems are quite powerful, and let's not kid ourselves into thinking that they have not served us well for 20-30 years! Yes, changes are coming and I'm frankly quite impressed by Hans Reiser's accomplishment in finally coming up with a balanced-tree-based filesystem. Many have tried and failed where he succeeded.
That's because his was a great step forward, not because the old UNIX filesystems weren't also. Let's stop trying to re-define terms so that we can explain why the last 20 years were the dark-ages. They simply were not.
Yes. They don't have access to the NTFS specs. Also, NTFS is a very complex filing system, with many different versions. You don't want to get that wrong. Resizing was a more important goal, and that has been working for many months now.
Of course, it might be included in all distros when completed anyway, due to patents MS hold on the technology.
I agree that we need a revolution in how filesystems work inside an operating system, but it seems that the arguments placed in this paper had alot of holes.
For one thing, the need for changing a filesystem should not really be solely concerned on space or metadata. I think security, speed of data retrieval, and self correcting error engines should be centered on the new systems.
The reason for the speed of data retrieval as being more important than data size is because hardrives are getting much bigger than they are faster. In five years, we may have 20 terabyte drives, but the access speeds will still be horrible.
Security and error correction are obvious points that should be implemented on a systemwide level. When these features are system wide, then management becomes much easier for all system users.
This article seems to just be the author brainstorming or feeling excited about reiserfs. It's hardly a "summary of developments in the filesystem". Now if he was asking about opinions on his article it'd be fine, but he's not, so I'll just discard this as another non-news.
Nobody (apart from perhaps this guy) has ever claimed that this syntax will actually ever be used, or needed. There are other possible syntaxes available, and in fact one long term blue sky plan for RFS is to allow many different types of syntax within the same file path, including for instance things that vaguely resemble database queries.
So, don't get hung up on the syntax given in this article.
Are you so sure that you would hate it? After reading the PDF, I was thinking of two things:
I don't think it's any more radical than treating network sockets as files. Sure, it might feel a little weird first, but once you'd get used to it, the simplicity would overweigh the clumsiness of existing implementations.
It's also very easy to wrap together a shell script that imitates the existing implementations and put it to /bin/chown or whatever you wish to replace.
It seems that the author presumed that the only use of LDAP is to provide passwords for user authentication. While that is a common use of LDAP it is not the only use.
It would seem that having a file system that is LDAP aware could be extremely useful. Imagine if your LDAP tree were reflected as a tree in your file system. You wouldn't need to embed LDAP calls in your application, it would just be data in your file system. So looking up an attribute for the current user, or a user, would be as simple as reading a file that holds the value of the attribute.
Is for someone to come up with a real unlimited snapshotting filesystem for linux. I don't want to use user mode hacks (as nice as they are rsync style snapshotting isn't reliable enough), or snapshotting that only allows a shadow copy of the entire volume, I want to be able to tell the users that they can just go into ~/.snapshot/time (where time can be hours, days, or weeks in the past) and copy the file they messed up back into their home directory. Basically I want the most usefull feature of netapps without the HUGE markup =) The cost in admin time both in user interaction and reduced need to do tape retrieval and file restores is immense.
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
The only thing I'm concerned about is backward compatibility - if someone accidentally tries to open a file with a trailing slash, and gets an error because now it's a directory, then it's a Bad Thing.
[100% ISO 646 Compliant]
SVM, ERGO MONSTRO.
3.8 How was the Linux NTFS Driver written?
Microsoft haven't released any documention about the internals of NTFS, so we
had to reverse engineer the filesystem from scratch. The method was roughly:
- Look at the volume with a hex editor
- Perform some operation, e.g. create a file
- Use the hex editor to look for changes
- Classify and document the changes
- Repeat steps 1-4 forever
If this sounds like a lot of work, then you probably understand how hard thetask has been. We now understand pretty much everything about NTFS and we
have documented it for the benefit of others: http://linux-ntfs.sourceforge.net/ntfs/index.html
Actually writing the driver was far simpler than gathering the information.
This is said by someone who obviously hasn't done any real world application profiling. It's quite the opposite -- CPU is relatively rarely a limiting factor in desktop applications, dealing with the HDD very often is.
This is very often why adding more memory to a system makes it seem more responsive -- larger disk buffers, less need for disk based virtual memory.
Basically hard disks are very often *the* limitation; CPUs are fast.
I wish people with clever ideas to redesign POSIX namespaces would spend ten years in system administration first so they realise what's involved with managing REAL WORKING SYSTEMS.
/bin/prog /bin/prog
/bin/prog into my home fs - Counter-intuitive to the path semantics. If I run this a second time it copies my copy of /bin/prog over itself - Inconsistent.
/etc/passwd becomes a hierarchy of files. Just logging in one user will involve multiple open()-read()-close() operations. Whilst these might be efficiently implementable at fs-level, it is still very inefficient in user space, or will at least require a dramatic rethink of unix tools.
Some of the ideas might well lead in useful directions, but some (at least as described in the paper) are plain silly. viz:
1) with overlayed mounts:
suppose my home dir is mounted read-write over a read-only system root, and I do not have a "/bin/prog" in my home dir. Consider:
cp
First time, it copies the system
2) Attributes in the namespace
We have a rather carefully written setuid chown/chgrp/chmod replacement which can be run by users in an "admin" group, and allows devolution of 1st-line support tasks to nominated users. It won't touch files whose uid/gid is 100, so they can only touch non-system files.
If attributes (file uid) is file/..uid and cp is supposed to handle what chown does, the above breaks big-time. We now need a custom cp replacement. Either that or we have to add an ACL for the admin group to every file we want them to manage, which is a great deal of effort, and likely end up inconsistent.
Contrary to the paper, setuid and PARTICULARLY setgid is NOT going to go away in the real world any time soon, as far as files are concerned. Ports less than 1024 are a different matter and I agree with the document.
3) Consider the number of file descriptors involved if