State Of The Filesystem
Skeme writes "Have you heard of Plan 9 or Reiser4 but don't know much about them? Are you curious about the improvements free software is making to its filesystems in general? Read my summary of the current developments in the filesystem: namely, what improvements we can expect (a lot), and what Linux and the BSDs can do to improve on the filesystem."
I've always wondered how these filesystems with metadata handle transferring files between different systems. It would suck to have all your MP3 info in filesystem metadata and then lose it all when you transferred to a system without fs metadata. Anyone have any insight?
will ReiserFS be ready for the 2.6 kernel? Just curious.
Any sufficiently advanced libertarian utopia is indistinguishable from government.
It's really nice. But what does it brings new that we shoudl rewrite 90% of all system tools too use this new features? I find "cp /a/..uid /b/..uid" same as "chmod /a --reference=/b"...
There has been talk in the kernel mail list of implementing 9p, the plan 9 distributed filesystem.
Now that is Open Source, I hope it will happen.
...and not very general. Interesting for its comments on what's being tried out in R-FS & Plan9 but certainly doesn't manage to be a general summary of what's going on.
How about the changes coming in 2.6 (like xfs support built in)?
The article makes some good points but for me it could have done with rewriting to make it more general, separate the analysis of filesystem implementation problems from technical detail, and included more examples from other file systems.
"we demand rigidly defined areas of doubt and uncertainty!"
The concept of reducing primitives is a good one, and based in sound mathematical theory. As already pointed out though, you need some container format that can handle some of these new ideas, things like very small files, files as directories and so on. This is needed, because when you transfer files through lossy mediums like the internet, or older filing systems, you don't want to lose the structure.
As far as I know, there isn't a container format that can do this. Tar is showing its age already, I wouldn't like to see it hacked yet again. Zip is alright, but you'd need to break compatability to add in all those extra features, and then it wouldn't be zip anymore. It'd also be inefficient.
So, what I propose is rather than reinvent the wheel to solve this problem, we add support for "boxing" to the Linux kernel.
A box is a filing system in a file. We already use them, to some extent - it's been possible to mount ISO images using the loopback filing system for a while. What's needed is to take this to the next level. The first thing is that we need the ability to use files as mount points, not just directories. When files and directories are the same, well, I guess that should be easier.
The .box file format simply contains a short header with some useful metadata, like maybe a checksum, and the filing system it contains (maybe that isn't needed). The fun part is the UI. What you need is the ability to right click on any dirfile (for want of a better term) and choose the "Box it" option. You'd need a better label. What essentially happens then is that the heirarchy below this point is sucked up and transformed into an ISO containing perhaps a "Reiser4-Lite" filing system. You can forgo the journal and other things that are redundant purely for storage.
The user has then converted their file or directory into something that can be transferred across the net, on Windows compatible CDs and so on, without losing the inherant structure of the original.
At the other end, choosing the "Unbox" option mounts the contents of the box using the loopback FS, mounted at the point of the file. To the user, it is seamless, far easier than zips or tarballs.
Of course, there are lots of complications. You have to agree on the format to use inside the box, for one, because the need to have kernel mods and so on makes it more complex than just installing tar.
I think MacOS has something a little bit similar with disk mountable images (.dmg) files, but the MacOS filing system is rather poor, and I don't know how easy it is for users to create them. Also the OS unfortunately applies some magic to them - for instance Safari will automatically extract the contents of the DMG file then destroy it when you download one (but other stuff does not, oops).
Anyway. That's one way to prevent loss of vital structure when transferring across lossy mediums, that I can think of. There are probably others.
If Linux and related systems move to filesystems with really powerful metadata support, presumably the lockin would be much stronger. Moving a directory from Linux to a Windows system may be possible but the programming to do it will become increasingly painful and the risk of data loss will rise. And with mainframe integrity, why would you want to, Mr. customer?
Apart from the CS issues, is this an attempt to use the embrace, extend weapons of Microsoft against it by turning the Linux filesystem into a full mainframe system, effectively squeezing out Windows servers by a convergence between big tin and small boxes? I guess this is pretty pie in the sky but I'd like to think so.
Panurge has posted for the last time. Thanks for the positive moderations.
Really, I think someone should get on with finishing the NTFS filesystem access in the kernel. With people migrating to XP it's really becoming more important that this driver is fixed (how long has it been declared "dangerous" for write use now?!).
I'd really like to know why this driver has taken so long to complete - is there some information that the developers don't have access to? Some technical reason? What?!
Code, Hardware, stuff like that.
there os should have a 'list' of what's supported by each fs. when you copy a file from fsa to fsb the os (or program) should compare feature and let you know (somehow) that something is not (may not) going to work right. if you copy something from the regular ext2 system that is case sensitive to a ms-dos floppy disk, something should try to remind you. or the program checks this and looks for problems.
remember that not all problems can be detected, so are you willing to live with: 'this may not work correctly' messages?
eric
if that crock, that bag-on-the-side, that mess is what we have to look forward to, I think i'll switch to BSD.
/directory/..owner is a big ugly crock.
I mean, acessing owner data by travelling into a directory then backwards out of it again like: vi
Feel that power? That's mah MOUSING FINGER
I see it happening more and more that people present their summaries, articles and technical papers on the net as pdfs. This is very inconvenient.
Pdfs are nice for printing and publishing on traditional media, because you can be sure they will be in the correct layout etcetera for the printer. But on the web, where people browse between lightweight, easy html-documents, they're just a nuisance.
Please, if you must publish a pdf, publish an html version next to it.
I do not know much about file systems, so I have a few questions.
o w.ogg. I let
RightAboutNow.ogg/title be "Right About Now", /artist be "Millencolin". I
could add a file RightAboutNow.ogg/genre and put "punk" in it.
Once we see the GConf example, other possibilities immediately spring to mind. In almost all multimedia formats--such as MP3, MPEG, and OGG--there is a tagging system for storing things such as the author and title. Instead of storing those attributes in the tag--yet another namespace--store it in the file-as-adirectory. I have a file / music/Millencolin/PennybridgePioneers/RightAboutN
While this is a nice example, I wonder if there really is an advantage to this kind of file system, because it seems like it takes more effort to keep track of all those sub-files instead of keeping all the info in a single file. Anyone can shed some light on this?
Also, what if I format my partitions with different file systems, say Reiserfs 4 and ext3, could there be any imcompatibility issues? Imagine I kept mp3's on both partitions, would my mp3 player know how to handle both formats, since the tag info is dealt with differently on both systems?
Before adopting any of these ideas, one must consider the security implications of doing so.
If we assume that the filesystem is decoupled from the access control layer in the kernel, then one must ensure that any operation that potentially affects security is adequately controlled.
For example, on systems with POSIX_RESTRICTED_CHOWN, the following ought to be illegal:
cp foo/..uid bar/..uid
This can be accomplished by making the UIDs mode 444. Without POSIX_RESTRICTED_CHOWN, the UID is 644. However, we have now moved a systemwide security feature into the filesystem. If multiple filesystems are configured into one kernel, then they ought to be consistent; otherwise the security model will be flawed.
As for things such as allowing access to an environment, doesn't that break encapsulation? It means for a certain filename, the filesystem must grovel through a user-space process to find the environment. Also, if a change in some external environment immediately affects some partially-related processes (e.g. daemons started from that shell), then a whole new raft of security holes will come up based on a process' environment or filesystem layout changing unexpectedly.
Cool ideas, but let's be careful lest we make a steaming pile of Swiss cheese.
I'm really getting tired of the ever-creeping assertion that transactions are required for [x]. At first x was ACID-compliant relational databases, and such was true because ACID was defined as such. However, then I started to see assertions that relational databases had to be ACID-compliant (mostly from the anti-MySQL camps who were ignoring the long history of highly valuable, non-ACID relational databases).
Now, in this article, I see the assertion that databases in general require transactions, and thus cannot be supported by a filesystem.
Worse, the logic is self-refuting, as the article previously states that a filesystem is a database, just a limited one. As it happens, POSIX-type filesystems are quite powerful, and let's not kid ourselves into thinking that they have not served us well for 20-30 years! Yes, changes are coming and I'm frankly quite impressed by Hans Reiser's accomplishment in finally coming up with a balanced-tree-based filesystem. Many have tried and failed where he succeeded.
That's because his was a great step forward, not because the old UNIX filesystems weren't also. Let's stop trying to re-define terms so that we can explain why the last 20 years were the dark-ages. They simply were not.
Umm... That isn't reiser4.
I agree that we need a revolution in how filesystems work inside an operating system, but it seems that the arguments placed in this paper had alot of holes.
For one thing, the need for changing a filesystem should not really be solely concerned on space or metadata. I think security, speed of data retrieval, and self correcting error engines should be centered on the new systems.
The reason for the speed of data retrieval as being more important than data size is because hardrives are getting much bigger than they are faster. In five years, we may have 20 terabyte drives, but the access speeds will still be horrible.
Security and error correction are obvious points that should be implemented on a systemwide level. When these features are system wide, then management becomes much easier for all system users.
I agree that metadata in the filesystem is a risky proposition. Just on general principle, I prefer my data inside the file and not left with the filesystem. The MP3 metadata example, to me, is like Windows file extensions on HGH. I remember John Dvorak wrote a piece about Windows file extensions a long while back, and he argued that file types, etc. should be inside the file. A header of sorts. I tended to agree then, and I see filesystem metadata as a bad trend.
This article seems to just be the author brainstorming or feeling excited about reiserfs. It's hardly a "summary of developments in the filesystem". Now if he was asking about opinions on his article it'd be fine, but he's not, so I'll just discard this as another non-news.
I'm surprised at the negativity of some of the comments here, moaning that POSIX semantics are perfect and nothing else can possibly be countenanced...
Plan9 namespaces and Reiser4 really do bring a lot more to the table in terms of useful expanded semantics and utility than all the posix filesystems. Posix extended attributes are very limited, and some filesystem implementors seem to be keen to implement them in the most restricted way possible ( eg size limitations in ext3).
The annoying this with Reiserfs is that the VFS will lag it by a few versions, and very very few apps will make any use of its special system call. Sigh. We'll be stuck with databases in a file for a long while yet.
One thing I would like to know about reiserfs is how attributes are attached to directories? If they are just small files in the "directory" bit of a file, what distinguishes them from children of the directory? Or are attributes just banned from dirs? Seems limiting.
Nobody (apart from perhaps this guy) has ever claimed that this syntax will actually ever be used, or needed. There are other possible syntaxes available, and in fact one long term blue sky plan for RFS is to allow many different types of syntax within the same file path, including for instance things that vaguely resemble database queries.
So, don't get hung up on the syntax given in this article.
We had plan9 machines here 10 years ago...
I don't think any exist anymore, in fact I don't even think the inferno install works anymore.
But anyway, it isn't a "new advance" anymore.
I may be way off base, but I see the need to have a existing non-journalling file system. Someone stop me if I'm wrong, but in my mind things like audio recording, video capturing and the likes would suffer performancewise from being run on journalled file systems
The stars that shine and the stars that shrink
in the face of stagnation the water runs before your eyes
It seems that the author presumed that the only use of LDAP is to provide passwords for user authentication. While that is a common use of LDAP it is not the only use.
It would seem that having a file system that is LDAP aware could be extremely useful. Imagine if your LDAP tree were reflected as a tree in your file system. You wouldn't need to embed LDAP calls in your application, it would just be data in your file system. So looking up an attribute for the current user, or a user, would be as simple as reading a file that holds the value of the attribute.
Is for someone to come up with a real unlimited snapshotting filesystem for linux. I don't want to use user mode hacks (as nice as they are rsync style snapshotting isn't reliable enough), or snapshotting that only allows a shadow copy of the entire volume, I want to be able to tell the users that they can just go into ~/.snapshot/time (where time can be hours, days, or weeks in the past) and copy the file they messed up back into their home directory. Basically I want the most usefull feature of netapps without the HUGE markup =) The cost in admin time both in user interaction and reduced need to do tape retrieval and file restores is immense.
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
[100% ISO 646 Compliant]
SVM, ERGO MONSTRO.
Cool ideas, but let's be careful lest we make a steaming pile of Swiss cheese.
/etc/passwd just before the system crashes, you'll be unable to log in again when the system comes up.
/var/log/messages as the system went down, and you'll find this message (as well as the rest of its 4k block) in /etc/passwd, while your changed password file may be found at the end of /var/log/messages. This is a feature of ReiserFS, not a bug.
Evidently you haven't used ReiserFS. It already does this.
ReiserFS only journals filesystem metadata. Because it uses a B+ tree balanced allocation scheme for file blocks, when the system crashes the last pair of blocks written will often be swapped with respect to their files. For example (this has happened to me and separately to a friend) if you modify out
What happened? syslogd wrote a panic message to
I still miss the raw speed of ReiserFS, to be sure, but EXT3 has kept every last one of my hundred-odd filesystems rock solid for two years now, which is really what you want a journalled FS to do.
ObOnTopicComment: Miner's examples are clumsy and ill-considered. BBN's Dave Mankins put a relational database into the 4.1BSD filesystem back in 1984, and Plan 9 took a more rational approach with its namespace algebra. This is not a new idea, so there's no absolutely no excuse for breathless exposition based merely on coolness factor.
At best, Miner's descriptions obscure any true value Reiser's proposal might have. Organize my MP3 collection with FS metadata and lose it all when I try to move to another FS? What is he thinking? Is he thinking at all?
Sheesh.
I feel this very interessting.
:-)
For example I like the currents devfs and procfs (although not perfect). Those help me a lot to "debug" new hardware connection. SImple and coherent.
I also imagine some RDB support. Imagine a "select * from account" where account would be asimple directory. Would be nice to use "find" for the where clause
Imagine also implementting OO classes wiht inheritenace using symlinks... And more...
Yes would be very nice!
D.
The only thing I'm concerned about is backward compatibility - if someone accidentally tries to open a file with a trailing slash, and gets an error because now it's a directory, then it's a Bad Thing.
[100% ISO 646 Compliant]
SVM, ERGO MONSTRO.
A filesystem that goes wrong properly...
Imagine your filesystem is a library...
Imagine you drop a bomb on it...
Books and pages scattered all over the place... Yet you can still work out which page belongs to which book, and where on the shelves they used to sit..
Having been caught short by LVM and reiser before (which just couldn't deal with a 46GB gap in the filsystem where a disk used to be) it seems to me that no-one's made a filesystem that breaks properly...
For me.. speed is not an issue.. nor is CPU usage... I'm quite happy to throw a dedicated box at handling the filesystem... I just want something that is written with the thought in mind... "Our hardware is unreliable.. It will die.. it wll loose bytes... How do we deal with it"... not the usual thing that people do of "Our disk is 100% reliable... how can we most efficiently organise it... Hrm.. now what do we do when we encounter problem X" There should be no fsck... That should be handled by 'mount'
Sure.. there's always RAID... but I reckon I've got about 28MB of data in total that's critically important... the rest is just crap I downloaded from the net and can always get again...
Continuing my anology further...
If your filesystem's a library, then then your physical disks are wings of the library.. (east wing.. west wing.. etc)...
I want a librarian who's job is to work out the best way to organise the data in the different parts of the library... When you're looking for a book, it's often easiest just to ask the librarian.. who magically holds much of that information in his/her head.
The librarian is also responsible for making sure that multiple copies are maintained for popular or important works...
It'd also let me do the thing I've always wanted...
chattr +mirror filename
(i.e. This file is important and must be mirrored on 2 or more disks)
Just my two pennies worth... I'll write it one day if someone else doesn't get there first.
This is said by someone who obviously hasn't done any real world application profiling. It's quite the opposite -- CPU is relatively rarely a limiting factor in desktop applications, dealing with the HDD very often is.
This is very often why adding more memory to a system makes it seem more responsive -- larger disk buffers, less need for disk based virtual memory.
Basically hard disks are very often *the* limitation; CPUs are fast.
The more I read this, the more it reminded me of the marketing version of how Apple would like us to think of Resource Forks.
Truthfully, there isn't exactly a lot of difference in the concept or the idea. Implementation is vastly different but the idea remains very similar.
Why do I want to accept this sort of idea anymore than I want to accept resource forks? If I copy a file with resource forks from one of my macs to nearly any other OS on the market thats not specifically configured to support them, I lose that information. Why do I want to continue this?
I use HFS+ because I have to. To get all the functionality I want out of my macs, its the only real option I have. But for anything other than system level files that are never likely to be copied to another machine, this is just a waste of time to me.
Next question. Say I do run this file system on my machines. I build up a heap of data and I'm using "files as directories" to store metadata about those files. How do I back it up? Don't even try to tell me "rebuild tar". Haven't we put tar through enough to try and extend its capabilities? I wouldn't touch a file system with these capabilities without a guaranteed way of being able to backup ALL the data. Otherwise its just truly not worth the effort.
Good old Acorn RISC OS already supported the use of directories as files back in the eighties. E.g.: Click on a file to open it, shift click to show a directory of sub-files (recurse at will).
You're thinking of application directories. It could not be done with ordinary files - They had to be a directory.
The only difference was that if a directory had an exclamation mark as its first character, RISC OS's default action would be to execute a file called !Run inside that directory.
Fantastic idea - It meant that you could keep all the libraries and files required for an application in one folder. It also meant that you could move the program and its associated files wherever you liked on the hard disk - There was only ever one icon to move.
If you ever decided to move stuff around like that in Windows, prepare for your programs to stop working.
I wish people with clever ideas to redesign POSIX namespaces would spend ten years in system administration first so they realise what's involved with managing REAL WORKING SYSTEMS.
/bin/prog /bin/prog
/bin/prog into my home fs - Counter-intuitive to the path semantics. If I run this a second time it copies my copy of /bin/prog over itself - Inconsistent.
/etc/passwd becomes a hierarchy of files. Just logging in one user will involve multiple open()-read()-close() operations. Whilst these might be efficiently implementable at fs-level, it is still very inefficient in user space, or will at least require a dramatic rethink of unix tools.
Some of the ideas might well lead in useful directions, but some (at least as described in the paper) are plain silly. viz:
1) with overlayed mounts:
suppose my home dir is mounted read-write over a read-only system root, and I do not have a "/bin/prog" in my home dir. Consider:
cp
First time, it copies the system
2) Attributes in the namespace
We have a rather carefully written setuid chown/chgrp/chmod replacement which can be run by users in an "admin" group, and allows devolution of 1st-line support tasks to nominated users. It won't touch files whose uid/gid is 100, so they can only touch non-system files.
If attributes (file uid) is file/..uid and cp is supposed to handle what chown does, the above breaks big-time. We now need a custom cp replacement. Either that or we have to add an ACL for the admin group to every file we want them to manage, which is a great deal of effort, and likely end up inconsistent.
Contrary to the paper, setuid and PARTICULARLY setgid is NOT going to go away in the real world any time soon, as far as files are concerned. Ports less than 1024 are a different matter and I agree with the document.
3) Consider the number of file descriptors involved if
I assume Plan9 is an ironic nod to the "worst film ever". When I develop my new filing system, which will only allow numeric characters in filenames, will delete the MFT every time the computer is rebooted, and will require a new directory for each file added to the system - that FAT16 limit of 512 was FAR too generous - I'm going to call it BattlefieldEarthFS.
When I am king, you will be first against the wall.
already one such implementation exists such that FreeBSD can expose its file system to plan9 machines (as you would expect it gets imported into your namespace. Would can be a different place depending on the namespace of the current process. Even (temporarily) "replacing" your local files with versions on the FreeBSD Box, if that's what you want.
There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
I used to live like that, then I took my big hard drive, slapped it into a linux box and shared it out with NFS, Samba, and NetAtalk. Now I can access all my files, which automagically get backed-up, from any machine on my LAN. Stop waiting for the 'universal' FS to show up, it'll never happen.
"Sometimes, I think Trent just needs a cup of hot chocolate and a blankie." -Tori Amos on Nine Inch Nails
I'm delighted with the prospect of metadata-as-files and files-as-directories (ergo, metadata-as-directories?) -- but here we have another problem to address: Insufficiently escaped data. Human-readable data fields (including filename, if the user can read/write it) should be able to contain any human-readable characters. Filenames should be free to contain normal punctuation; path separators -- again, if the user can read/write paths -- should be selected from outside the normal punctuation set, or else the stuff between the separators should be escaped. Or the user-accessible file and path names should be stored as metadata.
Can't tell you how much frustration I've endured over other people's improperly escaped data. This just looks like one more case.
(Mac OS <=9 used a colon as a path separator, making it the only forbidden character for file/folder names, which could have been avoided so easily: How about a pipe, guys? or (shiver) a backslash? Or, even better, some control character unique to the Mac, akin to the Option-Shift-K Apple logo? Programmers. Sheesh.)
Maybe we should replace filesystems with /etc/passwd access, something like:
/bin:
presistent hashmaps that have O(1) lookup?
This way, you can add any attribute to any
object you like. See Python's hashmaps.
So for
passwdhash = disk0["passwords"]
roothash = passwdhash["root"]
rootshell = roothash["shell"]
rootgid = roothash["gid"]
For chown of
disk0["/bin"]["perms"] = 0755
To check for available attribs, do:
print disk0["/bin/"].keys()
Bram
Bram Stolk http://stolk.org/tlctc/
He lost me as soon as he held up GConf as an example of what was to be accomplished. Have you ever LOOKED at the "xml" files that GConf generates? Ever tried to climb the ~/.gconf (and /etc/gconf) trees? I put GConf (and anything that aspires to be like it) in the same category with the Windows Registry. GConf is, by far, the thing I like least about GNOME (and, on the whole, I like GNOME).
Why do people keep adding needless complexity to fix systems that aren't even broken? If I can't edit my configs with vi, I'd rather use something else.
I want all of the power and none of the responsibility.
Just on general principle, I prefer my data inside the file and not left with the filesystem. The MP3 metadata example, to me, is like Windows file extensions on HGH.
.attributes created on the target legacy system; I'd be happier if just one big XML file could be created with the same name as the original file.
//rich onto server //legacy, and then you want to restore some files from //legacy to //rich. If all the metadata was stored in a big XML file, then when you copy the file from //legacy to //rich you restore all the metadata; you wouldn't accidentally slice off attributes by forgetting to copy one or more rich attributes files.
I'm with you -- I like self-contained file formats.
But I don't think he was proposing that you not use Ogg tags or MP3 tags; he was talking about the filesystem abstracting the tags. If you changed "Stagnation.ogg/album" to the string "Trespass", then the filesystem abstraction layer should update the Ogg "album" tag inside the file to be "Trespass".
The key benefit here is that you would not need some wacky command-line utility program to let you view and change tags on Ogg files. You could just use the shell. In bash:
for ii in *.ogg; do echo "Trespass" > $ii/album; done
Note that this same one-liner would work if you were in a directory with MP3 files, and you changed "*.ogg" to "*.mp3". Currently, you need to run vorbiscomment for your Ogg files, and mp3info for MP3 files. (I just checked, and sure enough, they take different arguments to do the same operations.)
Personally, I'd like to see a standard metadata portable XML format for legacy systems. People talk about copying a file from a rich metadata filesystem and having new files like
Suppose you backup server
You could do most of the fancy tricks of the rich metadata filesystem on a legacy filesystem that used the big XML file to store the rich metadata. And as long as the legacy system is just smart enough to look at the main data part of the XML and leave the metadata tags alone, you could still modify the file with sed, awk, perl or whatever, and then copy the big XML file onto your rich metadata filesystem and still not lose any rich metadata.
Note also that the big XML file could be used to deal with existing rich metadata systems, like resource forks from Macintosh filesystems, or multiple data streams from NTFS files.
steveha
lf(1): it's like ls(1) but sorts filenames by extension, tersely