The Linux Filesystem Challenge
Joe Barr writes "Mark Stone has thrown down the gauntlet for Linux filesystem developers in his thoughtful essay on Linux.com. The basic premise is that Linux must find a next-generation filesystem to keep pace with Microsoft and Apple, both of whom are promising new filesystems in a year or two. Never mind that Microsoft has been promising its "innovative" native database/filesystem (copying an idea from IBM's hugely successful OS/400) for more than ten years now. Anybody remember Cairo?"
Hans Reiser has written a white paper containing his thoughts on the design of the next major version of ReiserFS.
...wrote "Open Sources", which you can read/buy here. He's a fairly savvy fellow...
The Army reading list
Hans Reiser has some interesting ideas about the role of a modern file system. Here's a recent USENET post describing some of the immediately visible features of reiserfs v3. Some people have said that there was corruption in the past, but I think there are no longer any problems in recent 2.4 kernels. Namesys is now developing Reiser4, which appears to be more flexible (still needs time to stabilize though). If I had to put down my money on a future filesystem though, it would be ReiserFS.
This explains everyting you need to know. But basically FFS is a compatibility thing. Apple still recommends its HFS+.
Can somebody explain me WHY should we put things like database, indexing/previewing etc. into the filesystem => KERNEL SPACE!!!! What advantage does it bring?
Any good (XFS, JFS, ext3) filesystem now has nice feature called Extended Attributes which is intented for STORING such a data (like previews etc.). And using user-space server it's much more easier to add plug-ins for various file-formats, "search" plugins etc.
Who needs a filesystem in a database when you have a database that lives on your filesystem (updatedb). Get that updating in realtime, with more things (like permissions, access times etc.) and a lot of the work is done.
PR & tech journalists to the contrary, that is all that is involved in Spotlight & WinFS. Spotlight runs on HFS+. WinFS runs on NTFS. Both are databases stored as files on existing filesystems. The only difference between those databases & updatedb is that they may be using better database design (dunno) and that they update in real time via background processes.
I'm wrote a journal entry guessing as much about Spotlight, but since then more evidence has arrived, and I'm convinced that both WinFS & Spotlight are implemented that way. The features & implementation details are quite different, but not the filesystem.
We'll probably eventually start calling these databases a part of the "filesystem" much like right now some people will call mspaint.exe & bash a part of the operating system.
There are no trails. There are no trees out here.
Anyone? ... Bueller?
But seriously, even though he mentions Reiser, he doesn't seem to consider it's future direction, which is to allow varying degrees of structure, that could include attributes, as the user sees fit. At least that's how I understand it.
dtrace, due with Solaris 10 does that. So it's not quite a top equivelent, but it does laet you answear your questions ("What processes are kicking the shit out of the disk", and "By how much"), and long with the also useful "In what way" i.e. many small writes, hugh seek to read ratio, or what have you.
It is, however, expert driven, unlike top, which is simple to use. Still, I think that dtrace shows the furture of performance monitoring apps.
Note that dtrace lives partially in the kernel - it's not portable to Linux.
Actually, both have journaling filesystems.
hfs+ supports a journal (starting with macos 10.2.2 server and 10.3 panther), and ntfs5 supports a journal (starting with win2k)
BeOS was a great technology demo, but it had a huge way to go to become suitable for general-purpose, everyday use. NeXT had Be beaten in many of its strong spots (real-time scheduling? Mach already does that. well-designed, object-oriented system APIs? Openstep^WCocoa creams Be's APIs.), and was already a mature, field-tested operating system to boot. If Apple had bought Be, (a) they wouldn't have got Steve Jobs back to save the company, and (b) they would have a *lot* more "reinventing the wheel" to do to the BeOS base than they've had to do to Nextstep.
Notice the plugin feature. This will create endless possibilities for what you can do with the file system. Want to tie a DB/SQL search function in to it? Write a plugin, want special security? Write a plugin. Tons of possibilites with ReiserFS4 and it is _very_ fast. This is hands down better then the MS "a filesystem as a DB" approach. ReiserFS4 will be like Firebird, lean-n-mean-n-fast. Want more features, grab _your_ favorite plugins!
If Tyranny and Oppression come to this land,
it will be in the guise of fighting a foreign enemy. -James Madison
Here's instructions for RH8.0:
s .h tml
l es ystem-HOWTO.html
http://bob.plankers.com/other/linux/loopback_ef
The Linux Doc Project also has a HOWTO in their archive:
http://www.tldp.org/HOWTO/Loopback-Encrypted-Fi
You will want to check around though, a lot of the information appears to be very old. Also, the 2.6 kernel has a lot more encryption routines built into it, so using 2.6 changes how it's done. (but it still is basicly mounting an encrypted file using a loop-back mount point)
www.rdex.net
The way Linux does encrypted storage is with encrypted devices rather than encrypting the filesystem on the device. This is good because it means you can encrypt any (device-backed) filesystem.
And neither of whom have a journaled filesystem yet, while Linux has many to choose from.
... you get the point.
What are you talking about? NTFS has had journalling for over a decade. And Unicode. And ACLs. And streams. And reparse points (these are amazingly cool). And compression. And encryption. And
Now, MS doesn't use most of this good stuff, but it's all in there. Even three-letter file extensions on Windows are obsolete, since everything on NTFS can be an OLE server. There's nothing on Linux that comes close to the capabilities of NTFS. About the only major thing NTFS is missing is versionning, which VMS has.
Never mind that Microsoft has been promising its "innovative" native database/filesystem (copying an idea from IBM's hugely successful OS/400) for more than ten years now. Anybody remember Cairo?"
The seamless filesystem-in-a-database was created in the Multi-Valued DB structure in the mid-60's and release as the the Pick OS. It is still sold by Raining Data and runs on Windows, Unix, and Linux.
If you load everything on the filesystem to memory on boot, you end up wasting a lot of memory, since you typically use only a very small subset of your filesystem at any given time.
The solution would be to load things "on demand," as you've suggested.
Linux already does this, and it does more.
If you've ever looked at the output of free(1) after your system has been running for an hour or so, it will appear as if almost all your memory is in use. See those last two columns, "buffers" and "cached"? That's your "on-demand ramdisk" at work.
Linux will use memory that applications aren't using to cache filesystem data (including executables and metadata) to speed future accesses. If your applications need more memory than is currently free, the kernel will drop cached data rather than swap out application memory to disk. That way, you get the benefits of having your executables on a ramdisk, with the flexibility of not having to sacrifice running application performance in the process.
There are some disadvantages to this approach.
First, it's minimally supported by distros. I can't just set up a Fedora system out of box, and check "use encryption" and have it do an NTFS-style decryption of the file encryption key using the password entered at login for each user to decrypt that users' files. It requires hacking around pam and maybe initscripts.
Second, if that *was* done, it would take a different filesystem per user (per key), which is a pain to maintain.
Third, it can't be enabled by users (would require root dicking around with pam and filesystems) as NTFS encryption can be.
Fourth, it can't be enabled on-the-fly (requires creating new filesystems and copying the contents over, unlike NTFS).
Fifth, it's a pain to maintain -- on NTFS, it's easy for a user to just say "I want the contents of this directory and below to be encrypted" and choose to have things encrypted on a per-directory basis. The equivalent on Linux would be having the root user be creating new filesystems (knowing the appropriate sizes in advance and wasting any excess space allocated) copying over the contents and adding mount points for every filesystem mounted.
Sixth, NTFS supports key recovery with a backup, emergency passphrase (it can maintain two copies of the encryption key, one encrypted with, say, the administrator's password). Dunno about the Linux status of this.
Having an encryption layer above the block layer is a nice idea, but it's not a drop-in substitute for encryption support in the filesystem.
It would be possible to add a layer in which an encryption layer could be *added* (preprocess file/directory contents -- if one *only* wanted encrypted files and not directories, this could already be done with an LUFS or fuse module). Space for such a layer does not currently exist in Linux.
May we never see th
Man, you totally miss the point. NFS is not a file system (don't be fooled by the name), it's a network protocol. The files provided by a NFS server have to be physically stored on some (real) filesystem, like ext3 or reiserfs.
This is very much like saying "the future of filesystems is apache2, local filesystems are already good, now we have to concentrate on apache2".
Offsite backups are your friend. No matter what your filesystem's software, or the coolness of your raid array, or your battery-backed redo-logs; if a fire or a burglar takes your disks holding your filesystem you're hozed.
Personally, instead of a raid, I do a nightly "rsync" to a "yesterday" drive on a separate server (hense protecting myself from stupid-user failures as well as filesystem/disk failures); a "every time I did something significant" rsync to an encrypted filesystem removable drive kept in my car; and a "once in a blue moon" copy to DVDs in a safe.
An added benefit - upgrading an OS, or a computer is trivial, because the live backups are just that - live, and tested every night.
(Back to the filesystem topic, Reiser's whole naming idea is so much cooler than a heirarchy or a relational system I really hope this is the next big advance for Linux).
No. You may not see them very often, but there are a lot of AS/400 systems out there.
Mea navis aericumbens anguillis abundat
Moderators call that post interesting? WTF!
Where I've worked for 15 years, I solved that problem the first day at work. It's called yellow pages from Sun. I've also used it on Linux since 1995 without problem.
Quick googling got these. They may provide you with some leads. If you find more, please post :)
http://sourceforge.net/projects/e2compr/ (Ext2 Compression)
http://squashfs.sourceforge.net/ (Squashed - Read Only, don't know what that means)
S
Some might say this is a flame but... I hate reiser in it's current release. It's fairly unstable, and when it does go, you're pretty much fucked. I've used reiser twice, and of those two times, the filesystems died due to power outage or other similar effect within one or two events. That's not even approaching acceptable.
On the flip side, I've had multiple such snafus with XFS, but no filesystem failure. I've never even had to approach having to deal with trying to fix the system, as there's been no events which have resulted in fs corruption. Sure, power has gone out, but the machines have come back up again without a hitch.
~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
Apple is simply adding functionality to HFS+. Everything you've read about Spotlight describes a services that sits above the file system. It takes advantage of HFS+, but there is NO database driven FS coming out from Apple.
Their solution is to build a service that can interact with individual files, including their native metadata (ID3 tags, pdf metadata, MS Office metadata, email headers, etc.) through metadata importers and to store the metadata indexes in a separate database. This is relatively similar to how iTunes does it's thing. The services will have lots of APIs open to apps to incorporate the functionality locally.
The obvious clue that HFS+ isn't going away is that Apple is finally pushing full HFS+ support back up to the command line utils like cp to support resource forks and whatnot in 10.4, so hopefully we can stop needing OS X specific tools like ditto.
They've been adding improvements steadily over the years, such as journaling and most recently case sensitivity. The more obvious question to me is why doesn't the Linux community just jump all over HFS+ and build off of Apple's work since they seem more than willing to give the HFS+ support back anyway?
It doesn't sound as though "compressed streaming format[s]" are what you're really looking for, and AVI isn't a streaming format in any case. However, there are archival-type video codecs that may suit your needs:
In a perfect world, you'd have one of these working behind the scenes in some sort of network storage device in a manner similar to the dpsVelocity VTFS. If you haven't worked with an editing system that uses VTFS, I recommend getting a demo.
"Be Happy or Die." -- AoN
I suspect that a lot of the differences between Windows and UNIX are due to their respective histories.
UNIX has traditionally been about big systems with multiple users. Networks have been a standard feature for decades. In this sort of environment, you'd naturally use some network-oriented naming service, be it NIS or LDAP.
Windows has grown from a PC background where everything is traditionally local. In a networked environment there is little need for the MACHINEA/user when there is a DOMAIN/user (some exceptions obviously exist).
I am responsible for a network consisting of an NT domain and a number of Solaris, AIX, Linux and Unixware servers. All the *NIX boxes have the same UID/GID schema because we use NIS; not the most secure solution, but suitable for our environment. We interface those users easily with Windows (and Samba) because we only administer two sets of login credentials - NT and NIS (we could do just one using winbind, but that doesn't seem right...).
The UNIX filesystem permissions schema is easy to understand and it works extremely well. Commercial UNIX has had access control lists for years (part of the POSIX standard), but I'm not aware of anyone who uses them in the real world. They are potentially useful, but most people find the UNIX UID/GID does the job well enough for 99% of the time.
I'd say here lzo would be a better choice here. It compresses like 10-15% worse than zip but does it ~10 times faster (even faster fou decompression). People even reported speed-ups in reads due to having to read less off the disk (and decompression being fast enough to not hinder the effect).
It doesn't sound as though "compressed streaming format[s]" are what you're really looking for, and AVI isn't a streaming format in any case.
If you consider "streaming" to mean something like RealMedia or other web-based streaming codecs, you are correct. However, working in the DVD/Digital Video/Multimedia fields, we do refer to MPEG-2, AVI, and so forth as a "streaming" format because it is composed of one or more "streams" of content. Basically, the different between what we have now (tens of thousands of individual files, each one representing a single frame of video) and a "streaming" file is that it compresses all those individual files into one big file.
However, there are archival-type video codecs that may suit your needs:
Thanks for the listing! I will check into these.
In the end they will lay their freedom at our feet and say to us, Make us your slaves, but feed us. - Fyodor Dostoyevsky
Well, filesystems are more or less some kind of database.
Especially the Reiser (3/4) filesystems come very close to being a database.
The database is one big tree. You can see it as (in SQL view) like a single table, where the primary key is indexed and the actual data (the objects) can be of different types.
These types are:
The root directory (/) has a known key, where it can be looked up. There you'll find a "directory" item. It contains a list of names, each name also has a key. Using this key you can find the stat data for that file or directory list or the actual file data.
This data can be located anywhere in the tree, even small parts of file content (like the end of files that don't fill up a block so it would be a waste of space to store it in a full block).
Using this approach everything becomes dynamic. And also very fast because if you have a lot of file, you can write all the data into a contiguous region on the disk and don't have to update some fixed positions on the disk.
Now, reiser4 takes this approach to an extreme:
The clue is:
These default plugins make a filesystem from the database. It's just like reiserfs3 now, just faster.
BUT: You can now add plugins if you want. Plugins to store compressed or encrypted files. Plugins to store additional metadata alongside the files. It's basically the file system of the future. Because it's extensible without changing the disk format.
Sorry, but you are wrong here. Reiser4 is atomic and you can pack as many operations into one transaction as you like, you just have to use the reiser4 system call. This is, because there is no standard system call for atomic filesystem transactions. Modern filesystems are databases, build to store files and query them trough filenames, reiser4 is the first filesystem where search path can be done through plugins, therefore you can index everything you want.
kindly regards daniel
If you're concerned about compression speed, you may want to take a look at LZO. It's got incredibly fast compression, and even faster decompression. I think it was even used on the Mars Rovers.
You can compress OR encrypt a file with NTFS, and not BOTH.
Reiser4 has a compression plugin coming. We got gzip to work, but it consumes too much cpu, so now we are doing lzo which can compress at disk drive speed. The lzo plugin has a bug, maybe next week....
Hans
(You can email edward@namesys.com for details).
I think he something more along the lines of how it does a 'fork' of sorts in the file and the mail program sees multiple objects in 1 file.
P
Humans are slow, innaccurate, and brilliant; computers are fast, acurrate, and dumb; together they are unbeatable
By that logic I should have gone back to Windows the first time my first Redhat installation died.
I didn't, I looked around until I found something more stable. It turned out to be Slackware using ReiserFS on an AMD900 with "cheap RAM" and I have had 0 problems in the year and a half It's been running (and it's my main desktop PC/workhorse).
Sometimes it is just the hardware.
It's a pity no one has ever fully finished the NTFS filesystem module for Linux. I understand that payware solutions are available, but the "EXPERIMENTAL" read-write NTFS module has been around for years and years with nobody finishing it up. Right now you can write to an NTFS volume so long as you don't change much of anything...yeah, uh huh. That's real useful. And if you do accidentally change something, you can totally fsck up the volume, or at the very least you have to do a CHKDSK before anything else can access the volume. That's about as useful as a square bowling ball.
In the end they will lay their freedom at our feet and say to us, Make us your slaves, but feed us. - Fyodor Dostoyevsky
>Thanks but no thanks. I'd rather my /etc/fstab not be corrupted in the /etc/hosts , try to save and sh*t happens.
>event I edit
Relax, this is old news! The Novell Netware filesystem did this 10 years ago, they called it "sub-block allocation". I never had a problem with it on my servers. I've never heard of anyone else having problems with it, either.
uhhhh... what?
if the filesystem does the compression, the apps (or you) can't see it happen. that's the POINT. your suggestion, above, is ridiculous. If you had a tar.gz file, you could extract it to the FS, but it would actually be equally compressed (cause it's a gzip compressed FS), and then you could play with the files to your heart's content, without worrying about the compression, cause it's transparent. You wouldn't need or want some kinda plugin or something...
Unless the FS wasn't compressed, and you wanted a transparent way to access tar.gz files. That idea would make sense.
As I understand it, bad block checking is obsolete; I thought the hard drive firmware takes care of this nowadays and re-maps bad blocks when it detects them.
"I think it would be a good idea" Gandhi, on Western Civilisation
I don't think it's fair to blame the failings of ReiserFS on hardware, especially since some of the versions of ReiserFS that shipped with early 2.4 kernels were known to have problems.
If you've had no problems, you're either using a version of ReiserFS which has had the problems fixed, or you're lucky.
And it's not just hardware that changes the picture. It can be delicate kernel interactions. It can be the actual data of what's on disk exposing bugs in the code that handles it. Such bugs aren't necessarily exposed on ALL running instances of the code, but for some users, it might just be triggered.