SGI open-sourcing XFS
Yun Ye writes "Finally, a journaling FS for Linux! Get the full story
. Excellent-we'll have to ask the SGI people about it tomorrow. And who came up with the name change. Whatever the case, this will help Linux continue to crack the high-end market.
From the white paper:
7.1.2 Future Directions
The ever evolving future of XFS shall include items such as:
Access Control Lists
Disk Quotas
It looks like the white paper might not be completely up-to-date. Have Quotas and ACLs been implemented yet?
--
Business. Numbers. Money. People. Computer World.
This is internet gift economy at its best. I really hope that no-one will forget SGI for this generous move!
n ge/ti_xfs.html
;-)
Short overview of XFS at http://www.sgi.com/products/remanufactured/challe
All we need to get Linux into the data centers is a decent clustering/failover package...
Compaq, are you listening? How about GPL'ing your old failover/clustering package (ASE)? It would be a proper gift to match SGI...
(not that I don't appreciate or notice the release of the Compaq 64-bit math libraries... i really do... but SGI just raised the bar
/ Henning
Well, my nominations for the other two would be:
/proc filesystem, that Sun have only just introduced in Solaris 7.
1. No large files support (files over 2Gb). This is required for any work with medium-large databases or digital video editing.
2. Poor support for RAID - hot spares, hot swap, etc.
3. Poor NFS performance (speed and locking)
4. Poor desktop environment
5. No high availability clustering (Beowolf is cool but completely unrelated to this)
All these except (4) are possibly (5) more or less requirements for use in medium to large enterprise situations.
But while these are significant, lets not forget the cool things linux has like the
The article says that the lack of a journaling filesystem is one of linux's three major weaknesses. What are the other two?
Yeah. As I understand, the way many journaling systems work it, the filesystem itself is a database. Just as a database uses a journal to record transactions in preparation for committing or rollbacks, so does the journaling filesystem. The journal holds the transactions (block writes) until they are written, then the journal entries for them are cleared. If there is corruption in the filesystem, the journal can be used to bring the filesystem back to sanity.
:)
I've never lost an ext2 filesystem; it does, however, take some time to fsck... a journaling capability would be nice to have.
Codifex Maximus ~ In search of... a shorter sig.
It's slightly worriesome that SGI haven't decided on a licence, even though the piece says it'll certainly be Open Source.
If we need a journalled FS (I guess we do), then we need a GPLd journalled FS.
How's this going to be implemented? If it's a kernel patch, then correct me if I'm wrong, but doesn't it *have* to be GPL?
I guess if it's only a module, it can be any licence, right?
--
For real high-end stuff:
1. Poor large memory support. I'm not sure how the most recent kernels fare, but last I checked Linux only supports a maximum of 2GB of ram. This is probably one of the FIRST things which need to be fixed. I have heard that SGI is working on a patch to give 3.8GB on Intel machines... that sounds promising!
2. No raw I/O support. This is used for large RDBMS's for example. AFAIK Linus hates the idea, so it will probably never get this. Mind you this is minor because there are ways around it.
3. "fsync()" on large files is extremely inefficient. Again, this affects RDBMS to the extreme. It is so bad that in some cases Windows NT is 30 or 40 times faster than Linux on equivalent hardware doing database inserts and updates. To witness this for yourself, write a quick program that opens a file and loops appending a line and doing an fsynch on it. Notice the slowdown as the file grows. That shouldn't happen.
4. Poor performance under load. When the load on a Linux box goes up, context switch time increases a LOT. This is bad.
And these are just off the top of my head...
Mind you, I am not diminishing Linux, it is still my primary development platform of choice, however as it stands it doesn't make a good platform for really big single-computer tasks. I'm sure this will change in the future, but right now as much as I hate to say it, NT and commercial Unix have the lead. Also I'd love to hear corrections to my points above, they would be good news to tired eyes.
Thanks
Some time ago Silicon Graphics asked for comments from users regarding what we wanted, and I replied that opensourcing xfs was probably the most relevant thing they could do. Since apparently SGI is moving away from their Irix95, and they have a huge investment in xfs development, this appears to be a natural step towards having xfs on their Linux platform. This is really a good thing, and the only OpenSource initiative SGI has done so far which will actually matter to most users. Now 4dwm would be nice too - that is really a nice window manager.
--
Thorbjørn Ravn Andersen "...and...Tubular Bells!"
It looks like free software has reached a critical mass. There is enough usable and successful source out there to make it more profitable to add to it than develop your own proprietary code.
This could explode.
This is one of the advantages of the GPL, with BSDish licenses they could just port it and include it and keep it as proprietary as before. If it should be possible to boot from XFS it would have to be in the kernel wich forces it to be GPL.
This is unfortunately also one of the disadvantages. SGI don't want Sun or MS to be able to use it and probably don't care at all for *BSD. So they will probably release it under GPL only wich means *BSD can't use it.
Actually there is a limitation in the VFS (virtual file system) layer that means right now no FS can have more than 2GB files.
It could well be that SGI have patches to address that, but that would be separate to the XFS code as such.
This restriction only applies to 32 bit platforms such as x86, non-Ultra Sparc and PowerPC of course. On Alpha, Ultrapenguin and Merced there is/will be no such limitation.
okay...
OpenVault - opensource . check
XFS - opensource . check
DMF - opensource . Well, not yet
Open Source Ronin
It's out there.
http://linux.msede.com/lvm/
It is not considered production code yet, but I haven't had a problem with it yet.
Very similar to the HP-UX implementation of the Veritas volume manager, though IMO the Linux implementation is shaping up to have better tools.
> SGI still is deciding how to structure the open
> source license, the company said, though it is
> sure to meet the requirements of the Open
> Source Definition, a spokesman said.
They need more than OSD compliance. If they want it to go into the Linux kernel, they need to use a GPL-compliant license. The main question is whether they want (or will accept) that the other proprietary Unixen can use it. If not, then the obvious choice is to GPL it. It will cause problems for the BSDs, but why should SGI care? They don't have the hype (momentum) of Linux.
If they do want their file system to become a standard, they could LGPL it, or even use an X-like license. That would also make it easier to backport changes to Irix. Using the LGPL would fmake it more problematic for competitors to keep their changes proprietary.
Not to put a wet blanket on the party, but if it already doesn't include mentioned things like quota support, etc.. and they aren't releasing all of the features, and they don't even have a licence decided on yet...
Looks like it will just be a part of the code that might be incorporated into ext2, and will help some, if there are enough tallented people to actually do it (I sure know I won't be doing it). And, given it will take time for them to decide on the licence, it will take time to deal with the licence, and time to incorporate the code.. It might be a long time before we even see the effects of this _part_ of xfs incorporated into a Open Source OS (and who says Linux will be the first to use it?).
Apparently, this file system does not have quota or ACL support? As per the SGI white-paper on the XFS filesystem, those features are on the TODO list.
Is this something we will be able to put in given the source, and call the whole thing ext3fs, and release it in Linux 2.4?
Good filesystems are some of the most difficult code in an operating system, so having an excellent base like XFS will certainly help. Thank you SGI!
Sig (appended to the end of comments you post, 120 chars)
PS - Slashdot could use a full dictionary of terms from around the site... That'd rule..."
... Then you may want to check out What is it -
a really good web dictionary.
http://www.whatis.com/default.htm
Not to mention Hans Reiser's fs called (guess what) Reiserfs.
It wouldn't have journaling at first, but it is planned.
>...one thing that hasn't been pointed out yet: journaling file systems don't (immediately) overwrite a file when it is changed.
>
>To elaborate, imagine a long tape that represents your hard drive...
It hasn't been pointed out because it's not true. What you're talking about is a _log structured_ file system. That's a whole different thing. A _journaling_ file system looks after the metadata using a (duh) journal that records changes to directories, attributes, allocation maps etc. so they can be either rolled forward or rolled back upon reboot. However, a JFS (not necessarily IBM's JFS, though they were pioneers in this area) has pretty much the same ways of handling the actual file data as any other FS, including deferred writes etc.
Any FS including a JFS or LSFS may support features for increased synchrony providing greater data integrity, such as fsync() or O_SYNC, but that's really a separate matter.
Slashdot - News for Herds. Stuff that Splatters.
If you want grio, you'll have to get IRIX.
They are only releasing the journaling part,
and it's limited to 64-bit.
This is the perfect move. They give Linux something great, get extremely good PR, establish XFS as an industry-standard, and still manage to
keep a proprietary advantage
to make you want to buy their machines for technical reasons. Really smart.
Even such a `limited' version will
be better than NTFS.
If they only release it as GPL, it can't go into the BSD kernel. But nothing stops them from releasing it under multiple licences. But I highly doubt they'll go with anything BSD-like. That would be like telling Sun 'here's our XFS file system, please adopt and expand it for your own proprietary use'.
Having "just" journaling alone would not actually
m l
be *that* important; logical volume management
is more important -- fortunately XFS gives
both.
Journaling gives you "just" faster startup times
after crashes because the filesystem should by
definition never be in an inconsistent state
(no fsck or equivalent required). However,
for big disk farms the flexibility given
by a logical volume manager is really important.
(the flexibility is not bad for small setups,
either...)
Here's a nice white paper on XFS:
http://www.sgi.com/Technology/xfs-whitepaper.ht
Just a thought - it hasn't crashed yet on my amiga, and I'm using an early beta from ages ago, and I delibrately tested it by power cycling in the middle of lots of writes on several occasions.
/. effect minimisation,
It's great to never have to use l:disk-validator (amiga fsck) again ( o.k, ok. it's in ROM, not l: on all Amigas above 1.3, but hey...)
The website has an exceptionally clear discription of how the filesystem has been implemented. It's 64-bit, using the NSD (New Style Device)API.
It's also free.
here's the site :
www.xs4all.nl/~hjohn/SFS/
In the interest of
the feature list is duplicated here:
(Note that some of the things described are amiga-specific. The dos.library limitation, in particular, is irrelevant to linux, and probably to future amigas, too. The 2GB max single file size limitation arises from the amiga's incomplete transition to 64-bit APIs - CBM went bust as the NSD spec was released, and, once again, may not be relevant to a linux implementation)
This page gives you an overview of what SFS is capable of. It will also give you an idea what features we are planning to add in the near future in planned features, and what features we are considering later on.
Features
Below you'll find a list of features which are already implemented in SFS.
Fast reading of directories.
Fast seeking, even in extremely large files.
Blocksizes of 512 bytes up to 32768 bytes (32 kB) are supported.
Supports large partitions. The limit is about 2000 GB, but it can be more depending on the blocksize.
Support for partitions larger than 4 GB or partitions located (partially) beyond the 4 GB barrier on your drive. There is support for New Style Devices (NSD) which support 64 bit access, the 64-bit trackdisk commands and SCSI direct.
The length of file and directory names is internally limited only by blocksize. Limitations in the dos.library however will reduce the effective length of file and directory names to about 100 characters.
The size of a file in bytes is limited to slightly less than 4 GB. Because of limitations in dos.library we will however probably not allow files larger than 2 GB, to avoid potential problems.
Modifying data on your disk is very safe. Even if your system is resetted, has crashed or experienced a powerloss than your disk will not be corrupted and will not require long validation procedures before you will be able to use it again. In the worst case you will only lose the last few modifications made to the disk. See Safe writing for detailed information on how this works.
To be able to ensure that your disk never gets corrupted we use an internal caching system which keeps track of modifications before writing them to disk. This cache has the additional benefit that creating and copying files can be a lot faster, especially if the drive used isn't very fast (ZIP & floppy drives for example).
There is a built-in low-level read-ahead cache system which tries to speed up small disk accesses. This cache has as a primary purpose to speed up directory reading but also works very well to speed up files read by applications which use small buffers.
Disk space is used very efficiently. See the Space efficiency page for a comparison between a few filesystems.
Supports notification and Examine All.
Supports Soft links (hard links are not supported for now).
Using the SFSformat command you can format your SFS partition with case sensitive or case insensitive file and directory names. Default is case insensitive (like FFS).
There is a special directory which contains the last few files which were deleted. See deldir.
Planned features
The list of planned features below are features which are either already in development or are very likely to be added to the filesystem in the near future.
Multiuser support.
Built-in background file and free space defragmenter. Already the filesystem is set up in such a way to allow for easy implementation of this feature without having to do extensive scanning of the disk before the defragmenter can begin. This means defragmenting can be done in the background and can be interrupted at any time (even by a reset, crash or power failure) without loss of data.
Mirroring of important filesystem administration blocks to make the filesystem more robust.
Features we are considering
The features below are either features which are very application specific or not used very often. If there is enough demand for some of these features we will consider implementing them in the filesystem.
Mirroring of complete partitions. Such a feature would not only ensure that all your data is very safe since everything is stored twice on two different drives, but it will also speed up multiple concurrent read accesses since both drives can be used to deliver data. This feature however normally is only used on mission critical systems (like file servers) and would be of little use on systems not equipped with high speed SCSI controllers.
Support for striping. To put it simply, striping can be used to distribute data to multiple drives which increases the total available bandwidth as each disk will be used simultaneously to access part of the data. If you for example have 2 drives than with striping all odd blocks of 64 kB would be stored on drive 1, and all even blocks of 64 kB would be stored on drive 2. A similair scheme is used with more than 2 drives. With striping there is also an option to use one of the drives as a parity drive. If one of the drives crashes or becomes unuseable than the data on that drive can be reconstructed using the remaining drives which ensures that your data is very safe. However, although it may seem that striping could speed up disk accesses by a factor of 2 or more, this is usually only the case when working with very large video streams or multi user systems. Under normal conditions you will be hard pressed to find any speed gains at all.
Support for hard links (soft links are already implemented).
The ability to extend a partition without having to copy all your data and format the partition.
New DOS packets. There are lots of ways to exploit the ability of a filesystem better than is possible at the moment. New packets are the key to this. For example, support for paths larger than 255 characters, live directories (directories which are updated in realtime), enforcing recordlocking and many more. This however must be a team effort and we'll need support from writers of important applications and people willing to build new interfaces to access these new abilities.
There have been a few good answers to this question in this thread, but there's one thing that hasn't been pointed out yet: journaling file systems don't (immediately) overwrite a file when it is changed.
To elaborate, imagine a long tape that represents your hard drive. The tape is written from left to right. When a file is changed, the new version is appended to the end of the existing data, while the old version remains "untouched" farther to the left. When the kernel has finished updating all pending file writes, it can write a "checkpoint" at the end of the existing data. Essentially the checkpoint says "everything up to the point is kosher." If the disk gets really full, then the filesystem can double-back and overwrite the really old data at the beginning of the tape.
Now, let's see how this works to help recover from crashes; say the computer crashes as it's writing out a file to disk. In a conventional filesystem, a lot of things could go wrong: it could have been overwriting the old data, but finished only half of the job. Then, at best, you've got a corrupted file with a hybrid of new data and old data. The file allocation table may not have been updated, so the file may be completely lost. It's a bad situation.
Conversely, if the crash happened with a JFS, the computer would run an "fsck" and look for the last checkpoint. It's guaranteed that all data preceding the checkpoint has integrity. Then the filesystem would just work from that checkpoint and ignore any non-checkpointed data. This can still lead to some data loss, but never to filesystem corruption.
Of course, this is a simplified account, and there are implementation details. But that's the gist of it.
SGI could choose to use the GPL plus alternate licensing, and still get back improvements from others. This can be done either by getting assignments from contributors (if they are willing) or recoding the changes in a different way. Regardless of what licensing is chosen, getting assignments (as FSF and the egcs project do) is probably the safest thing to do, as I expect that within the next year, Microsoft or someone they put up to it will attempt to sue some high-profile open source project for theft of code (just find some contributor who didn't have the right to contribute, because of an employment contract or something), thus spreading a piracy taint over the whole movement.
It should be emphasized that at the least, SGI would be shooting themselves in the foot if they choose a GPL-incompatible license such as an NPL-like license. The reason is that this would force all Linux distributors to use their filesystem only as a module, which would be inconvenient. If SGI's work requires changes in the kernel itself, then it wouldn't even be valid to use it as a Linux kernel module if it's not under a GPL-compatible license.
SGI could use a BSD-like license (without the advertising clause), which would permit both BSD and Linux to use the code. They might not want to give that much away, though. You'll never beat Microsoft if you write code for them (yes, Microsoft networking has tons of Berkeley code in there, you can tell from the bug-compatibility).
I hope that they either go in the GPL or the BSD direction, and don't try to do one of those one-sided NPL-like licenses that is becoming popular with companies (e.g. we can take your changes proprietary, but you have to distribute source).
I wonder whether we'll ever see the day when billionaire philanthropists buy the rights to successful commercial software and then turn around and release it under the GPL in the interests of humanity.
Sheesh, evil *and* a jerk. -- Jade
I'd love to be able to run XFS on our apartment FreeBSD fileserver/firewall, as well as my Linux desktops. I am really looking forward to playing with this! Thank you SGI!
--
Jake
NetBSD 1.4 has a log-structured filesystem called LFS, though it may still have a bug here or there. And it really wants a cleaner that also consolidates (defragments) files as it cleans.
cjs
The world's most portable OS: http://www.netbsd.org.
Journaling filesystems keep a "redo-log" of all activity (changes) to the filesystem. If the system dumps (crashs) the redo-log is re-run at the "fsck" time so the filesytem will be complete again and the fsck take relatively no time. I have a very large Sun machine at work that has a terabyte of a Oracle table space that would take almost an hour to boot (due to the basic fsck of the oracle tablespace filesystems) unless it crashed then it was almost 2hours or more. I move the oracle tablespace filesystems to a journaling filesystem and now it takes about 12-15 minutes to boot maybe 20 minutes if I crash the box. Before the journalling filesytems, whenever it crashed (or I should say almost always when it crashed) I had to manually fix filesystems in maintenace mode. Once I moved the filesystems to a journaling fs, I have not had to do that again. If the journalling filesystem is stable and works like it is suppose to, I would move *all* my machines (including laptops, desktops, and servers) to it.
Scott
C{E,F,O,T}O
sboss dot net
email: scott@sboss.net
Scott
janitor
sdn website family
email: scott at sboss dot net
If SGI so desired, they could actually release a binary-only module which implements XFS (complete with GRIO (*drool*) if they really wanted to), and do so without GPL or anything that resembled an Open Source license. (Ick.) With that in mind, it stands to reason that a filesystem module could be distributed under any license, since it could be built separately from the kernel and modprobe'd in.
A different question is whether or not code that is part of the kernel source tree proper (eg. /usr/src/linux, also known as "everything in the linux-X.Y.ZZZ tarball") must be GPL'd, and I believe that said code does inherit the GPL from the rest of the kernel's GPL-ness. (Any license lawyers out there care to expound on this point for us?) If this is the case, then if SGI wants this to be part of the Linux kernel source tree, they'll have to jump on the GPL bandwagon.
I think "kernel patches", if they're distributed separately from the official kernel and must be applied manually by the users of the patch, are also exempt from being GPL'd, although this is a significantly greyer area. Kernel patches distributed in this fashion act very similarly to programs that #include GPL'd header files. eg. If I #include a GPL'd foo.h in my program, but I don't distribute my code with said foo.h, I don't believe my code becomes GPL'd -- even if foo.h is under GPL rather than LGPL -- although I'm not 100% certain. Anyone care to clarify on that particular grey area?
--Joe--
Program Intellivision!
Excerpt from http://www.OpenBSD.ORG/policy.html
The GNU Public License and licenses modeled on it impose the restriction that source code must be distributed or made available for all works that are derivatives of the GNU copyrighted code.
While this may be a noble strategy in terms of software sharing, it is a condition that is typically unacceptable for commercial use of software. As a consequence, software bound by the GPL terms can not be included in the kernel or "runtime" of OpenBSD, though software subject to GPL terms may be included as development tools or as part of the system that are "optional" as long as such use does not result in OpenBSD as a whole becoming subject to the GPL terms.
As an example, some ports include GNU Floating Point Emulation - this is optional and the system can be built without it or with an alternative emulation package. Another example is the use of GCC and other GNU tools in the OpenBSD tool chain - it is quite possible to distribute a system for many applications without a tool chain, or the distributor can choose to include a tool chain as an optional bundle which conforms to the GPL terms.
So a GPL part only have to be optional, XFS qualify.
There have been some excellent White Papers on
:(
m l
XFS over the years I recall at both USENIX and
the LISA mtgs. Might want to browse them.
There was one in 95 about XFS and guaranteed IO;
I KNOW there was one in Jan.96 in San Diego as I
have a copy still. Titled "Scalability in the XFS File System" sorry no link
Try this, looks recent:
http://www.sgi.com/Technology/xfs-whitepaper.ht
Check em out!
wahl@sgi.com
Yet SGI/IRIX it is considered an Enterprise OS:
2. Poor support for RAID - hot spares, hot swap, etc.
I manage several Origin 2000's with Fibre Channel and SCSI RAIDS. While the IRIX OS does support powering down a SCSI chain for swapping out a drive, the OS itself has no software RAID facility
SGI/IRIX achieves HA RAID using HARDWARE, There external RAIDS are made by Clariion and simply look like a SCSI device to the OS, you can do the same with Linux today.
3. Poor NFS performance (speed and locking)
While SGI (almost) properly implements NFS3, good luck if you are serving/sharing files with anything other than a Genuine SGI. They've screwed with the Sun implementation enough that it really pays to just stick with SGI for all your workstation/server needs.
5. No high availability clustering (Beowolf is cool but completely unrelated to this)
While there is an expensive, fun to set up FailSafe IRIX available, it is only limited to TWO nodes and not really true clustering (kindof like the original NT implementation).
4. Poor desktop environment
I use 4Dwm daily; its features pale in comparison to the linux offerings. While 4Dwm is "ok" it has a major flaw in that many of the features available on the desktop are not available to a non-SGI X-Server (such as Linux/eXceed). As the servers such as the Origin 2000/Origin 200 ship headless (no display) and we interact with the servers using Linux/NT-eXceed this is a major shortcoming. I wonder if the SGINT-eXceed combo is any better.
and several million hours saved worldwide for
fs non-gurus like myself. When can we burn the
rescue disks?
Lets hope it all comes together in an acceptable way. If/when it does, what a gift for humanity!! (or at least the
short fat flippered version thereof)
-- open source? sounds like the real book --
If the license for XFS is any sensible (i.e. a true Open Source license), this is the single most intelligent thing SGI could have done to score with the Open Source movement. Linux is in dire need of an Journalling File System and XFS is one of the very best of this flock.
Their white paper on XFS explains how XFS is different from conventional file systems and what they did to it to make it fast with very large files as well as with many, many small files (SGI is not Open Sourcing their GRIO capabilities, which together with RT scheduling would make Linux a serious multimedia contender).
If you are a USENIX member, you will be able to download the Sweeney paper Scalabilit y in the XFS File System from the USENIX server. It was published in the Spring 1996 proceedings of the USENIX, so you may also read it in your Universities library.
Forgive my ignorance here for I haven't learned much about file systems, but what is the difference between journaling and non-journaling file systems?
8Complex
PS - Slashdot could use a full dictionary of terms from around the site... That'd rule...
Well, hardware RAID is important, if only because there's tons of x86 server boxes out there that have hardware RAID cards in them, including many 486/586 Compaq boxes that are being decomissioned. These would make perfect Linux boxes.
Note that people go for hardware RAID on x86 even though WinNT has workable software RAID. So both are importantant.
--
Business. Numbers. Money. People. Computer World.