Ask Slashdot: Distributed Filesystems for Linux?

← Back to Stories (view on slashdot.org)

Ask Slashdot: Distributed Filesystems for Linux?

Posted by Cliff on Sunday September 19, 1999 @07:00AM from the using-all-that-unused-space dept.

Ledge Kindred asks: "I am looking for a distributed filesystem to run on my Linux boxes at home. I have several and most of the "extra" space on each one is "going to waste" - I'd like to be able to combine it all into a single network-able filesystem. How?" Click below for more.

"So far the two (three?) solutions that had the most promise are: AFS or Arla, and Coda.

The reasons against: AFS is commercial and I don't want to pay $15,000 in licenses just for a convenience to me. Arla still appears to be extremely alpha quality, even for a Linux hacker used to seeing major parts of his kernel labeled "alpha" or "beta". I had Coda up and running for a couple of days before I ran into a fairly severe flaw in the fundamental design that showed it to be inappropriate for what I want it to do. (But Coda is still the coolest thing since individually-wrapped cheese slices, and if you don't need to worry about that little problem, it's cooler than sex.)

I've found lots of references to the "GFS" project which is not at all what I want, and here and there mentions of other projects such as "DFS", "xFS" and a distributed filesystem for Beowulf clusters but no further details, URLs or most importantly - code - could I dig up.

I don't need RAID, redundancy, failover, or anything like that. I only need to take these extra machines on my home network and make all their extra disk space look like a single volume on the network. Support for Linux as a client is, obviously, essential, but I also have Windows, BeOS, *BSD and Solaris machines on my network, so clients for those would be appreciated but not necessary. Since this is just for me at home, (yes, I've got all that crap on my network at home - so I'm a little crazy) I'd rather stick with free software. Is there anything that can do this? "

If not, then it sounds like it would be an interesting project to work on. The ability to be able to harness the spare disk space across a private network can only be a good thing.

12 of 151 comments (clear)

Min score:

Reason:

Sort:

PVFS by jabbo · 1999-09-18 23:08 · Score: 3

If you're not interested in using NFS behind a firewall (just because it's slow, insecure, ugly, unreliable, fault-intolerant, and buggy doesn't make it bad, right?) you might be interested in PVFS. It sounds like what you are looking for, and under the "Files" link are the sources.

Personally, I like the "adventure" of Coda, but haven't tried setting it up in a few months. Now that my roommates have agreed to be guinea pigs for the Windows client, I figure I'll set it up behind my NAT box and play with it again. It's overkill for everything but a big installation, but I still think it's kind of fun. The thought that terrifies me is working with a multi-GB datafile or such over Coda -- but since my roommates will probably be more interested in playing Dopewars and moving around small files on a FE network, I'm going ahead with the grand master plan anyways. Besides, I have a laser printer and a burning desire to experience the frustrations of Samba...

--
Remember that what's inside of you doesn't matter because nobody can see it.
For your setup by aheitner · 1999-09-18 22:42 · Score: 5

Coda would be overkill -- the depot-style requirements are intended for a distributed environment like a university, in which all the clients constantly accessing the servers to do anything would kill the system. Actually, afs (which is currently used at CMU) is intended to work similarly -- clients build updated caches of appropriate application directories for their architectures, with the result that machines running a constantly out-of-date minimal core OS are served a centralized set of applications ...

For home network purposes, where a few users are unlikely to overwhelm the server, use NFS. It's easy, it's well supported across OSs, its performance may not be incredible but nothing you're likely to do will strain it. Even if you're moving huge files around, you're not going to have 10 people moving huge files around simultaneously.

Actually, there's one more fun option to consider: Inter-Mezzo, a distributed fs written in PERL in a few weeks by the creator of CODA, Peter Braam. It's small, it's pretty quick (the speed-critical parts are in C :), it's cool ... it can do a lot of the things CODA can, but it's much lighterweight, doesn't require its own fstype. I don't have the link handy, and I don't know if it has the same caching requirements as CODA ...
Your requirement doesn't sound too useful by Morgaine · 1999-09-19 00:31 · Score: 3

Have you actually thought this requirement of yours through? It sounds fairly dodgy to me.

For a start, you can't seriously be advocating that spare blocks from a variety of machines be used to provide unique bits and pieces of storage for virtual files distributed across those machines, I hope. This would make the availability and reliability of those files extremely low, ie. as low as the weakest link in the system.

Secondly, what happens when one of the contributing filestores requires more space, but can't use it because it's been allocated to one of those distributed files? You could no longer just delete something from the machine concerned without going through that hypothetical distributed filestore manager, because it would be the only party that would know whether the item in question is part of a distributed file and hence whether it can be deleted. (This assumes that it creates real files in the local filespace for allocating to distributed files, which it would have to do otherwise the space it allocates would evaporate if the distributing daemon died.) In other words, *all* of your storage becomes dependent on this new manager, slows to a crawl, and probably loses a lot of the reliability of your native filesystem to boot. No, no, no ...

If the new distributed filesystem manager actually *does* make space on one machine as requested, it would clearly have to push out the data onto some other machine to compensate. If you think about it, the policy issues in this area are "interesting". (Aka "horrid".)

Finally, since the first point (unavailability cased by one machine going down) makes the idea completely untenable in most cases, you'd have to be talking about a system in which blocks are allocated in multiple places for each virtual file block. That's great, but notice that such a scheme is *not* storage-efficient, yet your requirement is based on not wanting to waste storage space!!!

No, I don't think you've thought this requirement through.

--
"The question of whether machines can think is no more interesting than [] whether submarines can swim" - Dijkstra
xFS, Frangipani by Lazy+Jones · 1999-09-18 22:17 · Score: 3

xFS is here (with source). An interesting project is Frangipa ni, but it is not available to the unwashed Linux masses. :-/

--
"I love my job, but I hate talking to people like you" (Freddie Mercury)
My kludgy solution by dizco · 1999-09-18 23:14 · Score: 4

I have the same problem. I've got 5 boxes running linux & os/2, and want all the "spare" space to transparently appear as a single volume on all boxes. I couldn't find anything that would do this effectivly for me, so i brewed my own. Unfortunatly (for most of you) this requires an os/2 box.
Here's the details:
1) all the boxes export their spare space as nfs mounts.
2) a nifty IFS (installable file system) from IBM's EWS (employee-written software) program called Toronto Virtual File System is installed on one of the os2 boxes (we'll call this box os2tvfs)
3) os2tvfs mounts all the exported drives
4) with tvfs, all the mounted NFS drives are mounted into a tvfs drive (z: in my setup)
5) os2tvfs exports z:
6) any box that wants to access the big-virtual-volume mounts os2tvfs:/z:/

So how's it work? Lets go through an example:

box1 exports d:\, a 10 gig ide drive on an os2 system
d:\ contains a bunch of stuff, for this example we'll focus on "d:\mp3s\foo.mp3"

box2 exports /s1/ a 6.4gig scsi drive on a linux system
box2 has a file on it located at "/s1/mp3s/bar.mp3"

box1 then mounts os2tvfs:/z:/ as v:\
on box1, a directory listing of v:\mp3s\ contains both foo.mp3 and bar.mp3. if i copy baz.mp3 to v:\mp3s, it ends up as box2:/s1/mp3s/baz.mp3, as long as their is enough free space on box2:/bfi1/ for it, because i assigned a higher write priority to that volume when i mounted it with TVFS (it's a scsi drive- might as well use it up first). It shows up as os2tvfs:/z:/mp3s/baz.mp3.

Of course, this solution is kinda bad because it creates a ton of extra network traffic, but it was the only one i could find that did what i wanted.

--sean
LDAP/CODA rather then Re:NIS+automount+NFS by tenchiken · 1999-09-18 23:12 · Score: 3

You run into some severe problems with NIS+ and NFS. Don't use them if you can get away with it. To wit:

Linux's NFS still has problems. If you need NFS use BSD (BTW, before someone mods me down for that comment, I use linux. NFS is just not a good idea in general).

NIS is a nice idea, poorly implemented, with a lot of problems with security.

You said that there was a problem w/ CODA, and that would be what I normally use rather then NFS. There are a lot of good suggestions posted here.

For distributing information ala NIS, try taking a look at LDAP instead. I have been implementing it at a few client sites, and it works much better then NIS. (There are plugins that let GLIBC and PAM use LDAP transparently, and you can even emulate NIS).

I would definitly kill for something that could transparently create a single large namespace/disk space over a network, but with disk space so cheap, you are probibly better off going and buying a 16gb IDE drive. cheep cheep....
Re:The Charon Filesystem by Salamander · 1999-09-19 04:45 · Score: 4

>but is already faster than Ext2fs, and way ahead of XFS and NTFS.

By what measures and for what workloads? Such claims are meaningless without describing the environment, and are the realm of marketroids (particularly the MS kind) not scientists or engineers.

I find it most odd that you would tout the system's distributed nature and then compare it only against local FSes. How well does it perform in sharing situations, either locally or through slow WAN links? What level of coherency does it guarantee? How is failure recovery (a very tricky issue for a DFS) handled? How about disconnected operation?

To be perfectly blunt, the lack of even an attempt to address these sorts of crucial issues makes me wonder whether the part about Charon being distributed is "part of the plan" that hasn't actually been implemented (or even designed) yet. The DFS literature is littered with papers about systems that would supposedly blow everything else away, but that never actually got implemented. I've been there, I've done that, and the sad truth is that the realities of implementing a usable DFS - i.e. one that isn't pathologically ill-behaved in at least one of the areas alluded to above - generally shred naive ideals of superfast coolness.

>Only changed disk blocks and metadata are replicated, as opposed to entire files (and only on close)

If this is really what you meant to say, it's great performance but has dire implications for recoverability. This only strengthens my suspicion that you haven't really climbed into the mud pit in earnest yet.

--
Slashdot - News for Herds. Stuff that Splatters.
As with everything, "it depends" by Salamander · 1999-09-19 05:05 · Score: 4
Many people seem to be jumping in with suggested solutions, but it seems to me that the problem has not yet been adequately described. For example:
1. What kind of sharing do you anticipate? Some systems that handle read-only sharing well fall down when even a single writer enters the picture.
2. Aside from efficiency, what kind of coherency guarantees are you comfortable with? Some systems will behave exactly like a local filesystem in terms of what happens when one client writes and another reads, and can be used for "network-unaware" applications. Others, notably NFS, play "fast and loose" so you have to do explicit performance-robbing flushes to have any sort of guarantees. If you put build trees on a distributed/network filesystem you have to worry about when file modification times get updated, as well as about the file contents themselves; having "make" mistakenly tell you nothing needs to be done can be more than a little annoying.
3. What kind of failure-recovery guarantees do you need? Is it OK to lose the odd unflushed write every now and then after a failure?
4. Do you need support for either advisory or mandatory byte-range locks?
I'm also curious what you found lacking in GFS. I have my own different ideas about "how things should be done" but perhaps explaining why you consider it inappropriate will shed some light on your needs.

As far as practical advice goes, I think most of the relevant products and approaches have been mentioned; I don't promise to have secret knowledge of any "magic bullets". DFS technology is an area where I feel we're still looking for the right answers (sometimes even the right questions). That's why I enjoy working on DFSes, but it does mean that there's a large element of "choose your poison" in evaluating current offerings.
--
Slashdot - News for Herds. Stuff that Splatters.
Here's a plan by jdike · 1999-09-19 00:55 · Score: 3

This is something I've been thinking about for a while. I might give a go once my current project (the user-mode kernel port) settles down.

This my current thinking on cfs (cluster fs)
All members of the cluster share a filesystem, which potentially uses all the available storage on the cluster (although you might want to keep stuff like your home directory on a separate device that you don't share with the cluster).
Files are duplicated on multiple machines for speed and redundancy. Files will tend to be located on the machines that are accessing them, so most I/O is local.
cfs will just be the networking part. Local storage will be handled by a local fs (like ext2). cfs metadata will be stored in local files with funky names (which are made invisible by cfs anyway)
There are multiple levels of membership in a cluster. Primary members can read and write everything. Secondary members can only read. They can have read copies of files locally, but they can't hand those out to other machines. Machines wanting to read a file have to go to a primary member for a copy. This is for sysadmins who don't necessarily trust their users to prevent them from becoming root and modifying files (like /etc/passwd) behind the back of cfs and then handing the new /etc/passwd out to everybody else.
Machines can be members of multiple clusters. /etc might come from a cluster that everyone is a member of, /bin might come from a cluster of machines of the architecture, /projects might come from a third cluster, etc.
Files can be marked "local" which means that they permanently live on that machine, override whatever file comes from the cluster, and aren't shared with the cluster. This would be useful for config files which are only relevant to your machine, or your email directory.
A machine's /dev would mapped into the cluster filesystem as /dev/aa.bb.cc.dd/ rather than being marked local. This gives transparent access to every device in the cluster.
A machine which a writing a file is designated the file's owner. While writes are in progress, all reads have to go to that machine. Once the writes have stopped, the machine remains the owner, but it can start spreading the new data around the cluster. It can also designate secondary owners, who would come into play if the primary owner crashes. One of them would become the new owner. If it turns out that the old owner had changes which it didn't manage to propagate and the new owner made changes, then my current thinking is that this is brought to the attention of a human, who straightens things out. If this is not acceptable for a particular file for some reason, then that file can be marked in such a way that accesses to it hang or fail until the owner comes back.
NIS+automount+NFS by Somnus · 1999-09-18 22:17 · Score: 3
We use NIS+automount+NFS on my research group network. Summary:
- NIS -- share various maps, in your context most notably NFS maps so every machine knows where every export is on every machine
- automount/autofs -- on demand mounting of various shares
- NFS -- obviously, for the various exports
Two caveats:
- This is not secure! If you must run a network with authentication+encryption, look elsewhere, like Coda+Kerberos.
- This methods works best with kernel 2.2.x+glibc, but 2.0.x+glibc is okay too; libc5 is problematic. Make sure to have the latest revision of NIS, but more importantly have consistent versions of NIS across your network (we use YP, which seems to run reasonably well).
  
  *** Proven iconoclast, aspiring bohemian. ***
The Charon Filesystem by 1010011010 · 1999-09-19 02:47 · Score: 5

I dislike letting the cat out of the bag this early, but those of you who pay attention to linux kernel-development lists already know some of this.

We're writing a new distributed filesystem called Charon. It will be patented ("patent pending"), copyrighted, and GPLed. It's a true 64-bit, journaled filesystem that supports exabyte-plus file and volume sizes, sophisticated access control lists, per-directory quotas, distributed zero-knowledge protocol authentication, encryption, replication, named streams and indices (see BeFS, ReiserFS -- although we don't use B-trees of any type). It's in alpha stage right now, and full of debug code, but is already faster than Ext2fs, and way ahead of XFS and NTFS. We will be porting it to Solaris and NT after development on Linux is complete.

Unlike Coda, AFS, DFS, etc. replication, every Charon server is a full read-write replicant. Only changed disk blocks and metadata are replicated, as opposed to entire files (and only on close) as in Coda. Charon clients are partial replicants -- they use the local file system as cache and rely on their home server(s) for token management and authentication. The system also supports heirachical failover and replication.

Because of the way it is designed, it also supports a very nice feature for GUIs and web servers -- a very fast built-in file types database that provides a single repository for mime type, friendly name, icon(s), description, extension, and other information. Sort of alike the Windows registry, but much less stupid and much higher-performance.

Stay tuned! This isn't vaporware.

--
Napster-to-go says "Fill and refill your compatible MP3 player", which is a lie. It's not MP3. It's WMA with DRM.
Mango's Medley could be ported by RickyRay · 1999-09-19 00:08 · Score: 3

A buddy of mine (John Carter, professor at the University of Utah) was the chief architect/programmer of Mango's Medley (www.mango.com). It would be ideal. It lets you take spare space on a bunch of Win machines and use them as a virtual fileserver. The really cool thing is that is does transparent mirroring, plus if you're using a file it automatically turns your box into one of the mirrors so that it will be even faster than the fastest file server. And the company has a really cool name ;-)

Problems:
(1) It has only been written for Windows. But not that hard to port.
(2) More serious: they initially did it on Windows because that's where they saw a larger potential customer base. But my friend, last we spoke, said that despite the practicality of the product (and winning best of show 2 or 3 years ago at Comdex) they still haven't had any substantial sales. So a port isn't likely to happen. The best would be if they opened up the source for Linux (they still have a patent on the Windows version, so it probably wouldn't be a problem), but I have no clue if they would ever consider that. Regardless, somebody needs to write such a system for Linux/BSD. Probably wouldn't even be that hard.