Organizing Data Across a Heterogeneous Net?

← Back to Stories (view on slashdot.org)

Organizing Data Across a Heterogeneous Net?

Posted by Cliff on Friday May 31, 2002 @05:20AM from the interoperability-is-good dept.

angst_ridden_hipster asks: "Like many people, I have a bunch of machines I use regularly. These include Linux machines, BSD machines, a Mac OS X machine, and a Windows machine. These machines are on a number of networks. All have internet connectivity. Some of them are always powered on. A few of them are not. Obviously, I have a bunch of accounts. And, it goes without saying, I have a bunch of data. What are the best approaches to sharing data? I want to be able to securely access my home data while at work, and from one machine to another, etc. Opening ssh terminals is the approach I have traditionally used, but I'm beginning to wonder if some mirroring software (e.g., Unison) might be in order. It'd provide the function of backups, as well as guaranteeing availability. Would it be wiser to tunnel nfs over ssh? Or is there some better option? Assuming I actually start mirroring data across multiple machines, I'll need to organize it in a portable taxonomy. This is almost easy, since I use cygwin on the Windows machines, so I can assume a standard Unix-ish directory structure. But this gets more complicated when there are scripts or other code involved. What about application/platform-specific data? How do other people organize their data, anyway? Are there any useful standards? I'm hoping people will describe their approaches, and why they think they're (not) the best."

21 of 293 comments (clear)

Min score:

Reason:

Sort:

Database and rsync+ssh by ddkilzer · 2002-05-31 05:26 · Score: 3, Informative

Without knowing more about the type of data you're storing, I would recommend putting it in a database. I like PostgreSQL 7.x myself.

For the software, I would organize it in a directory structure and use rsync+ssh to mirror it as needed.

For backup software, use Amanda.

For file sharing, use Samba.

'Nuff said.
1. Re:Database and rsync+ssh by anwnn · 2002-05-31 05:29 · Score: 3, Informative
  
  the one problem with rsync/ssh, for Mac OS X atleast, is that it will munge the resource forks of most files. for some this isn't a problem, but if you do a lot of mac work, on files with forks, then it most definately is. setfileinfo can help sometimes, but that can be rather tedious.
2. Re:Database and rsync+ssh by Oculus+Habent · 2002-05-31 09:06 · Score: 4, Informative
  
  Well, there are several angles to look at. I'm going to hazard a few guesses at the situation, and hopefully I won't be too far off.
  
  Accounts: You mentioned many accounts, so part of the problem could be (not saying that you don't know, just that I don't). different users on different boxes. It's initially easier to use groups to clear up these issues, and tackle account changes later. Create some extra users to make usernames match for box to box, and then group them together so they all can access the appropriate files. This still leaves room for account name matching later.
  
  File System Uniformity: Some people will probably think this is an awful solution, but if you use a single directory (like /mnt) and mount/link everything to identical naming on each box, you won't have the location problems. Sure, it's cyclical to have / linked to /mnt/mylinuxbox on your linux box, but you will always know that your MP3s are in /mnt/mylinuxbox/mp3 (or whereve the hell they are).
  
  Remote Access to your Filesystems: I'm not really qualified for this one, but the NFS/SSH combo is secure and tried. If you don't mind the at-home network traffic, you can make life easier by mounting everything on one computer, and then mounting it. Not recommended for heavy use, but it's easier than managing four connections.
  
  Mirroring is OK if you have specific, regular downtime that the computers can spend, or you have an OC-3 from home to work and great drive access times. The probelm mirroring can present is synchronization lag. Unless you specifically set up your mirroring to syns ASAP, what will you do if you make it home before your data does? Live access does two things; you only transfer the files you need, and you don't have to worry about sync'ing. Plus, what's the point of the Internet if not to make information available? : )
  
  Organization: I've been re-organizing my files for years now, and the best this I've done for most files is to just simplify. I used to make subdirectories for everything. Just recently I have realized the real intent of the "filing cabinet" metaphor...
  
  Filing cabinets are only ever four layers deep. Department (what the cabinet is for - cabinets and drawers are physical limitations, not part of the concept), Group (Hanging Folders), Project (Manila Folders) and then files. Sure, you may end up with alot of "Groups", but that is what alphabetization is for.
  
  Mind you, I haven't managed to change over all of my filing systems to this format. It takes time to sit down and think about what should be where. But it seems (at least to me) like a good though for personal file organization.
  
  Good Luck.
  
  --
  That what was all this school was for... to teach us how to solve our own problems. -- janeowit
AFS? by alsta · 2002-05-31 05:28 · Score: 5, Informative

IBM has released Transarc's AFS as OpenAFS (http://www.openafs.org). Don't know if that is what you're looking for, but it is pretty nice. It's also portable, so it runs on various unices as well as Windows. Most can be found as binaries if you don't want to roll your own.

AFS is an NFS style implementation though, so you would have to save your files onto a special mount.

--
Wealth is the product of man's capacity to think. -Ayn Rand
1. Re:AFS? by C+Joe+V · 2002-05-31 07:23 · Score: 2, Informative
  
  I use OpenAFS between work/school and home. It is very convenient to access at work, where a fast Ethernet connects me to the AFS server, but quite slow from home over DSL. Examples: when Emacs auto-saves to AFS, I have to stop typing for a while; I try hard to avoid compiling things (or running TeX) where the code is in AFS; when I kept my email in AFS, sylpheed took a really freakin' long time to scan my inbox and was much slower to incorporate new messages than it ought to have been.
  
  Also, I was frustrated by the process of compiling OpenAFS for my Mandrake 8 box (GCC version crap), and if I ever try to mount AFS when anything is wrong with the network, I know I am in for a serious crash later on. Perhaps these are just my fault, of course.
  
  Hope this helps.
I use WebDAV by marick · 2002-05-31 05:32 · Score: 4, Informative

I'd say what you need is an internet-enabled file system. Some might say NFS, and that seems like a fine solution.

On the other hand, if you have a computer that is always on, that can run Apache, you can have your own personal WebDAV server instead. Simply install mod_dav, and access it through mod_ssl, and have a secure web-based filesystem.

Better than NFS, you can mount it on Windows (through web folders), Linux (through davfs) and Mac OSX (through the native DAV file system client that is designed to run with iDisk).

NOTE: I work for Xythos software, and we make an enterprise-level WebDAV server called the Xythos WebFile Server. It's significantly more expensive than free, and we run in-house copies of the product (y'know eat your own dogfood), so that's where I keep my shared data, but if I didn't, I'd have mod_dav running right now.
Try CVS. by Anonymous Coward · 2002-05-31 05:32 · Score: 1, Informative

http://www.cvshome.org/
Mac n' Windows by peatbakke · 2002-05-31 05:38 · Score: 2, Informative

I'm not sure if this is entirely applicable to your situation, but here's what I do, and it works reasonably well.

I have a server on a public IP address that runs SAMBA, but only accepts connections from 'localhost'. From my Windows box and iBook (running OS X), I just do a bit of SSH tunneling, and I'm able to mount the machine from anywhere I happen to be.

As far as I can tell, it's reasonably secure, and it works just fine for general files.

I also have a CVS repository on the server for my development projects, but that doesn't work so well for binary files like images and Word documents. :P

One of my friends keeps his files synchronized via an htaccess protected website which allows him to download and upload files. If you're interested, I'll see what I can do to track down his PHP script ...
don't use NFS by Kunta+Kinte · 2002-05-31 05:49 · Score: 5, Informative

Unless you want to share your data with lots of 'friends' you just haven't met yet.

NFS is used very often to mount home directories. But what is stopping someone from unplugging the workstation, plugging in a linux laptop with the IP of the legitimate workstation and mount the share, "su - user", and voila, you now have all the user's files.

That's just the simplest way. The problem is that most NFS implementations don't have *any* authentication except for IP authentication. So so other DNS attacks would work as well.

I am surpised that the most widely used network file system implementation for linux and most posix OSes has no real authentication. There *has* been authentication built in the protocol since version 3, but last time I checked, it was not supported on the linux. I was told by one guy working on the project that the problem was that there's no crypto in the kernel.

I used secure NFS on Solaris 8 for a while but I constantly lost the mounts. That but be fixed now, I don't know.

Use AFS, CVS, rsync, intermezzo, or something. But I would stay away from NFS.

--
Based on upvotes, Ageism is the only "-ism" Slashdotters care about and think isn't SJW
How I do it by MaxVlast · 2002-05-31 05:50 · Score: 3, Informative

Yep. Unified access to e-mail via IMAP is definitely the linchpin of a good arrangement.

I've been trying to deal with the same problems as you for several years. I have a Mac running Mac OS X, Windows PC, Linux server, and a NeXT around my desk. I have two large hard drives. One is in the Mac and that holds my home directory, and the Linux machine has all my MP3s. My home is exported via NFS and is mounted on the Linux box and on the NeXT so I always have live access to my files. The Windows box only does my TV program and Kazaa, so I'm content to simply have it use FTP to copy files back and forth (I haven't found a decent Windows NFS program.)

It all gets the job done, and it all works smoothly. Printing is done by IP printing to my big 'ol LaserJet. All the mail is kept either on my server at school, or on the cyrus server on the Linux box. It's a delight =)

--
There should be a moratorium on the use of the apostrophe.
Max V.
NeXTMail/MIME Mail welcome
Can there be only one? by rwa2 · 2002-05-31 06:00 · Score: 5, Informative

Well, here's my approach...
First, I try to adhere loosely to the FHS for ideas on overall organization. Even though it's mostly intended for POSIX systems, following their philosophy will really help you separate your data from your platform-dependent program files and libraries.
Most of my important stuff goes on the Linux server in /home (on an IDE software RAID1). However, I try to limit files in here to stuff that's absolutely essential to keep the size down. I occasionally mirror this offsite to my friends' servers with rsync (with the private stuff pgp encrypted). I try to make browser caches, etc. symlinks to dirs in /tmp . Try to keep only the stuff you created yourself in here.
I keep media and downloads on a plain partition under /home/ftp/pub (which is also symlinked from the http document root). That way, all my computers can easily get access to music and installers and junk.
Samba helps win32 boxes access the /home and /tmp directories.
NFS exports /home to the other UNIXen, as well as /usr for the other machines with the same CPU arch. It should be acceptable to export /usr/share to other UNIXen with different architectures.
I'd like to set up CODA, since it seems to support more different kinds clients than Intermezzo. These support disconnected operation and are good for laptops. For the meantime, I just use rsync to mirror home dirs onto my laptop, though (and just keep track of stuff that I change on the road manually :/ )
No thoughts on how to combine everything into a distributedFS so you could have parts of, say, a music archive living over several machines. There are several projects for Linux-only (PVFS) or Win32-only (more advanced network-neighborhoods). I'd say your best bet for convenience is just to make sure everything is visible from your one server and reexport it from there (invest in a switch so it doesn't deadlock your network). Until better DFSes exist, though, I think you'll get better performance and less confusion from running everything from one beefed-up server with a RAID (or two if you want failover).
centralize and distribute. by rusty0101 · 2002-05-31 06:04 · Score: 2, Informative

For those systems that are on all the time, select one system to be a common server, I personally recomend a Linux box, though xBSD or OSX may provide the features you need as well.

In your home directory, create a folder you are going to put your mount points in to mount the data stores you need.

On all the other systems, create a share that will contain the data you want to access "anywhere". On the central server Mount all of these shares in that sharesmount folder. This may be nfs or cifs as the architecture of the servers dictates.

As this is all mounted to your home directory, you can go to just about any system in the network and remotely mount all of your folders by Mounting your home folder from your primary server.

To remotely access this storage center, use either nfs over ssh, or build appropriate links into your web pages, and run a secure varient of apache.

I also recomend keeping your work data in a seprate storage area from your personal/home data. You may recall that Northwest Airlines successfully sued to get the personal computers of Flight Attendants who they believed co-operatively negotiated a sick-out strike. Keeping your personal data completely separate would reduce the likelyhood of loosing your entire computer setup if someone at work files a complaint that they believe you are doing something wrong.

There are other advantages to this kind of a setup. By centralizing your data storage tree, it is easier to perform backups, you will only need to backup the one server's home directory, tracing into the peripheral servers. If you wish to set up a thin client in a bedroom, or someplace where you don't want to have a lot of fans going, this gives you a platform ready made for your storage needs, as well as a reasonable terminal server. I think you get the idea.

-Rusty

--
You never know...
RSYNC by dbarclay10 · 2002-05-31 06:21 · Score: 3, Informative

Aha, been through this myself ;)

Okay, you *could* use some form of networked file system, but a) your laptop and other machines would need to be connected to use it, and b) I hope you are willing to fight to get a good implementation to work, and c) I hope you aren't playing with big files :)

I use rsync. I have ~/Makefile, 'make sync' works wonders. Here's the contents:

On the laptop:

get: rsync -avuz --exclude "*~" willow:/home/david/data /home/david put: rsync -avuz --exclude "*~" /home/david/data willow:/home/david sync: get put
Works like a charm :)

--

Barclay family motto:
Aut agere aut mori.
(Either action or death.)
Re:AFS? Not suitable by sparcv9 · 2002-05-31 06:23 · Score: 3, Informative

angst_ridden_hipster asked for something that runs on OS X
OpenAFS *does* run on OS X.

--

This is not a Fugazi .sig
Apple's new Xserve by alchemist68 · 2002-05-31 06:25 · Score: 1, Informative

If you've got the extra cash, buy the low end Xserve from Apple, set up a local area network connecting all the computers, and store all your files on Xserve. You already have a Mac OS X computer, you can control the Xserve with that. And if Apple's Xserve is really as easy to use and administer as they claim it to be, it would probably be worth the money for you. My suggestion only applies if you don't know enough about setting up servers and you want the quick and easy way out of your dilema. Hope this helps.
Web Design by Lando · 2002-05-31 06:40 · Score: 3, Informative

I'm not really sure what type of work your doing where you need access to your files... I can relate my knowledge on dealing with unison over the past year though.

I do a lot of back end web development. As such I usually like to copy the entire site down to a local machine, work on the system, upload to a test machine, test, and then move to a development machine. Unison has made my job a lot easier than it using a bunch of ssh scripts since unison automatically checks for changes and only copies over files with changes.

A sample script is as follows:

From my local file system $HOME/web/(website) I execute the following script

unison -auto -batch include ssh://user@somehost.com//www/(website)/include

unison -auto -batch www ssh://user@somehost.com//www/(website)/www

This script pulls all my programming work in include and the website accessable files www to my local system... I then work on the files and upload using the following script

unison -auto -batch include ssh://user@testhost.com//www/(website)/include

unison -auto -batch www ssh://user@testhost.com//www/(website)/www

I then check the coding and on the test host, when I get it to the point I want I upload it to the production machine...

If I have problems on the test host, I can go in and remove all files on my development system and pull a fresh copy of files from the live site...

Since I don't need to program and compile on different systems, just uploading the the test and production machines it works well.

Recently I took a trip and did not have access to my local system. I was able to borrow a windows system and after installing putting, winscp and unison I was up and running within 10-15 minutes at the remote site, which allowed me to get back to work.

The problem with using a remote mounting system is that you have to maintain network connectivity while working on files, not always an option, plus you are working with the live production files...

So basically I use unison just like a cp command except that it does not copy files that already are synced between systems and it automatically keeps my permissions sync'd as well.

Hope that helps

--
/* TODO: Spawn child process, interest child in technology, have child write a new sig */
Segregate the data, manage each. by jmanning2k · 2002-05-31 06:56 · Score: 4, Informative

I agree with you. Your question though, was overly general.
There's really three (or more) different separate data issues that you have to deal with.

Like most, I have many accounts, and just manage them on the fly. My data is retrieved manually when I need it. SSH (and scp), VNC, etc. This usually does the job.

Not the easiest way to do it. Especially when I recently changed jobs and had to setup new data and profiles - I thought, there must be a better way to do it.

So, here's a breakdown of the problems, and suggested fixes.

Break it down into 3 separate sets of data:
1. Profile data - Your shell scripts, .bashrc, environment, ssh directory, pgp keys, etc.
2. Daily Documents - My Documents folder, data directory. Limit this to stuff you need in ALL locations (though you could have a personal and a work version...) and on a regular basis.
3. Archived files - Infrequently used, but you occasionally need to access them from various places.

Then, the problem becomes much simpler. Instead of a grand scheme to manage all three of these at once, you have three smaller, simpler problems.

Here's my suggestions:
1. Profile info - Wasn't originally my idea, but the best thing I've found is to use CVS to manage the files. You'll also have to setup your shell scripts to detect the OS / machine you are on and run OS / machine specific versions.
For example: .bashrc
Detects OS, runs ~/.profile.d/linux, ~/.profile.d/win32, ~/.profile.d/macosx, etc.
Detects hostname, runs ~/.profile.d/hostname.
Put core stuff in the .bashrc, put specific things in the separate files.

The rest, usually doesn't change.

Add it all to CVS on a personal server. Then just checkout to each account you have. cvs update will keep it up to date if you change the master copy. You might need a special .cvsignore to make sure it only manages the files you want it to.
Then, you have the same profile files on all of your machines. Got a new .emacs macro, or shell prompt tweak? Edit one account, cvs commit, cvs update the rest.

2. Daily use Documents. This is a mix. Perhaps you could use a separate CVS repository. Or, use rsync and rdiff type backup sync programs. The key here is to keep this to a minimum. How much to you really need, and how much *must* be in sync between all your machines at all times. Again, this is fairly easy for a small number of documents, so don't let it get out of hand. If you don't use the file all the time, and don't need to maintain changes, then push it to archives.
This is the issue that most other posts address, so I won't get into too much detail. All those solutions are much easier with a small number of documents.

3. Archived files. This is probably what you were really asking about with regards to NFS and sharing files. These are the files you need every so often, stuff like your mp3 collection, downloaded software, extended (non category 2) documents, and the like.
For these, it depends on your setup and level of network access (the speed is important too). rsync might work if you need a locally cached copy, but this is much easier if you leave it in one place. Setup a gateway on your home network with IPSec or PPTP. Or, find WebDAV or some internet accessible filesystem you can use (NFS or SMB even, depends on your security needs). Then, connect to the central repository when you need these files.
This can be large, but keep it so that you don't need to synchronize frequently, and preferably only in one direction. You listen to your mp3's, but you don't change them frequently. Same with your downloaded tar/zip files of software you've collected. (Face it, having a single directory with cygwin, mozilla, etc - all the software you have installed at each location - is much easier than finding and downloading them all from their various sites each time.)
Or, for these files, if you really don't need them all the time, leave them on the central server, and scp them when you need them.

--

So, that pretty much covers it. I hope these suggestions are useful. There comes a time where managing it on the fly just gets too cumbersome. (You'll know that time - it usually happens right after you wipe out some vitally important data because you didn't synchronize the files.)

Beyond this, you can always add all kinds of stuff. Some examples: ACAP (a configuration file server, I use it with mulberry, my IMAP client. It lets me set preferences), Kerberos for common authentication, LDAP for an address book or netscape roaming profiles, the list goes on and on.

What would be nice is a set of scripts to help manage this.
Imagine, getting a new account and typing "pullprofile", and having your environment and data all retrieved, pulled from your central server. Then you could have login and logout scripts to synchronize the data, or just manually (possibly remotely if you forgot to sync before you left work) run them. A cron job to synchronize the big data store overnight.

I'll keep dreaming, and keep looking on freshmeat and sourceforge for a project like this. Maybe one day I'll get up the energy to start it myself, but don't count on it.

;-)

~Jonathan
AFS + kerberos by iocc · 2002-05-31 07:03 · Score: 2, Informative

Use AFS and kerberos. Works for mit.edu, Ericsson, kth.se and MANY others so it should work for you too.

http://www.openafs.org
http://www.pdc.kth.se/heimdal
Samba + VPN by AntiChristX · 2002-05-31 07:22 · Score: 2, Informative

My mom's office had the same types of problems so here's what I did:

1. Set up samba on the reliable (linux) machine, with proper tape backup, etc.
2. Firewalled the segment (which included their desktops) with a WatchGuard SOHO router (about $500 for 25 user support, runs linux :)
3. Set up Mobile User VPN on the firewall, and any laptops that might travel out of the office.

Samba and SMB are not the world's fastest solutions, but it is nice to be able to have the directory browsing in winders and macos. Samba is easy to set up, my first install of a samba PDC only taking about 3-4 hours (and never touch it again). If you need real speed for transferring over large files, you can always use SSH and SCP (putty and pscp for windows, niftytelnet for mac). Just always attempt to maintain a central data server, back it up as needed, and you'll be successful in clearing the data clutter.

--
AntiChristX
Daring to remain below 5 karma indefinitely
Server yes! And NetInfo vs. LDAP by plsuh · 2002-05-31 07:34 · Score: 4, Informative

This response is dead on. The original asker needs a file server that speaks multiple protocols. Once you have a server, it is much easier to create the necessary ssh or ssl tunnels that you need for total security.

Trying to maintain coherency of data via replication across multiple machines is begging for trouble -- this is a hard problem that to my knowledge has not been solved in a clean, cheap way.

If you want to use NetInfo for Mac OS X, create a new port from the Open Darwin sources. There's a port of an old NetInfo server module for Linux floating around, but it's not what I'd call up to date.

A better choice would be to use OpenLDAP, as Mac OS X is designed to pull directory service info from an LDAP data source. Windows systems can also pull from a LDAP, as can Linux and *BSD and Solaris and so on.

--Paul
Re:It's called a server by mprinkey · 2002-05-31 09:17 · Score: 3, Informative

I also agree that a server makes the most sense. I would amplify these recommended transport mechanisms to include a few others that will allow remote connectivity.

First is a secure IMAP server for centralized email. This will allows any SSL-enabled IMAP client to access your mailbox. Also, Squirrelmail running on an SSL web server can give your access to your centralize mail repository from any web browser.

SMB and NFS are the obvious choices for LAN-based access, but WAN access needs more care. I think that a VPN setup using CIPE is a good approach. One the CIPE links are build, you can use most services as if you were located on your wired LAN.

The other need might be for file access from "arbitrary" locations. In addition to the normal scp and sftp apps in OpenSSH, there is a nice SCP client for windows, WinSCP. Lastly, if you have a SSL web server there already, Web-FTP will give you access to your files via https.

This sounds like a lot. In the end, you would need to expose SSH, SSL IMAP, SSL Apache, and CIPE servers. I am midway through this deployment myself, but it has stalled a bit because one of primary Internet access points started disallowing outgoing SSH.