Slashdot Mirror


Sharing a Subset of Data Between 2 Sites?

eldrich asks: "We have two labs: a main lab (lab 1) has 1.2Tb of on-line data storage -- two machines with 600Gb RAID-5s hung off of them. These happily service about 30 Linux machines via NFS over fast ethernet. There are 5-6 WinXP machines that connect via SMB and Samba. The lab is on a private network with a single firewall between it and the world, and we use LDAP for practically everything (hostname, usernames, password, autofs, etc). The students' lab (lab 2) is 40 miles away, with 8 workstations and 2 WinXP machines. This lab also has a small RAID-5 Linux server with 180GB space which serves via NFS and Samba. Sometimes we have people from lab 2 at lab 1 and while they are at the main lab, they need their files. What I want to do is make lab 2's 180GB RAID a subset cache of the 1.2Tb one in lab 1. This puts everyone's main storage at lab 1 (which is backed up weekly) but a local copy can be cached on the lab 2 raid system. This gives the students a local copy for fast access, but all the safety of the backups made from our system. Does anyone know of a filesystem or programs that can help with this?"

"Some people spend 95% of their time in lab 2, so that is their 'home' server, but when they come to lab 1 for a week's stay or so, they scp/rsync their files to the lab 1 server, and at the end of the week push the changes back to lab 2. When people login to a workstation, they usually remain logged in for days at a time and xlock the screen. [If we can get this caching system working], it would mean that people moving between the labs would not need to copy files around since there would always be a 'local' copy.

The network between the labs is not fast enough for direct automounting of lab 1's server on the lab 2 workstations, especially since some files can be over 300Mb in size. We have a VPN (via freeswan) between the different labs, so all data transmitted is encrypted. Also, because lab 2 has 1/6 the capacity of lab 1's RAID it needs to be cached copies of in-use or probable in-use data only.

Crontab entries set for night copies are not useful because people often appear from both places on any given day.

The 3 servers currently run 2.4.18 with XFS so any solution should be compatible with XFS but at a real push we could consider changing the filesystem to another one."

23 comments

  1. Here's a thought. by Anonymous Coward · · Score: 0

    Stop trying to use the wrong tool for the job. Win2k DFS will do this job nicely. That and you also don't have to pay $699.

    -Linus

    1. Re:Here's a thought. by borgboy · · Score: 0, Offtopic

      Now, about that job.....how's $50M in stock options and $2M/year in salary? I'll even let you run Linux at work (in a Virtual PC, of course)
      All you have to do...is nothing!

      --Bill

      --
      meh.
  2. CODA and AFS by Lomby · · Score: 3, Informative

    If you have a very reliable connection you may want to go for AFS

    In case the connection is not realiable (or not fast enough), you may want to try CODA which is a distributed filesystem which supports disconnected operations. Beware: AFS is a mature project, while CODA may still be a work-in-progress.

    1. Re:CODA and AFS by Anonymous Coward · · Score: 0

      According to your links, AFS and CODA are the same thing.

  3. Keep it simple use CVS or rsync by Bob+Bitchen · · Score: 3, Informative

    Don't over-engineer, keep it simple use CVS or rsync.

    --
    http://tinyurl.com/3t236
    1. Re:Keep it simple use CVS or rsync by majorero · · Score: 1

      rsync will definitely do this job nicely. Take the 180GB offline for a couple hours and do a LAN rsync to the 1.2TB. Bring the 180GB back online, and have rsync to differentials after that. Simple!

  4. ssh by TheSHAD0W · · Score: 2, Interesting

    I'm not sure you'd find caching a subset of your file base to work very well. You might wish to consider instead installing some additional machines at the main location and allowing your researchers to log onto them remotely, using X or VNC if necessary. This should work much better than trying to maintain a local partial cache if you think you're going to experience many cache misses, especially since some of those files are so large.

  5. unison by martin · · Score: 3, Informative

    http://www.cis.upenn.edu/~bcpierce/unison/

    works very well and is designed for this kind of thing.

    BTW - weekly backups!!!! daily surely?

    1. Re:unison by PD · · Score: 1

      Mod that up. Unison is in my opinion one of the most underrated pieces of open source software. If you don't know what it is, look at http://www.cis.upenn.edu/~bcpierce/unison/

  6. Me too! by G4from128k · · Score: 0, Redundant

    I too would like such a capability. We don't have terabytes of data, but my wife and I find it frustrating to co-create documents and manage who has which version on which machine while ensuring the portablity of my wife's laptop and providing the speed of accessing files locally. Ideally, we would like all of our 12,000 shared files to be in at least two or three places at once (cached on my machine, cached on her laptop, and stored on a central file server).

    I'm envisioning some type of write-through file caching and distributed access control system that maintains near real-time synchronization between a local copy of a directory and an ostensibly identical copy of that directory on a remote server and any other machines that "share" that directory. I suspect that a relatively soft access control system would be OK in the sense that you could open your local copy of the file and propagate a lock afterward. Also, in the event of a network disconnect (e.g., using the laptop is on the airplane), the local system would journal any changes to the cached/shared file set and transmit/reconcile those changes when the network was reconnected.

    BTW, being one of those silly Mac users, I want a system that is totally transparent without extra steps (like a CVS check-out/check-in process), nasty batch processes, etc. When I open a file or close a file, I expect the system to appropriately handle the ugly details of caching, propagating changes to other machines, alerting me that the file is in use by someone else, etc.

    --
    Two wrongs don't make a right, but three lefts do.
  7. good work by cheezus · · Score: 0, Troll

    All the script kiddies out there thank you for profiling your system for them

    --
    /bin/fortune | slashdotsig.sh
  8. FolderShare by RandomCoil · · Score: 1

    FolderShare.com offers a small application that allows for various ways of sharing files between windows system. While it may not be sufficiently robust for your needs, it does a wonderful job of syncing my home and office files.

    For your situation, I would imagine that the server machines would run the FolderShare app, simply mirroring in more-or-less real time the lab2 data at lab1.

    RC

  9. Intermezzo might be a solution by narensankar · · Score: 3, Informative
    http://www.inter-mezzo.org/

    Similar to afs and coda suggested before, but with local caching to allow much higher performance. Also works in disconnected mode.

    1. Re:Intermezzo might be a solution by laursen · · Score: 1

      Neither Intermezzo or Coda is production stable - Have a look at this previous post.

  10. SUN's CacheFS by Asmodeus · · Score: 1

    ..is probably a good place to start. It is a cache filing system which is backed up by an NFS filing system

  11. Uhhh by Anonymous Coward · · Score: 0

    RSync.

    or

    Windows Directory Replication Service.

  12. or use rdiff-backup or cvsup by cornice · · Score: 1
    Or use rdiff-backup or cvsup.

    rdiff-backup is:
    rdiff-backup backs up one directory to another, possibly over a network. The target directory ends up a copy of the source directory, but extra reverse diffs are stored in a special subdirectory of that target directory, so you can still recover files lost some time ago. The idea is to combine the best features of a mirror and an incremental backup. rdiff-backup also preserves subdirectories, hard links, dev files, permissions, uid/gid ownership, and modification times. Also, rdiff-backup can operate in a bandwidth efficient manner over a pipe, like rsync. Thus you can use rdiff-backup and ssh to securely back a hard drive up to a remote location, and only the differences will be transmitted. Finally, rdiff-backup is easy to use and settings have sensical defaults.

    cvsup is:
    CVSup is a software package for distributing and updating collections of files across a network. It can efficiently and accurately mirror all types of files, including sources, binaries, hard links, symbolic links, and even device nodes. CVSup's streaming communication protocol and multithreaded architecture make it most likely the fastest mirroring tool in existence today.

  13. AFS has been doing this for years by bolverk · · Score: 1

    There are servers and clients for tons of operating systems, including every one you mentioned.

  14. rsync by Anonymous Coward · · Score: 0

    rsync

  15. WebDAV + HTTP proxy server by plsuh · · Score: 1

    How about making network file access be via WebDAV, and place a caching HTTP proxy server set to work with only the specified domain at each end. This caches a local copy of the data for quick reads, has good properties for wide-area networking, is cross-platform compatible, and can be configured with variable timeouts for different people. Writes may take a while, but for data consistency reasons going directly back to the home storage facility is probably a good thing. You can also easily limit the proxy cache to some fraction of the total space, e.g. 120 GB out of 180 GB in lab 2.

    For instance, user A normally works at lab 1 but sometimes works at lab 2 for a day or so. She can connect to a file server via webdav_fs using the URL http://lab1server.example.com/~A. The machines at lab 2 are configured so that access to domain lab1server.example.com is via the proxy, and is set to cache her data for 12 hours. The machines in lab 1 are set so that access to the domain lab1server.example.com does not go through the proxy, and thus get direct access.

    Users can still use scp/sftp for out of band access if they need to have data that persists longer than their normal caching period, or is going to be subject to lots of writes so that they want to manually control the writing process.

    --Paul

  16. Intermezzo _might_ be a waste of time. by jefu · · Score: 1
    I spent some time working on making intermezzo work on my machines a few months back. Eventually gave it up as a very bad idea. I don't remember right now what was the last straw, but I did spend a week or so working on it.

    A shame too, it looked pretty good and like it could have quite a bit of promise.

    Building a reliable, easy to install, distributed filesystem that allows for disconnected operation, updates and similar kinds of things would be very, very useful. (Notice the recent post on using CVS to maintain a distributed home directory.)

  17. FTP by Anonymous Coward · · Score: 0

    Maybe I'm missing something, but what's wrong with plain ol' FTP?