Distributed Data Storage on a LAN?

Win2k by SuiteSisterMary · 2003-10-29 09:16 · Score: 4, Informative

I believe that Windows 2000's Distributed File System allows you to do just this.

--
Vintage computer games and RPG books available. Email me if you're interested.

Re:Win2k by Anonymous Coward · 2003-10-29 11:21 · Score: 1, Informative

The distributed feature would be quite worthless if there wasn't some synchronization taking place to make sure the data was synched across all servers in the DFS namespace.

DFS uses the File Replication Service (FRS) to ensure that all DFS replicas are synchronized. Clients connect to the closes available server (based on Active Directory Site information) and will automatically fall back to another server if one goes down.

It's actually very easy to configure. Just fire up the DFS admin tool and add a new share. When you add a second replica the admin tool will ask you if you want to synchronize the replicas. Just click yes and everything will be configured automatically. The same is true if you add more replicas.

rdist would work... by ZenShadow · 2003-10-29 09:17 · Score: 4, Informative

The obvious answer for this is nbd, as pointed out in another post -- but I would have concerns about speed with that kind of setup. I'd be interested in hearing reports on that.

But if you don't want to get into nbd, you can tolerate delayed writes to your virtualized disks, and all you want is the network equivalent of RAID level 1, then you could always just set up an rdist script that synchronizes your local data disk with a remote repository (or eight) every so often...

--ZS

--
-- sigs cause cancer.

InterMezzo by Anonymous Coward · 2003-10-29 09:18 · Score: 1, Informative

Sounds like Coda or InterMezzo would fit the bill, but they won't address non-linux systems directly. You'd have to export the InterMezzo file systems with Samba and mount them on the MS Win boxes.

Re:Intermezzo by laursen · 2003-10-29 10:12 · Score: 5, Informative

Intermezzo is designed for this and a bit more - if one of the machines is a laptop you can take it away and work on it, and it'll resync when you get back.
We have looked at various distributed filesystems for use in a clustered setup of webservers. We wanted to remove the single point of failure from a central NFS server - Intermezzo was one of the filesystems we had a look at.
The idea behind Intermezzo is fairly simple and the documentation is good. The Intermezzo system looked like an ideal solution for our setup (Coda and OpenAFS are far to complex for use in a distributed filesystem on a closed internal net).
We tested the system but sadly it's not really production stable and I can't advise that you use it.
If you are looking for a SAFE solution then Intermezzo is not for you - you will just end up with garbled data, deadlocks and tons of wasted time ...
My 2 cents.
Re:Intermezzo by laursen · 2003-10-29 10:21 · Score: 2, Informative

We bought a large Storegatek raid (2 x RAID 5) and used NFS.

NFS is a proven filesystem and it has been tested for years. It's compatible with all major UNIX flavors and BSD/Linux systems.

AFS by Reeses · 2003-10-29 09:18 · Score: 4, Informative

It's called the Andrew File System.

http://www.psc.edu/general/filesys/afs/afs.html

There's another alternative with a different name, but I forget what it's called.

--
Reeses

Re:AFS by Strange+Ranger · 2003-10-29 10:24 · Score: 4, Informative

from karmak.org

AFS is based on a distributed file system originally developed under a different name in the mid-1980's at the Information Technology Center of Carnegie-Mellon University (CMU). It was first publically described in a paper in 1985, and soon afterwords was renamed to the "Andrew File System" in honor of the patrons of CMU, Andrew Carnegie and Andrew Mellon. As interest in AFS grew, CMU spawned the Transarc Company to develop and market AFS. Once Transarc was formed and AFS became a product, the "Andrew" was dropped to indicate that AFS had gone beyond the Andrew research project and had become a supported, product quality filesystem. However, there were a number of existing cells that rooted their filesystem as /afs. At the time, changing the root of the filesystem was a non-trivial undertaking. So, to save the early AFS sites from having to rename their filesystem, AFS remained as the name and filesystem root. In the late 1990's Transarc was acquired by IBM, who subsequently re-released AFS under an open source license. This code became the foundation for OpenAFS, which is currently under active development.
It's still running and running well at CMU (AFAIK - as of late 90's). Every student gets an "Andrew" ID. Actually the very first networked computer I ever logged into (other than dialing a bbs) was a 'node' on Andrew, in 1988. Very very cool at the time, and still is.

--

Operator, give me the number for 911!
Re:AFS by Umrick · 2003-10-29 10:31 · Score: 3, Informative

Never mind that AFS has been in production for literally years, serving terabytes of data for 10 thousand + clients (in several installations of AFS).

The Windows client did have some notable slowness issues, performance with Linux is excellent, and scales much better than NFS. Clients are available for a large number of OSs. Doesn't matter if it's the right time, just A time. So setup NTP on one machine as a primary, and the others can use ntpdate to set time once a day.

AFS started around 1986 as a commerical offering, IBM made it opensource in 2001. It can be a serious pain to set up at first, documents are indeed very outdated. Other limitations are no support for >2gig files. You can have readonly duplicates of data on multiple machines. Administration can be a dream once it's running.

You will need to have ext2 partitions available for storage (OpenAFS uses its own transaction system, and you WILL have race conditions if you put it on a journalling filesystem).

Also note that as of right now, 2.6 kernels are not supported, though 2.4/2.2 are fine.

www.openafs.org

CODA which was a start at an open source answer to AFS way back when, has even more out of date documentation, has never been used in production (that I know of), and basically is not nearly as ready for prime time as OpenAFS.

www.coda.org

Re:NBD Does this - NBD server for windows by flok · 2003-10-29 09:19 · Score: 5, Informative

And since the guy is also using windows-boxes, an NBD-server for windows can be found here:
http://www.vanheusden.com/Loose/nbdsrvr/
This version enables you to also export partitions/disks.

--

www.vanheusden.com - home of Multitail, HTTPing, CoffeeSaint, EntropyBroker, rsstail, bsod, listener, nagcon, nagi

Intermezzo by mikeee · 2003-10-29 09:19 · Score: 5, Informative

Intermezzo is designed for this and a bit more - if one of the machines is a laptop you can take it away and work on it, and it'll resync when you get back.

It isn't particularly high-performance, from what I know, and may be more complexity than you need.

Speed would be an issue... by Trolling4Dollars · 2003-10-29 09:21 · Score: 4, Informative

I imagine you'll need gigabit ethernet or multiple NICs in bonded mode. Then you have the performance of each individual system to take into account. Especially if one of the systems is heavily used. I would recommend getting one BIG HONKIN' SERVER and putting it in a central location. Give it gigbit and let everything else connect to it at 100. Then, make sure it has a hardware RAID controller. Use SAMBA for the cross platform connectivity you desire, and viola! protected data with redundancy and high speed performance. If you go with remote display (RDP with Windows Terminal Server or X with *nix) then you have an even better appraoch as all the data will exist on the secure RAID box.

I get what you mean though... it's a nice idea, but it would be costly to implement vs. what I suggested above.

When I went to see a presentation on HP's SAN solutions last year, I was very impressed with the ideas they had. One big hardware box with multiple disks that are controlled by the hardware. They are then presented to any systems over a fiber link as any number of drives you wish for any OS. Finally, their "snapshot" ability was pretty impressive. (Also called Business Copy) All they would do is quiesce the data bus, then create a bunch of pointers to the original data. As data is altered on the "copy" (just the pointers, not a real copy), the real data is then copied to the "copy" with changes put in place. I imagein something similar could be accomplished with CVS...

--
Un-news

Re:Speed would be an issue... by LookSharp · 2003-10-29 10:19 · Score: 2, Informative

...as much as I dislike replying to T4D, he brings up an interesting scenerio to counter your suggestion of using multiple machines.

I took a spare machine, added a 3ware 6800 ATA RAID controller ($130 on eBay), and installed eight 120GB Maxtor hard drives ($1200 when I bought them last year) and put them in eight Genica hot-swap trays ($60). For about $1500, I now have an 800GB formatted RAID5 array. (Had to throw in a dedicated 400W Antec power supply for HDs.) In a year, two of the drives have flunked, and the replacement drives have rebuilt beautifully.

If you're going to lose the site, you're going to lose your data in either case. All you protect against with the network situation is the complete loss of one machine. Protect your server as much as possible and put your data on it.

Just make sure you keep the "most precious" data offsite on tape of a sneaker-net external hard drive, in case the pop-tart that got stuck in your toaster burns down your house. (This apparently happens about 30 times a year, by the way, including one of my co-workers :)

Distributed Network Block Device by JumboMessiah · 2003-10-29 09:22 · Score: 2, Informative

A perfect solution would be a form of network block device that mounts distributed NBD shares. The Linux DRBD Project has this capability. From their website, "You could see it as a network raid-1".

Try Rsync or DRBD by oscarm · 2003-10-29 09:23 · Score: 4, Informative

see http://drbd.cubit.at/ DRBD is described as RAID1 over a network.

"Drbd takes over the data, writes it to the local disk and sends it to the other host. On the other host, it takes it to the disk there."

Rsync with a cron script would work too. I think there is a recipe in the linux hacks books to do something like what you are looking for: #292.

Venti needs a mention by DrSkwid · 2003-10-29 09:24 · Score: 3, Informative

http://plan9.bell-labs.com/sys/doc/venti/venti.h tm l

Abstract

This paper describes a network storage system, called Venti, intended for archival data. In this system, a unique hash of a block's contents acts as the block identifier for read and write operations. This approach enforces a write-once policy, preventing accidental or malicious destruction of data. In addition, duplicate copies of a block can be coalesced, reducing the consumption of storage and simplifying the implementation of clients. Venti is a building block for constructing a variety of storage applications such as logical backup, physical backup, and snapshot file systems.

--
There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter

hyper scsi by blaze-x · 2003-10-29 09:26 · Score: 2, Informative

from the website:

HyperSCSI is a networking protocol designed for the transmission of SCSI commands and data across a network. To put this in "ordinary" terms, it can allow one to connect to and use SCSI and SCSI-based devices (like IDE, USB, Fibre Channel) over a network as if it was directly attached locally.

http://nst.dsi.a-star.edu.sg/mcsa/hyperscsi/

Re:Standard Linux kernel maybe? by backtick · 2003-10-29 09:26 · Score: 3, Informative

NBD *is* standard Linux kernel. It's built right in: /usr/src/linux-2.4/Documentation/nbd.txt

If you're curious about using the enhanced NBD w/ failover and HA, you can read about it at:

http://www.it.uc3m.es/~ptb/nbd/#How_to_make_ENBD _w ork_with_heartbeat

Rsync and Ssh by PureFiction · 2003-10-29 09:32 · Score: 4, Informative

This is the way I do it, and although a little clunky, it allows me to keep remote backups of certain directories one three different servers.

First, setup ssh to use pubkey authentication instead of interactive password. You can read the man pages for details but it basically boils down to running keygen on the trusted source:

ssh-keygen -b 2048 -t dsa -f ~/.ssh/identity

Then copy|append the newly created ~/.ssh/identity.pub to the remote hosts into their /home/user/.ssh/authorized_keys file.

Now you can run rsync with ssh as the transport (instead of rsh) by exporting:

export RSYNC_RSH=ssh or also passing --rsh=ssh on the command line.

So to sync directories you could use a find command to update regularly:

while true; do
find . -follow -cnewer .last-sync | grep '.' 1>/dev/null 2>/dev/null
if (( $? == 0 )) ; then
rsync -rz --delete . destination:/some/path/
touch .last-sync
fi
sleep 60
done

Obviously this is pretty hackish and could be improved. But the point is that with ssh and rsync you could do automatic mirroring of specific filesystems or directories to remote locations securely.

Re:Rsync and Ssh by adamfranco · 2003-10-29 10:52 · Score: 4, Informative

Here is a nice page that explains how do do this. Even better, it shows how to do nice incremental backups using only slightly more space than the source (for the differing file versions). This makes for a pretty cheap and easy backup solution.

--
"When ideology and theology couple, their offspring are not always bad but they are always blind." -- Bill Moyers
Re:Rsync and Ssh by strudeau · 2003-10-29 10:58 · Score: 2, Informative

the original poster I think wants something that also works in Windows.

Rsync and ssh can work with Windows using Cygwin. See this document for example.

Unison? by Anonymous Coward · 2003-10-29 09:33 · Score: 1, Informative

Not yet seen reference to unison:

http://www.cis.upenn.edu/~bcpierce/unison/

They say: "Unison is a file-synchronization tool for Unix and Windows. (It also works on OSX to some extent, but it does not yet deal with 'resource forks' correctly; more information on OSX usage can be found on the unison-users mailing list archives.) It allows two replicas of a collection of files and directories to be stored on different hosts (or different disks on the same host), modified separately, and then brought up to date by propagating the changes in each replica to the other."

Re:NBD Does this by dbarclay10 · 2003-10-29 09:37 · Score: 5, Informative

Just to clarify what this guy is saying:

1) Make all your machines NBD servers. NBD for Linux, NBD for Windows. NBD stands for "network block device" and allows a client to use a server's block device.
2) Set up a master client/server (using Linux or something else with a decent software RAID stack). This machine will be the only NBD *client*, and it will use all the NBD block devices exported by the rest of your network.
3) On the master set up in 2), create a Linux MD RAID array overtop all the NBD devices that are available.
4) Create a filesystem on the brand-spanking-new multi-machine RAID array.
5) Export it back to the other machines via Samba or NFS or AFS or what have you.

Why does only one machine (the "master server") access the NBD devices, you ask? Because for a given block device, there can only be one client accessing it safely. Thus, if you want to make the RAID array available to anything other than the machine which is *running* the array off the NBD devices, you need to use something which allows concurrent access; something like NFS, Samba, or AFS.

Hope that clears it up a bit.

--

Barclay family motto:
Aut agere aut mori.
(Either action or death.)

Re:You aren't gonna get a real RAID. by Cranston+Snord · 2003-10-29 09:46 · Score: 4, Informative

Instead of xcopy, try RoboCopy, included in the windows NT/2k/xp/2k3 resource kit available here. It gives you almost as much control as rsync, including directory synchronization, touch control, ageing, network failure support, and others. I use this at work to move around copies of live production data to backup servers located offsite via vpn without any issues. More information on syntax can be found here.

--
And now for something completely different...a man with three buttocks.

Parallel Virtual File System by richoid · 2003-10-29 09:51 · Score: 4, Informative

http://www.parl.clemson.edu/pvfs/

"The goal of the Parallel Virtual File System (PVFS) Project is to explore the design, implementation, and uses of parallel I/O. PVFS serves as both a platform for parallel I/O research as well as a production file system for the cluster computing community. PVFS is currently targeted at clusters of workstations, or Beowulfs."

"In order to provide high-performance access to data stored on the file system by many clients, PVFS spreads data out across multiple cluster nodes, which we call I/O nodes. By spreading data across multiple I/O nodes, applications have multiple paths to data through the network and multiple disks on which data is stored. This eliminates single bottlenecks in the I/O path and thus increases the total potential bandwidth for multiple clients, or aggregate bandwidth."

Or there are many others to chose from, google for clustered filesystems:

http://www.yolinux.com/TUTORIALS/LinuxClustersAn dF ileSystems.html

Slow? by cerebralsugar · 2003-10-29 09:54 · Score: 2, Informative

I certainly would attest that this is a cool idea. I have a few systems at my place and it would be neat to make a single filesystem spanning all the storage on the network.

However, while small files would be fine, I would think the speed of the network would make for some fairly slow storage on a 100mbit network.

Add more users saving files across the network to the equation and things would get out of hand fast.

I guess I would just buy a serial ata raid motherboard (the intel D865GBFLK is one I have been thinking about), and just do 1:1 mirroring. Cheaper than scsi, and pretty darn fast.

--
Easy guys, I put my pants on one leg at a time. The difference is after I put on my pants I make gold records!

Raid != Backup by Alan · 2003-10-29 09:55 · Score: 2, Informative

Don't forget that RAID only protects you from hardware failures, it doesn't prevent you from doing an "rm -rf important_file" :)

Personally I have a server with a RAID 5 array that is shared via SAMBA to windows and linux clients, which works fine, though I may adjust this if good suggestions are made here. The only real issue would be disk space, and all my computers now have 120G+ hard drives or RAID array....

Or try Groove workspace for Windows by AllDigital · 2003-10-29 09:58 · Score: 2, Informative

Groove workspace if a collaborative environment, but it does have a component that allows you to share an archive of files.

Worth considering because:
- Files are encrypted and sent in an encrypted format.
- Files placed in the shared space are mirrored on all systems that are members of the worspace.
- The software is free for non-commercial use.
- Lot's of other interesting features to play with.
- You can even mirror with a machine accross the Internet.

Limited by:
- The speed of your connection.
- Windows users only.

Go check it out at http://groove.net/

Does anyone know if there are efforts in the open source community similar to...or designed to enhance this product?

DRBD does it as well... by Ron+Harwood · 2003-10-29 09:58 · Score: 2, Informative

Obvious link.

--
BlackNova Traders

Re:Most common form of data loss? by steveha · 2003-10-29 10:18 · Score: 3, Informative

0) Mirroring (RAID 1) takes double the disk space; but you could use RAID 5 instead. A 4 disk RAID 5 would take 4/3 as much disk space as you get to use.

1) You could make a partition that is 10% of your disk, make another identical one on another disk, and mirror those. Then put your 10% critical data in there.

2) Do what I do: set up a RAID server, and keep all critical data on that. This is good if you have a home network with multiple computers. It also makes data sharing easy among the computers.

steveha

--
lf(1): it's like ls(1) but sorts filenames by extension, tersely

Re:NBD Does this by caluml · 2003-10-29 10:23 · Score: 2, Informative

Hmm. How stable is it? From /usr/src/linux/Documentation/nbd.txt:

Note: Network Block Device is now experimental, which approximately
means, that it works on my computer, and it worked on one of school
computers.

That doesn't sound very promising to me. Usually stuff that's been in the kernel since 2.1 days is rock solid.

Isn't AFS/Coda more like the guy wants (excluding Windows-ability, although I seem to remember there being something for Andrews for Windows)?

--
Get your own free personal location tracker

Yes. by Ayanami+Rei · 2003-10-29 10:38 · Score: 2, Informative

Software RAID/LVM can detect which volumes go where by magic numbers written to them when you format them. But you still have to set up all the remote NBDs correctly on a new machine, and you need the old setup file from the old machine that tells it what block devices/partitions to use.

NOTE!

You shouldn't leave any NBD-exported volumes on the new master. Make it into a physical, local volume, but reference it in the "same place" in your RAID configuration.

--
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON

Re:You aren't gonna get a real RAID. by steveha · 2003-10-29 10:46 · Score: 2, Informative

No need for an "honest-to-dog hardware RAID". Linux software RAID is simply great.

Set up a server with multiple hard disks in a Linux software RAID, and run Samba and NFS on that. The Linux software RAID HOWTO explains all you need to know.

steveha

--
lf(1): it's like ls(1) but sorts filenames by extension, tersely

Re:You aren't gonna get a real RAID. by dbarclay10 · 2003-10-29 11:00 · Score: 2, Informative

First off, you aren't going to be able to use this like a real RAID array (a drive can die and you keep on working). The latency and bandwidth of any network that could be reasonably implemented in your home is going to prevent your system from acting like a real RAID array.

I'm currently running some benchmarks on an XFS filesystem built upon a Linux MD RAID1 array, which is in turn built upon a local disk and a remote disk (which is at the end of a switched 100mbit network, the NBD server itself having an 8-year-old drive and a controller which doesn't do DMA).

[ dbharris@willow: ~/ ]$ cat /proc/mdstat
Personalities : [raid0] [raid1]
md1 : active raid1 nbd0[1] dm-5[0]
1888192 blocks [2/2] [UU]

It takes approximately 10 minutes for a 1.8G array to sync. That's respectable. It's not blazing fast, but it's respectable.

The bonnie++ scores are:

willow,1G,5086,31,4766,2,2873,1,6377,27,8655,2,1 58.7,1,16,878,18,+++++,+++,766,14,880,18,+++++,+++ ,595,13

Which isn't amazing, but quite respectable, especially given that this type of thing wouldn't be used for mass storage of ISOs or whatever, but used for people's "My Documents" folders and their $HOMEs. Notable that a fully local array I have which is made up with an old SCSI controller and some old SCSI disks is about half this speed as far as the filesystem goes, and about a tenth the speed as far as syncing goes.

So, I believe that your assertion of "you aren't going to be able to use this like a real RAID array" is quite incorrect. Especially given that my network isn't particularily fast, my NICs aren't particularily fast, and the remote disk I'm using is dog slow. Replace the NICs with parts that aren't pieces of crap, use Gig-E, and use controllers/drives that aren't 7-8 years old, and you'll get very respectable performance - ESPECIALLY given that the intention isn't to store everything on it, just people's individual files.

P.S.: Yes. I'm repeating myself. I know this. It's deliberate :)

--

Barclay family motto:
Aut agere aut mori.
(Either action or death.)

Check out HiveCache by Jim+McCoy · 2003-10-29 11:17 · Score: 2, Informative

HiveCache is a distributed RAID system similar to what you are asking for, albeit one that is pitched to more of the enterprise backup environment than the home user. Strong security, error-correction and data replication, and multi-source data publiication and retrieval to eliminate the network hotspots that might otherwise occur.

While a pure linux solution seems to score the most points here, this particular one lets you combine your windows, OS X, and linux systems into a single distributed storage mesh. There is safety in numbers, and the more systems you can add to these sort of distributed storage systems the more reliable they become.

HiveCache is more of a backup solution, but I do know that it is possible to use this with a webDAV front-end for archival storage and other intersting storage possibilities.

Re:Most common form of data loss? by angst_ridden_hipster · 2003-10-29 11:26 · Score: 4, Informative

As I always chime in at this point:

Use rdiff-backup!

http://rdiff-backup.stanford.edu/

Configurable, secure, distributed, versioning incremental backups.

It's not a replacement for RAID, but is good for nightly inter-machine backups.

There's also a related project where the far-end repository is encrypted, so you can have it on any public server without fear of having your data read by the wrong people.

Very cool. It's saved my ass a few times.

--
Eloi, Eloi, lema sabachtani?
www.fogbound.net

Rsync & Rdiff-backup by hrath · 2003-10-29 11:47 · Score: 2, Informative

Check out http://rdiff-backup.stanford.edu/ for the wonderful rdiff-backup.

With the combination of rsync, ssh & rdiff-backup I have setup a very reliable incremental network backup infrastructure, allowing me to go back to any previous version of a file.

regards,

Heiko

HyperSCSI by Nicson · 2003-10-29 13:24 · Score: 2, Informative

I'm surprised to see nobody has yet mentioned HyperSCSI, which is:
- opensource
- based on raw ethernet (supposedly faster than iSCSI or other TCP/IP-based schemes)
- has a Win2K client

Check it out, I've tested and used it since about a year and it works quite well!
--
Nicson

Distributed Internet Backup System by trawg · 2003-10-30 00:35 · Score: 2, Informative

not really relevant, but may still be of interest to some (just sounds so neat): "Since disk drives are cheap, backup should be cheap too. Of course it does not help to mirror your data by adding more disks to your own computer because a fire, flood, power surge, etc. could still wipe out your local data center. Instead, you should give your files to peers (and in return store their files) so that if a catastrophe strikes your area, you can recover data from surviving peers. The Distributed Internet Backup System (DIBS) is designed to implement this vision. "

http://www.csua.berkeley.edu/~emin/source_code/d ib s/

EtherDrive Storage by web_guy1000 · 2003-10-30 02:56 · Score: 2, Informative

You might consider EtherDrive storage from www.coraid.com. I use it on Linux with software raid. Works like a champ.

Slashdot Mirror

Distributed Data Storage on a LAN?

40 of 446 comments (clear)