Distributed Data Storage on a LAN?
AgentSmith2 asks: "I have 8 computers at my house on a LAN. I make backups of important files, but not very often. If I could create a virtual RAID by storing data on multiple disks on my network I could protect myself from the most common form on data failure - a disk crash. I am looking for a solution that will let me mount the distributed storage as a shared drive on my Windows and Linux computers. Then when data is written, it is redundantly stored on all the machines that I have designated as my virtual RAID. And if I loose one of the disks that comprise the raid, the image would automatically reconstruct itself when I add a replacement system to the virtual RAID. Basically, I'm looking to emulate the features of hi-end RAIDS, but with multiple PCs instead of multiple disks within a single RAID subsystem. Is there any existing technologies that will let me do this?"
http://nbd.sourceforge.net/
/dev/nd0, it will send a request to the server via TCP, which will reply with the data requested. This can be used for stations with low disk space (or even diskless - if you boot from floppy) to borrow disk space from other computers. Unlike NFS, it is possible to put any file system on it. But (also unlike NFS), if someone has mounted NBD read/write, you must assure that no one else will have it mounted.
"Network Block Device (TCP version)
What is it: With this thing compiled into your kernel, Linux can use a remote server as one of its block devices. Every time the client computer wants to read
Limitations:It is impossible to use NBD as root file system, as an user-land program is required to start (but you could get away with initrd; I never tried that). (Patches to change this are welcome.) It also allows you to run read-only block-device in user-land (making server and client physically the same computer, communicating using loopback). Please notice that read-write nbd with client and server on the same machine is bad idea: expect deadlock within seconds (this may vary between kernel versions, maybe on one sunny day it will be even safe?). More generally, it is bad idea to create loop in 'rw mounts graph'. I.e., if machineA is using device from machineB readwrite, it is bad idea to use device on machineB from machineA.
Read-write nbd with client and server on some machine has rather fundamental problem: when system is short of memory, it tries to write back dirty page. So nbd client asks nbd server to write back data, but as nbd-server is userland process, it may require memory to fullfill the request. That way lies the deadlock.
Current state: It currently works. Network block device seems to be pretty stable. I originaly thought that it is impossible to swap over TCP. It turned out not to be true - swapping over TCP now works and seems to be deadlock-free.
If you want swapping to work, first make nbd working. (You'll have to mkswap on server; mkswap tries to fsync which will fail.) Now, you have version which mostly works. Ask me for kreclaimd if you see deadlocks.
Network block device has been included into standard (Linus') kernel tree in 2.1.101.
I've successfully ran raid5 and md over nbd. (Pretty recent version is required to do so, however.) "
it's called rsync
vodka, straight up, thank you!
Does anyone else find it funny that this was modded redundant?
RAID != Backups.
If you don't understand why, just put your Packard Bell back in the box and ship it back.
Tell them you're too stupid to own a computer.
I believe that Windows 2000's Distributed File System allows you to do just this.
Vintage computer games and RPG books available. Email me if you're interested.
AFAIK, rsync is not really suitable for a realtime scenario. A nbd raid-5 device would be virtually realtime, no?
I've been looking into this too. Most workstations today have large harddisks (40GB+) while on a network maybe 2-4 GB is used... Any windows software out there?
Distributed Data Storage on a LAN?
Kind of like a Beowulf of hard-discs then?
When anger rises, think of the consequences.
Confucius (551 BC - 479 BC)
The obvious answer for this is nbd, as pointed out in another post -- but I would have concerns about speed with that kind of setup. I'd be interested in hearing reports on that.
But if you don't want to get into nbd, you can tolerate delayed writes to your virtualized disks, and all you want is the network equivalent of RAID level 1, then you could always just set up an rdist script that synchronizes your local data disk with a remote repository (or eight) every so often...
--ZS
-- sigs cause cancer.
Please use the alternatives then, support is so much better and stability and security are 'features' found in abundance :)
I fail to see why this was allowed to post to the front page. How many fricking times do we have to tell you retards, it's LOSE, NOT LOOSE?!?!
Perhaps multiple files over different networking procotols (SMB for Windows machines, NFS for the Linux machines) mapped to built-in loopback devices (/dev/loX) accessed through built-in md utilizing software RAID5? Heh. It might not be pretty or fast, but it would probably work just fine. It may just give the kernel absolute fits though.
Anyone tried this?
just nfs mount the disks and use a backup utility to backup across the network nightly.
Sounds like Coda or InterMezzo would fit the bill, but they won't address non-linux systems directly. You'd have to export the InterMezzo file systems with Samba and mount them on the MS Win boxes.
It's called the Andrew File System.
http://www.psc.edu/general/filesys/afs/afs.html
There's another alternative with a different name, but I forget what it's called.
Reeses
what has occurred?
And since the guy is also using windows-boxes, an NBD-server for windows can be found here:
http://www.vanheusden.com/Loose/nbdsrvr/
This version enables you to also export partitions/disks.
www.vanheusden.com - home of Multitail, HTTPing, CoffeeSaint, EntropyBroker, rsstail, bsod, listener, nagcon, nagi
I have 8 computers at my house on a LAN. I make backups of important files, but not very often
I mean, let's be honest here. We are all dorks, but this guy is king dorkus dweedius maximus. Don't fool yourself about the "important data" - it is just pr0n and pirated MP3s.
If it was real work, there would be a real IT guy with real RAID and real backup tapes working on the problem,. But we know it isn't real work, because if this guy had a real IT job, h couldn't stand coming home and dealing with 8 friggin computers.
We realize you think you are cool because you have a few KVMs, a couple of Linksys routers, and a bunch of old PIIs running Lunix with one Windows machine, but come on, man. Stop spanking yourself over your elite NAT-ed network and just get one computer with hardware RAID. Instal Cygwin if you feel the need to type configure && make && make install a whole bunch of times and watch teh pretty text lines scroll.
I'd argue the point that the most common form of data loss is a crashed hard disk.
/. user knows what they're doing with their data far more than my average user does and is less likely to cause self-inflicted damage.
In my 14 years as a Network Administrator I think I've restored backups due to failed hard disks about twice (RAID catches the rest).
But I restore data accidentally deleted or changed by a user at least weekly! A distributed storage system won't help you there.
However, I will grant that the average
Intermezzo is designed for this and a bit more - if one of the machines is a laptop you can take it away and work on it, and it'll resync when you get back.
It isn't particularly high-performance, from what I know, and may be more complexity than you need.
Redhat has a very good software raid and is easy to setup with only two disks. Of course with only two disks they are mirrored. But it is very easy to setup a cron entry that can email you the status of that mirror everyday.
I hope you're looking at some fast lines to put between those boxen. Even at 100Mb/sec, doing RAID across a LAN could get slow.
I'm against picketing, but I don't know how to show it.
I have often wanted the same thing, kind of like RAID on files, call it RARF (Redundant Array of Remote Files). I was thinking along the line of a device driver that presents an ATA/IDE interface to the file system on one side and passes the requests to multiple copies of virtual disks. The virtual disks would be like VMWare disks, and potentially each on a different machine/location. Each virtual disk could even be encrypted differently.
This would be really useful for SOHO type places to allow me to have a hot offsite backup at multiple friends (and vise versa).
I haven't checked into it much, but I remembered the DIBS (Distributed Internet Backup System -- Slashdot article here). I would imagine that it could be modifed (maybe not trivially) to support real-time disk operations, since it is open-source. However, although I don't know much about Python, I have a feeling this may suffer in performance from being written in a (semi-)interpreted language. Python lovers want to flame me for incriminating their programming language?
Karma: Positive (mostly due to rash moderations)
Hmmmm, what happens if your house catches fire ?
8 copies of the same document all nicely toasted!
As opposed to a tight one?
I imagine you'll need gigabit ethernet or multiple NICs in bonded mode. Then you have the performance of each individual system to take into account. Especially if one of the systems is heavily used. I would recommend getting one BIG HONKIN' SERVER and putting it in a central location. Give it gigbit and let everything else connect to it at 100. Then, make sure it has a hardware RAID controller. Use SAMBA for the cross platform connectivity you desire, and viola! protected data with redundancy and high speed performance. If you go with remote display (RDP with Windows Terminal Server or X with *nix) then you have an even better appraoch as all the data will exist on the secure RAID box.
I get what you mean though... it's a nice idea, but it would be costly to implement vs. what I suggested above.
When I went to see a presentation on HP's SAN solutions last year, I was very impressed with the ideas they had. One big hardware box with multiple disks that are controlled by the hardware. They are then presented to any systems over a fiber link as any number of drives you wish for any OS. Finally, their "snapshot" ability was pretty impressive. (Also called Business Copy) All they would do is quiesce the data bus, then create a bunch of pointers to the original data. As data is altered on the "copy" (just the pointers, not a real copy), the real data is then copied to the "copy" with changes put in place. I imagein something similar could be accomplished with CVS...
Un-news
that should have read: MSI > OSS
A perfect solution would be a form of network block device that mounts distributed NBD shares. The Linux DRBD Project has this capability. From their website, "You could see it as a network raid-1".
...I could protect myself from the most common form on data failure - a disk crash.
In my experience, the most common form of data loss is not hardware failure, but user error. RAID is great for protecting against hardware failure, but be sure to still make backups to prevent against accidental deletion.
47% of all statistics are made up on the spot.
Samba RAID0.
What you are asking for sounds pretty damn complicated. My home has about 10 machines in it, and I just use Samba on two mirrored disks for network storage.
Hey, but it's a free world. Feel free to ratchet up the technology till you bleed....
see http://drbd.cubit.at/ DRBD is described as RAID1 over a network.
Rsync with a cron script would work too. I think there is a recipe in the linux hacks books to do something like what you are looking for: #292.
http://plan9.bell-labs.com/sys/doc/venti/venti.
Abstract
This paper describes a network storage system, called Venti, intended for archival data. In this system, a unique hash of a block's contents acts as the block identifier for read and write operations. This approach enforces a write-once policy, preventing accidental or malicious destruction of data. In addition, duplicate copies of a block can be coalesced, reducing the consumption of storage and simplifying the implementation of clients. Venti is a building block for constructing a variety of storage applications such as logical backup, physical backup, and snapshot file systems.
There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
I've been looking into something like this for a little while. What I'd like to do when I have the fundage is get a fileserver/backup box. The ideal is to run 4 160 GB IDE drives in RAID 5. This will give me a bit over 450 GB in usable network storage. I then want to add a pair of 250 GB 5400 drives for backup. I can then set up a the server to backup the data from the raid drives to the backup drives on a daily basis.
According to pricewatch the 4 160's could be had for around $400 total with about another $400 for the backup. Add a 3ware RAID controller for another $245 bucks and your looking at about $1045 to convert a system into supporting 450 GB of usuable network storage and backup.
From all indications IDE harddrives are now the cheapest form of backup there is. I've looked at CD, DVD, Tape, but it keeps coming back to IDE hard drives. This is far cheaper than a similiar storage and backup would be on tape.
from the website:
HyperSCSI is a networking protocol designed for the transmission of SCSI commands and data across a network. To put this in "ordinary" terms, it can allow one to connect to and use SCSI and SCSI-based devices (like IDE, USB, Fibre Channel) over a network as if it was directly attached locally.
http://nst.dsi.a-star.edu.sg/mcsa/hyperscsi/
You can share iSCSI devices, if you do it the right way, between many different hosts. NBD sounds good, but for what you're asking, iSCSI or FCIP or some derivative sounds more correct. i.e. virtual block devices, or "real" block devices on a network that can be accessed by windows or *nix. you could RAID (md) iSCSI devices, or just use a system which "owns" all the iSCSI devices in an MD, and present it up using CIFS or SMB.
--SuperBug
http://www6.tomshardware.com/storage/20031028/in dex.html
Not as a solution in and of itself, but it is a good idea considering that you more then likely have a box to burn...also try to grab some old PolyServe software. It will do that samething over a network, though not without resource loss.
WAR TUX!
What about data integrity when the network fails? Or when a single host fails? You could create ACLs for hosts that would be responsible for certain data upon certain failures, but then you're adding to an already overwhelming management nightmare.
Why not consider a shared storage system? You're not realistically going to have a failproof plan in your home, so just narrow it down to a few things. External JBOD with software RAID, presented as NAS to the rest of your computers. If a drive fails, just replace it. If the NAS head fails, just hook up the JBOD to another host.
Just thought I'd point this out, a typo in the article:
"Is there any existing technologies that will let me do this?"
--Should read "are..."
The highest performance is probably from Lustre, although it is designed for slightly larger clusters. Haven't tried it yet though.
http://www.vanheusden.com/Loose/nbdsrvr/
(I haven't used this, but it exists)
This is the way I do it, and although a little clunky, it allows me to keep remote backups of certain directories one three different servers.
/home/user/.ssh/authorized_keys file.
.last-sync | grep '.' 1>/dev/null 2>/dev/null .last-sync
First, setup ssh to use pubkey authentication instead of interactive password. You can read the man pages for details but it basically boils down to running keygen on the trusted source:
ssh-keygen -b 2048 -t dsa -f ~/.ssh/identity
Then copy|append the newly created ~/.ssh/identity.pub to the remote hosts into their
Now you can run rsync with ssh as the transport (instead of rsh) by exporting:
export RSYNC_RSH=ssh or also passing --rsh=ssh on the command line.
So to sync directories you could use a find command to update regularly:
while true; do
find . -follow -cnewer
if (( $? == 0 )) ; then
rsync -rz --delete . destination:/some/path/
touch
fi
sleep 60
done
Obviously this is pretty hackish and could be improved. But the point is that with ssh and rsync you could do automatic mirroring of specific filesystems or directories to remote locations securely.
What you seek is the holy grail of high-availability environments.
So far, I've not seen anything that exists that does what you are asking for. Several technologies come somewhat close.
What I've been hopeful of is the recent donations by Oracle for database clustering, but I haven't seen any decent fallout from that... yet.
For now, on my home-based work network, I have two network drives (both IDE 120 GB) and do nightly rsynch from one to the other.
(sigh)
I have no problem with your religion until you decide it's reason to deprive others of the truth.
Not yet seen reference to unison:
http://www.cis.upenn.edu/~bcpierce/unison/
They say: "Unison is a file-synchronization tool for Unix and Windows. (It also works on OSX to some extent, but it does not yet deal with 'resource forks' correctly; more information on OSX usage can be found on the unison-users mailing list archives.) It allows two replicas of a collection of files and directories to be stored on different hosts (or different disks on the same host), modified separately, and then brought up to date by propagating the changes in each replica to the other."
If you "loose" your drive it might not come back.
Wouldnt that be slower than just setting up a dedicated file server using some raid hardware....if you did that over the network, wouldnt that slow down your network tremendiously? besides that, i dont see too much advantage in it. If you have 8 computers at home, just set a new one up as a dedicated file server! put some 250GB WD 8MB cache drives on SATA with raid 0+1......and boom, file server with raid! more effective!! .. but i guesss thats just this techie's opinon
Using a pair of Intel EEPro 100's w/ trunking (using both links at the same time on one IP, works w/ a cisco switch), I've gotten over 100 Mb/sec of actual throughput (I think I hit 137 Mbit/sec, peak) out of a box using NBD to create a mirror'd RAID volume over the trunked ports. Now, my actual 'real' data speeds to the file ssystem were about half that (Call it 50-65 Mbit, or 6 to 7.5 MByte/sec), due to mirroring == writing it twice. Still not bad. Yes, the target disks were themselves part of other RAID volumes, for speed :)
Instead of trying to implement a shoestring SAN, go the simple route: throw up a Linux box running Samba for your "backup server;" it doesn't need much horsepower, just fairly fast drives and a network connection. Then schedule copies of your documents and home directories (using a cron-type tool on Linux and XCOPY called by the Task Scheduler on Windows, you should be able to hack something together that copies only changed files) every night at midnight, or some other time when you aren't using your computers. Although you might lose a bit of work if the system goes down, you won't ever lose more than 24 hours' worth.
If you have more money to blow, then I would suggest that you invest in an honest-to-dog hardware RAID card and some good drives and put them into a server, then do everything across the network (put the /home tree and My Documents folders on the server). You can of course mount the /home directory in Linux via NFS or smbmount, and Group Policy in Windows 2K/XP will allow you to change the location of the My Documents folder to whatever you choose. You might be able to do the same via the System Policy Editor on 9x; it's been a while and I can't find the information after a brief Google.
To sum up:
That's it. I'm no longer part of Team Sanity.
Really. If you're on a 100-megabit LAN, that gives you a max of about 10 megaBYTES per second. So, if you have to transmit information to two other computers for every disk write, you're effectively limitting yourself to a maximum of about 5 megabytes/second disk transfer. And that's under GOOD situations. If you're doing random I/O, where the latency will be the determining factor, then take the latency of the hard drives, add in the latency of the networking, and the latency of the software layers, and you're looking at some pretty abysmal performance.
Using rsync in a cron job will solve your backup problems. In fact, your script can use rsync to do the synchronization, and tar/gzip to archive the backup - giving you "point in time" snapshots for when someone says "I deleted this file 4 days ago, can you get it back?"
steve
Oh, you're not stuck, you're just unable to let go of the onion rings.
nbd + evms2 = networked software raid.
Be forwarned: This will be slower than snot on a cold Sunday.
The fastest and maybe even the cheapest setup to do this with would be to have a bunch of NAS drives on their own switch, with the host machine attached to the same switch. Host has multiple NICs, all channel bonded to this switch, and then has another NIC to the outside network. This would give you a big setup.. but again, SLOW! Your looking at 5MB/s tops with overhead.. IDE does this stuff all the time at up to 40MB/s+. SCSI and Fibrechannel, even faster.
Good luck!
...this question even got asked. Ok, if you *need* to share the same device across machine, something like the network block device can be a real help.
....
If all you're worried about is disk failures, mirror each disk locally. Disks are cheap, and real operating systems don't have any trouble with software mirroring.
Why would you want to make all of your machines suddenly non-functional, just because one of them lost a network card? Or the switch failed? Or
If you're not living on the edge, you're just taking up space!
what you're proposing is probably a poor solution to your needs. To use RAID-like disk storage across the network will require several high-latency transfers across the network for every write opperation. -very slow.
Furthermore, every time one of the computers is powered off the system will wait for that machine to come back, or will treat it like a dead disk. Even with high performance raid devices, degraded mode is mighty slow. Then when the device comes back you will have to rebuild the raid. A long/slow/agonizing process even with fast hardware.
I think rsync in a cron tab is a much better idea.
Another DSF
http://www.coda.cs.cmu.edu/
Why is Coda promising and potentially very important?
Coda is a distributed filesystem with its origin in AFS2. It has many features that are very desirable for network filesystems. Currently, Coda has several features not found elsewhere.
1. disconnected operation for mobile computing
2. is freely available under a liberal license
3. high performance through client side persistent caching
4. server replication
5. security model for authentication, encryption and access control
6. continued operation during partial network failures in server network
7. network bandwith adaptation
8. good scalability
9. well defined semantics of sharing, even in the presence of network failures
That also means that whenever even one of the machines is down ('hw maintenance', new kernel boot, system crash, unplugged...) all the others will lose access to the data too.
I suppose it could work well in a server room, but if your home setup is anything like mine - open cases and cat5 crisscrossing the house - or you have a screwdriver on your desk, you might experience a lot of downtime...
My wife would have me by the curlies.
yes, I'm a soldering iron wielding programmer
I hate to point this out, but my daughter's house in Scripp's Ranch in San Diego just narrowly escaped completely burning down. She evacuated with her hard disk (smart thinking there, kid!). The place is uninhabitable with smoke damage. How the fire went around that cul de sac is just amazing.
The point is: 8 computers in the house won't help diddly in a real disaster. That's a lot of work just to see it burn up. (I know it will never happen to you; it was 2,000 other houses that burned to the foundation.
And further, I've had two RAID systems go TU in the last few years. For me RAID doesn't cut it at all. Distributed File System works pretty cool--but so does a fire safe.
How about a moderation of -1 pedantic.
Intermezzo and Coda both do this, but I don't think there's any windows versions available. There are some Microsoft things available too, but obviously those aren't for linux. NBD (which everyone else has mentioned) isn't distributed, so that's not really what you're looking for.
What you might be able to do is put together a microcosm of Freenet or something like it, running on just your home computers. There may be other Peer-to-Peer solutions available that are faster/more stable. Do some searching on peer-to-peer distributed storage networks. I know of two researchy ones: OceanStore and Chord. Good luck!
-3Suns
~~~~
The Revolution will be Slashdotted
Though not real time like a true RAID, I think what you're really after is something like rsync, as many other posters have mentioned. When this came up in an earlier story I found a like to Unison, which seems to be better for my needs at least.
http://www.cis.upenn.edu/~bcpierce/unison/Might be interesting to combine this with FSRaid (Parity Archive or PAR files) to get some extra redundancy.
BI have done this with linux software raid, the loopback device, and smbmount. Performance was horrible, but it worked. Here's an overview of my setup:
/dev/loop0 filename) to make the files accessable as block devices. I then made the /etc/raidtab from hell, used raid 0+1 with LOTS of spare "disks". Brought up the raid, made a filesystem, and whamo a place to keep my mp3 archive.
I mounted every windows PC's filesystem (~800 of them) with smbmount on my host machine. I then proceeded to make a 2gig file on every system. I used the loopback device (losetup
I actualy had to hack more loop devices into the kernel as 255 is the maximum, but this is left as an excersize for the reader...
Windows Server 2003 has a feature called Volume Shadow Copy (VSC) which does exactly what you are asking. We use it to sync with files offsite automaticaly when changes are made. Like anything using a network, you need to watch your bandwidth, as every time a file changes, it must be sent over the wire.
Netapp filers have been doing snapshots for at _least_ 7 years now. This is not a new concept...
That way every machine will have a copy of all the files!
Wasteful, yes! But simple and effective!
There is no patch for stupidity
Visit my blog
I do this everynight to thousands of machines...
The software I use is Kazaa-lite.
Oh, you mean files other than my MP3s/jpegs/mpegs? Sorry, I can't help you there.
LongTail SSH Brute Force analysis tool is here!
Are you sure your data is so critical and important that a single daily or weekly backup won't take care of it? Considering the complexity of the offered solutions it may be easier to reconstruct a dead disk that implement these things. Also remember that you'd have 8 extra points of failure. If(when) one of your machines dies you're going to have to replace it and integrate the replacement into the system.
Here's how
I've since stopped using OS/2, and haven't found any replacement that works as good as that did. (Honestly, I haven't looked too hard in recent years.) Sometimes I think about bringing an os/2 box back just for tvfs.
I forget who was doin this..
but you setup a network that's insanely fast, but large enough and with enough hops that there's enough latency to get data both in and out..
then you route the data around in a circuilar fashion and grab it as it goes by.. I think someone in ca.NET was tryin to do it.. used DWDM and the fibre as the storage medium.. interesting concept..
I don't think you could do this iwht your linksys router though..
Many responses, even highly-rated ones, seem to be talking about simple replication via NBD (worst-written code I've ever seen) or DRBD. That's not the same as what the original poster was asking about. Neither are fully-distributed but non-transparent file stores such as HiveCache. AFS/DFS/Coda/Intermezzo are probably the closest in the sense of being both transparent and resistant to failures. There have also been a couple of very closely related projects at Microsoft (Farsite and Pastiche) but I'm not sure if there's anything you can actually download and use.
Slashdot - News for Herds. Stuff that Splatters.
1. Trash 7 of the computers (the ones you don't really use but keep around for the geek factor). Replace with a nice little external 300 gig Firewire or USB2.0 drive. You can take it with you wherever you go.
2. With 8 computers on your LAN, you definitely need to get out and get some. Get a girlfriend!
This is a test. This is a test of the emergency sig system. This has been only a test.
I've been using this concept for my mp3 collections and some ripped DVD's for some time now. The overall storage space is astronomical but the latency is kind of high, I think it is called KaZaa.
http://www.parl.clemson.edu/pvfs/
n dF ileSystems.html
"The goal of the Parallel Virtual File System (PVFS) Project is to explore the design, implementation, and uses of parallel I/O. PVFS serves as both a platform for parallel I/O research as well as a production file system for the cluster computing community. PVFS is currently targeted at clusters of workstations, or Beowulfs."
"In order to provide high-performance access to data stored on the file system by many clients, PVFS spreads data out across multiple cluster nodes, which we call I/O nodes. By spreading data across multiple I/O nodes, applications have multiple paths to data through the network and multiple disks on which data is stored. This eliminates single bottlenecks in the I/O path and thus increases the total potential bandwidth for multiple clients, or aggregate bandwidth."
Or there are many others to chose from, google for clustered filesystems:
http://www.yolinux.com/TUTORIALS/LinuxClustersA
Yeah it would be a geek wet dream to set up some weird distributed data storage system, and you'll be ready when the FBI comes for one of the computers, but it's probably overkill.
I work at home and have a relatively small number of computers (5, if you count the Zaurus handheld too). All Linux, FreeBSD, or OS X.
I back up every machine to "fileserver1" (on my linux desktop), then I back up fileserver1 to fileserver2 (my linux server/database machine). I use rsync, except on the Mac I use Retrospect Express FTP mode because of funny Mac files. I also keep a list of the "schg" files on BSD but usually I would just reinstall from scratch and then copy just my data files back, if there was ever a problem.
Each file is in roughly three places. Each machine has big hard drives, and I still have space left over.
Hard drives are so cheap, you could conceivably back up each machine to each other machine in a rotating schedule and be done with it. If you want to geek out, write a restore program that lets you browse and choose from each machine.
Personally, I've NEVER had a hard drive failure. NEVER. From my Centris 610 to my 10-year old 486 running Gentoo to my Red Hat P4, I've never had a crash.
I do however, fuck up at least once a week and delete important files. So having non-RAID backups is great.
So before trying to set up a SAN in your bathroom, check the price of putting 500GB in each machine to back up all the others with rsync.
I certainly would attest that this is a cool idea. I have a few systems at my place and it would be neat to make a single filesystem spanning all the storage on the network.
However, while small files would be fine, I would think the speed of the network would make for some fairly slow storage on a 100mbit network.
Add more users saving files across the network to the equation and things would get out of hand fast.
I guess I would just buy a serial ata raid motherboard (the intel D865GBFLK is one I have been thinking about), and just do 1:1 mirroring. Cheaper than scsi, and pretty darn fast.
Easy guys, I put my pants on one leg at a time. The difference is after I put on my pants I make gold records!
Don't forget that RAID only protects you from hardware failures, it doesn't prevent you from doing an "rm -rf important_file" :)
Personally I have a server with a RAID 5 array that is shared via SAMBA to windows and linux clients, which works fine, though I may adjust this if good suggestions are made here. The only real issue would be disk space, and all my computers now have 120G+ hard drives or RAID array....
ghettobackup.bat
copy c:\porncollection\*.* \\backup1\bak
copy c:\porncollection\*.* \\backup2\bak
.
.
.
copy c:\porncollection\*.* \\backup8\bak
-- Having a Creationist Museum is like having an Atheist place of worship
You really don't want to lose all that porn, huh?
It is "lose" and not "loose." It is "losing" and not "loosing."
I don't think the RAID algorithm is the right way to syncronize all your data, when applied on the larger scale. I imagine that what a person really want to do is to unify all his accounts, on slow and fast links all over the world, to look like a huge syncronized partition which stores the data throughout the accounts with sufficient redundancy (meaning something like 'keep copies of all data on at least three different locations). I think using RAID for this would give horrible performance and not be nearly flexible enough in how data is distributed through the different locations.
A new networked file system is needed. I am working on such a solution on my spare time (but it is still in the design phase).
The main idea is to unify cache and storage. This means that the least used files are deleted when an account is running out of storage, but under the constraint that a mimum number of copies of the files are kept online. (Hence, data will propagate to the nodes that actually use it). Upon a data request the filesystem goes out and fetch the data. Preferably in some P2P-like way where it is fetched simultaniously from all locations that has copies of that data.
If someone knows a solution that already works something like this, please tell me.
Open Materials Database
Groove workspace if a collaborative environment, but it does have a component that allows you to share an archive of files.
Worth considering because:
- Files are encrypted and sent in an encrypted format.
- Files placed in the shared space are mirrored on all systems that are members of the worspace.
- The software is free for non-commercial use.
- Lot's of other interesting features to play with.
- You can even mirror with a machine accross the Internet.
Limited by:
- The speed of your connection.
- Windows users only.
Go check it out at http://groove.net/
Does anyone know if there are efforts in the open source community similar to...or designed to enhance this product?
Obvious link.
BlackNova Traders
For that matter....what if the sun goes nova?
I have large number of hard drives packed into a couple computers in my home so I am speaking from experience here.
/.ers have a few old boxen lying arround and want to put them to use. It would be much nicer to set up a cluster of computers which would have a lot of redundant components than to set up one big server.
Drives these days tend to get very hot. I just put a 40Gig Maxtor in a co-workers computer a few hours ago and it gets hot to the touch pretty soon after spinning up. I have had 3 out of my dozen or so drives start reporting bad sectors within several months of getting them. Thankfully I didn't loose any data but I was sure scared every time it happened. I have since put a few more fans blowing over the drives to cool them better and the problems have since stopped. I have found that disk failures are becoming more common now as drives run hotter.
A distributed file system like the author is looking for would be most helpfull not only for redundancy, but also to allow for greater capacities than can be reasonably put in a single machine alone.
A large RAIDed fileserver would be nice but would potentially be a cooling nightmare. Not to mention the drone of all those fans! (I have to turn the computer off at night it's so loud.) Also since I just blew a power supply I would be worried about a single key component going bad and keeping me from my files until I got it replaced.
Most
Just my thoughts.
is to use IP over Carrier Pigeon.
Then the only remaining issue is number of pigeons.
-... ---
An interesting set of recommendations ... with absolutely zero in the way of justification. Why, why, why, and why?
While Slashdot may not be the epitome of scientific correctness, some of us do try, given half a chance. Pure opinion without facts is of no use to anybody.
(Must try harder.)
http://www.hovell.com/ifolder - Linux client is coming (or use the web interface).
http://www.microsoft.com/NTServer/nts/downloads/wi nfeatures/NTSDistrFile/AdminGuide.asp
At least, that's what I do.
This was covered once before: slashdot.org
Integrated application integration with synergistic synergized synergy
Can't you just use rsync?
I'd rather be a conservative nutjob than a liberal with no nuts and no job.
Alternately you could engrave the data onto coconuts and use migratory swallows transport them. But then that would raise the matter of using an African swallow compared to a European swallow.
Nobody expects the Spanish Inquisition!!
Don't play around with something "cool" like a distributed RAID disk. Just spend the money on a decent tape drive and tapes, design a tape backup rotation strategy, get a safety deposit box at a local (or not-so-local) bank for off-site storage, and set up Amanda to do the backups.
could you easily make another of the set into the master? would it pick up the raid and understand how to work with it?
How is this news?
Novell provides a sync solution that keeps files synchronized across multiple clients. It encrypts the files, replicates only file deltas, and also provides clientless Web access to iFolder files.
The server can run on Linux although there's no Linux client (yet) but it's coming soon. Several commercial implementations exist - see www.efoldering.com.
You can also try it free online at http://ifolderdemo.novell.com/, but you'll only get 10MB of space.
More info:
http://www.novell.com/products/ifolder/
too much about fire.
It's my wife and her need to open any email she gets using outlook on her windows box. She's just enough of a geek to be dangerous and "enjoys" the preview feature.
And she wonders why her 'puter can't log into the LAN without being Virus checked first.
-Goran
Carpe Scrotum - The only way to deal with your competition.
http://www.cis.upenn.edu/~bcpierce/unison/ :(
I was considering this, but I cannot afford enough bandwidth to do it over a WAN.
woo hoo!!
Warning, I am a support guy for FirstBackup an online backup service, www.firstbackup.com
:) The only downside to our software that I can see is that there is no linux/mac client as of yet, but mapped network drives work great.
o m
If data protection\security at a cheap price is what you need most online backup services will fit the bill.
I mean, at where I work all of our stuff is 448bit encrypted before it goes on the wire, and then when it goes on the wire it goes to our server farm, then gets mirrored 30 miles away in a secure location. And, you can tell how many problems we have with the software, I am one of the support people and I am posting on slashdot I am so busy.
Here are just a couple of our bigger competitors links and ours in case you are interested, I really do think online backup is what everyone will eventually go towards.
http://www.atbackup.com
http://www.connected.c
http://www.firstbackup.com
It seems to be a great problem solver for what you're trying to do. First off, on initial start it only connects to computers it knows, or downloads info about a couple of nodes from the main website, but if you were to export your noderef and import it into all of your other systems instead of the default noderefs, then you could have a distributed storage network set up among all of your computers.
Granted, you'd have to have a bit more storage dedicated than you'll be storing, but if you want every file to have a decent backup, then that's one of the prices you'll have to pay. Also, it's self cleaning when it comes to backups, because it automatically pushes out the old, less requested files in favor of the newer, more requested files.
Another solution, should your systems be using Linux is maybe something like GNUnet, which is built upon the sharing of files in both a distributed and an anonymous manner.
Damn, I *almost* hit preview. :) Oh well. Sorry about that.
-- The world is watching America, and America is watching TV.
it deserves a quick mention, even if it's not quite what you had in mind. it assumes that you have all your disk in one SAN pool, though, and this isn't really the way it is in your case; however, it provides for multiple NFS or SMB servers with one GFS backend that they all access, which is what you wanted...
Check out:
http://www.drbd.org/
What is DRBD
Drbd is a block device which is designed to build high availability clusters. This is done by mirroring a whole block device via (a dedicated) network. You could see it as a network raid-1.
Software RAID/LVM can detect which volumes go where by magic numbers written to them when you format them. But you still have to set up all the remote NBDs correctly on a new machine, and you need the old setup file from the old machine that tells it what block devices/partitions to use.
NOTE!
You shouldn't leave any NBD-exported volumes on the new master. Make it into a physical, local volume, but reference it in the "same place" in your RAID configuration.
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
I take it you've thought about speed issues? RAID over a 100mbit link doesn't sound like great fun - leastways I wouldn't put my swap on such a drive :)
Gigabit might work though.
"'I pass the test,' she said. 'I will diminish, and go into the West, and remain Galadriel.'"
- JRR Tolkien.
-matthew
"THERE IS NO JUSTICE, THERE IS ONLY ME." -Death
Set the wayback machine to 5 or 6 years ago - there was a product for Windows (98 at the time) called Medley from Mangosoft.
It used DCOM to allow all computer on your network to see the "pool volume". Each computer donated a certain amount of space to the pool, which then became available to the whole. The software made sure that any file was written at least two places on the network, so you didn't have redundancy until there were at least 3 computers with this installed. You couldn't read files directly from the "pool space" your computer contributed - my recollection is that the pool donation became a massive file much like modern Virtual PC.
At any rate, it was a cool idea that made perfect sense for small offices or homes with lots of W98 computers. I tried to set it up for a friend in the company he owned. It worked well until I went to one particular machine where it never worked. Tech support was no help.
The product disappeared within a year, so it must not have worked for anyone else either.
Just use Windows 2000 server with DFS. Geez. Some geeks.
I looked at your pricing. Since I have nearly 7GB of just digital photos BESIDES my data it could get quite pricey.
"If you are on fire you can just stop, drop, and roll. If you fall into Lava you are just dead." - my 5yr old daughter
I understand many of the comments here which say "put in a big honkin' server and hardware RAID". That would be a better solution from a purely 'let's serve files and protect data' standpoint if you can accomodate a single, large server and want the best performance.
However, I see a use for a network LAN storage system. Every machine these days comes with a 72G drive or larger installed locally, yet we are trained as IT personnel to say 'don't store anything locally, it's not secure or safe, put it on one of our nice big honkin' servers'. Unfortunately, those big servers cost alot of money, often require specific admins (eg SAN experts to deal with the management software, dividing up LUNs, etc), and may involve alot of red tape to justify additional storage allocation for your project.
What to do with all that local disk space that, if unused as most centralized IT would rather have you do it, would be a vast untapped storage resource?
The concerns regarding latency are well understood, but this might not be a factor if this LAN storage array was used for 'archive' storage where real-time high speed access isn't the driving factor. A RAID 5 system would be far too fragile, as if two nodes were offline/rebooting the entire network storage LAN would be unavailable. You'd need to have more redundancy than that.
I could see an interesting application using multiple nodes each contributing disk space to a LAN archive storage array which would be 'written to' and retrieved with similar expectations as writing to a tape drive. The bonus would be that you could work on files in realtime over such a network, just quite slowly (many vendors used to offer archive file systems which worked this way using tape or optical drives as the storage medium - AMASS was one such vendor).
Oracle 10g kind of does this and a heck of a lot more... but I think you have to use applications designed for it in order to work.
This person's asking for Transparent Redundant Data Backup, which doesn't seem so unusual that no one's asked or implemented it before.
$8.95/mo web hosting
Post a few articles on the net professing your undying loyalty to Usama Bin Laden. The FBI will back up everything for you.
The lustre project (www.lustre.org) is supposedly going to be the end all/be all of distributed parallel file systems, but I believe it is still fairly unstable and not ready for production use. In the meanwhile, the best one out there is PVFS(www.parl.clemson.edu/pvfs/). Fat chance trying to find Windows clients, but you can always re-export it with Samba.
Their Farsite File System is a serverless, distributed file system that would provide backups, as well as encryption to prevent unwanted access to your files.
What if you reboot one of the NBD servers? While you'll still have access to the data since it's a raid, I would well imagine that you would have to rebuild the entire "disk" once it comes back online.
Assuming a Raid5 with three nodes, and two go down not at the same moment, will all your data be lost?
I would think very carefully about these issues before putting all your valuable data on it. RAID isn't really designed for frequently unreliable connections like this. It's meant to prevent data loss if a hard drive crashes, which should be a fairly uncommon thing within a single system.
- It's not the Macs I hate. It's Digg users. -
Why would you want to "loose" one of the disks? Don't you know they're supposed to stay tightly enclosed in their little boxes?
And why do you think that "loosing" the disk would help the image "automatically reconstruct itself?"
Actually, if you did that the disk would carom around the room like a very fast, very lethal Frisbee and you would be too busy trying to survive to worry about where your data went!
Just a thought
Otherwise, your plan sounds peachy.
Any technology distinguishable from magic is insufficiently advanced.
Just rename all your important documents to porn and new movie titles, and EVERYONE will back them up for you!
Karma: It's all a bunch of tree-huggin' hippy crap!
If you have n computers each writing all their information to n-1 computers over a IP network, you are going to have some really slow access.
You'll be wanting a distributed cluster filesystem. There are several available, with their pros and cons. They are also all aimed at enterprise / HPTC installations. For home use you'll be better off with a set of RAID disks.
GPFS from IBM. This is free for academic use, but you pay for commercial use. Linux or AIX only.
GFS from sistina. Commercial offering. Linux only.
Lustre. This is beta quality code, but is freely available. It might work wonderfully, or it might eat your files.
(open)AFS. Works, but has limitations. It does not support large files and clients aren't available for all OSes.
http://www.drbd.org/
"Drbd is a block device which is designed to build high availability clusters. This is done by mirroring a whole block device via (a dedicated) network. You could see it as a network raid-1."
While a pure linux solution seems to score the most points here, this particular one lets you combine your windows, OS X, and linux systems into a single distributed storage mesh. There is safety in numbers, and the more systems you can add to these sort of distributed storage systems the more reliable they become.
HiveCache is more of a backup solution, but I do know that it is possible to use this with a webDAV front-end for archival storage and other intersting storage possibilities.
This is a great project for doing exactly what you want to do, cross platform, and encrypted. http://sourceforge.net/projects/dibs
How to do this is spelled out in the book Linux Server Hacks by Rob Flickenger. See tips #41 and #42.
s /
Or see online:
http://www.mikerubel.org/computers/rsync_snapshot
The beauty part: export the snapshots back to the users with NFS. When they lose a file they can get it back without asking the sysadmin to do it!
steveha
lf(1): it's like ls(1) but sorts filenames by extension, tersely
Are there any operating systems at all that have this functionality now?
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
I have managed hundreds of servers over the last decade. RAID helps with UPTIME, and high availability, it sometimes (rarely) helps with reducing data loss. Most of the time data loss is NOT BECAUSE OF DISK FAILURE. It is because of an idiot who accidently deletes the files or whole directory structures, or the logical volume.... 'nuf said. What you need to do is create OFFLINE copies of your work periodically. So, read up on rsync and write yourself a cron job. You can set up SSH/SCP on your windows box and you can then use rsync from the Linux boxes to backup your "Documents and Settings" dir on your Windows box. RSYNC even has command line options for creating snapshot backup directories.... There is a HOWTO at the samba site (where rsync comes from) that details scripts for how to create rotating backup scripts with RSYNC.
Something that caught my eye a while ago in this area was HiveCache. Never used it, don't know anyone that has, but it looks like a pretty cool system.
Check out http://rdiff-backup.stanford.edu/ for the wonderful rdiff-backup.
With the combination of rsync, ssh & rdiff-backup I have setup a very reliable incremental network backup infrastructure, allowing me to go back to any previous version of a file.
regards,
Heiko
Any post with the word "'puter" is automatically ignored due to lameness.
I want to delete my account but Slashdot doesn't allow it.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 go!
The self-certifying filesystem or the cooperative filesystem might do what you want, though I believe they only run on unix platforms. The code is considered to be in the alpha stage, but apparently the maintainers have been using it for a while without losing files. On some platforms SFS (on which CFS is based) has the nasty habit of deadlocking the kernel from time to time. You might want to read their documentation, since this might not be a problem for what you're running on.
SFS
CFS
WARNING: there is a trojan on your
The concept of being able to see the previous version sounds good. But on VMS, file versions didn't really achieve this all that well. Classic example: how do you delete a file?
Try #1:
DELETE FOO.TXT
This is really the wrong answer. If you have FOO.TXT;1 and FOO.TXT;2, then this command deletes FOO.TXT;2 and any attempt to access FOO.TXT will get you FOO.TXT;1.
Try #2:
DELETE FOO.TXT;*
This is the common recommendation, but you've now lost the ability to see any of the old versions.
The GNU file utilities (and emacs and some other GNU programs) have a file versioning scheme which is somewhat similar to VMS but somewhat better. Look at commands like "VERSION_CONTROL=numbered cp foo bar".
Personally, I usually put things which matter in CVS. With the CVS server in a distant city (at an ISP which provides ssh shell accounts). That gives me off-site backups.
CMD does not support UNC paths as current directories. =(
You might want to do RAID 0+1 instead. That is, a stripe of mirrored disks.
Why? If you have ten disks in a mirrored stripe and loose one disk in each stripe you loose. If you have a stripe over mirrors, you can loose a disk in each mirror and still access all data (but it's time to check those backups...) With four disks, it would be the same. Add more disks and a stripe over mirrors is safer.
Someone could calculate the probabilities of loosing data in the two setups, but you have a better chance with striped mirrors. The performance should be the same.
Still confused? You're not alone...
..
how about if you used nbd and exported a small image on each machine on a network of about 100 machines, then used all of the drives in a software raid0? sure networks can be slow and 100 speed is just 12MBytes per second. but in theory you could get 12MBytes/sec * 100 = 1200MBytes/sec(1.17GB/s) or more realistically 1/2 that. OR if you were on Gigabit ethernet then you could be looking at a very high thoroughput but prob high access time with tcp/ip overhead.
What I need is a system that can cope with very low reliability of the computers in the mesh. Also, they're not 24/7 so the system (MC-ed by a server I assume) needs to unmount at 10:30pm when the PCs shut themselves down then boot all the PCs and (re)mount at 7:30am. It needs to cope with an entire lab of PCs being swapped out at the end of lease.
Anything like this?
I think I'll have to make a flowchart to make sense of it.
A lot of the solutions posted so far have you mount some esoteric devices, but since you didn't specify, I'll assume you didn't ask for filesystem-level file storage. That leaves us with application-level file storage: a system for storing application data rather than general files:
/. for the resulting "scandal")
1 - Groove Networks [ www.groove.net ] have software that is meant for organizations to share documents in a P2P way. I think there's a shareware/evaluation version available, and if I remember correctly, it's written in Java.
2 - Nullsoft's WASTE was rumoured to provide a closed network of file sharing of some sort. I didn't get to read up too much on it because it was taken down by Justin's employer. (consult
3 - Any content-management system out there. There's plenty, some even open-source. (I think phpgroupware is one of them) Mostly web-based, but you're bound to find a few that integrate with existing applications.
4 - Some people suggested manual/automated synch, but then you are limited to the size of the smallest storage device on your network (if you want all files accessible from all machines). You are probably better off with central storage that uses RAID and/or a clever distributed backup. (think of targzipping your entire central file tree, then splitting that up in chunks and sending those chunks to clients in such a way that the server and even one client going down would still leave you with enough chunks to put it all back together on a machine that has enough storage space)
I'm sure some Computer Science student out there will take idea #4 (clever distributed backup chunks) and write their honours project based on it.
Oops, I forgot to give the most obvious answer: use CVS or some other source control system. Again, you have central storage, which is easy enough to back-up, but even if this fails, you most likely have fairly recent versions of the files on various machines, assuming you use them frequently, which you say you do, so it should be no problem. :)
I know it's not what you're asking for, but I'd recommend setting up one of those 8 boxes as a file server with regular RAID. It's a simple and proven way to get the end result you're after. It doesn't have to be expensive either, a pair of IDE drives (each alone on their own IDE port) and Linux with its built-in software RAID, exported to the network via NFS and/or SAMBA and/or whatever else.
I was just thinking that this is just up iSCSI's street.
.... my wallet doesn't open far enough at the moment :-)
Have multiple iSCSI targets and then use a software RAID-5 implementation (it wouldn't care because as far as it's concerned the iSCSI device is a "local" device) on he initator machine.
Would be neat to see in action, anyone got enough kit at home/work to try it out and report back ?
Mark
I'm surprised to see nobody has yet mentioned HyperSCSI, which is:
- opensource
- based on raw ethernet (supposedly faster than iSCSI or other TCP/IP-based schemes)
- has a Win2K client
Check it out, I've tested and used it since about a year and it works quite well!
--
Nicson
Other have already pointed out the NBD solution, so I wil not repeat it here, what I would like to ask is; why?
It would be much easier for you to put four 120Gig IDE drives in one of your eight computers and use a real RAID setup. I have a dual 60gig that holds all of my home direcotries and all of my mp3's, and that in itself is enough for me.
Set yourself up a RAID-5 so that the performance is not dismal and you will be set. Hardware IDE RAID is really cheap now, or you could go with a software solution if you don't have very intensive data writes.
George II -- Spreading Freedom and American values, one bomb at a time.
I've been looking into this as part of a project for one of my seminar classes this semester. Perhaps I can do my thesis on this as well, I'll have to see what happens.
It all seems to depend on what it is needed for. For a "normal" RAID, this won't work. For a cheap backup on a WFGM (Wide Family/Group Network) this has a lot of potential.
speed will take a hit if u'r not using gigabit ethernet.
also there's suppose to be a filesystem based on the PAR2 thingie....and I'm sure someone has mixed the two (or at least have a net-RAID5 type thing).
once I upgrade my networking schtuff to 1000Mb, then I'll do it too....but for the time being, I use my lil' linux toaster (Shuttle SS51G) as my central file storage location (ty samba!)
It's not windows, and it's not linux, either. It's human error.
I'm using one of Ximeta's ethernet-connected 160gb drives. It also has a usb2 connection, you can only use one or the other at a time. And only one client can have R/W access at a time - the others get RO access.
Mostly, I back up each machine's personal data and config to it periodically. I'm just talking about a home lan here, this is not an office-scaled solution.
I'm still looking for a better solution. sooner or later, some smart guy will make a shoebox-sized server, with redundant drives, and basic file sharing/locking support (nfs/smb). The drives themselves are so cheap (on the order of $1 per GB now) that the hardware and management required to make RAID5 aren't economical for most personal users. I think it needs to retail for under $300 to be a real winner, and I know that's a tough barrier.
The Glade partition control system has been doing this and more for ages. It's used in mission critical military applications. And, oh yes, it's free.
Check out http://www.act-europe.fr/ and click on the Glade link.
It looks to me like a waste of resources. Why not setting a cron job that copies the content of the partition you want to back up on n other systems?
That's the way I am set up at home: One linux box (my server) has a 80GB hdd. That's where I put everything I have valuable (mp3, pr0n, cvs, db...). Every night, at 1:53AM, a cron job starts, stop every service susceptible of changing the data (Tomcat, Mysql, cvs...) and back the HDD up through the network over to my second PC. Then all services are restarted and everything is up and running again. Incremental backup allow this operation to take a few minutes. The down time is usually not a problem since it's my home personnal system.
Write boring code, not shiny code!
This confusion dates back to the days of "Loose it or Lose it" when long bows were the new high tech weapon.
What you really need is a distributed, serverless filesystem - one which lets you store files on all your disk drives on the LAN, with automatic redundancy of data (so if a machine goes down or its storage becomes unavailable, you still have a copy of your data blocks on one or more of the other machines) and ability to access those files from any machine on the LAN. A serverless filesystem is one in which the participating machines act as peers - i.e. no master server. Distributed and serverless filesystems are a hot research area right now but I'm sorry to say that they're not yet ready for the mainstream.
I went through the "is CODA right for me?" phase, and also "is InterMezzo right for me?" and also spent tens of hours researching distributed filesystems and cluster filesystems online ... my conclusion
is that the area is still immature, I will let the pot simmer for a
few more years (hopefully not many), and use NFS with one or two
servers in the meantime.
My situation: desire for scalable and fault-tolerant distributed filesystem for home use with minimal maintenance or balancing effort. Emphasis on scalable - I want to be able to grow the filesystem essentially without limit. I also don't want to spend much time moving data between partitions. And last but not least, the bigger the filesystem grows, the less able I will be to back it up properly. I want redundancy so that if a disk dies the data is mirrored onto another disk, or if a server dies then the clients can continue to access the filesystem through another server.
All that seems to be quite a tall order. I checked out CODA, afs, PVFS, sgi's xfs, frangipani, petal, NFS, InterMezzo, berkeley's xfs, jfs, Sistina's gfs and some project Microsoft is doing to build a serverless filesystem based on a no-trust paradigm (that's quite unusual for Microsoft!).
Berkeley's xFS (now.cs.berkeley.edu/Xfs) sounded the most promising but it appears to be a defunct project. The source code is online however, so maybe somebody can resurrect it. Frangipani sounds interesting also, and maybe a little more alive than xFS.
On the other hand CODA, afs, intermezzo and Lustre are all in active development. afs IMHO suffered from kerberitis, i.e. once you start using kerberos it invades everything and it has lots of problems (which I read about on the openAFS list every day). AFS doesn't support live replication either - replication is done in a batch sense.
CODA doesn't scale and doesn't have expected filesystem semantics. For 80 gigs of server space I would require 3.2 gigs of virtual memory, and there's a limit to the size of a CODA directory (256k) which isn't seen in ordinary filesystems. There's also the full-file-download "feature". CODA is good for serving small filesystems to frequently disconnected clients but it is not good for serving the gigabyte AVIs which I want to share with my family.
InterMezzo is a lot more lightweight than CODA and will scale a lot better, but it's still a mirroring system rather than a network filesystem. I might use that to mirror my remote server where I just want to keep the data replicated and have write access on both the server and the client, but it's again not a solution for my situation.
The best thing about intermezzo is that it sits on top of a regular filesystem, so if you lose intermezzo the data is still safe in the underlying filesystem. CODA creates its own filesystem within files on a regular filesystem, and if you lose CODA then the data is trapped.
Frangipani is based on sharing data blocks, so like NFS it should be suitable for distributing files of arbitrary size. I need to look at it in a lot more detail; this is probably the right way to build a cluster filesystem for the long haul. For the short term, Intermezzo is probably the right way for a lot of people: it copies files from place to place on top of existing filesystems.
I got motivated to look at Frangipani again. No sour
NBD-server for windows
I'd be hesitant to put my stuff on Windows boxes if they were also used for other purposes. Most people set Windows up so they have administrative privileges. That means they could probably see all the files you are distributing - at least the filenames even if the data was only 1/5th the entire file or whatever. What about the issue of files becoming corrupt because someone's computer catches a virus which taints your data? Any checksumming?
So AFS is the oldest and probably the most robust of the choices. (Ok, so AFS is, but you probably don't want to buy AFS from Transarc, so just use OpenAFS) It is a distributed file system that allows for replication of data across servers and all of that. It is in use at MIT, NCSU, CMU and other good CS places. And you can use it on *nix and W32. It isn't the easiest choice to get running, but if you actually want the thing closest to Raid-5 across machines, this is definitely the choice for you.
I've only seen one answer thus far that even comes close to solving the problem as the user attempted to describe it. But I think the problem was that the person didn't know exactly what they really wanted, and therefore worded the question poorly.
The correct answer to this question is a mixture of solutions... as it makes no sense to completely mirror a filesystem accross multiple workstations. You'll never need to carry that entire filesystem with you at all times unless it carries your booting operating system.
Therefore I present my solution:
For the home user... dedicate two machines (your servers) to the redundant raid of your choice and means. RAID 5 could be the answer, RAID 1 could be the answer... RAID 5+1 could be the answer... not enough information is given to know just how much and what CRITICAL data you could possibly have at home. However this does give you a level of redundancy at the drive level. I would highly suggest making use of LVM in servers with more space to add drives later down the line.
Next step is to mirror the data accross the two servers. I suggest CODA. Not terribly difficult to install, RPMs available if thats the way you bend, lots of time under its belt and because of what we are about to do, Windows is not required.
So how do my Linux and Windows clients get to the data? Well. There are a bunch of ways to accomplish this. You could install multiple types of network filesystems to support multiple operating systems. Which to me has always seemed rather crappy. Who wants to match all those user ids one might use. Or, horror of horrors, allow SMB or NFS (or Appletalk) out of the local network? Not me. BUT... what about WebDAV? Still somewhat in its infancy - and its already had a rather significant remote hole - it is fairly elegant. Linux, Windows 2000+, and MacOS X all support it... its web based (so your going to be running a web server too)... and your can run the whole thing under SSL. This makes it available to you from just about anywhere, and using just about anyones computer (though there are certainly security issues when authenticating if you want to do this). And it will natively pass through just about any firewall (including Application Proxy firewalls).
BUT... and this does suck, you cannot manipulate files directly on the WebDAV share. Files must be copied to local storage, editted, then copied back over.
So... your looking at Linux, LVM, RAID (hardware preferably), CODA, LVS (if you so desire), Apache, and WebDAV. Reading between the lines this really sounds more like what you are really looking for.
Of course, thats just my opinion. I could be wrong.
http://windows.scares.us
I tried something like this back in 97-98.
Set up nfs servers on all the the computers that would store the data (servers), and setup loopback and software raid on the systems that would access it (clients). There was overlap between the two groups.
Created a couple hundred meg file on each of the computers in the exported directory. dd, yadda yadda..
Wrote a short script to mount all the exported trees, slap the files it found on loopback, and copied it around the clients. Made sure it would look for a lockfile, don't want more than one client accessing them at a time. Was a simple touch and exists affair.
Used one machine to make a raid, FS, etc, on the loopbacked devices.
Wrote a second script that would take the loopbacked devices and mount the raid.
Never quite got it to run right tho, just bought a tape drive instead. Guess you could play with it. The significant logical prob seemed to be that until you unmounted the raid and the NFS tree, you couldn't rely on data actually being written. Course, the raid code sucked donkey back then, and the NFS code was just erratic, so.. Mebbe things have improved.
.sig: Now legally binding!
While all of these ideas are "really cool", let's operate on the KISS principle here. With the low cost of IDE RAID these days, why not just create a RAID 1 mirror set and NFS export it. You could do 200 GB for probably $300 or so, based on a quick check of the prices on Comp USA. And if you really want distributed redundancy, set up a second system with another RAID 1 array. Then rsync the two with cron.
Using nbd or afs is pretty cool, technically speaking, buy way overkill for a home network and way more trouble, both to set up and to maintain, than it's worth. Instead, for the cost of one more PC you can set up a very redundant system. And in the extremely unlikely case that you lost both hard drives supporting your primary nfs simultaneously you could redirect yourself to your secondary and keep right on working. In a more likely scenario, where you lose just one drive, you immediately rsync to the secondary, repoint all clients to the secondary, and keep going. Replace the failed drive in your primary, then you could either fail back, or demote it to be the new secondary.
Why make it so hard?
In my universe I'm perfectly normal, it's not my fault you don't live in my universe.
Eight computers????
(deep breath) NERD!!!!!!!!!!
I'd considered the problem from the perspective of grouping up many small hard drives in various boxes to get more and more secure storage for archiving, not something active like compiling your kernel.
As for the floppy example, you should note how good the performance was. He moved a 3.6MB file to it in 32 seconds, that might sound slow to you and me, but 112KB/s, close to the USB maximum throughput. The RAID software used the interface. If I'vr decided I want to archive something via my network, I've already decided that the delay is worth it. If a net RAID sucks down my data as fast as I can send it, but also gives me error correction, I've done myself a favor by using it. This might not work so well for kernel compiling, but it would be just fine for tar files of images.
Friends don't help friends install M$ junk.
Too many of these threads are focused on Linux/Unix running some kind of experimental File System or some form of file replication tool like rsync. The reality is that most "Common Folk" don't have nor want to run any of this complicated infrastructure. They simply want to install a small little app on their Windows 2000/XP machines on their home network (which is maybe 3 boxes) and have them backup the data between them automatically balancing out free disk space with redundancy. Think of it like RAID 5 over the network.
Now, that all being said, let's think outside of the box a bit. Nearly every one of us that has more than one machine at home can benefit from this type of application. If it's difficult to setup, it simply won't be used by the masses. A good example of a "throw hardware at the problem" type solution is the Mirra (http://www.mirra.com) which should be coming out at the end of the month. If there was some way to setup something similar to the type of thing Mirra provides, but using the distributed resources of the existing comptuers on the network, then we really have a killer app!
I know this is all dream conjecture at this point, because we all know that something this good just simply doesn't exist, but it certainly sounds like the start of a good open source project. So, here's the challenge: Build something that will run as a service/daemon on Windows and Linux which will share free disk space transparently to the collective for automatic backups of information on other systems. Using things like WMI event triggers, you should be able to update files on other machines as soon as they are altered. The system should be able to broadcast and self configure, be secure, and allow for network interruptions.
Perhaps I'm dreaming here...but this would be the best thing to happen to home networks since cheap ethernet. Mirra+RAID5+AFS=???
Who's game?
"Failure is not an option. It comes bundled with any Microsoft Product."
Novell has a product called iFolder which if what you really want is file sync between those machines, it does it great. I have iFolder on my laptop, home machine and work machine. Unless all three of those die (plus the iFolder server) at the same time all my data in my iFolder is safe. Plus you can restore previous versions of documents out of the conflict resolution bin if you need to (much more common then disk failures...)
Its a blast from the past! Tape, ahh it is now cheap and boaring and will fall apart but how many of us can say we have a tape backup every sunday night on our network? I think that is pretty darn cool!
TheOne [ECO] http://www.TeamECO.com Team Leader
I remember there being a program I once tried out, where you ran it on a windows system, and what ever you put on that drive could be seen by everyone else ala mounting a windows share, but the data was not just stored on one machine, it was stored on two. And when you shut down one of those machines, it propagated the survivor to another machine, so it kinda grew and used harddisks over the network. Kinda like a peer-to-peer file server. But I can't remember what it's called... Anybody? I think it was called mango or something, but can't find it! You could set the usage down to 0Mb if you wanted that machine not to contribute to the "collective" and I think there was even a *nix version. Guys? Ideas?
Do the reverse of Fight Club...
Include a few frames of your familly photos!
or, you could put the info into the jpg header...
You'll probably want to at least zip it and embed it into another file...
I wonder how many terrist messages I have on my computer?
(George Bush to Donald Rumpfelt, cc: Homeland Infultration department: Lets get the news media to play more clips of 9/11 so I can grab even more power!)
Please use [ informative / summarizing ] SUBJECT LINES
Flame me here
does this.. but its expensive..
www.neverfailgroup.com
its a bit more than just network backup.
its kind of the dogs...
There
lose
infinite
It's spelled "etc.", not "ect.".
It's an abbreviation for "et cetera".
One of my freinds and some other people from the university of ulm (in germany) once coded an operating system called "plurix", which does even more than that. the disadvantage is: it is in java
Sure, it sounds good on paper, in practice it introduces massive complexity which introduces loads of opportunity for failure.
A better solution:
Move as much of the storage as possible onto a system designated as the server, add the backup device to the server and mirror the data on more than one drive. It'll be more secure, much simpler, better availability and faster.
Government of the people, by corporate executives, for corporate profits.
GPFS (General Parallel File System) from IBM can do this, runs on AIX and Linux. No Windows and quite expensive and not suitable for the casual home user.
Click here for more information.
//TheToon
Keep It Simple Stupid
Use rsync every 4, 8, 12, or 24 hours, depending on your need. Automate it. Personally, I rsync every week, but my machines are just for my hobby.
Why only weekly? I've been hacked (old BIND issue in '98) and having a snapshot of a week old version of my system helped me determine exactly what happened. Back then, I ran COPS, so that plus the 15k "root attempt failed" emails got my attention.
My solution covers, DR, hackers, but may not cover your disk failure impact requirement. Perhaps some mixture of these techniques would provide the best of both worlds? At least for the highly critical files.
Great inventions of Man:
- Fire
- Wheel
- rsync!
If you want raid emulation, NBD works well, but if you are looking for transparent, secure, distributed file storage, I would prefer to use OpenAFS (http://www.openafs.org). It is cross-platform, powerful, and secure, though it does have a learning curve.
I guess it depends what exactly you are trying to do.
LedgerSMB: Open source Accounting/ERP
not really relevant, but may still be of interest to some (just sounds so neat): "Since disk drives are cheap, backup should be cheap too. Of course it does not help to mirror your data by adding more disks to your own computer because a fire, flood, power surge, etc. could still wipe out your local data center. Instead, you should give your files to peers (and in return store their files) so that if a catastrophe strikes your area, you can recover data from surviving peers. The Distributed Internet Backup System (DIBS) is designed to implement this vision. "
d ib s/
http://www.csua.berkeley.edu/~emin/source_code/
I have two devices in my soho development machine, one hdd is for backup only to cover media failure of my main drive. Each day, faubackup runs against my home dir and makes a new replica over on drive 2. The really neat part is that it uses links in the filesystem, so the new dir *looks* like it should, but only new files actually take up space on the filesytem.
Out of the box, it keeps two yearly images, twelve monthlies, 4 weeklys and 7 daily copies. I have more choices to recover fresh stuff than older stuff (obviously). Old stuff falls off automatically each day too.
Available for Debian. I can *not* over-emphasize how cool this little utility is. It's a real 'set-it-and-forget-it' backup solution.
You might consider EtherDrive storage from www.coraid.com. I use it on Linux with software raid. Works like a champ.
Since no one else has said it... You could export virtual iSCSI disks from all of your hosts using software like Intel iSCSI refrence and then remount the disks and RAID the result. Depending on your machine config you could just leave it at that. If your running a bunch of diffrent platforms your best bet might to be to then reexport the RAID as a CIFS or NFS file system from one of the machines.
"HiveCache's revolutionary SwarmBackup and SwarmStorage technology give you high-reliability backup/restore and data storage services by storing data in the free disk space on desktop PCs in your enterprise. HiveCache technology uses peer-to-peer technology to build a reliable and fault-tolerant distributed storage mesh for your backup data, eliminating the network bottlenecks usually associated with network backup systems and without forcing you to purchase costly server storage that will quickly become obsolete."
A really well designed system for backups (not a RAID replacement, but that did not seem to be teh question).
There are no trolls. There are no trees out here.
Ain't no DFS on the planet gonna help if (like me) your girlfriend asynchronously decides that "all electrical devices are bad for me, so turn off everything when you're not using it".
The Russians have won. They have made the world a cesspool of distrust, greed, fear and hate.
a fireproof insulated vault, safe and sound from your primary storage copies burning up at the house. separate copies are better than ones in the same physical location
We seem to have a communication hole here. I don't see how your answer relates to my rebuttal of your claims. I'll try to clarify...
Compiling? Why? Why not just log into the box and do your compiling there?
Which box, the one with the left or the right part of the mirror of the RAID system? Remember, this was about a network file system configured in a way, that the overlying RAID system would give redundancy by storing the two parts of a mirror configuration on different computers. In other words: no matter on which machine you are logged in, some disk of the RAID is not on the local computer.
But to come back to your question of why not just log into the box and do your compiling there: Because the idea of the network file system was to connect several (8, IIRC) computers in the LAN. If he could do all his work with only 1 computer, he had no reason to have a LAN to begin with. E.g. different OSes. I don't know, if it is a matter for the original poster, but he explicitly said, it was a misc envirement and I would rather not try to cross-compile Windows to Linux or vice versa, if I don't absolutely have to.
As for the floppy example, you should note how good the performance was. He moved a 3.6MB file to it in 32 seconds, that might sound slow to you and me,
I didn't argue the speed of the floppy RAID at all, but the speed of USB.
but 112KB/s, close to the USB maximum throughput.
This was my point. I argued that saying "if you can do RAID over USB..." is bogus, when USB was only good enough, because it was a floppy RAID and as you just said yourself, even then USB barely managed to keep up.
If I'vr decided I want to archive something via my network, I've already decided that the delay is worth it.
But this was not only about archiving, but about replacing the complete local data storage by a network storage (in order to have a global redundancy).
If a net RAID sucks down my data as fast as I can send it, but also gives me error correction, I've done myself a favor by using it.
I completely agree. But I never argued about that point. What I argued was your claim, that one would not notice the speed loss [at least on Windows].
This might not work so well for kernel compiling, but it would be just fine for tar files of images.
Nope. I already anticipated that argument in my previous post and answered it there: "Ah, and if compiling does not fall into the "data storage" category: Well, simply copy that 50MB log file around, and some seconds become minutes (regarding the nobody would notice a "10 MBit" link)." (if you do log files or images doesn't make a big difference, as long as the file size is counted in MB).
Keep an eye on which arguments are silently dropped in replies. Not always, but often times it's very telling.
Build your filesystem using LVM. Then you can shutdown your services for two seconds while you take a snapshot of the partition(s). Then restart the services and sync the snapshot. I use this method for my company's backup server(s). It works quite well.
The point is that is supposed to be for a home network, so even a 10mn downtime around 2AM shouldn't be a problem! But thanks for the info, I'll have a look!
Write boring code, not shiny code!
www.openafs.org
Conformity is the jailer of freedom and enemy of growth. -JFK
Why don't you consider a pre-owned high-end RAID system?
If you're willing to pay even a couple thousand dollars, you can get a very highly redundant RAID 6 subsystem with high throughput. (or two, if you want to spend more.)
admin@jkoebel.net if you're interested in them. It may be more than you're looking to spend (free software...>=$2000+ RAID cabinets) but if you're interested, I'll work with you on it.
Even if you lose a disk in each mirror, you can still access all your data too.
if you're not a native english speaker, you can be excused for your ignorance. if you are, you're a fucking idiot that needs to go back to grade school.
note the difference:
loose:
1. Not fastened, restrained, or contained: loose bricks.
2. Not taut, fixed, or rigid: a loose anchor line; a loose chair leg.
3. Free from confinement or imprisonment; unfettered: criminals loose in the neighborhood; dogs that are loose on the streets.
4. Not tight-fitting or tightly fitted: loose shoes.
5. Not bound, bundled, stapled, or gathered together: loose papers.
6. Not compact or dense in arrangement or structure: loose gravel.
7. Lacking a sense of restraint or responsibility; idle: loose talk.
8. Not formal; relaxed: a loose atmosphere at the club.
9. Lacking conventional moral restraint in sexual behavior.
10. Not literal or exact: a loose translation.
11. Characterized by a free movement of fluids in the body: a loose cough; loose bowels.
lose:
1. To be unsuccessful in retaining possession of; mislay: He's always losing his car keys.
2.
1. To be deprived of (something one has had): lost her art collection in the fire; lost her job.
2. To be left alone or desolate because of the death of: lost his wife.
3. To be unable to keep alive: a doctor who has lost very few patients.
3. To be unable to keep control or allegiance of: lost his temper at the meeting; is losing supporters by changing his mind.
4. To fail to win; fail in: lost the game; lost the court case.
5. To fail to use or take advantage of: Don't lose a chance to improve your position.
6. To fail to hear, see, or understand: We lost the plane in the fog. I lost her when she started speaking about thermodynamics.
7.
1. To let (oneself) become unable to find the way.
2. To remove (oneself), as from everyday reality into a fantasy world.
8. To rid oneself of: lost five pounds.
9. To consume aimlessly; waste: lost a week in idle occupations.
10. To wander from or become ignorant of: lose one's way.
11.
1. To elude or outdistance: lost their pursuers.
2. To be outdistanced by: chased the thieves but lost them.
12. To become slow by (a specified amount of time). Used of a timepiece.
13. To cause or result in the loss of: Failure to reply to the advertisement lost her the job.
14. To cause to be destroyed. Usually used in the passive: Both planes were lost in the crash.
15. To cause to be damned.
the fact you're having problems with RPMs demonstrates more than half your problem - you're using the wrong distro. you should be using debian, or at least using the apt tools on RedHat/SuSE/whatever the fuck distro you're running that's using RPMs. Me, I use windows.
maybe he was abbreviating ectoplasmic, for some reason?
*shrugs*
erm, Sir Haxalot,perhaps?
c'mon, you stay up till 2AM more often than you do 3AM... so, why not change the time? But, if you weren't thorough about that, you probably weren't thorough about anything else, either. Best not to touch anything.
If you reboot the NBD server, the connection is lost. You'll have to restart the nbd-client process to make it work again.
Obviously, if two servers go down, you'll start losing data. Of course, that's a property of RAID5, not of NBD...
Yes, the connection would be lost, but since they are talking about RAID on this thing, when the node comes back online you'd have to rebuild the entire "virtual disk" which would be fairly time consuming and network intensive I'd imagine. Even if there was some mechanism in place that did a consistency check on the "disk" to see what data needs to be updated to be in sync with the raid volume, it would be slow. And most raid's just rebuild the volume, I don't think the Linux software raid is any different. (could be wrong, never used Linux raid before.)
And yes, it's a property of RAID, of course, obviously.
I sure wouldn't trust my data to this type of system. Too many points of failure. Like I said, your average RAID wasn't designed for this type of application. A mirror set might work okay, but I can think of a lot better ways to accomplish redundant data on a network - plus, I don't think duplicating data is exactly what the guy had in mind on his question.
- It's not the Macs I hate. It's Digg users. -
He's reading /. and is therefore probably a computer geek.
What makes you think he goes to sleep before 2am?
What do people who have no set schedule do?