Distributed Internet Backup System
deadfx writes "Since disk drives are cheap, backup should be cheap too. Of course it does not help to mirror your data by adding more disks to your own computer because a fire, flood, power surge, etc. could still wipe out your local data center. Instead, you should give your files to peers (and in return store their files) so that if a catastrophe strikes your area, you can recover data from surviving peers. The Distributed Internet Backup System (DIBS) is designed to implement this vision."
We do this with neighbor school districts. We also backup all buildings, over the WAN and at night, to a file on the hard drive of another building. We do this in two places, so backups criss-cross. Because of the size and time it takes, this can only happen at night and only one building per night, so there is a downside. But if a building goes down, I know I have a secondary (besides the tape in that building) to fall back on.
Why in the world would I ever put my data on someone else's machine? I spend my life keeping people out of my network.....
What if it is sensitive data? Do you think even with all that cryptography and secure computing blabla people will trust storing their important files on other people's computers? think not. There are companies who put their backups into safes ... ask *them* to put it online on a slashdot reader's PC. See what they answer.
Freenet and similar networks are only good for general [public] domain data
With this system all other P2P networks will go bye-bye
Why bother searching for files when I have my friends 200GB movies and mp3 collection backed up on my machine!
Its not copying its a Back-up! 8)
__Syo
I grant that personal backup is time consuming and it is tough to find a good method without resorting to expensive tape or hundreds of CDs. But as intriguing as this approach is, there seems like a lot of problems with it.
What if the reason you need to do a recovery is because your system with internet access is toast? How long does it take to restore several hundred thousand files? What about peers that drop off the network, or that are only on sporadically (no, that never happens in peer to peer filesharing networks!).
Even aside from the issues of speed of restoration, I can't imagine too many circumstances in which you want to rely on a internet network connection as a prerequisite for a successful restore... Although perhaps as a way of complimenting existing backup methodologies (i.e. backup root and critical config information to tape or CD, and the rest of your schiznit to DIBS) this might have a place.
This should work a little differently.
Why not stripe your data accross many hosts with parity data being stored on serveral. A central server would maintain a list of servers containing your data. In the event of a failure, you would simply fireup the client, that would contact this server for a list of your backup "devices" and it would start pulling in, reconstructing and decrypting the data.
This would have a couple bonuses...
1) You could stripe it accross 100 machines, and have another 100 with parity data so that any 50% of the machines can be unavaliable and you can still get your data back.
2) Security - Rather than having a full copy of your data on their machine, each node only has a small subset of your data, and does not know where to find the rest of the data making reconstruction nearly impossible for the storage node. GPG would be used on top of this.
And I don't want anyone else to have mine.
What if you back up something illegal?
I can keep all my files on CD-R's, CD-RW's, or DVD-R's.
(not including MP3's movies etc stuff I can always get again)
Hell I could keep them on Zip's if it weren't for some graphics I want to save.
Just back up your data, you can reinstall your programs and OS later. tarball your project files and burn them to a CD. Most project will fit on a CD assuming you're not a photographer.
This requires a lot of trust, which is OK because I'm the sysadmin at both places.
Without trust, you need DIBS-like encryption, which (probably) means no rsync-like differential backups, and you need a "safe" way to find partners.
How about "DIBS-raid" where your data is spread over many peers? If a peer blows up, you can still recover, and no one peer should have a recognizable piece of your data.
-Martin
This .sig donated to Poets Against the War.
Fiat Lux.
I have about half a terabyte of sensitive, important data that needs to be backed up and stored securely offsite every day (This data is just the important stuff. No OS files, etc.) and archives of records stored on several CD-Rs that also need to be stored offsite. The only dependable(?) solution we can commit to is tape backup. We use an Exabyte EZ17 autoloader and Veritas Backup Exec.
You guys wouldn't believe the nightmares I've gone through to get it running smoothly and keeping it there. 5 or so replaced EZ17s, 50 $80 tapes replaced, hours upon hours spent on the phone with Veritas because their software is buggy as hell and their open file option is a piece of shit written by another company (Veritas support was the one to tell me that!). My boss seems to think that we're the only ones that have issues with backups (He's the type that has no opinions. He KNOWS everything.), but I've talked with other administrators with a lot of servers and data using a plethora (Three Amigos vocabulary) of various backup products. We all agreed that backups are a pain in the ass.
I think that people who worry about "putting their files on other people's machines" should go over the docs once more.
There are no trolls. There are no trees out here.
I put up all my pictures on the net and let google, the wayback search engines, and everyone else in the world archive it all for me.
Been a pretty good backup plan so far.
Except, what happend when you need to do a complete restore?
You might try to counter this by saying, how often do you need to do a complete restore? Well, we are talking about offsite backup. Usually when you have to go to offsite backup to restore something it is because you had some sort of catastropic failure and need to completely restore your environment.
The Economics of Website Security
A lot of people have pointed out issues related to security, bandwidth, efficiency, etc. My vision is that DIBS will be designed to take things into account.
For example, DIBS uses GPG to encrypt and sign all communications so that peers can't read the data they are storing for you and so that other people can't pretend to be you and store their files with your peers.
Also, my vision is to include state-of-the-art erasure correction codes so DIBS uses redundancy efficiently. (Erasure correction codes are a generlaization of parity checks used by RAID). In fact, I have already written a python implementation of Reed-Solomon codes available at www.csua.berkeley.edu/~emin/source_code/py_ecc. I haven't had time to put this into DIBS yet since I'm currently working on my PhD at MIT and that keeps me pretty busy.
Incremental backup is another feature I'm planning to add. There are some issues with how incremental backup interacts with encryption and erasure correction. I think resolving these issues may take a little more thought so I might have to wait until I graduate, become a professor and get some grad students of my own to help me.
A Slashdot post isn't the place to go into all the arguments for or against DIBS. However, I think distributed backup is a viable idea. While there are some serious issues, I believe that through clever engineering, we can solve them and create a cheap, simple, efficient, and secure backup system usable by anyone with a network connection.
I decided to start writing a distributed backup prototype like DIBS in order to find out what the major issues are and how to address them. Sure, currently DIBS has some flaws, but it is a prototype written by a grad student. With more feedback from the community and some more development effort I believe DIBS can become a valuable tool. If you agree, I invite you to join the development effort, or try it out and tell me how you think it could be improved, or even take whatever parts you find useful and make something better. The project page is at sourceforge.
I live in Indiana. My mother lives in Georgia. My father lives in Arizona. My grandmother lives in Quebec. My aunt lives in Brazil. My brother lives in France. I have put together a datacenter in a closet in each of their houses. Each datacenter consists of two OpenBSD boxes serving as a multihost firewall and six FreeBSD boxes running the services I require. All of my data is mirrored daily to all of these centers. Most of my files are managed with CVS, too. Thus, I am confident that even in a disaster of biblical proportions, such as my toilet overflowing and damaging the hard drive, my data will be safe.