Distributed Internet Backup System

← Back to Stories (view on slashdot.org)

Distributed Internet Backup System

Posted by CmdrTaco on Friday January 31, 2003 @04:05AM from the here-have-my-drive dept.

deadfx writes "Since disk drives are cheap, backup should be cheap too. Of course it does not help to mirror your data by adding more disks to your own computer because a fire, flood, power surge, etc. could still wipe out your local data center. Instead, you should give your files to peers (and in return store their files) so that if a catastrophe strikes your area, you can recover data from surviving peers. The Distributed Internet Backup System (DIBS) is designed to implement this vision."

9 of 303 comments (clear)

Min score:

Reason:

Sort:

do this with schools by octalgirl · 2003-01-31 04:09 · Score: 5, Interesting

We do this with neighbor school districts. We also backup all buildings, over the WAN and at night, to a file on the hard drive of another building. We do this in two places, so backups criss-cross. Because of the size and time it takes, this can only happen at night and only one building per night, so there is a downside. But if a building goes down, I know I have a secondary (besides the tape in that building) to fall back on.
Security? by vano2001 · 2003-01-31 04:12 · Score: 5, Interesting

What if it is sensitive data? Do you think even with all that cryptography and secure computing blabla people will trust storing their important files on other people's computers? think not. There are companies who put their backups into safes ... ask *them* to put it online on a slashdot reader's PC. See what they answer. Freenet and similar networks are only good for general [public] domain data
And what if by Apparition-X · 2003-01-31 04:17 · Score: 4, Interesting

I grant that personal backup is time consuming and it is tough to find a good method without resorting to expensive tape or hundreds of CDs. But as intriguing as this approach is, there seems like a lot of problems with it.

What if the reason you need to do a recovery is because your system with internet access is toast? How long does it take to restore several hundred thousand files? What about peers that drop off the network, or that are only on sporadically (no, that never happens in peer to peer filesharing networks!).

Even aside from the issues of speed of restoration, I can't imagine too many circumstances in which you want to rely on a internet network connection as a prerequisite for a successful restore... Although perhaps as a way of complimenting existing backup methodologies (i.e. backup root and critical config information to tape or CD, and the rest of your schiznit to DIBS) this might have a place.
Distributed RAID Like Backups by angry_beaver · 2003-01-31 04:29 · Score: 5, Interesting

This should work a little differently.
Why not stripe your data accross many hosts with parity data being stored on serveral. A central server would maintain a list of servers containing your data. In the event of a failure, you would simply fireup the client, that would contact this server for a list of your backup "devices" and it would start pulling in, reconstructing and decrypting the data.
This would have a couple bonuses...

1) You could stripe it accross 100 machines, and have another 100 with parity data so that any 50% of the machines can be unavaliable and you can still get your data back.

2) Security - Rather than having a full copy of your data on their machine, each node only has a small subset of your data, and does not know where to find the rest of the data making reconstruction nearly impossible for the storage node. GPG would be used on top of this.
dibs vs rsync by bromoseltzer · 2003-01-31 04:37 · Score: 4, Interesting

I peer with another system at another institution using rsync. They rsync their files to a folder on my disk, and I rsync to a folder on theirs. No encryption, but very good performance - 128 kbs DSL upload is fine, running overnight.
This requires a lot of trust, which is OK because I'm the sysadmin at both places.
Without trust, you need DIBS-like encryption, which (probably) means no rsync-like differential backups, and you need a "safe" way to find partners.
How about "DIBS-raid" where your data is spread over many peers? If a peer blows up, you can still recover, and no one peer should have a recognizable piece of your data.
-Martin
This .sig donated to Poets Against the War.

--
Fiat Lux.
Hivecache by Glass+of+Water · 2003-01-31 04:52 · Score: 3, Interesting

This is similar to hivecache. I believe hivecache's in use in the wild. The difference is that hivecache seems to be specifically oriented to large enterprize.
I think that people who worry about "putting their files on other people's machines" should go over the docs once more.

--
There are no trolls. There are no trees out here.
Critical analysis. This is a bad idea. by almaw · 2003-01-31 05:18 · Score: 3, Interesting
Reasons why this is a truly impressively bad idea:
- Poor availability: If you're storing it on home-type machines, typical availability is probably <50%. Assuming no hardware failure, if you store your data across four machines, you have a 6.25% chance that all four machines will be down at once and you can't get the data back when you want it.
- Slow networks cause slow backup retrieval.
- Most people want to back up all their data, as sifting through it to find the bits you do or don't want to backup is difficult. Now, once you've performed the initial backup, you can do incremental backups, which cuts bandwidth requirements, but you still have to initially transfer up to multiple gigabytes over a slow internet connection.
- If a peer drops off the network, you must transfer all the data across to a new machine to maintain the same level of availability.
- If it's properly distributed, you can place no guarantees on the quality of service (i.e. the speed/reliability). Peers can go away and never come back without warning. Data would have to be massively replicated (1000 to 1 or more) for it to be considered vaguely secure. If there is implied trust between peers (i.e. two people know each other and authorise the data movement, this problem is mitigated.
- Massively prone to poor cryptography. If you use very strong cryptography, the system becomes very slow. You really need physical data separation for this.
- Requires an internet connection. Won't work from behind firewalls, etc. This is pretty obvious, but is still a factor
- Bugs are difficult to fix, as you have to maintain backwards compatibility between versions. Hardware solutions (or simple software ones like mirroring) aren't so prone to bugs. Because this is a complex software solution, there are bound to be bugs. Anything that can go wrong will. :)
- Due to the lower reliability of this system per node compared to say a RAID array, it's more expensive per megabyte. Note that it *has* to be lower - you're comparing the reliability of a HDD/tape in a normal backup scenario to a HDD+network+supporting computers.
- Prolly lots of other stuff I've missed that other people have covered.
Re:Problem = bandwidth. by regen · 2003-01-31 05:48 · Score: 3, Interesting

Except, what happend when you need to do a complete restore?

You might try to counter this by saying, how often do you need to do a complete restore? Well, we are talking about offsite backup. Usually when you have to go to offsite backup to restore something it is because you had some sort of catastropic failure and need to completely restore your environment.

--
The Economics of Website Security
response from DIBS author by emin · 2003-01-31 06:58 · Score: 3, Interesting

A lot of people have pointed out issues related to security, bandwidth, efficiency, etc. My vision is that DIBS will be designed to take things into account.

For example, DIBS uses GPG to encrypt and sign all communications so that peers can't read the data they are storing for you and so that other people can't pretend to be you and store their files with your peers.

Also, my vision is to include state-of-the-art erasure correction codes so DIBS uses redundancy efficiently. (Erasure correction codes are a generlaization of parity checks used by RAID). In fact, I have already written a python implementation of Reed-Solomon codes available at www.csua.berkeley.edu/~emin/source_code/py_ecc. I haven't had time to put this into DIBS yet since I'm currently working on my PhD at MIT and that keeps me pretty busy.

Incremental backup is another feature I'm planning to add. There are some issues with how incremental backup interacts with encryption and erasure correction. I think resolving these issues may take a little more thought so I might have to wait until I graduate, become a professor and get some grad students of my own to help me.

A Slashdot post isn't the place to go into all the arguments for or against DIBS. However, I think distributed backup is a viable idea. While there are some serious issues, I believe that through clever engineering, we can solve them and create a cheap, simple, efficient, and secure backup system usable by anyone with a network connection.

I decided to start writing a distributed backup prototype like DIBS in order to find out what the major issues are and how to address them. Sure, currently DIBS has some flaws, but it is a prototype written by a grad student. With more feedback from the community and some more development effort I believe DIBS can become a valuable tool. If you agree, I invite you to join the development effort, or try it out and tell me how you think it could be improved, or even take whatever parts you find useful and make something better. The project page is at sourceforge.