Slashdot Mirror


Distributed Internet Backup System

deadfx writes "Since disk drives are cheap, backup should be cheap too. Of course it does not help to mirror your data by adding more disks to your own computer because a fire, flood, power surge, etc. could still wipe out your local data center. Instead, you should give your files to peers (and in return store their files) so that if a catastrophe strikes your area, you can recover data from surviving peers. The Distributed Internet Backup System (DIBS) is designed to implement this vision."

31 of 303 comments (clear)

  1. Problem = bandwidth. by caluml · · Score: 5, Insightful

    The main problem with this approach (and for that matter Freenet) is that it is slow for all but the smallest files.

    Bandwidth is still the most precious commodity in computing. Once we get fibre to every house, then distributed storage will make sense.

    1. Re:Problem = bandwidth. by nano2nd · · Score: 5, Insightful

      You're right in that today's infrastructure isn't made for chuffing massive, hard-drive-sized hunks of data back and forth.

      But what about incremental backups?

      OK so you've got to get your base image uploaded -somehow- but after that, data changes very little on a daily basis and this level of data transfer to some secure backup repository won't be a problem at all with current bandwidth.

    2. Re:Problem = bandwidth. by gmuslera · · Score: 4, Insightful

      For internal networks where you have a lot of fast connected servers, sparing a bit of bandwidth and disk space to have a distributed backup across the LAN could be useful, specially when you can backup servers data in workstations and so on.

    3. Re:Problem = bandwidth. by mark_lybarger · · Score: 3, Funny

      writing a disaster recover plan ... $1000

      implementing procedures corporate wide ... $10,000

      having that plan be effective during an actual disaster recovery ... priceless

      everyone has a plan. tests it and everything. but when the email server crashes, and the backup tapes cannot be recovered and the VP stores all their email on the server (it's backed up right?), the fan starts blowing little brown chunks all around.

    4. Re:Problem = bandwidth. by regen · · Score: 3, Interesting

      Except, what happend when you need to do a complete restore?

      You might try to counter this by saying, how often do you need to do a complete restore? Well, we are talking about offsite backup. Usually when you have to go to offsite backup to restore something it is because you had some sort of catastropic failure and need to completely restore your environment.

  2. Ok, start sending me your code, Blizzard by Quarters · · Score: 4, Funny

    I've got my terrabyte array setup. Your, "Worlds of Warcraft" data will be completely secure on my backup node.

    Go ahead, send it.

    I'm waiting....

    1. Re:Ok, start sending me your code, Blizzard by Gojira+Shipi-Taro · · Score: 3, Funny

      I think there is already a system like this. Something like Kazoo or Cuzaa... The RIAA uses it to back up their data...

      --
      "Oh my God. This is terrible. This is the end of my Presidency. I'm fucked."; ~ Donald J. Trump
  3. All my data and software... by ackthpt · · Score: 5, Funny
    All my data and software are backed up on crackers computers.

    I'm not worried. %-)

    --

    A feeling of having made the same mistake before: Deja Foobar
    1. Re:All my data and software... by BillFarber · · Score: 5, Funny

      Are you saying you only use white-peoples' computers for backup?

  4. do this with schools by octalgirl · · Score: 5, Interesting

    We do this with neighbor school districts. We also backup all buildings, over the WAN and at night, to a file on the hard drive of another building. We do this in two places, so backups criss-cross. Because of the size and time it takes, this can only happen at night and only one building per night, so there is a downside. But if a building goes down, I know I have a secondary (besides the tape in that building) to fall back on.

  5. Security? by vano2001 · · Score: 5, Interesting

    What if it is sensitive data? Do you think even with all that cryptography and secure computing blabla people will trust storing their important files on other people's computers? think not. There are companies who put their backups into safes ... ask *them* to put it online on a slashdot reader's PC. See what they answer. Freenet and similar networks are only good for general [public] domain data

    1. Re:Security? by Guido69 · · Score: 3, Insightful

      I agree. This may be a perfectly fine way to back up your terrabyte ogg/mp3/pr0n archive, but no way will any major corps take it seriously. Has nothing to do with how secure it really is, but more on executive perception.

      --
      - If we aren't supposed to eat animals, then why are they made out of meat? - Steven Wright
  6. I can't see this being a go, any time soon. by saskboy · · Score: 4, Insightful

    As has been mentioned already, [no this is not redundant, because I am writing this myself] the potential for data being stolen is too great an issue to overlook. This is not a viable option because the potential for theft is too great, and no ammount of encryption will make a difference. Encryption will always be broken.

    --
    Saskboy's blog is good. 9 out of 10 dentists agree.
  7. Would this work in the current [US] legal climate? by Michalson · · Score: 3, Insightful

    What is to say that the FBI/RIAA won't come to your house, claiming you have terrorest information/stolen music stored on your harddrive? And assuming it was true, would you be legally/crimminally liable for it? This gives a whole new meaning to the excuse "well I was just holding it for a friend".

  8. Don't trust them to return your files by PepperedApple · · Score: 4, Insightful

    It's not so much that I wouldn't trust someone not to break the encryption, but what if the person who's holding your backup copies gets tired of giving up disk storage and just deletes the software from his/her computer. Or what if their computer happens to be off when you want to retrieve the backup?

  9. And what if by Apparition-X · · Score: 4, Interesting

    I grant that personal backup is time consuming and it is tough to find a good method without resorting to expensive tape or hundreds of CDs. But as intriguing as this approach is, there seems like a lot of problems with it.

    What if the reason you need to do a recovery is because your system with internet access is toast? How long does it take to restore several hundred thousand files? What about peers that drop off the network, or that are only on sporadically (no, that never happens in peer to peer filesharing networks!).

    Even aside from the issues of speed of restoration, I can't imagine too many circumstances in which you want to rely on a internet network connection as a prerequisite for a successful restore... Although perhaps as a way of complimenting existing backup methodologies (i.e. backup root and critical config information to tape or CD, and the rest of your schiznit to DIBS) this might have a place.

  10. First rate idea by mao+che+minh · · Score: 3, Funny
    I hereby volunteer to aid in the storage and backup duties of everyone's data that has at least three instances of the letter "x" or the string "britney" within it's file name. This is because my backup scripts only save files that satisfy these requirements. In return, could someoneplease help me store my vast collection Star Trek bloopers. It's just funny to hear Patrick Stewart cuss.

    Additionally, I extend a warm hand of support to Microsoft. I will accept any request by chairman Bill Gates to store sensitive files.

  11. Re:Would this work in the current [US] legal clima by kryzx · · Score: 3, Insightful
    This is actually a good question. If I back up my music file on your computer, does that fall under "fair use"? Would whether you access them or not effect the legal position? Is it possible to build something like this so my files can only be accessed, or at least can only be decrypted, by me, and hence are not usable to the person providing the disk space? If so, would that change the legal implications?

    This raises all sorts of interesting questions. Unfortunately the answer to all of these questions is most likely "we won't know until it goes to court and there is a ruling to estabish precedent."

    --
    "I don't know half of you half as well as I should like, and I like less than half of you half as well as you deserve."
  12. So the truth is out. by nlinecomputers · · Score: 3, Funny

    So THAT is what happend to Duke Nuken Forever!

    --
    Slashdot, home of supporters of free software, free music, and free speech.Except for Moderators that disagree with you.
  13. Private Peer to Peer (PP2P) by 4/3PI*R^3 · · Score: 4, Informative

    This is just the next evolutionary change in P2P. Encrypting data and exchanging the encryption key so that only those "in the know" can exchange files and the *AA groups don't know what you are trading.

    In the "Pefect Example of Talking Out of Both Sides Of Your Mouth" Department:

    This is posted on the home page:
    Note that DIBS is a backup system not a file sharing system like Napster, Gnutella, Kazaa, etc. In fact, DIBS encrypts all data transmissions so that the peers you trade files with can not access your data.[emphasis mine]

    This is posted on the documentation page:
    Make sure you give your gpg public key to any peers you want to trade files with.[emphasis mine]

  14. Also compare rdiff-backup and duplicity by wfrp01 · · Score: 4, Informative

    Some nice folks at Stanford are also creating a different flavor of network backup called rdiff-backup. I'll just plagiarize the description from the homepage:

    rdiff-backup backs up one directory to another, possibly over a network. The target directory ends up a copy of the source directory, but extra reverse diffs are stored in a special subdirectory of that target directory, so you can still recover files lost some time ago. The idea is to combine the best features of a mirror and an incremental backup. rdiff-backup also preserves subdirectories, hard links, dev files, permissions, uid/gid ownership (if it is running as root), and modification times. Finally, rdiff-backup can operate in a bandwidth efficient manner over a pipe, like rsync. Thus you can use rdiff-backup and ssh to securely back a hard drive up to a remote location, and only the differences will be transmitted.

    The homepage also links to a project called duplicity, which operates on a similar principle, but uses GnuPG to encrypt data to prevent spying/modification.

    --

    --Lawrence Lessig for Congress!
  15. This idea is not new by fudgefactor7 · · Score: 4, Insightful

    It's been discussed (and even tried) before, the problems were many, namely security speed, and availability. One cannot guarantee any of those three every important variables. As a result it (the idea) died a horrible death--let's hope it dies again.

  16. Re:Would this work in the current [US] legal clima by Michalson · · Score: 3, Insightful

    Unfortunately I think it would be bad *either* way. Now since "stolen music" is somewhat debateble here on /., and most people aren't too worried about being charged with terrorism, I'll try something more clear cut: Kiddie pron. Ruling 1: You are responsible for what is on your HD Result: Someone backs up their illegal pics to your harddrive (you don't know this because it's encrypted), you (innocent) get charged for it and sent to jail. Ruling 2: You are not responsible for encrypted content that appears to have been generated by this netbackup program. Result: Every pedophiles dream has come true. They simply encrypt their stuff and spoof it to look like someone elses backup file. They are now immune from procecution because "it's someone elses". Same applies to anyone else that wants to store something illegal on a computer system. Obviously there needs to be a way to positively indentify who "owns" what content on your harddrive before a system like this could become [legally] safe.

  17. Distributed RAID Like Backups by angry_beaver · · Score: 5, Interesting

    This should work a little differently.
    Why not stripe your data accross many hosts with parity data being stored on serveral. A central server would maintain a list of servers containing your data. In the event of a failure, you would simply fireup the client, that would contact this server for a list of your backup "devices" and it would start pulling in, reconstructing and decrypting the data.
    This would have a couple bonuses...

    1) You could stripe it accross 100 machines, and have another 100 with parity data so that any 50% of the machines can be unavaliable and you can still get your data back.

    2) Security - Rather than having a full copy of your data on their machine, each node only has a small subset of your data, and does not know where to find the rest of the data making reconstruction nearly impossible for the storage node. GPG would be used on top of this.

  18. Why not just use OpenAFS? by rindeee · · Score: 4, Informative

    It was designed for use in low-bandwidth envrionments. Not only do you get the benefit of a distributed backup system, but you get inherant (sp?) fault-tolerance, load-balancing, etc. Yes, over a low-bandwidth connection a file still takes a long time to copy, but OpenAFS is designed to accomodate this (not going into detail here, go to the OpenAFS site if you're curious). I am a fanatic OpenAFS user so I am somewhat biased. We have however implemented OpenAFS on a 1.4TB datastore at one of our customer sites (medical market) that has key data (a couple hundred Gig) distribted to 3 slave RO cells (again, read up on OpenAFS for answers). Rock solid reliability is an understatement.

  19. dibs vs rsync by bromoseltzer · · Score: 4, Interesting
    I peer with another system at another institution using rsync. They rsync their files to a folder on my disk, and I rsync to a folder on theirs. No encryption, but very good performance - 128 kbs DSL upload is fine, running overnight.

    This requires a lot of trust, which is OK because I'm the sysadmin at both places.

    Without trust, you need DIBS-like encryption, which (probably) means no rsync-like differential backups, and you need a "safe" way to find partners.

    How about "DIBS-raid" where your data is spread over many peers? If a peer blows up, you can still recover, and no one peer should have a recognizable piece of your data.

    -Martin

    This .sig donated to Poets Against the War.

    --
    Fiat Lux.
  20. Hivecache by Glass+of+Water · · Score: 3, Interesting
    This is similar to hivecache. I believe hivecache's in use in the wild. The difference is that hivecache seems to be specifically oriented to large enterprize.

    I think that people who worry about "putting their files on other people's machines" should go over the docs once more.

    --
    There are no trolls. There are no trees out here.
  21. Who would take Pete Townsend's files? by someguyintoronto · · Score: 4, Funny

    Seriously, what would be the legal ramifications if illegal data was stored on someone else computer?

    Would this back system, be an easy way to hide illegal content?

    What if the RIAA went after someone for keeping a bunch of legal MP3s?

    Too many cans... Too many worms...

  22. Critical analysis. This is a bad idea. by almaw · · Score: 3, Interesting
    Reasons why this is a truly impressively bad idea:
    • Poor availability: If you're storing it on home-type machines, typical availability is probably <50%. Assuming no hardware failure, if you store your data across four machines, you have a 6.25% chance that all four machines will be down at once and you can't get the data back when you want it.
    • Slow networks cause slow backup retrieval.
    • Most people want to back up all their data, as sifting through it to find the bits you do or don't want to backup is difficult. Now, once you've performed the initial backup, you can do incremental backups, which cuts bandwidth requirements, but you still have to initially transfer up to multiple gigabytes over a slow internet connection.
    • If a peer drops off the network, you must transfer all the data across to a new machine to maintain the same level of availability.
    • If it's properly distributed, you can place no guarantees on the quality of service (i.e. the speed/reliability). Peers can go away and never come back without warning. Data would have to be massively replicated (1000 to 1 or more) for it to be considered vaguely secure. If there is implied trust between peers (i.e. two people know each other and authorise the data movement, this problem is mitigated.
    • Massively prone to poor cryptography. If you use very strong cryptography, the system becomes very slow. You really need physical data separation for this.
    • Requires an internet connection. Won't work from behind firewalls, etc. This is pretty obvious, but is still a factor
    • Bugs are difficult to fix, as you have to maintain backwards compatibility between versions. Hardware solutions (or simple software ones like mirroring) aren't so prone to bugs. Because this is a complex software solution, there are bound to be bugs. Anything that can go wrong will. :)
    • Due to the lower reliability of this system per node compared to say a RAID array, it's more expensive per megabyte. Note that it *has* to be lower - you're comparing the reliability of a HDD/tape in a normal backup scenario to a HDD+network+supporting computers.
    • Prolly lots of other stuff I've missed that other people have covered.
  23. response from DIBS author by emin · · Score: 3, Interesting

    A lot of people have pointed out issues related to security, bandwidth, efficiency, etc. My vision is that DIBS will be designed to take things into account.

    For example, DIBS uses GPG to encrypt and sign all communications so that peers can't read the data they are storing for you and so that other people can't pretend to be you and store their files with your peers.

    Also, my vision is to include state-of-the-art erasure correction codes so DIBS uses redundancy efficiently. (Erasure correction codes are a generlaization of parity checks used by RAID). In fact, I have already written a python implementation of Reed-Solomon codes available at www.csua.berkeley.edu/~emin/source_code/py_ecc. I haven't had time to put this into DIBS yet since I'm currently working on my PhD at MIT and that keeps me pretty busy.

    Incremental backup is another feature I'm planning to add. There are some issues with how incremental backup interacts with encryption and erasure correction. I think resolving these issues may take a little more thought so I might have to wait until I graduate, become a professor and get some grad students of my own to help me.

    A Slashdot post isn't the place to go into all the arguments for or against DIBS. However, I think distributed backup is a viable idea. While there are some serious issues, I believe that through clever engineering, we can solve them and create a cheap, simple, efficient, and secure backup system usable by anyone with a network connection.

    I decided to start writing a distributed backup prototype like DIBS in order to find out what the major issues are and how to address them. Sure, currently DIBS has some flaws, but it is a prototype written by a grad student. With more feedback from the community and some more development effort I believe DIBS can become a valuable tool. If you agree, I invite you to join the development effort, or try it out and tell me how you think it could be improved, or even take whatever parts you find useful and make something better. The project page is at sourceforge.

  24. Re:Problem = bandwidth. (solution?) by racermd · · Score: 3, Insightful

    Ideally, you should be able to make your computer fail *COMPLETELY* and still be able to recover completely. The distributed backup plan seems to have different specific advantages for two specific groups of home users, but has the same overall beneficial results.

    For the average Joe with only one computer running that ancient copy of Windows98 on a P133, the massive ammount of data-cruft is bound to be the weakest point of upgrading or even backing up. I've found that most families only have that one computer, and only have the option of backing up onto floppies. Usually their data can fit on one or two CDR/CDRW discs, but their system is also usually too old to get a cd burner to work reliably. In addition, they're just too stingy with the purse-strings to shell out the $100 or so for a decent, middle-of-the-pack drive, anyway. Sending critical data over the internet might be a better option, if a bit more time-consuming (no broadband, only 56k modem). Frequent backups like this has the potential to be substantially more reliable, not to mention scores easier, than a pile of floppies as you're ideally only sending the new data. I can't tell you how often I wished for something like this when working on a friend's/family's system across town and away from my own network.

    And that brings me to my second group that can really take advantage of something like this: Power-users with a small network running at home. My network has a file-server that stores *EVERYTHING* on it for backup purposes. It's got ISO's of all my software and OS's, drivers, stand-alone programs, documents, and media files. Currently, there's about 80GB of data on there. Backing up that data is a Travan-5 drive (10GB/tape, native) and 9 cartridges. At about 3 hours per tape, backing up to 9 TR-5 tapes takes days, not hours. There's two additional tapes for backup of the server's OS and configuration and it easily fits on one tape. But if there are any significant changes to the system, I rotate the tape so that there's always a working copy in case things go terribly wrong. That's a total of 11 tapes. They're not exactly cheap, but it's probably the least expensive backup I can find right now without going to removable HDs (I'm avoiding that solution as HDs are, in my opinion, less reliable and durable than tapes). Using this distributed backup plan would allow me to recover my server's OS from the single tape and retrieve the data from the network when I have time.

    The 2 desktops and 2 laptops can be fully recovered with an OS or system recovery cd and the rest is available on the server. In fact, I usually have one of each type of computer down at any given time for something-or-other. Having the data on the server allows me to blow away any of the systems I run at any time and completely recover the system to a working state in just over an hour.

    Actually, I had been setting up a distributed backup plan for my own server with some of my friends so we'd all have each others' server's backup. More accurately, the plan was to merge the changes between all the servers' data and share it between all of us in a manner similar to CVS. There's only 3 of us, but we're located all over the state and we all have broadband. 80GB of data is a large ammount to initially transfer. Really, though, all we'd be transmitting is the changes we've made which would limit the total bandwidth used. We'd probably only set it up for once per week in automatic mode to further decrease the load with an option to manually update. In the event of a complete failure of one of the systems, there should be a copy from one of the other two servers that's no older than 1 week. As the storage requirements grow, each server can be updated with additional storage in sequence so that it recovers in a manner similar to how a RAID5 array rebuilds the data on a replaced drive.

    Unfortunately, neither of my two friends in question have the resources to afford the hardware and set up their own server to the reliability standards that I'm requiring, so it kind of fell through for now. I'm working with them on how to get everything running, and I may just maintain it for them from a remote console. They'll still host the server on their network and have access to it, of course. But the responsibility of maintaining the system may just have to lie with me.

    In short, it's not terribly difficult to implement a solution like this, but there are serious bandwidth concerns. If you're only doing this amongst your friends/peers, it's possible to mitigate the bandwidth issue by using a single removable hard disk to sneakernet the data to a fresh server. This allows for a much more reliable home network for power-users, and gives some peace-of-mind to the average user (and their power-user friends who fix their computer for them)

    --
    My sources are unreliable, but their information is fascinating. -- Ashleigh Brilliant