Cringely's P2P Backup Idea
gewg_ writes "If Napster and Bit Torrent had a baby, would it Baxter?
As a follow-on to Cringely's
last column where he talked about having a backup strategy in the
wake of Hurricane Frances, this week he proposes a distributed RAID notion as a solution."
Baxter is, of course, the famous IRC client for BeOS. (Hi, Seth!)
Get off my launchpad!
Depending on exactly what you have stored, millions of people may want to help you backup as soon as possible.
The coolest voice ever.
I think this is old news. Some people have been backing up the source code for viruses that they wrote on Kazaa for months now.
Buy Steampunk Clothing Online!
Well, we leave the data where it belongs: in the proxy network where the processes live too. Still a bit incomplete, but maturing WebDAV and mountable slices forthcoming...
Just insert a bunch of data into the network.. record the keys and retrieve once a week then delete. That should keep the data retrievable from the network for a good while. Using two nodes would help. Plus everything is encrypted with some heavy shit.
:(
Or, just make a local-freenet on the company lan.. everything is encrypted and unretrievable without the proper keys, so it's very secure and it's distributed.. + FEC encoding.
That assumes freenet works, AFAIK it's still fucking broken. Ian Clarke is playing too much politics with the project and the only coder that really understands freenet (Mathew Toseland) is swamped with ideas, day after day.. it just gets worse and worse... The donations seemed like a good idea, but after watching the DEV list for the last 18 months, I realize it's a failed project
Skype Me! username: john_allen_mohammed
In case they missed it.
"Backups are for wimps. Real men upload their data to an FTP site and have everyone else mirror it."
But on the serious side, the claim of using encryption to store data on someone's hard drive worries me. Let's say the encryption gets broken. Now you might get Aunt Nedda's cookie recipes, but then again, you might get BobCo's strategic investment plan for the next 6 months as well. I can see people signing up just for the chance to hunt through people's data.
Cringley's not the first with this kind of idea. In fact, the Freenet Project already implements something to this effect. Although not specifically designed for reliable backups, the distributed caching algorithms essentially replicate data towards where it's most often needed, helping to improve network performance and creating copies of important data along the way so that it won't be destroyed if a central server fails. Obviously not a commercial solution, but very interesting.
From the article:
It sounds from every description like the solution is Linux-specific, but I'm sure it can be made to work with other UNIX variants, especially since Gmail, itself, runs on Apple xServe 1u boxes. Windows compatibility is unknown, but I'm sure someone will solve that soon.
I know, it's a little childish, but I get a good feeling when I see something small...even this little thing here...that thinks of other OS's first and Windows compatibility will be "real soon now" or something like that.
"Leo Fender was in a 'state of grace' when he designed the Stratocaster." -- Paul Reed Smith
Well, I lived through this storm, checking my PC upstairs to make sure nothing was going to damage it. If the storm was risking the roof flying off and my room becoming flooded, I would have taken out my hdd. This sounds like a brilliant idea.
;)
Hey, it beats trying to store data to gmail accounts!
mysql>SELECT * FROM users WHERE clue > 0
0 Rows Returned
Ideas like Cringely's will be impossible if the INDUCE Act passes.
Save Betamax is a national Congress call-in day this tuesday to oppose the INDUCE Act. It might be our last chance to stop this bill.
I had this idea in about '97 or '98. I looked around to see if anyone else had done anything like this (remember, this is kinda pre-mass-P2P) and found that someone had done so, but on a business scale solution. I think it was called Mango, and is still in production today. It essentially made a portion of your drive available for a drive letter, then whetever was copied onto it could be seen by all. The data was stored in at least 2 places, so if one went down, there was still one copy, and the remaining copy would duplicate, so that there was always at least 2 copies. In the end, I think nobody went for it because it was too expensive... But this is EXACTLY what a lot of Small-Medium businesses need atm. Bring on the Mango's!
As a bonus, you can use it to transport data (eg. your mp3 collection) between places, or even use it to boot linux anywhere with much more space and document storage capability than Knoppix.
It's a neat idea. In a nutshell, he suggests a Peer to Peer encrypted storage network. You get exactly as much storage room as you are willing to offer yourself for others to use. When you store anything, it's encrypted and automatically spread to other systems.
It doesn't make for a very safe backup, though: What happens if somebody decides to stop the service and just deletes his local storage? You've got no more backup at least for a while, and you might not even know it. And of course, other people have head crashes, too, which would also obliberate your backup at least for the time it takes to recreate it from your own data. Of course, by that time, you might have deleted it yourself, either by accident or knowingly, since you have a backup after all. A viable solution would be to store every file multiple times on different remote servers, although that'd lower the storage capacity you get. It's still the right step, though.
The crucial problem is that the service provider can't really give any guarantees that you will be able to regain your lost data. With three or more independent copies in different locations, it's very unlikely that the backup won't work for some reason, but a backup that's not 100% is not a very useful one, especially in those situations where backups are really crucial.
It's still a neat idea, and to my knowledge has not been done to that degree of sophistication. Of course, as others suggest, nobody is stopping you from inserting encrypted data into Freenet, but that's nowhere near as fast and secure as this could be. And while it's not a true backup, it's better than no backup at all, and most likely enough security for many persons.
Switch back to Slashdot's D1 system.
http://www.csua.berkeley.edu/~emin/source_code/dib s/
which is open source and also
http://www.hivecache.com/ which will be commercial 'real soon now'
Peer Pressure
If your character data was stored on everyone else's computer, it would act like a virtual server, where if a few data sets get hacked, they'd be corrected by the whole.
P2P can work in wild ways we haven't even tapped.
too bad orrin hatch is trying to outlaw p2p:
www.geocities.com/James_Sager_PA
God spoke to me.
Foldershare
We use foldershare for peer-to-peer backup, but the catch is that you invite people that you trust to your libraries.
For backup purposes, I only invite myself and just connect another computer to the account.
Thank you Mario! But our princess is in another castle!
How many times would you have to duplicate the data to ensure that no corruption (both intentional and unintentional) occurred? You would have to compare copies of the data to each other to make sure it matched. I wouldn't want my backup corrupt because some joker wrote Goatse.cx pictures to it a few thousand times. You would also have to store additional data in the event that people ran the program and then quit, taking your backup along with them. So maybe you would have 1gb backed up over the network, and 10gb of other people's crap on your computer. And thats assuming it ran on some sort of credit system where you only got to backup a percentage of what you allowed people to store. Otherwise hoarders would run rampant and take over the system.
Even those who arrange and design shrubberies are under considerable economic stress at this period in history.
I just went through Hurricane Ivan in Grenada. If you have been watching the coverage you should know that our island was completely destroyed. There is no water, no electricity, and no security. The university I attend (St. George's) lied to the students' parents about our situation. There were looters with guns and machetes threatening students. The first two nights we fended for ourselves with a large bonfire and homemade weapons, knives, pipes, etc. The third night we had 10 minutes to pack up and leave since we could see the looters lighting fires to apartment buildings on the road we were on. I quickly took the hard drives out of my two laptops (and the external drive I have), picked up a GSM roaming phone, any cash I had, a passport and two pairs of clothes. We ran to campus. Campus had about 200 male students lighting bonfires and running security teams to monitor the area. We chartered our own jet out of Grenada yesterday to Barbados which is where I am writing this from. My point is this: no one cares about data in this situation. No one wants to know about RAID or tape backups. If it came down to it, I would have ran with only a passport, a phone, and cash. We were worried for our lives and whether we had water or not, data was not our concern. People need a reality check. How many of you can claim that you went through a Category III or IV hurricane on an isolated island fending for their lives? Not many, so quite franly Cringely can go to hell.
That depends upon what you consider 'better'.
Large businesses have a scheduling process and hire people to swap tapes, move tapes in and out of the various facilities, rotate tapes, and replace tapes that are no longer reliable. This process is done on a 24x7x365 (plus leap days) basis. Most of the data is actually being backed up via tape silos and 'robots' to handle the actual tapes while the various backups are hapening, but it is still a significant investment in people.
A small business may be able to get away with burning a CD-R or CD-RW every night with that days transactions, and a small stack of CD-R (or RW) every weekend which they take home and store in a CD spindle in their freezer, or something. Though I think you would be hard pressed to find a small business that actually does that. (I am sure there are some that do.) Monthly or quarterly they should be taking a spindal of archived data to a remote relative's place to provide further archival of data.
Mid sized businesses are in a bit of a quandry. The number of tapes needed for a good backup is more than anyone really wants to haul around, handle and store at home, but they are not sure it is worth the expense of using a comercial off-site backup for either.
A project like this may be just what they are looking for. No tapes or disks to try to keep track of. Everything compressed and encrypted, so it is reasonably secure. Retreival can start as soon as the replacement system is ready to start retreiving it.
I personally think it should be trialed only as a suplement to some other backup strategy, but even then, someone would decide it was either too much of a hassle, or not reliable enough.
There are even people here who think it is 'reasonable' to haul around 160 or 250 Gig hard drives to backup their critical data.
-Rusty
You never know...
This idea is poorly thought out. It has a couple of *major* flaws, imo.
#1) It doesn't recognize the reality of the complexity of backup software. Kinda easy to gloss over 'automated' backups without ever describing it. Pretty hard to imagine some piece of software that can universally back stuff up on everyone's hard drive and at the same time be very easy to use. Imagine mom/dad trying to use software with similar capabilities to Veritas BackupExec isn't easy. And.. imagine the wide variety of live files and databases that it wouid have to handle.
#2) Data integrity. He suggests a 1:1 ratio for backup space. Not hardly. How is he going to have any kind of redundancy with that? Crashes and people unsubscribing will happen all the time. The data would have to have a *lot* of tolerance to that.
A parity solution wouldn't be nearly enough. That assumes that only 1 failure at a time happens (using RAID 5 as my basis here). It would be easy to imagine that one person unsubscribed with part of your data and another had a crash or corruption problem.
So.. complete mirroring would be necessary. Again, its easy to imagine 2 people's system going offline at the same time.. so, you'd probably need more than 2x Mirror. At this point... how much is enough to ensure reliability? 3x 4x 5x ? ? ? How much do you trust your average netizen?
So.. pick your number and then divide your backup space by it. Like 5x? Add 10GB and you have 2GB usable storage. Not very good.
I'll just skip over the 'auto backup' of people's 40GB storage over a 128K up line for now.. already typed too much...
A company called 312, Inc. already has a commercial product for P2P backups called Lean On Me.
I don't work for them, etc.
Cringley is adding nothing new here. We've all already seen this on Slashdot. Hell, the websiteeven mentions how it's like P2P but not.
I lost interest in what this guy has to say when I read this:
"But while it might be easy to use Gmail for offsite backup, I couldn't bring myself to do that just because of the intrusive nature of Gmail. Remember this is a system that is by invitation only, which means that Google can quickly map a social network establishing who knows who. And since Gmail actually analyzes the content of your e-mail and can automatically group it by subject (how creepy is that?), Google not only knows who your friends are, but what do you talk about with those friends."
I nominate this to the prestigious "Fud of the week" award.
I did some research into this on my B.Sc. thesis, in essence it's a solution looking for a problem.
The thing is, you want backups because you want to be able to get it back, with this (and my idea) you have little control over the backup; in short words, it's not a backup.
FreeNet may at a first ignorant glance be a solution to this dilemma, however, you still have the same terror of doubt. Because you're not in control!
To summarize, there is a difference between not wanting to lose something, and wanting something.
If you don't get something you want, it hurts, if you lose something you need, it kills.
Control is everything, even if you have a 50% success rate and you know it you'll be quite happy. You will not like a 60% +/-40% success rate.
I've also been suggesting this for years. I'm too lazy to search for the older posts, but here is one from July:
4 3518
http://slashdot.org/comments.pl?sid=115027&cid=97
Of course what matters, though, is not talking about ideas, but *doing* them.
The company I work for (banking) sells storage for 120 euro per gigabyte per year to our internal clients. That's storage on RAID-disks (think StorageTek and the like), including backup (on tape) and all necessary services (people doing maintenance, restoring backups, etc). 120 euro / gigabyte / year comes to 1,22 dollar / month / 100 megabytes (compare to 8 $ per month with Apple). Considering our 1,22 $ plus some network costs, plus maintaining a billing system for a couple of million clients, and a bit of profit margin, maybe 8 $ per month is not a rip-off.
I have a photographic memory for numbers. I know almost a hundred of them.
For larger, business-driven uses, you probably want something like DataSafe. They will keep media for you in a very safe place. Or better yet, keep your whole business disaster protected -have more than one live site for IT operations.
I would have moderated you into oblivion given the chance.
I genuinely feel for you and your struggle for safety given the recent events, and you have my deepest sincere sympathy...
But that is not what this article is about. And how about this, given the chance to either leave my data behind or fend for myself given those circumstances...I'd stay with my data.
Perhaps your data isn't a life or death matter to you, but my stacks of CD's, DVD's and harddrives with the past 15 years of my writing, graphics, and (most importantly) my recording sessions....over 500gb by now probably...it is indeed worth it for me to ensure it is safe. Even under such circumstances. The very thought of that data no longer existing is sickening to me...
No to undervalue your experiences at all. I mean that genuinely. But this article was about data backup--a form of backup that would have saved you even more time in your race to protect your neck.
I fail to see how this is informative to the topic at hand when all I see is someone poo-pooing a genuine concern with a slightly related story.
I'm willing to bet far more slashdotters than just myself value their data as much, if not more...risk life and limb for it? I probably would...it is just that important to me....which is why I would want to back it up in the first place.
When I was working at a factory last year, I was part of an IT team supporting 1000+ PCs. An idea I thought of, but haven't had much time or chance to flesh out, was a "peer-redundant file system," whereas all those computers could have background hosts serving up a specified amount of space for use by anyone on the same network. The space would be treated like a block of sectors on a network-based drive, allocated by a master server, and made redundant through a desired number of hosts (anytime data gets posted, it should go to at least one random host, plus any more needed for redundancy). As people leave systems on, or turn them off, their shares could be updated by peers or the master server, and be able to sustain the desired space with as few as 1/3 hosts. Using the space would be easy: all client systems would have the same mount or drive letter, with the background software managing the behavior of the drive.
This situation solves two problems: one, having a network file share run out of space; two, a need for redundant backup. I suspect it could be done using exisiting peer-sharing software as a core.
Life is irony, and nothing ever goes as planned.
I was thinking about something like this for video activists who frequently have their tapes/discs confiscated by the cops. It'd be great if they had PocketPCs with webcams that were operating in a baxterian sort of way such that the video they were taking was simultaneously being recorded to the storage of other activists/media within wifi range. You could have wifi NAS (network storage) in vehicles and apartments surrounding the demonstration area, as well as on ipod-level storage in future wifi enabled pocketpcs. 3G cameraphones with hard drives might provide another simpler option, if they could be networked together in a p2p fashion. The cops might be able to confiscate my webcam and pocketpc, but my recordings (and proof) would be elsewhere in the aether.
geeks are cats who dig a certain kind of cool
There are several research groups doing work on distributed P2P backup systems. I know there's a group at MS doing this, as well as a group at MIT (http://catfish.csail.mit.edu/~kbarr/pstore/), and several others that don't come to mind offhand. I did a project on this in grad school, so I'm familiar with the research.
:)
There are a lot of issues here, mostly centering around the fact that you can't trust people in an open P2P network.
1) They might look at your data.
2) They might not be online when you want your data.
3) They might delete your data, or do other malicious things to it (insert viruses, etc.).
4) They might freeload by using space on other hosts and then deleting all the data they receive.
5) If a host leaves the system permanently, you need to detect that and replicate its data somewhere else. Also, how do you know whether it's leaving permanently or just logging off for a while?
#1 is easy, just encrypt the data. #2, #3, #4, and #5 are hard because data integrity is really important in a backup solution. You end up having to replicate the data all over the place to "ensure" that it'll be available when you need it, but then you've got the problem of having to donate more space than you receive to use the system. Plus, it's still not certain that your data will be available when you need it.
Basically what I'm trying to say is that it's a hard problem.
Nothing new here. Check out Berkeley's OceanStore project for an idea of a global storage solution impervious to local disasters.
Looks like he might like Pastiche.
An equivalent idea was proposed in about 1982, at the dawn of the internet. Simply tar your filesystem, then email the tar to yourself along a lengthy old-style routing chain. If you need your data back, just wait for the email to arrive and untar it. You could tune the recovery latency by adjusting the routing chain. Of course, over dialup uucp, even one-node-out-and-back path could result in a two day latency.
Man, those were the days.
When all you have is a hammer, everything looks like a skull.
Error correction gets a lot more sophisticated than checksums, you know. You can make a Reed-Solomon codec for 8-bit code words with 255 byte encoded blocks having any even number of parity bytes, and the way optimal RS codes work is that you can recover the original data as long as the number of missing code words plus twice the number of corrupted code words is less than the number of parity code words you chose.
So, you divide your data into chunks 225 bytes long. Each byte in a chunk goes to a different peer, and each of the 30 parity bytes also goes to a different peer. Then, even if a dozen peers have simultaneously unsubscribed or crashed and their shares haven't been replicated on new peers yet, you can still recover all your data from the shares that remain.
What if the martians invade the moon? WHERES YOUR DATA NOW! PUNK!