A Move to Secure Data by Scattering the Pieces
uler writes "The NY Times has an article about an interesting new open source storage project. Unlike data storage mechanisms today that work 'by making multiple copies of data,' the Cleversafe software takes an 'approach based on dispersing data in encrypted slices.' It's an elegant solution and one that's been a long time coming: the software uses algorithmic techniques known by mathematicians since the 70's. Adi Shamir (of RSA) first wrote of information dispersal is his 1979 paper 'How to Share a Secret (pdf).'"
PAR? PAR2?
I've been out of the freenet loop for a long time, but I thought I remembered reading in its documentation a few years ago that it did this same kind of encrypting and dispersing chunks of data.
With all of this encryption technology, people still need to remember basic security tips. Use good passwords ("password" could be cracked very quickly even with 128 bit AES), maintain physical security (hardware keyloggers can find out about the manifesto you're writing before you even save the file) and use common sense.
Before you all ask, yes it does run Linux. The company was actually at Linuxworld.
Information wants a fueled airplane waiting at the hangar and no one gets hurt.
From the article: Cleversafe is significant because it is an open-source project -- that is, the technology will be freely licensed, enabling others to adopt the design to build commercial products. This could be a very important OSS tool.
Think about it: The storage provider doesn't know what he's storing and the user doesn't need any incriminating data on his machine. It's a DRM nightmare...
Storing data in random locations, often garbled beyond all recognition?
Clearly Windows ME's memory -l-e-a-k-s- management made it the most secure OS ever. If only they had some way of reconstructing that data when you wanted it back again.
This concept just adds another layer
"Trolls they were, but filled with the evil will of their master: a fell race..." -- J.R.R. Tolkien on Olog-hai
When I was mobilized by the Army to develop a database for Personnel/HR use in the mid 90s, I thought of something similar for data backup. Was not really thinking of it as a security system, more like an 'insurance' system.
Problem was, I did not know enough about developing systems like that, nor did I know enough about getting the idea in front of the people who could make it happen.
The basics were when users in the field made queries the returned data would be stored for some period of time and a separate server would record who had what and be able to retrieve the data in case the backups were destroyed or inaccessible.
The main thing was that if it were recently downloaded data then it was more relevant than older data, which could wait to be reconstructed but newly queried records were more important to current operations.
Also, since the data was scattered about, it would be of less interest to a party wanting to grab info about soldiers.
Obviously the idea needed more thought by more brains than mine.
Eve Fairbanks says I drive a hybrid!LOL
Isn't this basically what freenet does? It encrypts the data into chunks and spreads it around all over the place.
:-)
I was working on a p2p system that worked in a similar manner. I was even thinking of repurposing it for the sake of doing online backups - but frankly the bandwidth just doesn't seem to be there yet to do that sort of thing in a practical manner. That, and I got bored with the project... (but nevermind that).
Hexy - a strategy game for iPhone/iPod Touch
It's '70s not 70's.
After RTFA, it occurs that this is mostly a research project. The goals (and downloadables) include libraries that allow PCs to mount a distributed encrypted filesystem and others.
In a business example where you know that you can ultimately control the sites where you're storing your partial data, this would be a very good thing.
For the single user attempting to secure his information by using the existing network, there are some downfalls. 6 of 1l slices of the data are needed to recontstruct the whole. Therefore if a party intent on obtaining secret data obtains the majority of the servers, he has the data.
Also, if a disaster wipes out the majority of the servers, leaving five or less of the eleven, the data is gone.
This is a very, very important concept for business storage, but I have to wonder if it scratches any geek itches not already soothed by Truecrypt and Par2.
The next Slashdot story will be ready soon, but subscribers can beat the rush and slashdot the links early!
See Comment 15948676
Of complexity, but also adds
"Trolls they were, but filled with the evil will of their master: a fell race..." -- J.R.R. Tolkien on Olog-hai
In this paper we show how to divide data D into n pieces in such a way that D is easily reconstructable
from any k pieces, but even complete knowledge of k - 1 pieces reveals absolutely no information about D
I use this approach in my sex life, however, rather than obscuring information about D, even knowing one "piece" p reveals way more information than I'd like to have out there. Hell, ever since k-1 got a page on myspace, every potential n+1 knows about me before we even get started.
--- What?
I'm ...
I can only hope that this scheme includes distributed storage of the pointers to all the fragments, too. Distributed data is only as reliable as the metadata that record where the data fragments are located. If the user of the system loses their only copy of the map to their fragments, the data is lost. If, on the other hand, each fragment also includes encrypted pointers to a few other fragments, then decrypting any fragment lets one bootstrap recovery of the entire network of fragments (a good thing if you want reliability, maybe less desirable for those seeking security).
Two wrongs don't make a right, but three lefts do.
See Comment 15948695Another layer of inefficiency and
"Trolls they were, but filled with the evil will of their master: a fell race..." -- J.R.R. Tolkien on Olog-hai
From what I remember they split up data into multiple pieces, encrypted it and distributed it over a number of nodes, with some redundancy in it. If you know python and are intrested in p2p I'm sure there's a lot to be learned from that project.
You are not entitled to your opinion. You are entitled to your informed opinion. -- Harlan Ellison
... unsure ...
Secure Data by Scattering the Pieces
;-)
You mean to tell me that all those hours of defragging my HD's on Windows 98 were actually a waste of time??
Sure, this will work until someone comes up with an Average White Band exploit. Then it's useless.
-Peter
See comment 15948718
an increased risk of loss of data.
Burma Shave.
"Trolls they were, but filled with the evil will of their master: a fell race..." -- J.R.R. Tolkien on Olog-hai
... of ...
... the ...
I thought about a system to do this a few years ago, but with a little twist: distribution of the pieces would be via computer virus. The pieces would be stored in user's computers, but more importantly in intrusion logs of "secure" systems as well. Retrieval would be a social act, kind of like a treasure hunt. "Hey, geeks of the world, there's this important information out there. Go figure out how to get it!"
This system could be used for high profile secrets, like government whistle-blower data and the like. Storage would be secret and nearly undetectable because of all the other virus noise. Retrieval would be highly public by necessity, both to make retrieval possible and to publicize the contents of the data.
... novelty.
Lotus Notes doesn't really do this on purpose, but an artifact of the system architecture and user behavior (they want local copies of everything) seems to combine to provide a rudimentary capability of data recovery from widely distributed stores such as you describe. I observed a client once restore all the data on a Lotus Notes server following a catastrophic data loss (short chain of events meant no backups could not be recovered following the loss of a RAID filesystem, many GB of data gone). They put a call out their user community for the replicated copies of the various "database" thingies which existed on user laptops and desktop systems. They were able to recover all of the data that anyone cared about, anyway, if not actually all of it.
If you mod me down, I shall become more powerful than you could possibly imagine.
This only works if the distance between the moved elements is greater than the attacker can cross. Not much different than sending reset passwds unencrypted through emails.
Ross Anderson of the Computer Security Group at Cambridge University wrote a paper called the Eternity Service. It has had a few different attempts at implementation, as well as some reworks in terms of design. The primary difference is in the Eternity Service - you had no idea what data you had, nor did you have access to the keys. This new concept/design seems to provide more control/granualirity for the user. Given the new proposed encryption laws in the UK, I'm not sure this is a good idea.
"Omnis tuus capsa sunt inesse nos"
Man, that sucks. 1 minute later on the initiation of the joke, 7 minutes later on the completion... and you get a redundant mod. I guess that's the way the cookie crumbles.
But he had been reading histories of early encryption research, and he saw a germ of an idea in the work of cryptographers who kept information secure by dividing it into pieces and dispersing it.
Germ?
I've been doing something like this for years.
First I would encrypt the original file, split it up into 10-100 pieces, encrypt those, hide them in other files, encrypt those, then store them in random locations around the internet either by emailing a piece to a webmail or uploading to a server somewhere, posting the binary or hex sequence to a forum, things like that.
Heck sometimes I'd repeat the repeat the encrypt/split/hide process several times, or even put the last step as hidden. Yes I realize anyone with any computer talent could find a file hidden in another one, but it keeps it out of plain sight.
I also remove any identifiable information on what order the pieces go in, I rely on myself to remember. Or leave clues elsewhere.
I'll admit sometimes it takes like 3 days to gather and assemble them if I need them, though.
I use it for things that are better off gone forever than being leaked.
Ignorantia legis neminem excusamus That's Latin for "ignorance of the law is no excuse": a principle recognized by the U.S. Courts.
/dev/random /dev/random for 100 years, and who knows what they'll come up with...)
/dev/urandom to come up with the beginning of Star Wars' music score: DDDGDCBA
Now consider what happens when RIAA figures out that every linux user may store copywrighted tunes in their
(Put a million computers to cat
Homework: test how long it takes for your
Obama likes poor people so much, he wants to make more of them.
Copy
The problem with this idea is bandwidth and speed. You think your broadband is fast, but if you have to download the 27 gigabytes of photos, music and stuff, it won't be exactly fast on a 8 Mbps DSL, not to talk about 1 Mbps or less. You might wait a couple of hours, but you won't wait a couple of days.
Okay. So you tell me that amount of available bandwidth will increase? But so will the amount of data that needs to be backed up. And it will grow faster than the bandwidth. Think of homemade movies. You can already fill up your average drive in no-time. What do you then do, when you get a HD camera?
Although the idea isn't a new one, I think it is still neat. It might work for some stuff, but I don't see this becoming mainstream with technologies like Time Machine coming to the end-users.
I demand the Cone of Silence!
Oh come on, a paper?
Everyone knows that if you want to share a secret, you just tell it to a -- eh, never mind. :P
FATMOUSE + YOU = FATMOUSE
Not quite, but the coding scheme that makes CDs and DVDs resistant to dust and scratches works much like that. Big blocks have an error correcting code appended, and then the bits of the data plus error correcting code are rearranged and spread widely across the block. So when you lose a contiguous set of bits, you can replace it by using data distributed across the block.
It's a good error correction scheme, but it's not exactly new. Every CD player in the world has this. CDs aren't encrypted (there's no key, just an well-known algorithm), but you could mix encryption in if you wanted. This wouldn't help the error recovery.
This is so not-new it's not even funny. I've already seen FreeNet and MNet mentioned as precursors, which is appropriate. Dozens of other P2P "filesystems" (in quotes because I don't believe it's truly a filesystem unless it's fully integrated into the OS) and block-level data stores have done this. Probably the one that most thoroughly examined the inherent tradeoffs, and that's most directly based on Shamir's IDA work, is PASIS at CMU. Presenting Cleversafe as the first to move in this direction is an insult to those who have gone before.
Slashdot - News for Herds. Stuff that Splatters.
While secret sharing is cool, one of its primary drawbacks is that it's usually built using asymmetric crypto (as in, based on number theoretic assumptions and the like). That means it's potentially quite slow. Ross Anderson wrote a paper on a cool alternative which uses only symmetric primitives to achieve the same result. (In fact, he's able to build a lot of different things by combining symmetric primitives in the right way.)
I'm a little reminded of the Judge from Buffy. Pieces scattered around the world. For security. This seems like a better application of the technique.
A friend taught me this. The secret in his case was a proprietary industrial process.
You take the secret and divide it into 3 pieces. You have a team of 3 people to each carry or memorize two of the 3 pieces.
Amy carries pieces 1 and 2
Bob carries pieces 2 and 3
Charlie carries pieces 3 and 1
If any one of them is compromised by bribery or other means, 1) the information is not lost and 2) the enemy has only an incomplete picture of what is going on.
This can be extended to more people to achieve greater redundancy or less exposure:
More redundancy: 4 people with 4 peices, each person knows 3 elements. Any 2 of 4 people needed to put the pieces together.
Less exposure: 4 people with 4 pieces, each knows 2 elements. Any 3 of 4 people needed to put the pieces together. Loss of 1 person exposes 1/2 of the total secret.
There's no reason to stop with 4 people and 4 pieces.
Think of this as RAID for human-knowledge.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
This will only move forward if they can make it perform as well as traditional disk access, and what do they plan for a backup strategy?
~Teh Def1c4t05S~
The Information Dispersal Algorithm is due to Michael Rabin.
Shamir's secret-sharing algorithm uses a similar idea (it's
essentially the same as Rabin's algorithm, except that the
data is padded with random gibberish).
Am I part of the core demographic for Swedish Fish?
This was several years ago, but I read a paper, I believe on Slashdot, about a crypto system intended for people like human rights observers working in the field. Basically you would write up your report, call up this program, pass your report to it, and the program would write it in crypto to uninitialized blocks of the file system so that it appeared to be random noise.
The concept was that the watcher's laptop was likely to be inspected when they left the country. The inspectors wouldn't find anything since they wouldn't know how this program was started, much less the keys required to make it work.
Does anyone else remember this? I've searched Slashdot with zero success, I even emailed Bruce Schneier but he hadn't heard of this.
When you sympathize with stupidity, you start thinking like an idiot.
Oops, need to update my literature review in that case. Thanks!
for...our...overloards
Here's a reg-free link.
Courtesy of the New York Times Link Generator.
In reading through a lot of the posts, I thought it might be useful to elaborate on how Cleversafe compares to current copy-based data storage systems as well as previous projects using similar techniques for data storage and communications...
h _Storage_Grid.
Effectively all digital data storage in use today works by making copies of data and redundant copies of data with the use of parity bits when stored on a RAID array. Cleversafe does not store a copy of the original data and definitely does not store copies of data. Cleversafe 'disperses' data which is different than copying data. Original data files are turned into a set of 'dispersed files' or 'slices' -- each of which contains too little information to be useful on its own. These slices are then stored in different locations. On the current Cleversafe test grid, each file is dispersed into 11 slices which are each stored by a separate storage hosting providers in separate geographic locations as shown at http://www.cleversafe.org/wiki/Cleversafe_Researc
In order to ensure ultra-high availability, the dispersal algorithms are designed in such a manner that any majority of these slices can be used to perfectly recreate all the original data. This technique is similar to methods often employed in data communications where data is broken up into some number of packets by the sender in such a manner that the receiver does not need all the packets to recreate all the original data.
Over the past 25 years, a number of projects have looked at storing data using information dispersal or similar techniques. Many of these projects have used Reed Solomon or similar encoding / decoding techniques, including OceanStore, PAR/PAR2 and others. The Cleversafe project is not only developing algorithms for information dispersal, but is also creating a complete system to enable the benefits of Dispersed Storage to be practically used on a generally-available-to-everyone scale. So in addition to creating new, computationally-efficient algorithms for information dispersal, the Cleversafe project includes:
- a metadata management system for managing files stored on the grid
- grid management tools, including 'rebuilder' processes that enable the grid to 'self-heal'
- interfaces to enable dispersed storage to work in various existing environments, including a general API, a command line interface, a file system interface (which we began demonstrating at Linux World last week) and an upcoming GUI interface
- integration with existing methods for encryption
- live dispersed storage grids running on nodes operated by various storage hosting companies in various locations
- etc.
So, the focus of Cleversafe is to build on the previous work in dispersed storage (which has mainly been academic research) to create a practical and complete open source system to better store the world's data.
Chris Gladwin
Cleversafe is improving how the world stores its data. Join us at www.cleversafe.org.
"Only wimps use tape backup: real men just upload their important stuff on ftp, and let the rest of the world mirror it." So this idea is similar to some degree.
(Fyi: this link to the New York Times article bypasses any need to login/register with the nytimes.com website.)
I'm the Cleversafe Dispersed Storage software-development project leader. I work with Chris Gladwin (mentioned in the New York Times article) as a fellow manager at Cleversafe.
I offer some comments below to help outline some of the unique aspects of the Cleversafe technology.
Encryption is not dispersal. Cleversafe provides both, and then some. The Cleversafe Dispersed Storage software disperses any "datasource" (typically a file) into several slices (our current software current uses 11 slices in an 11-lose-any-5 scheme; future versions may use additional schemes with "wider" slice sets). Additionally, our software also encrypts, compresses, scrambles, and signs the datasource content, but we are not trying to reinvent the wheel: other software technologies exist to do these things, and we leverage them extensively.
We found that a bigger challenge than creating or managing dispersal algorithms was to make the entire storage system regardless of the dispersal algorithm used (and we design the system to be dispersal-scheme agnostic). The meta-data management system and many other things took us far longer to implement then the Cleversafe IDA. It's not hard to use Reed-Solomon, or some other algorithm on a single file or a small set of files and disperse the slices by hand onto several different system (or use variants of this like the 3-piece secret story with Amy, Bob, and Charlie mentioned above). It's much harder to manage this across an entire file system (with hundreds of thousand of files--or many more depending on the file system) for an unlimited number of file systems from all the various users across to be stored on heterogeneous set of an unlimited-number of geographically-dispersed, commodity-storage nodes in a completely-decentralized way with no dependence on the original source of the data (eg, you could sledgehammer your laptop and not lose any data that's stored on our grid/storage service). (I apologize for that run-on sentence.)
Further, dispersed-storage systems do not require replication. (Dispersal systems may replicate data for performance purposes, if at all, depending on the application/configuration/installation/context.) If a system replicates entire copies of the data (be they encrypted or not) then it, by (our) definition is not a dispersed-storage system. So a continual question I have when evaluate other systems: do they replicate the data in whole or not? Most systems replicate.
Cleversafe is not the first to present a dispersal system, but we like to think we are the first to make it broadly usable by people and inter-operable with other systems. See our cmdline client (which will soon have continous-backup and XML-programmable policy management), our Dispersed Storage API, our dsgfs file system, a soon-to-be released GUI client, and future "connectors" (what we call the applications that leverage our technology) to come, all available at http://www.cleversafe.org.
A side note: "revision management" is built into the Cleversafe system to address what I call "soft" failures (accidental deletes, application failures, etc) vs. "hard" failures (hard disk crashes) as well as archival requirements.
I believe that the concept of "dispersed storage" will eventually change how the world thinks about storage systems--regardless of whether or not these are Cleversafe-based systems (I think Cleversafe presents the best such system, but I of course am biased).
There is prior art:
"Blondie, what did he tell you? I know which graveyard the money is buried in. Don't die on me Blondie. What did he tell you?"
"A name... a name on a gravestone..."
"Ah! We are partners! I know the graveyard, you know the name! Partners just like good old times, eh?!"
- For the complete works of Shakespeare: cat
(Didn't bother to RTFA but..)
Why not just get K random sequences and XOR them together to get a 1 time pad. Then encrypt the data and store it in public view. You will need ALL the pads to unlock it.
bash-2.04$
bash-2.04$yes "Don't you hate dialup connections?"| write USERNAME
So this is why there were so many of those "scattered items" type quests in console RPGs.
Elder: "We need the sacred information of Pr0n!"
Elder: "Unfortunately, the dastardly Cleversafe has scattered this information into 12 parts."
Elder: "You must go to each of the 12 ancient ruins and collect the sacred information for us!"
Player: "This quest sucks."
Makes sense now....
Brian "Psychochild" Green
MMO developer's blog
Way back when the earth was flat, Monty Python had a skit where there was a joke so funny that you'd die laughing, literally. The military wanted to use it against they enemies, so they can to translate it into the enemies language (probably German in the skit,I don't recall) but if a single translator was given the whole joke, they woule die laughing before they would write down the translation. So they cut the printed copy of the joke into smaller (non-leathal) sections and had a group of translators translate sections of the joke which was later reassembled into final form. Same principle.
I'm scared of world leaders who think locally and act globally.
I do see possible implementation flaw. The data is encrypted and distributed across multiple locations. It seems that relying on a single or small number of bandwidth providers removes some of the benefit from this technology. You've got redundant encrypted storage, but a single provider (or a group working together) could agreggate all of the pieces. How do you get all of the pieces distributed without moving this data over a compromised channel? Isn't this a geo-transposition cypher over an unsecured channel wrapped around
Any ideas?
I've often wondered when someone would get around to perfecting a dispersed backup system for LAN's. With the average workstation toteing 100GB drives, and the average use of a handful of GB's, there seems to be a surplus of cheap disk space on the LAN... at least compared to backup tapes or other media. Though, in hindsight, I guess a single fire or building disaster would still be catastrophic...
The focus was on security rather than data storage, but I'm sure I read about it on slashdot...