Open Source Moving in on the Data Storage World
pararox writes "The data storage and backup world is one of stagnant technologies and cronyism. A neat little open source project, called Cleversafe, is trying to dispell of that notion. Using the information dispersal algorithm originally conceived of by Michael Rabin (of RSA fame), the software splits every file you backup into small slices, any majority of which can be used to perfectly recreate all the original data. The software is also very scalable, allowing you to run your own backup grid on a single desktop or across thousands of machines."
Editors please note!
Editors, please note that there is some incorrect information in this post. Firstly, the original concept of the IDA was designed by Shamir of RSA fame, not Rabin.
Also note that the Cleversafe IDA is a custom algorithm, and is only similar to Shamir's initial concept.
While Michael Rabin was inventor of the Rabin cryptosystem in 1979, it was Ronald Rivest, Adi Shamir and Len Adleman behind RSA two years earlier.
Meh, it sounds like it's just par2 integrated into a distributed filesystem.
so it's sort of like parchive http://parchive.sourceforge.net/ which is software splits every file you backup into small slices, any majority of which can be used to perfectly recreate all the original data
Well, their webserver seems like it's been smoked, here's a link to their sourceforge page, where you can grab the actual software:
http://it.slashdot.org/it/06/04/26/2039224.shtml
Related companies/projects happened in this order: MojoNation .. MNet .. HiveCache .. AllMyData
good luck!
i haven't remembered the name yet, but the company was bought by novell shortly before NDS came out. i always thought it was how NDS replicated itself around w/o eating up the network while trying to take care of itself.
I just hope they don't patent it!
Am I part of the core demographic for Swedish Fish?
I think you mean linealy independent, not mutually orthogonal. Infact, the word orthogonal isn't even in Rabin's paper. Thus, what Rabin has done is shown how to generate n vectors such that any m are linearly independent .
It's an l dimensional space though. The PDF of the paper is http://portal.acm.org/ft_gateway.cfm?id=62050&type =pdf&coll=GUIDE&dl=GUIDE&CFID=70220506&CFTOKEN=528 80553, and is accessible to anyone who's had an undergraduate course in linear algebra. The crux of the argument is on page 4.
Hello-
I am the chief designer of the Cleversafe dispersed-storage system (aka a grid-storage software system) and am one of the project's co-founders. The Cleversafe system never stores a complete copy of the data in any one place (or "grid node" in our terminology). At most 1/11th of the file data--we call it a file "slices"--is stored at any one grid node in a "scrambled" (i.e., non-contiguous), compressed, and encrypted/signed fashion. The grid _never_ stores more than one copy of the data on the grid, and that one copy is never stored all in the same place--it's dispersed using an optimized information-dispersal algorithm that we created but has similar properties to the previously-published info-dispersal algorithms (IDAs).
If a grid node and its associated content--i.e., the user's file slices on that node--are ever completely compromised (firewall comes down, all encryption and scrambling is cracked, etc), then the cracker acquires at most 1/11th (one-eleventh) of the data users data.
Further, if any half (or at least 5 out of any 11) of the grid nodes are for any reason destroyed or otherwise unavailable, all of the user's data is still accessible. This is done by generating a "coded" file slice for every data slice that we store on the node, and regenerating missing file slices from down nodes by pumping the available data and coded slices through our info-dispersal algorithms (which are all open-sourced, by the way) that are executed on the client side or when the grid "self heals" for destroyed nodes.
The system can also be implemented in a cost-effective fashion. The grid system can sustain so many concurrent, per-node outages that the availability/uptime requirements for each node are minimal. Also, the grid-node servers need not support much processing capability, for the client offloads much of the work from the servers.
We feel this system provides a powerful combination of reliability, scalability, economy, and security.
The hardest part of the design, imo, is to be able to reliably track all of these file slices across a large and heterogeneous set of grid-node machines housing these info-dispersed file slices. We designed the grid meta-data system from the ground up to do this and to be capacity-expandable, performance-scalable, and easily serviceable. More details for the open-source flavor of the grid-software design can be found here:
http://wiki.cleversafe.org/Grid_Design
There's much more that I can say about this system; I plan to add additional comments to this thread as more questions and comments arise. I'm sure there are new comments I have yet to read, for they're coming in pretty quickly...
I also encourage further discussion at our newly-created web forums: http://forums.cleversafe.org/
Mailing lists (that will be synchronized with the web forums) will also be available at cleverafe.org in the near future.
-Matt
(This is a repost from an earlier part of the thread so that I can get these comments on the toplevel.)
Hello-
I am the lead designer of the first Cleversafe dispersed-storage system (aka a grid-storage software system) and am one of the project's co-founders. The Cleversafe system never stores a complete copy of the data in any one place (or "grid node" in our terminology). At most 1/11th of the file data--we call it a file "slices"--is stored at any one grid node in a "scrambled" (i.e., non-contiguous), compressed, and encrypted/signed fashion. The grid _never_ stores more than one copy of the data on the grid, and that one copy is never stored all in the same place--it's dispersed using an optimized information-dispersal algorithm that we created but has similar properties to the previously-published info-dispersal algorithms (IDAs).
If a grid node and its associated content--i.e., the user's file slices on that node--are ever completely compromised (firewall comes down, all encryption and scrambling is cracked, etc), then the cracker acquires at most 1/11th (one-eleventh) of the data users data.
Further, if any half (or at least 5 out of any 11) of the grid nodes are for any reason destroyed or otherwise unavailable, all of the user's data is still accessible. This is done by generating a "coded" file slice for every data slice that we store on the node, and regenerating missing file slices from down nodes by pumping the available data and coded slices through our info-dispersal algorithms (which are all open-sourced, by the way) that are executed on the client side or when the grid "self heals" for destroyed nodes.
The system can also be implemented in a cost-effective fashion. The grid system can sustain so many concurrent, per-node outages that the availability/uptime requirements for each node are minimal. Also, the grid-node servers need not support much processing capability, for the client offloads much of the work from the servers.
We feel this system provides a powerful combination of reliability, scalability, economy, and security.
The hardest part of the design, imo, is to be able to reliably track all of these file slices across a large and heterogeneous set of grid-node machines housing these info-dispersed file slices. We designed the grid meta-data system from the ground up to do this and to be capacity-expandable, performance-scalable, and easily serviceable. More details for the open-source flavor of the grid-software design can be found here:
http://wiki.cleversafe.org/Grid_Design [cleversafe.org]
There's much more that I can say about this system; I plan to add additional comments to this thread as more questions and comments arise. I'm sure there are new comments I have yet to read, for they're coming in pretty quickly...
I also encourage further discussion at our newly-created web forums: http://forums.cleversafe.org/ [cleversafe.org]
Mailing lists (that will be synchronized with the web forums) will also be available at cleverafe.org in the near future.
-Matt
Cleversafe project lead
More notes on our IDAs compared with others:
s
The Cleversafe information dispersal algorithms (IDAs) were designed to provide real-time performance with large amounts of data storage and retrieval (gigabytes, petabytes and above). Previous algorithms, like Rabin, Shamir and Reed-Solomon, are very effective at storing smaller amounts of data (kilobytes), but their computational overhead which is proportional to the square of the data block size or greater arent well suited for quickly dispersing/restoring larger amounts of data. The Cleversafe algorithms encode AND decode data with a computational overhead that is linearly proportional to the size of the data blocks. Specifically, the Cleversafe encoding algorithms for an 11 node grid with a threshold level of 6, required 5 operations per byte to encode data. For decoding on this dispersed storage grid, the Cleversafe algorithms require 4 operations per byte to decode data greater than 99% of the time and no more than 13 operations per byte in rare cases.
Another Cleversafe contributor, Chris Gladwin, developed our IDAs. For more info:
http://wiki.cleversafe.org/Turbo_IDA_Technology
On can also read an Excel spreadsheet (found in the above wiki page) and C++ source code that represents the "guts" of our 11-Pillar IDA code module.
For more info about Cleversafe contributors:
http://wiki.cleversafe.org/Cleversafe_Contributor
You can see Chris and I at the bottom of the page which is ordered with the most-recent contributor listed first.
-Matt England
ps: We are finishing up our project announcement at this week's MySQL User's Conference where we drew significant interest. We have engaged some MySQL core developers regarding integrating the their technology with ours.
BTW, I am a research assistant for Professor Michael Rabin at Harvard. I think Slashdot editors have the responsibility for making the correction!
Most people seem to know RSA names well but the IDA algorithm in this article is not related to RSA. So their comments on what R in RSA stands for can misguide the readers!