Archiving Digital History at the NARA
val1s writes "This article illustrates how difficult archiving is vs. just 'backing up' data. From the 38 million email messages created by the Clinton administration to proprietary data sets created by NASA, the National Archives and Records Administration is expecting to have as much a 347 petabytes to deal with by 2022. Are we destined for a "digital dark age"?"
Perhaps, the answer is compression.
Does anyone know whether there is an upper limit to text compression?
In signal processing, there is a limit called the Shannon Capacity theorem, which gives the maximum amount of information that can be transmitted on a channel. In text compression, is there a similar limit?
Note that the Shannon Capacity theorem does not tell you how to reach that limit. The theorem merely tells you what the limit is. For decades, we knew that maximum limit on a normal telephone twisted pair is about 56,000 bits per second, according to the theorem. However, we did not know how to reach it until Trellis coding was discovered, according to an electronic communications colleague at the institute where I work.
If we can calculate a similar limit for text compression, then we can know whether further research to find better text compression algorithms would be potentially fruitful. If we are already at the limit, then we should spend the money on finding denser storage media.
I haven't seen any software system that can reliably scale to that level and still make any kind of sense for someone that wants to find a piece of data in that haystack,
Haven't you? Have you ever worked with real archiving before? IBM have some nice solutions that allow us to stock on disk and a WORM library (Tivoli Storage Manager) and index in a (large) Oracle DB - they work and scale just fine (our experience over a couple of hundred teras). You probably wouldn't want all that data in a single archive anyway, but i'd guess you'd know that if you'd ever archived anything....
Try NetBSD... safe,straightforward,useful.
The most common structure used to index large amounts of data stored on magnetic or other large capacity media is the B-Tree and its variants. The article linked here explains the basic idea of the balanced multiway tree or B-Tree. The advantage of this type of index is that the index can be stored entirely on the collection of tapes, cartridges, disks or whatever else while only the portion of the tree which currently being operated on need be read into volatile or main memory. The B-Tree allows for efficient access to massive amounts of data while minimizing disk reads and writes. Theoretically, the B-Tree and its variants could be scaled up to address an unlimited amount data in logarithmic time.