Archiving Digital History at the NARA

16000 formats?!? by gardyloo · 2005-06-26 09:30 · Score: 3, Funny

Hm. This sounds like a job for OpenOffice...

Re:16000 formats?!? by parasonic · 2005-06-26 10:26 · Score: 1, Funny

Or this.
Re:16000 formats?!? by gomadtroll · 2005-06-26 12:17 · Score: 1

No, that would be something like what OASIS proposes, i.e. open document file format,
http://www.oasis-open.org/committees/tc_home.php?w g_abbrev=office

This would eliminate the 'what/who's/free/proprietary ... application dialog, and focus on the data.
Re:16000 formats?!? by nazsco · 2005-06-26 14:15 · Score: 1

"According to NARA's specifications, the system must ultimately be able to absorb any of the 16,000 other software formats believed to be in use throughout the federal bureaucracy"

probably some versions of WORD are there. so they will need to run office macros to not hurt DMCA while extracting keywords on those

347 petabytes? by ravenspear · 2005-06-26 09:33 · Score: 4, Insightful

Ok, I was tempted to make a pr0n joke about this, but I think the bigger question is what kind of indexing system will this use?

I haven't seen any software system that can reliably scale to that level and still make any kind of sense for someone that wants to find a piece of data in that haystack, err. haybarn.

Re:347 petabytes? by sneakyrussiian · 2005-06-26 09:38 · Score: 1

Google desktop search of course. ;)
Re:347 petabytes? by Anonymous Coward · 2005-06-26 09:38 · Score: 0

Hey just use Spotlight.

Can just imagine typing the first few letters of 'Clinton' and Spotlight going through it's Petabyte index and delivering results 'in real time'.

Not.

It would probably kernel panic at that point and wipe out the 347 petabytes of storage with one crash (especially if Firewire drives with an Oxford chipset were used somewhere) Oh well that was that then.
Re:347 petabytes? by OrangeSpyderMan · 2005-06-26 09:54 · Score: 3, Informative

I haven't seen any software system that can reliably scale to that level and still make any kind of sense for someone that wants to find a piece of data in that haystack,

Haven't you? Have you ever worked with real archiving before? IBM have some nice solutions that allow us to stock on disk and a WORM library (Tivoli Storage Manager) and index in a (large) Oracle DB - they work and scale just fine (our experience over a couple of hundred teras). You probably wouldn't want all that data in a single archive anyway, but i'd guess you'd know that if you'd ever archived anything....

--
Try NetBSD... safe,straightforward,useful.
Re:347 petabytes? by ravenspear · 2005-06-26 09:56 · Score: 2, Informative

Well considering that Spotlight took about 2 hours to index my 120 GB drive, that would be (347 * 1024^2) * 2 = 72771174 hours = 83,000 years to index that much data.

Now I'm sure the gov would use a faster system than my laptop, but still!
Re:347 petabytes? by CodeBuster · 2005-06-26 10:00 · Score: 3, Informative

The most common structure used to index large amounts of data stored on magnetic or other large capacity media is the B-Tree and its variants. The article linked here explains the basic idea of the balanced multiway tree or B-Tree. The advantage of this type of index is that the index can be stored entirely on the collection of tapes, cartridges, disks or whatever else while only the portion of the tree which currently being operated on need be read into volatile or main memory. The B-Tree allows for efficient access to massive amounts of data while minimizing disk reads and writes. Theoretically, the B-Tree and its variants could be scaled up to address an unlimited amount data in logarithmic time.
Re:347 petabytes? by Anonymous Coward · 2005-06-26 10:25 · Score: 0

Theoretically, the B-Tree and its variants could be scaled up to address an unlimited amount data in logarithmic time.

That so didn't make sense to me.

But maybe it's just me. :-)
Re:347 petabytes? by Anonymous Coward · 2005-06-26 10:30 · Score: 1, Interesting

Wow. I didn't know one could mess up so simple math so badly... That's a simple rule of thirds - basic high school math!

120GB/2Hr = 60GB/h indexing speed.
347PB = 347 000TB = 347 000 000GB (or use 347 x 1048576 - but HD manufacturers never use that - they like to inflate numbers)
347 000 000GB / (60GB/h) = 5783333 (and 1/3) h.
at 24h/day, 365d/yr, we get 660 years.

You were just a little over 12 times too much. Let's just hop you don't write code for a living :p

Still bloody too much, but it's not like the indexing is going to be done by a single processor across a single bus. Anything like that has got to be done by means of distributed computing (duh), so this math is completely irrelevant anyways :)

And it's not like spotlight is much of a reference either, perhaps make comparisons with big commercial indexing solutions, or open source implementations that could be scaled...

Making a comparison with distributed indexing of rendundant network storage of some sort with a local IDE disk indexing by spotlight is just laughable. Apples and oranges.
Re:347 petabytes? by Anonymous Coward · 2005-06-26 10:35 · Score: 0

unlimited != infinite
Re:347 petabytes? by Anonymous Coward · 2005-06-26 10:44 · Score: 0

Google.
Re:347 petabytes? by ravenspear · 2005-06-26 10:47 · Score: 1

Wow, thanks for catching that. I had it right up to the point where I stopped, but I forgot the last step. I calculated a time of 2 hours for each GB instead of 2 hours for each block of 120 GB. 83,000 / 120 is indeed 660.

The funny thing is I got an A in Calc III last semester. ;)
Re:347 petabytes? by gipsy+boy · 2005-06-26 11:02 · Score: 1

Let's hope you didn't get one in Algorithmic Complexity :)
It's not because an algorithm takes n time for m inputs, that it takes 2*n time for twice that amount (especially not when it comes to indexing, they usually complicate logarithmically, rather than linearily).
Re:347 petabytes? by Anonymous Coward · 2005-06-26 11:07 · Score: 0

Probably Reiser9 in the year 2022.
Re:347 petabytes? by ravenspear · 2005-06-26 11:34 · Score: 1

Actually I haven't taken any algorithms classes yet, but that's a good thing to remember.

One thing though, wouldn't it still be linear for the entire process? I mean I understand what you are saying as far as the algorithm goes. It's not necessarily going to take twice as long for the algorithm that creates the index to run createIndex(a,b,c,d) compared to createIndex(a,b).

But you still have to scan twice as many files to derive the inputs. How could that part not be linear?
Re:347 petabytes? by gipsy+boy · 2005-06-26 11:36 · Score: 1

Because scanning the files takes less time than actually indexing them. :)
Re:347 petabytes? by Anonymous Coward · 2005-06-26 11:54 · Score: 0

One word, Google.
Re:347 petabytes? by Aeiri · 2005-06-26 12:20 · Score: 1

Of course, those couple of hundred terabytes still don't compare to 347 petabytes.

(347*1024)/200 = 1776.64 times bigger than what you worked with. Quite a big difference.
Re:347 petabytes? by commodoresloat · 2005-06-26 13:00 · Score: 1

How many Libraries of Congress is that?
Re:347 petabytes? by awtbfb · 2005-06-26 13:59 · Score: 1

Ok, I was tempted to make a pr0n joke about this

Note that they don't say which mailbox in the Clinton administration...
Re:347 petabytes? by iamhassi · 2005-06-26 16:25 · Score: 1

still, that's the original indexing time, if you start small and work up to the size it won't be so bad.
also you're using modern processors and hard drives, by 2022 347 petabytes won't be anything when we all have terabyte hard drives... think about it, that's 17 years, how big/fast was your hard drive 17 years ago? Let's see... 1988... I didn't even have a hard drive, still all floppy.

By 2022 we'll all have hundreds of terabyte drives and measuring transfer rates in gB/sec, if not larger/faster. Sorry 347 petabytes is not impressive when you're talking 17 years from now.

--
my karma will be here long after I'm gone
Re:347 petabytes? by The_Wilschon · 2005-06-26 17:18 · Score: 1

It is entirely possible that it is an extremely large amount of data, but not nearly as many files as you might expect. I work at Fermilab, and I know that the event records we generate are somewhere on the order of a fifth of a gigabyte per event (after significant statistical data reduction). Then, we are having collisions at a rate of one every 396ns. Of course, most of those are uninteresting, and our trigger system throws them out. So, we finally have an event acceptance rate of ~0.08 kHz. Which leaves us producing data at a rate of 16 GB/s. In other words, we could have 347 PB produced in about 6000 hours (250 days). However, it would only contain ~1.75 billion different records. In which case archiving it would not in fact be a phenomenally difficult task.

Anyway, my point was that the records could be really large, and therefore there would not be that many of them. Turns out our event records are smaller on average than I thought they were, so my example didn't work as well as I had hoped... but the point stands.

Afterthought: Sorry to be unable to provide references... you can go to www-cdf.fnal.gov if you like, but you will not be able to get to the internal sites that this information came off of.

--
SIGSEGV caught, terminating

wait... not that kind of sig.
Re:347 petabytes? by HyperChicken · 2005-06-26 17:20 · Score: 1

No clue. But it is one NARA.

--
Free of Flash! Free of Flash!
Re:347 petabytes? by Anonymous Coward · 2005-06-27 00:16 · Score: 0

"Let's just hop you don't write code for a living"

Let's hop you don't proofread for a living.

Compression and moderation? by moz25 · 2005-06-26 09:36 · Score: 1

Can't they get more storage performance out of their system by (more) aggressively compressing old information? That shouldn't matter too much to the indexing mechanism. Also, it might make sense to tag the importance of different documents so that its compressing/archiving treatment can depend on that.

--
see a Text Widget

Re:Compression and moderation? by LiquidCoooled · 2005-06-26 10:11 · Score: 1

The best way to compress the data is to automatically put the national security black lines over the documents now rather than in 25 years time.
This way, not only will security be maintained (assuming they remove the data rather than just paint it black in a pdf), but it will take up less space.

Most released documents I have seen have most lines blacked out, so after compressing the remaining text, you could fit the entire Clinton admin documents onto a single floppy disk.

--
liqbase :: faster than paper
Re:Compression and moderation? by oakgrove · 2005-06-26 12:37 · Score: 1

That's some funny shit. Thanks for my first good laugh of the day.

--
The soylentnews experiment has been a dismal failure.

Compression by Anonymous Coward · 2005-06-26 09:37 · Score: 0

Surely data of this sort lends itself well to compression?

Data loss will always be a possibility by divide+overflow · 2005-06-26 09:37 · Score: 4, Insightful

It happened with the Great Library of Alexandria, with pagan libraries throughout the Christian era, and more recently has happened with antiquities in Afghanistan and Iraq. The only thing that can reliably preserve data is large scale, geographically widespread distribution of copies.

Re:Data loss will always be a possibility by Anonymous Coward · 2005-06-26 09:55 · Score: 0

they should use the google file system.
Re:Data loss will always be a possibility by tabdelgawad · 2005-06-26 10:03 · Score: 3, Insightful

Actually, it's more like 'inevitable'. I'll bet almost everyone has unintentionally lost digital data permanently and will do so again in the future.

The key, I think, is prioritization. We all do it individually (important stuff gets backed up many times and often, unimportant stuff perhaps never backed up), and NARA will have to do it too. I don't think backing up a president's email and backing up some minor whitehouse aide's email should have equal importance. The trick will be to come up with a reasonable prioritization scheme that will make the probability of losing the most important stuff very small.

--
Imposing Libertarian views on everyone online since 1992.
Re:Data loss will always be a possibility by Anonymous Coward · 2005-06-26 10:08 · Score: 0

and more recently has happened with antiquities in Afghanistan and Iraq

Yeah. Starting with Moslem fuckers blowing up statues of the Buddha because in their skewed eyes they were heathen images. Assholes.
Re:Data loss will always be a possibility by Anonymous Coward · 2005-06-26 10:35 · Score: 0

Did you even know what Huddha was until the report of the destruction of the statues on Fox news. Or are you one of those godless non-christians who does not believe in Jesus. Or perhaps, even worse, one of these christians that do not believe in the Ten Commandments.
Exodus 20:4"You shall not make for yourself a carved image--any likeness of anything that is in heaven above, or that is in the earth beneath, or that is in the water under the earth;
5"you shall not bow down to them nor serve them. For I, the LORD your God, am a jealous God, visiting the iniquity of the fathers upon the children to the third and fourth generations of those who hate Me,
6 "but showing mercy to thousands, to those who love Me and keep My commandments.
Re:Data loss will always be a possibility by doshell · 2005-06-26 11:37 · Score: 1

I think it also has to do with the fact that the media in which we store information are increasingly less durable (compare stone engraved millenia ago, writings in paper of past centuries still readable today, and the relatively short life expectancy of magnetic and optical media).

Now I'm not saying we should all go back to Stone Age, but it does make you think about the irony of progress...

--
Score: i, Imaginary
Re:Data loss will always be a possibility by writermike · 2005-06-26 12:04 · Score: 2, Funny

It happened with the Great Library of Alexandria, with pagan libraries throughout the Christian era, and more recently has happened with antiquities in Afghanistan and Iraq. The only thing that can reliably preserve data is large scale, geographically widespread distribution of copies.

True. But I hardly think Alexandria was lost to the tap of the Y key, a pregnant pause, then an "oops."

--
If Nalgene water bottles are outlawed, only outlaws will have Nalgene water bottles.
Re:Data loss will always be a possibility by FLEB · 2005-06-26 13:34 · Score: 1

Exodus 20:4 "You shall not...

Right. You. Obviously, this does not apply to them or me.

--
Information wants to be free.
Entertainment wants to be paid.
You just want to be cheap.
Re:Data loss will always be a possibility by It's+Impossible · 2005-06-26 15:04 · Score: 1

"The trick will be to come up with a reasonable prioritization scheme that will make the probability of losing the most important stuff very small."

Heh. Yes, that's the "trick".

Every archivist faces this problem; there's always more material than can possibly be catalogued. And there's really no practical way to differentiate between material that seems important now but will lose significance over time, and material that seems insignificant but will gain importance over time, or act as a small but key part of somebody's story someday, or derive cultural resonance from some future event, or be highly important to one person but nobody else... You get the picture.

Retention, in other words, is probably one of the defining characteristics of the archivist's craft. They spend all day thinking about "reasonable prioritization schemes", losing some battles while winning others.

The honest ones know that they lose more often than they win. Even in small archives, you might be astonished at just how much stuff languishes in a warehouse someplace, and probably won't even be seen in the lifetime of the current staff, if it survives at all through rodents or floods, or mistakes, or sheer inability to provide storage.

This particular article is interesting, and on Slashdot, because of the nature of the material and the technical challenges it may present, but issues of retention are not appreciably different from those faced by people working on "traditional" non-digital collections now or at any given time in the past.
Re:Data loss will always be a possibility by Tristor · 2005-06-26 15:17 · Score: 3, Funny

No, but it could have been lost to the strike of flint, a pregnant pause, then an "glukús theométôr" (Sweet Mother of God, for you people that suck). (Note: I spent like 20 minutes transliterating that to Latin just so I could post it on /. because it hated the Greek charset. I have no life.)

--
"I just karma whore to everyone." -garcia (6573)
Re:Data loss will always be a possibility by Anonymous Coward · 2005-06-26 15:33 · Score: 0

This God guy sure sounds like a friendly, caring sort of person.
Re:Data loss will always be a possibility by cazbar · 2005-06-26 17:19 · Score: 1

Ah yes, priorities.
The President's Email:

"Hey man, nice golf game yesterday. See you on the green on Friday."

The Aide's Email:

"Hey we need to get this budget worked out for the president or congress might stop ignoring his daily routine. Here's the data the IRS sent us."

Now we know who really runs the country.
Re:Data loss will always be a possibility by LiquidRaptor · 2005-06-26 18:07 · Score: 0, Offtopic

Heh, I always thought the whole point of christianity was you didn't need to worry about the 10 commandments as long as you accepted Jesus, but then again, that woulda ruined your troll so feel free to ignore.
Re:Data loss will always be a possibility by divide+overflow · 2005-06-26 20:19 · Score: 1

> Note: I spent like 20 minutes transliterating that to Latin just so I could post it on /. because it hated the Greek charset. I have no life.

Thanks for your selflessness. Or perhaps your Obsessive-Compulsive Disorder.
Re:Data loss will always be a possibility by onepoint · 2005-06-27 01:06 · Score: 1

Funny, 4 years ago someone asked my how to protect an entire warehouse of documents, i told him this.

1) buy the required scanners systems for the documents and then have 1 spares machine placed on the side ( sealed tightly in a steamship container)

2) scan everything to a systems with pc's

3) have a duplicate of the pc in the steamship container

4) make laminated copies of all instruction manuals and keep them in the steam ship container

5) store all the tapes somewhere and back up copies somewhere else.

they went ahead and did everything, scanned, taped backup, ( just recently back up tapes to cd copies ) ...

they cleared the warehouse of everything.

now somewhere sits some containers, with everything.

they asked me why the copies of each hardware system, I told them that back-up systems get forgotten, so having fresh system sitting there saves them time and reduces worries.

--
if you see me, smile and say hello.
Re:Data loss will always be a possibility by Anonymous Coward · 2005-07-01 01:55 · Score: 0

What can I say, I am OCD.

Answer is Compression? by reporter · 2005-06-26 09:39 · Score: 4, Informative

National Archives and Records Administration is expecting to have as much a 347 petabytes to deal with by 2022. Are we destined for a "digital dark age"?"

Perhaps, the answer is compression.

Does anyone know whether there is an upper limit to text compression?

In signal processing, there is a limit called the Shannon Capacity theorem, which gives the maximum amount of information that can be transmitted on a channel. In text compression, is there a similar limit?

Note that the Shannon Capacity theorem does not tell you how to reach that limit. The theorem merely tells you what the limit is. For decades, we knew that maximum limit on a normal telephone twisted pair is about 56,000 bits per second, according to the theorem. However, we did not know how to reach it until Trellis coding was discovered, according to an electronic communications colleague at the institute where I work.

If we can calculate a similar limit for text compression, then we can know whether further research to find better text compression algorithms would be potentially fruitful. If we are already at the limit, then we should spend the money on finding denser storage media.

Re:Answer is Compression? by slavemowgli · 2005-06-26 09:46 · Score: 1

There is no theoretical upper limit on text compression as far as I know (and I'd be rather surprised if there was [1]), but there *is* a (very basic) theorem from Kolmogorov complexity that says that there's always data that can't be compressed for any compression algorithm you devise (for a proof, simply consider the number of strings of length =n for a given n).

1. Well, I'd be surprised as long as you don't make any assumptions about the statistical distribution of bits in the text you want to compress. In other words, if you define certain properties and conjecture that all texts satisfies these properties, you may well be able to prove certain things (I'm not sure about this), but I don't think it'd really be very practical, as I'd say it's relatively likely that among those 347 PB there'll be data which does not match these properties.

--
quidquid latine dictum sit altum videtur.
Re:Answer is Compression? by Indianwells · 2005-06-26 09:51 · Score: 1

But the article was focused really on archiving as opposed to backup. Compression would work for backup, but archiving is an attempt to make the data searchable as well as permanent. Some type of indexed compression would preserve space size, but is it really that important? If they simply started chaining sans, there would be no issue with storage. If they used flash memory based san, there would be no data loss -- although it would be quite expensive to build. That tied into a highly indexed and hierarchical database with smart data management, and this problem seems surmountable ...
Re:Answer is Compression? by Biogenesis · 2005-06-26 10:25 · Score: 1

Sounds a bit like 42, it'll tell us the answer, but we need something else to find the question.

--
...just so Google finds it.
Re:Answer is Compression? by MasterC · 2005-06-26 10:27 · Score: 3, Interesting

The only thing that comes to mind is information entropy. If you're given a text document, you can determine the probability distribution for each letter, letter combinations, for words, or whatever you can think of. Then given the probability distribution, you can determine the information entropy. If, in the sum, you use log with base 2 then H(x) (see formal definitions) gives you the entropy in bits.

For example, if you have a text file with letters of equal probability (all letters have a probability of 1/27) then the bits required to represent a single letter turns out to be ~4.7549 bits. (Indeed, 2^4.7549 = 27)

This is the upper limit of compression. Such methods as the, now 50-years old, Huffman coding do decent work at approaching this limit (used in JPEG, for one).

So the answer to your question is: it's not broadly definiable for "text" or "information" but based on the patterns of the English language or a specific document.

--
:wq
Re:Answer is Compression? by mechsoph · 2005-06-26 10:44 · Score: 1

For any given set of data, there is a lower limit beyond which it cannot be compressed. This is called the "entropy" of the data. This is essentially how much actual information the data contains. We talked about it in one of my sophomore CS courses. Lempel-Ziv (gzip, I think) compression approaches the entropy of the data as the size of the data approaches infinity.
Re:Answer is Compression? by mcrbids · 2005-06-26 10:48 · Score: 1

There is no theoretical upper limit on text compression as far as I know

Which is obviously some hot gas coming from your posterior. Otherwise: 1 (the Holy bible, heavily compressed)

The amount of compression possible in a given string of numbers is inversely proportional to the amount of randomness in the input.

--
I have no problem with your religion until you decide it's reason to deprive others of the truth.
Re:Answer is Compression? by Anonymous Coward · 2005-06-26 10:48 · Score: 0

In text compression, is there a similar limit?

Yeah, it's called entropy.
The basic model is that in the end you're transmitting bits of pure information encoded in some not-too-efficient method (such as english). And according to the pigeonhole principle, there's no way to compress the data further in average than sending those nonpredictable equiprobable bits directly.
Re:Answer is Compression? by Anonymous Coward · 2005-06-26 11:07 · Score: 0

signal channel applies to text. Doesn't the paper even have examples coding texts?

How is this an answer? Its not about the disk space.
Re:Answer is Compression? by Anonymous Coward · 2005-06-26 11:17 · Score: 0

Lempel-Ziv (gzip, I think) compression approaches the entropy of the data as the size of the data approaches infinity.

BZZT. That is only if the data can be described accurately with a Markov model. For instance, LZ compresses digits of pi roughly as badly as random digits even though the entropy in that sequence is close to nil.
Re:Answer is Compression? by zysus · 2005-06-26 12:40 · Score: 2, Informative

Actually there is an upper limit...
It is some of Shannon's work on Information Theory.
Basically, information has entropy associated with it. Entropy being the randomness of information. Truly 100% random information cannot be compressed.
The central idea has to do with the probability of something occuring.
Text compresses quite well because certain letters are more common than others and there are a limited number of symbols. (e for example)
If i encode e using 1 bit instead of 8 that saves 7 bits.

This is the idea behind Huffman Coding.

Binary data... well, depends on the data.
I ran into this at work... basically, I was trying to reformat some data to save space on disk and eventually figured out that bzip would accomplish the same thing.
Re:Answer is Compression? by commodoresloat · 2005-06-26 13:10 · Score: 1

Just make sure you have a portable and open compression format that you will be able to dig up in 50 years. I have a ton of old data that I backed up in MacOS System 7ish using an old version of Stuffit that did automatic compression in the background (I think it was called Space Saver or the like). Well, it was a really dumb idea of me to install that because that data is inaccessible to me without running an old copy of the MacOS (though perhaps classic would work) and digging up that particular version of Stuffit Expander (the versions they have produced since OSX came out do not recognize the format at all, even though it is a format they created a decade or so ago). The point is not just to be able to store the data in as little space as possible but also to reliably access it 10, 50, 100 years from now.
Re:Answer is Compression? by ComputerSlicer23 · 2005-06-26 14:32 · Score: 1

The problem with your glib answer, is that "1" being the Holy Bible as compressed, is completely legitimate. It's be incredibly useful assuming the entire contents of the bible occurs often in whatever it is you are compressing. It's essentially the concept behind huffman encoding (it's not exactly the same, but picking the most common letters from your symbol set and encoding them as very short binary strings is the basic princepal).
Depending on how specialized your data is, it might be a net win to do exactly that. However, compressing arbitrary sets of data that obviously won't work. For most general compression methods the amount of randomness (which I'm more used to being referred to as entropy), what you are saying is accurate (that's true of gzip and bzip2 for example).

Read up on history of the American Revolution (I'm from the US, I have no idea if you are). Sometimes, 1 means "They are attacking by land", and 2 means "They are attacking by sea". Seems like a fine compression algorithm given the dataset they had to compress.

The answer to the posters original question, is that assuming all combinations of the alphabet symbols are equally likely, or that all valid input possibilities are equally likely, you can't compress the data at all. Compression depends upon non-uniform nature of the input (which is really what entropy means). However, your concept that it depends upon a singular input is wrong headed (there are lots of compression methods for which that is true, however, he's speaking strictly in the theoretical, not in the practical). It depends on the nature of all inputs and arranging that the most common inputs get re-encoded as shorter outputs. For lots of generic compression methods, that has a lot to do with entropy. I know you can compress english text much more then either gzip or bzip2 can. It's the basis of Andrew Tridgell's Ph.d thesis, I believe you can download it off his site. I believe this is a link to it.

Kirby
Re:Answer is Compression? by Nefarious+Wheel · 2005-06-26 15:41 · Score: 1

1 if by land... that's not a compression scheme, that's an indexing scheme.
Speaking of which, don't we have to consider indexing this megalith? And if things haven't changed *that* much since I was a DBA, you can easily have indexing that takes ten times the storage of the raw data itself. Better factor that in, too.

--
Do not mock my vision of impractical footwear
Re:Answer is Compression? by ComputerSlicer23 · 2005-06-26 16:04 · Score: 1

Toe-mato, Tha-mato (that works better when spoken). It is in fact a compression scheme, and an indexing scheme.
I can easily thing of it as a compression scheme. If they wanted to have it communicate all of that information they could have devised "Morse Code", and actually spelled it out. This is obviously much shorter. The code they specially designed for this single use was exactly as described.

You can think of it as an indexing scheme if you feel like it, but that doesn't mean it's any less legitimate as a compression scheme. If I were to send you files where every time you saw "1", you should replace it with "By Land", and "2" with "By Sea", it's a compression scheme. We'd have to come up with an escape sequence for "1" and "2", but that's not a big deal. This only makes sense if we are using "By Land" and "By Sea" a lot.

Yes, I have indexes that are larger then the dataset (actually, as a general rule, I don't I have several indexes on a single table which total more then the dataset, but I don't believe any single index is larger then the dataset). With Oracle if you have only one index, and it encompasses all the fields (probably the only easy way to have it take more space the the original), you can just create an IOT (Index Organzied Table). Then the data is the index, and the index is the data. I'd do it, but the version of Oracle we run is fairly buggy with respect to IOT (it's a legacy Oracle8i DB).

Kirby
Re:Answer is Compression? by Wordsmith · 2005-06-26 17:09 · Score: 1

That's perfectly legitimate compression, if in your scheme "1" is actually equivilant to the Bible. 11 would then be a nice shorthand, highly compressed way of writing the Bible twice, back to back. 12 might mean the Bible, then the Koran.

Such a scheme wouldn't be very useful for general use, of course ...
Re:Answer is Compression? by Anonymous Coward · 2005-06-26 17:13 · Score: 0

Does anyone know whether there is an upper limit to text compression?

Lossless or lossy?
Re:Answer is Compression? by Pingla · 2005-06-26 19:08 · Score: 1

I do not see how there can be any upper limit on text compression as it will depend on the contents of the text. As many have posted, it will depend on the frequency of the characters and this will vay depending on the contents (i.e the language it is written in).
Take the banal example of a language with only ten characters where three characters are heavily used. The compression of such a text would be very high, whereas a highly complex language might have lower compression possibilities.
Re:Answer is Compression? by JustKidding · 2005-06-26 22:07 · Score: 1

Does anyone know whether there is an upper limit to text compression?

That, ofcourse, strongly depends on the entropy of the text to be compressed. When you're talking about the current president's email, well, there can't possible be a whole of entropy in there, so it should be really easy to compress.
Re:Answer is Compression? by Butterspoon · 2005-06-27 00:50 · Score: 1

The current POTUS doesn't use email, it is claimed. He was shocked at the potential that a FOIA request would have to reveal his supposedly private comments. Ironic, really, considering the current topic!

--
pi = 2*|arg(God)|
Re:Answer is Compression? by some+guy+I+know · 2005-06-27 02:34 · Score: 1

Does anyone know whether there is an upper limit to text compression?
I remember that some company came up with a great compression utility back in the 1980's that could compress any data down to 1% of its original size.
These compressed files could themselves be compressed, resulting in even further compression.
By repeated compression, any size file could be compressed down to less than 1024 bytes.
The problem was that the company offered no decompression utility to decompress the file.
My understanding is that the company went out of business shortly thereafter.

A long, long time ago, a friend of mine in college came up with a great compression scheme while watching a file being punched on paper tape: simply remove all of the zero bits from the file being compressed.
This will typically reduce a file size by 50%.
Since the upper bit in bytes in ASCII text files is always 0, a text file will typically be compressed by over 50%.
In addition, once all of the zero bits are removed, you can compress the file further by counting the one bits left over and using that number as the compressed file.
This number can be further compressed, etc.
By repeated compression, all files can be compressed down to one bit, and since that bit is always 1 (except in the case of an initially empty file), you can treat that 1 bit as implied and remove it as well, leaving you with a totally empty file.
This means that my friend's scheme offers infinite compression!
Unfortunately, as with the scheme mentioned above, decompressing such a file is problematic.

--
Those who sacrifice security to condemn liberty deserve to repeat history or something. - Benjamin Santayana
Re:Answer is Compression? by Anonymous Coward · 2005-06-27 06:32 · Score: 0

Shannon's limit is the limit on entropy that can be transmitted. Such transmission can be from a compressor (at "now") to a decompressor (at "future"). Several posters have said, "calculate the entropy...", which is mildly humorous as there is no such thing as absolute entropy. There *is* entropy relative to a model.

Compressors pick a model and attempt to maximize the entropy density (relative to the model). One can always arrange for a particular text to have a minimal entropy (one bit) by having a model that is: if "0", then the special document, else /* "1" */ decompress the remainder of the document the normal way. This is one of the canonical examples of how entropy is relative.

It is highly likely that Markov chains could be constructed for the various authors cataloged by the NARA. This would have the added benefit that one could produce a "generic" document in the style of the author by following the chains...

ha by The+Big+Ugly · 2005-06-26 09:40 · Score: 3, Funny

"Archiving Digital History at the NARA"

You'll have to pry it from my cold, dead hands!

Ohhhh, NARA, not NRA....

Re:ha by grokMeNow · 2005-06-26 23:36 · Score: 1

Perhaps we should ask Moore for suggestions.

--
"Is man merely a mistake of God's? Or God merely a mistake of man's?"--Friedrich Nietzsche

Retain it all. by d3m057h3n35 · 2005-06-26 09:42 · Score: 2, Insightful

Perhaps it would be best to keep it all, even the stuff that now may seem totally useless, like Clinton administration emails from Janet Reno to Madeleine Albright asking what she thinks about Norman Mineta and his "hot Asian vibe." With search technology improving constantly, it would probably be better than throwing stuff away which could potentially be of interest, or spending time developing the AI to make the task less time-consuming. And besides, we can't make future historians' jobs too easy. They've gotta earn their pay, reminding us of the banalities of this age.

age of clutter... by Anonymous Coward · 2005-06-26 09:44 · Score: 0

Doesn't matter. We can't absorb the information available at any moment in real time. So we certainly cannot go back and absorb it later.
The abandonment of the notion that information should be evaluated and only the best archived -- as in traditional libraries -- is indeed likely to lead to a dark age. But it will be symmetric to the old ones: can't find the target in the clutter instead of being unable to find it in the desert.

Google to the rescue!!! by feloneous+cat · 2005-06-26 09:45 · Score: 3, Funny

With the new GoogleNARA...

nara.google.com

Oh, wait... I'm getting ahead of myself...

--
IANAL, but I've seen actors play them on TV

All that for things like... by suitepotato · 2005-06-26 09:46 · Score: 1

Dear Monica,
I did what last night? Man, I must have been smashed. You sure? ROTFLMAO...
Yours truly,
Bill

Seriously, we're archiving every little tiny 1 and 0 for what reason? There's some things that can just go in a zip file and be put on a CD and that's it. Want them to stick around forever? Have files put out every so often in leftover space on AOL CDs. They'll never be gone forever.

--
If my grammar and spelling are off, I am [distracted/tired/careless] (take your pick)

Difference between data and trash by HermanAB · 2005-06-26 09:47 · Score: 4, Insightful

In the age of pen and paper, only important stuff was written down. Nowadays all crap is preserved. This is useless. There is a big difference between data and information.

--
Oh well, what the hell...

Re:Difference between data and trash by Anonymous Coward · 2005-06-26 10:30 · Score: 1, Insightful

The problem is that it can be hard to know where the boundary between important and useless is...

Things that previous generations considered unworthy of preservation are things that are greatly treasured in today's age - look at all the old manuscripts of which we only have a few pages (because scribes reused the parchment). Look at the masterpieces that were painted over to save canvas.

As soon as you start to put hard limits down on what to preserve, and what to leave alone, we risk losing information that our next generations will value.

Besides - in many cases, it could just be easier to save everything. It seems that trying to enforce standards and judging what should and shouldn't be preserved might be more labour-intensive than the alternative. Considering the rate at which informationis generated it might make sense to have a trade-off between conserving storage versus conserving labour... storage is easier/cheaper/more available :)
Re:Difference between data and trash by jacksonj04 · 2005-06-26 10:47 · Score: 1

The trick is to get your data infrastructure organised to start with. Because I have a predetermined system for organising my class notes (Microsoft OneNote, so shoot me) I can reliably pick out notes from a specific class based on date, or topic based on exam questions, or I can take the Google approach and just go "Find me anything to do with this".

The information I need is preserved in an easily accessible form because I made a decision to make all my class notes organised, and as a result I've replaced 8 ringbinders of poorly organised content with a tablet PC and searchable, editable content.

Good planned structure to start with helps organisation later. Google has made gMail easily organised with tags, the world is getting closer to the idea that *everything* needs to be categorised by date, subject, relevance, people involved etc. but it's a long way yet.

--
How many people can read hex if only you and dead people can read hex?
Re:Difference between data and trash by Prof.Phreak · 2005-06-26 23:13 · Score: 1

Hmm... Assuming that google/yahoo save all of the queries anyone ever does (over the years), just index the -entire- NARA database using google, and then run it against -all- the queries anyone has bothered to run in the last 5 years. Whatever files do -not- come up in the first 1000 results, can be safely deleted :-)

Just an idea...

--
"If anything can go wrong, it will." - Murphy
Re:Difference between data and trash by evilviper · 2005-06-27 02:53 · Score: 1

Of course there is a difference between data and information, but it seems quite clear that only important information is being preserved.

In the story they talk about multiple revisions of word documents written by leaders, and photos of the effects of agent orange. Do you consider those things "crap"?

The fact is, the government is huge, and there is a hell of a lot of important information to be saved over the years.

--
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
Re:Difference between data and trash by HermanAB · 2005-06-27 03:23 · Score: 1

for example: Multiple revisions...

--
Oh well, what the hell...
Re:Difference between data and trash by evilviper · 2005-06-27 03:41 · Score: 1

The changes made in the process of writing a document are almost as important as the end product. Just look up the drafts of the founding documents, and see all that changed from the start to the final draft. A significant ammount of historical information would have been lost if we did not have those revisions.

Besides that, revisions are very, very small, so it's not as if storage is a real problem. When your 500GB hard drive is full, you don't go through and delete all your unneeded text files first, do you? It's merely the complexity of extracting the revision information from a Word document, and outputting it to an XML document that makes it challenging.

--
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
Re:Difference between data and trash by Anonymous Coward · 2005-06-29 08:50 · Score: 0

Thank you. The above made-up letter would not be considered a record. NARA only intends to preserve permanent records in their system. That is a small percentage of what the federal government generates. It is still a large amount, in sheer volume, but a very small percentage nonetheless.

Maybe it should have been 45 million e-mails by drsmack1 · 2005-06-26 09:48 · Score: 1

Deleting e-mails seems to be a good way avoid archiving issues.

http://archives.cnn.com/2000/ALLPOLITICS/stories/0 3/23/whitehouse.email/

--

Humor from a Genetically Molested Mind

Why do we need to archive everything? by night_sky_nsci · 2005-06-26 09:48 · Score: 1

I'm a little skeptic about why we have to archive all that information in the first place. History as we know it is established through researching for bits and pieces of evidences and putting them together; we know quite a bit about what happened 200, 300 years ago, but I am sure we don't have an equivalent of 200 petabytes of, say, parchment from which to study our recent history.

It'd be crazy to suggest the NARA audit every single bit (no pun intended) of archival data to determine whether they're worth archiving or not -- not only is it impossible, it flies in the face of the whole idea of archiving. However, the estimate of 347 petabytes may perhaps be too pessimestic, as surely not every kind of information they have are worth archiving. Just my two cents.

Re:Why do we need to archive everything? by felix71 · 2005-06-26 10:24 · Score: 4, Insightful

Actually, one of the main complaints Historians have is incomplete information about the past. Not having every little tidbit makes it impossible to figure out how people actually lived. History _should_ be more than just names, dates, and events. If we can properly preserve and index items that seem really mundane to us, future generations have a _much_ better chance of having some real understanding of how we developed as a society.

--
Never attribute to malice that which can be adequately explained by incompetence. -- Jerry Pournelle
Re:Why do we need to archive everything? by night_sky_nsci · 2005-06-26 10:32 · Score: 0

Are you familiar with paleopathology? Anthropologists dig up bones and study traces of diseases and injury in human skeletons. They use this to gain further insight on the lifestyle Dark Ages people led; for example, a typical Dark Ages skeleton would have bones that suggest they have been broken a few times moreso than other ages, suggesting to historians Dark Ages people engaged in physically risky and demanding activities. Furthermore, by studying the growth on these broken bone sites, they can figure out how they would have treated them.
Re:Why do we need to archive everything? by Anonymous Coward · 2005-06-26 17:17 · Score: 0

Never attribute to malice that which can be adequately explained by incompetence. -- Jerry Pournelle

I'm afraid that in Jerry's case it's probably 50/50.

Dark Ages by TimeTraveler1884 · 2005-06-26 09:50 · Score: 5, Insightful

Are we destined for a "digital dark age"?"

If by "dark age" you mean a time in human history where more information is recorded than ever, yes I suppose we are.

I think more accurately, we are headed towards an age of super-saturation of information. I have no doubt we can store all the data we are currently and will be generating. The question is how do we process it in to something meaningful? Just because we have the ability to archive everything, does not mean it will be useful to the [insert personally welcomed overlord] of the future.

Maybe historians of the future will be fascinated that Clinton's instant-message signoff was "l8ter d00d", but I doubt it. We'll want to save everything now of course, because we can. But the majority of the information I suspect will just be filtered out when actually searched.

Personally, I take the "you never know" ideology and save everything.

Re:Dark Ages by Blahbooboo3 · 2005-06-26 10:09 · Score: 1

Personally, I take the "you never know" ideology and save everything. You must be a clutter hound! :)
Re:Dark Ages by BlackMesaLabs · 2005-06-26 13:59 · Score: 0

I agree with the "you never know [so save everything]" ideology. One day we'll all kill each other somehow, and there'll be a nice cache of info for aliens to find a thousand years later. That'd be about all its useful for though ;) ...oh, and in Soviet Russia, data stores YOU!
Re:Dark Ages by d474 · 2005-06-26 17:37 · Score: 1

"Personally, I take the "you never know" ideology and save everything."
That's a good ideology, because I'm sure we'll develop an AI that would be more than happy to deep search this data someday and shed light on some history we never knew about. It could be very interesting.

--
Authority questions you. Return the favor.

Not a dark age... was the past so bright? by G4from128k · 2005-06-26 09:51 · Score: 4, Insightful

Digital technologies mean that archivists now enjoy orders of magnitude more information than they had in the past. Consider all the hallway and phone conversations or jotted notes lost in a paper-based organization versus having an archives of e-mail, IM, and sticky-note digital files.

Digital technologies mean that archivists now enjoy orders of magnitude more potential accessibility that in the past. Even if paper has greater innate archival lifespan, its physical form makes in inaccessible to all but a select monkish class of archivists colocated with their paper archives. Even the select few archivists who are allowed access to paper archives can only effectively process at best dozen documents per minute (and only a dozen per hour if they must wander the files to find randomly dispersed documents).

By contrast, digital technologies radically expand access on two dimensions. First, technology expands the number of people that can access an archive in terms of distance -- a remote researcher can have full access, including access to documents in use by other archivists. A low cost to copy documents means a wealth of information. Second, search tools provide prodigious access to the files -- searching/accessng/reading thousands or millions of documents per second.

To say we face a dark age is to presume that paper documents provided far more enlightenment and comprehensiveness of documentation than paper ever actually did.

--
Two wrongs don't make a right, but three lefts do.

Re:Not a dark age... was the past so bright? by Nasarius · 2005-06-26 11:12 · Score: 1

I think you're missing the point, which is that all that data is now much easier to lose, especially in the short term, if it's not taken care of properly.

--
LOAD "SIG",8,1

They use TIFF? by Anonymous Coward · 2005-06-26 09:53 · Score: 0

From the article:
and the 2000 census returns were converted into more than 600 million TIFF-format image files, some 40 terabytes of data

Why TIFF!? PNG (or any other lossless format) would reduce that considerably.

Re:They use TIFF? by Fear+the+Clam · 2005-06-26 09:57 · Score: 1

Why not just convert it to text? If a picture is worth 1000 words, they can knock the data down to 4 gigs right there.
Re:They use TIFF? by Murphy+Murph · 2005-06-26 10:02 · Score: 1

From the article:
and the 2000 census returns were converted into more than 600 million TIFF-format image files, some 40 terabytes of data

Why TIFF!? PNG (or any other lossless format) would reduce that considerably.

Uhh, maybe because TIFF does support compression.
Both lossy and lossless.

--
I dub thee... Sir Phobos, Knight of Mars, Beater of Ass.
Re:They use TIFF? by Anonymous Coward · 2005-06-26 13:24 · Score: 0

because 100 years from now, someone might want to look at the actual forms, how they were filled out, oh, i don't know, maybe do some kind of study comparing handwriting styles between people from the northeast to those in alaska, whatever.

point is, you lose that kind of data if you just convert it to text.
Re:They use TIFF? by Anonymous Coward · 2005-06-26 13:37 · Score: 0

Woooooooosh!
Re:They use TIFF? by hurfy · 2005-06-27 12:30 · Score: 1

hehe, that sounds about right.

We process medicaid for the state of Oregon.
We have to write the data (less than 1k of names/numbers) over a PDF file of the form. It then saves both! Only uses 1500000 bytes of disk space for 500bytes of data :(

Used more space than the other 3 states and all private pay people and companies plus Medicare combined. Sometimes the criteria seem to have zero to do with end result :/

Answer is not compression, it's less data. by gus+goose · 2005-06-26 09:54 · Score: 2, Insightful

People should think outside the box.

The answer to archiving the required volumes is producing less volumes. Case in point... we recently spent a week or so at work optimising a process that was I/O bound. The bugger took 10 hours to run. Although purchasing faster disks, converting to RAID0, and other techniques did whittle down the execution time to about 5 hours, the final solution was to redefine the process to reduce the actual IO (removed a COBOL sorting stage in the process), and the process is now 2 hours.

Bottom line: with the 100 + 38 million dollars (FTFA) assigned to the project I am sure I could eliminate a number of redundant positions, optimise some communication channels, retire voluminous individuals, replace inefficient protocols/people, and basically reduce the sources of data. Hell, if the US were to actually have peace instead of demand it, there would be a much reduced need for military inteligence, political rhetoric, and other civil responsibilities. The military could be half the size, and what do you know, we could not only reduce the requirement for archiving, but could actually save money in the process.

Remeber, govenment is a self-supporting process.

Go ahead, mark me a troll.

gus

--
.. if only.

Re:Answer is not compression, it's less data. by MasterC · 2005-06-26 10:40 · Score: 2, Insightful

...other techniques did whittle down the execution time to about 5 hours, the final solution ...is now 2 hours.

That's only a 60% reduction. A 60% reduction of 347 PB is still 138.8 PB...still a huge archival task.

Keep 1% of the data still leaves you with 3.47 PB. Not impossible, but still a daunting task.

--
:wq
Re:Answer is not compression, it's less data. by rbarreira · 2005-06-26 10:41 · Score: 1

The answer to archiving the required volumes is producing less volumes. Case in point... we recently spent a week or so at work optimising a process that was I/O bound. The bugger took 10 hours to run. Although purchasing faster disks, converting to RAID0, and other techniques did whittle down the execution time to about 5 hours, the final solution was to redefine the process to reduce the actual IO (removed a COBOL sorting stage in the process), and the process is now 2 hours.

I'm sure I could do that in about 1 hour...

--

The AACS key is NOT 0xF606EEFD628B1CA427BEA93A9CA9773F

burn, knowledge, burn by Leontes · 2005-06-26 09:54 · Score: 2, Interesting

The ancient, esteemedgreat library of alexandria was burned to the ground as knowledge literally turned to smoke, lost to mankind forever. Was it barbarians? Motivated by political revenge? Demanded by religious zealots? Accidental byproduct of an act of war?

Really, it's only the great works of artistry that need to be retained and remained, sustained and maintained. Historically, it's interesting to catalogue art, but politics? The everyday communications that lead up to the horrible decisions that lead our politicians to make the mistake of the daily business? We want records of this?

Perhaps the easiest way of keeping this knowledge at all interesting or inspiring is to burn it regularly, let people imagine what happened to allow such blunders or let apologists spin tales of delight explaining elegant solutions to how stupid people stumbled upon genius decisions. Conspiracy theorists or intellectual artistry can probably generate far greater truths than the truth will ever reveal.

It would save a great deal of money too, just having a delete key. If we are going to care so little for the decisions in the here and now, why preserve the information to be twisted by people in the future with their own biases and projects? We seem to care so little for truth knowadays, why should that change in the future?

Re:burn, knowledge, burn by Anonymous Coward · 2005-06-26 09:57 · Score: 0

Those who forgot history are destined to repeat it.
Re:burn, knowledge, burn by Leontes · 2005-06-26 09:59 · Score: 1

those that do not look up quotations before posting are destined to misquote them.
Re:burn, knowledge, burn by mrogers · 2005-06-26 10:30 · Score: 2, Interesting

Doesn't it diminish the aura of a great work of art if you know that it can always be restored from a backup?
Re:burn, knowledge, burn by Anonymous Coward · 2005-06-26 10:32 · Score: 0

The ancient, esteemedgreat library of alexandria was burned to the ground as knowledge literally turned to smoke, lost to mankind forever. Was it barbarians? Motivated by political revenge? Demanded by religious zealots? Accidental byproduct of an act of war?

league of shadows.
Re:burn, knowledge, burn by Anonymous Coward · 2005-06-26 10:40 · Score: 0

Reminds me of the Canadian Broadcasting Corporation (CBC), who did throw a lot of old footage in dumpsters, and then lost some more due to a fire.

And all this time I had been wondering why they weren't coming out with some of the older shows [from my youth] on DVDs :(
Re:burn, knowledge, burn by mcrbids · 2005-06-26 10:57 · Score: 4, Insightful

Really, it's only the great works of artistry that need to be retained and remained, sustained and maintained. Historically, it's interesting to catalogue art, but politics? The everyday communications that lead up to the horrible decisions that lead our politicians to make the mistake of the daily business? We want records of this?

Absolutely, yes!

History is often taught as "Charlamagne took over Constantinople in the year 12xx" as though military feats really mattered to the average Joe. But, the truth is, America was colonized by people who thought that, however bad it might be in a virgin land, it was BETTER than their lives in Europe.

One of the key failures in public education today is to communicate the understanding that history is comprised mostly of PEOPLE doing ORDINARY things in their time to make life better for themselves and their families. They loved, worked, got bored, and cracked jokes at the expense of their leaders, just like we do today.

History doesn't consist of battles, anymore than history consists of artworks. Capturing more detail in the average, everyday lives of people gives a much better understanding to the cultural norms, and the ideals to which people aspired.

The pyramids of ancient Egypt provide a clear, artistic monument to their culture, yet we have an only modest understanding of their day to day cultures. Similarly, we have Stonehenge as a clear monument to the grooved-ware people of the English isles, but almost NO understanding of who they were and what they felt was important. How much would a true historian give to understand the day-to-day culture of these mysterious "grooved-ware" people of ancient?

Those memos and IMs comprise that understand of people today.

--
I have no problem with your religion until you decide it's reason to deprive others of the truth.
Re:burn, knowledge, burn by Anonymous Coward · 2005-06-26 11:04 · Score: 0

Those WHO do not match pronouns are destined to be grammar flamed.
Re:burn, knowledge, burn by Leontes · 2005-06-26 11:56 · Score: 1

those who grammar flame are destined to live pleasant, peaceful existences, with much delight. Bastards.
Re:burn, knowledge, burn by Leontes · 2005-06-26 12:45 · Score: 1

I'm not convinced that people want to know the truth of what happened if it doesn't speak to their own specific zeitgeist. Why does it matter some what some joe w. schmuck dies in office does when there is a john q. public living the prototypical existence outside the hallowed halls of policy?

It is only through examining the artistry, the great works, the monuments that withstand the test of time possibly, for those are the things which were the attempts of that culture to enrich themselves. Perhaps instead of recording the email of politicians, certain random people's lives should be recorded, without their knowledge, for posterity's sake. This is how citizen-31234x lived their life: what does it tell it about their time they existed in.

If we start talking about politics, the action is already so many iterations divorced from reality, what does it matter for its record, except to document the folly of man?

It is the successes of humanity that interests me, not the constant, ever-increasing, astounding downfall of mankind. Give me the life and times of a random person striving for a good life, rather than a famous person 'striving' for a 'good' life.
Re:burn, knowledge, burn by commodoresloat · 2005-06-26 13:23 · Score: 1

Hello, History? You are going to judge these people, aren't you?
Re:burn, knowledge, burn by coopex · 2005-06-26 18:08 · Score: 1

The only problem is, that when some historian from 2425 comes up slashdot and goatse, there'll be an immediate call to delete it all.

--
The road to hell is paved with good intentions.

Re:Usually when I archive... by ArchAngel21x · 2005-06-26 09:56 · Score: 2, Interesting

I guess you didn't see how Mr. Ebbers or the founder of Aldephia are facing prison time. Quit trying to spread that liberal lie that white collar crime pays off. By the way, it is inappropiate to refer to blacks as niggers. Grow up and learn to be a little more tolerant of diversity.

So? by ArchAngel21x · 2005-06-26 09:58 · Score: 3, Insightful

By the time the government comes up with a half ass solution, archive.org will already have it all organized, online, indexed, and backed up.

contract for archiving system by 1nv4d3r · 2005-06-26 10:03 · Score: 1

anybody know what the government has spec'ed TFA's archiving system to do? It says it will need to read 16,000 file formats, and be impervious to terrorist attack (?), but not much else...

I wonder what kind of searches and cross-linking will be done, for instance. What kinds of access control there will be? I'd also just like to see what the 16,000 formats are, out of curiosity. Sounds like a project waaaay larger than the $136 million they've allotted for it so far.

Stupid name.... i'm guessing they were chuckling about 'the ERA of NARA'...

where's our real measurements? by Toxygen · 2005-06-26 10:05 · Score: 1

How many libraries of congress is that?

Have a look at the Fedora Project by pangloss · 2005-06-26 10:06 · Score: 3, Funny

http://www.fedora.info/
(Not to be confused with the Linux distribution)

From the website, Fedora is "a general purpose repository service...devoted to...providing open-source repository software that can serve as the foundation for many types of information management systems".

Problem for some is that Fedora can be a little hard to grok. It's not an out-of-the-box repository to install and run, like the repository application mentioned in the article (DSpace). It's an architecture for building repository software. Once you understand the potential for building applications on top of Fedora, you start to see some light at the end of the tunnel for just the sort of issues the article raises.

Re:Have a look at the Fedora Project by ragnar · 2005-06-27 01:08 · Score: 1

I work on a digital humanities project (and I also work down the hall from the Fedora folks). We are in the process of ingesting our 20,000+ object repository into Fedora. Most of it involves XSL acrobatics, but I'll spare the details.

Fedora is oriented toward digital library work, which I suspect has some carry over with archival work at NARA. They would be wise to look at it, but I'll say from our personal experience, it is a major task to get our materials into Fedora. I don't mean this in any way to speak ill of the project or its staff. The system is designed primarily to work with XML source material, but as NARA has stated, they are faced with 16,000 different file formats.

I think NARA is basically screwed if the goal is to save everything. I'll be interested to see how they pare down the list and set priorities.

I used to work at the Library of Congress, where we built software for metadata capture in a project to digitize analog films and audio recordings. Many of these source objects were films that were literally turning to acid on the shelf. Our estimates said it would take 70 years at peak production to digitize everything, yet many films would be ruined in less than 20 years. If the experience at the Library bears any resemblance to NARA, the first job is to prioritize.

--
-- Solaris Central - http://w
Re:Have a look at the Fedora Project by pangloss · 2005-06-27 04:17 · Score: 1

The system is designed primarily to work with XML source material

The digital object model is represented in XML, but there's certainly no bias favoring source material in XML. Datastreams by reference can point to any mime-typed material. For example, see the Spoken Word collection at Northwestern (audio recordings of Supreme Court arguments) or the University of Virginia Digital Image collection.
Re:Have a look at the Fedora Project by ragnar · 2005-06-27 04:24 · Score: 1

Thanks for the clarification. I hesitated when I wrote that, and should have double checked my facts. All the examples I've worked this so far have been with XML content.

--
-- Solaris Central - http://w

Re:Dark Ages are ahead! All aboard by screwthemoderators · 2005-06-26 10:07 · Score: 2, Funny

I think it may be worse than that- that there will be a huge proliferation of false information, sensationalistic 'infotainmnet,' advertising, propaganda, etc... Why, historians of the future may be depending on /. as their main source of of information! Think of what a tragedy that would be!

--
http://en.wikipedia.org/wiki/Signature_bloc

In related news by AutopsyReport · 2005-06-26 10:08 · Score: 0

I heard Monica Lewinsky slurped up 1 gigabyte of this digital history.

--

For he today that sheds his blood with me shall be my brother.

Relevant, interesting post by Council · 2005-06-26 10:11 · Score: 4, Funny

Here is a relevant post by Ralph Spoilsport on an earlier article, which can be found here. I am reproducing it here in full because it is very interesting and highly relevant.

this is actually a BIG question

And one that I have railed about for many years.
I have been in the same position the Author discussed, and I have come to ONLY negative conclusions. In a few words, and I hate to say this, but buddy:

WE'RE FUCKED.

Digital is a loser's proposition. backing up to analogue or even digital data on analogic substrates (such as DV tape) fail. Simply nad purely.

The *only* thing that comes close is some kind of RAID, and those, even with the plummeting price of storage, are still too expensive given the needs.

Also, a RAID assumes a continuity of several things that are not likely to be continuous:

With Video:
Framerate, number of lines, colour depth, aspect ratio, file format, compression format, Operating system compatibility, etc etc etc. All of these things are variables.

With Audio:
sample rate, compression format, bit depth, file format, etc.

Basically all of it points to very bad places.

I am fairly well convinced that our age will simply disappear. They will find our garbage, the few books not pressed on acidic paper, our paintings (fat lot of good the abstract stuff will mean to them) and drawings, that's about it. the rest will just be shiny little bits of crap in the landfill.

Since we will have used up all the dense energy forms, they will be appalled at the energy requirements just to get the few remaining museum piece devices to work. Archiving the 21st century will be impossible. To the 25th century, the 21st century will be seen as a dark age - not only for the holocaust of the die caused by the failure of the petroleum based economy, but from the simple fact that very little of the information formats we are totally geared into will survive, including this note on /.

His problem of saving personal video is just the tip ofthe iceberg. His problem is the problem of our very civilisation, writ small.

That's why I am abandoning video, and going back to painting. In 500 years, my painting CAN survive. the video simply won't.

RS

And don't give me shit about my karma or whatever. My karma's fine, I don't care about it. I'm copying this because it's interesting and contributes to the discussion.

What do you think about Ralph's thoughts?

--
xkcd.com - a webcomic of mathematics, love, and language.

Re:Relevant, interesting post by fmaxwell · 2005-06-26 11:16 · Score: 2, Insightful

I think that he's being absurdly pessimistic. Those future historians will have no more difficulty reading our media than we have playing the sounds from a wax cylinder for an Edison phonograph. Running archived computer software will be no different than any of us using a software emulator to run a game for a long-dead gaming console.

Sure, optical and magnetic media decay, but there's nothing stopping people from "refreshing" the media before it decays too far. If you have a stack of CD-R discs that are starting to show an increase in correctable errors, then you back them up to new CD-R discs or to DVD-R. You don't have to sit idly by and watch them decay. I've got a CP/M computer with that's over 20 years old and it can still boot from its 10MB (yes, megabyte) hard drive. So it's not like data just disappears five years after it's recorded.

It's also a problem which is being addressed by the industry. There are companies offering long-life CD-R media designed for archival. Other companies offer data storage for archival data, much like the climate-controlled vaults where countless audio master tapes and films have been stored for decades.

In closing, I think that 95%+ of archived data will still be able to be accessed in a century -- provided that it is properly stored and cared for.
Re:Relevant, interesting post by Anonymous Coward · 2005-06-26 11:34 · Score: 0

Sure, optical and magnetic media decay, but there's nothing stopping people from "refreshing" the media before it decays too far. If you have a stack of CD-R discs that are starting to show an increase in correctable errors, then you back them up to new CD-R discs or to DVD-R.

Yes, but with current OS's, correctable erros are completely hidden from me. Only when the error is uncorrectable do I get a warning. There is no equivalent of S.M.A.R.T for CDs.
Re:Relevant, interesting post by Council · 2005-06-26 11:49 · Score: 1

Dammit. Why do I keep getting modded "funny" when I don't expect it!?

I think that must be a bad sign.

And the last stupid joke I tried to make (a goddamn PUN) got modded "interesting".

Sigh.

--
xkcd.com - a webcomic of mathematics, love, and language.
Re:Relevant, interesting post by Council · 2005-06-26 11:52 · Score: 1

Goddammit, why do I keep getting modded "funny" when I'm being serious!?

And the last joke I tried to make got modded "interesting". It was a goddamn PUN, people! There was nothing interesting about it!

Sigh.

Watch this be modded "anti-semitic" or something.

--
xkcd.com - a webcomic of mathematics, love, and language.
Re:Relevant, interesting post by Thing+1 · 2005-06-26 11:57 · Score: 1

What do you think about Ralph's thoughts?

"My cat's breath smells like cat food." - Ralph

I think he means it.

--
I feel fantastic, and I'm still alive.
Re:Relevant, interesting post by fmaxwell · 2005-06-28 05:33 · Score: 1

Yes, but with current OS's, correctable erros are completely hidden from me. Only when the error is uncorrectable do I get a warning. There is no equivalent of S.M.A.R.T for CDs.

There are programs which talk to the drives at a low-level and report BLER and other error indicators (just one example: KPROBE. If you're doing data archiving, then get hold of these programs. If you're burning something to transport from home to work, don't worry about it.

Slightly overdramatic? by mrogers · 2005-06-26 10:12 · Score: 2, Insightful

Are we currently experiencing a dark age because we don't have access to every letter, memo, bank statement and laundry ticket created in the 20th century? Archiving everything is an attractively simple approach, but if it turns out to be impractical we can always fall back on common sense and restrict ourselves to archiving the maybe 10% of things that have even a remote chance of being interesting in 100 years' time.

Tanks for the Memories by Doc+Ruby · 2005-06-26 10:17 · Score: 2, Interesting

We need to imprint holographic storage on synthetic diamonds. Even if they're slow and expensive, they'll last even longer than the paper records they replace. We'll have to spend a fortune redigitizing all the polymer (CD/DVD, floppy, tape), celluloid (microfilm/fiche) and rotating (disc) media that will age to illegibility within our lifetimes. Until we get holographic gems, we need to archive everything on paper, including those expiring media, in a format easily digitized to a more permanent medium. But of course the government, and barely unaccountable bosses, want the public record to disappear down the memory hole. If they could accelerate the process, including newspapers, they'd spend everything we've got (and more) to make it happen.

--

--
make install -not war

Re:Tanks for the Memories by StupidKatz · 2005-06-26 11:52 · Score: 1

Sure thing, I'll just run over to K-Mart and pick up a few diamo-discs and an -RW drive!

As for the idea to "archive everything on paper", I'd like to see you "archive" the contents of a typical 80GiB hard drive by using paper. Google tells me that it would take roughly 500 trees to "archive" a single 80GiB drive. (I'd need about 2,000 trees just for myself!)

The real solution is to do what everyone else does: backup to your medium of choice (tape, optical, hard disk, etc.), make a duplicate copy, and re-create (and test) your archive as often as you deem necessary. When some magic, permanent archival medium does make it out into the market, then we can all start using the wonderful holographical goodness. But, of course, that wonderful goodness won't help you today, nor next week.
Re:Tanks for the Memories by Doc+Ruby · 2005-06-26 13:25 · Score: 1

A quick search finds barcodes storing 26.3KB per square inch, much denser than the "small novel" claimed in your source (which would be 38 pages at 1MB). No surprise, the average word is 6 letters long, yet their numbers claim it takes 10 bytes to represent them. Of course, compressing a novel, or a government archive, takes a lot less than 6 bytes per word. And barcodes are far from the densest paper archive that I can think of, and I'm not even in the business. Archiving images of typed paper documents typically compresses 1MB to 35KB, without even using "codebook" encoding to reduce letters to 6bits - the standard 250 word page reduces to about 1KB of text, not 1MB, in binary, which is then compressed, up to 40:1. So we're talking about many orders of magnitude smaller media requirements than those numbers you cited guesstimate. Let's call it 100K:1, and pages have two sides. Now we're talking about 100B pages. Through 2022, which is 17 years. 50K trees, your numbers say, paper 1M "short novels", which would be about 100 pages long, but not A4 sized - let's say 0.5 A4 sheets: 50B sheets. Since 500 sheet reams weigh 6lbs, and 17 trees make a a ton of paper, that's 166.7K sheets per tree. Which is 300K trees. New Hampshire alone accounts for about 100K trees harvested per year, which is 1.7M trees, about six times the trees needed for the maximum archive estimate during its 17 years. And the pages don't have to be trees: US recycling recovers about 50M tons of paper per year, which is 40% of the paper needed.

It's still a lot of paper. But we'll get it back, recycled, when we switch to the online archival storage, rather than the near-line optical scanning. It's also expensive. But the US economy will produce over $600T in the next 17 years. That's almost $2500:MB to be archived. Seems like we can afford it.

--
--
make install -not war

Records by Big+Sean+O · 2005-06-26 10:17 · Score: 2, Informative

NARA makes a distinction between a document and a record. Any old piece of paper or email is a document, but a record is something which shows how the US government did business.

For example, the email to my supervisor asking when I can take a week's vacation isn't a record. The leave request form I get him to sign is a record. An email about lunch plans: not a record. An email to a coworker about a grant application probably is.

Besides obvious records (eg: financial and legal records), there are many documents that may or may not be records. For the most part, it's up to each program to decide which documents are records and archive them appropriately.

--
My father is a blogger.

Every mail is sacred by kfg · 2005-06-26 10:21 · Score: 3, Insightful

Every mail is great
If a mail is wasted
The gods get quite irrate

Every mail is wanted
Every mail is good
Every mail is needed
In your network neighborhood

Really, the idea of not being able to record and save every post-it note being equated with those times and places where writing itself was denigrated into virtual nonexistence is a bit silly.

KFG

Re:Every mail is sacred by TripleE78 · 2005-06-27 04:52 · Score: 1

Oddly enough, just after reading this, I noticed the Slashdot fortune was:

And now for something completely different.

Eerie.

~EEE~

the more I think about it... by 1nv4d3r · 2005-06-26 10:30 · Score: 2, Insightful

I'm not sure most of this stuff is worth making preserving digitally enough to justify the cost. Just print em out, and put them in a Raiders of the Lost Ark-style warehouse. The few people who want to see all of clinton's administration's emails can travel to it and search.

I'd much rather see those hundreds of millions of dollars invested in, for instance, making all out of print recordings and books available on-line. It's a smaller problem (sounds like), but would benefit the world much more than online copies of every government employee's timecard records.

.

Re:the more I think about it... by NekoXP · 2005-06-26 12:24 · Score: 1

I'd much rather see those hundreds of millions of dollars invested in, for instance, making all out of print recordings and books available on-line. It's a smaller problem (sounds like), but would benefit the world much more than online copies of every government employee's timecard records

They already invest hundreds of millions of dollars in that. It's called the Library Of Congress.

http://www.loc.gov/
http://www.digitalpreservation.gov/

distributed model by nikkatsu · 2005-06-26 10:30 · Score: 1

the only way to start the process of really archiving is breaking out of expecting single institutions to do it and distribute the task -- distributed archiving can start w bloggers, since they seem to have time on their hands: http://www.mcgeek.com/mainsite/tech/123,37.html/

strip MS HTML from Outlook mails by rduke15 · 2005-06-26 10:31 · Score: 4, Funny

I don't know about the NASA data sets, but they could certainly save a few petabytes by stripping the stupid HTML part of all Outlook emails...

Moore's Law saves the day by G4from128k · 2005-06-26 10:40 · Score: 2, Interesting

In 1987, a Mac II came with a 40 MB drive. 17 years later, a PowerMac G5 came with 160 GB drive. This was at least 4000X improvement in storage density and price (and 1987's drive was both physically larger and more expensive than 2004's drive).

Assuming we continue the current rate of advance in storage density and price, future archivist should be able to buy a 0.64 PB drive for under $500 in 2021. A mere quarter of million dollars will provide enough space for a copy of all that stuff.

--
Two wrongs don't make a right, but three lefts do.

Re:Moore's Law saves the day by Nasarius · 2005-06-26 11:31 · Score: 1

First, Moore's Law is about transistor density, which has nothing to do with hard drives. Secondly, hard drives haven't been getting any more reliable. That means all these hard drives have to be replaced every few years. It's a nightmare for long-term storage.

--
LOAD "SIG",8,1

I'm guessing... steady state. by dpbsmith · 2005-06-26 10:41 · Score: 3, Interesting

The Zapruder film was the beginning. In recent years, I've been dumbfounded by the vast extension in recording and documentation of things like crimes in progress, natural disasters, America's Funniest Home Videos, you name it. A plane crashes, and the next day there are ten different home videos from people in the vicinity who had camcorders.

I believe the cost of traditional photography in constant dollars dropped enormously between my parents' time and mine. I know we took about ten times as many silver-on-paper and Kodacolor dye-on-paper snapshots as my parent did. Then we got a camcorder. My parents captured about three hours total of 8 mm silent home movies. I have about forty hours of 8mm and digital-8 camcorder tape.

And since my wife and I got digital cameras, we've been taking five to ten times as many pictures as we did when we used film cameras.

Now, YES, I'm on the format treadmill. Got most of the old 8mm movies transferred to VHS. Got most of the VHS transferred to DVD. Got a lot of the old slides scanned. Got most of my digital images burned to CD. In the last five years, I've probably spent a hundred hours, or 0.2% of my life, on nothing but struggling to copy from old formats to new. I've spent a small fortune getting Shutterfly to print pictures, because to tell the truth I have much more faith in the prints surviving than the CD's.

So, I don't see a digital dark age. I see a bizarre situation in which the quantity of material recorded in digital form continues to increase exponentially for quite some time. _Most_ of it will get lost, and the percentage that survives, say, a hundred years will keep going DOWN exponentially with time.

But I'm guessing the total quantity of 21st century material available to historians of the 23rd century will, in absolute numbers, be just about the same as the total quantity of 20th century material.

It's one of those mind-boggling things like personal death that one can never quite come to grips with. The future is unknown, and we can accept that. But the fact that most of the past is unknown is equally true--and very hard to accept.

--

"How to Do Nothing," kids activities, back in print!

Yeah, but that's 17 years away. by MacDork · 2005-06-26 10:43 · Score: 1

In 2022, we'll probably have terabyte capacity in our mobile phones. Seriously. In the early 90s, 80 Gb of drive space ran about $80,000 according to this archived historical document. Nowadays, I can get an 80 Gb drive for about $65 according to froogle, and that's without considering inflation. Sure at a conservative $1/Gb were looking at $347 million dollars today, but in 17 years time that'll probably look more like two or three hundred thousand bucks. No biggie for our bloated government.

True but... by BlightThePower · 2005-06-26 10:58 · Score: 1

I don't think backing up a president's email and backing up some minor whitehouse aide's email should have equal importance.

I agree really but I also find the problem with data is you never know until its too late. The aide's email could be an international "smoking gun" lost forever vs. an eternally archived Presidential request for diet soda on Air Force One.

I feel that if you can't completely automate backups then the best thing is to give users easy access to backup resources for their own material so they can judge whats most important and what isn't. This happens in some organisations at the moment but not in all; I used to work in a place where I had to make a special appointment with a tech just to burn a CD of stuff on my HD. Guess how much data we regularly lost as an organisation...

--
Plays violent online games as: Nerfherder76

Re:True but... by some+guy+I+know · 2005-06-27 02:11 · Score: 1

The aide's email could be an international "smoking gun" lost forever vs. an eternally archived Presidential request for diet soda on Air Force One.
I gree with this completely.
The article mentioned the selective retention of information as one possibility for coping with the massive amounts of data that need to be preserved.
I think that it would be a mistake to do this.
IMO, all data should be archived in bulk as soon as possible, and then scholars can work on indexing those portions that they deem important, in order to allow more efficient access to such data.
However, the "less interesting" information should still be available, so that if a mid-level manager's lunch plan email (mentioned in the article as being something not worth keeping) turns out to be important, then it can be retrieved.

The other thing to consider is that, as AI progresses, it may become possible to have intelligent agents sift through the data, and find subtle relationships not seen by human researchers.
If some data are lost, however, such automated research will be crippled to some extent.

I remember reading an essay by Isaac Asimov some years back, where he was responding to criticism of his "Foundation" novels, particularly the fact that nobody in the Empire knows on which planet the human race originated.
He mentioned the burning of the ancient library at Alexandria and destruction of all video recordings of the first Superbowl (an event popular in the U.S. involving American football) as examples of how knowledge can be lost.
In addition to Asimov's examples, many old movies are rotting away in studio warehouses, centuries-old books are decaying or being eaten by insects, etc., etc.
So even non-digital data can be and are being lost.

The solution is to make many copies of data available at many different locations (including off-planet, once that option becomes viable).
Do this first; worry about classification/catagorization/etc. later.

--
Those who sacrifice security to condemn liberty deserve to repeat history or something. - Benjamin Santayana

AOL CD's by Anonymous Coward · 2005-06-26 10:59 · Score: 0

AOL cd's are closed. Belive me, I've tried...

The Solution: by DarkEdgeX · 2005-06-26 11:20 · Score: 2, Funny

NARA needs to open up tons and tons of GMail accounts. Where do I send my invites so I can contribute?

--
All I know about Bush is I had a good job when Clinton was president.

Re:The Solution: by gnu-sucks · 2005-06-27 05:09 · Score: 1

Perhaps if you knew a little more, you'd have a better job...

Try to help correct other's math sans sarcasm. by jbn-o · 2005-06-26 11:29 · Score: 5, Insightful

You were just a little over 12 times too much. Let's just hop you don't write code for a living :p [...]

To you and the countless others on /. who offer their corrections in a similar tone: Yes, we get it, the parent poster goofed and you supplied a correction. Given the trivial context here, it's hardly a big deal and doesn't warrant sarcasm. Everyone make mistakes and plenty of people make mistakes in their work every day, including people who do work where lives are at stake. That's one reason why it is good to work with other people. In life it's far more important to be forgiving, keep things in perspective, and help other people without the wiseacre commentary and then move on.

--
Digital Citizen

Re:Try to help correct other's math sans sarcasm. by Anonymous Coward · 2005-06-26 14:16 · Score: 0

You know what? I completely agree with you. And I wish more people had that attitude towards people and life in general. But they don't. And this is Slashdot where proving your knowledge by disproving another's is a way of life. May you continue through your journey with your outlook on things. I will as also and maybe we shall me as friends one day.
Re:Try to help correct other's math sans sarcasm. by seti · 2005-06-26 18:52 · Score: 1

i wish i had mod points left..

--
Coca-Cola, sometimes War.
Re:Try to help correct other's math sans sarcasm. by Anonymous Coward · 2005-06-27 01:37 · Score: 0

Everyone make mistakes

"makes".

Elementary My Dear Dewey by chadpnet · 2005-06-26 11:34 · Score: 1

Who says you have to archive all data digitally. The system thats been working for years at our local public and univ. libraries is storing meta information digitally that references a tangible location.

Cost-of-copy and modes of failure by G4from128k · 2005-06-26 11:34 · Score: 2, Interesting

I think you're missing the point, which is that all that data is now much easier to lose, especially in the short term, if it's not taken care of properly.

Perhaps, perhaps not. Sure, digital data can be lost easily, but it can also be copied/backed-up more easily. Assuming $0.01/page for paper copy (a gross underestimate of the cost of paper, toner, and labor for copies) and assuming 10 kB data/page (an overestimate), $10/GB (for high-end maintained storage), then cost ratio is at least 100:1 in favor of digital (and probably 1000:1). Inaccessible formats are a concern, but an automated batch process at the time of initial archiving can, at least, convert the data to some data format standard with a longer likely lifespan(e.g., plain ASCII, RTF, PDF, HTML, etc.)

Paper is its own single-point of failure concerns and the huge cost of copying makes those concerns real. Digital does add some new modes of failure (e.g., format obsolesce), but I think those are not as burdensome as the physical costs of copies.

--
Two wrongs don't make a right, but three lefts do.

Oh, give me a break... by Anonymous Coward · 2005-06-26 11:35 · Score: 0

The guy conflates integrity preservation solutions (RAID) with data format issues.

Major formats will be figured out after the apolocalypse, don't worry about that. (Sir, we found over 100 million 4 3/4" plastic discs with digital data on them! Should we try to decode them?)Data will be lost, that's true. But some of it will be figured out, just as when we look back at current histories.

In the past, when societies paper or papyrus instead of parchment, the recoverability of their information went down because they didn't survive as well. At other points, changes in inks (due to convenience of manufactur or cost) also led to lower data survivability.

So this isn't a new thing at all.

But most importantly, don't get too excited about it. You no more should be worried about whether your pr0n collection will survive than the average greek or roman was about their inventory/accounting records, or indeed their pr0n collections.

Reiser4 by r_jensen11 · 2005-06-26 11:45 · Score: 0

Havn't you heard? There's nothing Reiser4 can't do!

That's a lot of...! by Seumas · 2005-06-26 11:52 · Score: 1

347 petabytes? Why not store it all as petafiles?

*duck*

Re:That's a lot of...! by hostyle · 2005-06-26 20:17 · Score: 1

Were you attemptiung to make some sort of PDF-file joke there?

--
Caesar si viveret, ad remum dareris.
Re:That's a lot of...! by Seumas · 2005-06-26 20:33 · Score: 1

No, I was trying to make a child-molestation joke. Sheesh.

If bad humor can't be appreciated at Slashdot, where CAN it be -- oooh lookie, FARK!

Agreed by shpoffo · 2005-06-26 12:13 · Score: 1

All too relevant.... Recording every minute detail of communication is not the way our brains work now, and doesn't even seem to be on the horizon for how our brains are going. Why in the world would we want to archive every little detail.

Governmental psychosis is costly.

.
-shpoffo

Re:Agreed by mattpalmer1086 · 2005-06-27 01:39 · Score: 1

The problem is not recording, storing, migrating and managing all this stuff. The problem is locating the good stuff in the midst of all the boring stuff. This problem has two parts: deciding what's interesting (historians often find commonplace stuff, like an old bill of sale, as, if not more interesting than the record it was accidentally left in), and actually having the man power to appraise it in the first place.

It is actually cheaper to archive more stuff digitally, knowing that some of it won't be very interesting, than it is to micro-appraise the records and only take the "good" bits, and then rely on the increasing sophistication of search engines to help mine it all.

A Job For Google? by InfoTechnologist80 · 2005-06-26 12:16 · Score: 1

Has not Google already figured out this problem with GMail? Google, maybe you should bid on the job? Imagine beign able to use google to search the national archive. Hmmm.... :)

make them archive ECHELON too by hilaryduff · 2005-06-26 12:28 · Score: 1

if they arent already

Money by Detritus · 2005-06-26 12:35 · Score: 1

It doesn't matter whether it is on paper or digital media. If someone isn't willing to spend the money to preserve it, it will be lost. I've seen decades worth of project records and file libraries end up in the land-fill because there was no budget or requirement for preserving them. It's sad to see the products of many years of work by talented people discarded like so much trash.

To add insult to injury, slime-sucking lawyers now advise their clients to destroy records, like email, as soon as possible to prevent them from being the subject of discovery in a future lawsuit. At a previous employer, company policy was to nuke all email older than 30 days. Due to the drive to eliminate paper shuffling, email messages were the only record of many policy decisions.

--
Mea navis aericumbens anguillis abundat

Re:Money by detritus` · 2005-06-26 20:28 · Score: 1

heh, its funny cuz its true... i've "burned" many an email message once a contract was complete and the money was in my pocket... its safer for me to put the money in my pocket and then deny all contact with a company than to actually archive all transactions if someone decides they do not like me any more. IE. i remember one client who threatened legal action if i did not update his website after actually making said website for a certain fee... after asking him to prove it and due to a HD crash it was better for me to delete any correspondence than to actually update said site.

--
drunk chemists

entropy by YesIAmAScript · 2005-06-26 13:01 · Score: 2, Informative

You can calculate the amount of entropy in a document (text or no) and that is a limit to how small you could possibly make it.

I don't recall how close modern methods like arithmatic encoding make it to that limit, but I know it's close enough that we couldn't double the compression ratio of text documents from the current state of the art.

Trellis coding is a system for dealing with induced errors in modem signalling. It allows you to cancel some of them out. It doesn't actually increase the throughput in an ideal situation.

The thing that allowed us to reach the limit for a phone line is combined amplitude-phase coding, or the creation of the "constellation diagram" for modem encoding.

The constellation defines certain combinations of phase and amplitude that represents groups of bits (a baud). Trellis coding simply defines additional combinations that are not sent. If you see any of these on the receiving end, then you realize that the constellation is either being twisted (phase error) or shrunk/grown (amplitude error) and you can try to compensate for it.

The name comes from a trellis, like you grow plants on. The legal signals sent should go through the holes in the trellis. If you receive a signal that falls on the trellis (hits the trellis) you adjust it so that it goes through the trellis and assume this adjustment factor can be used to adjust other, valid hits too to more accurately determine the data that was sent.

--
http://lkml.org/lkml/2005/8/20/95

Moore's Law and storage by G4from128k · 2005-06-26 13:12 · Score: 1

First, Moore's Law is about transistor density, which has nothing to do with hard drives. Secondly, hard drives haven't been getting any more reliable. That means all these hard drives have to be replaced every few years. It's a nightmare for long-term storage.

You are right -- Gordon Moore spoke only of trends in the number of transistors/IC. Yet his law was, if anything, about advances in the technologies of miniaturization. This miniaturization has had profound, indirect effects on storage. The same technologies that enabled semiconductor engineers to make smaller transistors have helped disk drive designers make denser drives. Smaller heads, faster electronics, and a better understanding of materials lead to advancements in both ICs and HDs.

I'm not sure what you mean about reliability. Perhaps reliability on a per-drive basis remains constant. But reliability on a per-bit basis has improved. How long would a cluster of 4,000 40 MB drives go without a failure in 1987? The reliability of 160 GB of storage has improved.

Yes, storage systems need periodic drive replacement but by the time a drive needs to be replaced, the indirect effects of Moore's Law will have made that replacement about 1/4 the price of the original drive. Thus, if storage is $1/Gb now, securing about $1.33/GB is sufficient to buy both today's storage and have the money needed to buy all subsequent replacements every 3 years in perpetuity. By 2022, a storage array of 100 servers with 6 drives each (an installation only 4 times larger on a device-count basis than the new Wikipedia installation) would provide the needed storage of 347 PB.

--
Two wrongs don't make a right, but three lefts do.

WANTED: Digital Librarian Archive Asst., by catmistake · 2005-06-26 13:18 · Score: 1

WANTED: Digital Librarian Archive Asst., Digital Salvage Director, UNIX Admin., eMail Archiver, Information Architect, Mathemetician, Psychological Councilor...

All positions require Computer Science Bachelor's and Master's degree and 18 years experience or 2 year Mathematics degree and 10 years experience, except for the Councilor, which requires 6 months Hooters waiting experience and a PRN

send resume and salary reqs. to address_empty at potmail.c

--
The Admin and the Engineer

347 petabytes = ? Libraries of Congress by pentalive · 2005-06-26 13:38 · Score: 1

or how many Volkswagon Beetles filled with DAT tapes?

or how many beowulf clusters are needed to search it? sort it? :^)

Re:347 petabytes = ? Libraries of Congress by helfen · 2005-06-27 17:49 · Score: 1

347 petabytes = ? Libraries of Congress

The Library of Congress (20 million books, not counting pictures) - 20 terabytes - (http://web.archive.org/web/20001101043610/http:// archive.org/14terabytes.html)

347 petabytes = 355 328 terabytes

355 328/ 20 = 371.4 Libraries of Congress

it serves a statistical purpose by mbius · 2005-06-26 13:44 · Score: 1

Think of the crap as padding. When you only save important information, and then you lose information, it was important.

346 petabytes of padding might be overdoing it though.

--
you can have my violent video games when you pry them from my cold, dead hands.
Prime UID Club

like John Lithgow! by mbius · 2005-06-26 14:01 · Score: 1

Are we currently experiencing a dark age because we don't have access to every letter, memo, bank statement and laundry ticket created in the 20th century?

It is a matter beyond impeachment that future generations can expect substantial volumes of washie to go unclaimed, forgotten to the sonorous march of history.

--
you can have my violent video games when you pry them from my cold, dead hands.
Prime UID Club

industrial espionage would be sillier by mbius · 2005-06-26 14:05 · Score: 2, Funny

"Give them to me."

"What do you want??"

"That Gem...and the Holograms."

--
you can have my violent video games when you pry them from my cold, dead hands.
Prime UID Club

Re:Dark Ages are ahead! All aboard by identity0 · 2005-06-26 14:21 · Score: 1

You jest, but it's possible something like Wikipedia or (shudder) everything2 will be on some future historian's list of sources.

So historians in 2100 will have to wade through various trolls and defacement attemps to try to get what people thought about in 2005 - but at least they'll know not to click on Goatse links.

Answer is DNA? by Anonymous Coward · 2005-06-26 14:44 · Score: 0

DNA combined with the cellular mechanisms to protect, and propogate the information.

Very Large Storage Array by hernick · 2005-06-26 14:50 · Score: 1

How expensive is data storage, really ? I'll design a ten petabyte (10PB) storage system. You'll see how much it costs. To build this monster machine, I'll be using commercial off-the-shelf hardware organised as a massive Linux cluster.

You may ask "why do you want to build the most powerful Beowulf cluster on Earth when storage companies have all these amazing storage systems ?" Well, this system needs to be an open solution. The system will need to grow and evolve as the needs change. Vendor lock-in is simply unacceptable.

"Current storage software for Linux is useless for this massive an archive", you retort. Let me make it clear that there is no software anywhere that can fulfill the needs of the NARA. No matter what solution they choose, they will have to commission the creation of custom software. This software must be open-source, for the sake of the future growth and survival of the system.

Our basic storage unit will be a 300GB SATA drive. It's inexpensive, fairly reliable, and availaible in bulk quantities. We're going to need a lot of these little drives.

I'm going to create a very reliable system. First, it'll be completely redundant, with two identical systems, kept in sync, at two different geographical locations. I'll first consider the cost of a single location.

So, I need to store 10PB in a single building. I want reliability, and I'll get that by using two different levels of ECC: RAID-5 and distributed data. All disk drives will be part of 4-drive RAID-5 arrays. All these arrays will be part of larger 4-array RAID-5 arrays that are distributed across storage nodes.

Thus, each drive is regrouped with 16 others to form a 2-redundancy-layer 16 drive array, with 7 drives dedicated to parity. This leaves us 9/16 usable drives, or 2.7TB.

We therefore have 4-node storage subunits, with 5.4TB usable storage space. Inter-node communications go through gigabit ethernet with jumbo frames. The second gigabit ethernet interface of each node is also connected to a secondary network, for outside access to the node. Here, storage groups of 4-subunits are channeled through a single GBe port. That's 21.6TB which can be accessed through GBe.

Now, we have 512 such storage groups, each addressable at a speed of 1GBps. At this speed, the complete data store of each group can be transferred in or out in less than 72 hours.

We're using 2U cases for each system. This means that each subgroup uses 32U for servers, and we're using 42U racks. We have free space in every rack to allow for our high power density and communications and service equipment.

Let us consider the cost of a single such 21.6TB storage unit. We have 16 servers and 128 drives. Each server costs with storage costs around 2.5K. Each storage unit will be contained in a single rack with independent power conditioning, cooling and communications, at a cost of 15K$/rack, if we use high-end COTS equipement. So, we have a cost of 55K$/unit.

We need to make a cluster out of all this. We need extra computational power. We need a network backbone. We need tape drives to put data in the system. Front-end systems. A way to communicate with the outside world. A network operations center. That's going to add 5M$ to the system.

This means that the total equipment cost for a single location, including 5% of spare units, is around 35M$, including power and cooling equipment.

We need around 35000 sq.ft of floor space to host all this equipment and an 5000 sq.ft operations center. At a cost of 30$/sq.ft, we're talking about 1M$/year for the building.

Let's consider power costs, and do so very conservatively, with and a generous overhead on all figures. Each drive uses 15W of power. Each server uses a total of 250W of power. Each storage group uses 4kW of power. Each location thereforce uses 2MW of power. Over the course of a year, we're talking about 12.7GWh of power.

Now, we still need more power for cooling, which comes in at about 60% of total power costs. T

Re:Very Large Storage Array by Anonymous Coward · 2005-06-26 17:54 · Score: 0

congrats, you have preserved the 'bits'. That is the easiest part of the equation.

Honestly, this brute force, simplistic understanting of the problem is why we are in trouble. If you can, imagine how many file formats (propriatary) become obsolete in a given yeat, and next, imagine converting petabytes of that in a secure fashion.

Now, do that for the rest of time. Then you start to appreciate the true problem. Preserving bits, that's easy.

I was talking about that in another arena: by Ralph+Spoilsport · 2005-06-26 15:13 · Score: 1

here:

http://slashdot.org/comments.pl?sid=154005&cid=129 17603

My opinion?

The 21st century will disappear from history. In 500 yearstime they will know more about Italy of 1505 than the USA of 2005. Why? the records of Italy will still exist.

The entire digital info system is based on the free ride of petroleum. Petroleum will basically disappear from society fairly soon, (either it will simply deplete, or will become too expensive to drill it out) and everything made of plastic and anything requiring high energy density to acquire (like digging up precious metals) will be largely (but not completely) curtailed. The result is most of what we call "our culture" will be lost soon after the Collapse ( http://slashdot.org/comments.pl?sid=154005&cid=129 17603 ) and will be largely ignored from a lack of manpower ( http://www.dieoff.org/ ).

It's a truly stunning prospect - our civilisation will, with the possible exception of a few basic texts that can be copied to paper, will simply disappear. It will be seen as a dark age - not from a lack of people writing (as it was in 490 CE) but from a lack of putting things into a survivable substrate.

RS

--
Shoes for Industry. Shoes for the Dead.

Re:I was talking about that in another arena: oops by Ralph+Spoilsport · 2005-06-26 15:17 · Score: 1

after the Collapse

the link shoud have been:

http://www.amazon.com/exec/obidos/tg/sim-explorer/ explore-items/-/0670033375/0/101/1/none/purchase/r ef%3Dpd_sxp_r0/103-5019446-5179842

RS

--
Shoes for Industry. Shoes for the Dead.

Please contact CERN by Lawrence_Bird · 2005-06-26 16:27 · Score: 1

The people at the LHC have been planning for large data rates and storage requirements for quite a few years.

The computational and data-storage requirements for the LHC experiments will be staggering, according to Jamie Shiers, leader of the Database Group in CERN's IT division. "We project 5 to 8 petabytes [PB] of data will be generated each year, the analysis of which will require some 100PB of storage [of which a large fraction will hopefully be online] and more computing power than that supplied by the world's largest supercomputers," he says.

The Goal is Data Loss by grolaw · 2005-06-26 16:50 · Score: 1

If the current administration has its way, we have no business archiving anything.

One of GWB's first acts was to lock down the Reagan administration's (and, all subsequent administration's) data forever. The 12 year release cycle that the Ford Administration approved was revoked within weeks of Jan 2000 (some cynics say, to prevent data about Iran-Contra and GHWB's involvement becoming public - but that's just crazy talk).

The only data less available than old parchment in a vault is random magnetic domains and / or the lack thereof.

You can't prosecute what didn't happen... Ask Oliver North about those PROFs backup tapes.

In ten years there will be no "official" record.

Bush will have achieved what countless computer marketing schemes promised: a paperless office.

The corrupt politician's wet-dream - no records.

It all started as "a matter of national security" - but the first victim (*target*) was a cartographer mapping Caribou trails through ANWAR.

Now, states like Missouri have eliminated publishing certain rules, laws and regulations on the Internet - as too costly. Yep, if you want to read the regs to Chap. 213 R.S.Mo. you have to go to Jefferson City and ask nicely at the Mo. Commission on Human Rights to look at a copy of the new rules. One per Commission....Damn electrons and ink are dangerous to 'merican republicans. Ration them - then burn-bag and deep six 'em.

Electronic Presidency by d474 · 2005-06-26 17:26 · Score: 1

FTFA:

"A new avalanche of records from the Bush administration--the most electronic presidency yet--will descend in three and a half years, when the president leaves office."

The Bush Administration is also the most secretive presidency yet. It would certainly be interesting to be on the IT staff "archiving" that set of data. The IT boss would be amazed out how much free overtime is staffers were willing to do in the middle of the night...

--
Authority questions you. Return the favor.

Redundant by Keith+McClary · 2005-06-26 18:15 · Score: 1

I suspect 99.99% of this information is multiply redundant. With a good compression algorithm, it would fit onto a DVD or a CD or perhaps even a floppy.

Use cleaver ways to eliminate unneccessary data by Frit+Mock · 2005-06-27 00:21 · Score: 1

"From the 38 million email messages created by the Clinton administration ..."

I am pretty sure, that 90% of those emails could be deleted. (Not saying, that the adminitration writes /dev/null stuff ... certainly there are many of "cantina" messages, but that's not the point.)

- If I take a look at my emails, I have quite a lot of threats in there, all later messages include the whole conversation of previous messages.

- If I take a look at my (outbound) emails, there are lot's of mails to multiple recipients. It is only neccessary to store it *once*!

- Since most of the emails are internal, i.e. one administration member writes to another one, there is no reason to save the outbox of the sender and inbox of the receiver.

too many attachments by Karoshi · 2005-06-27 00:58 · Score: 1

I guess they sent lots of attachments, too. And compression of those binary files isn't as effective as the compression of text.

--
Don't answer me. Moderate. Slashdot is about moderation, not discussion.

mod parent +1 anti-semitic by Jamu · 2005-06-27 01:15 · Score: 1

nt.

--
Who ordered that?

Who really needs all of this data? by tjstork · 2005-06-27 01:19 · Score: 1

I guess the first question is, why are even keeping this data around. Give the historians something to argue about and delete some stuff.

--
This is my sig.

Re:Usually when I archive... by falsified · 2005-06-27 01:30 · Score: 1

Keep in mind, though, that the prison sentences of Ebbers and the prison sentence of the guy who knocks over a BP are going to be about the same, even though Ebbers stole millions while the gas station thief would have got about $200.

--
HI, MY NAME IS ISAAC.

distributed grid storage by Anonymous Coward · 2005-06-27 02:10 · Score: 0

http://archives.gov/electronic_records_archives/pa pers/thic_04.html
Digital Archiving and Long Term Preservation: An Early Experience with Grid and Digital Library Technologies

The first project is the Persistent Digital Archives project [1], which is a joint effort between the San Diego Supercomputer Center (SDSC), the University of Maryland (UMD), and the National Archives and Records Administration (NARA), and is supported by the National Science Foundation under the Partnership for Advanced Computing (PACI) Program. The main goal of this project is to develop a technology framework for digital archiving and preservation based on data grid and digital library technologies, and to demonstrate these technologies on a pilot persistent archive. We have already built a significant prototype using commodity platforms with significant disk caches coupled with heterogeneous tape libraries for back-ups.

300 years from now: Spotted in the NARA Archives by Anonymous Coward · 2005-06-27 02:55 · Score: 0

Re: Refrigerator use
PLEASE PLEASE PLEASE To all our West Wing office staff we remind you once again that refrigerator cleaning day is FRIDAY and if you leave any foodstuffs they will be THROWN AWAY by 6 pm. Thanks!!!!

FORMATS, FORMATS, FORMATS!!! by AirDave · 2005-06-27 06:03 · Score: 1

As Mr. Ballmer would say.

Hardware and media types can be migrated and validated in a straightforward process. It is the format and representation of the data that is the *hard* problem. Understanding how the information is represented in the digital record is the only way one can conceive of a process to migrate it to a more current representation. Unfortunately, many of the representations are proprietary, e.g., MS Office documents. Open standards for the data representations are the only way forward.

Slashdot Mirror

Archiving Digital History at the NARA

202 comments