Digital Future of the Library of Congress

← Back to Stories (view on slashdot.org)

Digital Future of the Library of Congress

Posted by ryuzaki0 on Friday March 25, 2005 @04:15AM from the yay-for-learning dept.

lesinator writes "On Monday the 28th the US Library of Congress is holding the eighth lecture in its series on Managing Knowledge and Creativity in a Digital Context. Previous speakers include David Weinberger on blogging, Brewster Kahle - founding member of archive.org and the wayback machine, and Lawrence Lessig on intellectual property and the creative commons. After the lecture questions will be taken from the audience and the internet. C-Span will be broadcasting the lecture live at 6:30 PM EST, and also has archives of previous lectures. Audio archives of previous lecture are available at Audible.com in the Selected Free Media section."

32 of 141 comments (clear)

At last! by Shadow+Wrought · 2005-03-25 04:18 · Score: 3, Funny

We'll know just how much storage really is required to hold the Library of Congress.

--
If brevity is the soul of wit, then how does one explain Twitter?
1. Re:At last! by cmburns69 · 2005-03-25 04:28 · Score: 4, Insightful
  
  While it's an interesting question, it really depends on how you want to store the contents of each book.
  
  Would you store each page of each book as an image? As flat ASCII text (except of pictures and diagrams, of course!)? What kind of indexing would you do? Basic indexing of book names? Full-text indexing of the contents? All that storage adds up!
  
  In summary, the library of congress (depending on the method used) could probably fit into something ranging from a couple of gigabytes to a couple of petabytes.
  
  --
  Online Starcraft RPG? At
  Dietary fiber is like asynchronous IO-- Non-blocking!
2. Re:At last! by Shadow+Wrought · 2005-03-25 04:34 · Score: 4, Interesting
  
  Well I owuld think that they would have to start with an image first. Once they OCR'd it and generated ascii text files, they could save a tremendous amoutn of space by simply deleting the images. However, after that much effort in imaging all those pages, I just can't see them doing that. The best bet is probably two databases, one of ascii text and one of images.
  They might even be able to generate revenue by having the ascii text freely available and searchable, while the images would cost money. That way folks just interested in the text can find it easily, while scholars and others who need to see the source material can have access at a moderate price.
  
  --
  If brevity is the soul of wit, then how does one explain Twitter?
3. Re:At last! by WillAdams · 2005-03-25 04:39 · Score: 2, Interesting
  
  There's a cue for a question I've been wondering about for a while.
  
  What was the first reference / usage of ``LoC'' as a unit of knowledge measurement?
  
  The first time I recall seeing it was in Michael Gear's novels, _The Artifact_ if memory serves, ~1976.
  
  Anyone have an earlier instance?
  
  William
  
  --
  Sphinx of black quartz, judge my vow.
4. Re:At last! by caseydk · 2005-03-25 06:53 · Score: 2, Interesting
  
  I was working on this project just a few years back (2001-2002).
  
  Our estimates projected that by 2005, it would be take about 4 TB of digitization EACH day to keep pace.
  
  The first storage phase called for 180TB server.
5. Re:At last! by aboyko · 2005-03-25 06:55 · Score: 2, Insightful
  
  A couple of gigabytes?! Only if you burn it first. There's something like 10^8 books, nevermind the other stuff. How do you compress any given book into 100 bytes?
  
  The "20 TB" figure comes from the smallest possible measure, treating the flat books as ASCII text. Even just considering current digital content, it's also inaccurately small by >1 order of magnitude.
  
  It's a really really really big library.
Here's an idea related to audio archiving by filmmaker · 2005-03-25 04:19 · Score: 4, Insightful

Maybe the fine folks at audio.com might consider making their audio clips available by means other than the Real or MS media players?

--
I Want To Believe
Dammit! by dteichman2 · 2005-03-25 04:21 · Score: 2, Insightful

What are they thinking! Airing this at 6:30 PM EST! CSpan has just ensured that nobody on the west coast will see this. Or, is that what they are aiming for?

--

Silence is golden... and duct tape is silver.
1. Re:Dammit! by lukewarmfusion · 2005-03-25 04:31 · Score: 5, Funny
  
  C-SPAN is clearly concerned with ratings. Didn't you see the stuff they pulled out for Sweeps week? I think it was something like "old guy reading boring text to empty room."
Nice, but how long? by Anonymous Coward · 2005-03-25 04:22 · Score: 2, Funny

How long is it going to take to digitize the entire library?

Anyone have a good approximation? I'd like to know in Burning Libraries of Congress (BLC) please.

I'm guessing somewhere around 10-200 BLC.
1. Re:Nice, but how long? by yuriismaster · 2005-03-25 04:37 · Score: 3, Interesting
  
  Well, I would imagine that unless they have a massive staff and many OCR scanners or automation with REALLY good OCR, this may take a LOONNNG time.
  
  I'm not quite sure about the length of a BLOC, but this is a job for not-quite-manual labor. Each book requires a simple task: Scan page 1, flip page, scan page 2, page 3, flip, ad infinitum.
  
  One way to save on time would be to contact the publshers of any book made after 1985-ish, where you can get electronic copies from the author. Some older books may have been already digitized, but it's still going to take more than 25 years unless there's a massive army working on this.
2. Re:Nice, but how long? by Blue-Footed+Boobie · 2005-03-25 04:55 · Score: 4, Informative
  
  Nonsense. I put together solutions with high-speed scanners all the time. Some of our highest-end average 118ipm (Duplex) and have 1000pg ADFs.
  Also, you would generally split the load between 4-6 of these scanners for a job this big. The software is automated, and will OCR/Convert/Archive the file is one step.
  As a general rule, you can fit 10,000 b/w text pages in 1GB of storage.
  
  --
  DAMN YOU OCTODOG! DAMN YOU TO HELL!
3. Re:Nice, but how long? by Blue-Footed+Boobie · 2005-03-25 06:21 · Score: 2, Informative
  
  Nope, Canon (and others) make Book Scanners with actually flip and scan each page automatically. They can handle all sizes too.
  They are very expensive, but cool as hell.
  
  --
  DAMN YOU OCTODOG! DAMN YOU TO HELL!
Some ideas by gowen · 2005-03-25 04:24 · Score: 5, Insightful

Here an interesting talks they might give:

i) What if the Apostles had had technological means to prevent the reproduction of the New Testament?

ii) Would our culture be diminished if the people who rediscovered Beowulf had been unable to decrypt the manuscript?

iii) Is the continual repitition and reworking of myth and fable through the Oral Tradition disrespectful of the content creators who first recorded these stories?

--
Athletic Scholarships to universities make as much sense as academic scholarships to sports teams.
1. Re:Some ideas by Scrameustache · 2005-03-25 04:29 · Score: 3, Insightful
  
  i) What if the Apostles had had technological means to prevent the reproduction of the New Testament?
  
  Main Entry: apostle
  Pronunciation: &-'pä-s&l
  Function: noun
  Etymology: Middle English, from Old French & Old English; Old French apostle & Old English apostol, both from Late Latin apostolus, from Greek apostolos, from apostellein to send away, from apo- + stellein to send
  1 : one sent on a mission: as a : one of an authoritative New Testament group sent out to preach the gospel and made up especially of Christ's 12 original disciples and Paul b : the first prominent Christian missionary to a region or group
  
  They wouldn't have prevented the distribution of the story their mission it was to distribute, that's for sure.
  
  --
  You can't take the sky from me...
2. Re:Some ideas by Anonymous Coward · 2005-03-25 05:25 · Score: 4, Interesting
  
  It's been continually re-written. For example, until 1954 Jesus never actually said "I am the Son of God"; when Pontius Pilate accused him of claiming to be the Jewish Messiah, he cryptically responded "It is you who said it." The fact Jesus didn't claim to be the Son of God but was surrounded by intense believers was one the essential "mysteries" of Christianity that you were supposed to accept as a Christian.
  
  In 1954, the American "New International" edition just editted the trial dialog and "re-interpreted" "it is you who said it" into "I am the Son of God." I don't think the European and Catholic churches have editted that part yet.
That's the right idea .. carry it further by Anonymous Coward · 2005-03-25 04:30 · Score: 5, Insightful

It is amusing that this story follows directly after a story about Microsoft proprietary file formats.

The Library of Congress should insist that all 'publications' be submitted to it in open formats. What good is it if they have something on file that nobody can read! The extreme is that they have to have a licensed copy of every piece of software that ever created a file. If all the formats have to be open then at least historians can cobble together something that can read a file of interest.

With the ip laws as stupid as they are now, we run the real risk of losing the record of our age.
1. Re:That's the right idea .. carry it further by John+Seminal · 2005-03-25 06:25 · Score: 2, Insightful
  
  It is amusing that this story follows directly after a story about Microsoft proprietary file formats. The Library of Congress should insist that all 'publications' be submitted to it in open formats. What good is it if they have something on file that nobody can read!
  Why even have it on any digital media. I want the original records. Screw having computerized copies. This is the nations library, where a copy of everything in its' original form must be.
  I have no problem with the card catalogue system. Some things should not change. If someone wants to open the "Digital Library of Congress" then go for it. But leave the original as-is. I can only imagine someone wanting to digitize the Great Library in Alexandria back 2000 years ago that resulted in the great fire. HA! We screw ourselves again.
  
  --
  Rosco: "If brains were gunpowder, Enos couldn't blow his nose."
Next series by E+IS+mC(Square) · 2005-03-25 04:34 · Score: 4, Funny

"Managing Knowledge and Creativity with DRM"...

Sponsored by Apple and Microsoft!
Hello, Project Gutenberg?!? by Infosquawk · 2005-03-25 04:40 · Score: 5, Interesting

I can never understand why there isn't more acknowledgment of our debt to Project Gutenberg on these issues.

Michael Hart was digitizing books before digitizing books was cool, as far back as 1971, and the Project's efforts have been hugely successful on very little money. Nevertheless, I rarely see any official or media acknowledgment of the Project's efforts. If anyone should be on that panel for their ability to give advice from practical experience and performance in this field, while on a shoestring budget, it would be Hart!

--

OoO

Please do not publish outside of /.
Outsource parts of LOC to Google or Amazon? by G4from128k · 2005-03-25 04:44 · Score: 4, Insightful

With the current wave of outsourcing, privatization, and government use of commercial contractors, I wonder if Amazon or Google don't have a major role to play in the process of cataloging/archiving/serving digital content in the future.

Although LOC could never be replaced by a Google or Amazon, these private companies could provide services that augment or reduce the cost of LOC-like services. For example, if Amazon scans a book, why should LOC scan it too?

--
Two wrongs don't make a right, but three lefts do.
1. Re:Outsource parts of LOC to Google or Amazon? by HeedlessYouth · 2005-03-25 04:48 · Score: 2, Interesting
  
  You mean like this?
Publication of New Testament by dpilot · 2005-03-25 05:09 · Score: 3, Interesting

Authorship of the New Testament is not a simple question at all. First off, the Apostles didn't sit down and start collecting the New Testament. That was done hundreds of years later by some chaps in Rome or Turkey who also had political axes to grind. Every few decades or centuries, there's also Yet Another Translation, and in the forward they talk about the prayer, consideration, and attempts to divine the True Word of God that went into it. Common belief is that over the centuries there has been so much prayer, consideration, and attempts to divine the True Word of God that today's bibles MUST be correct. Yet in spite of all that, I have this feeling that precedent is even stronger in the Bible than in the US legal system, and that we're still carrying the weight of perhaps improper decisions made over a thousand years ago, plus trying to justify them.

Then you also get to the issue of what is and isn't in the Bible. Consider "The suppressed Gospels and Epistles of the original New Testament of Jesus the Christ, Complete" http://www.gutenberg.org/etext/6516 for an example. Would the Apostles have wanted them published, or not? What about "The Forgotten Books of Eden"? Or less/more controversial, how about Maccabees, Sirach, Tobit, and company - the ones in the Catholic, but not the Protestant Bible? (Perhaps Maccabees is the most historically verifiable book IN the Bible, too.)

By the way, most of the Bible ended up being written down much later - after even US copyrights would have expired. Good thing Steamboat Willie doesn't date back to BC.

--
The living have better things to do than to continue hating the dead.
What about a backup copy? by voss · 2005-03-25 05:10 · Score: 3, Interesting

It would seem if the LOC is going to have X number of Petabytes on computers...why not have a second copy stored AWAY from DC. If something were to happen to DC at least we would have backup copies of everything...and we probably should have a separate backup location at a third site.
DRM and archiving are so diametrically opposed... by PornMaster · 2005-03-25 05:15 · Score: 3, Insightful

DRM and archiving are quite conflicting. But then again, how do you make available information on which you want to retain technical methods of copyright protection?

I think the obvious solution is to archive it in a non-DRM, non-proprietary format, but transcode to a DRM/proprietary format when retrieved, if the content is not in the public domain.

--
500GB of disk, 5TB of transfer, $5.95/mo
Re:it has to be said... by Clay+Pigeon+-TPF-VS- · 2005-03-25 05:22 · Score: 2, Funny

But how are we going to measure asteroids and meteors now that the larger imperial unit (Libraries of Congress) is going to get smallers? Will we have to fall back to the smaller unit (VW Beetles) for all of them now?

--
Viral software licensing is not freedom, it is in fact GNU/Socialism.
Small representations. by Grendel+Drago · 2005-03-25 05:31 · Score: 2, Interesting

Have you ever seen someone's hundred and fifty page thesis, diagrams and all, fit onto a 3.5" floppy? People who wrote their theses in TeX or LaTeX, with a few postscript diagrams. I was impressed by how tiny the code for a real, well-produced book could be.

'Course, the problem is that these representations work if you're entering in the content with that method in the first place.

--grendel drago

--
Laws do not persuade just because they threaten. --Seneca
This just in... by SmokeHalo · 2005-03-25 05:54 · Score: 5, Funny

The LOC has announced that they are accepting volunteers to digitize texts. Their first volunteer is Earl the night janitor, who has been busily keying in the last 20 years of New York City phone books. He hopes to move on to Chicago soon.

--
I'm not good in groups. It's difficult to work in a group when you're omnipotent. - Q
1. Re:This just in... by superpulpsicle · 2005-03-25 07:03 · Score: 2, Funny
  
  Don't worry Earl will soon have the assistance of hundreds of non-English speaking Iraqi prisoners to help him.
Re:The problem I see with Project Gutenberg... by Baldur_of_Asgard · 2005-03-25 06:02 · Score: 2, Informative

(1) Under the old US law, content had to be marked "Copyright" to be copyrighted. Under the present US law, all work is automatically copyrighted the moment it is created, UNLESS the author specifies otherwise. I think this holds true for works since, was it 1987? I forget exactly - but it's been a little while now.

(2) A person who transcribes a book that is in the public domain can CLAIM a copyright on it, but this is not enforceable unless they have changed the text significantly enough for it to be a new work - in which case you probably don't want it anyhow, except possibly as a work of satire or fiction.

Baldur of Asgard
Are they requiring publishers to submit PDF files? by melted · 2005-03-25 06:05 · Score: 4, Interesting

Are they requiring publishers to submit PDF files for new entries yet? Or files in another open format? Man, I'd hate to see taxpayer's money wasted on doing work that they could avoid doing by simply mandating PDF submissions from publishers.

I can see that some publishers may just say, "oh, my book isn't gonna be in libraries if I don't submit PDF, so much the better, I'll sell more copies". I hope these fellas realize how badly they're shooting themselves in the foot.
Yes, and yet...no. by oneiros27 · 2005-03-25 07:22 · Score: 2, Insightful
You're making a large number of assumptions in your first paragraph:
1. The OCR is always correct.
2. The documents could be represented in ASCII
3. The text is the only part of the document with any value
Of course, your second paragraph shows that clearly those assumptions can't be true -- why would someone pay more for something without an additional benefit?

And you wouldn't maintain seperate databases -- pictures aren't searchable. You'd want to use any OCRd (preferably vetted afterwards) as the basis for indexing the images, so that you could help people find more images that might be of interest to them (which you mentioned in the second paragraph). However, I'm not sure what the requirements are that the LOC operates under, so even if they're allowed to do cost recovery or otherwise charge fees.
--
Build it, and they will come^Hplain.