Human-Powered Internet Archive Book Project

← Back to Stories (view on slashdot.org)

Human-Powered Internet Archive Book Project

Posted by Zonk on Friday November 11, 2005 @06:39PM from the hope-she-likes-the-way-books-smell dept.

Carl Bialik from the WSJ writes "A group led by the Internet Archive is planning a massive, ambitious effort to scan millions of old books and make them available for Web searching early next year. Behind that effort are about a dozen scanners, employees making about $10 an hour to manually scan volumes -- some more than a century old -- one page at a time, on special contraptions. The Wall Street Journal Online visits a University of Toronto library to watch one of the scanners in action: 25-year-old Liz Ridolfo."

15 of 113 comments (clear)

Min score:

Reason:

Sort:

It's lighter! by HolyCrapSCOsux · 2005-11-11 18:46 · Score: 4, Interesting

Last time I moved, It took many VERY HEAVY boxes to Move all my books. Maybe I'll scan them all..

All though anything useful has to be illegal... :(

--
0xB315AA8D852DCD3F3DCA578FD2E0BF88
1. Re:It's lighter! by Hosiah · 2005-11-11 19:11 · Score: 3, Insightful
  
  Ahem: years ago, I made up the "moving time" rule that books *must* be packed in the smallest available boxes. Anything of dimensions around 2x1x1 feet. After straining on the book boxes previously, it occurred to me that it's human nature to (a) pack books first, reasoning that you're not going to be doing much reading in the next couple days anyway... and (b) upon first beginning to pack, grab the biggest box to start with.
Re:Diffrent? by way2trivial · 2005-11-11 18:47 · Score: 3, Insightful

Stories over 75 years old don't have the same copyright protections..

anyone can do 'a christmas carol' because it's copyright has expired..

using however, someones PRECISE arangement of the text is not permissible however- that has it's own copyright...
so if I buy a current day copy from amazon, I cant scan it in... but if I buy a copy that's last edition/print was more than 75 years ago, it is fair game.

--
every day http://en.wikipedia.org/wiki/Special:Random
Sorta. by Grendel+Drago · 2005-11-11 18:52 · Score: 4, Informative

Project Gutenberg frequently makes use of the page scans for source material. What PG does is to run the images through OCR, proofread and post-process it. It's more useful than a stack of page images, but considerably more work.

If you look at the current books on Distributed Proofreaders, you'll see that some of them credit the Million Books Project for the page scans.

--
Laws do not persuade just because they threaten. --Seneca
Hey there... by stev3 · 2005-11-11 18:56 · Score: 3, Funny

Why hello, Ms. Liz Ridolfo. I'm happy to see you are into computers (at least I'll tell myself that) and you like to put your pictures online.

Please email me at superdesperateteengeek@needtogetlaid.net
Re:Why not join the Gutenberg Project by flimnap · 2005-11-11 19:00 · Score: 3, Informative

Project Gutenberg and the Open Content Alliance are working on two slightly different things:

The OCA is making available the images of scanned pages. That's fine for reading an entire book, but you can't search it, nor copy a section of text into a document of your own.

Project Gutenberg makes available plain text, usually illustrated HTML, and occasionally other versions, of public domain books, which can be used by anyone for no cost.

If you'd like to help prepare public domain ebooks, visit Distributed Proofreaders and proofread a page a day (or more!).
Good Bad Ugly by mpapet · 2005-11-11 19:01 · Score: 4, Insightful

The good:
Old books prior to copyright laws are being scanned.

The bad:
Pay is roughly $10/hr. Now, I happen to be concerned that someone being paid so little should be handling rare books. Not to mention the college graduate getting paid so little.

The ugly:
The digital camera contraption costs $30,000!! There's a few scanner manufacturers left in the world and none of them have exploited this niche. Shame on them.

--
http://www.maxineudall.com/2010/02/should-economists-be-sued-for-malpractice.html
1. Re:Good Bad Ugly by Dave114 · 2005-11-11 21:41 · Score: 3, Informative
  
  There's a few scanner manufacturers left in the world and none of them have exploited this niche.
  Actually, you can buy a robotic book scanner (there's a demo video of it). No doubt it costs an arm and a leg although it may be worth it if you're scanning a large enough volume of books.
Re:Diffrent? by arrrrg · 2005-11-11 19:05 · Score: 4, Informative

From the Wikipedia article on the Open Content Alliance:

The Open Content Alliance is a consortium of non-profit and for-profit groups which is dedicated to building a free archive of digital text and multimedia. It was conceived in 2005 by Yahoo and the Internet Archive. It was conceived in response to Google Print's closed nature, and aims to keep public domain works in the public domain on-line. These results will then be used in the search results of participating search engines. You can see a sample of the open content at openlibrary.org

A large difference between the OCA's approach and that of Google Print is that the OCA intends to ask a copyright holder before digitising a work that is still under copyright, while Google Print will digitise any book unless explicitly told not to do so by November 1, 2005.

So, Google Print will almost certainly be better when searching for copyrighted material. For public domain works, we'll have to wait and see.

IMHO, it seems like a little cooperation here would make a lot of sense for both parties - they could save money trading digital copies 1-for-1 while remaining in (healthy) competition.
Re:Diffrent? by Dave114 · 2005-11-11 19:11 · Score: 4, Informative

It's different. Take a look at the Open Content Alliance's FAQ. Below are a few excerpts from it:

What can people do with materials contained in the OCA archive?
The OCA will encourage the greatest possible degree of access to and reuse of collections in the archive, while respecting the rights of content owners and contributors. Generally, textual material will be free to read, and in most cases, available for saving or printing using formats such as PDF. Contributors to the OCA will determine the appropriate level of access to their content.
How will the OCA deal with copyrighted content?
The OCA is committed to respecting the copyrights of content owners. All content providers who contribute to the OCA must agree with the founding principles of the OCA, contained in the OCA Call for Participation, which describes how their materials and associated metadata will be accessed and used. Further, all contributors of collections can specify use restrictions on material that they contribute.
Will copyrighted content be digitized or placed in the OCA archive without explicit permission from rights-holders?
No. OCA contributors must secure the permission of all concerned copyright holders prior to submitting materials to the OCA for digitization or inclusion in the archive.
Re:Why not join the Gutenberg Project by flimnap · 2005-11-11 19:28 · Score: 3, Insightful

So, as the summary states: make them available for Web searching does not mean that there will be a complete text index available (that is full text search,) but instead you can only search for specific works?

That probably means that the search index will be uncorrected OCR, which leads to some inaccurate searches. The problem with using raw OCR is scannos, words that may be recognised as a different word that "looks" the same, for example modem and modern, or an i might be recognised as a slash.

I do that every once in a while on their German counterpart: GaGa

Your time might be better spent at the real Distributed Proofreaders, or DP-Europe, since Projekt Gutenberg-DE is not an offical branch of PG, and actually copyrights its output (unlike the real PG).
Re:Contributing to Gutenberg by jonathan_ingram · 2005-11-11 20:17 · Score: 4, Informative

The scans won't be added to Project Gutenberg, but it's very likely that the scans will be used by Project Gutenberg's Distributed Proofreading project, which I'm involved in. We're already 'harvesting' images from quite a few sites, as well as all the images our volunteers scan. Now that there are several large and relatively well funded scanning operations getting off the ground, I imagine that over time an ever increasing proportion of the works that go through DP will be based from harvested images.

I maintain several lists that show the DP harvesting status of several image collections, including The Internet Archive's Canadian Libraries collection, Google Print, and Early Canadiana Online. As you can see, we will not be running short of material to work on for a very long time, even without any of these recently announced initiatives. That said, it's always great to see more material be made freely available, rather than locked up behind expensive subscription services like Jstor and EEBO.

--
-- Help Digitise the Public Domain at DP.
best commet ever! by CannibalSmith · 2005-11-11 20:18 · Score: 4, Funny

> bullshit

I too want to be modded Insightful!

--
being smart is exausting
Over a century old... by Cow+Jones · 2005-11-11 21:30 · Score: 3, Funny

employees making about $10 an hour to manually scan volumes -- some more than a century old
I think that if they hired younger people to scan the books, it might go a little faster.
Imagine a 100 year old at this job...
"...(mumble mumble) in my day we used priests to copy books (mumble mumble) oh dear, I tore another page, darn Parkinson (mumble mumble)"

--

Ah, arrogance and stupidity, all in the same package. How efficient of you. -- Londo Mollari
Re:RTFA? by commbat · 2005-11-12 03:39 · Score: 3, Interesting

I'd RTFA if the black text didn't overlap a black image. IE-only web designers should be shot.

This is when the 'remove this object' firefox extension comes in handy. Just remove the image and the text is readable. 'Undo last remove' to get the image back.

I don't think you should have been modded down.

--
'Intellectual Properties' are uncontrollable in the wild. To base an economy on them is just stupid.