Human-Powered Internet Archive Book Project

← Back to Stories (view on slashdot.org)

Human-Powered Internet Archive Book Project

Posted by Zonk on Friday November 11, 2005 @06:39PM from the hope-she-likes-the-way-books-smell dept.

Carl Bialik from the WSJ writes "A group led by the Internet Archive is planning a massive, ambitious effort to scan millions of old books and make them available for Web searching early next year. Behind that effort are about a dozen scanners, employees making about $10 an hour to manually scan volumes -- some more than a century old -- one page at a time, on special contraptions. The Wall Street Journal Online visits a University of Toronto library to watch one of the scanners in action: 25-year-old Liz Ridolfo."

8 of 113 comments (clear)

Min score:

Reason:

Sort:

Re:Diffrent? by way2trivial · 2005-11-11 18:47 · Score: 3, Insightful

Stories over 75 years old don't have the same copyright protections..

anyone can do 'a christmas carol' because it's copyright has expired..

using however, someones PRECISE arangement of the text is not permissible however- that has it's own copyright...
so if I buy a current day copy from amazon, I cant scan it in... but if I buy a copy that's last edition/print was more than 75 years ago, it is fair game.

--
every day http://en.wikipedia.org/wiki/Special:Random
Good Bad Ugly by mpapet · 2005-11-11 19:01 · Score: 4, Insightful

The good:
Old books prior to copyright laws are being scanned.

The bad:
Pay is roughly $10/hr. Now, I happen to be concerned that someone being paid so little should be handling rare books. Not to mention the college graduate getting paid so little.

The ugly:
The digital camera contraption costs $30,000!! There's a few scanner manufacturers left in the world and none of them have exploited this niche. Shame on them.

--
http://www.maxineudall.com/2010/02/should-economists-be-sued-for-malpractice.html
Re:It's lighter! by Hosiah · 2005-11-11 19:11 · Score: 3, Insightful

Ahem: years ago, I made up the "moving time" rule that books *must* be packed in the smallest available boxes. Anything of dimensions around 2x1x1 feet. After straining on the book boxes previously, it occurred to me that it's human nature to (a) pack books first, reasoning that you're not going to be doing much reading in the next couple days anyway... and (b) upon first beginning to pack, grab the biggest box to start with.
Re:Why not join the Gutenberg Project by flimnap · 2005-11-11 19:28 · Score: 3, Insightful

So, as the summary states: make them available for Web searching does not mean that there will be a complete text index available (that is full text search,) but instead you can only search for specific works?

That probably means that the search index will be uncorrected OCR, which leads to some inaccurate searches. The problem with using raw OCR is scannos, words that may be recognised as a different word that "looks" the same, for example modem and modern, or an i might be recognised as a slash.

I do that every once in a while on their German counterpart: GaGa

Your time might be better spent at the real Distributed Proofreaders, or DP-Europe, since Projekt Gutenberg-DE is not an offical branch of PG, and actually copyrights its output (unlike the real PG).
Re:Sorta. by spxero · 2005-11-11 19:57 · Score: 1, Insightful

But at the same time, wouldn't it be better if this outfit did the scanning for PG and PG edited and finalized? What is the point to race against another organization to provide the same works without making profit?*

*Until advertisement factors in. Advertisement ALWAYS factors in...
Re:Hey there... by prezkennedy.org · 2005-11-11 20:46 · Score: 2, Insightful

This had to be about as funny as the US PATRIOT Act.
No, I think that actually has a leg up on this comment.

--
It started back in Team Fortress Classic
Re:Diffrent? by Kadin2048 · 2005-11-12 08:09 · Score: 2, Insightful

Commercial or non-commercial use doesn't enter into it.

If the work in question is under copyright, you can't copy and redistribute it; if it's not, then you can. The only exceptions would be the fair use provisions, and I don't think that they would cover you reproducing an entire book, even if it was for non-commercial use: if you're a university professor you can't copy an entire textbook and give them out to your students. That's a non-commerical use, but it's still illegal. There might be some exceptions for purely personal use -- some type of "format shifting" perhaps, like OCRing it and running it through text-to-speech and putting the result on your iPod -- that you could make a good case for if you already owned the book in print form, but non-commerical use normally isn't an excuse for infringement. Despite public opinion to the contrary, there is no exception to a copyright holder's exclusive rights for "non-commercial" uses.

Also, if you made a derivative work from something that was out of copyright, and then went and tried to sell it, only the portions that you contributed anew would be protected, the existing stuff doesn't change. That's not to say that you couldn't sell it (if it's out of copyright you don't even have to change anything to sell it, you can go and print out anything you want from Project Gutenberg and try to sell it, if you think you'll get any buyers), but you wouldn't have any recourse against someone taking your changed version, editing out all of your changes, and selling it themselves.

There is a fairly good introduction to these concepts here. Or read it straight from the U.S. Code.

--
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
OCA and PG scratching each others' backs by TTK+Ciar · 2005-11-12 08:23 · Score: 2, Insightful

The focuses of OCA and PG are really quite different: PG is most interested in preserving the essential information of a book (ie, its text), while OCA's interest is in preserving the form of the book (ie, its fonts, pages format, coloration, even down to the yellowing of the pages). That having been said, there's a lot each can do for the other (and has!).
The Archive has archived most of PG's material, because even though the Books department of The Archive is focussed mostly on preserving books, The Archive as a whole is interested in preserving just about any information it can, and the PG data is definitely of interest.
When the The Archive's Scribe software processes the book images into its various format (jpg, djvu, pdf, flippy, et al), it OCR's the book's text. This text then becomes part of generating some of the other formats. It will be really trivial for PG to obtain this text for any book it wants to incorporate into their dataset.
qv: intlepisode00jamearch. The interesting files here are intlepisode00jamearch.txt which is just the OCR'd text, and intlepisode00jamearch_djvu.xml which is the OCR'd text with layout information (which has been useful to me in developing software which auto-corrects some OCR errors -- where the text is on the page often offers valuable hints for choosing the right heuristic for guessing the right text).
A quick side note on the differences between Google's and OCA's efforts that I haven't seen talked about much -- Google's main advantages in their bookscanning efforts are their wealth and fame, while The Archive's main advantages are experience, familiarity, and scanning technology.
Traditional book-scanning technologies are expensive and slow (which makes doing a lot of books, fast, that much more expensive, because you have to hire more people to do more books in parallel), but Google has enough money to throw at the problem that this is less of an issue. Google's fame means they can bring powerful partners onboard with a smile and a handshake, including some of the most prestigious libraries in the nation.
The Archive has been involved in scanning books and making them available online for several years now (qv The Million Books Project). This experience has shaped the processes used in the acquisition and scanning of books, as well as the technology used in their storage, indexing, and presentation. Furthermore, libraries around the world have grown familiar with The Archive over the years. That, and The Archive's good track record, make it a powerful rallying point for partnerships and alliances, and have given it more experience in facilitating such relationships. Finally, partially due to the limits of existing book-scanning solutions, and partially due to The Archive's limited budget, it has facilitated the development of two independent low-cost, reliable, high-quality book-scanning systems: The Scribe (developed in-house at The Archive) and the Kirtas Robot (developed at Kirtas, a Canadian company).
Many of the books scanned for the Million Book Project using traditional scanning methods are really lousy, sometimes to the point of being unreadable. These new scanning systems dramatically improve the quality of the end product, while equally dramatically reducing the cost-per-page. This means that more scanning systems can be purchased for more libraries (avoiding the per-library capital outlay problem), and more books can be scanned more quickly within a given budget.
Obviously, Google and OCA can benefit from co-operation, as each has a lot to offer the other. I'd be surprised if Google didn't join the OCA, eventually, if for no other reason that to gain access to the books of the >100 OCA