Just One Page a Day

← Back to Stories (view on slashdot.org)

Posted by ryuzaki0 on Friday November 8, 2002 @02:36AM from the stuff-to-read dept.

Charles Franks writes "Two years ago I started building an online proofreading system as a way to help Project Gutenberg (PG) get more books online: Distributed Proofreaders (DP). The concept is simple, we scan books and load the image and OCR output for each page into the online system. Next, proofreaders compare the OCR text to the image making any corrections as necessary, each page gets looked at twice. Finally the output from the site is massaged into a PG e-text and submitted to PG for posting to the archive. Now, nearly 600 books and a lot of PHP code later, we have snuggled into our new home which is graciously provided by the Internet Archive and Project Gutenberg. Now that we have 'real' resources available to us (the original site ran on a Pentium 200 over my 128kbps upstream cablemodem) I would like to invite the online community at large to help us put even more books online. To this end I would like to ask everyone to do 'Just One Page a Day'. Thank you, Charles Franks"

7 of 373 comments (clear)

Min score:

Reason:

Sort:

Copyright is not an issue by ardmhacha · 2002-11-08 02:43 · Score: 5, Informative

Project Gutenberg only publishes books that are out of copyright. That means Dickens is okay but you wont find the latest Stephen King
Re:Which books are getting converted? by teeker · 2002-11-08 02:50 · Score: 5, Informative

The books that are being converted are whatever people feel like contributing.

Don't think your favorite authors are being represented? Can you demonstrate that the work is out of copyright? Make the conversion yourself!

Doing the hard work yourself is the best way to guarantee your interests are represented.

--
teeker
Re:OCR Software -- Clara, perhaps? by timothy · 2002-11-08 03:13 · Score: 5, Informative

Though the web page was last updated in July, I find several happy references (and some less happy) to "Clara," a GPL'd OCR program.

Here's the web page: http://www.claraocr.org/index.html

timothy

--
jrnl: http://tinyurl.com/c2l8yr / foes: http://tinyurl.com/ckjno5
Re:ASCII Only? by Robotech_Master · 2002-11-08 03:42 · Score: 5, Informative

Check out Black Mask for a lot of nicely-formatted pubdom e-books, including many from Gutenberg but also some that Gutenberg doesn't have.

--
Editor Emeritus and Senior Writer, TeleRead.org
Re:And you ask the /. community.. by CaseyB · 2002-11-08 04:03 · Score: 5, Informative

I know you're joking, but in reality it doesn't matter how good your spelling is. In fact, I would imagine that any spelling errors found in the text should be reproduced intact, in the interest of accurately representing the original work. This project is about correcting OCR errors, not spelling / grammar.
Re:Scanning without damaging the book? by jpetts · 2002-11-08 04:19 · Score: 5, Informative

Is there any reasonable way to scan in pages from something like a 100+ year old 1.5" thick wire-bound paperback book that only opens about 60 degrees before putting up a fight?

Yes indeed! *Any* decent academic library should have a photocopier which can do this. Older models tend to have a glass platen which extends right to the edge of the photocopier, and the side slopes away at around 60 degrees rather than dropping at a right angle. Newer models, such as the Minolta PS3000 will support the book in a cradle, face up, so that contact with the pages is minimised. They also tend to have a host of features, such as automagically erasing the gutter shadow that one gets with such a system.

--
Call me old fashioned, but I like a dump to be as memorable as it is devastating - Bender
Re:Some PG books ARE copyrighted... by dpbsmith · 2002-11-08 05:17 · Score: 5, Informative

...Not many, but there are some Project Gutenberg books that are copyrighted and distributed with the author's permission.

Also, Project Gutenberg of Australia publishes a number of works that are out of copyright in Australia, but still under copyright in the U.S. It is a copyright infringement for readers in the U. S. to download these works, which include, among others, Hervey Allen's _Anthony Adverse_(1933), F. Scott Fitzgerald's _The Great Gadsby_ (1944), Khalil Gibran's _The Prophet_ (1923), D. H. Lawrence's _Lady Chatterley's Lover_ (1928), all of George Orwell's novels, most of Virginia Woolf's, etc. etc.

Not exactly "the latest Stephen King" but a lot newer than Dickens.

--
"How to Do Nothing," kids activities, back in print!