Slashdot Mirror


Just One Page a Day

Charles Franks writes "Two years ago I started building an online proofreading system as a way to help Project Gutenberg (PG) get more books online: Distributed Proofreaders (DP). The concept is simple, we scan books and load the image and OCR output for each page into the online system. Next, proofreaders compare the OCR text to the image making any corrections as necessary, each page gets looked at twice. Finally the output from the site is massaged into a PG e-text and submitted to PG for posting to the archive. Now, nearly 600 books and a lot of PHP code later, we have snuggled into our new home which is graciously provided by the Internet Archive and Project Gutenberg. Now that we have 'real' resources available to us (the original site ran on a Pentium 200 over my 128kbps upstream cablemodem) I would like to invite the online community at large to help us put even more books online. To this end I would like to ask everyone to do 'Just One Page a Day'. Thank you, Charles Franks"

11 of 373 comments (clear)

  1. OCR Software by Zach+Garner · · Score: 4, Interesting

    Is there any worth-while open source OCR software? How about reasonably priced closed source OCR software for *BSD or Linux?

  2. Dirtributed OCR? by edwilli · · Score: 4, Interesting

    Have each client do the OCR (if you can find GPL software). Or maybe there's a company willing to donate it. That way you could farm out most of the processing too.

  3. Re:Legal Implications by stinky+wizzleteats · · Score: 5, Interesting

    While publishers sell dead-tree copies still, they have no copyright over the original text contained within.

    What? You mean to suggest that you have an actual example of a publisher making money without tyranny over the content?

    Gasp!

  4. Graphics by mallfouf · · Score: 4, Interesting

    Very good idea.
    Will there be any support for proofing in other languages (french, spanish, arabic, etc...)?
    What about books published in other countries. Would we be able to post those books if they're not copyrighted in the US but copyrighted in other countries? or vice versa.

  5. use proofreading meta-data to improve OCR! by tomlouie · · Score: 5, Interesting

    What if they kept track of every time the human reader finds an OCR-error. Couldn't you then build a profile of what words/phrases/letters the OCR software has the most problems with?

    Then, couldn't you just selectively have the humans review the highest probably error prone sections of a book, instead of every single word of every single page?

    What do you think?

  6. A better way - have computers do more work. by lawpoop · · Score: 5, Interesting
    I was thinking -

    In order to make the proofing faster, maybe you could OCR a document 2 or 3 times, and then have only the disagreements proofread.

    We use omnipro here at work, and I'm surprised at how well it works, even recreating page formats.

    Of course, it doesn't work 100%, but it sure does get about 95%. If you were to OCR a document 2-3 or more times, and most of it was identical, it would save a lot of time if you had humans going over only the parts that the different OCRs didn't agree on.

    Steve Lefevre

    --
    Computers are useless. They can only give you answers.
    -- Pablo Picasso
  7. Re:Umm... by jandrese · · Score: 5, Interesting

    Someone needs to do a google search on " Public Domain". Public domain is there for a reason. Just as Copyright is available to give the artist a means of supporting himself, it was never ment to last his entire life. The purpose is to give the artist an incentive to work, current copyright law fails in this respect because an artist only needs to create one successful work and can immediatly switch to being a leech on society for the rest of his (and his childrens, and childrens childrens) life. Having the works pass into the Public domain is a good idea for two reasons:
    1. It is for the greater good of society as other people build on earlier works.
    2. It keeps the artist busy as they were supposed to have to keep releasing work to feed themselves as their early work passed into the public domain, just like any other job.

    --

    I read the internet for the articles.
  8. Books read to you while commuting by dudemaster · · Score: 3, Interesting

    How about this.... use an open source speech synthesis tool/API that can play these text books (especially as more get added) over a PDA, laptop, etc while cruising in on the way to work and home. Something like:

    http://www.cstr.ed.ac.uk/projects/festival/
    (no plug, just did a quick freshmeat search)

    would be pretty cool to get some good novels read to you w/o buying the tapes.

  9. What books need to be done? by Alethes · · Score: 3, Interesting

    Is there a list of books that are out of copyright and perhaps the status of those books on the Gutenberg Project website or anywhere else?

  10. Possible Enhancements by Niles_Stonne · · Score: 5, Interesting

    This a great project... But after doing my first page I found a couple of possible enhancements.

    Add a "Quality" stat for each person. Base it on the number of things that were missed(another words, the number of things that the second-string proofer finds).

    Use more than just two proofers. Have one "First String" proofer, who could be anybody, but have two second string proofers (who both get the output of the first string proofer). If the second string proofers have any differences in their output(with the exception of white space), then another second string proofer should be used. Only proofers with a certain quality rating(slightly higher than what a newbie's would be) should be able to do the second string proofing.

    The "User rating" should be a combination of the number of pages done and the quality rating of those pages. Note that quality rating would only be increased by doing first string proofing. Page count would go up for any proofing.

    Quality could be a float, starting at 1.0 for newbies. Every page that is completed and has a second-string person check would then go into a calculation like:

    _new_quality_ = _old_quality_ + (0.01 - (_num_differences_between_their_proof_and_final_pr oof_ / 1000))

    Thus, for every page proofed that requires NO corrections by the second string the user's quality would go up by 0.01. ( 0.01 - 0/1000 = 0.01 )

    if there were more than ten errors in the proofing, their quality would go down ( 0.01 - 10/1000 = 0.00 ), (0.01 - 20/1000 = -0.01)

    Have a threshold of 1.10 or some such for second string proofers... That way it would require the user to do at least 10 perfect pages, or 20 pages with 5 errors, etc, before they could do the second string proofing.

    Obviously, make sure that the second string proofer can't see who the first string proofer is.

    The "User Rating" (mentioned above) could just be a multiplication of the Quality and Page Counts...

    --
    Sticks and Stones may break my bones, but copyright will always protect me.
  11. Scanning without damaging the book? by mttlg · · Score: 3, Interesting

    I have a few books that are old enough to be well out of copyright (and obscure enough not to be found online already), and for a while I have been considering typing them in. OCR would be a lot easier, but getting a good image from a flatbed scanner would seriously damage most of these books. Even a handheld scanner would be impractical in some cases, and a digital camera seems even less likely to work. Is there any reasonable way to scan in pages from something like a 100+ year old 1.5" thick wire-bound paperback book that only opens about 60 degrees before putting up a fight?