Slashdot Mirror


Just One Page a Day

Charles Franks writes "Two years ago I started building an online proofreading system as a way to help Project Gutenberg (PG) get more books online: Distributed Proofreaders (DP). The concept is simple, we scan books and load the image and OCR output for each page into the online system. Next, proofreaders compare the OCR text to the image making any corrections as necessary, each page gets looked at twice. Finally the output from the site is massaged into a PG e-text and submitted to PG for posting to the archive. Now, nearly 600 books and a lot of PHP code later, we have snuggled into our new home which is graciously provided by the Internet Archive and Project Gutenberg. Now that we have 'real' resources available to us (the original site ran on a Pentium 200 over my 128kbps upstream cablemodem) I would like to invite the online community at large to help us put even more books online. To this end I would like to ask everyone to do 'Just One Page a Day'. Thank you, Charles Franks"

28 of 373 comments (clear)

  1. Stop reading this by XiC · · Score: 5, Insightful

    And start reading a page!
    After that come back and you may continue();

  2. And you ask the /. community.. by Harald74 · · Score: 5, Funny

    ... which is renowned for it's spelling prowess? ;)

    --
    A)bort, R)etry or S)elf-destruct?
    1. Re:And you ask the /. community.. by Textbook+Error · · Score: 5, Funny

      for it's spelling

      Or grammer... :-)

      ("it's" == "it is", "its" == possessive form)

      --

      Nae bother
    2. Re:And you ask the /. community.. by tswinzig · · Score: 5, Funny

      ... which is renowned for it's spelling prowess? ;)

      Are you kidding? With the number of people bitching about grammar and spelling in the comments, you just know there's a pool of talent here!

      (BTW, there's no apostrophe in the possessive form of "its.")

      --

      "And like that ... he's gone."
    3. Re:And you ask the /. community.. by tswinzig · · Score: 5, Funny

      for it's spelling

      Or grammer...


      Or spelling?

      --

      "And like that ... he's gone."
    4. Re:And you ask the /. community.. by Anonymous Coward · · Score: 5, Funny

      Or sense of humour?

    5. Re:And you ask the /. community.. by CaseyB · · Score: 5, Informative

      I know you're joking, but in reality it doesn't matter how good your spelling is. In fact, I would imagine that any spelling errors found in the text should be reproduced intact, in the interest of accurately representing the original work. This project is about correcting OCR errors, not spelling / grammar.

  3. Just one page a day? by Adam+Rightmann · · Score: 5, Funny

    Sounds like Gary Condit's plan for extramarital affairs.

    --
    A. Rightmann
  4. Obvious... by OrangeSpyderMan · · Score: 5, Funny

    I'm shure that buy askin teh Salshdot crowd (esp. the editturs) to help, yule improove jamatically teh kwality off you're output.

    :-)

    --
    Try NetBSD... safe,straightforward,useful.
  5. Copyright is not an issue by ardmhacha · · Score: 5, Informative

    Project Gutenberg only publishes books that are out of copyright. That means Dickens is okay but you wont find the latest Stephen King

  6. Wow, what a scary thought by TheConfusedOne · · Score: 5, Funny

    Imagine the kids 200 years from now reading |-|uc||_3b3rry F1|\||\|.

    (That hurts my brain just trying to type it in...)

    --
    --- I wish I could hear the soundtrack to my life. That way I'd know when to duck.
  7. A better use of time by Apreche · · Score: 5, Insightful

    I think a better use of time would be to have all these programmers here develop a better OCR. Then you wouldn't need the proofreading and could just feed books into the scanner. I mean there are lots of things wrong with OCR and reasons why it can't be absolutely perfect, but it CAN bet better. If we just write one line of code a day each we'll have better OCR in no time.

    --
    The GeekNights podcast is going strong. Listen!
  8. Re:Legal Implications by stinky+wizzleteats · · Score: 5, Interesting

    While publishers sell dead-tree copies still, they have no copyright over the original text contained within.

    What? You mean to suggest that you have an actual example of a publisher making money without tyranny over the content?

    Gasp!

  9. Re:Which books are getting converted? by teeker · · Score: 5, Informative

    The books that are being converted are whatever people feel like contributing.

    Don't think your favorite authors are being represented? Can you demonstrate that the work is out of copyright? Make the conversion yourself!

    Doing the hard work yourself is the best way to guarantee your interests are represented.

    --
    teeker
  10. use proofreading meta-data to improve OCR! by tomlouie · · Score: 5, Interesting

    What if they kept track of every time the human reader finds an OCR-error. Couldn't you then build a profile of what words/phrases/letters the OCR software has the most problems with?

    Then, couldn't you just selectively have the humans review the highest probably error prone sections of a book, instead of every single word of every single page?

    What do you think?

  11. Read? by uneek · · Score: 5, Funny

    Don't you mean run a compare tool in the background using CPU idle time right?

    You don't actually want us to read a
    page of literature do you?

  12. A better way - have computers do more work. by lawpoop · · Score: 5, Interesting
    I was thinking -

    In order to make the proofing faster, maybe you could OCR a document 2 or 3 times, and then have only the disagreements proofread.

    We use omnipro here at work, and I'm surprised at how well it works, even recreating page formats.

    Of course, it doesn't work 100%, but it sure does get about 95%. If you were to OCR a document 2-3 or more times, and most of it was identical, it would save a lot of time if you had humans going over only the parts that the different OCRs didn't agree on.

    Steve Lefevre

    --
    Computers are useless. They can only give you answers.
    -- Pablo Picasso
  13. Re:OCR Software -- Clara, perhaps? by timothy · · Score: 5, Informative

    Though the web page was last updated in July, I find several happy references (and some less happy) to "Clara," a GPL'd OCR program.

    Here's the web page: http://www.claraocr.org/index.html

    timothy

    --
    jrnl: http://tinyurl.com/c2l8yr / foes: http://tinyurl.com/ckjno5
  14. Re:Umm... by jandrese · · Score: 5, Interesting

    Someone needs to do a google search on " Public Domain". Public domain is there for a reason. Just as Copyright is available to give the artist a means of supporting himself, it was never ment to last his entire life. The purpose is to give the artist an incentive to work, current copyright law fails in this respect because an artist only needs to create one successful work and can immediatly switch to being a leech on society for the rest of his (and his childrens, and childrens childrens) life. Having the works pass into the Public domain is a good idea for two reasons:
    1. It is for the greater good of society as other people build on earlier works.
    2. It keeps the artist busy as they were supposed to have to keep releasing work to feed themselves as their early work passed into the public domain, just like any other job.

    --

    I read the internet for the articles.
  15. Just one page a day, huh? by WIAKywbfatw · · Score: 5, Funny

    Sure, it starts as just one a day. But, before you know it, you're doing two, then five, then ten.

    You stop going out with friends or even returning their calls, personal hygiene takes a back seat and even Counter Strike and Warcraft III become unappealling. And, finally, after countless chapters and hundreds of pages you realise that you're friends were right: you're an addict.

    Just one page a day, huh? Yeah, right.

    Opium. Pot. Cocaine. Now pages.

    It might not be your older brother's drug, or your Daddy's or your grandfathers, but, trust me, this stuff can be dangerous.

    Do what I do. Just say no.

    --

    "Accept that some days you are the pigeon, and some days you are the statue." - David Brent, Wernham Hogg
  16. Possible Enhancements by Niles_Stonne · · Score: 5, Interesting

    This a great project... But after doing my first page I found a couple of possible enhancements.

    Add a "Quality" stat for each person. Base it on the number of things that were missed(another words, the number of things that the second-string proofer finds).

    Use more than just two proofers. Have one "First String" proofer, who could be anybody, but have two second string proofers (who both get the output of the first string proofer). If the second string proofers have any differences in their output(with the exception of white space), then another second string proofer should be used. Only proofers with a certain quality rating(slightly higher than what a newbie's would be) should be able to do the second string proofing.

    The "User rating" should be a combination of the number of pages done and the quality rating of those pages. Note that quality rating would only be increased by doing first string proofing. Page count would go up for any proofing.

    Quality could be a float, starting at 1.0 for newbies. Every page that is completed and has a second-string person check would then go into a calculation like:

    _new_quality_ = _old_quality_ + (0.01 - (_num_differences_between_their_proof_and_final_pr oof_ / 1000))

    Thus, for every page proofed that requires NO corrections by the second string the user's quality would go up by 0.01. ( 0.01 - 0/1000 = 0.01 )

    if there were more than ten errors in the proofing, their quality would go down ( 0.01 - 10/1000 = 0.00 ), (0.01 - 20/1000 = -0.01)

    Have a threshold of 1.10 or some such for second string proofers... That way it would require the user to do at least 10 perfect pages, or 20 pages with 5 errors, etc, before they could do the second string proofing.

    Obviously, make sure that the second string proofer can't see who the first string proofer is.

    The "User Rating" (mentioned above) could just be a multiplication of the Quality and Page Counts...

    --
    Sticks and Stones may break my bones, but copyright will always protect me.
  17. ASCII Only? by vondo · · Score: 5, Insightful

    Reading the blurb at the page-a-day site, it says ASCII only where bold is converted to ALL CAPS, the English pound symbol is rendered as "L," etc. No preservation of figures, drawings, or photos.

    This seems very short sighted to me. Devices that can only display ASCII are becoming rarer and rarer. Why not, instead, store docs in some sort of SGML format to handle the special markup (which must be rare) and then down convert to ASCII when needed.

    I've tried reading these things on my Palm. Very difficult. But if I could get a nice typeset PDF version, that would be a whole different story (no pun intended).

    1. Re:ASCII Only? by Robotech_Master · · Score: 5, Informative

      Check out Black Mask for a lot of nicely-formatted pubdom e-books, including many from Gutenberg but also some that Gutenberg doesn't have.

      --
      Editor Emeritus and Senior Writer, TeleRead.org
  18. Re:A better use of time (OK, here's mine) by gosand · · Score: 5, Funny
    If we just write one line of code a day each we'll have better OCR in no time.

    OK, here's mine:

    #include stdio.h

    next...

    --

    My beliefs do not require that you agree with them.

  19. Re:Scanning without damaging the book? by jpetts · · Score: 5, Informative

    Is there any reasonable way to scan in pages from something like a 100+ year old 1.5" thick wire-bound paperback book that only opens about 60 degrees before putting up a fight?

    Yes indeed! *Any* decent academic library should have a photocopier which can do this. Older models tend to have a glass platen which extends right to the edge of the photocopier, and the side slopes away at around 60 degrees rather than dropping at a right angle. Newer models, such as the Minolta PS3000 will support the book in a cradle, face up, so that contact with the pages is minimised. They also tend to have a host of features, such as automagically erasing the gutter shadow that one gets with such a system.

    --
    Call me old fashioned, but I like a dump to be as memorable as it is devastating - Bender
  20. Re:And you ask the /. community... by Binestar · · Score: 5, Funny

    MY GOD! A story where nitpicking grammar and spelling is *ON* topic.

    This'll be a fun one to read through.

    --
    Do you Gentoo!?
  21. Re:Some PG books ARE copyrighted... by dpbsmith · · Score: 5, Informative

    ...Not many, but there are some Project Gutenberg books that are copyrighted and distributed with the author's permission.

    Also, Project Gutenberg of Australia publishes a number of works that are out of copyright in Australia, but still under copyright in the U.S. It is a copyright infringement for readers in the U. S. to download these works, which include, among others, Hervey Allen's _Anthony Adverse_(1933), F. Scott Fitzgerald's _The Great Gadsby_ (1944), Khalil Gibran's _The Prophet_ (1923), D. H. Lawrence's _Lady Chatterley's Lover_ (1928), all of George Orwell's novels, most of Virginia Woolf's, etc. etc.

    Not exactly "the latest Stephen King" but a lot newer than Dickens.

  22. Looking for proofreaders on slashdot !! by tadas · · Score: 5, Funny

    If they're looking for proofreaders here, the project is in deep trouble...

    --
    This page accidentally left blank