Slashdot Mirror


Just One Page a Day

Charles Franks writes "Two years ago I started building an online proofreading system as a way to help Project Gutenberg (PG) get more books online: Distributed Proofreaders (DP). The concept is simple, we scan books and load the image and OCR output for each page into the online system. Next, proofreaders compare the OCR text to the image making any corrections as necessary, each page gets looked at twice. Finally the output from the site is massaged into a PG e-text and submitted to PG for posting to the archive. Now, nearly 600 books and a lot of PHP code later, we have snuggled into our new home which is graciously provided by the Internet Archive and Project Gutenberg. Now that we have 'real' resources available to us (the original site ran on a Pentium 200 over my 128kbps upstream cablemodem) I would like to invite the online community at large to help us put even more books online. To this end I would like to ask everyone to do 'Just One Page a Day'. Thank you, Charles Franks"

25 of 373 comments (clear)

  1. Re:Legal Implications by Junta · · Score: 4, Informative

    The only works that go into PG are works in the public domain. While publishers sell dead-tree copies still, they have no copyright over the original text contained within. (Which is why these works are typically available through multiple publishers.

    --
    XML is like violence. If it doesn't solve the problem, use more.
  2. Copyright is not an issue by ardmhacha · · Score: 5, Informative

    Project Gutenberg only publishes books that are out of copyright. That means Dickens is okay but you wont find the latest Stephen King

    1. Re:Copyright is not an issue by Twylite · · Score: 3, Informative

      Sadly, copyright is an issue in this sort of work. Just because Dickens' works are no longer copyright, doesn't mean you can go and pull a Dickens novel off the library/bookstore shelf and OCR it. Publishers tend to be careful to make slight alterations to the text here and there (formatting, spelling, come clarifications and corrections) which turns a copyright-expired work into a derived work over which they own the copyright. Shitty, isn't it?

      --
      i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
  3. Re:Legal Implications by seizer · · Score: 4, Informative

    It helps if you read the FAQ list.

    Due to copyright laws, it is only legal to do this with older books (copyrighted 75 or more years ago). As a result, Project Gutenberg is mostly comprised of the "Classics."

  4. Re:copyrights? by Jeremy+Erwin · · Score: 4, Informative

    Copyrights aren't perpetual. The Gutenberg project aims to publish books that are no longer, or have never been under copyright.

  5. Re:Book Pirating? by raju1kabir · · Score: 4, Informative
    So are the books they are digitizing all in the public domain? It doesn't seem like there would be that many books in the public domain that haven't already been made available on the net. Of course I could be wrong.

    And you probably are. The best efforts of our duly elected Congressional representatives notwithstanding, copyright still does expire. After that, a work passes automatically into the public domain. That means there are hundreds of thousands of books available.

    In fact, if you've previously seen the classics online, they probably came from this project, which has been around for almost as long as I can remember.

    --
    "Patriotism is your conviction that this country is superior to all other countries because you were born in it." -- GBS
  6. Re:Which books are getting converted? by teeker · · Score: 5, Informative

    The books that are being converted are whatever people feel like contributing.

    Don't think your favorite authors are being represented? Can you demonstrate that the work is out of copyright? Make the conversion yourself!

    Doing the hard work yourself is the best way to guarantee your interests are represented.

    --
    teeker
  7. Re:OCR Software by Anonymous Coward · · Score: 4, Informative
    Generally not used at dp. Mostly uses Abbyy Fine Reader (www.abbyy.com) which is commercial.


    gocr (http://jocr.sourceforge.net/) is open-source, and includes interesting bits like deskewing.


    As a proofreader, I really appreciate the best ocr, and the free guys are not the best.

  8. OCR errors mostly caused by poor scan quality by oob · · Score: 4, Informative

    I've just proofed four pages, a mix of modern English, quoted Cockney and religious babble (Jonah 4:13, 9 etc.)

    OK it's only four pages, but the errors I've corrected so far have been when the scan has been poor and the OCR software has had to make a guess.

  9. Re:OCR Software -- Clara, perhaps? by timothy · · Score: 5, Informative

    Though the web page was last updated in July, I find several happy references (and some less happy) to "Clara," a GPL'd OCR program.

    Here's the web page: http://www.claraocr.org/index.html

    timothy

    --
    jrnl: http://tinyurl.com/c2l8yr / foes: http://tinyurl.com/ckjno5
  10. Re:Graphics by dvdeug · · Score: 4, Informative

    Will there be any support for proofing in other languages (french, spanish, arabic, etc...)?

    DP has had books in Dutch, French, Spanish and German. No Arabic - no one has mentioned being able to do it, for one thing.

    Would we be able to post those books if they're not copyrighted in the US but copyrighted in other countries?

    Project Gutenberg only worries about the US copyright. If it's not copyrighted in the US, they'll do it.

  11. Re:will this work? by clonebarkins · · Score: 4, Informative
    who does the final proof reading, and if there is someone doing the final proof reading that kinda eliminates the need for the distributed part.

    charlz has a workflow diagram for the works that go through his site. As you see, each book has a project manager, who has final processing/proofing responsibilities.

    Also, I'm not sure you get the idea of two rounds of proofing. They don't see different versions of a corrected page -- the first one sees the straight OCR output (or, sometimes the project manager will do some automated corrections on it first) and then the first round proofer edits the text. Then, when all the pages have gone through the first round, the second round proofer reads the text as it was edited by the first round proofer. This helps because it builds off the edits of the first round proofer and allows the second round proofer to perhaps catch things not caught in the first round.

    When proofreading, you're never going to capture all the mistakes with one pair of eyes. A distributed proofreading effort is very beneficial to the goals and efforts of Project Gutenberg, and I applaud the efforts of all those who have proofed even one page.

    Having said that, I've done over 300 (under a different name).

    --

    "The evil of the world is made possible by nothing but the sanction you give it." -- Ayn Rand

  12. Re:What books need to be done? by clonebarkins · · Score: 3, Informative

    Check out the following for a start:

    --

    "The evil of the world is made possible by nothing but the sanction you give it." -- Ayn Rand

  13. Re:OCR Software -- Clara, perhaps? by Zach+Garner · · Score: 3, Informative

    I've used both clara and gOCR. Both are not yet working well enough to actually use to scan books..

  14. Re:ASCII Only? by Robotech_Master · · Score: 5, Informative

    Check out Black Mask for a lot of nicely-formatted pubdom e-books, including many from Gutenberg but also some that Gutenberg doesn't have.

    --
    Editor Emeritus and Senior Writer, TeleRead.org
  15. Re:And you ask the /. community.. by CaseyB · · Score: 5, Informative

    I know you're joking, but in reality it doesn't matter how good your spelling is. In fact, I would imagine that any spelling errors found in the text should be reproduced intact, in the interest of accurately representing the original work. This project is about correcting OCR errors, not spelling / grammar.

  16. Re:A better way - have computers do more work. by noodlez84 · · Score: 4, Informative

    Although your method of "proofreading" is actually useful for most documents, it is _not_ a good method for Project Gutenberg (as a contributor to DP, I can attest to this).

    The works put out by Project Gutenberg are going to be around for decades, if not, centuries. 95% accuracy is shit for those purposes. An issue that comes up on the PG mailing list (gutvol-d) every once in a while is whether or not to correct spelling mistakes that appear in the real, dead-tree versions of the books. What if, for example, it's obvious to almost any reader that the author meant the word "by" instead of "bye". Surprisingly (or not, depending on the way you look at it), the general response is *not* to correct those kinds of "mistakes". The rationality being that PG is -not- an editor, but simply a library (which is actually its legal status).

    So, in short, for works with millions of characters that are going to be around for many decades, 95% accuracy. The "bar" might be high, and, when proofreading for DP, I strive for 100%.

  17. Re:Scanning without damaging the book? by jpetts · · Score: 5, Informative

    Is there any reasonable way to scan in pages from something like a 100+ year old 1.5" thick wire-bound paperback book that only opens about 60 degrees before putting up a fight?

    Yes indeed! *Any* decent academic library should have a photocopier which can do this. Older models tend to have a glass platen which extends right to the edge of the photocopier, and the side slopes away at around 60 degrees rather than dropping at a right angle. Newer models, such as the Minolta PS3000 will support the book in a cradle, face up, so that contact with the pages is minimised. They also tend to have a host of features, such as automagically erasing the gutter shadow that one gets with such a system.

    --
    Call me old fashioned, but I like a dump to be as memorable as it is devastating - Bender
  18. Re:ASCII Only? by rusty0101 · · Score: 4, Informative

    When the project was started, SGML varients were not widly used, and the option of including images was a concern for storage space.

    Using things like BOLD and L for british pound were workarounds to have a common way of presenting the data. I suspect that it would be trivial to build a formating filter in perl, or another language that would convert BOLD to bold though it would require a bit more work to recognize that it really should be Bold or even that it should be BOLD.

    Converting monetary symbols would require a bit more work, but would also not be impossible.

    Re-inserting any diagrams, figures, illustrations or other graphics would require more work. If the original scanned pages are still available, as this part of the project suggests, even that would not be impossible.

    One variation is the free bookmobile project that is out there. They use scans of the original book to build a new book for kids. Preparation for printing involves downloading the book over the internet, via a dsl speed sattelite link. I am not sure however if the working material is suitable for e-book reading however.

    -Rusty

    --
    You never know...
  19. Re:ASCII Only? by quinto2000 · · Score: 3, Informative

    From actually proofing a few pages, this depends entirely on the particular project and when it was started. Some of the newer ones allow special characters.

    --
    Ceci n'est pas un post
  20. Can't get through? Try ibiblio by gbnewby · · Score: 3, Informative
    The main Gutenberg page is slashdotted right now, but you can get nearly the same access to the books via the main ibiblio page at ibiblio.org/gutenberg, which is the main distribution site for the collection.

    It looks like the texts01.archive.org/dp site is holding up fairly well! If you cannot get through today, though, please check back later. Slashdot effect aside, it's usually quite speedy and has a decent 'net connection. If you want to keep informed of current events, get on one of our mailing lists via (when it's not slashdotted) our subscriptions page.

    Dr. Gregory B. Newby
    Chief Executive and Director
    Project Gutenberg Literary Archive Foundation http://gutenberg.net
    A 501(c)(3) not-for-profit organization with EIN 64-6221541
    gbnewby@ils.unc.edu // 919-962-8064

  21. Re:Some PG books ARE copyrighted... by dpbsmith · · Score: 5, Informative

    ...Not many, but there are some Project Gutenberg books that are copyrighted and distributed with the author's permission.

    Also, Project Gutenberg of Australia publishes a number of works that are out of copyright in Australia, but still under copyright in the U.S. It is a copyright infringement for readers in the U. S. to download these works, which include, among others, Hervey Allen's _Anthony Adverse_(1933), F. Scott Fitzgerald's _The Great Gadsby_ (1944), Khalil Gibran's _The Prophet_ (1923), D. H. Lawrence's _Lady Chatterley's Lover_ (1928), all of George Orwell's novels, most of Virginia Woolf's, etc. etc.

    Not exactly "the latest Stephen King" but a lot newer than Dickens.

  22. Re:use proofreading meta-data to improve OCR! by dmoynihan · · Score: 3, Informative

    Actually, they're working on that.

    The program is Gutcheck, was developed by PG's Jim Tinsley.

    Catches a lot!

  23. Non-native proofers by Sangui5 · · Score: 3, Informative

    are actually the preferred way to proof text. A project to create "The Collected Works of Edmund Spenser" is headquartered here, and the English-types were looking for people to work on some software for them. The current most accurate way to create an electronic copy is to hire people without even a passing familiarity with the alphabet you are targeting, train them to identify the letters themselves (using the font you're targetting, which may be very much non-standard, esp. for work as old as Spencer's), and have them enter it in character by character. You then have another illiterate person do the same, and have 1 editor (English graduate student) check both copies. Then any differences have to be handled by another editor (English PhD), and the final copy signed off by yet another editor (PhD).

    A very very expensive way to do it.

    See, an illiterate person won't introduce any bias into the text. They will faithfully duplicate any spelling mistakes that they find. In the case of an English scholarly collection, the mistakes are amoung the most important part, since they can identify different print runs, and how language shifts over time.

    As a side note, the software project is hopeless. The best that cann be managed is to automate the administration of their current systems--no OCR will ever meet the level of accuracy that their current system provides.

  24. Proofing FAQ by Wanker · · Score: 3, Informative
    Stop reading this
    And start reading a page!
    After that come back and you may continue();

    ...but first read the Proofing FAQ on the site and save yourself some confusion:

    http://texts01.archive.org/dp/faq/ProoferFAQ.html

    Especially read section 5 for some of their typesetting-to-ASCII conventions which would be non-obvious otherwise.