Slashdot Mirror


Just One Page a Day

Charles Franks writes "Two years ago I started building an online proofreading system as a way to help Project Gutenberg (PG) get more books online: Distributed Proofreaders (DP). The concept is simple, we scan books and load the image and OCR output for each page into the online system. Next, proofreaders compare the OCR text to the image making any corrections as necessary, each page gets looked at twice. Finally the output from the site is massaged into a PG e-text and submitted to PG for posting to the archive. Now, nearly 600 books and a lot of PHP code later, we have snuggled into our new home which is graciously provided by the Internet Archive and Project Gutenberg. Now that we have 'real' resources available to us (the original site ran on a Pentium 200 over my 128kbps upstream cablemodem) I would like to invite the online community at large to help us put even more books online. To this end I would like to ask everyone to do 'Just One Page a Day'. Thank you, Charles Franks"

38 of 373 comments (clear)

  1. Re:Legal Implications by phil+reed · · Score: 2, Informative

    I can't decide if this is a joke or not.

    You do know about Project Gutenberg, right?

    --

    ...phil
    "For a list of the ways which technology has failed to improve our quality of life, press 3."
  2. Re:Legal Implications by Junta · · Score: 4, Informative

    The only works that go into PG are works in the public domain. While publishers sell dead-tree copies still, they have no copyright over the original text contained within. (Which is why these works are typically available through multiple publishers.

    --
    XML is like violence. If it doesn't solve the problem, use more.
  3. Copyright is not an issue by ardmhacha · · Score: 5, Informative

    Project Gutenberg only publishes books that are out of copyright. That means Dickens is okay but you wont find the latest Stephen King

    1. Re:Copyright is not an issue by Twylite · · Score: 3, Informative

      Sadly, copyright is an issue in this sort of work. Just because Dickens' works are no longer copyright, doesn't mean you can go and pull a Dickens novel off the library/bookstore shelf and OCR it. Publishers tend to be careful to make slight alterations to the text here and there (formatting, spelling, come clarifications and corrections) which turns a copyright-expired work into a derived work over which they own the copyright. Shitty, isn't it?

      --
      i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
  4. Re:Legal Implications by seizer · · Score: 4, Informative

    It helps if you read the FAQ list.

    Due to copyright laws, it is only legal to do this with older books (copyrighted 75 or more years ago). As a result, Project Gutenberg is mostly comprised of the "Classics."

  5. Re:Book Pirating? by phil+reed · · Score: 2, Informative
    So are the books they are digitizing all in the public domain?


    Yup.
    It doesn't seem like there would be that many books in the public domain that haven't already been made available on the net.

    How do you suppose they make it to the net? Most of the public domain books were written before word processors, so there's no electronic text around.

    Of course I could be wrong.

    Yeah. Go look at Project Gutenberg's site - think of it as you homework assignment for the weekend.

    --

    ...phil
    "For a list of the ways which technology has failed to improve our quality of life, press 3."
  6. Re:copyrights? by Jeremy+Erwin · · Score: 4, Informative

    Copyrights aren't perpetual. The Gutenberg project aims to publish books that are no longer, or have never been under copyright.

  7. Re:Book Pirating? by raju1kabir · · Score: 4, Informative
    So are the books they are digitizing all in the public domain? It doesn't seem like there would be that many books in the public domain that haven't already been made available on the net. Of course I could be wrong.

    And you probably are. The best efforts of our duly elected Congressional representatives notwithstanding, copyright still does expire. After that, a work passes automatically into the public domain. That means there are hundreds of thousands of books available.

    In fact, if you've previously seen the classics online, they probably came from this project, which has been around for almost as long as I can remember.

    --
    "Patriotism is your conviction that this country is superior to all other countries because you were born in it." -- GBS
  8. Re:Which books are getting converted? by teeker · · Score: 5, Informative

    The books that are being converted are whatever people feel like contributing.

    Don't think your favorite authors are being represented? Can you demonstrate that the work is out of copyright? Make the conversion yourself!

    Doing the hard work yourself is the best way to guarantee your interests are represented.

    --
    teeker
  9. Re:OCR Software by Anonymous Coward · · Score: 4, Informative
    Generally not used at dp. Mostly uses Abbyy Fine Reader (www.abbyy.com) which is commercial.


    gocr (http://jocr.sourceforge.net/) is open-source, and includes interesting bits like deskewing.


    As a proofreader, I really appreciate the best ocr, and the free guys are not the best.

  10. Re:public domain books? by teeker · · Score: 2, Informative

    True, but Project Gutenberg is a repository for digital copies of literature that are public domain. To remain a legitimate entity, they can't publish copyrighted works (without the author's consent).

    So, the answer to your question is no. But that's what p2p is for ;-)

    --
    teeker
  11. Re:Legal Implications by Anonymous Coward · · Score: 2, Informative

    >But the publishers still have copyright on their specific printing.

    Nope. Copyright holders (not necessarily the publisher) would have copyright on editorial corrections and (for music: a weird case) some on appearance, but not on the original text.

    Publishers often claim copyright on the entire contents of 300 year old works, but they have no legal basis for this.

  12. Re:public domain books? by SamTheButcher · · Score: 2, Informative
    Also, if you read about the project, it's goal is to put all of the works into XML to create a searchable repository, not just to have all of these .txt documents floating around. Well, that's the newest goal, anyway.

    $.02. Like it or leave it.

  13. Re:A better use of time by Anonymous Coward · · Score: 1, Informative

    >I think a better use of time would be to have all these programmers here develop a better OCR.

    Maybe. OCR has improved to the level that is better than re-typing. Still averages more than an error a page, 'though. And is a hard problem.

    The most sucessful recent hacks on dp have been further exploiting the output of existing OCR (thanks Aldorando) to do things like handle end-of-line dashes (mostly) automatically.

  14. Re:A better use of time by scottcain · · Score: 2, Informative

    Perhaps, but the page I just proofed was from a book publish in the 1850's, so it was not the best image quality, and still the OCR did a great job. The most common mistake I corrected was converting I's to !'s. It got right things that I had to look at pretty closely to make sure it was right.

  15. OCR errors mostly caused by poor scan quality by oob · · Score: 4, Informative

    I've just proofed four pages, a mix of modern English, quoted Cockney and religious babble (Jonah 4:13, 9 etc.)

    OK it's only four pages, but the errors I've corrected so far have been when the scan has been poor and the OCR software has had to make a guess.

  16. Re:OCR Software -- Clara, perhaps? by timothy · · Score: 5, Informative

    Though the web page was last updated in July, I find several happy references (and some less happy) to "Clara," a GPL'd OCR program.

    Here's the web page: http://www.claraocr.org/index.html

    timothy

    --
    jrnl: http://tinyurl.com/c2l8yr / foes: http://tinyurl.com/ckjno5
  17. Re:Graphics by dvdeug · · Score: 4, Informative

    Will there be any support for proofing in other languages (french, spanish, arabic, etc...)?

    DP has had books in Dutch, French, Spanish and German. No Arabic - no one has mentioned being able to do it, for one thing.

    Would we be able to post those books if they're not copyrighted in the US but copyrighted in other countries?

    Project Gutenberg only worries about the US copyright. If it's not copyrighted in the US, they'll do it.

  18. Re:will this work? by clonebarkins · · Score: 4, Informative
    who does the final proof reading, and if there is someone doing the final proof reading that kinda eliminates the need for the distributed part.

    charlz has a workflow diagram for the works that go through his site. As you see, each book has a project manager, who has final processing/proofing responsibilities.

    Also, I'm not sure you get the idea of two rounds of proofing. They don't see different versions of a corrected page -- the first one sees the straight OCR output (or, sometimes the project manager will do some automated corrections on it first) and then the first round proofer edits the text. Then, when all the pages have gone through the first round, the second round proofer reads the text as it was edited by the first round proofer. This helps because it builds off the edits of the first round proofer and allows the second round proofer to perhaps catch things not caught in the first round.

    When proofreading, you're never going to capture all the mistakes with one pair of eyes. A distributed proofreading effort is very beneficial to the goals and efforts of Project Gutenberg, and I applaud the efforts of all those who have proofed even one page.

    Having said that, I've done over 300 (under a different name).

    --

    "The evil of the world is made possible by nothing but the sanction you give it." -- Ayn Rand

  19. Re:What books need to be done? by clonebarkins · · Score: 3, Informative

    Check out the following for a start:

    --

    "The evil of the world is made possible by nothing but the sanction you give it." -- Ayn Rand

  20. Re:OCR Software -- Clara, perhaps? by Zach+Garner · · Score: 3, Informative

    I've used both clara and gOCR. Both are not yet working well enough to actually use to scan books..

  21. Re:ASCII Only? by Robotech_Master · · Score: 5, Informative

    Check out Black Mask for a lot of nicely-formatted pubdom e-books, including many from Gutenberg but also some that Gutenberg doesn't have.

    --
    Editor Emeritus and Senior Writer, TeleRead.org
  22. Distributed Everything by Mostly+Harmless · · Score: 2, Informative

    Something I posted on 10/24...

    Go here. Now. It's the most complete listing of distributed computing I've ever found. Has the usual, like folding and SETI, but also neat things like Distributed Proofreading and finding as-of-yet unknown comets.

    --
    "`Ford, you're turning into a penguin. Stop it.'" -Douglas Adams, THHGTTG
  23. Re:And you ask the /. community.. by CaseyB · · Score: 5, Informative

    I know you're joking, but in reality it doesn't matter how good your spelling is. In fact, I would imagine that any spelling errors found in the text should be reproduced intact, in the interest of accurately representing the original work. This project is about correcting OCR errors, not spelling / grammar.

  24. Re:A better way - have computers do more work. by noodlez84 · · Score: 4, Informative

    Although your method of "proofreading" is actually useful for most documents, it is _not_ a good method for Project Gutenberg (as a contributor to DP, I can attest to this).

    The works put out by Project Gutenberg are going to be around for decades, if not, centuries. 95% accuracy is shit for those purposes. An issue that comes up on the PG mailing list (gutvol-d) every once in a while is whether or not to correct spelling mistakes that appear in the real, dead-tree versions of the books. What if, for example, it's obvious to almost any reader that the author meant the word "by" instead of "bye". Surprisingly (or not, depending on the way you look at it), the general response is *not* to correct those kinds of "mistakes". The rationality being that PG is -not- an editor, but simply a library (which is actually its legal status).

    So, in short, for works with millions of characters that are going to be around for many decades, 95% accuracy. The "bar" might be high, and, when proofreading for DP, I strive for 100%.

  25. Re:Scanning without damaging the book? by jpetts · · Score: 5, Informative

    Is there any reasonable way to scan in pages from something like a 100+ year old 1.5" thick wire-bound paperback book that only opens about 60 degrees before putting up a fight?

    Yes indeed! *Any* decent academic library should have a photocopier which can do this. Older models tend to have a glass platen which extends right to the edge of the photocopier, and the side slopes away at around 60 degrees rather than dropping at a right angle. Newer models, such as the Minolta PS3000 will support the book in a cradle, face up, so that contact with the pages is minimised. They also tend to have a host of features, such as automagically erasing the gutter shadow that one gets with such a system.

    --
    Call me old fashioned, but I like a dump to be as memorable as it is devastating - Bender
  26. Re:ASCII Only? by rusty0101 · · Score: 4, Informative

    When the project was started, SGML varients were not widly used, and the option of including images was a concern for storage space.

    Using things like BOLD and L for british pound were workarounds to have a common way of presenting the data. I suspect that it would be trivial to build a formating filter in perl, or another language that would convert BOLD to bold though it would require a bit more work to recognize that it really should be Bold or even that it should be BOLD.

    Converting monetary symbols would require a bit more work, but would also not be impossible.

    Re-inserting any diagrams, figures, illustrations or other graphics would require more work. If the original scanned pages are still available, as this part of the project suggests, even that would not be impossible.

    One variation is the free bookmobile project that is out there. They use scans of the original book to build a new book for kids. Preparation for printing involves downloading the book over the internet, via a dsl speed sattelite link. I am not sure however if the working material is suitable for e-book reading however.

    -Rusty

    --
    You never know...
  27. Re:ASCII Only? by quinto2000 · · Score: 3, Informative

    From actually proofing a few pages, this depends entirely on the particular project and when it was started. Some of the newer ones allow special characters.

    --
    Ceci n'est pas un post
  28. Can't get through? Try ibiblio by gbnewby · · Score: 3, Informative
    The main Gutenberg page is slashdotted right now, but you can get nearly the same access to the books via the main ibiblio page at ibiblio.org/gutenberg, which is the main distribution site for the collection.

    It looks like the texts01.archive.org/dp site is holding up fairly well! If you cannot get through today, though, please check back later. Slashdot effect aside, it's usually quite speedy and has a decent 'net connection. If you want to keep informed of current events, get on one of our mailing lists via (when it's not slashdotted) our subscriptions page.

    Dr. Gregory B. Newby
    Chief Executive and Director
    Project Gutenberg Literary Archive Foundation http://gutenberg.net
    A 501(c)(3) not-for-profit organization with EIN 64-6221541
    gbnewby@ils.unc.edu // 919-962-8064

  29. Re:Some PG books ARE copyrighted... by dpbsmith · · Score: 5, Informative

    ...Not many, but there are some Project Gutenberg books that are copyrighted and distributed with the author's permission.

    Also, Project Gutenberg of Australia publishes a number of works that are out of copyright in Australia, but still under copyright in the U.S. It is a copyright infringement for readers in the U. S. to download these works, which include, among others, Hervey Allen's _Anthony Adverse_(1933), F. Scott Fitzgerald's _The Great Gadsby_ (1944), Khalil Gibran's _The Prophet_ (1923), D. H. Lawrence's _Lady Chatterley's Lover_ (1928), all of George Orwell's novels, most of Virginia Woolf's, etc. etc.

    Not exactly "the latest Stephen King" but a lot newer than Dickens.

  30. Re:use proofreading meta-data to improve OCR! by dmoynihan · · Score: 3, Informative

    Actually, they're working on that.

    The program is Gutcheck, was developed by PG's Jim Tinsley.

    Catches a lot!

  31. Re:Are any of these resources distributed? by Anonymous Coward · · Score: 2, Informative

    DP submits to project Gutenberg. This is a gutenberg FAQ.

    1. Michael Hart (gutenberg's leader) is very much in favor of massive replication. My favourite is when disk drive makers start putting the entire gutenberg collection on their drives before selling them (to fill up space/differentiator)

    2. PG has been around 20 years, and never been shut down. Judges actually understand and defend the public domain, within limits that PG understands.

    3. Nothing goes through DP without copyright approval from MHart. And if he makes a mistake, it is likely to be fixed by withdrawing the offending *book*, as far as possible.

  32. Re:A better way - have computers do more work. by Anonymous Coward · · Score: 1, Informative

    Ya got two approaches to preserving old text.

    1. Scan it.
    Pros:
    Automates well. Susceptible to massive implementation.
    Cons:
    Output is bulky/slow to view/not searchable/not editable (by comparison to ascii)

    2. Make it text.
    Pros:
    OCR can (now) really save you time
    Susceptible to massive implementation.
    Small/quick to view/searchable/editable/
    Cons:
    Not as automatable. Loses formatting.

    Now you can mix these up. (Add TEI or Docbook
    tags to the text. Simulate columns with spaces.
    OCR the images and search on
    the OCR).

    Paperofrecord is OCRing the images, which
    has been a known successful method of allowing
    adequate searching for a decade or so (the
    OCR does not even have to be very good by
    modern standards).

    Microfiche folk have been preserving images
    for decades, now, so the economics and technology
    is well understood.

    Gutenberg was new (20 years ago) in actually
    careing about the public domain.

    DP is new since we can now do massive scale
    'clickworking', which allows for greater voluntarism.

  33. Re:Scanning without damaging the book? by ChaosDiscord · · Score: 2, Informative
    ...but getting a good image from a flatbed scanner would seriously damage most of these books. ...a digital camera seems even less likely to work.

    Actually given a nice digital camera with a high resolution, you can generate perfectly fine images for OCRing. I've known a few people who have done exactly this to take images of rare books that they have access to but would never be allowed to put on a scanner.

  34. Non-native proofers by Sangui5 · · Score: 3, Informative

    are actually the preferred way to proof text. A project to create "The Collected Works of Edmund Spenser" is headquartered here, and the English-types were looking for people to work on some software for them. The current most accurate way to create an electronic copy is to hire people without even a passing familiarity with the alphabet you are targeting, train them to identify the letters themselves (using the font you're targetting, which may be very much non-standard, esp. for work as old as Spencer's), and have them enter it in character by character. You then have another illiterate person do the same, and have 1 editor (English graduate student) check both copies. Then any differences have to be handled by another editor (English PhD), and the final copy signed off by yet another editor (PhD).

    A very very expensive way to do it.

    See, an illiterate person won't introduce any bias into the text. They will faithfully duplicate any spelling mistakes that they find. In the case of an English scholarly collection, the mistakes are amoung the most important part, since they can identify different print runs, and how language shifts over time.

    As a side note, the software project is hopeless. The best that cann be managed is to automate the administration of their current systems--no OCR will ever meet the level of accuracy that their current system provides.

  35. Maybe not for long -- still good by the+grace+of+R'hllor · · Score: 2, Informative

    It's on Slashdot, so everyone does a few pages, find out it's actually fairly tedious, and only a few will remain of the initial burst. They're at about 7000 for today right now, which is about 1000 more than what they've done so far, this month. Don't build your site based on these estimates.

    Check back there in a few weeks to see how the site is doing. Hopefully quite well, since it is a splendid and worthwhile[1] effort.

    [1]: And only in the preview did I realize I sounded like that woman in the HHGTTG.

  36. Proofing FAQ by Wanker · · Score: 3, Informative
    Stop reading this
    And start reading a page!
    After that come back and you may continue();

    ...but first read the Proofing FAQ on the site and save yourself some confusion:

    http://texts01.archive.org/dp/faq/ProoferFAQ.html

    Especially read section 5 for some of their typesetting-to-ASCII conventions which would be non-obvious otherwise.

  37. Re:Scanning without damaging the book? by Anonymous Coward · · Score: 1, Informative

    Contact Project Gutenberg (http://promo.net/pg). They have use of an orbital scanner, I believe it's called, in San Francisco which can do non-destructive scanning of fragile bindings. (I'm not an AC, I'm just so technologically challenged that I don't see a reason to create an account here. I type in weird-font books on browned paper for PG. OCR distinctly has its limits in the face of the creativity of font designers.)