Slashdot Mirror


Distributed Proofreaders Posts 5,000th E-book

bbc writes "Distributed Proofreaders has posted its 5,000th ebook to Project Gutenberg. The book, a Short Biographical Dictionary of English Literature, by John W. Cousin, was proofed for this special occasion by over 500 volunteers. Distributed Proofreaders is a project that distributes the otherwise gargantuan task of correcting scanning and recognition errors in an OCR'ed text. The project has thousands of volunteers, of which many hundreds are active on any given day. It is currently the main supplier of etexts for Project Gutenberg."

6 of 144 comments (clear)

  1. Wonderful by Chasuk · · Score: 4, Informative

    As I get older, reading texts on-screen gets easier. My vision is still 20/20, but I now require reading glasses, which are generally out of reach when I need them. Project Gutenberg has come in as a real lifesaver (well, sanity-saver) now that I'm turning into a geezer. That, and the price is perfect!

  2. Rsync your own Gutenberg library by gtoomey · · Score: 4, Informative
    You can rsync your own copy of the Gutenberg library. I used the Aarnet mirror as its closest to me and fast.

    Just be aware that the Gutenberg is some 135GB, and much of it is gif jpg and mp3 (spoken work books). So i just used --include in rsync to download the .txt .htm and .html files. Its a more manageable 10GB download.

  3. Re:law of averages? by jonathan_ingram · · Score: 5, Informative
    However, I am curious as to just how accurate the proofreading is.

    The answer is: surprisingly accurate. We proof one page at a time, working from the original scanned images, and emphasise that people should try as hard as they can to stick to the source material. As counter-intuitive as it may appear, this type of proofreading is actually hardest to do with material from the late 18th/19th century -- subtle changes in spelling (and small changes in accent systems for the non-English languages) make errors much harder for human proofreaders to correct than the earlier material, where spelling consistency was completely optional!

    Each page is OCRed (and the ability of modern OCR programs is a major improvement over those of even a couple of years ago), proofread twice, and then the whole document is reviewed twice before being posted. We've also recently become much more aware of the need to make useful texts which can be used for scholarly purposes in the future, leading to such improvements as retention of all page numbers.

  4. Re:good books? by jonathan_ingram · · Score: 4, Informative

    There are many sites which have taken some of the more popular works from Project Gutenberg, and put a more user-friendly directory style front end to them. One of the best is Blackmask.com, which also contains works from non-Gutenberg free book providers. There are 312 works in the 'Science Fiction' section alone.

  5. Re:Hm! by jonathan_ingram · · Score: 5, Informative
    It's an interesting idea, but at the moment we're concentrating on providing proofreading services for Project Gutenberg. Every book which goes through the site has been scanned by one of our unpaid volunteers (except for those which have been, to use a slightly emotive term, 'raided' from sites that provide page images) -- and we already have enough books in our queue to keep us going for a year, even if we all stopped scanning immediately!

    Also, we are very comfortable with being a provider of *public domain* material, and I think many members wouldn't feel comfortable moving into the copy-restricted domain.

  6. Re:How strange by jonathan_ingram · · Score: 4, Informative

    I'll let you in on a secret -- this isn't really our 5000th book! Some larger works are split into multiple projects, so while this is our 5000th *project*, it's around 10% off being our 5000th *book*. The text we chose for *this* 5000 was supposed to be appropriate for an internal celebration, rather than one which would be announced to the world -- it's a great example of the sort of text which would be very unlikely to get into PG if DP didn't exist, and it gives us useful biographical information to use in the 'blurb' for future projects. It's hard to stop people from submitting stories to Slashdot, though :).