Distributed Proofreaders Posts 5,000th E-book
bbc writes "Distributed Proofreaders has posted its 5,000th ebook to Project Gutenberg. The book, a Short Biographical Dictionary of English Literature, by John W. Cousin, was proofed for this special occasion by over 500 volunteers.
Distributed Proofreaders is a project that distributes the otherwise gargantuan task of correcting scanning and recognition errors in an OCR'ed text. The project has thousands of volunteers, of which many hundreds are active on any given day. It is currently the main supplier of etexts for Project Gutenberg."
Rather than setting up a complicated system to make micro-payments that only some people would follow anyway, do what I do, determine a fair value for youself and make a donation. Not for one book, but estimate a year or two worth so you don't 'nickel and dime' the value of you donation with transaction fees.
I think it's really a shame that current copyright laws (and retroactive extensions) have limited project Gutenberg to texts from a little after the turn of the century and before.
I just don't understand the point of retroactive copyright extensions. The idea behind copyrights, like patents, is to encourage innovation by allowing the creator an exclusive right for a limited time. If people believe copyright terms need to be extended to achieve this goal, fine. I disagree, but whatever. However, I think it's ludicrous that terms should be extended on works that have already been created, unless maybe they think that extending terms retroactively will lead to more works being produced in the past?
It's so Disney can keep milking Mickey Mouse.
Here's what I want to see:
You get automatic copyright for 25 years. After that, you must pay $1 per year to keep something in copyright. If you can't be bothered to keep track of your stuff and pay the $1, it lapses into the public domain.
Disney will pay the $1 for Mickey ($1 for Steamboat Willy, $1 for each other cartoon, $1 for each book, etc.). But forgotten gems, like ancient Apple ][ games, will become legal public domain items.
I'd actually like to see a hard limit of 50 years or so for copyright, but even if you can't get that, at least the above scheme makes alot of stuff lapse into the public domain.
A cool feature: if the legal trail is tangled and murky, and no one knows who owns it anymore, no one will pay the $1 and it will fall into public domain. Let's say LSD Software wrote a fun game for the Commodore 64. Then ABC Games bought the game from LSD (who kept the rights to use the music in future games). Then ABC Games went under, but its assets were bought by PDQ Games, which later split into PDQ Software and Foo Bar Games. After that it gets REALLY complicated... anyway, after all that, who exactly owns that fun game? No one knows. It would take a court case to decide, but no one will bother so no one will ever know. Under the current system, you are technically a pirate if you keep the game, but there is no one you can pay a license fee and legally have the game! Catch-22.
Heck, Disney should want this. They make big bucks by Disney-ifying public domain stuff, so they should make sure things will actually go into the public domain in the future.
At the risk of going over very old and well-trodden ground, if PG wanted to be useful for "scholarly purposes" it should long ago have corrected the original mistake of using plain text, and used a markup that could have kept page numbers and other meta-information for scholars, while giving the common reader a clean text with a suitable style sheet. But even today on the PG website is a "justification" for sticking to plain text making it clear that scholars don't even figure in the intended audience for PG texts.
Offer their services? It's 99% volunteer work. Why would someone volunteer to proofread some magazine? Gutenberg works because the books that it generates are for non-commercial/academic use - that's why volunteers feel they're doing something good when they're contributing.
It seems to me that this project could have a large impact on OCR readers.
Think about it. You have thousands of volunteers pouring over images, and then providing the corrected text (if necessary). Couldn't this also be used to "train" the OCR software to become better at identifying text?
If you log the image, the original OCR'd text, and the manually verified text you could use it in a test case for future OCR software.
I do this all the time when I write data validation/cleanup software.. I run my input data through a program, capture the output, and manually verify that it is correct.. making changes if necessary. I then use the two pieces of information in my test cases as a benchmark. If I introduce a bug in my code that causes something I already wrote to suddenly break, or output incorrect results, I know about it instantly. Works great with database correction code.
Maybe I'm simplifying this too much, but I sure hope someone is capturing all this great data. It could come in handy..
Yes, we do -- although as I mention in an earlier post, we have a year's worth of material as it is, without going back and re-doing the older material already in PG. However, as you say, some of PGs content is below the standards we expect of newly produced text. Hopefully we can go back and correct *all* PGs content over time. The main factor stopping us is that we need page scans of any project before it can go through DP. If you know of any page images of a clearable edition of Ulysses, or indeed if you have a clearable edition which you are willing to scan, then we would gladly put it through the site.
-- Help Digitise the Public Domain at DP.
Yes, the long term plan is to make the page images we use in proofreading available for end users. There are several logistical problems with this (mainly to do with bandwidth and disk space), but all the images are archived for the time when we can make them available.
It's possible that we might interface with something like the Million Book Project, which makes page images, but no text, available.
-- Help Digitise the Public Domain at DP.