Distributed Proofreaders Posts 5,000th E-book

← Back to Stories (view on slashdot.org)

Distributed Proofreaders Posts 5,000th E-book

Posted by timothy on Tuesday August 24, 2004 @06:41PM from the error-checking-and-correcting dept.

bbc writes "Distributed Proofreaders has posted its 5,000th ebook to Project Gutenberg. The book, a Short Biographical Dictionary of English Literature, by John W. Cousin, was proofed for this special occasion by over 500 volunteers. Distributed Proofreaders is a project that distributes the otherwise gargantuan task of correcting scanning and recognition errors in an OCR'ed text. The project has thousands of volunteers, of which many hundreds are active on any given day. It is currently the main supplier of etexts for Project Gutenberg."

10 of 144 comments (clear)

Min score:

Reason:

Sort:

Re:500 people read it? by wolfdvh · 2004-08-24 19:09 · Score: 5, Insightful

I like Gutenberg, I hope they start a system where you can download copyright books for a micropayment, I would pay good money for text ebooks.
Rather than setting up a complicated system to make micro-payments that only some people would follow anyway, do what I do, determine a fair value for youself and make a donation. Not for one book, but estimate a year or two worth so you don't 'nickel and dime' the value of you donation with transaction fees.
A shame by iamdrscience · 2004-08-24 19:10 · Score: 4, Insightful

I think it's really a shame that current copyright laws (and retroactive extensions) have limited project Gutenberg to texts from a little after the turn of the century and before.

I just don't understand the point of retroactive copyright extensions. The idea behind copyrights, like patents, is to encourage innovation by allowing the creator an exclusive right for a limited time. If people believe copyright terms need to be extended to achieve this goal, fine. I disagree, but whatever. However, I think it's ludicrous that terms should be extended on works that have already been created, unless maybe they think that extending terms retroactively will lead to more works being produced in the past?
1. Re:A shame by MikeCapone · 2004-08-24 19:17 · Score: 5, Insightful
  
  I just don't understand the point of retroactive copyright extensions. The idea behind copyrights, like patents, is to encourage innovation by allowing the creator an exclusive right for a limited time. If people believe copyright terms need to be extended to achieve this goal, fine. I disagree, but whatever. However, I think it's ludicrous that terms should be extended on works that have already been created, unless maybe they think that extending terms retroactively will lead to more works being produced in the past?
  
  There's nothing to understand. Everything's about money now. Nobody cares about books, art or people. If you can make money - especially on the work of authors usually living near poverty - long after they are dead, then you are the winner of this big capitalistic orgy!
  
  --
  Treehugger? Treehugger... Treehugger!
2. Re:A shame by 16K+Ram+Pack · 2004-08-24 21:58 · Score: 2, Insightful
  
  The change has of course happened because of the industrialisation of reproduction. At one time, if you wanted to hear music, you went to a show or bought the sheet music. Performance was expensive and does not scale up.
  You could fill a music hall with people and pay the performers. You want to open another music hall? You need another set of performers.
  Recorded music meant that each copy scaled the initial costs down. This has, over time become even more exaggerated, though. At one time, record production and promotion was quite amateurish, which also would mean that records were made which actually cost very little, but actually cost quite a lot in terms of the pressing/sleeve production.
  Now, the situation is that CDs cost very little to record and manufacture but the music costs a huge amount to produce in terms of promotion/PR/grooming etc. The cost of CD number 1 is huge but by the time you reach CD number 2million, it costs very little.
  This means that people involved are not small entrepreneurs of the Fred Carno, but major corporations with everything that comes with it.
Make them renew each year by Anonymous Coward · 2004-08-24 19:57 · Score: 4, Insightful

It's so Disney can keep milking Mickey Mouse.

Here's what I want to see:

You get automatic copyright for 25 years. After that, you must pay $1 per year to keep something in copyright. If you can't be bothered to keep track of your stuff and pay the $1, it lapses into the public domain.

Disney will pay the $1 for Mickey ($1 for Steamboat Willy, $1 for each other cartoon, $1 for each book, etc.). But forgotten gems, like ancient Apple ][ games, will become legal public domain items.

I'd actually like to see a hard limit of 50 years or so for copyright, but even if you can't get that, at least the above scheme makes alot of stuff lapse into the public domain.

A cool feature: if the legal trail is tangled and murky, and no one knows who owns it anymore, no one will pay the $1 and it will fall into public domain. Let's say LSD Software wrote a fun game for the Commodore 64. Then ABC Games bought the game from LSD (who kept the rights to use the music in future games). Then ABC Games went under, but its assets were bought by PDQ Games, which later split into PDQ Software and Foo Bar Games. After that it gets REALLY complicated... anyway, after all that, who exactly owns that fun game? No one knows. It would take a court case to decide, but no one will bother so no one will ever know. Under the current system, you are technically a pirate if you keep the game, but there is no one you can pay a license fee and legally have the game! Catch-22.

Heck, Disney should want this. They make big bucks by Disney-ifying public domain stuff, so they should make sure things will actually go into the public domain in the future.
Re:law of averages? by littlem · 2004-08-24 20:26 · Score: 5, Insightful

We've also recently become much more aware of the need to make useful texts which can be used for scholarly purposes in the future, leading to such improvements as retention of all page numbers.

At the risk of going over very old and well-trodden ground, if PG wanted to be useful for "scholarly purposes" it should long ago have corrected the original mistake of using plain text, and used a markup that could have kept page numbers and other meta-information for scholars, while giving the common reader a clean text with a suitable style sheet. But even today on the PG website is a "justification" for sticking to plain text making it clear that scholars don't even figure in the intended audience for PG texts.
Re:Hm! by Anonymous Coward · 2004-08-24 20:29 · Score: 1, Insightful

Offer their services? It's 99% volunteer work. Why would someone volunteer to proofread some magazine? Gutenberg works because the books that it generates are for non-commercial/academic use - that's why volunteers feel they're doing something good when they're contributing.
Helping improve OCR software? by Anonymous Coward · 2004-08-24 21:35 · Score: 3, Insightful

It seems to me that this project could have a large impact on OCR readers.

Think about it. You have thousands of volunteers pouring over images, and then providing the corrected text (if necessary). Couldn't this also be used to "train" the OCR software to become better at identifying text?

If you log the image, the original OCR'd text, and the manually verified text you could use it in a test case for future OCR software.

I do this all the time when I write data validation/cleanup software.. I run my input data through a program, capture the output, and manually verify that it is correct.. making changes if necessary. I then use the two pieces of information in my test cases as a benchmark. If I introduce a bug in my code that causes something I already wrote to suddenly break, or output incorrect results, I know about it instantly. Works great with database correction code.

Maybe I'm simplifying this too much, but I sure hope someone is capturing all this great data. It could come in handy..
Re:good books? by jonathan_ingram · 2004-08-25 00:42 · Score: 3, Insightful

Does DP take on new versions of existing PG books?

Yes, we do -- although as I mention in an earlier post, we have a year's worth of material as it is, without going back and re-doing the older material already in PG. However, as you say, some of PGs content is below the standards we expect of newly produced text. Hopefully we can go back and correct *all* PGs content over time. The main factor stopping us is that we need page scans of any project before it can go through DP. If you know of any page images of a clearable edition of Ulysses, or indeed if you have a clearable edition which you are willing to scan, then we would gladly put it through the site.

--
-- Help Digitise the Public Domain at DP.
Re:Any chance images could be made available? by jonathan_ingram · 2004-08-25 02:04 · Score: 3, Insightful

Yes, the long term plan is to make the page images we use in proofreading available for end users. There are several logistical problems with this (mainly to do with bandwidth and disk space), but all the images are archived for the time when we can make them available.

It's possible that we might interface with something like the Million Book Project, which makes page images, but no text, available.

--
-- Help Digitise the Public Domain at DP.