Distributed Proofreaders Posts 5,000th E-book

← Back to Stories (view on slashdot.org)

Distributed Proofreaders Posts 5,000th E-book

Posted by timothy on Tuesday August 24, 2004 @06:41PM from the error-checking-and-correcting dept.

bbc writes "Distributed Proofreaders has posted its 5,000th ebook to Project Gutenberg. The book, a Short Biographical Dictionary of English Literature, by John W. Cousin, was proofed for this special occasion by over 500 volunteers. Distributed Proofreaders is a project that distributes the otherwise gargantuan task of correcting scanning and recognition errors in an OCR'ed text. The project has thousands of volunteers, of which many hundreds are active on any given day. It is currently the main supplier of etexts for Project Gutenberg."

20 of 144 comments (clear)

Min score:

Reason:

Sort:

Wonderful by Chasuk · 2004-08-24 18:45 · Score: 4, Informative

As I get older, reading texts on-screen gets easier. My vision is still 20/20, but I now require reading glasses, which are generally out of reach when I need them. Project Gutenberg has come in as a real lifesaver (well, sanity-saver) now that I'm turning into a geezer. That, and the price is perfect!

--
Neopets - the best free game on the Int
1. Re:Wonderful by Anonymous Coward · 2004-08-25 01:24 · Score: 3, Informative
  
  The price may be right, but donating is good too.
5052 by squidinkcalligraphy · 2004-08-24 18:46 · Score: 1, Informative

And since then they seem to have proofed another 52 books - that's not a bad rate considering...

--
"I think it would be a good idea" Gandhi, on Western Civilisation
Re:good books? by mrchaotica · 2004-08-24 19:28 · Score: 2, Informative

Sherlock Holmes mysteries, old sci-fi (Jules Verne, H.G. Wells, etc), Edgar Allen Poe's short stories... there's lots of good stuff.

--
"[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz
Rsync your own Gutenberg library by gtoomey · 2004-08-24 19:47 · Score: 4, Informative

You can rsync your own copy of the Gutenberg library. I used the Aarnet mirror as its closest to me and fast.
Just be aware that the Gutenberg is some 135GB, and much of it is gif jpg and mp3 (spoken work books). So i just used --include in rsync to download the .txt .htm and .html files. Its a more manageable 10GB download.
1. Re:Rsync your own Gutenberg library by Black+Acid · 2004-08-24 21:26 · Score: 2, Informative
  
  I use --exclude \*.zip --exclude \*.iso --exclude \*.mp3 with wget to achieve similar results. The advantage of this is you get all the images and indexes, without wasting space on computer synthesized spoken books (yech), zipped files which you already downloaded the contents of, and 4.7GB/700MB DVD or CD ISOs. On the other hand, the Project Gutenberg CD and DVD Project is worth looking into for "best of" collections if you don't want the whole library.
  
  --
  Tired of free ipod spam sigs? Opt ou
Re:law of averages? by jonathan_ingram · 2004-08-24 19:53 · Score: 5, Informative

However, I am curious as to just how accurate the proofreading is.

The answer is: surprisingly accurate. We proof one page at a time, working from the original scanned images, and emphasise that people should try as hard as they can to stick to the source material. As counter-intuitive as it may appear, this type of proofreading is actually hardest to do with material from the late 18th/19th century -- subtle changes in spelling (and small changes in accent systems for the non-English languages) make errors much harder for human proofreaders to correct than the earlier material, where spelling consistency was completely optional!
Each page is OCRed (and the ability of modern OCR programs is a major improvement over those of even a couple of years ago), proofread twice, and then the whole document is reviewed twice before being posted. We've also recently become much more aware of the need to make useful texts which can be used for scholarly purposes in the future, leading to such improvements as retention of all page numbers.

--
-- Help Digitise the Public Domain at DP.
Re:good books? by jonathan_ingram · 2004-08-24 19:58 · Score: 4, Informative

There are many sites which have taken some of the more popular works from Project Gutenberg, and put a more user-friendly directory style front end to them. One of the best is Blackmask.com, which also contains works from non-Gutenberg free book providers. There are 312 works in the 'Science Fiction' section alone.

--
-- Help Digitise the Public Domain at DP.
Re:Hm! by jonathan_ingram · 2004-08-24 20:02 · Score: 5, Informative

It's an interesting idea, but at the moment we're concentrating on providing proofreading services for Project Gutenberg. Every book which goes through the site has been scanned by one of our unpaid volunteers (except for those which have been, to use a slightly emotive term, 'raided' from sites that provide page images) -- and we already have enough books in our queue to keep us going for a year, even if we all stopped scanning immediately!
Also, we are very comfortable with being a provider of *public domain* material, and I think many members wouldn't feel comfortable moving into the copy-restricted domain.

--
-- Help Digitise the Public Domain at DP.
Re:law of averages? by jonathan_ingram · 2004-08-24 20:39 · Score: 3, Informative

DP is 'semi attached' to PG -- I think you'll find that we are much more concerned both with keeping page and edition information, and with marking such information up in an appropriate way, than some of the traditionalists inside PG are.

For example, many of use make sure that we produce a valid XHTML edition of each project, and that the page numbers and edition information of the source are preserved. For an example text, see Graham Wallas -- Human Nature In Politics. We are currently working on a markup and stylesheet which will improve the end-user experience in several ways (and then, sigh, we will have to go back and move all the books we've already done to this new system. This may take a while :) ).

--
-- Help Digitise the Public Domain at DP.
Re:How strange by jonathan_ingram · 2004-08-24 20:48 · Score: 4, Informative

I'll let you in on a secret -- this isn't really our 5000th book! Some larger works are split into multiple projects, so while this is our 5000th *project*, it's around 10% off being our 5000th *book*. The text we chose for *this* 5000 was supposed to be appropriate for an internal celebration, rather than one which would be announced to the world -- it's a great example of the sort of text which would be very unlikely to get into PG if DP didn't exist, and it gives us useful biographical information to use in the 'blurb' for future projects. It's hard to stop people from submitting stories to Slashdot, though :).

--
-- Help Digitise the Public Domain at DP.
Re:Who picks this stuff? by jonathan_ingram · 2004-08-24 21:24 · Score: 3, Informative

If they were published before 1923, then they're public domain, and we'd love to have them in PG! All you need is a scanner, and some spare time :).

Until the middle of last year, we focused almost exclusively on books. Since then, we've been putting some very interesting periodicals through the site (Punch, The Strand Magazine, Scientific American, Notes & Queries, to name but a few). Magazine aimed specifically at boys (or, indeed, girls), would be a great addition to the pile!

--
-- Help Digitise the Public Domain at DP.
Re:law of averages? by bbc · 2004-08-25 00:38 · Score: 2, Informative

"However, I am curious as to just how accurate the proofreading is."

That's very hard to tell, as there is no gold standard for accuracy. There are two sometimes conflicting goals in regards to accuracy that we have; one is to preserve the author's intent, the other to preserve the actual printed text. At some points these two conflict, for instance, when we would like to normalize spelling to increase readability.

There is currently some talk going on at the DP forums as to which system would be best to eliminate common errors, that everybody tends to overlook.

We already have several systems in place to help us with these. For instance, we use a specially modified font that helps to highlight differences between letters. It's dog ugly, but that's intentional; because it grates, you see errors much more quickly.

Also, once common errors are identified as such, we write software that can help us find such errors.

Finally, we use these new-found methods to look at books we posted to Project Gutenberg in the past, to measure the increase in accuracy.
Re:Hm! by jonathan_ingram · 2004-08-25 00:48 · Score: 3, Informative

Yes, Australia is currently 'Life+50', which means that a work becomes copyright free 50 years after the death of the author (sadly, this will be changing to 'Life+70' soon). I live in the EU, which is 'Life+70'. There's a significant amount of material which is copyright free in the EU and Australia, but still copy restricted in the USA -- basically, anything published after 1922 by an author who died before 1934. We recently started a 'DPEU' to focus on these works. At the moment the focus is on Eastern European languages, but there's a wide variety of content (including some English material).

--
-- Help Digitise the Public Domain at DP.
Re:formatting by ragnar · 2004-08-25 00:53 · Score: 2, Informative

The problem you raise is not so easy to solve. While it sounds nice to separate content from presentation, in many cases the presentation is part of the content. Take the indentation of poetry for example, or for a more specific example, e. e . cummings. Once you wade into these areas you start talking about marking the text, which is a tricky issue. The Text Encoding Iniative has been hammering out a solution for a decade, but the learning curve is steep.

As much as I think the project is digging themselves into a whole with hard formatting, I can understand why they do it. The alternative is a nasty can of worms.

--
-- Solaris Central - http://w
Re:What books to read by bbc · 2004-08-25 00:53 · Score: 2, Informative

There are several websites that offer free ebooks, and that allow people to review them.

Of the authors I got to know through Project Gutenberg, Stephen Leacock and Theodor Storm stick out in my mind the most. Oh, and Hendrik Conscience turned out to be less boring than I thought after proofing the first of his books to go through DP (but so far he's only available in Dutch).
Public apology by bbc · 2004-08-25 01:04 · Score: 3, Informative

I would like to apologize to TPTB (The Powers That Be) at Distributed Proofreaders for messing up by posting this story to Slashdot.

The 5000th Posted celebrations were supposed to be internal. There is a discrepancy between works posted and books posted: sometimes a book gets split up. The big celebrations were intended for 5000 actual books posted.

I am afraid I got a little carried away, and hope Slashdot will still carry the real story of 5000 books posted to Project Gutenberg.
Re:What ever happened to Project Gutenberg 2? by jhutch2000 · 2004-08-25 02:50 · Score: 2, Informative

There was a lot of internal contention about that "pay" site using the Gutenberg trademark. For the most part, the furor has died down, and as I understand it, for the most part, the World E-book library thing has given up use of the Gutenberg trademark and some checks and balances have been put in place to prevent the unilateral decision that led to that controversy.
Re:Request for MATH experts by jhutch2000 · 2004-08-25 06:17 · Score: 2, Informative

Only one MATH book is ever in the first round at any one time. Hilbert's book is that one right now.
The logic behind this is simple. Most of our volunteers avoid these books like the plague and if we kept releasing new ones, pretty soon the entire first round would be only MATH books.
To see what's waiting in the queue for English language math books, see here. For Languages Other Than English (LOTE) math books, see here.
Re:Once again by jonathan_ingram · 2004-08-25 17:56 · Score: 2, Informative

(this story is off the front page now, so I doubt you will be looking for an answer, but I'll answer you anyway :) )

I have to question why humans are doing the bulk of the editing for project gutenberg.

There are several reasons. Firstly, there are lots of people around who can spare five minutes to proofread a page -- particularly when it has already been OCRed. Secondly, we are a completely volunteer organisation, with no 'plan' as to the books we scan, and so having to find and scan two seperate copies of a text would reduce the amount of material on the site considerably. In particular, it would almost certainly stop us from proofing some of the older and/or harder material.

I can only suggest that you join DP, and test the process out. I think you'll find that it works surprisingly well.

--
-- Help Digitise the Public Domain at DP.