Just One Page a Day
Charles Franks writes "Two years ago I started building an online proofreading system as a way to help Project Gutenberg (PG) get more books online: Distributed Proofreaders (DP). The concept is simple, we scan books and load the image and OCR output for each page into the online system. Next, proofreaders compare the OCR text to the image making any corrections as necessary, each page gets looked at twice. Finally the output from the site is massaged into a PG e-text and submitted to PG for posting to the archive. Now, nearly 600 books and a lot of PHP code later, we have snuggled into our new home which is graciously provided by the Internet Archive and Project Gutenberg. Now that we have 'real' resources available to us (the original site ran on a Pentium 200 over my 128kbps upstream cablemodem) I would like to invite the online community at large to help us put even more books online. To this end I would like to ask everyone to do 'Just One Page a Day'. Thank you, Charles Franks"
The only works that go into PG are works in the public domain. While publishers sell dead-tree copies still, they have no copyright over the original text contained within. (Which is why these works are typically available through multiple publishers.
XML is like violence. If it doesn't solve the problem, use more.
Project Gutenberg only publishes books that are out of copyright. That means Dickens is okay but you wont find the latest Stephen King
It helps if you read the FAQ list.
Due to copyright laws, it is only legal to do this with older books (copyrighted 75 or more years ago). As a result, Project Gutenberg is mostly comprised of the "Classics."
Copyrights aren't perpetual. The Gutenberg project aims to publish books that are no longer, or have never been under copyright.
And you probably are. The best efforts of our duly elected Congressional representatives notwithstanding, copyright still does expire. After that, a work passes automatically into the public domain. That means there are hundreds of thousands of books available.
In fact, if you've previously seen the classics online, they probably came from this project, which has been around for almost as long as I can remember.
"Patriotism is your conviction that this country is superior to all other countries because you were born in it." -- GBS
The books that are being converted are whatever people feel like contributing.
Don't think your favorite authors are being represented? Can you demonstrate that the work is out of copyright? Make the conversion yourself!
Doing the hard work yourself is the best way to guarantee your interests are represented.
teeker
gocr (http://jocr.sourceforge.net/) is open-source, and includes interesting bits like deskewing.
As a proofreader, I really appreciate the best ocr, and the free guys are not the best.
I've just proofed four pages, a mix of modern English, quoted Cockney and religious babble (Jonah 4:13, 9 etc.)
OK it's only four pages, but the errors I've corrected so far have been when the scan has been poor and the OCR software has had to make a guess.
Though the web page was last updated in July, I find several happy references (and some less happy) to "Clara," a GPL'd OCR program.
Here's the web page: http://www.claraocr.org/index.html
timothy
jrnl: http://tinyurl.com/c2l8yr / foes: http://tinyurl.com/ckjno5
Will there be any support for proofing in other languages (french, spanish, arabic, etc...)?
DP has had books in Dutch, French, Spanish and German. No Arabic - no one has mentioned being able to do it, for one thing.
Would we be able to post those books if they're not copyrighted in the US but copyrighted in other countries?
Project Gutenberg only worries about the US copyright. If it's not copyrighted in the US, they'll do it.
charlz has a workflow diagram for the works that go through his site. As you see, each book has a project manager, who has final processing/proofing responsibilities.
Also, I'm not sure you get the idea of two rounds of proofing. They don't see different versions of a corrected page -- the first one sees the straight OCR output (or, sometimes the project manager will do some automated corrections on it first) and then the first round proofer edits the text. Then, when all the pages have gone through the first round, the second round proofer reads the text as it was edited by the first round proofer. This helps because it builds off the edits of the first round proofer and allows the second round proofer to perhaps catch things not caught in the first round.
When proofreading, you're never going to capture all the mistakes with one pair of eyes. A distributed proofreading effort is very beneficial to the goals and efforts of Project Gutenberg, and I applaud the efforts of all those who have proofed even one page.
Having said that, I've done over 300 (under a different name).
"The evil of the world is made possible by nothing but the sanction you give it." -- Ayn Rand
Check out the following for a start:
"The evil of the world is made possible by nothing but the sanction you give it." -- Ayn Rand
I've used both clara and gOCR. Both are not yet working well enough to actually use to scan books..
Check out Black Mask for a lot of nicely-formatted pubdom e-books, including many from Gutenberg but also some that Gutenberg doesn't have.
Editor Emeritus and Senior Writer, TeleRead.org
I know you're joking, but in reality it doesn't matter how good your spelling is. In fact, I would imagine that any spelling errors found in the text should be reproduced intact, in the interest of accurately representing the original work. This project is about correcting OCR errors, not spelling / grammar.
Although your method of "proofreading" is actually useful for most documents, it is _not_ a good method for Project Gutenberg (as a contributor to DP, I can attest to this).
The works put out by Project Gutenberg are going to be around for decades, if not, centuries. 95% accuracy is shit for those purposes. An issue that comes up on the PG mailing list (gutvol-d) every once in a while is whether or not to correct spelling mistakes that appear in the real, dead-tree versions of the books. What if, for example, it's obvious to almost any reader that the author meant the word "by" instead of "bye". Surprisingly (or not, depending on the way you look at it), the general response is *not* to correct those kinds of "mistakes". The rationality being that PG is -not- an editor, but simply a library (which is actually its legal status).
So, in short, for works with millions of characters that are going to be around for many decades, 95% accuracy. The "bar" might be high, and, when proofreading for DP, I strive for 100%.
Is there any reasonable way to scan in pages from something like a 100+ year old 1.5" thick wire-bound paperback book that only opens about 60 degrees before putting up a fight?
Yes indeed! *Any* decent academic library should have a photocopier which can do this. Older models tend to have a glass platen which extends right to the edge of the photocopier, and the side slopes away at around 60 degrees rather than dropping at a right angle. Newer models, such as the Minolta PS3000 will support the book in a cradle, face up, so that contact with the pages is minimised. They also tend to have a host of features, such as automagically erasing the gutter shadow that one gets with such a system.
Call me old fashioned, but I like a dump to be as memorable as it is devastating - Bender
When the project was started, SGML varients were not widly used, and the option of including images was a concern for storage space.
Using things like BOLD and L for british pound were workarounds to have a common way of presenting the data. I suspect that it would be trivial to build a formating filter in perl, or another language that would convert BOLD to bold though it would require a bit more work to recognize that it really should be Bold or even that it should be BOLD.
Converting monetary symbols would require a bit more work, but would also not be impossible.
Re-inserting any diagrams, figures, illustrations or other graphics would require more work. If the original scanned pages are still available, as this part of the project suggests, even that would not be impossible.
One variation is the free bookmobile project that is out there. They use scans of the original book to build a new book for kids. Preparation for printing involves downloading the book over the internet, via a dsl speed sattelite link. I am not sure however if the working material is suitable for e-book reading however.
-Rusty
You never know...
From actually proofing a few pages, this depends entirely on the particular project and when it was started. Some of the newer ones allow special characters.
Ceci n'est pas un post
It looks like the texts01.archive.org/dp site is holding up fairly well! If you cannot get through today, though, please check back later. Slashdot effect aside, it's usually quite speedy and has a decent 'net connection. If you want to keep informed of current events, get on one of our mailing lists via (when it's not slashdotted) our subscriptions page.
Dr. Gregory B. Newby // 919-962-8064
Chief Executive and Director
Project Gutenberg Literary Archive Foundation http://gutenberg.net
A 501(c)(3) not-for-profit organization with EIN 64-6221541
gbnewby@ils.unc.edu
...Not many, but there are some Project Gutenberg books that are copyrighted and distributed with the author's permission.
Also, Project Gutenberg of Australia publishes a number of works that are out of copyright in Australia, but still under copyright in the U.S. It is a copyright infringement for readers in the U. S. to download these works, which include, among others, Hervey Allen's _Anthony Adverse_(1933), F. Scott Fitzgerald's _The Great Gadsby_ (1944), Khalil Gibran's _The Prophet_ (1923), D. H. Lawrence's _Lady Chatterley's Lover_ (1928), all of George Orwell's novels, most of Virginia Woolf's, etc. etc.
Not exactly "the latest Stephen King" but a lot newer than Dickens.
"How to Do Nothing," kids activities, back in print!
Actually, they're working on that.
The program is Gutcheck, was developed by PG's Jim Tinsley.
Catches a lot!
are actually the preferred way to proof text. A project to create "The Collected Works of Edmund Spenser" is headquartered here, and the English-types were looking for people to work on some software for them. The current most accurate way to create an electronic copy is to hire people without even a passing familiarity with the alphabet you are targeting, train them to identify the letters themselves (using the font you're targetting, which may be very much non-standard, esp. for work as old as Spencer's), and have them enter it in character by character. You then have another illiterate person do the same, and have 1 editor (English graduate student) check both copies. Then any differences have to be handled by another editor (English PhD), and the final copy signed off by yet another editor (PhD).
A very very expensive way to do it.
See, an illiterate person won't introduce any bias into the text. They will faithfully duplicate any spelling mistakes that they find. In the case of an English scholarly collection, the mistakes are amoung the most important part, since they can identify different print runs, and how language shifts over time.
As a side note, the software project is hopeless. The best that cann be managed is to automate the administration of their current systems--no OCR will ever meet the level of accuracy that their current system provides.
...but first read the Proofing FAQ on the site and save yourself some confusion:
http://texts01.archive.org/dp/faq/ProoferFAQ.htmlEspecially read section 5 for some of their typesetting-to-ASCII conventions which would be non-obvious otherwise.