Just One Page a Day
Charles Franks writes "Two years ago I started building an online proofreading system as a way to help Project Gutenberg (PG) get more books online: Distributed Proofreaders (DP). The concept is simple, we scan books and load the image and OCR output for each page into the online system. Next, proofreaders compare the OCR text to the image making any corrections as necessary, each page gets looked at twice. Finally the output from the site is massaged into a PG e-text and submitted to PG for posting to the archive. Now, nearly 600 books and a lot of PHP code later, we have snuggled into our new home which is graciously provided by the Internet Archive and Project Gutenberg. Now that we have 'real' resources available to us (the original site ran on a Pentium 200 over my 128kbps upstream cablemodem) I would like to invite the online community at large to help us put even more books online. To this end I would like to ask everyone to do 'Just One Page a Day'. Thank you, Charles Franks"
Is there any worth-while open source OCR software? How about reasonably priced closed source OCR software for *BSD or Linux?
Have each client do the OCR (if you can find GPL software). Or maybe there's a company willing to donate it. That way you could farm out most of the processing too.
While publishers sell dead-tree copies still, they have no copyright over the original text contained within.
What? You mean to suggest that you have an actual example of a publisher making money without tyranny over the content?
Gasp!
Very good idea.
Will there be any support for proofing in other languages (french, spanish, arabic, etc...)?
What about books published in other countries. Would we be able to post those books if they're not copyrighted in the US but copyrighted in other countries? or vice versa.
What if they kept track of every time the human reader finds an OCR-error. Couldn't you then build a profile of what words/phrases/letters the OCR software has the most problems with?
Then, couldn't you just selectively have the humans review the highest probably error prone sections of a book, instead of every single word of every single page?
What do you think?
It's surprising that so many people are either trolling or are unaware of the concept of "public domain." I personally fear the latter more because it shows the ideological degradation of America. The Slashdot community is much more likely to be aware of copyright issues than most Americans. If so many of us are so naive then I genuinely fear for the survival of our country as a free nation. Perhaps that is the reason why the media corporations can encroach upon our rights by pushing inferior products and getting unanimous approval of the DMCA in the senate.
In order to make the proofing faster, maybe you could OCR a document 2 or 3 times, and then have only the disagreements proofread.
We use omnipro here at work, and I'm surprised at how well it works, even recreating page formats.
Of course, it doesn't work 100%, but it sure does get about 95%. If you were to OCR a document 2-3 or more times, and most of it was identical, it would save a lot of time if you had humans going over only the parts that the different OCRs didn't agree on.
Steve Lefevre
Computers are useless. They can only give you answers.
-- Pablo Picasso
I have a little problem with the logistics here. I can understand why every page is being sent to 2 people for proof reading in an effort to eliminate errors, but the problem arises that these arent 2 computers doing simple computations, if both of these people have different versions of a corrected page, as im sure they will. what happenes then? who does the final proof reading, and if there is someone doing the final proof reading that kinda eliminates the need for the distributed part. I could almost guarentee that any 2 people checking the same full page of data in their free time will find/create different errors. I hope I'm missing some large concept here, becouse i do love PG, they keep my palm stacked with good reading for free.
Someone needs to do a google search on " Public Domain". Public domain is there for a reason. Just as Copyright is available to give the artist a means of supporting himself, it was never ment to last his entire life. The purpose is to give the artist an incentive to work, current copyright law fails in this respect because an artist only needs to create one successful work and can immediatly switch to being a leech on society for the rest of his (and his childrens, and childrens childrens) life. Having the works pass into the Public domain is a good idea for two reasons:
1. It is for the greater good of society as other people build on earlier works.
2. It keeps the artist busy as they were supposed to have to keep releasing work to feed themselves as their early work passed into the public domain, just like any other job.
I read the internet for the articles.
Lots of books aren't copyrighted anymore as the copyright expired. You see back before Disney bought legislation from people like Sonny Bono copyrights would be allowed to expire after about 50 years or so.
Beowulf, Moby Dick, Shakespearre's plays, etc are all free as in speach and beer. Edited versions of the original text can be copyrighted. Examples of that are edition of Shakespearre's plays with "translations" next to the original text. You can buy his complete works, unedited, for very little $ these days. The only cost for the publisher is printing and typesetting.
How about this.... use an open source speech synthesis tool/API that can play these text books (especially as more get added) over a PDA, laptop, etc while cruising in on the way to work and home. Something like:
o plug, just did a quick freshmeat search)
http://www.cstr.ed.ac.uk/projects/festival/
(n
would be pretty cool to get some good novels read to you w/o buying the tapes.
Is there a list of books that are out of copyright and perhaps the status of those books on the Gutenberg Project website or anywhere else?
This a great project... But after doing my first page I found a couple of possible enhancements.
r oof_ / 1000))
Add a "Quality" stat for each person. Base it on the number of things that were missed(another words, the number of things that the second-string proofer finds).
Use more than just two proofers. Have one "First String" proofer, who could be anybody, but have two second string proofers (who both get the output of the first string proofer). If the second string proofers have any differences in their output(with the exception of white space), then another second string proofer should be used. Only proofers with a certain quality rating(slightly higher than what a newbie's would be) should be able to do the second string proofing.
The "User rating" should be a combination of the number of pages done and the quality rating of those pages. Note that quality rating would only be increased by doing first string proofing. Page count would go up for any proofing.
Quality could be a float, starting at 1.0 for newbies. Every page that is completed and has a second-string person check would then go into a calculation like:
_new_quality_ = _old_quality_ + (0.01 - (_num_differences_between_their_proof_and_final_p
Thus, for every page proofed that requires NO corrections by the second string the user's quality would go up by 0.01. ( 0.01 - 0/1000 = 0.01 )
if there were more than ten errors in the proofing, their quality would go down ( 0.01 - 10/1000 = 0.00 ), (0.01 - 20/1000 = -0.01)
Have a threshold of 1.10 or some such for second string proofers... That way it would require the user to do at least 10 perfect pages, or 20 pages with 5 errors, etc, before they could do the second string proofing.
Obviously, make sure that the second string proofer can't see who the first string proofer is.
The "User Rating" (mentioned above) could just be a multiplication of the Quality and Page Counts...
Sticks and Stones may break my bones, but copyright will always protect me.
I have a few books that are old enough to be well out of copyright (and obscure enough not to be found online already), and for a while I have been considering typing them in. OCR would be a lot easier, but getting a good image from a flatbed scanner would seriously damage most of these books. Even a handheld scanner would be impractical in some cases, and a digital camera seems even less likely to work. Is there any reasonable way to scan in pages from something like a 100+ year old 1.5" thick wire-bound paperback book that only opens about 60 degrees before putting up a fight?
The response from the slashdot community is impressive. Already they have hit their mark for the day as far as 'pages processed'. They have over 1400 (at 10:13am CST) pages processed. When I visited their site at 8:45am CST they had only 615 pages. I predict that the project will hit the 3000 mark fairly quickly for today.
I just proofread 2 pages of some greek philosophy book. the system works really nice! quick database, not too large pages to read. except i would like to have source and text next to each other, and not above each other.
I signed up for an account, and did a bit of proofing. One page was a bibliography with lots of numbers -- the OCR software made a few errors here and there, sometimes confusing "1" with "!". Another page was in old German. Since many old German characters look so different than their modern-day counterparts, I was quite impressed when it translated them flawlessly into their proper ASCII counterparts. The OCR software even got the umlauts right. Only problem was it sometimes mistook an end of line "-" for a "=". One problem I did have was that most of the scans seemed to be pretty low resolution. This causes problems when comparing the scanned text to the original image, as it can create difficulties for the proofreader. The software also had trouble translating the low-res blocks.
http://cltracker.net -- powerful craigslist multi-city search
Their approach to solving this reminds me of how the Oxford English Dictionary was started -- by compiling submissions and references from thousands of volunteers. A really enjoyable recounting of this (and of one particular person who contributed thousands of words while in an insane asylum) is The Professor and the Madman