Just One Page a Day
Charles Franks writes "Two years ago I started building an online proofreading system as a way to help Project Gutenberg (PG) get more books online: Distributed Proofreaders (DP). The concept is simple, we scan books and load the image and OCR output for each page into the online system. Next, proofreaders compare the OCR text to the image making any corrections as necessary, each page gets looked at twice. Finally the output from the site is massaged into a PG e-text and submitted to PG for posting to the archive. Now, nearly 600 books and a lot of PHP code later, we have snuggled into our new home which is graciously provided by the Internet Archive and Project Gutenberg. Now that we have 'real' resources available to us (the original site ran on a Pentium 200 over my 128kbps upstream cablemodem) I would like to invite the online community at large to help us put even more books online. To this end I would like to ask everyone to do 'Just One Page a Day'. Thank you, Charles Franks"
And start reading a page!
After that come back and you may continue();
I think a better use of time would be to have all these programmers here develop a better OCR. Then you wouldn't need the proofreading and could just feed books into the scanner. I mean there are lots of things wrong with OCR and reasons why it can't be absolutely perfect, but it CAN bet better. If we just write one line of code a day each we'll have better OCR in no time.
The GeekNights podcast is going strong. Listen!
a wonderful resource for poor areas.
And where do the poor get online? In libraries.
D'oh!
Copyrights aren't perpetual In Theory. But isn't disney and microsoft (MS wrt printed works esp) working hard to insure they're perpetual In Practice?
Bollocks.
Technology is a human endeavour and as with all human work it is subject to ethical and moral considerations.
It's a disgrace that moral philosophy is not a required course in most tech. degree programs.
Different book - different font - different problems.
It might help a bit but most OCR programs already tag letters that it is unsure about. They don't mention in the article if the distributed system incorporates OCR ambiguity in prioritising proofreading.
As an aside why not just store the raw image for any ambiguous text within the documents in the PG archive (Think of an HTML sort of thing). As people read the document just poll them as to what they think the letters in the bitmap are.
I guess a lot of the stategy rests on how frequently the ocr software makes an error or find ambiguity.
>Just get just about any scanner - it'll almost certainly come with free OCR software.
Generally not nearly as good as the top two (Scansoft (http://www.scansoft.com/sdk/: seems to have engulfed the Xerox/Textbridge and Caere/Omnipage technologies), ABBYY).
When you scan for public use, think about the time of *other people* you waste if your OCR is not optimal or your scans are off-register/ skewed etc.
These are humans comparing identical books to text.. if they have the IDENTICAL book they won't have this problem.
Gutenburg often has published the same 'book' but of different publications due to slight variations in the text.
Well, copyrights weren't perpetual. Whether they will be or not remains to be seen.
Liberty uber alles.
> In order to make the proofing faster, maybe you could OCR a document 2 or 3 times, and then have only the disagreements proofread.
Maybe worth a try, but could well die if they get the same words wrong. For dp, the extra time for scanning could well eat up the time saved by the proofreaders. Not to mention extra development to support this (with extra GUI/ more chances for confused newbies).
This may eliminate some of the OCR errors, but it won't speed up the process because a good editor reads every word. You are asking for more errors when you ask your editors to become lazy and skip words.
Most OCR will probably misread the same character incorrectly every time (read 'B' as '13', for example). That kind of error will not be flagged, and will be overlooked by editors who are used to only looking for flagged errors.
Reading the blurb at the page-a-day site, it says ASCII only where bold is converted to ALL CAPS, the English pound symbol is rendered as "L," etc. No preservation of figures, drawings, or photos.
This seems very short sighted to me. Devices that can only display ASCII are becoming rarer and rarer. Why not, instead, store docs in some sort of SGML format to handle the special markup (which must be rare) and then down convert to ASCII when needed.
I've tried reading these things on my Palm. Very difficult. But if I could get a nice typeset PDF version, that would be a whole different story (no pun intended).
How long before someone writes a script to hit "Save and get another Page" and they shoot to the top of the ladder claiming to have proofread 13,450,213 pages per day...
OCR Engines are not email programs. You can't just add a line of code and all of a sudden it works better. Usually you have to spend time developing a complicated algorithm. Usually this is more than a line of code. Then you have to test it against known text (ground truth) to make sure it's a benefit, rather than a problem over a broad selection of pages. It's quite often the case that something that improves one page makes another worse.
Actually, having people make verifications against the OCR results establishes the ground truth which someone could use to improve the OCR engine so by doing a Page a Day, you are helping to make future Open Source OCR engines better.
I am not a number! I am a man! And don't you
Well actually only the alterations would be copyrighted not the entire work. Only the original author can create a derivative work that is fully covered by copyright. Usually the publishers add a new foreward of absolutely not worth. If you take out that forward and copy only the original text it would be hard for them to prove otherwise. The only sticking point is translations of foreign work. You won't find a lot of Kafka in there (I found only Metamorphosis) because a lot of his stuff was translated only after WW II. The translations are basically new works and are copyrighted as of the date of translation.
Copyright law is supposed to give incentive to create, for the betterment of society, and allow the creator to derive direct benefits as a reward. An artist who has created a work so successful that (s)he can live on it indefinitely has arguably provided a suitable level of betterment to society.
Saying that copyright law is an incentive to "work" is accepting mediocracy. Artists who produce works that society values more highly should (have the opportunity to) receive more benefits.
On the other hand, I don't necessarily agree that copyright should last the lifetime of the creator (although there are strong arguments for this in the case of a natural person). But what is a "fair" limit?
Is 5 years enough? Almost certainly not. Many authors only achieve popularity after 10 or more years, and then make a fair amount of money off increased sales of their older works. A good number accept this as a risk, and plan to use this phenomenon to their benefit - work up a good number of titles with varied content, and you'll pull more readers, who are then likely to try some of your other titles.
Is 20 years enough? Maybe. But some of our best-loved authors were 15-20 years ahead of their time in terms of what readers wanted.
Is life enough? Strangely, no. If an aging star has just completed his/her autobiography, concludes the publishing deal, and dies ... well, the family could well be screwed.
Maybe the answer lies in a compromise, rather than an all-or-nothing approach. Copyright over a work lasts for the greater of 10 years or the creator's natural life (which gets very interesting when we get eternal life medications ...). But some rights fall away after the LESSER of those two times, such as exclusivity over derivative works (but not translations).
This allows society to (culturally) enrich itself by building on a work after a shorter amount of time, while the creator (and/or family) can still derive value from the original work for a longer time.
In the case of books this is easily understood: author writes book; 10 years later other people can write preludes and sequals, extend the world and characters, etc; 30 years later author dies and original book falls into public domain.
i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
It seems like every few years I turn around and notice that some massive archive collection gets sued, goes out of business, has funding pulled, gets tangled in legal action, has a university board go into panic mode, etc. and suddenly it disappears without warning or notice to the frustration of many. I'm certain you also can name a number of services, collections, and resources that spontaneously vanished when hosted at friendly sites. History has proven that despite best intentions, nothing lasts forever unless we go out of our way to protect it.
So that work isn't lost or destroyed, are any of the mega-sized projects replicated elsewhere in the event that a "it'll never happen" situation crops up to this unsuspecting resource?
[i] it doesn't work 100%, but it sure does get about 95%[/i]
THAT IS 2000/20=100 errors per page.(That is the way OCR works, if it 99% ok, it is still 20 errors per page.
And that doesn't include "strange" formatting like things scribbleing things in margins or heading above pages, italics and extra spaces.
By the way you are not supposed to correct spelling errors made in the original pager. especially since this is often "old" english.
Since Project Gutenburg can only publish books whose copyright has expired, it's quite likely that a spelling "error" may instead reflect language evolution, that is, a change in the way words are spelled over time.
As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.
For one example, my current project is a cookbook published in the 1730's, and so far I've corrected Apricocr to Apricock and Lemon to Lemmon; in both cases the form I corrected it to was overwhelming used in the text.
/. did ask for it -- would be to leave the spelling and typos intact, if the goal is to preserve literary creations. You are potentially losing information by changing it.
"Apricocr" I can see being a legitimate typo, but perhaps in converting "Lemon" to "Lemmon", you are eradicating one of the earliest uses (intentional or not) of the now-current spelling.
My personal opinion -- and I yes, everyone on
Ask anyone who has studied the First Folio of Shakespeare about the importance of spelling.
(And just incase you don't have a Shakespeare scholar handy: since Shakespeare's plays were almost always written down after they were first performed (and written down by someone else), there are many clues to the the original performance in how certain words are spelled, capitalized and how sentences are punctuated. Hamlet's "What a piece of worke is a man" is a good example of this.)
Tuus crepidae innexilis sunt.