Just One Page a Day

← Back to Stories (view on slashdot.org)

Posted by ryuzaki0 on Friday November 8, 2002 @02:36AM from the stuff-to-read dept.

Charles Franks writes "Two years ago I started building an online proofreading system as a way to help Project Gutenberg (PG) get more books online: Distributed Proofreaders (DP). The concept is simple, we scan books and load the image and OCR output for each page into the online system. Next, proofreaders compare the OCR text to the image making any corrections as necessary, each page gets looked at twice. Finally the output from the site is massaged into a PG e-text and submitted to PG for posting to the archive. Now, nearly 600 books and a lot of PHP code later, we have snuggled into our new home which is graciously provided by the Internet Archive and Project Gutenberg. Now that we have 'real' resources available to us (the original site ran on a Pentium 200 over my 128kbps upstream cablemodem) I would like to invite the online community at large to help us put even more books online. To this end I would like to ask everyone to do 'Just One Page a Day'. Thank you, Charles Franks"

26 of 373 comments (clear)

Min score:

Reason:

Sort:

Stop reading this by XiC · 2002-11-08 02:38 · Score: 5, Insightful

And start reading a page!
After that come back and you may continue();
1. Re:Stop reading this by H0ek · 2002-11-08 07:01 · Score: 3, Insightful
  
  In fact, I feel it would be a Good Thing(tm) for our friendly Slashdot host to stick the link to this project into their Quick Link section on the main page.
  
  Of course, I've already bookmarked the page, but that's on one machine. What happens six months down the line when I need to rebuild my bookmarks? Search for the article on Slashdot? Ick.
  
  --
  H0ek
  Think you're smart? Prove you've got brains!
A better use of time by Apreche · 2002-11-08 02:45 · Score: 5, Insightful

I think a better use of time would be to have all these programmers here develop a better OCR. Then you wouldn't need the proofreading and could just feed books into the scanner. I mean there are lots of things wrong with OCR and reasons why it can't be absolutely perfect, but it CAN bet better. If we just write one line of code a day each we'll have better OCR in no time.

--
The GeekNights podcast is going strong. Listen!
1. Re:A better use of time by Anonymous Coward · 2002-11-08 03:04 · Score: 1, Insightful
  
  I think a better use of time would be to have all these programmers here develop a better OCR. Then you wouldn't need the proofreading and could just feed books into the scanner. I mean there are lots of things wrong with OCR and reasons why it can't be absolutely perfect, but it CAN bet better. If we just write one line of code a day each we'll have better OCR in no time.
  
  Great idea! Allow me to offer this line today:
  
  $legible_book_copy = getPerfectOCR($famous_book);
  
  Now someone just needs to implement the simple function, getPerfectOCR().
2. Re:A better use of time by SteakJerky.com · 2002-11-08 03:22 · Score: 2, Insightful
  
  Even with fantastic OCR, there will be some small errors out there so a human double check is a great idea. If project Gutenberg isn't a great reason to buy a pda, I don't know what is. Its a huge library of great books ready to be read in the lunch line, on the bus, in the john...
Re:How do I get to plug my online website? by Anonymous Coward · 2002-11-08 02:46 · Score: 2, Insightful

a wonderful resource for poor areas.

And where do the poor get online? In libraries.
D'oh!
Re:Legal Implications by astrosmurf · 2002-11-08 02:52 · Score: 2, Insightful

The only works that go into PG are works in the public domain. While publishers sell dead-tree copies still, they have no copyright over the original text contained within
But the publishers still have copyright on their specific printing. Distributing scanned copies of pages probably still violates their copyright, even if distributing the OCR output does not.
Re:copyrights? by Anonymous Coward · 2002-11-08 02:55 · Score: 2, Insightful

Copyrights aren't perpetual In Theory. But isn't disney and microsoft (MS wrt printed works esp) working hard to insure they're perpetual In Practice?
Technology without morality is an atrocity by Anonymous Coward · 2002-11-08 03:06 · Score: 1, Insightful

This site is about technology for technology's sake.
Bollocks.
Technology is a human endeavour and as with all human work it is subject to ethical and moral considerations.
It's a disgrace that moral philosophy is not a required course in most tech. degree programs.
Re:use proofreading meta-data to improve OCR! by Big_Breaker · 2002-11-08 03:09 · Score: 4, Insightful

Different book - different font - different problems.

It might help a bit but most OCR programs already tag letters that it is unsure about. They don't mention in the article if the distributed system incorporates OCR ambiguity in prioritising proofreading.

As an aside why not just store the raw image for any ambiguous text within the documents in the PG archive (Think of an HTML sort of thing). As people read the document just poll them as to what they think the letters in the bitmap are.

I guess a lot of the stategy rests on how frequently the ocr software makes an error or find ambiguity.
Re:OCR Software by Anonymous Coward · 2002-11-08 03:11 · Score: 2, Insightful

>Just get just about any scanner - it'll almost certainly come with free OCR software.

Generally not nearly as good as the top two (Scansoft (http://www.scansoft.com/sdk/: seems to have engulfed the Xerox/Textbridge and Caere/Omnipage technologies), ABBYY).

When you scan for public use, think about the time of *other people* you waste if your OCR is not optimal or your scans are off-register/ skewed etc.
Re:will this work? by GiMP · 2002-11-08 03:11 · Score: 3, Insightful

These are humans comparing identical books to text.. if they have the IDENTICAL book they won't have this problem.

Gutenburg often has published the same 'book' but of different publications due to slight variations in the text.
Re:copyrights? by msouth · 2002-11-08 03:14 · Score: 3, Insightful

Copyrights aren't perpetual. The Gutenberg project aims to publish books that are no longer, or have never been under copyright.

Well, copyrights weren't perpetual. Whether they will be or not remains to be seen.

--
Liberty uber alles.
Re:A better way - have computers do more work. by Anonymous Coward · 2002-11-08 03:16 · Score: 1, Insightful

> In order to make the proofing faster, maybe you could OCR a document 2 or 3 times, and then have only the disagreements proofread.

Maybe worth a try, but could well die if they get the same words wrong. For dp, the extra time for scanning could well eat up the time saved by the proofreaders. Not to mention extra development to support this (with extra GUI/ more chances for confused newbies).
Re:A better way - have computers do more work. by hands · 2002-11-08 03:18 · Score: 2, Insightful

In order to make the proofing faster, maybe you could OCR a document 2 or 3 times, and then have only the disagreements proofread.

This may eliminate some of the OCR errors, but it won't speed up the process because a good editor reads every word. You are asking for more errors when you ask your editors to become lazy and skip words.
Most OCR will probably misread the same character incorrectly every time (read 'B' as '13', for example). That kind of error will not be flagged, and will be overlooked by editors who are used to only looking for flagged errors.
ASCII Only? by vondo · 2002-11-08 03:34 · Score: 5, Insightful

Reading the blurb at the page-a-day site, it says ASCII only where bold is converted to ALL CAPS, the English pound symbol is rendered as "L," etc. No preservation of figures, drawings, or photos.

This seems very short sighted to me. Devices that can only display ASCII are becoming rarer and rarer. Why not, instead, store docs in some sort of SGML format to handle the special markup (which must be rare) and then down convert to ASCII when needed.

I've tried reading these things on my Palm. Very difficult. But if I could get a nice typeset PDF version, that would be a whole different story (no pun intended).
1. Re:ASCII Only? by sagwalla · 2002-11-08 05:51 · Score: 2, Insightful
  
  The beauty of this is that it is in the public domain. If you want a PDF version, or an HTML version, feel free to make one. The Gutenberg standards put the material out in a least common denominated format so anyone has the same freedom.
Distributed Proofreading has a "high score" table. by Lovepump · 2002-11-08 03:37 · Score: 3, Insightful

How long before someone writes a script to hit "Save and get another Page" and they shoot to the top of the ladder claiming to have proofread 13,450,213 pages per day...
No, not really by Codex+The+Sloth · 2002-11-08 03:46 · Score: 4, Insightful

OCR Engines are not email programs. You can't just add a line of code and all of a sudden it works better. Usually you have to spend time developing a complicated algorithm. Usually this is more than a line of code. Then you have to test it against known text (ground truth) to make sure it's a benefit, rather than a problem over a broad selection of pages. It's quite often the case that something that improves one page makes another worse.

Actually, having people make verifications against the OCR results establishes the ground truth which someone could use to improve the OCR engine so by doing a Page a Day, you are helping to make future Open Source OCR engines better.

--
I am not a number! I am a man! And don't you ... oh wait, I'm #93427. Ha ha! In your face #93428!
Re:Copyright is not an issue by Anonymous Coward · 2002-11-08 03:51 · Score: 2, Insightful

Well actually only the alterations would be copyrighted not the entire work. Only the original author can create a derivative work that is fully covered by copyright. Usually the publishers add a new foreward of absolutely not worth. If you take out that forward and copy only the original text it would be hard for them to prove otherwise. The only sticking point is translations of foreign work. You won't find a lot of Kafka in there (I found only Metamorphosis) because a lot of his stuff was translated only after WW II. The translations are basically new works and are copyrighted as of the date of translation.
Re:Umm... by Twylite · 2002-11-08 04:04 · Score: 4, Insightful

Copyright law is supposed to give incentive to create, for the betterment of society, and allow the creator to derive direct benefits as a reward. An artist who has created a work so successful that (s)he can live on it indefinitely has arguably provided a suitable level of betterment to society.

Saying that copyright law is an incentive to "work" is accepting mediocracy. Artists who produce works that society values more highly should (have the opportunity to) receive more benefits.

On the other hand, I don't necessarily agree that copyright should last the lifetime of the creator (although there are strong arguments for this in the case of a natural person). But what is a "fair" limit?

Is 5 years enough? Almost certainly not. Many authors only achieve popularity after 10 or more years, and then make a fair amount of money off increased sales of their older works. A good number accept this as a risk, and plan to use this phenomenon to their benefit - work up a good number of titles with varied content, and you'll pull more readers, who are then likely to try some of your other titles.

Is 20 years enough? Maybe. But some of our best-loved authors were 15-20 years ahead of their time in terms of what readers wanted.

Is life enough? Strangely, no. If an aging star has just completed his/her autobiography, concludes the publishing deal, and dies ... well, the family could well be screwed.

Maybe the answer lies in a compromise, rather than an all-or-nothing approach. Copyright over a work lasts for the greater of 10 years or the creator's natural life (which gets very interesting when we get eternal life medications ...). But some rights fall away after the LESSER of those two times, such as exclusivity over derivative works (but not translations).

This allows society to (culturally) enrich itself by building on a work after a shorter amount of time, while the creator (and/or family) can still derive value from the original work for a longer time.

In the case of books this is easily understood: author writes book; 10 years later other people can write preludes and sequals, extend the world and characters, etc; 30 years later author dies and original book falls into public domain.

--
i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
Are any of these resources distributed? by wls · 2002-11-08 04:21 · Score: 3, Insightful

It seems like every few years I turn around and notice that some massive archive collection gets sued, goes out of business, has funding pulled, gets tangled in legal action, has a university board go into panic mode, etc. and suddenly it disappears without warning or notice to the frustration of many. I'm certain you also can name a number of services, collections, and resources that spontaneously vanished when hosted at friendly sites. History has proven that despite best intentions, nothing lasts forever unless we go out of our way to protect it.

So that work isn't lost or destroyed, are any of the mega-sized projects replicated elsewhere in the event that a "it'll never happen" situation crops up to this unsuspecting resource?
Re:A better way - have computers do more work. by leuk_he · 2002-11-08 04:41 · Score: 3, Insightful

[i] it doesn't work 100%, but it sure does get about 95%[/i]

THAT IS 2000/20=100 errors per page.(That is the way OCR works, if it 99% ok, it is still 20 errors per page.

And that doesn't include "strange" formatting like things scribbleing things in margins or heading above pages, italics and extra spaces.

By the way you are not supposed to correct spelling errors made in the original pager. especially since this is often "old" english.
Re:And you ask the /. community.. by JoeBuck · 2002-11-08 05:42 · Score: 4, Insightful

Since Project Gutenburg can only publish books whose copyright has expired, it's quite likely that a spelling "error" may instead reflect language evolution, that is, a change in the way words are spelled over time.
Re:I am programmer, let's automate this by Sloppy · 2002-11-08 05:52 · Score: 3, Insightful

There has to be a better way, a programming way, to get this done without having to look at all of the files with human eyes.
It's not human eyes that are needed, it's human brains. If it is possible to automate, then the OCR doesn't need checking; it just needs to be upgraded to include whatever algorithm that you're about to invent.

--
As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.
Re:And you ask the /. community.. by Greedo · 2002-11-08 06:49 · Score: 3, Insightful

For one example, my current project is a cookbook published in the 1730's, and so far I've corrected Apricocr to Apricock and Lemon to Lemmon; in both cases the form I corrected it to was overwhelming used in the text.

"Apricocr" I can see being a legitimate typo, but perhaps in converting "Lemon" to "Lemmon", you are eradicating one of the earliest uses (intentional or not) of the now-current spelling.

My personal opinion -- and I yes, everyone on /. did ask for it -- would be to leave the spelling and typos intact, if the goal is to preserve literary creations. You are potentially losing information by changing it.

Ask anyone who has studied the First Folio of Shakespeare about the importance of spelling.

(And just incase you don't have a Shakespeare scholar handy: since Shakespeare's plays were almost always written down after they were first performed (and written down by someone else), there are many clues to the the original performance in how certain words are spelled, capitalized and how sentences are punctuated. Hamlet's "What a piece of worke is a man" is a good example of this.)

--
Tuus crepidae innexilis sunt.