Accurate OCR?

← Back to Stories (view on slashdot.org)

Posted by Cliff on Thursday September 19, 2002 @04:30AM from the that's-an-'a'-not-an-'o' dept.

theBrownfury asks: "I work at a lab on a university campus that provides services for disabled students. One of the main functions of this lab is to convert printed materials such as books, reading packets, etc. into electronic text(RTF or Word) that is either going to be fed to a text-to-speech synthesizer or going to be further processed for use in braille devices. Ideally we'd like to be able to process 1000 pages a week. However our current solution (a Bell&Howell 4040D scanner coupled to a mid-level PC workstation with OmniPage Pro 11 and 2-3 proofing stations) is limited to an average of 10-11 (16 on a good day) pages per hour because of the constant hand holding the OCR process requires. We've already made sure we're feeding the OCR engine good quality scans. Also it should be clarified that the variety of materials we deal with is so varied that a majority of it cannot be defined by any types of 'general' scanning or OCR templates."

"Do any of you know of a solution which can exploit our current scanner, which we're rather happy with, but bring in a better OCR method to improve our efficiency? It should be noted that the solution should be financially reasonable (as ni less than US$10K).

Our biggest bottlenecks:
- software's terrific inability to accurately pick up the areas of text on the scanned page to OCR
- marking words as possibly erroneous without checking against dictionary elongating the proofing process
- stability of OCR software

Bonuses:
- dealing with multiple languages such as Spanish and French
- capability to OCR matematical texts and papers. Currently we hand type math textbooks for students."

8 of 59 comments (clear)

Min score:

Reason:

Sort:

There is no perfect OCR software by tchuladdiass · 2002-09-19 04:36 · Score: 2, Insightful

I've heard that often times it is cheaper to send the material to a data entry company (which uses over-seas labour) than it is to use OCR software, since you have to spend so much time correcting proofreading. I've always thought that Omnipage was one of the most accurate packages out there, so since that's what you already use, I don't think your gonna get much better. Of course, it's been several years since I've worked with any ocr, that the state of the art may have changed since then.
US Postal Service by crazymennonite · 2002-09-19 05:06 · Score: 3, Insightful

Perhaps some research into the US Postal Service OCR developers would be useful. Their systems are obviously huge, well funded, and exceptionally accurate considering the volume of mail. I don't know how they maintain it, if its an internal group, or a contract with external developers, but whoever has it, has got a good thing.
Here's a few suggestions by dbrutus · 2002-09-19 05:08 · Score: 4, Insightful

For longer texts, it might be worth it to call the publisher and ask if they have an electronic version available. Why reinvent the wheel if you don't have to?

Another solution might be stretching your budget by doing your proof-reading offshore.
How handwriting recognition is easier by yerricde · 2002-09-19 06:03 · Score: 2, Insightful

one would imagine that recognizing much more readable printed words would be easier than my inconsistent and messy handwriting.

However, with handwriting on a pda screen, the software gets additional information the order of the strokes. For instance, if you always write one letter clockwise and another counterclockwise, the software can use that to help distinguish the letters. Print can't do that.

--
Will I retire or break 10K?
Google does this. by adolf · 2002-09-19 06:06 · Score: 3, Insightful

Why not ask Google how they do it?

They've got a number of image-based paper catalogs online and searchable, and thus OCR'd.

Talk about varied formatting. It seems to be reasonably accurate, and I'm sure that the pocess is pretty streamlined -- everything else they do seems to be...

Here is an example.

--
Kid-proof tablet..
1. Re:Google does this. by Bald+Wookie · 2002-09-19 10:50 · Score: 2, Insightful
  
  This used to impress the hell out of me. Then I realized how they can appear to deliver perfect OCR:
  
  You don't know what you're missing.
  
  If the OCR fails, you don't get the hit. So long as you never see any false positives, the OCR appears to be batting 1000. In reality there might be a few catalogs that it misses because the OCR didn't work. You just never know.
  
  Compare this to OCRing a document. Every error stands out.
  
  Don't get me wrong, I'm still impressed by Google. They are just solving the 'easier' side of the problem.
Do not lock yourself with .doc by InodoroPereyra · 2002-09-19 06:23 · Score: 3, Insightful

A bit off your question, but I think you may want to consider this. If you have the choice:
... reading packets, etc. into electronic text(RTF or Word) ...

you will do yourself and your lab a big favor if you choose RTF. RTF is documented, so you do not lock yourself with a single vendor (microsoft) for further processing of the electronic data. It may not matter now, but it could be very important for you guys at some point in future ...
Correct Scanning: double-entry method. by DancingSword · 2002-09-23 16:39 · Score: 2, Insightful

Get 2 different OCR-engine programs

Scan the same text in - to plain text - with program 1, scan it in to a second plain-text file with OCR program 2, and 'diff' 'em, ignoring white-space.

That means the indifidual person running the scanning doesn't have anywhere near the amount of work to do.

Small erors may get through, but it is drastically fast, in comparison with the way it normally is done, eh?

--
Messages to/for me ( in me journal )