Accurate OCR?
theBrownfury asks: "I work at a lab on a university campus that provides services for disabled students. One of the main functions of this lab is to convert printed materials such as books, reading packets, etc. into electronic text(RTF or Word) that is either going to be fed to a text-to-speech synthesizer or going to be further processed for use in braille devices. Ideally we'd like to be able to process 1000 pages a week. However our current solution (a Bell&Howell 4040D scanner coupled to a mid-level PC workstation with OmniPage Pro 11 and 2-3 proofing stations) is limited to an average of 10-11 (16 on a good day) pages per hour because of the constant hand holding the OCR process requires. We've already made sure we're feeding the OCR engine good quality scans. Also it should be clarified that the variety of materials we deal with is so varied that a majority of it cannot be defined by any types of 'general' scanning or OCR templates."
"Do any of you know of a solution which can exploit our current scanner, which we're rather happy with, but bring in a better OCR method to improve our efficiency? It should be noted that the solution should be financially reasonable (as ni less than US$10K).
Our biggest bottlenecks:
- software's terrific inability to accurately pick up the areas of text on the scanned page to OCR
- marking words as possibly erroneous without checking against dictionary elongating the proofing process
- stability of OCR software
Bonuses:
- dealing with multiple languages such as Spanish and French
- capability to OCR matematical texts and papers. Currently we hand type math textbooks for students."
reading packets, etc. into electronic text(RTF or Word) that is either going to be fed to a text-to-speech synthesizer or going to be further processed for use in braille devices.
- capability to OCR matematical texts and papers. Currently we hand type math textbooks for students."
I pity the kids who are going to have to listen to "fluid dynamics on tape":
"Partial rho partial t plus rho times left parenthesis partial u partial x plus partial v partial y plus partial w partial z right parenthesis equals zero".GMD
watch this
Perhaps some research into the US Postal Service OCR developers would be useful. Their systems are obviously huge, well funded, and exceptionally accurate considering the volume of mail. I don't know how they maintain it, if its an internal group, or a contract with external developers, but whoever has it, has got a good thing.
For longer texts, it might be worth it to call the publisher and ask if they have an electronic version available. Why reinvent the wheel if you don't have to?
Another solution might be stretching your budget by doing your proof-reading offshore.
--
"Outlook not so good." That magic 8-ball knows everything! I'll ask about Exchange Server next.
Most OCR systems can only give you 98% accurcy, but we've foung that by running the output through cmdr_taco's spelling and gramer checker, that the accurcy is bumped up to 100%.
Just like this post!
Moneyed corporations, non-working 'poor' and criminal prisoners are turning productive citizens into tax-slaves.
In regards to accuracy: I've tested and compared OmniPage Pro to Abby FineReader and Abby is much, much better at text recognition. It doesn't offer as many export formats as OmniPage Pro does, but it does include an SDK, so if you can get your hands on some programmers you might be able to fiddle with it some. Abby is definitely a step up from OmniPage.
I'm pretty sure that Abby FineReader has language modules, so you can scan works in many languages.
Karma: Chevy Kavalierma.
Why not ask Google how they do it?
They've got a number of image-based paper catalogs online and searchable, and thus OCR'd.
Talk about varied formatting. It seems to be reasonably accurate, and I'm sure that the pocess is pretty streamlined -- everything else they do seems to be...
Here is an example.
Kid-proof tablet..
you will do yourself and your lab a big favor if you choose RTF. RTF is documented, so you do not lock yourself with a single vendor (microsoft) for further processing of the electronic data. It may not matter now, but it could be very important for you guys at some point in future ...
I started a small document scanning service a few years ago. (I am no longer in that business). The biggest issue in OCR accuracy is pre-process. (in particular de-skew and grayscale removal). If the page is skewed even a couple of degrees OCR will fail miserably. I have had superb results using TMSSequoia Scanfix software which automatically cleans-up and straightens the page nicely. Its expensive but worth-it if you have a lot to scan. I believe that they still have a demo available.
My experience has been that the consumer OCR software is considerably MORE accurate than industrial versions that cost 20X as much. I obtained excellent OCR accuracy using Scansoft's Textbridge software which utilized the Xerox Textbridge engine. Scansoft appears to have purchased Omnipage OCR and discontinued the Textbridge OCR line. I found that I achieved much higher accuracy with Textbridge then with Omnipage after the document was processed by Scanfix. Textbridge did not have some of the features of Omnipage but Textbridge was faster and better at OCR. I would definately download the Textbridge 98 demo that is still floating around on the web.
Both Textbridge and Omnipage OCR were vastly superior to anything else I previewed, including Adobe's OCR engine. OCR can be surprisingly accurate but the source image needs to be free of distortion. Sometimes you will need to break up the page into several using photo-editing software since no OCR can inteterpret the structure of a document very well.
I suspect that you will be better off just typing in the mathematics in by hand. Maybe a visual LATEX editor like Scientific Workplace would be helpful. The LATEX output could be manipulated using a parser to put the equations into the simpler forms that you need while keeping the raw equation in a form that could be used for other purposes later on.
Honesty, 10pgs/hour is pretty good so it doesn't sound like you are doing all that much touch-up. I suspect that using Scanfix will provide the greatest boost in productivity.
I've been working off and on with OCR packages since 1991, and have seen little improvement in the accuracy over that time. 98% or 99% accuracy sounds great; but, as you already know, you have to have someone go over the entire text and check it. If you consider that you don't know where the errors are likely to be, then you begin to realize the extent of the issue. I have generally found that, in cases where 100% accuracy is necessary (and there are some cases where 99% might be good enough), it's just as cost-effective to use a professional typing service.
The scanner you have is hard to beat. As for the software, I found that the Caere engine was a little better than the OmniPage engine when I first started working with OCR, but over time OmniPage has gotten that little bit extra oomph into it.
Having said that, there are some posts that recommend Abby, a product with which I'm unfamiliar, and state that a trial version is available, so it's probably worth a check.
Finally, one small factor that sometimes is overlooked: what resolution do you scan at? You may want to try lowering the resolution and seeing if that gives any better results. Lowering the resolution can have the effect of smoothing out some of the noise that can confuse OCR engines. Try going all the way down to 200 dpi.
Finally (part two), I've found you can also sometimes tweak the results by playing with the depth -- instead of scanning in b&w, try gray scale (I suggest 4 bit).
Finally (part three), I'm dubious that you'll find anything to handle formulae. For those readers who may be surprised to learn OCR accuracy is not quite up to scratch, just wait until you encounter OCR format preservation.
Good luck -- and if you do get better results, by all means let us all know!
Cheers