Accurate OCR?

← Back to Stories (view on slashdot.org)

Posted by Cliff on Thursday September 19, 2002 @04:30AM from the that's-an-'a'-not-an-'o' dept.

theBrownfury asks: "I work at a lab on a university campus that provides services for disabled students. One of the main functions of this lab is to convert printed materials such as books, reading packets, etc. into electronic text(RTF or Word) that is either going to be fed to a text-to-speech synthesizer or going to be further processed for use in braille devices. Ideally we'd like to be able to process 1000 pages a week. However our current solution (a Bell&Howell 4040D scanner coupled to a mid-level PC workstation with OmniPage Pro 11 and 2-3 proofing stations) is limited to an average of 10-11 (16 on a good day) pages per hour because of the constant hand holding the OCR process requires. We've already made sure we're feeding the OCR engine good quality scans. Also it should be clarified that the variety of materials we deal with is so varied that a majority of it cannot be defined by any types of 'general' scanning or OCR templates."

"Do any of you know of a solution which can exploit our current scanner, which we're rather happy with, but bring in a better OCR method to improve our efficiency? It should be noted that the solution should be financially reasonable (as ni less than US$10K).

Our biggest bottlenecks:
- software's terrific inability to accurately pick up the areas of text on the scanned page to OCR
- marking words as possibly erroneous without checking against dictionary elongating the proofing process
- stability of OCR software

Bonuses:
- dealing with multiple languages such as Spanish and French
- capability to OCR matematical texts and papers. Currently we hand type math textbooks for students."

9 of 59 comments (clear)

Min score:

Reason:

Sort:

Re:US Postal Service by duffbeer703 · 2002-09-19 05:18 · Score: 4, Informative

The USPS has a very tightly defined set of data that it needs to scan. (ie zipcodes)

If there is more than a slight chance of a misread, then the machines automatically send the envelope to a human reader, who keys in the zip.

--
Conformity is the jailer of freedom and enemy of growth. -JFK
Abby FineReader... by greenhide · 2002-09-19 05:34 · Score: 4, Informative

In regards to accuracy: I've tested and compared OmniPage Pro to Abby FineReader and Abby is much, much better at text recognition. It doesn't offer as many export formats as OmniPage Pro does, but it does include an SDK, so if you can get your hands on some programmers you might be able to fiddle with it some. Abby is definitely a step up from OmniPage.

dealing with multiple languages such as Spanish and French

I'm pretty sure that Abby FineReader has language modules, so you can scan works in many languages.

--
Karma: Chevy Kavalierma.
1. Re:Abby FineReader... by bootprom · 2002-09-19 05:49 · Score: 2, Informative
  
  I'd have to agree. I work for a document management software company and we sometimes work with a third party company called Kofax. They provide scanning and OCR. It just so happens that they license their OCR engine from the same people who make Omnipage (scansoft?). We have some clients that are using that engine to scan and OCR 100,000 documents a day. While people do report problems, it works very well for the most part, and there will always be some problems with ocr - at least for the foreseeable future.
  
  Dan
2. Re:Abby FineReader... by Cy+Guy · 2002-09-19 06:28 · Score: 2, Informative
  
  I've tested and compared OmniPage Pro to Abby FineReader
  
  You can also download a fully functional demo version that will run 15 times. So it couldn't hurt to give it a try.
  
  I'm pretty sure that Abby FineReader has language modules, so you can scan works in many languages
  
  I'll say, in fact it supports the following: Armenian (Eastern), Armenian (Grabar), Armenian (Western), Bulgarian, Catalan, Croatian, Czech, Danish, Dutch, Dutch (Belgian), Estonian, Finnish, French, German, German (new spelling), Greek, Hungarian, Italian, Latvian, Lithuanian, Norwegian (Bokmal), Norwegian (Nynorsk), Polish, Portuguese, Portuguese (Brazilian), Romanian, Russian, Slovak, Spanish, Swedish, Tatar, Turkish, and Ukrainian. Only European languages, but still impressive.
  
  --
  Work for Change & GET PAID!
OpenBook Ruby by edbarrett · 2002-09-19 05:45 · Score: 2, Informative

OpenBook Ruby from Freedom Scientific has served us pretty well. It's a combo scanner/screen reader program. We have it set up for use on a public workstation and it's very accurate. We're still using the 4.0 version, but it appears to be up to version 6.0 now (with built in scan to MP3 conversion!)
GOCR by bmomjian · 2002-09-19 06:30 · Score: 2, Informative

I actually use gocr with great success. It doesn't have a user interface; strinctly command-line, but it works well.
My experiences with OCR... Scanfix+Textbridge by tiohero · 2002-09-19 07:26 · Score: 3, Informative

I started a small document scanning service a few years ago. (I am no longer in that business). The biggest issue in OCR accuracy is pre-process. (in particular de-skew and grayscale removal). If the page is skewed even a couple of degrees OCR will fail miserably. I have had superb results using TMSSequoia Scanfix software which automatically cleans-up and straightens the page nicely. Its expensive but worth-it if you have a lot to scan. I believe that they still have a demo available.

My experience has been that the consumer OCR software is considerably MORE accurate than industrial versions that cost 20X as much. I obtained excellent OCR accuracy using Scansoft's Textbridge software which utilized the Xerox Textbridge engine. Scansoft appears to have purchased Omnipage OCR and discontinued the Textbridge OCR line. I found that I achieved much higher accuracy with Textbridge then with Omnipage after the document was processed by Scanfix. Textbridge did not have some of the features of Omnipage but Textbridge was faster and better at OCR. I would definately download the Textbridge 98 demo that is still floating around on the web.

Both Textbridge and Omnipage OCR were vastly superior to anything else I previewed, including Adobe's OCR engine. OCR can be surprisingly accurate but the source image needs to be free of distortion. Sometimes you will need to break up the page into several using photo-editing software since no OCR can inteterpret the structure of a document very well.

I suspect that you will be better off just typing in the mathematics in by hand. Maybe a visual LATEX editor like Scientific Workplace would be helpful. The LATEX output could be manipulated using a parser to put the equations into the simpler forms that you need while keeping the raw equation in a form that could be used for other purposes later on.

Honesty, 10pgs/hour is pretty good so it doesn't sound like you are doing all that much touch-up. I suspect that using Scanfix will provide the greatest boost in productivity.
My experience, for what it's worth by kiwimate · 2002-09-20 03:08 · Score: 3, Informative

I've been working off and on with OCR packages since 1991, and have seen little improvement in the accuracy over that time. 98% or 99% accuracy sounds great; but, as you already know, you have to have someone go over the entire text and check it. If you consider that you don't know where the errors are likely to be, then you begin to realize the extent of the issue. I have generally found that, in cases where 100% accuracy is necessary (and there are some cases where 99% might be good enough), it's just as cost-effective to use a professional typing service.

The scanner you have is hard to beat. As for the software, I found that the Caere engine was a little better than the OmniPage engine when I first started working with OCR, but over time OmniPage has gotten that little bit extra oomph into it.

Having said that, there are some posts that recommend Abby, a product with which I'm unfamiliar, and state that a trial version is available, so it's probably worth a check.

Finally, one small factor that sometimes is overlooked: what resolution do you scan at? You may want to try lowering the resolution and seeing if that gives any better results. Lowering the resolution can have the effect of smoothing out some of the noise that can confuse OCR engines. Try going all the way down to 200 dpi.

Finally (part two), I've found you can also sometimes tweak the results by playing with the depth -- instead of scanning in b&w, try gray scale (I suggest 4 bit).

Finally (part three), I'm dubious that you'll find anything to handle formulae. For those readers who may be surprised to learn OCR accuracy is not quite up to scratch, just wait until you encounter OCR format preservation.

Good luck -- and if you do get better results, by all means let us all know!

Cheers
Contact BBN - bbn.com by NoSlack · 2002-09-20 06:06 · Score: 2, Informative

I work in the speech recognition field and work with the researchers at BBN alot (they might sound familiar, can you say @ sign inventors) and they are always bragging about their OCR, especially for foriegn languages. They have a very different approach to OCR (they dont do Character recognition at all, rather word recognition) and thus their accuracy is very very high. Plus they are government funded for this sort of thing (can you say NSA?) I would recommend contacting them directly, not just through the website, as you are an educational institution they will probably price a very good deal with you.
Good Luck.