Slashdot Mirror


Accurate OCR?

theBrownfury asks: "I work at a lab on a university campus that provides services for disabled students. One of the main functions of this lab is to convert printed materials such as books, reading packets, etc. into electronic text(RTF or Word) that is either going to be fed to a text-to-speech synthesizer or going to be further processed for use in braille devices. Ideally we'd like to be able to process 1000 pages a week. However our current solution (a Bell&Howell 4040D scanner coupled to a mid-level PC workstation with OmniPage Pro 11 and 2-3 proofing stations) is limited to an average of 10-11 (16 on a good day) pages per hour because of the constant hand holding the OCR process requires. We've already made sure we're feeding the OCR engine good quality scans. Also it should be clarified that the variety of materials we deal with is so varied that a majority of it cannot be defined by any types of 'general' scanning or OCR templates."

"Do any of you know of a solution which can exploit our current scanner, which we're rather happy with, but bring in a better OCR method to improve our efficiency? It should be noted that the solution should be financially reasonable (as ni less than US$10K).

Our biggest bottlenecks:
- software's terrific inability to accurately pick up the areas of text on the scanned page to OCR
- marking words as possibly erroneous without checking against dictionary elongating the proofing process
- stability of OCR software

Bonuses:
- dealing with multiple languages such as Spanish and French
- capability to OCR matematical texts and papers. Currently we hand type math textbooks for students."

1 of 59 comments (clear)

  1. Xerox TextBridge Pro by John+Sokol · · Score: 2, Interesting

    I once had to recover a lost book manuscript from old printouts. The hard drive had crashed. After severer iterations I found a good combination and proper settings.

    The scanner I used is a $99 scanner that is several years old, Canon CanoScan FB620P.
    I am very impressed with it. For OCR I used Xerox TextBridge Pro, the interface it awkward, but the OCR part it works. The biggest problem was the way the windows twain drivers were setup such that I had to go through several windows and mouse clicks to scan, and finish scanning.

    I can do over 30 pages per Hour, I get about 99.8% on clean copy, the trick was to use a gray scale scan or text mode, Also I scan at 300 DPI , I find it's important to give the OCR as much info as possible to work from.

    You still want to run this past a human proofreader, but overall I am very impressed with the setup and it's results.

    --
    I am always doing that which I can not do, in order that I may learn how to do it. - Pablo Picasso