Slashdot Mirror


Best OCR for Technical Texts?

An anonymous reader asks: "I'm scanning in user manuals for older lab equipment. I've never used OCR before today, so I installed the Caere Omnipage 9.0 that came with the scanner. I was pretty happy except for a few things. It doesn't seem to want to recognize engineering symbols like the one char +/-,square root, omega, simple equations, it has trouble with super- and subscripts, and it outputs funky Word files. For example, from an 8.5 x 11 original page scanned in at 1 bit at 300 dpi, the output Word file was 10 inches wide, used tons of Omnipage text styles and didn't match the original text's flow. It did do a good job of italicizing headers and recognizing the various sections in a two column page. Googling the news and net just backs up my claims but provides no real solution. A Google search that provides nothing useful looking for best OCR for engineering."

10 of 28 comments (clear)

  1. Clara OCR by aster_ken · · Score: 5, Informative

    Have you looked at the open-source Clara OCR? I've used it for some very unique texts in the recent past. It's accuracy is quite good. Besides that, the proofing mechanisms are great!

    Go here: http://www.claraocr.org/.

    It has very recently been ported to win32, and the community support (via e-mail lists) is excellent.

    1. Re:Clara OCR by PerlGuru · · Score: 2, Informative

      Though I haven't used Clara OCR I went to that page and it looks like it might work for you. It looks like it learns the font for the page and once you tell it what the symbol is once it learns that and uses it the rest of the time you tell it to use that font. Looks like something I am definatley going to try, what could it hurt, it's opensource so no money out of pocket.

  2. Good Luck! by Asprin · · Score: 3, Interesting


    Good luck!

    I've used a few different version of Omnipage PRO, and it works OK if the layout is not complicated, it uses standard fonts, the text is clean and clear and it doesn't have too many weird logos or symbols. You still have to proofread everything and correct it by hand, though, so I'm not convinced it's a time saver as much as it is a typing saver.

    OmniPage Pro does do a MUCH better job of identifying words that the free version they throw in with scanners because it uses spelling and grammar checkers to help ID words from context. The free version is as close to useless as you can get in the software world - it's really just an ad for Pro.

    Engineering and math symbols are right out.

    --
    "Lawyers are for sucks."
    - Doug McKenzie
  3. Try Different (tm) by coyote4til7 · · Score: 2, Informative

    Have you tried other combinations of settings (e.g. dpi, bit depth)? That won't solve all of the problems you talk about it, but playing with those settings in each package you look at _before_ rating how good it is is important.

    --

    the clock on the wall says 4 til 7
  4. Finereader by Marc+Boucher · · Score: 3, Informative

    You can try FineReader from ABBYY

  5. Use Greyscale by jayrtfm · · Score: 4, Informative

    Use 8 bit, NOT 1 bit. When I switched from 1 to 8 bit on a page of normal text, the dozen or so errors vanished.

    Since Omnipage is up to version 12, perhaps there's been an improvement since your version.

    Your google skills are sorely lacking, the "Hacking Google" book would be a good investment for you. Eliminating the quotes and word "best" in your search string would help.

    2 different free web based ocr, just upload a 300 dpi b/w (8bit greyscale) file
    http://www.expervision.com/webtr6.htm
    http: //docmorph.nlm.nih.gov/docmorph/

    here are some OCR programs

    http://www.scansoft.com/omnipage/

    http://www.abbyy.com/

    http://www.newsoftinc.com/redir/digitaloffice_al l. asp?category=ocr4

    more ocr links than you really want
    http://web3.humboldt1.com/~jiva/ocr/_ocr_res ource. htm

    1. Re:Use Greyscale by SeanAhern · · Score: 2, Insightful

      Your google skills are sorely lacking

      No joke! The link in the post doesn't even connect to Google - it's a Yahoo link.

  6. Re:Use Greyscale: With links by 2sleep2type · · Score: 2, Informative
  7. ICR, Google, etc by Strange+Ranger · · Score: 2, Informative

    What you really need is ICR, Intelligent Character Recognition. There is a free trial version of one such product here.

    Better Google searching makes the difference.

    --

    Operator, give me the number for 911!
  8. The Best! by FortKnox · · Score: 3, Funny

    The Best OCR scanner is an intern with a pencil. ;-)

    --
    Good quote, too many chars. Seriously, the slashdot 120 char limit sucks!