Slashdot Mirror


Best OCR for Technical Texts?

An anonymous reader asks: "I'm scanning in user manuals for older lab equipment. I've never used OCR before today, so I installed the Caere Omnipage 9.0 that came with the scanner. I was pretty happy except for a few things. It doesn't seem to want to recognize engineering symbols like the one char +/-,square root, omega, simple equations, it has trouble with super- and subscripts, and it outputs funky Word files. For example, from an 8.5 x 11 original page scanned in at 1 bit at 300 dpi, the output Word file was 10 inches wide, used tons of Omnipage text styles and didn't match the original text's flow. It did do a good job of italicizing headers and recognizing the various sections in a two column page. Googling the news and net just backs up my claims but provides no real solution. A Google search that provides nothing useful looking for best OCR for engineering."

1 of 28 comments (clear)

  1. Hacking Google on the Cheap by fm6 · · Score: 1, Offtopic
    Your google skills are sorely lacking, the "Hacking Google" book would be a good investment for you. Eliminating the quotes and word "best" in your search string would help.
    I don't think you need to read a book to understand that too many keywords eliminate all useful results. Also, the Yahoo engine is not quite the same as the Google engine, even though it's licensed from Google. Which is why it didn't catch the fact that "superscipts" is not the correct spelling!

    I got a lot of interesting results Googling for "ocr superscripts symbols".

    Here's my (non-copyrighted) strategy for doing a Google search. Google is fiendishly fast (which I find mind-boggling, given the size of the database!), so there's no reason not to play around. Start with an absolute minimum of keywords. If your results are too broad, add one or two keywords and search again. Iterate until you have useful results or you reach a dead end. If you do reach a dead end, the browser's "back" button is a convenient way to back out to a broader search.

    I find the Google Toolbar indispensible. It has a lot of features, but only three that I ever use:

    • A handy search text/list box. Not only does this it save steps while entering a search string, it automatically syncs itself with any Google search you enter, even if you do it just by back-buttoning out to a previous Google page.
    • A "search this site only" button.
    • Automatically generated buttons that search the current page for your search terms. These are real time-and-aggravation savers on a lengthy search.
    I also use the uplevel button, but that's really a patch for a missing Internet Explorer feature.

    If you're a die-hard Netscape/Mozilla person, there's a Sidebar with most of these features. Notably missing are the automatic term buttons -- main reason I still use Internet Explorer.