Best OCR for Technical Texts?

← Back to Stories (view on slashdot.org)

Best OCR for Technical Texts?

Posted by Cliff on Thursday May 8, 2003 @02:14AM from the picking-up-on-the-odd-symbology dept.

An anonymous reader asks: "I'm scanning in user manuals for older lab equipment. I've never used OCR before today, so I installed the Caere Omnipage 9.0 that came with the scanner. I was pretty happy except for a few things. It doesn't seem to want to recognize engineering symbols like the one char +/-,square root, omega, simple equations, it has trouble with super- and subscripts, and it outputs funky Word files. For example, from an 8.5 x 11 original page scanned in at 1 bit at 300 dpi, the output Word file was 10 inches wide, used tons of Omnipage text styles and didn't match the original text's flow. It did do a good job of italicizing headers and recognizing the various sections in a two column page. Googling the news and net just backs up my claims but provides no real solution. A Google search that provides nothing useful looking for best OCR for engineering."

28 comments

Min score:

Reason:

Sort:

Clara OCR by aster_ken · 2003-05-08 02:22 · Score: 5, Informative

Have you looked at the open-source Clara OCR? I've used it for some very unique texts in the recent past. It's accuracy is quite good. Besides that, the proofing mechanisms are great!

Go here: http://www.claraocr.org/.

It has very recently been ported to win32, and the community support (via e-mail lists) is excellent.
1. Re:Clara OCR by PerlGuru · 2003-05-08 02:52 · Score: 2, Informative
  
  Though I haven't used Clara OCR I went to that page and it looks like it might work for you. It looks like it learns the font for the page and once you tell it what the symbol is once it learns that and uses it the rest of the time you tell it to use that font. Looks like something I am definatley going to try, what could it hurt, it's opensource so no money out of pocket.
Try spelling superscripts correctly by keesh · 2003-05-08 02:28 · Score: 0, Insightful

That might help slightly...
Good Luck! by Asprin · 2003-05-08 02:32 · Score: 3, Interesting

Good luck!

I've used a few different version of Omnipage PRO, and it works OK if the layout is not complicated, it uses standard fonts, the text is clean and clear and it doesn't have too many weird logos or symbols. You still have to proofread everything and correct it by hand, though, so I'm not convinced it's a time saver as much as it is a typing saver.

OmniPage Pro does do a MUCH better job of identifying words that the free version they throw in with scanners because it uses spelling and grammar checkers to help ID words from context. The free version is as close to useless as you can get in the software world - it's really just an ad for Pro.

Engineering and math symbols are right out.

--
"Lawyers are for sucks."
- Doug McKenzie
Try Different (tm) by coyote4til7 · 2003-05-08 02:33 · Score: 2, Informative

Have you tried other combinations of settings (e.g. dpi, bit depth)? That won't solve all of the problems you talk about it, but playing with those settings in each package you look at _before_ rating how good it is is important.

--

the clock on the wall says 4 til 7
Finereader by Marc+Boucher · 2003-05-08 02:37 · Score: 3, Informative

You can try FineReader from ABBYY
Use Greyscale by jayrtfm · 2003-05-08 02:40 · Score: 4, Informative

Use 8 bit, NOT 1 bit. When I switched from 1 to 8 bit on a page of normal text, the dozen or so errors vanished.

Since Omnipage is up to version 12, perhaps there's been an improvement since your version.

Your google skills are sorely lacking, the "Hacking Google" book would be a good investment for you. Eliminating the quotes and word "best" in your search string would help.

2 different free web based ocr, just upload a 300 dpi b/w (8bit greyscale) file
http://www.expervision.com/webtr6.htm
http: //docmorph.nlm.nih.gov/docmorph/

here are some OCR programs

http://www.scansoft.com/omnipage/

http://www.abbyy.com/

http://www.newsoftinc.com/redir/digitaloffice_al l. asp?category=ocr4

more ocr links than you really want
http://web3.humboldt1.com/~jiva/ocr/_ocr_res ource. htm
1. Re:Use Greyscale by SeanAhern · 2003-05-08 10:38 · Score: 2, Insightful
  
  Your google skills are sorely lacking
  
  No joke! The link in the post doesn't even connect to Google - it's a Yahoo link.
Re:Use Greyscale: With links by 2sleep2type · 2003-05-08 03:52 · Score: 2, Informative

All links that work as links
www.expervision.com/webtr6.htm
http://docmorph.nlm.nih.gov/docmorph/ here are some OCR programs
http://www.scansoft.com/omnipage/
http://www.abbyy.com/
more ocr links than you really want http://web3.humboldt1.com/~jiva/ocr/_ocr_resource. htm
Abby by gmiller123456 · 2003-05-08 04:39 · Score: 1

I can't speak for the rest of the programs you mention, but Abbyy doesn't recognize equations very well at all. (Based on the last time I tried which was about a year ago).
ICR, Google, etc by Strange+Ranger · 2003-05-08 05:10 · Score: 2, Informative

What you really need is ICR, Intelligent Character Recognition. There is a free trial version of one such product here.

Better Google searching makes the difference.

--

Operator, give me the number for 911!
1. Re:ICR, Google, etc by g4dget · 2003-05-09 11:09 · Score: 1
  
  "ICR" is a meaningless marketing buzzword, not a specific feature of an OCR package.
Good all'round scanner? by evilad · 2003-05-08 05:47 · Score: 1

Anyone know of a decent home-use scanner with a letter-size sheetfeeder?

Can it also take a stack of 4x5 photos?
The Best! by FortKnox · 2003-05-08 05:56 · Score: 3, Funny

The Best OCR scanner is an intern with a pencil. ;-)

--
Good quote, too many chars. Seriously, the slashdot 120 char limit sucks!
1. Re:The Best! by Anonymous Coward · 2003-05-08 09:21 · Score: 0
  
  Why would then need a pencil? Surely a keyboard.
better OCR... by sribe · 2003-05-08 06:26 · Score: 1

Oh boy, it's been a long time, but when I evaluated OCR packages I found that each of the 3 major choices at the (OmniPage, TypeReader, and one whose name I no longer remember and is no longer available) could do better than the others on certain types of my test documents. So you might want to get a trial of TypeReader (www.expervision.com) and try it out, in addition to FineReader as suggested by another poster.
1. Re:better OCR... by pgf · 2003-05-11 16:20 · Score: 1
  
  I recently completed a project to OCR many 10s of thousands of technical documents. I ended up choosing TypeReader. I have to say that TypeReader beat the hell out of every other OCR program I found. It was extremely fast, extremely accurate, had a bunch of output formats to choose from, and offers a batch mode that worked wonders for what I had to work with.
  
  I HIGHLY recomment TypeReader!
  
  Paul
OCR Software by heathenor · 2003-05-08 07:49 · Score: 1

Abby Fine Reader
Hacking Google on the Cheap by fm6 · 2003-05-08 09:35 · Score: 1, Offtopic
Your google skills are sorely lacking, the "Hacking Google" book would be a good investment for you. Eliminating the quotes and word "best" in your search string would help.
I don't think you need to read a book to understand that too many keywords eliminate all useful results. Also, the Yahoo engine is not quite the same as the Google engine, even though it's licensed from Google. Which is why it didn't catch the fact that "superscipts" is not the correct spelling!
I got a lot of interesting results Googling for "ocr superscripts symbols".
Here's my (non-copyrighted) strategy for doing a Google search. Google is fiendishly fast (which I find mind-boggling, given the size of the database!), so there's no reason not to play around. Start with an absolute minimum of keywords. If your results are too broad, add one or two keywords and search again. Iterate until you have useful results or you reach a dead end. If you do reach a dead end, the browser's "back" button is a convenient way to back out to a broader search.
I find the Google Toolbar indispensible. It has a lot of features, but only three that I ever use:
- A handy search text/list box. Not only does this it save steps while entering a search string, it automatically syncs itself with any Google search you enter, even if you do it just by back-buttoning out to a previous Google page.
- A "search this site only" button.
- Automatically generated buttons that search the current page for your search terms. These are real time-and-aggravation savers on a lengthy search.
I also use the uplevel button, but that's really a patch for a missing Internet Explorer feature.
If you're a die-hard Netscape/Mozilla person, there's a Sidebar with most of these features. Notably missing are the automatic term buttons -- main reason I still use Internet Explorer.
Keep as image by etn991 · 2003-05-08 13:19 · Score: 1

I've found that the best is to leave the scans as image files and bundle them together as a multipage TIFF or PDF file. It takes more space to store them, but you don't have to mess around with OCG (Optical Character Guessing).

Easy to access and read. The only loss is you can't do cut and paste or text searching.
1. Re:Keep as image by Anonymous Coward · 2003-05-09 04:00 · Score: 0
  
  That's not a terrible idea... in some cases it might make sense to do both, like you could distribute the OCRed version but then also keep the PDF file around in case of a fuck-up in the OCRed text.
2. Re:Keep as image by g4dget · 2003-05-09 11:15 · Score: 1
  
  Software like Acrobat will keep scans as images but still do OCR to let you do searching, cutting, and pasting.
do not store in OCR'ed format by g4dget · 2003-05-09 08:38 · Score: 1

Scan your manuals in 300dpi or 600dpi grayscale and archive them that way. This will allow you to go back and use better OCR as OCR technology improves.
Then, use something like Adobe Acrobat to put them on-line: Acrobat uses OCR internally to make the text searchable, but it still displays the original page image. That means that formulas and appearance will be preserved even if the OCR screws up.
You need to use Omnipage 12... by Anonymous Coward · 2003-05-09 12:34 · Score: 0

the newest version is much improved.

I have not tried to use it on technical articles, but it is very good at scanning normal documents and keeping both the fonts and styles intact.
Primal Instincts by metalligoth · 2003-05-09 13:18 · Score: 1

Prime OCR is what I have always seen in corporate sectors. At both Ford Motor Company, and Pfizer Drugs it is used for OCR on very complex technical documents.
http://www.totalsol.com/products/doc_process/prime recognition/product_prime.html
1. Re:Primal Instincts by metalligoth · 2003-05-09 13:29 · Score: 1
  
  Oops! Take out the space in that URL... Sorry!
Gamera? by mvidal01 · 2003-05-17 04:45 · Score: 1

How about Gamera Gamera: A Python-based Toolkit for Structured Document Recognition http://dkc.mse.jhu.edu/gamera/papers/gamera_python _2002/gamera.html