Accurate OCR?

← Back to Stories (view on slashdot.org)

Posted by Cliff on Thursday September 19, 2002 @04:30AM from the that's-an-'a'-not-an-'o' dept.

theBrownfury asks: "I work at a lab on a university campus that provides services for disabled students. One of the main functions of this lab is to convert printed materials such as books, reading packets, etc. into electronic text(RTF or Word) that is either going to be fed to a text-to-speech synthesizer or going to be further processed for use in braille devices. Ideally we'd like to be able to process 1000 pages a week. However our current solution (a Bell&Howell 4040D scanner coupled to a mid-level PC workstation with OmniPage Pro 11 and 2-3 proofing stations) is limited to an average of 10-11 (16 on a good day) pages per hour because of the constant hand holding the OCR process requires. We've already made sure we're feeding the OCR engine good quality scans. Also it should be clarified that the variety of materials we deal with is so varied that a majority of it cannot be defined by any types of 'general' scanning or OCR templates."

"Do any of you know of a solution which can exploit our current scanner, which we're rather happy with, but bring in a better OCR method to improve our efficiency? It should be noted that the solution should be financially reasonable (as ni less than US$10K).

Our biggest bottlenecks:
- software's terrific inability to accurately pick up the areas of text on the scanned page to OCR
- marking words as possibly erroneous without checking against dictionary elongating the proofing process
- stability of OCR software

Bonuses:
- dealing with multiple languages such as Spanish and French
- capability to OCR matematical texts and papers. Currently we hand type math textbooks for students."

59 comments

Min score:

Reason:

Sort:

for those with vision problems... by Anonymous Coward · 2002-09-19 04:33 · Score: 0

for those students that require large-print - why not just give them a decent magnifying lens?
There is no perfect OCR software by tchuladdiass · 2002-09-19 04:36 · Score: 2, Insightful

I've heard that often times it is cheaper to send the material to a data entry company (which uses over-seas labour) than it is to use OCR software, since you have to spend so much time correcting proofreading. I've always thought that Omnipage was one of the most accurate packages out there, so since that's what you already use, I don't think your gonna get much better. Of course, it's been several years since I've worked with any ocr, that the state of the art may have changed since then.
1. Re:There is no perfect OCR software by LauFu · 2002-09-19 06:08 · Score: 1
  
  Yeah, but if you send it overseas to be OCR'd and proof-read, it could come back with phrases like: "All your base are belong to us." or "Someone set us up the bomb!" ;-)
  
  --
  LauFu http://www.everythinggeek.com
2. Re:There is no perfect OCR software by mwolff · 2002-09-19 08:28 · Score: 0
  
  I thought the 99% accurate OCR technology my new scanner's box was advertising would be great. After it came home and was installed I realized what that really meant. 1 out of 100 characters was wrong. With the documents I was intending to OCR, that were really long, well,. I got sad.....
3. Re:There is no perfect OCR software by david+duncan+scott · 2002-09-19 10:08 · Score: 2
  
  And just imagine what it's like for people who want to scan in financial records and such, where a "0" instead of a "O" doesn't jump right out at you or get caught by s spell-checker.
  
  --
  This next song is very sad. Please clap along. -- Robin Zander
I have the solution ... wait ... by twoflower · 2002-09-19 04:39 · Score: 1, Troll

Perfect or near-perfect OCR is one of the holy grails of information technology. Various companies are therefore constantly coming up with the "next big thing" and applying the latest buzzwords to the problem. I can remember when perfect OCR was just around the corner due to "fuzzy logic", then it was just around the corner due to "neural nets", then it was coming soon because of "heuristic analysis", then ... ad infinitum, ad nauseum.

I don't think we'll ever have near-perfect OCR. 90% is as good as it gets.

--

--
Twoflower
1. Re:I have the solution ... wait ... by RevAaron · 2002-09-19 05:35 · Score: 2
  
  It's gotta get better than 90%! The handwriting recognition system of the Newton OS (now found in OS X as "Inkwell") managed around 99% for my messy handwriting. I know that HWR != OCR, but one would imagine that recognizing much more readable printed words would be easier than my inconsistent and messy handwriting. (and yes, Newton OS HWR does rely on a neural net that learns your handwriting style. :) )
  
  --
  
  Working toward a usable PDA environment in the spirit of Newton OS: Dynapad
2. Re:I have the solution ... wait ... by afidel · 2002-09-19 08:43 · Score: 2
  
  OCR has had claimed 99.9% for printed material for some time, only problem is that this is still an average of 1 error per page, and the reality is closer to 95%, way too low to be usefull as the origional poster shows with the pages/hour count.
  
  --
  There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
text-to-speech math stuff? by GuyMannDude · 2002-09-19 05:04 · Score: 5, Funny

reading packets, etc. into electronic text(RTF or Word) that is either going to be fed to a text-to-speech synthesizer or going to be further processed for use in braille devices.

- capability to OCR matematical texts and papers. Currently we hand type math textbooks for students."

I pity the kids who are going to have to listen to "fluid dynamics on tape":
"Partial rho partial t plus rho times left parenthesis partial u partial x plus partial v partial y plus partial w partial z right parenthesis equals zero".
GMD

--
watch this
1. Re:text-to-speech math stuff? by Anonymous Coward · 2002-09-20 08:11 · Score: 0
  
  We don't need your pity. What you describe
  is how we learn.
US Postal Service by crazymennonite · 2002-09-19 05:06 · Score: 3, Insightful

Perhaps some research into the US Postal Service OCR developers would be useful. Their systems are obviously huge, well funded, and exceptionally accurate considering the volume of mail. I don't know how they maintain it, if its an internal group, or a contract with external developers, but whoever has it, has got a good thing.
1. Re:US Postal Service by duffbeer703 · 2002-09-19 05:18 · Score: 4, Informative
  
  The USPS has a very tightly defined set of data that it needs to scan. (ie zipcodes)
  
  If there is more than a slight chance of a misread, then the machines automatically send the envelope to a human reader, who keys in the zip.
  
  --
  Conformity is the jailer of freedom and enemy of growth. -JFK
2. Re:US Postal Service by perljon · 2002-09-19 05:19 · Score: 1
  
  My dad maintains it. :-)
  
  But he just replaces components when they break. All the research, etc. is done by exterior companies.
  
  --
  This isn't the sig you are looking for... Carry on...
3. Re:US Postal Service by Anonymous Coward · 2002-09-19 08:41 · Score: 0
  
  Oh, you mean the cluster of linux boxes next to me.....
  
  As pointed out, these generally sort zip codes, with manual sorting of street number etc.
  Accuracy is subjective, as a bad guess may send an envelope to Topeka instead of Tulsa, where your user probably would not be fazed by a simple misread.
  I dont know the ins and outs of the OCR software, but they use a lot of test images, hours of testing, and a boatload of money to write custom software. Im sure LMCO would love to write an app for you ;)
4. Re:US Postal Service by thogard · 2002-09-19 18:45 · Score: 1
  
  Their current system will read the entire address and do some sanity checks on the zip code vs the street address. It will then barcode it using a delivery point barcode. The system is quite impressive considering it can read handwriting that I can't.
  
  I think that the US post office has 9 digit post codes for cities in outher countries and I am looking for someone in the US to verify this by sending me a properly bar coded letter. If your game, email me.
5. Re:US Postal Service by gordie · 2002-09-20 06:06 · Score: 1
  
  Even with the very large amount of money that USPS has spent developing it's OCR system, just to read a single string of numbers, they still employ thousands of workers to hand enter the zip codes that the system can not read. Those it can, are bar coded and sent on, but a very non-trivial amount of letters are "kicked" to the long rows of human operators.
Here's a few suggestions by dbrutus · 2002-09-19 05:08 · Score: 4, Insightful

For longer texts, it might be worth it to call the publisher and ask if they have an electronic version available. Why reinvent the wheel if you don't have to?

Another solution might be stretching your budget by doing your proof-reading offshore.
1. Re:Here's a few suggestions by sql*kitten · 2002-09-20 00:43 · Score: 2
  
  For longer texts, it might be worth it to call the publisher and ask if they have an electronic version available.
  
  An electronic copy definitely will exists - no-one typesets books by hand any more. It is highly likely in fact that the book was written with a mainstream word processing program like Word and a final draft exists in this format. You only need a) to make a strong enough case for them to let you have it and b) a technique for converting it into a format that you can use. The latter probably already exists too.
2. Re:Here's a few suggestions by dbrutus · 2002-09-20 07:28 · Score: 3
  
  My guess is that creating a preferred publisher list made up of publishers willing to be "good corporate citizens" by helping with ADA reasonable accomodations might do the trick, especially if you made it a national list for all universities.
  
  Nobody wants to be viewed as being nasty to the disabled.
No Google link? by Numeric · 2002-09-19 05:09 · Score: 0, Flamebait

Hmmm...this is one of those rarities that a response hasn't said "Did you do a search on Google?"

--
-- ladies and gentlemen we are floating in space!
I have more. by sharkey · 2002-09-19 05:18 · Score: 3, Funny
Accurate OCR
- Military Intelligence
- Bureaucratic Efficiency
- Microsoft Works
- Political Integrity
--

--
"Outlook not so good." That magic 8-ball knows everything! I'll ask about Exchange Server next.
1. Re:I have more. by Anonymous Coward · 2002-09-20 03:31 · Score: 0
  
  Microsoft Inovation
  
  Duh
Achieve 100% Accurecy by zulux · 2002-09-19 05:31 · Score: 5, Funny

Most OCR systems can only give you 98% accurcy, but we've foung that by running the output through cmdr_taco's spelling and gramer checker, that the accurcy is bumped up to 100%.

Just like this post!

--
Moneyed corporations, non-working 'poor' and criminal prisoners are turning productive citizens into tax-slaves.
Abby FineReader... by greenhide · 2002-09-19 05:34 · Score: 4, Informative

In regards to accuracy: I've tested and compared OmniPage Pro to Abby FineReader and Abby is much, much better at text recognition. It doesn't offer as many export formats as OmniPage Pro does, but it does include an SDK, so if you can get your hands on some programmers you might be able to fiddle with it some. Abby is definitely a step up from OmniPage.

dealing with multiple languages such as Spanish and French

I'm pretty sure that Abby FineReader has language modules, so you can scan works in many languages.

--
Karma: Chevy Kavalierma.
1. Re:Abby FineReader... by bootprom · 2002-09-19 05:49 · Score: 2, Informative
  
  I'd have to agree. I work for a document management software company and we sometimes work with a third party company called Kofax. They provide scanning and OCR. It just so happens that they license their OCR engine from the same people who make Omnipage (scansoft?). We have some clients that are using that engine to scan and OCR 100,000 documents a day. While people do report problems, it works very well for the most part, and there will always be some problems with ocr - at least for the foreseeable future.
  
  Dan
2. Re:Abby FineReader... by Cy+Guy · 2002-09-19 06:28 · Score: 2, Informative
  
  I've tested and compared OmniPage Pro to Abby FineReader
  
  You can also download a fully functional demo version that will run 15 times. So it couldn't hurt to give it a try.
  
  I'm pretty sure that Abby FineReader has language modules, so you can scan works in many languages
  
  I'll say, in fact it supports the following: Armenian (Eastern), Armenian (Grabar), Armenian (Western), Bulgarian, Catalan, Croatian, Czech, Danish, Dutch, Dutch (Belgian), Estonian, Finnish, French, German, German (new spelling), Greek, Hungarian, Italian, Latvian, Lithuanian, Norwegian (Bokmal), Norwegian (Nynorsk), Polish, Portuguese, Portuguese (Brazilian), Romanian, Russian, Slovak, Spanish, Swedish, Tatar, Turkish, and Ukrainian. Only European languages, but still impressive.
  
  --
  Work for Change & GET PAID!
3. Re:Abby FineReader... by Quixote · 2002-09-19 14:53 · Score: 2
  
  IIRC, you can download a demo of Abbyy's OCR program and try it out yourself, on your own docs.
  If you want free, head on over to the National Library of Medicine's DocMorph page. You can upload TIFF files, and have them converted to plain text in about 15 seconds. Not bad for 'free', I think.
4. Re:Abby FineReader... by technos · 2002-09-20 07:07 · Score: 2
  
  Definitly agree. I evaluated a bunch of them, and despite having much more experience with OmniPage (not to mention a free copy) I went out and bought a copy of Finereader to use on books I scanned for Project Gutenberg. It was faster, more reliable in terms of output, and it wouldn't balk at taking 1000 page .tiff files.
  
  --
  .sig: Now legally binding!
5. Re:Abby FineReader... by Thornae · 2002-09-20 07:14 · Score: 2
  
  Just adding my own vote of confidence. I used a free older version of Abby from PC Plus (uk mag) coverdisc to scan in a bunch of stuff on my gf's PC for her literature review, and it was pretty damn good.
  I'll normally go for OSS solutions if I can, but there wasn't anything I could find in the realm of OCR that compared in terms of ease of use and accuracy. If I ever need to do lots of scanning, I'll be investing in a copy of Abby...
  
  --
  |>
  Here be Dragons
OpenBook Ruby by edbarrett · 2002-09-19 05:45 · Score: 2, Informative

OpenBook Ruby from Freedom Scientific has served us pretty well. It's a combo scanner/screen reader program. We have it set up for use on a public workstation and it's very accurate. We're still using the 4.0 version, but it appears to be up to version 6.0 now (with built in scan to MP3 conversion!)
How handwriting recognition is easier by yerricde · 2002-09-19 06:03 · Score: 2, Insightful

one would imagine that recognizing much more readable printed words would be easier than my inconsistent and messy handwriting.

However, with handwriting on a pda screen, the software gets additional information the order of the strokes. For instance, if you always write one letter clockwise and another counterclockwise, the software can use that to help distinguish the letters. Print can't do that.

--
Will I retire or break 10K?
Aphex Twin? by yerricde · 2002-09-19 06:06 · Score: 1

with built in scan to MP3 conversion!

Do you mean through text-to-speech, or through bitmap-to-Aphex-Twin's-face?
Yes, I know it's the former, but I was making a joke.

--
Will I retire or break 10K?
Google does this. by adolf · 2002-09-19 06:06 · Score: 3, Insightful

Why not ask Google how they do it?

They've got a number of image-based paper catalogs online and searchable, and thus OCR'd.

Talk about varied formatting. It seems to be reasonably accurate, and I'm sure that the pocess is pretty streamlined -- everything else they do seems to be...

Here is an example.

--
Kid-proof tablet..
1. Re:Google does this. by Bald+Wookie · 2002-09-19 10:50 · Score: 2, Insightful
  
  This used to impress the hell out of me. Then I realized how they can appear to deliver perfect OCR:
  
  You don't know what you're missing.
  
  If the OCR fails, you don't get the hit. So long as you never see any false positives, the OCR appears to be batting 1000. In reality there might be a few catalogs that it misses because the OCR didn't work. You just never know.
  
  Compare this to OCRing a document. Every error stands out.
  
  Don't get me wrong, I'm still impressed by Google. They are just solving the 'easier' side of the problem.
Clara OCR by aster_ken · 2002-09-19 06:13 · Score: 1

Clara OCR, though developed for *nix, is a decent OCR program. You should try it. We've used it to OCR old inventory sheets (some handwritten) with fairly accurate results. It now has Win32 binaries available, too. Here is their homepage.
just curious by rodentia · 2002-09-19 06:14 · Score: 1

Is anyone not turning hard-working americans into tax slaves?

--
illegitimii non ingravare
Do not lock yourself with .doc by InodoroPereyra · 2002-09-19 06:23 · Score: 3, Insightful

A bit off your question, but I think you may want to consider this. If you have the choice:
... reading packets, etc. into electronic text(RTF or Word) ...

you will do yourself and your lab a big favor if you choose RTF. RTF is documented, so you do not lock yourself with a single vendor (microsoft) for further processing of the electronic data. It may not matter now, but it could be very important for you guys at some point in future ...
1. Re:Do not lock yourself with .doc by greenhide · 2002-09-19 07:04 · Score: 1
  
  Or how about XML?
  
  My guess is if this content is geared toward the vision impaired, niceties like hanging indents, styles, etc. aren't that important. In such a case, using XML with perhaps a few style-based tags (such as <B>, <I> <U> etc) should do the trick.
  
  The nice thing about XML data is it's relatively clean, it's easy to convert to other formats, and can be made to store data or metadata as well.
  
  --
  Karma: Chevy Kavalierma.
2. Re:Do not lock yourself with .doc by MrBoombasticfantasti · 2002-09-20 00:18 · Score: 1
  XML sucks, for more than one reason:
  
  The format is horrendously inefficient;
  
  Without a description of the tags used and the relation between them they are just as unreadable as anything else;
  
  You can't easily display or print your stuff without digging into the technical side of XML.
  
  The XML-hype will be gone soon, don't worry.
  --
  !ERR: Signature not found.
3. Re:Do not lock yourself with .doc by greenhide · 2002-09-20 02:06 · Score: 1
  
  The format is horrendously inefficient
  
  It's true that the XML format is "bulky", but with processor speeds increasing and memory increasingly cheap, that is less of an issue. XML can be processed and cataloged pretty quickly, even if it isn't the most "efficient" way of doing it. Besides, we're talking document access in a specialized library for the visually impaired. We're not talking about many concurrent accesses all going at the same time.
  
  Without a description of the tags used and the relation between them they are just as unreadable as anything else
  
  Wrong. This is the nicest feature of XML! You can actually look at it, and assuming that the tags are intuitively named, you can pretty much figure out what information is in the file. Think about this: which would you rather try to decipher from within a plain text editor: volumnious RTF code, the binary content of a .doc file, or sections of text wrapped in tags like <title>, <author> and <loc_number>? And for more "effecient" data formats, it gets even worse. Imagine trying to open up an Access database in a binary editor, for example.
  
  You can't easily display or print your stuff without digging into the technical side of XML.
  
  Yeah, that is true. Without a decent application capable of opening and interpreting an XML file, you just can't print or display it. That also applies to every other existing data format, except *maybe* plain text, which is not really a useful format to store information in.
  
  XML is not hype. It makes sense as a data storage format for all of the opposite reasons you cited: it can be processed fairly quickly, it is easily readable, even in text format, and there are many tools which allow you to very easily display or print XML formatted information.
  
  --
  Karma: Chevy Kavalierma.
4. Re:Do not lock yourself with .doc by Anonymous Coward · 2002-09-20 06:16 · Score: 0
  
  Why the fuck are you warring about XML? Shit, he asked bout OCR, not to prove how much you know so you can feel better about yourself as you sit home alone this weekend.
GOCR by bmomjian · 2002-09-19 06:30 · Score: 2, Informative

I actually use gocr with great success. It doesn't have a user interface; strinctly command-line, but it works well.
Xerox TextBridge Pro by John+Sokol · 2002-09-19 06:36 · Score: 2, Interesting

I once had to recover a lost book manuscript from old printouts. The hard drive had crashed. After severer iterations I found a good combination and proper settings.

The scanner I used is a $99 scanner that is several years old, Canon CanoScan FB620P.
I am very impressed with it. For OCR I used Xerox TextBridge Pro, the interface it awkward, but the OCR part it works. The biggest problem was the way the windows twain drivers were setup such that I had to go through several windows and mouse clicks to scan, and finish scanning.

I can do over 30 pages per Hour, I get about 99.8% on clean copy, the trick was to use a gray scale scan or text mode, Also I scan at 300 DPI , I find it's important to give the OCR as much info as possible to work from.

You still want to run this past a human proofreader, but overall I am very impressed with the setup and it's results.

--
I am always doing that which I can not do, in order that I may learn how to do it. - Pablo Picasso
Sweet irony! by Longinus · 2002-09-19 07:25 · Score: 1

You misspelled grammar.
1. Re:Sweet irony! by Anonymous Coward · 2002-09-19 08:09 · Score: 0
  
  And accuracy!
My experiences with OCR... Scanfix+Textbridge by tiohero · 2002-09-19 07:26 · Score: 3, Informative

I started a small document scanning service a few years ago. (I am no longer in that business). The biggest issue in OCR accuracy is pre-process. (in particular de-skew and grayscale removal). If the page is skewed even a couple of degrees OCR will fail miserably. I have had superb results using TMSSequoia Scanfix software which automatically cleans-up and straightens the page nicely. Its expensive but worth-it if you have a lot to scan. I believe that they still have a demo available.

My experience has been that the consumer OCR software is considerably MORE accurate than industrial versions that cost 20X as much. I obtained excellent OCR accuracy using Scansoft's Textbridge software which utilized the Xerox Textbridge engine. Scansoft appears to have purchased Omnipage OCR and discontinued the Textbridge OCR line. I found that I achieved much higher accuracy with Textbridge then with Omnipage after the document was processed by Scanfix. Textbridge did not have some of the features of Omnipage but Textbridge was faster and better at OCR. I would definately download the Textbridge 98 demo that is still floating around on the web.

Both Textbridge and Omnipage OCR were vastly superior to anything else I previewed, including Adobe's OCR engine. OCR can be surprisingly accurate but the source image needs to be free of distortion. Sometimes you will need to break up the page into several using photo-editing software since no OCR can inteterpret the structure of a document very well.

I suspect that you will be better off just typing in the mathematics in by hand. Maybe a visual LATEX editor like Scientific Workplace would be helpful. The LATEX output could be manipulated using a parser to put the equations into the simpler forms that you need while keeping the raw equation in a form that could be used for other purposes later on.

Honesty, 10pgs/hour is pretty good so it doesn't sound like you are doing all that much touch-up. I suspect that using Scanfix will provide the greatest boost in productivity.
PrimeOCR - primerecognition. Voting Engines by tweedlebait · 2002-09-19 07:28 · Score: 1

They use several engines and vote on the best. Care FineReader Typereader etc.. 6 or 8 of them. The results are very nice. The package is expensive. It is pretty programmer friendly. If you ask nice and give them some cash and they'll get you a demo. You usually talk to the owner over there too which is cool. The products are simple and pretty neat. Their verifier is well designed too. I don't work for them but a while back demo'd their suite. impressive (crashed on occasion but what ocr pack doesn't.) If you get prime though, overclock like heck. They charged a lot extra per processor last i chekced. --erics

--
Firefox & /. ? Use this often:
I'll second that by Mr.Intel · 2002-09-19 07:34 · Score: 2

I first delved into the world of OCR back in 98 with this product and havn't turned back. The current version is made by scansoft (the same makers as OmniPage), this product is much better. Even PCWorld has a review of it (March 2000).

It achieves 98% accuracy on typed text and can handle graphics, bullets and tables. These were big plusses for me. I still use the 98 version and have very few complaints. Dirty pages can be a problem, but it has frequently amazed me in how it catches characters in the midst of goop.

the trick was to use a gray scale scan or text mode, Also I scan at 300 DPI , I find it's important to give the OCR as much info as possible to work from.

I agree. The right settings are very important. I recommend some serious tweaking before you get to hard and heavy with it. For plain text, the above works great, although I sometimes prefer 150dpi. For anything with tables and graphics, 300dpi is a must. A good scanner can make a big difference too. My work uses network ready scanners that copy the file to a network share and the software picks up the files automatically. Very efficient.

--
ASCII tastes bad dude.
Binary it is then.
Open Source has no OCR by 0x0d0a · 2002-09-19 12:51 · Score: 2

...or no decent OCR, anyway. There are a couple of abandoned useless research projects.

Kind of surprising.

--
May we never see th
Help GOCR by Jebediah21 · 2002-09-19 13:23 · Score: 2

Why not use some of the money and resources to further GOCR? Hell, you could probably even convince the CS department to make it an assignment to send a patch or implement a new feature into the app.

--

Everytime you look at porn a devil gets their horns.
My experience, for what it's worth by kiwimate · 2002-09-20 03:08 · Score: 3, Informative

I've been working off and on with OCR packages since 1991, and have seen little improvement in the accuracy over that time. 98% or 99% accuracy sounds great; but, as you already know, you have to have someone go over the entire text and check it. If you consider that you don't know where the errors are likely to be, then you begin to realize the extent of the issue. I have generally found that, in cases where 100% accuracy is necessary (and there are some cases where 99% might be good enough), it's just as cost-effective to use a professional typing service.

The scanner you have is hard to beat. As for the software, I found that the Caere engine was a little better than the OmniPage engine when I first started working with OCR, but over time OmniPage has gotten that little bit extra oomph into it.

Having said that, there are some posts that recommend Abby, a product with which I'm unfamiliar, and state that a trial version is available, so it's probably worth a check.

Finally, one small factor that sometimes is overlooked: what resolution do you scan at? You may want to try lowering the resolution and seeing if that gives any better results. Lowering the resolution can have the effect of smoothing out some of the noise that can confuse OCR engines. Try going all the way down to 200 dpi.

Finally (part two), I've found you can also sometimes tweak the results by playing with the depth -- instead of scanning in b&w, try gray scale (I suggest 4 bit).

Finally (part three), I'm dubious that you'll find anything to handle formulae. For those readers who may be surprised to learn OCR accuracy is not quite up to scratch, just wait until you encounter OCR format preservation.

Good luck -- and if you do get better results, by all means let us all know!

Cheers
There are good frontends by marm · 2002-09-20 03:50 · Score: 2

Well, in the best traditions of UNIX software, the gocr authors decided to keep the user interface and the actual OCR code separate. Sensible.

Actually it comes with Tcl/Tk (gocr.tcl) and GTK+ (gtk-gocr) frontends. My personal favourite, though, is the KDE scanning program, Kooka, which includes support for gocr amongst its many very useful talents.
Contact BBN - bbn.com by NoSlack · 2002-09-20 06:06 · Score: 2, Informative

I work in the speech recognition field and work with the researchers at BBN alot (they might sound familiar, can you say @ sign inventors) and they are always bragging about their OCR, especially for foriegn languages. They have a very different approach to OCR (they dont do Character recognition at all, rather word recognition) and thus their accuracy is very very high. Plus they are government funded for this sort of thing (can you say NSA?) I would recommend contacting them directly, not just through the website, as you are an educational institution they will probably price a very good deal with you.
Good Luck.
1. Re:Contact BBN - bbn.com by NoSlack · 2002-09-20 06:09 · Score: 1
  
  http://www.bbn.com/speech/ocr.html is the link, sorry forgot to add it. It makes for a interesting read. Enjoy
Be Careful -- Might Not Be Fair Use by stinkenstein · 2002-09-20 06:53 · Score: 1

Be careful about contacting the publishers. They may decide that they could cash in on this and try to hassle you for unauthorized copying of their work. Sounds evil, I know, but you know lawyers. Often the best bet is to stay under the radar.

IAAL, just not a very good one so get a good one's advice.

--
Where do you get *your* entropy?
ABBYY FineReader... by alienmole · 2002-09-20 07:59 · Score: 2

Another vote for ABBYY FineReader. For us, it made the difference between not using OCR at all, and using it. All our previous attempts to use OCR seriously had failed, until we came across FineReader. We use FineReader Pro, but there's a whole product line of versions for different requirements.
Correct Scanning: double-entry method. by DancingSword · 2002-09-23 16:39 · Score: 2, Insightful

Get 2 different OCR-engine programs

Scan the same text in - to plain text - with program 1, scan it in to a second plain-text file with OCR program 2, and 'diff' 'em, ignoring white-space.

That means the indifidual person running the scanning doesn't have anywhere near the amount of work to do.

Small erors may get through, but it is drastically fast, in comparison with the way it normally is done, eh?

--
Messages to/for me ( in me journal )