Batch Cataloging of Scanned Documents via OCR?

← Back to Stories (view on slashdot.org)

Batch Cataloging of Scanned Documents via OCR?

Posted by Cliff on Monday November 21, 2005 @02:03PM from the information-retrieval-and-processing dept.

munwin99 asks: "I am looking for some software to process a batch of images (scanned forms). We want to use Gallery to view the images, and be able to search them by 3 or 4 attributes. We want to get these attributes from the form (date, name, etc). We want it to check a section of the scanned form, read the info from that section(s), and dump the retrieved info into Gallery (using OCR / ICR). Is there any (preferably) free or open source software that can do this? Supported OSes should include either Windows, Linux or Mac OS X. Even Gallery is optional, if someone has a better suggestion."

31 comments

Min score:

Reason:

Sort:

Custom Layout by tonsofpcs · 2005-11-21 14:08 · Score: 4, Informative

Many pieces of OCR software allow you to create a 'layout' for OCRing, that is, specify where images and textual data are. If your forms all follow the same layout, or you have just a few [relatively], you can set up these layouts and, in many pieces of software, reuse them. The only cavet is that you need to be sure that the forms are scanned the same way; if your forms have prepunched holes or markings in specific points on the edge, you can use animation software [like Bauhaus Software's Mirage] on a batch to 'pixel-track' the pages and align them based upon these marks, then export no-/low-loss TGAs, TIFFs, PNGs, or similar for OCRing.

--
Video Production Support
I doubt it by Anonymous Coward · 2005-11-21 14:48 · Score: 0

Forms processing is really hard. Expensive software takes training (on the part of the software and the user) to become reasonably accurate. I would lump this in as similar to the previoulsy covered free or open source software for engineering cad work.
Forget Gallery by barzok · 2005-11-21 14:52 · Score: 0, Flamebait

Unless Gallery 2 uses a database back-end, skip it and use Coppermine. Gallery bogs down once you get a lot of images into the system as it's all flat-file data storage. Coppermine is a similar application from the user perspective, but uses MySQL as a back-end and actually allows you to associate keywords with images.
1. Re:Forget Gallery by douggmc · 2005-11-21 15:38 · Score: 0
  
  Umm ... me thinks Gallery uses a DB backend. Namely ... MySQL.
2. Re:Forget Gallery by Directrix1 · 2005-11-21 16:37 · Score: 1
  
  Gallery 2 does use an RDBMS. I have set it up with Postgres but I think it can do MySQL also. BTW, I have run gallery 1 with thousands of images, and while I do agree an RDBMS would have been optimal, it didn't really slow down to excrutiating levels.
  
  --
  Occam's razor is the blind faith in the natural selection of least resistance and in universal oversimplification. -- EF
Hire a kid from high school by phoenix.bam! · 2005-11-21 14:55 · Score: 2, Insightful

Pay a kid $8 an hour to scan the forms and process them. You will get much better accuracy and you can put them to other uses as well. A kid from high school will cost you... $5,000 a year or something depending on how many forms you've got. Hell, the kid could do the work from home if you set up the computer post the images in an accessible way. Hire a kid and give him a future in data processing!
1. Re:Hire a kid from high school by dascandy · 2005-11-21 20:22 · Score: 1
  
  mod parent up - he's got a very good point. Why use machines to do OCR if there are people that can do it both better and cheaper?
  
  Plus, it creates jobs for those who have trouble finding something to fit in with studying.
Hack your value/key pairs into EXIF data by jgaynor · 2005-11-21 15:10 · Score: 1, Offtopic

While Im unsure if Gallery allows you to create, edit and query 'meta' fields with each image I do know that it reads, stores and can query the EXIF fields of all imported images. One way to be able to store (once)/query (many) your custom data fields would be to initially fudge those values into the EXIF fields of each scanned image. Yes it would be weird to search for 'last name' with a 'camera model' query, but it would work.

Anyway this is probably how you'd want to go about this:

1. Scan doc to file
2. use an app or library to OCR the fields you want
3. Add EXIF fields/data to the image with perl (CPAN EXIF modules)
4. dump image into gallery. Gallery parses out and stores your crap in query-able EXIF fields.

This is all conjecture though - good luck. Seems like a pretty shitty task if you ask me.
Open Source OCR by SpinningAround · 2005-11-21 15:14 · Score: 2, Informative

I've been looking into OCR packages as part of a custom data capture work-flow desired by one of my customers.

The OCR / document image layout analysis world is dominated by a handful of commercial companies. There is a dearth of OCR and document analysis code available in the open source community. That which is available on any sort of 'free' basis is not going to be of a lot of use other than as a starting point for some serious development of your own, I would suggest.

The big names commercially are:

Abbyy's Finereader
Nuance's (formerly Scansoft) Omnipage
and then a number of smaller players like SimpleOCR

In the open source world, some places to start looking are:

GOCR

and GNU's OCRAD

Both Nuance and Abbyy offer an SDK for OCR integration at a code level which might suit depending on your budget. Certainly the price (probably between $500 and $5000 for a license) represent a good deal if you look at the costs and time it would take to write anything that does serious OCR work yourself.

BTW, if anyone out there knows of any good document layout analysis code available to have a look at, I would be particularly interested. I am looking into document layout analysis for a personal project and although there is a fair bit of academic research available at Citeseer, I actually haven't found much in the way of good sample code that I can use as a starting point for some of my own ideas.
1. Re:Open Source OCR by Directrix1 · 2005-11-21 16:32 · Score: 1
  
  Wow, gocr will scan barcodes in images. Neato!
  
  --
  Occam's razor is the blind faith in the natural selection of least resistance and in universal oversimplification. -- EF
Open Source Solutions to Breathing. by Anonymous Coward · 2005-11-21 15:57 · Score: 0

[For the original poster]
"The ScanSoft Capture Development System 12 allows software developers to significantly reduce the cost of development and speed time-to-market by providing highly accurate OCR, ICR, OMR and Barcode recognition engines, pre-made user interfaces for controlling scanning devices, advanced image enhancement tools, document processing capabilities and support for a wide range of input/output filters, including PDF, XML and Open eBook standards."

http://www.ocr-systeme.de/englisch/dk2000.htm

"Formulator 2.5 is one of the fastest selling personal form management software solutions available. Designed for both personal and business use, Formulator 2.5 enables anyone to quickly turn any paper form into a digital form fast. In an age where practically everything is done by computers, have you ever asked yourself from time to time why most forms still have to be completed by hand? The truth is they no longer have to be - thanks to Formulator! When using Formulator 2.5 you can scan in forms and create a digital template, complete your data on screen, and print them out again in no time. The completed entries can either be printed on the original form (this being fed into the printer), or alternatively, the program can print out a copy of your newly created digital form, and include your entries on it. All you need is a pc, a scanner, and a printer. "

http://www.ocr-systeme.de/englisch/formulat.htm

"BTW, if anyone out there knows of any good document layout analysis code available to have a look at, I would be particularly interested. I am looking into document layout analysis for a personal project and although there is a fair bit of academic research available at Citeseer, I actually haven't found much in the way of good sample code that I can use as a starting point for some of my own ideas."

Reminds me of that "Ask Slashdot" were the OP asked about OSS engineering software. There's a reason why. Engineering and OCR (image analysis) is HARD.
1. Re:Open Source Solutions to Breathing. by Anonymous Coward · 2005-11-21 17:29 · Score: 0
  
  Being hard doesn't exactly disqualify anything from being available in the open source world... after all the open source world seems to have managed to develop Operating Systems, file systems, a plethora of server software including databases, mail servers and application servers and a whole bunch of other stuff which we might consider to be quite 'hard'.
  
  A more plausible explanation is that there isn't much demand for an open-source OCR package in the world of emailed soft copies...
Here's what we did by Anonymous Coward · 2005-11-21 17:55 · Score: 0

Scanned a bunch (BIG BUNCH) of documents into PDFs. Then we ran them through Adobe Acrobat 7's Batch Processing to OCR them. Now we have PDFs with pretty accurate text ready for Google to index.
1. Re:Here's what we did by rot26 · 2005-11-22 02:12 · Score: 1
  
  Now we have PDFs with pretty accurate text ready for Google to index
  
  What OCR software did you use? I haven't had real good luck with this. (The documents are already scanned into PDF's when I recieve them so I have no control over the quality.)
  
  --
  
  To ensure perfect aim, shoot first and call whatever you hit the target
2. Re:Here's what we did by Anonymous Coward · 2005-11-22 09:31 · Score: 0
  
  Adobe Acrobat comes with built-in OCR. It took the PDFs our scanner made (basically just images), and OCR'd the text in the PDF, creating a dual layer PDF with text and the scanned graphic image of the text.
3. Re:Here's what we did by rot26 · 2005-11-23 01:49 · Score: 1
  
  Thanks. I have acrobat pro 6.0. I'll see if that has the OCR feature. (If it does I've never noticed it). I was using readiris pro, which did a horrible job, at least on the scans I get.
  
  --
  
  To ensure perfect aim, shoot first and call whatever you hit the target
4. Re:Here's what we did by Anonymous Coward · 2005-11-23 03:49 · Score: 0
  
  I believe it does. In version 7, it's under the document menu -> "Recognize text using OCR"
Not a troll, a job application by tengu1sd · 2005-11-21 17:56 · Score: 1

Most commerical scanning engines allow a human to touch up the results, comparing the output against the originals. This is where your intern or entry level temp can be useful. Or you could just outsource the hole ting to summ one overseas.
You Get What You Pay For by Flwyd · 2005-11-21 18:36 · Score: 2, Informative

My job involves integrating with OCR, and we've looked at quite a few options. Though there are some bargains, you get what you pay for.

The big players are Abbyy and Scansoft. Both have extensive feature lists, from handy GUIs to form/document layout to Asian language support. They also come with a hefty price tag. Their Windows support is best, but they have software for others. Single user applications are reasonably priced in the two-digit figure range. However, we decided not to integrate with either of them in part because of the price tag for high volume server-side processing. If you only have a hundred or two forms to do at a time, a workstation solution may be your best bet.

We chose to integrate with Transym, a cheap but pretty good engine. It does a good job at what it tries to do, which is recognize standard printed text. We then take that text and extract meaningful data, like dates and names, from the output text + position information. Pretty much every other cheap/free package we looked at had pretty lousy performance on our straight-forward documents (primarily typed paragraphs).

ICR (recognizing handwriting) and IMR (mark recognition) is another bag. There are very few players in this arena. They work best when the domain is well-defined (the U.S. Postal Service, for instance, does pretty well at recognizing zip codes). If you're trying to recognize dates and check boxes, the form definition software that Abbyy and Scansoft provide probably fits your needs best.

Finally, you need to consider how reliable you want your process to be and how much quality control you want. Even the best OCR engine makes errors, and ICR is quite a bit behind that. You can't blindly trust OCR output unless you're willing to deal with incomplete data. If you're going to have a human verify the computer's work for only a few fields, you may not be gaining significant efficiency.

(I don't claim to have evaluated every potential option. There may be software we missed, software we didn't evaluate because it didn't meet our integration needs, and software that's come to light after we did our search.)

--
Ceci n'est pas une signature.
Abbyy Finereader by jukervin · 2005-11-21 19:32 · Score: 2, Informative

I would reccommend taking a look of Abbyy's offerings. Particulary FormReader 6.5 Family which is intented for OCRing forms and semi-structured documents.

http://www.abbyy.com/formreader/
scansnap and spotlight by austad · 2005-11-21 19:34 · Score: 2, Informative

Not sure what you're doing, but this might give you some ideas. I scan all of my papers docs in using a Fujitsu ScanSnap in OSX. It can automatically pipe them to ReadIris Pro for OCR, and dump them in a save directory. I can search for whatever I want in spotlight, and it pops right up. "Hennepin county property tax 2002" bring up one document, and putting in my address and the words "purchase agreement" comes up with the purchase agreement for my house.

It's pretty insane how much time it saves me.

--
Need Free Juniper/NetScreen Support? JuniperForum
OT: meta-moderation needed in this thread by Anonymous Coward · 2005-11-21 20:37 · Score: 0

Please review the moderations in this thread and meta-moderate as necessary. It's clear to me as a casual reader that someone got points and just decided to mod everyone down for no good reason. If it was my site, I'd probably just revoke moderation privileges for the offending moderators.
Zylab by martin · 2005-11-21 20:47 · Score: 1

Not cheap, but good...can OCR and match/highlight the search terms in the scanned document.
MSR Eurostore by Anonymous Coward · 2005-11-21 23:09 · Score: 0

http://www.msr.co.uk/

(Not that I used to work for one of their partner companies or anything! ;)
Contact Amazon by alta · 2005-11-22 02:28 · Score: 1

Get them to put it on mturk.com at .03 per page ;)

--
Do not meddle in the affairs of sysadmins, for they are subtle, and quick to anger.
Re:Open Source OCR - Scantron by Anonymous Coward · 2005-11-22 05:34 · Score: 0

Ask the folks at scantron. They have forms that accept both bubbles and (fill-ins and/or essays), and their machines seem to do a good job at not grading the fill-in/essay regions, but only recognizing the bubble regions.
Try Kofax or Captovation by Anonymous Coward · 2005-11-22 07:16 · Score: 0

I have used both Kofax and Captovation in various projects. Captovation even used to allow you to download a sample copy of their product.
Some Kodak scanners come with some basic ocr software but i am not sure how good these are.
Good luck
Docubase by petree · 2005-11-22 08:37 · Score: 1

Docubase does everything you are looking for. (For real, call them and ask) Too bad it'll probably cost you mid 5 figures to do it. Disclosure: I work for a company that resells docubase integrated into our product.
cheap way to buy Abby by rjnagle · 2005-11-22 09:39 · Score: 1

I am an opensource guy, but even I ended up buying Abby for my mass scanning needs.

Here's what you do: Buy a 5.0 license of Abby Finereader off ebay. You can buy it for about 10$.

Buy the upgrade version of the latest version of Abby Finereader for $150.

It's still $160, but that's still considerably cheaper than paying the new price of $500-600. Abby finereader docs say specifically that the upgrade software will work successfully on ALL prior versions of finereader.

As far as feeding into a database, I'm afraid I can't be any help, but if any software has this functionality, finereader would.

--
Robert Nagle, Idiotprogrammer, Houston