Slashdot Mirror

← Back to Stories (view on slashdot.org)

Batch Cataloging of Scanned Documents via OCR?

Posted by Cliff on Monday November 21, 2005 @02:03PM from the information-retrieval-and-processing dept.

munwin99 asks: "I am looking for some software to process a batch of images (scanned forms). We want to use Gallery to view the images, and be able to search them by 3 or 4 attributes. We want to get these attributes from the form (date, name, etc). We want it to check a section of the scanned form, read the info from that section(s), and dump the retrieved info into Gallery (using OCR / ICR). Is there any (preferably) free or open source software that can do this? Supported OSes should include either Windows, Linux or Mac OS X. Even Gallery is optional, if someone has a better suggestion."

1 of 31 comments (clear)

Min score:

Reason:

Sort:

Hack your value/key pairs into EXIF data by jgaynor · 2005-11-21 15:10 · Score: 1, Offtopic

While Im unsure if Gallery allows you to create, edit and query 'meta' fields with each image I do know that it reads, stores and can query the EXIF fields of all imported images. One way to be able to store (once)/query (many) your custom data fields would be to initially fudge those values into the EXIF fields of each scanned image. Yes it would be weird to search for 'last name' with a 'camera model' query, but it would work.

Anyway this is probably how you'd want to go about this:

1. Scan doc to file
2. use an app or library to OCR the fields you want
3. Add EXIF fields/data to the image with perl (CPAN EXIF modules)
4. dump image into gallery. Gallery parses out and stores your crap in query-able EXIF fields.

This is all conjecture though - good luck. Seems like a pretty shitty task if you ask me.