Slashdot Mirror


Automated OCR for Forms Processing?

Oscar Carrillo asks: "We have to do a large NIH grant which collects tons of data. And much of that data is in the form of questionnaires. The forms will be available on the web, but it's mostly not feasible to have the subjects sit in front of the computer all day (not to mention that people get annoyed sitting in front of a computer all day). The study is being conducted at several universities and institutions around the country. Using Linux/JSP/Struts/PostgreSQL will take care of most of our needs. But it would save a lot of data entry, if all forms could be scanned at each site, images uploaded to the website, and then automatically put through OCR (Optical Character Recognition) to get only the relevant raw data that subjects wrote. Does anyone know of something that can handle this? Are there any open source projects that can handle this? Any good commercial alternatives?"

3 of 30 comments (clear)

  1. A few commercial packages: by j-turkey · · Score: 3, Informative

    I am not aware of any open-source automated ICR for forms processing. I can, however, offer a few commercial alternatives.

    I have used TELEform, by Cardiff and was somewhat impressed. It can take multiple different inputs (scanner, fax, email, web POST, etc), run ICR on them, and store the data in a number of different RDBMS. I believe that this is a Windows-only package.

    Another piece of software that I cannot recommend as I do not have experience with is from 170 Systems called 170 MarkView, which basically does the same thing.

    I have used TELEform in a medical/clinical setting, where doctors fill out prescriptions, fax them in, and character recgonition is run server-side where it is verified by my data staff. It works pretty well, but you need to keep in mind that handwriting recgonition is not infallible, and if you are interested in any level of accuracy, I would recommend that a human verify each comb-box where there is handwritten text. Most verifiers that I've seen are pretty good and you can glance at each each and just pound the tab-key to scream through the fields.

    As far as statistics for accuracy with TELEform, the numbers that I reported are as follows (the numbers represent the percentage of fields with ICR errors of the specified type):

    ICR Error type:
    ==========
    Handwriting 3.7%
    Combination Handwriting/OCR 3.2%
    OCR 2.9%
    Total 9.9%

    You can take the first three numbers with a grain of salt (the numbers based on what kind of ICR error occured are subjective and somewhat antecdotal) but the total is accurate -- expect to have approximately 9.9% of your fields come back with errors, and around 6% if you are really careful in desigining your forms and train your users on how to write on those forms. These numbers are consistent for all of the ICR systems I have used.

    I hope this helps...


    -Turkey

    --

    -Turkey

  2. Two Words by 4/3PI*R^3 · · Score: 3, Informative
    ...WORK STUDIES
    ...GRADUATE STUDENTS
    ...SLAVE LABOR

    You will spend less time and get better results hiring starving, desperate for money, college students to do data entry.

  3. I have worked in OCR/forms-processing, etc. by krinsh · · Score: 2, Informative

    and know an outstanding programmer that works with a number of OS platforms that I would call an expert on OCR, forms recognition, etc. Check out http://www.microimagesys.com and contact Mr. Lunglhofer. Also, look at Kofax for your Image and OCR retrieval from scanned documents. I am not 100% sure Adobe has a *nix version; but I create a considerable number of e-forms in Adobe (and learned this from Mr. Lunglhofer). These forms are used in an enormous variety of electronic, web-based, and non-web applications. Ask him what he would suggest and see what kind of product he could provide for you.

    --
    I think with the interesting people, their lives can't possibly be wrapped up into a nice little package.