Slashdot Mirror


Automated OCR for Forms Processing?

Oscar Carrillo asks: "We have to do a large NIH grant which collects tons of data. And much of that data is in the form of questionnaires. The forms will be available on the web, but it's mostly not feasible to have the subjects sit in front of the computer all day (not to mention that people get annoyed sitting in front of a computer all day). The study is being conducted at several universities and institutions around the country. Using Linux/JSP/Struts/PostgreSQL will take care of most of our needs. But it would save a lot of data entry, if all forms could be scanned at each site, images uploaded to the website, and then automatically put through OCR (Optical Character Recognition) to get only the relevant raw data that subjects wrote. Does anyone know of something that can handle this? Are there any open source projects that can handle this? Any good commercial alternatives?"

8 of 30 comments (clear)

  1. Bubbles by adamy · · Score: 3, Insightful

    Remeber those old annoying CTBS test a nad SAT stuff? If these surveys are multiple choice, use the old #2 lead pencil and scan em in that way. You data will already be entered in. Most universities have the facilities for this already.

    Do not count on handwriting recognition to be successful for the people who fill out the surveys. While it works fine for typeset and computer gnereated print, it won't work for many different handwritings and many different idiomatic expressions.

    --
    Open Source Identity Management: FreeIPA.org
  2. NIH has an OCR website by Hee+Hee+Hee · · Score: 3, Interesting

    Check out NLM's DocMorph at docmorph.nlm.nih.gov/docmorph/default.htm. It's a site put up by the NIH (coincidentally) where you post your scanned image and they post an OCR'ed document, in the format you choose, for a short period. It does a fairly good job for the price (free).

    --
    - Bill
  3. another lowly subject by tps12 · · Score: 4, Funny

    it's mostly not feasible to have the subjects sit in front of the computer all day

    Then I guess somebody forgot to tell my boss.

    --

    Karma: Good (despite my invention of the Karma: sig)
  4. Solutions by Wrexen · · Score: 3, Interesting

    Does anyone know of something that can handle this?

    High school students? Technology isn't the answer to everything, and if these are handwritten you're not going to have very much success trying to automate the recognition. My name is Fod Na1oyyy, etc

  5. Fla. by Strange+Ranger · · Score: 3, Funny


    Doesn't the State of Florida has a forms tallying system they're looking to unload?

    --

    Operator, give me the number for 911!
  6. OC R forms processing is problematic at best by Pauly · · Score: 4, Insightful
    Having worked at one of the world's largest OCR/Forms processing vendors, take it from me: don't do this.

    OCR forms processing does:

    • waste money and time
    • create unnecessary pain
    • require high-quality and expensive printed forms
    • require high-quality and expensive scanning equipment
    • introduce more human error

    OCR forms processing does NOT:

    • "save a lot of data entry"
    • do anything automatically (unless your forms are all checkboxes)
    • save money or time
    That said, if you have a lot of questions to be answered, a well designed form using as little handwritten responses as possible (all checkboxes are best), may be viable.

    Frankly, most of the large projects I worked on could have gotten the task done easier and cheaper writing an app to run on low-end Palms given to each interviewee. Seriously.

    If you would like more concrete advice or contacts with people in the industry, email me.

  7. A few commercial packages: by j-turkey · · Score: 3, Informative

    I am not aware of any open-source automated ICR for forms processing. I can, however, offer a few commercial alternatives.

    I have used TELEform, by Cardiff and was somewhat impressed. It can take multiple different inputs (scanner, fax, email, web POST, etc), run ICR on them, and store the data in a number of different RDBMS. I believe that this is a Windows-only package.

    Another piece of software that I cannot recommend as I do not have experience with is from 170 Systems called 170 MarkView, which basically does the same thing.

    I have used TELEform in a medical/clinical setting, where doctors fill out prescriptions, fax them in, and character recgonition is run server-side where it is verified by my data staff. It works pretty well, but you need to keep in mind that handwriting recgonition is not infallible, and if you are interested in any level of accuracy, I would recommend that a human verify each comb-box where there is handwritten text. Most verifiers that I've seen are pretty good and you can glance at each each and just pound the tab-key to scream through the fields.

    As far as statistics for accuracy with TELEform, the numbers that I reported are as follows (the numbers represent the percentage of fields with ICR errors of the specified type):

    ICR Error type:
    ==========
    Handwriting 3.7%
    Combination Handwriting/OCR 3.2%
    OCR 2.9%
    Total 9.9%

    You can take the first three numbers with a grain of salt (the numbers based on what kind of ICR error occured are subjective and somewhat antecdotal) but the total is accurate -- expect to have approximately 9.9% of your fields come back with errors, and around 6% if you are really careful in desigining your forms and train your users on how to write on those forms. These numbers are consistent for all of the ICR systems I have used.

    I hope this helps...


    -Turkey

    --

    -Turkey

  8. Two Words by 4/3PI*R^3 · · Score: 3, Informative
    ...WORK STUDIES
    ...GRADUATE STUDENTS
    ...SLAVE LABOR

    You will spend less time and get better results hiring starving, desperate for money, college students to do data entry.