Slashdot Mirror


Automated OCR for Forms Processing?

Oscar Carrillo asks: "We have to do a large NIH grant which collects tons of data. And much of that data is in the form of questionnaires. The forms will be available on the web, but it's mostly not feasible to have the subjects sit in front of the computer all day (not to mention that people get annoyed sitting in front of a computer all day). The study is being conducted at several universities and institutions around the country. Using Linux/JSP/Struts/PostgreSQL will take care of most of our needs. But it would save a lot of data entry, if all forms could be scanned at each site, images uploaded to the website, and then automatically put through OCR (Optical Character Recognition) to get only the relevant raw data that subjects wrote. Does anyone know of something that can handle this? Are there any open source projects that can handle this? Any good commercial alternatives?"

4 of 30 comments (clear)

  1. Bubbles by adamy · · Score: 3, Insightful

    Remeber those old annoying CTBS test a nad SAT stuff? If these surveys are multiple choice, use the old #2 lead pencil and scan em in that way. You data will already be entered in. Most universities have the facilities for this already.

    Do not count on handwriting recognition to be successful for the people who fill out the surveys. While it works fine for typeset and computer gnereated print, it won't work for many different handwritings and many different idiomatic expressions.

    --
    Open Source Identity Management: FreeIPA.org
  2. OC R forms processing is problematic at best by Pauly · · Score: 4, Insightful
    Having worked at one of the world's largest OCR/Forms processing vendors, take it from me: don't do this.

    OCR forms processing does:

    • waste money and time
    • create unnecessary pain
    • require high-quality and expensive printed forms
    • require high-quality and expensive scanning equipment
    • introduce more human error

    OCR forms processing does NOT:

    • "save a lot of data entry"
    • do anything automatically (unless your forms are all checkboxes)
    • save money or time
    That said, if you have a lot of questions to be answered, a well designed form using as little handwritten responses as possible (all checkboxes are best), may be viable.

    Frankly, most of the large projects I worked on could have gotten the task done easier and cheaper writing an app to run on low-end Palms given to each interviewee. Seriously.

    If you would like more concrete advice or contacts with people in the industry, email me.

  3. Captiva by HockeyP9 · · Score: 2, Insightful

    Im not sure, but I believe a company called Captiva can do the type of capturing your taliking about. I know many companies use it to process tax forms and such. its definatly not cheap or open-source though.

  4. Try using an email-based form instead by Chip42 · · Score: 2, Insightful
    I would agree with other posters that OCR will not solve your problem. I would also agree with you that having people sit in front of a computer for long periods is not a good idea.

    I have created a system to help in reviewing proposed changes to the Convention on International Trade in Endangered Species(CITES)which sends out a formatted email form to reviewers (many of whom are in developng countries). The reviewers reply with their answers and a simple RegEx sucks the answers into a database.

    Benefits of this approach:
    • Iterative- users can return to their email program several times before completing and sending the form. You don't have to complete in one sitting (our questionnaire is about 150 detailed questions long)
    • Offline- for users that don't have a dedicated net connection (yes they exist)
    • Transaction based- you know that an email has been successfully sent
    • Lowtech- a text based email is a kind of lowest common denominator. You don't have problems with plug-ins or JRE versions with this solution
    This kind of thing is easy to build. The tricky part is predicting what users will do with it. In our forms the questions look like-
    ||Please tell us your shoe size::
    ||What color are your shoes?::
    The directions originally told users to enter data between the :: after a question and the || before the next question. Result: many of them answered like
    ||Please tell us your show size:My size is twelve:
    and caused the RegEx to miss the answer.

    Best of luck