Automated OCR for Forms Processing?
Oscar Carrillo asks: "We have to do a large NIH grant which collects tons of data. And much of that data is in the form of questionnaires. The forms will be available on the web, but it's mostly not feasible to have the subjects sit in front of the computer all day (not to mention that people get annoyed sitting in front of a computer all day). The study is being conducted at several universities and institutions around the country. Using Linux/JSP/Struts/PostgreSQL will take care of most of our needs. But it would save a lot of data entry, if all forms could be scanned at each site, images uploaded to the website, and then automatically put through OCR (Optical Character Recognition) to get only the relevant raw data that subjects wrote. Does anyone know of something that can handle this? Are there any open source projects that can handle this? Any good commercial alternatives?"
Remeber those old annoying CTBS test a nad SAT stuff? If these surveys are multiple choice, use the old #2 lead pencil and scan em in that way. You data will already be entered in. Most universities have the facilities for this already.
Do not count on handwriting recognition to be successful for the people who fill out the surveys. While it works fine for typeset and computer gnereated print, it won't work for many different handwritings and many different idiomatic expressions.
Open Source Identity Management: FreeIPA.org
Check out NLM's DocMorph at docmorph.nlm.nih.gov/docmorph/default.htm. It's a site put up by the NIH (coincidentally) where you post your scanned image and they post an OCR'ed document, in the format you choose, for a short period. It does a fairly good job for the price (free).
- Bill
For one of our clients, OCR Forms made some sense, but the problem was that a computer form was vastly easier to use for our purpouses: If someone typed in a vendor name, the computer form made an educated guess after the first few chars, if it was incorrect, you just kept on typing, if if correct you just tabbed over. An OCR Form system would have to be able to corelate "Sqishy Soft","Squish-Soft","SS LLC","SquishySofware" and every permutaion and bad spelling with vendor #1212 - the computer form would just let you pick vendor "Squishy Software INC" form the list.
For offsite form entry, Psion Revo's worked wonderfully, but we've moved everything to Sharp Zarus due to the un-certain future of Psion. We've kept the Psions, but just replace them when they break.
Sorry for the ramble, It's lunch time.
Moneyed corporations, non-working 'poor' and criminal prisoners are turning productive citizens into tax-slaves.
it's mostly not feasible to have the subjects sit in front of the computer all day
Then I guess somebody forgot to tell my boss.
Karma: Good (despite my invention of the Karma: sig)
Does anyone know of something that can handle this?
High school students? Technology isn't the answer to everything, and if these are handwritten you're not going to have very much success trying to automate the recognition. My name is Fod Na1oyyy, etc
Doesn't the State of Florida has a forms tallying system they're looking to unload?
Operator, give me the number for 911!
OCR forms processing does:
OCR forms processing does NOT:
- "save a lot of data entry"
- do anything automatically (unless your forms are all checkboxes)
- save money or time
That said, if you have a lot of questions to be answered, a well designed form using as little handwritten responses as possible (all checkboxes are best), may be viable.Frankly, most of the large projects I worked on could have gotten the task done easier and cheaper writing an app to run on low-end Palms given to each interviewee. Seriously.
If you would like more concrete advice or contacts with people in the industry, email me.
I am not aware of any open-source automated ICR for forms processing. I can, however, offer a few commercial alternatives.
I have used TELEform, by Cardiff and was somewhat impressed. It can take multiple different inputs (scanner, fax, email, web POST, etc), run ICR on them, and store the data in a number of different RDBMS. I believe that this is a Windows-only package.
Another piece of software that I cannot recommend as I do not have experience with is from 170 Systems called 170 MarkView, which basically does the same thing.
I have used TELEform in a medical/clinical setting, where doctors fill out prescriptions, fax them in, and character recgonition is run server-side where it is verified by my data staff. It works pretty well, but you need to keep in mind that handwriting recgonition is not infallible, and if you are interested in any level of accuracy, I would recommend that a human verify each comb-box where there is handwritten text. Most verifiers that I've seen are pretty good and you can glance at each each and just pound the tab-key to scream through the fields.
As far as statistics for accuracy with TELEform, the numbers that I reported are as follows (the numbers represent the percentage of fields with ICR errors of the specified type):
ICR Error type:
==========
Handwriting 3.7%
Combination Handwriting/OCR 3.2%
OCR 2.9%
Total 9.9%
You can take the first three numbers with a grain of salt (the numbers based on what kind of ICR error occured are subjective and somewhat antecdotal) but the total is accurate -- expect to have approximately 9.9% of your fields come back with errors, and around 6% if you are really careful in desigining your forms and train your users on how to write on those forms. These numbers are consistent for all of the ICR systems I have used.
I hope this helps...
-Turkey
-Turkey
You will spend less time and get better results hiring starving, desperate for money, college students to do data entry.
If it's handwritten, just forget it. You'll have enough problems getting people able to read it, much less computers. The postal service does do some of this, but they have a secret: they know all the valid addresses and can do cross-referencing between different parts if they really have to.
If it's typed you might be able to OCR it, but don't count on it being truly reliable and plan on saving the image as well - you're going to have to be able to go back to it.
If it's filled and printed you might be able to put something together, but if you have that why not have them send the data electronically instead?
If it's fill and print but you can't communicate it electronically, see if you can generate barcodes. If this is the case, I assume it's generated by an application instead of an HTML form, since an HTML form could be communicated back to the server.
The main situation I know of where OCRing worked well was for an imaging system - the company wanted to store images of all the work order pages for each customer (to include signatures & handwritten notes), tied back to the database. Since the initial work orders were printed, all that needed to be OCRed was the work order number, which both included check digits and needed to match against a known work order already in the database. Even then, there were provisions for dealing with the ones that weren't recognized. Barcodes weren't used because the imaging system was separate from the creation system, which didn't have the capability to generate them.
fencepost
just a little off
Im not sure, but I believe a company called Captiva can do the type of capturing your taliking about. I know many companies use it to process tax forms and such. its definatly not cheap or open-source though.
and know an outstanding programmer that works with a number of OS platforms that I would call an expert on OCR, forms recognition, etc. Check out http://www.microimagesys.com and contact Mr. Lunglhofer. Also, look at Kofax for your Image and OCR retrieval from scanned documents. I am not 100% sure Adobe has a *nix version; but I create a considerable number of e-forms in Adobe (and learned this from Mr. Lunglhofer). These forms are used in an enormous variety of electronic, web-based, and non-web applications. Ask him what he would suggest and see what kind of product he could provide for you.
I think with the interesting people, their lives can't possibly be wrapped up into a nice little package.
I have created a system to help in reviewing proposed changes to the Convention on International Trade in Endangered Species(CITES)which sends out a formatted email form to reviewers (many of whom are in developng countries). The reviewers reply with their answers and a simple RegEx sucks the answers into a database.
Benefits of this approach:
- Iterative- users can return to their email program several times before completing and sending the form. You don't have to complete in one sitting (our questionnaire is about 150 detailed questions long)
- Offline- for users that don't have a dedicated net connection (yes they exist)
- Transaction based- you know that an email has been successfully sent
- Lowtech- a text based email is a kind of lowest common denominator. You don't have problems with plug-ins or JRE versions with this solution
This kind of thing is easy to build. The tricky part is predicting what users will do with it. In our forms the questions look like-||Please tell us your shoe size::
||What color are your shoes?::
The directions originally told users to enter data between the
||Please tell us your show size:My size is twelve:
and caused the RegEx to miss the answer.
Best of luck
If the forms are mostly checkboxes, you can probably scan it as a picture, then look in the right areas for crud in checkboxes. Might need some alignment with known markings in corners. If there is some writing or text (serial number, name), enter that manually while displaying the picture on screen. This is also a good time to ask about questions which seem to have no or multiple boxes checked. "Please clarify question 3C."
Sometimes simple brute force does wonders.
Infuriate left and right