Slashdot Mirror


Optical Character Recognition Still Struggling With Handwriting

Ian Lamont recently asked Google if they planned to extend their transcription of books and other printed media to include public records, many of which were handwritten before word processors became ubiquitous. Google wouldn't talk about any potential plans, but Lamont found out a bit more about the limits of optical character recognition in the process: "Even though some CAPTCHA schemes have been cracked in the past year, a far more difficult challenge lies in using software to recognize handwritten text. Optical character recognition has been used for years to convert printed documents into text data, but the enormous variation in handwriting styles has thwarted large-scale OCR imports of handwritten public documents and historical records. Ancestry.com took a surprising approach to digitizing and converting all publicly released US census records from 1790 to 1930: It contracted the job to Chinese firms whose staff manually transcribed the names and other information. The Chinese staff are specially trained to read the cursive and other handwriting styles from digitized paper records and microfilm. The task is ongoing with other handwritten records, at a cost of approximately $10 million per year, the company's CEO says."

5 of 150 comments (clear)

  1. Half the time.. by Miststlkr · · Score: 5, Insightful

    I can't even read people's handwriting, I hardly expect a computer to.

    1. Re:Half the time.. by glwtta · · Score: 4, Insightful

      Hell, I can't even read my own handwriting. Yeah, this is probably not going to happen.

      --
      sic transit gloria mundi
  2. general OCR harder than CAPTCHA OCR by Anonymous Coward · · Score: 5, Insightful

    There is a simple reason that general OCR is much harder than cracking a CAPTCHA. General OCR has to recognize text *reliably*. CAPTCHA breakers are thrilled with a 10% success rate, because they use distributed systems created by worms to do the hard work a million times over. If you got 10% of the words right when scanning historical records you might as well not bother.

  3. Re:Too variable, less reference by Kickersny.com · · Score: 4, Insightful

    While handheld technology is indeed getting better, it's not directly applicable to the problem at hand. Real-time handwriting analysis uses stroke analysis as well as shape analysis to determine the letter(s). That is, the order in which you construct your letters matters very much. For example, if you crossed your T before drawing the vertical bar, the engine may have a difficult time figuring out what you intended.

    When OCRing documents, all of that 'meta-information' is lost.

  4. Now you have a training dataset. by bigattichouse · · Score: 4, Insightful

    Now you take the human translated recognition, and use it to train your genetic algo or neural net against the original images.

    --
    meh