Optical Character Recognition Still Struggling With Handwriting
Ian Lamont recently asked Google if they planned to extend their transcription of books and other printed media to include public records, many of which were handwritten before word processors became ubiquitous. Google wouldn't talk about any potential plans, but Lamont found out a bit more about the limits of optical character recognition in the process:
"Even though some CAPTCHA schemes have been cracked in the past year, a far more difficult challenge lies in using software to recognize handwritten text. Optical character recognition has been used for years to convert printed documents into text data, but the enormous variation in handwriting styles has thwarted large-scale OCR imports of handwritten public documents and historical records. Ancestry.com took a surprising approach to digitizing and converting all publicly released US census records from 1790 to 1930: It contracted the job to Chinese firms whose staff manually transcribed the names and other information. The Chinese staff are specially trained to read the cursive and other handwriting styles from digitized paper records and microfilm. The task is ongoing with other handwritten records, at a cost of approximately $10 million per year, the company's CEO says."
I can't even read people's handwriting, I hardly expect a computer to.
There is a simple reason that general OCR is much harder than cracking a CAPTCHA. General OCR has to recognize text *reliably*. CAPTCHA breakers are thrilled with a 10% success rate, because they use distributed systems created by worms to do the hard work a million times over. If you got 10% of the words right when scanning historical records you might as well not bother.
There is an on-line archive of all people that have passed trough Ellis Island (http://www.ellisisland.org/search/passSearch.asp). It consists of retyped (OCR-ed?) ship manifests. Manifests are lists of passengers, with names, places of births and similar information. In original, they are written by hand, in cursive scripts (as expected for late 19th and early 20th century).
Problem is not with the script, but with appropriate context. Someone who retyped this, did not know what to expect in these forms.
My grand-grand father's place of origin was written as "Lipovqani, Slovenia". Pair "lj" was recognized as "q". For someone who is native English speaker "lj" one next to other does not make too much sense. But for anyone with Slavic origin, "q" does not make sense (it's only in foreign words), and "lj" does make sense since it is a way to write "soft l" voice like in "Richelieu".
Ok, maybe that was not the an easy part to guess. But "Slovenia" was serious error. In that moment, Slovenia did not exist. It was part of the Austro-Hungary, and it did not exist as single entity inside it. What was really written was actually "Slavonia". That's an area in Eastern Croatia, and it *was* an entity inside Austro-Hungary.
Should I mention that I was not able to track my grand-grand mother and my other grand-grand father?
No sig today.
No.
I own a microfilm digitization / OCR shop. We work with tons of old records such as the ones referenced in this story, as well as old HR docs, check stubs, time cards, architectural drawings, you name it. If you OCR cursive, you don't get back 80%, or 70%, or even 30% accuracy . . . you get back a bunch of pseudo-random (to our eyes) characters which are in NO WAY related to what the actual text is. About the only handwriting recognizable using today's tech is block-print, like you find on engineering diagrams. The technique in this article is pretty standard operating procedure, and has been for some time -- much easier to put a few hundred people on the project and grind through it (and cheaper too compared to data entry rates here in the US -- about 1/3 the price). That usually includes double-keying to check everything and a 99.99999% accuracy guarantee.
Just FYI, there are only a few OCR engines out there. Probably the most commonly used is the ABBYY engine, which is both OEMed and sold directly as desktop- and server-based products by ABBYY. There are a few others as well, and despite their differences, most have pretty much the same capabilities and accuracy. But OCR of cursive, especially of the docs cited in the article where you don't have someone sit down and "train" the machine first with handwriting samples, is still one of the great "unsolved" computing problems. I expect we'll have the capability in the next decade or so as processor core density, memory, and storage continues to increase at their current rate -- eventually, the machine will be able to "brute-force" through the docs just like the Chinese data entry folks in this article.
For a moment there, I was picturing some new technology that could distinguish between C, PERL and and Java written on scratch paper.
In pseudocode:
// undecipherable
IF LooksLikeC THEN "This must be C code"
IF LooksLikeJava THEN "This must be Java code"
ELSE "Must be Perl code"
i've outsourced all of my computer applications and software needs to India.
instead of using PowerPoint at meetings, i just have two Indian women in bikinis hold up large displays with my bullet points written on them--they even do slide transitions.
instead of an e-mail client, i use an Indian courier. it takes a while for me to communicate with international clients, but i receive practically no spam.
and rather than a word processor i have a guy with a notepad that a dictate to. he also offers me helpful tips when he notices that i'm trying to write a letter.
then there's the 17-year-old i have doing my taxes. i don't even think he's out of high school yet, but he beats Turbo Tax any day.
but you should really see the guy i have simulating Windows Vista for me. he wears this really slick suit, moves really slow, and everyone once in a while he comes up to me and kicks me in the balls.