Optical Character Recognition Still Struggling With Handwriting
Ian Lamont recently asked Google if they planned to extend their transcription of books and other printed media to include public records, many of which were handwritten before word processors became ubiquitous. Google wouldn't talk about any potential plans, but Lamont found out a bit more about the limits of optical character recognition in the process:
"Even though some CAPTCHA schemes have been cracked in the past year, a far more difficult challenge lies in using software to recognize handwritten text. Optical character recognition has been used for years to convert printed documents into text data, but the enormous variation in handwriting styles has thwarted large-scale OCR imports of handwritten public documents and historical records. Ancestry.com took a surprising approach to digitizing and converting all publicly released US census records from 1790 to 1930: It contracted the job to Chinese firms whose staff manually transcribed the names and other information. The Chinese staff are specially trained to read the cursive and other handwriting styles from digitized paper records and microfilm. The task is ongoing with other handwritten records, at a cost of approximately $10 million per year, the company's CEO says."
Beat up Martin = Eat up Martha
Now that's a name that'll be remembered
It seems to me that it would be better to OCR everything and contract the proof-reading to the Chinese firm. The wide variation of writing styles and letter forms may make 100% accuracy of OCR impossible for this task, but starting from OCR should reduce the task, shouldn't it?
"Please describe the scientific nature of the 'whammy'" - Agent Scully
I can't even read people's handwriting, I hardly expect a computer to.
1. Use the handwritten words as CAPTCHAs ...
2. Wait for the bad guys to come up with programs to break them.
3.
4. Profit!
There is a simple reason that general OCR is much harder than cracking a CAPTCHA. General OCR has to recognize text *reliably*. CAPTCHA breakers are thrilled with a 10% success rate, because they use distributed systems created by worms to do the hard work a million times over. If you got 10% of the words right when scanning historical records you might as well not bother.
Just slice them into smaller word chunks, and have humans OCR them a word at a time, using multiple passes at the words to verify them. At the same time, verified words can be used as CAPTCHAs.
An OCR program can include a bank of fonts, and even when there is some sort of spill/ink blot/whatever on the paper, it has a solid reference. Handwriting isn't so easy, because humans don't always write their "Q"s with the line in the exact same spot and other fluctuations. Even if you gave a computer a point of reference (neatly drawn letters corresponding with their actual alphabetical values), a computer probably couldn't get it for a lot of people with inconsistent handwriting.
Now, with context and improved technology, I don't think that handwriting recognition is impossible. I have a feeling that it will be a technology like speech recognition: never perfect, and it will require training.
For a moment there, I was picturing some new technology that could distinguish between C, PERL and and Java written on scratch paper.
Now you take the human translated recognition, and use it to train your genetic algo or neural net against the original images.
meh
I can't even read my own handwriting sometimes - how is a computer supposed read it; unless it knows what I was thinking at that particular point in time.
I hope they didn't give them the Presidential Book of Secrets, we could all be in trouble then!
Skip ------ See the latest from http://www.anArchyFortWorth.com
OCR = Optical Character Recognition, not optical code recognition.
Use handwritten CAPTCHAS?
Particles, stuff that matters.
I guess we should start making kindergartners write in "Times New Roman" from now on.
If our elected representatives no longer represent us, do we still live in a Democracy?
C in this case means Character, not Code. See one definition.
I have never seen the word Code used in an English definition.
Why's that a supprise... I can't read my own handwriting either... And I certainly can't read the handwriting written by someone else a hundred years ago... Why do you expect a program to be able to do so...
There is an on-line archive of all people that have passed trough Ellis Island (http://www.ellisisland.org/search/passSearch.asp). It consists of retyped (OCR-ed?) ship manifests. Manifests are lists of passengers, with names, places of births and similar information. In original, they are written by hand, in cursive scripts (as expected for late 19th and early 20th century).
Problem is not with the script, but with appropriate context. Someone who retyped this, did not know what to expect in these forms.
My grand-grand father's place of origin was written as "Lipovqani, Slovenia". Pair "lj" was recognized as "q". For someone who is native English speaker "lj" one next to other does not make too much sense. But for anyone with Slavic origin, "q" does not make sense (it's only in foreign words), and "lj" does make sense since it is a way to write "soft l" voice like in "Richelieu".
Ok, maybe that was not the an easy part to guess. But "Slovenia" was serious error. In that moment, Slovenia did not exist. It was part of the Austro-Hungary, and it did not exist as single entity inside it. What was really written was actually "Slavonia". That's an area in Eastern Croatia, and it *was* an entity inside Austro-Hungary.
Should I mention that I was not able to track my grand-grand mother and my other grand-grand father?
No sig today.
Can OCR properly trace the lines at least to replicate it? Meaning, it could make a vector replica of the handwriting? Would be neat if it could do that, then try to straighten out the lines, perhaps to simulate the possible path the original writer took to write it. Of course, the software will have to figure out intersections. Maybe a path of logic would be to know what turns a handwriter would NOT take, and then determine individual letters from that.
Combine that with other logic, like finding "dots" would indicate an i or a j, and maybe it will improve.
Get the guys writing the code that breaks captcha.
Simple, honestly. Make it economically worthwhile to write the code to do such. Writing code to break handwriting isn't as lucrative as say, writing virii or malware code.
Take a look at the results...
disclaimer: I doubt they will EVER break my doc's handwriting.
--Toll_Free
Doesnt USPS's system rely on Optical Charactor recognition? I thought it had a really high success rate... I know the software we used when I used to work on the Fed at the reserve wasn't all that good.. Anything it would reject would be sent to two people on a computer who would type in what they thought the letters were.. then if they matched it would go through, if not, it would go to a third person to make the final decision. After seeing that much handwriting I dont think we'll ever have software thats 100%.. Especially when you cant even read your own handwriting sometimes *chuckle*
For a moment there, I was picturing some new technology that could distinguish between C, PERL and and Java written on scratch paper.
In pseudocode:
// undecipherable
IF LooksLikeC THEN "This must be C code"
IF LooksLikeJava THEN "This must be Java code"
ELSE "Must be Perl code"
Back in highschool, I had a job that involved creating a database for a local cemetery's burial records. For 120 years, these records had been kept in a set of handwritten journals with a semi-alphabetical index. Given the time span, there had been many generations of people making these handwritten entries...and the differences in penmanship were outstanding.
Some time around the 1940's or 1950's, the job passed from a fountain-pen user to a fan of the ballpoint. Wow, what a difference. Early ballpoint pens were crap! Lots of lumpy smudges all over the place.
Ink quality aside, the shift to the ballpoint heralded the end of readable writing. Really, everything before it had been a beauty to look upon, and everything after was chicken scratches.
Mind you, this is all greatly anecdotal...it's just the handwriting of a half dozen people over a century. But I really believe that the 'convenience' of the ballpoint lead to people taking less care. Fountain pens required more care and skill, ballpoints lowered the bar.
Hmmm...I'll finish with some /. relevant content: I set up the database on an 80's era Macintosh 128K, working in my parent's basement.
OK, so their CAPTCHA has just been broken, and computers cannot read handwriting... why not use handwriting as CAPTCHAs?
-- There are 10 types of people in the world: Those who understand binary, And those who don't.
Apparently all that's necessary for Google (or anyone) to convert all handwritten documents to text is not OCR but human computation. What about using something like a Google Image Labeler? Instead of using random pictures, they could just use fragments of handwritten text? One could easily create software that automatically breaks handwritten text in words or sentences. Google labeler already has built in systems to validate the quality of the labels. I imagine the same sort of systems could be used to validade the effort of converting hand-written text into files. If Google, or some other company, created a web game or payed (in either money or some sort of virtual credit to be used on the net) I am sure people would be willing to spend their time playing/converting the handwritten text to files. As long as enough people decided to play, converting a huge amount of documents into text files wouldn't take long...
Let's get this straight -- they're transcribing an archaic form of handwriting, from a language they don't know, using characters they don't know, for a guy who's going to pay them minimum wage and isn't going to check their work. Yeah, right.
I piss off bigots.
The 800 lb gorilla in the room that nobody wants to talk about is the extreme lack of progress in language processing. OCR still requires far too much hand-editing of the result to be practical for casual use. Speech recognition is OK, but quite primitive. Speech ouput now sounds pretty good, but underlying all these should be a "natural language" computing infrastructure. Such a beast doesn't exist. That's why there are no "what you say is what you get" word processing programs or ubiquitous speech-control products. It's also why there are no quality translation tools for written or spoken languages.
MIT had high hopes for their AI lab in the late '70s. The Japanese had a crash program that was supposed to lift so called "expert systems" by several orders of magnitude in the late '80s. What ever happened to all the promised innovation? There is still no system capable of taking a piece of paper with handwritten notes and figuring out what information is present on it. Or even distinguish between information and random doodling. Or a system that groks music to the point where you can whistle a tune and it tells you the name and who wrote it.
We still have a long way to go.
Instead of using OCR, they can outsource it to India, have someone read the text and use speech to text software
I agree but only if we are stuck with making incremental improvements to current technology. We already have proof that excellent handwritten character recognition is possible since we humans can do it. We use all sorts of cognitive tricks to recognize handwriting, not the least of which is, as you point out, that we usually have a good handle on the historical context surrounding the writing in question. This sort of knowledge requires a lifetime of training and learning. A French person will find it a lot easier to recognize handwriten letters if the words are written in French. Change to different language and his/her performance will suffer. He/she uses a technique called pattern completion which is entirely based on learning from previous experience, and not just reading experience. Our future machines will have to do likewise. In my opinion, good recognition in this field will require a breakthrough in our understanding of intelligence. I am optimistic.
your code will put anything not java into perl
Yeah I know you sort of meant it
But even if you succesfuly recognized C code, you are going to make it perl code.
Oh well...
"The USPS might see a return on investment if their OCR equipment works on 75% of text, routing the hard-to-read 25% to humans. That's a huge reduction in workload, because otherwise every letter would need to be scanned by a human."
USPS got smart there. They have that info encoded as a bar code at the point of origin were the problem's easier to deal with. Same really with the other carriers.
"And here's a somewhat related question: Is there good freeware or GPL'd OCR software usable on windows? I have a few dozen pages, scanned in as high-res PNGs, that I need to convert. Snag: It has some Kanji characters sprinkled throughout."
Well do as the other poster mentioned and do the poor man's version of what the stories suggesting and find a citizen in another country that'll do the work in exchange for something else.
Shai Schticks:"You don't make peace with friends, you make peace with enemies"
Have you ever taken lecture notes?
The US Post Office has, for years, had fairly reliable automated reading of handwritten digits, which is used to auto-sort and -route mail by zipcode. It can handle some pretty terrible handwriting, crazy arrangement on the envelope, and unlikely variations, so only a relatively small percentage of letters are spit out to be read by human eyes.
Its task is made easier by the fact that they're locating and segmenting fixed-length sequences that are usually at least somewhat separated: they're looking for either a 5-digit zip code or a 5-dash-4-digit zip+4, and handwritten digits usually don't connect in the way that cursive letters do. That and you have only 10 digits to deal with, instead of 36 alphanumeric characters plus punctuation, but that particular difference is just a matter of computing power and memory to scale up to ~4x the charset.
10 PRINT CHR$(205.5+RND(1)); : GOTO 10
I volunteered this summer transcribing input for a senate campaign. For many documents, people's handwriting was simply unreadable. Even using context, years of experience with parsing human names, the fact that half of the people were already in our database, and the ability to google for contributor's company names, I still had a number of times where I just had to guess at what people meant. Granted, only about 5% of the input is completely illegible, but if I can't parse it, I certainly can't blame a machine for not being able to parse it.
I've abandoned my search for truth; now I'm just looking for some useful delusions.
your code will put anything not java into perl
And Perl will probably run it anyway...
This is old news, why does everyone fret so, see right here http://ars.userfriendly.org/cartoons/?id=20081005&mode=classic. - this comment submitted by CAPTCHA-BOT 2000 Pro!
I don't care if the OCR can read others' handwriting. I just need it to analyze *my* handwriting.
I forget the exact terms from my Machine Learning course, but...
This reminds me of Bayesian spam filtering you can use on email boxes. You can also have training data to help sort new cases.
In this case the sample size is either a bit larger than the typical use (about 50 possibilities if we're talking about common alphanumeric characters), or whole words (if we use dictionary instead). Some combined solution may be more effective: using a dictionary to help your training data collected from simple characters.
Filtering out the writing from the page can already be done to some degree where the scanning method is true bitmap... text is decided as either black or white, with a certain value of grey being a deciding threshold.
You can also start with several basic handwriting styles, use that as a base and have the training data adjust to you.
How about an "open" handwriting database where the training data report back to a repository?
Isn't this obvious proof that the CAPTCHAs are poorly designed? Why not just use actual handwriting as CAPTCHAs? Then, when some hackers crack it, they have solved a useful outstanding problem in CS.
I will not lay out the specifics on how it was done, since I am not sure that the guy who designed the process wants it shared. However, the US Census in 2000 processed every piece of paper from that Census using OCR with some back up QA by humans. The process essentially used a server farm to run each block that contained handwriting through a series of OCR checks, depending on the OCR confidence level the box would be either passed as read or put in fron of a keyer who would type what they saw in the box. The process then decided if the human matched what it had guessed if it did it passed on through if not then it went to another keyer and looked at the match between the two keyers and the OCR guess. It took about 90 days to process every piece of paper sent in. I cannot recall how many pieces there were but obviously it was millions. It surprises me tha tno one has improved on that in the last 8 years. I am going to have to see what they plan to do this time around, it was a pretty cool project to be a part of. We had a huge (for the year 2000 anyhow)SAN from EMC which is now pretty common but was rather rare at that time. I hope they keep it on the cutting edge this time around. I do know they adopted LINUX at the processing center I worked at after the Census was complete. I am pretty sure the project will be done without any Windows machines this time.
http://www.irislink.com/
I was shocked to see how well it read my hand writing. This was on a Tablet PC running XP.
The only nuisance was having to turn it off when you didn't need it. Otherwise it kept on thinking I was writing something.
Obama's legacy: (N)othing (S)ecure (A)nywhere and (T)error (S)imulation (A)dministration
Actually OCR'ing cursive is probably more a function of being able to accurately scan pen and ink writing than it is a function of "cursive is hard to decode".
That, and a touch screen generates a curve with X and Y as a function of time. This curve contains the order and speed of the strokes made by the stylus at any given moment, valuable information for distinguishing characters. A raster image generated by the scanner is density as a function of X and Y, with no time information.
As someone who's spent countless hours combing through Ancestry.com's databases of "transcribed" public records while researching my own family history, I can say with some certainty that it's not just OCR that struggles with handwriting.
I'd say that at least a third, and probably more like half, of the records I've found on Ancestry.com which reference the folks I'm researching, are transcribed incorrectly.
Certainly part of the problem is that the people doing the transcribing aren't familiar with the names they're transcribing (I've had a DuBois written as both "Delrie" and "Dobins"). Another part of the problem is that when you're looking at handwritten records from well over a hundred years ago, often they're just plain hard to read (or even illegible).
Anyway, that second point, IMO, makes using Ancestry's efforts as an example of issues with "handwriting" in general a bit dodgy. The problems they face are more along the lines of dealing with old, faded, often poorly filmed documents where even a human will have a tough time.
=== "Some people see the glass as half-empty. Others see it as half-full. I see the glass as too big." -G. Carlin.
For captchas, you only need some accuracy because you get infinite retries and immediate feedback whether your OCR guessed right. For digitizing text, especially people's names where spell checkers are useless, there's no automatic feedback. OCR needs near-100% accuracy to be of any value, because proofreading takes almost as long as manual transcription. So comparing captcha solvers to traditional OCR is apples to oranges.
... we do work for handwriting recognition at the company I work at currently, using 3rd party packages along with a lot of our own special sauce to improve accuracy.
It ends up being pretty good with high quality scanned documents... the only time we end up with trouble is with low resolution faxes.