Google Pushes Open Source OCR
SocialWorm writes "Google has just announced work on OCRopus, which it says it hopes will 'advance the state of the art in optical character recognition and related technologies.' OCRopus will be available under the Apache 2.0 License. Obviously, there may be search and image search implications from OCRopus. 'The goal of the project is to advance the state of the art in optical character recognition and related technologies, and to deliver a high quality OCR system suitable for document conversions, electronic libraries, vision impaired users, historical document analysis, and general desktop use. In addition, we are structuring the system in such a way that it will be easy to reuse by other researchers in the field.'"
I doubt it.
Part of what makes OCR work is that it assumes that the text was written to communicate meaning. It has regular characters in alignment, forming common words and abbreviations in more or less grammatical sentences with close-to-proper punctuation.
A good captcha has a non-sense string of characters in various cases, all skewed and distorted, with extra geometric elements obscuring the characters. This renders unavailable somewhere around half of the clues that an OCR uses.
Computers are useless. They can only give you answers.
-- Pablo Picasso
Captcha (warped text) will probably remain for a long time. This OCR has more practical uses when applied to text that is meant to be legible.
Under the influence of Post-Cyberpunk Gonzo Journalism
Captchas are far from human-readable (the good ones at least), and I seriously doubt this project will help very much in that arena.
Um, the whole point of CAPTCHAS is that they are human-readable (albeit often not easily so) but not machine-readable.
If you're sick of image spam, you can do what I did. Add the OpenProtect channel to SpamAssassin and then add these line to your SpamAssassin config:
required_hits 5
score SARE_GIF_ATTACH 5
I don't see image spam any more. I resorted to that after I was getting a hundred or so of them a day.
"It ain't a war against drugs.it's a war against personal freedom" --Bill Hicks
Here's some other information you might need/want to build this software; note that I am on Ubuntu Feisty.
To build tesseract-ocr you must install autoconf.
If you are smart you will figure out a way to omit ocropus/data-test-pages from your checkout. Do you really want to use their pages to test? No! You want to use your own data.
I built with gcc 4.1.2, YMMV. Some people have reported errors trying to just compile tesseract.
to build ocropus I ended up installing jam, libaspell-dev and libtiff-dev.
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"