Google Pushes Open Source OCR
SocialWorm writes "Google has just announced work on OCRopus, which it says it hopes will 'advance the state of the art in optical character recognition and related technologies.' OCRopus will be available under the Apache 2.0 License. Obviously, there may be search and image search implications from OCRopus. 'The goal of the project is to advance the state of the art in optical character recognition and related technologies, and to deliver a high quality OCR system suitable for document conversions, electronic libraries, vision impaired users, historical document analysis, and general desktop use. In addition, we are structuring the system in such a way that it will be easy to reuse by other researchers in the field.'"
Now that they will be able to recognize and tag our images, I wonder if Picassa will finally get increased storage. Google will be able to deliver targeted ads based on our pictures.
An OCR system that runs on Linux. I've been waiting for quite some time for something like this.
Anyone know of an open source utility that can convert scanned image-based PDF files into searchable PDFs ?
(Extra points if it somehow re-generates the actual file so it looks nice instead of pixelated.)
Perhaps this library could be used to build such an application if none exists...
English only I suppose?
among other things, sure, but it's got to be a high priority for google. I don't buy either one. I think the goal of the project is to get sued.
Google knows darned well that there are tons of patents around OCR, so they're not going to roll their own internally. Instead, they'll open source the project and make as much noise about enhancing the state of the art through collaboration as possible. Then, when they get sued (and they will), they can bring this case front-and-center in the debate surrounding patent reform, citing this as the textbook example of how the promotion of the sciences and useful arts (as specified by the Constitution) is hobbled by current patent law surrounding software.
I could be wrong, but they'd be stupid to think that high-profile, open source OCR software won't be challenged by those who hold the patents....
You've obviously never fought off a bb spammer. They don't use one or two accounts to spam one or two messages - they inundate the board from a long list of IPs. Even without spamming messages, they create hordes of accounts just for the pagerank provided by the links within their personal account pages. Plus, admin-approval-delays degrade quality for the user. It creates a huge headache all around to handle maintaining banlists and cleaning out garbage.
Captchas are by far the better solution.
The problem is that, long term, they will eventually be cracked. I'm imagining the ultimate solution will only be to allow users with email addresses from "approved" major ISPs.
Look, any illiterate kid in a third world country can play type-in-the CAPTCHA all day long. I think the solution is to put up an array of 9 or so pictures, and ask the reader to click on the kitten. The other 8 being something other than a kitten, and all the files having random names which rotate with every view. You could also change the item being asked for to defeat simple image recognition, and have several pictures of kittens/what-have-yous.
To defeat *this*, you would need someone with a greater command of the english language than simple recognition of characters, or very advanced image recogniion software. I wouldn't worry about the software anytime soon if you chose images carefully.
Even those who arrange and design shrubberies are under considerable economic stress at this period in history.