Google Adds OCR To PDF and Images
Kilrah_il writes "Now you have the option to OCR every PDF and image you upload to Google Docs. 'When you upload files to Google Docs, you'll notice a new option that tells Google to convert the text from PDF and image files to Google Docs documents. ... I've tried to convert an excerpt from the book Rework and the result wasn't great. About 10% of the text has been incorrectly converted and the formatting hasn't been preserved.'"
F1r5t p0st? (OCR's by Google)
I can understand OCR software not working if you are scanning a document, due to dirt over the text or what have you...but OCR failing on a PDF with typed text? WTF?
Living With a Nerd
Could google provide some sort of opt-in service where our PDFs (one word at a time) could appear as a captcha. More or less what reCaptcha does, except with something a bit newer.
With all the words deciphered, no bump in the OCR backend?
Domestic spying is now "Benign Information Gathering"
How long before you see an automated system to upload and process Captcha images on google?
I am Slashdot. Are you Slashdot as well?
I know with several services, like Google Voice, they had a link or checkbox to indicate that "this transcription is lousy, I can do better" with an option to do so, which was presumably sent back to improve the service. It really seemed to work, too, the quality of Google Voice's voicemail-to-text transcriptions started off horrible, and has since become awesome. Same goes for the built-in speech-to-text in Android. If Google includes something like that here to tune whatever machine learning algos they're using, I have no doubt it will rapidly progress to a usable state.
Dislike the Electoral College? Lobby your state to join the National Popular Vote Interstate Compact.
They really should hide the text underneath the actual scanned image, though, so that what you're actually looking at is the real page, but searchable. That takes care of the issue with layout, and since you aren't actually trying to read the garbled text, although 10% is still a rather high error rate it won't matter as much because you'll only notice it if you're trying to copy-and-paste or you might search for something and miss a few of the hits because it was incorrectly OCR'd. Not a huge deal.
Alexander Peter Kristopeit bought his basement from his mommy for one dollar.
I don't know for sure what's running behind this, but Google's OCRopus is Apache, as is the actual OCR engine behind it, tesseract.
Mod points: Guaranteed to remove your sense of humor.
Side effects may include gullibility and temporary retardation
About 10% of the text has been incorrectly converted and the formatting hasn't been preserved.
What did you expect? I've been in the legal field for 10 years and have seen OCR progress substantially during that time. However, 10% error rate is still very common with scanned docs and unless you are looking at the original image, all the formatting is lost. This is with the best OCR engines in the industry!
Maybe you should actually know something about the particular field before you judge?
If brevity is the soul of wit, then how does one explain Twitter?
Google's search engine started doing OCR on any scanned documents they found in late 2008. The results were horrible in some cases, but it didn't matter. The searchable OCR results made it possible to find things more easily and you could obviously refer to the original source if the OCR was too garbled.
OCR consists of many steps; recognizing the individual characters is only one of them. You also need to separate text from images, group characters into lines and columns, separate floats, captions, and body text, etc. Many of those are tough problems even if someone hands you a PDF with all the characters. And if any one of them is wrong, the entire output may be wrong.
Recognizing individual characters is also harder than you may think because there is such a wide variety of fonts in use and because there are so many odd things that can happen. Even in perfectly rendered images (no dirt etc.), two characters may be bit-identical but mean something different in different fonts. Ligatures, underlines, unknown characters, etc. also make the problem quite a bit harder.
And even though 1% error would be low for just about any other machine learning or pattern recognition problem, that's a high OCR error and looks quite bad; people are much more sensitive to OCR errors than pattern recognition errors in other contexts. Furthermore, there are a lot of characters to be classified and you only get very little CPU time per character.
We've been developing an OCR system (ocropus.org) for a while now (see http://bit.ly/9Xputj for status info). It's fairly easy to get excellent performance on a closed dataset with a well-defined character set. Getting acceptable performance on arbitrary documents and dealing with all the special cases (ligatures, foreign characters, color images, magazine layouts, unknown languages, Unicode issues, etc.) is tons of work.
Oh, and in case you're wondering, although Google has sponsored OCRopus (thanks!), OCRopus is a separate project from Google's internal OCR efforts.