Google Adds OCR To PDF and Images

← Back to Stories (view on slashdot.org)

Google Adds OCR To PDF and Images

Posted by CmdrTaco on Tuesday June 22, 2010 @12:58AM from the typing-is-for-suckers dept.

Kilrah_il writes "Now you have the option to OCR every PDF and image you upload to Google Docs. 'When you upload files to Google Docs, you'll notice a new option that tells Google to convert the text from PDF and image files to Google Docs documents. ... I've tried to convert an excerpt from the book Rework and the result wasn't great. About 10% of the text has been incorrectly converted and the formatting hasn't been preserved.'"

9 of 76 comments (clear)

Min score:

Reason:

Sort:

Captcha correction? by 0100010001010011 · 2010-06-22 01:05 · Score: 3, Interesting

Could google provide some sort of opt-in service where our PDFs (one word at a time) could appear as a captcha. More or less what reCaptcha does, except with something a bit newer.
1. Re:Captcha correction? by Anonymous Coward · 2010-06-22 05:29 · Score: 1, Interesting
  
  Check into how the current reCaptcha works. The user is presented with two words. One is known to be correct. The other is suspect. User is unaware of which is known and which is suspect. User types both words, and backend system verifies the known word was typed correctly. Logs suspect word value typed by user. Returns the suspect word image to a few more users, and if they all respond with same text along with correct known word, the system can assume the suspect image contains the text returned. http://stackoverflow.com/questions/1435696/how-does-recaptcha-work
Where did all the ReCAPTCHA go? by AHuxley · 2010-06-22 01:08 · Score: 2, Interesting

With all the words deciphered, no bump in the OCR backend?

--
Domestic spying is now "Benign Information Gathering"
Google Captcha processor here I come!!!! by OzPeter · 2010-06-22 01:19 · Score: 2, Interesting

How long before you see an automated system to upload and process Captcha images on google?

--
I am Slashdot. Are you Slashdot as well?
1. Re:Google Captcha processor here I come!!!! by BrightSpark · 2010-06-22 03:15 · Score: 2, Interesting
  
  One of the easier ways to restrict how your words and ideas are searched and indexed on the net is to to hide them in plain sight. A jpg image of text is very dificult for a search engine to use, yet you and I can read and understand the data quite easily. This ability to scan on line has been around, but not mainstream to my knowledge. I'm guessing Google has been checking jpgs for text as a trial for some time. Once this is gone maybe ASCII art text will work for a while. Hiding/protecting data by steganography is detectable by scan now, eg http://www.outguess.org/detection.php so the battle continues. Of course one can work offline and send letters to each other and be protected by law :-) I wonder if one day sending stuff my mail will seem shady?
Re:lolwut? by erikdalen · 2010-06-22 01:19 · Score: 3, Interesting

Didn't fail at all on a PDF with typed text for me. Did you actually try it?
I bet they don't actually use OCR on a PDF with typed text as they can just extract it from the PDF, they probably use that on images inside PDFs though.

--
Erik Dalén
OCR efficiency by Anonymous Coward · 2010-06-22 01:34 · Score: 1, Interesting

> the result wasn't great. About 10% of the text has been incorrectly converted and the formatting hasn't been preserved.
Well, what is the state of the art of OCR today? I wouldn't call this a bad result either... And OTOH, if people were correctly trained in spelling, we would have made do without spell checkers and have invested in OCR technology instead, right? ;)
Google search had OCR before Google Docs by mike.mondy · 2010-06-22 06:01 · Score: 2, Interesting

Google's search engine started doing OCR on any scanned documents they found in late 2008. The results were horrible in some cases, but it didn't matter. The searchable OCR results made it possible to find things more easily and you could obviously refer to the original source if the OCR was too garbled.
lots of tough problems in OCR by tmbdev · 2010-06-22 13:55 · Score: 2, Interesting

OCR consists of many steps; recognizing the individual characters is only one of them. You also need to separate text from images, group characters into lines and columns, separate floats, captions, and body text, etc. Many of those are tough problems even if someone hands you a PDF with all the characters. And if any one of them is wrong, the entire output may be wrong.
Recognizing individual characters is also harder than you may think because there is such a wide variety of fonts in use and because there are so many odd things that can happen. Even in perfectly rendered images (no dirt etc.), two characters may be bit-identical but mean something different in different fonts. Ligatures, underlines, unknown characters, etc. also make the problem quite a bit harder.
And even though 1% error would be low for just about any other machine learning or pattern recognition problem, that's a high OCR error and looks quite bad; people are much more sensitive to OCR errors than pattern recognition errors in other contexts. Furthermore, there are a lot of characters to be classified and you only get very little CPU time per character.
We've been developing an OCR system (ocropus.org) for a while now (see http://bit.ly/9Xputj for status info). It's fairly easy to get excellent performance on a closed dataset with a well-defined character set. Getting acceptable performance on arbitrary documents and dealing with all the special cases (ligatures, foreign characters, color images, magazine layouts, unknown languages, Unicode issues, etc.) is tons of work.
Oh, and in case you're wondering, although Google has sponsored OCRopus (thanks!), OCRopus is a separate project from Google's internal OCR efforts.