Google Adds OCR To PDF and Images

← Back to Stories (view on slashdot.org)

Google Adds OCR To PDF and Images

Posted by CmdrTaco on Tuesday June 22, 2010 @12:58AM from the typing-is-for-suckers dept.

Kilrah_il writes "Now you have the option to OCR every PDF and image you upload to Google Docs. 'When you upload files to Google Docs, you'll notice a new option that tells Google to convert the text from PDF and image files to Google Docs documents. ... I've tried to convert an excerpt from the book Rework and the result wasn't great. About 10% of the text has been incorrectly converted and the formatting hasn't been preserved.'"

15 of 76 comments (clear)

Min score:

Reason:

Sort:

F1r5t p0st? by Chrisq · 2010-06-22 01:00 · Score: 4, Funny

F1r5t p0st? (OCR's by Google)
lolwut? by Pojut · 2010-06-22 01:00 · Score: 2, Insightful

I can understand OCR software not working if you are scanning a document, due to dirt over the text or what have you...but OCR failing on a PDF with typed text? WTF?

--
Living With a Nerd
1. Re:lolwut? by erikdalen · 2010-06-22 01:19 · Score: 3, Interesting
  
  Didn't fail at all on a PDF with typed text for me. Did you actually try it?
  I bet they don't actually use OCR on a PDF with typed text as they can just extract it from the PDF, they probably use that on images inside PDFs though.
  
  --
  Erik Dalén
2. Re:lolwut? by mlk · 2010-06-22 01:34 · Score: 3, Informative
  
  I've just tried with the extract.
  The text extraction seams to have worked well. Unsurprisingly the formatting has been lost and it has got confused with the REwork type bits. PDFs are not designed with extraction to a editable format in mind, so getting any of the formatting is impressive in my book.
  
  --
  Wow, I should not post when knackered.
Captcha correction? by 0100010001010011 · 2010-06-22 01:05 · Score: 3, Interesting

Could google provide some sort of opt-in service where our PDFs (one word at a time) could appear as a captcha. More or less what reCaptcha does, except with something a bit newer.
1. Re:Captcha correction? by pcgc1xn · 2010-06-22 05:37 · Score: 2, Informative
  
  I am pretty sure that with recatpcha only one of the two words you type in is unknown.
  So if I have some text that looks like 'first known) p0sh bi4ches'.
  Captcha user one will get "first p0sh".
  If they correctly identify first, then I will accept their reading of posh, say "post".
  User 2 gets "p0sh b14ches"
  If they correctly identify "p0sh" as post, then I will accept their reading of "bi4ches".
  Obviously the guys at recaptcha has done a better job than my simplified & poor explanation. You need "some" knowledge of what the text actually is, but only some.
  
  --
  Zapsavings: Simply calculate how much energy efficient bulb
Where did all the ReCAPTCHA go? by AHuxley · 2010-06-22 01:08 · Score: 2, Interesting

With all the words deciphered, no bump in the OCR backend?

--
Domestic spying is now "Benign Information Gathering"
Google Captcha processor here I come!!!! by OzPeter · 2010-06-22 01:19 · Score: 2, Interesting

How long before you see an automated system to upload and process Captcha images on google?

--
I am Slashdot. Are you Slashdot as well?
1. Re:Google Captcha processor here I come!!!! by BrightSpark · 2010-06-22 03:15 · Score: 2, Interesting
  
  One of the easier ways to restrict how your words and ideas are searched and indexed on the net is to to hide them in plain sight. A jpg image of text is very dificult for a search engine to use, yet you and I can read and understand the data quite easily. This ability to scan on line has been around, but not mainstream to my knowledge. I'm guessing Google has been checking jpgs for text as a trial for some time. Once this is gone maybe ASCII art text will work for a while. Hiding/protecting data by steganography is detectable by scan now, eg http://www.outguess.org/detection.php so the battle continues. Of course one can work offline and send letters to each other and be protected by law :-) I wonder if one day sending stuff my mail will seem shady?
Is there a "this translation is bad" option? by AdmiralXyz · 2010-06-22 01:28 · Score: 4, Informative

I know with several services, like Google Voice, they had a link or checkbox to indicate that "this transcription is lousy, I can do better" with an option to do so, which was presumably sent back to improve the service. It really seemed to work, too, the quality of Google Voice's voicemail-to-text transcriptions started off horrible, and has since become awesome. Same goes for the built-in speech-to-text in Android. If Google includes something like that here to tune whatever machine learning algos they're using, I have no doubt it will rapidly progress to a usable state.

--
Dislike the Electoral College? Lobby your state to join the National Popular Vote Interstate Compact.
Doesn't have to be perfect by clone53421 · 2010-06-22 01:36 · Score: 4, Insightful

They really should hide the text underneath the actual scanned image, though, so that what you're actually looking at is the real page, but searchable. That takes care of the issue with layout, and since you aren't actually trying to read the garbled text, although 10% is still a rather high error rate it won't matter as much because you'll only notice it if you're trying to copy-and-paste or you might search for something and miss a few of the hits because it was incorrectly OCR'd. Not a huge deal.

--
Alexander Peter Kristopeit bought his basement from his mommy for one dollar.
Re:Anyone know what they're using for the OCR? by quickOnTheUptake · 2010-06-22 02:47 · Score: 3, Informative

I don't know for sure what's running behind this, but Google's OCRopus is Apache, as is the actual OCR engine behind it, tesseract.

--
Mod points: Guaranteed to remove your sense of humor.
Side effects may include gullibility and temporary retardation
OCR Reality by Shadow+Wrought · 2010-06-22 04:19 · Score: 2, Informative

About 10% of the text has been incorrectly converted and the formatting hasn't been preserved.

What did you expect? I've been in the legal field for 10 years and have seen OCR progress substantially during that time. However, 10% error rate is still very common with scanned docs and unless you are looking at the original image, all the formatting is lost. This is with the best OCR engines in the industry!

Maybe you should actually know something about the particular field before you judge?

--
If brevity is the soul of wit, then how does one explain Twitter?
Google search had OCR before Google Docs by mike.mondy · 2010-06-22 06:01 · Score: 2, Interesting

Google's search engine started doing OCR on any scanned documents they found in late 2008. The results were horrible in some cases, but it didn't matter. The searchable OCR results made it possible to find things more easily and you could obviously refer to the original source if the OCR was too garbled.
lots of tough problems in OCR by tmbdev · 2010-06-22 13:55 · Score: 2, Interesting

OCR consists of many steps; recognizing the individual characters is only one of them. You also need to separate text from images, group characters into lines and columns, separate floats, captions, and body text, etc. Many of those are tough problems even if someone hands you a PDF with all the characters. And if any one of them is wrong, the entire output may be wrong.
Recognizing individual characters is also harder than you may think because there is such a wide variety of fonts in use and because there are so many odd things that can happen. Even in perfectly rendered images (no dirt etc.), two characters may be bit-identical but mean something different in different fonts. Ligatures, underlines, unknown characters, etc. also make the problem quite a bit harder.
And even though 1% error would be low for just about any other machine learning or pattern recognition problem, that's a high OCR error and looks quite bad; people are much more sensitive to OCR errors than pattern recognition errors in other contexts. Furthermore, there are a lot of characters to be classified and you only get very little CPU time per character.
We've been developing an OCR system (ocropus.org) for a while now (see http://bit.ly/9Xputj for status info). It's fairly easy to get excellent performance on a closed dataset with a well-defined character set. Getting acceptable performance on arbitrary documents and dealing with all the special cases (ligatures, foreign characters, color images, magazine layouts, unknown languages, Unicode issues, etc.) is tons of work.
Oh, and in case you're wondering, although Google has sponsored OCRopus (thanks!), OCRopus is a separate project from Google's internal OCR efforts.