Google Adds OCR To PDF and Images

← Back to Stories (view on slashdot.org)

Google Adds OCR To PDF and Images

Posted by CmdrTaco on Tuesday June 22, 2010 @12:58AM from the typing-is-for-suckers dept.

Kilrah_il writes "Now you have the option to OCR every PDF and image you upload to Google Docs. 'When you upload files to Google Docs, you'll notice a new option that tells Google to convert the text from PDF and image files to Google Docs documents. ... I've tried to convert an excerpt from the book Rework and the result wasn't great. About 10% of the text has been incorrectly converted and the formatting hasn't been preserved.'"

5 of 76 comments (clear)

Min score:

Reason:

Sort:

Is there a "this translation is bad" option? by AdmiralXyz · 2010-06-22 01:28 · Score: 4, Informative

I know with several services, like Google Voice, they had a link or checkbox to indicate that "this transcription is lousy, I can do better" with an option to do so, which was presumably sent back to improve the service. It really seemed to work, too, the quality of Google Voice's voicemail-to-text transcriptions started off horrible, and has since become awesome. Same goes for the built-in speech-to-text in Android. If Google includes something like that here to tune whatever machine learning algos they're using, I have no doubt it will rapidly progress to a usable state.

--
Dislike the Electoral College? Lobby your state to join the National Popular Vote Interstate Compact.
Re:lolwut? by mlk · 2010-06-22 01:34 · Score: 3, Informative

I've just tried with the extract.
The text extraction seams to have worked well. Unsurprisingly the formatting has been lost and it has got confused with the REwork type bits. PDFs are not designed with extraction to a editable format in mind, so getting any of the formatting is impressive in my book.

--
Wow, I should not post when knackered.
Re:Anyone know what they're using for the OCR? by quickOnTheUptake · 2010-06-22 02:47 · Score: 3, Informative

I don't know for sure what's running behind this, but Google's OCRopus is Apache, as is the actual OCR engine behind it, tesseract.

--
Mod points: Guaranteed to remove your sense of humor.
Side effects may include gullibility and temporary retardation
OCR Reality by Shadow+Wrought · 2010-06-22 04:19 · Score: 2, Informative

About 10% of the text has been incorrectly converted and the formatting hasn't been preserved.

What did you expect? I've been in the legal field for 10 years and have seen OCR progress substantially during that time. However, 10% error rate is still very common with scanned docs and unless you are looking at the original image, all the formatting is lost. This is with the best OCR engines in the industry!

Maybe you should actually know something about the particular field before you judge?

--
If brevity is the soul of wit, then how does one explain Twitter?
Re:Captcha correction? by pcgc1xn · 2010-06-22 05:37 · Score: 2, Informative

I am pretty sure that with recatpcha only one of the two words you type in is unknown.
So if I have some text that looks like 'first known) p0sh bi4ches'.
Captcha user one will get "first p0sh".
If they correctly identify first, then I will accept their reading of posh, say "post".
User 2 gets "p0sh b14ches"
If they correctly identify "p0sh" as post, then I will accept their reading of "bi4ches".
Obviously the guys at recaptcha has done a better job than my simplified & poor explanation. You need "some" knowledge of what the text actually is, but only some.

--
Zapsavings: Simply calculate how much energy efficient bulb