Google Adds OCR To PDF and Images
Kilrah_il writes "Now you have the option to OCR every PDF and image you upload to Google Docs. 'When you upload files to Google Docs, you'll notice a new option that tells Google to convert the text from PDF and image files to Google Docs documents. ... I've tried to convert an excerpt from the book Rework and the result wasn't great. About 10% of the text has been incorrectly converted and the formatting hasn't been preserved.'"
F1r5t p0st? (OCR's by Google)
Could google provide some sort of opt-in service where our PDFs (one word at a time) could appear as a captcha. More or less what reCaptcha does, except with something a bit newer.
Didn't fail at all on a PDF with typed text for me. Did you actually try it?
I bet they don't actually use OCR on a PDF with typed text as they can just extract it from the PDF, they probably use that on images inside PDFs though.
Erik Dalén
I know with several services, like Google Voice, they had a link or checkbox to indicate that "this transcription is lousy, I can do better" with an option to do so, which was presumably sent back to improve the service. It really seemed to work, too, the quality of Google Voice's voicemail-to-text transcriptions started off horrible, and has since become awesome. Same goes for the built-in speech-to-text in Android. If Google includes something like that here to tune whatever machine learning algos they're using, I have no doubt it will rapidly progress to a usable state.
Dislike the Electoral College? Lobby your state to join the National Popular Vote Interstate Compact.
I've just tried with the extract.
The text extraction seams to have worked well. Unsurprisingly the formatting has been lost and it has got confused with the REwork type bits. PDFs are not designed with extraction to a editable format in mind, so getting any of the formatting is impressive in my book.
Wow, I should not post when knackered.
They really should hide the text underneath the actual scanned image, though, so that what you're actually looking at is the real page, but searchable. That takes care of the issue with layout, and since you aren't actually trying to read the garbled text, although 10% is still a rather high error rate it won't matter as much because you'll only notice it if you're trying to copy-and-paste or you might search for something and miss a few of the hits because it was incorrectly OCR'd. Not a huge deal.
Alexander Peter Kristopeit bought his basement from his mommy for one dollar.
I don't know for sure what's running behind this, but Google's OCRopus is Apache, as is the actual OCR engine behind it, tesseract.
Mod points: Guaranteed to remove your sense of humor.
Side effects may include gullibility and temporary retardation