Google Adds OCR To PDF and Images
Kilrah_il writes "Now you have the option to OCR every PDF and image you upload to Google Docs. 'When you upload files to Google Docs, you'll notice a new option that tells Google to convert the text from PDF and image files to Google Docs documents. ... I've tried to convert an excerpt from the book Rework and the result wasn't great. About 10% of the text has been incorrectly converted and the formatting hasn't been preserved.'"
F1r5t p0st? (OCR's by Google)
I can understand OCR software not working if you are scanning a document, due to dirt over the text or what have you...but OCR failing on a PDF with typed text? WTF?
Living With a Nerd
Could google provide some sort of opt-in service where our PDFs (one word at a time) could appear as a captcha. More or less what reCaptcha does, except with something a bit newer.
With all the words deciphered, no bump in the OCR backend?
Domestic spying is now "Benign Information Gathering"
How long before you see an automated system to upload and process Captcha images on google?
I am Slashdot. Are you Slashdot as well?
For now it sucks, but we know if google wants it throws out the better in the market.
Just wondering if this gets so good as to make mass captcha cracking cheap.
Sadly, I had no issues reading this: "This is going to make document scanning a real time saver from now on!"
Obviously, I've spent way too much time correcting bad OCR.
My blog
I know with several services, like Google Voice, they had a link or checkbox to indicate that "this transcription is lousy, I can do better" with an option to do so, which was presumably sent back to improve the service. It really seemed to work, too, the quality of Google Voice's voicemail-to-text transcriptions started off horrible, and has since become awesome. Same goes for the built-in speech-to-text in Android. If Google includes something like that here to tune whatever machine learning algos they're using, I have no doubt it will rapidly progress to a usable state.
Dislike the Electoral College? Lobby your state to join the National Popular Vote Interstate Compact.
> the result wasn't great. About 10% of the text has been incorrectly converted and the formatting hasn't been preserved.
Well, what is the state of the art of OCR today? I wouldn't call this a bad result either... And OTOH, if people were correctly trained in spelling, we would have made do without spell checkers and have invested in OCR technology instead, right? ;)
They really should hide the text underneath the actual scanned image, though, so that what you're actually looking at is the real page, but searchable. That takes care of the issue with layout, and since you aren't actually trying to read the garbled text, although 10% is still a rather high error rate it won't matter as much because you'll only notice it if you're trying to copy-and-paste or you might search for something and miss a few of the hits because it was incorrectly OCR'd. Not a huge deal.
Alexander Peter Kristopeit bought his basement from his mommy for one dollar.
First: My suggestion is that Google should put its efforts in making Google Docs at least as usable as Zoho Office first.
How can a small company like Zoho beat Google on usability?
Second: GMail still sucks [at search experience], big time in my opinion. Here's why: I had this happen to me recently...
I knew an email existed but could not remember much about it at all! Yes sometimes, you need a memory trigger for lack of a better word.
My search term was "details" and Gmail returned 311 messages. I also knew that the attachment this message had was in 'tiff' format. This search had GMail return all those messages that met my search criterion but it would have been more useful if Google Mail had gone ahead to "auto-magically" categorize emails with attachments, and further by attachment type and so many other useful categorizations.
This way, a message with a 'tiff' format attachment (which I could not remember) would have been displayed...already sorted for me to see, may be with some kind of highlight. If I had a huge in-box like our folks in sales, results would not be that useful. By the way, I searched for the string "when" in an in-box that had 142,211 emails and I received 11,317 emails back! No categorization at all! Needless to say, these results were not useful.
The current approach is still wanting, inadequate and can be made better. Yahoo does this, so Google can surely do better.
It'd be cool if it was GPL'd :).
Hi! I make Firefox Plug-ins. Check 'em out @ https://addons.mozilla.org/en-US/firefox/addon/youtube-mp3-podcaster/
OCR (Oxford, Cambridge and RSA Examinations) is an examination board that sets examinations and awards qualifications (including GCSEs and A-levels). It is one of England, Wales and Northern Ireland's five main examination boards.
Organization of Communist Revolutionaries (marxist-leninist) (in Persian: ( (-) was an Iranian Maoist organization. It was formed in opposition to the Shah regime in Iran and was active the Iranian student movement in exile.
To perform OCR (optical character recognition); Oxford, Cambridge & RSA (examinations (board)); Optical Character Recognition; Office for Civil Rights (US); Office of the Chief Rabbi
People who can't figure things out from context would have a much harder time than you think.
Um...no. The summary is clearly talking about text. It even says "a new option that tells Google to convert the text from PDF and image files to Google Docs documents". There's no way that, from context, you could think OCR meant "Organization of Communist Revolutionaries".
Don't argue for argument's sake.
Snarkyness aside, it explicitly says that it’s converting the text from images into a searchable document... if someone can’t tell from that context that OCR means converting the text from images into a document, they probably have about the IQ of a cinder-block and wouldn’t “get it” from the Google results either.
Hell, if we are really lowering ourselves to that lowest denominator of intelligence, the person would probably still be confused if we called it Optical Character Recognition. What, like Google has to actually print out the stuff you upload so it can Optically Recognize it...? Oh noes, we’re killing trees. Everybody, don’t use it!
Alexander Peter Kristopeit bought his basement from his mommy for one dollar.
Calling someone a "hater" only means you can not rationally rebut their argument.
About 10% of the text has been incorrectly converted and the formatting hasn't been preserved.
What did you expect? I've been in the legal field for 10 years and have seen OCR progress substantially during that time. However, 10% error rate is still very common with scanned docs and unless you are looking at the original image, all the formatting is lost. This is with the best OCR engines in the industry!
Maybe you should actually know something about the particular field before you judge?
If brevity is the soul of wit, then how does one explain Twitter?
Google's search engine started doing OCR on any scanned documents they found in late 2008. The results were horrible in some cases, but it didn't matter. The searchable OCR results made it possible to find things more easily and you could obviously refer to the original source if the OCR was too garbled.
What are the supported character sets? Is it only roman characters or what?
Twinstiq, game news
OCR consists of many steps; recognizing the individual characters is only one of them. You also need to separate text from images, group characters into lines and columns, separate floats, captions, and body text, etc. Many of those are tough problems even if someone hands you a PDF with all the characters. And if any one of them is wrong, the entire output may be wrong.
Recognizing individual characters is also harder than you may think because there is such a wide variety of fonts in use and because there are so many odd things that can happen. Even in perfectly rendered images (no dirt etc.), two characters may be bit-identical but mean something different in different fonts. Ligatures, underlines, unknown characters, etc. also make the problem quite a bit harder.
And even though 1% error would be low for just about any other machine learning or pattern recognition problem, that's a high OCR error and looks quite bad; people are much more sensitive to OCR errors than pattern recognition errors in other contexts. Furthermore, there are a lot of characters to be classified and you only get very little CPU time per character.
We've been developing an OCR system (ocropus.org) for a while now (see http://bit.ly/9Xputj for status info). It's fairly easy to get excellent performance on a closed dataset with a well-defined character set. Getting acceptable performance on arbitrary documents and dealing with all the special cases (ligatures, foreign characters, color images, magazine layouts, unknown languages, Unicode issues, etc.) is tons of work.
Oh, and in case you're wondering, although Google has sponsored OCRopus (thanks!), OCRopus is a separate project from Google's internal OCR efforts.
This google ocr thing just gives you the text. What about the formating? Why not use something like WatchOCR from http://www.watchocr.com./ It creates text searchable pdfs from image only pdfs and it's all free and open source. You just drop them into a watched folder and the server spits them out as text searchable. It runs as a LiveCD so you don't even have to install anything to try it.