Google Adds OCR To PDF and Images
Kilrah_il writes "Now you have the option to OCR every PDF and image you upload to Google Docs. 'When you upload files to Google Docs, you'll notice a new option that tells Google to convert the text from PDF and image files to Google Docs documents. ... I've tried to convert an excerpt from the book Rework and the result wasn't great. About 10% of the text has been incorrectly converted and the formatting hasn't been preserved.'"
F1r5t p0st? (OCR's by Google)
This is a good thing for all concerned !!
I can understand OCR software not working if you are scanning a document, due to dirt over the text or what have you...but OCR failing on a PDF with typed text? WTF?
Living With a Nerd
th15 i5 $o1zg to nnVke d0(unnenct 5cam1ng a rea| t1me sAver fr0m novv on!
Could google provide some sort of opt-in service where our PDFs (one word at a time) could appear as a captcha. More or less what reCaptcha does, except with something a bit newer.
With all the words deciphered, no bump in the OCR backend?
Domestic spying is now "Benign Information Gathering"
I've tried to convert an excerpt from the book Rework and the result wasn't great. About 10% of the text has been incorrectly converted and the formatting hasn't been preserved.'
Oh come on man, this is *FREE* and you're complaining about a 10% margin of error?
Then again, most OCR tech is cheap enough as is, if not completely free.
If anything, this just makes it feel weirder as to exactly how much you're letting google control your documents. Not only are you letting them have your typewritten docs, but now you're scanning in docs for them to archive as they want? Because it's not like you're the owner of said scanned in docs once you put them in there.
If Google's lucky, people will start bombarding them with more and more documents and then a year or two down the line "Google's friendly archives! Millions of scanned documents fully OCR'd and searchable from various sources! Wanna complain about how you didn't want your docs online, searchable, and visible to the entire world? TOUGH COOKIES. CHECK THE TOS."
How long before you see an automated system to upload and process Captcha images on google?
I am Slashdot. Are you Slashdot as well?
For now it sucks, but we know if google wants it throws out the better in the market.
Just wondering if this gets so good as to make mass captcha cracking cheap.
I know with several services, like Google Voice, they had a link or checkbox to indicate that "this transcription is lousy, I can do better" with an option to do so, which was presumably sent back to improve the service. It really seemed to work, too, the quality of Google Voice's voicemail-to-text transcriptions started off horrible, and has since become awesome. Same goes for the built-in speech-to-text in Android. If Google includes something like that here to tune whatever machine learning algos they're using, I have no doubt it will rapidly progress to a usable state.
Dislike the Electoral College? Lobby your state to join the National Popular Vote Interstate Compact.
Anyway, it's a beginning. In a few year's time, we'll go to meetings with a notebook and upload our notes into Google docs later.
> the result wasn't great. About 10% of the text has been incorrectly converted and the formatting hasn't been preserved.
Well, what is the state of the art of OCR today? I wouldn't call this a bad result either... And OTOH, if people were correctly trained in spelling, we would have made do without spell checkers and have invested in OCR technology instead, right? ;)
Typing "Optical Character Recognition (OCR)" was too much effort?
They really should hide the text underneath the actual scanned image, though, so that what you're actually looking at is the real page, but searchable. That takes care of the issue with layout, and since you aren't actually trying to read the garbled text, although 10% is still a rather high error rate it won't matter as much because you'll only notice it if you're trying to copy-and-paste or you might search for something and miss a few of the hits because it was incorrectly OCR'd. Not a huge deal.
Alexander Peter Kristopeit bought his basement from his mommy for one dollar.
First: My suggestion is that Google should put its efforts in making Google Docs at least as usable as Zoho Office first.
How can a small company like Zoho beat Google on usability?
Second: GMail still sucks [at search experience], big time in my opinion. Here's why: I had this happen to me recently...
I knew an email existed but could not remember much about it at all! Yes sometimes, you need a memory trigger for lack of a better word.
My search term was "details" and Gmail returned 311 messages. I also knew that the attachment this message had was in 'tiff' format. This search had GMail return all those messages that met my search criterion but it would have been more useful if Google Mail had gone ahead to "auto-magically" categorize emails with attachments, and further by attachment type and so many other useful categorizations.
This way, a message with a 'tiff' format attachment (which I could not remember) would have been displayed...already sorted for me to see, may be with some kind of highlight. If I had a huge in-box like our folks in sales, results would not be that useful. By the way, I searched for the string "when" in an in-box that had 142,211 emails and I received 11,317 emails back! No categorization at all! Needless to say, these results were not useful.
The current approach is still wanting, inadequate and can be made better. Yahoo does this, so Google can surely do better.
It'd be cool if it was GPL'd :).
Hi! I make Firefox Plug-ins. Check 'em out @ https://addons.mozilla.org/en-US/firefox/addon/youtube-mp3-podcaster/
OCR (Oxford, Cambridge and RSA Examinations) is an examination board that sets examinations and awards qualifications (including GCSEs and A-levels). It is one of England, Wales and Northern Ireland's five main examination boards.
Organization of Communist Revolutionaries (marxist-leninist) (in Persian: ( (-) was an Iranian Maoist organization. It was formed in opposition to the Shah regime in Iran and was active the Iranian student movement in exile.
To perform OCR (optical character recognition); Oxford, Cambridge & RSA (examinations (board)); Optical Character Recognition; Office for Civil Rights (US); Office of the Chief Rabbi
People who can't figure things out from context would have a much harder time than you think.
Um...no. The summary is clearly talking about text. It even says "a new option that tells Google to convert the text from PDF and image files to Google Docs documents". There's no way that, from context, you could think OCR meant "Organization of Communist Revolutionaries".
Don't argue for argument's sake.
Don't argue for argument's sake.
But those are the best kind... it doesn’t even really matter who’s wrong.
Never let a day go by when you can’t say to yourself as you’re falling asleep, “Well, I was wrong on the internet today, but damn, I had fun.” That’s what I say...
Alexander Peter Kristopeit bought his basement from his mommy for one dollar.
Snarkyness aside, it explicitly says that it’s converting the text from images into a searchable document... if someone can’t tell from that context that OCR means converting the text from images into a document, they probably have about the IQ of a cinder-block and wouldn’t “get it” from the Google results either.
Hell, if we are really lowering ourselves to that lowest denominator of intelligence, the person would probably still be confused if we called it Optical Character Recognition. What, like Google has to actually print out the stuff you upload so it can Optically Recognize it...? Oh noes, we’re killing trees. Everybody, don’t use it!
Alexander Peter Kristopeit bought his basement from his mommy for one dollar.
Calling someone a "hater" only means you can not rationally rebut their argument.
About 10% of the text has been incorrectly converted and the formatting hasn't been preserved.
What did you expect? I've been in the legal field for 10 years and have seen OCR progress substantially during that time. However, 10% error rate is still very common with scanned docs and unless you are looking at the original image, all the formatting is lost. This is with the best OCR engines in the industry!
Maybe you should actually know something about the particular field before you judge?
If brevity is the soul of wit, then how does one explain Twitter?
Google's search engine started doing OCR on any scanned documents they found in late 2008. The results were horrible in some cases, but it didn't matter. The searchable OCR results made it possible to find things more easily and you could obviously refer to the original source if the OCR was too garbled.
What are the supported character sets? Is it only roman characters or what?
Twinstiq, game news
OCR consists of many steps; recognizing the individual characters is only one of them. You also need to separate text from images, group characters into lines and columns, separate floats, captions, and body text, etc. Many of those are tough problems even if someone hands you a PDF with all the characters. And if any one of them is wrong, the entire output may be wrong.
Recognizing individual characters is also harder than you may think because there is such a wide variety of fonts in use and because there are so many odd things that can happen. Even in perfectly rendered images (no dirt etc.), two characters may be bit-identical but mean something different in different fonts. Ligatures, underlines, unknown characters, etc. also make the problem quite a bit harder.
And even though 1% error would be low for just about any other machine learning or pattern recognition problem, that's a high OCR error and looks quite bad; people are much more sensitive to OCR errors than pattern recognition errors in other contexts. Furthermore, there are a lot of characters to be classified and you only get very little CPU time per character.
We've been developing an OCR system (ocropus.org) for a while now (see http://bit.ly/9Xputj for status info). It's fairly easy to get excellent performance on a closed dataset with a well-defined character set. Getting acceptable performance on arbitrary documents and dealing with all the special cases (ligatures, foreign characters, color images, magazine layouts, unknown languages, Unicode issues, etc.) is tons of work.
Oh, and in case you're wondering, although Google has sponsored OCRopus (thanks!), OCRopus is a separate project from Google's internal OCR efforts.
This google ocr thing just gives you the text. What about the formating? Why not use something like WatchOCR from http://www.watchocr.com./ It creates text searchable pdfs from image only pdfs and it's all free and open source. You just drop them into a watched folder and the server spits them out as text searchable. It runs as a LiveCD so you don't even have to install anything to try it.