Google Adds OCR To PDF and Images

F1r5t p0st? by Chrisq · 2010-06-22 01:00 · Score: 4, Funny

F1r5t p0st? (OCR's by Google)

lolwut? by Pojut · 2010-06-22 01:00 · Score: 2, Insightful

I can understand OCR software not working if you are scanning a document, due to dirt over the text or what have you...but OCR failing on a PDF with typed text? WTF?

--
Living With a Nerd

Re:lolwut? by Sir_Lewk · 2010-06-22 01:06 · Score: 1

Maybe the font they were using was ShittyLowRezScan-Serifs.

--
"linux is just DOS with a UNIX like syntax" -- Galactic Dominator (944134)
Re:lolwut? by erikdalen · 2010-06-22 01:19 · Score: 3, Interesting

Didn't fail at all on a PDF with typed text for me. Did you actually try it?
I bet they don't actually use OCR on a PDF with typed text as they can just extract it from the PDF, they probably use that on images inside PDFs though.

--
Erik Dalén
Re:lolwut? by AHuxley · 2010-06-22 01:21 · Score: 1

Someone at google made a mistake with the dpi setting? Between Tesseract and reCAPTCHA something should work.

--
Domestic spying is now "Benign Information Gathering"
Re:lolwut? by mlk · 2010-06-22 01:24 · Score: 1

It is likely that the PDF tried above was scanned pages.

--
Wow, I should not post when knackered.
Re:lolwut? by mlk · 2010-06-22 01:34 · Score: 3, Informative

I've just tried with the extract.
The text extraction seams to have worked well. Unsurprisingly the formatting has been lost and it has got confused with the REwork type bits. PDFs are not designed with extraction to a editable format in mind, so getting any of the formatting is impressive in my book.

--
Wow, I should not post when knackered.
Re:lolwut? by clone53421 · 2010-06-22 02:29 · Score: 1

If you uploaded a PDF with typed text, it probably didn't even do OCR on it. It'd be pointless. You have to convert the pages to images for that to be necessary and I'm guessing you didn't.
Open in Acrobat Reader and use the snapshot tool to capture an entire page. Paste into Word as an image, then re-export to PDF. Upload that and then see how the OCR fares. Of course you'll also get an excellent quality in the snapshot since it's a pure digital copy and it won't have the blemishes that you'd get by printing a physical page and scanning that, so the results might be better.

--
Alexander Peter Kristopeit bought his basement from his mommy for one dollar.
Re:lolwut? by TheRaven64 · 2010-06-23 00:21 · Score: 1

PDFs are not designed with extraction to a editable format in mind
Spoken like someone who has never read the PDF spec. PDFs are, in fact, specifically designed to allow editing. Everything in a PDF is stored as an object inside the document, indexed via an object table. Text runs are single objects containing a stream of commands sent to a PostScript-like VM to control their positioning. You can relatively easily map these to rich text in some other format, and you can trivially replace any object in a PDF by adding a new version and appending a new object table with a new version. PDFs store their object table at the end specifically for this reason - it allows new versions of objects to be added without having to rewrite the entire file; you can just write a new object table that refers to the old one but overrides some objects by providing a higher version number for them.
What you meant to say was 'creating correct layout information from an image of text is hard'.

--
I am TheRaven on Soylent News
Re:lolwut? by mlk · 2010-06-23 22:05 · Score: 1

Nope - I'll repharse it to "most tools do not output a format that is extraction to a common editable format (such as Word)" if you like.
The spec may allow for easy editing, but converting a PDF (a PDF contain text with formating, not a image stored in a PDF) is hard. Chunks of unrelated text get bunched together into a single object, while other chucks of text that are related get throw into sum unrelated chunk so extracting it all logically becomes a royal pain. Sure this is the "fault" of the creation tool, but given that all the creation tools I've played with generate scaryness I'd question that, and suggest that the specification is not as well designed for editing as a document. It may well be great (even for editing) as DTP format, I don't know.

--
Wow, I should not post when knackered.

Captcha correction? by 0100010001010011 · 2010-06-22 01:05 · Score: 3, Interesting

Could google provide some sort of opt-in service where our PDFs (one word at a time) could appear as a captcha. More or less what reCaptcha does, except with something a bit newer.

Re:Captcha correction? by 0100010001010011 · 2010-06-22 01:51 · Score: 1

Well, a captcha service would have corrected "F1r5t p0st". Seems relevant to me.
Re:Captcha correction? by somersault · 2010-06-22 02:22 · Score: 1

I fail to see how it works as a captcha if the "correct" interpretation is unknown..

--
which is totally what she said
Re:Captcha correction? by Anonymous Coward · 2010-06-22 05:29 · Score: 1, Interesting

Check into how the current reCaptcha works. The user is presented with two words. One is known to be correct. The other is suspect. User is unaware of which is known and which is suspect. User types both words, and backend system verifies the known word was typed correctly. Logs suspect word value typed by user. Returns the suspect word image to a few more users, and if they all respond with same text along with correct known word, the system can assume the suspect image contains the text returned. http://stackoverflow.com/questions/1435696/how-does-recaptcha-work
Re:Captcha correction? by pcgc1xn · 2010-06-22 05:37 · Score: 2, Informative

I am pretty sure that with recatpcha only one of the two words you type in is unknown.
So if I have some text that looks like 'first known) p0sh bi4ches'.
Captcha user one will get "first p0sh".
If they correctly identify first, then I will accept their reading of posh, say "post".
User 2 gets "p0sh b14ches"
If they correctly identify "p0sh" as post, then I will accept their reading of "bi4ches".
Obviously the guys at recaptcha has done a better job than my simplified & poor explanation. You need "some" knowledge of what the text actually is, but only some.

--
Zapsavings: Simply calculate how much energy efficient bulb

Where did all the ReCAPTCHA go? by AHuxley · 2010-06-22 01:08 · Score: 2, Interesting

With all the words deciphered, no bump in the OCR backend?

--
Domestic spying is now "Benign Information Gathering"

Re:Where did all the ReCAPTCHA go? by Loconut1389 · 2010-06-22 01:56 · Score: 1

ReCAPTCHA was to fix bad scans in specific works- I didn't think it was ever designed to further OCR, but I see how it could possibly be useful.
Re:Where did all the ReCAPTCHA go? by AHuxley · 2010-06-22 03:18 · Score: 1

http://en.wikipedia.org/wiki/ReCAPTCHA seems to be in use for some form of OCR?
"The reCAPTCHA software itself is not open source" could be the issue?

--
Domestic spying is now "Benign Information Gathering"
Re:Where did all the ReCAPTCHA go? by slaingod · 2010-06-22 04:04 · Score: 1

I think the point Loconut was making is that ReCaptcha does not 'further' machine OCR (ie. it doesn't improve the recognition algorithms used by the OCR software), instead using humans used to 'OCR' words that otherwise aren't legible.

--
http://blog.slaingod.com
Re:Where did all the ReCAPTCHA go? by AHuxley · 2010-06-22 04:11 · Score: 1

Pity they did not improve the recognition algorithms with all the data flowing in.
Cost vs a tiny % in better recognition vs a free network of humans.
Thanks for the info, I was thinking that a quality private OCR system was getting the ReCAPTCHA inputs and it was learning.

--
Domestic spying is now "Benign Information Gathering"

Google Captcha processor here I come!!!! by OzPeter · 2010-06-22 01:19 · Score: 2, Interesting

How long before you see an automated system to upload and process Captcha images on google?

--
I am Slashdot. Are you Slashdot as well?

Re:Google Captcha processor here I come!!!! by mrops · 2010-06-22 02:33 · Score: 1

A little offtopic.
I have always wondered that google does a whole lot of processing. More so than any other corporation in recent times. Stuff like this OCR, searches, building heuristics for searches etc etc etc. Combined, these are no small tasks, is there a number on what kind of processing power google has, does google's computing grid qualify to be categorized as a super computing grid? What is its standing when compared to all those other super computers?
Re:Google Captcha processor here I come!!!! by BrightSpark · 2010-06-22 03:15 · Score: 2, Interesting

One of the easier ways to restrict how your words and ideas are searched and indexed on the net is to to hide them in plain sight. A jpg image of text is very dificult for a search engine to use, yet you and I can read and understand the data quite easily. This ability to scan on line has been around, but not mainstream to my knowledge. I'm guessing Google has been checking jpgs for text as a trial for some time. Once this is gone maybe ASCII art text will work for a while. Hiding/protecting data by steganography is detectable by scan now, eg http://www.outguess.org/detection.php so the battle continues. Of course one can work offline and send letters to each other and be protected by law :-) I wonder if one day sending stuff my mail will seem shady?
Re:Google Captcha processor here I come!!!! by gravis777 · 2010-06-22 03:43 · Score: 1

I can't read captcha's 60% of the time, and am not always in an area where I can listen to the audio hint. An OCR would be nice. On Boing Boing, I usually mistype the captcha's about 3-4 times before finally stumbling on one I can actually read.
Re:Google Captcha processor here I come!!!! by rah1420 · 2010-06-22 04:24 · Score: 1

A super computing grid?
Oolcay itay.

--
Mit der Dummheit kämpfen Götter selbst vergebens.
Re:Google Captcha processor here I come!!!! by Bigjeff5 · 2010-06-22 06:42 · Score: 1

I had to use a captcha for work once, and the captcha itself was incorrect. I have no idea what key combination would have worked, but what the captcha said certainly did. It had an audio option, so I tried it, but the audio was so garbled I couldn't pick out a single word, let alone the three necessary to complete the captcha.
I like captcha as a basic form of protection from bots, but when it keeps me from accessing a website it is beyond worthless.

--
Security is mostly a superstition... Avoiding danger is no safer in the long run than outright exposure. - Helen Keller

captcha cracking by aiwarrior · 2010-06-22 01:19 · Score: 1

For now it sucks, but we know if google wants it throws out the better in the market.
Just wondering if this gets so good as to make mass captcha cracking cheap.

Re:Great! by morgan_greywolf · 2010-06-22 01:23 · Score: 1

Sadly, I had no issues reading this: "This is going to make document scanning a real time saver from now on!"

Obviously, I've spent way too much time correcting bad OCR.

--
My blog

Is there a "this translation is bad" option? by AdmiralXyz · 2010-06-22 01:28 · Score: 4, Informative

I know with several services, like Google Voice, they had a link or checkbox to indicate that "this transcription is lousy, I can do better" with an option to do so, which was presumably sent back to improve the service. It really seemed to work, too, the quality of Google Voice's voicemail-to-text transcriptions started off horrible, and has since become awesome. Same goes for the built-in speech-to-text in Android. If Google includes something like that here to tune whatever machine learning algos they're using, I have no doubt it will rapidly progress to a usable state.

--
Dislike the Electoral College? Lobby your state to join the National Popular Vote Interstate Compact.

Re:Is there a "this translation is bad" option? by rolfwind · 2010-06-22 01:53 · Score: 1

Google translate between western languages I encountered are pretty good, but they need a lot of work on the asian languages imo.
Re:Is there a "this translation is bad" option? by steveg · 2010-06-23 05:51 · Score: 1

Awesome might be pushing it a bit, but I'll agree it's gotten better. It's never quite right, but I can usually get the gist of the message even before I have a chance to listen directly.

--
Ignorance killed the cat. Curiosity was framed.

OCR efficiency by Anonymous Coward · 2010-06-22 01:34 · Score: 1, Interesting

> the result wasn't great. About 10% of the text has been incorrectly converted and the formatting hasn't been preserved.

Well, what is the state of the art of OCR today? I wouldn't call this a bad result either... And OTOH, if people were correctly trained in spelling, we would have made do without spell checkers and have invested in OCR technology instead, right? ;)

Doesn't have to be perfect by clone53421 · 2010-06-22 01:36 · Score: 4, Insightful

They really should hide the text underneath the actual scanned image, though, so that what you're actually looking at is the real page, but searchable. That takes care of the issue with layout, and since you aren't actually trying to read the garbled text, although 10% is still a rather high error rate it won't matter as much because you'll only notice it if you're trying to copy-and-paste or you might search for something and miss a few of the hits because it was incorrectly OCR'd. Not a huge deal.

--
Alexander Peter Kristopeit bought his basement from his mommy for one dollar.

Re:Doesn't have to be perfect by gravis777 · 2010-06-22 03:47 · Score: 1

The funny thing is that their OCR seems to be pretty good for Google Books. Yes, its photographed pictures, but you can search the text, which means some type of OCR must be going on. So, unless they are using a completely different technology, than this should really only have issues with hand-written text.
Re:Doesn't have to be perfect by gravis777 · 2010-06-22 03:48 · Score: 1

Photographed text. Blah. Should have proofread before I hit submit.
Re:Doesn't have to be perfect by clone53421 · 2010-06-22 04:07 · Score: 1

Copy-and-paste some text from it and see how good the OCR was. You’ll be able to see the mistakes that were previously hidden.
I’m guessing it’s exactly the same engine, but done exactly as I said it should be, correctly.

--
Alexander Peter Kristopeit bought his basement from his mommy for one dollar.

Google should concentrate elsewhere by bogaboga · 2010-06-22 01:54 · Score: 1

First: My suggestion is that Google should put its efforts in making Google Docs at least as usable as Zoho Office first.

How can a small company like Zoho beat Google on usability?

Second: GMail still sucks [at search experience], big time in my opinion. Here's why: I had this happen to me recently...

I knew an email existed but could not remember much about it at all! Yes sometimes, you need a memory trigger for lack of a better word.

My search term was "details" and Gmail returned 311 messages. I also knew that the attachment this message had was in 'tiff' format. This search had GMail return all those messages that met my search criterion but it would have been more useful if Google Mail had gone ahead to "auto-magically" categorize emails with attachments, and further by attachment type and so many other useful categorizations.

This way, a message with a 'tiff' format attachment (which I could not remember) would have been displayed...already sorted for me to see, may be with some kind of highlight. If I had a huge in-box like our folks in sales, results would not be that useful. By the way, I searched for the string "when" in an in-box that had 142,211 emails and I received 11,317 emails back! No categorization at all! Needless to say, these results were not useful.

The current approach is still wanting, inadequate and can be made better. Yahoo does this, so Google can surely do better.

Re:Google should concentrate elsewhere by MozeeToby · 2010-06-22 02:13 · Score: 1

By the way, I searched for the string "when" in an in-box that had 142,211 emails and I received 11,317 emails back!
You can hardly expect Google to make up for your lack of search skills or memory. I'm not saying you don't have other valid points, but searching for such basic terms as 'when' and 'details' instead of something that is unique to the message is bound to return tons of results. It's not Google's fault that you have 11,317 emails with the word 'when' in it.
Re:Google should concentrate elsewhere by HamburglerJones · 2010-06-22 02:19 · Score: 1

Search:
details has:attachment .tiff

I see your point about it being nice if they'd automatically label this stuff, but you can search for attachments. This might turn up something that has a different kind of attachment and merely mentions ".tiff" in the email, but what you're looking for should turn up.

I have found Gmail search to be vastly superior to Yahoo! and Outlook since I've switched. They have some great tips on how to search.
Re:Google should concentrate elsewhere by bogaboga · 2010-06-22 02:41 · Score: 1

You still do not get it, I am afraid! And that's the very reason that companies like Apple and Microsoft at one point in the past made life incredibly easy for computer users. This is why they excelled, of course making users "dumb" in the process.

You can hardly expect Google to make up for your lack of search skills or memory.

This is the very mistake you make...How come Google now categorizes results of search terms at google.com? Tell me why. I just searched for "House Skills" and had categories of videos, discussions, books, news, blogs, updates returned. So according to you, categorizations work for searches at google.com but not GMail? What kind of reasoning is this?

I'm not saying you don't have other valid points, but searching for such basic terms as 'when' and 'details' instead of something that is unique to the message is bound to return tons of results.
OK..thanks...! People do crazy things with their computers and an algorithm is supposed to take care of such business.

It's not Google's fault that you have 11,317 emails with the word 'when' in it.
Who said it's Google's fault? OK...but when a successful company prides itself in being able to 'organize' the world's information, and is pretty good at it, we as users expect the best. Why not?
Re:Google should concentrate elsewhere by quickOnTheUptake · 2010-06-22 02:42 · Score: 1

The thing I've had trouble with in gmail search is that it lacks any sort of lemmatisation. This would be fine if it would match sub-strings within words, but it seems to only match full words that are morphologically identical.

--
Mod points: Guaranteed to remove your sense of humor.
Side effects may include gullibility and temporary retardation
Re:Google should concentrate elsewhere by bogaboga · 2010-06-22 02:53 · Score: 1

I see your point. This is another thing Yahoo Mail does well.
Re:Google should concentrate elsewhere by KeNickety · 2010-06-22 03:04 · Score: 1

Possibly because computers and networks can't be expected to infer meaning? Remember, in your search field you've entered no context, no kinds of specifying statements, so you're expecting the computer to be able to read your mind?
Re:Google should concentrate elsewhere by bogaboga · 2010-06-22 03:15 · Score: 1

...so you're expecting the computer to be able to read your mind?
No sir! I expect the computer to categorize, and I know it is possible because I have seen it elsewhere...even in applications by the same vendor.
Re:Google should concentrate elsewhere by Tropaios · 2010-06-22 04:05 · Score: 1

The problem is that you weren't using Google's search engine properly. You failed to give it all the relevant information you DID remember. Next time include "tiff" in your search as well as clicking the box "Has attachment" in search options.
In fact, go do that now, then come back and tell me how many results you get.

Anyone know what they're using for the OCR? by rsilvergun · 2010-06-22 02:14 · Score: 1

It'd be cool if it was GPL'd :).

--
Hi! I make Firefox Plug-ins. Check 'em out @ https://addons.mozilla.org/en-US/firefox/addon/youtube-mp3-podcaster/

Re:Anyone know what they're using for the OCR? by quickOnTheUptake · 2010-06-22 02:47 · Score: 3, Informative

I don't know for sure what's running behind this, but Google's OCRopus is Apache, as is the actual OCR engine behind it, tesseract.

--
Mod points: Guaranteed to remove your sense of humor.
Side effects may include gullibility and temporary retardation
Re:Anyone know what they're using for the OCR? by tmbdev · 2010-06-22 13:38 · Score: 1

FWIW, I believe a lot of OCRopus hasn't been incorporated at Google yet because OCRopus itself is still under heavy development.
Re:Anyone know what they're using for the OCR? by danhs7 · 2010-06-23 02:11 · Score: 1

Why? Apache is a more liberal license.

Re:OCD??? by selven · 2010-06-22 02:25 · Score: 1

OCR (Oxford, Cambridge and RSA Examinations) is an examination board that sets examinations and awards qualifications (including GCSEs and A-levels). It is one of England, Wales and Northern Ireland's five main examination boards.

Organization of Communist Revolutionaries (marxist-leninist) (in Persian: ( (-) was an Iranian Maoist organization. It was formed in opposition to the Shah regime in Iran and was active the Iranian student movement in exile.

To perform OCR (optical character recognition); Oxford, Cambridge & RSA (examinations (board)); Optical Character Recognition; Office for Civil Rights (US); Office of the Chief Rabbi

People who can't figure things out from context would have a much harder time than you think.

Re:OCD??? by SpeZek · 2010-06-22 02:46 · Score: 1

Um...no. The summary is clearly talking about text. It even says "a new option that tells Google to convert the text from PDF and image files to Google Docs documents". There's no way that, from context, you could think OCR meant "Organization of Communist Revolutionaries".

Don't argue for argument's sake.

Re:OCD??? by clone53421 · 2010-06-22 03:06 · Score: 1

Snarkyness aside, it explicitly says that it’s converting the text from images into a searchable document... if someone can’t tell from that context that OCR means converting the text from images into a document, they probably have about the IQ of a cinder-block and wouldn’t “get it” from the Google results either.

Hell, if we are really lowering ourselves to that lowest denominator of intelligence, the person would probably still be confused if we called it Optical Character Recognition. What, like Google has to actually print out the stuff you upload so it can Optically Recognize it...? Oh noes, we’re killing trees. Everybody, don’t use it!

--
Alexander Peter Kristopeit bought his basement from his mommy for one dollar.

Changing ridiculously stupid subject line by mjwx · 2010-06-22 03:56 · Score: 1

I bet they don't actually use OCR on a PDF with typed text as they can just extract it from the PDF, they probably use that on images inside PDFs though. Have you tried it on a PDF that was an image of text, such as a scanned or photographed text document. That's the real test.

--
Calling someone a "hater" only means you can not rationally rebut their argument.

OCR Reality by Shadow+Wrought · 2010-06-22 04:19 · Score: 2, Informative

About 10% of the text has been incorrectly converted and the formatting hasn't been preserved.

What did you expect? I've been in the legal field for 10 years and have seen OCR progress substantially during that time. However, 10% error rate is still very common with scanned docs and unless you are looking at the original image, all the formatting is lost. This is with the best OCR engines in the industry!

Maybe you should actually know something about the particular field before you judge?

--
If brevity is the soul of wit, then how does one explain Twitter?

Re:OCR Reality by binary+paladin · 2010-06-22 05:58 · Score: 1

Yeah, I was thinking the same thing. This sounds like someone who hasn't actually done OCR prior to these fancy Google docs.
OCR has always been somewhat inaccurate. It's just the nature of the beast.

Google search had OCR before Google Docs by mike.mondy · 2010-06-22 06:01 · Score: 2, Interesting

Google's search engine started doing OCR on any scanned documents they found in late 2008. The results were horrible in some cases, but it didn't matter. The searchable OCR results made it possible to find things more easily and you could obviously refer to the original source if the OCR was too garbled.

What character sets? by HalAtWork · 2010-06-22 06:51 · Score: 1

What are the supported character sets? Is it only roman characters or what?

--
Twinstiq, game news

lots of tough problems in OCR by tmbdev · 2010-06-22 13:55 · Score: 2, Interesting

OCR consists of many steps; recognizing the individual characters is only one of them. You also need to separate text from images, group characters into lines and columns, separate floats, captions, and body text, etc. Many of those are tough problems even if someone hands you a PDF with all the characters. And if any one of them is wrong, the entire output may be wrong.

Recognizing individual characters is also harder than you may think because there is such a wide variety of fonts in use and because there are so many odd things that can happen. Even in perfectly rendered images (no dirt etc.), two characters may be bit-identical but mean something different in different fonts. Ligatures, underlines, unknown characters, etc. also make the problem quite a bit harder.

And even though 1% error would be low for just about any other machine learning or pattern recognition problem, that's a high OCR error and looks quite bad; people are much more sensitive to OCR errors than pattern recognition errors in other contexts. Furthermore, there are a lot of characters to be classified and you only get very little CPU time per character.

We've been developing an OCR system (ocropus.org) for a while now (see http://bit.ly/9Xputj for status info). It's fairly easy to get excellent performance on a closed dataset with a well-defined character set. Getting acceptable performance on arbitrary documents and dealing with all the special cases (ligatures, foreign characters, color images, magazine layouts, unknown languages, Unicode issues, etc.) is tons of work.

Oh, and in case you're wondering, although Google has sponsored OCRopus (thanks!), OCRopus is a separate project from Google's internal OCR efforts.

Free OCR server. by rynolangner · 2010-07-02 12:08 · Score: 1

This google ocr thing just gives you the text. What about the formating? Why not use something like WatchOCR from http://www.watchocr.com./ It creates text searchable pdfs from image only pdfs and it's all free and open source. You just drop them into a watched folder and the server spits them out as text searchable. It runs as a LiveCD so you don't even have to install anything to try it.

Slashdot Mirror

Google Adds OCR To PDF and Images

59 of 76 comments (clear)