Google Docs' OCR Quality Tested

/b/ by stonewallred · 2011-04-28 10:13 · Score: 1, Interesting

Since the standard practice on 4chan is to use the word niggers for any word in a recaptcha that has a punctuation mark, I question just how good the OCR is.

Re:/b/ by stonewallred · 2011-04-28 11:16 · Score: 0, Troll

Lmao, flamebait? So speaking the truth is flamebait?
Nice to know that an easily verifiable fact is modded in such a fashion.
Re:/b/ by Anonymous Coward · 2011-04-28 12:31 · Score: 1

You said nigger, you referenced to 4chan and b/. Most mod don't take time to understand what it been told, they only lookup the keyword. Michael Kristopeit is right, slashdot is stagnated... LOL i am better post that as A. Coward.
Re:/b/ by Super+Dave+Osbourne · 2011-04-28 12:52 · Score: 1, Insightful

Slashdot has become formula boring. Quite a long time ago. This is verifiable, and not meant as flamebait. If the mods would stop acting like scripts without some AI built in for content /. would be once again a viable worthwhile place to contribute on a regular basis, rather than drive-bye train wreck contribution.
Re:/b/ by stonewallred · 2011-04-28 13:03 · Score: 1

I usually keep 5 or 15 points for modding.
And I mod according to what they say the guidelines are.
But then again, I cruise here on raw and uncut simply because I want to see it all, and many good posts are hidden if you use filtering.
Guess I need to become a karma whore.
Re:/b/ by Snaller · 2011-04-28 13:56 · Score: 1

Trump? Is that you?

--
If Google really cared they would fix Android Chrome to reflow text, instead of discriminating
Re:/b/ by zill · 2011-04-28 14:04 · Score: 1

You realize that recaptcha knows exactly which site the captchas come from, right? It would only take a single line of code to filter out all the noise from 4chan.
Re:/b/ by Anonymous Coward · 2011-04-28 19:16 · Score: 0

Your argument is invalid... /b/tards use the same trick on almost every site that throws them a captcha
Re:/b/ by Anonymous Coward · 2011-04-29 00:39 · Score: 0

read teh rulez o' da intarwez!

Google's OCR by machinelou · 2011-04-28 10:19 · Score: 1

I've played around with Google's OCR framework (tesseract) and it is far from perfect. So, this isn't really a surprise.

Re:Google's OCR by icebike · 2011-04-28 10:27 · Score: 1

Its also far from new. Didn't they get that from some long dead Open Source project?

--
Sig Battery depleted. Reverting to safe mode.
Re:Google's OCR by owlstead · 2011-04-28 10:34 · Score: 1

Yeah, I was looking for an android OCR library, and that one was the only one that came up. Although there are a few other Linux options, none of those seemed to be right on the money either. This article is strengthening the already published reports on open source OCR software: basically, it's not performing all that well. I wish it was.
Re:Google's OCR by camperslo · 2011-04-28 10:48 · Score: 2

I guess it'll be a little while before we'll see an app I'd wondered about. I thought it would be useful to be able to take snapshots of things like news reports (streamed on the web, El Gato Eye-TV domestic or satellite t.v., YouTube etc.) and do OCR on them, AND get an English translation of it. With the events so far this year, support for Japanese and Arabic languages would have been a good start.
Re:Google's OCR by somersault · 2011-04-28 11:01 · Score: 1

Definitely, weirdly I was wondering this afternoon if Goggles can already do OCR and translation on full pages of text.. I have a French book that I'd love to read, but I have basically no French!

--
which is totally what she said
Re:Google's OCR by owlstead · 2011-04-28 11:17 · Score: 1

With the top-notch translators that are around today, you may be able to get the gist of the book. But the chance that the translation of the book will be a joy to read is about zero, zip, nada, nothing. You'd better buy a good translation or, if that's not available, try and learn French (with the book itself as source material maybe).
Re:Google's OCR by retchdog · 2011-04-28 11:52 · Score: 1

there was a paper about combining a (crappy) machine translation with low-skilled workers, who natively understand the target language, to patch up the glaring flaws. the idea is that _most_ of the errors made by the machine don't require understanding of the source language to detect. of course you lose out on anything 'deep' or artistic in the source language, and i would be hesitant to trust it for scientific papers or legal documents, but it's an interesting idea.

--
"They were pure niggers." – Noam Chomsky
Re:Google's OCR by somersault · 2011-04-28 13:06 · Score: 1

There are no translations available or I'd buy them. It's a book about Parkour by David Belle.. I'm just interested in basic history and his opinions rather than flowery language or whatever. If there is much discussion of technique it might be really hard to understand though - I auto-translated a French tutorial on rolling before, and it would just read as gibberish to someone who didn't already have a good idea of the technique.

--
which is totally what she said
Re:Google's OCR by RobertM1968 · 2011-04-28 16:26 · Score: 1

I've played around with Google's OCR framework (tesseract) and it is far from perfect. So, this isn't really a surprise.

Its also far from new. Didn't they get that from some long dead Open Source project?
Answered in the order you mentioned each:
Yes, far from new (project started 26 years ago).
No, not long dead. Just "recently" (roughly 6 years ago, give or take) open sourced and ported/compiled for Linux, OS/2 (and other platforms I am sure).
Yes (open source project), and I think it was called... Tesseract. Kinda like the poster you responded to mentioned. ;-)
To save you the work, it was an HP/UNLV project, started in 1985, that was open-sourced in 2005. It is still available on SourceForge.

--
StarTrekPhase2 - The Five Year Mission Continues!
Re:Google's OCR by ozmanjusri · 2011-04-28 17:51 · Score: 1

This article is strengthening the already published reports on open source OCR software: basically, it's not performing all that well. I wish it was.
Now that it's getting some exposure, I'd say it'll be performing a lot better soon.
Nothing like being in the public eye for attracting clever people's attention.

--
"I've got more toys than Teruhisa Kitahara."
Re:Google's OCR by ggeens · 2011-04-28 20:03 · Score: 1

there was a paper about combining a (crappy) machine translation with low-skilled workers, who natively understand the target language, to patch up the glaring flaws.
I'm working on a project where the translations are handled like that. We send all texts to an external company, and a few hours later, they send back the translation. This seems to work relatively well.
The next phase involves immediate translation without human intervention. I'm curious as to how that will work out.

--
WWTTD?
Re:Google's OCR by ozmanjusri · 2011-04-28 20:06 · Score: 1

there was a paper about combining a (crappy) machine translation with low-skilled workers
Even better, Distributed Proofreaders is Project Gutenberg's version of just that. They've probably passed 20,000 books OCR'd and proofed by now.

--
"I've got more toys than Teruhisa Kitahara."
Re:Google's OCR by lxs · 2011-04-28 20:23 · Score: 1

Or you could learn French if much of the literature in your field of interest is in that language. It isn't that hard if you're not interested in fluency. You also have gained a valuable skill.
Hey, it's more useful than either Elvish or Klingon.
Re:Google's OCR by Anonymous Coward · 2011-04-29 02:37 · Score: 0

Can you link to the book, maybe if it's on Amazon.fr or another site? I know high-school level French, so could likely get most of it. :)
Re:Google's OCR by somersault · 2011-04-29 05:18 · Score: 1

Here you go
It appears that there is a Facebook group where people are putting up translations of small parts of it now.

--
which is totally what she said

Better to scan to PDF by icebike · 2011-04-28 10:34 · Score: 3, Interesting

There are a number of scanner apps in the market that do a much better job in the first step of this process, which is taking the picture. They then concentrate their efforts on producing a clean usable PDF of the document. I tested one of these and found that the PDF rendered by it was much better than the PDF produced by Google.
Everything is crisp and readable.

If the first fails, its no wonder the second OCR step fails.

--
Sig Battery depleted. Reverting to safe mode.

Re:Better to scan to PDF by X0563511 · 2011-04-28 10:51 · Score: 1

And how do you plan on searching, indexing, or otherwise having an computer operate on the contents of that document?

--
For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
Re:Better to scan to PDF by icebike · 2011-04-28 11:01 · Score: 2

Just how many of such documents do you expect to have to index taken with a cellphone? Seriously, this is a toy. Don't go all corporate archives on me here.

--
Sig Battery depleted. Reverting to safe mode.
Re:Better to scan to PDF by sortius_nod · 2011-04-28 11:18 · Score: 1

Even then, I have yet to work for a company that has a searchable PDF archive. Even when I worked for Fairfax (media company here in AU that publishes national & local newspapers), the PDF archive that came straight out of the publishing app wasn't searchable. Hell, it only had 3 months of the paper on servers, the rest were on archive DVDs.
The whole idea of searchable PDFs died a long time ago, this is why business use purpose built products.
Also, the OP stated that it was the original PDF that was generated better, the next step is to run OCR on the PDF. I have no idea what GP was on about, seems like they just wanted to post on this topic.
Re:Better to scan to PDF by X0563511 · 2011-04-28 11:37 · Score: 1

Just how many of such documents do you expect to have to index taken with a cellphone? Seriously, this is a toy. Don't go all corporate archives on me here.
Well, that's the whole point to OCR. If you're just scanning, then you're just scanning. OCR'ing lets you do all kinds of text processing, analysis, format shifting etc. A scan is... just a picture of a document. Makes me think of microfiche.

--
For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
Re:Better to scan to PDF by Anonymous Coward · 2011-04-28 12:02 · Score: 0

You missed his point (although he muddied it a lot by talking about PDFs specifically.) A low-res picture is hard to OCR. The computer doesn't have as much detail to draw conclusions with. He will of course OCR the high-res scan in order to make it indexable. But it will have a much better chance of actually recognizing text rather than reading "01~`........!!4g"
Re:Better to scan to PDF by icebike · 2011-04-28 12:10 · Score: 2

True, but again, this is a cell phone app. You don't expect document management system level capabilities, especially not in release 1.0.
If you want that level of quality you bring something more than a cell phone to the task. Maybe a flatbed or something.
My point here is this: I've had much better luck going direct to PDF On the phone than via Google Docs.
Try this test if you have a Google Docs account, (even a free one):
Upload some PDF, even one created using something on your phone like CamScanner..
Then, once you have a document in Google Docs, select it and from the menu choose Make a Google Docs Copy. It will OCR it for you.
Now if you uploaded a quality PDF (say something scanned to pdf directly from your scanner) the OCR will be close to flawless.
But even those shot with the camera and cleaned up by CamScanner will be better than the ones created directly in Google Docs on the android, probably for some of the reasons mentioned in TFA.

--
Sig Battery depleted. Reverting to safe mode.
Re:Better to scan to PDF by Anonymous Coward · 2011-04-28 12:35 · Score: 0

The whole idea of searchable PDFs died a long time ago, this is why business use purpose built products.
I think it's hilarious that a business in the business of words can't search their own content. Lemme guess, the PDFs that came out of your publishing app were essentially TIFF images like you'd get from a fax receiver?
Searchable PDFs work just fine, thank you very much. Our mining business can mine our PDFs even easier than the resources in the ground.
Re:Better to scan to PDF by X0563511 · 2011-04-28 14:30 · Score: 0

I don't touch Google anything, except for email. I much rather use -real- solutions, with my nice flatbed etc :)
It is odd that your phone does that better than Google...

--
For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
Re:Better to scan to PDF by afidel · 2011-04-28 14:55 · Score: 1

We OCR everything that's scanned into our document management system, search would be basically impossible without it since relying on users to accurately enter metadata is suicidal if you want useful data.

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Re:Better to scan to PDF by pruss · 2011-04-28 16:22 · Score: 1

Searchable pdfs are not dead. For instance, jstor.org's large repository of scholarly journals is searchable pdfs. jstor is very heavily used in my field. Not perfect, but pretty good.
Re:Better to scan to PDF by AmiMoJo · 2011-05-03 21:31 · Score: 1

This is a Google product. They like to release early and do public betas lasting years, so expect rapid improvements.
There seems little point in reviewing a new Google product until it has matured somewhat because the first version is always half done sort-of-works quality code. The first version of Android typed everything entered into the phone into a hidden root shell for crying out loud. About the only area they seem to hold off in is the front page of their search engine.

--
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC

No good free solutions by CajunArson · 2011-04-28 10:34 · Score: 2

The end of the article is pretty telling. Basically any professional OCR software from the mid 1990's and normal consumer grade commercial software from today is lightyears ahead of open source solutions. Which is kind of sad, but the problem is that there really isn't a huge market for OCR in the way that there is for web browsers and other more successful projects, coupled with the inherent difficulty in doing good OCR.

--
AntiFA: An abbreviation for Anti First Amendment.

Re:No good free solutions by Anonymous Coward · 2011-04-28 11:04 · Score: 0

We use multiple free software OCR solutions as part of our spam filtering... works good for us.

Nexus S has no flash? by versificator · 2011-04-28 10:35 · Score: 1

according to the article, it doesn't have a flash. which is completely incorrect. I thought maybe the Docs application doesn't use the flash when taking pictures, but again...this is incorrect.

Re:Nexus S has no flash? by icebike · 2011-04-28 10:46 · Score: 1

Google DOCs will use the flash or not, based on user settings, so, yeah, he just missed that.
But In my tests with Nexus One, (Not Nexus S), using the flash at the range needed to see the picture just puts a
white blob in the center of the shot and is actually worse than using bright room lights.

--
Sig Battery depleted. Reverting to safe mode.
Re:Nexus S has no flash? by Idbar · 2011-04-28 10:47 · Score: 2

What article? The link seems to be pointing to a 403 Error page. At least to me.
Re:Nexus S has no flash? by ThatsMyNick · 2011-04-28 11:38 · Score: 1

Google Cache of TFA
Re:Nexus S has no flash? by N+Monkey · 2011-04-28 18:41 · Score: 1

What article? The link seems to be pointing to a 403 Error page. At least to me.
Maybe it was just the OCR'ed output of a scan of "Loser roar"
( Ok, I couldn't come up with anything better)

OCR-B character recognition (question) by owlstead · 2011-04-28 10:42 · Score: 1

I'm in the market for a good way of recognizing OCR-B based characters on an android device (mostly uppercase characters and digits). I know the location (on a flat 2D plane in a 3D space) of the characters, but they do not form sentences or even words. Does anyone have a good algorithm to do this kind of low-level character recognition? A library would be even better of course, especially if it is open source. I'm personally thinking of comparing bitmaps or vectors.

As a hint to other devs, many commercial barcode packages contain OCR character recognition, which could be used for purposes where you can specify the conditions (fonts, lighting conditions etc).

Re:OCR-B character recognition (question) by coredog64 · 2011-04-28 11:31 · Score: 1

What about OpenCV?
http://blog.damiles.com/?p=292
Re:OCR-B character recognition (question) by owlstead · 2011-04-28 11:58 · Score: 1

Looks promising, many thanks! License plates are not that far off from the intended purpose.

Um... by Shadow+Wrought · 2011-04-28 10:46 · Score: 4, Insightful

He uploaded the 120 dpi image instead of the 300 dpi image and is surprised the OCR sucks. Really? Lossy isn't the concern when you're OCR'ing bloack text on a white background. Seriously. Think about what the image is actually going to be used for, then make your decision.

And, seriously, how effective of OCR'ing are you really imagining you're going to get off of a camera phone pic, anyway?

--
If brevity is the soul of wit, then how does one explain Twitter?

Re:Um... by ortholattice · 2011-04-28 11:25 · Score: 2

It seems TFA is giving 403 errors, but Google's 300 DPI PDFs that you can download for public domain books often have incredibly poor quality, much poorer than you get with 300 DPI on a cheap home scanner. While they might be marginally acceptable for novels, for the old math books I'm interested in, the Google PDFs are mostly useless. Often you can't disambiguate small blurry subscripts by eye, never mind OCR. On the other hand, I have never had a problem reading 300DPI subscripts on scans I make at home, and they usually will OCR fine. too, unless they are tiny subscripts of subscripts. I wrote about this here.
Re:Um... by sootman · 2011-04-29 03:26 · Score: 1

> And, seriously, how effective of OCR'ing are you really imagining
> you're going to get off of a camera phone pic, anyway?
Camera phones are getting quite good. An iPhone 4 takes 5MP images and there are many others out now that are as good or better.
Specifically, the images are 2592x1936 pixels which equates to 225 dpi at 8.5" x 11". That's plenty to OCR a typical page--say, 8.5x11 with clean 12-point type. I've carefully taken photos of documents with my phone and printed them and they're indistinguishable from a photocopy.

--
Dear Slashdot: next time you want to mess with the site, add a rich-text editor for comments.

What a dumb fuck by binford2k · 2011-04-28 10:51 · Score: 1

I suppose this retard thinks he's clever.

Bad Kitty!

Verily, you may not link directly to images. Link to their containing web page instead.

You tried to access: /blog/

From: http://hurvitz.org/

I have spoken!

CAPTCHA Breakers by MoonBuggy · 2011-04-28 10:55 · Score: 3, Interesting

If the increasing absurdity of the CAPTCHAs I tend to see is anything to go by, there are programs out there that'll read normal printed text from even the crappiest photo without missing a beat. The question is, are the spammers using standard commercial solutions, or have they got some useful tech of their own that we might be able to get our hands on (seize it as part of a settlement and make it public domain, for instance).

Re:CAPTCHA Breakers by jewelises · 2011-04-28 11:04 · Score: 3, Insightful

I don't think that spammers have any amazing tech, they just have different requirements. They can still send spam with a 1% success rate whereas with OCR you'd want a 99% success rate.
Re:CAPTCHA Breakers by Hal_Porter · 2011-04-28 14:50 · Score: 1

Don't tell him this. It's funnier to let him keep PH3AR1NG TEH 3L33T HAXORZ.

--
echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;

They Why by RileyCR · 2011-04-28 10:58 · Score: 3, Informative

Google took the Tesseract OCR engine, one of the first engines, and wrapped document analysis and some high level improvements on it. In the current OCR market landscape there are only 4 commercial engines, and two that make up 98% of the market. Compared to those two OCROpus is not even close because of the legacy engine. So the real reason is it's old technology, very old. Unless Google licenses ABBYY or Nuance they will not get any better. The reality is OCR takes 50 man-years to develop to compete with these top two engines, and it's just not practical for even Google to go out and start from scratch.

Re:They Why by camperslo · 2011-04-28 11:05 · Score: 1

Does that mean it couldn't be a viable candidate for some Summer of Code work then?
Re:They Why by Anonymous Coward · 2011-04-28 11:19 · Score: 0

You're saying Google couldn't have 50 persons working for a year on a critical component of their Google Books efforts that also has relevance to web search and Google Docs?
I can see why it wouldn't be cost effective compared to licensing an existing engine, but it's hardly infeasible.
Re:They Why by Super+Dave+Osbourne · 2011-04-28 12:57 · Score: 1

Until the day you can hold up a document in front of your iBhone camera and have it snap and convert that document with 99%+ accuracy and have spell and grammatical checking solve the other 1% accurately to 99% also, meaning 99.99% conversion is done properly in any language, the technology won't be tolerated by end users. That will take more as you say than Tesseract, as you so well pointed out. Google should stop whoring themselves as OpenSource focused and just do the right thing by purchasing outright and pushing to the open market the tech that exists. Then others will come in and make the move to do better, and the model of improved software continues.
Re:They Why by afidel · 2011-04-28 15:57 · Score: 2

Hmm, of the four engines we use you mentioned two. Abbyy has by far the worst recognition rate (but is most flexible for scan setup so we use it for arbitrary documents rather than the forms based stuff going into our document management system). We also use Nuance through Adlib. The other two we use are Kofax AIP, and DokuStar.

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Re:They Why by __aancvu2993 · 2011-04-28 19:59 · Score: 1

You haven't done the math. 1% is not nearly enough. This reply is 300 chars long and getting 3 wrong is annoying enough. Error rate should go down to .000001% for OCR to be a commodity, and that's with good 600dpi originals. Factor in crappy scans, poor resolution/contrast and you are in for a pretty tought ride.
Re:They Why by Vegemeister · 2011-04-28 22:47 · Score: 1

Just put the OCR'd text in a side channel with the image, as PDF does. Then you get a searchable, copyable document, andd preserve the original formatting and avoid the need for extremely low error rate.
Re:They Why by spinkham · 2011-04-29 03:27 · Score: 1

Sure, and with 9 women you can make a baby in a month.
10 experts and 5 years would be more feasible. 5 experts and 10 years even more so.
Scaling is hard.
See also http://en.wikipedia.org/wiki/The_Mythical_Man-Month

--
Blessed are the pessimists, for they have made backups.
Re:They Why by tompaulco · 2011-04-29 03:55 · Score: 1

My shop uses Nuance through two different products, and we are looking into directly interfacing with Abbyy. The results we have seen from Abbyy have been much better than what we have seen through Nuance. I guess mileage varies.

--
If you are not allowed to question your government then the government has answered your question.

403 by Nick+Ives · 2011-04-28 11:00 · Score: 1

Did anyone else mirror this? I'm just getting a 403.

--
Nick

Re:403 by master_kaos · 2011-04-28 11:03 · Score: 1

same...
Re:403 by Anonymous Coward · 2011-04-28 11:10 · Score: 0

Same here, seems like it doesn't even try to connect to the page.

Way to go by rudy_wayne · 2011-04-28 11:04 · Score: 0

Forbidden

You don't have permission to access /blog/2011/04/ocr-quality-of-google-docs on this server.
Additionally, a 404 Not Found error was encountered while trying to use an ErrorDocument to handle the request.

Nice link, asshole.

Oh WTG by Anonymous Coward · 2011-04-28 11:10 · Score: 1

Self-promote to /. and host on a box that can't handle the limited traffic of a 25-comment popularity story?

GOOD WORK SON

Due to an "intentional decision" by Anonymous Coward · 2011-04-28 11:26 · Score: 0

Much better than an accidental decision, I guess.

99% success rate is crappy ... by perpenso · 2011-04-28 11:32 · Score: 3, Insightful

I don't think that spammers have any amazing tech, they just have different requirements. They can still send spam with a 1% success rate whereas with OCR you'd want a 99% success rate.

I once worked on an OCR project. The client specified a 99% success rate and we strained to restrain our grins. 99% is about one error every one or two lines of text. We got 99.6% in our first implementation before we even began to work on accuracy. Admittedly we had excellent image quality. This was a custom solution that had its own optics.

Re:99% success rate is crappy ... by martin-boundary · 2011-04-28 13:08 · Score: 3, Interesting

Heh, it's always fun to reinterpret requirements to make them easier to implement :)
A 99% success rate could also mean 99 pages with zero errors out of a 100 pages attempted. With 250 words per page that would represent a mandated success rate of 99.995%
Re:99% success rate is crappy ... by thegarbz · 2011-04-28 14:02 · Score: 1

QUICK A LAWYER, LET'S GET HIM!
As an aside. Stupid slashdot filter is telling me using caps is like yelling. Well I AM yelling.
Re:99% success rate is crappy ... by perpenso · 2011-04-28 14:50 · Score: 1

Heh, it's always fun to reinterpret requirements to make them easier to implement :)
A 99% success rate could also mean 99 pages with zero errors out of a 100 pages attempted. With 250 words per page that would represent a mandated success rate of 99.995%
Thankfully the client specified 99% with respect to character recognition not correct pages. If they were specifying pages we would have been straining to suppress pissing our pants rather than suppressing grins. :-)
Re:99% success rate is crappy ... by tompaulco · 2011-04-29 02:49 · Score: 1

Heh, it's always fun to reinterpret requirements to make them easier to implement That's what out customer's do to us. We promised 95% accuracy rate on OCR per CHARACTER, but they generate their numbers off of how many fields of data had a wrong character in them.
Of course, we also specified that based on clean images scanned at 300 DPI, and they give us crap images scanned at 200 DPI with fold lines , highlighter and pen scribble and apparently their mailing machine sprays some kind of serial number on every single page that runs right over what we need to read.

--
If you are not allowed to question your government then the government has answered your question.
Re:99% success rate is crappy ... by SpinningCone · 2011-04-29 02:55 · Score: 1

obligatory XKCD (alt text is relevant)
Re:99% success rate is crappy ... by AmiMoJo · 2011-05-03 21:40 · Score: 1

Google's approach to accuracy appears to be somewhat novel. Most OCR software uses spelling correction and grammar rules to improve accuracy but Google use data derived from the contents of pages they index. They use it for translation too which, when it works, gives their output a more natural quality compared to previous efforts. I find that Chinese to English works particularly well.
Doubling OCR accuracy is exponentially harder. Unlike a human that can easily pick up on what type of document it is (letter, technical manual, novel, newspaper article) and make informed mental corrections based on its expectation of the language used machines have to come at documents more or less blind. Even document structure can be hard to figure out, e.g. the way a story flows over multiple columns in a newspaper.

--
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
Re:99% success rate is crappy ... by perpenso · 2011-05-03 22:30 · Score: 1

... Google use data derived from the contents of pages they index ...
Interesting. I guess that adapts for common usage deviating from proper spelling and grammar.

... pick up on what type of document it is (letter, technical manual, novel, newspaper article) and make informed mental corrections ...
Machines will do this to a degree, for example favoring lowercase L when the surrounding characters are alphabetic and favoring one when the surrounding characters are numeric. But yeah, context rules, the preceding works well enough in prose but often fails in source code.

Use in combination with CamScanner by Clifton+Beach · 2011-04-28 11:35 · Score: 1

You can get better results by using CamScanner to capture the image, then upload the JPG to Google Docs. I found that uploading the JPG works better than uploading the PDF.

--
42 hidden comments

More like Masters/PhD Thesis than Summer of Code by perpenso · 2011-04-28 11:39 · Score: 3, Interesting

Does that mean it couldn't be a viable candidate for some Summer of Code work then?

More like a bunch of masters/phd thesis to get started.

OCR is an area of AI research under the topic of Computer Vision. It is yet another area that seems simple in concept but turns out to be incredibly difficult in practice.

403 Forbidden by Anonymous Coward · 2011-04-28 11:41 · Score: 0

I got the "403 Forbidden" message !

Re:More like Masters/PhD Thesis than Summer of Cod by Lehk228 · 2011-04-28 11:55 · Score: 2

seems to me that OCR would be an area that would be easy to build a framework for genetic algorithms, using a huge collection of solved OCR pages to evaluate. with each generation being tested on a random subset of pages so they do not learn to cheat instead of learn to solve.

only problem is sometimes GA make a solution that makes no sense and should not work but somehow does http://www.damninteresting.com/on-the-origin-of-circuits

--
Snowden and Manning are heroes.

Re:More like Masters/PhD Thesis than Summer of Cod by perpenso · 2011-04-28 12:59 · Score: 1

seems to me that OCR would be an area that would be easy to build a framework for genetic algorithms, using a huge collection of solved OCR pages to evaluate. with each generation being tested on a random subset of pages so they do not learn to cheat instead of learn to solve.

Sounds like a great thesis project. :-)

Crippleware? by mr100percent · 2011-04-28 13:02 · Score: 1

Google prides itself on having supposedly the best quality apps and features, which is why they take years to leave Beta. Why would they intentionally release a crippled version of their app? That will be the worst thing since Google Books with the missing pages.

Re:Crippleware? by Anonymous Coward · 2011-04-28 23:32 · Score: 0

Google has only one idea - put stuff into their search engine to make money from advertisers. Everything else is padding.
Re:Crippleware? by Anonymous Coward · 2011-04-29 04:35 · Score: 0

That's not the case at all. Google usually do a passable release first. If it's a core product they will improve on it. If it's not, it's just a place holder to let others pick it up. I've never hear google claiming they have the best quality app. The reason for lots of beta is more likely for when people complain, they can say it's beta.
Re:Crippleware? by Anonymous Coward · 2011-05-01 07:33 · Score: 0

Its not crippleware, just go RTFA and apply some logic if you want to know why.

Slashdotted by Anonymous Coward · 2011-04-28 13:02 · Score: 0

Slashdotted

In case I'm not the only one. by scumfuker · 2011-04-28 14:13 · Score: 1

Wikipedia says:

Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a record-keeping system in an office, or to publish the text on a website. OCR makes it possible to edit the text, search for a word or phrase, store it more compactly, display or print a copy free of scanning artifacts, and apply techniques such as machine translation, text-to-speech and text mining to it. OCR is a field of research in pattern recognition, artificial intelligence and computer vision.

http://en.wikipedia.org/wiki/Optical_character_recognition

Re:More like Masters/PhD Thesis than Summer of Cod by koxkoxkox · 2011-04-28 15:29 · Score: 1

Genetic algorithms are an optimisation algorithm, but what do you want to optimise exactly ? What are your individuals here ?

The idea of using a large collection of solved problem to check and improve the accuracy of the method looks more like neural network to me. Indeed, this seems to be a common method for OCR. For example : http://www.codeproject.com/KB/dotnet/simple_ocr.aspx

Re:More like Masters/PhD Thesis than Summer of Cod by Tacvek · 2011-04-28 16:48 · Score: 1

While neural networks are a good solution, genetic algorithms can still be used in conjunction with them.

One possible training method for neural networks happens to be genetic algorithms. The genes being the link strengths, and the fitness function being say the percentage of correct results. (If you reach a sufficiently high level, you might want to change to minimizing uncertainty, with a fitness dropping exponentially if the correct percentage drops too low.)

In the alternative genetic algorithms can be used with other neural network training techniques, with the genetic algorithm selecting the number and arrangement of nodes, with fitness being related to the quality of the network after training.

A hybrid of both the above can also be used. I believe that is the approach critterding (an artificial life simulator) uses for the neural networks representing the creature's brains which are evolved much like the rest of the creatures body.

--
Stylish sheet to fix many problems in Slashdot's D3: https://gist.github.com/801524

I work on this by KingAlanI · 2011-04-28 19:30 · Score: 1

My job entails working with our office's document management system to manually enter metadata.
In part, I essentially end up parsing the data which users entered in various formats.
However, since the original form is entered electronically to begin with, I figure this could be a lot more automated. (The people in my office definitely have a clue; however, fat chance moving this up through the bureaucracy.)

--
I listen to both RIAA and non-RIAA stuff if I like the music, tangential business/politics nonwithstanding.

No good solutions anywhere by dbIII · 2011-04-28 20:23 · Score: 1

People expect OCR to be magic so are always disappointed when they first run the stuff. They do not understand that one or two uncertainties per page is a pretty spectacularly good result until you've been able to train the thing with identically laid out documents on the same paper etc for a long time. Feeding in stuff printed on a dot matrix makes the secretaries cry ten minutes after they greeted the arrival of the OCR software with joy. Of course it works a bit better on later pages after tweaking or training - but it all looks like crap to start with.
With specific jobs the stuff can work out of the box with hardly any errors but you have to be lucky.

Re:No good solutions anywhere by tompaulco · 2011-04-29 02:56 · Score: 1

Well, that is where the commercial software has open source beat. They have already trained their OCR on millions of characters. But then, there is no retraining most of them, other than upgrading to the next version when it comes out. Tesseract you can train, but it starts out pretty crappy. Whether Tesseract is of any use to you depends on what your needs are. If you are going to be OCRing something that has a fairly narrow range of image quality and font, then you can train Tesseract to pick it up very specifically and it will probably outperform commercial vendors. If, on the other hand, you need to pick up OCR off of any old crap that someone ran through a scanner, than you will probably immediately see decent results out of the commercial package, and no amount of training in Tesseract will ever improve it much.

--
If you are not allowed to question your government then the government has answered your question.

Obligatory by katsuo11 · 2011-04-28 20:51 · Score: 1

"You're holding it wrong."

I've just tried it... by Simon+Brooke · 2011-04-28 23:18 · Score: 2

I think the quality is tolerable. I photographed a document lying on my desk, without doing anything special to make it smooth or adjust lighting. This is a good simulation of a real-world situation where you can photograph a piece of text. There were errors in the transcription but it was readable, and with a very little editing would have been perfect. What surprised me was that apparently the whole image was uploaded from my phone to Google Docs, and then downloaded again, which is a little bit inefficient; I think that the OCR process runs server side.

I see this as very useful. This afternoon I'm going in to the local planning office to look at some planning applications; I won't be able to take them away, and I doubt I'll be allowed to use a photocopier, but I will have my phone. That's a real world application. I can think of hundreds more.

--
I'm old enough to remember when discussions on Slashdot were well informed.

Re:I've just tried it... by Anonymous Coward · 2011-04-29 03:33 · Score: 0

What surprised me was that apparently the whole image was uploaded from my phone to Google Docs, and then downloaded again, which is a little bit inefficient; I think that the OCR process runs server side.
Yes. This is how most of the Google services work. The voice search on Android uploads the sound it captures to Google, and then downloads the "translation". Remember, Google's goal is to be endpoint/OS agnostic. Easier to port such a system to different platforms.

Re:More like Masters/PhD Thesis than Summer of Cod by allo · 2011-04-29 00:54 · Score: 0

hm, no genetic programming, train a neural net with input / correct output for single letters and then for whole words and see what you can get there.

Typical Google by ChrisMaple · 2011-04-29 06:47 · Score: 1

I've tried a couple of the free applications that Google has made available, and they've been really inferior products. It's no surprise that they've put out yet another amateurish effort.

--
Contribute to civilization: ari.aynrand.org/donate

Google Pushing The Edge by Anonymous Coward · 2011-05-01 10:22 · Score: 0

People jump on Google, apparently the iPhone toadies who need to diss the opposition, but Google has been pushing the edge of what good services they can provide for users in return for their consumer behavior. I know that almost anything I type can be tracked by Google, but their Gmail, their Search, their innovation had provided the very essence of the new model of giving something to the ordinary person in return for their American consumer behavior. What have the TV networks ever given you in return for their insipid, insulting, and intellectually degrading commercial breaks.? At least Google gives me good, efficient, speedy, reliable, and stable email and file storing service with ads I sometimes look at, and sometimes ignore. I'm sure the OCR will be refined, nobody else has the *&*() to even try such an advancement by providing a new, useful service in return for one's marketing behavior. Nothings free, and have never had a problem with Google's info on me. Hope they kick MS and Oracle and iPhone butt...

Slashdot Mirror

Google Docs' OCR Quality Tested

99 comments