Google Docs' OCR Quality Tested

← Back to Stories (view on slashdot.org)

Google Docs' OCR Quality Tested

Posted by timothy on Thursday April 28, 2011 @10:12AM from the weighed-in-the-balance-and-found-wanting dept.

orenh writes "Google has released a Google Docs application for Android, which includes the ability to create documents by OCR-ing photos. I tested the application's OCR quality and found that it's mediocre under the best conditions and poor under real-world conditions. However, I believe that this poor performance is caused in part by an intentional decision by Google."

17 of 99 comments (clear)

Min score:

Reason:

Sort:

Better to scan to PDF by icebike · 2011-04-28 10:34 · Score: 3, Interesting

There are a number of scanner apps in the market that do a much better job in the first step of this process, which is taking the picture. They then concentrate their efforts on producing a clean usable PDF of the document. I tested one of these and found that the PDF rendered by it was much better than the PDF produced by Google.
Everything is crisp and readable.
If the first fails, its no wonder the second OCR step fails.

--
Sig Battery depleted. Reverting to safe mode.
1. Re:Better to scan to PDF by icebike · 2011-04-28 11:01 · Score: 2
  
  Just how many of such documents do you expect to have to index taken with a cellphone? Seriously, this is a toy. Don't go all corporate archives on me here.
  
  --
  Sig Battery depleted. Reverting to safe mode.
2. Re:Better to scan to PDF by icebike · 2011-04-28 12:10 · Score: 2
  
  True, but again, this is a cell phone app. You don't expect document management system level capabilities, especially not in release 1.0.
  If you want that level of quality you bring something more than a cell phone to the task. Maybe a flatbed or something.
  My point here is this: I've had much better luck going direct to PDF On the phone than via Google Docs.
  Try this test if you have a Google Docs account, (even a free one):
  Upload some PDF, even one created using something on your phone like CamScanner..
  Then, once you have a document in Google Docs, select it and from the menu choose Make a Google Docs Copy. It will OCR it for you.
  Now if you uploaded a quality PDF (say something scanned to pdf directly from your scanner) the OCR will be close to flawless.
  But even those shot with the camera and cleaned up by CamScanner will be better than the ones created directly in Google Docs on the android, probably for some of the reasons mentioned in TFA.
  
  --
  Sig Battery depleted. Reverting to safe mode.
No good free solutions by CajunArson · 2011-04-28 10:34 · Score: 2

The end of the article is pretty telling. Basically any professional OCR software from the mid 1990's and normal consumer grade commercial software from today is lightyears ahead of open source solutions. Which is kind of sad, but the problem is that there really isn't a huge market for OCR in the way that there is for web browsers and other more successful projects, coupled with the inherent difficulty in doing good OCR.

--
AntiFA: An abbreviation for Anti First Amendment.
Um... by Shadow+Wrought · 2011-04-28 10:46 · Score: 4, Insightful

He uploaded the 120 dpi image instead of the 300 dpi image and is surprised the OCR sucks. Really? Lossy isn't the concern when you're OCR'ing bloack text on a white background. Seriously. Think about what the image is actually going to be used for, then make your decision.

And, seriously, how effective of OCR'ing are you really imagining you're going to get off of a camera phone pic, anyway?

--
If brevity is the soul of wit, then how does one explain Twitter?
1. Re:Um... by ortholattice · 2011-04-28 11:25 · Score: 2
  
  It seems TFA is giving 403 errors, but Google's 300 DPI PDFs that you can download for public domain books often have incredibly poor quality, much poorer than you get with 300 DPI on a cheap home scanner. While they might be marginally acceptable for novels, for the old math books I'm interested in, the Google PDFs are mostly useless. Often you can't disambiguate small blurry subscripts by eye, never mind OCR. On the other hand, I have never had a problem reading 300DPI subscripts on scans I make at home, and they usually will OCR fine. too, unless they are tiny subscripts of subscripts. I wrote about this here.
Re:Nexus S has no flash? by Idbar · 2011-04-28 10:47 · Score: 2

What article? The link seems to be pointing to a 403 Error page. At least to me.
Re:Google's OCR by camperslo · 2011-04-28 10:48 · Score: 2

I guess it'll be a little while before we'll see an app I'd wondered about. I thought it would be useful to be able to take snapshots of things like news reports (streamed on the web, El Gato Eye-TV domestic or satellite t.v., YouTube etc.) and do OCR on them, AND get an English translation of it. With the events so far this year, support for Japanese and Arabic languages would have been a good start.
CAPTCHA Breakers by MoonBuggy · 2011-04-28 10:55 · Score: 3, Interesting

If the increasing absurdity of the CAPTCHAs I tend to see is anything to go by, there are programs out there that'll read normal printed text from even the crappiest photo without missing a beat. The question is, are the spammers using standard commercial solutions, or have they got some useful tech of their own that we might be able to get our hands on (seize it as part of a settlement and make it public domain, for instance).
1. Re:CAPTCHA Breakers by jewelises · 2011-04-28 11:04 · Score: 3, Insightful
  
  I don't think that spammers have any amazing tech, they just have different requirements. They can still send spam with a 1% success rate whereas with OCR you'd want a 99% success rate.
They Why by RileyCR · 2011-04-28 10:58 · Score: 3, Informative

Google took the Tesseract OCR engine, one of the first engines, and wrapped document analysis and some high level improvements on it. In the current OCR market landscape there are only 4 commercial engines, and two that make up 98% of the market. Compared to those two OCROpus is not even close because of the legacy engine. So the real reason is it's old technology, very old. Unless Google licenses ABBYY or Nuance they will not get any better. The reality is OCR takes 50 man-years to develop to compete with these top two engines, and it's just not practical for even Google to go out and start from scratch.
1. Re:They Why by afidel · 2011-04-28 15:57 · Score: 2
  
  Hmm, of the four engines we use you mentioned two. Abbyy has by far the worst recognition rate (but is most flexible for scan setup so we use it for arbitrary documents rather than the forms based stuff going into our document management system). We also use Nuance through Adlib. The other two we use are Kofax AIP, and DokuStar.
  
  --
  There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
99% success rate is crappy ... by perpenso · 2011-04-28 11:32 · Score: 3, Insightful

I don't think that spammers have any amazing tech, they just have different requirements. They can still send spam with a 1% success rate whereas with OCR you'd want a 99% success rate.
I once worked on an OCR project. The client specified a 99% success rate and we strained to restrain our grins. 99% is about one error every one or two lines of text. We got 99.6% in our first implementation before we even began to work on accuracy. Admittedly we had excellent image quality. This was a custom solution that had its own optics.
1. Re:99% success rate is crappy ... by martin-boundary · 2011-04-28 13:08 · Score: 3, Interesting
  
  Heh, it's always fun to reinterpret requirements to make them easier to implement :)
  A 99% success rate could also mean 99 pages with zero errors out of a 100 pages attempted. With 250 words per page that would represent a mandated success rate of 99.995%
More like Masters/PhD Thesis than Summer of Code by perpenso · 2011-04-28 11:39 · Score: 3, Interesting

Does that mean it couldn't be a viable candidate for some Summer of Code work then?
More like a bunch of masters/phd thesis to get started.

OCR is an area of AI research under the topic of Computer Vision. It is yet another area that seems simple in concept but turns out to be incredibly difficult in practice.
Re:More like Masters/PhD Thesis than Summer of Cod by Lehk228 · 2011-04-28 11:55 · Score: 2

seems to me that OCR would be an area that would be easy to build a framework for genetic algorithms, using a huge collection of solved OCR pages to evaluate. with each generation being tested on a random subset of pages so they do not learn to cheat instead of learn to solve.

only problem is sometimes GA make a solution that makes no sense and should not work but somehow does http://www.damninteresting.com/on-the-origin-of-circuits

--
Snowden and Manning are heroes.
I've just tried it... by Simon+Brooke · 2011-04-28 23:18 · Score: 2

I think the quality is tolerable. I photographed a document lying on my desk, without doing anything special to make it smooth or adjust lighting. This is a good simulation of a real-world situation where you can photograph a piece of text. There were errors in the transcription but it was readable, and with a very little editing would have been perfect. What surprised me was that apparently the whole image was uploaded from my phone to Google Docs, and then downloaded again, which is a little bit inefficient; I think that the OCR process runs server side.
I see this as very useful. This afternoon I'm going in to the local planning office to look at some planning applications; I won't be able to take them away, and I doubt I'll be allowed to use a photocopier, but I will have my phone. That's a real world application. I can think of hundreds more.

--
I'm old enough to remember when discussions on Slashdot were well informed.