Google Docs' OCR Quality Tested
orenh writes "Google has released a Google Docs application for Android, which includes the ability to create documents by OCR-ing photos. I tested the application's OCR quality and found that it's mediocre under the best conditions and poor under real-world conditions. However, I believe that this poor performance is caused in part by an intentional decision by Google."
Since the standard practice on 4chan is to use the word niggers for any word in a recaptcha that has a punctuation mark, I question just how good the OCR is.
I've played around with Google's OCR framework (tesseract) and it is far from perfect. So, this isn't really a surprise.
There are a number of scanner apps in the market that do a much better job in the first step of this process, which is taking the picture. They then concentrate their efforts on producing a clean usable PDF of the document. I tested one of these and found that the PDF rendered by it was much better than the PDF produced by Google.
Everything is crisp and readable.
If the first fails, its no wonder the second OCR step fails.
Sig Battery depleted. Reverting to safe mode.
The end of the article is pretty telling. Basically any professional OCR software from the mid 1990's and normal consumer grade commercial software from today is lightyears ahead of open source solutions. Which is kind of sad, but the problem is that there really isn't a huge market for OCR in the way that there is for web browsers and other more successful projects, coupled with the inherent difficulty in doing good OCR.
AntiFA: An abbreviation for Anti First Amendment.
according to the article, it doesn't have a flash. which is completely incorrect. I thought maybe the Docs application doesn't use the flash when taking pictures, but again...this is incorrect.
I'm in the market for a good way of recognizing OCR-B based characters on an android device (mostly uppercase characters and digits). I know the location (on a flat 2D plane in a 3D space) of the characters, but they do not form sentences or even words. Does anyone have a good algorithm to do this kind of low-level character recognition? A library would be even better of course, especially if it is open source. I'm personally thinking of comparing bitmaps or vectors.
As a hint to other devs, many commercial barcode packages contain OCR character recognition, which could be used for purposes where you can specify the conditions (fonts, lighting conditions etc).
He uploaded the 120 dpi image instead of the 300 dpi image and is surprised the OCR sucks. Really? Lossy isn't the concern when you're OCR'ing bloack text on a white background. Seriously. Think about what the image is actually going to be used for, then make your decision.
And, seriously, how effective of OCR'ing are you really imagining you're going to get off of a camera phone pic, anyway?
If brevity is the soul of wit, then how does one explain Twitter?
I suppose this retard thinks he's clever.
Bad Kitty!
Verily, you may not link directly to images. Link to their containing web page instead.
You tried to access: /blog/
From: http://hurvitz.org/
I have spoken!
If the increasing absurdity of the CAPTCHAs I tend to see is anything to go by, there are programs out there that'll read normal printed text from even the crappiest photo without missing a beat. The question is, are the spammers using standard commercial solutions, or have they got some useful tech of their own that we might be able to get our hands on (seize it as part of a settlement and make it public domain, for instance).
Google took the Tesseract OCR engine, one of the first engines, and wrapped document analysis and some high level improvements on it. In the current OCR market landscape there are only 4 commercial engines, and two that make up 98% of the market. Compared to those two OCROpus is not even close because of the legacy engine. So the real reason is it's old technology, very old. Unless Google licenses ABBYY or Nuance they will not get any better. The reality is OCR takes 50 man-years to develop to compete with these top two engines, and it's just not practical for even Google to go out and start from scratch.
Did anyone else mirror this? I'm just getting a 403.
Nick
Forbidden
You don't have permission to access /blog/2011/04/ocr-quality-of-google-docs on this server.
Additionally, a 404 Not Found error was encountered while trying to use an ErrorDocument to handle the request.
Nice link, asshole.
Self-promote to /. and host on a box that can't handle the limited traffic of a 25-comment popularity story?
GOOD WORK SON
Much better than an accidental decision, I guess.
I don't think that spammers have any amazing tech, they just have different requirements. They can still send spam with a 1% success rate whereas with OCR you'd want a 99% success rate.
I once worked on an OCR project. The client specified a 99% success rate and we strained to restrain our grins. 99% is about one error every one or two lines of text. We got 99.6% in our first implementation before we even began to work on accuracy. Admittedly we had excellent image quality. This was a custom solution that had its own optics.
You can get better results by using CamScanner to capture the image, then upload the JPG to Google Docs. I found that uploading the JPG works better than uploading the PDF.
42 hidden comments
Does that mean it couldn't be a viable candidate for some Summer of Code work then?
More like a bunch of masters/phd thesis to get started.
OCR is an area of AI research under the topic of Computer Vision. It is yet another area that seems simple in concept but turns out to be incredibly difficult in practice.
I got the "403 Forbidden" message !
seems to me that OCR would be an area that would be easy to build a framework for genetic algorithms, using a huge collection of solved OCR pages to evaluate. with each generation being tested on a random subset of pages so they do not learn to cheat instead of learn to solve.
only problem is sometimes GA make a solution that makes no sense and should not work but somehow does http://www.damninteresting.com/on-the-origin-of-circuits
Snowden and Manning are heroes.
seems to me that OCR would be an area that would be easy to build a framework for genetic algorithms, using a huge collection of solved OCR pages to evaluate. with each generation being tested on a random subset of pages so they do not learn to cheat instead of learn to solve.
Sounds like a great thesis project. :-)
Google prides itself on having supposedly the best quality apps and features, which is why they take years to leave Beta. Why would they intentionally release a crippled version of their app? That will be the worst thing since Google Books with the missing pages.
Slashdotted
Wikipedia says:
Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a record-keeping system in an office, or to publish the text on a website. OCR makes it possible to edit the text, search for a word or phrase, store it more compactly, display or print a copy free of scanning artifacts, and apply techniques such as machine translation, text-to-speech and text mining to it. OCR is a field of research in pattern recognition, artificial intelligence and computer vision.
http://en.wikipedia.org/wiki/Optical_character_recognition
Genetic algorithms are an optimisation algorithm, but what do you want to optimise exactly ? What are your individuals here ?
The idea of using a large collection of solved problem to check and improve the accuracy of the method looks more like neural network to me. Indeed, this seems to be a common method for OCR. For example : http://www.codeproject.com/KB/dotnet/simple_ocr.aspx
While neural networks are a good solution, genetic algorithms can still be used in conjunction with them.
One possible training method for neural networks happens to be genetic algorithms. The genes being the link strengths, and the fitness function being say the percentage of correct results. (If you reach a sufficiently high level, you might want to change to minimizing uncertainty, with a fitness dropping exponentially if the correct percentage drops too low.)
In the alternative genetic algorithms can be used with other neural network training techniques, with the genetic algorithm selecting the number and arrangement of nodes, with fitness being related to the quality of the network after training.
A hybrid of both the above can also be used. I believe that is the approach critterding (an artificial life simulator) uses for the neural networks representing the creature's brains which are evolved much like the rest of the creatures body.
Stylish sheet to fix many problems in Slashdot's D3: https://gist.github.com/801524
My job entails working with our office's document management system to manually enter metadata.
In part, I essentially end up parsing the data which users entered in various formats.
However, since the original form is entered electronically to begin with, I figure this could be a lot more automated. (The people in my office definitely have a clue; however, fat chance moving this up through the bureaucracy.)
I listen to both RIAA and non-RIAA stuff if I like the music, tangential business/politics nonwithstanding.
People expect OCR to be magic so are always disappointed when they first run the stuff. They do not understand that one or two uncertainties per page is a pretty spectacularly good result until you've been able to train the thing with identically laid out documents on the same paper etc for a long time. Feeding in stuff printed on a dot matrix makes the secretaries cry ten minutes after they greeted the arrival of the OCR software with joy. Of course it works a bit better on later pages after tweaking or training - but it all looks like crap to start with.
With specific jobs the stuff can work out of the box with hardly any errors but you have to be lucky.
"You're holding it wrong."
I think the quality is tolerable. I photographed a document lying on my desk, without doing anything special to make it smooth or adjust lighting. This is a good simulation of a real-world situation where you can photograph a piece of text. There were errors in the transcription but it was readable, and with a very little editing would have been perfect. What surprised me was that apparently the whole image was uploaded from my phone to Google Docs, and then downloaded again, which is a little bit inefficient; I think that the OCR process runs server side.
I see this as very useful. This afternoon I'm going in to the local planning office to look at some planning applications; I won't be able to take them away, and I doubt I'll be allowed to use a photocopier, but I will have my phone. That's a real world application. I can think of hundreds more.
I'm old enough to remember when discussions on Slashdot were well informed.
hm, no genetic programming, train a neural net with input / correct output for single letters and then for whole words and see what you can get there.
I've tried a couple of the free applications that Google has made available, and they've been really inferior products. It's no surprise that they've put out yet another amateurish effort.
Contribute to civilization: ari.aynrand.org/donate
People jump on Google, apparently the iPhone toadies who need to diss the opposition, but Google has been pushing the edge of what good services they can provide for users in return for their consumer behavior. I know that almost anything I type can be tracked by Google, but their Gmail, their Search, their innovation had provided the very essence of the new model of giving something to the ordinary person in return for their American consumer behavior. What have the TV networks ever given you in return for their insipid, insulting, and intellectually degrading commercial breaks.? At least Google gives me good, efficient, speedy, reliable, and stable email and file storing service with ads I sometimes look at, and sometimes ignore. I'm sure the OCR will be refined, nobody else has the *&*() to even try such an advancement by providing a new, useful service in return for one's marketing behavior. Nothings free, and have never had a problem with Google's info on me. Hope they kick MS and Oracle and iPhone butt...