Google Pushes Open Source OCR
SocialWorm writes "Google has just announced work on OCRopus, which it says it hopes will 'advance the state of the art in optical character recognition and related technologies.' OCRopus will be available under the Apache 2.0 License. Obviously, there may be search and image search implications from OCRopus. 'The goal of the project is to advance the state of the art in optical character recognition and related technologies, and to deliver a high quality OCR system suitable for document conversions, electronic libraries, vision impaired users, historical document analysis, and general desktop use. In addition, we are structuring the system in such a way that it will be easy to reuse by other researchers in the field.'"
Now that they will be able to recognize and tag our images, I wonder if Picassa will finally get increased storage. Google will be able to deliver targeted ads based on our pictures.
from the google-has-taken-all-knowledge-to-be-its-provice dept.
Did you mean: province
Use this line to checkout ocropus:
svn co http://ocropus.googlecode.com/svn/trunk/ ocropus
The goal of the project is to stop the damn email image spammers.
among other things, sure, but it's got to be a high priority for google.
Oh great. I, for one, do not welcome the increase in message board spamming.
"Trolls they were, but filled with the evil will of their master: a fell race..." -- J.R.R. Tolkien on Olog-hai
... for Captchas? If Google is pushing OCR I could see it eventually becoming good enough to parse at least some types of captchas.
The goal of the project is to ... deliver a high quality OCR system suitable for document conversions, electronic libraries, vision impaired users, historical document analysis
So, will it work on documents written in crayon? It would be a tragic loss for Dubya's presidential documents to get lost in the sands of time. On the scale of the library of Alexandria. No, seriously.
The theory of relativity doesn't work right in Arkansas.
An OCR system that runs on Linux. I've been waiting for quite some time for something like this.
So will something like this eventually render captchas used as a security/anti-spam measure obsolete?
Not like something wasn't bound to eventually come out to counter that idea, anyway.
I've been hoping that someone with deep pockets (Google, IBM, Sun) would take on this area for a while.
There is a major need for an OSS OCR package, and right now the field is pretty bare. There's GOCR, and a commercial offering called OCRShop, and at least that I've run across, that's about it. Nothing really on par with Omnipage, or other commercial packages for other platforms.
I think there are some really neat applications for OCR that have never really been investigated, because it's so expensive to build that capability into other products. A free OCR engine that really worked could lead to some very neat book-scanning applications, just for starters. I don't think that there's really any integrated packages around for helping people scan books and manuscripts. (Right now you have to photograph the pages, keep them organized, then OCR them and proofread the text against the images. Bit of a nightmare.) I'd love to see a free application for libraries that let a user batch scan (via a digital camera -- let's not get into what I think of SANE and scanners generally) a book, and then provided a nice interface for proofreading the OCRed text against the original image.
Something like that could have a huge social impact. There are a lot of libraries where I'm sure they'd love to scan some of their out-of-copyright assets and provide them to patrons in a digital form, but it's just too technically complicated. An easy-to-use program that let the proofreading be done by nontechnical users (maybe remotely, as long as we're dreaming) could vastly increase the volume of digital materials available.
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
And of course, as a side effect they'll probably wind up with a lovely distributed system for solving captcha. ;)
True, but CAPTCHAs always seemed like a bit of an inelegant hack anyway. First, they're horrible from a disabled-access standpoint, and second they're really not all that effective against a concerted enemy when there's a lot of money on the line. Spammers can just pay a few kids in some Third World country to sit there all day and solve CAPTCHAs if they want to.
Since message boards, which are the major users of CAPTCHAs, are practically by design little fiefdoms, I don't think they're nearly as hard to patrol as a common-carrier network like email. The solution to message-board spam is to either institute a moderator-delay (for small blogs and boards), or simply make enough admins with IP-ban powers so that the second someone starts spamming, they get banned and the spam gets deleted. Lameness filters working on the same principles as email spam-filters are probably helpful, too.
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
Orcopus:
Level: 15
Race: Fell Marine
HP: 290/290
EP: 200/200
Water elemental
Drops: Tentacle
This'll be a much needed boost for us Linux users who want to help out Project Gutenburg.
Okay, so one thing will lead to another and soon Google will be creating technology to recognize non-symbol shapes... How long before I can login to my G-Accounts by smiling at my computer?
And Zonk has taken all editing to be his...provice.
All you people who are worried about this breaking captchas seem to be missing something--there have been a number of fairly decent OCR packages out there for a long time. The goal of this Google project is to create an open-sourced one that does a good job deciphering HUMAN-READABLE TEXT. Captchas are far from human-readable (the good ones at least), and I seriously doubt this project will help very much in that arena.
This guy's the limit!
Anyone know of an open source utility that can convert scanned image-based PDF files into searchable PDFs ?
(Extra points if it somehow re-generates the actual file so it looks nice instead of pixelated.)
Perhaps this library could be used to build such an application if none exists...
English only I suppose?
When we can make a computer that can tell the difference between a kitten and an adult cat (or hell even another furred mamal) with any kind of accuracy, I think the LEAST of your problems at that point is coming up with captchas. You should be more worried about how you're going to escape from Skynet.
Could they be prosecuted under the DMCA for this?
And will it be able to recognised and latexify handwriten mathematics. The world and it's mother can do OCR, but I've yet to an honest attempt at making writing mathematics papers easier.
May the Maths Be with you!
yeah that's what they're trying to do you fucking idiot
Ok, I got excited too early. Actually, ballot scanning is a specialized task and general purpose OCR probably doesn't play much of a part in that, but if any part of it does apply, then this is still awesome.
Start Running Better Polls
Hopefully, the Linux community will adopt some of this, as some of it can be utilized for accessibility. After perusing some patents from the 1800's, it's clear that Google has made some headway in this department. There were errors in translation (namely K's and R's/P's and B's), but for several documents, things come across as intended.
I think the potential of new Google-backed OCR software is pretty high but I'm not certain that your average library would have the manpower and technical know-how to manage a book-to-ebook conversion, Google OCR software or not.
If libraries are interested in getting their out-of-copyright assets into digital form, they really only need contact someone with Digital Proofreaders to get the ball rolling. DPers would take care of the scanning, proofing, formatting, and post-processing of the book on behalf of the library requiring nothing but a temporary loan of the book or manuscript (something the libraries already excel at :)
Will I be able to search my comics strips (downloaded since ever) by keyword?
I would love that!
http://www.dieblinkenlights.com
captcha's are not restricted to images of letters. For example: you could ask people to solve a regular text question (this would also fix accessibility issues)
If you're sick of image spam, you can do what I did. Add the OpenProtect channel to SpamAssassin and then add these line to your SpamAssassin config:
required_hits 5
score SARE_GIF_ATTACH 5
I don't see image spam any more. I resorted to that after I was getting a hundred or so of them a day.
"It ain't a war against drugs.it's a war against personal freedom" --Bill Hicks
Where have you been lately? Picasaweb.google.com has already increased from a mere 250MB to 1GB+ and counting!
make[3]: Entering directory `/home/rick/tesseract-ocr/wordrec' ../cutil/globals.h:46: error: previous declaration of 'int optind' with 'C++' linkage ../ccutil/getopt.h:23: error: conflicts with new declaration with 'C' linkage ../cutil/globals.h:47: error: previous declaration of 'char* optarg' with 'C++' linkage ../ccutil/getopt.h:24: error: conflicts with new declaration with 'C' linkage
if g++ -DHAVE_CONFIG_H -I. -I. -I.. -I../ccstruct -I../ccutil -I../cutil -I../classify -I../image -I../dict -I../viewer -g -O2 -MT tface.o -MD -MP -MF ".deps/tface.Tpo" -c -o tface.o tface.cpp; \
then mv -f ".deps/tface.Tpo" ".deps/tface.Po"; else rm -f ".deps/tface.Tpo"; exit 1; fi
make[3]: *** [tface.o] Error 1
make[3]: Leaving directory `/home/rick/tesseract-ocr/wordrec'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/home/rick/tesseract-ocr/wordrec'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/home/rick/tesseract-ocr'
Actually lots of people do book "scanning" with digital cameras. In fact, you can sometimes get much better results off of a book using a digital camera than you can by pressing it down against the bed of a flatbed scanner (because if the page wasn't typeset with a wide gutter, you'll start to distort some of the letters as you get close to the binding). Plus, it's a lot easier on the books, which is important when you're talking about books that are all going to be 75 years old and some much, much older.
The best way to use a flatbed scanner to scan books is actually to run them through a guillotine first, chop off the binding, and then scan the loose pages; this produces good results but it's not something most libraries are going to be willing to do.
Here's a commercial non-destructive book scanner which uses cameras. Basically, what you do, is you have two cameras, each pointing at one side of the book. You use lights held at an angle to the paper with reflectors and diffusers so that it's evenly lit, and then you just flip the pages and fire the cameras once per page turn. You can build a setup to do this (with manual page turning) for a few hundred bucks plus the cost of the cameras. The auto page-turning is really what drives up the cost.
People were photographing text using cameras for a lot longer than photocopiers have been around. The standard way of reproducing photographs was by using a copy stand and a fixed camera in order to make an internegative, and prior to the introduction of all-digital typesetting, almost all offset printing was done by photographing a paste-up of the final product with a special camera, which produced the plate used in the press.
So in short, although you're correct that just holding a digital camera over a book and clicking the shutter wouldn't give great results, the issues surrounding lighting, lens distortion, and focus are all solved problems. (And if you really wanted to be slick about things like barrel distortion and dust, you could start each run by photographing a standard grey field and a checkerboard, and use that to remove dust and correct for distortion digitally, rather than mechanically/optically.)
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
Yeah that's similar to what I was thinking about. Actually, what I was recalling was this thing, which seems to pretty clearly use off-the-shelf DSLR cameras (not sure on the lenses though, they're not visible). It probably costs a fortune because of the robotics and vacuum system necessary for the automatic page turning, but I think you could DIY something similar out of two copy stands for a lot less if you were okay with flipping pages.
The one you linked to seems like it would have more distortion of the pages because the cameras aren't being held constantly perpendicular to the page, but maybe it just corrects for that in software afterwards. (It wouldn't be hard, in fact I think all the code you'd need to do it is part of the Panorama Tools / Hugin package.)
What I think is a bigger problem for most libraries isn't the scanning per se, because that at least is a problem that most non-technical people can understand, but it's the storage and document-management that's the issue. Once you have the book scanned, you have a giant pile of JPEGs or TIFF files...unless you're careful about organization, it could become a real mess in a hurry.
So where I think the missing piece is, has to do with getting from raw images to an actual ebook. The hardest problem seems to be in the proofreading step; if you run each image through an OCR program, and then you want to proofread it, you need some way of distributing pages out to proofreaders, and letting each of them have a page of text and the image from that page, side by side. And then managing their edits and checking changes back in, etc. It's nothing really novel -- they're all solved problems in other areas (documents management, change management, remote access, web services) -- but I've never seen them combined.
If you had a software package that handled all the document management and proofreading (preferably something that your proofreaders could log into remotely and work, while storing everything centrally), then the hardware required is mostly off-the-shelf. It goes from being a $25,000 grant proposal, to some undergrad's thesis/semester project.
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
it actually has many issues, and it is lagging behind the Windows version that Nuance produces. My company owns several licenses.
it is, however, the best OCR on Linux right now. I'm looking forward to having an alternative.
Here's some other information you might need/want to build this software; note that I am on Ubuntu Feisty.
To build tesseract-ocr you must install autoconf.
If you are smart you will figure out a way to omit ocropus/data-test-pages from your checkout. Do you really want to use their pages to test? No! You want to use your own data.
I built with gcc 4.1.2, YMMV. Some people have reported errors trying to just compile tesseract.
to build ocropus I ended up installing jam, libaspell-dev and libtiff-dev.
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
All the OCR available to my Ubuntu 6.10 (Edgy) APT are worthless (< 50% correct characters), after trying them on real scans (usually faxes) that are perfectly clear to my eye:
clara - Free OCR program for Unix Systems
gocr - A command line OCR
ocrad - Optical Character Recognition program
unpaper - post-processing tool for scanned pages
Will this Google OCR really work, and can I install it with APT?
Meanwhile, why is it all Optical Character Recognition, when the accuracy we expect is really Optical Word Recognition? How come spelling, grammar and phrase frequency (including typos etc) isn't used to error correct at a symbolic level higher than pixels?
--
make install -not war
Look, any illiterate kid in a third world country can play type-in-the CAPTCHA all day long. I think the solution is to put up an array of 9 or so pictures, and ask the reader to click on the kitten. The other 8 being something other than a kitten, and all the files having random names which rotate with every view. You could also change the item being asked for to defeat simple image recognition, and have several pictures of kittens/what-have-yous.
To defeat *this*, you would need someone with a greater command of the english language than simple recognition of characters, or very advanced image recogniion software. I wouldn't worry about the software anytime soon if you chose images carefully.
Even those who arrange and design shrubberies are under considerable economic stress at this period in history.
>IBM has (and I think Google too) lobbied for open source exemption.
IBM has donated certain patents for open source utilization.
They have in no way lobbied for universal open source exemption from patent laws.
They have collected 1b+ last year in IP licensing revenues, including software patent licensing revenues.
>What's more, there are many dozens of such simple patents surrounding OCR.
From this statement you logically imply that there are no complex patents surrounding OCR technology. Thus you state that the area of research is overpatented.
It could more reasonably proven that OCR patents are numerous because it is a universal, difficult problem and many investors have spent significant resources for people to attempt to gain accuracy improvements related to it.
Read some of the patents in the link you provided. You will find numerous non trivial breakthroughs.
It's fascinating that Google has chosen the Apache license for the release of this product. Given that Eben Moglen has explicitly stated that the Apache License is incompatible with GPLv3, what does this mean for mixing this code into other projects?
Even though v3 no longer has the anti-google Affero provisions, Google still chooses Apache instead of GPLv3 or even v2 with a rider to upgrade to v3. You gotta believe the Google lawyers were thinking about this issue before release...
A sig?!? I don't think so.....
"I've been hoping that someone with deep pockets (Google, IBM, Sun) would take on this area for a while."
It sounds like they're just providing support for a few PHD students (i.e. cheap labor) and want the community to perform a lot of the work. What are deep pockets needed for?
I don't often need to do OCR, but I had passable results with Ocrad recently. Like some of the other respondents, I couldn't get much useful output from GOCR.
If your comment title says 'Re: Foo', I'm not likely to read it.
Is that a Chinese mispronunciation? ;)
One awesome application of this: I teach university courses that require term papers. If I could scan and upload the term papers I receive and Google could OCR them and tell me whether they're plagiarized (and of course Google would know; they know all!), I'd be prepared to pay them a bit of money for this. Or, more accurately, my university would be prepared to pay them a decent sum of money on my behalf. Then, they could keep the data from the term papers for the future, to make sure that nobody turns in that same paper in a later semester. Google not only gets money for this, but a whole lot of data to crawl through. Who knows what they would learn if a curious goog starts cleverly mining that data? If they do this, I would really love to work for them and use my 20% "downtime" to code a sentence structure analyzer that could predict a grade based just on syntactic features of the writing. In order to get more data, Google might even offer the OCR + plagiarism detection for free if the instructor agrees to use a Google grading and feedback system, so that Google could correlate each essay with a grade and an explanation of the grade. After tens of thousands of examples, Google might learn how to assign fairly accurate grades on its own (machine agrees with human to almost the same degree that humans agree with each other about what grade is deserved), and after that, who knows, Google might learn how to write B- term papers without any human input!
BTW, I am aware of plagiarism.org and their plagiarism-detection service which works like the thing that I want Google to do. Of course, if Google enters this market, they will crush all competition immediately, and plausibly, they'll do a better job because their database is just bigger. Also, Google could charge less, because a part of the payment will be access to the data itself. In fact, Google is already looking like it will accept information as payment for many of its services! And why not?
Patents last 20 years in the U.S., IIRC.
This OCR is a refined version of HP's Tesseract, which HP handed over to UNLV some time ago. The original code was developed starting in 1985, so there is a good possibility patents are not valid.
"You might wonder why Google is interested in OCR? In a nutshell, we are all about making information available to users, and when this information is in a paper document, OCR is the process by which we can convert the pages of this document into text that can then be used for indexing."
Charles
Learning HOW to think is more important than learning WHAT to think.
I had never heard of Digital Proofreaders before; that's very cool. Their system seems to be very close to what I was envisioning (allows distributed proofreading via a web interface, automatically assembles books together and puts them in a central repository for access).
Thanks for the link. The next time I'm talking to any of my librarian friends, I'll have to mention it. I didn't see anything on their FAQ though about accepting books from libraries for digitization, just on starting a project yourself (meaning scan the book and submit the scans and OCRed files for proofreading). But the scanning is really the easy part relative to the proofreading, so it still is a big step forward.
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
It is interesting that web forms have become a measure of AI strength in the world wide web. As soon as Captchas are largely solved, there will be new and improved human tests. I am guessing the next step will be identifying logos, or some sort of symbol. Eventually that problem will be solved too. So what do we do when we can't tell a human from a machine?
Please send me your registered DNA sequence, a voice recording reading this message, and a picture of you in the current location...
I guess a central database of information (identification and secure communication channel) is going to be the only way to ensure you are who you say you are.
Eventually I guess it won't really matter if you are human.
Actually, GOCR works very well (100%) on the image-based text that some sites use to prevent screen scrapping.
1. Download and save the image.
2. If it's a gif, convert it to a jpg.
gif2jpg -a tmp.gif
3. Reduce the colors to 2 (black & white).
djpeg -colors 2 -greyscale -dither none tmp.jpg tmp.pnm
4. If there is a border, crop it off.
pnmcut a b c d tmp.pnm > OCR.pnm
(The dimensions a,b,c,d can be determined by any tool that returns useful info about an image, in general remove 1 or 2 pixels from the edges to get rid of borders.)
5. OCR it.
gocr -n 1 OCR.pnm >> OCR.txt
Of course, this is all automated within the screen scraper, I just broke it out here to explain the steps.
For CAPTCHAs, you have to demorph the severely distorted images after step 4, before you OCR it. I'm still working on the demorpher, but it's about 50% accurate now. Basically, it unstretches long strings of pixels to the average of other strings of pixels in the x and y axis. Works even better if you determine the angle of the pixel sting and shrink on that, along with some rotation to the nearest x or y axis.
Alternatively, slap an USB or RS232 or RS485 interface to a cheap microcontroller with built-in ADC (usually 10-bit, usually multiplexed to several possible input pins) and suitable analog circuits for sensing the values required, and log to a computer or to a data storage (eg. smartcard).
Yet another alternative, using your approach, is taking the image, finding out the approximate positions of the centers of the LCD segments, finding the image brightness thresholds for segment on/off, and getting on/off values for each segment of the display you want to watch. Then a simple decoding algorithm that turns the list of segments switched on to the displayed value.
Abby Finereader Sprint 6.0 for Windows, which is available "free" bundled with a lot of different cheap page scanners, is a simple, but extremely fine product: it does 99,5% or better justice to any kind of printed text in about 25 languages, without training needed, even if the source is low-quality, like 72dpi JPEGs.
All in all, it would be very difficult for Google to make a GNU GPL and patent-free general purpose OCR implementation that comes anywhere near the recognition reliability that Abby, Recognita and other top-notch commercial software titles achieved during more than a decade of continous development.
The problem with producing good CAPTCHAs is that it is hard to find a problem that the computer can easily generate and have the answer for, but cannot solve trivially. Our current CAPTCHAs are a good compromise, but I, at least, have no idea how to create text CAPTCHAs with those properties.
Send email from the afterlife! Write your e-will at Dead Man's Switch.
I'll preface this with "this is just my experience"...
I'm involved in a project to capture a library of technical documents to PDF (we've done 40,000 pages so far). The software being used is Acrobat Capture 3.0 on Windows 2000 running on a 3GHz P4. Once the documents have been through Acrobat Capture, we use Acrobat 5 to retouch them (strangely, later versions of Acrobat give you less control and less ability to fix problems in the documents - we actually downgraded from Acrobat 7 back to Acrobat 5).
Our pages are scanned at 600 DPI, 1 bit per pixel, using a Kodak i65 that automatically deskews the pages (a small amount of skew seems to confuse Acrobat Capture to no end, and if there is a graphic on the page you get aliased lines instead of clean, straight lines).
We've found that the error rate goes up when you drop to 300 DPI. Normal fax resolution is 100 x 200 DPI ("fine" is 200 x 200 DPI), so you can expect to have very poor performance at fax resolutions. Basically, Acrobat Capture acts like OCR'ing a fax image is a torture test for the OCR because it seems like there just aren't enough pixels to give the OCR engine enough hints about what it's looking at. I'd be more interested to find out how the Google OCR does with a clean page of Helvetica text.
Most of the problems we are now having is that we're into a very old set of documents (early 1940s) that were created using typewriters that apparently we're very well taken care of.
The person who is doing the work is using some macro software that has let him automate the process of fixing the text in Acrobat to some degree, but it's still slow going (average seems to be 100-200 pages per day).
Putting moderation advice in your
I've built my own ADC's and stuff, but I question their accuracy and it's a *lot* of work to get one built that works nicely. I've built a couple 12-bit ADC's that output parallel to something like the sparkfun usb interface (16 lines of IO) and that's functional. But it's really nice to have a rugged, precalibrated multimeter that, out of the box, already has its voltage, amperage, and such calibrated and ready to go. What I've ended up doing is buying GPIB-equipped multimeters on ebay, and that works. But I often have situations where I'd like to be running five or six -- measuring efficiency on dual or triple-channel switching power supply chips, for instance -- so I don't have enough RS232's without kludging things onto the computer. GPIB works beautifully, as would USB, given their extensibility.
I like your idea of the point decoding of the LCD. I'll have to think about that and see if I can come up with an easy implementation. That'd be a lot simpler with LED readouts, which a couple of my power supplies have. What a great idea! Thanks.
Nostalgia's not what it used to be.
One would assume that OCR is a heavily patented space, and a patent search seems to agree. Caere could make things difficult for the competition.
I've been looking for a good solution to the image spam problem, but this is not it.
They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock
For more instruments you can deal via USB, USB-to-serial converters, or eg. a pair of Netmos serial port cards, which will add 8 more RS232 ports to your machine. GPIB is IMHO an overpriced monstrosity.
I have to admit a lot of fondness for GPIB since my dad helped design it and I have a rack of equipment that he subsequently designed around it. In 1982, it beat the hell out of anything else on the market.
What I'd really like, rather than usb or rs232, is ethernet. Our newer tektronix scopes have a network jack on the back and somewhere inside their weird little insides, a webserver, so I can run the scope from anywhere in the building and get data out of it. That's amazingly useful. No drivers, no special cables, no limits on how many instruments I can work with, just pure functionality.
Nostalgia's not what it used to be.
Again, RS232 comes to rescue here. For some $50, there are eg. the Lantronix XPort adapters available, which are UART/TCP converters. They can either sit and listen for a connection (and then relay the bytes back and forth between UART pins and the socket), or actively open a connection to a defined IP:port. I have some supervisory hardware made this way. The UART/RS232 level converter can be made of two transistors, and all the other stuff you need is 3.3V/100(or so) mA for the module.