Google Pushes Open Source OCR

← Back to Stories (view on slashdot.org)

Google Pushes Open Source OCR

Posted by Zonk on Tuesday April 10, 2007 @06:02AM from the google-has-taken-all-knowledge-to-be-its-provice dept.

SocialWorm writes "Google has just announced work on OCRopus, which it says it hopes will 'advance the state of the art in optical character recognition and related technologies.' OCRopus will be available under the Apache 2.0 License. Obviously, there may be search and image search implications from OCRopus. 'The goal of the project is to advance the state of the art in optical character recognition and related technologies, and to deliver a high quality OCR system suitable for document conversions, electronic libraries, vision impaired users, historical document analysis, and general desktop use. In addition, we are structuring the system in such a way that it will be easy to reuse by other researchers in the field.'"

20 of 212 comments (clear)

The goal of the project by user24 · 2007-04-10 06:07 · Score: 4, Insightful

The goal of the project is to stop the damn email image spammers.

among other things, sure, but it's got to be a high priority for google.
1. Re:The goal of the project by slashbob22 · 2007-04-10 07:50 · Score: 4, Insightful
  
  Ok, I'll bite and play DA for a bit.
  
  Why Google wouldn't want this:
  1) Google's own patents on search techniques, distributing advertisement, etc. Yes, I understand that you are talking about the need for certain limits to patent law - but this could as easily hurt Google as help them.
  2) They are already challenging the copyright laws with GooTube, I don't see any sense in tackling both at the same time.
  
  IANIGHQ (In Google's HQ) but I don't see the value of getting sued at this point in time. Besides, if Google is doing this under appropriate conditions there shouldn't be concern of suits - but I suppose their Chinese plagiarism case doesn't support this point.
  
  // End DA
  
  --
  Proof by very large bribes. QED.
the presidential papers by User+956 · 2007-04-10 06:10 · Score: 4, Funny

The goal of the project is to ... deliver a high quality OCR system suitable for document conversions, electronic libraries, vision impaired users, historical document analysis

So, will it work on documents written in crayon? It would be a tragic loss for Dubya's presidential documents to get lost in the sands of time. On the scale of the library of Alexandria. No, seriously.

--
The theory of relativity doesn't work right in Arkansas.
Very cool. by Kadin2048 · 2007-04-10 06:17 · Score: 5, Insightful

I've been hoping that someone with deep pockets (Google, IBM, Sun) would take on this area for a while.

There is a major need for an OSS OCR package, and right now the field is pretty bare. There's GOCR, and a commercial offering called OCRShop, and at least that I've run across, that's about it. Nothing really on par with Omnipage, or other commercial packages for other platforms.

I think there are some really neat applications for OCR that have never really been investigated, because it's so expensive to build that capability into other products. A free OCR engine that really worked could lead to some very neat book-scanning applications, just for starters. I don't think that there's really any integrated packages around for helping people scan books and manuscripts. (Right now you have to photograph the pages, keep them organized, then OCR them and proofread the text against the images. Bit of a nightmare.) I'd love to see a free application for libraries that let a user batch scan (via a digital camera -- let's not get into what I think of SANE and scanners generally) a book, and then provided a nice interface for proofreading the OCRed text against the original image.

Something like that could have a huge social impact. There are a lot of libraries where I'm sure they'd love to scan some of their out-of-copyright assets and provide them to patrons in a digital form, but it's just too technically complicated. An easy-to-use program that let the proofreading be done by nontechnical users (maybe remotely, as long as we're dreaming) could vastly increase the volume of digital materials available.

--
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
Small price if it helps email spam. by Kadin2048 · 2007-04-10 06:22 · Score: 4, Insightful

And of course, as a side effect they'll probably wind up with a lovely distributed system for solving captcha. ;)

True, but CAPTCHAs always seemed like a bit of an inelegant hack anyway. First, they're horrible from a disabled-access standpoint, and second they're really not all that effective against a concerted enemy when there's a lot of money on the line. Spammers can just pay a few kids in some Third World country to sit there all day and solve CAPTCHAs if they want to.

Since message boards, which are the major users of CAPTCHAs, are practically by design little fiefdoms, I don't think they're nearly as hard to patrol as a common-carrier network like email. The solution to message-board spam is to either institute a moderator-delay (for small blogs and boards), or simply make enough admins with IP-ban powers so that the second someone starts spamming, they get banned and the spam gets deleted. Lameness filters working on the same principles as email spam-filters are probably helpful, too.

--
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
1. Re:Small price if it helps email spam. by Pxtl · 2007-04-10 07:39 · Score: 4, Interesting
  
  You've obviously never fought off a bb spammer. They don't use one or two accounts to spam one or two messages - they inundate the board from a long list of IPs. Even without spamming messages, they create hordes of accounts just for the pagerank provided by the links within their personal account pages. Plus, admin-approval-delays degrade quality for the user. It creates a huge headache all around to handle maintaining banlists and cleaning out garbage.
  
  Captchas are by far the better solution.
  
  The problem is that, long term, they will eventually be cracked. I'm imagining the ultimate solution will only be to allow users with email addresses from "approved" major ISPs.
Orcopus? by voice_of_all_reason · 2007-04-10 06:23 · Score: 4, Funny

Orcopus:

Level: 15
Race: Fell Marine
HP: 290/290
EP: 200/200
Water elemental
Drops: Tentacle
Wonderful! by jshriverWVU · 2007-04-10 06:27 · Score: 4, Insightful

This'll be a much needed boost for us Linux users who want to help out Project Gutenburg.
Re:The beginning of the end? by lawpoop · 2007-04-10 06:38 · Score: 5, Informative

I doubt it.

Part of what makes OCR work is that it assumes that the text was written to communicate meaning. It has regular characters in alignment, forming common words and abbreviations in more or less grammatical sentences with close-to-proper punctuation.

A good captcha has a non-sense string of characters in various cases, all skewed and distorted, with extra geometric elements obscuring the characters. This renders unavailable somewhere around half of the clues that an OCR uses.

--
Computers are useless. They can only give you answers.
-- Pablo Picasso
captchas by gEvil+(beta) · 2007-04-10 06:40 · Score: 4, Insightful

All you people who are worried about this breaking captchas seem to be missing something--there have been a number of fairly decent OCR packages out there for a long time. The goal of this Google project is to create an open-sourced one that does a good job deciphering HUMAN-READABLE TEXT. Captchas are far from human-readable (the good ones at least), and I seriously doubt this project will help very much in that arena.

--
This guy's the limit!
searchable pdfs by radarsat1 · 2007-04-10 06:44 · Score: 4, Interesting

Anyone know of an open source utility that can convert scanned image-based PDF files into searchable PDFs ?
(Extra points if it somehow re-generates the actual file so it looks nice instead of pixelated.)

Perhaps this library could be used to build such an application if none exists...
Language? by ceeam · 2007-04-10 06:45 · Score: 4, Interesting

English only I suppose?
1. Re:Language? by fireboy1919 · 2007-04-10 07:07 · Score: 5, Funny
  
  Since the official language of the Googleplex is Googlese, and the original project was developed by the US Census bureau - notorious for their use of no languages except Esperanto, it goes without saying (though I'm saying it anyway), that it will read only Klingon.
  
  Remember kids, there are no stupid questions.
  Only people who don't RTFA who ask questions.
  
  --
  Mod me down and I will become more powerful than you can possibly imagine!
Re:The beginning of the end? by user24 · 2007-04-10 06:58 · Score: 4, Insightful

Please, please, please, everybody, stop claiming that "what is 2+2?" is a hard AI question. I could code something in a hour to defeat most of this sort of question, and give me a week and a budget and I'll write something to get past 95% of these type of questions.

If the text is parsable, it takes nothing to google it.
I mean, those two examples you give; just slap it into google and screenscrape it. So you're going to need harder questions than that.

So the next generation of crapchas will ask "what color is the sky".
Go and take a glance at ultraHal or another relatively advance NLP AI; a large knowledgebase is not hard to construct. When it doesn't know, it guesses. If it gets it right, then the knowledgebase increases by one fact.

So then, what, you have to ask "Given that all bleeps and blue, and blank is a bleep, is blank blue?"
Not only is that also easily computationally solved, but also a lot of people aren't going to be able to answer (smartass questions about stopping spam and idiots aside)

So *then* I suppose you have to ask "In the first mathematical antimony, does Kant conclusively prove both that there can have been no beginning to time and that there must have been a beginning to time?"
and give the user a 255 character textarea to put their answer in.

So... please, text question based captchas are DOOMED TO FAIL. stop thinking that they could work. They can't.
Re:Finally... by Feyr · 2007-04-10 07:00 · Score: 4, Insightful

have you tried gocr? it's nice as a random number generator, but beside that... it's pretty much garbage
Well... by Shawn+is+an+Asshole · 2007-04-10 07:40 · Score: 4, Informative

If you're sick of image spam, you can do what I did. Add the OpenProtect channel to SpamAssassin and then add these line to your SpamAssassin config:

required_hits 5 score SARE_GIF_ATTACH 5

I don't see image spam any more. I resorted to that after I was getting a hundred or so of them a day.

--
"It ain't a war against drugs.it's a war against personal freedom" --Bill Hicks
Re:Sign of times to come? by Instine · 2007-04-10 07:48 · Score: 4, Insightful

What about a free service to upload scanned images to and recieve html in return?... Please....

--
Because you can - or because you should?
More build info; Ubuntu Feisty by drinkypoo · 2007-04-10 08:25 · Score: 4, Informative

Here's some other information you might need/want to build this software; note that I am on Ubuntu Feisty.

To build tesseract-ocr you must install autoconf.

If you are smart you will figure out a way to omit ocropus/data-test-pages from your checkout. Do you really want to use their pages to test? No! You want to use your own data.

I built with gcc 4.1.2, YMMV. Some people have reported errors trying to just compile tesseract.

to build ocropus I ended up installing jam, libaspell-dev and libtiff-dev.

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Re:captcha's by cmacb · 2007-04-10 08:43 · Score: 4, Funny

captcha's are not restricted to images of letters. For example: you could ask people to solve a regular text question (this would also fix accessibility issues)

You mean as in:

Describe what the following expression does in 30 words or less: {"ab", "c"}{"d", "ef"} = {"abd", "abef", "cd", "cef"}

Man, I'll never get into forum postings if they do that!
Re:captcha's by JamesTRexx · 2007-04-10 09:31 · Score: 4, Funny

Describe what the following expression does in 30 words or less:
{"ab", "c"}{"d", "ef"} = {"abd", "abef", "cd", "cef"}

Answer: Makes my head hurt...

*click* Access to MySpace granted, have a nice day.

Which forum were you taking about again? :-)

--
home