Google Releases Tesseract as Open Source

← Back to Stories (view on slashdot.org)

Google Releases Tesseract as Open Source

Posted by ryuzaki0 on Monday September 4, 2006 @03:27PM from the bit-rot dept.

An anonymous reader writes "Google recently released Tesseract as open source. Originally developed at the HP Labs from 1985-1995, it has been touted as one of the most accurate Optical Character Recognition (OCR) programs available. Having sat on the shelf gathering dust for so many years, Google cleaned up some of the more outdated portions of the code and released it for general consumption. You can download Tesseract over at Sourceforge.

17 of 251 comments (clear)

I take back every bad thing I said about Google by OrangeTide · 2006-09-04 15:30 · Score: 4, Interesting

HOORAY! Good free OCR software is in short supply. I wonder if this will have a positive impact on Project Gutenberg?

--
“Common sense is not so common.” — Voltaire
Re:As much as I like open source software ... by aweinert · 2006-09-04 15:32 · Score: 5, Informative

CAPTCHAs are specifically meant to break OCR... and if you RTFA, it say it does poorly with grayscale and color documents. Baisically its meant for reading typed text... like in a book.
improvements by Anonymous Coward · 2006-09-04 15:33 · Score: 5, Funny

Google cleaned up some of the more outdated portions of the code
i.e., added AdSense to the OCR output.
Re:As much as I like open source software ... by illuminatedwax · 2006-09-04 15:33 · Score: 5, Funny

You're right! Let us never delve into research that could conceivably overturn weak software security! Some things man was never meant to discover! Turn back, before we fly too close to the sun and our wings melt!! O, Prometheus, why hast thou given us this OCR technology??

--
Did you ever notice that *nix doesn't even cover Linux?
From the Project by Gopal.V · 2006-09-04 15:43 · Score: 4, Insightful

> It was open-sourced by HP and UNLV in 2005.

So google basically did what ? Fix bit-rot ? Google has re-released some open source code, essentially forking off the orginal ?

> License: (None Listed)

I'm a fan of the FOSS idea. Basically that makes sures that the whole work to which I contributed, always remains available to me (and others). It might not always work for a company, but as a developer it makes sense to me. And the second thing I need to see is a License after I see some code.

So explain to me how exactly this is open source (other than the "compile, but don't touch" version of it) and *then* I might think of downloading it and probably fix a few bugs or write docs.

--
Quidquid latine dictum sit, altum videtur
I'm sorry Dave... by macadamia_harold · 2006-09-04 15:44 · Score: 4, Funny

Originally developed at the HP Labs from 1985-1995, it has been touted as one of the most accurate Optical Character Recognition (OCR) programs available.

Yeah, but how is it on lip-reading? That's when we really need to worry.

--
Push Button, Receive Bacon
Hosting by truthsearch · 2006-09-04 15:44 · Score: 5, Interesting

Is there any particular reason google isn't hosting the project themselves?

--
Developers: We can use your help.
1. Re:Hosting by larry+bagina · 2006-09-04 15:46 · Score: 5, Funny
  
  Yes. They need the 99.9999% uptime (6 9s) that only sourceforge can provide.
  
  --
  Do you even lift?
  These aren't the 'roids you're looking for.
2. Re:Hosting by Leto-II · 2006-09-04 18:24 · Score: 4, Funny
  
  I think you need to recalibrate your sarcasm detector.
  
  --
  Do not anger the worm.
NFB owns you by tepples · 2006-09-04 15:48 · Score: 4, Interesting

CATCHAs have been very effective in stopping spammers in the past, but if they can now just read them and answer correctly, then they are effectively rendered useless ...

They're already useless if installing one will subject your business to boycotts and/or lawsuits from National Federation of the Blind and other advocates for people with disabilities.
1. Re:NFB owns you by MrNonchalant · 2006-09-04 17:08 · Score: 4, Informative
  
  You can build accessible CAPTCHAs, using images with a sound backup for blind users. My girlfriend is visually impaired and non-accessible CAPTCHAs are a real problem for her, she can't register at some sites without assistance.
Re:As much as I like open source software ... by djtack · 2006-09-04 15:51 · Score: 4, Insightful

Plus, good OCR could help recognize image spam (where they send the text in an image attachment, to avoid filtering, and fill the message body with "bayes poison").
Un-Finishable by Kadin2048 · 2006-09-04 16:09 · Score: 5, Interesting

In all honesty, I doubt Project Gutenberg will have run out of pre-1923 books by the time that new stuff starts coming out of Copyright under the new rules. They have everything written by humanity before that date to digitize: not just English language books and "classics," but government documents, records, foreign language texts, ancient manuscripts ... everything. That's as close to an un-finishable task as you can set yourself, I think.

Just assuming that somehow they did manage to digitize everything that was out of copyright, then I think what they should do is start archiving everything that they can. Even if they can't disseminate the information, they could still scan documents in and store them for later OCR-ing, thus preserving them against deterioration. I think this would be covered by fair use law even if the work was still protected. Perhaps this sort of archival work is not exactly the aim of PG, but it's still critically important.

With that said, I don't mean to in any way excuse the disgusting abuse of our political and legal system that was and is the "Sonny Bono Copyright Term Extension Act." That thing is a disgusting example of pretty much everything that's wrong with our government today.

--
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
1. Re:Un-Finishable by mrchaotica · 2006-09-04 17:58 · Score: 4, Insightful
  
  In all honesty, I doubt Project Gutenberg will have run out of pre-1923 books by the time that new stuff starts coming out of Copyright under the new rules.
  
  Your argument makes the fundamentally flawed assumption that the "new rules" will remain constant. The reality is that Copyright will continue getting extended so that new content never comes into Public Domain. (I hope the copyright fuckers are the first against the wall when the revolution comes!)
  
  Even if they can't disseminate the information, they could still scan documents in and store them for later OCR-ing, thus preserving them against deterioration.
  
  I'm sure they could even OCR them... they just couldn't make them available to the public. Of course, given the community-driven mechanism by which Project Gutenberg works, they couldn't legally distribute them to the volunteers either...
  
  --
  "[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz
Two reasons by patio11 · 2006-09-04 16:49 · Score: 4, Insightful

You've got two constraints. One is that you have to be able to compose an arbitrarily large numbers of capchas algorithmically. For example, that example you just used is human-composed. If its the only CAPTCHA you have, the following program gets me a job at Google: gawk 'BEGIN{print "b"}' . If you have 100 CAPTCHAS, I only need to add a switch statement and some elbow grease and then I get to break your CAPTCHA a trillion times.

The other contraint is that you have to have your problem be trivially solvable by humans. I know plenty of people who cannot solve the CAPTCHA you have given: one obvious example would be, umm, all of my coworkers, because I live in Japan and "sub sandwitch" is not generally on the Japanese English curriculum. Similarly, you could any number of parsing problems which are very difficult for machines ("Here are 10 pictures chosen from HotOrNot. Click the three hot chicks.") but which may also be difficult for some users, such as Slashdotters who have never met a girl before.

By the way, you can find an implementation of that CAPTCHA at http://www.hotcaptcha.com/

--
Help poke pirates in the eyepatch, arr.
I call bullshit by quigonn · 2006-09-04 17:16 · Score: 4, Interesting

The very first CAPTCHA implementation was broken, but the funny thing about CAPTCHAs is that it's absolutely no effort to make an image completely unreadable for current OCR software. And even if one certain implementation is broken, just add another layer of distortion. Human brain is capable of coping with it, OCR software usually is not.

And after all, it's not about authentication, it's about making a service accessible only for humans.

BTW, it's funny that you praise your own cryptography solution in your blog, but it's obvious that you have the problem of replay attacks, you even mention it in the "caveat" section below the text box.

--
A monkey is doing the real work for me.
HP decided to got out of the OCR business? by Frosty+Piss · 2006-09-04 17:18 · Score: 5, Funny

In 1995 it was one of the top 3 performers at the OCR accuracy contest organized by University of Nevada in Las Vegas. However, shortly thereafter, HP decided to get out of the OCR business...

Actually, shortly thereafter, HP decided to get out technology innovation business, and into the printer ink business.

--
If you want news from today, you have to come back tomorrow.