Google Releases Tesseract as Open Source

← Back to Stories (view on slashdot.org)

Google Releases Tesseract as Open Source

Posted by ryuzaki0 on Monday September 4, 2006 @03:27PM from the bit-rot dept.

An anonymous reader writes "Google recently released Tesseract as open source. Originally developed at the HP Labs from 1985-1995, it has been touted as one of the most accurate Optical Character Recognition (OCR) programs available. Having sat on the shelf gathering dust for so many years, Google cleaned up some of the more outdated portions of the code and released it for general consumption. You can download Tesseract over at Sourceforge.

14 of 251 comments (clear)

Min score:

Reason:

Sort:

Re:As much as I like open source software ... by aweinert · 2006-09-04 15:32 · Score: 5, Informative

CAPTCHAs are specifically meant to break OCR... and if you RTFA, it say it does poorly with grayscale and color documents. Baisically its meant for reading typed text... like in a book.
i hope it can augment the SpamAssassin OCR plugin by sednet · 2006-09-04 16:02 · Score: 2, Informative

it would be great if tesseract could augment the gocr-based FuzzyOCR and OCR plugins for SpamAssassin.

--
about sean dreilinger
Re:From the Project by kevlarman · 2006-09-04 16:10 · Score: 3, Informative

if you had bothered to browse cvs you would find that it has been released under the apache license: http://tesseract-ocr.cvs.sourceforge.net/tesseract -ocr/tesseract/COPYING?view=markup

--
A mouse is a device used to point to the xterm you want to type in
License by mapinguari · 2006-09-04 16:11 · Score: 2, Informative

Here's what's in the COPYING file distributed with the source, with some punctuation stripped to placate the lameness filter:
This package contains the Tesseract Open Source OCR Engine. Orignally developed at Hewlett Packard Laboratories Bristol and at Hewlett Packard Co, Greeley Colorado, the majority of the code in this distribution is now licensed under the Apache License: ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** http://www.apache.org/licenses/LICENSE-2.0 ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. Other Dependencies and Licenses: The Aspirin/MIGRAINES system in the aspirin directory is separately licensed thus: # NO WARRANTY Since the Aspirin/MIGRAINES system is licensed free of charge, Russell Leighton and the MITRE Corporation provide absolutley no warranty. Should the Aspirin/MIGRAINES system prove defective, you must assume the cost of all necessary servicing, repair or correction. In no way will Russell Leighton or the MITRE Corporation be liable to you for damages, including any lost profits, lost monies, or other special, incidental or consequential damages arising out of the use or inability to use the Aspirin/MIGRAINES system. COPYRIGHT This software is the copyright of Russell Leighton and the MITRE Corporation. It may be freely used and modified for research and development purposes. We require a brief acknowledgement in any research paper or other publication where this software has made a significant contribution. If you wish to use it for commercial gain you must contact The MITRE Corporation for conditions of use. Russell Leighton and the MITRE Corporation provide absolutely NO WARRANTY for this software. August, 1992 Russell Leighton The MITRE Corporation 7525 Colshire Dr. McLean, Va. 22102-3481 Tesseract can also make use of the libtiff library. (www.libtiff.org) Without libtiff, Tesseract can only read uncompressed and G3 compressed TIFF files.
1. Re:License by lisaparratt · 2006-09-04 19:34 · Score: 2, Informative
  
  It's a neural networking system, so I'd hazard a guess that it's pretty vital to the project :(
Re:NFB owns you by MrNonchalant · 2006-09-04 17:08 · Score: 4, Informative

You can build accessible CAPTCHAs, using images with a sound backup for blind users. My girlfriend is visually impaired and non-accessible CAPTCHAs are a real problem for her, she can't register at some sites without assistance.
Re:As much as I like open source software ... by Phroggy · 2006-09-04 18:31 · Score: 2, Informative

I am currently using the FuzzyOcr plugin to SpamAssassin, and it uses gocr to do the character recognition. To be sure, gocr is improving (the stable released version is practically useless, but the CVS version actually works, mostly), but if Tesseract is better, great!

--
$x='S24;r)>63/* h@<5+oZ)32"5cz';$me='phroggy'x$];
$x=~y+ -xz+\0-Tx+;print$_^chop$me for split'',$x;
Re:Non-English Charsets? by Yvanhoe · 2006-09-04 20:42 · Score: 2, Informative

Google specifically said in the article it doesn't work for non-english texts. I suppose it means it incorporates an english dictionnary too, so other roman language wouldn't work either.

--
The Wise adapts himself to the world. The Fool adapts the world to himself. Therefore, all progress depends on the Fool.
Re:Isn't fully free / open source by Ed+Avis · 2006-09-04 21:34 · Score: 2, Informative

If you think the software isn't entirely free, contact Sourceforge. Their conditions require that all hosted projects be free software.

--
-- Ed Avis ed@membled.com
Re:I call bullshit by johansalk · 2006-09-04 22:25 · Score: 3, Informative

If captcha is using humans, wasn't there an anti-captcha thing spammers were doing by having people answer some captcha to get into some free porn that is then used (their answer) to get the bots through legitimate sites the spammers wanted to get into?
Re:Un-Finishable by gweeks · 2006-09-04 23:03 · Score: 3, Informative

> This is patently false. New stuff comes out of copyright every day.

This is just so un-true. In the United States (the only place that project Gutenberg worries about) nothing is entering the Public Domain except unpublished manuscripts where the author died 70 years ago. Nothing else will enter the public domain until 2019. Congress has affectivly frozen the public domain.
Re:As much as I like open source software ... by Dan+Ost · 2006-09-05 00:32 · Score: 2, Informative

As someone who has been involved in applying OCR to real world problems, there's nothing
trivial about generating a good binary images from images taken in the field (in my case,
images of boxes moving down a conveyor belt or hand imaged by workers).

Even if you disregard such problems as uneven lighting, glare, and distortion due the
unavoidable vibration inherrent to plant settings, most forms that are interesting to
OCR are handwritten and not designed to be OCR friendly. Hopefully this will change as
the people who design such forms become more conscious of the capabilities of OCR, but
even if that were to happen tomorrow, it would take years to complete the transition.

--

*sigh* back to work...
Re:Un-Finishable by fotbr · 2006-09-05 01:39 · Score: 2, Informative

Unless estate holders release it early. Or the author and holder of the copyright declares in his/her will that his/her work be released into the public domain upon his death, etc.

Just because its not common (or likely) doesn't mean it can't happen.
Re:THIS IS ONLY FOR *NIX and not mentioned? by dadman · 2006-09-06 00:23 · Score: 2, Informative

Err... How about Cygwin http://www.cygwin.com/ ?