Google Releases Tesseract as Open Source
An anonymous reader writes "Google recently released Tesseract as open source. Originally developed at the HP Labs from 1985-1995, it has been touted as one of the most accurate Optical Character Recognition (OCR) programs available. Having sat on the shelf gathering dust for so many years, Google cleaned up some of the more outdated portions of the code and released it for general consumption. You can download Tesseract over at Sourceforge.
Can't spammers use this thing to break CAPTCHAs on sites like Slashdot and many other internet forums? CATCHAs have been very effective in stopping spammers in the past, but if they can now just read them and answer correctly, then they are effectively rendered useless ...
This signature was left intentionally blank.
HOORAY! Good free OCR software is in short supply. I wonder if this will have a positive impact on Project Gutenberg?
“Common sense is not so common.” — Voltaire
This should be useful for adding anti-image spam capabilities to FOSS anti-spam programs.
The road to tyranny has always been paved with claims of necessity.
Google cleaned up some of the more outdated portions of the code
i.e., added AdSense to the OCR output.
My guess is that they are doing this in the hope the open source community will build on and improve OCR technology. This would be in Google's interest, as it can then index text from images (such as their own Books project) more accurately and efficiently.
www.shortman.com.au - top shorted stocks on the ASX
Now I can finally see how to tell the difference between the 'A'-ness of 'A' and the 'P'-ness of 'P'!
(Credit to S.G.)
> It was open-sourced by HP and UNLV in 2005.
So google basically did what ? Fix bit-rot ? Google has re-released some open source code, essentially forking off the orginal ?
> License: (None Listed)
I'm a fan of the FOSS idea. Basically that makes sures that the whole work to which I contributed, always remains available to me (and others). It might not always work for a company, but as a developer it makes sense to me. And the second thing I need to see is a License after I see some code.
So explain to me how exactly this is open source (other than the "compile, but don't touch" version of it) and *then* I might think of downloading it and probably fix a few bugs or write docs.
Quidquid latine dictum sit, altum videtur
Originally developed at the HP Labs from 1985-1995, it has been touted as one of the most accurate Optical Character Recognition (OCR) programs available.
Yeah, but how is it on lip-reading? That's when we really need to worry.
Push Button, Receive Bacon
Is there any particular reason google isn't hosting the project themselves?
Developers: We can use your help.
I though google was opening up their own open source repository http://www.newsforge.com/article.pl?sid=06/07/27/1 833251
It's just an OCR program, you fucking nerds. Nothing to get interested with, despite being released by Google. That's the only reason it was posted on Slashdot isn't it? Cos it's Google.
Posted anon because I wanted to test the OCR on Slashdot's CAPCHA for Anon Cowards.
They're already useless if installing one will subject your business to boycotts and/or lawsuits from National Federation of the Blind and other advocates for people with disabilities.
Should we praise technology that helps Project Gutenberg run out of pre-1923 books faster? Once all notable pre-1923 books are scanned, OCR'd, and cleaned up, then what does PG do?
it would be great if tesseract could augment the gocr-based FuzzyOCR and OCR plugins for SpamAssassin.
about sean dreilinger
No binaries! Only source code! Good luck getting it to compile on Windows, I gave up after I got several dozen obscure errors I had never seen before from the compiler.*
* If anyone can get VC++2K5 to compile it, please post.
In all honesty, I doubt Project Gutenberg will have run out of pre-1923 books by the time that new stuff starts coming out of Copyright under the new rules. They have everything written by humanity before that date to digitize: not just English language books and "classics," but government documents, records, foreign language texts, ancient manuscripts ... everything. That's as close to an un-finishable task as you can set yourself, I think.
Just assuming that somehow they did manage to digitize everything that was out of copyright, then I think what they should do is start archiving everything that they can. Even if they can't disseminate the information, they could still scan documents in and store them for later OCR-ing, thus preserving them against deterioration. I think this would be covered by fair use law even if the work was still protected. Perhaps this sort of archival work is not exactly the aim of PG, but it's still critically important.
With that said, I don't mean to in any way excuse the disgusting abuse of our political and legal system that was and is the "Sonny Bono Copyright Term Extension Act." That thing is a disgusting example of pretty much everything that's wrong with our government today.
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
Couldn't google have released it on their own code hosting they recently launched.
Yes go ahead click the link. Its kosher
This package contains the Tesseract Open Source OCR Engine.
Orignally developed at Hewlett Packard Laboratories Bristol and
at Hewlett Packard Co, Greeley Colorado, the majority of the code
in this distribution is now licensed under the Apache License:
** Licensed under the Apache License, Version 2.0 (the "License");
** you may not use this file except in compliance with the License.
** You may obtain a copy of the License at
** http://www.apache.org/licenses/LICENSE-2.0
** Unless required by applicable law or agreed to in writing, software
** distributed under the License is distributed on an "AS IS" BASIS,
** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
** See the License for the specific language governing permissions and
** limitations under the License.
Other Dependencies and Licenses:
The Aspirin/MIGRAINES system in the aspirin directory is separately
licensed thus:
#
NO WARRANTY
Since the Aspirin/MIGRAINES system is licensed free of charge,
Russell Leighton and the MITRE Corporation provide absolutley
no warranty. Should the Aspirin/MIGRAINES system prove defective,
you must assume the cost of all necessary servicing, repair or correction.
In no way will Russell Leighton or the MITRE Corporation be liable to you for
damages, including any lost profits, lost monies, or other
special, incidental or consequential damages arising out of
the use or inability to use the Aspirin/MIGRAINES system.
COPYRIGHT
This software is the copyright of Russell Leighton and the MITRE Corporation.
It may be freely used and modified for research and development
purposes. We require a brief acknowledgement in any research
paper or other publication where this software has made a significant
contribution. If you wish to use it for commercial gain you must contact
The MITRE Corporation for conditions of use. Russell Leighton and
the MITRE Corporation provide absolutely NO WARRANTY for this software.
August, 1992
Russell Leighton
The MITRE Corporation
7525 Colshire Dr.
McLean, Va. 22102-3481
Tesseract can also make use of the libtiff library. (www.libtiff.org)
Without libtiff, Tesseract can only read uncompressed and G3 compressed
TIFF files.
Come on, 34 comments and no mention of A Wrinkle in Time?
It compiles fine on Windows 2003 using MinGW (G++ 4.0.1) and Digital Mars C++ 8.45. It also compiles fine using Watcom C++ 10.6, if you can imagine that.
If I had to field a guess, it's that Visual C++ 2005 isn't a good C++ compiler. Try using higher-quality tools, even if they're a decade old.
"My guess is that they are doing this in the hope the open source community will build on and improve OCR technology. "
Uh, huh. R-E-S-E-A-R-C-H-!
I just checked out tesseract. One thing I have to look at more is the license. It appears to be the Apache license, which seems like a decent free license. But it also includes MITRE's aspirin. I'm not sure how dependent it is on aspirin and what the license restrictions of aspirin are.
The two best free OCR engines out right now are clara and gocr. While they are the best, they are not that great yet. I just ran the same tiff I had run with those two (I also have the document in pbm and other formats). Tesseract did not read it, it bailed with "IMAGE::check_legal_access:Error:Can't seek backwards in a buffered image!"
Clara and GOCR are written in C, Tesseract is written in C++, a language I don't know. Tesseract did well in the UNLV challenge so it probably has some good features. It does say it has no page layout analysis though.
Hopefully this can be improved, or good parts of it can be borrowed and incorporated into gocr or clara. It couldn't handle my test that both clara and gocr could, but it probably has strengths the other two doesn't. One day hopefully we'll have a free OCR that handles things as automagically as the commercial ones do. I will see what I can contribute to that as well. Although this is C++ and I don't know that language.
I've tried all the previous open source stuff and it was pretty much unusable. The accuracy was so bad that it was just easier to start typing. I got a few of the Windows programs kinda of working under WINE, but then I discovered Vividata and it worked really well and could be called from the command line. This meant I could write my own scripts that used it. I used it quite a bit for Project Gutenberg and was very impressed. It's not cheap, but if you want to do OCR under Linux and can afford it, I recommend it.
I would definitely prefer a Free Software solution so I'm excited about this development. Until this solution is really work-able (see the Google limitations, they're pretty serious), give Vividata a try.
I downloaded and tried compiling it in OS X and got some linux-specific build problems. I'm no code guru so I gave up as well. But then, even linux doesn't support the `make install` process, as claimed but the `./configure` script's output.
From the license file: "It may be freely used and modified for research and development purposes. We require a brief acknowledgement in any research paper or other publication where this software has made a significant contribution. If you wish to use it for commercial gain you must contact The MITRE Corporation for conditions of use."
The condition would be to solve a text given puzzle, instead of reading an image meant to be as confusing as possible, some forums have very bad systems for this and sometimes I have to register multiple times before actually getting a CAPTCHA image that I can read.
Copyright infringement is "piracy" in the same way DRM is "consumer rape"
You've got two constraints. One is that you have to be able to compose an arbitrarily large numbers of capchas algorithmically. For example, that example you just used is human-composed. If its the only CAPTCHA you have, the following program gets me a job at Google: gawk 'BEGIN{print "b"}' . If you have 100 CAPTCHAS, I only need to add a switch statement and some elbow grease and then I get to break your CAPTCHA a trillion times.
The other contraint is that you have to have your problem be trivially solvable by humans. I know plenty of people who cannot solve the CAPTCHA you have given: one obvious example would be, umm, all of my coworkers, because I live in Japan and "sub sandwitch" is not generally on the Japanese English curriculum. Similarly, you could any number of parsing problems which are very difficult for machines ("Here are 10 pictures chosen from HotOrNot. Click the three hot chicks.") but which may also be difficult for some users, such as Slashdotters who have never met a girl before.
By the way, you can find an implementation of that CAPTCHA at http://www.hotcaptcha.com/
Help poke pirates in the eyepatch, arr.
how do you use this? It compiled fine, and the readme says to use it something like this:
tesseract file.tif output batch
What are "output" and "batch" supposed to be? When I specify a batch file it segfaults.
In Soviet Russia, shady political characters recognize YOU!
The very first CAPTCHA implementation was broken, but the funny thing about CAPTCHAs is that it's absolutely no effort to make an image completely unreadable for current OCR software. And even if one certain implementation is broken, just add another layer of distortion. Human brain is capable of coping with it, OCR software usually is not.
And after all, it's not about authentication, it's about making a service accessible only for humans.
BTW, it's funny that you praise your own cryptography solution in your blog, but it's obvious that you have the problem of replay attacks, you even mention it in the "caveat" section below the text box.
A monkey is doing the real work for me.
Actually, shortly thereafter, HP decided to get out technology innovation business, and into the printer ink business.
If you want news from today, you have to come back tomorrow.
TH18 IS GRLAT NEWf4 FOR TH0Sj OF US U$1NZ BA) O(R RLCOGN1+ION!
THAHKS, G00GLL!1!!!
This story is somewhat timely for me. I am secretary of a club, we have a large quantity of documents collected over the last 20 years or so, some hand written, some typed, forms, invoices, minutes of meetings, letters sent to and from etc etc. There are a LOT of documents.
Lately I've been thinking about computerizing these documents into a web based system, so that any of the club executive can search and pull out a document they need etc, we could also flag documents as "general release" so that people could read interesting stuff from our past. And of course it would also serve as a secure backup of our documents, incase of fire, theft, alien invasion...
I think what is needed is a rough OCR system, that is, an OCR system that's not trying to be perfect, but can at least make about 50% accuracy on both typed and handwritten (without training!) documents, and preferably where it wasn't pretty certain it was correct, it would just skip words. The idea being that I'd run each document (big job, but doesn't matter if it takes a year) through a scanner, OCR it to get some searchable content, then store it as a PDF, or jpeg or something.
Anybody know of such an (open source, or at least free as in beer) OCR system?
NZ Electronics Enthusiasts: Check out my Trade Me Listings
little known fact: the AAA is the largest anti-public transport lobby in the US.
Single-mindedly want all transport monies to go into roading projects.
I guess it's part of their mission, but it is pretty crap for an otherwise populist organization of good.
As there seems to be no documentation on the Sourceforge page about what this can actually do, does it learn or follow rules? If it learns, can it learn to recognize, say, Japanese characters?
I tell ya, it'd be friggin' sweet if someone would work on making a functional Music OCR program. Scanning a score using the piece-of-crap Photoscore into (the not-so-piece-of-crap) Sibelius always ends taking longer than actually inputting the music manually. I don't know about others who dabble in this software, but I'm sick and tired of a piece of dust being interpreted as a meter change.
The piece in question is a neural network simulator named Aspirin/MIGRAINES, presumably used for training. Pun away.
"Currently it builds under Linux with gcc2.95 and under Windows with VC++6". In other words, it won't compile under Mac OS X... yet ;)
The bits on the bus go on and off... on and off... on and off...
My thoughts exactly...
$x='S24;r)>63/* h@<5+oZ)32"5cz';$me='phroggy'x$];
$x=~y+ -xz+\0-Tx+;print$_^chop$me for split'',$x;
A good idea, and if significant amounts of text are in an image, I'd view the mail as dubious anyway.
If not because of spam, then because of the idiotic format. Images are for illustrations, but using them to transfer major amounts of text is just stupid and inefficient.
C - the footgun of programming languages
and I've never heard of this thing.
Guess I should have got out of my cube more.
ccalam - acoustic versions of new songs.
Comment removed based on user account deletion
Screen captured some text from the article, used XV to transform into tif, changing image to monochrome.
Input image: it has been touted as one of the most accurate Optical Character Recognition (OCR) programs available. Having sat on the shelf gathering dust for so many years, Google cleaned up some of the more outdated portions of the code
Output text: ii has been lamed as one of lhe mos! accurale Oplical Characler Recognilion (OCR) programs available. Having sat on lhe shelf galhering dusk for so many years, Google cleaned up some of lhe more ouldaled porlions of lhe code
I have no idea what kind of input is optimal, but for a first shot in the dark, that's not too shabby. I'll go play with it some more. (^,^)
I don't get it. Isn't everything released on SourceForge supposed to be under a free license? Then how come this is released under no license? Perhaps I'm not looking on the right pages, but I can't seem to find anything besides the "none listed" on the main page of the project.
I suppose "audible captchas" should be feasible. That is, if you can't see the picture, the captcha server also has an audio file with the same information. I'd be surprised if this doesn't exist already in some form.
While it may be nice to have the source of a tesseract, however, those can only be built in a 4-dimensional space. So where do I get the build environment?
The Tao of math: The numbers you can count are not the real numbers.
I always knew Google were powerful. I did not, however, know they had the power to open source the 4-dimensional analog of the (3-dimensional) cube, where motion along the fourth dimension is often a representation for bounded transformations of the cube through time.
"No, no, no, don't tug on that! You never know what it might be attached to."
that F/OSS isn't anti-business. It just works with different business models.
Google's business interest in releasing this as open source is obvious: the greater the value of the materials available to the Internet, the greater the value of its service.
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
The name of the system you propose is called challenge/response (CR). CR is not a good idea for the following reasons:
1) It says "My time is more important than yours" to all your correspondents, because you're not willing to look at a few spams getting past your Bayesian filter every day so instead you offload that time burden to people who want to talk to you.
2) Dueling CR systems ("Hey, bob@example.com, I don't recognize you. Please prove you are a human" "Re: Hey, bob -- steve@stupid.com, I don't recognize you. Please prove you are a human"). Even more fun in a potentially infinite loop. Any system you can make to shortcircuit this loop can be abused by spam to avoid the CR altogether.
3) Doesn't survive the Chinese Sweatshop Spam Attack, which will be ubiquitous if CR becomes popular. (Take poor Chinese person, teach them 10 words of English, pay them 2 cents an hour to answer CAPTCHAs so you get guaranteed delivery of your Maximize Your Mr. Wiggly offers.)
4) Breaks legitimate bulk mail senders, such as Amazon, Paypal, eBay, mailing lists, etc etc. Mailing lists in particular are going to be very fun, since a lot of CR systems would spam the entire list -- perhaps provoking 100 challenges! Which then leads to combinatorial hilarity!
Help poke pirates in the eyepatch, arr.
As the linked article states, there are commercial OCR programs that are far more accurate.
--Rob
Towards the Singularity.
Does anyone here know how to get it to install and run on SuSE 10.0. The instructions are a little confusing. If you can't use make install, what do you use.
..
./configure returns "error in line 1329" and "make install has not been implemented yet avoid using."
From INSTALL
"4. Type `make install' to install the programs and any data files and documentation."
Running
README has this to say "The executable must reside in the same directory as the tessdata directory The command line is: tesseract image.tif batch"
Trying to run it and a windows pops up briefly and then disappears.
davecb5620@gmail.com
Are you insinuating that the 115th Congress won't try to enact a Chastity Bono Copyright Term Extension Act? Given Mexico's life plus 100 copyright term, the next step of "harmonization" for the United States and its trading partners is life plus 100 or, in the case of works made for hire, 125 years after publication.
Who's to say that publishers won't fight back against Gutenberg the way (ObTopic) they did against Google? It's only fair use if you can pay a judge to tell you that it is and if you can pay your lawyer to tell the judge to tell you that it is.
Except that estates of authors of well-known works tend to be stricter than that. I'm willing to bet that there won't be enough books 1. which are notable, 2. whose copyright is abandoned by the author or his estate, and 3. which are not already published electronically by the author, to keep Project Gutenberg and the public-domain part of Google Book Search busy after the Chastity Bono Act comes into effect.
well, that is quite a stretch but just maybe send the info from Mindstorms to host so that the robots can read
Oh man, I thought that Google had finally unleashed a time machine. Just think of the ramifications of that, the concept of time travel is simple enough, you just wind the string of time a different way and jump from one piece of the yarn to its neighboring piece on the ball, right? But the part that needs to be solved is how to do it, and where to get the energy? - Yes, by the way, there is such a thing as a tesseract. A Wrinkle in Time.
One small flaw with your argument. Copyrighted material isn't a step function, timewise. Content is constantly created on one end, and it falls off on the other (and yes I'm using a much broader definition of "published" than you are)
Second the extension of copyrights. While there is precedent, it's not an infinite function either.
"And although it's technically possible for an author to explicitly release his work himself, it doesn't count because it doesn't solve the problem."
Well neither does P2P, despite the booster crowd here.
...or mothers... but please, this was not Flamebait. It's called humor, and if it's not funny, just don't laugh. It's not like he posted some big GNAA ASCII SNAFU.
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
I know for a fact because I saw it myself that HP Research Labs Bristol had a fully working implementation of page layout analysis together with OCR in 1994. It was impressive. It handled all the usual page layout issues such as multiple columns and page skew. I've no idea whether that's true or not (and after a quick Google, it looks like finding out would be too much like hard work), but it's certainly interesting....
Man, that tool is old.
... it would be quite possible to push through a request to re-release it under the Apache license.
I know some people who work in the department where it was created and I think the consesus is that no one has thought about that tool in a long time (as there are much better ones now).
I think if there was some pressure from users of Tesseract
I think the biggest hurdle there would be the paperwork and explaining why it'd be a nice gesture to the people who have to sign the forms (managers, corporate, etc.). We have a pretty hefty PR/licensing process since most of our work is delivered to our sponsors (government).
But its not like anyone considers it some kinda asset. In fact, if anyone asked for support for it there'd be groaning because no one has touched it in over a DECADE.
But yeah, I encourage users of Tesseract to send snail-mail letters explaining the issue to the Neuroscience folks there at the Washington location.
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
Apparently the OP thinks the entire world lives and breathes *NIX, so much so that he couldn't be bothered to mention the OS platform requirement? Thanks for wasting the time of those readers who may not yet have a Linux system with which to use it.
Ahora podemos ver la diferencia entre la 'A'-nosidad de 'A' y la 'P'-nosidad de P!