Google Releases Tesseract as Open Source
An anonymous reader writes "Google recently released Tesseract as open source. Originally developed at the HP Labs from 1985-1995, it has been touted as one of the most accurate Optical Character Recognition (OCR) programs available. Having sat on the shelf gathering dust for so many years, Google cleaned up some of the more outdated portions of the code and released it for general consumption. You can download Tesseract over at Sourceforge.
HOORAY! Good free OCR software is in short supply. I wonder if this will have a positive impact on Project Gutenberg?
“Common sense is not so common.” — Voltaire
This should be useful for adding anti-image spam capabilities to FOSS anti-spam programs.
The road to tyranny has always been paved with claims of necessity.
CAPTCHAs are specifically meant to break OCR... and if you RTFA, it say it does poorly with grayscale and color documents. Baisically its meant for reading typed text... like in a book.
Google cleaned up some of the more outdated portions of the code
i.e., added AdSense to the OCR output.
You're right! Let us never delve into research that could conceivably overturn weak software security! Some things man was never meant to discover! Turn back, before we fly too close to the sun and our wings melt!! O, Prometheus, why hast thou given us this OCR technology??
Did you ever notice that *nix doesn't even cover Linux?
My guess is that they are doing this in the hope the open source community will build on and improve OCR technology. This would be in Google's interest, as it can then index text from images (such as their own Books project) more accurately and efficiently.
www.shortman.com.au - top shorted stocks on the ASX
OCR is most effective when the letter boundaries are clear and well-defined, such as fixed-width text, or text that is at least on a straight line. Most CAPTCHAs put the letters on a curved path, as well as distorting the letters so they are no longer within a clearly defined rectangular shape. This makes it very hard to identify which parts of the images are letters and which parts are not, making OCRing CAPTCHAs a non-trivial problem.
Now I can finally see how to tell the difference between the 'A'-ness of 'A' and the 'P'-ness of 'P'!
(Credit to S.G.)
> It was open-sourced by HP and UNLV in 2005.
So google basically did what ? Fix bit-rot ? Google has re-released some open source code, essentially forking off the orginal ?
> License: (None Listed)
I'm a fan of the FOSS idea. Basically that makes sures that the whole work to which I contributed, always remains available to me (and others). It might not always work for a company, but as a developer it makes sense to me. And the second thing I need to see is a License after I see some code.
So explain to me how exactly this is open source (other than the "compile, but don't touch" version of it) and *then* I might think of downloading it and probably fix a few bugs or write docs.
Quidquid latine dictum sit, altum videtur
Originally developed at the HP Labs from 1985-1995, it has been touted as one of the most accurate Optical Character Recognition (OCR) programs available.
Yeah, but how is it on lip-reading? That's when we really need to worry.
Push Button, Receive Bacon
Is there any particular reason google isn't hosting the project themselves?
Developers: We can use your help.
They're already useless if installing one will subject your business to boycotts and/or lawsuits from National Federation of the Blind and other advocates for people with disabilities.
Plus, good OCR could help recognize image spam (where they send the text in an image attachment, to avoid filtering, and fill the message body with "bayes poison").
it would be great if tesseract could augment the gocr-based FuzzyOCR and OCR plugins for SpamAssassin.
about sean dreilinger
careful, statements like that are likely to get you voted governor in some states.
ôó
In all honesty, I doubt Project Gutenberg will have run out of pre-1923 books by the time that new stuff starts coming out of Copyright under the new rules. They have everything written by humanity before that date to digitize: not just English language books and "classics," but government documents, records, foreign language texts, ancient manuscripts ... everything. That's as close to an un-finishable task as you can set yourself, I think.
Just assuming that somehow they did manage to digitize everything that was out of copyright, then I think what they should do is start archiving everything that they can. Even if they can't disseminate the information, they could still scan documents in and store them for later OCR-ing, thus preserving them against deterioration. I think this would be covered by fair use law even if the work was still protected. Perhaps this sort of archival work is not exactly the aim of PG, but it's still critically important.
With that said, I don't mean to in any way excuse the disgusting abuse of our political and legal system that was and is the "Sonny Bono Copyright Term Extension Act." That thing is a disgusting example of pretty much everything that's wrong with our government today.
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
This package contains the Tesseract Open Source OCR Engine.
Orignally developed at Hewlett Packard Laboratories Bristol and
at Hewlett Packard Co, Greeley Colorado, the majority of the code
in this distribution is now licensed under the Apache License:
** Licensed under the Apache License, Version 2.0 (the "License");
** you may not use this file except in compliance with the License.
** You may obtain a copy of the License at
** http://www.apache.org/licenses/LICENSE-2.0
** Unless required by applicable law or agreed to in writing, software
** distributed under the License is distributed on an "AS IS" BASIS,
** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
** See the License for the specific language governing permissions and
** limitations under the License.
Other Dependencies and Licenses:
The Aspirin/MIGRAINES system in the aspirin directory is separately
licensed thus:
#
NO WARRANTY
Since the Aspirin/MIGRAINES system is licensed free of charge,
Russell Leighton and the MITRE Corporation provide absolutley
no warranty. Should the Aspirin/MIGRAINES system prove defective,
you must assume the cost of all necessary servicing, repair or correction.
In no way will Russell Leighton or the MITRE Corporation be liable to you for
damages, including any lost profits, lost monies, or other
special, incidental or consequential damages arising out of
the use or inability to use the Aspirin/MIGRAINES system.
COPYRIGHT
This software is the copyright of Russell Leighton and the MITRE Corporation.
It may be freely used and modified for research and development
purposes. We require a brief acknowledgement in any research
paper or other publication where this software has made a significant
contribution. If you wish to use it for commercial gain you must contact
The MITRE Corporation for conditions of use. Russell Leighton and
the MITRE Corporation provide absolutely NO WARRANTY for this software.
August, 1992
Russell Leighton
The MITRE Corporation
7525 Colshire Dr.
McLean, Va. 22102-3481
Tesseract can also make use of the libtiff library. (www.libtiff.org)
Without libtiff, Tesseract can only read uncompressed and G3 compressed
TIFF files.
Why can't captchas just say "the letter at the beginning of the word that is spelled the reverse of the 3 letter name for an underwater vehicle or sandwich"? If someone can make a program that interprets that and gets the answer right after getting it off a captcha with OCR, then Google probably wants to know so they can hire them.
Stupidity is like nuclear power, it can be used for good or evil. And you don't want to get any on you.
Come on, 34 comments and no mention of A Wrinkle in Time?
I just checked out tesseract. One thing I have to look at more is the license. It appears to be the Apache license, which seems like a decent free license. But it also includes MITRE's aspirin. I'm not sure how dependent it is on aspirin and what the license restrictions of aspirin are.
The two best free OCR engines out right now are clara and gocr. While they are the best, they are not that great yet. I just ran the same tiff I had run with those two (I also have the document in pbm and other formats). Tesseract did not read it, it bailed with "IMAGE::check_legal_access:Error:Can't seek backwards in a buffered image!"
Clara and GOCR are written in C, Tesseract is written in C++, a language I don't know. Tesseract did well in the UNLV challenge so it probably has some good features. It does say it has no page layout analysis though.
Hopefully this can be improved, or good parts of it can be borrowed and incorporated into gocr or clara. It couldn't handle my test that both clara and gocr could, but it probably has strengths the other two doesn't. One day hopefully we'll have a free OCR that handles things as automagically as the commercial ones do. I will see what I can contribute to that as well. Although this is C++ and I don't know that language.
I've tried all the previous open source stuff and it was pretty much unusable. The accuracy was so bad that it was just easier to start typing. I got a few of the Windows programs kinda of working under WINE, but then I discovered Vividata and it worked really well and could be called from the command line. This meant I could write my own scripts that used it. I used it quite a bit for Project Gutenberg and was very impressed. It's not cheap, but if you want to do OCR under Linux and can afford it, I recommend it.
I would definitely prefer a Free Software solution so I'm excited about this development. Until this solution is really work-able (see the Google limitations, they're pretty serious), give Vividata a try.
You've got two constraints. One is that you have to be able to compose an arbitrarily large numbers of capchas algorithmically. For example, that example you just used is human-composed. If its the only CAPTCHA you have, the following program gets me a job at Google: gawk 'BEGIN{print "b"}' . If you have 100 CAPTCHAS, I only need to add a switch statement and some elbow grease and then I get to break your CAPTCHA a trillion times.
The other contraint is that you have to have your problem be trivially solvable by humans. I know plenty of people who cannot solve the CAPTCHA you have given: one obvious example would be, umm, all of my coworkers, because I live in Japan and "sub sandwitch" is not generally on the Japanese English curriculum. Similarly, you could any number of parsing problems which are very difficult for machines ("Here are 10 pictures chosen from HotOrNot. Click the three hot chicks.") but which may also be difficult for some users, such as Slashdotters who have never met a girl before.
By the way, you can find an implementation of that CAPTCHA at http://www.hotcaptcha.com/
Help poke pirates in the eyepatch, arr.
In order to pose the question, you have to generate it randomly. If it's not random, you already lost.
In order to generate it, you're going to end up using a grammar.
Running grammars in reverse is merely a matter of patience (to explore the space of problems the test program will pose) and the right tools; it's a fundamental bit of computer science.
Granted, expecting spammers to be conversant with the fundamental elements of computer science is a pretty high bar, but it only takes one to leap it and the rest to buy the program from him.
The image tests have the advantage that done properly, it takes more than just patience and computer science fundamentals to crack, it would require fundamental advances in the art.
(Note that nowhere in this message do I claim that image tests are perfect; in fact everything I know is vulnerable to the "feed it to a human in another context (viz, 'porn') and let them do the work" attack, and there are also points to be made about how widespread any given grammar/image test becomes; I know a website where the image test actually is a constant and so far it doesn't seem to be a problem because of scale issues. My point is that text tests have an additional disadvantage. It's not an intrinsically bad idea, though.)
Google wouldn't be interested in hiring people who could crack this, merely because they can crack this. Might make a decent interview question, though.
(You might also be tempted to think that you could just use a really complicated grammar, but you are constrained by two things, the human supposedly reading and taking the test, and the complexity of the human language itself. By the time you write some problem generator that could reliably throw off a parser, you'll be reliably confusing the hell out of your human users, too.)
Or write up a quick script to cut the images in half down the middle and save them as a series of other images.
- Give a man a fire and he's warm for a day, but set him on fire and he's warm for the rest of his life.
The very first CAPTCHA implementation was broken, but the funny thing about CAPTCHAs is that it's absolutely no effort to make an image completely unreadable for current OCR software. And even if one certain implementation is broken, just add another layer of distortion. Human brain is capable of coping with it, OCR software usually is not.
And after all, it's not about authentication, it's about making a service accessible only for humans.
BTW, it's funny that you praise your own cryptography solution in your blog, but it's obvious that you have the problem of replay attacks, you even mention it in the "caveat" section below the text box.
A monkey is doing the real work for me.
Actually, shortly thereafter, HP decided to get out technology innovation business, and into the printer ink business.
If you want news from today, you have to come back tomorrow.
TH18 IS GRLAT NEWf4 FOR TH0Sj OF US U$1NZ BA) O(R RLCOGN1+ION!
THAHKS, G00GLL!1!!!
That's no problem! All I really need it to do is allow all of those geeks out there to share those great Playboy articles with me over p2p networks! I'm tired of just getting the filler photography! ;-)
You're a secretary? Do you do anal? If so, I can double your pay.
As there seems to be no documentation on the Sourceforge page about what this can actually do, does it learn or follow rules? If it learns, can it learn to recognize, say, Japanese characters?
The piece in question is a neural network simulator named Aspirin/MIGRAINES, presumably used for training. Pun away.
Also: *AA includes the MAA (Mathematical Association of America), the ADAA (Anxiety Disorders Association of America), the MSAA (Multiple Sclerosis Association of America), and the SCAA (Specialty Coffee Association of America).
The SCAA must be the ones responsible for not letting Java be open sourced.
Did you ever notice that *nix doesn't even cover Linux?
I am currently using the FuzzyOcr plugin to SpamAssassin, and it uses gocr to do the character recognition. To be sure, gocr is improving (the stable released version is practically useless, but the CVS version actually works, mostly), but if Tesseract is better, great!
$x='S24;r)>63/* h@<5+oZ)32"5cz';$me='phroggy'x$];
$x=~y+ -xz+\0-Tx+;print$_^chop$me for split'',$x;
"the letter at the beginning of the word that is spelled the reverse of the 3 letter name for an underwater vehicle or sandwich"
My wife would fail this test. My father will fail this test. My step-mother will fail this test. My children will fail this test.
A computer will very easily get this test right one time on 26.
In one word: Useless.
A good idea, and if significant amounts of text are in an image, I'd view the mail as dubious anyway.
If not because of spam, then because of the idiotic format. Images are for illustrations, but using them to transfer major amounts of text is just stupid and inefficient.
C - the footgun of programming languages
Comment removed based on user account deletion
You're just not avant-garde enough.
Screen captured some text from the article, used XV to transform into tif, changing image to monochrome.
Input image: it has been touted as one of the most accurate Optical Character Recognition (OCR) programs available. Having sat on the shelf gathering dust for so many years, Google cleaned up some of the more outdated portions of the code
Output text: ii has been lamed as one of lhe mos! accurale Oplical Characler Recognilion (OCR) programs available. Having sat on lhe shelf galhering dusk for so many years, Google cleaned up some of lhe more ouldaled porlions of lhe code
I have no idea what kind of input is optimal, but for a first shot in the dark, that's not too shabby. I'll go play with it some more. (^,^)
There is a piece of non-free software that runs quite well under Wine and exports nice MusicXML. You will find it linked to from http://www.recordare.com/software.html .
I really should ask google to help buy this technology and set it free.
If you think the software isn't entirely free, contact Sourceforge. Their conditions require that all hosted projects be free software.
-- Ed Avis ed@membled.com
Yes, by using contrasting colors that convert to the same tone in grayscale. A side effect being that most such technologies also shut out colorblind people...
The name of the system you propose is called challenge/response (CR). CR is not a good idea for the following reasons:
1) It says "My time is more important than yours" to all your correspondents, because you're not willing to look at a few spams getting past your Bayesian filter every day so instead you offload that time burden to people who want to talk to you.
2) Dueling CR systems ("Hey, bob@example.com, I don't recognize you. Please prove you are a human" "Re: Hey, bob -- steve@stupid.com, I don't recognize you. Please prove you are a human"). Even more fun in a potentially infinite loop. Any system you can make to shortcircuit this loop can be abused by spam to avoid the CR altogether.
3) Doesn't survive the Chinese Sweatshop Spam Attack, which will be ubiquitous if CR becomes popular. (Take poor Chinese person, teach them 10 words of English, pay them 2 cents an hour to answer CAPTCHAs so you get guaranteed delivery of your Maximize Your Mr. Wiggly offers.)
4) Breaks legitimate bulk mail senders, such as Amazon, Paypal, eBay, mailing lists, etc etc. Mailing lists in particular are going to be very fun, since a lot of CR systems would spam the entire list -- perhaps provoking 100 challenges! Which then leads to combinatorial hilarity!
Help poke pirates in the eyepatch, arr.
As someone who has been involved in applying OCR to real world problems, there's nothing
trivial about generating a good binary images from images taken in the field (in my case,
images of boxes moving down a conveyor belt or hand imaged by workers).
Even if you disregard such problems as uneven lighting, glare, and distortion due the
unavoidable vibration inherrent to plant settings, most forms that are interesting to
OCR are handwritten and not designed to be OCR friendly. Hopefully this will change as
the people who design such forms become more conscious of the capabilities of OCR, but
even if that were to happen tomorrow, it would take years to complete the transition.
*sigh* back to work...
I gave up on CAPTCHA, the spammers have some really good software which can deal with this. My site used to get about 5-10 bot registrations a day. So I changed tactics, and simply ask "Are you a bot? (don't answer this question!)". If they answer this question, registration is denied, no matter what e-mail address or IP they are using. This alone is 100% effective, but I do have some other questions as a backup, just in case. It's rather interesting how all these registrations seem to follow the same pattern, almost like there is only one decent 'spam package' out there.
Are you insinuating that the 115th Congress won't try to enact a Chastity Bono Copyright Term Extension Act? Given Mexico's life plus 100 copyright term, the next step of "harmonization" for the United States and its trading partners is life plus 100 or, in the case of works made for hire, 125 years after publication.
Who's to say that publishers won't fight back against Gutenberg the way (ObTopic) they did against Google? It's only fair use if you can pay a judge to tell you that it is and if you can pay your lawyer to tell the judge to tell you that it is.
Err... How about Cygwin http://www.cygwin.com/ ?