Google Releases Tesseract as Open Source
An anonymous reader writes "Google recently released Tesseract as open source. Originally developed at the HP Labs from 1985-1995, it has been touted as one of the most accurate Optical Character Recognition (OCR) programs available. Having sat on the shelf gathering dust for so many years, Google cleaned up some of the more outdated portions of the code and released it for general consumption. You can download Tesseract over at Sourceforge.
HOORAY! Good free OCR software is in short supply. I wonder if this will have a positive impact on Project Gutenberg?
“Common sense is not so common.” — Voltaire
This should be useful for adding anti-image spam capabilities to FOSS anti-spam programs.
The road to tyranny has always been paved with claims of necessity.
CAPTCHAs are specifically meant to break OCR... and if you RTFA, it say it does poorly with grayscale and color documents. Baisically its meant for reading typed text... like in a book.
Google cleaned up some of the more outdated portions of the code
i.e., added AdSense to the OCR output.
You're right! Let us never delve into research that could conceivably overturn weak software security! Some things man was never meant to discover! Turn back, before we fly too close to the sun and our wings melt!! O, Prometheus, why hast thou given us this OCR technology??
Did you ever notice that *nix doesn't even cover Linux?
My guess is that they are doing this in the hope the open source community will build on and improve OCR technology. This would be in Google's interest, as it can then index text from images (such as their own Books project) more accurately and efficiently.
www.shortman.com.au - top shorted stocks on the ASX
OCR is most effective when the letter boundaries are clear and well-defined, such as fixed-width text, or text that is at least on a straight line. Most CAPTCHAs put the letters on a curved path, as well as distorting the letters so they are no longer within a clearly defined rectangular shape. This makes it very hard to identify which parts of the images are letters and which parts are not, making OCRing CAPTCHAs a non-trivial problem.
Now I can finally see how to tell the difference between the 'A'-ness of 'A' and the 'P'-ness of 'P'!
(Credit to S.G.)
> It was open-sourced by HP and UNLV in 2005.
So google basically did what ? Fix bit-rot ? Google has re-released some open source code, essentially forking off the orginal ?
> License: (None Listed)
I'm a fan of the FOSS idea. Basically that makes sures that the whole work to which I contributed, always remains available to me (and others). It might not always work for a company, but as a developer it makes sense to me. And the second thing I need to see is a License after I see some code.
So explain to me how exactly this is open source (other than the "compile, but don't touch" version of it) and *then* I might think of downloading it and probably fix a few bugs or write docs.
Quidquid latine dictum sit, altum videtur
Originally developed at the HP Labs from 1985-1995, it has been touted as one of the most accurate Optical Character Recognition (OCR) programs available.
Yeah, but how is it on lip-reading? That's when we really need to worry.
Push Button, Receive Bacon
Is there any particular reason google isn't hosting the project themselves?
Developers: We can use your help.
I though google was opening up their own open source repository http://www.newsforge.com/article.pl?sid=06/07/27/1 833251
They're already useless if installing one will subject your business to boycotts and/or lawsuits from National Federation of the Blind and other advocates for people with disabilities.
Plus, good OCR could help recognize image spam (where they send the text in an image attachment, to avoid filtering, and fill the message body with "bayes poison").
Should we praise technology that helps Project Gutenberg run out of pre-1923 books faster? Once all notable pre-1923 books are scanned, OCR'd, and cleaned up, then what does PG do?
it would be great if tesseract could augment the gocr-based FuzzyOCR and OCR plugins for SpamAssassin.
about sean dreilinger
No binaries! Only source code! Good luck getting it to compile on Windows, I gave up after I got several dozen obscure errors I had never seen before from the compiler.*
* If anyone can get VC++2K5 to compile it, please post.
While Slashdot has always been a target for trolls and miscreants, I don't ever remember it being a spammers destination (note 4-digit UID). Even back in those crazy, hazy days when we didn't have to try to interpret some bizarro text -- AKA the vast bulk of Slashdot's existence - somehow spammers were thwarted in their evil quest. Was Slashdot just feeling a bit left out, and just had to stick a CAPTCHA in there to be just like everyone else ("See!? Spammers like us too!").
CAPTCHAs should be replaced by forcing answers to submitted homework questions - kids get their homework done for them on a distributed network, and it somewhat proves that there's a human on the other end (no machine could interpret most homework questions).
careful, statements like that are likely to get you voted governor in some states.
ôó
In all honesty, I doubt Project Gutenberg will have run out of pre-1923 books by the time that new stuff starts coming out of Copyright under the new rules. They have everything written by humanity before that date to digitize: not just English language books and "classics," but government documents, records, foreign language texts, ancient manuscripts ... everything. That's as close to an un-finishable task as you can set yourself, I think.
Just assuming that somehow they did manage to digitize everything that was out of copyright, then I think what they should do is start archiving everything that they can. Even if they can't disseminate the information, they could still scan documents in and store them for later OCR-ing, thus preserving them against deterioration. I think this would be covered by fair use law even if the work was still protected. Perhaps this sort of archival work is not exactly the aim of PG, but it's still critically important.
With that said, I don't mean to in any way excuse the disgusting abuse of our political and legal system that was and is the "Sonny Bono Copyright Term Extension Act." That thing is a disgusting example of pretty much everything that's wrong with our government today.
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
This package contains the Tesseract Open Source OCR Engine.
Orignally developed at Hewlett Packard Laboratories Bristol and
at Hewlett Packard Co, Greeley Colorado, the majority of the code
in this distribution is now licensed under the Apache License:
** Licensed under the Apache License, Version 2.0 (the "License");
** you may not use this file except in compliance with the License.
** You may obtain a copy of the License at
** http://www.apache.org/licenses/LICENSE-2.0
** Unless required by applicable law or agreed to in writing, software
** distributed under the License is distributed on an "AS IS" BASIS,
** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
** See the License for the specific language governing permissions and
** limitations under the License.
Other Dependencies and Licenses:
The Aspirin/MIGRAINES system in the aspirin directory is separately
licensed thus:
#
NO WARRANTY
Since the Aspirin/MIGRAINES system is licensed free of charge,
Russell Leighton and the MITRE Corporation provide absolutley
no warranty. Should the Aspirin/MIGRAINES system prove defective,
you must assume the cost of all necessary servicing, repair or correction.
In no way will Russell Leighton or the MITRE Corporation be liable to you for
damages, including any lost profits, lost monies, or other
special, incidental or consequential damages arising out of
the use or inability to use the Aspirin/MIGRAINES system.
COPYRIGHT
This software is the copyright of Russell Leighton and the MITRE Corporation.
It may be freely used and modified for research and development
purposes. We require a brief acknowledgement in any research
paper or other publication where this software has made a significant
contribution. If you wish to use it for commercial gain you must contact
The MITRE Corporation for conditions of use. Russell Leighton and
the MITRE Corporation provide absolutely NO WARRANTY for this software.
August, 1992
Russell Leighton
The MITRE Corporation
7525 Colshire Dr.
McLean, Va. 22102-3481
Tesseract can also make use of the libtiff library. (www.libtiff.org)
Without libtiff, Tesseract can only read uncompressed and G3 compressed
TIFF files.
Why can't captchas just say "the letter at the beginning of the word that is spelled the reverse of the 3 letter name for an underwater vehicle or sandwich"? If someone can make a program that interprets that and gets the answer right after getting it off a captcha with OCR, then Google probably wants to know so they can hire them.
Stupidity is like nuclear power, it can be used for good or evil. And you don't want to get any on you.
Come on, 34 comments and no mention of A Wrinkle in Time?
Specifically like Google Books, I bet. Unless the book is multi-column, then fuck it and we'll wait for the single column edition.
I just checked out tesseract. One thing I have to look at more is the license. It appears to be the Apache license, which seems like a decent free license. But it also includes MITRE's aspirin. I'm not sure how dependent it is on aspirin and what the license restrictions of aspirin are.
The two best free OCR engines out right now are clara and gocr. While they are the best, they are not that great yet. I just ran the same tiff I had run with those two (I also have the document in pbm and other formats). Tesseract did not read it, it bailed with "IMAGE::check_legal_access:Error:Can't seek backwards in a buffered image!"
Clara and GOCR are written in C, Tesseract is written in C++, a language I don't know. Tesseract did well in the UNLV challenge so it probably has some good features. It does say it has no page layout analysis though.
Hopefully this can be improved, or good parts of it can be borrowed and incorporated into gocr or clara. It couldn't handle my test that both clara and gocr could, but it probably has strengths the other two doesn't. One day hopefully we'll have a free OCR that handles things as automagically as the commercial ones do. I will see what I can contribute to that as well. Although this is C++ and I don't know that language.
I've tried all the previous open source stuff and it was pretty much unusable. The accuracy was so bad that it was just easier to start typing. I got a few of the Windows programs kinda of working under WINE, but then I discovered Vividata and it worked really well and could be called from the command line. This meant I could write my own scripts that used it. I used it quite a bit for Project Gutenberg and was very impressed. It's not cheap, but if you want to do OCR under Linux and can afford it, I recommend it.
I would definitely prefer a Free Software solution so I'm excited about this development. Until this solution is really work-able (see the Google limitations, they're pretty serious), give Vividata a try.
I downloaded and tried compiling it in OS X and got some linux-specific build problems. I'm no code guru so I gave up as well. But then, even linux doesn't support the `make install` process, as claimed but the `./configure` script's output.
From the license file: "It may be freely used and modified for research and development purposes. We require a brief acknowledgement in any research paper or other publication where this software has made a significant contribution. If you wish to use it for commercial gain you must contact The MITRE Corporation for conditions of use."
The condition would be to solve a text given puzzle, instead of reading an image meant to be as confusing as possible, some forums have very bad systems for this and sometimes I have to register multiple times before actually getting a CAPTCHA image that I can read.
Copyright infringement is "piracy" in the same way DRM is "consumer rape"
You've got two constraints. One is that you have to be able to compose an arbitrarily large numbers of capchas algorithmically. For example, that example you just used is human-composed. If its the only CAPTCHA you have, the following program gets me a job at Google: gawk 'BEGIN{print "b"}' . If you have 100 CAPTCHAS, I only need to add a switch statement and some elbow grease and then I get to break your CAPTCHA a trillion times.
The other contraint is that you have to have your problem be trivially solvable by humans. I know plenty of people who cannot solve the CAPTCHA you have given: one obvious example would be, umm, all of my coworkers, because I live in Japan and "sub sandwitch" is not generally on the Japanese English curriculum. Similarly, you could any number of parsing problems which are very difficult for machines ("Here are 10 pictures chosen from HotOrNot. Click the three hot chicks.") but which may also be difficult for some users, such as Slashdotters who have never met a girl before.
By the way, you can find an implementation of that CAPTCHA at http://www.hotcaptcha.com/
Help poke pirates in the eyepatch, arr.
In order to pose the question, you have to generate it randomly. If it's not random, you already lost.
In order to generate it, you're going to end up using a grammar.
Running grammars in reverse is merely a matter of patience (to explore the space of problems the test program will pose) and the right tools; it's a fundamental bit of computer science.
Granted, expecting spammers to be conversant with the fundamental elements of computer science is a pretty high bar, but it only takes one to leap it and the rest to buy the program from him.
The image tests have the advantage that done properly, it takes more than just patience and computer science fundamentals to crack, it would require fundamental advances in the art.
(Note that nowhere in this message do I claim that image tests are perfect; in fact everything I know is vulnerable to the "feed it to a human in another context (viz, 'porn') and let them do the work" attack, and there are also points to be made about how widespread any given grammar/image test becomes; I know a website where the image test actually is a constant and so far it doesn't seem to be a problem because of scale issues. My point is that text tests have an additional disadvantage. It's not an intrinsically bad idea, though.)
Google wouldn't be interested in hiring people who could crack this, merely because they can crack this. Might make a decent interview question, though.
(You might also be tempted to think that you could just use a really complicated grammar, but you are constrained by two things, the human supposedly reading and taking the test, and the complexity of the human language itself. By the time you write some problem generator that could reliably throw off a parser, you'll be reliably confusing the hell out of your human users, too.)
Or write up a quick script to cut the images in half down the middle and save them as a series of other images.
- Give a man a fire and he's warm for a day, but set him on fire and he's warm for the rest of his life.
some states?
Statements like that are likely to get you elected to Congress or the Presidency...
It's the same kind of logic as "We can't find them, thus they must be there..."
The very first CAPTCHA implementation was broken, but the funny thing about CAPTCHAs is that it's absolutely no effort to make an image completely unreadable for current OCR software. And even if one certain implementation is broken, just add another layer of distortion. Human brain is capable of coping with it, OCR software usually is not.
And after all, it's not about authentication, it's about making a service accessible only for humans.
BTW, it's funny that you praise your own cryptography solution in your blog, but it's obvious that you have the problem of replay attacks, you even mention it in the "caveat" section below the text box.
A monkey is doing the real work for me.
Actually, shortly thereafter, HP decided to get out technology innovation business, and into the printer ink business.
If you want news from today, you have to come back tomorrow.
TH18 IS GRLAT NEWf4 FOR TH0Sj OF US U$1NZ BA) O(R RLCOGN1+ION!
THAHKS, G00GLL!1!!!
Naw, more like trollish babbling. OCR doesn't handle curving lines and distorted letters well. If you want to make yourself seem intelligent, at least research your shit first and try to stay on topic. :)
That's no problem! All I really need it to do is allow all of those geeks out there to share those great Playboy articles with me over p2p networks! I'm tired of just getting the filler photography! ;-)
This story is somewhat timely for me. I am secretary of a club, we have a large quantity of documents collected over the last 20 years or so, some hand written, some typed, forms, invoices, minutes of meetings, letters sent to and from etc etc. There are a LOT of documents.
Lately I've been thinking about computerizing these documents into a web based system, so that any of the club executive can search and pull out a document they need etc, we could also flag documents as "general release" so that people could read interesting stuff from our past. And of course it would also serve as a secure backup of our documents, incase of fire, theft, alien invasion...
I think what is needed is a rough OCR system, that is, an OCR system that's not trying to be perfect, but can at least make about 50% accuracy on both typed and handwritten (without training!) documents, and preferably where it wasn't pretty certain it was correct, it would just skip words. The idea being that I'd run each document (big job, but doesn't matter if it takes a year) through a scanner, OCR it to get some searchable content, then store it as a PDF, or jpeg or something.
Anybody know of such an (open source, or at least free as in beer) OCR system?
NZ Electronics Enthusiasts: Check out my Trade Me Listings
Only if the CAPTCHA makers don't test it through tesseract beforehand...
As there seems to be no documentation on the Sourceforge page about what this can actually do, does it learn or follow rules? If it learns, can it learn to recognize, say, Japanese characters?
I tell ya, it'd be friggin' sweet if someone would work on making a functional Music OCR program. Scanning a score using the piece-of-crap Photoscore into (the not-so-piece-of-crap) Sibelius always ends taking longer than actually inputting the music manually. I don't know about others who dabble in this software, but I'm sick and tired of a piece of dust being interpreted as a meter change.
The piece in question is a neural network simulator named Aspirin/MIGRAINES, presumably used for training. Pun away.
Also: *AA includes the MAA (Mathematical Association of America), the ADAA (Anxiety Disorders Association of America), the MSAA (Multiple Sclerosis Association of America), and the SCAA (Specialty Coffee Association of America).
The SCAA must be the ones responsible for not letting Java be open sourced.
Did you ever notice that *nix doesn't even cover Linux?
Captchas are designed to be difficult to OCR. Besides there are plenty of OCR apps around already, if you hadn't noticed. I don't think spammers have been holding out for a GPL one.
I am currently using the FuzzyOcr plugin to SpamAssassin, and it uses gocr to do the character recognition. To be sure, gocr is improving (the stable released version is practically useless, but the CVS version actually works, mostly), but if Tesseract is better, great!
$x='S24;r)>63/* h@<5+oZ)32"5cz';$me='phroggy'x$];
$x=~y+ -xz+\0-Tx+;print$_^chop$me for split'',$x;
"the letter at the beginning of the word that is spelled the reverse of the 3 letter name for an underwater vehicle or sandwich"
My wife would fail this test. My father will fail this test. My step-mother will fail this test. My children will fail this test.
A computer will very easily get this test right one time on 26.
In one word: Useless.
"Currently it builds under Linux with gcc2.95 and under Windows with VC++6". In other words, it won't compile under Mac OS X... yet ;)
The bits on the bus go on and off... on and off... on and off...
Also: *AA includes the MAA (Mathematical Association of America), the ADAA (Anxiety Disorders Association of America), the MSAA (Multiple Sclerosis Association of America), and the SCAA (Specialty Coffee Association of America).
and also the GNAA (Gay Nigger Association of America)
Don't ask me what's my point in mentionning this because I have no fucking idea :-) have a good day!
You just got troll'd!
A good idea, and if significant amounts of text are in an image, I'd view the mail as dubious anyway.
If not because of spam, then because of the idiotic format. Images are for illustrations, but using them to transfer major amounts of text is just stupid and inefficient.
C - the footgun of programming languages
and I've never heard of this thing.
Guess I should have got out of my cube more.
ccalam - acoustic versions of new songs.
Comment removed based on user account deletion
Screen captured some text from the article, used XV to transform into tif, changing image to monochrome.
Input image: it has been touted as one of the most accurate Optical Character Recognition (OCR) programs available. Having sat on the shelf gathering dust for so many years, Google cleaned up some of the more outdated portions of the code
Output text: ii has been lamed as one of lhe mos! accurale Oplical Characler Recognilion (OCR) programs available. Having sat on lhe shelf galhering dusk for so many years, Google cleaned up some of lhe more ouldaled porlions of lhe code
I have no idea what kind of input is optimal, but for a first shot in the dark, that's not too shabby. I'll go play with it some more. (^,^)
I don't get it. Isn't everything released on SourceForge supposed to be under a free license? Then how come this is released under no license? Perhaps I'm not looking on the right pages, but I can't seem to find anything besides the "none listed" on the main page of the project.
I found that I needed to use grayscale tif files for one and "output" is the output-filename where you'll get:
/usr/local/bin directory on "make install" and copied that directory from the build directory to get it to work.
outputFilename.raw #???
outputFilename.map # seems to be a location map of 0/1's where 1's are valid text and 0's aren't
outputFilename.txt # the text from the OCR event
I also found that the tessdata directory did not get installed into the
Without "batch", it tries to bring up and X window but that just quickly goes away with no debug output.
Usage: tesseract inputfile.tif [path/]outputfilename batch
LoB
"Anyone who stands out in the middle of a road looks like roadkill to me." --Linus
I suppose "audible captchas" should be feasible. That is, if you can't see the picture, the captcha server also has an audio file with the same information. I'd be surprised if this doesn't exist already in some form.
While it may be nice to have the source of a tesseract, however, those can only be built in a 4-dimensional space. So where do I get the build environment?
The Tao of math: The numbers you can count are not the real numbers.
I always knew Google were powerful. I did not, however, know they had the power to open source the 4-dimensional analog of the (3-dimensional) cube, where motion along the fourth dimension is often a representation for bounded transformations of the cube through time.
"No, no, no, don't tug on that! You never know what it might be attached to."
that F/OSS isn't anti-business. It just works with different business models.
Google's business interest in releasing this as open source is obvious: the greater the value of the materials available to the Internet, the greater the value of its service.
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
Your title and post make you sound like you think this shouldn't be released open source, just in case spammers use it.
Well, then OOo will have to stop releasing their office suite: just think, Base could be used to store e-mail addresses to spam! Or, maybe no open source e-mail clients should be released, because the spammers might use it to send spam!
Don't blame the software for the way it is used; It's the user's fault if (s)he decides to use it malevolently. Most software has the potential for misuse, some more than others, but that doesn't mean that fear of spam should stop tools that have a chance to be misused being released. Just think of the positive uses of programs like this.
Besides, it's more than easy enough for spammers to just make a program to do stuff like break CAPTCHAs (yes, I know they're designed to defeat spammers, but nothing's perfect).
and don't forget the ADA, the Dyslexics Association of America
...and part of a good CAPTCHA is causing these transformations to come up with useless output.
Yes, by using contrasting colors that convert to the same tone in grayscale. A side effect being that most such technologies also shut out colorblind people...
The name of the system you propose is called challenge/response (CR). CR is not a good idea for the following reasons:
1) It says "My time is more important than yours" to all your correspondents, because you're not willing to look at a few spams getting past your Bayesian filter every day so instead you offload that time burden to people who want to talk to you.
2) Dueling CR systems ("Hey, bob@example.com, I don't recognize you. Please prove you are a human" "Re: Hey, bob -- steve@stupid.com, I don't recognize you. Please prove you are a human"). Even more fun in a potentially infinite loop. Any system you can make to shortcircuit this loop can be abused by spam to avoid the CR altogether.
3) Doesn't survive the Chinese Sweatshop Spam Attack, which will be ubiquitous if CR becomes popular. (Take poor Chinese person, teach them 10 words of English, pay them 2 cents an hour to answer CAPTCHAs so you get guaranteed delivery of your Maximize Your Mr. Wiggly offers.)
4) Breaks legitimate bulk mail senders, such as Amazon, Paypal, eBay, mailing lists, etc etc. Mailing lists in particular are going to be very fun, since a lot of CR systems would spam the entire list -- perhaps provoking 100 challenges! Which then leads to combinatorial hilarity!
Help poke pirates in the eyepatch, arr.
Plus, IIRC CAPTCHAs don't really work anyway.
Everything in moderation, including moderation itself
As someone who has been involved in applying OCR to real world problems, there's nothing
trivial about generating a good binary images from images taken in the field (in my case,
images of boxes moving down a conveyor belt or hand imaged by workers).
Even if you disregard such problems as uneven lighting, glare, and distortion due the
unavoidable vibration inherrent to plant settings, most forms that are interesting to
OCR are handwritten and not designed to be OCR friendly. Hopefully this will change as
the people who design such forms become more conscious of the capabilities of OCR, but
even if that were to happen tomorrow, it would take years to complete the transition.
*sigh* back to work...
I gave up on CAPTCHA, the spammers have some really good software which can deal with this. My site used to get about 5-10 bot registrations a day. So I changed tactics, and simply ask "Are you a bot? (don't answer this question!)". If they answer this question, registration is denied, no matter what e-mail address or IP they are using. This alone is 100% effective, but I do have some other questions as a backup, just in case. It's rather interesting how all these registrations seem to follow the same pattern, almost like there is only one decent 'spam package' out there.
That was my thoughts exactly, why release it on sourceforge? Unless they don't have any faith in there own code repository.
http://www.fanboy.co.nz/adblock/
As the linked article states, there are commercial OCR programs that are far more accurate.
--Rob
Towards the Singularity.
Does anyone here know how to get it to install and run on SuSE 10.0. The instructions are a little confusing. If you can't use make install, what do you use.
..
./configure returns "error in line 1329" and "make install has not been implemented yet avoid using."
From INSTALL
"4. Type `make install' to install the programs and any data files and documentation."
Running
README has this to say "The executable must reside in the same directory as the tessdata directory The command line is: tesseract image.tif batch"
Trying to run it and a windows pops up briefly and then disappears.
davecb5620@gmail.com
Are you insinuating that the 115th Congress won't try to enact a Chastity Bono Copyright Term Extension Act? Given Mexico's life plus 100 copyright term, the next step of "harmonization" for the United States and its trading partners is life plus 100 or, in the case of works made for hire, 125 years after publication.
Who's to say that publishers won't fight back against Gutenberg the way (ObTopic) they did against Google? It's only fair use if you can pay a judge to tell you that it is and if you can pay your lawyer to tell the judge to tell you that it is.
Except that estates of authors of well-known works tend to be stricter than that. I'm willing to bet that there won't be enough books 1. which are notable, 2. whose copyright is abandoned by the author or his estate, and 3. which are not already published electronically by the author, to keep Project Gutenberg and the public-domain part of Google Book Search busy after the Chastity Bono Act comes into effect.
well, that is quite a stretch but just maybe send the info from Mindstorms to host so that the robots can read
As with most problems in computer science, this can be solved by a one-line perl program:
perl -e 'print "b\n"'
-- Charles Reindorf
lol, how is that flamebait, is it because I said the word nigger?
You just got troll'd!
For those folks the blogs on www.livejournal.com have an audio version of CAPTCHA.
I do not believe in karma. "Funny"=-6. Do good and forbid evil. Yours, Oft-Offtopic Flamebaiting Troll.
...or mothers... but please, this was not Flamebait. It's called humor, and if it's not funny, just don't laugh. It's not like he posted some big GNAA ASCII SNAFU.
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Or it could be since this is a RE-release. the original release was on sourceforge, so they kept it that way.
Yes go ahead click the link. Its kosher
That's never happened before!
I know for a fact because I saw it myself that HP Research Labs Bristol had a fully working implementation of page layout analysis together with OCR in 1994. It was impressive. It handled all the usual page layout issues such as multiple columns and page skew. I've no idea whether that's true or not (and after a quick Google, it looks like finding out would be too much like hard work), but it's certainly interesting....
Man, that tool is old.
... it would be quite possible to push through a request to re-release it under the Apache license.
I know some people who work in the department where it was created and I think the consesus is that no one has thought about that tool in a long time (as there are much better ones now).
I think if there was some pressure from users of Tesseract
I think the biggest hurdle there would be the paperwork and explaining why it'd be a nice gesture to the people who have to sign the forms (managers, corporate, etc.). We have a pretty hefty PR/licensing process since most of our work is delivered to our sponsors (government).
But its not like anyone considers it some kinda asset. In fact, if anyone asked for support for it there'd be groaning because no one has touched it in over a DECADE.
But yeah, I encourage users of Tesseract to send snail-mail letters explaining the issue to the Neuroscience folks there at the Washington location.
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
Apparently the OP thinks the entire world lives and breathes *NIX, so much so that he couldn't be bothered to mention the OS platform requirement? Thanks for wasting the time of those readers who may not yet have a Linux system with which to use it.
Maybe it's harder to renege on a release if it's not hosted on their network?
Program Intellivision!