Google Releases Tesseract as Open Source

I take back every bad thing I said about Google by OrangeTide · 2006-09-04 15:30 · Score: 4, Interesting

HOORAY! Good free OCR software is in short supply. I wonder if this will have a positive impact on Project Gutenberg?

--
“Common sense is not so common.” — Voltaire

Re:I take back every bad thing I said about Google by Commie1 · 2006-09-04 21:42 · Score: 2, Interesting

I've been using Tesseract for a PG project for a few weeks now and, as TFA says, it's not as good
as some commercial ones out there. Abby Finereader seems to be the OCR software of choice for
Distributed Proofreaders, at least.
Tesseract just has ASCII support (for now, as they like to add), so it ignores italics, accents etc.
In the case of the book I'm working on, it had a very hard time with the ff ligature and had some
trouble with b and c, but became hut, he became be, c was often an o or e.
The words difficult, office and scientific were the standard pitfalls. On some pages it was nearly flawless though.
The biggest advantages to me are clearly that it is free*, it's good enough and I can use it on my preferred OS.

* Mostly Apache License v2.0, a part of it is under a "freely use and modify for research and development purposes" license however.

Anti-spam by Bacon+Bits · 2006-09-04 15:30 · Score: 2, Interesting

This should be useful for adding anti-image spam capabilities to FOSS anti-spam programs.

--
The road to tyranny has always been paved with claims of necessity.

Re:Anti-spam by Phroggy · 2006-09-04 19:08 · Score: 2, Interesting

Following that logic, wouldn't this then also be just as usefull a tool to spammers looking to crack those crazy registration verification images?

Yes, absolutely, and spammers are already using image obfuscation techniques: using italic difficult-to-read fonts spaced very close together (difficult to separate the image into individual characters and difficult to identify each character once you do), using colored backgrounds to make the text very low-contrast when converted into a monochrome image the OCR software can use, using animated GIFs (as mentioned previously in another article) so that if you only convert the first or last frame of the animation you won't get anything useful, and finally splitting the image into multiple pieces that are assembled together with HTML. The only solution I see to this last problem is to develop spam filtering software that uses Gecko or KHTML to render the HTML and analyze the rendered page.

In the war between spammers and anti-spammers, the spammers are clearly winning, and they will continue to win for the foreseeable future. No technical solution can stop spam, only certain limited types of spam - but the spammers are constantly adapting. I believe if Congress were to earmark funding for the investigation and prosecution of spammers, we could actually make a significant dent in the problem (other governments have already expressed a willingness to cooperate).

It's difficult to legally define spam in such a way that makes spam illegal without infringing the right to freedom of speech and press, and I believe we need to err on the side of protecting liberty at the expense of some spam being legal. This is what CAN-SPAM has done - it's far from perfect, but it's a good start. CAN-SPAM has gotten a lot of criticism for being too easy for spammers to work around, but how much spam do you get that actually complies with the law? Not much... so why aren't we prosecuting violators right and left? Limited resources. Given the choice between tracking down a spammer and tracking down a murderer/rapist/child molestor/etc., both of which cost money, most of us recognize that spam is a less severe problem. More resources need to be allocated to the appropriate law enforcement agencies so they can deal with both.

Oh, and if my argument about CAN-SPAM was unconvincing, consider this: nearly all the image-based spam I've been seeing lately has been either for penny stocks or prescription medications (i.e. "male enhancement" products). Both of these are already cleaerly illegal (the SEC and FDA are the respective government agencies responsible, I believe). It should be possible to prosecute the spammers for stock market manipulation and dispensing controlled drugs without a prescription, even if sending spam weren't against the law.

Some here will call for spammers to be sentenced to life in prison without parole, execution, castration, public hanging, public stoning, or worse. Get over it. Forget about revenge, that's not what our criminal justice system is for. All I want is for the spam to stop, and for the spammer to lose whatever they've gained from it. That should be enough. Let the spammers become productive members of society if they're capable of doing so; lock them up if they cause any further trouble.

Is this the final solution? No, of course not. But let's start with this, and see how it goes for now. If this works, spam won't go away, it will just change into new forms... and that's OK. When that happens, we can find new ways of dealing with it. The hope is that after that happens a few times, it will become much less of a problem. Maybe not, but I'd sure like to find out!

--
$x='S24;r)>63/* h@<5+oZ)32"5cz';$me='phroggy'x$];
$x=~y+ -xz+\0-Tx+;print$_^chop$me for split'',$x;

Hosting by truthsearch · 2006-09-04 15:44 · Score: 5, Interesting

Is there any particular reason google isn't hosting the project themselves?

--
Developers: We can use your help.

NFB owns you by tepples · 2006-09-04 15:48 · Score: 4, Interesting

CATCHAs have been very effective in stopping spammers in the past, but if they can now just read them and answer correctly, then they are effectively rendered useless ...

They're already useless if installing one will subject your business to boycotts and/or lawsuits from National Federation of the Blind and other advocates for people with disabilities.

Un-Finishable by Kadin2048 · 2006-09-04 16:09 · Score: 5, Interesting

In all honesty, I doubt Project Gutenberg will have run out of pre-1923 books by the time that new stuff starts coming out of Copyright under the new rules. They have everything written by humanity before that date to digitize: not just English language books and "classics," but government documents, records, foreign language texts, ancient manuscripts ... everything. That's as close to an un-finishable task as you can set yourself, I think.

Just assuming that somehow they did manage to digitize everything that was out of copyright, then I think what they should do is start archiving everything that they can. Even if they can't disseminate the information, they could still scan documents in and store them for later OCR-ing, thus preserving them against deterioration. I think this would be covered by fair use law even if the work was still protected. Perhaps this sort of archival work is not exactly the aim of PG, but it's still critically important.

With that said, I don't mean to in any way excuse the disgusting abuse of our political and legal system that was and is the "Sonny Bono Copyright Term Extension Act." That thing is a disgusting example of pretty much everything that's wrong with our government today.

--
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."

Re:Un-Finishable by HuguesT · 2006-09-04 22:05 · Score: 2, Interesting

This is patently false. New stuff comes out of copyright every day. However, coming out of copyright is not the same thing as becoming available to the public. Clearly this is where Projet Gutenberg comes in.

One enormous area I'm personnally interested in is sheet music. Some of the music I'm interested in playing has come out of copyright decades or even centuries ago. No one is going to reclaim copyright on Mozart's requiem for instance. Yet it is by and large not available to the public because translating original manuscript sheet music into something that modern musicians can play without too much trouble is a huge undertaking.

Yet I have no doubt that this will eventually happen. PG already has a section devoted to sheet music. The tools are beginning to appear : lilypond is a superb Free music engraving software package. I'm personnally working on music OCR software, and others are as well I'm sure. Eventually this will work out well I think.

The public is in the process of reclaiming what is theirs, this is pretty much unstoppable right now.

Re:As much as I like open source software ... by Millenniumman · 2006-09-04 16:23 · Score: 2, Interesting

Why can't captchas just say "the letter at the beginning of the word that is spelled the reverse of the 3 letter name for an underwater vehicle or sandwich"? If someone can make a program that interprets that and gets the answer right after getting it off a captcha with OCR, then Google probably wants to know so they can hire them.

--
Stupidity is like nuclear power, it can be used for good or evil. And you don't want to get any on you.

No Wrinkle in Time comments? by reaktor · 2006-09-04 16:35 · Score: 2, Interesting

Come on, 34 comments and no mention of A Wrinkle in Time?

my thoughts by br00tus · 2006-09-04 16:43 · Score: 3, Interesting

I would love to use a free (speech and beer) OCR engine that works as well as a commercial one, or even nearby as good as a commercial one.

I just checked out tesseract. One thing I have to look at more is the license. It appears to be the Apache license, which seems like a decent free license. But it also includes MITRE's aspirin. I'm not sure how dependent it is on aspirin and what the license restrictions of aspirin are.

The two best free OCR engines out right now are clara and gocr. While they are the best, they are not that great yet. I just ran the same tiff I had run with those two (I also have the document in pbm and other formats). Tesseract did not read it, it bailed with "IMAGE::check_legal_access:Error:Can't seek backwards in a buffered image!"

Clara and GOCR are written in C, Tesseract is written in C++, a language I don't know. Tesseract did well in the UNLV challenge so it probably has some good features. It does say it has no page layout analysis though.

Hopefully this can be improved, or good parts of it can be borrowed and incorporated into gocr or clara. It couldn't handle my test that both clara and gocr could, but it probably has strengths the other two doesn't. One day hopefully we'll have a free OCR that handles things as automagically as the commercial ones do. I will see what I can contribute to that as well. Although this is C++ and I don't know that language.

Vividata works quite well by GnuPooh · 2006-09-04 16:47 · Score: 2, Interesting

I've tried all the previous open source stuff and it was pretty much unusable. The accuracy was so bad that it was just easier to start typing. I got a few of the Windows programs kinda of working under WINE, but then I discovered Vividata and it worked really well and could be called from the command line. This meant I could write my own scripts that used it. I used it quite a bit for Project Gutenberg and was very impressed. It's not cheap, but if you want to do OCR under Linux and can afford it, I recommend it.

I would definitely prefer a Free Software solution so I'm excited about this development. Until this solution is really work-able (see the Google limitations, they're pretty serious), give Vividata a try.

I call bullshit by quigonn · 2006-09-04 17:16 · Score: 4, Interesting

The very first CAPTCHA implementation was broken, but the funny thing about CAPTCHAs is that it's absolutely no effort to make an image completely unreadable for current OCR software. And even if one certain implementation is broken, just add another layer of distortion. Human brain is capable of coping with it, OCR software usually is not.

And after all, it's not about authentication, it's about making a service accessible only for humans.

BTW, it's funny that you praise your own cryptography solution in your blog, but it's obvious that you have the problem of replay attacks, you even mention it in the "caveat" section below the text box.

--
A monkey is doing the real work for me.

Re:License by mrchaotica · 2006-09-04 18:03 · Score: 2, Interesting

The Aspirin/MIGRAINES system in the aspirin directory is separately licensed thus: [proprietary junk license]

Anybody know how important this headache library is to the software, and how easily replaced it is?

--

"[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz

Non-English Charsets? by TheoMurpse · 2006-09-04 18:13 · Score: 3, Interesting

As there seems to be no documentation on the Sourceforge page about what this can actually do, does it learn or follow rules? If it learns, can it learn to recognize, say, Japanese characters?

License issue: not free software by hellgate · 2006-09-04 18:18 · Score: 2, Interesting

Parts of the Tesseract tar ball are under a "for non-commercial use" only license:

This software is the copyright of Russell Leighton and the MITRE Corporation. It may be freely used and modified for research and development purposes. We require a brief acknowledgement in any research paper or other publication where this software has made a significant contribution. If you wish to use it for commercial gain you must contact The MITRE Corporation for conditions of use.

The piece in question is a neural network simulator named Aspirin/MIGRAINES, presumably used for training. Pun away.

Re:As much as I like open source software ... by Anonymous Coward · 2006-09-04 18:49 · Score: 2, Interesting

"the letter at the beginning of the word that is spelled the reverse of the 3 letter name for an underwater vehicle or sandwich"

My wife would fail this test. My father will fail this test. My step-mother will fail this test. My children will fail this test.

A computer will very easily get this test right one time on 26.

In one word: Useless.

Image spam by Lonewolf666 · 2006-09-04 19:17 · Score: 2, Interesting

A good idea, and if significant amounts of text are in an image, I'd view the mail as dubious anyway.
If not because of spam, then because of the idiotic format. Images are for illustrations, but using them to transfer major amounts of text is just stupid and inefficient.

--
C - the footgun of programming languages

Test example of tesseract. by dannycim · 2006-09-04 19:42 · Score: 2, Interesting

Screen captured some text from the article, used XV to transform into tif, changing image to monochrome.

Input image: it has been touted as one of the most accurate Optical Character Recognition (OCR) programs available. Having sat on the shelf gathering dust for so many years, Google cleaned up some of the more outdated portions of the code

Output text: ii has been lamed as one of lhe mos! accurale Oplical Characler Recognilion (OCR) programs available. Having sat on lhe shelf galhering dusk for so many years, Google cleaned up some of lhe more ouldaled porlions of lhe code

I have no idea what kind of input is optimal, but for a first shot in the dark, that's not too shabby. I'll go play with it some more. (^,^)

Re:Test example of tesseract. by CXI · 2006-09-05 01:23 · Score: 2, Interesting

A screen shot is typically much lower resolution than what you'd normally scan documents at for OCR. It's not a good test.

Re:Music OCR by lowieken · 2006-09-04 21:02 · Score: 3, Interesting

There is a piece of non-free software that runs quite well under Wine and exports nice MusicXML. You will find it linked to from http://www.recordare.com/software.html .

I really should ask google to help buy this technology and set it free.

Re:I'm sorry Dave... by MichaelSmith · 2006-09-04 23:22 · Score: 2, Interesting

Yeah, but how is it on lip-reading? That's when we really need to worry.

Given that my laptop has a microphone I was a bit worried about the recent article on google sampling sound on peoples computers. But my wife's laptop also has a webcam. Should I tell my wife not to google in bed? If the mic is off will they still catch what she is talking about?

Dave why don't you take a stress pill and lie down. If you are looking for something to read there is always google news.

--
http://michaelsmith.id.au

Re:As much as I like open source software ... by Anonymous Coward · 2006-09-05 00:45 · Score: 2, Interesting

I gave up on CAPTCHA, the spammers have some really good software which can deal with this. My site used to get about 5-10 bot registrations a day. So I changed tactics, and simply ask "Are you a bot? (don't answer this question!)". If they answer this question, registration is denied, no matter what e-mail address or IP they are using. This alone is 100% effective, but I do have some other questions as a backup, just in case. It's rather interesting how all these registrations seem to follow the same pattern, almost like there is only one decent 'spam package' out there.

Slashdot Mirror

Google Releases Tesseract as Open Source

23 of 251 comments (clear)