Google Releases Tesseract as Open Source

I take back every bad thing I said about Google by OrangeTide · 2006-09-04 15:30 · Score: 4, Interesting

HOORAY! Good free OCR software is in short supply. I wonder if this will have a positive impact on Project Gutenberg?

--
“Common sense is not so common.” — Voltaire

Re:I take back every bad thing I said about Google by Commie1 · 2006-09-04 21:42 · Score: 2, Interesting

I've been using Tesseract for a PG project for a few weeks now and, as TFA says, it's not as good
as some commercial ones out there. Abby Finereader seems to be the OCR software of choice for
Distributed Proofreaders, at least.
Tesseract just has ASCII support (for now, as they like to add), so it ignores italics, accents etc.
In the case of the book I'm working on, it had a very hard time with the ff ligature and had some
trouble with b and c, but became hut, he became be, c was often an o or e.
The words difficult, office and scientific were the standard pitfalls. On some pages it was nearly flawless though.
The biggest advantages to me are clearly that it is free*, it's good enough and I can use it on my preferred OS.

* Mostly Apache License v2.0, a part of it is under a "freely use and modify for research and development purposes" license however.

Anti-spam by Bacon+Bits · 2006-09-04 15:30 · Score: 2, Interesting

This should be useful for adding anti-image spam capabilities to FOSS anti-spam programs.

--
The road to tyranny has always been paved with claims of necessity.

Re:Anti-spam by Phroggy · 2006-09-04 19:08 · Score: 2, Interesting

Following that logic, wouldn't this then also be just as usefull a tool to spammers looking to crack those crazy registration verification images?

Yes, absolutely, and spammers are already using image obfuscation techniques: using italic difficult-to-read fonts spaced very close together (difficult to separate the image into individual characters and difficult to identify each character once you do), using colored backgrounds to make the text very low-contrast when converted into a monochrome image the OCR software can use, using animated GIFs (as mentioned previously in another article) so that if you only convert the first or last frame of the animation you won't get anything useful, and finally splitting the image into multiple pieces that are assembled together with HTML. The only solution I see to this last problem is to develop spam filtering software that uses Gecko or KHTML to render the HTML and analyze the rendered page.

In the war between spammers and anti-spammers, the spammers are clearly winning, and they will continue to win for the foreseeable future. No technical solution can stop spam, only certain limited types of spam - but the spammers are constantly adapting. I believe if Congress were to earmark funding for the investigation and prosecution of spammers, we could actually make a significant dent in the problem (other governments have already expressed a willingness to cooperate).

It's difficult to legally define spam in such a way that makes spam illegal without infringing the right to freedom of speech and press, and I believe we need to err on the side of protecting liberty at the expense of some spam being legal. This is what CAN-SPAM has done - it's far from perfect, but it's a good start. CAN-SPAM has gotten a lot of criticism for being too easy for spammers to work around, but how much spam do you get that actually complies with the law? Not much... so why aren't we prosecuting violators right and left? Limited resources. Given the choice between tracking down a spammer and tracking down a murderer/rapist/child molestor/etc., both of which cost money, most of us recognize that spam is a less severe problem. More resources need to be allocated to the appropriate law enforcement agencies so they can deal with both.

Oh, and if my argument about CAN-SPAM was unconvincing, consider this: nearly all the image-based spam I've been seeing lately has been either for penny stocks or prescription medications (i.e. "male enhancement" products). Both of these are already cleaerly illegal (the SEC and FDA are the respective government agencies responsible, I believe). It should be possible to prosecute the spammers for stock market manipulation and dispensing controlled drugs without a prescription, even if sending spam weren't against the law.

Some here will call for spammers to be sentenced to life in prison without parole, execution, castration, public hanging, public stoning, or worse. Get over it. Forget about revenge, that's not what our criminal justice system is for. All I want is for the spam to stop, and for the spammer to lose whatever they've gained from it. That should be enough. Let the spammers become productive members of society if they're capable of doing so; lock them up if they cause any further trouble.

Is this the final solution? No, of course not. But let's start with this, and see how it goes for now. If this works, spam won't go away, it will just change into new forms... and that's OK. When that happens, we can find new ways of dealing with it. The hope is that after that happens a few times, it will become much less of a problem. Maybe not, but I'd sure like to find out!

--
$x='S24;r)>63/* h@<5+oZ)32"5cz';$me='phroggy'x$];
$x=~y+ -xz+\0-Tx+;print$_^chop$me for split'',$x;

Re:As much as I like open source software ... by aweinert · 2006-09-04 15:32 · Score: 5, Informative

CAPTCHAs are specifically meant to break OCR... and if you RTFA, it say it does poorly with grayscale and color documents. Baisically its meant for reading typed text... like in a book.

improvements by Anonymous Coward · 2006-09-04 15:33 · Score: 5, Funny

Google cleaned up some of the more outdated portions of the code
i.e., added AdSense to the OCR output.

Re:As much as I like open source software ... by illuminatedwax · 2006-09-04 15:33 · Score: 5, Funny

You're right! Let us never delve into research that could conceivably overturn weak software security! Some things man was never meant to discover! Turn back, before we fly too close to the sun and our wings melt!! O, Prometheus, why hast thou given us this OCR technology??

--
Did you ever notice that *nix doesn't even cover Linux?

Hoping OCR will improve? by smileytshirt · 2006-09-04 15:34 · Score: 3, Insightful

My guess is that they are doing this in the hope the open source community will build on and improve OCR technology. This would be in Google's interest, as it can then index text from images (such as their own Books project) more accurately and efficiently.

--
www.shortman.com.au - top shorted stocks on the ASX

Re:As much as I like open source software ... by Carthag · 2006-09-04 15:35 · Score: 2, Insightful

OCR is most effective when the letter boundaries are clear and well-defined, such as fixed-width text, or text that is at least on a straight line. Most CAPTCHAs put the letters on a curved path, as well as distorting the letters so they are no longer within a clearly defined rectangular shape. This makes it very hard to identify which parts of the images are letters and which parts are not, making OCRing CAPTCHAs a non-trivial problem.

Finally! by nihilatron · 2006-09-04 15:40 · Score: 3, Funny

Now I can finally see how to tell the difference between the 'A'-ness of 'A' and the 'P'-ness of 'P'!

(Credit to S.G.)

From the Project by Gopal.V · 2006-09-04 15:43 · Score: 4, Insightful

> It was open-sourced by HP and UNLV in 2005.

So google basically did what ? Fix bit-rot ? Google has re-released some open source code, essentially forking off the orginal ?

> License: (None Listed)

I'm a fan of the FOSS idea. Basically that makes sures that the whole work to which I contributed, always remains available to me (and others). It might not always work for a company, but as a developer it makes sense to me. And the second thing I need to see is a License after I see some code.

So explain to me how exactly this is open source (other than the "compile, but don't touch" version of it) and *then* I might think of downloading it and probably fix a few bugs or write docs.

--
Quidquid latine dictum sit, altum videtur

Re:From the Project by kevlarman · 2006-09-04 16:10 · Score: 3, Informative

if you had bothered to browse cvs you would find that it has been released under the apache license: http://tesseract-ocr.cvs.sourceforge.net/tesseract -ocr/tesseract/COPYING?view=markup

--
A mouse is a device used to point to the xterm you want to type in

I'm sorry Dave... by macadamia_harold · 2006-09-04 15:44 · Score: 4, Funny

Originally developed at the HP Labs from 1985-1995, it has been touted as one of the most accurate Optical Character Recognition (OCR) programs available.

Yeah, but how is it on lip-reading? That's when we really need to worry.

--
Push Button, Receive Bacon

Re:I'm sorry Dave... by MichaelSmith · 2006-09-04 23:22 · Score: 2, Interesting

Yeah, but how is it on lip-reading? That's when we really need to worry.

Given that my laptop has a microphone I was a bit worried about the recent article on google sampling sound on peoples computers. But my wife's laptop also has a webcam. Should I tell my wife not to google in bed? If the mic is off will they still catch what she is talking about?

Dave why don't you take a stress pill and lie down. If you are looking for something to read there is always google news.

--
http://michaelsmith.id.au

Hosting by truthsearch · 2006-09-04 15:44 · Score: 5, Interesting

Is there any particular reason google isn't hosting the project themselves?

--
Developers: We can use your help.

Re:Hosting by larry+bagina · 2006-09-04 15:46 · Score: 5, Funny

Yes. They need the 99.9999% uptime (6 9s) that only sourceforge can provide.

--
Do you even lift?
These aren't the 'roids you're looking for.
Re:Hosting by Leto-II · 2006-09-04 18:24 · Score: 4, Funny

I think you need to recalibrate your sarcasm detector.

--
Do not anger the worm.

NFB owns you by tepples · 2006-09-04 15:48 · Score: 4, Interesting

CATCHAs have been very effective in stopping spammers in the past, but if they can now just read them and answer correctly, then they are effectively rendered useless ...

They're already useless if installing one will subject your business to boycotts and/or lawsuits from National Federation of the Blind and other advocates for people with disabilities.

Re:NFB owns you by MrNonchalant · 2006-09-04 17:08 · Score: 4, Informative

You can build accessible CAPTCHAs, using images with a sound backup for blind users. My girlfriend is visually impaired and non-accessible CAPTCHAs are a real problem for her, she can't register at some sites without assistance.
Re:NFB owns you by maxwell+demon · 2006-09-04 20:34 · Score: 2, Funny

Of course you can resort to other, harder to calculate questions like: "What is the answer to life, the universe and everything?" Oops, Computers seem to have become much faster since Deep Thought! :-)

--
The Tao of math: The numbers you can count are not the real numbers.

Re:As much as I like open source software ... by djtack · 2006-09-04 15:51 · Score: 4, Insightful

Plus, good OCR could help recognize image spam (where they send the text in an image attachment, to avoid filtering, and fill the message body with "bayes poison").

i hope it can augment the SpamAssassin OCR plugin by sednet · 2006-09-04 16:02 · Score: 2, Informative

it would be great if tesseract could augment the gocr-based FuzzyOCR and OCR plugins for SpamAssassin.

--
about sean dreilinger

Re:As much as I like open source software ... by binarybum · 2006-09-04 16:07 · Score: 3, Funny

careful, statements like that are likely to get you voted governor in some states.

--
ôó

Un-Finishable by Kadin2048 · 2006-09-04 16:09 · Score: 5, Interesting

In all honesty, I doubt Project Gutenberg will have run out of pre-1923 books by the time that new stuff starts coming out of Copyright under the new rules. They have everything written by humanity before that date to digitize: not just English language books and "classics," but government documents, records, foreign language texts, ancient manuscripts ... everything. That's as close to an un-finishable task as you can set yourself, I think.

Just assuming that somehow they did manage to digitize everything that was out of copyright, then I think what they should do is start archiving everything that they can. Even if they can't disseminate the information, they could still scan documents in and store them for later OCR-ing, thus preserving them against deterioration. I think this would be covered by fair use law even if the work was still protected. Perhaps this sort of archival work is not exactly the aim of PG, but it's still critically important.

With that said, I don't mean to in any way excuse the disgusting abuse of our political and legal system that was and is the "Sonny Bono Copyright Term Extension Act." That thing is a disgusting example of pretty much everything that's wrong with our government today.

--
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."

Re:Un-Finishable by mrchaotica · 2006-09-04 17:58 · Score: 4, Insightful

In all honesty, I doubt Project Gutenberg will have run out of pre-1923 books by the time that new stuff starts coming out of Copyright under the new rules.

Your argument makes the fundamentally flawed assumption that the "new rules" will remain constant. The reality is that Copyright will continue getting extended so that new content never comes into Public Domain. (I hope the copyright fuckers are the first against the wall when the revolution comes!)

Even if they can't disseminate the information, they could still scan documents in and store them for later OCR-ing, thus preserving them against deterioration.

I'm sure they could even OCR them... they just couldn't make them available to the public. Of course, given the community-driven mechanism by which Project Gutenberg works, they couldn't legally distribute them to the volunteers either...

--
"[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz
Re:Un-Finishable by HuguesT · 2006-09-04 22:05 · Score: 2, Interesting

This is patently false. New stuff comes out of copyright every day. However, coming out of copyright is not the same thing as becoming available to the public. Clearly this is where Projet Gutenberg comes in.

One enormous area I'm personnally interested in is sheet music. Some of the music I'm interested in playing has come out of copyright decades or even centuries ago. No one is going to reclaim copyright on Mozart's requiem for instance. Yet it is by and large not available to the public because translating original manuscript sheet music into something that modern musicians can play without too much trouble is a huge undertaking.

Yet I have no doubt that this will eventually happen. PG already has a section devoted to sheet music. The tools are beginning to appear : lilypond is a superb Free music engraving software package. I'm personnally working on music OCR software, and others are as well I'm sure. Eventually this will work out well I think.

The public is in the process of reclaiming what is theirs, this is pretty much unstoppable right now.
Re:Un-Finishable by gweeks · 2006-09-04 23:03 · Score: 3, Informative

> This is patently false. New stuff comes out of copyright every day.

This is just so un-true. In the United States (the only place that project Gutenberg worries about) nothing is entering the Public Domain except unpublished manuscripts where the author died 70 years ago. Nothing else will enter the public domain until 2019. Congress has affectivly frozen the public domain.
Re:Un-Finishable by fotbr · 2006-09-05 01:39 · Score: 2, Informative

Unless estate holders release it early. Or the author and holder of the copyright declares in his/her will that his/her work be released into the public domain upon his death, etc.

Just because its not common (or likely) doesn't mean it can't happen.

License by mapinguari · 2006-09-04 16:11 · Score: 2, Informative

Here's what's in the COPYING file distributed with the source, with some punctuation stripped to placate the lameness filter:

This package contains the Tesseract Open Source OCR Engine. Orignally developed at Hewlett Packard Laboratories Bristol and at Hewlett Packard Co, Greeley Colorado, the majority of the code in this distribution is now licensed under the Apache License: ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** http://www.apache.org/licenses/LICENSE-2.0 ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. Other Dependencies and Licenses: The Aspirin/MIGRAINES system in the aspirin directory is separately licensed thus: # NO WARRANTY Since the Aspirin/MIGRAINES system is licensed free of charge, Russell Leighton and the MITRE Corporation provide absolutley no warranty. Should the Aspirin/MIGRAINES system prove defective, you must assume the cost of all necessary servicing, repair or correction. In no way will Russell Leighton or the MITRE Corporation be liable to you for damages, including any lost profits, lost monies, or other special, incidental or consequential damages arising out of the use or inability to use the Aspirin/MIGRAINES system. COPYRIGHT This software is the copyright of Russell Leighton and the MITRE Corporation. It may be freely used and modified for research and development purposes. We require a brief acknowledgement in any research paper or other publication where this software has made a significant contribution. If you wish to use it for commercial gain you must contact The MITRE Corporation for conditions of use. Russell Leighton and the MITRE Corporation provide absolutely NO WARRANTY for this software. August, 1992 Russell Leighton The MITRE Corporation 7525 Colshire Dr. McLean, Va. 22102-3481 Tesseract can also make use of the libtiff library. (www.libtiff.org) Without libtiff, Tesseract can only read uncompressed and G3 compressed TIFF files.

Re:License by mrchaotica · 2006-09-04 18:03 · Score: 2, Interesting

The Aspirin/MIGRAINES system in the aspirin directory is separately licensed thus: [proprietary junk license]

Anybody know how important this headache library is to the software, and how easily replaced it is?

--
"[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz
Re:License by lisaparratt · 2006-09-04 19:34 · Score: 2, Informative

It's a neural networking system, so I'd hazard a guess that it's pretty vital to the project :(

Re:As much as I like open source software ... by Millenniumman · 2006-09-04 16:23 · Score: 2, Interesting

Why can't captchas just say "the letter at the beginning of the word that is spelled the reverse of the 3 letter name for an underwater vehicle or sandwich"? If someone can make a program that interprets that and gets the answer right after getting it off a captcha with OCR, then Google probably wants to know so they can hire them.

--
Stupidity is like nuclear power, it can be used for good or evil. And you don't want to get any on you.

No Wrinkle in Time comments? by reaktor · 2006-09-04 16:35 · Score: 2, Interesting

Come on, 34 comments and no mention of A Wrinkle in Time?

my thoughts by br00tus · 2006-09-04 16:43 · Score: 3, Interesting

I would love to use a free (speech and beer) OCR engine that works as well as a commercial one, or even nearby as good as a commercial one.

I just checked out tesseract. One thing I have to look at more is the license. It appears to be the Apache license, which seems like a decent free license. But it also includes MITRE's aspirin. I'm not sure how dependent it is on aspirin and what the license restrictions of aspirin are.

The two best free OCR engines out right now are clara and gocr. While they are the best, they are not that great yet. I just ran the same tiff I had run with those two (I also have the document in pbm and other formats). Tesseract did not read it, it bailed with "IMAGE::check_legal_access:Error:Can't seek backwards in a buffered image!"

Clara and GOCR are written in C, Tesseract is written in C++, a language I don't know. Tesseract did well in the UNLV challenge so it probably has some good features. It does say it has no page layout analysis though.

Hopefully this can be improved, or good parts of it can be borrowed and incorporated into gocr or clara. It couldn't handle my test that both clara and gocr could, but it probably has strengths the other two doesn't. One day hopefully we'll have a free OCR that handles things as automagically as the commercial ones do. I will see what I can contribute to that as well. Although this is C++ and I don't know that language.

Vividata works quite well by GnuPooh · 2006-09-04 16:47 · Score: 2, Interesting

I've tried all the previous open source stuff and it was pretty much unusable. The accuracy was so bad that it was just easier to start typing. I got a few of the Windows programs kinda of working under WINE, but then I discovered Vividata and it worked really well and could be called from the command line. This meant I could write my own scripts that used it. I used it quite a bit for Project Gutenberg and was very impressed. It's not cheap, but if you want to do OCR under Linux and can afford it, I recommend it.

I would definitely prefer a Free Software solution so I'm excited about this development. Until this solution is really work-able (see the Google limitations, they're pretty serious), give Vividata a try.

Two reasons by patio11 · 2006-09-04 16:49 · Score: 4, Insightful

You've got two constraints. One is that you have to be able to compose an arbitrarily large numbers of capchas algorithmically. For example, that example you just used is human-composed. If its the only CAPTCHA you have, the following program gets me a job at Google: gawk 'BEGIN{print "b"}' . If you have 100 CAPTCHAS, I only need to add a switch statement and some elbow grease and then I get to break your CAPTCHA a trillion times.

The other contraint is that you have to have your problem be trivially solvable by humans. I know plenty of people who cannot solve the CAPTCHA you have given: one obvious example would be, umm, all of my coworkers, because I live in Japan and "sub sandwitch" is not generally on the Japanese English curriculum. Similarly, you could any number of parsing problems which are very difficult for machines ("Here are 10 pictures chosen from HotOrNot. Click the three hot chicks.") but which may also be difficult for some users, such as Slashdotters who have never met a girl before.

By the way, you can find an implementation of that CAPTCHA at http://www.hotcaptcha.com/

--
Help poke pirates in the eyepatch, arr.

Re:As much as I like open source software ... by Jerf · 2006-09-04 16:54 · Score: 3, Insightful

In order to pose the question, you have to generate it randomly. If it's not random, you already lost.

In order to generate it, you're going to end up using a grammar.

Running grammars in reverse is merely a matter of patience (to explore the space of problems the test program will pose) and the right tools; it's a fundamental bit of computer science.

Granted, expecting spammers to be conversant with the fundamental elements of computer science is a pretty high bar, but it only takes one to leap it and the rest to buy the program from him.

The image tests have the advantage that done properly, it takes more than just patience and computer science fundamentals to crack, it would require fundamental advances in the art.

(Note that nowhere in this message do I claim that image tests are perfect; in fact everything I know is vulnerable to the "feed it to a human in another context (viz, 'porn') and let them do the work" attack, and there are also points to be made about how widespread any given grammar/image test becomes; I know a website where the image test actually is a constant and so far it doesn't seem to be a problem because of scale issues. My point is that text tests have an additional disadvantage. It's not an intrinsically bad idea, though.)

Google wouldn't be interested in hiring people who could crack this, merely because they can crack this. Might make a decent interview question, though.

(You might also be tempted to think that you could just use a really complicated grammar, but you are constrained by two things, the human supposedly reading and taking the test, and the complexity of the human language itself. By the time you write some problem generator that could reliably throw off a parser, you'll be reliably confusing the hell out of your human users, too.)

Re:As much as I like open source software ... by Otto · 2006-09-04 17:01 · Score: 3, Insightful

Or write up a quick script to cut the images in half down the middle and save them as a series of other images.

--
- Give a man a fire and he's warm for a day, but set him on fire and he's warm for the rest of his life.

I call bullshit by quigonn · 2006-09-04 17:16 · Score: 4, Interesting

The very first CAPTCHA implementation was broken, but the funny thing about CAPTCHAs is that it's absolutely no effort to make an image completely unreadable for current OCR software. And even if one certain implementation is broken, just add another layer of distortion. Human brain is capable of coping with it, OCR software usually is not.

And after all, it's not about authentication, it's about making a service accessible only for humans.

BTW, it's funny that you praise your own cryptography solution in your blog, but it's obvious that you have the problem of replay attacks, you even mention it in the "caveat" section below the text box.

--
A monkey is doing the real work for me.

Re:I call bullshit by johansalk · 2006-09-04 22:25 · Score: 3, Informative

If captcha is using humans, wasn't there an anti-captcha thing spammers were doing by having people answer some captcha to get into some free porn that is then used (their answer) to get the bots through legitimate sites the spammers wanted to get into?

HP decided to got out of the OCR business? by Frosty+Piss · 2006-09-04 17:18 · Score: 5, Funny

In 1995 it was one of the top 3 performers at the OCR accuracy contest organized by University of Nevada in Las Vegas. However, shortly thereafter, HP decided to get out of the OCR business...

Actually, shortly thereafter, HP decided to get out technology innovation business, and into the printer ink business.

--
If you want news from today, you have to come back tomorrow.

W0W1 by Anonymous Coward · 2006-09-04 17:21 · Score: 3, Funny

TH18 IS GRLAT NEWf4 FOR TH0Sj OF US U$1NZ BA) O(R RLCOGN1+ION!

THAHKS, G00GLL!1!!!

Re:As much as I like open source software ... by ajs · 2006-09-04 17:33 · Score: 2, Funny

That's no problem! All I really need it to do is allow all of those geeks out there to share those great Playboy articles with me over p2p networks! I'm tired of just getting the filler photography! ;-)

Re:What about "rough ocr" by Anonymous Coward · 2006-09-04 18:11 · Score: 3, Insightful

You're a secretary? Do you do anal? If so, I can double your pay.

Non-English Charsets? by TheoMurpse · 2006-09-04 18:13 · Score: 3, Interesting

As there seems to be no documentation on the Sourceforge page about what this can actually do, does it learn or follow rules? If it learns, can it learn to recognize, say, Japanese characters?

Re:Non-English Charsets? by Yvanhoe · 2006-09-04 20:42 · Score: 2, Informative

Google specifically said in the article it doesn't work for non-english texts. I suppose it means it incorporates an english dictionnary too, so other roman language wouldn't work either.

--
The Wise adapts himself to the world. The Fool adapts the world to himself. Therefore, all progress depends on the Fool.

License issue: not free software by hellgate · 2006-09-04 18:18 · Score: 2, Interesting

Parts of the Tesseract tar ball are under a "for non-commercial use" only license:

This software is the copyright of Russell Leighton and the MITRE Corporation. It may be freely used and modified for research and development purposes. We require a brief acknowledgement in any research paper or other publication where this software has made a significant contribution. If you wish to use it for commercial gain you must contact The MITRE Corporation for conditions of use.

The piece in question is a neural network simulator named Aspirin/MIGRAINES, presumably used for training. Pun away.

Re:Totally OT response to sig. by illuminatedwax · 2006-09-04 18:23 · Score: 2, Insightful

Also: *AA includes the MAA (Mathematical Association of America), the ADAA (Anxiety Disorders Association of America), the MSAA (Multiple Sclerosis Association of America), and the SCAA (Specialty Coffee Association of America).

The SCAA must be the ones responsible for not letting Java be open sourced.

--
Did you ever notice that *nix doesn't even cover Linux?

Re:As much as I like open source software ... by Phroggy · 2006-09-04 18:31 · Score: 2, Informative

I am currently using the FuzzyOcr plugin to SpamAssassin, and it uses gocr to do the character recognition. To be sure, gocr is improving (the stable released version is practically useless, but the CVS version actually works, mostly), but if Tesseract is better, great!

--
$x='S24;r)>63/* h@<5+oZ)32"5cz';$me='phroggy'x$];
$x=~y+ -xz+\0-Tx+;print$_^chop$me for split'',$x;

Re:As much as I like open source software ... by Anonymous Coward · 2006-09-04 18:49 · Score: 2, Interesting

"the letter at the beginning of the word that is spelled the reverse of the 3 letter name for an underwater vehicle or sandwich"

My wife would fail this test. My father will fail this test. My step-mother will fail this test. My children will fail this test.

A computer will very easily get this test right one time on 26.

In one word: Useless.

Image spam by Lonewolf666 · 2006-09-04 19:17 · Score: 2, Interesting

A good idea, and if significant amounts of text are in an image, I'd view the mail as dubious anyway.
If not because of spam, then because of the idiotic format. Images are for illustrations, but using them to transfer major amounts of text is just stupid and inefficient.

--
C - the footgun of programming languages

Re:Image spam by maxwell+demon · 2006-09-04 20:16 · Score: 2, Insightful

Unless it's a scanned page, where you might be interested in more than just the raw text, or simply don't want to risk errors in converting it to text (think official documents).

--
The Tao of math: The numbers you can count are not the real numbers.

Comment removed by account_deleted · 2006-09-04 19:40 · Score: 3, Insightful

Comment removed based on user account deletion

Re:Music OCR by Scaba · 2006-09-04 19:40 · Score: 2, Funny

I'm sick and tired of a piece of dust being interpreted as a meter change.

You're just not avant-garde enough.

Test example of tesseract. by dannycim · 2006-09-04 19:42 · Score: 2, Interesting

Screen captured some text from the article, used XV to transform into tif, changing image to monochrome.

Input image: it has been touted as one of the most accurate Optical Character Recognition (OCR) programs available. Having sat on the shelf gathering dust for so many years, Google cleaned up some of the more outdated portions of the code

Output text: ii has been lamed as one of lhe mos! accurale Oplical Characler Recognilion (OCR) programs available. Having sat on lhe shelf galhering dusk for so many years, Google cleaned up some of lhe more ouldaled porlions of lhe code

I have no idea what kind of input is optimal, but for a first shot in the dark, that's not too shabby. I'll go play with it some more. (^,^)

Re:Test example of tesseract. by CXI · 2006-09-05 01:23 · Score: 2, Interesting

A screen shot is typically much lower resolution than what you'd normally scan documents at for OCR. It's not a good test.

Re:Music OCR by lowieken · 2006-09-04 21:02 · Score: 3, Interesting

There is a piece of non-free software that runs quite well under Wine and exports nice MusicXML. You will find it linked to from http://www.recordare.com/software.html .

I really should ask google to help buy this technology and set it free.

Re:Isn't fully free / open source by Ed+Avis · 2006-09-04 21:34 · Score: 2, Informative

If you think the software isn't entirely free, contact Sourceforge. Their conditions require that all hosted projects be free software.

--
-- Ed Avis ed@membled.com

Re:As much as I like open source software ... by Arancaytar · 2006-09-04 23:19 · Score: 3, Insightful

Yes, by using contrasting colors that convert to the same tone in grayscale. A side effect being that most such technologies also shut out colorblind people...

Since you ask, here's why: by patio11 · 2006-09-04 23:56 · Score: 3, Insightful

The name of the system you propose is called challenge/response (CR). CR is not a good idea for the following reasons:

1) It says "My time is more important than yours" to all your correspondents, because you're not willing to look at a few spams getting past your Bayesian filter every day so instead you offload that time burden to people who want to talk to you.
2) Dueling CR systems ("Hey, bob@example.com, I don't recognize you. Please prove you are a human" "Re: Hey, bob -- steve@stupid.com, I don't recognize you. Please prove you are a human"). Even more fun in a potentially infinite loop. Any system you can make to shortcircuit this loop can be abused by spam to avoid the CR altogether.
3) Doesn't survive the Chinese Sweatshop Spam Attack, which will be ubiquitous if CR becomes popular. (Take poor Chinese person, teach them 10 words of English, pay them 2 cents an hour to answer CAPTCHAs so you get guaranteed delivery of your Maximize Your Mr. Wiggly offers.)
4) Breaks legitimate bulk mail senders, such as Amazon, Paypal, eBay, mailing lists, etc etc. Mailing lists in particular are going to be very fun, since a lot of CR systems would spam the entire list -- perhaps provoking 100 challenges! Which then leads to combinatorial hilarity!

--
Help poke pirates in the eyepatch, arr.

Re:As much as I like open source software ... by Dan+Ost · 2006-09-05 00:32 · Score: 2, Informative

As someone who has been involved in applying OCR to real world problems, there's nothing
trivial about generating a good binary images from images taken in the field (in my case,
images of boxes moving down a conveyor belt or hand imaged by workers).

Even if you disregard such problems as uneven lighting, glare, and distortion due the
unavoidable vibration inherrent to plant settings, most forms that are interesting to
OCR are handwritten and not designed to be OCR friendly. Hopefully this will change as
the people who design such forms become more conscious of the capabilities of OCR, but
even if that were to happen tomorrow, it would take years to complete the transition.

--

*sigh* back to work...

Re:As much as I like open source software ... by Anonymous Coward · 2006-09-05 00:45 · Score: 2, Interesting

I gave up on CAPTCHA, the spammers have some really good software which can deal with this. My site used to get about 5-10 bot registrations a day. So I changed tactics, and simply ask "Are you a bot? (don't answer this question!)". If they answer this question, registration is denied, no matter what e-mail address or IP they are using. This alone is 100% effective, but I do have some other questions as a backup, just in case. It's rather interesting how all these registrations seem to follow the same pattern, almost like there is only one decent 'spam package' out there.

Chastity Bono's next step is life+100 by tepples · 2006-09-05 01:47 · Score: 2, Insightful

I doubt Project Gutenberg will have run out of pre-1923 books by the time that new stuff starts coming out of Copyright under the new rules.

Are you insinuating that the 115th Congress won't try to enact a Chastity Bono Copyright Term Extension Act? Given Mexico's life plus 100 copyright term, the next step of "harmonization" for the United States and its trading partners is life plus 100 or, in the case of works made for hire, 125 years after publication.

Just assuming that somehow they did manage to digitize everything that was out of copyright, then I think what they should do is start archiving everything that they can.

Who's to say that publishers won't fight back against Gutenberg the way (ObTopic) they did against Google? It's only fair use if you can pay a judge to tell you that it is and if you can pay your lawyer to tell the judge to tell you that it is.

Re:THIS IS ONLY FOR *NIX and not mentioned? by dadman · 2006-09-06 00:23 · Score: 2, Informative

Err... How about Cygwin http://www.cygwin.com/ ?

Slashdot Mirror

Google Releases Tesseract as Open Source

64 of 251 comments (clear)