Google Releases Tesseract as Open Source
An anonymous reader writes "Google recently released Tesseract as open source. Originally developed at the HP Labs from 1985-1995, it has been touted as one of the most accurate Optical Character Recognition (OCR) programs available. Having sat on the shelf gathering dust for so many years, Google cleaned up some of the more outdated portions of the code and released it for general consumption. You can download Tesseract over at Sourceforge.
HOORAY! Good free OCR software is in short supply. I wonder if this will have a positive impact on Project Gutenberg?
“Common sense is not so common.” — Voltaire
This should be useful for adding anti-image spam capabilities to FOSS anti-spam programs.
The road to tyranny has always been paved with claims of necessity.
Is there any particular reason google isn't hosting the project themselves?
Developers: We can use your help.
They're already useless if installing one will subject your business to boycotts and/or lawsuits from National Federation of the Blind and other advocates for people with disabilities.
In all honesty, I doubt Project Gutenberg will have run out of pre-1923 books by the time that new stuff starts coming out of Copyright under the new rules. They have everything written by humanity before that date to digitize: not just English language books and "classics," but government documents, records, foreign language texts, ancient manuscripts ... everything. That's as close to an un-finishable task as you can set yourself, I think.
Just assuming that somehow they did manage to digitize everything that was out of copyright, then I think what they should do is start archiving everything that they can. Even if they can't disseminate the information, they could still scan documents in and store them for later OCR-ing, thus preserving them against deterioration. I think this would be covered by fair use law even if the work was still protected. Perhaps this sort of archival work is not exactly the aim of PG, but it's still critically important.
With that said, I don't mean to in any way excuse the disgusting abuse of our political and legal system that was and is the "Sonny Bono Copyright Term Extension Act." That thing is a disgusting example of pretty much everything that's wrong with our government today.
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
Why can't captchas just say "the letter at the beginning of the word that is spelled the reverse of the 3 letter name for an underwater vehicle or sandwich"? If someone can make a program that interprets that and gets the answer right after getting it off a captcha with OCR, then Google probably wants to know so they can hire them.
Stupidity is like nuclear power, it can be used for good or evil. And you don't want to get any on you.
Come on, 34 comments and no mention of A Wrinkle in Time?
I just checked out tesseract. One thing I have to look at more is the license. It appears to be the Apache license, which seems like a decent free license. But it also includes MITRE's aspirin. I'm not sure how dependent it is on aspirin and what the license restrictions of aspirin are.
The two best free OCR engines out right now are clara and gocr. While they are the best, they are not that great yet. I just ran the same tiff I had run with those two (I also have the document in pbm and other formats). Tesseract did not read it, it bailed with "IMAGE::check_legal_access:Error:Can't seek backwards in a buffered image!"
Clara and GOCR are written in C, Tesseract is written in C++, a language I don't know. Tesseract did well in the UNLV challenge so it probably has some good features. It does say it has no page layout analysis though.
Hopefully this can be improved, or good parts of it can be borrowed and incorporated into gocr or clara. It couldn't handle my test that both clara and gocr could, but it probably has strengths the other two doesn't. One day hopefully we'll have a free OCR that handles things as automagically as the commercial ones do. I will see what I can contribute to that as well. Although this is C++ and I don't know that language.
I've tried all the previous open source stuff and it was pretty much unusable. The accuracy was so bad that it was just easier to start typing. I got a few of the Windows programs kinda of working under WINE, but then I discovered Vividata and it worked really well and could be called from the command line. This meant I could write my own scripts that used it. I used it quite a bit for Project Gutenberg and was very impressed. It's not cheap, but if you want to do OCR under Linux and can afford it, I recommend it.
I would definitely prefer a Free Software solution so I'm excited about this development. Until this solution is really work-able (see the Google limitations, they're pretty serious), give Vividata a try.
The very first CAPTCHA implementation was broken, but the funny thing about CAPTCHAs is that it's absolutely no effort to make an image completely unreadable for current OCR software. And even if one certain implementation is broken, just add another layer of distortion. Human brain is capable of coping with it, OCR software usually is not.
And after all, it's not about authentication, it's about making a service accessible only for humans.
BTW, it's funny that you praise your own cryptography solution in your blog, but it's obvious that you have the problem of replay attacks, you even mention it in the "caveat" section below the text box.
A monkey is doing the real work for me.
Anybody know how important this headache library is to the software, and how easily replaced it is?
"[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz
As there seems to be no documentation on the Sourceforge page about what this can actually do, does it learn or follow rules? If it learns, can it learn to recognize, say, Japanese characters?
The piece in question is a neural network simulator named Aspirin/MIGRAINES, presumably used for training. Pun away.
"the letter at the beginning of the word that is spelled the reverse of the 3 letter name for an underwater vehicle or sandwich"
My wife would fail this test. My father will fail this test. My step-mother will fail this test. My children will fail this test.
A computer will very easily get this test right one time on 26.
In one word: Useless.
A good idea, and if significant amounts of text are in an image, I'd view the mail as dubious anyway.
If not because of spam, then because of the idiotic format. Images are for illustrations, but using them to transfer major amounts of text is just stupid and inefficient.
C - the footgun of programming languages
Screen captured some text from the article, used XV to transform into tif, changing image to monochrome.
Input image: it has been touted as one of the most accurate Optical Character Recognition (OCR) programs available. Having sat on the shelf gathering dust for so many years, Google cleaned up some of the more outdated portions of the code
Output text: ii has been lamed as one of lhe mos! accurale Oplical Characler Recognilion (OCR) programs available. Having sat on lhe shelf galhering dusk for so many years, Google cleaned up some of lhe more ouldaled porlions of lhe code
I have no idea what kind of input is optimal, but for a first shot in the dark, that's not too shabby. I'll go play with it some more. (^,^)
There is a piece of non-free software that runs quite well under Wine and exports nice MusicXML. You will find it linked to from http://www.recordare.com/software.html .
I really should ask google to help buy this technology and set it free.
Given that my laptop has a microphone I was a bit worried about the recent article on google sampling sound on peoples computers. But my wife's laptop also has a webcam. Should I tell my wife not to google in bed? If the mic is off will they still catch what she is talking about?
Dave why don't you take a stress pill and lie down. If you are looking for something to read there is always google news.
http://michaelsmith.id.au
I gave up on CAPTCHA, the spammers have some really good software which can deal with this. My site used to get about 5-10 bot registrations a day. So I changed tactics, and simply ask "Are you a bot? (don't answer this question!)". If they answer this question, registration is denied, no matter what e-mail address or IP they are using. This alone is 100% effective, but I do have some other questions as a backup, just in case. It's rather interesting how all these registrations seem to follow the same pattern, almost like there is only one decent 'spam package' out there.