Google Releases Tesseract as Open Source
An anonymous reader writes "Google recently released Tesseract as open source. Originally developed at the HP Labs from 1985-1995, it has been touted as one of the most accurate Optical Character Recognition (OCR) programs available. Having sat on the shelf gathering dust for so many years, Google cleaned up some of the more outdated portions of the code and released it for general consumption. You can download Tesseract over at Sourceforge.
> It was open-sourced by HP and UNLV in 2005.
So google basically did what ? Fix bit-rot ? Google has re-released some open source code, essentially forking off the orginal ?
> License: (None Listed)
I'm a fan of the FOSS idea. Basically that makes sures that the whole work to which I contributed, always remains available to me (and others). It might not always work for a company, but as a developer it makes sense to me. And the second thing I need to see is a License after I see some code.
So explain to me how exactly this is open source (other than the "compile, but don't touch" version of it) and *then* I might think of downloading it and probably fix a few bugs or write docs.
Quidquid latine dictum sit, altum videtur
Plus, good OCR could help recognize image spam (where they send the text in an image attachment, to avoid filtering, and fill the message body with "bayes poison").
You've got two constraints. One is that you have to be able to compose an arbitrarily large numbers of capchas algorithmically. For example, that example you just used is human-composed. If its the only CAPTCHA you have, the following program gets me a job at Google: gawk 'BEGIN{print "b"}' . If you have 100 CAPTCHAS, I only need to add a switch statement and some elbow grease and then I get to break your CAPTCHA a trillion times.
The other contraint is that you have to have your problem be trivially solvable by humans. I know plenty of people who cannot solve the CAPTCHA you have given: one obvious example would be, umm, all of my coworkers, because I live in Japan and "sub sandwitch" is not generally on the Japanese English curriculum. Similarly, you could any number of parsing problems which are very difficult for machines ("Here are 10 pictures chosen from HotOrNot. Click the three hot chicks.") but which may also be difficult for some users, such as Slashdotters who have never met a girl before.
By the way, you can find an implementation of that CAPTCHA at http://www.hotcaptcha.com/
Help poke pirates in the eyepatch, arr.
Your argument makes the fundamentally flawed assumption that the "new rules" will remain constant. The reality is that Copyright will continue getting extended so that new content never comes into Public Domain. (I hope the copyright fuckers are the first against the wall when the revolution comes!)
I'm sure they could even OCR them... they just couldn't make them available to the public. Of course, given the community-driven mechanism by which Project Gutenberg works, they couldn't legally distribute them to the volunteers either...
"[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz