Slashdot Mirror


Google Releases Tesseract as Open Source

An anonymous reader writes "Google recently released Tesseract as open source. Originally developed at the HP Labs from 1985-1995, it has been touted as one of the most accurate Optical Character Recognition (OCR) programs available. Having sat on the shelf gathering dust for so many years, Google cleaned up some of the more outdated portions of the code and released it for general consumption. You can download Tesseract over at Sourceforge.

251 comments

  1. As much as I like open source software ... by Sonic+McTails · · Score: 0

    Can't spammers use this thing to break CAPTCHAs on sites like Slashdot and many other internet forums? CATCHAs have been very effective in stopping spammers in the past, but if they can now just read them and answer correctly, then they are effectively rendered useless ...

    --
    This signature was left intentionally blank.
    1. Re:As much as I like open source software ... by aweinert · · Score: 5, Informative

      CAPTCHAs are specifically meant to break OCR... and if you RTFA, it say it does poorly with grayscale and color documents. Baisically its meant for reading typed text... like in a book.

    2. Re:As much as I like open source software ... by illuminatedwax · · Score: 5, Funny

      You're right! Let us never delve into research that could conceivably overturn weak software security! Some things man was never meant to discover! Turn back, before we fly too close to the sun and our wings melt!! O, Prometheus, why hast thou given us this OCR technology??

      --
      Did you ever notice that *nix doesn't even cover Linux?
    3. Re:As much as I like open source software ... by Carthag · · Score: 2, Insightful

      OCR is most effective when the letter boundaries are clear and well-defined, such as fixed-width text, or text that is at least on a straight line. Most CAPTCHAs put the letters on a curved path, as well as distorting the letters so they are no longer within a clearly defined rectangular shape. This makes it very hard to identify which parts of the images are letters and which parts are not, making OCRing CAPTCHAs a non-trivial problem.

    4. Re:As much as I like open source software ... by djtack · · Score: 4, Insightful

      Plus, good OCR could help recognize image spam (where they send the text in an image attachment, to avoid filtering, and fill the message body with "bayes poison").

    5. Re:As much as I like open source software ... by ergo98 · · Score: 1
      Can't spammers use this thing to break CAPTCHAs on sites like Slashdot and many other internet forums?CATCHAs have been very effective in stopping spammers in the past

      While Slashdot has always been a target for trolls and miscreants, I don't ever remember it being a spammers destination (note 4-digit UID). Even back in those crazy, hazy days when we didn't have to try to interpret some bizarro text -- AKA the vast bulk of Slashdot's existence - somehow spammers were thwarted in their evil quest. Was Slashdot just feeling a bit left out, and just had to stick a CAPTCHA in there to be just like everyone else ("See!? Spammers like us too!").

      CAPTCHAs should be replaced by forcing answers to submitted homework questions - kids get their homework done for them on a distributed network, and it somewhat proves that there's a human on the other end (no machine could interpret most homework questions).
    6. Re:As much as I like open source software ... by binarybum · · Score: 3, Funny

      careful, statements like that are likely to get you voted governor in some states.

      --
      ôó
    7. Re:As much as I like open source software ... by Millenniumman · · Score: 2, Interesting

      Why can't captchas just say "the letter at the beginning of the word that is spelled the reverse of the 3 letter name for an underwater vehicle or sandwich"? If someone can make a program that interprets that and gets the answer right after getting it off a captcha with OCR, then Google probably wants to know so they can hire them.

      --
      Stupidity is like nuclear power, it can be used for good or evil. And you don't want to get any on you.
    8. Re:As much as I like open source software ... by somethinghollow · · Score: 1

      Specifically like Google Books, I bet. Unless the book is multi-column, then fuck it and we'll wait for the single column edition.

    9. Re:As much as I like open source software ... by debiguana · · Score: 1
      (no machine could interpret most homework questions)
      With the stuff I've seen coming home lately, I'd venture that most humans can't interpret most homework questions, either!
    10. Re:As much as I like open source software ... by Jerf · · Score: 3, Insightful

      In order to pose the question, you have to generate it randomly. If it's not random, you already lost.

      In order to generate it, you're going to end up using a grammar.

      Running grammars in reverse is merely a matter of patience (to explore the space of problems the test program will pose) and the right tools; it's a fundamental bit of computer science.

      Granted, expecting spammers to be conversant with the fundamental elements of computer science is a pretty high bar, but it only takes one to leap it and the rest to buy the program from him.

      The image tests have the advantage that done properly, it takes more than just patience and computer science fundamentals to crack, it would require fundamental advances in the art.

      (Note that nowhere in this message do I claim that image tests are perfect; in fact everything I know is vulnerable to the "feed it to a human in another context (viz, 'porn') and let them do the work" attack, and there are also points to be made about how widespread any given grammar/image test becomes; I know a website where the image test actually is a constant and so far it doesn't seem to be a problem because of scale issues. My point is that text tests have an additional disadvantage. It's not an intrinsically bad idea, though.)

      Google wouldn't be interested in hiring people who could crack this, merely because they can crack this. Might make a decent interview question, though.

      (You might also be tempted to think that you could just use a really complicated grammar, but you are constrained by two things, the human supposedly reading and taking the test, and the complexity of the human language itself. By the time you write some problem generator that could reliably throw off a parser, you'll be reliably confusing the hell out of your human users, too.)

    11. Re:As much as I like open source software ... by Otto · · Score: 3, Insightful

      Or write up a quick script to cut the images in half down the middle and save them as a series of other images.

      --
      - Give a man a fire and he's warm for a day, but set him on fire and he's warm for the rest of his life.
    12. Re:As much as I like open source software ... by deafpluckin · · Score: 1
      and if you RTFA, it say it does poorly with grayscale and color documents.
      ...but there are trivial image transformations that you can do to turn a color image into grayscale then a grascale image into a binary image. Any matrix manipulation package is capable of doing these transformations.
    13. Re:As much as I like open source software ... by StarkRG · · Score: 1

      some states?

      Statements like that are likely to get you elected to Congress or the Presidency...

      It's the same kind of logic as "We can't find them, thus they must be there..."

    14. Re:As much as I like open source software ... by rm69990 · · Score: 1

      Naw, more like trollish babbling. OCR doesn't handle curving lines and distorted letters well. If you want to make yourself seem intelligent, at least research your shit first and try to stay on topic. :)

    15. Re:As much as I like open source software ... by ajs · · Score: 2, Funny

      That's no problem! All I really need it to do is allow all of those geeks out there to share those great Playboy articles with me over p2p networks! I'm tired of just getting the filler photography! ;-)

    16. Re:As much as I like open source software ... by benplaut · · Score: 1

      Only if the CAPTCHA makers don't test it through tesseract beforehand...

    17. Re:As much as I like open source software ... by 1u3hr · · Score: 1
      Can't spammers use this thing to break CAPTCHAs

      Captchas are designed to be difficult to OCR. Besides there are plenty of OCR apps around already, if you hadn't noticed. I don't think spammers have been holding out for a GPL one.

    18. Re:As much as I like open source software ... by Phroggy · · Score: 2, Informative

      I am currently using the FuzzyOcr plugin to SpamAssassin, and it uses gocr to do the character recognition. To be sure, gocr is improving (the stable released version is practically useless, but the CVS version actually works, mostly), but if Tesseract is better, great!

      --
      $x='S24;r)>63/* h@<5+oZ)32"5cz';$me='phroggy'x$];
      $x=~y+ -xz+\0-Tx+;print$_^chop$me for split'',$x;
    19. Re:As much as I like open source software ... by Anonymous Coward · · Score: 2, Interesting

      "the letter at the beginning of the word that is spelled the reverse of the 3 letter name for an underwater vehicle or sandwich"

      My wife would fail this test. My father will fail this test. My step-mother will fail this test. My children will fail this test.

      A computer will very easily get this test right one time on 26.

      In one word: Useless.

    20. Re:As much as I like open source software ... by Anonymous Coward · · Score: 0

      Spammers break CAPTCHAs by displaying them to humans on another site, where solving them is made a prerequisite to accessing "free" porn. Easy.

    21. Re:As much as I like open source software ... by Sgt.+CoDFish · · Score: 1

      Your title and post make you sound like you think this shouldn't be released open source, just in case spammers use it.

      Well, then OOo will have to stop releasing their office suite: just think, Base could be used to store e-mail addresses to spam! Or, maybe no open source e-mail clients should be released, because the spammers might use it to send spam!

      Don't blame the software for the way it is used; It's the user's fault if (s)he decides to use it malevolently. Most software has the potential for misuse, some more than others, but that doesn't mean that fear of spam should stop tools that have a chance to be misused being released. Just think of the positive uses of programs like this.

      Besides, it's more than easy enough for spammers to just make a program to do stuff like break CAPTCHAs (yes, I know they're designed to defeat spammers, but nothing's perfect).

    22. Re:As much as I like open source software ... by cduffy · · Score: 1

      ...and part of a good CAPTCHA is causing these transformations to come up with useless output.

    23. Re:As much as I like open source software ... by Arancaytar · · Score: 3, Insightful

      Yes, by using contrasting colors that convert to the same tone in grayscale. A side effect being that most such technologies also shut out colorblind people...

    24. Re:As much as I like open source software ... by Anonymous Coward · · Score: 0

      Your reply with some generalization, could work as a Mission Statement on the wall in any number of companies...

    25. Re:As much as I like open source software ... by Shaper_pmp · · Score: 1

      Plus, IIRC CAPTCHAs don't really work anyway.

      --
      Everything in moderation, including moderation itself
    26. Re:As much as I like open source software ... by Dan+Ost · · Score: 2, Informative

      As someone who has been involved in applying OCR to real world problems, there's nothing
      trivial about generating a good binary images from images taken in the field (in my case,
      images of boxes moving down a conveyor belt or hand imaged by workers).

      Even if you disregard such problems as uneven lighting, glare, and distortion due the
      unavoidable vibration inherrent to plant settings, most forms that are interesting to
      OCR are handwritten and not designed to be OCR friendly. Hopefully this will change as
      the people who design such forms become more conscious of the capabilities of OCR, but
      even if that were to happen tomorrow, it would take years to complete the transition.

      --

      *sigh* back to work...
    27. Re:As much as I like open source software ... by Anonymous Coward · · Score: 2, Interesting

      I gave up on CAPTCHA, the spammers have some really good software which can deal with this. My site used to get about 5-10 bot registrations a day. So I changed tactics, and simply ask "Are you a bot? (don't answer this question!)". If they answer this question, registration is denied, no matter what e-mail address or IP they are using. This alone is 100% effective, but I do have some other questions as a backup, just in case. It's rather interesting how all these registrations seem to follow the same pattern, almost like there is only one decent 'spam package' out there.

    28. Re:As much as I like open source software ... by charlesr · · Score: 1

      As with most problems in computer science, this can be solved by a one-line perl program:

          perl -e 'print "b\n"'

      --
      -- Charles Reindorf
    29. Re:As much as I like open source software ... by Anonymous Coward · · Score: 0

      It was used to stop GNAA crapfloods (bots posting the same thing to flood a story).

      Kind of ironic though, because they also published a story about captchas being defeated written by GNAA member Sam.

    30. Re:As much as I like open source software ... by mapkinase · · Score: 1

      For those folks the blogs on www.livejournal.com have an audio version of CAPTCHA.

      --
      I do not believe in karma. "Funny"=-6. Do good and forbid evil. Yours, Oft-Offtopic Flamebaiting Troll.
    31. Re:As much as I like open source software ... by Stoenhenge · · Score: 1
      Software written with the intent doing good, being used to do evil ?

      That's never happened before!

  2. I take back every bad thing I said about Google by OrangeTide · · Score: 4, Interesting

    HOORAY! Good free OCR software is in short supply. I wonder if this will have a positive impact on Project Gutenberg?

    --
    “Common sense is not so common.” — Voltaire
    1. Re:I take back every bad thing I said about Google by Commie1 · · Score: 2, Interesting

      I've been using Tesseract for a PG project for a few weeks now and, as TFA says, it's not as good
      as some commercial ones out there. Abby Finereader seems to be the OCR software of choice for
      Distributed Proofreaders, at least.
      Tesseract just has ASCII support (for now, as they like to add), so it ignores italics, accents etc.
      In the case of the book I'm working on, it had a very hard time with the ff ligature and had some
      trouble with b and c, but became hut, he became be, c was often an o or e.
      The words difficult, office and scientific were the standard pitfalls. On some pages it was nearly flawless though.
      The biggest advantages to me are clearly that it is free*, it's good enough and I can use it on my preferred OS.

      * Mostly Apache License v2.0, a part of it is under a "freely use and modify for research and development purposes" license however.

  3. Anti-spam by Bacon+Bits · · Score: 2, Interesting

    This should be useful for adding anti-image spam capabilities to FOSS anti-spam programs.

    --
    The road to tyranny has always been paved with claims of necessity.
    1. Re:Anti-spam by ZSpade · · Score: 0, Redundant

      Following that logic, wouldn't this then also be just as usefull a tool to spammers looking to crack those crazy registration verification images?

      It's open source, available to all who would use it; be it for good or profit, but hopefully both.

      --
      Go ahead and call me unreliable; reliable is just a synonym for predictable.
    2. Re:Anti-spam by jrockway · · Score: 1

      > Following that logic, wouldn't this then also be just as usefull a tool to spammers looking to crack those crazy registration verification images?

      Let me let you in on a little secret. CAPTCHAs were brpken a long time ago. They're the eqivalent of writing your password on a sticky note and putting it under your keyboard.

      I recommend authenticating people with strong cryptography, which is how people can post to my blog.

      --
      My other car is first.
    3. Re:Anti-spam by Anonymous Coward · · Score: 0

      > I recommend authenticating people with strong cryptography, which is how people can post to my blog

      Mmm. 5 articles on your blog, and zero comments. Indeed, your crypto seems pretty effective...

    4. Re:Anti-spam by Phroggy · · Score: 2, Interesting

      Following that logic, wouldn't this then also be just as usefull a tool to spammers looking to crack those crazy registration verification images?

      Yes, absolutely, and spammers are already using image obfuscation techniques: using italic difficult-to-read fonts spaced very close together (difficult to separate the image into individual characters and difficult to identify each character once you do), using colored backgrounds to make the text very low-contrast when converted into a monochrome image the OCR software can use, using animated GIFs (as mentioned previously in another article) so that if you only convert the first or last frame of the animation you won't get anything useful, and finally splitting the image into multiple pieces that are assembled together with HTML. The only solution I see to this last problem is to develop spam filtering software that uses Gecko or KHTML to render the HTML and analyze the rendered page.

      In the war between spammers and anti-spammers, the spammers are clearly winning, and they will continue to win for the foreseeable future. No technical solution can stop spam, only certain limited types of spam - but the spammers are constantly adapting. I believe if Congress were to earmark funding for the investigation and prosecution of spammers, we could actually make a significant dent in the problem (other governments have already expressed a willingness to cooperate).

      It's difficult to legally define spam in such a way that makes spam illegal without infringing the right to freedom of speech and press, and I believe we need to err on the side of protecting liberty at the expense of some spam being legal. This is what CAN-SPAM has done - it's far from perfect, but it's a good start. CAN-SPAM has gotten a lot of criticism for being too easy for spammers to work around, but how much spam do you get that actually complies with the law? Not much... so why aren't we prosecuting violators right and left? Limited resources. Given the choice between tracking down a spammer and tracking down a murderer/rapist/child molestor/etc., both of which cost money, most of us recognize that spam is a less severe problem. More resources need to be allocated to the appropriate law enforcement agencies so they can deal with both.

      Oh, and if my argument about CAN-SPAM was unconvincing, consider this: nearly all the image-based spam I've been seeing lately has been either for penny stocks or prescription medications (i.e. "male enhancement" products). Both of these are already cleaerly illegal (the SEC and FDA are the respective government agencies responsible, I believe). It should be possible to prosecute the spammers for stock market manipulation and dispensing controlled drugs without a prescription, even if sending spam weren't against the law.

      Some here will call for spammers to be sentenced to life in prison without parole, execution, castration, public hanging, public stoning, or worse. Get over it. Forget about revenge, that's not what our criminal justice system is for. All I want is for the spam to stop, and for the spammer to lose whatever they've gained from it. That should be enough. Let the spammers become productive members of society if they're capable of doing so; lock them up if they cause any further trouble.

      Is this the final solution? No, of course not. But let's start with this, and see how it goes for now. If this works, spam won't go away, it will just change into new forms... and that's OK. When that happens, we can find new ways of dealing with it. The hope is that after that happens a few times, it will become much less of a problem. Maybe not, but I'd sure like to find out!

      --
      $x='S24;r)>63/* h@<5+oZ)32"5cz';$me='phroggy'x$];
      $x=~y+ -xz+\0-Tx+;print$_^chop$me for split'',$x;
    5. Re:Anti-spam by tehcyder · · Score: 1
      They're the eqivalent of writing your password on a sticky note and putting it under your keyboard.
      How the hell did you know I do that?
      --
      To have a right to do a thing is not at all the same as to be right in doing it
  4. improvements by Anonymous Coward · · Score: 5, Funny

    Google cleaned up some of the more outdated portions of the code
    i.e., added AdSense to the OCR output.

    1. Re:improvements by puddpunk · · Score: 1

      It's open sauce, take it out! :P

    2. Re:improvements by Anonymous Coward · · Score: 1, Funny

      I hope this isn't the same OCR Google Books is using. They managed to mangle one of the most famous chapter titles in literature.

  5. Hoping OCR will improve? by smileytshirt · · Score: 3, Insightful

    My guess is that they are doing this in the hope the open source community will build on and improve OCR technology. This would be in Google's interest, as it can then index text from images (such as their own Books project) more accurately and efficiently.

    --
    www.shortman.com.au - top shorted stocks on the ASX
    1. Re:Hoping OCR will improve? by grammar+fascist · · Score: 1

      My guess is that they are doing this in the hope the open source community will build on and improve OCR technology.

      More likely the computer vision research community, actually. "Many eyes" help a lot with bugs and bugfixes, but, ironically, not so well on nontrivial vision tasks.

      --
      I got my Linux laptop at System76.
  6. Finally! by nihilatron · · Score: 3, Funny

    Now I can finally see how to tell the difference between the 'A'-ness of 'A' and the 'P'-ness of 'P'!

    (Credit to S.G.)

  7. From the Project by Gopal.V · · Score: 4, Insightful

    > It was open-sourced by HP and UNLV in 2005.

    So google basically did what ? Fix bit-rot ? Google has re-released some open source code, essentially forking off the orginal ?

    > License: (None Listed)

    I'm a fan of the FOSS idea. Basically that makes sures that the whole work to which I contributed, always remains available to me (and others). It might not always work for a company, but as a developer it makes sense to me. And the second thing I need to see is a License after I see some code.

    So explain to me how exactly this is open source (other than the "compile, but don't touch" version of it) and *then* I might think of downloading it and probably fix a few bugs or write docs.

    1. Re:From the Project by Sir_Lewk · · Score: 1

      Apparently it's release "mostly" under the Apache License. The "COPYING" file in the download contains license information.

      --
      "linux is just DOS with a UNIX like syntax" -- Galactic Dominator (944134)
    2. Re:From the Project by kevlarman · · Score: 3, Informative

      if you had bothered to browse cvs you would find that it has been released under the apache license: http://tesseract-ocr.cvs.sourceforge.net/tesseract -ocr/tesseract/COPYING?view=markup

      --
      A mouse is a device used to point to the xterm you want to type in
    3. Re:From the Project by Anonymous Coward · · Score: 0

      So download it, and read the "copying" file, and see that it's mostly under the Apache license.

    4. Re:From the Project by 1+a+bee · · Score: 1
      >And the second thing I need to see is a License after I see some code.

      It's licensed under the Apache license. See the README at http://tesseract-ocr.cvs.sourceforge.net/tesseract -ocr/tesseract/README?revision=1.1&view=markup

  8. I'm sorry Dave... by macadamia_harold · · Score: 4, Funny

    Originally developed at the HP Labs from 1985-1995, it has been touted as one of the most accurate Optical Character Recognition (OCR) programs available.

    Yeah, but how is it on lip-reading? That's when we really need to worry.

    1. Re:I'm sorry Dave... by MichaelSmith · · Score: 2, Interesting
      Yeah, but how is it on lip-reading? That's when we really need to worry.

      Given that my laptop has a microphone I was a bit worried about the recent article on google sampling sound on peoples computers. But my wife's laptop also has a webcam. Should I tell my wife not to google in bed? If the mic is off will they still catch what she is talking about?

      Dave why don't you take a stress pill and lie down. If you are looking for something to read there is always google news.

  9. Hosting by truthsearch · · Score: 5, Interesting

    Is there any particular reason google isn't hosting the project themselves?

    1. Re:Hosting by larry+bagina · · Score: 5, Funny

      Yes. They need the 99.9999% uptime (6 9s) that only sourceforge can provide.

      --
      Do you even lift?

      These aren't the 'roids you're looking for.

    2. Re:Hosting by boots_the_monkey · · Score: 1

      I thought the same thing when sourceforge came up. Someone might catch some flak over that one.

    3. Re:Hosting by jZnat · · Score: 1

      I'm pretty sure Sourceforge goes through more than .02 seconds of downtime per month...

      --
      'Yes, firefox is indeed greater than women. Can women block pops up for you? No. Can Firefox show you naked women? Yes.'
    4. Re:Hosting by Leto-II · · Score: 4, Funny

      I think you need to recalibrate your sarcasm detector.

      --
      Do not anger the worm.
    5. Re:Hosting by littlem · · Score: 1

      The people who modded this "Funny" obviously have projects on SF themselves, and know exactly what their uptime is like - put politely, nothing to write home about.

    6. Re:Hosting by srussell · · Score: 1
      They need the 99.9999% uptime (6 9s) that only sourceforge can provide.
      You misspelled "downtime". --- SER
    7. Re:Hosting by makomk · · Score: 1

      No idea. They seem to host most/all of their open-source releases over at Sourceforge, though.

  10. Sourceforge? by JackieBrown · · Score: 1

    I though google was opening up their own open source repository http://www.newsforge.com/article.pl?sid=06/07/27/1 833251

  11. Umm... by Anonymous Coward · · Score: 0

    It's just an OCR program, you fucking nerds. Nothing to get interested with, despite being released by Google. That's the only reason it was posted on Slashdot isn't it? Cos it's Google.

    Posted anon because I wanted to test the OCR on Slashdot's CAPCHA for Anon Cowards.

    1. Re:Umm... by Anonymous Coward · · Score: 0

      Gotta see this newfangled CAPCHA thing.

  12. NFB owns you by tepples · · Score: 4, Interesting
    CATCHAs have been very effective in stopping spammers in the past, but if they can now just read them and answer correctly, then they are effectively rendered useless ...

    They're already useless if installing one will subject your business to boycotts and/or lawsuits from National Federation of the Blind and other advocates for people with disabilities.

    1. Re:NFB owns you by MrNonchalant · · Score: 4, Informative

      You can build accessible CAPTCHAs, using images with a sound backup for blind users. My girlfriend is visually impaired and non-accessible CAPTCHAs are a real problem for her, she can't register at some sites without assistance.

    2. Re:NFB owns you by Isotopian · · Score: 1

      Just put up a big banner telling em where to click then.

      Obligatory Hellen Keller Joke:

      How did Hellen Keller break her arm?
      Reading the road signs!

      --

      It's poetry with a beat behind it! And guns! They're like beatniks with automatic weapons.

    3. Re:NFB owns you by mrchaotica · · Score: 1

      I'm blind and deaf, you insensitive clod!

      (Not really, but someone could be...)

      --

      "[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz

    4. Re:NFB owns you by Phroggy · · Score: 1

      How do people who are blind and deaf use the World Wide Web? I'm not saying it couldn't be done, but unless it actually is done, we shouldn't need to worry about it.

      --
      $x='S24;r)>63/* h@<5+oZ)32"5cz';$me='phroggy'x$];
      $x=~y+ -xz+\0-Tx+;print$_^chop$me for split'',$x;
    5. Re:NFB owns you by Anonymous Coward · · Score: 0

      I'm a big fan of asking the user a simple random question, such as "what is 2 + 5". It's much better than some of the horrible captchas (like Slashdot), and accessible too.

      The worst captchas are the ones found on Slashdot. Sometimes they are impossible to read. When I enter the wrong text, Slashdot doesn't generate a new one but displays the same unreadable image!

    6. Re:NFB owns you by Punboy · · Score: 1

      Its called a braille display.

      --
      If you like what I've said here, and want to read more, go to http://www.krillrblog.com
    7. Re:NFB owns you by Anonymous Coward · · Score: 0

      Indeed, braille display

    8. Re:NFB owns you by Anonymous Coward · · Score: 0

      The same way the blind use the net of course. You don't have to hear to browse the net (actually it is an advantage with MySpace).

      A blind and deaf person can read braille just as well as a blind person :-)

    9. Re:NFB owns you by pipatron · · Score: 1

      Just enter "what is 2 + 5" in google, and you'll see why that doesn't help anymore...

      --
      c++; /* this makes c bigger but returns the old value */
    10. Re:NFB owns you by Phroggy · · Score: 1

      So, can we make braille CAPTCHAs?

      --
      $x='S24;r)>63/* h@<5+oZ)32"5cz';$me='phroggy'x$];
      $x=~y+ -xz+\0-Tx+;print$_^chop$me for split'',$x;
    11. Re:NFB owns you by Anonymous Coward · · Score: 0
      How do people who are blind and deaf use the World Wide Web?

      Ah hell, we have a blind, deaf, and dumb person in the white house, so using the web is nothing.

    12. Re:NFB owns you by maxwell+demon · · Score: 2, Funny

      Of course you can resort to other, harder to calculate questions like: "What is the answer to life, the universe and everything?" Oops, Computers seem to have become much faster since Deep Thought! :-)

      --
      The Tao of math: The numbers you can count are not the real numbers.
    13. Re:NFB owns you by stiggle · · Score: 1

      But the NFB website itself is not standards compliant. http://validator.w3.org/check?uri=http%3A%2F%2Fwww .nfb.org%2Fnfb%2FDefault.asp

    14. Re:NFB owns you by indifferent+children · · Score: 1, Funny
      'm a big fan of asking the user a simple random question, such as "what is 2 + 5".

      I'm tired of all of the anti-Americanism on /. If you want to exclude Americans from your site, go ahead; but don't rub our noses in it.

      --
      Censorship is telling a man he can't have a steak just because a baby can't chew it. --Mark Twain
    15. Re:NFB owns you by tepples · · Score: 1
      So, can we make braille CAPTCHAs?

      That's considered a Hard Problem(tm). Refreshable braille displays take text input, such as (using HTML as an example) element text and valid alt attributes. Spambots can read plain text input. You'll have to make a completely text-based test based on reasoning, and that's still not solved.

    16. Re:NFB owns you by ultranova · · Score: 1

      You'll have to make a completely text-based test based on reasoning, and that's still not solved.

      Yeah. What if you're blind, deaf and stupid ?-)

      Hmm... Maybe you could make the user classify random pieces of text as "Offtopic", "Insightfull", "Informative" etc ? Then have someone else declare this classification correct or incorrect. That way you'll recognize spambots from human for sure ;).

      --

      Forget magic. Any technology distinguishable from divine power is insufficiently advanced.

    17. Re:NFB owns you by drinkypoo · · Score: 1

      The captcha module for drupal, which admittedly produces the ugliest captcha I've ever seen, will also produce text captchas if you value accessibility. They are just simple math problems that could certainly be solved by computer but you COULD implement a captcha that would prevent a computer from easily snapping them up.

      For example, you could have a list of animals and their characteristics, and you could pick some animals at random, then figure out which of their characteristics are unique to them out of your random set; put the animals (with alt-tag names of course) next to the simple math problems. Then your form field says "please input the answer to the math problem next to the animal with a long nose" or what have you.

      Of course, this too can be scripted eventually, but if you're always adding new animals then it becomes the same old arms race; just stay ahead.

      --
      "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    18. Re:NFB owns you by DragonWriter · · Score: 1
      A blind and deaf person can read braille just as well as a blind person :-)
      Most blind people, IIRC, can't read braille at all—its rather hard to learn and getting substantial material in braille is incredibly bulky, and and other ways of replacing printed information (i.e., audio) are cheaper and more accessible; given the lack of competing alternatives, I'd hope blind-and-deaf people (on average) were better at reading braille than blind people, though I have no idea if that's actually that case.
  13. Sonny Bono pwned Gutenberg by tepples · · Score: 1
    I wonder if this will have a positive impact on Project Gutenberg?

    Should we praise technology that helps Project Gutenberg run out of pre-1923 books faster? Once all notable pre-1923 books are scanned, OCR'd, and cleaned up, then what does PG do?

    1. Re:Sonny Bono pwned Gutenberg by Anonymous Coward · · Score: 1, Funny
      Once all notable pre-1923 books are scanned, OCR'd, and cleaned up, then what does PG do?

      Maybe they could ... I dunno ... READ the books?
    2. Re:Sonny Bono pwned Gutenberg by technos · · Score: 1

      Gutenberg already uses OCR. Has for a decade at least.

      --
      .sig: Now legally binding!
    3. Re:Sonny Bono pwned Gutenberg by bersl2 · · Score: 1

      Well, pending another retroactive extension of copyright (I don't even want to start on that...), works will begin to enter the public domain.

    4. Re:Sonny Bono pwned Gutenberg by ma++i+ude · · Score: 1
      Gutenberg already uses OCR. Has for a decade at least.

      Indeed it has. And as their scanning FAQ explains, they recommend you buy an OCR software package. I'm all for having the right tools for the job, even if it means going non-OSS, but if these packages are available for free, it encourages more people to participate. Surely that's a good thing?

      --
      You can't shut us down! The Internet is about the free exchange and sale of other people's ideas!
  14. i hope it can augment the SpamAssassin OCR plugin by sednet · · Score: 2, Informative

    it would be great if tesseract could augment the gocr-based FuzzyOCR and OCR plugins for SpamAssassin.

    --
    about sean dreilinger
  15. Yay! by The+MAZZTer · · Score: 1

    No binaries! Only source code! Good luck getting it to compile on Windows, I gave up after I got several dozen obscure errors I had never seen before from the compiler.*

    * If anyone can get VC++2K5 to compile it, please post.

    1. Re:Yay! by cduffy · · Score: 1

      Yes, the source is crap. Look at the debugging console -- they're *spawning an xterm* for output that would traditionally go to stderr. Don't have a DISPLAY set? Program crashes. Building on MacOS? Lucky you -- they have a bunch of commented-out code for running a separate window to display (what-should-be) stderr on the Mac; consequently, instead of getting output to stderr (which would actually be *useful* for redirection to a file, or direct output to the console, or whatever) it goes off into nowhere because the code to do the display is commented out! (I've written a patch for this -- making the separate window a configure option -- and will submit it at some point... maybe today). It's good that Google is paying for someone to clean this thing up -- it needs it.

      [Might be that the Mac-only code is pre-OSX, btw -- I'm not really sure, as my experience with Mac-targeted development is pretty limited -- but if so, it makes more sense]

      And the dependency on a neural network library that's available for research/personal use only really sucks; my interest for this was in inbound fax categorization in a (very low budget) commercial environment; however, that feature is low-priority enough that getting our business types to even talk to their business types (much less actually spend money) is a fairly dubious thing.

    2. Re:Yay! by kalidasa · · Score: 1

      If the code was written in 1995, it would make sense that the Mc-only code would be pre-OS X, as OS X didn't exist yet; and on previous versions of the Mac OS, there was no easy-access terminal emulator, so I can see why they would be putting it into a window for display.

    3. Re:Yay! by Dishwasha · · Score: 1

      If you have a handy copy of VC6.0, there are 3 things you have to do to get this to compile. First, remove mfcpch.cpp from the project list under the ccutil folder. Second, remove getopt.cpp from the project list under the ccutil folder. Third, go to Project, Settings, C/C++ tab, the "Precompiled Headers" Category, and select Not using precompiled headers. Make sure you do the third step under Release and Debug individually if you want to build each type. Once you've done this you will have a working tesseract executable under Windows. Enjoy!

    4. Re:Yay! by slashkitty · · Score: 1

      yeah, what is w/ the xterm? Anyway, I wouldn't call this code 1.0 .. It doesn't even ocr the sample they provide with it.

      --
      -- these are only opinions and they might not be mine.
  16. Un-Finishable by Kadin2048 · · Score: 5, Interesting

    In all honesty, I doubt Project Gutenberg will have run out of pre-1923 books by the time that new stuff starts coming out of Copyright under the new rules. They have everything written by humanity before that date to digitize: not just English language books and "classics," but government documents, records, foreign language texts, ancient manuscripts ... everything. That's as close to an un-finishable task as you can set yourself, I think.

    Just assuming that somehow they did manage to digitize everything that was out of copyright, then I think what they should do is start archiving everything that they can. Even if they can't disseminate the information, they could still scan documents in and store them for later OCR-ing, thus preserving them against deterioration. I think this would be covered by fair use law even if the work was still protected. Perhaps this sort of archival work is not exactly the aim of PG, but it's still critically important.

    With that said, I don't mean to in any way excuse the disgusting abuse of our political and legal system that was and is the "Sonny Bono Copyright Term Extension Act." That thing is a disgusting example of pretty much everything that's wrong with our government today.

    --
    "Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
    1. Re:Un-Finishable by mrchaotica · · Score: 4, Insightful
      In all honesty, I doubt Project Gutenberg will have run out of pre-1923 books by the time that new stuff starts coming out of Copyright under the new rules.

      Your argument makes the fundamentally flawed assumption that the "new rules" will remain constant. The reality is that Copyright will continue getting extended so that new content never comes into Public Domain. (I hope the copyright fuckers are the first against the wall when the revolution comes!)

      Even if they can't disseminate the information, they could still scan documents in and store them for later OCR-ing, thus preserving them against deterioration.

      I'm sure they could even OCR them... they just couldn't make them available to the public. Of course, given the community-driven mechanism by which Project Gutenberg works, they couldn't legally distribute them to the volunteers either...

      --

      "[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz

    2. Re:Un-Finishable by HuguesT · · Score: 2, Interesting

      This is patently false. New stuff comes out of copyright every day. However, coming out of copyright is not the same thing as becoming available to the public. Clearly this is where Projet Gutenberg comes in.

      One enormous area I'm personnally interested in is sheet music. Some of the music I'm interested in playing has come out of copyright decades or even centuries ago. No one is going to reclaim copyright on Mozart's requiem for instance. Yet it is by and large not available to the public because translating original manuscript sheet music into something that modern musicians can play without too much trouble is a huge undertaking.

      Yet I have no doubt that this will eventually happen. PG already has a section devoted to sheet music. The tools are beginning to appear : lilypond is a superb Free music engraving software package. I'm personnally working on music OCR software, and others are as well I'm sure. Eventually this will work out well I think.

      The public is in the process of reclaiming what is theirs, this is pretty much unstoppable right now.

    3. Re:Un-Finishable by WilliamSChips · · Score: 1
      (I hope the copyright fuckers are the first against the wall when the revolution comes!)
      I'm from the future. They were the first against the wall when the revolution came.
      --
      Please, for the good of Humanity, vote Obama.
    4. Re:Un-Finishable by gweeks · · Score: 3, Informative

      > This is patently false. New stuff comes out of copyright every day.

      This is just so un-true. In the United States (the only place that project Gutenberg worries about) nothing is entering the Public Domain except unpublished manuscripts where the author died 70 years ago. Nothing else will enter the public domain until 2019. Congress has affectivly frozen the public domain.

    5. Re:Un-Finishable by Anonymous Coward · · Score: 0

      Yes, damn our government straight to hell for even trying to come up with a good compromise between the rights of creators and the public domain.

      Idiot. You can disagree with the solution as much as you want, but don't try to blame government for the problem. Individuals have rights, and among those rights is the right not to have the stuff you make taken without your permission. Copyright law tries to strike a balance between this unalienable right and the public good. It might not always strike the perfect balance, but it's dumb as rocks to deny that a balance needs to be struck, and that that balance needs to keep up with the times.

      Now let me ask you this. How come people get all up in arms when the government seizes land through eminent domain -- you know, for the good of the public -- but get equally all up in arms when the government DECLINES to seize intellectual property as quickly as they seem to think the government should?

      Could it be because people don't give two shits about principle, and just want to take and take for themselves?

    6. Re:Un-Finishable by fotbr · · Score: 2, Informative

      Unless estate holders release it early. Or the author and holder of the copyright declares in his/her will that his/her work be released into the public domain upon his death, etc.

      Just because its not common (or likely) doesn't mean it can't happen.

    7. Re:Un-Finishable by FeatureBug · · Score: 1
      Yet I have no doubt that this will eventually happen. PG already has a section devoted to sheet music. The tools are beginning to appear : lilypond [lilypond.org] is a superb Free music engraving software package. I'm personnally working on music OCR software, and others are as well I'm sure. Eventually this will work out well I think. think.
      What you describe already exists in the Mutopia project.
    8. Re:Un-Finishable by mrchaotica · · Score: 1
      This is patently false.

      I really don't like it when people accuse me of making false statements, when I'm not doing so!

      What gweeks (the author of the other reply) said is true: In the United States, NO published work will automatically enter the Public Domain until 2019. This is a fact.

      And although it's technically possible for an author to explicitly release his work himself, it doesn't count because it doesn't solve the problem.

      --

      "[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz

    9. Re:Un-Finishable by bcrowell · · Score: 1

      This is patently false. New stuff comes out of copyright every day.
      Er, no. Congress has been regularly extending the term of copyrights, so that nothing ever expires.

    10. Re:Un-Finishable by Chapter80 · · Score: 1
      This is patently false.
      Patents and copyrights aren't the same thing. ;-)
    11. Re:Un-Finishable by elrous0 · · Score: 1
      I hope the copyright fuckers are the first against the wall when the revolution comes!

      Won't do any good to shoot them. Bleach is best when you're trying to kill pond scum and fungi.

      -Eric

      --
      SJW: Someone who has run out of real oppression, and has to fake it.
    12. Re:Un-Finishable by frankie · · Score: 1

      Oh, that's wonderful news. Please give us all a rough time estimate, with 1 significant digit of precision, when any well-known copyrighted work (such as Of Mice and Men, Doc Savage, or the song Happy Birthday) will actually come out of copyright.

      Yep, I thought not.

    13. Re:Un-Finishable by Chandon+Seldon · · Score: 1

      With enough bullets at high enough velocity, the concussive force will do the job.

      --
      -- The act of censorship is always worse than whatever is being censored. Always.
    14. Re:Un-Finishable by AnyoneEB · · Score: 1

      He did not say that copyright is bad. He said that the current copyright laws allow for a ridiculously long copyright period. Or, in other words, he agrees with you that a balance must be struck between the rights of the creators and the public domain, just he believes the balance in current US law gives too much to the creators. I agree.

      There is no "unalienable right" to prevent others from copying your works. The government grants that right to creators in order "To promote the Progress of Science and useful Arts" (US Constitution). In other words, the end of copyright does not signify a gift from the creator to the public domain. Copyright is a sort of "loan" from the public domain to the creator to encourage the creator to publish.

      I agree that copyright is necessary, but currently the majority of the money made off a published work is made very soon after the work is published. Another 90 years of protection will not encourage creators. Obviously, the term needs to be long enough such that the average consumer would not simply wait it out (a year or two), but short enough the he would see it in the public domain in his lifetime.

      --
      Centralization breaks the internet.
    15. Re:Un-Finishable by HuguesT · · Score: 1

      Now now, I'm not accusing anyone of lying. Lying is knowning fully that something is a certain way and pretending it is otherwise. At worse I'm accusing you of ignorance, which is not nearly as bad.

      Now you are qualifying your statement by a "In the United State", however even there, copyrighted works enter the public domain regularly, if not every day. Read this very interesting page.

      Now in Australia for instance, and in many other countries, copyright is 50 years after death for old work, and 70 for *new* work only, and there is no Sonny Bono retroactive act nonsense. Stuff that is more than 50 year old *today* enters the public domain *today*. In about 50 years (because the 70 year period) there will be a 20 year period where very little will enter the PD, however hopefully this will resume.

    16. Re:Un-Finishable by HuguesT · · Score: 1

      If I'm not mistaken, Steinbeck died in 1968. If my reading of the law is correct, all of his work is due to become public domain 70 years later in the US, i.e. in 2038. However it is expected to become PD in 2018 in Australia, for instance (where the famous 20-year extention of copyright was not retroactive, because, wait for it, it was deemed anticonstitutional). Only another 12 years to wait.

      If this does not happen then, write to your congressman, you are pretty much being ripped off.

    17. Re:Un-Finishable by frankie · · Score: 1

      Your reading is incorrect. Australia may have gotten it right, but the SCOTUS has explicitly allowed retroactive extension, and Congress has indeed been ripping us off (WRT copyright) since 1976.

    18. Re:Un-Finishable by mrchaotica · · Score: 1
      In about 50 years (because the 70 year period) there will be a 20 year period where very little will enter the PD, however hopefully this will resume.

      The point of my post is that this kind of gap is occuring right now in the United States, and will end in 2019.

      --

      "[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz

  17. Why release it on Sourceforge by MySchizoBuddy · · Score: 0, Redundant

    Couldn't google have released it on their own code hosting they recently launched.

    --
    Yes go ahead click the link. Its kosher
    1. Re:Why release it on Sourceforge by mdew · · Score: 1

      That was my thoughts exactly, why release it on sourceforge? Unless they don't have any faith in there own code repository.

      --
      http://www.fanboy.co.nz/adblock/
    2. Re:Why release it on Sourceforge by MySchizoBuddy · · Score: 1

      Or it could be since this is a RE-release. the original release was on sourceforge, so they kept it that way.

      --
      Yes go ahead click the link. Its kosher
    3. Re:Why release it on Sourceforge by Mr+Z · · Score: 1

      Maybe it's harder to renege on a release if it's not hosted on their network?

  18. License by mapinguari · · Score: 2, Informative
    Here's what's in the COPYING file distributed with the source, with some punctuation stripped to placate the lameness filter:

    This package contains the Tesseract Open Source OCR Engine.
    Orignally developed at Hewlett Packard Laboratories Bristol and
    at Hewlett Packard Co, Greeley Colorado, the majority of the code
    in this distribution is now licensed under the Apache License:

    ** Licensed under the Apache License, Version 2.0 (the "License");
    ** you may not use this file except in compliance with the License.
    ** You may obtain a copy of the License at
    ** http://www.apache.org/licenses/LICENSE-2.0
    ** Unless required by applicable law or agreed to in writing, software
    ** distributed under the License is distributed on an "AS IS" BASIS,
    ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    ** See the License for the specific language governing permissions and
    ** limitations under the License.


    Other Dependencies and Licenses:

    The Aspirin/MIGRAINES system in the aspirin directory is separately
    licensed thus:
    #

    NO WARRANTY

    Since the Aspirin/MIGRAINES system is licensed free of charge,
    Russell Leighton and the MITRE Corporation provide absolutley
    no warranty. Should the Aspirin/MIGRAINES system prove defective,
    you must assume the cost of all necessary servicing, repair or correction.
    In no way will Russell Leighton or the MITRE Corporation be liable to you for
    damages, including any lost profits, lost monies, or other
    special, incidental or consequential damages arising out of
    the use or inability to use the Aspirin/MIGRAINES system.

    COPYRIGHT

    This software is the copyright of Russell Leighton and the MITRE Corporation.
    It may be freely used and modified for research and development
    purposes. We require a brief acknowledgement in any research
    paper or other publication where this software has made a significant
    contribution. If you wish to use it for commercial gain you must contact
    The MITRE Corporation for conditions of use. Russell Leighton and
    the MITRE Corporation provide absolutely NO WARRANTY for this software.

    August, 1992
    Russell Leighton
    The MITRE Corporation
    7525 Colshire Dr.
    McLean, Va. 22102-3481

    Tesseract can also make use of the libtiff library. (www.libtiff.org)
    Without libtiff, Tesseract can only read uncompressed and G3 compressed
    TIFF files.

    1. Re:License by arose · · Score: 1

      So it isn't open source after-all.

      --
      Analogies don't equal equalities, they are merely somewhat analogous.
    2. Re:License by mrchaotica · · Score: 2, Interesting
      The Aspirin/MIGRAINES system in the aspirin directory is separately licensed thus: [proprietary junk license]

      Anybody know how important this headache library is to the software, and how easily replaced it is?

      --

      "[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz

    3. Re:License by lisaparratt · · Score: 2, Informative

      It's a neural networking system, so I'd hazard a guess that it's pretty vital to the project :(

    4. Re:License by Arancaytar · · Score: 1

      Aptly named, then. Will cause quite a headache for any OS programmers working on it.

    5. Re:License by Strolls · · Score: 1
      So it isn't open source after-all.
      You can download, read, compile & use the source code for the program. I fail to see how that's anything other than open source. It may not be "Free" as in the GPL or BSD licenses, but it certainly is open-sorce (as is code released under Microsoft's shared-source license).

      Stroller.

    6. Re:License by arose · · Score: 1

      Argue with the Open Source Initiative, they invented and defined the term.

      --
      Analogies don't equal equalities, they are merely somewhat analogous.
  19. No Wrinkle in Time comments? by reaktor · · Score: 2, Interesting

    Come on, 34 comments and no mention of A Wrinkle in Time?

    1. Re:No Wrinkle in Time comments? by pedantic+bore · · Score: 1
      The problem is that whenever someone compares Google to IT (the gigantic evil brain thing), they get modded down as flamebait by the Google-bois. So I won't. Google is nothing like a giant brain than manages all the information on the planet and governs access to that information, enforcing conformity and uniformity of perception. Not in any way. And of course the evil part is just plain backwards.

      (I think it was wonderfully prescient to name the villian after "information technology". Or maybe just lucky.)

      --
      Am I part of the core demographic for Swedish Fish?
  20. Maybe VC++ 2005 is shit? by Anonymous Coward · · Score: 0

    It compiles fine on Windows 2003 using MinGW (G++ 4.0.1) and Digital Mars C++ 8.45. It also compiles fine using Watcom C++ 10.6, if you can imagine that.

    If I had to field a guess, it's that Visual C++ 2005 isn't a good C++ compiler. Try using higher-quality tools, even if they're a decade old.

    1. Re:Maybe VC++ 2005 is shit? by Anonymous Coward · · Score: 0

      It doesn't compile on OS X, and according to a bug report on the project page, you get the same errors under linux.

    2. Re:Maybe VC++ 2005 is shit? by Anonymous Coward · · Score: 0

      You realize that Linux and Mac OS X are not Windows, correct? We're talking about Windows. And yes, it does compile on Windows using a number of compilers, including G++, Digital Mars, and OpenWatcom.

    3. Re:Maybe VC++ 2005 is shit? by Anonymous Coward · · Score: 0

      You do realize that I wasn't talking about Windows? I was talking about Linux, which it claims to compile on.

  21. Hoping OCR research will improve? by Anonymous Coward · · Score: 0

    "My guess is that they are doing this in the hope the open source community will build on and improve OCR technology. "

    Uh, huh. R-E-S-E-A-R-C-H-!

  22. my thoughts by br00tus · · Score: 3, Interesting
    I would love to use a free (speech and beer) OCR engine that works as well as a commercial one, or even nearby as good as a commercial one.

    I just checked out tesseract. One thing I have to look at more is the license. It appears to be the Apache license, which seems like a decent free license. But it also includes MITRE's aspirin. I'm not sure how dependent it is on aspirin and what the license restrictions of aspirin are.

    The two best free OCR engines out right now are clara and gocr. While they are the best, they are not that great yet. I just ran the same tiff I had run with those two (I also have the document in pbm and other formats). Tesseract did not read it, it bailed with "IMAGE::check_legal_access:Error:Can't seek backwards in a buffered image!"

    Clara and GOCR are written in C, Tesseract is written in C++, a language I don't know. Tesseract did well in the UNLV challenge so it probably has some good features. It does say it has no page layout analysis though.

    Hopefully this can be improved, or good parts of it can be borrowed and incorporated into gocr or clara. It couldn't handle my test that both clara and gocr could, but it probably has strengths the other two doesn't. One day hopefully we'll have a free OCR that handles things as automagically as the commercial ones do. I will see what I can contribute to that as well. Although this is C++ and I don't know that language.

    1. Re:my thoughts by Phroggy · · Score: 1

      If you've only used the latest released version of gocr, definitely try the development version; it's far superior (i.e. not completely useless).

      --
      $x='S24;r)>63/* h@<5+oZ)32"5cz';$me='phroggy'x$];
      $x=~y+ -xz+\0-Tx+;print$_^chop$me for split'',$x;
    2. Re:my thoughts by Anonymous Coward · · Score: 0
      "Tesseract did well in the UNLV challenge so it probably has some good features. It does say it has no page layout analysis though. [my emphasis]"
      There is something very odd going on. I know for a fact because I saw it myself that HP Research Labs Bristol had a fully working implementation of page layout analysis together with OCR in 1994. It was impressive. It handled all the usual page layout issues such as multiple columns and page skew. There was a neat Motif-like GUI too. Since HP has open-sourced their OCR engine, it makes no sense for them not to open-source their page-layout analysis and GUI code too. If anybody from Google reads this, please find out what happened to that code and see if you can persuade somebody in HP to resurrect it.
  23. Vividata works quite well by GnuPooh · · Score: 2, Interesting

    I've tried all the previous open source stuff and it was pretty much unusable. The accuracy was so bad that it was just easier to start typing. I got a few of the Windows programs kinda of working under WINE, but then I discovered Vividata and it worked really well and could be called from the command line. This meant I could write my own scripts that used it. I used it quite a bit for Project Gutenberg and was very impressed. It's not cheap, but if you want to do OCR under Linux and can afford it, I recommend it.

    I would definitely prefer a Free Software solution so I'm excited about this development. Until this solution is really work-able (see the Google limitations, they're pretty serious), give Vividata a try.

  24. No luck for OS X either by lullabud · · Score: 1

    I downloaded and tried compiling it in OS X and got some linux-specific build problems. I'm no code guru so I gave up as well. But then, even linux doesn't support the `make install` process, as claimed but the `./configure` script's output.

  25. Isn't fully free / open source by oblique303 · · Score: 1
    Looks like this isn't fully free / open source..

    From the license file: "It may be freely used and modified for research and development purposes. We require a brief acknowledgement in any research paper or other publication where this software has made a significant contribution. If you wish to use it for commercial gain you must contact The MITRE Corporation for conditions of use."

    1. Re:Isn't fully free / open source by Ed+Avis · · Score: 2, Informative

      If you think the software isn't entirely free, contact Sourceforge. Their conditions require that all hosted projects be free software.

      --
      -- Ed Avis ed@membled.com
  26. In the future by Vexorian · · Score: 1

    The condition would be to solve a text given puzzle, instead of reading an image meant to be as confusing as possible, some forums have very bad systems for this and sometimes I have to register multiple times before actually getting a CAPTCHA image that I can read.

    --

    Copyright infringement is "piracy" in the same way DRM is "consumer rape"
  27. Two reasons by patio11 · · Score: 4, Insightful

    You've got two constraints. One is that you have to be able to compose an arbitrarily large numbers of capchas algorithmically. For example, that example you just used is human-composed. If its the only CAPTCHA you have, the following program gets me a job at Google: gawk 'BEGIN{print "b"}' . If you have 100 CAPTCHAS, I only need to add a switch statement and some elbow grease and then I get to break your CAPTCHA a trillion times.

    The other contraint is that you have to have your problem be trivially solvable by humans. I know plenty of people who cannot solve the CAPTCHA you have given: one obvious example would be, umm, all of my coworkers, because I live in Japan and "sub sandwitch" is not generally on the Japanese English curriculum. Similarly, you could any number of parsing problems which are very difficult for machines ("Here are 10 pictures chosen from HotOrNot. Click the three hot chicks.") but which may also be difficult for some users, such as Slashdotters who have never met a girl before.

    By the way, you can find an implementation of that CAPTCHA at http://www.hotcaptcha.com/

    1. Re:Two reasons by Spazmogazm · · Score: 1

      So should I be pleased that I found the female version of this easy and the male version almost impossible? How many metro-sexual points did I lose?

    2. Re:Two reasons by Linux987 · · Score: 1

      Why not make a CAPTCHA system as follows. You design a CAPTCHA for your inbox, and anyone who sends you an e-mail immediately gets a reply asking for them to verify they are a person by answering your CAPTCHA. If they answer correctly, the message is put into your inbox.

    3. Re:Two reasons by Anonymous Coward · · Score: 0

      Right. Me too.
      I wonder if it really works for women, or the designer just thow a dosen of male images and go with his own criteria.

    4. Re:Two reasons by Sgt.+CoDFish · · Score: 1

      There's one huge problem with that type of CAPTCHA; personal opinion gets involved. There are people in the world who find overweight women attractive, so would answer differently to others, whereas others might go for the best smiles, whatever. Then, when a computer tells them they have shit taste in women and accuses them of being a bot when they don't pick the specified 3, they get pissed off with the site and leave:

      "I'm not having that! Who's this bastard computer to tell me that those women aren't attractive?" (General response from a British bloke who obviously isn't tech-savvy.)

      And, because you've obviously tried to cater for everyone in the world who visits your site, you have the problem of the fact that there are people in the world who don't find anything attractive at all. And also people who find everything attractive.

      A much better CAPTCHA would be something for which there is no doubt of the answer, provided you spoke the language the site is in (and if you don't, why are you on it?). For example, the user is presented with 9 pictures: 3 of cats, 6 of dogs, and told to click on the cats. Even then, it is trivial to write a program/script to try all possible combinations, but you're not gonna alienate users, because a cat is a cat, and a dog is a dog.

      I know you said there would still be people with problems with the HotOrNot thing, but you cite an example that really doesn't help your argument at all; even the most geeky /. member has met a girl/boy before (they may be gay or bi) and found at least someone attractive. Or has seen porn, which is probably more likely... :) And, as I said, the cat/dog problem solves this one... if you're on a computer, on a website, you could just google for cat/dog to find out what they looked like if you were so sad as to have never seen one.

      If you're catering for the blind? Well, then you start getting fancy and adding sound files and shit... for example, you could make 2 sound files, play one of them randomly, repeat 10 times and ask the user to say how many times they heard a certain word. For example, the sounds could be: "cat", "cat", "dog", "dog", "dog", "cat", "dog", "dog", "dog", "cat", and the user would be asked to input the number of "cat"s (4, obviously). And, to combat scripts, it's random. It can still be broken by a script that keeps on guessing randomly, but all types of security have weaknesses.

      Please note: This post assumes that we're not trying to prevent people who are hired to spam up sites, but that we are trying to fight bots. I personally believe there's nothing to beat hired spammers except good ol' human moderation, possibly aided by a search program for the usual spam words (viagra, buy, cheap, free trial period, discreet delivery etc., although the standard spammer's English is so bad these words probably won't be found...) which flags up potentially spam posts (but that will, of course return some false positives. Nothing's perfect).

    5. Re:Two reasons by GrumpySimon · · Score: 1

      the first version of this AFAIK was kitten auth

    6. Re:Two reasons by Anonymous Coward · · Score: 0
      So should I be pleased that I found the female version of this easy and the male version almost impossible?

      No, you can still be gay. You can be a gay man who just can't make up his mind. So many men, so little time! :P

      Are you really that insecure about your sexuality?

    7. Re:Two reasons by santiago · · Score: 1

      It's a mashup of some sort that interfaces with HotOrNot. Presumably, it picks out randomly selected images with very high and very low ratings, to get differentiable sets that most people will be able to agree on based on prevailing standards of attractiveness for countries with widespread internet access.

    8. Re:Two reasons by Catharsis · · Score: 1

      CAPTCHAs are, in essence, computer-administered Turing Tests. I think it's really neat that we're writing software that runs on computers that blocks out computers.

      Ultimately, this is an arms race that pushes towards so-called Strong AI. When OCR distorted text is no longer a good enough filter, and when spambots can recognize hot babes, or play 20 Questions... what will we have left to prove our identity?

      --

      "The wise man proportions his belief to the evidence." -- David Hume

  28. Am I really stupid or... by Anonymous Coward · · Score: 0

    how do you use this? It compiled fine, and the readme says to use it something like this:
    tesseract file.tif output batch
    What are "output" and "batch" supposed to be? When I specify a batch file it segfaults.

    1. Re:Am I really stupid or... by Locutus · · Score: 1

      I found that I needed to use grayscale tif files for one and "output" is the output-filename where you'll get:
      outputFilename.raw #???
      outputFilename.map # seems to be a location map of 0/1's where 1's are valid text and 0's aren't
      outputFilename.txt # the text from the OCR event

      I also found that the tessdata directory did not get installed into the /usr/local/bin directory on "make install" and copied that directory from the build directory to get it to work.

      Without "batch", it tries to bring up and X window but that just quickly goes away with no debug output.

      Usage: tesseract inputfile.tif [path/]outputfilename batch

      LoB

      --
      "Anyone who stands out in the middle of a road looks like roadkill to me." --Linus
  29. In Soviet Russia by Anonymous Coward · · Score: 0

    In Soviet Russia, shady political characters recognize YOU!

  30. I call bullshit by quigonn · · Score: 4, Interesting

    The very first CAPTCHA implementation was broken, but the funny thing about CAPTCHAs is that it's absolutely no effort to make an image completely unreadable for current OCR software. And even if one certain implementation is broken, just add another layer of distortion. Human brain is capable of coping with it, OCR software usually is not.

    And after all, it's not about authentication, it's about making a service accessible only for humans.

    BTW, it's funny that you praise your own cryptography solution in your blog, but it's obvious that you have the problem of replay attacks, you even mention it in the "caveat" section below the text box.

    --
    A monkey is doing the real work for me.
    1. Re:I call bullshit by quintesse · · Score: 1

      Call all you want: http://www.cs.sfu.ca/~mori/research/gimpy/

      Of course recognizing these images has little to do with OCR programs.

    2. Re:I call bullshit by AaronLawrence · · Score: 1

      Human brain is capable of coping with it, OCR software usually is not.

      The human brain is NOT capable of coping with an arbitrary level of distortion. Many people have remarked that recent captchas are sometimes difficult to read due to the very heavy distortion.

      This is true at least for letters and numbers. "Pictures of things" might do better, but they require an enormous amount of work compared to a little program spitting out JPGs of text.

      --
      For every expert, there is an equal and opposite expert. - Arthur C. Clarke
    3. Re:I call bullshit by johansalk · · Score: 3, Informative

      If captcha is using humans, wasn't there an anti-captcha thing spammers were doing by having people answer some captcha to get into some free porn that is then used (their answer) to get the bots through legitimate sites the spammers wanted to get into?

    4. Re:I call bullshit by quigonn · · Score: 1

      As I wrote, the first CAPTCHA implementation (which you linked to) was indeed broken, but not the concept per se. Please read my posting before answering.

      --
      A monkey is doing the real work for me.
    5. Re:I call bullshit by quigonn · · Score: 1

      Where did I write "arbitrary level of distortion"?

      To lay this out clearly: human capability of recognition is still much better than those of computer programs, and that's what CAPTCHAs are exploiting: generally, every AI-hard problem can be used for distinguishing between humans and computers, which also means that everytime a CAPTCHA building upon an AI-hard problem has been broken, an AI-hard problem has been solved (provided no implementation errors have been used to bypass the need of solving the actual AI-hard problem).

      --
      A monkey is doing the real work for me.
    6. Re:I call bullshit by Ciarang · · Score: 1

      If this obviously broken 'cryptography solution' is so great, why isn't it applied in his wiki? - I quote from the top of the front page: "Alert! Due to spam, you have to log in to edit the wiki or create a problem ticket. The username is foo and the password is bar."

      That's chapters 1 and 2 of "Strange Perversions in Authentication" covered right off the bat.

      The first doesn't stop computers at all - last time I checked they were perfectly capable of digitally signing things - and the 'repeat attack' is laughable. The second only requires human input once, then the computer is off and running, which is very different to a CAPTCHA required for each action.

      Someone seems to be confusing CAPTCHAs (distinguishing between man and machine) with half-baked authentication (trying but failing to verify identity).

      Having the text colour virtually identical to the background on the blog was a nice touch though.

    7. Re:I call bullshit by Tim+C · · Score: 1

      Yes. Sometimes the best technical solution to a problem is to write software to enable a human to work on it, rather than attacking it with software directly. This is one of those cases.

      It would be trivial to set this sort of thing up - simply have a page enabling people to sign up for their free porn (or whatever), and display the captcha image from the target site. When the person submits their answer, use it as part of your response to the target site.

    8. Re:I call bullshit by Anonymous Coward · · Score: 0

      I wouldn't say it's "absolutely no effort." See this page for a whole bunch of broken captchas.

    9. Re:I call bullshit by quintesse · · Score: 1

      And if you would actually read the page I linked to you would see that even the latest improved CAPTCHA has a 33% success rate which is more than sufficient for most purposes.

      And if you would actually look at the CAPTCHAs (even the "broken" ones as you call them) you would see that there are several which are damn hard to read for humans but the program doesn't seem to have any problems with them.

      So I still say that you can call bullshit all you want but humans are not always better at these kind of things.

      What we _are_ good at is cognizing all kinds of objects and even better, abstract notions, something that a computer would not easily copy, but if I see examples of this: http://gs264.sp.cs.cmu.edu/cgi-bin/esp-pix it's even damn hard for human beings!

    10. Re:I call bullshit by quintesse · · Score: 1

      "Where did I write "arbitrary level of distortion"?"

      vs

      "And even if one certain implementation is broken, just add another layer of distortion"

      with the knowledge that several of these CAPTCHAs have been "broken" this definitely seems to suggest that we just should add new layers of distortion hence an "arbitrary level of distortion".

  31. HP decided to got out of the OCR business? by Frosty+Piss · · Score: 5, Funny
    In 1995 it was one of the top 3 performers at the OCR accuracy contest organized by University of Nevada in Las Vegas. However, shortly thereafter, HP decided to get out of the OCR business...

    Actually, shortly thereafter, HP decided to get out technology innovation business, and into the printer ink business.

    --
    If you want news from today, you have to come back tomorrow.
    1. Re:HP decided to got out of the OCR business? by Plutonite · · Score: 1

      Well that's not nice now, is it? At least they're not hurting anyone with their printer ink business.

      Technological innovation = huge evil corporation doing research = products that will pwn a consumer/rip him off/ruin his shit. Windows was marketed as innovation, now look where that got us.

      You should be old enough to know this.

    2. Re:HP decided to got out of the OCR business? by steve_l · · Score: 1


      As someone who works in HPLabs, in the same building as the tesseract team, I must differ.

      We in HPLabs do still try and do leading edge research. Its just really hard to get your stuff into products where there's more and more emphasis on buying prepackaged stuff from VC-funded startups.

      What we have found is that OSS projects make a great destination for advanced research. A lot of stuff uses linux that way, Xen, and other existing projects. then things like SmartFrog and Jena are in-house projects with open codebases. Come try them out!

    3. Re:HP decided to got out of the OCR business? by sharkey · · Score: 1

      At least they're not hurting anyone with their printer ink business.

      Other than the people who would prefer to refill their ink carts or buy refilles/remanufactured ones at a fraction of the HP retail cost, anyway.
      --

      --
      "Outlook not so good." That magic 8-ball knows everything! I'll ask about Exchange Server next.
    4. Re:HP decided to got out of the OCR business? by Frosty+Piss · · Score: 1
      We in HPLabs do still try and do leading edge research. Its just really hard to get your stuff into products where there's more and more emphasis on buying prepackaged stuff from VC-funded startups.
      Than you don't disagree with me at all. I said that HP is no longer in the research biz. You said much the same thing, that HP buys its tech from VC start-ups.
      --
      If you want news from today, you have to come back tomorrow.
    5. Re:HP decided to got out of the OCR business? by argent · · Score: 1

      We in HPLabs do still try and do leading edge research. Its just really hard to get your stuff into products where there's more and more emphasis on buying prepackaged stuff from VC-funded startups.

      SOunds like you're saying that HP is in the printer ink business and runs HP Labs (is that what used to be DECWRL?) for the PR, no?

    6. Re:HP decided to got out of the OCR business? by ArtStone · · Score: 1

      The R&D department is too busy finding clever ways to obtain the private home and cell phone telephone records of their board of directors in order to force them to quit - for leaking to the press details of internal disagreements within the board about future company plans:

      http://www.msnbc.msn.com/id/14687677/site/newsweek /

      --
      Final 2006 "Proof of Global Warming" US Hurricane Count -> 0
  32. W0W1 by Anonymous Coward · · Score: 3, Funny

    TH18 IS GRLAT NEWf4 FOR TH0Sj OF US U$1NZ BA) O(R RLCOGN1+ION!

    THAHKS, G00GLL!1!!!

  33. What about "rough ocr" by Bitsy+Boffin · · Score: 1

    This story is somewhat timely for me. I am secretary of a club, we have a large quantity of documents collected over the last 20 years or so, some hand written, some typed, forms, invoices, minutes of meetings, letters sent to and from etc etc. There are a LOT of documents.

    Lately I've been thinking about computerizing these documents into a web based system, so that any of the club executive can search and pull out a document they need etc, we could also flag documents as "general release" so that people could read interesting stuff from our past. And of course it would also serve as a secure backup of our documents, incase of fire, theft, alien invasion...

    I think what is needed is a rough OCR system, that is, an OCR system that's not trying to be perfect, but can at least make about 50% accuracy on both typed and handwritten (without training!) documents, and preferably where it wasn't pretty certain it was correct, it would just skip words. The idea being that I'd run each document (big job, but doesn't matter if it takes a year) through a scanner, OCR it to get some searchable content, then store it as a PDF, or jpeg or something.

    Anybody know of such an (open source, or at least free as in beer) OCR system?

    --
    NZ Electronics Enthusiasts: Check out my Trade Me Listings
    1. Re:What about "rough ocr" by Anonymous Coward · · Score: 3, Insightful

      You're a secretary? Do you do anal? If so, I can double your pay.

    2. Re:What about "rough ocr" by Bitsy+Boffin · · Score: 1

      Double of zero isn't that enticing.

      --
      NZ Electronics Enthusiasts: Check out my Trade Me Listings
    3. Re:What about "rough ocr" by Anonymous Coward · · Score: 0

      It's called DJVU.

      In other words, it is an efficient compression format for collections of images of scanned text documents. When and if you have an accurate copy of the text (via OCR or typing) then the DJVU readers can use this to do text searching and highlight the words in the image of the page.

      In my opinion, this is the best way to get started in scanning document collections because you get a usable result immediately without fiddling with OCR and you can add OCR later.

      There is an open-source e-book reader named Evince that supports DJVU and there is the Lizardtech plugin for Windows. Lots more info is available on the net by googling for DJVU.

    4. Re:What about "rough ocr" by robbak · · Score: 1

      Just to remind you, 50% means every second letter is wrong. As in, you've got a 6.25% chance that a four-letter word would be correct. (Bit like my typing skills). I think you need something a bit better than that....

      --
      Prediction for end of Universe #42: Fencepost error in Quantum_bogosort.cpp
    5. Re:What about "rough ocr" by Anonymous Coward · · Score: 0

      It doesn't seem to be widely known, but Microsoft Office (the full version) includes quite a nice OCR utility. If you have the default install, look for Microsoft Office Document Imaging. It's far, far better than the almost totally useless OCR software we got with our HP All-in-One printer.

    6. Re:What about "rough ocr" by Anonymous Coward · · Score: 0

      try to have a look on djvu file format created more or less for that.

  34. Totally OT response to sig. by Anonymous Coward · · Score: 0

    little known fact: the AAA is the largest anti-public transport lobby in the US.

    Single-mindedly want all transport monies to go into roading projects.

    I guess it's part of their mission, but it is pretty crap for an otherwise populist organization of good.

    1. Re:Totally OT response to sig. by illuminatedwax · · Score: 2, Insightful

      Also: *AA includes the MAA (Mathematical Association of America), the ADAA (Anxiety Disorders Association of America), the MSAA (Multiple Sclerosis Association of America), and the SCAA (Specialty Coffee Association of America).

      The SCAA must be the ones responsible for not letting Java be open sourced.

      --
      Did you ever notice that *nix doesn't even cover Linux?
    2. Re:Totally OT response to sig. by 4D6963 · · Score: 1, Flamebait

      Also: *AA includes the MAA (Mathematical Association of America), the ADAA (Anxiety Disorders Association of America), the MSAA (Multiple Sclerosis Association of America), and the SCAA (Specialty Coffee Association of America).

      and also the GNAA (Gay Nigger Association of America)

      Don't ask me what's my point in mentionning this because I have no fucking idea :-) have a good day!

      --
      You just got troll'd!
    3. Re:Totally OT response to sig. by Anonymous Coward · · Score: 1, Funny

      and don't forget the ADA, the Dyslexics Association of America

    4. Re:Totally OT response to sig. by 4D6963 · · Score: 1

      lol, how is that flamebait, is it because I said the word nigger?

      --
      You just got troll'd!
  35. Non-English Charsets? by TheoMurpse · · Score: 3, Interesting

    As there seems to be no documentation on the Sourceforge page about what this can actually do, does it learn or follow rules? If it learns, can it learn to recognize, say, Japanese characters?

    1. Re:Non-English Charsets? by Yvanhoe · · Score: 2, Informative

      Google specifically said in the article it doesn't work for non-english texts. I suppose it means it incorporates an english dictionnary too, so other roman language wouldn't work either.

      --
      The Wise adapts himself to the world. The Fool adapts the world to himself. Therefore, all progress depends on the Fool.
    2. Re:Non-English Charsets? by Gneral+Tsao · · Score: 1

      Too add to that, it says it's US-ASCII only, so no diacritic marks either.

  36. Music OCR by Crabbyass · · Score: 1

    I tell ya, it'd be friggin' sweet if someone would work on making a functional Music OCR program. Scanning a score using the piece-of-crap Photoscore into (the not-so-piece-of-crap) Sibelius always ends taking longer than actually inputting the music manually. I don't know about others who dabble in this software, but I'm sick and tired of a piece of dust being interpreted as a meter change.

    1. Re:Music OCR by Anonymous Coward · · Score: 0

      'Sharpeye 2.68' (Visiv)?

    2. Re:Music OCR by Scaba · · Score: 2, Funny
      I'm sick and tired of a piece of dust being interpreted as a meter change.

      You're just not avant-garde enough.

    3. Re:Music OCR by lowieken · · Score: 3, Interesting

      There is a piece of non-free software that runs quite well under Wine and exports nice MusicXML. You will find it linked to from http://www.recordare.com/software.html .

      I really should ask google to help buy this technology and set it free.

    4. Re:Music OCR by advocate_one · · Score: 1

      which program... there are quite a few OCR programs linked to from there...

      --
      Donald 'Duck' Dunn: We had a band powerful enough to turn goat piss into gasoline.
    5. Re:Music OCR by treeves · · Score: 1

      Yeah, I tried the demo of that software and it was terrible!

      --
      ...the future crusty old bastards are already drinking the Kool-Aid.
  37. License issue: not free software by hellgate · · Score: 2, Interesting
    Parts of the Tesseract tar ball are under a "for non-commercial use" only license:

    This software is the copyright of Russell Leighton and the MITRE Corporation. It may be freely used and modified for research and development purposes. We require a brief acknowledgement in any research paper or other publication where this software has made a significant contribution. If you wish to use it for commercial gain you must contact The MITRE Corporation for conditions of use.

    The piece in question is a neural network simulator named Aspirin/MIGRAINES, presumably used for training. Pun away.

    1. Re:License issue: not free software by Anonymous Coward · · Score: 0

      Obvioulsy this part needs to be ripped out if it was to be part of a free project (or commercial that one for that matter)

      I admire google for releasing the whole thing "as is" so the community can see best how to proceed.

      While neural networks are interesting, a lot has developed since they were first invisioned in the eighties. There has to be another equally good approach.

      As an aside - Mitre may not even be in this segment of the business anymore. The code is pretty old.

  38. If you're wondering about OS compatibility... by 5plicer · · Score: 1

    "Currently it builds under Linux with gcc2.95 and under Windows with VC++6". In other words, it won't compile under Mac OS X... yet ;)

    --
    The bits on the bus go on and off... on and off... on and off...
    1. Re:If you're wondering about OS compatibility... by Anonymous Coward · · Score: 0

      Yes it does. I just compiled it for Mac OS X.
      All you have to do is edit some source files and you're done (just replace linux/limits.h by limits.h and malloc.h by stdlib.h).

  39. MOD PARENT UP by Phroggy · · Score: 0, Redundant

    My thoughts exactly...

    --
    $x='S24;r)>63/* h@<5+oZ)32"5cz';$me='phroggy'x$];
    $x=~y+ -xz+\0-Tx+;print$_^chop$me for split'',$x;
  40. Image spam by Lonewolf666 · · Score: 2, Interesting

    A good idea, and if significant amounts of text are in an image, I'd view the mail as dubious anyway.
    If not because of spam, then because of the idiotic format. Images are for illustrations, but using them to transfer major amounts of text is just stupid and inefficient.

    --
    C - the footgun of programming languages
    1. Re:Image spam by maxwell+demon · · Score: 2, Insightful

      Unless it's a scanned page, where you might be interested in more than just the raw text, or simply don't want to risk errors in converting it to text (think official documents).

      --
      The Tao of math: The numbers you can count are not the real numbers.
  41. I worked at HP labs for some of this period by niceone · · Score: 1

    and I've never heard of this thing.

    Guess I should have got out of my cube more.

    1. Re:I worked at HP labs for some of this period by Mr.+Hankey · · Score: 1

      It might have been dangerous. If you actually found yourself in a tesseract, you might have ended up right back in your cube (possibly falling from the ceiling) when walking out the wrong side.

      --
      GPL: Free as in will
    2. Re:I worked at HP labs for some of this period by Anonymous Coward · · Score: 0

      It was HPLB (HP Labs Bristol) and not any of the US labs. I can't remember the name of the group (too long ago, too many reorgs!).

  42. Comment removed by account_deleted · · Score: 3, Insightful

    Comment removed based on user account deletion

  43. Test example of tesseract. by dannycim · · Score: 2, Interesting

    Screen captured some text from the article, used XV to transform into tif, changing image to monochrome.

    Input image: it has been touted as one of the most accurate Optical Character Recognition (OCR) programs available. Having sat on the shelf gathering dust for so many years, Google cleaned up some of the more outdated portions of the code

    Output text: ii has been lamed as one of lhe mos! accurale Oplical Characler Recognilion (OCR) programs available. Having sat on lhe shelf galhering dusk for so many years, Google cleaned up some of lhe more ouldaled porlions of lhe code

    I have no idea what kind of input is optimal, but for a first shot in the dark, that's not too shabby. I'll go play with it some more. (^,^)

    1. Re:Test example of tesseract. by Random832 · · Score: 1

      I have no idea what kind of input is optimal, but for a first shot in the dark, that's not too shabby. I'll go play with it some more.

      meh. a _screenshot_ contains perfectly regular characters - if it can't ace _that_ then I don't _want_ to see what it does with a scanned page.

      --
      We've secretly replaced Slashdot with new Folgers Crystals - let's see if it notices.
    2. Re:Test example of tesseract. by CXI · · Score: 2, Interesting

      A screen shot is typically much lower resolution than what you'd normally scan documents at for OCR. It's not a good test.

    3. Re:Test example of tesseract. by Random832 · · Score: 1

      But every single instance of "t" is the exact same pixels in a screenshot - how do you justify it being interpreted as "i" in some cases and "l" in other cases?

      --
      We've secretly replaced Slashdot with new Folgers Crystals - let's see if it notices.
    4. Re:Test example of tesseract. by Qzukk · · Score: 1

      Because the software wasn't designed to handle perfectly regular text, since in a real scanning-from-paper situation that almost never comes up. Thus it treats every letter individually.

      --
      If I have been able to see further than others, it is because I bought a pair of binoculars.
    5. Re:Test example of tesseract. by SnarfQuest · · Score: 1

      I'll bet that they aren't the exact same pixels. If you magnify them you'll set that the characters have a "fuzzy" border, and the differing occurrences are "fuzzed" differently. This is to improve readibility to us mere humans.

      --
      Who would win this election: Andrew Weiner vs Andrew Weiner's weiner.
    6. Re:Test example of tesseract. by DragonWriter · · Score: 1
      meh. a _screenshot_ contains perfectly regular characters
      Is that really true with modern displays, with font smoothing, and freely resizable proportional fonts with variable kerning, etc.? Mapping from a vector-defined font with those kinds of variables to on-screen pixels, I'd expect something less than perfect regularity, though I'd expect that at commonly used sizes and resolutions, its not something that would be noticeable to a user. OCR software, on the other hand...
    7. Re:Test example of tesseract. by Mr+Z · · Score: 1

      Adding to what Qzukk said, in this case I imagine the letter bounding boxes probably derive from the perceived positions of the neighboring characters, and there's probably some normalization step involved when extracting the perceived character from the image. Furthermore, depending on the font, you could have characters that appear to bump into each other in some pairings. For example, consider "it" vs. "t" at the start of a word. The bar of the 't' nearly touches the body of the preceding 'i' (at least in the font I'm using at the moment). Is it "ii" with a hunk of noise or "it"?

      A higher resolution source would probably fare much better.

      --Joe
  44. License? by omeg · · Score: 1

    I don't get it. Isn't everything released on SourceForge supposed to be under a free license? Then how come this is released under no license? Perhaps I'm not looking on the right pages, but I can't seem to find anything besides the "none listed" on the main page of the project.

    1. Re:License? by Anonymous Coward · · Score: 0

      being allowed to do whatever the hell you want sounds pretty free to me...

    2. Re:License? by Anonymous Coward · · Score: 0

      Download it, untar, and read the COPYRIGHT file.

      fukkin slashdot dolts

    3. Re:License? by The+Cisco+Kid · · Score: 1

      The README and COPYING files included in the package document that the package is distrubuted under the Apache License v2.0

    4. Re:License? by robbak · · Score: 1

      Probably because noone selected a licence from the drop-down box on the submission form. And _that_ was probably because parts of the code are under at least 2 licences, one of which is non-free.

      --
      Prediction for end of Universe #42: Fencepost error in Quantum_bogosort.cpp
  45. Audible captchas by sita · · Score: 1

    I suppose "audible captchas" should be feasible. That is, if you can't see the picture, the captcha server also has an audio file with the same information. I'd be surprised if this doesn't exist already in some form.

    1. Re:Audible captchas by Anonymous Coward · · Score: 0

      Well it doesn't seem for example Slashdot is using them, does it? Out of all the sites I've ever seen, I've never seen this implemented either. So visually impaired people have trouble using Slashdot. Ofcourse, its no surprise to me as GNOME and KDE are a PITA for disabled ones as well. The F/OSS community doesn't exactly embrace this minority...

    2. Re:Audible captchas by Greedo · · Score: 1
      --
      Tuus crepidae innexilis sunt.
  46. Build environment by maxwell+demon · · Score: 1

    While it may be nice to have the source of a tesseract, however, those can only be built in a 4-dimensional space. So where do I get the build environment?

    --
    The Tao of math: The numbers you can count are not the real numbers.
    1. Re:Build environment by Mr.+Hankey · · Score: 1

      Just build it in 3 dimensions, and let an earthquake fold it for you.

      --
      GPL: Free as in will
  47. I always knew... by sam991 · · Score: 1

    I always knew Google were powerful. I did not, however, know they had the power to open source the 4-dimensional analog of the (3-dimensional) cube, where motion along the fourth dimension is often a representation for bounded transformations of the cube through time.

    --
    "No, no, no, don't tug on that! You never know what it might be attached to."
  48. An interesting demonstration by hey! · · Score: 1

    that F/OSS isn't anti-business. It just works with different business models.

    Google's business interest in releasing this as open source is obvious: the greater the value of the materials available to the Internet, the greater the value of its service.

    --
    Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
  49. Since you ask, here's why: by patio11 · · Score: 3, Insightful

    The name of the system you propose is called challenge/response (CR). CR is not a good idea for the following reasons:

    1) It says "My time is more important than yours" to all your correspondents, because you're not willing to look at a few spams getting past your Bayesian filter every day so instead you offload that time burden to people who want to talk to you.
    2) Dueling CR systems ("Hey, bob@example.com, I don't recognize you. Please prove you are a human" "Re: Hey, bob -- steve@stupid.com, I don't recognize you. Please prove you are a human"). Even more fun in a potentially infinite loop. Any system you can make to shortcircuit this loop can be abused by spam to avoid the CR altogether.
    3) Doesn't survive the Chinese Sweatshop Spam Attack, which will be ubiquitous if CR becomes popular. (Take poor Chinese person, teach them 10 words of English, pay them 2 cents an hour to answer CAPTCHAs so you get guaranteed delivery of your Maximize Your Mr. Wiggly offers.)
    4) Breaks legitimate bulk mail senders, such as Amazon, Paypal, eBay, mailing lists, etc etc. Mailing lists in particular are going to be very fun, since a lot of CR systems would spam the entire list -- perhaps provoking 100 challenges! Which then leads to combinatorial hilarity!

    1. Re:Since you ask, here's why: by 14CharUsername · · Score: 1

      Not really quite that bad. You whitelist everyone in your address book (and mailing lists you've subscribed to) so their messages go directly to your inbox without the challenge. The challenge is only for users that have never emailed you before. And the Dueling CR problem is solves by just automatically adding addresses you've sent mail to to your whitelist.

      I don't use CR, but I might consider it if spam gets any worse.

    2. Re:Since you ask, here's why: by Anonymous Coward · · Score: 0

      like faking being someone on your address book is hard.

    3. Re:Since you ask, here's why: by triso · · Score: 1
      like faking being someone on your address book is hard.
      It's pretty hard when you've never seen my address book. It's really hard when your address is on my blacklist.
  50. Error of omission in summary by autophile · · Score: 1
    ...it has been touted as one of the most accurate open source Optical Character Recognition (OCR) programs available.

    As the linked article states, there are commercial OCR programs that are far more accurate.

    --Rob

    --
    Towards the Singularity.
  51. how to get it to run .. by rs232 · · Score: 1

    Does anyone here know how to get it to install and run on SuSE 10.0. The instructions are a little confusing. If you can't use make install, what do you use.

    From INSTALL ..

    "4. Type `make install' to install the programs and any data files and documentation."

    Running ./configure returns "error in line 1329" and "make install has not been implemented yet avoid using."

    README has this to say "The executable must reside in the same directory as the tessdata directory The command line is: tesseract image.tif batch"

    Trying to run it and a windows pops up briefly and then disappears.

    --
    davecb5620@gmail.com
    1. Re:how to get it to run .. by wiggling · · Score: 1
      Can't even configure on FC5:

      ./configure: line 1329: tesseract: command not found
      checking build system type... i686-pc-linux-gnu
      checking host system type... i686-pc-linux-gnu
      checking for cl.exe... no
      checking for g++... g++
      checking for C++ compiler default output... configure: error: C++ compiler cannot create executables
      See `config.log' for more details.

      That's B.S.

  52. Chastity Bono's next step is life+100 by tepples · · Score: 2, Insightful
    I doubt Project Gutenberg will have run out of pre-1923 books by the time that new stuff starts coming out of Copyright under the new rules.

    Are you insinuating that the 115th Congress won't try to enact a Chastity Bono Copyright Term Extension Act? Given Mexico's life plus 100 copyright term, the next step of "harmonization" for the United States and its trading partners is life plus 100 or, in the case of works made for hire, 125 years after publication.

    Just assuming that somehow they did manage to digitize everything that was out of copyright, then I think what they should do is start archiving everything that they can.

    Who's to say that publishers won't fight back against Gutenberg the way (ObTopic) they did against Google? It's only fair use if you can pay a judge to tell you that it is and if you can pay your lawyer to tell the judge to tell you that it is.

  53. Enough such works? by tepples · · Score: 1
    Unless estate holders release it early. Or the author and holder of the copyright declares in his/her will that his/her work be released into the public domain upon his death, etc.

    Except that estates of authors of well-known works tend to be stricter than that. I'm willing to bet that there won't be enough books 1. which are notable, 2. whose copyright is abandoned by the author or his estate, and 3. which are not already published electronically by the author, to keep Project Gutenberg and the public-domain part of Google Book Search busy after the Chastity Bono Act comes into effect.

    1. Re:Enough such works? by fotbr · · Score: 1

      See the second line of my post. It may not be likely, but it IS POSSIBLE.

  54. port to Mindstorms? by derniers · · Score: 1

    well, that is quite a stretch but just maybe send the info from Mindstorms to host so that the robots can read

  55. I thought G had unleashed a time machine. by Anonymous Coward · · Score: 0

    Oh man, I thought that Google had finally unleashed a time machine. Just think of the ramifications of that, the concept of time travel is simple enough, you just wind the string of time a different way and jump from one piece of the yarn to its neighboring piece on the ball, right? But the part that needs to be solved is how to do it, and where to get the energy? - Yes, by the way, there is such a thing as a tesseract. A Wrinkle in Time.

    1. Re:I thought G had unleashed a time machine. by trupoet · · Score: 0

      Wrinkle in time man havent read that since elementary school.

      Speaking of time travel, just watched 12 monkeys last night.

      3 that movie.

  56. Un-Endable by Anonymous Coward · · Score: 0

    One small flaw with your argument. Copyrighted material isn't a step function, timewise. Content is constantly created on one end, and it falls off on the other (and yes I'm using a much broader definition of "published" than you are)

    Second the extension of copyrights. While there is precedent, it's not an infinite function either.

    "And although it's technically possible for an author to explicitly release his work himself, it doesn't count because it doesn't solve the problem."

    Well neither does P2P, despite the booster crowd here.

  57. I know moderators don't have senses of humor... by drinkypoo · · Score: 1

    ...or mothers... but please, this was not Flamebait. It's called humor, and if it's not funny, just don't laugh. It's not like he posted some big GNAA ASCII SNAFU.

    --
    "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
    1. Re:I know moderators don't have senses of humor... by 4D6963 · · Score: 1

      Moderators are only human ;-)

      Maybe the one who did that thought it was the classic GNAA troll, or maybe do they have a problem with "the N-word"

      --
      You just got troll'd!
    2. Re:I know moderators don't have senses of humor... by illuminatedwax · · Score: 1

      What I want to know is who is modding any of this OT shit up?? :)

      --
      Did you ever notice that *nix doesn't even cover Linux?
  58. What else for Gutenberg? by DragonWriter · · Score: 1
    Should we praise technology that helps Project Gutenberg run out of pre-1923 books faster?
    Yes.
    Once all notable pre-1923 books are scanned, OCR'd, and cleaned up, then what does PG do?
    I dunno. Continue (as they have already been) getting some newer material. Provide access to the old stuff. Improve search facilities. Buy out rights to existing copyrighted material. Fund expeditions to find lost pre-1923 manuscripts. Lobby for better copyright laws.
  59. Mod parent up by makomk · · Score: 1

    I know for a fact because I saw it myself that HP Research Labs Bristol had a fully working implementation of page layout analysis together with OCR in 1994. It was impressive. It handled all the usual page layout issues such as multiple columns and page skew. I've no idea whether that's true or not (and after a quick Google, it looks like finding out would be too much like hard work), but it's certainly interesting....

  60. Re: Aspirin by Ayanami+Rei · · Score: 1

    Man, that tool is old.
    I know some people who work in the department where it was created and I think the consesus is that no one has thought about that tool in a long time (as there are much better ones now).

    I think if there was some pressure from users of Tesseract ... it would be quite possible to push through a request to re-release it under the Apache license.

    I think the biggest hurdle there would be the paperwork and explaining why it'd be a nice gesture to the people who have to sign the forms (managers, corporate, etc.). We have a pretty hefty PR/licensing process since most of our work is delivered to our sponsors (government).

    But its not like anyone considers it some kinda asset. In fact, if anyone asked for support for it there'd be groaning because no one has touched it in over a DECADE.

    But yeah, I encourage users of Tesseract to send snail-mail letters explaining the issue to the Neuroscience folks there at the Washington location.

    --
    THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
  61. THIS IS ONLY FOR *NIX and not mentioned? by macraig · · Score: 1

    Apparently the OP thinks the entire world lives and breathes *NIX, so much so that he couldn't be bothered to mention the OS platform requirement? Thanks for wasting the time of those readers who may not yet have a Linux system with which to use it.

    1. Re:THIS IS ONLY FOR *NIX and not mentioned? by dadman · · Score: 2, Informative

      Err... How about Cygwin http://www.cygwin.com/ ?

  62. In spanish by Anonymous Coward · · Score: 0

    Ahora podemos ver la diferencia entre la 'A'-nosidad de 'A' y la 'P'-nosidad de P!