Slashdot Mirror


Google Releases Tesseract as Open Source

An anonymous reader writes "Google recently released Tesseract as open source. Originally developed at the HP Labs from 1985-1995, it has been touted as one of the most accurate Optical Character Recognition (OCR) programs available. Having sat on the shelf gathering dust for so many years, Google cleaned up some of the more outdated portions of the code and released it for general consumption. You can download Tesseract over at Sourceforge.

7 of 251 comments (clear)

  1. Re:As much as I like open source software ... by aweinert · · Score: 5, Informative

    CAPTCHAs are specifically meant to break OCR... and if you RTFA, it say it does poorly with grayscale and color documents. Baisically its meant for reading typed text... like in a book.

  2. improvements by Anonymous Coward · · Score: 5, Funny

    Google cleaned up some of the more outdated portions of the code
    i.e., added AdSense to the OCR output.

  3. Re:As much as I like open source software ... by illuminatedwax · · Score: 5, Funny

    You're right! Let us never delve into research that could conceivably overturn weak software security! Some things man was never meant to discover! Turn back, before we fly too close to the sun and our wings melt!! O, Prometheus, why hast thou given us this OCR technology??

    --
    Did you ever notice that *nix doesn't even cover Linux?
  4. Hosting by truthsearch · · Score: 5, Interesting

    Is there any particular reason google isn't hosting the project themselves?

    1. Re:Hosting by larry+bagina · · Score: 5, Funny

      Yes. They need the 99.9999% uptime (6 9s) that only sourceforge can provide.

      --
      Do you even lift?

      These aren't the 'roids you're looking for.

  5. Un-Finishable by Kadin2048 · · Score: 5, Interesting

    In all honesty, I doubt Project Gutenberg will have run out of pre-1923 books by the time that new stuff starts coming out of Copyright under the new rules. They have everything written by humanity before that date to digitize: not just English language books and "classics," but government documents, records, foreign language texts, ancient manuscripts ... everything. That's as close to an un-finishable task as you can set yourself, I think.

    Just assuming that somehow they did manage to digitize everything that was out of copyright, then I think what they should do is start archiving everything that they can. Even if they can't disseminate the information, they could still scan documents in and store them for later OCR-ing, thus preserving them against deterioration. I think this would be covered by fair use law even if the work was still protected. Perhaps this sort of archival work is not exactly the aim of PG, but it's still critically important.

    With that said, I don't mean to in any way excuse the disgusting abuse of our political and legal system that was and is the "Sonny Bono Copyright Term Extension Act." That thing is a disgusting example of pretty much everything that's wrong with our government today.

    --
    "Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
  6. HP decided to got out of the OCR business? by Frosty+Piss · · Score: 5, Funny
    In 1995 it was one of the top 3 performers at the OCR accuracy contest organized by University of Nevada in Las Vegas. However, shortly thereafter, HP decided to get out of the OCR business...

    Actually, shortly thereafter, HP decided to get out technology innovation business, and into the printer ink business.

    --
    If you want news from today, you have to come back tomorrow.