Slashdot Mirror


Google Pushes Open Source OCR

SocialWorm writes "Google has just announced work on OCRopus, which it says it hopes will 'advance the state of the art in optical character recognition and related technologies.' OCRopus will be available under the Apache 2.0 License. Obviously, there may be search and image search implications from OCRopus. 'The goal of the project is to advance the state of the art in optical character recognition and related technologies, and to deliver a high quality OCR system suitable for document conversions, electronic libraries, vision impaired users, historical document analysis, and general desktop use. In addition, we are structuring the system in such a way that it will be easy to reuse by other researchers in the field.'"

1 of 212 comments (clear)

  1. Re:The beginning of the end? by lawpoop · · Score: 5, Informative

    I doubt it.

    Part of what makes OCR work is that it assumes that the text was written to communicate meaning. It has regular characters in alignment, forming common words and abbreviations in more or less grammatical sentences with close-to-proper punctuation.

    A good captcha has a non-sense string of characters in various cases, all skewed and distorted, with extra geometric elements obscuring the characters. This renders unavailable somewhere around half of the clues that an OCR uses.

    --
    Computers are useless. They can only give you answers.
    -- Pablo Picasso