Slashdot Mirror


Open Source OCR That Makes Searchable PDFs

An anonymous reader writes "In my job all of our multifunction copiers scan to PDF but many of our users want and expect those PDFs to be text searchable. I looked around for software that would create text searchable pdfs but most are very expensive and I couldn't find any that were open source (free). I did find some open source packages like CuneiForm and Exactimage that could in theory do the job, but they were hard to install and difficult to set up and use over a network. Then I stumbled upon WatchOCR. This is a Live CD distro that can easily create a server on your network that provides an OCR service using watched folders. Now all my scanners scan to a watched folder, WatchOCR picks up those files and OCRs them, and then spits them out into another folder. It uses CuneiForm and ExactImage but it is all configured and ready to deploy. It can even be remotely managed via the Web interface. Hope this proves helpful to someone else who has this same situation."

5 of 133 comments (clear)

  1. Thanks! by Fast+Thick+Pants · · Score: 5, Insightful

    Wow, it's a "Tell Slashdot" segment! I've been looking for something similar myself, so thanks, I'll give this a spin!

    1. Re:Thanks! by TooMuchToDo · · Score: 3, Insightful

      Looks like Slashdot needs a moderation "+1 Thank You!" option.

  2. Run on a VM by ChuckDriver · · Score: 3, Insightful

    Ultimately, it would be nice to figure out what script or daemon is running in this and put it on an existing server. In the mean time, I could see just creating a VM for this thing to get started.

  3. Anyone got error rates? by savanik · · Score: 3, Insightful

    I was looking for something like this last year - it looks like this just got released last month, so I don't feel too bad about not finding it.

    It looks really interesting, but how accurate is it? I've got some old books that are falling apart I'd like to scan in and textify, but I'd like to know how much time I'm going to have to budget ahead of time fixing problems and proofing.

  4. Re:Wait a sec by ushering05401 · · Score: 4, Insightful

    Seriously, I'm conflicted. I'm not any sort of web search guru, but it looks like that site just got put up. Is submitter an early adopter (v0.2) or a social engineer?