Open Source OCR That Makes Searchable PDFs
An anonymous reader writes "In my job all of our multifunction copiers scan to PDF but many of our users want and expect those PDFs to be text searchable. I looked around for software that would create text searchable pdfs but most are very expensive and I couldn't find any that were open source (free). I did find some open source packages like CuneiForm and Exactimage that could in theory do the job, but they were hard to install and difficult to set up and use over a network. Then I stumbled upon WatchOCR. This is a Live CD distro that can easily create a server on your network that provides an OCR service using watched folders. Now all my scanners scan to a watched folder, WatchOCR picks up those files and OCRs them, and then spits them out into another folder. It uses CuneiForm and ExactImage but it is all configured and ready to deploy. It can even be remotely managed via the Web interface. Hope this proves helpful to someone else who has this same situation."
Wow, it's a "Tell Slashdot" segment! I've been looking for something similar myself, so thanks, I'll give this a spin!
There's something wrong with this Slashvertisement--it's for a free product!
Cool! Amazing Toys.
Wow, very cool. I have been looking around for something similar myself.
While we are on the topic, anyone seen a good solution to scan, OCR, and reconvert existing crappy pdfs to improve them?
HA! I just wasted some of your bandwidth with a frivolous sig!
Ultimately, it would be nice to figure out what script or daemon is running in this and put it on an existing server. In the mean time, I could see just creating a VM for this thing to get started.
I was looking for something like this last year - it looks like this just got released last month, so I don't feel too bad about not finding it.
It looks really interesting, but how accurate is it? I've got some old books that are falling apart I'd like to scan in and textify, but I'd like to know how much time I'm going to have to budget ahead of time fixing problems and proofing.
After doing a similar search recently, your two major choices are ABBY FineReader (they have Enterprise/Server level editions) or OmniReader (again at the Server/Enterprise level). They're priced pretty closely and have pretty well matched features, plus high accuracy. We're in the process of moving from a solution originally based on Adobe Acrobat's built-in OCR, which is okay but not great. Initial testing with ABBY showed a demonstrably lower error rate on documents from scanned in legal files.
Now it just needs to incorporate a Recaptcha Lite to improve accuracy.
Maybe something on the web interface when it doesn't recognize a word you can correct it.
[Given the success of the Cow Clicker on Facebook, maybe turn it into a facebook game. Tell people they're only allowed to correct words every 6 hours. If they want to correct more words, they'll have to pay for it. Add friends and correct more words to level up!]
there is! I happen to work for a company (shameless plug) called DocuWare. Its document management software that does all of that., we are not in 24/7 we are in 8 AM-8 PM eastern m-f for support (I am the support) at the corporate level, however we sell through a dealer network that provides support on a contract basis (many Toshiba business solutions are resellers for us, I know they are 24X7) www.docuware.com
have you seen my sig? there are many others like it but none that are the same
Seriously, I'm conflicted. I'm not any sort of web search guru, but it looks like that site just got put up. Is submitter an early adopter (v0.2) or a social engineer?
That isn't a good sign, my friend.
Most, if not ALL of the documents being scanned into PDF format, are generated on computers already, so why go through the whole OCR process, and not get the actual document from the original source in a PDF version that is already text searchable?
THIS is exactly the problem with document management and processing today! Doing things the hard way because we can't be bothered changing processes that will save tons of money, be more effective, and accurate.
I know people who type a document in WORD and then print it to the Copier/scanner/fax device, go pick up the document, put it on the document scanner, scans it to email (PDF) and sends it that way.
SERIOUSLY???
Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.
Afaict the original structure was already gone when the pdf was made, you can only try to reverese engineer it from the drawing objects.
You might want to try converting to postscript using ghostscript and then converting to svg using pstoedit. You still won't have the original structure but at least you should have the table shape as a vector drawing rather than a bitmap.
note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
I found tesseract to work very well to do OCR tasks. Doesn't generate PDF though.
Sole support staff's user name in 'ganjadude' I am a little wary :)
I'd love to see your script, if you want to make it available.