Open Source OCR That Makes Searchable PDFs
An anonymous reader writes "In my job all of our multifunction copiers scan to PDF but many of our users want and expect those PDFs to be text searchable. I looked around for software that would create text searchable pdfs but most are very expensive and I couldn't find any that were open source (free). I did find some open source packages like CuneiForm and Exactimage that could in theory do the job, but they were hard to install and difficult to set up and use over a network. Then I stumbled upon WatchOCR. This is a Live CD distro that can easily create a server on your network that provides an OCR service using watched folders. Now all my scanners scan to a watched folder, WatchOCR picks up those files and OCRs them, and then spits them out into another folder. It uses CuneiForm and ExactImage but it is all configured and ready to deploy. It can even be remotely managed via the Web interface. Hope this proves helpful to someone else who has this same situation."
Wow, it's a "Tell Slashdot" segment! I've been looking for something similar myself, so thanks, I'll give this a spin!
There's something wrong with this Slashvertisement--it's for a free product!
Cool! Amazing Toys.
Wow, very cool. I have been looking around for something similar myself.
While we are on the topic, anyone seen a good solution to scan, OCR, and reconvert existing crappy pdfs to improve them?
HA! I just wasted some of your bandwidth with a frivolous sig!
Saw this on facebook. While I don't personally have a need for this, I know that down the line, I'll be glad I knew about it. Good post.
Ultimately, it would be nice to figure out what script or daemon is running in this and put it on an existing server. In the mean time, I could see just creating a VM for this thing to get started.
I agree with above posters, it's amazing to see a useful Slashvertisement. This one, however, has some quality behind it. I had not seen this program and OCR is one area where it's been difficult to find quality OSS solutions. Thanks for the post.
Nice, thanks for sharing. Currently we use Acrobat to OCR scanned documents, it seems to work well but doesn't keep up to our high-speed scanners. Having it automated sounds great. How does the speed/accuracy of WatchOCR compare to commercial products?
I was looking for something like this last year - it looks like this just got released last month, so I don't feel too bad about not finding it.
It looks really interesting, but how accurate is it? I've got some old books that are falling apart I'd like to scan in and textify, but I'd like to know how much time I'm going to have to budget ahead of time fixing problems and proofing.
Is there something similar available commercially anyone can recommend? We may end up needing to scan large amounts of pdf's to a shared drive somewhere and need the whole thing to be searchable for keywords, but a requirement for that would be a commercial product that has 24x7 support.
I guess that's where
step 3: ????
Comes into play.
If this works well, I have a bunch of use for this. Thank you for the heads-up.
the no
Funny, I was just looking for something to do this the other day.
But isn't there some middle grown betweeen (a) making users do complicated setup work, vs. (b) making an entire OS out of it?
How about just making a tarball or Ubuntu/Debian/RPM package that installs and sensibly configures those two tools?
Seriously, I'm conflicted. I'm not any sort of web search guru, but it looks like that site just got put up. Is submitter an early adopter (v0.2) or a social engineer?
I settled on an expensive propriety solution some months ago at work(I am the IT guy, Dishwasher, and Business...something) to do our orgs scan and ocr. Admittedly its end to end including the scanner as well. But $15K and does a good job.
I did searches online(a dozen hours) and they all funneled back to "FOSS less good, proprietry for best results)
I am afraid to look at this one, because I did make final decision with pressure from the General Manager.
I dunno what google uses actually, but their in-house solution(on googe code) would *not* produce good results. No1 in the FOSS tests, but like 6th(by miles) on proprietry comparisons.
In post Patriot Act America, the library books scan you.
Most, if not ALL of the documents being scanned into PDF format, are generated on computers already, so why go through the whole OCR process, and not get the actual document from the original source in a PDF version that is already text searchable?
THIS is exactly the problem with document management and processing today! Doing things the hard way because we can't be bothered changing processes that will save tons of money, be more effective, and accurate.
I know people who type a document in WORD and then print it to the Copier/scanner/fax device, go pick up the document, put it on the document scanner, scans it to email (PDF) and sends it that way.
SERIOUSLY???
Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.
Most normal PDF readers (incl. Okular) only work when the actual text is included in the PDF to begin with. When the source isn't computer-generated but scanned in, there's only image data to work with (no text). Actual OCR is pretty much the only choice in this case...
I have tried twice to download it, and it 'finishes' at about 150mb both times, while the file size on their web page shows over 600mb. As a double-check, (suspecting a file size reporting error on their page), it fails MD5 sum as well. Has anyone successfully downloaded it?
V for Vendetta: People should not be afraid of their governments. Governments should be afraid of their people.
I understand what you're saying, but installing the distro in a VM isn't much extra resource/work over a tarball. Plug in your preferred virtualization solution, of course, they all support exporting directories.
"Ahh! I see you're in that indeterminate Schrodinger state where - oh, uh
Not everyone wanting to do this does in fact have access to the electronic source. I know I would like to try it for some my old crumbling books, as someone else mentioned above, no longer in print (or otherwise only available in DRM-encumbered ebook formats that I cannot read on Linux or Windows Mobile).
RO
Afaict the original structure was already gone when the pdf was made, you can only try to reverese engineer it from the drawing objects.
You might want to try converting to postscript using ghostscript and then converting to svg using pstoedit. You still won't have the original structure but at least you should have the table shape as a vector drawing rather than a bitmap.
note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
I found tesseract to work very well to do OCR tasks. Doesn't generate PDF though.
I can't seem to be able to download this file, it keeps giving up after a couple of hundred megs... probably slashdotted.
Make sure everyone's vote counts: Verified Voting
Can you go one step further with this and get it read the text (only) out of Microsoft formatted files? Maybe it could even read words out of Word files, Powerpoint, etc.
Like the beaver, it's just Dam one thing after another
Quick search result although I think we have slightly older black and white versions.
I'm out of my mind right now, but feel free to leave a message.....
>> Oh yeah, and nobody knows what exactly it does.
Oh yeah, and nobody knows what exactly it does with access to all your sensitive documents.
Someone had to do it.
I haven't tried it yet, but this looks promising. It isn't free, but it also doesn't seem as pricey as Adobe.
Qoppa Software [ http://www.qoppa.com/index.html ]
An effective "democracy" creates the illusion the people have a say in their government.
I wrote a bash script a few months back which, in a little over 130 lines (it has a few command line options), can convert any old PDF to a text searcheable PDF. I really wonder whether a distro is a bit overkill for this? But it is such an important tool to have that I commend the authors for making it available... I just wish they'd put up the actual script that they used so I could compare it to my own!
I must be missing something. Why would you want OCR on a server and not as part of the program that interfaces with the scanner?
The truth may be out there, but lies are inside your head
Version 0.2 has been out for at least a month by the looks of their forum, and version numbers are a very imprecise way of telling how useful the software is for your needs, or even how stable it is. What's wrong with being an "early adopter" if it's the only working and free solution to your problem?
which is totally what she said
It just does *everything* I need. It takes scans from scanner, it processes it with OCR, it allows me to delete or insert pages.....it's just very simple and does the job well.
For OCR, gscan2pdf works with 4 OCR programs currently:
Ocropus is developed with funding/support from Google. It uses tesseract as a backend to do a lot of the work. In simple terms, Ocropus is awesome. I find it does a stellar job at OCR. It's absolutely open source and great software.