Open Source OCR That Makes Searchable PDFs
An anonymous reader writes "In my job all of our multifunction copiers scan to PDF but many of our users want and expect those PDFs to be text searchable. I looked around for software that would create text searchable pdfs but most are very expensive and I couldn't find any that were open source (free). I did find some open source packages like CuneiForm and Exactimage that could in theory do the job, but they were hard to install and difficult to set up and use over a network. Then I stumbled upon WatchOCR. This is a Live CD distro that can easily create a server on your network that provides an OCR service using watched folders. Now all my scanners scan to a watched folder, WatchOCR picks up those files and OCRs them, and then spits them out into another folder. It uses CuneiForm and ExactImage but it is all configured and ready to deploy. It can even be remotely managed via the Web interface. Hope this proves helpful to someone else who has this same situation."
Wow, it's a "Tell Slashdot" segment! I've been looking for something similar myself, so thanks, I'll give this a spin!
Nothing beats the proprietary software like ABBYY Finereader.
There's something wrong with this Slashvertisement--it's for a free product!
Cool! Amazing Toys.
Wow, very cool. I have been looking around for something similar myself.
While we are on the topic, anyone seen a good solution to scan, OCR, and reconvert existing crappy pdfs to improve them?
HA! I just wasted some of your bandwidth with a frivolous sig!
Saw this on facebook. While I don't personally have a need for this, I know that down the line, I'll be glad I knew about it. Good post.
Ultimately, it would be nice to figure out what script or daemon is running in this and put it on an existing server. In the mean time, I could see just creating a VM for this thing to get started.
I agree with above posters, it's amazing to see a useful Slashvertisement. This one, however, has some quality behind it. I had not seen this program and OCR is one area where it's been difficult to find quality OSS solutions. Thanks for the post.
The most useful /. post in at least a year.
Nice, thanks for sharing. Currently we use Acrobat to OCR scanned documents, it seems to work well but doesn't keep up to our high-speed scanners. Having it automated sounds great. How does the speed/accuracy of WatchOCR compare to commercial products?
I was looking for something like this last year - it looks like this just got released last month, so I don't feel too bad about not finding it.
It looks really interesting, but how accurate is it? I've got some old books that are falling apart I'd like to scan in and textify, but I'd like to know how much time I'm going to have to budget ahead of time fixing problems and proofing.
Is there something similar available commercially anyone can recommend? We may end up needing to scan large amounts of pdf's to a shared drive somewhere and need the whole thing to be searchable for keywords, but a requirement for that would be a commercial product that has 24x7 support.
I guess that's where
step 3: ????
Comes into play.
If this works well, I have a bunch of use for this. Thank you for the heads-up.
the no
Your copier providers probably already include this in the package you have. It just hasn't been enabled.
Our direct-to-pdf document scanners include copies of Acrobat Pro (both Windows and OSX), automatically do OCR, and were less than $400 each.
I'm out of my mind right now, but feel free to leave a message.....
Funny, I was just looking for something to do this the other day.
But isn't there some middle grown betweeen (a) making users do complicated setup work, vs. (b) making an entire OS out of it?
How about just making a tarball or Ubuntu/Debian/RPM package that installs and sensibly configures those two tools?
Seriously, I'm conflicted. I'm not any sort of web search guru, but it looks like that site just got put up. Is submitter an early adopter (v0.2) or a social engineer?
Another opensource option: The pdfs can be converted to tiffs using ghostscript and ocr-ed using tesseract. The current version of tesseract does not have document layout analysis and page segmentation support.
I settled on an expensive propriety solution some months ago at work(I am the IT guy, Dishwasher, and Business...something) to do our orgs scan and ocr. Admittedly its end to end including the scanner as well. But $15K and does a good job.
I did searches online(a dozen hours) and they all funneled back to "FOSS less good, proprietry for best results)
I am afraid to look at this one, because I did make final decision with pressure from the General Manager.
I dunno what google uses actually, but their in-house solution(on googe code) would *not* produce good results. No1 in the FOSS tests, but like 6th(by miles) on proprietry comparisons.
In post Patriot Act America, the library books scan you.
Okular is a free linux program that can export pdfs to text as well.
Most, if not ALL of the documents being scanned into PDF format, are generated on computers already, so why go through the whole OCR process, and not get the actual document from the original source in a PDF version that is already text searchable?
THIS is exactly the problem with document management and processing today! Doing things the hard way because we can't be bothered changing processes that will save tons of money, be more effective, and accurate.
I know people who type a document in WORD and then print it to the Copier/scanner/fax device, go pick up the document, put it on the document scanner, scans it to email (PDF) and sends it that way.
SERIOUSLY???
Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.
While we're on the topic, does anyone know of any PDF to text or HTML (or XML or whatever) converters that will do a good job of preserving the original structure of the information?
Specifically, I have occasion to deal with PDF documents that contain tables of information -- I don't have any need for OCR (at least, not so far), as the PDFs that I deal with are not scanned documents.
Most tools will extract the text correctly, and will create documents that will render quite close to the original document in a web browser, but the markup can become extremely difficult to parse.
Generally, each block of text (table element, say) will be placed inside something like an independently-positioned DIV.
Things get especially screwy when table elements can contain line breaks (which make some rows span multiple lines) and some elements which are empty.
So, the text will all be there, but for some PDFs, it becomes a difficult task to parse out the meaning. I tried out a number of free tools and some paid demos, and have settled on PDFTOHTML.
Does anyone have a better tool that will, at the very least, draw individual lines between table rows? I think what PDFTOHTML does is to create a background image of all the lines on the page. I'd prefer a free/opensource solution, but would be perfectly happy with anything that does the job well.
thanks.
Just about anyone can read a PDF. If you send a MS Word doc, you have to wonder what version of Word the other person has. And these days, Macs are popular enough that they might not have Word at all! PDF works, and works for everyone. It would be far simpler to print to PDF, but not everyone has a print driver that can do that. ODF is supposed to fix that, but it probably won't.
What are you suggesting?, Word 2007 and Open Office (for some versions back now) export to PDF, so why not just do that for sending electronically, and skip the print/scan steps?
RO
Have also been looking for something like this but was using tesseract to create .txt files that they then searched!
I have tried twice to download it, and it 'finishes' at about 150mb both times, while the file size on their web page shows over 600mb. As a double-check, (suspecting a file size reporting error on their page), it fails MD5 sum as well. Has anyone successfully downloaded it?
V for Vendetta: People should not be afraid of their governments. Governments should be afraid of their people.
I understand what you're saying, but installing the distro in a VM isn't much extra resource/work over a tarball. Plug in your preferred virtualization solution, of course, they all support exporting directories.
"Ahh! I see you're in that indeterminate Schrodinger state where - oh, uh
Not everyone wanting to do this does in fact have access to the electronic source. I know I would like to try it for some my old crumbling books, as someone else mentioned above, no longer in print (or otherwise only available in DRM-encumbered ebook formats that I cannot read on Linux or Windows Mobile).
RO
I found tesseract to work very well to do OCR tasks. Doesn't generate PDF though.
Although it is a expensive model, but I expect that within a few years this will be a standard feature even in the cheaper models.
Look around when you replace your existing copiers.
So let me get this straight - there's a new Service on a Disc, that reportedly delivers something cool, that is brand new and working better than anyone has gotten it to work, it's free, and all that's required is that you have to spin it up on a server on your network - no muss no fuss. Oh yeah, and nobody knows what exactly it does.
Time to update your resume...
I can't seem to be able to download this file, it keeps giving up after a couple of hundred megs... probably slashdotted.
Make sure everyone's vote counts: Verified Voting
Can you go one step further with this and get it read the text (only) out of Microsoft formatted files? Maybe it could even read words out of Word files, Powerpoint, etc.
Like the beaver, it's just Dam one thing after another
>> Oh yeah, and nobody knows what exactly it does.
Oh yeah, and nobody knows what exactly it does with access to all your sensitive documents.
Someone had to do it.
I haven't tried it yet, but this looks promising. It isn't free, but it also doesn't seem as pricey as Adobe.
Qoppa Software [ http://www.qoppa.com/index.html ]
An effective "democracy" creates the illusion the people have a say in their government.
The download is quite slow. I guess it would be nice to have alternative download options to try.
I wrote a bash script a few months back which, in a little over 130 lines (it has a few command line options), can convert any old PDF to a text searcheable PDF. I really wonder whether a distro is a bit overkill for this? But it is such an important tool to have that I commend the authors for making it available... I just wish they'd put up the actual script that they used so I could compare it to my own!
I must be missing something. Why would you want OCR on a server and not as part of the program that interfaces with the scanner?
The truth may be out there, but lies are inside your head
I've been looking for a good OCR system for my business! This Cuneiform sounds like just what I need! Well I can't wait to inst-*looks at web page*... erm..... hm....
Version 0.2 has been out for at least a month by the looks of their forum, and version numbers are a very imprecise way of telling how useful the software is for your needs, or even how stable it is. What's wrong with being an "early adopter" if it's the only working and free solution to your problem?
which is totally what she said
It just does *everything* I need. It takes scans from scanner, it processes it with OCR, it allows me to delete or insert pages.....it's just very simple and does the job well.
For OCR, gscan2pdf works with 4 OCR programs currently:
Ocropus is developed with funding/support from Google. It uses tesseract as a backend to do a lot of the work. In simple terms, Ocropus is awesome. I find it does a stellar job at OCR. It's absolutely open source and great software.