Open Source OCR That Makes Searchable PDFs

← Back to Stories (view on slashdot.org)

Open Source OCR That Makes Searchable PDFs

Posted by timothy on Thursday July 22, 2010 @07:21AM from the word-of-advice dept.

An anonymous reader writes "In my job all of our multifunction copiers scan to PDF but many of our users want and expect those PDFs to be text searchable. I looked around for software that would create text searchable pdfs but most are very expensive and I couldn't find any that were open source (free). I did find some open source packages like CuneiForm and Exactimage that could in theory do the job, but they were hard to install and difficult to set up and use over a network. Then I stumbled upon WatchOCR. This is a Live CD distro that can easily create a server on your network that provides an OCR service using watched folders. Now all my scanners scan to a watched folder, WatchOCR picks up those files and OCRs them, and then spits them out into another folder. It uses CuneiForm and ExactImage but it is all configured and ready to deploy. It can even be remotely managed via the Web interface. Hope this proves helpful to someone else who has this same situation."

21 of 133 comments (clear)

Min score:

Reason:

Sort:

Thanks! by Fast+Thick+Pants · 2010-07-22 07:23 · Score: 5, Insightful

Wow, it's a "Tell Slashdot" segment! I've been looking for something similar myself, so thanks, I'll give this a spin!
1. Re:Thanks! by godrik · 2010-07-22 07:32 · Score: 2, Informative
  
  Same here. Thank you too!
  (I know this post is very redundant and useless. But thanks are always welcome, aren't they ?)
2. Re:Thanks! by MikeBabcock · 2010-07-22 08:22 · Score: 3, Interesting
  
  I only wish I could find a source download on their site. Even a "what we're doing" guide. Downloading the ISO and reverse-engineering what they're doing with cuneiform and exactimage doesn't seem nearly as productive, especially when I'd rather implement this on an existing server than boot a special piece of hardware with it.
  
  --
  - Michael T. Babcock (Yes, I blog)
3. Re:Thanks! by tsstahl · 2010-07-22 08:41 · Score: 2, Insightful
  
  Virtual machine?
4. Re:Thanks! by TooMuchToDo · 2010-07-22 09:43 · Score: 3, Insightful
  
  Looks like Slashdot needs a moderation "+1 Thank You!" option.
Wait a sec by inKubus · 2010-07-22 07:30 · Score: 5, Funny

There's something wrong with this Slashvertisement--it's for a free product!

--
Cool! Amazing Toys.
Thanks for the info... by TiggertheMad · 2010-07-22 07:30 · Score: 2

Wow, very cool. I have been looking around for something similar myself.

While we are on the topic, anyone seen a good solution to scan, OCR, and reconvert existing crappy pdfs to improve them?

--

HA! I just wasted some of your bandwidth with a frivolous sig!
1. Re:Thanks for the info... by It's+the+tripnaut! · 2010-07-22 11:18 · Score: 2, Informative
  
  While we are on the topic, anyone seen a good solution to scan, OCR, and reconvert existing crappy pdfs to improve them?
  I've tried quite a few free and proprietary OCR's and the best available right now, imho, is ABBYY Finereader. Other than fonts, it also easily recognizes tables, diagrams and illustrations. But most of all, it can read and render 189 languages (including Chinese and Cyrillic) accurately. A free trial version is available.
Run on a VM by ChuckDriver · 2010-07-22 07:32 · Score: 3, Insightful

Ultimately, it would be nice to figure out what script or daemon is running in this and put it on an existing server. In the mean time, I could see just creating a VM for this thing to get started.
1. Re:Run on a VM by TooMuchToDo · 2010-07-22 09:31 · Score: 2, Interesting
  
  Already on it. Want it as an EC2 AMI? ;)
Anyone got error rates? by savanik · 2010-07-22 07:39 · Score: 3, Insightful

I was looking for something like this last year - it looks like this just got released last month, so I don't feel too bad about not finding it.
It looks really interesting, but how accurate is it? I've got some old books that are falling apart I'd like to scan in and textify, but I'd like to know how much time I'm going to have to budget ahead of time fixing problems and proofing.
Re:commercial? by Anonymous Coward · 2010-07-22 07:44 · Score: 2, Informative

After doing a similar search recently, your two major choices are ABBY FineReader (they have Enterprise/Server level editions) or OmniReader (again at the Server/Enterprise level). They're priced pretty closely and have pretty well matched features, plus high accuracy. We're in the process of moving from a solution originally based on Adobe Acrobat's built-in OCR, which is okay but not great. Initial testing with ABBY showed a demonstrably lower error rate on documents from scanned in legal files.
Re:ocr by 0100010001010011 · 2010-07-22 07:45 · Score: 3, Funny

Now it just needs to incorporate a Recaptcha Lite to improve accuracy.
Maybe something on the web interface when it doesn't recognize a word you can correct it.
[Given the success of the Cow Clicker on Facebook, maybe turn it into a facebook game. Tell people they're only allowed to correct words every 6 hours. If they want to correct more words, they'll have to pay for it. Add friends and correct more words to level up!]
Re:commercial? by ganjadude · 2010-07-22 07:46 · Score: 3, Informative

there is! I happen to work for a company (shameless plug) called DocuWare. Its document management software that does all of that., we are not in 24/7 we are in 8 AM-8 PM eastern m-f for support (I am the support) at the corporate level, however we sell through a dealer network that provides support on a contract basis (many Toshiba business solutions are resellers for us, I know they are 24X7) www.docuware.com

--
have you seen my sig? there are many others like it but none that are the same
Re:Wait a sec by ushering05401 · 2010-07-22 07:51 · Score: 4, Insightful

Seriously, I'm conflicted. I'm not any sort of web search guru, but it looks like that site just got put up. Is submitter an early adopter (v0.2) or a social engineer?
Re:added. by b4dc0d3r · 2010-07-22 07:53 · Score: 3, Funny

Saw this on facebook.
That isn't a good sign, my friend.
Stupid by Archangel+Michael · 2010-07-22 08:04 · Score: 2, Insightful

Most, if not ALL of the documents being scanned into PDF format, are generated on computers already, so why go through the whole OCR process, and not get the actual document from the original source in a PDF version that is already text searchable?
THIS is exactly the problem with document management and processing today! Doing things the hard way because we can't be bothered changing processes that will save tons of money, be more effective, and accurate.
I know people who type a document in WORD and then print it to the Copier/scanner/fax device, go pick up the document, put it on the document scanner, scans it to email (PDF) and sends it that way.
SERIOUSLY???

--
Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.
Re:better alternatives to pdftohtml by petermgreen · 2010-07-22 09:15 · Score: 2, Informative

Afaict the original structure was already gone when the pdf was made, you can only try to reverese engineer it from the drawing objects.
You might want to try converting to postscript using ghostscript and then converting to svg using pstoedit. You still won't have the original structure but at least you should have the table shape as a vector drawing rather than a bitmap.

--
note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
Tesseract OCR by TheSync · 2010-07-22 09:35 · Score: 2, Informative

I found tesseract to work very well to do OCR tasks. Doesn't generate PDF though.
Re:commercial? by FelixNZ · 2010-07-22 10:26 · Score: 3, Funny

Sole support staff's user name in 'ganjadude' I am a little wary :)
Re:exactimage + cuneiform by kilf · 2010-07-22 23:41 · Score: 2, Insightful

I'd love to see your script, if you want to make it available.