Open Source OCR That Makes Searchable PDFs

← Back to Stories (view on slashdot.org)

Open Source OCR That Makes Searchable PDFs

Posted by timothy on Thursday July 22, 2010 @07:21AM from the word-of-advice dept.

An anonymous reader writes "In my job all of our multifunction copiers scan to PDF but many of our users want and expect those PDFs to be text searchable. I looked around for software that would create text searchable pdfs but most are very expensive and I couldn't find any that were open source (free). I did find some open source packages like CuneiForm and Exactimage that could in theory do the job, but they were hard to install and difficult to set up and use over a network. Then I stumbled upon WatchOCR. This is a Live CD distro that can easily create a server on your network that provides an OCR service using watched folders. Now all my scanners scan to a watched folder, WatchOCR picks up those files and OCRs them, and then spits them out into another folder. It uses CuneiForm and ExactImage but it is all configured and ready to deploy. It can even be remotely managed via the Web interface. Hope this proves helpful to someone else who has this same situation."

95 of 133 comments (clear)

Min score:

Reason:

Sort:

Thanks! by Fast+Thick+Pants · 2010-07-22 07:23 · Score: 5, Insightful

Wow, it's a "Tell Slashdot" segment! I've been looking for something similar myself, so thanks, I'll give this a spin!
1. Re:Thanks! by godrik · 2010-07-22 07:32 · Score: 2, Informative
  
  Same here. Thank you too!
  (I know this post is very redundant and useless. But thanks are always welcome, aren't they ?)
2. Re:Thanks! by sumdumass · 2010-07-22 07:38 · Score: 1
  
  Heh.. 6 months of looking for something and about to settle on a very expensive proprietary deployment.
  I'll have to see if it works as easy as he says, but it's right there for my needs too.
3. Re:Thanks! by MikeBabcock · 2010-07-22 08:22 · Score: 3, Interesting
  
  I only wish I could find a source download on their site. Even a "what we're doing" guide. Downloading the ISO and reverse-engineering what they're doing with cuneiform and exactimage doesn't seem nearly as productive, especially when I'd rather implement this on an existing server than boot a special piece of hardware with it.
  
  --
  - Michael T. Babcock (Yes, I blog)
4. Re:Thanks! by tsstahl · 2010-07-22 08:41 · Score: 2, Insightful
  
  Virtual machine?
5. Re:Thanks! by houstonbofh · 2010-07-22 08:45 · Score: 1
  
  Virtual machine?
  Only a solution to "How do I get this running" and not "What is this thing doing?" The lack of source is a bit offputting to me. I will look at it, but I may wait to roll it out.
6. Re:Thanks! by StuartHankins · 2010-07-22 09:18 · Score: 1
  
  Setup a VM; not only can you monitor / limit its communication but it's a cinch to back up. In my environment this is the easiest way to test something also. I use ntop for monitoring and it works ok; it would probably be a good fit in this case.
7. Re:Thanks! by interval1066 · 2010-07-22 09:26 · Score: 1
  
  "Only a solution to "How do I get this running" and not "What is this thing doing?" The lack of source is a bit offputting to me. I will look at it, but I may wait to roll it out.
  I would tend to agree, only because I'm extremely paranoid when it comes to security; I'd do some site analysis and make sure unexpected connections to foreign hosts aren't going out over the wire. If they were I'd want to do some code analysis to see what exactly is going on. Or if I wanted to add some customization: extremely important. But; in a pinch it sounds like a really worthwhile solution.
  
  --
  Python: 'And then suddenly you have a language which says "we're all stuck with whatever the whiniest coder wants".'
8. Re:Thanks! by TooMuchToDo · 2010-07-22 09:43 · Score: 3, Insightful
  
  Looks like Slashdot needs a moderation "+1 Thank You!" option.
9. Re:Thanks! by Peach+Rings · 2010-07-22 13:57 · Score: 1
  
  It's all GPL so there has to be source somewhere. Their site says
  
  The source code of the standard packages on the CD are available from their respective original providers (for example on the FTP servers at Debian). Special components such as the WatchOCR program and scripts are available on the CD.
  so it's probably on the disk.
10. Re:Thanks! by oldhack · 2010-07-22 19:09 · Score: 1
  
  Mustn't disturb the delicate balance of the (slashdot) universe - pair it with -1 (or +1) "Fuck You!" option.
  
  --
  Fuck systemd. Fuck Redhat. Fuck Soylent, too. Wait, scratch the last one.
11. Re:Thanks! by anonymous+cupboard · 2010-07-22 23:24 · Score: 1
  
  Many systems are better dedicated to a single problem, i.e., just because you have a server doesn't mean to say that you have to serve everything. VMs are a great solution to this allowing you to partition up your server so that each service that you provide runs in its own little virtual box without having to worry so much about unwanted interactions.
12. Re:Thanks! by alabandit · 2010-07-23 00:27 · Score: 1
  
  +1 Thank You!
  
  --
  "You are still innocent until proven guilty. What's changed is what they do to innocent people." by notnAP (846325)
13. Re:Thanks! by MikeBabcock · 2010-07-23 07:38 · Score: 1
  
  Their reply is inept and as bad as I'd expected.
  They want me to wade through a fully functional CD to find their various customized configuration files and figure out which bits they changed.
  Load Knoppix disc, load OCR disc, do full diff on mounted contents? Not my idea of fun.
  They can post an actual "here's what we did" or I'll just ignore it.
  
  --
  - Michael T. Babcock (Yes, I blog)
14. Re:Thanks! by MikeBabcock · 2010-07-26 07:18 · Score: 1
  
  If I have a working server that already hosts PDF content, why would I want to virtualize this one when I could integrate its features into the existing one?
  That said, the one server per service concept is a mentality I do not subscribe to.
  
  --
  - Michael T. Babcock (Yes, I blog)
15. Re:Thanks! by anonymous+cupboard · 2010-07-28 00:48 · Score: 1
  
  That said, the one server per service concept is a mentality I do not subscribe to.
  This is where Microsoft came apart. Due to their pricing model, there was always pressure to stick as much on one box as possible. This in turn led to interesting side effects.
  Linux always made it easier to have many boxes, which tended to simplify problems. VMs meant you no longer had to worry about physical machines and you can still limit resources - useful if the OCR turns out to be a CPU pig.
16. Re:Thanks! by r3naissance · 2010-07-30 02:27 · Score: 1
  
  Another "thanks"; I have been looking for something along these lines on a personal basis. Looking forward to checking it out.
Wait a sec by inKubus · 2010-07-22 07:30 · Score: 5, Funny

There's something wrong with this Slashvertisement--it's for a free product!

--
Cool! Amazing Toys.
Thanks for the info... by TiggertheMad · 2010-07-22 07:30 · Score: 2

Wow, very cool. I have been looking around for something similar myself.

While we are on the topic, anyone seen a good solution to scan, OCR, and reconvert existing crappy pdfs to improve them?

--

HA! I just wasted some of your bandwidth with a frivolous sig!
1. Re:Thanks for the info... by It's+the+tripnaut! · 2010-07-22 11:18 · Score: 2, Informative
  
  While we are on the topic, anyone seen a good solution to scan, OCR, and reconvert existing crappy pdfs to improve them?
  I've tried quite a few free and proprietary OCR's and the best available right now, imho, is ABBYY Finereader. Other than fonts, it also easily recognizes tables, diagrams and illustrations. But most of all, it can read and render 189 languages (including Chinese and Cyrillic) accurately. A free trial version is available.
2. Re:Thanks for the info... by datakid23 · 2010-07-22 14:56 · Score: 1
  
  It's true. I teach Translation Studies and one of the main pieces of software that's needed is OCR. I use OOo, Poedit, Lokalize, Jubler and OmegaT in my class, I teach Creative Common's Licensing, I promote sites like http://pootle.locamotion.org/ and http://www.transifex.net/ I *really* *really* wish I could give my students a best of the bunch free OCR link. But the reality is, ABBYY Finereader is the best that's available. And since it's relatively cheap (compared to some of the translation software like Trados) I don't think it's too onerous. But hot diggity, I wish there was a better FLOSS OCR program.
3. Re:Thanks for the info... by ArundelCastle · 2010-07-22 17:11 · Score: 1
  
  While we are on the topic, anyone seen a good solution to scan, OCR, and reconvert existing crappy pdfs to improve them?
  I think they are called interns. Photoshoop's Content-Aware Fill isn't very good with charts or handwriting.
  ...wait, you actually kept the original document after PDFing? Troglodyte.
added. by jon42689 · 2010-07-22 07:31 · Score: 1

Saw this on facebook. While I don't personally have a need for this, I know that down the line, I'll be glad I knew about it. Good post.
1. Re:added. by b4dc0d3r · 2010-07-22 07:53 · Score: 3, Funny
  
  Saw this on facebook.
  That isn't a good sign, my friend.
2. Re:added. by Fnord666 · 2010-07-23 03:41 · Score: 1
  
  Saw this on facebook.
  
  Please close the door and wash your hands afterwards when partaking in these activities. Thanks.
  
  --
  'The tyrant will always find pretext for his tyranny.' - Aesop's Fables
Run on a VM by ChuckDriver · 2010-07-22 07:32 · Score: 3, Insightful

Ultimately, it would be nice to figure out what script or daemon is running in this and put it on an existing server. In the mean time, I could see just creating a VM for this thing to get started.
1. Re:Run on a VM by xtracto · 2010-07-22 08:01 · Score: 1
  
  Haha, I thought exactly the same. It would be really great if someone could create a VirtualBox, vmware or Qemu "virtual appliance" for this!
  
  --
  Ubuntu is an African word meaning 'I can't configure Debian'
2. Re:Run on a VM by TooMuchToDo · 2010-07-22 09:31 · Score: 2, Interesting
  
  Already on it. Want it as an EC2 AMI? ;)
3. Re:Run on a VM by xtracto · 2010-07-22 18:37 · Score: 1
  
  It would be cool if you really do it, do you expect to publish it somewhere?
  cheers.
  
  --
  Ubuntu is an African word meaning 'I can't configure Debian'
Cool program by Raineer · 2010-07-22 07:34 · Score: 1

I agree with above posters, it's amazing to see a useful Slashvertisement. This one, however, has some quality behind it. I had not seen this program and OCR is one area where it's been difficult to find quality OSS solutions. Thanks for the post.
1. Re:Cool program by inode_buddha · 2010-07-22 08:28 · Score: 1
  
  You would be amazed at how much the people over at Groklaw could use something like this; since most US court documents are recorded as scanned PDF's and TIFF files. I'm saving this link.
  
  --
  C|N>K
ocr by Suicidal+Teapot · 2010-07-22 07:36 · Score: 1

Nice, thanks for sharing. Currently we use Acrobat to OCR scanned documents, it seems to work well but doesn't keep up to our high-speed scanners. Having it automated sounds great. How does the speed/accuracy of WatchOCR compare to commercial products?
1. Re:ocr by IICV · 2010-07-22 07:44 · Score: 1
  
  Who gives a shit? My cheapass "free" workflow for OCR-ing PDF documents on Windows was basically what's described here. With this, all I need is to run a virtual server on my computer! That's significantly better.
2. Re:ocr by 0100010001010011 · 2010-07-22 07:45 · Score: 3, Funny
  
  Now it just needs to incorporate a Recaptcha Lite to improve accuracy.
  Maybe something on the web interface when it doesn't recognize a word you can correct it.
  [Given the success of the Cow Clicker on Facebook, maybe turn it into a facebook game. Tell people they're only allowed to correct words every 6 hours. If they want to correct more words, they'll have to pay for it. Add friends and correct more words to level up!]
3. Re:ocr by wiredlogic · 2010-07-22 13:09 · Score: 1
  
  MODI just leaves you with the text pulled out of context. ExactImage's hocr2pdf can merge the OCR'd text back into the original scanned pages to produce a PDF with searchable text and all the original formatting and images.
  
  --
  I am becoming gerund, destroyer of verbs.
Anyone got error rates? by savanik · 2010-07-22 07:39 · Score: 3, Insightful

I was looking for something like this last year - it looks like this just got released last month, so I don't feel too bad about not finding it.
It looks really interesting, but how accurate is it? I've got some old books that are falling apart I'd like to scan in and textify, but I'd like to know how much time I'm going to have to budget ahead of time fixing problems and proofing.
1. Re:Anyone got error rates? by adavies42 · 2010-07-22 08:17 · Score: 1, Funny
  
  > tesseract-based
  you need 4d software to scan 2d text? trippy....
  
  --
  Media that can be recorded and distributed can be recorded and distributed.
  -kfg
2. Re:Anyone got error rates? by Taxman415a · 2010-07-23 04:06 · Score: 1
  
  This is tesseract without training so the error rates are going to be high. It doesn't say if it is specifically using the development version, but if it's not, there is no layout analysis. That doesn't stop you from doing the scanning, and then do the OCR sometime in the future. Consider Diybookscanner.org for a much faster, cheaper, etc. way to scan your books.
commercial? by Paralizer · 2010-07-22 07:40 · Score: 1

Is there something similar available commercially anyone can recommend? We may end up needing to scan large amounts of pdf's to a shared drive somewhere and need the whole thing to be searchable for keywords, but a requirement for that would be a commercial product that has 24x7 support.
1. Re:commercial? by Anonymous Coward · 2010-07-22 07:44 · Score: 2, Informative
  
  After doing a similar search recently, your two major choices are ABBY FineReader (they have Enterprise/Server level editions) or OmniReader (again at the Server/Enterprise level). They're priced pretty closely and have pretty well matched features, plus high accuracy. We're in the process of moving from a solution originally based on Adobe Acrobat's built-in OCR, which is okay but not great. Initial testing with ABBY showed a demonstrably lower error rate on documents from scanned in legal files.
2. Re:commercial? by ganjadude · 2010-07-22 07:46 · Score: 3, Informative
  
  there is! I happen to work for a company (shameless plug) called DocuWare. Its document management software that does all of that., we are not in 24/7 we are in 8 AM-8 PM eastern m-f for support (I am the support) at the corporate level, however we sell through a dealer network that provides support on a contract basis (many Toshiba business solutions are resellers for us, I know they are 24X7) www.docuware.com
  
  --
  have you seen my sig? there are many others like it but none that are the same
3. Re:commercial? by h4rr4r · 2010-07-22 10:00 · Score: 1
  
  Why?
  You like giving away money?
  I suggest you install this on your own machine, find a quote for a "commercial 24x7" support solution, then tell your boss your company does the same thing for 1/2 the price.
4. Re:commercial? by FelixNZ · 2010-07-22 10:26 · Score: 3, Funny
  
  Sole support staff's user name in 'ganjadude' I am a little wary :)
5. Re:commercial? by JuliaNZ · 2010-07-22 12:22 · Score: 1
  
  Heh, that'll teach me for reading Slashdot on a new laptop without logging in. The EzeScan comment was mine.
6. Re:commercial? by Taxman415a · 2010-07-23 04:13 · Score: 1
  
  All of the major OCR packages will have a Pro version that will have a drop folder server type setup that will do that. OmniPage and FineReader are the standard options with OmniPage being a little more accurate, but if you want to go budget busting (not kidding even a little) for extreme accuracy go for PrimeOCR. They also seem to have the consulting and support services you may want. But for searchable PDFs you don't actually need extreme accuracy, just moderate depending on the task, so perhaps support options can be found for the other packages that meet your needs.
7. Re:commercial? by ganjadude · 2010-07-29 04:54 · Score: 1
  
  sorry, should have been more clear, I am not the only person in support, we have a very well rounded team
  
  --
  have you seen my sig? there are many others like it but none that are the same
Re:Wait a sec by sumdumass · 2010-07-22 07:42 · Score: 1

I guess that's where
step 3: ????
Comes into play.
Wow. by mj01nir · 2010-07-22 07:43 · Score: 1

If this works well, I have a bunch of use for this. Thank you for the heads-up.

--
the no .sig .sig
Middle ground? by DoofusOfDeath · 2010-07-22 07:48 · Score: 1

Funny, I was just looking for something to do this the other day.
But isn't there some middle grown betweeen (a) making users do complicated setup work, vs. (b) making an entire OS out of it?
How about just making a tarball or Ubuntu/Debian/RPM package that installs and sensibly configures those two tools?
Re:Wait a sec by ushering05401 · 2010-07-22 07:51 · Score: 4, Insightful

Seriously, I'm conflicted. I'm not any sort of web search guru, but it looks like that site just got put up. Is submitter an early adopter (v0.2) or a social engineer?
Error rate. by stimpleton · 2010-07-22 07:56 · Score: 1

I settled on an expensive propriety solution some months ago at work(I am the IT guy, Dishwasher, and Business...something) to do our orgs scan and ocr. Admittedly its end to end including the scanner as well. But $15K and does a good job.

I did searches online(a dozen hours) and they all funneled back to "FOSS less good, proprietry for best results)

I am afraid to look at this one, because I did make final decision with pressure from the General Manager.

I dunno what google uses actually, but their in-house solution(on googe code) would *not* produce good results. No1 in the FOSS tests, but like 6th(by miles) on proprietry comparisons.

--

In post Patriot Act America, the library books scan you.
1. Re:Error rate. by Monkey-Man2000 · 2010-07-22 08:13 · Score: 1
  
  Well, since this apparently was just put online recently you probably made the right decision. But since there seems to be so much interest in this thread, hopefully it may becomed polished rather quickly for future needs.
  
  --
  This post was generated by a Cadre of Uber Monkeys for Monkey-Man2000 (603495).
2. Re:Error rate. by StuartHankins · 2010-07-22 09:31 · Score: 1
  
  In my role it's always better to be aware and try it out than pretend it doesn't exist. You can't thoroughly research every solution in advance every time. We call it due diligence even after the fact because you might find a better way of doing something and it's always good to have options. If nothing else it may also give you some negotiating room with the proprietary vendor at renewal time.
Stupid by Archangel+Michael · 2010-07-22 08:04 · Score: 2, Insightful

Most, if not ALL of the documents being scanned into PDF format, are generated on computers already, so why go through the whole OCR process, and not get the actual document from the original source in a PDF version that is already text searchable?
THIS is exactly the problem with document management and processing today! Doing things the hard way because we can't be bothered changing processes that will save tons of money, be more effective, and accurate.
I know people who type a document in WORD and then print it to the Copier/scanner/fax device, go pick up the document, put it on the document scanner, scans it to email (PDF) and sends it that way.
SERIOUSLY???

--
Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.
1. Re:Stupid by Big+Boss · 2010-07-22 08:13 · Score: 1
  
  Just about anyone can read a PDF. If you send a MS Word doc, you have to wonder what version of Word the other person has. And these days, Macs are popular enough that they might not have Word at all! PDF works, and works for everyone. It would be far simpler to print to PDF, but not everyone has a print driver that can do that. ODF is supposed to fix that, but it probably won't.
2. Re:Stupid by ChronoFish · 2010-07-22 08:40 · Score: 1
  
  You clearly missed the point.
  
  -CF
3. Re:Stupid by Archangel+Michael · 2010-07-22 08:56 · Score: 1
  
  Print to PDF, ever heard of that?
  OpenOffice Export to PDF, ever heard of that?
  Acrobat Professional, ever heard of that?
  How about copy/paste into email?
  There are plenty of alternatives to take your WORD (or whatever) doc and get it into a searchable PDF without scanning the damn thing into a TIFF and then OCRing it back to text later.
  
  --
  Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.
4. Re:Stupid by Phoenix+Rising · 2010-07-22 09:06 · Score: 1
  
  You obviously live in a utopia somewhere. Most of the documents I've seen scanned in to document management may have had their origins on a computer, but they've had signatures, comments, and other stuff penned in by hand, and you can't always get the originals sent to you.
  The poster is addressing a real need, as evidenced by the number of comments proclaiming the usefulness of the post.
  
  --
  Let us live so that when we come to die, even the undertaker will be sorry -- Mark Twain
5. Re:Stupid by Bert64 · 2010-07-22 09:52 · Score: 1
  
  The "print to pdf" function often creates very poor pdf files, a proper pdf export function in the program is a lot better...
  Get a relatively complex document and compare the output from the native pdf export of openoffice and printing to a pdf file.
  
  --
  http://spamdecoy.net - free throwaway anonymous email - avoid spam!
6. Re:Stupid by Archangel+Michael · 2010-07-22 10:00 · Score: 1
  
  All of the "exceptions" you listed (signatures, comments penned by hand) are NOT OCRed, making it text searchable as needed by the ORIGINAL concept.
  Changing processes would solve the need to OCR documents that already exist as searchable text elsewhere. EVEN if you have need to document signatures and other hand written notes.
  It is a real need (searchable text), I never said that it wasn't. I'm just quibbling over the process to attain the goal.
  
  --
  Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.
7. Re:Stupid by salesgeek · 2010-07-22 10:37 · Score: 1
  
  What is true for OpenOffice may not, and probably is not true for all applications. CorelDraw, yes, I'm looking at you. That purple is supposed to be blue.
  
  --
  -- $G
8. Re:Stupid by 200_success · 2010-07-22 13:19 · Score: 1
  
  ... and fax machines still exist because some processes require signatures.
9. Re:Stupid by TheRaven64 · 2010-07-22 22:18 · Score: 1
  
  No matter how bad the print-to-PDF function is, it's is not going to be worse than the print-to-paper-then-scan-and-OCR function.
  
  --
  I am TheRaven on Soylent News
10. Re:Stupid by brasselv · 2010-07-30 09:37 · Score: 1
  
  That is 100% correct, in theory.
  In the real world, however, process changes in any large organization tend to be slow, expensive, and messy.
  It's often a NECESSITY to look for incremental optimizations and workarounds.
  I saw a situation very similar to the one you are describing. A large organization with 1000+ points of sale, having to snailmail paper documents (i.e. contracts) to the headquarters - everyday.
  Such contracts were produced/printed with a proprietary software solution, in which Cobol (!) code was a major part, plus a bunch of different databases, etc etc.
  The type of thing: "Don't touch ANYTHING or it falls apart...". Despite this, for their needs, it was working pretty ok - wasn’t changed in may be 15 years or more.
  Add to this: (1) old-ish /poorly trained staff, used to the very same UI since forever (2) the necessity to file the paper version anyways, since it carried signatures .
  What do you do? Sure, there are valid arguments to change it all, and start from scratch - web UI / cloud / VPN / insert-your-buzzword.
  But at the end of the day, if you take the total cost, this would easily be a multimillion investment. And cause a lot of glitches in the business. And almost 100% piss off the staff ("I liked the old one better! It was nice and green! And what about this new mouse thingy?")
  While you figure out the ramifications of a total overhaul, in the meanwhile you may still want to file the contracts as PDFs, instead of simply throwing them in a warehouse. And, the "meanwhile" could last years...
  
  --
  "Whenever people agree with me I always feel I must be wrong." (Oscar Wilde)
Re:Okular by timothyb89 · 2010-07-22 08:16 · Score: 1

Most normal PDF readers (incl. Okular) only work when the actual text is included in the PDF to begin with. When the source isn't computer-generated but scanned in, there's only image data to work with (no text). Actual OCR is pretty much the only choice in this case...
Unsuccessful download? by Kiralan · 2010-07-22 08:54 · Score: 1

I have tried twice to download it, and it 'finishes' at about 150mb both times, while the file size on their web page shows over 600mb. As a double-check, (suspecting a file size reporting error on their page), it fails MD5 sum as well. Has anyone successfully downloaded it?

--
V for Vendetta: People should not be afraid of their governments. Governments should be afraid of their people.
1. Re:Unsuccessful download? by MathiasRav · 2010-07-22 09:09 · Score: 1
  
  Did you try wget? See what error it reports, or try with wget --continue (shorthand -c).
2. Re:Unsuccessful download? by Kiralan · 2010-07-22 09:11 · Score: 1
  
  I had the 'short' download error using the link on the home page, which leads to an FTP-like directory page. I am trying the link in the forums, and it appears to be working. Thanks for the advice, though!
  
  --
  V for Vendetta: People should not be afraid of their governments. Governments should be afraid of their people.
VirtualBox as the middle ground by daboochmeister · 2010-07-22 08:58 · Score: 1

I understand what you're saying, but installing the distro in a VM isn't much extra resource/work over a tarball. Plug in your preferred virtualization solution, of course, they all support exporting directories.

--
"Ahh! I see you're in that indeterminate Schrodinger state where - oh, uh ... never mind." Dave Bucci
1. Re:VirtualBox as the middle ground by TheRaven64 · 2010-07-22 22:16 · Score: 1
  
  He's not smoking anything, he's just inhaling the cloud.
  
  --
  I am TheRaven on Soylent News
Re:Stupid ... maybe by Anonymous Coward · 2010-07-22 09:00 · Score: 1, Insightful

Not everyone wanting to do this does in fact have access to the electronic source. I know I would like to try it for some my old crumbling books, as someone else mentioned above, no longer in print (or otherwise only available in DRM-encumbered ebook formats that I cannot read on Linux or Windows Mobile).
RO
Re:better alternatives to pdftohtml by petermgreen · 2010-07-22 09:15 · Score: 2, Informative

Afaict the original structure was already gone when the pdf was made, you can only try to reverese engineer it from the drawing objects.
You might want to try converting to postscript using ghostscript and then converting to svg using pstoedit. You still won't have the original structure but at least you should have the table shape as a vector drawing rather than a bitmap.

--
note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
Tesseract OCR by TheSync · 2010-07-22 09:35 · Score: 2, Informative

I found tesseract to work very well to do OCR tasks. Doesn't generate PDF though.
1. Re:Tesseract OCR by PGC · 2010-07-22 20:50 · Score: 1
  
  Agreed.
  
  --
  The Dutch will inherit the earth. If not, we'll settle for a bit of ocean. Beta delenda est!
Anyone have a mirror or torrent? by rsborg · 2010-07-22 10:21 · Score: 1

I can't seem to be able to download this file, it keeps giving up after a couple of hundred megs... probably slashdotted.

--
Make sure everyone's vote counts: Verified Voting
Microsoft Files by imscarr · 2010-07-22 11:41 · Score: 1

Can you go one step further with this and get it read the text (only) out of Microsoft formatted files? Maybe it could even read words out of Word files, Powerpoint, etc.

--
Like the beaver, it's just Dam one thing after another
Re:Ask your vendors by Dynedain · 2010-07-22 12:17 · Score: 1

Quick search result although I think we have slightly older black and white versions.

--
I'm out of my mind right now, but feel free to leave a message.....
Let me fix that for you: by skids · 2010-07-22 12:32 · Score: 1

>> Oh yeah, and nobody knows what exactly it does.
Oh yeah, and nobody knows what exactly it does with access to all your sensitive documents.

--
Someone had to do it.
PDF/OCR software by emaname · 2010-07-22 16:09 · Score: 1

I haven't tried it yet, but this looks promising. It isn't free, but it also doesn't seem as pricey as Adobe.

Qoppa Software [ http://www.qoppa.com/index.html ]

--
An effective "democracy" creates the illusion the people have a say in their government.
exactimage + cuneiform by seyyah · 2010-07-22 18:28 · Score: 1

I wrote a bash script a few months back which, in a little over 130 lines (it has a few command line options), can convert any old PDF to a text searcheable PDF. I really wonder whether a distro is a bit overkill for this? But it is such an important tool to have that I commend the authors for making it available... I just wish they'd put up the actual script that they used so I could compare it to my own!
1. Re:exactimage + cuneiform by kilf · 2010-07-22 23:41 · Score: 2, Insightful
  
  I'd love to see your script, if you want to make it available.
2. Re:exactimage + cuneiform by tosszyx · 2010-07-23 04:29 · Score: 1
  
  Is it similar to the one presented in this tutorial ?
3. Re:exactimage + cuneiform by seyyah · 2010-07-23 08:03 · Score: 1
  
  Where shall I send it?
4. Re:exactimage + cuneiform by seyyah · 2010-07-23 08:08 · Score: 1
  
  That's where I first read about exactimage and how it could be used for OCR. But his script has strange dependencies and produces large PDFs as output. Mine produces smaller PDFs and has dependencies suitable for my personal setup. YMMV!
5. Re:exactimage + cuneiform by tosszyx · 2010-07-23 21:42 · Score: 1
  
  Well, it sounds very interesting, as some have commented already, can we take a look at it ?
6. Re:exactimage + cuneiform by Jorophose · 2010-07-24 14:32 · Score: 1
  
  can you put it on pastebin maybe?
7. Re:exactimage + cuneiform by seyyah · 2010-07-25 04:39 · Score: 1
  
  Feel free to make suggestions: pdfocr.
8. Re:exactimage + cuneiform by kilf · 2010-07-25 07:01 · Score: 1
  
  You could attach it to an email and send to kilf@graffiti.net - cheers!
9. Re:exactimage + cuneiform by seyyah · 2010-07-27 23:35 · Score: 1
  
  Note: don't use cuneiform 0.9 or 1.0.
Why on server? by Zumbs · 2010-07-22 21:16 · Score: 1

I must be missing something. Why would you want OCR on a server and not as part of the program that interfaces with the scanner?

--
The truth may be out there, but lies are inside your head
1. Re:Why on server? by joke_dst · 2010-07-22 23:37 · Score: 1
  
  Because you want an internal infrastructure that allows you to replace the scanner easily. Those thing brake down, or more importantly gets replaced with faster new scanners.
  One of my clients has a scanner farm that scans documents, then the images are sent to a OCR server farm. It's way easier to replace any part if it that way (we're actually trying out a new OCR suite right now so I'll test this one).
  Even if you just have one scanner and does OCR on the same machine, braking it up this way makes it easier for you in the future.
2. Re:Why on server? by Anonymous Coward · 2010-07-23 02:55 · Score: 1, Insightful
  
  If you are running a high speed scanner that scans 100ppm/200ipm, the computer would not be able to OCR the pages fast enough to keep up with the scanner throughput. Since you are paying good money for that scanner (and the operator running it), you want to get every possible image through that scanner per day. The OCR can be done after the fact on a server that only needs to be periodically monitored by IT.
3. Re:Why on server? by stickystyle · 2010-07-23 03:33 · Score: 1
  
  Because not all scanners are directly plugged into user computers with a software interface.
  I have a large office MFP that scans into network shares and with this little server running I can have it watch that share and fix up my PDF's real nice for the users, rather than installing something on everyones computers.
  
  --
  Pluralitas non est ponenda sine neccesitate
Re:Wait a sec by somersault · 2010-07-23 01:34 · Score: 1

Version 0.2 has been out for at least a month by the looks of their forum, and version numbers are a very imprecise way of telling how useful the software is for your needs, or even how stable it is. What's wrong with being an "early adopter" if it's the only working and free solution to your problem?

--
which is totally what she said
Ocropus with gscan2pdf by danhs7 · 2010-07-23 05:07 · Score: 1
I use gscan2pdf for my Linux desktop. I find it's incredibly simple and convenient.
It just does *everything* I need. It takes scans from scanner, it processes it with OCR, it allows me to delete or insert pages.....it's just very simple and does the job well.

For OCR, gscan2pdf works with 4 OCR programs currently:
- GOCR
- Tesseract
- Ocropus
- Cuneiform
Ocropus is developed with funding/support from Google. It uses tesseract as a backend to do a lot of the work. In simple terms, Ocropus is awesome. I find it does a stellar job at OCR. It's absolutely open source and great software.