From Paper To PDF?
"After a week of on/off searching, I did find some good references as well as nearly all the parts necessary for the job, including open source OCR engines, PDF and Postscript tools, search engines, and the like.
Unfortunately, I came up with only two solutions -- neither of them Open Source, and most quite costly (premium beer); Adobe Capture or dedicated "PDF scanners" like this one.
My question to the Slashdot crowd is this:
- Is there a cost-effective way of moving existing dead-tree documents into either HTML, PDF, or other searchable mixed text and graphics format?
We all deal with a mix of electronic and printed documents -- and you're like me you've paid for some of them in both formats.
If you're like me, you buy new documents in electronic, searchable, format when you can. How many of us have O'Reilly's Networking Bookshelf, or some other CD texts ready to search on our notebooks and networks?
Yet, I have a four foot wide stack of technical documents and books that just isn't going to come with me on each plane trip. I'm not going to get rid of them -- they are still valuable -- but I can't figure out how to make them useful more often.
The available tools for capturing paper and converting it into searchable PDFs is costly, and is geared toward corporations that can justify the costs by the number of users. To me, a per-use licence of Adobe's Capture --
-- is just not cost effective.
If the document is already a text document -- even if it's in some word processor I don't use -- generating PDF files is easy and cheap;
Print a document to a Postscript file, or create one. For example a simple text document is trivial;
- enscript file.txt -p file.ps
Convert the resulting Postscript file to PDF;
- ps2pdf file.ps file.pdf
Converting a paper document to PDF is also easy. Just scan the image and use tiff2ps or jpeg2ps to create the Post script file. The only problem is that the resulting PDF is a bitmap image and isn't searchable.
Interestingly enough, TIFF -- a format used extensively for scanned documents -- does support TIFF+Text, but usually as an extention to TIFF and isn't really an optimal format; The Unofficial TIFF Home Page.
So, if you want to search the documents and keep the formatting and diagrams, you're back to paying Adobe for Capture or some other nearly as expensive method. "
You don't need to spend all that money for Adobe Capture 3.0 when you can buy Adobe Acrobat 4.0. This is NOT the adobe reader, but the full version of Adobe Acrobat with all the bells and whistles. A url is: http://www.adobe.com/store/product s /acrobat.html.
In addition, you can also buy the Adobe Acrobat Business Tools, which is a slightly broken but still functional version of Acrobat 4.0. That is available here: http://www.adobe.com/store/pro ducts/acrbustools.html.
In Soviet Russia...michael would be rotting in Siberia!
I just used an off-the-shelf OCR engine and hacked the text together with the images programmatically myself. We would get TIFF images, which most engines could understand.
On really, really big OCR jobs, though, the real problem is the tradeoff between human intervention and quality. See, OCR engines just guess at stuff. The only reason they work at all is that they guess well. But they guess wrong anywhere from 0.1% to 10% of the time, depending on the quality of the input.
Each mistake must be correct by a human being. But humans are expensive. If you have lots of documents to OCR, the technology integration costs and the cost of the OCR engines themselves are amortized. They end up dwarfed by the paychecks of the humans.
The cost of massive amounts of OCR, therefore, is directly related to the amount of human correction of OCR mistakes.
Thus, you can save tons of money by selectively sacrificing OCR quality. Getting every page perfectly formatted requires around 60 seconds a page for a skilled OCR operator. It's all about reducing that time. How? Simple. Don't expect everything to be perfect. There are various levels of quality you can get out of OCR engines-human systems:
Oh, and it really helps if you get the workflow of the OCR down. Allow the operator to move on to the next document automatically, save them the trouble of remembering the name of the document they're working on, etc. etc. This may require a bit of hacking of the OCR engine you're using, but it's worth it.
So when doing something like this, ask yourself: how perfect does it have to be, really? You can save tons of money if you can cut any quality corners.