Slashdot Mirror


From Paper To PDF?

Spoing dropped this bit of informative info into the bin: "Last week, a friend of mine griped that he didn't know of an easy way -- short of getting Adobe Capture and paying per-use licence fees -- of creating searchable PDFs. I scoffed, and told him I've done it many times, and it was free -- as in beer and speech. Dumbfounded, he pushed me to show him how, and I did; print to a Postscript file, and run ps2pdf on it...done! Since every document could be output as Postscript, his problem was solved. If he wanted to batch process the documents, he could set up a few scripts to simplify the task. While he was impressed, he ended up asking what seemed like an easy question; 'Can you do the same with a scanned image?'" And therein lies the question...

"After a week of on/off searching, I did find some good references as well as nearly all the parts necessary for the job, including open source OCR engines, PDF and Postscript tools, search engines, and the like.

Unfortunately, I came up with only two solutions -- neither of them Open Source, and most quite costly (premium beer); Adobe Capture or dedicated "PDF scanners" like this one.

My question to the Slashdot crowd is this:

  1. Is there a cost-effective way of moving existing dead-tree documents into either HTML, PDF, or other searchable mixed text and graphics format?

We all deal with a mix of electronic and printed documents -- and you're like me you've paid for some of them in both formats.

If you're like me, you buy new documents in electronic, searchable, format when you can. How many of us have O'Reilly's Networking Bookshelf, or some other CD texts ready to search on our notebooks and networks?

Yet, I have a four foot wide stack of technical documents and books that just isn't going to come with me on each plane trip. I'm not going to get rid of them -- they are still valuable -- but I can't figure out how to make them useful more often.

The available tools for capturing paper and converting it into searchable PDFs is costly, and is geared toward corporations that can justify the costs by the number of users. To me, a per-use licence of Adobe's Capture --

  1. Adobe Capture - Prices

    Adobe Capture - Features

-- is just not cost effective.

If the document is already a text document -- even if it's in some word processor I don't use -- generating PDF files is easy and cheap;

Print a document to a Postscript file, or create one. For example a simple text document is trivial;

  1. enscript file.txt -p file.ps

Convert the resulting Postscript file to PDF;

  1. ps2pdf file.ps file.pdf

Converting a paper document to PDF is also easy. Just scan the image and use tiff2ps or jpeg2ps to create the Post script file. The only problem is that the resulting PDF is a bitmap image and isn't searchable.

Interestingly enough, TIFF -- a format used extensively for scanned documents -- does support TIFF+Text, but usually as an extention to TIFF and isn't really an optimal format; The Unofficial TIFF Home Page.

So, if you want to search the documents and keep the formatting and diagrams, you're back to paying Adobe for Capture or some other nearly as expensive method. "

2 of 188 comments (clear)

  1. Adobe Acrobat 4.0 by cetan · · Score: 5

    You don't need to spend all that money for Adobe Capture 3.0 when you can buy Adobe Acrobat 4.0. This is NOT the adobe reader, but the full version of Adobe Acrobat with all the bells and whistles. A url is: http://www.adobe.com/store/product s /acrobat.html.

    In addition, you can also buy the Adobe Acrobat Business Tools, which is a slightly broken but still functional version of Acrobat 4.0. That is available here: http://www.adobe.com/store/pro ducts/acrbustools.html.

    --
    In Soviet Russia...michael would be rotting in Siberia!
  2. Save money on OCR by sacrificing quality by AnonymousHero · · Score: 5
    Ahh... mass-OCR cost-effectiveness... it takes me back...

    I just used an off-the-shelf OCR engine and hacked the text together with the images programmatically myself. We would get TIFF images, which most engines could understand.

    On really, really big OCR jobs, though, the real problem is the tradeoff between human intervention and quality. See, OCR engines just guess at stuff. The only reason they work at all is that they guess well. But they guess wrong anywhere from 0.1% to 10% of the time, depending on the quality of the input.

    Each mistake must be correct by a human being. But humans are expensive. If you have lots of documents to OCR, the technology integration costs and the cost of the OCR engines themselves are amortized. They end up dwarfed by the paychecks of the humans.

    The cost of massive amounts of OCR, therefore, is directly related to the amount of human correction of OCR mistakes.

    Thus, you can save tons of money by selectively sacrificing OCR quality. Getting every page perfectly formatted requires around 60 seconds a page for a skilled OCR operator. It's all about reducing that time. How? Simple. Don't expect everything to be perfect. There are various levels of quality you can get out of OCR engines-human systems:

    • no correction: just let 'er run. You can get it fully automated this way, but the quality is crap.
    • zoning only: The OCR engines just suck at text with multiple columns, inserts, and tables. You can get people to correct the engine's zoning at a clip of around 5 seconds a page, 10 seconds if you require them to put in tokens representing the excised images.
    • spelling correction: Typically, most people object to the spelling mistakes OCR introduces. With good quality text an operator can correct them at around 20-30 seconds a page.
    • formatting correction: OCR engines can really mess up indentation and text flow. Unfortunately this is the most time consuming problem to fix, anywhere from 30 seconds to a couple of minutes per-page.

    Oh, and it really helps if you get the workflow of the OCR down. Allow the operator to move on to the next document automatically, save them the trouble of remembering the name of the document they're working on, etc. etc. This may require a bit of hacking of the OCR engine you're using, but it's worth it.

    So when doing something like this, ask yourself: how perfect does it have to be, really? You can save tons of money if you can cut any quality corners.