Slashdot Mirror


From Paper To PDF?

Spoing dropped this bit of informative info into the bin: "Last week, a friend of mine griped that he didn't know of an easy way -- short of getting Adobe Capture and paying per-use licence fees -- of creating searchable PDFs. I scoffed, and told him I've done it many times, and it was free -- as in beer and speech. Dumbfounded, he pushed me to show him how, and I did; print to a Postscript file, and run ps2pdf on it...done! Since every document could be output as Postscript, his problem was solved. If he wanted to batch process the documents, he could set up a few scripts to simplify the task. While he was impressed, he ended up asking what seemed like an easy question; 'Can you do the same with a scanned image?'" And therein lies the question...

"After a week of on/off searching, I did find some good references as well as nearly all the parts necessary for the job, including open source OCR engines, PDF and Postscript tools, search engines, and the like.

Unfortunately, I came up with only two solutions -- neither of them Open Source, and most quite costly (premium beer); Adobe Capture or dedicated "PDF scanners" like this one.

My question to the Slashdot crowd is this:

  1. Is there a cost-effective way of moving existing dead-tree documents into either HTML, PDF, or other searchable mixed text and graphics format?

We all deal with a mix of electronic and printed documents -- and you're like me you've paid for some of them in both formats.

If you're like me, you buy new documents in electronic, searchable, format when you can. How many of us have O'Reilly's Networking Bookshelf, or some other CD texts ready to search on our notebooks and networks?

Yet, I have a four foot wide stack of technical documents and books that just isn't going to come with me on each plane trip. I'm not going to get rid of them -- they are still valuable -- but I can't figure out how to make them useful more often.

The available tools for capturing paper and converting it into searchable PDFs is costly, and is geared toward corporations that can justify the costs by the number of users. To me, a per-use licence of Adobe's Capture --

  1. Adobe Capture - Prices

    Adobe Capture - Features

-- is just not cost effective.

If the document is already a text document -- even if it's in some word processor I don't use -- generating PDF files is easy and cheap;

Print a document to a Postscript file, or create one. For example a simple text document is trivial;

  1. enscript file.txt -p file.ps

Convert the resulting Postscript file to PDF;

  1. ps2pdf file.ps file.pdf

Converting a paper document to PDF is also easy. Just scan the image and use tiff2ps or jpeg2ps to create the Post script file. The only problem is that the resulting PDF is a bitmap image and isn't searchable.

Interestingly enough, TIFF -- a format used extensively for scanned documents -- does support TIFF+Text, but usually as an extention to TIFF and isn't really an optimal format; The Unofficial TIFF Home Page.

So, if you want to search the documents and keep the formatting and diagrams, you're back to paying Adobe for Capture or some other nearly as expensive method. "

13 of 188 comments (clear)

  1. PDF, Ugh. by Juggle · · Score: 3

    I learned my lesson about researching and testing what I offer before selling it to clients thanks to PDF. I knew that PHP was capable of generating PDF's so I went ahead and accepted a job to create a website which would automagically generate PDF resumes for the visitors. What I then found out was that PHP could only generate PDF's if you bought one of two pricy libraries which actually do the PDF work.

    I ended up searching for three days (and submitting an ask /. which was discarded) before I found a set of OS (free as in beer and speach) perl libraries for generating PDF's. But oh what a pain. I ended up designing a sample resume in QuarkXpress then using a pica ruler on the printout to convert it to something I could generate. But after about two weeks of hacking I had a resume generator which spits out very clean professional looking resumes in HTML and PDF for anyone who's willing to register on the site and fill out a few simple forms. Client was happy and I tucked another language into my cap. (Since the libraries I found pretty much required you to know PostScript).

    Moral of story: test the technology before selling to a client. And trying to generate PDF's on the cheap is only for those who have way more time than money!

    --
    --- Juggle juggle@hitesman.com
  2. Re:Is it legal to convert PostScript to PDF? by Azog · · Score: 3

    The patent on gif is not the gif file format per se, but the compression algorithm.


    Torrey Hoffman (Azog)

    --
    Torrey Hoffman (Azog)
    "HTML needs a rant tag" - Alan Cox
  3. PDF XML by 1010011010 · · Score: 3

    We've about finished a tool that will do PDF to XML conversions, and back again. It also sports a native API to allow t he creation of documents from scratch. It allows embedding of truetype fonts. It runs on Linux and Windows NT.

    It'll be out in the next week or so; check Freshmeat.

    The idea behind it is, create a nice layou template in the tool of your choice -- Illustrator, for example. Save as PDF. Convert to XML. Add your markup to it -- extra text, etc., convert back to PDF. Done!

    Release 1.5 will include a "template" feature, whereby you can use pages from existing PDFs as templates directly; something along these lines (pseudocode):


    p = new pdf();
    t = new pdftemplate("foo.pdf");

    p.newpage("8.5","11");
    p.include_from_template(t.page(1));
    p.drawstring("Hi!");

    p.write("bar.pdf");


    Does this type of tool sound interesting to anyone?

    On a related note, we plan to offer it as both open source and a commercial product. For instance, the ActiveX interface would be commercial. You could negotiate a commercial license. And you can use it under something like the Alladin license (a la ghostscript, pdflib, etc). Any advice on open source + commercial? I have to justify my department's budget.


    --
    Napster-to-go says "Fill and refill your compatible MP3 player", which is a lie. It's not MP3. It's WMA with DRM.
  4. Other ways... by antdude · · Score: 3

    I asked a friend about this and he said, "no, but the answer is yes, there are other ways....use other OCR engines, like Omnipage Pro or TextBridge Pro. Adobe Capture 3.0 is really really really nice, but is expensive. The searchability factor is the only reason OCRing is needed in most instances."

    Some useful sites:
    PDF Research
    Planet PDF
    AcroBuddies
    Codecuts
    PDF Zone
    Adobe
    Deja.com

    --
    Ant(Dude) @ Quality Foraged Links (AQFL.net) & The Ant Farm (antfarm.ma.cx / antfarm.home.dhs.org).
  5. I did that in two hours... by Greyfox · · Score: 3
    Easy solution:

    1) Write LaTeX resume style class. Mine's pretty primative because it only has to deal with my resume.

    2) Create resume using resume style.

    3) pdflatex resume.tex.

    Or...

    3) latex2html resume.tex (Though latex2html doesn't really generate it to look the way I need it, but it is just a simple perl program so you could always hack it.

    Nice thing about LaTeX is you can also go to XML or DVI or RTF or a number of other fairly widely used formats. Or you could just ship the raw LaTeX if the company you're dealing with is that clueful.

    --

    I'm trying to teach myself to set people on fire with my mind... Is it hot in here?

  6. the OCR situation is not good by Jamie+Zawinski · · Score: 4

    Last year, I tried several Linux-based OCR packages, and they basically didn't work at all.

    I ended up using the Windows software that came with my scanner to OCR the documents, and at first glance it appeared to do a good job -- it didn't mess up too often. But then I went in and actually proofread and spell-checked its output to find all the typos it had made, and it turns out that this process was so time-consuming that it was faster for me to just type it all in by hand. Even though the OCR software only made a mistake every few lines, finding those mistakes took enough concentration that typing the whole thing took less time.

    Your mileage may vary, according to how fast you can type.

  7. embed TIFF images in the PDF by jetson123 · · Score: 4
    Many Adobe-converted scanned pages seem to be just a sequence of TIFF images with the OCR'ed text also contained in the PDF file. The OCR'ed text is never displayed, but can be used for searching (in my experience, Adobe's OCR is not very good).

    So, a simple conversion would consist of just putting the scanned TIFF images in sequence into a PDF file.

  8. Re:why bother with PDF? by turg · · Score: 4
    I don't know about elsewhere but PDF is essential for dead-tree publishing. The advantage it has over all other formats is not that it displays the same on every screen but that it prints the same on every printer (assuming that the author remembered to embed the @#$! fonts, but that's another story :-)

    With PDF, you can design and lay out your ad and transmit it electronically (or on disk) to the newspaper, knowing that it will print exactly how it it did for you. Or you can lay out your brochure and send it off to the printers knowing the same thing. With any other format, the publisher/printer's machine is going to have at least one (oh, if only it were ever just one!) setting different than yours, which will change the layout.

    PDF is the way that print ads are submitted electronically today. It's either PDF or old-fashioned cut-and-paste (no, even more old-fashioned than you're thinking, I mean with actual scissors and glue). The Associated Press runs a "wire service" called AdSend for ad agencies to transmit PDF ads electronically to newspapers and magazines -- and they are transmitting millions of PDF's a year.

    The same thing basically goes for sending anything you want printed to a print shop. In any case, free PDF-making software enables dead-tree publising the same way that the web enables electronic publishing (though we haven't got any print shops that'll work for free, yet :-)

    ========

    --
    <sig>Guvf vf abg n frperg zrffntr
  9. Missing a step? by sugarman · · Score: 4
    You mentioned OCR software, but didn't go much further with it. Wouldn't this be the solution you need?

    Scan to OCR to PS to PDF

    there are apprarently a couple tools to do this for you. check out a brief list here

    Seeing as you've looked into Adobe Capture, windows may be an option. If so, then the other question would be whether you've looked into Textbridge? This looks like it would do exactly what you're asking. No muss, little fuss.

    --
    --sugarman--
  10. The age old question by underwhelm · · Score: 4

    I am asked to do this all the time as a computer services employee of Kinkos.

    The short answer is using OCR to create a text file, proof reading the text file, and then printing to a postscript file.

    The long answer is, you need to find quality OCR software that does not choke on things like forms. You also *MUST* proof read every OCRd document. No OCR is perfect, and drawn elements will almost certainly trip the software into embedding odd characters or pipes into your text. Different fot sizes will cause the software to choke. Thin fonts will cause the software to choke.

    If you are OCRing forms, I recommend Omni Form (it's the only software I know of that recognizes forms, but I have never used it personally).

    Batch processing of OCR pages is likely easy to set up with professional OCR software (Omni Page does it), but it does not excuse you from proofreading the results. After that, the PDF part is a snap, and can be accomplished with any OCR software you choose to use.

    If you are asking which OCR software is, I can't help you directly. OCR software is a niche software market, and you either get free, dissapointing software with your scanner, or you pay big money for something that does a decent job. Just like everything else in life. Have you read any OCR software reviews?

    --

    I don't need large brains to have a good time.

  11. A former intern... by heliocentric · · Score: 4

    Speaking as a former intern under a guy who wanted all these meeting minutes from the early 80s on put on the web I know what you are asking for. I knew HTML and simple coding then, and was only being asked to translate them to HTML. What I did, was OCR a ton of the text, only to reduce the keystrokes (it's much easier to drink coffee while swapping pages in a scanner every few seconds then it is typing all day) then I spell checked them as an initial step, formatted them by hand. Then when I moved onto the next ton, and they were in the scanner bed I would check the grammar of those which I did in the first batch.

    So, I ended up being the cheap labor to get the stuff together, but I incorporated the error checked suggested by the other replies, and I utilized OCR to minimize carpel tunnel damage.

    Yeah, it took a while, and yes I got paid little in comparison to the other people at the location, but I got paid, they got their silly meeting minutes online, and they didn't have to hire 1,000 monkeys with 1,000 type-writers and have redundancy of people or invest in vast warehouses of paper feeders.

    The scale of my work: I worked on a series of bound volumes that took up 3+ feet on a bookshelf and I completed the work on my own in less than 2 weeks (while also feilding tech support questions from the group). If you have 1,000,000 pages to be put online yesterday, maybe you could use a larger staff - but always remember:

    If it takes a farmer 3 days to plow a field, and 3 farms only a day to plow the same field, and it takes one woman 9 months to have a baby, how many months does it take 9 women to have one baby?

    Often putting more people on a project doesn't equate to faster solutions or better ones and usually not cheaper ones.

    --
    Wheeeee
  12. Adobe Acrobat 4.0 by cetan · · Score: 5

    You don't need to spend all that money for Adobe Capture 3.0 when you can buy Adobe Acrobat 4.0. This is NOT the adobe reader, but the full version of Adobe Acrobat with all the bells and whistles. A url is: http://www.adobe.com/store/product s /acrobat.html.

    In addition, you can also buy the Adobe Acrobat Business Tools, which is a slightly broken but still functional version of Acrobat 4.0. That is available here: http://www.adobe.com/store/pro ducts/acrbustools.html.

    --
    In Soviet Russia...michael would be rotting in Siberia!
  13. Save money on OCR by sacrificing quality by AnonymousHero · · Score: 5
    Ahh... mass-OCR cost-effectiveness... it takes me back...

    I just used an off-the-shelf OCR engine and hacked the text together with the images programmatically myself. We would get TIFF images, which most engines could understand.

    On really, really big OCR jobs, though, the real problem is the tradeoff between human intervention and quality. See, OCR engines just guess at stuff. The only reason they work at all is that they guess well. But they guess wrong anywhere from 0.1% to 10% of the time, depending on the quality of the input.

    Each mistake must be correct by a human being. But humans are expensive. If you have lots of documents to OCR, the technology integration costs and the cost of the OCR engines themselves are amortized. They end up dwarfed by the paychecks of the humans.

    The cost of massive amounts of OCR, therefore, is directly related to the amount of human correction of OCR mistakes.

    Thus, you can save tons of money by selectively sacrificing OCR quality. Getting every page perfectly formatted requires around 60 seconds a page for a skilled OCR operator. It's all about reducing that time. How? Simple. Don't expect everything to be perfect. There are various levels of quality you can get out of OCR engines-human systems:

    • no correction: just let 'er run. You can get it fully automated this way, but the quality is crap.
    • zoning only: The OCR engines just suck at text with multiple columns, inserts, and tables. You can get people to correct the engine's zoning at a clip of around 5 seconds a page, 10 seconds if you require them to put in tokens representing the excised images.
    • spelling correction: Typically, most people object to the spelling mistakes OCR introduces. With good quality text an operator can correct them at around 20-30 seconds a page.
    • formatting correction: OCR engines can really mess up indentation and text flow. Unfortunately this is the most time consuming problem to fix, anywhere from 30 seconds to a couple of minutes per-page.

    Oh, and it really helps if you get the workflow of the OCR down. Allow the operator to move on to the next document automatically, save them the trouble of remembering the name of the document they're working on, etc. etc. This may require a bit of hacking of the OCR engine you're using, but it's worth it.

    So when doing something like this, ask yourself: how perfect does it have to be, really? You can save tons of money if you can cut any quality corners.