Slashdot Mirror


Multi-page PDF To Multi-page TIFF and Archiving?

GeorgeMonroy writes "One of my clients has aperture cards that they have been scanning into multi-page PDF files — but now they want them in multi-page TIFFs instead. One of the reasons they gave for this is that TIFF files require less storage space. While that is true, I wonder if TIFF is the best format going into the future. Are TIFFs better than PDFs for future use? I wonder what format you think would last longer. Are there any other formats that you think would be better or more future-proof? To me, storage is not a good enough reason to go to TIFF, because storage prices are always dropping anyway. Also, since they already have many of these files in PDF format and they want to convert them into multipage TIFFs, are there any programs that you can recommend that will perform batch processing of files so that we do not have to convert each PDF one by one? If another file format is better than TIFF, then are there any programs for batch processing that you can recommend?"

23 of 125 comments (clear)

  1. Ghostscript by tomtomtom777 · · Score: 2, Informative

    Ghostscript can do the conversion from the console.

    You can write a simple shell script to convert all files.

  2. DjVu by cduffy · · Score: 3, Informative

    The most effective compressors are commercial, but DjVu is a very effective image archival format; see DjVuLibre for the non-commercial tree.

    Moving back towards the question in the article, I don't think there's much worry about either TIFF or PDF in terms of future proofing; they're both very widely used, have multiple implementations and third parties with substantial interest in keeping those implementations maintained, etc. The quality of TIFF implementations varies wildly, but the good ones are only going to get better, and I'd be shocked if libtiff ended up terminally bitrotted without a successor implementing a superset of its functionality inside my lifetime.

  3. Re:Are these things images or documents? by pclminion · · Score: 4, Informative

    If they're images, then you should use TIFF (or perhaps PNG). However, it doesn't make sense for them to be "multi-page." If they're documents, then PDF is appropriate.

    Multi-page TIFF is well supported in the industry. There is nothing "weird" about it. It even supports embedded, searchable text (a Microsoft addition, but something that actually adds value). PDF archival can be difficult to do correctly. At the very least you want to use a product which supports PDF/A, followed up with some serious validation to make sure the results are actually compliant. Otherwise you may get bitten decades down the road. Searchable TIFF, on the other hand, will be around for freaking ever.

  4. Don't do this. But if you insist, here's how. by Cheesey · · Score: 5, Informative

    Are TIFFs better than PDFs for future use? I wonder what format you think would last longer. Are there any other formats that you think would be better or more future-proof? To me, storage is not a good enough reason to go to TIFF, because storage prices are always dropping anyway.

    Don't use TIFF. Stay with PDF. PDF is what all the big digital libraries are using. It's a proper standard, it's readable and writable by lots of free open source software, so even if Adobe disappears in a puff of intellectual property, you'll still be able to read your documents.

    TIFF, on the other hand, is a container format (like AVI, but worse). It isn't fully supported by every program - what sort of TIFF do you want, anyway? Compressed with LZW? With RLE? Not compressed at all? There's free software that will read and write the most common types of TIFF, so you can certainly do it, but why give up the convenience of using PDF?

    Also, since they already have many of these files in PDF format and they want to convert them into multipage TIFFs, are there any programs that you can recommend that will perform batch processing of files so that we do not have to convert each PDF one by one?

    Use ghostscript. Use something like the following command line:

    gs -dNOPAUSE -sDEVICE=tiffgray -sOutputFile=output%02d.tiff -dBATCH -r300 input.pdf
    This turns input.pdf into a series of 300 dpi tiff files, one for each page, called output01.tiff, output02.tiff, etc. Change the DEVICE to get a different sort of tiff file, and use gs --help to get a list of options. You can easily wrap this command in a script of almost any sort to make the process fully automatic.
    --
    >north
    You're an immobile computer, remember?
  5. Comment removed by account_deleted · · Score: 2, Informative

    Comment removed based on user account deletion

  6. Re:Are these things images or documents? by BadMrMojo · · Score: 3, Informative

    ...it actually supports a huge number of subformats (and different compression algorithms) in addition.

    TIFF: Thousands of Image File Formats

    If you do wind up converting to tiff, then remember to document everything in excruciating detail. With thousands of possible combinations - each of which is a perfectly valid tiff image - you may encounter some issues if someone's using a less robust reader and assuming for the wrong compression algorithm, byte order, data striping or photometric interpretation.

  7. PDFs... by pdboddy · · Score: 2, Informative

    Stick with PDF. Chances are, neither PDF nor TIFF will vanish overnight. I'd say PDF is easier to work with, even with minimalist free tools. Since either one is technically "good" for archiving, why do more work than you really need to do, even with batch processing it'd be a pain.

    Acrobat has batch processing, and can convert pdfs to TIFF, JPEG, PNG and more. That would be my suggestion if you are really going to convert to TIFFs.

    --
    Julie Moult is an idiot.
    1. Re:PDFs... by Eric+Smith · · Score: 2, Informative

      If you have 5 MB PDF files that convert to 1 MB TIFF files, that means that the PDF files were encoded badly. There's no fundamental reason for PDF files to be significantly larger than TIFF files.

  8. pdf2tiff.sh by Anonymous Coward · · Score: 4, Informative

    let's not reinvent the wheel -- I did this about 9 months ago //wolfmann -- and this code is Public domain (done on federal gov't time):

    # cat pdf2tiff.sh
    #!/bin/bash

    for file in */*.pdf #for each pdf
    do
                    filename=`echo $file | cut -d'.' -f1`
                    if [ ! -e "$filename".tiff ]
                    then
                                    echo "gs -q -dNOPAUSE -dBATCH -sDEVICE=tiffg4 -sOutputFile=$filename.tiff $file"
                                    gs -q -dNOPAUSE -dBATCH -sDEVICE=tiffg3 -sOutputFile="$filename".tiff "$file" 2> /dev/null
                    else
                                    echo "$filename.tiff exists! skipping..."
                    fi
    done

    1. Re:pdf2tiff.sh by Anonymous Coward · · Score: 2, Informative

      Better method:
      #!/bin/bash
      for file in `find . -name '*.pdf'`
      do
      filename=${file%%.*}
      ...
      This will cause recurse into all subdirectories and it will work on files with multiple dots in the name (eg. applicant.12345.cv.pdf)

    2. Re:pdf2tiff.sh by Anonymous Coward · · Score: 1, Informative

      Well, if we want to correct others, -iname is the better option, but who cares, as long as we get modded informative for the basics...

  9. It doesn't matter. Neither is great. by moosesocks · · Score: 3, Informative

    How *much* smaller are these TIFFs anyway? TIFF is actually a container format, and can support all sorts of compression, some of them proprietary, some of them common. Not all of them are lossless either (TIFF-Jpeg is a perfectly valid combination, and was used before the days of Exif to add metadata to jpegs). TIFFs can also include vectorized data. It's not all that much less complicated than a PDF.

    PDFs are also a container format to an extent. You could very well have a TIFF embedded in a PDF. Fortunately for us, the PDF specification is a bit more stringent on what is supported and what isn't, and PDFs tend to work just about everywhere (especially if all that you've got is an image). You can also apply all sorts of compression to PDFs to reduce their file size... these might not be quite as well supported.

    Both formats are extremely common, and it's extremely unlikely that you'll ever have to do any sort of conversion to display them. If I had to place money on it, I'd wager that PDF will be in widespread use for longer than TIFF, though neither format seems to be going anywhere anytime soon. You're more likely to have to worry about the storage devices you're using and the longevity of the media.

    If you just need to store lossless raster images, PNG might be a good bet. It's a "Free" format, and is officially endorsed as an ISO/IEC standard. TIFF is copyrighted by adobe. It also has the advantage of being a complete image format, rather than just a container, which means that any software that can open a PNG image should be able to open *any* PNG. Because of its open-sourceness and widespread adoption, PNG will be around for a long time to come as well. Once again, the storage medium and filesystem that you use to store the images is very likely going to become obsolete before the file format itself.

    Granted, PNG's compression algorithm isn't optimized for photographic data, though the image formats that *are* optimized for this purpose are neither common nor free.

    In summary, there's no reason that a PDF needs to be terribly larger than a PDF (the overhead should be especially negligible if you've got lots of images at a high-resolution). Neither format is going away anytime soon, but both have quirks that can hurt you in the future (Multi-page TIFFs are even somewhat of an oddity today). If you really want small files and future-proofing, go with PNG. Otherwise, it's more or less a non-issue.

    --
    -- If you try to fail and succeed, which have you done? - Uli's moose
  10. Re:Are these things images or documents? by commanderfoxtrot · · Score: 2, Informative

    Both PDF and TIFF handle multiple pages, and have done so for years.

    Either would be suitable for this application.

    If you really want to convert these, Imagemagick would be the best tool to use.

    However- it seems a little daft for storage space to be the main reason for changing: you're simply exchanging one compressed image format for another. You may save 10% e.g. if you move from JPEG in the PDF to PNG/similar in the TIFF but is that really worth the effort?!

    If they really want shiny TIFFs, it would be easy to have an Imagemagick script to convert single PDFs to TIFF on-demand.

    --
    http://blog.grcm.net/
  11. Re:Some Solutions by ZERO1ZERO · · Score: 2, Informative

    imagemagick is Slow slow slow for multipage tiffs. Using tifftools on windows, creating and splitting multipage G4 tiffs is 20 (TWENTY) times faster using tifftools.

  12. TIFF better defined than PDF? Nuts. by Odder · · Score: 0, Informative

    I've seen these multipage TIFFs you are talking about and hate them. They invariably require some kind of non free viewer that sucks next to any free or non free pdf viewer. TIFF compression itself has lots of vendor specific schemes and you would be better off with pdf or png as alternatives. If you have not been bitten by a TIFF format, you have not been using TIFF long enough. If you add text to your TIFF, the size will probably grow to the same size as a pdf of the same but you trade the Scalar Vector nature of post script for a bitmap.

    I don't understand why anyone would move from tried and true pdf systems for a TIFF system unless they want to lose text search for employees, regulators and the public. Handing people image based pdf instead of text files or normal pdf is a standard practice for the Bush administration that borders on criminal obstruction of justice.

  13. Re:Are these things images or documents? by MBGMorden · · Score: 4, Informative

    Multi-page TIFF is well supported in the industry. Better supported than PDF in some cases. Our records management (in addition to keeping electronic scanned copies) still insists on having a microfilm copy of all of our retained documents. We can send digital copies to a processing company to have them processed, but they don't accept PDF documents - only TIFF's (multi-page is acceptable). Given that our internal document management is all in PDF, I ended up having to find a program to convert all of that information about a year ago (though the name of the program we ended up using escapes me - I wouldn't recommend it anyways, since it crashed for me very frequently).
    --
    "People who think they know everything are very annoying to those of us who do."-Mark Twain
  14. Re:Are these things images or documents? by cduffy · · Score: 3, Informative

    Bah. Use libtiff, document that only readers which also use libtiff (>= the version you're using) are supported, and you're done.

  15. Shell script by debatem1 · · Score: 2, Informative

    Honest to God, what you're talking about is a trivial task. Use ghostscript, or, if you don't have the time or interest, contact me with your requirements and I'll write it for you gratis, provided it remains F/OSS.

  16. Re:Another reason to do go tiff by Wilden2003 · · Score: 2, Informative

    You are so right about Adobe Acrobat.

    For a stand-alone free PDF viewer take a look at Foxit.

    http://www.foxitsoftware.com/pdf/rd_intro.php

    It's fast and small, and does the job.

  17. Re:Are these things images or documents? by mrcaseyj · · Score: 2, Informative

    I agree that unless the files are extremely huge or extremely numerous then storage space probably shouldn't be a concern because its cheap compared to your time and getting cheaper. But if storage space is a concern then you might look into the tiff format used by the patent office. Apparently it uses a form of lossless compression taken from fax machines and gets much better compression than many other common formats on black and white(no greyscale) documents. If it's the patent office's standard archive format then it will probably be supported for a long time. Pdf can probably use the same compression as well though. I was going to mod up someone below http://ask.slashdot.org/comments.pl?sid=593693&no_d2=1&cid=23920133 who recommended PDF/A as the only archive format to use.

    I was also going to mod up someone who recommended http://ask.slashdot.org/comments.pl?sid=593693&no_d2=1&cid=23921141 that if you use tiff just make sure it can be read by libtiff and you should have no worries about future readability.

  18. Why bother? by MarkCollette · · Score: 2, Informative

    As someone who writes software to view PDFs, I can tell you this is completely pointless, since anything that saves scanned documents into PDF is really storing it as a TIFF image inside of the PDF anyway. The PDF container adds useful features for metadata, and is well documented, so shouldn't add any future-proof issues. And the overhead is probably a few kilobytes.

  19. Re:Astroturf? by willyhill · · Score: 3, Informative

    Note that "Odder" and "freenix" are the same person.

    --
    The twitter monologues. Click on my homepage and be amazed.
  20. Max file size favors PDF by klossner · · Score: 2, Informative

    TIFF files have a maximum size of 4GB. (The "value offset" field of an IFD entry is a 32-bit value.) You can exceed this with 50 noisy pages. PDF files have a maximum size of 10 to the tenth power bytes. (The byte offset in a cross-reference table entry is a ten-digit decimal number.) That's 2.3 times the maximum TIFF file size.

    I have written software to create both TIFF and PDF files. I would use PDF for archiving. Even today, it's tricky to find a TIFF reader that will run on all the important platforms and handle the variety of compression flavors (e.g., JBIG2.)