Multi-page PDF To Multi-page TIFF and Archiving?
GeorgeMonroy writes "One of my clients has aperture cards that they have been scanning into multi-page PDF files — but now they want them in multi-page TIFFs instead. One of the reasons they gave for this is that TIFF files require less storage space. While that is true, I wonder if TIFF is the best format going into the future. Are TIFFs better than PDFs for future use? I wonder what format you think would last longer. Are there any other formats that you think would be better or more future-proof? To me, storage is not a good enough reason to go to TIFF, because storage prices are always dropping anyway. Also, since they already have many of these files in PDF format and they want to convert them into multipage TIFFs, are there any programs that you can recommend that will perform batch processing of files so that we do not have to convert each PDF one by one? If another file format is better than TIFF, then are there any programs for batch processing that you can recommend?"
The most effective compressors are commercial, but DjVu is a very effective image archival format; see DjVuLibre for the non-commercial tree.
Moving back towards the question in the article, I don't think there's much worry about either TIFF or PDF in terms of future proofing; they're both very widely used, have multiple implementations and third parties with substantial interest in keeping those implementations maintained, etc. The quality of TIFF implementations varies wildly, but the good ones are only going to get better, and I'd be shocked if libtiff ended up terminally bitrotted without a successor implementing a superset of its functionality inside my lifetime.
If they're images, then you should use TIFF (or perhaps PNG). However, it doesn't make sense for them to be "multi-page." If they're documents, then PDF is appropriate.
Multi-page TIFF is well supported in the industry. There is nothing "weird" about it. It even supports embedded, searchable text (a Microsoft addition, but something that actually adds value). PDF archival can be difficult to do correctly. At the very least you want to use a product which supports PDF/A, followed up with some serious validation to make sure the results are actually compliant. Otherwise you may get bitten decades down the road. Searchable TIFF, on the other hand, will be around for freaking ever.
Are TIFFs better than PDFs for future use? I wonder what format you think would last longer. Are there any other formats that you think would be better or more future-proof? To me, storage is not a good enough reason to go to TIFF, because storage prices are always dropping anyway.
Don't use TIFF. Stay with PDF. PDF is what all the big digital libraries are using. It's a proper standard, it's readable and writable by lots of free open source software, so even if Adobe disappears in a puff of intellectual property, you'll still be able to read your documents.
TIFF, on the other hand, is a container format (like AVI, but worse). It isn't fully supported by every program - what sort of TIFF do you want, anyway? Compressed with LZW? With RLE? Not compressed at all? There's free software that will read and write the most common types of TIFF, so you can certainly do it, but why give up the convenience of using PDF?
Also, since they already have many of these files in PDF format and they want to convert them into multipage TIFFs, are there any programs that you can recommend that will perform batch processing of files so that we do not have to convert each PDF one by one?
Use ghostscript. Use something like the following command line:
This turns input.pdf into a series of 300 dpi tiff files, one for each page, called output01.tiff, output02.tiff, etc. Change the DEVICE to get a different sort of tiff file, and use gs --help to get a list of options. You can easily wrap this command in a script of almost any sort to make the process fully automatic.>north
You're an immobile computer, remember?
TIFF: Thousands of Image File Formats
If you do wind up converting to tiff, then remember to document everything in excruciating detail. With thousands of possible combinations - each of which is a perfectly valid tiff image - you may encounter some issues if someone's using a less robust reader and assuming for the wrong compression algorithm, byte order, data striping or photometric interpretation.
let's not reinvent the wheel -- I did this about 9 months ago //wolfmann -- and this code is Public domain (done on federal gov't time):
# cat pdf2tiff.sh
#!/bin/bash
for file in */*.pdf #for each pdf /dev/null
do
filename=`echo $file | cut -d'.' -f1`
if [ ! -e "$filename".tiff ]
then
echo "gs -q -dNOPAUSE -dBATCH -sDEVICE=tiffg4 -sOutputFile=$filename.tiff $file"
gs -q -dNOPAUSE -dBATCH -sDEVICE=tiffg3 -sOutputFile="$filename".tiff "$file" 2>
else
echo "$filename.tiff exists! skipping..."
fi
done
How *much* smaller are these TIFFs anyway? TIFF is actually a container format, and can support all sorts of compression, some of them proprietary, some of them common. Not all of them are lossless either (TIFF-Jpeg is a perfectly valid combination, and was used before the days of Exif to add metadata to jpegs). TIFFs can also include vectorized data. It's not all that much less complicated than a PDF.
PDFs are also a container format to an extent. You could very well have a TIFF embedded in a PDF. Fortunately for us, the PDF specification is a bit more stringent on what is supported and what isn't, and PDFs tend to work just about everywhere (especially if all that you've got is an image). You can also apply all sorts of compression to PDFs to reduce their file size... these might not be quite as well supported.
Both formats are extremely common, and it's extremely unlikely that you'll ever have to do any sort of conversion to display them. If I had to place money on it, I'd wager that PDF will be in widespread use for longer than TIFF, though neither format seems to be going anywhere anytime soon. You're more likely to have to worry about the storage devices you're using and the longevity of the media.
If you just need to store lossless raster images, PNG might be a good bet. It's a "Free" format, and is officially endorsed as an ISO/IEC standard. TIFF is copyrighted by adobe. It also has the advantage of being a complete image format, rather than just a container, which means that any software that can open a PNG image should be able to open *any* PNG. Because of its open-sourceness and widespread adoption, PNG will be around for a long time to come as well. Once again, the storage medium and filesystem that you use to store the images is very likely going to become obsolete before the file format itself.
Granted, PNG's compression algorithm isn't optimized for photographic data, though the image formats that *are* optimized for this purpose are neither common nor free.
In summary, there's no reason that a PDF needs to be terribly larger than a PDF (the overhead should be especially negligible if you've got lots of images at a high-resolution). Neither format is going away anytime soon, but both have quirks that can hurt you in the future (Multi-page TIFFs are even somewhat of an oddity today). If you really want small files and future-proofing, go with PNG. Otherwise, it's more or less a non-issue.
-- If you try to fail and succeed, which have you done? - Uli's moose
"People who think they know everything are very annoying to those of us who do."-Mark Twain
Bah. Use libtiff, document that only readers which also use libtiff (>= the version you're using) are supported, and you're done.
Note that "Odder" and "freenix" are the same person.
The twitter monologues. Click on my homepage and be amazed.