Multi-page PDF To Multi-page TIFF and Archiving?
GeorgeMonroy writes "One of my clients has aperture cards that they have been scanning into multi-page PDF files — but now they want them in multi-page TIFFs instead. One of the reasons they gave for this is that TIFF files require less storage space. While that is true, I wonder if TIFF is the best format going into the future. Are TIFFs better than PDFs for future use? I wonder what format you think would last longer. Are there any other formats that you think would be better or more future-proof? To me, storage is not a good enough reason to go to TIFF, because storage prices are always dropping anyway. Also, since they already have many of these files in PDF format and they want to convert them into multipage TIFFs, are there any programs that you can recommend that will perform batch processing of files so that we do not have to convert each PDF one by one? If another file format is better than TIFF, then are there any programs for batch processing that you can recommend?"
If they're images, then you should use TIFF (or perhaps PNG). However, it doesn't make sense for them to be "multi-page." If they're documents, then PDF is appropriate.
I would suggest that your client doesn't know WTF they want.
"[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz
But size does not have anything to do with it. TIFF is far simpler in structure than PDF and has therefore better compatibility. TIFF is also well documented. Of course, they would have to use raw tiff to get the advantages. The storage-space argument is secondary and matters only insofar as larger data sets have a higher irsk of corruption.
I dispute the "well documented" claim. The TIFF standard is quite clear. Unfortunately, almost nobody adheres precisely to the standard. I work extensively with TIFF and PDF, and I have to say that the consistency I see in PDF is about 100 times more than what I see in TIFF. Your typical TIFF reader will contain thousands of hacks and workarounds for oddities that are produced by major players in the industry. While there is slightly non-compliant PDF, I have never seen things that even begin approaching the strangeness I see in TIFF on a daily basis. Having said that, I recommend TIFF plus search text metadata for archival, not PDF.
I'd have to agree with that - I keep bumping into nasty combinations of old style JPEG in TIFF images combined with Wang annotations and highlighting - the choice of viewers that will cope with both of those at once is pretty limited
I was not recommending Microsoft's TIFF+Text, but rather some easily maintained combination of TIFF+text. If that is Microsoft's extension, then that's what it is. I see no loss of functionality due to the storage format not being vector-based. An archival format is going to be looked at on the screen, searched, and possibly printed. There is no benefit to vector graphics except perhaps a size argument, which is obviated by technologies like JBIG2 (if they'd ever get off their butts and formalize the JBIG2-in-TIFF spec).
The only argument I've ever heard for vector graphics is that "When I zoom way, way, WAY in, it still looks all smooth and pretty." Now what the hell are you trying to look at a document from a distance of 0.0002 inches for? It's a non-complaint.
TIFF, on the other hand, is a simple raster format with enormously wide support. You don't have to worry about how one rasterizer is going to look different from another -- every pixel is precisely defined. The document will appear exactly as it was intended to. Consider the difference in codebase size between a simple TIFF reader and a full-blown vector rendering engine. There is enormous complexity with no benefit.
This code will not work on files with spaces in them, because the back-ticks will expand spaces without escapes while the shell globbing used in the original code will escape them. Since most PDF titles I've seen use spaces in their names, this is important. The rest of those modifications will help, though.
Were I coding this script, I'd write two (since xargs can't use a function). One for the inner loop, and one for the outer loop (or if only running it once, I'd do the outer loop on the command line.
$ cat pdf2tiff.sh#!/bin/sh
# convert specified files from PDF to TIFF
for file in "$@"; do
target="${file%%.*}.tiff"
if [ -r "$file" ] && [ ! -s "$target" ]
then echo "gs -q -dNOPAUSE -dBATCH -sDEVICE=tiffg4 -sOutputFile='$target' '$file'"
gs -q -dNOPAUSE -dBATCH -sDEVICE=tiffg3 -sOutputFile="$target" "$file" 2>/dev/null
else
echo "'$file' not readable or '$target' exists, skipping"
fi
done
$ cat pdf2tiff-recursive.sh
#!/bin/sh
# convert all PDFs to TIFF format within supplied directory tree (default: $PWD)
find ${1:-.} -name '*.pdf' -print0 |xargs -0
Use my userscript to add story images to Slashdot. There's no going back.