Multi-page PDF To Multi-page TIFF and Archiving?
GeorgeMonroy writes "One of my clients has aperture cards that they have been scanning into multi-page PDF files — but now they want them in multi-page TIFFs instead. One of the reasons they gave for this is that TIFF files require less storage space. While that is true, I wonder if TIFF is the best format going into the future. Are TIFFs better than PDFs for future use? I wonder what format you think would last longer. Are there any other formats that you think would be better or more future-proof? To me, storage is not a good enough reason to go to TIFF, because storage prices are always dropping anyway. Also, since they already have many of these files in PDF format and they want to convert them into multipage TIFFs, are there any programs that you can recommend that will perform batch processing of files so that we do not have to convert each PDF one by one? If another file format is better than TIFF, then are there any programs for batch processing that you can recommend?"
Ghostscript can do the conversion from the console.
You can write a simple shell script to convert all files.
The most effective compressors are commercial, but DjVu is a very effective image archival format; see DjVuLibre for the non-commercial tree.
Moving back towards the question in the article, I don't think there's much worry about either TIFF or PDF in terms of future proofing; they're both very widely used, have multiple implementations and third parties with substantial interest in keeping those implementations maintained, etc. The quality of TIFF implementations varies wildly, but the good ones are only going to get better, and I'd be shocked if libtiff ended up terminally bitrotted without a successor implementing a superset of its functionality inside my lifetime.
If they're images, then you should use TIFF (or perhaps PNG). However, it doesn't make sense for them to be "multi-page." If they're documents, then PDF is appropriate.
Multi-page TIFF is well supported in the industry. There is nothing "weird" about it. It even supports embedded, searchable text (a Microsoft addition, but something that actually adds value). PDF archival can be difficult to do correctly. At the very least you want to use a product which supports PDF/A, followed up with some serious validation to make sure the results are actually compliant. Otherwise you may get bitten decades down the road. Searchable TIFF, on the other hand, will be around for freaking ever.
Are TIFFs better than PDFs for future use? I wonder what format you think would last longer. Are there any other formats that you think would be better or more future-proof? To me, storage is not a good enough reason to go to TIFF, because storage prices are always dropping anyway.
Don't use TIFF. Stay with PDF. PDF is what all the big digital libraries are using. It's a proper standard, it's readable and writable by lots of free open source software, so even if Adobe disappears in a puff of intellectual property, you'll still be able to read your documents.
TIFF, on the other hand, is a container format (like AVI, but worse). It isn't fully supported by every program - what sort of TIFF do you want, anyway? Compressed with LZW? With RLE? Not compressed at all? There's free software that will read and write the most common types of TIFF, so you can certainly do it, but why give up the convenience of using PDF?
Also, since they already have many of these files in PDF format and they want to convert them into multipage TIFFs, are there any programs that you can recommend that will perform batch processing of files so that we do not have to convert each PDF one by one?
Use ghostscript. Use something like the following command line:
This turns input.pdf into a series of 300 dpi tiff files, one for each page, called output01.tiff, output02.tiff, etc. Change the DEVICE to get a different sort of tiff file, and use gs --help to get a list of options. You can easily wrap this command in a script of almost any sort to make the process fully automatic.>north
You're an immobile computer, remember?
But size does not have anything to do with it. TIFF is far simpler in structure than PDF and has therefore better compatibility. TIFF is also well documented. Of course, they would have to use raw tiff to get the advantages. The storage-space argument is secondary and matters only insofar as larger data sets have a higher irsk of corruption.
I dispute the "well documented" claim. The TIFF standard is quite clear. Unfortunately, almost nobody adheres precisely to the standard. I work extensively with TIFF and PDF, and I have to say that the consistency I see in PDF is about 100 times more than what I see in TIFF. Your typical TIFF reader will contain thousands of hacks and workarounds for oddities that are produced by major players in the industry. While there is slightly non-compliant PDF, I have never seen things that even begin approaching the strangeness I see in TIFF on a daily basis. Having said that, I recommend TIFF plus search text metadata for archival, not PDF.
Comment removed based on user account deletion
TIFF: Thousands of Image File Formats
If you do wind up converting to tiff, then remember to document everything in excruciating detail. With thousands of possible combinations - each of which is a perfectly valid tiff image - you may encounter some issues if someone's using a less robust reader and assuming for the wrong compression algorithm, byte order, data striping or photometric interpretation.
Stick with PDF. Chances are, neither PDF nor TIFF will vanish overnight. I'd say PDF is easier to work with, even with minimalist free tools. Since either one is technically "good" for archiving, why do more work than you really need to do, even with batch processing it'd be a pain.
Acrobat has batch processing, and can convert pdfs to TIFF, JPEG, PNG and more. That would be my suggestion if you are really going to convert to TIFFs.
Julie Moult is an idiot.
let's not reinvent the wheel -- I did this about 9 months ago //wolfmann -- and this code is Public domain (done on federal gov't time):
# cat pdf2tiff.sh
#!/bin/bash
for file in */*.pdf #for each pdf /dev/null
do
filename=`echo $file | cut -d'.' -f1`
if [ ! -e "$filename".tiff ]
then
echo "gs -q -dNOPAUSE -dBATCH -sDEVICE=tiffg4 -sOutputFile=$filename.tiff $file"
gs -q -dNOPAUSE -dBATCH -sDEVICE=tiffg3 -sOutputFile="$filename".tiff "$file" 2>
else
echo "$filename.tiff exists! skipping..."
fi
done
How *much* smaller are these TIFFs anyway? TIFF is actually a container format, and can support all sorts of compression, some of them proprietary, some of them common. Not all of them are lossless either (TIFF-Jpeg is a perfectly valid combination, and was used before the days of Exif to add metadata to jpegs). TIFFs can also include vectorized data. It's not all that much less complicated than a PDF.
PDFs are also a container format to an extent. You could very well have a TIFF embedded in a PDF. Fortunately for us, the PDF specification is a bit more stringent on what is supported and what isn't, and PDFs tend to work just about everywhere (especially if all that you've got is an image). You can also apply all sorts of compression to PDFs to reduce their file size... these might not be quite as well supported.
Both formats are extremely common, and it's extremely unlikely that you'll ever have to do any sort of conversion to display them. If I had to place money on it, I'd wager that PDF will be in widespread use for longer than TIFF, though neither format seems to be going anywhere anytime soon. You're more likely to have to worry about the storage devices you're using and the longevity of the media.
If you just need to store lossless raster images, PNG might be a good bet. It's a "Free" format, and is officially endorsed as an ISO/IEC standard. TIFF is copyrighted by adobe. It also has the advantage of being a complete image format, rather than just a container, which means that any software that can open a PNG image should be able to open *any* PNG. Because of its open-sourceness and widespread adoption, PNG will be around for a long time to come as well. Once again, the storage medium and filesystem that you use to store the images is very likely going to become obsolete before the file format itself.
Granted, PNG's compression algorithm isn't optimized for photographic data, though the image formats that *are* optimized for this purpose are neither common nor free.
In summary, there's no reason that a PDF needs to be terribly larger than a PDF (the overhead should be especially negligible if you've got lots of images at a high-resolution). Neither format is going away anytime soon, but both have quirks that can hurt you in the future (Multi-page TIFFs are even somewhat of an oddity today). If you really want small files and future-proofing, go with PNG. Otherwise, it's more or less a non-issue.
-- If you try to fail and succeed, which have you done? - Uli's moose
Both PDF and TIFF handle multiple pages, and have done so for years.
Either would be suitable for this application.
If you really want to convert these, Imagemagick would be the best tool to use.
However- it seems a little daft for storage space to be the main reason for changing: you're simply exchanging one compressed image format for another. You may save 10% e.g. if you move from JPEG in the PDF to PNG/similar in the TIFF but is that really worth the effort?!
If they really want shiny TIFFs, it would be easy to have an Imagemagick script to convert single PDFs to TIFF on-demand.
http://blog.grcm.net/
imagemagick is Slow slow slow for multipage tiffs. Using tifftools on windows, creating and splitting multipage G4 tiffs is 20 (TWENTY) times faster using tifftools.
Although the TIFF format is open and it is widely used in archiving systems, it is not particularly suited for an archive you setup new. The main reason is that many applications that generate TIFF may throw in their own proprietary stuff and lock you into a specific viewer. Also, you cannot do a text search of content in TIFF.
When you discuss archves you think about looong times. Typically 10 to 50 years of retention with the odd exception where eternity is desired.
Hence "plain" PDF is probably even worse than TIFF. One problem here are the included resources (fonts) and references (http links) which are mostly left out in order to save disk space. The other problem is that there are so many "plain" PDF versions to choose from and none of them will last 10 to 50 years.
However, PDF is a good technology and therefor the PDF/A standard was developed. It is designed especially to deal with loooong term issues, is currently readable through almost any PDF reader and will be maintained by most sensible PDF readers for the years to come. There is NO vendor lock-in, you can put text in a PDF/A document an run searches against it. But most importantly, NO propitiatory stuff can be shoved in as it would result in an invalid document (a PDF document maybe but not a PDF/A document.)
With the price of current disk space you should NOT make file size a defining criterion in your archiving policy. Only on z/OS disk space comes at absurd and ridiculous prices. If you can, try aiming for an archiving solution on Unix, Linux or even Windows.
I am in the archiving business. At the moment PDF/A is the only format suitable for archiving.
I hadn't the slightest objection to his spending his time planning massacres for the bourgeoisie... (P.G. Wodehouse)
I would recommend OCRing these documents and storing them in some kind of text-based format (in addition to the graphical format of your choice). If you have particularly voluminous back-catalogues of these documents you'll be very thankful in the future if you have the option to search-enable this textual content.
A graphic image of text is like a wax apple - it looks and tastes like a replica.
"People who think they know everything are very annoying to those of us who do."-Mark Twain
Bah. Use libtiff, document that only readers which also use libtiff (>= the version you're using) are supported, and you're done.
Honest to God, what you're talking about is a trivial task. Use ghostscript, or, if you don't have the time or interest, contact me with your requirements and I'll write it for you gratis, provided it remains F/OSS.
You are so right about Adobe Acrobat.
For a stand-alone free PDF viewer take a look at Foxit.
http://www.foxitsoftware.com/pdf/rd_intro.php
It's fast and small, and does the job.
Imagemagick does it's conversion entirely in-memory, so if a document is more than a hundred pages or so you are going to have to have some problems.
I was not recommending Microsoft's TIFF+Text, but rather some easily maintained combination of TIFF+text. If that is Microsoft's extension, then that's what it is. I see no loss of functionality due to the storage format not being vector-based. An archival format is going to be looked at on the screen, searched, and possibly printed. There is no benefit to vector graphics except perhaps a size argument, which is obviated by technologies like JBIG2 (if they'd ever get off their butts and formalize the JBIG2-in-TIFF spec).
The only argument I've ever heard for vector graphics is that "When I zoom way, way, WAY in, it still looks all smooth and pretty." Now what the hell are you trying to look at a document from a distance of 0.0002 inches for? It's a non-complaint.
TIFF, on the other hand, is a simple raster format with enormously wide support. You don't have to worry about how one rasterizer is going to look different from another -- every pixel is precisely defined. The document will appear exactly as it was intended to. Consider the difference in codebase size between a simple TIFF reader and a full-blown vector rendering engine. There is enormous complexity with no benefit.
Having the format support a thing, and having that thing make sense are two different things. For example, Excel supports being used as a database... but does it make sense?
My point: use image formats for images, and document formats for documents. If the things you're trying to store are images, don't put them in a document format, and if they're documents, then don't put them in an image format.
Also, if TIFF is designed to store both images and documents, then I question whether it is too general to do either of them well. And your mention of "subformats" makes me think my concern is well-founded!
"[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz
I agree that unless the files are extremely huge or extremely numerous then storage space probably shouldn't be a concern because its cheap compared to your time and getting cheaper. But if storage space is a concern then you might look into the tiff format used by the patent office. Apparently it uses a form of lossless compression taken from fax machines and gets much better compression than many other common formats on black and white(no greyscale) documents. If it's the patent office's standard archive format then it will probably be supported for a long time. Pdf can probably use the same compression as well though. I was going to mod up someone below http://ask.slashdot.org/comments.pl?sid=593693&no_d2=1&cid=23920133 who recommended PDF/A as the only archive format to use.
I was also going to mod up someone who recommended http://ask.slashdot.org/comments.pl?sid=593693&no_d2=1&cid=23921141 that if you use tiff just make sure it can be read by libtiff and you should have no worries about future readability.
I hear a lot of talk about how to convert back and forth, but nobody's mentioning the thing that I would consider the most important:
When you convert from .png to .tif, are you losing data?
Most of these convert scripts seem to work by starting Ghostview and rendering a .tif out of your PDF. This is a *terrible terrible idea*. What you'd really want to do is reach into the PDF itself, and extract the lossless images perfectly. Anything else is like printing the .PDF and scanning the printout - you might lose pixels, you might gain extra pixels, and you almost certainly won't be perfectly aligned with the "pixel grain" of the original image.
Unless you can guarantee that you'll pulling out, pixel-by-pixel, the exact original data, I would stick with PDFs.
Breaking Into the Industry - A development log about starting a game studio.
As someone who writes software to view PDFs, I can tell you this is completely pointless, since anything that saves scanned documents into PDF is really storing it as a TIFF image inside of the PDF anyway. The PDF container adds useful features for metadata, and is well documented, so shouldn't add any future-proof issues. And the overhead is probably a few kilobytes.
Note that "Odder" and "freenix" are the same person.
The twitter monologues. Click on my homepage and be amazed.
TIFF files have a maximum size of 4GB. (The "value offset" field of an IFD entry is a 32-bit value.) You can exceed this with 50 noisy pages. PDF files have a maximum size of 10 to the tenth power bytes. (The byte offset in a cross-reference table entry is a ten-digit decimal number.) That's 2.3 times the maximum TIFF file size.
I have written software to create both TIFF and PDF files. I would use PDF for archiving. Even today, it's tricky to find a TIFF reader that will run on all the important platforms and handle the variety of compression flavors (e.g., JBIG2.)
> Scanned documents are both images and documents.
Exactly. I don't mean to be rude, but the GP's comment is just silly.
"Documents" can be either graphical or text (or both). There's a reason why word processing formats allow embedding of images.
In fact, "document" has become an almost meaningless term since it can apply to so many types of data. About the only thing all the different "document" formats have in common is that their content is 2D!
"Slow down, Cowboy! It has been 3 years, 7 months and 26 days since you last successfully posted a comment."