Multi-page PDF To Multi-page TIFF and Archiving?
GeorgeMonroy writes "One of my clients has aperture cards that they have been scanning into multi-page PDF files — but now they want them in multi-page TIFFs instead. One of the reasons they gave for this is that TIFF files require less storage space. While that is true, I wonder if TIFF is the best format going into the future. Are TIFFs better than PDFs for future use? I wonder what format you think would last longer. Are there any other formats that you think would be better or more future-proof? To me, storage is not a good enough reason to go to TIFF, because storage prices are always dropping anyway. Also, since they already have many of these files in PDF format and they want to convert them into multipage TIFFs, are there any programs that you can recommend that will perform batch processing of files so that we do not have to convert each PDF one by one? If another file format is better than TIFF, then are there any programs for batch processing that you can recommend?"
If they're images, then you should use TIFF (or perhaps PNG). However, it doesn't make sense for them to be "multi-page." If they're documents, then PDF is appropriate.
I would suggest that your client doesn't know WTF they want.
"[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz
Apple was kind enough to build this functionality into Mac OS X in the form of Preview and Automator (or Apple Script).
Regards,
Ryan Pritchard
Fun Extends All Basic Life Expectancies
Ghostscript can do the conversion from the console.
You can write a simple shell script to convert all files.
We use a program called ImageSite that handles that. It uses TIF files. Why reinvent the wheel?
One of our competitors trademarked the term "hypothesis". From now on, we will call them "boneheaded ideas".
The most effective compressors are commercial, but DjVu is a very effective image archival format; see DjVuLibre for the non-commercial tree.
Moving back towards the question in the article, I don't think there's much worry about either TIFF or PDF in terms of future proofing; they're both very widely used, have multiple implementations and third parties with substantial interest in keeping those implementations maintained, etc. The quality of TIFF implementations varies wildly, but the good ones are only going to get better, and I'd be shocked if libtiff ended up terminally bitrotted without a successor implementing a superset of its functionality inside my lifetime.
You haven't specified what is on the microfilm chip in the card. If its largely text, I can't see why you'd want to lose the embedded text (searcheable etc. - TIFF would require OCR at some point..?) in PDF.
But size does not have anything to do with it. TIFF is far simpler in structure than PDF and has therefore better compatibility. TIFF is also well documented. Of course, they would have to use raw tiff to get the advantages. The storage-space argument is secondary and matters only insofar as larger data sets have a higher irsk of corruption.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Are TIFFs better than PDFs for future use? I wonder what format you think would last longer. Are there any other formats that you think would be better or more future-proof? To me, storage is not a good enough reason to go to TIFF, because storage prices are always dropping anyway.
Don't use TIFF. Stay with PDF. PDF is what all the big digital libraries are using. It's a proper standard, it's readable and writable by lots of free open source software, so even if Adobe disappears in a puff of intellectual property, you'll still be able to read your documents.
TIFF, on the other hand, is a container format (like AVI, but worse). It isn't fully supported by every program - what sort of TIFF do you want, anyway? Compressed with LZW? With RLE? Not compressed at all? There's free software that will read and write the most common types of TIFF, so you can certainly do it, but why give up the convenience of using PDF?
Also, since they already have many of these files in PDF format and they want to convert them into multipage TIFFs, are there any programs that you can recommend that will perform batch processing of files so that we do not have to convert each PDF one by one?
Use ghostscript. Use something like the following command line:
This turns input.pdf into a series of 300 dpi tiff files, one for each page, called output01.tiff, output02.tiff, etc. Change the DEVICE to get a different sort of tiff file, and use gs --help to get a list of options. You can easily wrap this command in a script of almost any sort to make the process fully automatic.>north
You're an immobile computer, remember?
Comment removed based on user account deletion
But it'll save it as multiple TIFF files, I don't think Acrobat has multiple-page TIFF capability.
Julie Moult is an idiot.
In any case, the PDF and TIFF file formats are well-documented, and if ever even their widespread use makes them to be extinct (bloody unlikely), it would always be possible to write a program to convert them into the format-du-jour, provided, of course, you are able to read the media...
Or under windows just use ImageMagick
http://www.imagemagick.org/script/binary-releases.php#windows
Stick with PDF. Chances are, neither PDF nor TIFF will vanish overnight. I'd say PDF is easier to work with, even with minimalist free tools. Since either one is technically "good" for archiving, why do more work than you really need to do, even with batch processing it'd be a pain.
Acrobat has batch processing, and can convert pdfs to TIFF, JPEG, PNG and more. That would be my suggestion if you are really going to convert to TIFFs.
Julie Moult is an idiot.
let's not reinvent the wheel -- I did this about 9 months ago //wolfmann -- and this code is Public domain (done on federal gov't time):
# cat pdf2tiff.sh
#!/bin/bash
for file in */*.pdf #for each pdf /dev/null
do
filename=`echo $file | cut -d'.' -f1`
if [ ! -e "$filename".tiff ]
then
echo "gs -q -dNOPAUSE -dBATCH -sDEVICE=tiffg4 -sOutputFile=$filename.tiff $file"
gs -q -dNOPAUSE -dBATCH -sDEVICE=tiffg3 -sOutputFile="$filename".tiff "$file" 2>
else
echo "$filename.tiff exists! skipping..."
fi
done
How *much* smaller are these TIFFs anyway? TIFF is actually a container format, and can support all sorts of compression, some of them proprietary, some of them common. Not all of them are lossless either (TIFF-Jpeg is a perfectly valid combination, and was used before the days of Exif to add metadata to jpegs). TIFFs can also include vectorized data. It's not all that much less complicated than a PDF.
PDFs are also a container format to an extent. You could very well have a TIFF embedded in a PDF. Fortunately for us, the PDF specification is a bit more stringent on what is supported and what isn't, and PDFs tend to work just about everywhere (especially if all that you've got is an image). You can also apply all sorts of compression to PDFs to reduce their file size... these might not be quite as well supported.
Both formats are extremely common, and it's extremely unlikely that you'll ever have to do any sort of conversion to display them. If I had to place money on it, I'd wager that PDF will be in widespread use for longer than TIFF, though neither format seems to be going anywhere anytime soon. You're more likely to have to worry about the storage devices you're using and the longevity of the media.
If you just need to store lossless raster images, PNG might be a good bet. It's a "Free" format, and is officially endorsed as an ISO/IEC standard. TIFF is copyrighted by adobe. It also has the advantage of being a complete image format, rather than just a container, which means that any software that can open a PNG image should be able to open *any* PNG. Because of its open-sourceness and widespread adoption, PNG will be around for a long time to come as well. Once again, the storage medium and filesystem that you use to store the images is very likely going to become obsolete before the file format itself.
Granted, PNG's compression algorithm isn't optimized for photographic data, though the image formats that *are* optimized for this purpose are neither common nor free.
In summary, there's no reason that a PDF needs to be terribly larger than a PDF (the overhead should be especially negligible if you've got lots of images at a high-resolution). Neither format is going away anytime soon, but both have quirks that can hurt you in the future (Multi-page TIFFs are even somewhat of an oddity today). If you really want small files and future-proofing, go with PNG. Otherwise, it's more or less a non-issue.
-- If you try to fail and succeed, which have you done? - Uli's moose
imagemagick is Slow slow slow for multipage tiffs. Using tifftools on windows, creating and splitting multipage G4 tiffs is 20 (TWENTY) times faster using tifftools.
Although the TIFF format is open and it is widely used in archiving systems, it is not particularly suited for an archive you setup new. The main reason is that many applications that generate TIFF may throw in their own proprietary stuff and lock you into a specific viewer. Also, you cannot do a text search of content in TIFF.
When you discuss archves you think about looong times. Typically 10 to 50 years of retention with the odd exception where eternity is desired.
Hence "plain" PDF is probably even worse than TIFF. One problem here are the included resources (fonts) and references (http links) which are mostly left out in order to save disk space. The other problem is that there are so many "plain" PDF versions to choose from and none of them will last 10 to 50 years.
However, PDF is a good technology and therefor the PDF/A standard was developed. It is designed especially to deal with loooong term issues, is currently readable through almost any PDF reader and will be maintained by most sensible PDF readers for the years to come. There is NO vendor lock-in, you can put text in a PDF/A document an run searches against it. But most importantly, NO propitiatory stuff can be shoved in as it would result in an invalid document (a PDF document maybe but not a PDF/A document.)
With the price of current disk space you should NOT make file size a defining criterion in your archiving policy. Only on z/OS disk space comes at absurd and ridiculous prices. If you can, try aiming for an archiving solution on Unix, Linux or even Windows.
I am in the archiving business. At the moment PDF/A is the only format suitable for archiving.
I hadn't the slightest objection to his spending his time planning massacres for the bourgeoisie... (P.G. Wodehouse)
I would recommend OCRing these documents and storing them in some kind of text-based format (in addition to the graphical format of your choice). If you have particularly voluminous back-catalogues of these documents you'll be very thankful in the future if you have the option to search-enable this textual content.
A graphic image of text is like a wax apple - it looks and tastes like a replica.
PDF makes sense for document signing, security, and damage detection. TIFF does not have any of this important security and data integrity protection by itself.
PDF also allows for the same compression on the scanned image that TIFF does, as well as much better compression methods available to it.
TIFF, while well-understood in the archival industry, has rather fledgling support in the free *NIX world--especially multi-page TIFF.
Finally, with PDF, you can preserve both the image and the OCR data all in the same file. That's impossible with TIFF.
And, anyway, it's not 1985 anymore.
Kriston
I have a similar issue, but have chosen PDF because they meet the digital signature requirements of most professional licensure boards (architects and engineers worry about this stuff). It's not a large hurdle, just that the documents can be externally verified against a publically available key. Adobe lets you do that for free (well, assuming you have their s/w; I can post a key on my website for a 3rd party to install and verify the signature).
This isn't a high-crypto requirement area - you can easily fake a paper document, and the standard isn't too much higher for A/E work, but it has to be sealed and marked for the inability to change, and the certificate must be publicly available and easily verified.
Is it just my observation, or are there way too many stupid people in the world?
I cannot find an analogy to how fundamentally incorrect the submitters mental model of PDF's and Tiffs are. Think of PDF as a container format, you can compress the images inside the PDF to your heart's content, much smaller than TIFF will do since it can use JPEG or PNG or whatever format you want. Tools->Print Production->PDF Optimizer. It even has OCR and some scanned image auto cleaning. The easiest thing is just to have them change their scan resolution, down to say, 150 DPI and B+W instead of color.
Is there anything better than clicking through Microsoft ads on Slashdot?
We have a document management system. In it we had to make the decision for PDF or tiff. We opted for tiff. It had nothing to do with the file size. The deciding factor was because we could find FAST tiff viewers all day and night. It's probably not that PDF as a format is that much more bloated, but the readers, especially acrobat reader take a LOT longer to start up.
We use an activeX control called alternatiff to view them in the browser (and yes, it does multipage) The control loads in the browser VERY fast. Acrobat embedded in a page is painfully slow to load, even if you just do a page re-load.
Do not meddle in the affairs of sysadmins, for they are subtle, and quick to anger.
Send them a quotation. If the money looks good, do it and don't bitch about it on slashdot. If it does not look good, decline the job and don't bitch about it on slashdot either. Either way, don't bitch about it on slashdot.
Legal service firms work with all of these PDF and TIFF variants all of the time. They should be able to kick out whatever you need at x cents per page (which will usually be cheaper than your time/money)
The weird TIFF formats are used for various document management products, so it really depends mostly on your workflow.
Business. Numbers. Money. People. Computer World.
Also courts have requirements for electronic document formats and there is nothing non-standard about a "image based pdf" (these also support searchable OCR full text).
Business. Numbers. Money. People. Computer World.
The gold-standard tool for this is PDF2IMG which uses Adobe's own PDF rendering library but it'll set you back a few thousand dollars.
Ghostscript is good but it isn't perfect: it does choke on some PDFs, misrenders some and won't pick up non-embedded TTF fonts, only external PS fonts. It also doesn't do any anti-aliasing so you probably want to render large and sample down and (IIRC) there's a max image size it can render. But by and large it does just work.
Honest to God, what you're talking about is a trivial task. Use ghostscript, or, if you don't have the time or interest, contact me with your requirements and I'll write it for you gratis, provided it remains F/OSS.
Swedish plasma phys. PhD student; MSc EE; knows maths, programming, electronics; finance interest; seeks opportunities
Your paranoia is amusing. I wasn't "slamming" TIFF, merely pointing out that the fact that the spec is open and available is no reason to believe that all files conform to that spec. I was not recommending TIFF in all of its many incarnations, but in specific forms. I suppose I should have been more explicit. Use basic TIFF, with a 0=white photometric, with G4 compression. Stick to that and any viewer will open it.
If you'd like to see some actual pimping of the product I ACTUALLY work on, see www.swiftview.com.
I was not recommending Microsoft's TIFF+Text, but rather some easily maintained combination of TIFF+text. If that is Microsoft's extension, then that's what it is. I see no loss of functionality due to the storage format not being vector-based. An archival format is going to be looked at on the screen, searched, and possibly printed. There is no benefit to vector graphics except perhaps a size argument, which is obviated by technologies like JBIG2 (if they'd ever get off their butts and formalize the JBIG2-in-TIFF spec).
The only argument I've ever heard for vector graphics is that "When I zoom way, way, WAY in, it still looks all smooth and pretty." Now what the hell are you trying to look at a document from a distance of 0.0002 inches for? It's a non-complaint.
TIFF, on the other hand, is a simple raster format with enormously wide support. You don't have to worry about how one rasterizer is going to look different from another -- every pixel is precisely defined. The document will appear exactly as it was intended to. Consider the difference in codebase size between a simple TIFF reader and a full-blown vector rendering engine. There is enormous complexity with no benefit.
I hear a lot of talk about how to convert back and forth, but nobody's mentioning the thing that I would consider the most important:
When you convert from .png to .tif, are you losing data?
Most of these convert scripts seem to work by starting Ghostview and rendering a .tif out of your PDF. This is a *terrible terrible idea*. What you'd really want to do is reach into the PDF itself, and extract the lossless images perfectly. Anything else is like printing the .PDF and scanning the printout - you might lose pixels, you might gain extra pixels, and you almost certainly won't be perfectly aligned with the "pixel grain" of the original image.
Unless you can guarantee that you'll pulling out, pixel-by-pixel, the exact original data, I would stick with PDFs.
Breaking Into the Industry - A development log about starting a game studio.
As someone who writes software to view PDFs, I can tell you this is completely pointless, since anything that saves scanned documents into PDF is really storing it as a TIFF image inside of the PDF anyway. The PDF container adds useful features for metadata, and is well documented, so shouldn't add any future-proof issues. And the overhead is probably a few kilobytes.
Postscript begs to differ - {G}.Also courts have requirements for electronic document formats and there is nothing non-standard about a "image based pdf" (these also support searchable OCR full text).
An Invisible Entity of Vast Power whose existence must be taken on faith alone: Liberal Media
Note that "Odder" and "freenix" are the same person.
The twitter monologues. Click on my homepage and be amazed.
TIFF files have a maximum size of 4GB. (The "value offset" field of an IFD entry is a 32-bit value.) You can exceed this with 50 noisy pages. PDF files have a maximum size of 10 to the tenth power bytes. (The byte offset in a cross-reference table entry is a ten-digit decimal number.) That's 2.3 times the maximum TIFF file size.
I have written software to create both TIFF and PDF files. I would use PDF for archiving. Even today, it's tricky to find a TIFF reader that will run on all the important platforms and handle the variety of compression flavors (e.g., JBIG2.)
You understand we're talking about document archiving, right? Postscript is a terrible format for raster images.
Business. Numbers. Money. People. Computer World.
A scan? Do you have some suggestion how to convert a SCAN to a vector format and magically synthesize information that was never there in the first place?
For raster archival, 600 DPI is good enough. Nobody is suggesting archiving rasters at screen resolution.
I had to create a bash script a while ago to convert color postscript to black and white tiff. I used netpbm and conjunction with ghostscript to do it. Those 2 programs together can do just about anything you want for batch file graphics. Not sure if it will work for .pdf, but .pdf is pretty close to post script and probably easy enough to convert to eps or straight ps. Gimp and as stated above, imagemagick, also have alot of useful batch processing tools, but you have to learn their script language (in the case of gimp) to use it and it's also much slower than netpbm (both imagemagick and gimp).
Implementations in minutes. Converts to most anything. Not the most efficient though
-- Programming with boost is like building a house with lego. It's a cool but I wouldn't want to live in it
If you OCR the text and determine that there is no image data you can disard the raster data and compress more heavily leaving the text as vector data.
That's a form of magically synthesizing vector data from raster scans.
I won't speak to why or whether you should do it, but here are a few options for how.
Doculex has an app called MPTiffIt that will do single to multi or multi to single page tiff conversion. You'd need to convert the PDFs to single page tiff via Save As (or perhaps a Batch Process), then recombine them with MPTiffIt.
Or, you could use a Tiff printer driver along with a batch printing software.
Personally, I'd use L.A.W. (Legal Access Ware, created by Image Capture Engineering and now owned by Lexis Nexis), which is a full production scanning, OCR and e-doc conversion suite. You can import any type of doc that can be printed and output single or multi-page tiffs of PDFs with a variety of database/unitization load files. Similar products by IPro and Doculex exist. Any litigation support vendor should have one of these tools, and would likely charge a relatively nominal fee (per GB) to perform the conversion for you.
Or you could use JBIG2 while preserving the original appearance of the document absolutely. Your typical letter-sized page of 600 DPI information (over 33 million pixels) will compress to anywhere between 30-150 kilobytes, and that's in a lossless mode. The (theoretically) smaller size of a vector representation is not worth the loss of the original data.
While you're at it, why not decode the data punched on each card and then just store the microfilm image and the decoded data, discarding the image of the rest of the card? That'd make things a lot more efficient.
You'll do even better on file size if you convert scans to Djvu: Scan to pnm (or pgm or pbm), convert using c44 (or cjb2 for pbm), and you can skip the TeX step (use djvm to assemble pages). If the text you're scanning is black and white, you'll be stunned by the quality and file size reduction.
OCR doesn't work that well, you will still need the images. So you won't gain anything from doing TIFF+text.