Multi-page PDF To Multi-page TIFF and Archiving?
GeorgeMonroy writes "One of my clients has aperture cards that they have been scanning into multi-page PDF files — but now they want them in multi-page TIFFs instead. One of the reasons they gave for this is that TIFF files require less storage space. While that is true, I wonder if TIFF is the best format going into the future. Are TIFFs better than PDFs for future use? I wonder what format you think would last longer. Are there any other formats that you think would be better or more future-proof? To me, storage is not a good enough reason to go to TIFF, because storage prices are always dropping anyway. Also, since they already have many of these files in PDF format and they want to convert them into multipage TIFFs, are there any programs that you can recommend that will perform batch processing of files so that we do not have to convert each PDF one by one? If another file format is better than TIFF, then are there any programs for batch processing that you can recommend?"
SwiftConvert will support PDF in the upcoming 9.0 release.
I have to warn you that the price is steep.
If they're images, then you should use TIFF (or perhaps PNG). However, it doesn't make sense for them to be "multi-page." If they're documents, then PDF is appropriate.
I would suggest that your client doesn't know WTF they want.
"[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz
Apple was kind enough to build this functionality into Mac OS X in the form of Preview and Automator (or Apple Script).
Regards,
Ryan Pritchard
Fun Extends All Basic Life Expectancies
Ghostscript can do the conversion from the console.
You can write a simple shell script to convert all files.
We use a program called ImageSite that handles that. It uses TIF files. Why reinvent the wheel?
One of our competitors trademarked the term "hypothesis". From now on, we will call them "boneheaded ideas".
Acrobat has a built in batch processing mode capable of outputing TIFF files. I don't know I'd do this if these were primarily text documents, though, as the scans may include text representations of the scanned pages which you'd lose with the conversion.
The most effective compressors are commercial, but DjVu is a very effective image archival format; see DjVuLibre for the non-commercial tree.
Moving back towards the question in the article, I don't think there's much worry about either TIFF or PDF in terms of future proofing; they're both very widely used, have multiple implementations and third parties with substantial interest in keeping those implementations maintained, etc. The quality of TIFF implementations varies wildly, but the good ones are only going to get better, and I'd be shocked if libtiff ended up terminally bitrotted without a successor implementing a superset of its functionality inside my lifetime.
You haven't specified what is on the microfilm chip in the card. If its largely text, I can't see why you'd want to lose the embedded text (searcheable etc. - TIFF would require OCR at some point..?) in PDF.
At my place of employment, I use Cardiff Teleform for this task. But it's expensive and is really made for OCR & Data Entry (which is our main use of it).
But size does not have anything to do with it. TIFF is far simpler in structure than PDF and has therefore better compatibility. TIFF is also well documented. Of course, they would have to use raw tiff to get the advantages. The storage-space argument is secondary and matters only insofar as larger data sets have a higher irsk of corruption.
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Are TIFFs better than PDFs for future use? I wonder what format you think would last longer. Are there any other formats that you think would be better or more future-proof? To me, storage is not a good enough reason to go to TIFF, because storage prices are always dropping anyway.
Don't use TIFF. Stay with PDF. PDF is what all the big digital libraries are using. It's a proper standard, it's readable and writable by lots of free open source software, so even if Adobe disappears in a puff of intellectual property, you'll still be able to read your documents.
TIFF, on the other hand, is a container format (like AVI, but worse). It isn't fully supported by every program - what sort of TIFF do you want, anyway? Compressed with LZW? With RLE? Not compressed at all? There's free software that will read and write the most common types of TIFF, so you can certainly do it, but why give up the convenience of using PDF?
Also, since they already have many of these files in PDF format and they want to convert them into multipage TIFFs, are there any programs that you can recommend that will perform batch processing of files so that we do not have to convert each PDF one by one?
Use ghostscript. Use something like the following command line:
This turns input.pdf into a series of 300 dpi tiff files, one for each page, called output01.tiff, output02.tiff, etc. Change the DEVICE to get a different sort of tiff file, and use gs --help to get a list of options. You can easily wrap this command in a script of almost any sort to make the process fully automatic.>north
You're an immobile computer, remember?
Comment removed based on user account deletion
In any case, the PDF and TIFF file formats are well-documented, and if ever even their widespread use makes them to be extinct (bloody unlikely), it would always be possible to write a program to convert them into the format-du-jour, provided, of course, you are able to read the media...
Or under windows just use ImageMagick
http://www.imagemagick.org/script/binary-releases.php#windows
Stick with PDF. Chances are, neither PDF nor TIFF will vanish overnight. I'd say PDF is easier to work with, even with minimalist free tools. Since either one is technically "good" for archiving, why do more work than you really need to do, even with batch processing it'd be a pain.
Acrobat has batch processing, and can convert pdfs to TIFF, JPEG, PNG and more. That would be my suggestion if you are really going to convert to TIFFs.
Julie Moult is an idiot.
let's not reinvent the wheel -- I did this about 9 months ago //wolfmann -- and this code is Public domain (done on federal gov't time):
# cat pdf2tiff.sh
#!/bin/bash
for file in */*.pdf #for each pdf /dev/null
do
filename=`echo $file | cut -d'.' -f1`
if [ ! -e "$filename".tiff ]
then
echo "gs -q -dNOPAUSE -dBATCH -sDEVICE=tiffg4 -sOutputFile=$filename.tiff $file"
gs -q -dNOPAUSE -dBATCH -sDEVICE=tiffg3 -sOutputFile="$filename".tiff "$file" 2>
else
echo "$filename.tiff exists! skipping..."
fi
done
How *much* smaller are these TIFFs anyway? TIFF is actually a container format, and can support all sorts of compression, some of them proprietary, some of them common. Not all of them are lossless either (TIFF-Jpeg is a perfectly valid combination, and was used before the days of Exif to add metadata to jpegs). TIFFs can also include vectorized data. It's not all that much less complicated than a PDF.
PDFs are also a container format to an extent. You could very well have a TIFF embedded in a PDF. Fortunately for us, the PDF specification is a bit more stringent on what is supported and what isn't, and PDFs tend to work just about everywhere (especially if all that you've got is an image). You can also apply all sorts of compression to PDFs to reduce their file size... these might not be quite as well supported.
Both formats are extremely common, and it's extremely unlikely that you'll ever have to do any sort of conversion to display them. If I had to place money on it, I'd wager that PDF will be in widespread use for longer than TIFF, though neither format seems to be going anywhere anytime soon. You're more likely to have to worry about the storage devices you're using and the longevity of the media.
If you just need to store lossless raster images, PNG might be a good bet. It's a "Free" format, and is officially endorsed as an ISO/IEC standard. TIFF is copyrighted by adobe. It also has the advantage of being a complete image format, rather than just a container, which means that any software that can open a PNG image should be able to open *any* PNG. Because of its open-sourceness and widespread adoption, PNG will be around for a long time to come as well. Once again, the storage medium and filesystem that you use to store the images is very likely going to become obsolete before the file format itself.
Granted, PNG's compression algorithm isn't optimized for photographic data, though the image formats that *are* optimized for this purpose are neither common nor free.
In summary, there's no reason that a PDF needs to be terribly larger than a PDF (the overhead should be especially negligible if you've got lots of images at a high-resolution). Neither format is going away anytime soon, but both have quirks that can hurt you in the future (Multi-page TIFFs are even somewhat of an oddity today). If you really want small files and future-proofing, go with PNG. Otherwise, it's more or less a non-issue.
-- If you try to fail and succeed, which have you done? - Uli's moose
Are TIFFs better than PDFs for future use? I wonder what format you think would last longer. Are there any other formats that you think would be better or more future-proof?
PDF should be the most future-proof format as far as I know. In the world of legal discovery, TIFF has been a mainstay for paper scanning. PDF is taking over, and many filings can be done electronically using PDF. Courts are now requesting PDF's for discovery, and the use of the format will only increase, IMHO.
TIFF, on the other hand, can be put together alot of different ways (believe me, I have seen most of them). The TIFF that you create today might not be readable in a few years. Sure, if you do a typical Group4, you're probably ok, but the software that creates TIFF's sometimes leaves alot to be desired. TIFF is a good container, don't get me wrong.
But if you asked me what format would be readable in 25 years by most computers, I would still put my money on PDF.
I had the same concerns when I undertook a document scanning project when I worked for a major municipality. Was worried about whether PDF's or TIFFs would be best for the long term, especially since we were scanning in several million documents.
That was ten years ago (1998), and still are both viable. We went with multipage TIFFs, btw.
Thanks for reading!
Eric
imagemagick is Slow slow slow for multipage tiffs. Using tifftools on windows, creating and splitting multipage G4 tiffs is 20 (TWENTY) times faster using tifftools.
Although the TIFF format is open and it is widely used in archiving systems, it is not particularly suited for an archive you setup new. The main reason is that many applications that generate TIFF may throw in their own proprietary stuff and lock you into a specific viewer. Also, you cannot do a text search of content in TIFF.
When you discuss archves you think about looong times. Typically 10 to 50 years of retention with the odd exception where eternity is desired.
Hence "plain" PDF is probably even worse than TIFF. One problem here are the included resources (fonts) and references (http links) which are mostly left out in order to save disk space. The other problem is that there are so many "plain" PDF versions to choose from and none of them will last 10 to 50 years.
However, PDF is a good technology and therefor the PDF/A standard was developed. It is designed especially to deal with loooong term issues, is currently readable through almost any PDF reader and will be maintained by most sensible PDF readers for the years to come. There is NO vendor lock-in, you can put text in a PDF/A document an run searches against it. But most importantly, NO propitiatory stuff can be shoved in as it would result in an invalid document (a PDF document maybe but not a PDF/A document.)
With the price of current disk space you should NOT make file size a defining criterion in your archiving policy. Only on z/OS disk space comes at absurd and ridiculous prices. If you can, try aiming for an archiving solution on Unix, Linux or even Windows.
I am in the archiving business. At the moment PDF/A is the only format suitable for archiving.
I hadn't the slightest objection to his spending his time planning massacres for the bourgeoisie... (P.G. Wodehouse)
I would recommend OCRing these documents and storing them in some kind of text-based format (in addition to the graphical format of your choice). If you have particularly voluminous back-catalogues of these documents you'll be very thankful in the future if you have the option to search-enable this textual content.
A graphic image of text is like a wax apple - it looks and tastes like a replica.
PDF makes sense for document signing, security, and damage detection. TIFF does not have any of this important security and data integrity protection by itself.
PDF also allows for the same compression on the scanned image that TIFF does, as well as much better compression methods available to it.
TIFF, while well-understood in the archival industry, has rather fledgling support in the free *NIX world--especially multi-page TIFF.
Finally, with PDF, you can preserve both the image and the OCR data all in the same file. That's impossible with TIFF.
And, anyway, it's not 1985 anymore.
Kriston
I have a similar issue, but have chosen PDF because they meet the digital signature requirements of most professional licensure boards (architects and engineers worry about this stuff). It's not a large hurdle, just that the documents can be externally verified against a publically available key. Adobe lets you do that for free (well, assuming you have their s/w; I can post a key on my website for a 3rd party to install and verify the signature).
This isn't a high-crypto requirement area - you can easily fake a paper document, and the standard isn't too much higher for A/E work, but it has to be sealed and marked for the inability to change, and the certificate must be publicly available and easily verified.
Is it just my observation, or are there way too many stupid people in the world?
I cannot find an analogy to how fundamentally incorrect the submitters mental model of PDF's and Tiffs are. Think of PDF as a container format, you can compress the images inside the PDF to your heart's content, much smaller than TIFF will do since it can use JPEG or PNG or whatever format you want. Tools->Print Production->PDF Optimizer. It even has OCR and some scanned image auto cleaning. The easiest thing is just to have them change their scan resolution, down to say, 150 DPI and B+W instead of color.
Is there anything better than clicking through Microsoft ads on Slashdot?
We have a document management system. In it we had to make the decision for PDF or tiff. We opted for tiff. It had nothing to do with the file size. The deciding factor was because we could find FAST tiff viewers all day and night. It's probably not that PDF as a format is that much more bloated, but the readers, especially acrobat reader take a LOT longer to start up.
We use an activeX control called alternatiff to view them in the browser (and yes, it does multipage) The control loads in the browser VERY fast. Acrobat embedded in a page is painfully slow to load, even if you just do a page re-load.
Do not meddle in the affairs of sysadmins, for they are subtle, and quick to anger.
I've seen these multipage TIFFs you are talking about and hate them. They invariably require some kind of non free viewer that sucks next to any free or non free pdf viewer. TIFF compression itself has lots of vendor specific schemes and you would be better off with pdf or png as alternatives. If you have not been bitten by a TIFF format, you have not been using TIFF long enough. If you add text to your TIFF, the size will probably grow to the same size as a pdf of the same but you trade the Scalar Vector nature of post script for a bitmap.
I don't understand why anyone would move from tried and true pdf systems for a TIFF system unless they want to lose text search for employees, regulators and the public. Handing people image based pdf instead of text files or normal pdf is a standard practice for the Bush administration that borders on criminal obstruction of justice.
Send them a quotation. If the money looks good, do it and don't bitch about it on slashdot. If it does not look good, decline the job and don't bitch about it on slashdot either. Either way, don't bitch about it on slashdot.
Legal service firms work with all of these PDF and TIFF variants all of the time. They should be able to kick out whatever you need at x cents per page (which will usually be cheaper than your time/money)
The weird TIFF formats are used for various document management products, so it really depends mostly on your workflow.
Business. Numbers. Money. People. Computer World.
If you look down a little more you see the same person slamming TIFF for inconsitency but still recomending it witout saying why. This would only be strange if he did not also say, "searchable text (a Microsoft addition, but something that actually adds value)."
Also courts have requirements for electronic document formats and there is nothing non-standard about a "image based pdf" (these also support searchable OCR full text).
Business. Numbers. Money. People. Computer World.
The gold-standard tool for this is PDF2IMG which uses Adobe's own PDF rendering library but it'll set you back a few thousand dollars.
Ghostscript is good but it isn't perfect: it does choke on some PDFs, misrenders some and won't pick up non-embedded TTF fonts, only external PS fonts. It also doesn't do any anti-aliasing so you probably want to render large and sample down and (IIRC) there's a max image size it can render. But by and large it does just work.
Honest to God, what you're talking about is a trivial task. Use ghostscript, or, if you don't have the time or interest, contact me with your requirements and I'll write it for you gratis, provided it remains F/OSS.
Swedish plasma phys. PhD student; MSc EE; knows maths, programming, electronics; finance interest; seeks opportunities
One of the reasons they gave for this is that TIFF files require less storage space. While that is true, I wonder if TIFF is the best format going into the future. Are TIFFs better than PDFs for future use?
That is not really true, PDFs use different formats internally to store image data -- if you need smaller files, set your settings correctly while generating the pdf, or run them through an optimizer that reduces the embedded object sizes. TIFF is a container format and as such to support it well into the future you have to gamble on what the main branch meta-formats will be. PDF is an open standard and will be around for a very long time.
I hear a lot of talk about how to convert back and forth, but nobody's mentioning the thing that I would consider the most important:
When you convert from .png to .tif, are you losing data?
Most of these convert scripts seem to work by starting Ghostview and rendering a .tif out of your PDF. This is a *terrible terrible idea*. What you'd really want to do is reach into the PDF itself, and extract the lossless images perfectly. Anything else is like printing the .PDF and scanning the printout - you might lose pixels, you might gain extra pixels, and you almost certainly won't be perfectly aligned with the "pixel grain" of the original image.
Unless you can guarantee that you'll pulling out, pixel-by-pixel, the exact original data, I would stick with PDFs.
Breaking Into the Industry - A development log about starting a game studio.
As someone who writes software to view PDFs, I can tell you this is completely pointless, since anything that saves scanned documents into PDF is really storing it as a TIFF image inside of the PDF anyway. The PDF container adds useful features for metadata, and is well documented, so shouldn't add any future-proof issues. And the overhead is probably a few kilobytes.
Postscript begs to differ - {G}.Also courts have requirements for electronic document formats and there is nothing non-standard about a "image based pdf" (these also support searchable OCR full text).
An Invisible Entity of Vast Power whose existence must be taken on faith alone: Liberal Media
TIFF files have a maximum size of 4GB. (The "value offset" field of an IFD entry is a 32-bit value.) You can exceed this with 50 noisy pages. PDF files have a maximum size of 10 to the tenth power bytes. (The byte offset in a cross-reference table entry is a ten-digit decimal number.) That's 2.3 times the maximum TIFF file size.
I have written software to create both TIFF and PDF files. I would use PDF for archiving. Even today, it's tricky to find a TIFF reader that will run on all the important platforms and handle the variety of compression flavors (e.g., JBIG2.)
If they have any legal concerns (like their docs will be used in court), then stay with PDF.
Scalar graphics work regardless of the size of your screen or resolution of your printer. These things may change drastically in the next 10 years. Bit mapped stuff already looks like poop at any but native size. A crummy 75 dpi scan is going to be the size of a postage stamp on a good monitor and it won't blow up gracefully. That might take up less storage space but the people who read it will have eyeballs that bleed and much of the size advantage is lost when you put the text in anyway. Wouldn't it be better to use software that writes high quality pdf in the first place and then just archive it?
For scientific reports, this is a big deal. Detail scales in pdf. A pdf can have eps graphs that can be place at any size and never lose any of their detail until it is printed. Don't forget about hyperlinks which most TIFFs wont' have. I can share these files as dvi, ps or pdf. Most people prefer the pdf. Converting them to TIFF would essentially ruin them.
You understand we're talking about document archiving, right? Postscript is a terrible format for raster images.
Business. Numbers. Money. People. Computer World.
I had to create a bash script a while ago to convert color postscript to black and white tiff. I used netpbm and conjunction with ghostscript to do it. Those 2 programs together can do just about anything you want for batch file graphics. Not sure if it will work for .pdf, but .pdf is pretty close to post script and probably easy enough to convert to eps or straight ps. Gimp and as stated above, imagemagick, also have alot of useful batch processing tools, but you have to learn their script language (in the case of gimp) to use it and it's also much slower than netpbm (both imagemagick and gimp).
Implementations in minutes. Converts to most anything. Not the most efficient though
-- Programming with boost is like building a house with lego. It's a cool but I wouldn't want to live in it
I have done this very thing, converting multi-page TIFFs to PDFs using 'convert' - assuming a Linux platform naturally!
In my shop, we have a customer that sends us hundreds of multi-page PDF images a day to process (they are scans of freight invoices). We decided to convert them to multi-page TIFF files for easier handling (we developed our own data entry application that displays the images without using an external viewer).
We use a batch conversion process using GhostScript. If you're on a Windows workstation, use the gswin32 (I think that's the name) rather than the gs command. It goes considerably quicker and won't open and close a GS window for each PDF file. On an average PC it can convert about one PDF per second (most are less than 10 pages). The PDFs with 150+ pages usually take 30 seconds or so.
We chose TIFF because we could easily write our own viewer. We're not beholden to Adobe or Fox Software to provide software for viewing and manipulating the documents. The PDF viewer plugin from Fox was several thousand dollars and I wrote a TIFF viewer in C++ in just a few days.
I won't speak to why or whether you should do it, but here are a few options for how.
Doculex has an app called MPTiffIt that will do single to multi or multi to single page tiff conversion. You'd need to convert the PDFs to single page tiff via Save As (or perhaps a Batch Process), then recombine them with MPTiffIt.
Or, you could use a Tiff printer driver along with a batch printing software.
Personally, I'd use L.A.W. (Legal Access Ware, created by Image Capture Engineering and now owned by Lexis Nexis), which is a full production scanning, OCR and e-doc conversion suite. You can import any type of doc that can be printed and output single or multi-page tiffs of PDFs with a variety of database/unitization load files. Similar products by IPro and Doculex exist. Any litigation support vendor should have one of these tools, and would likely charge a relatively nominal fee (per GB) to perform the conversion for you.
While you're at it, why not decode the data punched on each card and then just store the microfilm image and the decoded data, discarding the image of the rest of the card? That'd make things a lot more efficient.
> standard practice for the Bush administration
Did someone mod you "informative" because there's no "-1, Rabble Rouser" option?
I used png files when I was scanning some text books, taking the following steps to make good quality pdfs from the page images:
1) Scan each page at around 300 dpi, producing a large TIFF file (around 8Mbytes each).
2) Use imagemagick to reduce the number of colours to 16 grayscale with some band-pass filtering and convert to lossless compressed png file. This helps improves the contrast & quality of the image by reducing the subtle changes in colour by the scanned pages not being completely flat, and due to any page transparency.
Each file ends up smaller than the original by a factor of around 10.
3) Embed the pages into a pdf file by using pdflatex. Final result is scanned text at good resolution (300dpi) for relatively small filesize. For example, one scanned chapter went from: 183Mb (tiff) -> 13Mb (png) -> 8Mb (pdf).
Also, looks nice on screen as well as printed out.
You'll do even better on file size if you convert scans to Djvu: Scan to pnm (or pgm or pbm), convert using c44 (or cjb2 for pbm), and you can skip the TeX step (use djvm to assemble pages). If the text you're scanning is black and white, you'll be stunned by the quality and file size reduction.
OCR doesn't work that well, you will still need the images. So you won't gain anything from doing TIFF+text.