Multi-page PDF To Multi-page TIFF and Archiving?

Are these things images or documents? by mrchaotica · 2008-06-24 04:30 · Score: 1, Interesting

If they're images, then you should use TIFF (or perhaps PNG). However, it doesn't make sense for them to be "multi-page." If they're documents, then PDF is appropriate.

I would suggest that your client doesn't know WTF they want.

--

"[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz

Re:Are these things images or documents? by pclminion · 2008-06-24 04:35 · Score: 4, Informative

If they're images, then you should use TIFF (or perhaps PNG). However, it doesn't make sense for them to be "multi-page." If they're documents, then PDF is appropriate.

Multi-page TIFF is well supported in the industry. There is nothing "weird" about it. It even supports embedded, searchable text (a Microsoft addition, but something that actually adds value). PDF archival can be difficult to do correctly. At the very least you want to use a product which supports PDF/A, followed up with some serious validation to make sure the results are actually compliant. Otherwise you may get bitten decades down the road. Searchable TIFF, on the other hand, will be around for freaking ever.
Re:Are these things images or documents? by cduffy · 2008-06-24 04:35 · Score: 1

Yes, it makes sense for them to be multi-page. TIFF is a multi-page format -- it actually supports a huge number of subformats (and different compression algorithms) in addition.
Re:Are these things images or documents? by algae · 2008-06-24 04:43 · Score: 1

You're wrong - multi-page TIFF is a real image format, as about 2 seconds with Google would have told you if you'd bother to check. We used it extensively at a copy shop I used to work at that served law firms. What the OP wants is to use ghostscript as a converter from PDF to TIFF. If I recall correctly, you can specify multi-page TIFF as an output format.

--
Causation can cause correlation
Re:Are these things images or documents? by BadMrMojo · 2008-06-24 04:51 · Score: 3, Informative

...it actually supports a huge number of subformats (and different compression algorithms) in addition.

TIFF: Thousands of Image File Formats
If you do wind up converting to tiff, then remember to document everything in excruciating detail. With thousands of possible combinations - each of which is a perfectly valid tiff image - you may encounter some issues if someone's using a less robust reader and assuming for the wrong compression algorithm, byte order, data striping or photometric interpretation.
Re:Are these things images or documents? by commanderfoxtrot · 2008-06-24 05:01 · Score: 2, Informative

Both PDF and TIFF handle multiple pages, and have done so for years.
Either would be suitable for this application.
If you really want to convert these, Imagemagick would be the best tool to use.
However- it seems a little daft for storage space to be the main reason for changing: you're simply exchanging one compressed image format for another. You may save 10% e.g. if you move from JPEG in the PDF to PNG/similar in the TIFF but is that really worth the effort?!
If they really want shiny TIFFs, it would be easy to have an Imagemagick script to convert single PDFs to TIFF on-demand.

--
http://blog.grcm.net/
Re:Are these things images or documents? by MBGMorden · 2008-06-24 05:32 · Score: 4, Informative

Multi-page TIFF is well supported in the industry. Better supported than PDF in some cases. Our records management (in addition to keeping electronic scanned copies) still insists on having a microfilm copy of all of our retained documents. We can send digital copies to a processing company to have them processed, but they don't accept PDF documents - only TIFF's (multi-page is acceptable). Given that our internal document management is all in PDF, I ended up having to find a program to convert all of that information about a year ago (though the name of the program we ended up using escapes me - I wouldn't recommend it anyways, since it crashed for me very frequently).

--
"People who think they know everything are very annoying to those of us who do."-Mark Twain
Re:Are these things images or documents? by doctor_nation · 2008-06-24 05:44 · Score: 1

I assume that he means that it doesn't make sense for an image to be multi-page. In other words, it should be a single page and let the printer driver/program work out the paging. Of course, if they are scanning into PDF in the first place I would assume that they aren't images.
Re:Are these things images or documents? by doctor_nation · 2008-06-24 05:47 · Score: 1

Oops, I actually followed the link to find out what an aperture card is and I see that I'm wrong- they are basically engineering drawings so they are images (with text annotations). But that doesn't explain why they are multi-page when a PDF page can be any size you want.
Re:Are these things images or documents? by cduffy · 2008-06-24 05:51 · Score: 3, Informative

Bah. Use libtiff, document that only readers which also use libtiff (>= the version you're using) are supported, and you're done.
Re:Are these things images or documents? by alan_dershowitz · 2008-06-24 06:31 · Score: 2, Interesting

Imagemagick does it's conversion entirely in-memory, so if a document is more than a hundred pages or so you are going to have to have some problems.
Re:Are these things images or documents? by Slacksoft · 2008-06-24 06:51 · Score: 1

I used to work at a company in the imaging/scanning/OCR market for *cough-large-printer-companies-cough*. The best format to save the images, IMO, is LZW Tiff format. PDF is fine and all for storing images, and provides the compact file size. However, you can do LZW Tiff images, and are able to convert Tiff images into other formats easier than trying the same with PDF. On that same note Tiff still retains all the image layers just the same. I've been out of the loop on processing techniques, but if you want try contacting Jetsoft Development. I believe their website is http://www.jetsoftdev.com/ or www.scanhelp.com to check with available options.
Re:Are these things images or documents? by mrchaotica · 2008-06-24 07:03 · Score: 3, Insightful

Yes, it makes sense for them to be multi-page. TIFF is a multi-page format...

Having the format support a thing, and having that thing make sense are two different things. For example, Excel supports being used as a database... but does it make sense?

My point: use image formats for images, and document formats for documents. If the things you're trying to store are images, don't put them in a document format, and if they're documents, then don't put them in an image format.

Also, if TIFF is designed to store both images and documents, then I question whether it is too general to do either of them well. And your mention of "subformats" makes me think my concern is well-founded!

--
"[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz
Re:Are these things images or documents? by tepples · 2008-06-24 07:07 · Score: 1

they are basically engineering drawings so they are images (with text annotations). But that doesn't explain why they are multi-page when a PDF page can be any size you want. Different pages might cover different views or different parts of a product.
Re:Are these things images or documents? by Anonymous Coward · 2008-06-24 07:11 · Score: 1, Interesting

... PDF archival can be difficult to do correctly. At the very least you want to use a product which supports PDF/A, followed up with some serious validation to make sure the results are actually compliant...
Daily I convert mass amount of PDFs to multi-page or single-page TIFFs, and on a daily basis I come across PDF errors. Unless you do extensive and exhaustive error checking\trapping, I would archive in TIFF format. The TIFF format is considerably less prone to errors within the document. TIFF is widely used and accepted, especially in the United States Court of Law.
Re:Are these things images or documents? by mrcaseyj · 2008-06-24 07:24 · Score: 2, Informative

I agree that unless the files are extremely huge or extremely numerous then storage space probably shouldn't be a concern because its cheap compared to your time and getting cheaper. But if storage space is a concern then you might look into the tiff format used by the patent office. Apparently it uses a form of lossless compression taken from fax machines and gets much better compression than many other common formats on black and white(no greyscale) documents. If it's the patent office's standard archive format then it will probably be supported for a long time. Pdf can probably use the same compression as well though. I was going to mod up someone below http://ask.slashdot.org/comments.pl?sid=593693&no_d2=1&cid=23920133 who recommended PDF/A as the only archive format to use.
I was also going to mod up someone who recommended http://ask.slashdot.org/comments.pl?sid=593693&no_d2=1&cid=23921141 that if you use tiff just make sure it can be read by libtiff and you should have no worries about future readability.
Re:Are these things images or documents? by cduffy · 2008-06-24 07:39 · Score: 1

Sure, it makes sense. Look at faxes -- they're multiple pages of bitmapped images. Does that compose a "document"? Maybe. Is it something TIFF is good for? Absolutely! There are standardized tags used for storing extra information about fax transmissions in TIFF documents, and there's a great deal of software which makes use of that metadata. The same thing is true of archiving documents composed of scanned images -- there are a great many tag types associated with metadata such as scanner model and configuration, and software support for those tags is widespread.
I've also seen good use made of TIFF in the context of medical imaging (where flexibility and storage of multiple pages/layers and keeping arbitrary metadata is important), though there are other, more specialized formats which are also important in that space. If you have multiple images in a file (ie. multiple layers from a scan), does that make it a "document"?
The distinction between a "image" and a "document" is a hazy one at times, and I don't think we have a good enough description of the problem space the parent is working in to make the kind of judgment you're arguing for. Certainly, whether something is composed of more than one page is not the place to draw the line.
Re:Are these things images or documents? by mrchaotica · 2008-06-24 07:55 · Score: 1

Ah, so TIFF is really general. In that case, saying you're storing something "in TIFF" doesn't really say much, does it? It's kind of like saying you're storing it "in XML" -- what does that mean, if you don't specify a schema too?

The distinction between a "image" and a "document" is a hazy one at times, and I don't think we have a good enough description of the problem space the parent is working in to make the kind of judgment you're arguing for. Certainly, whether something is composed of more than one page is not the place to draw the line.

But if something's more than one page, then it must be more than one image too. So really, you're talking about a sequence or collection of images. These, I agree, can come in several categories: is it a sequence of nearly identical images? Then it's a movie, and it ought to be stored in some movie format. Is it a sequence of related but not nearly identical images? Then it's probably a slide show, which is a kind of document, and ought to be stored in a document format like PDF or .odp (OpenDocument Presentation). Is it a collection of images (in which order is not particularly important)? Then it ought to be stored in a folder of individual image files. And so on.

--
"[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz
Re:Are these things images or documents? by cduffy · 2008-06-24 09:20 · Score: 1

Is it a sequence of related but not nearly identical images? Then it's probably a slide show, which is a kind of document, and ought to be stored in a document format like PDF or .odp (OpenDocument Presentation).
If I get a multi-page fax, I don't want it delivered as a slideshow -- I want a PDF or a multipage TIFF. Yes, PDF is a sensible choice in this situation -- but what basis do you have for saying that multipage TIFF isn't? It's widely used in that case, and there's software support for the tagging format (so I can use my fax-viewing software to pull up things like the caller ID information or the logs indicating the model of fax machine used by the sender).
Frankly, I think you're arguing at this point for the sake of being contrary.
Re:Are these things images or documents? by DimmO · 2008-06-24 10:50 · Score: 1

I can just imagine the OP's rollercoaster of emotions reading the parent: "ooh awesome someone knows a program to convert pdf to tiff. Exactly what I want! oh wait, he forgets what it's called. dammit. but, it crashed all the time. meh."
Re:Are these things images or documents? by mfnickster · 2008-06-24 13:28 · Score: 2, Insightful

> Scanned documents are both images and documents.
Exactly. I don't mean to be rude, but the GP's comment is just silly.
"Documents" can be either graphical or text (or both). There's a reason why word processing formats allow embedding of images.
In fact, "document" has become an almost meaningless term since it can apply to so many types of data. About the only thing all the different "document" formats have in common is that their content is 2D!

--
"Slow down, Cowboy! It has been 3 years, 7 months and 26 days since you last successfully posted a comment."
Re:Are these things images or documents? by david@ecsd.com · 2008-06-24 23:28 · Score: 1

Couldn't Ghostscript do this? Tiff is an output device option. I don't think it'd take too much to jimmy up a script to process an assload of PDFs. (Unless, of course, that's the program that was crashing frequently...)
Re:Are these things images or documents? by cduffy · 2008-06-25 02:16 · Score: 1

TIFF for faxes was the best of bad choices. There are better options these days.
How are those options better? How is TIFF bad?
TIFF has direct support for the compression and image formats used by fax machines over the wire, so there's no transcoding needed to write a TIFF of an incoming fax (and an outgoing fax can be preprocessed for sending while keeping it in TIFF). Tools that you and I already have installed (ie. tiffinfo) can provide caller ID and other useful metadata. There's a complete toolchain built around TIFF. It works. Tell me what's broken, and then we can start to talk about doing the work to rebuild that toolchain against something else that fixes the problems.

Mac OS X has that funtionality in Preview. by t-maxx+cowboy · 2008-06-24 04:31 · Score: 1

Apple was kind enough to build this functionality into Mac OS X in the form of Preview and Automator (or Apple Script).

--
Regards,

Ryan Pritchard
Fun Extends All Basic Life Expectancies

Ghostscript by tomtomtom777 · 2008-06-24 04:33 · Score: 2, Informative

Ghostscript can do the conversion from the console.

You can write a simple shell script to convert all files.

Re:Ghostscript by Tsunayoshi · 2008-06-24 08:40 · Score: 1

Solaris 10, (and I would presume Linux as well) has a package of TIFF utilities.
We didn't do too much pdf2tiff (most stuff came in as TIFFs), but I don't remember any major issues the other way (tiff2pdf).
I imagine a fully featured converter application would do a better job though.

--
"Get a bicycle. You will not regret it, if you live." - Mark Twain, "Taming the Bicycle"

ImageSite by Thelasko · 2008-06-24 04:33 · Score: 1

We use a program called ImageSite that handles that. It uses TIF files. Why reinvent the wheel?

--
One of our competitors trademarked the term "hypothesis". From now on, we will call them "boneheaded ideas".

Re:ImageSite by Thelasko · 2008-06-24 05:29 · Score: 1

Let me clarify, ImageSite manages the files. It doesn't, to my knowledge, convert PDF to TIF. I think you can use a program the prints the PDF to a TIF file.

--
One of our competitors trademarked the term "hypothesis". From now on, we will call them "boneheaded ideas".

DjVu by cduffy · 2008-06-24 04:34 · Score: 3, Informative

The most effective compressors are commercial, but DjVu is a very effective image archival format; see DjVuLibre for the non-commercial tree.

Moving back towards the question in the article, I don't think there's much worry about either TIFF or PDF in terms of future proofing; they're both very widely used, have multiple implementations and third parties with substantial interest in keeping those implementations maintained, etc. The quality of TIFF implementations varies wildly, but the good ones are only going to get better, and I'd be shocked if libtiff ended up terminally bitrotted without a successor implementing a superset of its functionality inside my lifetime.

Nature of microfilm content? by mx90 · 2008-06-24 04:34 · Score: 1

You haven't specified what is on the microfilm chip in the card. If its largely text, I can't see why you'd want to lose the embedded text (searcheable etc. - TIFF would require OCR at some point..?) in PDF.

Re:Nature of microfilm content? by BobMcD · 2008-06-24 09:29 · Score: 1

You haven't specified what is on the microfilm chip in the card. Relax. I'm pretty confident it isn't pr0n...

Tiff is better by gweihir · 2008-06-24 04:35 · Score: 1

But size does not have anything to do with it. TIFF is far simpler in structure than PDF and has therefore better compatibility. TIFF is also well documented. Of course, they would have to use raw tiff to get the advantages. The storage-space argument is secondary and matters only insofar as larger data sets have a higher irsk of corruption.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.

Re:Tiff is better by pclminion · 2008-06-24 04:39 · Score: 4, Interesting

But size does not have anything to do with it. TIFF is far simpler in structure than PDF and has therefore better compatibility. TIFF is also well documented. Of course, they would have to use raw tiff to get the advantages. The storage-space argument is secondary and matters only insofar as larger data sets have a higher irsk of corruption.

I dispute the "well documented" claim. The TIFF standard is quite clear. Unfortunately, almost nobody adheres precisely to the standard. I work extensively with TIFF and PDF, and I have to say that the consistency I see in PDF is about 100 times more than what I see in TIFF. Your typical TIFF reader will contain thousands of hacks and workarounds for oddities that are produced by major players in the industry. While there is slightly non-compliant PDF, I have never seen things that even begin approaching the strangeness I see in TIFF on a daily basis. Having said that, I recommend TIFF plus search text metadata for archival, not PDF.
Re:Tiff is better by Anonymous Coward · 2008-06-24 05:26 · Score: 1, Interesting

I'd have to agree with that - I keep bumping into nasty combinations of old style JPEG in TIFF images combined with Wang annotations and highlighting - the choice of viewers that will cope with both of those at once is pretty limited
Re:Tiff is better by twitchingbug · 2008-06-24 07:01 · Score: 1

> Having said that, I recommend TIFF plus search text metadata for archival, not PDF.
Can I ask why? You whole post seemed to slam TIFF - that it's too varied from the actual spec. How does that translate to being a good archival format? Or are you just saying, use TIFF if you follow the spec?
Re:Tiff is better by pclminion · 2008-06-24 07:12 · Score: 1

Or are you just saying, use TIFF if you follow the spec?

Yes, I stopped my explanation too soon. If you stick to what's actually written (not implied by the existence of viewers that are already out there) in the Baseline TIFF specification, your file will be viewable everywhere. As far as text metadata, the curse is also the beauty, because TIFF easily lets you place tagged data in a file and it will be ignored by any reader which doesn't understand that tag.

MS's searchable TIFF is your typical MS creation, but at least it's a working example of embedding text in TIFF. If the community would come together and develop a simple, standard format for embedding text, I see no reason why we shouldn't all start using it.

The advantage of a raster format like TIFF is that it guarantees the result down to the pixel level. While this is achievable in PDF/A as well by using raster images as the sole graphical elements, the standard does not enforce this, and allows far too much. IMHO, if you need validation tools just to check if your document is compliant to a certain spec, then that spec is too complex for something as important as long-term archival.

Don't do this. But if you insist, here's how. by Cheesey · 2008-06-24 04:38 · Score: 5, Informative

Are TIFFs better than PDFs for future use? I wonder what format you think would last longer. Are there any other formats that you think would be better or more future-proof? To me, storage is not a good enough reason to go to TIFF, because storage prices are always dropping anyway.

Don't use TIFF. Stay with PDF. PDF is what all the big digital libraries are using. It's a proper standard, it's readable and writable by lots of free open source software, so even if Adobe disappears in a puff of intellectual property, you'll still be able to read your documents.

TIFF, on the other hand, is a container format (like AVI, but worse). It isn't fully supported by every program - what sort of TIFF do you want, anyway? Compressed with LZW? With RLE? Not compressed at all? There's free software that will read and write the most common types of TIFF, so you can certainly do it, but why give up the convenience of using PDF?

Also, since they already have many of these files in PDF format and they want to convert them into multipage TIFFs, are there any programs that you can recommend that will perform batch processing of files so that we do not have to convert each PDF one by one?

Use ghostscript. Use something like the following command line:

gs -dNOPAUSE -sDEVICE=tiffgray -sOutputFile=output%02d.tiff -dBATCH -r300 input.pdf

This turns input.pdf into a series of 300 dpi tiff files, one for each page, called output01.tiff, output02.tiff, etc. Change the DEVICE to get a different sort of tiff file, and use gs --help to get a list of options. You can easily wrap this command in a script of almost any sort to make the process fully automatic.

--
>north
You're an immobile computer, remember?

Re:Don't do this. But if you insist, here's how. by mobby_6kl · 2008-06-24 07:47 · Score: 1

Yeah, not to mention that PDF supports several compression formats for images including zip, jpeg, and jpeg2000 at various quality levels if storage is such a concern. Just playing around with the image from the wiki article, file sizes range from 4795kb for uncompressed, to 285kb for maximum quality (but lossy) jpeg2k. A jpeg compressed tiff is about 660kb.
Now, I've never had to deal with aperture cards IRL, so I'm not sure about the following: a lot of space seems to be wasted to preserve a few pieces of the human readable printed data on the cards. If this is unimportant, just thresholding the whole image except for the embedded film can provide huge savings and/or leave more bits for the significant parts.
PS. The default Windows picture viewer doesn't support jpeg and zip compressed tiffs.
Re:Don't do this. But if you insist, here's how. by jabuzz · 2008-06-24 12:11 · Score: 1

Anything a TIFF can do, a PDF can do better. I can take the very same images that you have in your TIFF, compress them with a better algorithm and embed them into a PDF, and end up with something smaller.
You were clearly utterly incompetent in the generation of PDF's is all I can conclude.
Re:Don't do this. But if you insist, here's how. by twistedcubic · 2008-06-28 22:09 · Score: 1

I don't think 300dpi is high for archival purposes. I think it's around the minimum. I can convert handwritten bitonal 8x11in 450dpi scans into 30-40 kilobyte Djvu files.

Comment removed by account_deleted · 2008-06-24 04:43 · Score: 2, Informative

Comment removed based on user account deletion

Re:Adobe Acrobat Professional will do it by pdboddy · 2008-06-24 04:43 · Score: 1

But it'll save it as multiple TIFF files, I don't think Acrobat has multiple-page TIFF capability.

--
Julie Moult is an idiot.

Media or format? by Pig+Hogger · 2008-06-24 04:47 · Score: 1

You seem to be confusing the media with the data format. Whether it is TIFF or PDF is irrelevent. It's all ones and zeroes in the end, whether it is stored on punch cards, floppies, CDs or Flash RAM.

In any case, the PDF and TIFF file formats are well-documented, and if ever even their widespread use makes them to be extinct (bloody unlikely), it would always be possible to write a program to convert them into the format-du-jour, provided, of course, you are able to read the media...

Re:Some Solutions by alanthenerd · 2008-06-24 04:52 · Score: 1

Or under windows just use ImageMagick
http://www.imagemagick.org/script/binary-releases.php#windows

PDFs... by pdboddy · 2008-06-24 04:54 · Score: 2, Informative

Stick with PDF. Chances are, neither PDF nor TIFF will vanish overnight. I'd say PDF is easier to work with, even with minimalist free tools. Since either one is technically "good" for archiving, why do more work than you really need to do, even with batch processing it'd be a pain.

Acrobat has batch processing, and can convert pdfs to TIFF, JPEG, PNG and more. That would be my suggestion if you are really going to convert to TIFFs.

--
Julie Moult is an idiot.

Re:PDFs... by Gta-Klue · 2008-06-24 05:02 · Score: 1

I disagree as the storage difference between a 5 mb to 18gig pdf compared to a 250k to 1mb tiff is huge! Where I work we do PDF to TIFF conversion just for that reason. Due mainly to Sharepoint having a 8mb limit per document you can upload.

--
This is PURE EAU DE TROLLETTE
09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0
Re:PDFs... by clodney · 2008-06-24 05:45 · Score: 1

Don't forget that TIFF has a 4GB limit, due to all the file offsets being encoded as 32 bit values. If a TIFF tag can not store the data directly in the tag body, it stores the location of the data as an offset from the start of the file.
Re:PDFs... by Eric+Smith · 2008-06-24 05:55 · Score: 2, Informative

If you have 5 MB PDF files that convert to 1 MB TIFF files, that means that the PDF files were encoded badly. There's no fundamental reason for PDF files to be significantly larger than TIFF files.
Re:PDFs... by pdboddy · 2008-06-24 07:20 · Score: 1

I could argue that you've got a poorly made PDF. Especially one that's 18gigs in size, I've never seen one that large in ten years of working in the industry.

Or I could argue that storage costs will go up over time, not down, so your argument over file size is moot. A properly made pdf will have a reasonable file size.

--
Julie Moult is an idiot.
Re:PDFs... by pdboddy · 2008-06-24 07:21 · Score: 1

Er, storage capacity will go up over time, not down. *wishes for an edit button*

--
Julie Moult is an idiot.
Re:PDFs... by vbraga · 2008-06-24 15:32 · Score: 1

There's already a extension for bigger TIFF files and a version of libtiff to support it.

--
English is not my first language. Corrections and suggestions are welcome.

pdf2tiff.sh by Anonymous Coward · 2008-06-24 04:58 · Score: 4, Informative

let's not reinvent the wheel -- I did this about 9 months ago //wolfmann -- and this code is Public domain (done on federal gov't time):

# cat pdf2tiff.sh
#!/bin/bash

for file in */*.pdf #for each pdf
do
filename=`echo $file | cut -d'.' -f1`
if [ ! -e "$filename".tiff ]
then
echo "gs -q -dNOPAUSE -dBATCH -sDEVICE=tiffg4 -sOutputFile=$filename.tiff $file"
gs -q -dNOPAUSE -dBATCH -sDEVICE=tiffg3 -sOutputFile="$filename".tiff "$file" 2> /dev/null
else
echo "$filename.tiff exists! skipping..."
fi
done

Re:pdf2tiff.sh by Anonymous Coward · 2008-06-24 06:29 · Score: 2, Informative

Better method:
#!/bin/bash for file in `find . -name '*.pdf'` do filename=${file%%.*} ... This will cause recurse into all subdirectories and it will work on files with multiple dots in the name (eg. applicant.12345.cv.pdf)
Re:pdf2tiff.sh by Anonymous Coward · 2008-06-24 07:14 · Score: 1, Informative

Well, if we want to correct others, -iname is the better option, but who cares, as long as we get modded informative for the basics...
Re:pdf2tiff.sh by Khopesh · 2008-06-25 05:57 · Score: 2, Interesting

This code will not work on files with spaces in them, because the back-ticks will expand spaces without escapes while the shell globbing used in the original code will escape them. Since most PDF titles I've seen use spaces in their names, this is important. The rest of those modifications will help, though.

Were I coding this script, I'd write two (since xargs can't use a function). One for the inner loop, and one for the outer loop (or if only running it once, I'd do the outer loop on the command line.
$ cat pdf2tiff.sh #!/bin/sh # convert specified files from PDF to TIFF for file in "$@"; do target="${file%%.*}.tiff" if [ -r "$file" ] && [ ! -s "$target" ] then echo "gs -q -dNOPAUSE -dBATCH -sDEVICE=tiffg4 -sOutputFile='$target' '$file'" gs -q -dNOPAUSE -dBATCH -sDEVICE=tiffg3 -sOutputFile="$target" "$file" 2>/dev/null else echo "'$file' not readable or '$target' exists, skipping" fi done $ cat pdf2tiff-recursive.sh #!/bin/sh # convert all PDFs to TIFF format within supplied directory tree (default: $PWD) find ${1:-.} -name '*.pdf' -print0 |xargs -0 ./pdf2tiff.sh

--
Use my userscript to add story images to Slashdot. There's no going back.
Re:pdf2tiff.sh by Khopesh · 2008-06-25 11:23 · Score: 1

Oops, there should be only one percent sign in the target="${file%.*}.tiff" line. This enables your use of "User Manual 3.7.pdf" -> "User Manual 3.7.tiff" instead of "User Manual 3.tiff" as the above AC posted (and I copied). Also, the second script should have quotes around the first argument to find, which should use @ instead of 1, which respects multiple arguments and the spaces within them while still defaulting to the current working directory when there are no arguments.

I've incorporated this into my psmerge replacement script psmerge2 which uses ghostscript to merge and convert documents that it can read. I often use it to convert single documents from one format to another, and now it can also convert PDFs (and PSs) to TIFF.

--
Use my userscript to add story images to Slashdot. There's no going back.

It doesn't matter. Neither is great. by moosesocks · 2008-06-24 04:59 · Score: 3, Informative

How *much* smaller are these TIFFs anyway? TIFF is actually a container format, and can support all sorts of compression, some of them proprietary, some of them common. Not all of them are lossless either (TIFF-Jpeg is a perfectly valid combination, and was used before the days of Exif to add metadata to jpegs). TIFFs can also include vectorized data. It's not all that much less complicated than a PDF.

PDFs are also a container format to an extent. You could very well have a TIFF embedded in a PDF. Fortunately for us, the PDF specification is a bit more stringent on what is supported and what isn't, and PDFs tend to work just about everywhere (especially if all that you've got is an image). You can also apply all sorts of compression to PDFs to reduce their file size... these might not be quite as well supported.

Both formats are extremely common, and it's extremely unlikely that you'll ever have to do any sort of conversion to display them. If I had to place money on it, I'd wager that PDF will be in widespread use for longer than TIFF, though neither format seems to be going anywhere anytime soon. You're more likely to have to worry about the storage devices you're using and the longevity of the media.

If you just need to store lossless raster images, PNG might be a good bet. It's a "Free" format, and is officially endorsed as an ISO/IEC standard. TIFF is copyrighted by adobe. It also has the advantage of being a complete image format, rather than just a container, which means that any software that can open a PNG image should be able to open *any* PNG. Because of its open-sourceness and widespread adoption, PNG will be around for a long time to come as well. Once again, the storage medium and filesystem that you use to store the images is very likely going to become obsolete before the file format itself.

Granted, PNG's compression algorithm isn't optimized for photographic data, though the image formats that *are* optimized for this purpose are neither common nor free.

In summary, there's no reason that a PDF needs to be terribly larger than a PDF (the overhead should be especially negligible if you've got lots of images at a high-resolution). Neither format is going away anytime soon, but both have quirks that can hurt you in the future (Multi-page TIFFs are even somewhat of an oddity today). If you really want small files and future-proofing, go with PNG. Otherwise, it's more or less a non-issue.

--
-- If you try to fail and succeed, which have you done? - Uli's moose

Re:Some Solutions by ZERO1ZERO · 2008-06-24 05:05 · Score: 2, Informative

imagemagick is Slow slow slow for multipage tiffs. Using tifftools on windows, creating and splitting multipage G4 tiffs is 20 (TWENTY) times faster using tifftools.

PDF/A by SpaghettiPattern · 2008-06-24 05:07 · Score: 4, Insightful

Although the TIFF format is open and it is widely used in archiving systems, it is not particularly suited for an archive you setup new. The main reason is that many applications that generate TIFF may throw in their own proprietary stuff and lock you into a specific viewer. Also, you cannot do a text search of content in TIFF.

When you discuss archves you think about looong times. Typically 10 to 50 years of retention with the odd exception where eternity is desired.

Hence "plain" PDF is probably even worse than TIFF. One problem here are the included resources (fonts) and references (http links) which are mostly left out in order to save disk space. The other problem is that there are so many "plain" PDF versions to choose from and none of them will last 10 to 50 years.

However, PDF is a good technology and therefor the PDF/A standard was developed. It is designed especially to deal with loooong term issues, is currently readable through almost any PDF reader and will be maintained by most sensible PDF readers for the years to come. There is NO vendor lock-in, you can put text in a PDF/A document an run searches against it. But most importantly, NO propitiatory stuff can be shoved in as it would result in an invalid document (a PDF document maybe but not a PDF/A document.)

With the price of current disk space you should NOT make file size a defining criterion in your archiving policy. Only on z/OS disk space comes at absurd and ridiculous prices. If you can, try aiming for an archiving solution on Unix, Linux or even Windows.

I am in the archiving business. At the moment PDF/A is the only format suitable for archiving.

--

I hadn't the slightest objection to his spending his time planning massacres for the bourgeoisie... (P.G. Wodehouse)

Re:PDF/A by ealsmyr · 2008-06-25 02:03 · Score: 1

I am working on a piece of software that produces PDF documents as end result. PDF documents that customers need to have accessible for a long time due to regulatory needs.
Our problem with PDF/A is that different preflight tools give different results and we therefore find it hard to claim compliance with the standard.
Unfortunately, the ISO PDF/A standard was ambiguous in several parts - particularly the XMP metadata requirements. I have heard the PDF/A Competence Center is working on an extensive PDF/A test suite for preflight programs. This will hopefully avoid the differences in preflight results from the various applications.

Speaking as an enterprise search specialist... by spyrochaete · 2008-06-24 05:12 · Score: 2, Insightful

I would recommend OCRing these documents and storing them in some kind of text-based format (in addition to the graphical format of your choice). If you have particularly voluminous back-catalogues of these documents you'll be very thankful in the future if you have the option to search-enable this textual content.

A graphic image of text is like a wax apple - it looks and tastes like a replica.

Re:Speaking as an enterprise search specialist... by nine-times · 2008-06-24 05:28 · Score: 1

You can OCR stuff, store the text in the PDF.
Re:Speaking as an enterprise search specialist... by spyrochaete · 2008-06-24 05:38 · Score: 1

Since the alternative format being discussed is a graphical image I made the assumption that they're currently scanning flat images into multi-page PDFs. If they're OCRing into PDF then that format would indeed be indexable by most enterprise search engines (as well as free desktop search engines).
Re:Speaking as an enterprise search specialist... by Hektor_Troy · 2008-06-25 03:12 · Score: 1

Well ... while I doubt it's doable, it would be really cool with a PDF->LaTeX program. Transform everything in the pdf into LaTeX files (except pictures). Would give you free text search and an easy way to compress the stuff really well.

--
We do not live in the 21st century. We live in the 20 second century.
Re:Speaking as an enterprise search specialist... by spyrochaete · 2008-06-25 03:44 · Score: 1

You doubt that what is doable? OCRing to PDF is a very common procedure, and many enterprise and desktop search products have no trouble reading and indexing PDFs. I've never seen LaTeX used in any corporate environment or by any individual except for malicious university professors.

PDF is the right answer by kriston · 2008-06-24 05:15 · Score: 1

PDF makes sense for document signing, security, and damage detection. TIFF does not have any of this important security and data integrity protection by itself.

PDF also allows for the same compression on the scanned image that TIFF does, as well as much better compression methods available to it.

TIFF, while well-understood in the archival industry, has rather fledgling support in the free *NIX world--especially multi-page TIFF.

Finally, with PDF, you can preserve both the image and the OCR data all in the same file. That's impossible with TIFF.

And, anyway, it's not 1985 anymore.

--

Kriston

Tangential question - can TIFFs be dig. signed by Overzeetop · 2008-06-24 05:18 · Score: 1

I have a similar issue, but have chosen PDF because they meet the digital signature requirements of most professional licensure boards (architects and engineers worry about this stuff). It's not a large hurdle, just that the documents can be externally verified against a publically available key. Adobe lets you do that for free (well, assuming you have their s/w; I can post a key on my website for a 3rd party to install and verify the signature).

This isn't a high-crypto requirement area - you can easily fake a paper document, and the standard isn't too much higher for A/E work, but it has to be sealed and marked for the inability to change, and the certificate must be publicly available and easily verified.

--
Is it just my observation, or are there way too many stupid people in the world?

Re:Tangential question - can TIFFs be dig. signed by Just+Some+Guy · 2008-06-24 05:41 · Score: 1

Any file can be digitally signed with GPG or PGP. If your customers are used to looking for public keys on your site anyway, you might as well make them PGP pubkeys.

--
Dewey, what part of this looks like authorities should be involved?
Re:Tangential question - can TIFFs be dig. signed by Creepy+Crawler · 2008-06-24 07:18 · Score: 1

Even the pirate groups do that.
They'll have the "goods" and a md5sum.txt with the md5sum of the "goods" and a GPG signature around the md5sum AND the gpg signature of the file. Then they zip it.
--
- Mod parent up! by Anonymous Coward (Score:1) Thurs, Nov 31, @13:37

PDF is the King for a reason. by fsterman · 2008-06-24 05:18 · Score: 1

I cannot find an analogy to how fundamentally incorrect the submitters mental model of PDF's and Tiffs are. Think of PDF as a container format, you can compress the images inside the PDF to your heart's content, much smaller than TIFF will do since it can use JPEG or PNG or whatever format you want. Tools->Print Production->PDF Optimizer. It even has OCR and some scanned image auto cleaning. The easiest thing is just to have them change their scan resolution, down to say, 150 DPI and B+W instead of color.

--
Is there anything better than clicking through Microsoft ads on Slashdot?

Re:PDF is the King for a reason. by pdboddy · 2008-06-24 07:36 · Score: 1

What's the point of archiving something if you're going to potentially lose information by dropping the resolution? Especially something like the aperture cards the OP is having archived. The Wikipedia image is horrible, and I can't imagine what that would look like in B&W... if that's a typical aperture card, it's going to look horrible.

And I'd be careful trusting the pdf optimizer for for file size reduction. Not always the best results. But I'd still put my money on PDF.

--
Julie Moult is an idiot.

Another reason to do go tiff by alta · 2008-06-24 05:23 · Score: 1

We have a document management system. In it we had to make the decision for PDF or tiff. We opted for tiff. It had nothing to do with the file size. The deciding factor was because we could find FAST tiff viewers all day and night. It's probably not that PDF as a format is that much more bloated, but the readers, especially acrobat reader take a LOT longer to start up.

We use an activeX control called alternatiff to view them in the browser (and yes, it does multipage) The control loads in the browser VERY fast. Acrobat embedded in a page is painfully slow to load, even if you just do a page re-load.

--
Do not meddle in the affairs of sysadmins, for they are subtle, and quick to anger.

Re:Another reason to do go tiff by Wilden2003 · 2008-06-24 06:24 · Score: 2, Informative

You are so right about Adobe Acrobat.
For a stand-alone free PDF viewer take a look at Foxit.
http://www.foxitsoftware.com/pdf/rd_intro.php
It's fast and small, and does the job.
Re:Another reason to do go tiff by Galactic+Dominator · 2008-06-24 09:09 · Score: 1

Complete bullshit

Not free as in gratis or libre. Even that honky free offer is more BS.

--
brandelf -t FreeBSD /brain
Re:Another reason to do go tiff by gothzilla · 2008-06-24 09:50 · Score: 1

Tried Foxit on my network just a couple months ago. Uninstalled it from the test machines after a week. Too many PDF's either wouldn't open or caused Foxit to crash.
Re:Another reason to do go tiff by JoshJ · 2008-06-27 03:46 · Score: 1

How about this one then?
http://blog.kowalczyk.info/software/sumatrapdf/

--
Care about privacy? Read this!

Can I make a suggestion? by johannesg · 2008-06-24 05:37 · Score: 1

Send them a quotation. If the money looks good, do it and don't bitch about it on slashdot. If it does not look good, decline the job and don't bitch about it on slashdot either. Either way, don't bitch about it on slashdot.

Go to a service bureau by IntlHarvester · 2008-06-24 05:45 · Score: 1

Legal service firms work with all of these PDF and TIFF variants all of the time. They should be able to kick out whatever you need at x cents per page (which will usually be cheaper than your time/money)

The weird TIFF formats are used for various document management products, so it really depends mostly on your workflow.

--
Business. Numbers. Money. People. Computer World.

Re:TIFF better defined than PDF? Nuts. by IntlHarvester · 2008-06-24 05:52 · Score: 1

I don't understand why anyone would move from tried and true pdf systems for a TIFF system unless they want to lose text search for employees, regulators and the public. Handing people image based pdf instead of text files or normal pdf is a standard practice for the Bush administration that borders on criminal obstruction of justice. TIFF is the old established "tried and true" (haw) standard, PDF is the new hotness. So I doubt anyone is moving backwards here.

Also courts have requirements for electronic document formats and there is nothing non-standard about a "image based pdf" (these also support searchable OCR full text).

--
Business. Numbers. Money. People. Computer World.

PDF2IMG by RupW · 2008-06-24 06:00 · Score: 1

The gold-standard tool for this is PDF2IMG which uses Adobe's own PDF rendering library but it'll set you back a few thousand dollars.

Ghostscript is good but it isn't perfect: it does choke on some PDFs, misrenders some and won't pick up non-embedded TTF fonts, only external PS fonts. It also doesn't do any anti-aliasing so you probably want to render large and sample down and (IIRC) there's a max image size it can render. But by and large it does just work.

Re:PDF2IMG by Directrix1 · 2008-06-24 09:24 · Score: 1

It can do antialiasing

--
Occam's razor is the blind faith in the natural selection of least resistance and in universal oversimplification. -- EF

Shell script by debatem1 · 2008-06-24 06:01 · Score: 2, Informative

Honest to God, what you're talking about is a trivial task. Use ghostscript, or, if you don't have the time or interest, contact me with your requirements and I'll write it for you gratis, provided it remains F/OSS.

Re:Shell script by debatem1 · 2008-06-24 06:55 · Score: 1

Indeed, there are literally thousands of extant solutions to this problem- but since the asker is apparently too lazy to find any of them via Google, I figured I'd offer.

Batch processing by Bromskloss · 2008-06-24 06:08 · Score: 1

are there any programs that you can recommend that will perform batch processing of files so that we do not have to convert each PDF one by one?

Sing with me! That's what loops are for.

--
Swedish plasma phys. PhD student; MSc EE; knows maths, programming, electronics; finance interest; seeks opportunities

Re:Astroturf? by pclminion · 2008-06-24 06:10 · Score: 1

Your paranoia is amusing. I wasn't "slamming" TIFF, merely pointing out that the fact that the spec is open and available is no reason to believe that all files conform to that spec. I was not recommending TIFF in all of its many incarnations, but in specific forms. I suppose I should have been more explicit. Use basic TIFF, with a 0=white photometric, with G4 compression. Stick to that and any viewer will open it.

If you'd like to see some actual pimping of the product I ACTUALLY work on, see www.swiftview.com.

Re:Astroturf? by pclminion · 2008-06-24 07:00 · Score: 2, Interesting

I was not recommending Microsoft's TIFF+Text, but rather some easily maintained combination of TIFF+text. If that is Microsoft's extension, then that's what it is. I see no loss of functionality due to the storage format not being vector-based. An archival format is going to be looked at on the screen, searched, and possibly printed. There is no benefit to vector graphics except perhaps a size argument, which is obviated by technologies like JBIG2 (if they'd ever get off their butts and formalize the JBIG2-in-TIFF spec).

The only argument I've ever heard for vector graphics is that "When I zoom way, way, WAY in, it still looks all smooth and pretty." Now what the hell are you trying to look at a document from a distance of 0.0002 inches for? It's a non-complaint.

TIFF, on the other hand, is a simple raster format with enormously wide support. You don't have to worry about how one rasterizer is going to look different from another -- every pixel is precisely defined. The document will appear exactly as it was intended to. Consider the difference in codebase size between a simple TIFF reader and a full-blown vector rendering engine. There is enormous complexity with no benefit.

Converting by ZorbaTHut · 2008-06-24 07:29 · Score: 2, Insightful

I hear a lot of talk about how to convert back and forth, but nobody's mentioning the thing that I would consider the most important:

When you convert from .png to .tif, are you losing data?

Most of these convert scripts seem to work by starting Ghostview and rendering a .tif out of your PDF. This is a *terrible terrible idea*. What you'd really want to do is reach into the PDF itself, and extract the lossless images perfectly. Anything else is like printing the .PDF and scanning the printout - you might lose pixels, you might gain extra pixels, and you almost certainly won't be perfectly aligned with the "pixel grain" of the original image.

Unless you can guarantee that you'll pulling out, pixel-by-pixel, the exact original data, I would stick with PDFs.

--
Breaking Into the Industry - A development log about starting a game studio.

Why bother? by MarkCollette · 2008-06-24 07:52 · Score: 2, Informative

As someone who writes software to view PDFs, I can tell you this is completely pointless, since anything that saves scanned documents into PDF is really storing it as a TIFF image inside of the PDF anyway. The PDF container adds useful features for metadata, and is well documented, so shouldn't add any future-proof issues. And the overhead is probably a few kilobytes.

Re:TIFF better defined than PDF? Nuts. by pugugly · 2008-06-24 08:41 · Score: 1

TIFF is the old established "tried and true" (haw) standard, PDF is the new hotness. So I doubt anyone is moving backwards here.

Also courts have requirements for electronic document formats and there is nothing non-standard about a "image based pdf" (these also support searchable OCR full text).

Postscript begs to differ - {G}.

--
An Invisible Entity of Vast Power whose existence must be taken on faith alone: Liberal Media

Re:Astroturf? by willyhill · 2008-06-24 08:44 · Score: 3, Informative

Note that "Odder" and "freenix" are the same person.

--
The twitter monologues. Click on my homepage and be amazed.

Max file size favors PDF by klossner · 2008-06-24 09:08 · Score: 2, Informative

TIFF files have a maximum size of 4GB. (The "value offset" field of an IFD entry is a 32-bit value.) You can exceed this with 50 noisy pages. PDF files have a maximum size of 10 to the tenth power bytes. (The byte offset in a cross-reference table entry is a ten-digit decimal number.) That's 2.3 times the maximum TIFF file size.

I have written software to create both TIFF and PDF files. I would use PDF for archiving. Even today, it's tricky to find a TIFF reader that will run on all the important platforms and handle the variety of compression flavors (e.g., JBIG2.)

Re:TIFF better defined than PDF? Nuts. by IntlHarvester · 2008-06-24 10:19 · Score: 1

You understand we're talking about document archiving, right? Postscript is a terrible format for raster images.

--
Business. Numbers. Money. People. Computer World.

Re:Vector graphics rule. by pclminion · 2008-06-24 10:36 · Score: 1

A scan? Do you have some suggestion how to convert a SCAN to a vector format and magically synthesize information that was never there in the first place?

For raster archival, 600 DPI is good enough. Nobody is suggesting archiving rasters at screen resolution.

Re:Some Solutions by ya+really · 2008-06-24 11:06 · Score: 1

I had to create a bash script a while ago to convert color postscript to black and white tiff. I used netpbm and conjunction with ghostscript to do it. Those 2 programs together can do just about anything you want for batch file graphics. Not sure if it will work for .pdf, but .pdf is pretty close to post script and probably easy enough to convert to eps or straight ps. Gimp and as stated above, imagemagick, also have alot of useful batch processing tools, but you have to learn their script language (in the case of gimp) to use it and it's also much slower than netpbm (both imagemagick and gimp).

netpbm by Uzik2 · 2008-06-24 11:23 · Score: 1

Implementations in minutes. Converts to most anything. Not the most efficient though

--
-- Programming with boost is like building a house with lego. It's a cool but I wouldn't want to live in it

Re:Vector graphics rule. by im_thatoneguy · 2008-06-24 14:10 · Score: 1

If you OCR the text and determine that there is no image data you can disard the raster data and compress more heavily leaving the text as vector data.

That's a form of magically synthesizing vector data from raster scans.

A couple applications... by Remik · 2008-06-24 14:51 · Score: 1

I won't speak to why or whether you should do it, but here are a few options for how.

Doculex has an app called MPTiffIt that will do single to multi or multi to single page tiff conversion. You'd need to convert the PDFs to single page tiff via Save As (or perhaps a Batch Process), then recombine them with MPTiffIt.

Or, you could use a Tiff printer driver along with a batch printing software.

Personally, I'd use L.A.W. (Legal Access Ware, created by Image Capture Engineering and now owned by Lexis Nexis), which is a full production scanning, OCR and e-doc conversion suite. You can import any type of doc that can be printed and output single or multi-page tiffs of PDFs with a variety of database/unitization load files. Similar products by IPro and Doculex exist. Any litigation support vendor should have one of these tools, and would likely charge a relatively nominal fee (per GB) to perform the conversion for you.

Re:Vector graphics rule. by pclminion · 2008-06-24 15:01 · Score: 1

Or you could use JBIG2 while preserving the original appearance of the document absolutely. Your typical letter-sized page of 600 DPI information (over 33 million pixels) will compress to anywhere between 30-150 kilobytes, and that's in a lossless mode. The (theoretically) smaller size of a vector representation is not worth the loss of the original data.

decode by TheSHAD0W · 2008-06-24 16:16 · Score: 1

While you're at it, why not decode the data punched on each card and then just store the microfilm image and the decoded data, discarding the image of the rest of the card? That'd make things a lot more efficient.

Re:It doesn't matter. Neither is great. by twistedcubic · 2008-06-28 22:20 · Score: 1

You'll do even better on file size if you convert scans to Djvu: Scan to pnm (or pgm or pbm), convert using c44 (or cjb2 for pbm), and you can skip the TeX step (use djvm to assemble pages). If the text you're scanning is black and white, you'll be stunned by the quality and file size reduction.

Don't OCR by emj · 2008-06-29 20:56 · Score: 1

OCR doesn't work that well, you will still need the images. So you won't gain anything from doing TIFF+text.

Slashdot Mirror

Multi-page PDF To Multi-page TIFF and Archiving?

101 of 125 comments (clear)