Slashdot Mirror


Multi-page PDF To Multi-page TIFF and Archiving?

GeorgeMonroy writes "One of my clients has aperture cards that they have been scanning into multi-page PDF files — but now they want them in multi-page TIFFs instead. One of the reasons they gave for this is that TIFF files require less storage space. While that is true, I wonder if TIFF is the best format going into the future. Are TIFFs better than PDFs for future use? I wonder what format you think would last longer. Are there any other formats that you think would be better or more future-proof? To me, storage is not a good enough reason to go to TIFF, because storage prices are always dropping anyway. Also, since they already have many of these files in PDF format and they want to convert them into multipage TIFFs, are there any programs that you can recommend that will perform batch processing of files so that we do not have to convert each PDF one by one? If another file format is better than TIFF, then are there any programs for batch processing that you can recommend?"

5 of 125 comments (clear)

  1. PDF/A by SpaghettiPattern · · Score: 4, Insightful

    Although the TIFF format is open and it is widely used in archiving systems, it is not particularly suited for an archive you setup new. The main reason is that many applications that generate TIFF may throw in their own proprietary stuff and lock you into a specific viewer. Also, you cannot do a text search of content in TIFF.

    When you discuss archves you think about looong times. Typically 10 to 50 years of retention with the odd exception where eternity is desired.

    Hence "plain" PDF is probably even worse than TIFF. One problem here are the included resources (fonts) and references (http links) which are mostly left out in order to save disk space. The other problem is that there are so many "plain" PDF versions to choose from and none of them will last 10 to 50 years.

    However, PDF is a good technology and therefor the PDF/A standard was developed. It is designed especially to deal with loooong term issues, is currently readable through almost any PDF reader and will be maintained by most sensible PDF readers for the years to come. There is NO vendor lock-in, you can put text in a PDF/A document an run searches against it. But most importantly, NO propitiatory stuff can be shoved in as it would result in an invalid document (a PDF document maybe but not a PDF/A document.)

    With the price of current disk space you should NOT make file size a defining criterion in your archiving policy. Only on z/OS disk space comes at absurd and ridiculous prices. If you can, try aiming for an archiving solution on Unix, Linux or even Windows.

    I am in the archiving business. At the moment PDF/A is the only format suitable for archiving.

    --

    I hadn't the slightest objection to his spending his time planning massacres for the bourgeoisie... (P.G. Wodehouse)
  2. Speaking as an enterprise search specialist... by spyrochaete · · Score: 2, Insightful

    I would recommend OCRing these documents and storing them in some kind of text-based format (in addition to the graphical format of your choice). If you have particularly voluminous back-catalogues of these documents you'll be very thankful in the future if you have the option to search-enable this textual content.

    A graphic image of text is like a wax apple - it looks and tastes like a replica.

  3. Re:Are these things images or documents? by mrchaotica · · Score: 3, Insightful

    Yes, it makes sense for them to be multi-page. TIFF is a multi-page format...

    Having the format support a thing, and having that thing make sense are two different things. For example, Excel supports being used as a database... but does it make sense?

    My point: use image formats for images, and document formats for documents. If the things you're trying to store are images, don't put them in a document format, and if they're documents, then don't put them in an image format.

    Also, if TIFF is designed to store both images and documents, then I question whether it is too general to do either of them well. And your mention of "subformats" makes me think my concern is well-founded!

    --

    "[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz

  4. Converting by ZorbaTHut · · Score: 2, Insightful

    I hear a lot of talk about how to convert back and forth, but nobody's mentioning the thing that I would consider the most important:

    When you convert from .png to .tif, are you losing data?

    Most of these convert scripts seem to work by starting Ghostview and rendering a .tif out of your PDF. This is a *terrible terrible idea*. What you'd really want to do is reach into the PDF itself, and extract the lossless images perfectly. Anything else is like printing the .PDF and scanning the printout - you might lose pixels, you might gain extra pixels, and you almost certainly won't be perfectly aligned with the "pixel grain" of the original image.

    Unless you can guarantee that you'll pulling out, pixel-by-pixel, the exact original data, I would stick with PDFs.

    --
    Breaking Into the Industry - A development log about starting a game studio.
  5. Re:Are these things images or documents? by mfnickster · · Score: 2, Insightful

    > Scanned documents are both images and documents.

    Exactly. I don't mean to be rude, but the GP's comment is just silly.

    "Documents" can be either graphical or text (or both). There's a reason why word processing formats allow embedding of images.

    In fact, "document" has become an almost meaningless term since it can apply to so many types of data. About the only thing all the different "document" formats have in common is that their content is 2D!

    --
    "Slow down, Cowboy! It has been 3 years, 7 months and 26 days since you last successfully posted a comment."