Automated PDF File Integrity Checking?

← Back to Stories (view on slashdot.org)

Automated PDF File Integrity Checking?

Posted by timothy on Thursday May 22, 2008 @07:13AM from the one-at-a-time dept.

WomensHealth writes "I have about 6500 pdfs in my 'My Paperport Documents' folder that I've created over the years. As with all valuable data, I maintain off-site backups. Occasionally, when accessing a very old folder, I'll find one or two corrupted files. I would like to incorporate into my backup routine, a way of verifying the integrity of each file, so that I can immediately identify and replace with a backed-up version, any that might become corrupted. I'm not talking about verifying the integrity of the backup as a whole, instead, I want to periodically check the integrity of each individual PDF in the collection. Any way to do this in an automated fashion? I could use either an XP or OS X solution. I could even boot a Linux distro if required."

10 of 40 comments (clear)

Min score:

Reason:

Sort:

How about... by Uncle+Focker · 2008-05-22 07:14 · Score: 4, Informative

Maintaining a database of md5 checksum on the archived versions of the files and periodically check your live versions against it?
1. Re:How about... by ZephyrXero · 2008-05-22 07:20 · Score: 4, Informative
  
  It sounds more like what he needs is to take an md5sum of new files when they are added to the archive and verify any changes to them are made by a user specifically overwriting the file rather than some sort of software/hardware corruption as he's apparently experiencing. The md5 part is easy to automate, however the second part may require a human eye :/
  
  --
  "A truly wise man realizes he knows nothing."
2. Re:How about... by Ritchie70 · 2008-05-22 07:25 · Score: 3, Informative
  
  For Windows, Microsoft has a free command line tool, "FCIV.EXE", that will do this (MD5 and/or SHA) and save it all in an XML database for you. It will also then validate the files against that database.
  
  It's part of one of the resource kits.
  
  --
  The preferred solution is to not have a problem.
3. Re:How about... by Azarael · 2008-05-22 07:38 · Score: 4, Informative
  
  This is one of the features of the git revision control system:
  File integrity checking is built into the basic lookup mechanism, so that corruption will be detected automatically
  from http://lwn.net/Articles/145194/
4. Re:How about... by TheRaven64 · 2008-05-22 09:11 · Score: 1, Informative
  
  MD5 gives you error detection, but not correction. You'd be better off with par2 for this kind of thing. When you add a file, run the par2 utility to generate the check file. On OS X, do this with a Folder Action whenever a new file is created with a .pdf extension. Then just set up a cron job that runs every month or so and attempts to verify / repair the files. Make sure you check the output of this, since silent data corruption is usually a sign that the drive is on its way out.
  
  --
  I am TheRaven on Soylent News
use ZFS by larry+bagina · 2008-05-22 07:55 · Score: 3, Informative

it has built in integrity checking and stuff.

--
Do you even lift?
These aren't the 'roids you're looking for.
Multivalent by Anonymous Coward · 2008-05-22 08:42 · Score: 2, Informative

I once found this:

http://multivalent.sourceforge.net/

The Multivalent suite of document tools includes a command-line utility that validates PDFs. It can be run across a whole directory of files too, so should do the trick.

Written in Java, so should run anywhere.
md5? WTF? RTFQ, morans by gblues · 2008-05-22 08:47 · Score: 4, Informative

The OP is not asking about preventing future corruption; the OP wants an automated way to sift through 6500 PDFs to find corrupt (or at least, potentially corrupt) PDF files without having to open each one by hand.

MD5 generates a hash of the binary data of the PDF file. A MD5 hash will not tell you if a PDF file is corrupt; it is only useful once the integrity of the PDF has been confirmed. After the integrity is confirmed, then you can make your database of MD5 hashes, to detect future corruption.

To test that a given file is a valid PDF, you could probably use something like pdf2ps; you don't care about the PostScript output per se, but you'd be testing for an error code. If pdf2ps returns an error code, you set the file aside for manual verification. This should, if nothing else, whittle down that 6500 PDF archive into a much smaller subset that you can feasibly test manually using Adobe Acrobat. And those, if you "refry" them (print them back to the Adobe PDF printer to re-PDF it), will probably fix the PDF so it passes the pdf2ps test.

I will leave the actual writing of a script to recurse through your directories, feed each PDF file through pdf2ps, and test for error codes, as an exercise to the OP. Now that you have an idea of what to do, actually doing it should be pretty simple.
PDF validation by Peter+H.S. · 2008-05-22 09:01 · Score: 5, Informative

Here is a java command line tool designed to check the validity of 1000's of pdf files:

http://multivalent.sourceforge.net/Tools/pdf/Validate.html

There is also a tool for repairing some pdf errors:
http://multivalent.sourceforge.net/Tools/index.html

Never used it myself, just stumbled over it when I was searching for some pdf software.

--
Regards
XPdf by dtrumpet · 2008-05-23 06:36 · Score: 2, Informative

XPdf comes with a 'pdfinfo' command line utility. It returns non-zero if the PDF is corrupt. Should be somewhat efficient and very easy to automate.