Automated PDF File Integrity Checking?
WomensHealth writes "I have about 6500 pdfs in my 'My Paperport Documents' folder that I've created over the years. As with all valuable data, I maintain off-site backups. Occasionally, when accessing a very old folder, I'll find one or two corrupted files. I would like to incorporate into my backup routine, a way of verifying the integrity of each file, so that I can immediately identify and replace with a backed-up version, any that might become corrupted. I'm not talking about verifying the integrity of the backup as a whole, instead, I want to periodically check the integrity of each individual PDF in the collection. Any way to do this in an automated fashion? I could use either an XP or OS X solution. I could even boot a Linux distro if required."
Maintaining a database of md5 checksum on the archived versions of the files and periodically check your live versions against it?
it has built in integrity checking and stuff.
Do you even lift?
These aren't the 'roids you're looking for.
I once found this:
http://multivalent.sourceforge.net/
The Multivalent suite of document tools includes a command-line utility that validates PDFs. It can be run across a whole directory of files too, so should do the trick.
Written in Java, so should run anywhere.
The OP is not asking about preventing future corruption; the OP wants an automated way to sift through 6500 PDFs to find corrupt (or at least, potentially corrupt) PDF files without having to open each one by hand.
MD5 generates a hash of the binary data of the PDF file. A MD5 hash will not tell you if a PDF file is corrupt; it is only useful once the integrity of the PDF has been confirmed. After the integrity is confirmed, then you can make your database of MD5 hashes, to detect future corruption.
To test that a given file is a valid PDF, you could probably use something like pdf2ps; you don't care about the PostScript output per se, but you'd be testing for an error code. If pdf2ps returns an error code, you set the file aside for manual verification. This should, if nothing else, whittle down that 6500 PDF archive into a much smaller subset that you can feasibly test manually using Adobe Acrobat. And those, if you "refry" them (print them back to the Adobe PDF printer to re-PDF it), will probably fix the PDF so it passes the pdf2ps test.
I will leave the actual writing of a script to recurse through your directories, feed each PDF file through pdf2ps, and test for error codes, as an exercise to the OP. Now that you have an idea of what to do, actually doing it should be pretty simple.
Here is a java command line tool designed to check the validity of 1000's of pdf files:
http://multivalent.sourceforge.net/Tools/pdf/Validate.html
There is also a tool for repairing some pdf errors:
http://multivalent.sourceforge.net/Tools/index.html
Never used it myself, just stumbled over it when I was searching for some pdf software.
--
Regards
XPdf comes with a 'pdfinfo' command line utility. It returns non-zero if the PDF is corrupt. Should be somewhat efficient and very easy to automate.