Automated PDF File Integrity Checking?

← Back to Stories (view on slashdot.org)

Automated PDF File Integrity Checking?

Posted by timothy on Thursday May 22, 2008 @07:13AM from the one-at-a-time dept.

WomensHealth writes "I have about 6500 pdfs in my 'My Paperport Documents' folder that I've created over the years. As with all valuable data, I maintain off-site backups. Occasionally, when accessing a very old folder, I'll find one or two corrupted files. I would like to incorporate into my backup routine, a way of verifying the integrity of each file, so that I can immediately identify and replace with a backed-up version, any that might become corrupted. I'm not talking about verifying the integrity of the backup as a whole, instead, I want to periodically check the integrity of each individual PDF in the collection. Any way to do this in an automated fashion? I could use either an XP or OS X solution. I could even boot a Linux distro if required."

40 comments

Min score:

Reason:

Sort:

How about... by Uncle+Focker · 2008-05-22 07:14 · Score: 4, Informative

Maintaining a database of md5 checksum on the archived versions of the files and periodically check your live versions against it?
1. Re:How about... by ZephyrXero · 2008-05-22 07:20 · Score: 4, Informative
  
  It sounds more like what he needs is to take an md5sum of new files when they are added to the archive and verify any changes to them are made by a user specifically overwriting the file rather than some sort of software/hardware corruption as he's apparently experiencing. The md5 part is easy to automate, however the second part may require a human eye :/
  
  --
  "A truly wise man realizes he knows nothing."
2. Re:How about... by Ritchie70 · 2008-05-22 07:25 · Score: 3, Informative
  
  For Windows, Microsoft has a free command line tool, "FCIV.EXE", that will do this (MD5 and/or SHA) and save it all in an XML database for you. It will also then validate the files against that database.
  
  It's part of one of the resource kits.
  
  --
  The preferred solution is to not have a problem.
3. Re:How about... by Uncle+Focker · 2008-05-22 07:25 · Score: 1
  
  He could do a check to see when it was modified last versus when the last version of the file was archived as a starting point to at least to weed out any files that are different do to modification not corruption. That should at least cover most cases and in the other cases it would definitely most likely require manual checking.
4. Re:How about... by Azarael · 2008-05-22 07:38 · Score: 4, Informative
  
  This is one of the features of the git revision control system:
  File integrity checking is built into the basic lookup mechanism, so that corruption will be detected automatically
  from http://lwn.net/Articles/145194/
5. Re:How about... by Last_Available_Usern · 2008-05-22 08:14 · Score: 2, Insightful
  
  A checksum won't help if the user replaces/saves the file with a corrupted version.
6. Re:How about... by TheRaven64 · 2008-05-22 09:11 · Score: 1, Informative
  
  MD5 gives you error detection, but not correction. You'd be better off with par2 for this kind of thing. When you add a file, run the par2 utility to generate the check file. On OS X, do this with a Folder Action whenever a new file is created with a .pdf extension. Then just set up a cron job that runs every month or so and attempts to verify / repair the files. Make sure you check the output of this, since silent data corruption is usually a sign that the drive is on its way out.
  
  --
  I am TheRaven on Soylent News
7. Re:How about... by TheRealMindChild · 2008-05-22 09:30 · Score: 1
  
  I would like to incorporate into my backup routine, a way of verifying the integrity of each file, so that I can immediately identify and replace with a backed-up version, any that might become corrupted.
  
  There is no article to NOT read here, buddy. And PAR(2) could be the worst suggestion for such a situation as I have ever heard. Parity is meant to work over a finite set of data. This guy has variable amounts of PDF's. You just added a layer of complexity (you'd need to somehow define "sets" of PDF's), just to shoehorn in a solution that doesn't even do what he originally wanted.
  
  You sir, fail the class.
  
  --
  
  "When life gives you lemons, don't make lemonade. Make life take the lemons back!" -- Cave Johnson
8. Re:How about... by mrmeval · 2008-05-22 11:37 · Score: 1
  
  md5 is good but computationally intensive. It would be good to make one at first but is there a known way to detect a bad file that takes less time? I don't know if a simple CRC or even modulo-11 would be bad or good.
  
  I'm definitely not a programmer nor math geek. :(
  
  --
  I'd go on a Vegan diet but the delivery time from Vega is too long. --brownkitty
9. Re:How about... by Anonymous Coward · 2008-05-22 15:16 · Score: 0
  
  Completely true, but that's a different problem than the one that the OP specified. The Goal is to just to detect and replace problem files from good copies. If there's a worry that the replacement file is corrupted too, you can just as easily use the git checksum to verify it as well.
Duplicity + S3 + log file by Anonymous Coward · 2008-05-22 07:17 · Score: 1, Interesting

Remote backup with a notification of changed files and versions of all previous files for restoration.
quick script by debatem1 · 2008-05-22 07:17 · Score: 2, Interesting

wouldn't be too hard to write an inotify script that stores a backup of the file and an md5sum whenever you drop a file in. wouldn't help you recover an already corrupt document, but it would help you to stop it in the future. a tie-in to the actions menu would make it more usable, but that's a bit more effort, and such solutions probably already exist.
Sure there's a way by b4dc0d3r · 2008-05-22 07:29 · Score: 3, Interesting

There are PDF libraries out there - write a wrapper that loads a file, and when it gets to the end without error emits a 0 "no error" return code, and any errors result in a non-zero code.

Or maybe there are other cmd-line tools which issue a "failed to load" error. That's where I'd look first. Like a tool to strip content out of a PDF - script it so it outputs to /dev/null and check the exit code. I'd be surprised if there were a ready-made solution for this somewhere.
md5sum by Nozsd · 2008-05-22 07:38 · Score: 2, Insightful

md5sum *.pdf > sums
md5sum -c sums

Not exactly automated, but I wouldn't exactly call typing 2 lines to be manual labor; and once you've got the sums you really just need the second line.

Put something like this in a shell script and you can make it automatically replace files that fail a hash check with a good backup. Use perl, python, or whatever, and you can make it work across Windows, OS X, and *nix.

--
When you have finished this cup of coffee your adventure will begin again.
1. Re:md5sum by forkazoo · 2008-05-22 08:30 · Score: 1
  
  md5sum *.pdf > sums
  md5sum -c sums
  
  Not exactly automated, but I wouldn't exactly call typing 2 lines to be manual labor; and once you've got the sums you really just need the second line.
  
  That assumes that all the PDF's start out valid, and will never be validly changed. What you really want is something like just using ghostscript to render each PDF to a temporary image, and then an automated check to make sure the image isn't 100% blank. (Or, just accepting the result if ghostscript doesn't exit with an error, and assume that the PDF has content. That's even easier. Pretty much just a bash one liner. Well, maybe three or four liner if you want it to be readable...)
2. Re:md5sum by xtracto · 2008-05-22 08:55 · Score: 1
  
  That's even easier. Pretty much just a bash one liner. Well, maybe three or four liner if you want it to be readable...)
  
  haha... you should see my R-commandscript-sed-awk-paste-echo-forloop- bash one liners I did to process some R data analysis and make it latex-table-ready and their respective graphics =oD
  
  Yay for Linux... that was teh k1ll3r app that made me not run windows at work
  
  --
  Ubuntu is an African word meaning 'I can't configure Debian'
3. Re:md5sum by SatanicPuppy · 2008-05-22 09:04 · Score: 1
  
  If you were using perl, you could use the PDF::Reuse library or PDF::API2 to do all kinds of crap. If it's not a valid pdf, the libraries throw all kinds of errors when you attempt to open the file. With that, you can even look at things like the number of pages, the content on the pages, etc.
  
  Mind you, if the rendering is fubared, like a font problem or something, so the page looks like crap, it may still be a valid pdf and pass through any sort of check with no problem. A corrupted image will still show up as an image, screwed up text will still be text, probably even the same text though mashed up and illegible outside the code.
  
  I guess you could use imagemagick or something to render it into a tiff and then compare it...But basically it's a nightmare to try and find out if the page "looks right" without having a human-validated copy lying around, and if you're doing that, then this isn't all that useful.
  
  --
  ad logicam Claiming a proposition is false because it was presented as the conclusion of a fallacious argument.
use ZFS by larry+bagina · 2008-05-22 07:55 · Score: 3, Informative

it has built in integrity checking and stuff.

--
Do you even lift?
These aren't the 'roids you're looking for.
1. Re:use ZFS by RiotingPacifist · 2008-05-22 11:45 · Score: 1
  
  Right... so learn BSD or lock your self into a proprietary operating system, and use an experimental filesystem (granted BSDs experimental is another mans rocks rock solid, but if you go the Mac route its not quite as safe).
  
  --
  IranAir Flight 655 never forget!
2. Re:use ZFS by Anonymous Coward · 2008-05-22 12:38 · Score: 0
  
  http://zfs-on-fuse.blogspot.com/
Checksums by gweihir · 2008-05-22 08:28 · Score: 1

Just use the Linux md5sum utility: Create checksums: md5sum file > file.md5 Test: md5sum -c file.md5 Or use a compressor: bzip2 file Test: bzip2 -tv file.bz2

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
When did the corruption occur? by Anonymous Coward · 2008-05-22 08:41 · Score: 0

Are you sure the corruption didn't happen when the PDFs were being created, as opposed to when they were being backed up (or restored)?

PDFs are not as trouble-free as many people seem to think. I would swear to this in a court of law because I used to have a job that had me handling several hundred PDFs every day that had been generated by companies from around the world, using virtually every PDF creation software package known to man. Trust me: PDFs don't always start out their lives error-free.

So finding out when and how the corruption occurred really matters here, because it will determine at what point(s) you'll need to verify the PDFs are valid.
Multivalent by Anonymous Coward · 2008-05-22 08:42 · Score: 2, Informative

I once found this:

http://multivalent.sourceforge.net/

The Multivalent suite of document tools includes a command-line utility that validates PDFs. It can be run across a whole directory of files too, so should do the trick.

Written in Java, so should run anywhere.
Linus to the rescue: Use Git by stew1 · 2008-05-22 08:44 · Score: 2, Interesting

Use git.

http://git.or.cz/

Check them all into a repository, then periodically run git-fsck. Git hashes all files in a repository with SHA-1 when they're first committed, and git-fsck recalculates the hashes.

Jon
md5? WTF? RTFQ, morans by gblues · 2008-05-22 08:47 · Score: 4, Informative

The OP is not asking about preventing future corruption; the OP wants an automated way to sift through 6500 PDFs to find corrupt (or at least, potentially corrupt) PDF files without having to open each one by hand.

MD5 generates a hash of the binary data of the PDF file. A MD5 hash will not tell you if a PDF file is corrupt; it is only useful once the integrity of the PDF has been confirmed. After the integrity is confirmed, then you can make your database of MD5 hashes, to detect future corruption.

To test that a given file is a valid PDF, you could probably use something like pdf2ps; you don't care about the PostScript output per se, but you'd be testing for an error code. If pdf2ps returns an error code, you set the file aside for manual verification. This should, if nothing else, whittle down that 6500 PDF archive into a much smaller subset that you can feasibly test manually using Adobe Acrobat. And those, if you "refry" them (print them back to the Adobe PDF printer to re-PDF it), will probably fix the PDF so it passes the pdf2ps test.

I will leave the actual writing of a script to recurse through your directories, feed each PDF file through pdf2ps, and test for error codes, as an exercise to the OP. Now that you have an idea of what to do, actually doing it should be pretty simple.
1. Re:md5? WTF? RTFQ, morans by value_added · 2008-05-22 09:05 · Score: 1
  
  The OP is not asking about preventing future corruption; the OP wants an automated way to sift through 6500 PDFs to find corrupt (or at least, potentially corrupt) PDF files without having to open each one by hand.
  
  If that is indeed the case, and he's repeatedly encountering corrupt files, then I'd suggest he's asked the wrong question.
  
  As for pdf2ps, I'm unfamiliar with what error codes it returns, but if it's useful as you state, then it's worth pointing out that all the utilities he'll need (including md5, etc.) are available in Cygwin.
PDF validation by Peter+H.S. · 2008-05-22 09:01 · Score: 5, Informative

Here is a java command line tool designed to check the validity of 1000's of pdf files:

http://multivalent.sourceforge.net/Tools/pdf/Validate.html

There is also a tool for repairing some pdf errors:
http://multivalent.sourceforge.net/Tools/index.html

Never used it myself, just stumbled over it when I was searching for some pdf software.

--
Regards
Ghostscript by Marillion · 2008-05-22 09:08 · Score: 2, Interesting

Many are commenting on using checksums (MD5, SHA, ....) to validate the file hasn't changed. This is good. However, none of these can actually tell if the PDF was is good to begin with. I would suggest using Ghostscript to verify that the PDF is properly structured. Ghostscript is an opensource tool that can convert PDF and Postscript files to several other formats. If Ghostscript can interpret the PDF file without errors, then odds are the file is good too.

--
This is a boring sig
Prevention, first by Anonymous Coward · 2008-05-22 09:26 · Score: 3, Insightful

One of the things that strikes me about the posts thus far is that nobody has asked the first and most important question: *WHY* are the files becoming corrupted? And what is the nature of the corruption?

From a general accessibility perspective, the age of the folders shouldn't matter, nor should the age of the files contained within them: A properly operating file system will maintain the integrity of the files it tracks indefinitely, assuming the underlying media is sound and all related hardware is functioning correctly.

Certainly, for verification of critical data, checksums are a good measure so long as they are done at the time of file creation, after verification that the files are good, but in light of the reported symptoms, I'd investigate the source of the problem first, and correct it. Then I'd make provisions for checksumming, in addition to regular file system health checks, before backing up those files and their checksums.

Proceeding from a "bottom-up point of view": For Windows-based systems, regardless of the file system in use (although I'd hope you'd be using NTFS), regular file system scans via CHKDSK are a must. The same applies to the file systems of other OS': Run whatever utilities are available to verify the integrity of the file system on each hard drive regularly.

In addition, most hard drive manufacturers have utilities that you can download for free that will non-destructively scan the media for grown defects. These are typically available as ISOs: Make a CD, boot from it, and follow the instructions carefully, preferably after making a full, verified backup. Naturally, you'll have to know the manufacturer(s) of your hard drives.

Once you've identified the cause of the corruption, and corrected it, then you can (and should) make provisions for checksums.

But, there are other things that you can, and should check as well. Make sure that the AC power to your computer is sound from an electrical perspective and that the power available is sufficient for the load being placed upon it. Buy a good UPS if you don't have one already, and if you do have one, test it.

Then, test the power supply in the computer to ensure that it is providing adequate power.

Then test the memory in your computer.

Then test the hard drives, both surface level and file system level.

Hope this helps.
1. Re:Prevention, first by 19thNervousBreakdown · 2008-05-22 09:42 · Score: 1
  
  I was going to post exactly this. Files randomly becoming corrupted? Maybe if I had 5,000 Chinese kids remembering numbers, but data shouldn't just change on a computer, whether it's over the wire, on disk, or in memory. Treat the disease, not the symptom.
  
  --
  <xml><I><am><so><damn>Web 2.0</damn></so></am></I></xml>
2. Re:Prevention, first by CaseyB · 2008-05-22 12:17 · Score: 1
  
  Darn right! This is like asking how to efficiently procure and install pots throughout your house to catch all the water dripping from your ceiling.
filesystem or hardware issue? by bcrowell · 2008-05-22 09:35 · Score: 1

If you've got files on your computer that you only read, never write, and those files are getting corrupted, then it sounds like you have a problem with your filesystem, or a problem with your hardware. You need to find and fix the problem with the filesystem or hardware, not apply band-aids to PDF files if the problem has nothing to do with the PDF format per se.

Another possibility would be that you're using buggy software that is supposed to open PDF files in read-only mode, but actually corrupts them. If so, then you need to identify what the software is that's doing it, so you can remove that software from your computer.

For diagnosis, and also recovery, one thing you could try would be using the Unison file synchronizer to synchronize your files with a hard disk on another computer. If the files aren't changing, it will run extremely fast. If you notice mysterious changes to files right after a blackout or an electrical storm, then you can guess that's why. If you notice mysterious changes right after you use a particular application, ditto. Unison has a -fastcheck option on Windows, which you should read about; you'd probably want to run most of the time with it, and maybe once a week without it.

--
Find free books.
1. Re:filesystem or hardware issue? by Peter+H.S. · 2008-05-22 12:03 · Score: 2, Insightful
  
  Personally I think that the pdf files were dodgy from the beginning, but that the errors just show up when using newer generation pdf-viewing software. That could explain why it only seems to be very old files that are corrupted, a more random system error or a systematic software problem would corrupt newer files too.
  
  Your suggestions are of course valid, it must considered a high priority to find out whether the system is corrupting the files, or if they were bad from the beginning.
  
  --
  Regards
pdftk by Anonymous Coward · 2008-05-22 11:37 · Score: 0

Get yourself a free (both beer-wise and speech-wise) copy of pdftk .

Works on Winders, Linux, OS X, *BSD, even Solaris. Build yourself an AT every night .BAT file that checks 'em all. It might even be able to repair the corrupted ones.
Might I suggest... by Arceliar · 2008-05-22 12:42 · Score: 1

...bittorrent? It has built in file integrity checking. Simply create a torrent for the files and have the backup source seed. Then periodically check the integrity of the files (many clients can let you force a recheck of file integrity) and it will not only identify corrupt files, but automatically download replacements from the backup. If you have to add files to the backup, it does require you to make a new torrent. Still, if you set things up right it does prove to be a rather elegant solution, I've used it myself for a few things in the past.
Version Control System by diamondmagic · 2008-05-22 13:28 · Score: 1

This could be anywhere from just works to very effective, but a distributed version control system, so copied can be kept on multiple systems easily, and something that can check the integrity of files.
http://en.wikipedia.org/wiki/Comparison_of_revision_control_software seems interesting, look for distributed and atomic commits, and signed tags (though this by itself doesn't guarantee it catches file errors right away).

I use and love Git, and though Windows support is there, it is questionable I have heard.

Adding or saving changes is "git add [file]" then "git commit", checking files for any changes is "git status", checking the integrity of the stored data history is "git fsck", pushing your changes to a remote backup location is "git push" .

Mercurial and Monotone do similar things (if differently or not as well), but have better Windows support.

If there is any reason not to use a version system, it is the complexity and abundance of features that are not needed for something this simple.

--
Wonder what the public key field is for?
Harvard's Object Validation Environment by Anonymous Coward · 2008-05-23 03:24 · Score: 0

has modules for checking pdf files.
http://hul.harvard.edu/jhove/
two methods in one script by Anonymous Coward · 2008-05-23 05:29 · Score: 0

Linux and possibly OSX should be able to make of md5sum and pdfinfo to do the following in a script:
1) upon PDF generation use pdfinfo to ensure that it is a valid PDF (i have had generation fail and no reader could open it when needed months later)

2) if Generation and validation are successful generate md5sum for file and store values gen_md5sum,cur_md5sum,filename,unix_path,valid (bool),invalid_time,valid_timestamp in mysql a (myisam table format) database.

3) Weekly cron does comparison with files in path stored and MD5 for and checks against table check if current MD5sum matches known valid MD5.

If you are not doing MD5sum generation in batch then the process should be quick, this way you can see the date of generation for the file and MD5 and also when it may have become corrupt.

-kb
XPdf by dtrumpet · 2008-05-23 06:36 · Score: 2, Informative

XPdf comes with a 'pdfinfo' command line utility. It returns non-zero if the PDF is corrupt. Should be somewhat efficient and very easy to automate.
PAR2 is your new best friend by Anonymous Coward · 2008-05-23 15:23 · Score: 0

PAR2 does checksums of each file and checksums of sections of the contents of each file in an overlapping way. If any section of any file in the PAR2 archive is corrupted/lost, the original contents can be regenerated, given enough time and CPU power ... depending on how much of each file is missing.