Ask Slashdot: What's a Good Tool To Detect Corrupted Files?
Volanin writes "Currently I use a triple boot system on my Macbook, including MacOS Lion, Windows 7, and Ubuntu Precise (on which I spend the great majority of my time). To share files between these systems, I have created a huge HFS+ home partition (the MacOS native format, which can also be read in Linux, and in Windows with Paragon HFS). But last week, while working on Ubuntu, my battery ran out and the computer suddenly powered off. When I powered it on again, the filesystem integrity was OK (after a scandisk by MacOS), but a lot of my files' contents were silently corrupted (and my last backup was from August...). Mostly, these files are JPGs, MP3s, and MPG/MOV videos, with a few PDFs scattered around. I want to get rid of the corrupted files, since they waste space uselessly, but the only way I have to check for corruption is opening them up one by one. Is there a good set of tools to verify the integrity by filetype, so I can detect (and delete) my bad files?"
you seem to be surprisingly ok with the fact that your computer crashed and all your documents and media were corrupted, as was your backup. I would have been beside myself. Hulk smash! Please let us know what different set ups you're exploring to avoid this.
is urgency. Corrupted files have the ability to detect urgency and your discovery of them will come in a form compatible with the laws of Murphy.
2000-2001 MAF-Soft http://www.maf-soft.de/
The version I have is v1.0.3.102
It can scan single mp3s and entire folders structures for defects and logs everything if you wish. It will give you a percentage of how good the file is.
Depending on the damage you may be able to fix headers and chop off corrupted tag info with something like a MP3Pro Trim v1.80.exe
Obama's legacy: (N)othing (S)ecure (A)nywhere and (T)error (S)imulation (A)dministration
or sha1sum if you prefer. Automate in cron against a list of knowns.
eg: /home/wilbur/Documents/* > /home/wilbur/Docs.md5 /home/wilbur/Docs.md5
$ md5sum
$ md5sum -c
Join the Slashcott! Feb 10 thru Feb 17!
Have some respect, the man just lost his entire porn stash.
Mod me down, my New Earth Global Warmingist friends!
You can run jpeginfo -c. I have a script that runs against a directory and makes a list for when I do data recovery for all my friends who don't listen when I tell them their 10 year old laptop may be dying soon.
In the land of the blind, the one-eyed man is kinky.
And even though your last backup is from August, this will still constrain the number of files you potentially have to eyeball.
unix "file" is not the answer. For some formats it does as little as look at a couple header bytes. Its a great tool to guess a format. Its a terrible verifying parser and does nothing to verify content.
An example of what I'm getting at, with some made up details, unfortunately html is not like well formed xml and every viewer is different anyway so the best way to figure out if a html web page file format is corrupt is unfortunately to pull it up in firefox. This only detects corruption in the structure of the file, if the corruption is just a couple bits then you end up with problems like tQis where the only way to see the h got fouled up is to write more or less a IQ 100 artificial intelligence. All "file" is going to test is pretty much does the file begin with or contain a regex something like less-than html greater-than (getting past the filters).
For content you could F around with, for example, piping a mp3 file thru a decoder and then thru an averaging spectrum analyzer and see if there's anything overly unusual in the spectrum. Also some heuristics like is the file only 1 second long, then its F'ed up.
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
Author here:
> Last backup August.
Yes, that was silly of me.
> Thinks there is a way to detect generic file corruption
There is no way to detect generic file corruption. But there is a way to detect specific filetype corruption. For example, I already found mp3val, that is able to scan all my mp3 and check for file integrity, and even fix a few kinds of corruption (such as unmatching bytes in the header and sound chunks). Maybe with the right set of tools, I might also detect (or even fix) my corrupted pictures, movies and books as well.
If I clone myself, can I call it a thread?
If a girl winks to us, can I call it a race condition?
You need good filesystem, with embedded data checksum and self-healing using redundant copies. For Linux - btrfs is fine. For Mac OS X & Linux - ZFS.
:wq
Tech Tool Pro, over on the Mac side, has a "File Structures" check which looks at a lot of different structured file types to make sure that their internal format is valid.
That seems very strange--the only files that should really be corrupted, unless something extremely rare and catastrophic happened, are the ones that were being written when power went out, or were cached. And even then, a flush usually flushes everything, or at least whole files at once, or areas of disk. Is the partition highly fragmented or something?
I know this doesn't do much for your question, but that kind of failure mode is almost exactly what filesystems do their damnedest to avoid. HFS+, being journaled, should be even more proof against, well, exactly what happened to you. Maybe the Linux driver is poor, but man, if you got silent data corruption on a multitude of files that weren't even being written, that's really bad and the driver should be classified "EXPERIMENTAL" at best, and certainly not compiled into distros' default kernels.
To answer your question, I don't have experience with any tools (I automate my backups, and any archival files go on a RAID volume that does a full integrity scan nightly), but once you find one, you should separate your files into two categories--"must be good", and "can be bad". The "must be good" files (serial #s, source code, etc.), you hand-check, so you know for certain that every one of them is good. It'll also motivate you to replace them now, instead of later when replacements will only get harder to come by. The "can be bad" files (music, pictures, etc.), you do the automated check on and then just delete as you run into ones that the check missed. This has the advantage of concentrating your effort into where it's useful. If you try to check all of your files, you'll just burn out before you finish. You may even want to do more advanced triaging, but you'll have to come up with the categories and criteria there. The main thing is, split this problem up.
<xml><I><am><so><damn>Web 2.0</damn></so></am></I></xml>
Consider the possibility that the backup already contains corrupted files. I once had defective RAM where only one bit flipped occasionally. The machine was quite stable, so the defect went undetected and over a couple of months it silently corrupted hundreds of files. Unless he finds out what caused the crash, he can't be sure that the backup is alright.
You can just run mencoder or ffmpeg on the mp3 and mov on all the files (with a small shell script, probably involving 'find' or similar), just tell it to write the output to /dev/null, that should go through those files as fast at they can be read from disk and abort with error on those that are broken.
For the jpgs, you could try something similar with imagemagick's 'convert', to convert them to whatever format to /dev/null, which also needs to read the whole file content and aborts if they're broken (one should hope).
Those converters are really fast, especially ffmpeg, so that should complete in a reasonable time.
I'd be asking myself why lots of files became corrupted from one dodgy file system event. Assuming HFS works like file systems I'm more familiar with, it will allocate sequential blocks for files wherever it can. This means that a random filesystem splat is really unlikely to corrupt loads and loads of files. You might expect a file system corruption to cause a load of files to go missing (if a directory entry is corrupted) or corrupt a few files, but not put random errors into loads of files.
I'd check to see whether files I was writing now get corrupted too. It might be dodgy disk or RAM in your computer.
The above might be complete paranoia, but I'm a paranoid person when it comes to my data, and silent corruption is the absolute worst form of corruption.
For next time, store MD5SUM files so you can see what gets corrupted and what doesn't (that is what I do for my digital picture and video archive).
Every man for himself, all in favour say "I"
Well...
My first suspicion would be that the filesystem is messed up, not the actual files. Unless s/he had a lot of pending writes to all of these files, there is no reason that something should have actually overwritten or garbled them when the power shut down. Much more likely was an impending or in-progress write to the filesystem's tables, which has affected where it thinks all the files' pieces are stored. And if that is the case, date modified and size may be irrelevant because those are going to be reported by the filesystem.
Aside from trying to read back sector-by-sector data and assembling them, however, I don't know that there's a remedy.
I vote based on politicians' actions, unless contrary to my preconceptions. Often wrong, never uncertain. #iamthe99%
Not the BSOD.
If the OP had used open source "tripwire" on known-good files in each filesystem on his Macbook, and saved the resultant data output to a USB thumbdrive formatted with FAT32, the OP would have had a good chance of determining all corrupted files. In this case, an ounce of prevention would have prevented several pounds of "cure".
Check out http://tripwire.org./
Author here:
At first I thought this idea wouldn't work. As some people have already written here, the 'file' command sometimes just checks for a few bytes. But since it is so easy to implement, why not give it a try? And indeed, for videos it worked quite well. Some of the corrupted MOV files were detected simply as 'data file' or even 'MPEG sequence' and were promptly deleted! Thank you for the idea.
If I clone myself, can I call it a thread?
If a girl winks to us, can I call it a race condition?
Author here:
Ok, I could deal with the loss of some unique videos and pictures from travels... but now that you mention the porn... *weep*
If I clone myself, can I call it a thread?
If a girl winks to us, can I call it a race condition?
I'd recommend running a base OS and then run something like VMware workstation so that you run other OSes inside the main OS. One huge benefit is that you can have access to multiple OSes at the same time and you don't need to reboot into them either. With hypervisor technology getting common on desktop, there probably isn't any need to multi-boot unless you have a specific reason not to use virtualization.
These comments are full of 'helpful' suggestions to compare to backup or to md5's generated from the backups.
That makes no sense.
If he has a good set of backups JUST RESTORE THE BACKUPS to get known good files back. Why would you read every backup file and every current file, then compare them, then make a list of ones that don't match just to restore the backups. Restore them all. done.
- For the complete works of Shakespeare: cat
Perhaps but I agree with the first post. Going through and simply looking at all the JPEGs or MPEGs is probably the only way to tell if a file is corrupted (I wouldn't trust the CPU to do an accurate job). Also gives you a change to erase a lot of stuff you really don't need anymore. I dumped 300 gig off my drive simply by going through everything... took awhile but it was worthwhile to get rid of old shows/movies I'll likely never watch.
My AC stalker: " I personally agree with your posts most of the time, but that won't keep me from modding you troll"
zfs! Works great. Included with FreeBSD 9, amongst other OSs.
You might also enjoy John Siracusa's exhaustive review of filesystems on one of my favorite podcasts.
Spoon not. Fork, or fork not. There is no spoon.
The JSTOR/Harvard Object Validation Environment:
http://hul.harvard.edu/jhove/
It's specifically designed to first probabilistically identify files, then attempt to verify their format.
Disclaimer: I haven't worked on it directly, but I did spend a number in the digital preservation space, so I probably know some of the people who have contributed to it.
Let me ask a stupid question since I've never run a battery out on a machine running Ubuntu. Why did this happen? Running OSX or Windows, the machine would have hibernated safely before the battery ran out. Does Ubuntu not do this and it just dies? Or is this something you configured to act this way? If it is default behavior in Ubuntu it is something they ought to fix.
mplayer can detect corrupted movie and audio files find . -name '*.mov' -exec mplayer -msglevel all=6 -speed 100.0 -framedrop -nogui -nolirc -cache 8192 -tskeepbroken -ao null -vo null {} \; | grep Warning! > $1.txt Change the *.mov as appropriate.
Alright now I'm afraid I can't help with your verify problem but I do have one piece of solid advice: get rid of Paragon HFS immediately!
It is a truly shoddy piece of software that as of version 9.0 has a terrible bug that will cause it to destroy HFS+ filesystems. Google "paragon hfs corruption" and you will see many many horror stories from people who just plugged a Mac OS X disk into a Windows machine w/ Paragon HFS and then discovered the entire filesystem was hosed. In my dual-boot win/mac setup I replaced my copy of MacDrive with a trial version of Paragon HFS 9.0 from their website and every single one of the six HFS+ disks I had connected internally were damaged. Disk Utility couldn't do a thing and I had to buy a program called Diskwarrior to even begin to recover data. I ended up losing two disks worth of files anyway.
http://www.mac-help.com/t12137-opened-hfs-drive-win7-paragon-hfs-now-wont-boot.html
http://www.wilderssecurity.com/showthread.php?t=299306
http://hardforum.com/showthread.php?t=1677099
http://www.avforums.com/forums/apple-mac/1509344-hfs-super-block-not-found.html
whew! Anyway the pain I went through after that software very nearly ruined my life was so great, I don't want it to happen to anyone else. According to their own website 9.0 has this awful bug but they fixed it in 9.0.1. Evidently the trial download on the main page is still for version 9.0 and still has the disk destroying bug! Any software company that releases a filesystem driver with this terrible a bug (not to mention the numerous reports of BSODs and other relatively minor problems) clearly has terrible quality assurance and simply can't be trusted.
The bad news is I don't know of any (and I don't think you'll find any) easy, one-shot tool to run across the whole lot that gives you a simple "corrupted yes/no?" answer to lots of different filetypes.
The good news is it'd be reasonably easy to lash together something in bash, kick it off overnight and come back in the morning to a list of probably-corrupted files.
In pseudo-bash (because I haven't the time to write it out and check it works properly), something like this would be a good start:
function checkJpeg {
jpeginfo -c $1 || return 1
return 0
}
function checkPdf {
# do something to check a PDF is OK
}
FILETYPE=`file $1` ;; ;;
case $FILETYPE in
"jpeg" )
checkJpeg $1 || echo $1
"PDF )
checkPdf $1 || echo $1
esac
Then run it with the help of find /home -type f -print0 to check every file in /home. This would give you a list of potentially-corrupted files. Up to you how you deal with it - personally I wouldn't run rm against it in case you find files that can be rescued or that your checks aren't as perfect as you'd like.
For extra credit, determine the expected filetype based on file extension and then use file(1) as your first "is it corrupted?" test - that way you'll spot files that are too corrupted for file(1) to work reliably.
That is a good thought, and photorec does an excellent job of finding pictures and videos by searching through your sectors - definitely worth a try.
http://www.cgsecurity.org/wiki/PhotoRec_Step_By_Step
Every man for himself, all in favour say "I"
The real reason and it was stated in the summary is that the file system was HFS+ which is far less tolerate to this behavior than ext4.
mplayer can detect corrupted movie and audio files find . -name '*.mov' -exec mplayer -msglevel all=6 -speed 100.0 -framedrop -nogui -nolirc -cache 8192 -tskeepbroken -ao null -vo null {} \; | grep Warning! > $1.txt Change the *.mov as appropriate.
<infomercial>its JUST. THAT. EASY folks!</infomercial>
... yes, this is not what you want to hear at this point, but try to have a positive take on this.
Last year during a routing Windows7 installation, my second hard drive from which I double boot my 90%-of-the-time-in-use Linux was destroyed. Either a coincidence that it occurred during the win7 installation or a nefarious plot, but the hard disk, a 1TB Seageate sata, developed an unrecoverable click of death.
On that hard drive I had my short stories which I had written in college and the intervening years since then, much of my photos, skype history and many other things, seemingly important to me at the time of the "disaster". I was inconsolable for a few days, and felt like I had been bereft of someone very dear to me. Then it hit me -- to hell with the stories, to hell with the photos, to hell with the rest of the digital baggage I had accumulated. I could write my stories again, and do it better, I could take more photos, I could hoard more useless junk. After a month I no longer missed any of the lost stuff.
Learn to view such mishaps more philosophically and learn to shed all the useless garbage you accumulate through the years; realize that almost nothing that you can store on your computer, or up in your attic, has really all that sentimental value you attach to it. Learn what's important, intrinsically important, to you and safeguard that. All the rest, you'll be amazed how little you need it and how even less you'll miss it.
To hell with useless stuff.
Seconding the photorec / testdisk suite, they are incredible. I would rate it up with ddrescue as the top 2 data recovery tools.
I used to do that, but found it to be pointless these days. Organizing the stuff is one thing, but deleting is basically pointless unless you can automate it. 300GB may seem like a job well done, but with 3TB drives for $100 these days, you just saved yourself $10 worth of harddrive space and it probably took you a few hours.
My current setup is to have everything on my server box and simply copy over what I need to my laptop as I need it and NFS/SSHFS the rest of it on the fly when home.
George is your best bet. He's not bright enough for most support tasks, but he can certainly handle this one.
Indeed, I used photorec/testdisk to recover mp4 files after they had (all) been accidentally deleted from an HFS+ partition.
But when I first started it in it's default mode, it "found" only rubbish, breaking up the actual mp4s into a mess of .doc, xml, jpg, .whatever files, including totally broken .mp4s.
When I restarted it after configuring it to only look for .mov/.mp4, it did a fantastic job, and as far as I know, all files could be recovered. Of course, that was made easier by the fact that I knew that all the files which needed to be recovered were .mp4.
I think this is the root of the problem here, he chose the wrong filesystem to share between the three OSes. Sadly there are not too many choices. FAT32 is the only one natively supported by all three, with its well known limitations. He might have been better with NTFS though, using NTFS-3G on Linux and OS X, but that has some performance hit. There's really no perfect solution for this kind of problem.
.sig: No such file or directory
Well, jpeg files have a structure that will generate detectable errors if it's damaged. So simply opening them with something as simple as djpeg from the IJG and piping the output to /dev/null should give you a pretty good start on damaged images. Something like this perhaps:
find . -name "*jpg" -o -name "*jpeg" -o -name "*JPG" -o -name "*JPEG" | while read filename; do if djpeg "$filename" > /dev/null 2> then :; else echo "$filename" is toast; fi; done
You could probably do something similar with mpg123 and mplayer for .mp3 and movies.
There ought to be an &1 after the 2>.
First, let's presume you're running Linux for what follows.
/foo/bar -name "*.jpg" -print | sort -u > /tmp/files.jpg
/foo/bar for all files suffixed ".jpg" and dump a sorted list of them into /tmp/files.jpg and this one:
/foo/bar -type f -print | sort -u > /tmp/files.all
/tmp/files.all.
(Note that the method by which find traverses filesystem trees won't yield sorted output, hence the
need to pipe these through sort.)
/tmp/files.jpg /tmp/files.all > /tmp/files.all2l
.gif, .mpg,
etc., as you deal with each file type and reduce the remaining list to those awaiting
your attention. /tmp/files.all3, /tmp/files.all4, etc. will each be smaller and eventually,
if you deal with all files, /tmp/files.allX will be zero-length. Note that not all files
have suffixes, of course -- and those without will likely be the ones requiring the most
manual effort. If you want to know which suffixes are most numerous, something like /tmp/files.all | sort | uniq -c | sort -n
1. You're going to want to be familiar with both file(1) and find(1). File(1) is pretty straightforward, but be aware that its heuristics for file type detection vary in accuracy. If you're not find-literate, then at least get used to this construct:
find
which will recursively search directory
find
which will search the same directory, but will return a list of all (plain) files, that is, things which are not directories, devices, sockets, etc., sorted and dumped into file
2. You now have (a) a list of all jpg files and (b) a list of all files. (I picked jpg arbitrarily to illustrate the process, by the way.) You can now generate a list of all files that are NOT jpg with this:
comm -13
The point of this exercise is that you can now repeat steps 1-2 with
sed -e "s/.*\.//"
will give you a rough idea.
3. Now then...you'll need some tools for dealing with each file type. The first tool I'd use is stat(1), to check sizes for plausability. Then things like jpeginfo(1), mp3val(1), tidy(1), will be some help, but of course you'll need to distinguish between "error message emitted because file is corrupt" and "error message emitted because file has minor issues...that it had BEFORE this episode". You may need to check the Ubuntu repository for tools you don't have; you may need to do some searching on the web for "Linux tool to check PDF integrity) and similar.
4. If you have backups of any kind and can restore them, then you could try using sum(1) to compare checksums pre- and post-incident. This is a filetype-invariant method, which is good because it lets you skip the above...but bad because all it wll tell you is "different", not "mildly damaged" or "horribly corrupted" or something in between.
5. I would recommend against deleting anything at this point. Instead, move it to secondary storage, like an external drive. I don't have a specific reason for advising this, other than "many years of experience doing partially-manual, partially-automated things like this and a recognition that sometimes errors in the methodology...or fatigue introduced by the tedium of executing it...lead to mistakes".
6. Good luck.
The identify program is a member of the ImageMagick(1) suite of tools. It describes the format and characteristics of one or more image files. It also reports if an image is incomplete or corrupt.
That seems not worth it. The thing is, both drive-space and data-volume tends to double every ~18 months or so. You wait first "a couple of years", then on a network drive, then once a decade has passed, they go in the trash.
But a decade ago the cheapest storage was a 40GB drive costing $130 or thereabouts. Today 40GB worth of space is 1.5% of that shiny new 3TB-disk costing $150 or thereabouts.
There's essentially no benefit to deleting old data, because old data is *always* small data, and so copying it to the new disk will use a miniscule portion of the new disc and have essentially no cost. $150/3TB is equivalent to $2 for saving those 40GB.
The only data that's potentially worthwhile to delete is *new* data that you have no need for. There is no such thing as "old but large data".
Avoiding clutter is a different issue, but that's easily solved by copying all the old data to a named folder, then move out of that folder and into the current file-system only those files you actually use.