Does Anyone Make a Photo De-Duplicator For Linux? Something That Reads EXIF?
postbigbang writes "Imagine having thousands of images on disparate machines. many are dupes, even among the disparate machines. It's impossible to delete all the dupes manually and create a singular, accurate photo image base? Is there an app out there that can scan a file system, perhaps a target sub-folder system, and suck in the images-- WITHOUT creating duplicates? Perhaps by reading EXIF info or hashes? I have eleven file systems saved, and the task of eliminating dupes seems impossible."
I've had the same problem as I stupidly try to make the world a better place by renaming or putting them in sub-directories.
fdupes will do a bit-wise comparison. -r = recurse. -d = delete.
fdupes would be the fastest way.
ExifTool is probably your best start:
http://www.sno.phy.queensu.ca/~phil/exiftool/
If they are identical then their hashes should be identical.
So write a script that generates hashes for each of them and checks for duplicate hashes.
Works excellently for this.
whatever you decide on, it could probably be done in a hundred lines of perl
Funny you mention perl.
There's a tool written in perl called "findimagedupes" in Debian. Pretty awesome tool for large image collections, because it could identify duplicates even if they had been resized, or messed with a little (e.g. adding logos, etc). Point it at a directory, and it'll find all the dupes for you.
For the general case (any file), I've used this script:
#!/bin/sh
OUTF=rem-duplicates.sh;
echo "#! /bin/sh" > $OUTF;
find "$@" -type f -print0 |
xargs -0 -n1 md5sum |
sort --key=1,32 | uniq -w 32 -d --all-repeated=separate |
sed -r 's/^[0-9a-f]*( )*//;s/([^a-zA-Z0-9./_-])/\\\1/g;s/(.+)/#rm \1/' >> $OUTF;
chmod a+x $OUTF; ls -l $OUTF
It should be straightforward to change "md5sum" to some other key -- e.g. EXIF Date + some other EXIF fields.
(Also, isn't this really a question for superuser.com or similar?)
I second exiftool. Lots of options to rename files. If you rename files based on createtime and perhaps other fields like resolution you will end up with unique filenames and then you can filter the duplicates
Here is a quick command which will rename every file in a directory according to createDate
exiftool "-FileNameCreateDate" -d "%Y%m%d_%H%M%S.%%e" DIR
If the files were all captured with the same device it is probably super easy since the exif info will be consistent. If the files are from lots of different sources...good luck.
Yeah, this Ask Slashdot should really be about teaching people how to search for packages in aptitude or whatever your package manager is...
Here are some others:
findimagedupes
Finds visually similar or duplicate images
findimagedupes is a commandline utility which performs a rough "visual diff" to
two images. This allows you to compare two images or a whole tree of images and
determine if any are similar or identical. On common image types,
findimagedupes seems to be around 98% accurate.
Homepage: http://www.jhnc.org/findimaged...
fslint :
kleansweep : ...
File cleaner for KDE
KleanSweep allows you to reclaim disk space by finding unneeded files. It can
search for files basing on several criterias; you can seek for:
* empty files
* empty directories
* backup files
* broken symbolic links
* broken executables (executables with missing libraries)
* dead menu entries (.desktop files pointing to non-existing executables)
* duplicated files
Homepage: http://linux.bydg.org/~yogin/
komparator :
directories comparator for KDE
Komparator is an application that searches and synchronizes two directories. It
discovers duplicate, newer or missing files and empty folders. It works on
local and network or kioslave protocol folders.
Homepage: http://komparator.sourceforge....
backuppc : (just in case this was related to your intended use case for some reason)
high-performance, enterprise-grade system for backing up PCs
BackupPC is disk based and not tape based. This particularity allows features #
not found in any other backup solution:
* Clever pooling scheme minimizes disk storage and disk I/O. Identical files
across multiple backups of the same or different PC are stored only once
resulting in substantial savings in disk storage and disk writes. Also known
as "data deduplication".
I bet if you throw Picasa at your combined images directory, it might have some kind of "similar image" detection too, particularly since its sorts everything by exif timestamp.
That said, I've never had to use any of this stuff, because my habit was to rename my camera image dumps to a timestamped directory (e.g. 20140123_DCIM ) to begin with, and upload it to its final resting place on my main file server immediately, so I know all other copies I encounter on other household machines are redundant.
This will help find exact matches by exif data. It will not find near-matches unless they have the same exif data. If you want that, good luck. Geeqie has a find-similar command, but it's only so good (image search is hard!). Apparently there's also a findimagedupes tool available, see comments above (I wrote this before seeing that and had assumed apt-cache search had already been exhausted).
I would write a script that runs exiftool on each file you want to test. Remove the items that refer to timestamp, file name, path, etc. make a md5.
Something like this exif_hash.sh (sorry, slashdot eats whitespace so this is not indented):
#!/bin/sh
for image in "$@"; do
echo "`exiftool |grep -ve 20..:..: -e 19..:..: -e File -e Directory |md5sum` $image"
done
And then run:
find [list of paths] -typef -print0 |xargs -0 exif_hash.sh |sort > output
If you have a really large list of images, do not run this through sort. Just pipe it into your output file and sort it later. It's possible that the sort utility can't deal with the size of the list (you can work around this by using grep '^[0-8]' output |sort >output-1 and grep -v '^[0-8]' output |sort >output-2, then cat output-1 output-2 > output.sorted or thereabouts; you may need more than two passes).
There are other things you can do to display these, e.g. awk '{print $1}' output |uniq -c |sort -n to rank them by hash.
On Debian, exiftool is part of the libimage-exiftool-perl package. If you know perl, you can write this with far more precision (I figured this would be an easier explanation for non-coders).
Use my userscript to add story images to Slashdot. There's no going back.
See my earlier contrivution: geeqie. It will even scan for image similarity not just rudimentary hashing. Someone else mentioned gqview & that it was out of date - geeqie is what gqview became.