Does Anyone Make a Photo De-Duplicator For Linux? Something That Reads EXIF?
postbigbang writes "Imagine having thousands of images on disparate machines. many are dupes, even among the disparate machines. It's impossible to delete all the dupes manually and create a singular, accurate photo image base? Is there an app out there that can scan a file system, perhaps a target sub-folder system, and suck in the images-- WITHOUT creating duplicates? Perhaps by reading EXIF info or hashes? I have eleven file systems saved, and the task of eliminating dupes seems impossible."
exactly what you mean by deduplication is kind of vague, but whatever you decide on, it could probably be done in a hundred lines of perl (using CPAN libraries of course).
"They were pure niggers." – Noam Chomsky
On the Mac I use a program called Gemini. It could route out any duplicate files between multiple sources, and give you options on which ones to keep/delete (ie manual, oldest, etc).
Id have put them all on FreeBSd ZFS filesystems and enabled dedup........ whalah job completed ... :P
I've had the same problem as I stupidly try to make the world a better place by renaming or putting them in sub-directories.
fdupes will do a bit-wise comparison. -r = recurse. -d = delete.
fdupes would be the fastest way.
Just script something that grabs a list of image files from the filesystem, runs an MD5 hash on all of them, locates any duplicate MD5s, then outputs a list of files to delete later. Now if you're talking about a somewhat more sophisticated duplicate detection (such as, say, detecting images that are the same picture but are not in the same size or format) you're getting into the "someone will pay you money for this" territory.
It is important here to know why you want to remove duplicate images. Is it just so you can have one large photo album without seeing the same picture twice? If that is true then you could sync all images on all machines onto one large drive, sort the files by size and manually delete the duplicates as they would all bunch together.
If you are trying to save disk space, then using a file system like ZFS can automatically remove duplicate data and add compression.
Also consider that if you do not know where your duplicate files are then any duplicates in existence are, effectively, acting as backups for your disorganized collection. Erasing duplicates until you find a way to cleanly backup your data may be a mistake in the long run.
# zfs set dedup=on mypool/photos
Make sure you have enough RAM though (1GB of RAM per TB of unique data) and/or an SSD for L2ARC to make sure it doesn't grind to a halt.
If they are identical then their hashes should be identical.
So write a script that generates hashes for each of them and checks for duplicate hashes.
fslint is a toolkit to find all redundant disk usage (duplicate files
for e.g.). It includes a GUI as well as a command line interface.
http://www.pixelbeat.org/fslin...
If you're a zombie and you know it, bite your friend!
There are many duplicate file finders, if the files are binary identical. (Search on Google for "find duplicate files" or "delete duplicate files".) However, if the files have been modified in any way, this becomes much more difficult, because similar files for music or photos have a degree of tolerance for errors and variation. Signed: the author of two of those programs.
I would try running all the files through ssdeep.
You could script it to find a certain % match that you're satisfied with. Only catch to this is that it could be a very time-intensive process to scan a huge number of files. Exif might be a faster option which could be cobbled together in Perl pretty quickly, but that wouldn't catch dupes that had their exif stripped or have slight differences due to post-processing.
It has been reported to work under WINE, but your mileage may vary.
Sorry - don't have any links.
I did just this, but by copying all of the pics from the various devices to a linux fileshare, and then ran: http://www.pixelbeat.org/fslint/ Nice software, did exactly what I wanted.
You could use Unison to merge them two at a time.
Other option is somethling like FSLint that can detect duplicate.
I'm pretty sure I wrote something like this in perl/bash in like 20 minutes.
1 - do an md5sum of each file and toss it in a file
2 - sort
3 - perl (or you language of choice) program, basicly:
sum = "a"
newsum = next line
if newsum == sum delete file
else sum = newsum
Works excellently for this.
fdupes will work and is faster than writing a homemade script for the job. The big problem is "across multiple machines" which might require use of, say, sshfs to bring all the machines' data remotely onto one temporarily for duplicate scanning. fdupes checks sizes first, and only then starts trying to hash anything, so obvious non-duplicates don't get hashed at all. Significant time savings. Across multiple machines, another option is using md5deep to build recursive hash lists.
The only tool so far that I've used for image duplicate finding that checks CONTENT rather than bitwise 1:1 duplicate checking is GQview on Linux. It works fairly well, though it's a bit dated by now it's still a good viewer program. Add -D_FILE_OFFSET_BITS=64 to the CFLAGS if you compile it yourself on a 32-bit machine today though.
Requires WINE but should work fine on Linux.
http://www.anti-twin.com/
whatever you decide on, it could probably be done in a hundred lines of perl
Funny you mention perl.
There's a tool written in perl called "findimagedupes" in Debian. Pretty awesome tool for large image collections, because it could identify duplicates even if they had been resized, or messed with a little (e.g. adding logos, etc). Point it at a directory, and it'll find all the dupes for you.
Under *buntu
sudo apt-get install fdupes
man fdupes:
fdupes - finds duplicate files in a given set of directories
As a former Shotwell dev I might point out that most photo manager apps can do this.
There's no -1 for "I don't get it."
Are we seriously discussing how to dedupe files based on a hash here?
News for nerds, stuff that matters, questions that belong in a forum where people answer things you couldn't be bothered to Google.
In addition to the other methods (ZFS, fdupes, etc), I personally use git-annex.
Git annex can even run on android, so I keep at least two copies of my photos spread throughout all of my computers and removable devices.
http://www.donarmstrong.com
See http://www.librelogiciel.com/s...
I haven't modified nor used it in years (I don't own a digital camera anymore...) so I ignore if it still works with up to date libraries, but its "--nodupes" option does what you want, and its numerous other command line options (http://www.librelogiciel.com/software/DigicaMerge/commandline) help you solve the main problems of managing directories full of pictures.
It's Free Software, licensed under the GNU GPL of the Free Software Foundation.
Hoping this helps
Votez ecolo : Chiez dans l'urne !
You can even compile your home-grown photo-deduplicator into your custom kernel if you want to.
Get your co-workers @ the nsa to do their own work
For the general case (any file), I've used this script:
#!/bin/sh
OUTF=rem-duplicates.sh;
echo "#! /bin/sh" > $OUTF;
find "$@" -type f -print0 |
xargs -0 -n1 md5sum |
sort --key=1,32 | uniq -w 32 -d --all-repeated=separate |
sed -r 's/^[0-9a-f]*( )*//;s/([^a-zA-Z0-9./_-])/\\\1/g;s/(.+)/#rm \1/' >> $OUTF;
chmod a+x $OUTF; ls -l $OUTF
It should be straightforward to change "md5sum" to some other key -- e.g. EXIF Date + some other EXIF fields.
(Also, isn't this really a question for superuser.com or similar?)
Hi. I've been through the same problem. In my case, I had to deduplicate 80k photos. The reason why most suggestions in this thread won't work is because generic solutions don't take advantage of the extra information photos contain. In my case, around 90% of the photos had good EXIF information, but in itself, that is not enough.
I used DIM to classify photos into year / month / day structure, and later I used a photo deduplicator on each day's sub folder.
Additionally, there was extra manual work for those photos not resolved in this way, but definitely was way better than comparing 80k with 80k.
rtfm.
If you're not big into scripting there's a program on the Ubuntu Software Center called FSLint that does exactly what you're looking for. You can have it match on filenames, filesize, hashes, etc. It's just a generic file deduplicator, not optimized for images or anything.
What's the problem? Just cp -u the $file to /newhd/by_md5/$(md5sum $file).${file##*.} ...and store the original file name in exif create another hardlink to the md5 filename or whatever way you prefer to locate your stuff )
(
Create a git-annex repository on each file system and set them up with at least one common remote. Then add all of the photos on each file system into the git-annex repository (git annex add *.jpg), sync it with the common remote (git annex sync yourremote), and move all content to the remote (git annex move -t yourremote .).
Yeah, this Ask Slashdot should really be about teaching people how to search for packages in aptitude or whatever your package manager is...
Here are some others:
findimagedupes
Finds visually similar or duplicate images
findimagedupes is a commandline utility which performs a rough "visual diff" to
two images. This allows you to compare two images or a whole tree of images and
determine if any are similar or identical. On common image types,
findimagedupes seems to be around 98% accurate.
Homepage: http://www.jhnc.org/findimaged...
fslint :
kleansweep : ...
File cleaner for KDE
KleanSweep allows you to reclaim disk space by finding unneeded files. It can
search for files basing on several criterias; you can seek for:
* empty files
* empty directories
* backup files
* broken symbolic links
* broken executables (executables with missing libraries)
* dead menu entries (.desktop files pointing to non-existing executables)
* duplicated files
Homepage: http://linux.bydg.org/~yogin/
komparator :
directories comparator for KDE
Komparator is an application that searches and synchronizes two directories. It
discovers duplicate, newer or missing files and empty folders. It works on
local and network or kioslave protocol folders.
Homepage: http://komparator.sourceforge....
backuppc : (just in case this was related to your intended use case for some reason)
high-performance, enterprise-grade system for backing up PCs
BackupPC is disk based and not tape based. This particularity allows features #
not found in any other backup solution:
* Clever pooling scheme minimizes disk storage and disk I/O. Identical files
across multiple backups of the same or different PC are stored only once
resulting in substantial savings in disk storage and disk writes. Also known
as "data deduplication".
I bet if you throw Picasa at your combined images directory, it might have some kind of "similar image" detection too, particularly since its sorts everything by exif timestamp.
That said, I've never had to use any of this stuff, because my habit was to rename my camera image dumps to a timestamped directory (e.g. 20140123_DCIM ) to begin with, and upload it to its final resting place on my main file server immediately, so I know all other copies I encounter on other household machines are redundant.
You can google forever and not get the correct answer.
This is not a trivial problem, and in my case, I had to test multiple ways to do this before finding the correct tools. Also, most approaches work fine with 100 photos, but the problem becomes different if you are talking about 80k photos.
And if it was 100 photos, very likely he would do it by hand and won't need a tool.
I checked multiple places including slashdot before almost writing my own tools in perl.
This will help find exact matches by exif data. It will not find near-matches unless they have the same exif data. If you want that, good luck. Geeqie has a find-similar command, but it's only so good (image search is hard!). Apparently there's also a findimagedupes tool available, see comments above (I wrote this before seeing that and had assumed apt-cache search had already been exhausted).
I would write a script that runs exiftool on each file you want to test. Remove the items that refer to timestamp, file name, path, etc. make a md5.
Something like this exif_hash.sh (sorry, slashdot eats whitespace so this is not indented):
#!/bin/sh
for image in "$@"; do
echo "`exiftool |grep -ve 20..:..: -e 19..:..: -e File -e Directory |md5sum` $image"
done
And then run:
find [list of paths] -typef -print0 |xargs -0 exif_hash.sh |sort > output
If you have a really large list of images, do not run this through sort. Just pipe it into your output file and sort it later. It's possible that the sort utility can't deal with the size of the list (you can work around this by using grep '^[0-8]' output |sort >output-1 and grep -v '^[0-8]' output |sort >output-2, then cat output-1 output-2 > output.sorted or thereabouts; you may need more than two passes).
There are other things you can do to display these, e.g. awk '{print $1}' output |uniq -c |sort -n to rank them by hash.
On Debian, exiftool is part of the libimage-exiftool-perl package. If you know perl, you can write this with far more precision (I figured this would be an easier explanation for non-coders).
Use my userscript to add story images to Slashdot. There's no going back.
AND, most people come with the trivial answer on deduping files. You DON'T want to MD5 or do anything based on hash tags for 80k photos. That doesn't work. Photos are a particular type of file with particular characteristics, which can reduce your workload a lot.
Trivial approach sucks in this case, and carefully picking the correct tools (in my case classifying photos with an exif / date approach) before deduplicating can convert an impractical solution into a working solution.
fslint is the tool you are looking for.
this would be a nice intermediate-level weekend perl project
Why use perl when a bash script will do?
=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Friends don't let friends enable ecmascript.
script to output path, filename, checksum to somewhere maybe?
This will find duplicate files in the general sense:
http://packages.debian.org/sid...
Picasa as a local instance, importing from all the other locations... just remember to check the "exclude duplicates" box
fdupes.
Done :)
- http://www.milkme.co.uk
http://en.wikipedia.org/wiki/L...
Be careful with fdupes. It defaults to including zero length files and will hard link those together too, which is generally a really bad idea.
ftwin is a command line tool, when built with libpuzzle, able to generate a signature for each image and detect duplicates (including resized/sliightly modified). Link: http://freecode.com/projects/f... Disclaimer: I'm the author and don't maintain it actively :-P
http://en.wikipedia.org/wiki/List_of_duplicate_file_finders
This seems like a rather lot of work just to automate deduping of your porn collection. It might be more enjoyable to do it by hand anyway.
I love how this article was listed twice in the RSS feed. Kudos!
findimagedupes
I always use: fdupes -nrSd *
Reminds me of Windows link repairer, automatically searching for the nearest file size, which was almost always the wrong thing to do, then suggesting grampa accept the new pointer.
(-1: Post disagrees with my already-settled worldview) is not a valid mod option.
If you are going stricly based on hashing (e.g. not trying to match images that may have different EXIF data embedded, thus making the hashes different) fslint works quite well. It will chug through a filesystem and basically wraps python commands to compare by hash and file size (using both md5 and sha256) and will give you a report of wasted space. You can then save a parseable plain text file. It can take a while - it's bandwidth-bound as you might expect - I just did this for a 2tb network share and it took over 12 hours. But it got the job done and all I had to do was sudo apt-get install fslint
Hope it doesn't access your backup drive and wipe out your backups as "duplicates".
The de-duplicator is called a human.
They use a database of hashes of kiddie porn to identify offending material without forcing anyone to look at the stuff. Seems like it would be ready to use Perl to crawl your filesystem and identify dupes.
by Mike Buddha -- Someday the mountain might get him, but the law never will.
I wrote a shell script that looked at the datestamp for each photo and then moved it to a directory called YYYY/MM/DD (so 2000/12/25). I'm going off the assumption that there weren't two photos taken on the same day with the same filenames. So far that seems to be working.
...you want to trust the EXIF time stamp to determine a duplicate? I had a video cam that was constantly resetting the internal clock to "Jan 1, 2000." It's possible that you could lose some data.
Interestingly, I wrote a program years back, that did the OPPOSITE of this. It read the file name (formatted as a date) and set the date in the EXIF header. I was converting DV-AVI video to still images.
#!/usr/bin/perl
...] ...]
...]
# $Id: findDups.pl 218 2014-01-24 01:04:52Z alan $
#
# Find duplicate files: for files of the same size compares md5 of successive chunks until they differ
#
use strict;
use warnings;
use Digest::MD5 qw(md5 md5_hex md5_base64);
use Fcntl;
use Cwd qw(realpath);
my $BUFFSIZE = 131072; # compare these many bytes at a time for files of same size
my %fileByName; # all files, name => size
my %fileBySize; # all files, size => [fname1, fname2,
my %fileByHash; # only with duplicates, hash => [fname1, fname2,
if ($#ARGV < 0) {
print "Syntax: findDups.pl <file|dir> [...]\n";
exit;
}
# treat params as files or dirs
foreach my $arg (@ARGV) {
$arg = realpath($arg);
if (-d $arg) {
addDir($arg);
} else {
addFile($arg);
}
}
# get filesize after adding dirs, to avoid more than one stat() per file in case of symlinks, duplicate dirs, etc
foreach my $fname (keys %fileByName) {
$fileByName{$fname} = -s $fname;
}
# build hash of filesize => [ filename1, filename2,
foreach my $fname (keys %fileByName) {
push(@{$fileBySize{$fileByName{$fname}}}, $fname);
}
# for files of the same size: compare md5 of each successive chunk until there is a difference
foreach my $size (keys %fileBySize) {
next if $#{$fileBySize{$size}} < 1; # skip filesizes array with just one file
my %checking;
foreach my $fname (@{$fileBySize{$size}}) {
if (sysopen my $FH, $fname, O_RDONLY) {
$checking{$fname}{fh} = $FH; # file handle
$checking{$fname}{md5} = Digest::MD5->new; # md5 object
} else {
warn "Error opening $fname: $!";
}
}
my $read=0;
while (($read < $size) && (keys %checking > 0)) {
my $r;
foreach my $fname (keys %checking) { # read buffer and update md5
my $buffer;
$r = sysread($checking{$fname}{fh}, $buffer, $BUFFSIZE);
if (! defined($r)) {
warn "Error reading from $fname: $!";
close $checking{$fname}{fh};
delete $checking{$fname};
} else {
$checking{$fname}{md5}->add($buffer);
}
}
$read += $r;
FILE1: foreach my $fname1 (keys %checking) { # remove files without dups
my $duplicate = 0;
FILE2: foreach my $fname2 (keys %checking) { # compare to each checking file
next if $fname1 eq $fname2;
if ($checking{$fname1}{md5}->clone->digest eq $checking{$fname2}{md5}->clone->digest) {
$duplicate = 1;
next FILE1; # skip to next file
}
}
What happens on Slashdot is after about 20 posts, some people with experience and a simple idea speak up.
paradxum's outline is getting pretty clear, simple and powerful
The book Unix Powertools by O'Reilly and ? has a number of recipes for solving your problem. Here is a verbal plan if you have access to Linux or a Unix type command line.
ls -l >filelist ; Make a long style listing of every directory.
ls -l >>filelist ; Append all the directories you can to the filelist
sort filelist ; Sort it by the file size column, see man sort for fields. Be careful, when you go over 1000 files, sort takes lots of time.
----------- what I would do, since I like the gnumeric spreadsheet
is open filelist with gnumeric and write a cell comparison script. Even old stuff, since the hard disk they are stored on is old too, I prefer to unplug the disk, leave it inside the old steel case computer.
Manually read the file list, identical image files will have the same file size. By sorting, all the likely exact duplicate files will appear together with the parent file.
For the delete duplicates task see other Slashdot posts.
If it's attached to a live system and is writeable then it's not a backup yet, it's just a copy.
A web hosting business near me went under because they made that mistake and lost all of their hosted data in a single incident.
Copies on instantly available disk are often a lot more convenient than detached disks, tapes or whatever, but if that's all you've got there are plenty of ways to lose the lot.
Why didn't I think of that?
Well, if you are searching for identical files, just use "rdfind". It has options to delete or even hard-link duplicate files.
If you need to find **similar** images, you can use the "geeqie" image viewer. It can compare images sets by similarity level. I'm not aware of a command-line tool doing this, though.
In four different programming languages: http://stromberg.dnsalias.org/...
You could also go back and rename the photos to the time stamp created using exif tool, when you're done. Assuming the poster doesn't mind renaming the files.
DigiKam will do everything you want. It works by creating hashes. You set your level of similarity and digiKam will find the files. It can handle multiple locations, and even "albums" on removable media. If you have a lot of images it can be slow, but if you limit any particular search you can greatly improve performance. It is available for Linux and Windows both.
Google: image duplicate finder
What I did in my deduplicator written in Python was group the files by their and reject any file with a unique size. Then I'd hash the first few kilobytes of each file with MD5 (it's just a spot check so speed is more valuable than security against intentional collisions) and reject any file with a unique first few kilobytes. Finally I'd hash the whole file with a more secure hash.
You may want to try dupeguru. It's available at http://www.hardcoded.net/dupeguru_pe.
What you want, is a first pass which identifies some interesting points in the image.
There is an algorithm for that called SIFT (scale-invariant feature transform), but it's patented and apparently unavailable for licensing in free software.
if you have the exact same picture in three different folders/subdirectories on the same file system, zfs will only allocate storage for one copy of the data, and point the three file entries to that one copy. Similar to how hard links work in ext2 and friends.
I think the idea is to use some utility to query ZFS and find files that ZFS has deduplicated. Similar to how one can count hard links to each inode in ext2 and friends.
However it's fairly easy to do with a unix shell and only standard tools...
/path/to/pics -type f -print0 | xargs -0 md5 | sort | while read hash r; do
/. comment box, though, so it's probably wrong somewhere. Only intended to convey the idea.
Something along the lines of:
find
if ! [ "$lasthash" = "$hash" ]; then
echo "$rest"
fi
lasthash="$hash"
done | while read dupe; do
echo rm -- "$dupe"
done
That would, once the echo is removed, delete all files that are dupes (except one of each).
Typed it right into the
CLI paste? paste.pr0.tips!
I have used http://www.duplicate-finder.com/photo.html (MS Windows only) because I could not find anything on Linux with similar functionality. It does work very well, it can find similar, but not identical images, such as the same picture saved in a different format or with different compression settings. It tends to slow down when working directories with multiple thousands of images.
I really shouldn't have used someone else's email address for this account.
I use a utility called dupe-guru. Does md5 sums all all files/directories specified and looks for duplicates. Lets the user decide what action to take on the duplicates. works perfectly and is available here: https://launchpad.net/~hsoft/+archive/ppa/+packages
I use DigiKam but when it came to finding duplicates in unmanaged folders I was happy to find out Geeqie has a very powerful File->FindDuplicates tool with many methods for identifying duplicates. Start with the quick ones and move on to the slower methods.
Also: I love Geeqie's view-> pan view mode... check it out!
Whats whats wrong wrong with with dupes dupes? Picky picky.
Table-ized A.I.
fdupes = find dupes.. works for all files, you can specify .jpg .jpeg .JPG .JPEG etc
which can just identify duplicate files by full content
fdupes does it for me.
FDUPES is a program for identifying or deleting duplicate files residing within specified directories.
http://premium.caribe.net/~adrian2/programs/fdupes.html
I used to use a command line program called Similar, which was part of a package called STIC (Simple Tools for Image Collectors). Similar would scan the images and build a database, and then the database could be queried for which images were similar, and I would pipe the output of the query into xargs rm and remove the duplicates.
On Mageia4 x86_64, I get the error message: /usr/bin/perl: symbol lookup error: /usr/lib/perl5/vendor_perl/5.18.1/x86_64-linux-thread-multi/auto/Graphics/Magick/Magick.so: undefined symbol: InitializeMagick
perl-Graphics-Magick is version 1.3.18
I use VisiPics for Windows. It's a free software that actually analyses the content of images to find duplicates. This works very well because images may not have exif data or the same image may be different file sizes or formats.
I don't know if it will work under Wine, but it's worth a try.
Visipics is the only tool I have ever found that will reliably use image matching to dedupe; it is Windows only but I have used it on my own collections & it works very well indeed: http://www.visipics.info/
Now (v1.31) understands .raw as well as all other main image formats & can handle rotated images; brilliant little program!
Nico M, London, GB.
I use fslint. It does more than just find duplicate images.
Your best bet is using something like dupeGuru (http://www.hardcoded.net/dupeguru_pe/). It uses a variant of phash (http://www.phash.org/) to also find similar images. I've used it on an archive of 250,000 photos and it works beautifully.
I had the same problem once. I had my daughter's pictures since birth (6 years). They were saved by the month, but I had multiple copies, modifications (rotation, resizing), etc. Also, I had her videos, and even worse, the system was faulty, and some of the copies were on exact. All in all, I had something well over a hundred thousand image + videos. All spread over three disks. A picture viewer was obviously out of the question. I just checked it right now, and the tree right now is 232GB. When I did this, the final tree was maybe 150GB.
Here is how I solved it: First, I wrote a program which tried to extract as much info from the images/videos as possible.These included the creation date (via exif or mplayer parsing), if the file was faulty (jpeginfo and mplayer are your friends), the orientation, md5, geometry, and the length for videos. Originally, I also collected inodes. I collected all of these attrs for each image/video and associated them with the files as extended fs attributes. This program run for over a day (or maybe two). Then I wrote another program which inserted these attributes into the filename (or removed them). This way I got some impressively long filenames, but all info was there. The filename looked like this: in-seattle-zoo-01.jpeg__-length-3911023-orientation-portrait-error-no-......jpeg
Once this was done, I wrote a lot of small programs to eliminate files, dirs, whole trees. For example, suppose I had seven directories called month9. I wrote a program which extracted the md5sums from each directory (remember, it was in the filename now), and if two dirs were matches (or one was the subset of another one), I could eliminate one. if it wasn't enough, I went to other attributes.
All in all, I spent something like 2-3 weeks deduping the mess I have created.
I decided it was best to put the info into extended attributes (it served as the database) since if a file was deemed dup and removed, the database entry with it automatically went.
Good luck, and try not to tear out your hair. :-)
Vilmos
Boar is a svn-like tool that handles large binaries. It deduplicates identical files out of the box, and there is also a plugin (in the devel version) that enables block-level deduplication (useful for efficient storage of large images with only differing exif info). Try it out at http://www.boarvcs.org
KDE's digikam has multiple dedup features from hash checks to close matches
same (ftp://ftp.bitwizard.nl/same/) replaces the duplicates by hard or symbolic links.
I have had a bad practice in just emptying the SD card in folders, and renaming the folder to somthing, like backup_christmas. I didn't always format the SD card, so I have had a lot of copies of the same photo.
After a few years this has made it quite a mess.
So I wrote a python script that reads the file, renames it and relocates it to
year \ month \ day \ time_with_seconds.jpg
and if there were two in the same second, I just added one more to the jpg until the name was free.
One thing I found out was that the date and time was off on a lot of pictures, so the ones I spotted, I found the series of photos, eg a holiday for a week, and then used the datetime module to calculate the difference in time from some reference picture (eg. renembering that we we were on the beach one day, and guessing on the time) and then modified the exif data and ran the move and rename script again.
Afterwards I made a deduplication script, think it was based on:
http://code.activestate.com/recipes/362459-dupinator-detect-and-delete-duplicate-files/
Afterwards I imported the folder into DigiKam, which I use to organize the pictures and tags them.
Actually DigiKam seems to have a deduplication feature, and can probadly rename and move photos also, I don't know. It can also find 'similar' photos. I have a habbit of making a multiple shots for taking the one that was eg. the least blurry.
If there isn't, find some time and do it. Today I found some time, my ignored her because of the time I already have.
No need to roll your own. If the redundant files are identical (the
problem as stated lets me assume that), use fdupes.
"Searches the given path for duplicate files. Such files are found by
comparing file sizes and MD5 signatures, followed by a byte-by-byte
comparison."
It's fast, accurate, and generates a list of duplicate files to handle
yourself - or automatically deletes all except the first of duplicate
files found.
I've used it myself with tens of thousands of pictures to exactly do
what the OP wants.
digikam
I'm surprised nobody has mentioned fdupes yet.
It's a terminal program found in most linux distributions that will identify and/or delete file duplicates.
It uses several checks to speed things up: file size, hashes when the size matches, and full file comparisons if the hashes match.
DupeGuru Picture Edition at http://www.hardcoded.net/dupeguru_pe/
cross platform and easy to use
I know I'm a little late to the conversation, but I wrote a script to tackle this very problem just a month or two ago.
https://github.com/mikegreiling/photosort.py
http://pixelcog.com/blog/2013/recover-corrupted-photo-library/
I had a corrupted iPhoto library after a hard drive went bad, so I needed to combine the photos from my iPhone and several other sources to recompile the library, and the only way to recognize duplicates was with EXIF information.
Using filenames is not usefull because the name may change but the contents don't.
I've done this several times using EXIF data of photo time + camera + exposure to make a unique key.
But I've never considered if the file has been edited and the EXIF key stays the same - then also check dimensions and maybe hash of contents.
Geeqie (was GQView) does duplicate image searches including options for non-perfect duplicates, or just bit for bit checks. It'll then show you the images side by side with some meta info.
It claims to be able to read exif.
Beyond Compare (http://www.scootersoftware.com/) is a really great solution for this. I use it all the time.
I have found clonespy to be an excellent tool for finding duplicates and not just for photos. I prefer the two folder method. You would have to run it under a windows vm to run it on linux ;)
I am wondering why no one suggested gimp. Gimp has a command-line interface that does almost anything brilliantly and is perfect for working on multiple files. Gimp reads EXIF data and also accepts python scripts internally if you prefer a GUI over CLI.
Freedups - http://www.stearns.org/freedups/ - should also do the trick. It hardlinks identical files, freeing up the space without changing the directory tree. It does caching to reduce disk bandwidth, and does the whole thing with a single pass through the files. GPL'd. (I'm the author).
On mac I use macpaw's Gemini it's not free but for me it has worked well.
There is a command line tool called fdupes.
Or a GUI tool fslint.
Both in ubuntu repo.
so here it is: https://github.com/withorwitho...
enjoy, it doesn't delete or move anything automatically. You can add that if you want, just outputs images that are perceptually similar.
Example usage and output is included on github page. email me if you want it to work a different way or do something different. It's not the most robust phash algorithm, but it's better than straight hashes (in some ways) as it'll detect a similar png and jpg that are similar.
I wrote one that I use, works really well because it also hardlinks all the duplicates. https://github.com/wscott/link...
I have done it with Flickr and FlickrDupFinder (https://www.flickr.com/services/apps/72157623582289101/) which has worked very well!
"Beware of he who would deny you access to information, for in his heart, he dreams himself your master."
Have you tried digikam?
It has a photo de-duplicator.
why bother ?
Just put all your images into yfidb (your favorite image database) and sort em out as time goes by
or something like that
If you have 1,000s of images, I assume either (a) you don't really care that much about anyone particular image, or you have a special set of images you care about all ready
I have the exact same problem, so hopefully you will find the tool that I use to your liking:
https://github.com/christophelg/DuplicateFinder
If the images are identical, hash them and compare hashes.
Privacy is terrorism.