Does Anyone Make a Photo De-Duplicator For Linux? Something That Reads EXIF?
postbigbang writes "Imagine having thousands of images on disparate machines. many are dupes, even among the disparate machines. It's impossible to delete all the dupes manually and create a singular, accurate photo image base? Is there an app out there that can scan a file system, perhaps a target sub-folder system, and suck in the images-- WITHOUT creating duplicates? Perhaps by reading EXIF info or hashes? I have eleven file systems saved, and the task of eliminating dupes seems impossible."
exactly what you mean by deduplication is kind of vague, but whatever you decide on, it could probably be done in a hundred lines of perl (using CPAN libraries of course).
"They were pure niggers." – Noam Chomsky
I've had the same problem as I stupidly try to make the world a better place by renaming or putting them in sub-directories.
fdupes will do a bit-wise comparison. -r = recurse. -d = delete.
fdupes would be the fastest way.
It is important here to know why you want to remove duplicate images. Is it just so you can have one large photo album without seeing the same picture twice? If that is true then you could sync all images on all machines onto one large drive, sort the files by size and manually delete the duplicates as they would all bunch together.
If you are trying to save disk space, then using a file system like ZFS can automatically remove duplicate data and add compression.
Also consider that if you do not know where your duplicate files are then any duplicates in existence are, effectively, acting as backups for your disorganized collection. Erasing duplicates until you find a way to cleanly backup your data may be a mistake in the long run.
If they are identical then their hashes should be identical.
So write a script that generates hashes for each of them and checks for duplicate hashes.
fslint is a toolkit to find all redundant disk usage (duplicate files
for e.g.). It includes a GUI as well as a command line interface.
http://www.pixelbeat.org/fslin...
If you're a zombie and you know it, bite your friend!
I would try running all the files through ssdeep.
You could script it to find a certain % match that you're satisfied with. Only catch to this is that it could be a very time-intensive process to scan a huge number of files. Exif might be a faster option which could be cobbled together in Perl pretty quickly, but that wouldn't catch dupes that had their exif stripped or have slight differences due to post-processing.
I did just this, but by copying all of the pics from the various devices to a linux fileshare, and then ran: http://www.pixelbeat.org/fslint/ Nice software, did exactly what I wanted.
This is what I'd do, but I doubt the submitter is a Bourne shell wizard.
Shell scripts ARE still software by the way.
I'm pretty sure I wrote something like this in perl/bash in like 20 minutes.
1 - do an md5sum of each file and toss it in a file
2 - sort
3 - perl (or you language of choice) program, basicly:
sum = "a"
newsum = next line
if newsum == sum delete file
else sum = newsum
Works excellently for this.
fdupes will work and is faster than writing a homemade script for the job. The big problem is "across multiple machines" which might require use of, say, sshfs to bring all the machines' data remotely onto one temporarily for duplicate scanning. fdupes checks sizes first, and only then starts trying to hash anything, so obvious non-duplicates don't get hashed at all. Significant time savings. Across multiple machines, another option is using md5deep to build recursive hash lists.
The only tool so far that I've used for image duplicate finding that checks CONTENT rather than bitwise 1:1 duplicate checking is GQview on Linux. It works fairly well, though it's a bit dated by now it's still a good viewer program. Add -D_FILE_OFFSET_BITS=64 to the CFLAGS if you compile it yourself on a 32-bit machine today though.
Requires WINE but should work fine on Linux.
http://www.anti-twin.com/
whatever you decide on, it could probably be done in a hundred lines of perl
Funny you mention perl.
There's a tool written in perl called "findimagedupes" in Debian. Pretty awesome tool for large image collections, because it could identify duplicates even if they had been resized, or messed with a little (e.g. adding logos, etc). Point it at a directory, and it'll find all the dupes for you.
Under *buntu
sudo apt-get install fdupes
man fdupes:
fdupes - finds duplicate files in a given set of directories
As a former Shotwell dev I might point out that most photo manager apps can do this.
There's no -1 for "I don't get it."
Have you read the zfs documentation? Setting zfs dedup does not remove duplicate files (per OP request, since there are eleven different file systems), but removes redundant storage for files which are duplicates. In other words, if you have the exact same picture in three different folders/subdirectories on the same file system, zfs will only allocate storage for one copy of the data, and point the three file entries to that one copy. Similar to how hard links work in ext2 and friends.
One can use NTFS and turn on deduplication, then manually fire off the background "optimization" task. It isn't a "presto!", but after a good long while, it will find and merge duplicate files, or duplicate blocks of different files.
Caveat: This is only Windows 8 and newer, or Windows Server 2012 and newer.
Yeah. Thanks. It's a simple question. So far, I've seen scripting suggestions, which might be useful. I'm a nerd, but not wanting to do much code because I'm really rusty at it. Instead, I'm amazed that no one runs into this problem and has built an app that does this. That's all I'm looking for: consolidation.
---- Teach Peace. It's Cheaper Than War.
In addition to the other methods (ZFS, fdupes, etc), I personally use git-annex.
Git annex can even run on android, so I keep at least two copies of my photos spread throughout all of my computers and removable devices.
http://www.donarmstrong.com
See http://www.librelogiciel.com/s...
I haven't modified nor used it in years (I don't own a digital camera anymore...) so I ignore if it still works with up to date libraries, but its "--nodupes" option does what you want, and its numerous other command line options (http://www.librelogiciel.com/software/DigicaMerge/commandline) help you solve the main problems of managing directories full of pictures.
It's Free Software, licensed under the GNU GPL of the Free Software Foundation.
Hoping this helps
Votez ecolo : Chiez dans l'urne !
For the general case (any file), I've used this script:
#!/bin/sh
OUTF=rem-duplicates.sh;
echo "#! /bin/sh" > $OUTF;
find "$@" -type f -print0 |
xargs -0 -n1 md5sum |
sort --key=1,32 | uniq -w 32 -d --all-repeated=separate |
sed -r 's/^[0-9a-f]*( )*//;s/([^a-zA-Z0-9./_-])/\\\1/g;s/(.+)/#rm \1/' >> $OUTF;
chmod a+x $OUTF; ls -l $OUTF
It should be straightforward to change "md5sum" to some other key -- e.g. EXIF Date + some other EXIF fields.
(Also, isn't this really a question for superuser.com or similar?)
Yeah, this Ask Slashdot should really be about teaching people how to search for packages in aptitude or whatever your package manager is...
Here are some others:
findimagedupes
Finds visually similar or duplicate images
findimagedupes is a commandline utility which performs a rough "visual diff" to
two images. This allows you to compare two images or a whole tree of images and
determine if any are similar or identical. On common image types,
findimagedupes seems to be around 98% accurate.
Homepage: http://www.jhnc.org/findimaged...
fslint :
kleansweep : ...
File cleaner for KDE
KleanSweep allows you to reclaim disk space by finding unneeded files. It can
search for files basing on several criterias; you can seek for:
* empty files
* empty directories
* backup files
* broken symbolic links
* broken executables (executables with missing libraries)
* dead menu entries (.desktop files pointing to non-existing executables)
* duplicated files
Homepage: http://linux.bydg.org/~yogin/
komparator :
directories comparator for KDE
Komparator is an application that searches and synchronizes two directories. It
discovers duplicate, newer or missing files and empty folders. It works on
local and network or kioslave protocol folders.
Homepage: http://komparator.sourceforge....
backuppc : (just in case this was related to your intended use case for some reason)
high-performance, enterprise-grade system for backing up PCs
BackupPC is disk based and not tape based. This particularity allows features #
not found in any other backup solution:
* Clever pooling scheme minimizes disk storage and disk I/O. Identical files
across multiple backups of the same or different PC are stored only once
resulting in substantial savings in disk storage and disk writes. Also known
as "data deduplication".
I bet if you throw Picasa at your combined images directory, it might have some kind of "similar image" detection too, particularly since its sorts everything by exif timestamp.
That said, I've never had to use any of this stuff, because my habit was to rename my camera image dumps to a timestamped directory (e.g. 20140123_DCIM ) to begin with, and upload it to its final resting place on my main file server immediately, so I know all other copies I encounter on other household machines are redundant.
You can google forever and not get the correct answer.
This is not a trivial problem, and in my case, I had to test multiple ways to do this before finding the correct tools. Also, most approaches work fine with 100 photos, but the problem becomes different if you are talking about 80k photos.
And if it was 100 photos, very likely he would do it by hand and won't need a tool.
I checked multiple places including slashdot before almost writing my own tools in perl.
This will help find exact matches by exif data. It will not find near-matches unless they have the same exif data. If you want that, good luck. Geeqie has a find-similar command, but it's only so good (image search is hard!). Apparently there's also a findimagedupes tool available, see comments above (I wrote this before seeing that and had assumed apt-cache search had already been exhausted).
I would write a script that runs exiftool on each file you want to test. Remove the items that refer to timestamp, file name, path, etc. make a md5.
Something like this exif_hash.sh (sorry, slashdot eats whitespace so this is not indented):
#!/bin/sh
for image in "$@"; do
echo "`exiftool |grep -ve 20..:..: -e 19..:..: -e File -e Directory |md5sum` $image"
done
And then run:
find [list of paths] -typef -print0 |xargs -0 exif_hash.sh |sort > output
If you have a really large list of images, do not run this through sort. Just pipe it into your output file and sort it later. It's possible that the sort utility can't deal with the size of the list (you can work around this by using grep '^[0-8]' output |sort >output-1 and grep -v '^[0-8]' output |sort >output-2, then cat output-1 output-2 > output.sorted or thereabouts; you may need more than two passes).
There are other things you can do to display these, e.g. awk '{print $1}' output |uniq -c |sort -n to rank them by hash.
On Debian, exiftool is part of the libimage-exiftool-perl package. If you know perl, you can write this with far more precision (I figured this would be an easier explanation for non-coders).
Use my userscript to add story images to Slashdot. There's no going back.
AND, most people come with the trivial answer on deduping files. You DON'T want to MD5 or do anything based on hash tags for 80k photos. That doesn't work. Photos are a particular type of file with particular characteristics, which can reduce your workload a lot.
Trivial approach sucks in this case, and carefully picking the correct tools (in my case classifying photos with an exif / date approach) before deduplicating can convert an impractical solution into a working solution.
fslint is the tool you are looking for.
See my earlier contrivution: geeqie. It will even scan for image similarity not just rudimentary hashing. Someone else mentioned gqview & that it was out of date - geeqie is what gqview became.
Why use perl when a bash script will do?
=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Friends don't let friends enable ecmascript.
Adjust as needed:
find ./ -type f -iname '*.jpg' -exec md5sum {} \; > image_md5.txt
cat image_md5.txt | cut -d" " -f1 | sort | uniq -d | while read md5; do grep $md5 image_md5.txt; done
This will find duplicate files in the general sense:
http://packages.debian.org/sid...
Picasa as a local instance, importing from all the other locations... just remember to check the "exclude duplicates" box
whalah is not a word.... seriously. wtf people. It's voilÃ.
As for ZFS, sure, I recommend ZFS. But I'm not sure how i feel about ZFS's dedupe. Besides, the multiple files are still there even if it no longer takes up extra space.
You'd want a script that finds dupes by hash but that will only detect images that are identical copies, not 'simliar' say an image has been cropped or retouched or resized. A program that can find image dupes even with changes like tineye.com would be ideal. Anything like that exist?
You can tell how powerful someone is by the magnitude of the crime they can commit and be able to get away with.
fdupes.
Done :)
- http://www.milkme.co.uk
http://en.wikipedia.org/wiki/L...
Be careful with fdupes. It defaults to including zero length files and will hard link those together too, which is generally a really bad idea.
ftwin is a command line tool, when built with libpuzzle, able to generate a signature for each image and detect duplicates (including resized/sliightly modified). Link: http://freecode.com/projects/f... Disclaimer: I'm the author and don't maintain it actively :-P
When you're talking about duplicate content, you can't limit yourself to "just hashes".
In this case, with pictures, just opening one and saving it again might produce a different hash, just by recompression or changing the file format. How does all these "just check the hashes" solution works for that?
Finding duplicates image is not that easy.
http://en.wikipedia.org/wiki/List_of_duplicate_file_finders
This seems like a rather lot of work just to automate deduping of your porn collection. It might be more enjoyable to do it by hand anyway.
I always use: fdupes -nrSd *
Reminds me of Windows link repairer, automatically searching for the nearest file size, which was almost always the wrong thing to do, then suggesting grampa accept the new pointer.
(-1: Post disagrees with my already-settled worldview) is not a valid mod option.
If you are going stricly based on hashing (e.g. not trying to match images that may have different EXIF data embedded, thus making the hashes different) fslint works quite well. It will chug through a filesystem and basically wraps python commands to compare by hash and file size (using both md5 and sha256) and will give you a report of wasted space. You can then save a parseable plain text file. It can take a while - it's bandwidth-bound as you might expect - I just did this for a 2tb network share and it took over 12 hours. But it got the job done and all I had to do was sudo apt-get install fslint
Hope it doesn't access your backup drive and wipe out your backups as "duplicates".
They use a database of hashes of kiddie porn to identify offending material without forcing anyone to look at the stuff. Seems like it would be ready to use Perl to crawl your filesystem and identify dupes.
by Mike Buddha -- Someday the mountain might get him, but the law never will.
I wrote a shell script that looked at the datestamp for each photo and then moved it to a directory called YYYY/MM/DD (so 2000/12/25). I'm going off the assumption that there weren't two photos taken on the same day with the same filenames. So far that seems to be working.
...you want to trust the EXIF time stamp to determine a duplicate? I had a video cam that was constantly resetting the internal clock to "Jan 1, 2000." It's possible that you could lose some data.
Interestingly, I wrote a program years back, that did the OPPOSITE of this. It read the file name (formatted as a date) and set the date in the EXIF header. I was converting DV-AVI video to still images.
#!/usr/bin/perl
...] ...]
...]
# $Id: findDups.pl 218 2014-01-24 01:04:52Z alan $
#
# Find duplicate files: for files of the same size compares md5 of successive chunks until they differ
#
use strict;
use warnings;
use Digest::MD5 qw(md5 md5_hex md5_base64);
use Fcntl;
use Cwd qw(realpath);
my $BUFFSIZE = 131072; # compare these many bytes at a time for files of same size
my %fileByName; # all files, name => size
my %fileBySize; # all files, size => [fname1, fname2,
my %fileByHash; # only with duplicates, hash => [fname1, fname2,
if ($#ARGV < 0) {
print "Syntax: findDups.pl <file|dir> [...]\n";
exit;
}
# treat params as files or dirs
foreach my $arg (@ARGV) {
$arg = realpath($arg);
if (-d $arg) {
addDir($arg);
} else {
addFile($arg);
}
}
# get filesize after adding dirs, to avoid more than one stat() per file in case of symlinks, duplicate dirs, etc
foreach my $fname (keys %fileByName) {
$fileByName{$fname} = -s $fname;
}
# build hash of filesize => [ filename1, filename2,
foreach my $fname (keys %fileByName) {
push(@{$fileBySize{$fileByName{$fname}}}, $fname);
}
# for files of the same size: compare md5 of each successive chunk until there is a difference
foreach my $size (keys %fileBySize) {
next if $#{$fileBySize{$size}} < 1; # skip filesizes array with just one file
my %checking;
foreach my $fname (@{$fileBySize{$size}}) {
if (sysopen my $FH, $fname, O_RDONLY) {
$checking{$fname}{fh} = $FH; # file handle
$checking{$fname}{md5} = Digest::MD5->new; # md5 object
} else {
warn "Error opening $fname: $!";
}
}
my $read=0;
while (($read < $size) && (keys %checking > 0)) {
my $r;
foreach my $fname (keys %checking) { # read buffer and update md5
my $buffer;
$r = sysread($checking{$fname}{fh}, $buffer, $BUFFSIZE);
if (! defined($r)) {
warn "Error reading from $fname: $!";
close $checking{$fname}{fh};
delete $checking{$fname};
} else {
$checking{$fname}{md5}->add($buffer);
}
}
$read += $r;
FILE1: foreach my $fname1 (keys %checking) { # remove files without dups
my $duplicate = 0;
FILE2: foreach my $fname2 (keys %checking) { # compare to each checking file
next if $fname1 eq $fname2;
if ($checking{$fname1}{md5}->clone->digest eq $checking{$fname2}{md5}->clone->digest) {
$duplicate = 1;
next FILE1; # skip to next file
}
}
If it's attached to a live system and is writeable then it's not a backup yet, it's just a copy.
A web hosting business near me went under because they made that mistake and lost all of their hosted data in a single incident.
Copies on instantly available disk are often a lot more convenient than detached disks, tapes or whatever, but if that's all you've got there are plenty of ways to lose the lot.
In four different programming languages: http://stromberg.dnsalias.org/...
DigiKam will do everything you want. It works by creating hashes. You set your level of similarity and digiKam will find the files. It can handle multiple locations, and even "albums" on removable media. If you have a lot of images it can be slow, but if you limit any particular search you can greatly improve performance. It is available for Linux and Windows both.
You can't really depend on hashed to not put different keys into the same bin. Given md5sum, or some such, collisions won't be frequent, but they will happen.
This may not matter. What's the cost of missing an image or two? If it's not large, then the small probability of a collision may be good enough.
Exif is based on metadata, so the probability of an improper collision is probably less than for, say, md5sum. It's also mor e likely to recognize slightly different images as being the same. This is probably why he was suggesting comparing on Exif. (IIUC, using Exif you can even standardize and only compare on thumbnails of the image, which would standardize the image for different sizes, and allow jpg's to be compared against, say, tiff's...but this is WAY out of my depth, and is based on a superficial reading of so documentation.)
I think we've pushed this "anyone can grow up to be president" thing too far.
Google: image duplicate finder
What I did in my deduplicator written in Python was group the files by their and reject any file with a unique size. Then I'd hash the first few kilobytes of each file with MD5 (it's just a spot check so speed is more valuable than security against intentional collisions) and reject any file with a unique first few kilobytes. Finally I'd hash the whole file with a more secure hash.
What you want, is a first pass which identifies some interesting points in the image.
There is an algorithm for that called SIFT (scale-invariant feature transform), but it's patented and apparently unavailable for licensing in free software.
md5 is a 128bit hash. Assuming your not trying to create collisions, the odds of you getting a collision in n files is:
p = 1 - (2^128)! / ((2^128 - n)! * (2^128)^n)
This is an expression that starts at 0 and gradually goes to 1 as n goes to infinity.
These numbers are so big, I have no idea how to even solve for n to get something like p = 0.0001%, without using a bignumber package, but I imagine n would have to be *REALLY* big in order to get a p significantly above 0
OK so I wrote a quick little python script (I just remember python has bignumber support) to do it on a smaller numbers.
If we assume md5 was only 64 bits, even with 100 million files, your chances fo hitting an md5 collision are 0.03% (i.e. a 0.0003 chance).
When you bump up the md5 to 128 bits 100 million files has a 0.000000 (rounded to 6 decimal places) chance of happening.
maybe I will let my program run overnight and see how far it gets. It's programmed to count how many files it will take before the probability of a collision is 50%. Who knows, maybe it will take millions of years to finish. We'll see tomorrow morning.
I should also point out that the convention for UUIDs (universally unique identifiers) is also 128bits. Meaning that the chances of randomly getting the same 128 bit number is so low that experts have determined it's ok to just assume it never happens for purposes of computing.
http://en.wikipedia.org/wiki/Universally_unique_identifier
BTW I am at 500 million files, and the odds of getting a 128bit md5 collision are still 0.000000
I should also point out that when I said: " Assuming your not trying to create collisions...", I was referring to the fact that md5 has been compromised. My point is that 128 bits is enough bits to ensure you will not get a random collision due to chance if you are using a good hashing algorithm.
if you have the exact same picture in three different folders/subdirectories on the same file system, zfs will only allocate storage for one copy of the data, and point the three file entries to that one copy. Similar to how hard links work in ext2 and friends.
I think the idea is to use some utility to query ZFS and find files that ZFS has deduplicated. Similar to how one can count hard links to each inode in ext2 and friends.
However it's fairly easy to do with a unix shell and only standard tools...
/path/to/pics -type f -print0 | xargs -0 md5 | sort | while read hash r; do
/. comment box, though, so it's probably wrong somewhere. Only intended to convey the idea.
Something along the lines of:
find
if ! [ "$lasthash" = "$hash" ]; then
echo "$rest"
fi
lasthash="$hash"
done | while read dupe; do
echo rm -- "$dupe"
done
That would, once the echo is removed, delete all files that are dupes (except one of each).
Typed it right into the
CLI paste? paste.pr0.tips!
I have used http://www.duplicate-finder.com/photo.html (MS Windows only) because I could not find anything on Linux with similar functionality. It does work very well, it can find similar, but not identical images, such as the same picture saved in a different format or with different compression settings. It tends to slow down when working directories with multiple thousands of images.
I really shouldn't have used someone else's email address for this account.
pff, two commands. amateur...
CLI paste? paste.pr0.tips!
How about only hashing files with identical file sizes?
Of course it runs NetBSD. BTC: 1NT7QvbetmANwaMzhpVL6
I use DigiKam but when it came to finding duplicates in unmanaged folders I was happy to find out Geeqie has a very powerful File->FindDuplicates tool with many methods for identifying duplicates. Start with the quick ones and move on to the slower methods.
Also: I love Geeqie's view-> pan view mode... check it out!
Whats whats wrong wrong with with dupes dupes? Picky picky.
Table-ized A.I.
I use VisiPics for Windows. It's a free software that actually analyses the content of images to find duplicates. This works very well because images may not have exif data or the same image may be different file sizes or formats.
I don't know if it will work under Wine, but it's worth a try.
Visipics is the only tool I have ever found that will reliably use image matching to dedupe; it is Windows only but I have used it on my own collections & it works very well indeed: http://www.visipics.info/
Now (v1.31) understands .raw as well as all other main image formats & can handle rotated images; brilliant little program!
Nico M, London, GB.
I use fslint. It does more than just find duplicate images.
Your best bet is using something like dupeGuru (http://www.hardcoded.net/dupeguru_pe/). It uses a variant of phash (http://www.phash.org/) to also find similar images. I've used it on an archive of 250,000 photos and it works beautifully.
I don't know what the price of RAM is doing these days, but I did buy a 4GB upgrade for my laptop last September, cost £19 for the module. ...oh here we go: 8GB Integral PC3-12800 desktop is going for £55 at PC World Retail. 32GB bankfiller would hit £220, you could beat that with a little shopping around I'm sure.
Laptop SODIMM: same price.
Seems a bit high to me...
Political debates have me rolling my eyes so much I think I got optical whiplash. I should sue. - Foamy The Squirrel
or, in DOS/Win7 CLI: "dir /s /os >filelist" returns the entire tree contents from the current directory sorted in ascending file size order to the text file "filelist". 10,070 files/6359 folders (random tree search on my hard drive) took 16 seconds.
Import tab-delimited list into your favourite spreadsheet.
Do what you need to do.
Political debates have me rolling my eyes so much I think I got optical whiplash. I should sue. - Foamy The Squirrel
And the probability of a collision and the files being the same size (the first thing to check when looking for dupes) is even smaller.
And then you could pick a random section from both files and run an md5sum on that, squaring your probability of a collision. Probably. I'm just guessing.
systemd is Roko's Basilisk.
seriously. wtf people. It's voilÃ.
Well, you tried.
Quoted for funniness.
systemd is Roko's Basilisk.
Et voilà! L'UTF, c'est votre ami.
Il n'y a pas de Planet B.
same (ftp://ftp.bitwizard.nl/same/) replaces the duplicates by hard or symbolic links.
No need to roll your own. If the redundant files are identical (the
problem as stated lets me assume that), use fdupes.
"Searches the given path for duplicate files. Such files are found by
comparing file sizes and MD5 signatures, followed by a byte-by-byte
comparison."
It's fast, accurate, and generates a list of duplicate files to handle
yourself - or automatically deletes all except the first of duplicate
files found.
I've used it myself with tens of thousands of pictures to exactly do
what the OP wants.
Zut alors! Mais il n'est pas UTF, maintenant:
Et voilà! L'UTF, c'est votre ami.
systemd is Roko's Basilisk.
It's far from ideal. You do get (most of) the storage benefits, but it doesn't help with organisation.
Filesystem-level deduplication is meant to save space from blocks that several files use (several full image backups will undoubtedly share a large portion of files that belong to the OS and common applications, for instance).
An "easy" way to find a guaranteed collision is to simply create more files than 2^(bits in hash). So a bit over 3.4 x 10^38 files for MD5 and you'll get collisions on all subsequent files.
This should be obvious, but just in case:
If you can ask an oracle for a file with a hash not in a list of hashes, then you can keep adding the new files to the list. An n-bit hash can have 2^n unique values, so after 2^n files created no new value can possibly be added to the list.
Not a sentence!
I know I'm a little late to the conversation, but I wrote a script to tackle this very problem just a month or two ago.
https://github.com/mikegreiling/photosort.py
http://pixelcog.com/blog/2013/recover-corrupted-photo-library/
I had a corrupted iPhoto library after a hard drive went bad, so I needed to combine the photos from my iPhone and several other sources to recompile the library, and the only way to recognize duplicates was with EXIF information.
There have been interesting responses. Tools that find substantially-similar (read: the same image lossy encoded, resized, and rotated) images, produce hashes that can be compared to find out "How similar" two images are, and so on.
Support my political activism on Patreon.
I wouldn't call creating 2^128 files easy. Also, you are likely to get collisions way before you get close to reaching 2^128 files.
OK so it's morning and my program has calculated the probability of getting a collision with 50 billion files at 0.000000
I actually wrote a program that looks for duplicate files based on md5hash, and I did check "size collisions" before actually computing md5hashes (which are pretty CPU intensive for large files).
But even if all the files were the same size, you need a lot of files before you should expect to run into collisions.
I am wondering why no one suggested gimp. Gimp has a command-line interface that does almost anything brilliantly and is perfect for working on multiple files. Gimp reads EXIF data and also accepts python scripts internally if you prefer a GUI over CLI.
I believe that's the chance of any particular pair being in collision. The chance of some pair being in collision would be appreciably larger. And you left out the number of bins in the hash. Even if the raw md5sum would be different, when you change it into a bin number it will be quite a bit smaller...though this can be handled by chaining, etc.
But, yest, it is critically dependent on the number of files to be examined. If he's managing a large library of images, and they are valuable, then he might want to avoid this approach. If he's managing his own photos, there's probably no problem. However, unless I'm misunderstanding Exif documentation (likely) that would allow him to properly compare images at different resolutions and in different formats, where md5sum wouldn't.
I think we've pushed this "anyone can grow up to be president" thing too far.
I believe that's the chance of any particular pair being in collision. The chance of some pair being in collision would be appreciably larger.
The chance of a particular pair colliding is 1/(2^128). The formula I provided is the probability for any pair colliding.
And you left out the number of bins in the hash. Even if the raw md5sum would be different, when you change it into a bin number it will be quite a bit smaller...though this can be handled by chaining, etc.
And this wouldn't count as a collision in the sense that you wouldn't mistakenly assume to files were equal when they were actually different. This would presumably only happen if the md5 hashes collided
But, yest, it is critically dependent on the number of files to be examined. If he's managing a large library of images, and they are valuable, then he might want to avoid this approach.
How large is a large library of files? I don;t know how far you followed this thread, but I actually made a small python script to calculate how many files you'd need to cause a certain liklihood of a collision, and it's been running for about 18 hours so far and at 640 billion files, the chances of a collision are still 0.000000
The "easy" bit is not about creating the files, but about finding an upper bound for n to get a collision probability of 1. Sorry for phrasing that poorly.
Not a sentence!
the upper bound is 2^128. I think I originally said p goes to 1 as n goes to infinity, but it is actually that p goes to 1 as n goes to 2^128
so here it is: https://github.com/withorwitho...
enjoy, it doesn't delete or move anything automatically. You can add that if you want, just outputs images that are perceptually similar.
Example usage and output is included on github page. email me if you want it to work a different way or do something different. It's not the most robust phash algorithm, but it's better than straight hashes (in some ways) as it'll detect a similar png and jpg that are similar.
I wrote one that I use, works really well because it also hardlinks all the duplicates. https://github.com/wscott/link...
I have done it with Flickr and FlickrDupFinder (https://www.flickr.com/services/apps/72157623582289101/) which has worked very well!
"Beware of he who would deny you access to information, for in his heart, he dreams himself your master."
If the images are identical, hash them and compare hashes.
Privacy is terrorism.