Slashdot Mirror


Does Anyone Make a Photo De-Duplicator For Linux? Something That Reads EXIF?

postbigbang writes "Imagine having thousands of images on disparate machines. many are dupes, even among the disparate machines. It's impossible to delete all the dupes manually and create a singular, accurate photo image base? Is there an app out there that can scan a file system, perhaps a target sub-folder system, and suck in the images-- WITHOUT creating duplicates? Perhaps by reading EXIF info or hashes? I have eleven file systems saved, and the task of eliminating dupes seems impossible."

42 of 243 comments (clear)

  1. write it yourself by retchdog · · Score: 2, Insightful

    exactly what you mean by deduplication is kind of vague, but whatever you decide on, it could probably be done in a hundred lines of perl (using CPAN libraries of course).

    --
    "They were pure niggers." – Noam Chomsky
    1. Re:write it yourself by Anonymous Coward · · Score: 5, Informative

      ExifTool is probably your best start:

      http://www.sno.phy.queensu.ca/~phil/exiftool/

    2. Re:write it yourself by shipofgold · · Score: 4, Informative

      I second exiftool. Lots of options to rename files. If you rename files based on createtime and perhaps other fields like resolution you will end up with unique filenames and then you can filter the duplicates

      Here is a quick command which will rename every file in a directory according to createDate

        exiftool "-FileNameCreateDate" -d "%Y%m%d_%H%M%S.%%e" DIR

      If the files were all captured with the same device it is probably super easy since the exif info will be consistent. If the files are from lots of different sources...good luck.

    3. Re:write it yourself by Anonymous Coward · · Score: 3, Informative

      I use VisiPics for Windows. It's a free software that actually analyses the content of images to find duplicates. This works very well because images may not have exif data or the same image may be different file sizes or formats.

      I don't know if it will work under Wine, but it's worth a try.

    4. Re:write it yourself by niftymitch · · Score: 3, Interesting

      ExifTool is probably your best start:

      http://www.sno.phy.queensu.ca/~phil/exiftool/

      find . -print0 | xargs -0 md5sum | sort -flags | uniq -flags

      There are flags in uniq to let you see pairs of identical md5sums as a pair.

      Multiple machines drag the full file to the next machine and concat the
      local files....

      Yes exif helps. but some editors attach exif data from the original...
      The serious might cmp files as well before deleting.

      --
      Truth is stranger than fiction, but it is because Fiction is obliged to stick to possibilities; Truth isn't. Mark Twain.
    5. Re:write it yourself by DedTV · · Score: 2

      I use VisiPics to find similar images. For exact duplicates I use CloneSpy.

      And since we're talking dupes, some other things I use to clean up dupes, depending on need, are AllDup which I mostly use for deduping tagged Audio files but can handle a lot of other things. For video, I've only found 2 options, Video Comparer and Duplicate Video Search. I use DVS because I got it for free legally, but it's not as stable or fast as Video Comparer.

  2. fdupes -rd by Anonymous Coward · · Score: 5, Informative

    I've had the same problem as I stupidly try to make the world a better place by renaming or putting them in sub-directories.

    fdupes will do a bit-wise comparison. -r = recurse. -d = delete.

    fdupes would be the fastest way.

  3. Write a quick script. by khasim · · Score: 4, Informative

    If they are identical then their hashes should be identical.

    So write a script that generates hashes for each of them and checks for duplicate hashes.

  4. fslint by innocent_white_lamb · · Score: 3, Informative

    fslint is a toolkit to find all redundant disk usage (duplicate files
    for e.g.). It includes a GUI as well as a command line interface.

    http://www.pixelbeat.org/fslin...

    --
    If you're a zombie and you know it, bite your friend!
  5. Fuzzy Hashing by Oceanplexian · · Score: 2

    I would try running all the files through ssdeep.

    You could script it to find a certain % match that you're satisfied with. Only catch to this is that it could be a very time-intensive process to scan a huge number of files. Exif might be a faster option which could be cobbled together in Perl pretty quickly, but that wouldn't catch dupes that had their exif stripped or have slight differences due to post-processing.

  6. Geeqie by zakkie · · Score: 4, Informative

    Works excellently for this.

    1. Re:Geeqie by subreality · · Score: 2

      +1. The reason: it has a fuzzy-matching dedupe feature. It'll crawl all your images, then show them grouped by similarity and let you choose which ones to delete. It seems to do a pretty good job with recompressed or slightly cropped images.

      Open it up, right click a directory, Find Duplicates Recursive.

      fdupes is also good to weed out the bit-for-bit identical files first.

  7. Don't reinvent the wheel: fdupes, md5deep, gqview by nctritech · · Score: 2

    fdupes will work and is faster than writing a homemade script for the job. The big problem is "across multiple machines" which might require use of, say, sshfs to bring all the machines' data remotely onto one temporarily for duplicate scanning. fdupes checks sizes first, and only then starts trying to hash anything, so obvious non-duplicates don't get hashed at all. Significant time savings. Across multiple machines, another option is using md5deep to build recursive hash lists.

    The only tool so far that I've used for image duplicate finding that checks CONTENT rather than bitwise 1:1 duplicate checking is GQview on Linux. It works fairly well, though it's a bit dated by now it's still a good viewer program. Add -D_FILE_OFFSET_BITS=64 to the CFLAGS if you compile it yourself on a 32-bit machine today though.

  8. findimagedupes in Debian by nemesisrocks · · Score: 5, Interesting

    whatever you decide on, it could probably be done in a hundred lines of perl

    Funny you mention perl.

    There's a tool written in perl called "findimagedupes" in Debian. Pretty awesome tool for large image collections, because it could identify duplicates even if they had been resized, or messed with a little (e.g. adding logos, etc). Point it at a directory, and it'll find all the dupes for you.

    1. Re:findimagedupes in Debian by msobkow · · Score: 3, Interesting

      Why do I have this sneaking suspicion it runs in exponential time, varying as the size of the data set...

      From what this user is talking about (multiple drives full of images), they may well have reached the point where it is impossible to sort out the dupes without one hell of a heavy hitting cluster to do the comparisons and sorting.

      --
      I do not fail; I succeed at finding out what does not work.
    2. Re:findimagedupes in Debian by complete+loony · · Score: 2

      What you want, is a first pass which identifies some interesting points in the image. Similar to microsoft's photosynth. Then you can compare this greatly simplified data for similar sets of points. Allowing you to ignore the effects of scaling or cropping.

      A straight hash won't identify similarities between images, and would be totally confused by compression artefacts.

      --
      09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
    3. Re:findimagedupes in Debian by nemesisrocks · · Score: 2, Informative

      Why do I have this sneaking suspicion it runs in exponential time, varying as the size of the data set...

      It's actually pretty nifty how findimagedupes works. It creates a 16x16 thumbnail of each image (it's a little more complicated than that -- read more on the manpage), and uses this as a fingerprint. Fingerprints are then compared using an algorithm that looks like O(n^2).

      I doubt the difference between O(2^n) and O(n^2) would make a huge impact anyway: the biggest bottleneck is going to be disk read and seek time, not comparing fingerprints. It's akin to running compression on a filesystem: read speed is an order of magnitude slower than the compression.

    4. Re:findimagedupes in Debian by sexconker · · Score: 2

      Why do I have this sneaking suspicion it runs in exponential time, varying as the size of the data set...

      It's actually pretty nifty how findimagedupes works. It creates a 16x16 thumbnail of each image (it's a little more complicated than that -- read more on the manpage), and uses this as a fingerprint. Fingerprints are then compared using an algorithm that looks like O(n^2).

      I doubt the difference between O(2^n) and O(n^2) would make a huge impact anyway: the biggest bottleneck is going to be disk read and seek time, not comparing fingerprints. It's akin to running compression on a filesystem: read speed is an order of magnitude slower than the compression.

      O(n^2) vs O(2^n) is a huge difference eve for very small datasets (hundreds of pictures).
      You have to read all the images and generate the hashes, but that's Theta(n).
      Comparing one hash to every other has is Theta(n^2).

      If the hashes are small enough to all live in memory (or enough of them that you can intelligently juggle your comparisons without having to wait on the disk too much), then you'll be fine for tens of thousands of pictures.
      But photographers can take thousands of pictures per shoot, hundreds of thousands in a year, and have millions of photos to dedupe.
      When you're at that level, comparisons have to be 6 orders of magnitude faster than your disk read to avoid being the bottleneck. With large hard drives shitting out 60-120 MBps (we'll ignore SSDs because they can't hold that many photos, and we'll ignore RAID just because), that's not going to be the case.

  9. fdupes by ender8282 · · Score: 2

    Under *buntu
    sudo apt-get install fdupes
    man fdupes:
    fdupes - finds duplicate files in a given set of directories

  10. Photo managers by MrEricSir · · Score: 2

    As a former Shotwell dev I might point out that most photo manager apps can do this.

    --
    There's no -1 for "I don't get it."
  11. Re:ZFS dedup by Anonymous Coward · · Score: 3, Informative

    Have you read the zfs documentation? Setting zfs dedup does not remove duplicate files (per OP request, since there are eleven different file systems), but removes redundant storage for files which are duplicates. In other words, if you have the exact same picture in three different folders/subdirectories on the same file system, zfs will only allocate storage for one copy of the data, and point the three file entries to that one copy. Similar to how hard links work in ext2 and friends.

  12. Re:Seriously? by postbigbang · · Score: 3, Interesting

    Yeah. Thanks. It's a simple question. So far, I've seen scripting suggestions, which might be useful. I'm a nerd, but not wanting to do much code because I'm really rusty at it. Instead, I'm amazed that no one runs into this problem and has built an app that does this. That's all I'm looking for: consolidation.

    --
    ---- Teach Peace. It's Cheaper Than War.
  13. General case by xaxa · · Score: 5, Informative

    For the general case (any file), I've used this script:


    #!/bin/sh

    OUTF=rem-duplicates.sh;

    echo "#! /bin/sh" > $OUTF;

    find "$@" -type f -print0 |
        xargs -0 -n1 md5sum |
            sort --key=1,32 | uniq -w 32 -d --all-repeated=separate |
                sed -r 's/^[0-9a-f]*( )*//;s/([^a-zA-Z0-9./_-])/\\\1/g;s/(.+)/#rm \1/' >> $OUTF;

    chmod a+x $OUTF; ls -l $OUTF

    It should be straightforward to change "md5sum" to some other key -- e.g. EXIF Date + some other EXIF fields.

    (Also, isn't this really a question for superuser.com or similar?)

    1. Re:General case by Forever+Wondering · · Score: 3, Informative

      (Also, isn't this really a question for superuser.com or similar?)

      Possibly ;-)
      http://superuser.com/questions...

      --
      Like a good neighbor, fsck is there ...
    2. Re:General case by Yakasha · · Score: 3, Funny

      (Also, isn't this really a question for superuser.com or similar?)

      Possibly ;-) http://superuser.com/questions...

      So adapt the script to de-dupe stories?

      But then if we did that... what would we read on /.?

  14. Re:Don't reinvent the wheel: fdupes, md5deep, gqvi by rwa2 · · Score: 5, Informative

    Yeah, this Ask Slashdot should really be about teaching people how to search for packages in aptitude or whatever your package manager is...
    Here are some others:

    findimagedupes
    Finds visually similar or duplicate images
    findimagedupes is a commandline utility which performs a rough "visual diff" to
    two images. This allows you to compare two images or a whole tree of images and
    determine if any are similar or identical. On common image types,
    findimagedupes seems to be around 98% accurate.
    Homepage: http://www.jhnc.org/findimaged...

    fslint :

    kleansweep :
    File cleaner for KDE
    KleanSweep allows you to reclaim disk space by finding unneeded files. It can
    search for files basing on several criterias; you can seek for:
    * empty files
    * empty directories
    * backup files
    * broken symbolic links
    * broken executables (executables with missing libraries)
    * dead menu entries (.desktop files pointing to non-existing executables)
    * duplicated files ...
    Homepage: http://linux.bydg.org/~yogin/

    komparator :
    directories comparator for KDE
    Komparator is an application that searches and synchronizes two directories. It
    discovers duplicate, newer or missing files and empty folders. It works on
    local and network or kioslave protocol folders.
    Homepage: http://komparator.sourceforge....

    backuppc : (just in case this was related to your intended use case for some reason)
    high-performance, enterprise-grade system for backing up PCs
    BackupPC is disk based and not tape based. This particularity allows features #
    not found in any other backup solution:
    * Clever pooling scheme minimizes disk storage and disk I/O. Identical files
        across multiple backups of the same or different PC are stored only once
        resulting in substantial savings in disk storage and disk writes. Also known
        as "data deduplication".

    I bet if you throw Picasa at your combined images directory, it might have some kind of "similar image" detection too, particularly since its sorts everything by exif timestamp.

    That said, I've never had to use any of this stuff, because my habit was to rename my camera image dumps to a timestamped directory (e.g. 20140123_DCIM ) to begin with, and upload it to its final resting place on my main file server immediately, so I know all other copies I encounter on other household machines are redundant.

  15. Quick shell script using exiftool by Khopesh · · Score: 4, Interesting

    This will help find exact matches by exif data. It will not find near-matches unless they have the same exif data. If you want that, good luck. Geeqie has a find-similar command, but it's only so good (image search is hard!). Apparently there's also a findimagedupes tool available, see comments above (I wrote this before seeing that and had assumed apt-cache search had already been exhausted).

    I would write a script that runs exiftool on each file you want to test. Remove the items that refer to timestamp, file name, path, etc. make a md5.

    Something like this exif_hash.sh (sorry, slashdot eats whitespace so this is not indented):

    #!/bin/sh
    for image in "$@"; do
    echo "`exiftool |grep -ve 20..:..: -e 19..:..: -e File -e Directory |md5sum` $image"
    done

    And then run:

    find [list of paths] -typef -print0 |xargs -0 exif_hash.sh |sort > output

    If you have a really large list of images, do not run this through sort. Just pipe it into your output file and sort it later. It's possible that the sort utility can't deal with the size of the list (you can work around this by using grep '^[0-8]' output |sort >output-1 and grep -v '^[0-8]' output |sort >output-2, then cat output-1 output-2 > output.sorted or thereabouts; you may need more than two passes).

    There are other things you can do to display these, e.g. awk '{print $1}' output |uniq -c |sort -n to rank them by hash.

    On Debian, exiftool is part of the libimage-exiftool-perl package. If you know perl, you can write this with far more precision (I figured this would be an easier explanation for non-coders).

    --
    Use my userscript to add story images to Slashdot. There's no going back.
  16. Re:Seriously? by zakkie · · Score: 5, Informative

    See my earlier contrivution: geeqie. It will even scan for image similarity not just rudimentary hashing. Someone else mentioned gqview & that it was out of date - geeqie is what gqview became.

  17. Re:You don't need software for this by unrtst · · Score: 3, Informative

    Adjust as needed:

    find ./ -type f -iname '*.jpg' -exec md5sum {} \; > image_md5.txt
    cat image_md5.txt | cut -d" " -f1 | sort | uniq -d | while read md5; do grep $md5 image_md5.txt; done

    ...though I think something more sophisticated than an md5sum would be wise (exif data could have been changed but nothing else, and you'd miss that dupe).

  18. Re:I think I wrote one of these. by Cummy · · Score: 3, Insightful

    Why do people on this site believe that everyone who is interested in tech is a programmer? This"just write it" is foolishness of the highest order. For many of us non-programers "just write it" is like telling some one living in Florida to "just build a plane and fly to that concert in Vienna after work tomorrow". If that seems like a ridiculous ask, then so is asking a person without the skill to write a script for that. So it can be done in 20 minutes, use that 20 minutes to help someone by writing the program and loading it to a repo. All the 20second tutorials in the world will not get someone to write a program if they just don;t have the skill set.
    This is part of the reason Windows is successful: think of a problem, there is likely program out there that solves it already, and if there isn't one someone will soon write one (Apple users just go and buy one). Linux will not get out of single digit adoption until people with the skills write and edit programs for the non-programers like myself because when stuff needs to get done fast Windows will have the program (and yes it is easier to clean out the malware and fight the popups than it is to write the program).

  19. Re:Seriously? by Cley+Faye · · Score: 2

    When you're talking about duplicate content, you can't limit yourself to "just hashes".
    In this case, with pictures, just opening one and saving it again might produce a different hash, just by recompression or changing the file format. How does all these "just check the hashes" solution works for that?
    Finding duplicates image is not that easy.

  20. http://en.wikipedia.org/wiki/List_of_duplicate_fil by Anonymous Coward · · Score: 2, Informative

    http://en.wikipedia.org/wiki/List_of_duplicate_file_finders

  21. Re:I think I wrote one of these. by VortexCortex · · Score: 2, Informative

    This"just write it" is foolishness of the highest order. For many of us non-programers "just write it" is like telling some one living in Florida to "just build a plane and fly to that concert in Vienna after work tomorrow".

    Computer literacy used to involve typing a terminal command. All the PC folks in the 80's and 90's did it. I can't be fucked to care if folks are too stupid to learn how to use their computers. If you can't "write it yourself" in this instance, which amounts to running an operation across a set of files, then sorting the result, then you do not know how to use a computer. You know how to use some applications and input devices. It's a big difference.

    This is part of the reason Windows is successful: think of a problem, there is likely program out there that solves it already, and if there isn't one someone will soon write one

    Which is why it's a nightmare to administer windows. MS had to create a fucking scripting terminal "powershell" because they ditched DOS and didn't expose OS features to a terminal... Now go press the Towel key to open Window8's start screen. Start typing... AT A NEUTERED TERMINAL... ugh. Sometimes, its better to not have to wait for someone to create something for you, especially when it's something very easy to do. You would FIRE a secretary that could not sort a set of physical files by customer ID and remove duplicates, or add up totals with a calculator, etc. Your standard for computer "operator" is so low it's pitiable.

    If you paid attention to the thread, you'd have noticed that nothing you said about Windows is exclusive to windows. Indeed, a Google search for any OS would have turned up solutions for it. Some would be a few lines of BASH or Perl, Powershell, BATCH scripts, etc. Some would be 'free' programs, some of those would have adware, some would have malware. At least the ones in the FLOSS repositories wouldn't.

    The OS exposes your computer's features to you. If you do not know how to write a simple set of instructions for it to follow, then you do not know how to use a computer.

  22. My solution by alantus · · Score: 2

    #!/usr/bin/perl
    # $Id: findDups.pl 218 2014-01-24 01:04:52Z alan $
    #
    # Find duplicate files: for files of the same size compares md5 of successive chunks until they differ
    #
    use strict;
    use warnings;
    use Digest::MD5 qw(md5 md5_hex md5_base64);
    use Fcntl;
    use Cwd qw(realpath);

    my $BUFFSIZE = 131072; # compare these many bytes at a time for files of same size

    my %fileByName; # all files, name => size
    my %fileBySize; # all files, size => [fname1, fname2, ...]
    my %fileByHash; # only with duplicates, hash => [fname1, fname2, ...]

    if ($#ARGV < 0) {
    print "Syntax: findDups.pl <file|dir> [...]\n";
    exit;
    }

    # treat params as files or dirs
    foreach my $arg (@ARGV) {
    $arg = realpath($arg);
    if (-d $arg) {
    addDir($arg);
    } else {
    addFile($arg);
    }
    }

    # get filesize after adding dirs, to avoid more than one stat() per file in case of symlinks, duplicate dirs, etc
    foreach my $fname (keys %fileByName) {
    $fileByName{$fname} = -s $fname;
    }

    # build hash of filesize => [ filename1, filename2, ...]
    foreach my $fname (keys %fileByName) {
    push(@{$fileBySize{$fileByName{$fname}}}, $fname);
    }

    # for files of the same size: compare md5 of each successive chunk until there is a difference
    foreach my $size (keys %fileBySize) {
    next if $#{$fileBySize{$size}} < 1; # skip filesizes array with just one file
    my %checking;
    foreach my $fname (@{$fileBySize{$size}}) {
    if (sysopen my $FH, $fname, O_RDONLY) {
    $checking{$fname}{fh} = $FH; # file handle
    $checking{$fname}{md5} = Digest::MD5->new; # md5 object
    } else {
    warn "Error opening $fname: $!";
    }
    }
    my $read=0;
    while (($read < $size) && (keys %checking > 0)) {
    my $r;
    foreach my $fname (keys %checking) { # read buffer and update md5
    my $buffer;
    $r = sysread($checking{$fname}{fh}, $buffer, $BUFFSIZE);
    if (! defined($r)) {
    warn "Error reading from $fname: $!";
    close $checking{$fname}{fh};
    delete $checking{$fname};
    } else {
    $checking{$fname}{md5}->add($buffer);
    }
    }
    $read += $r;
    FILE1: foreach my $fname1 (keys %checking) { # remove files without dups
    my $duplicate = 0;
    FILE2: foreach my $fname2 (keys %checking) { # compare to each checking file
    next if $fname1 eq $fname2;
    if ($checking{$fname1}{md5}->clone->digest eq $checking{$fname2}{md5}->clone->digest) {
    $duplicate = 1;
    next FILE1; # skip to next file
    }
    }

  23. digiKam is what you want. by Lurching · · Score: 2

    DigiKam will do everything you want. It works by creating hashes. You set your level of similarity and digiKam will find the files. It can handle multiple locations, and even "albums" on removable media. If you have a lot of images it can be slow, but if you limit any particular search you can greatly improve performance. It is available for Linux and Windows both.

  24. I wrote one myself by tepples · · Score: 3, Insightful

    What I did in my deduplicator written in Python was group the files by their and reject any file with a unique size. Then I'd hash the first few kilobytes of each file with MD5 (it's just a spot check so speed is more valuable than security against intentional collisions) and reject any file with a unique first few kilobytes. Finally I'd hash the whole file with a more secure hash.

  25. SIFT is patented by tepples · · Score: 2

    What you want, is a first pass which identifies some interesting points in the image.

    There is an algorithm for that called SIFT (scale-invariant feature transform), but it's patented and apparently unavailable for licensing in free software.

  26. Re:Hashes should be relatively easy by TsuruchiBrian · · Score: 3, Informative

    md5 is a 128bit hash. Assuming your not trying to create collisions, the odds of you getting a collision in n files is:

    p = 1 - (2^128)! / ((2^128 - n)! * (2^128)^n)

    This is an expression that starts at 0 and gradually goes to 1 as n goes to infinity.

    These numbers are so big, I have no idea how to even solve for n to get something like p = 0.0001%, without using a bignumber package, but I imagine n would have to be *REALLY* big in order to get a p significantly above 0

  27. Re:Hashes should be relatively easy by TsuruchiBrian · · Score: 2

    OK so I wrote a quick little python script (I just remember python has bignumber support) to do it on a smaller numbers.

    If we assume md5 was only 64 bits, even with 100 million files, your chances fo hitting an md5 collision are 0.03% (i.e. a 0.0003 chance).

    When you bump up the md5 to 128 bits 100 million files has a 0.000000 (rounded to 6 decimal places) chance of happening.

    maybe I will let my program run overnight and see how far it gets. It's programmed to count how many files it will take before the probability of a collision is 50%. Who knows, maybe it will take millions of years to finish. We'll see tomorrow morning.

  28. Visipics is excellent. by micronicos · · Score: 3, Informative

    I use VisiPics for Windows. It's a free software that actually analyses the content of images to find duplicates. This works very well because images may not have exif data or the same image may be different file sizes or formats.
    I don't know if it will work under Wine, but it's worth a try.

    Visipics is the only tool I have ever found that will reliably use image matching to dedupe; it is Windows only but I have used it on my own collections & it works very well indeed: http://www.visipics.info/

    Now (v1.31) understands .raw as well as all other main image formats & can handle rotated images; brilliant little program!

    --
    Nico M, London, GB.
    1. Re:Visipics is excellent. by DMUTPeregrine · · Score: 2

      I've used Duplicate Photo Finder for a while, but VisiPics looks like it's probably better. That said, I have tested and Duplicate Photo Finder worked for me with WINE.

      --
      Not a sentence!
  29. Re:ZFS filesystem with dedup by Zontar+The+Mindless · · Score: 2

    Et voilà! L'UTF, c'est votre ami.

    --
    Il n'y a pas de Planet B.