Does Anyone Make a Photo De-Duplicator For Linux? Something That Reads EXIF?

write it yourself by retchdog · 2014-01-23 10:35 · Score: 2, Insightful

exactly what you mean by deduplication is kind of vague, but whatever you decide on, it could probably be done in a hundred lines of perl (using CPAN libraries of course).

--
"They were pure niggers." – Noam Chomsky

Re:write it yourself by Anonymous Coward · 2014-01-23 10:38 · Score: 5, Informative

ExifTool is probably your best start:
http://www.sno.phy.queensu.ca/~phil/exiftool/
Re:write it yourself by postbigbang · 2014-01-23 10:38 · Score: 1

Imagine tons of iterative backups of photos. Generations of backups. Now they need consolidation. Something that can look at file systems, vacuum the files-- but only one of each photo, even if there are many copies of that photo, as in myphoto(1).jpg, etc.

--
---- Teach Peace. It's Cheaper Than War.
Re:write it yourself by Anonymous Coward · 2014-01-23 10:40 · Score: 0

I Actually expect some helpful Gus to post a 1-liner....
Re:write it yourself by thorgil · 2014-01-23 10:41 · Score: 1

or python, using 10 lines.

--
Warning: This sig contains a small bug. ==> *
Re:write it yourself by Anonymous Coward · 2014-01-23 10:53 · Score: 0

I don't know if anything exists but would some kind of image hashing program work. It would just have to create hashes for each photo and then compare the hashes removing any duplicate hashes?
Re:write it yourself by Anonymous Coward · 2014-01-23 10:54 · Score: 1

Perl to the rescue.
$ sudo cpan App:dupfind
$ dupfind [ --options ] --dir ./path/to/search/
Or you can try your luck with something designed for finding similar image files:
http://www.jhnc.org/findimagedupes/
Re:write it yourself by Anonymous Coward · 2014-01-23 10:55 · Score: 1

No, you would want to remove the duplicate photos. Removing the duplicate hashes doesn't solve the problem.
Re:write it yourself by shipofgold · 2014-01-23 11:01 · Score: 4, Informative

I second exiftool. Lots of options to rename files. If you rename files based on createtime and perhaps other fields like resolution you will end up with unique filenames and then you can filter the duplicates
Here is a quick command which will rename every file in a directory according to createDate
exiftool "-FileNameCreateDate" -d "%Y%m%d_%H%M%S.%%e" DIR
If the files were all captured with the same device it is probably super easy since the exif info will be consistent. If the files are from lots of different sources...good luck.
Re:write it yourself by vux984 · 2014-01-23 11:04 · Score: 1

If the files are in fact identical internally, just backups and backups of backups then it should be pretty straightforward.
Simplest would be simply to:
start with an empty destination
Compare each file in the source(s) tree(s) on each file in the destination by filesize in bytes, then if there is a match there, do a file compare using cmp. Copy it to the destination it if it doesn't match, otherwise move to the next file. Seems like something that would take 10-20 lines of command line script tops. Its a one time job, so who cares if its ideally efficient.
A more sophisticated method to generate and compare file hashes, and compare hashes would potentially be somewhat faster and cleverer; but it would depend on how much duplication actually exists. cmp will terminate at the first mismatch byte so cmp will short circuit out of virtually all comparisons nearly immediately. Whereas generating hashes will require processing all the files completely, as well as coming up with a system for manageing the hash/filename map etc... gets cleverer than it needs to be for a one off job pretty fast.
Re:write it yourself by Anonymous Coward · 2014-01-23 11:07 · Score: 0

exactly what you mean by deduplication is kind of vague
And yet you pretty much know what he means.

but whatever you decide on, it could probably be done in a hundred lines of perl (using CPAN libraries of course).
Re:write it yourself by Anonymous Coward · 2014-01-23 11:22 · Score: 0

You must get paid by the hour if you think it'll a hundred lines.
Re:write it yourself by Anonymous Coward · 2014-01-23 13:47 · Score: 3, Informative

I use VisiPics for Windows. It's a free software that actually analyses the content of images to find duplicates. This works very well because images may not have exif data or the same image may be different file sizes or formats.

I don't know if it will work under Wine, but it's worth a try.
Re:write it yourself by A+nonymous+Coward · 2014-01-23 14:56 · Score: 1

I wrote a file deduplicator. Build a table of file size ---> name. If two files have the same size, run md5sum on them or just use cmp -s. It's a trivial program.
But if you have photos which you consider duplicates but which have different sizes or checksums, then it's a visual gig and lots of boring tedious work,

--
Infuriate left and right
Re:write it yourself by niftymitch · 2014-01-23 15:31 · Score: 3, Interesting

ExifTool is probably your best start:
http://www.sno.phy.queensu.ca/~phil/exiftool/
find . -print0 | xargs -0 md5sum | sort -flags | uniq -flags
There are flags in uniq to let you see pairs of identical md5sums as a pair.
Multiple machines drag the full file to the next machine and concat the
local files....
Yes exif helps. but some editors attach exif data from the original...
The serious might cmp files as well before deleting.

--
Truth is stranger than fiction, but it is because Fiction is obliged to stick to possibilities; Truth isn't. Mark Twain.
Re:write it yourself by KinkyClown · 2014-01-23 18:05 · Score: 1

or GWBASIC, using 5.000 lines.
Re: write it yourself by EvandroJúnior · 2014-01-23 20:18 · Score: 1

I had written it myself. Had the same problem This program does all that and more with a graphic interface. It uses md5 hashes to deduplicate files from a source folder to a destination folder. https://github.com/evandrojr/F... It is open source works for windows and should work for linux with minor changes. It uses mono with c# Hope it helps. Worked for me pretty well.
Re:write it yourself by Enderek · 2014-01-23 23:21 · Score: 1

I had problem with 200GB of pictures. I've used "rdfind" to remove duplicates (about 80 GB). It works fast and smart: search for files with the same size, compare first and last bytes, compare md5sum. Remove duplicates or create soft/hard links.
Re:write it yourself by DedTV · 2014-01-24 10:03 · Score: 2

I use VisiPics to find similar images. For exact duplicates I use CloneSpy.

And since we're talking dupes, some other things I use to clean up dupes, depending on need, are AllDup which I mostly use for deduping tagged Audio files but can handle a lot of other things. For video, I've only found 2 options, Video Comparer and Duplicate Video Search. I use DVS because I got it for free legally, but it's not as stable or fast as Video Comparer.
Re: write it yourself by leslie.satenstein · 2014-01-25 10:01 · Score: 1

I wrote some LINUX code that does 99% of what the requestor wants. Point the program at a top directory and it will scan it and every enclosed subdirectory. Written for Unix/Linux. It uses md5 hashes and or sha1 hashes. It can run run as root. Just ask for it.
Re: write it yourself by amiga3D · 2014-01-25 13:58 · Score: 1

Run as root? Why would you.....ah, never mind.
Re: write it yourself by Kremmy · 2014-01-27 17:22 · Score: 1

In this case you might want to run as root so as to avoid weird permissions issues in pulling the image data from the foreign filesystems.
Re:write it yourself by RockDoctor · 2014-01-28 03:00 · Score: 1

or GWBASIC, using 5.000 lines.
You won't be able to do that until Win10 (SP2).

--
Birds are not dinosaur descendants;birds are dinosaurs, for all useful meanings of "birds", "are" and "dinosaurs"
Re:write it yourself by alexo · 2014-01-31 12:57 · Score: 1

Visipics is showing it's age.
The UI is very limiting (although the algorithm is still the best of all the alternatives I tried)
I wish the author would open-source it.

Hashes should be relatively easy by Anonymous Coward · 2014-01-23 10:36 · Score: 0

On the Mac I use a program called Gemini. It could route out any duplicate files between multiple sources, and give you options on which ones to keep/delete (ie manual, oldest, etc).

Re:Hashes should be relatively easy by HiThere · 2014-01-23 13:58 · Score: 1

You can't really depend on hashed to not put different keys into the same bin. Given md5sum, or some such, collisions won't be frequent, but they will happen.
This may not matter. What's the cost of missing an image or two? If it's not large, then the small probability of a collision may be good enough.
Exif is based on metadata, so the probability of an improper collision is probably less than for, say, md5sum. It's also mor e likely to recognize slightly different images as being the same. This is probably why he was suggesting comparing on Exif. (IIUC, using Exif you can even standardize and only compare on thumbnails of the image, which would standardize the image for different sizes, and allow jpg's to be compared against, say, tiff's...but this is WAY out of my depth, and is based on a superficial reading of so documentation.)

--

I think we've pushed this "anyone can grow up to be president" thing too far.
Re:Hashes should be relatively easy by TsuruchiBrian · 2014-01-23 14:25 · Score: 3, Informative

md5 is a 128bit hash. Assuming your not trying to create collisions, the odds of you getting a collision in n files is:
p = 1 - (2^128)! / ((2^128 - n)! * (2^128)^n)
This is an expression that starts at 0 and gradually goes to 1 as n goes to infinity.
These numbers are so big, I have no idea how to even solve for n to get something like p = 0.0001%, without using a bignumber package, but I imagine n would have to be *REALLY* big in order to get a p significantly above 0
Re:Hashes should be relatively easy by TsuruchiBrian · 2014-01-23 14:44 · Score: 2

OK so I wrote a quick little python script (I just remember python has bignumber support) to do it on a smaller numbers.
If we assume md5 was only 64 bits, even with 100 million files, your chances fo hitting an md5 collision are 0.03% (i.e. a 0.0003 chance).
When you bump up the md5 to 128 bits 100 million files has a 0.000000 (rounded to 6 decimal places) chance of happening.
maybe I will let my program run overnight and see how far it gets. It's programmed to count how many files it will take before the probability of a collision is 50%. Who knows, maybe it will take millions of years to finish. We'll see tomorrow morning.
Re:Hashes should be relatively easy by TsuruchiBrian · 2014-01-23 14:49 · Score: 1

I should also point out that the convention for UUIDs (universally unique identifiers) is also 128bits. Meaning that the chances of randomly getting the same 128 bit number is so low that experts have determined it's ok to just assume it never happens for purposes of computing.
http://en.wikipedia.org/wiki/Universally_unique_identifier
BTW I am at 500 million files, and the odds of getting a 128bit md5 collision are still 0.000000
Re:Hashes should be relatively easy by TsuruchiBrian · 2014-01-23 14:52 · Score: 1

I should also point out that when I said: " Assuming your not trying to create collisions...", I was referring to the fact that md5 has been compromised. My point is that 128 bits is enough bits to ensure you will not get a random collision due to chance if you are using a good hashing algorithm.
Re:Hashes should be relatively easy by wonkey_monkey · 2014-01-23 20:28 · Score: 1

And the probability of a collision and the files being the same size (the first thing to check when looking for dupes) is even smaller.
And then you could pick a random section from both files and run an md5sum on that, squaring your probability of a collision. Probably. I'm just guessing.

--
systemd is Roko's Basilisk.
Re:Hashes should be relatively easy by DMUTPeregrine · 2014-01-24 02:44 · Score: 1

An "easy" way to find a guaranteed collision is to simply create more files than 2^(bits in hash). So a bit over 3.4 x 10^38 files for MD5 and you'll get collisions on all subsequent files.
This should be obvious, but just in case:
If you can ask an oracle for a file with a hash not in a list of hashes, then you can keep adding the new files to the list. An n-bit hash can have 2^n unique values, so after 2^n files created no new value can possibly be added to the list.

--
Not a sentence!
Re:Hashes should be relatively easy by TsuruchiBrian · 2014-01-24 04:48 · Score: 1

I wouldn't call creating 2^128 files easy. Also, you are likely to get collisions way before you get close to reaching 2^128 files.
Re:Hashes should be relatively easy by TsuruchiBrian · 2014-01-24 04:49 · Score: 1

OK so it's morning and my program has calculated the probability of getting a collision with 50 billion files at 0.000000
Re:Hashes should be relatively easy by TsuruchiBrian · 2014-01-24 04:55 · Score: 1

I actually wrote a program that looks for duplicate files based on md5hash, and I did check "size collisions" before actually computing md5hashes (which are pretty CPU intensive for large files).
But even if all the files were the same size, you need a lot of files before you should expect to run into collisions.
Re:Hashes should be relatively easy by HiThere · 2014-01-24 07:06 · Score: 1

I believe that's the chance of any particular pair being in collision. The chance of some pair being in collision would be appreciably larger. And you left out the number of bins in the hash. Even if the raw md5sum would be different, when you change it into a bin number it will be quite a bit smaller...though this can be handled by chaining, etc.
But, yest, it is critically dependent on the number of files to be examined. If he's managing a large library of images, and they are valuable, then he might want to avoid this approach. If he's managing his own photos, there's probably no problem. However, unless I'm misunderstanding Exif documentation (likely) that would allow him to properly compare images at different resolutions and in different formats, where md5sum wouldn't.

--

I think we've pushed this "anyone can grow up to be president" thing too far.
Re:Hashes should be relatively easy by TsuruchiBrian · 2014-01-24 07:25 · Score: 1

I believe that's the chance of any particular pair being in collision. The chance of some pair being in collision would be appreciably larger.
The chance of a particular pair colliding is 1/(2^128). The formula I provided is the probability for any pair colliding.

And you left out the number of bins in the hash. Even if the raw md5sum would be different, when you change it into a bin number it will be quite a bit smaller...though this can be handled by chaining, etc.
And this wouldn't count as a collision in the sense that you wouldn't mistakenly assume to files were equal when they were actually different. This would presumably only happen if the md5 hashes collided

But, yest, it is critically dependent on the number of files to be examined. If he's managing a large library of images, and they are valuable, then he might want to avoid this approach.
How large is a large library of files? I don;t know how far you followed this thread, but I actually made a small python script to calculate how many files you'd need to cause a certain liklihood of a collision, and it's been running for about 18 hours so far and at 640 billion files, the chances of a collision are still 0.000000
Re:Hashes should be relatively easy by DMUTPeregrine · 2014-01-24 07:37 · Score: 1

The "easy" bit is not about creating the files, but about finding an upper bound for n to get a collision probability of 1. Sorry for phrasing that poorly.

--
Not a sentence!
Re:Hashes should be relatively easy by TsuruchiBrian · 2014-01-24 08:36 · Score: 1

the upper bound is 2^128. I think I originally said p goes to 1 as n goes to infinity, but it is actually that p goes to 1 as n goes to 2^128

ZFS filesystem with dedup by Anonymous Coward · 2014-01-23 10:37 · Score: 0

Id have put them all on FreeBSd ZFS filesystems and enabled dedup........ whalah job completed ... :P

Re:ZFS filesystem with dedup by mlts · 2014-01-23 10:56 · Score: 1

One can use NTFS and turn on deduplication, then manually fire off the background "optimization" task. It isn't a "presto!", but after a good long while, it will find and merge duplicate files, or duplicate blocks of different files.
Caveat: This is only Windows 8 and newer, or Windows Server 2012 and newer.
Re:ZFS filesystem with dedup by DiSKiLLeR · 2014-01-23 11:29 · Score: 1

whalah is not a word.... seriously. wtf people. It's voilÃ.
As for ZFS, sure, I recommend ZFS. But I'm not sure how i feel about ZFS's dedupe. Besides, the multiple files are still there even if it no longer takes up extra space.
You'd want a script that finds dupes by hash but that will only detect images that are identical copies, not 'simliar' say an image has been cropped or retouched or resized. A program that can find image dupes even with changes like tineye.com would be ideal. Anything like that exist?

--
You can tell how powerful someone is by the magnitude of the crime they can commit and be able to get away with.
Re:ZFS filesystem with dedup by Anonymous Coward · 2014-01-23 14:16 · Score: 0

stay on topic. kvetching about someone's spelling or grammar when you don't know whether they're working in their first, second or third language and you understand what they are trying to say despite these difficulties only makes yu look the fool.
if the questioner is working in Linux a program called findimagedupes may come in handy. yu can recurse through directories and set the amount of similarity to use for the comparisons.
most file deduplicators only look for files that are bitwise identical while findimagedupes compares the actual images.
Re:ZFS filesystem with dedup by Anonymous Coward · 2014-01-23 15:39 · Score: 0

Except "whalah" has a 100% chance of coming from an ignorant 'murrican. We shouldn't condone ignorance and stupidity. It should be mocked at full force.
There's too much stupidity already because we "tolerate" everything these days. Faux political correctness is the vehicle that the stupid and ignorant ride on. The feelings of idiots are regarded higher than a good education and high intelligence.
Fuck this bullshit.
Re:ZFS filesystem with dedup by wonkey_monkey · 2014-01-23 20:30 · Score: 1

seriously. wtf people. It's voilÃ.
Well, you tried.
Quoted for funniness.

--
systemd is Roko's Basilisk.
Re:ZFS filesystem with dedup by Anonymous Coward · 2014-01-23 21:46 · Score: 0

Mod this foul-mouthed, literate son of a bitch up.
Re:ZFS filesystem with dedup by Zontar+The+Mindless · 2014-01-23 21:49 · Score: 2

Et voilà! L'UTF, c'est votre ami.

--
Il n'y a pas de Planet B.
Re:ZFS filesystem with dedup by wonkey_monkey · 2014-01-24 00:25 · Score: 1

Zut alors! Mais il n'est pas UTF, maintenant:

Et voilà! L'UTF, c'est votre ami.

--
systemd is Roko's Basilisk.
Re:ZFS filesystem with dedup by ericloewe · 2014-01-24 02:36 · Score: 1

It's far from ideal. You do get (most of) the storage benefits, but it doesn't help with organisation.
Filesystem-level deduplication is meant to save space from blocks that several files use (several full image backups will undoubtedly share a large portion of files that belong to the OS and common applications, for instance).
Re:ZFS filesystem with dedup by Anonymous Coward · 2014-01-24 14:25 · Score: 0

Sacrebleu!

fdupes -rd by Anonymous Coward · 2014-01-23 10:37 · Score: 5, Informative

I've had the same problem as I stupidly try to make the world a better place by renaming or putting them in sub-directories.

fdupes will do a bit-wise comparison. -r = recurse. -d = delete.

fdupes would be the fastest way.

Re:fdupes -rd by Xolotl · 2014-01-23 12:34 · Score: 1

fdupes is excellent and I second that (please mod the parent up!)
The only drawback to fdupes is that the files must be identical, so two identical images but where one has some additional metadata e.g. inside the EXIF won't be deduplicated.
Re:fdupes -rd by Anonymous Coward · 2014-01-23 13:23 · Score: 0

That's not a drawback. That's a feature. (Hint: it's even what OP requested.)
Re:fdupes -rd by Eunuchswear · 2014-01-23 21:55 · Score: 1

It's bloody slow though.
I ended up writing my own, it's pretty easy to do,

--
Watch this Heartland Institute video
Re:fdupes -rd by Kirth · 2014-01-27 02:06 · Score: 1

yeah, so what? As long as the source isn't out there, it's useless (except for you).
By the way, I also wrote something that can be used to weed out duplicates; it's a file indexer (fileindex.pl) which will allow you to index one set of files, and then check another set for duplicates (or add to the index). It's useful if you have a set of files (which you may want to process in the future) and new files are coming in, so you can check the new files against the already existing ones (or, if you indexed the old ones also before you've changed them, against old versions of already existing files).
And if the EXIF info is good, and the filenames are bad, I also wrote a script to rename files according to the EXIF information. Same for EPUB. http://seegras.discordia.ch/Pr...

--
"The more prohibitions there are, The poorer the people will be" -- Lao Tse
Re:fdupes -rd by Eunuchswear · 2014-01-27 03:22 · Score: 1

yeah, so what? As long as the source isn't out there, it's useless (except for you).
You're right, I've been meaning to get around to packaging and releasing it, with a comparison with fdupes.
My bad. Will do that next weekend if I can.

--
Watch this Heartland Institute video

You don't need software for this by Anonymous Coward · 2014-01-23 10:37 · Score: 0

Just script something that grabs a list of image files from the filesystem, runs an MD5 hash on all of them, locates any duplicate MD5s, then outputs a list of files to delete later. Now if you're talking about a somewhat more sophisticated duplicate detection (such as, say, detecting images that are the same picture but are not in the same size or format) you're getting into the "someone will pay you money for this" territory.

Re:You don't need software for this by Anonymous Coward · 2014-01-23 10:42 · Score: 1

This is what I'd do, but I doubt the submitter is a Bourne shell wizard.
Shell scripts ARE still software by the way.
Re:You don't need software for this by Anonymous Coward · 2014-01-23 10:43 · Score: 0

Yes, I had the same issue and did this. Then you just manually compare the output of the list in a web browser. Cheap and easy. Filenames would also be a big clue if they are dups, so maybe sort by MD5 hash, filename, then full path filename.
Re:You don't need software for this by unrtst · 2014-01-23 11:15 · Score: 3, Informative

Adjust as needed:
find ./ -type f -iname '*.jpg' -exec md5sum {} \; > image_md5.txt
cat image_md5.txt | cut -d" " -f1 | sort | uniq -d | while read md5; do grep $md5 image_md5.txt; done
...though I think something more sophisticated than an md5sum would be wise (exif data could have been changed but nothing else, and you'd miss that dupe).
Re:You don't need software for this by fisted · 2014-01-23 15:42 · Score: 1

pff, two commands. amateur...

--
CLI paste? paste.pr0.tips!
Re:You don't need software for this by TCM · 2014-01-23 15:43 · Score: 1

How about only hashing files with identical file sizes?

--
Of course it runs NetBSD. BTC: 1NT7QvbetmANwaMzhpVL6

Perhaps you might ask why first by Anonymous Coward · 2014-01-23 10:38 · Score: 1

It is important here to know why you want to remove duplicate images. Is it just so you can have one large photo album without seeing the same picture twice? If that is true then you could sync all images on all machines onto one large drive, sort the files by size and manually delete the duplicates as they would all bunch together.

If you are trying to save disk space, then using a file system like ZFS can automatically remove duplicate data and add compression.

Also consider that if you do not know where your duplicate files are then any duplicates in existence are, effectively, acting as backups for your disorganized collection. Erasing duplicates until you find a way to cleanly backup your data may be a mistake in the long run.

ZFS dedup by brambus · 2014-01-23 10:38 · Score: 0

# zfs set dedup=on mypool/photos

Make sure you have enough RAM though (1GB of RAM per TB of unique data) and/or an SSD for L2ARC to make sure it doesn't grind to a halt.

Re:ZFS dedup by Anonymous Coward · 2014-01-23 10:53 · Score: 3, Informative

Have you read the zfs documentation? Setting zfs dedup does not remove duplicate files (per OP request, since there are eleven different file systems), but removes redundant storage for files which are duplicates. In other words, if you have the exact same picture in three different folders/subdirectories on the same file system, zfs will only allocate storage for one copy of the data, and point the three file entries to that one copy. Similar to how hard links work in ext2 and friends.
Re:ZFS dedup by Anonymous Coward · 2014-01-23 10:54 · Score: 0

# zfs set dedup=on mypool/photos
Make sure you have enough RAM though (1GB of RAM per TB of unique data) and/or an SSD for L2ARC to make sure it doesn't grind to a halt.
I'm really interested in building a 7 or 8TB RAIDZ2 over the next year or so, and I really, really want to use dedup because it's cool and stuff. But the RAM requirements are painful - the data I intend to store will make heavy use of dedup and could easily exceed 20TB. I've looked around and never saw a mention of using an SSD to supplement this - do you have any good links on that?
Re:ZFS dedup by Anonymous Coward · 2014-01-23 11:00 · Score: 0

That requires the files all be exactly identical, bit-for-bit, and thus also requires looking at every single bit of every file. If you have different versions with different changes (one rotated, another resized), it wouldn't find them.
If you just extract the EXIF data, you can process any file in milliseconds no matter how big, and then create a database to easily tell you which files were copies of the same one.
dom
Re:ZFS dedup by Anonymous Coward · 2014-01-23 12:16 · Score: 0

For 7-8TB, you might be better off just buying enough RAM. 16GB is ~ $200. Depending on your workload, that might be better overall than using an SSD for L2ARC.
Re:ZFS dedup by ihtoit · 2014-01-23 19:39 · Score: 1

I don't know what the price of RAM is doing these days, but I did buy a 4GB upgrade for my laptop last September, cost £19 for the module. ...oh here we go: 8GB Integral PC3-12800 desktop is going for £55 at PC World Retail. 32GB bankfiller would hit £220, you could beat that with a little shopping around I'm sure.
Laptop SODIMM: same price.
Seems a bit high to me...

--
Political debates have me rolling my eyes so much I think I got optical whiplash. I should sue. - Foamy The Squirrel
Re:ZFS dedup by Anonymous Coward · 2014-01-24 09:38 · Score: 0

Possibly they were, but I don't think they were suggesting using an SSD to supplement RAM. Being significantly faster than spinning rust an SSD would allow L2ARC to work more quickly (CPU permitting).

Write a quick script. by khasim · 2014-01-23 10:40 · Score: 4, Informative

If they are identical then their hashes should be identical.

So write a script that generates hashes for each of them and checks for duplicate hashes.

Re:Write a quick script. by xpucadad · 2014-01-23 12:14 · Score: 1

I've done this in Perl. It was easy. But the files have to be identical - even if the pictures are the same size and look identical, if the files contents aren't exactly the same they won't match.
Re:Write a quick script. by Anonymous Coward · 2014-01-23 12:38 · Score: 0

I've done this in Perl. It was easy. But the files have to be identical - even if the pictures are the same size and look identical, if the files contents aren't exactly the same they won't match.
That's the goal though. If just one of many "close enough" copies is ok, due a 2:1/4:1/etc reduce on each and hash the reduced set. This is classic CS: CPU/memory/IO - optimize for what you want and spend resources on the others...
Re:Write a quick script. by Em+Adespoton · 2014-01-23 12:53 · Score: 1

What you really need is something that sorts on exif fields, and also generates a normalized histogram xsum for the image itself.
More intensive, but not too bad these days. Imagemagick combined with exiftool, xargs, sort and sed should get you what you want.

fslint by innocent_white_lamb · 2014-01-23 10:40 · Score: 3, Informative

fslint is a toolkit to find all redundant disk usage (duplicate files
for e.g.). It includes a GUI as well as a command line interface.

http://www.pixelbeat.org/fslin...

--
If you're a zombie and you know it, bite your friend!

there are many duplicate file finders by Anonymous Coward · 2014-01-23 10:41 · Score: 0

There are many duplicate file finders, if the files are binary identical. (Search on Google for "find duplicate files" or "delete duplicate files".) However, if the files have been modified in any way, this becomes much more difficult, because similar files for music or photos have a degree of tolerance for errors and variation. Signed: the author of two of those programs.

Fuzzy Hashing by Oceanplexian · 2014-01-23 10:41 · Score: 2

I would try running all the files through ssdeep.

You could script it to find a certain % match that you're satisfied with. Only catch to this is that it could be a very time-intensive process to scan a huge number of files. Exif might be a faster option which could be cobbled together in Perl pretty quickly, but that wouldn't catch dupes that had their exif stripped or have slight differences due to post-processing.

Re:Fuzzy Hashing by stenvar · 2014-01-23 12:37 · Score: 1

That's useless for many kinds of compressed files, like images and audio.

Try "Uniquefiler" under WINE by Anonymous Coward · 2014-01-23 10:41 · Score: 0

It has been reported to work under WINE, but your mileage may vary.
Sorry - don't have any links.

fslint by Anonymous Coward · 2014-01-23 10:41 · Score: 1

I did just this, but by copying all of the pics from the various devices to a linux fileshare, and then ran: http://www.pixelbeat.org/fslint/ Nice software, did exactly what I wanted.

Few option by Anonymous Coward · 2014-01-23 10:41 · Score: 0

You could use Unison to merge them two at a time.

Other option is somethling like FSLint that can detect duplicate.

I think I wrote one of these. by paradxum · 2014-01-23 10:42 · Score: 1

I'm pretty sure I wrote something like this in perl/bash in like 20 minutes.
1 - do an md5sum of each file and toss it in a file
2 - sort
3 - perl (or you language of choice) program, basicly:
sum = "a"
newsum = next line
if newsum == sum delete file
else sum = newsum

Re:I think I wrote one of these. by Cummy · 2014-01-23 11:15 · Score: 3, Insightful

Why do people on this site believe that everyone who is interested in tech is a programmer? This"just write it" is foolishness of the highest order. For many of us non-programers "just write it" is like telling some one living in Florida to "just build a plane and fly to that concert in Vienna after work tomorrow". If that seems like a ridiculous ask, then so is asking a person without the skill to write a script for that. So it can be done in 20 minutes, use that 20 minutes to help someone by writing the program and loading it to a repo. All the 20second tutorials in the world will not get someone to write a program if they just don;t have the skill set.
This is part of the reason Windows is successful: think of a problem, there is likely program out there that solves it already, and if there isn't one someone will soon write one (Apple users just go and buy one). Linux will not get out of single digit adoption until people with the skills write and edit programs for the non-programers like myself because when stuff needs to get done fast Windows will have the program (and yes it is easier to clean out the malware and fight the popups than it is to write the program).
Re:I think I wrote one of these. by VortexCortex · 2014-01-23 11:53 · Score: 2, Informative

This"just write it" is foolishness of the highest order. For many of us non-programers "just write it" is like telling some one living in Florida to "just build a plane and fly to that concert in Vienna after work tomorrow".
Computer literacy used to involve typing a terminal command. All the PC folks in the 80's and 90's did it. I can't be fucked to care if folks are too stupid to learn how to use their computers. If you can't "write it yourself" in this instance, which amounts to running an operation across a set of files, then sorting the result, then you do not know how to use a computer. You know how to use some applications and input devices. It's a big difference.

This is part of the reason Windows is successful: think of a problem, there is likely program out there that solves it already, and if there isn't one someone will soon write one
Which is why it's a nightmare to administer windows. MS had to create a fucking scripting terminal "powershell" because they ditched DOS and didn't expose OS features to a terminal... Now go press the Towel key to open Window8's start screen. Start typing... AT A NEUTERED TERMINAL... ugh. Sometimes, its better to not have to wait for someone to create something for you, especially when it's something very easy to do. You would FIRE a secretary that could not sort a set of physical files by customer ID and remove duplicates, or add up totals with a calculator, etc. Your standard for computer "operator" is so low it's pitiable.
If you paid attention to the thread, you'd have noticed that nothing you said about Windows is exclusive to windows. Indeed, a Google search for any OS would have turned up solutions for it. Some would be a few lines of BASH or Perl, Powershell, BATCH scripts, etc. Some would be 'free' programs, some of those would have adware, some would have malware. At least the ones in the FLOSS repositories wouldn't.
The OS exposes your computer's features to you. If you do not know how to write a simple set of instructions for it to follow, then you do not know how to use a computer.
Re:I think I wrote one of these. by dowens81625 · 2014-01-23 11:53 · Score: 0

Why do people on this site believe that everyone who is interested in tech is a programmer? This"just write it" is foolishness of the highest order. For many of us non-programers "just write it" is like telling some one living in Florida to "just build a plane and fly to that concert in Vienna after work tomorrow". If that seems like a ridiculous ask, then so is asking a person without the skill to write a script for that. So it can be done in 20 minutes, use that 20 minutes to help someone by writing the program and loading it to a repo. All the 20second tutorials in the world will not get someone to write a program if they just don;t have the skill set.
This is part of the reason Windows is successful: think of a problem, there is likely program out there that solves it already, and if there isn't one someone will soon write one (Apple users just go and buy one). Linux will not get out of single digit adoption until people with the skills write and edit programs for the non-programers like myself because when stuff needs to get done fast Windows will have the program (and yes it is easier to clean out the malware and fight the popups than it is to write the program).
To learn is to know ones own value.
To expect something is to not care about yourself or your worth.
Nothing worth doing is ever easy.
Re: I think I wrote one of these. by Anonymous Coward · 2014-01-23 11:57 · Score: 0

we think that because we assume that only someone exactly like us would be presume to ask for free help. an idiot windows user would properly pick up the phone and pay one of us to do it for him. at least a cheapskate would download one of our free virus embeded tools. THIS FUCKER, however, is too lazy even for google.
I got an idea, why don't you go over and personally sort out his thumbnailled porn collection
Re:I think I wrote one of these. by Anonymous Coward · 2014-01-23 13:51 · Score: 0

For that statement, I had to sit at the terminal in fishing waders, damn, the bs is deep. Remember windows was around then, dummy. It may have been called WFW or 3.0 then , but it was there. If not there were other graphical overlays for DOS back then, But, and then, When I was in college in 63, and we used other graphical systems, to program(the) cameras and the environment used for TV productions, using Test layouts from GMI, testing for the navy, later saw some of the same test equipment in the Air Force. and the largest chip was a transistor. Before OS's and MS didn't ditch DOS, they evolved/changed dos to the next step, an OS. Remember just like Linux starts with a command to go step by step thru an operation, a program, just as dos did, and does. I guess you didn't write in assembly, and have to interpret to machine code, have to write coms. But that was where I started having to do the math with a slideruler, modeling for physics class was hell then.
Re:I think I wrote one of these. by jedidiah · 2014-01-23 14:16 · Score: 1

> This is part of the reason Windows is successful: think of a problem, there is likely program out there that solves it already
No. Windows is successful because it's the followup to a product that already owned the market: MS-DOS.
Now you want to talk about nasty user hostile shit, MS-DOS has "script it yourself" Unix beat by a wide margin.

--
A Pirate and a Puritan look the same on a balance sheet.
Re:I think I wrote one of these. by kesuki · 2014-01-23 14:44 · Score: 1

"some of those would have adware, some would have malware. At least the ones in the FLOSS repositories wouldn't."
repositories are a layer of security. yet malware repos are widely promoted on some websites of so called help doing things like playing back movies configuring firewalls etc, also trusted repos are in fact compromized sometimes like http://www.techrepublic.com/blog/linux-and-open-source/linux-repository-hit-by-malware-attack/2989/
i remember one site no link as i forgot where i found it, was a guide to set up a 'transparent' firewall and it was basically a guide to let all traffic go in both directions and no rules to block anything. as a human i almost spit my soda out my nose at the so called guide. there are people who are not smart enough to realize how bad that info was.

--
https://www.gnu.org/philosophy/free-sw.html
Re:I think I wrote one of these. by tftp · 2014-01-23 14:56 · Score: 1

Computer literacy used to involve typing a terminal command. All the PC folks in the 80's and 90's did it.
Yes, all the 0.07% of the population. The rest was in fear of the computer, for a good reason. Back then computers were not very useful unless you were a programmer, or your specific need was covered (MS Word, Excel, WP.)
If you can't "write it yourself" in this instance, which amounts to running an operation across a set of files, then sorting the result, then you do not know how to use a computer. You know how to use some applications and input devices. It's a big difference.
So what? Most people today who use computers on daily basis cannot do any of the above. Does not mean anything. They can use a few applications, and that's all they need. They do not know that the computer can also calculate pi for them. They do not need that. Hell, I'm working with computers for many years, and I can't tell you off the top of my head how can I program MS Word to open all .docx files that match a pattern and then replace one string inside with another. I'd have to study on this scripting and automation thing that I never needed to do before. Does it mean that I don't know how to use computers? You just can't know everything.
Now go press the Towel key to open Window8's start screen. Start typing...
And observe how much backlash this decision caused - to the extent that many people refuse to buy Win8 boxes. People are just not that good at typing; but they are pretty good at finding icons on the desktop and clicking on them. Typing requires being able to type fast, and being able to remember what to type. None of that is a certainty.
The OS exposes your computer's features to you. If you do not know how to write a simple set of instructions for it to follow, then you do not know how to use a computer.
Again, it's just a matter of definitions. For one man, "how to use" means "being able to access Gmail in a browser." For another man, "how to use" means ability to program a new OS from scratch, using their own compiler.
Re:I think I wrote one of these. by Anonymous Coward · 2014-01-23 16:36 · Score: 0

To learn is to know ones own value.
To expect something is to not care about yourself or your worth.
Nothing worth doing is ever easy.
Worst. Haiku attempt. Ever
Re:I think I wrote one of these. by BetterThanCaesar · 2014-01-23 19:35 · Score: 1
Step one is to compare file sizes. Since file sizes need to be identical in order for the files to be identical, and file sizes are already calculated and stored as metadata, this will greatly reduce the time needed.
1. List all files with their respective sizes.
2. Sort
3. For each consecutive file in the list with the same size as the previous file, compare the MD5 hashes.
--
"Stop failing the Turing test!" -- Dilbert
Re:I think I wrote one of these. by turbidostato · 2014-01-24 00:18 · Score: 1

"Why do people on this site believe that everyone who is interested in tech is a programmer?"
A Bash one-liner or even a 100-line script doesn't make you a programmer.
On the other hand, if asked "how I do move this car from here to a town 100 miles away" the answer is "the most cheap and efficient way is for you to drive it there" and whinning "why do people on this site believe that I should learn to drive" is just that: whinning.
Oh, and learning to drive will help you a lot of times, not, only on this task, as well as learning scripting basics will help you a lot of times, not only on this task.
You don't want to cope with the proper solution? Your problem, not mine.
Re:I think I wrote one of these. by DMUTPeregrine · 2014-01-24 02:50 · Score: 1

Image deduplication is a much harder problem than you (and many of the posters here) seem to think. It's certainly not terrifically hard, but it's not as simple as comparing file size and content hash.
What if the image was resized?
What if a watermark was added?
What if the image was saved in a different format, eg PNG and JPEG?
What if the image had its lighting curves adjusted?
etc.
You may still want to find these duplicates, but size/hash methods will fail.
The findimagedupes tool works well in most of these cases, most of the shell scripts proposed here won't.

--
Not a sentence!
Re:I think I wrote one of these. by Anonymous Coward · 2014-01-24 09:45 · Score: 0

Kudos to you for your staunch defence of linux. However, unless you know how to build a car from bare materials, fully compliant with all design rules and safety standards, you're just one of those pitiful people who know enough to turn the key, turn a wheel and push a couple of pedals: you'll never be a true driver.

Geeqie by zakkie · 2014-01-23 10:42 · Score: 4, Informative

Works excellently for this.

Re:Geeqie by subreality · 2014-01-23 16:48 · Score: 2

+1. The reason: it has a fuzzy-matching dedupe feature. It'll crawl all your images, then show them grouped by similarity and let you choose which ones to delete. It seems to do a pretty good job with recompressed or slightly cropped images.
Open it up, right click a directory, Find Duplicates Recursive.
fdupes is also good to weed out the bit-for-bit identical files first.

Don't reinvent the wheel: fdupes, md5deep, gqview by nctritech · 2014-01-23 10:43 · Score: 2

fdupes will work and is faster than writing a homemade script for the job. The big problem is "across multiple machines" which might require use of, say, sshfs to bring all the machines' data remotely onto one temporarily for duplicate scanning. fdupes checks sizes first, and only then starts trying to hash anything, so obvious non-duplicates don't get hashed at all. Significant time savings. Across multiple machines, another option is using md5deep to build recursive hash lists.

The only tool so far that I've used for image duplicate finding that checks CONTENT rather than bitwise 1:1 duplicate checking is GQview on Linux. It works fairly well, though it's a bit dated by now it's still a good viewer program. Add -D_FILE_OFFSET_BITS=64 to the CFLAGS if you compile it yourself on a 32-bit machine today though.

Anti-Twin by MatthiasF · 2014-01-23 10:45 · Score: 1

Requires WINE but should work fine on Linux.

http://www.anti-twin.com/

findimagedupes in Debian by nemesisrocks · 2014-01-23 10:47 · Score: 5, Interesting

whatever you decide on, it could probably be done in a hundred lines of perl

Funny you mention perl.

There's a tool written in perl called "findimagedupes" in Debian. Pretty awesome tool for large image collections, because it could identify duplicates even if they had been resized, or messed with a little (e.g. adding logos, etc). Point it at a directory, and it'll find all the dupes for you.

Re:findimagedupes in Debian by msobkow · 2014-01-23 11:12 · Score: 3, Interesting

Why do I have this sneaking suspicion it runs in exponential time, varying as the size of the data set...
From what this user is talking about (multiple drives full of images), they may well have reached the point where it is impossible to sort out the dupes without one hell of a heavy hitting cluster to do the comparisons and sorting.

--
I do not fail; I succeed at finding out what does not work.
Re:findimagedupes in Debian by complete+loony · 2014-01-23 11:21 · Score: 2

What you want, is a first pass which identifies some interesting points in the image. Similar to microsoft's photosynth. Then you can compare this greatly simplified data for similar sets of points. Allowing you to ignore the effects of scaling or cropping.
A straight hash won't identify similarities between images, and would be totally confused by compression artefacts.

--
09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
Re:findimagedupes in Debian by nemesisrocks · 2014-01-23 11:30 · Score: 2, Informative

Why do I have this sneaking suspicion it runs in exponential time, varying as the size of the data set...
It's actually pretty nifty how findimagedupes works. It creates a 16x16 thumbnail of each image (it's a little more complicated than that -- read more on the manpage), and uses this as a fingerprint. Fingerprints are then compared using an algorithm that looks like O(n^2).
I doubt the difference between O(2^n) and O(n^2) would make a huge impact anyway: the biggest bottleneck is going to be disk read and seek time, not comparing fingerprints. It's akin to running compression on a filesystem: read speed is an order of magnitude slower than the compression.
Re:findimagedupes in Debian by Anonymous Coward · 2014-01-23 11:31 · Score: 0

Why do I have this sneaking suspicion it runs in exponential time, varying as the size of the data set...
Exponential time would require quite a bit of creativity. The simplest algorithm, two nested loops and checking for equality, performs only a quadratic number of comparisons. A simple sorting variant would require O(n log n) comparisons to sort the data, then O(n) to filter out the equal ones. Hash functions can be used to reduce the constant factor.
Re:findimagedupes in Debian by safetyinnumbers · 2014-01-23 11:36 · Score: 1

I've used findimagedups. IIRC, it rescales each image to a standard size (64x64 or something) then filters and normalizes it down to a 1-bit-depth image.

It then builds a database of these 'hashes'/'signatures' and can output a list of files that have a threshold of bits in common.

That's how it can ignore small changes, it loses most detail and then can ignore a threshold of differences.

It would fail if an image was cropped or rotated, for instance. It could handle picture orientation it it was modified to store 4 versions of the signature, I guess.

It won't actually remove images itself (I wrote a script to read it's output and delete listed images matching a specific path).

I needed it because Dropbox was 'fixing' orientation when it uploaded images and I wanted to clear out ones I'd backed up directly from the camera. (I usually delete duplicate images based on hash.)
Re:findimagedupes in Debian by sexconker · 2014-01-23 12:27 · Score: 2

Why do I have this sneaking suspicion it runs in exponential time, varying as the size of the data set...
It's actually pretty nifty how findimagedupes works. It creates a 16x16 thumbnail of each image (it's a little more complicated than that -- read more on the manpage), and uses this as a fingerprint. Fingerprints are then compared using an algorithm that looks like O(n^2).
I doubt the difference between O(2^n) and O(n^2) would make a huge impact anyway: the biggest bottleneck is going to be disk read and seek time, not comparing fingerprints. It's akin to running compression on a filesystem: read speed is an order of magnitude slower than the compression.
O(n^2) vs O(2^n) is a huge difference eve for very small datasets (hundreds of pictures).
You have to read all the images and generate the hashes, but that's Theta(n).
Comparing one hash to every other has is Theta(n^2).
If the hashes are small enough to all live in memory (or enough of them that you can intelligently juggle your comparisons without having to wait on the disk too much), then you'll be fine for tens of thousands of pictures.
But photographers can take thousands of pictures per shoot, hundreds of thousands in a year, and have millions of photos to dedupe.
When you're at that level, comparisons have to be 6 orders of magnitude faster than your disk read to avoid being the bottleneck. With large hard drives shitting out 60-120 MBps (we'll ignore SSDs because they can't hold that many photos, and we'll ignore RAID just because), that's not going to be the case.
Re:findimagedupes in Debian by TsuruchiBrian · 2014-01-23 14:11 · Score: 1

O(n^2) vs O(2^n) is a huge difference eve for very small datasets (hundreds of pictures).
Hopefully it's actually something like O(p * 2^n) vs O(p * n^2) where n is the thumbnail size and p is the number of images.
Re:findimagedupes in Debian by a_claudiu · 2014-01-23 21:28 · Score: 1

Fingerprints are then compared using an algorithm that looks like O(n^2)
Why O(n^2) when a sorting algoritm can go to O(n log n)?
Re:findimagedupes in Debian by Anonymous Coward · 2014-01-23 22:20 · Score: 0

No, comparing one hash to every other hash is O(n*log(n)) - generate all hashes, sort them, and do a linear pass to sift out duplicates.
Re:findimagedupes in Debian by buchner.johannes · 2014-01-23 23:05 · Score: 1

The real answer is to make a hash over the image content. The ImageHash python package comes with a program to discover duplicate images. It is more powerful than what is needed here: It can find images that looks similar (different format, resolution, etc.).
I think the ImageHash package uses a better algorithm than findimagedupes (description here, actually you can choose between several), and is shorter in code.

--
NB: The message above might reflect my opinion right now, but not necessarily tomorrow or next year.

fdupes by ender8282 · 2014-01-23 10:50 · Score: 2

Under *buntu
sudo apt-get install fdupes
man fdupes:
fdupes - finds duplicate files in a given set of directories

Photo managers by MrEricSir · 2014-01-23 10:53 · Score: 2

As a former Shotwell dev I might point out that most photo manager apps can do this.

--
There's no -1 for "I don't get it."

Re:Photo managers by ihtoit · 2014-01-23 19:55 · Score: 1

oh? Like, say, Irfanview? Not that I've ever had the urge to go looking...

--
Political debates have me rolling my eyes so much I think I got optical whiplash. I should sue. - Foamy The Squirrel
Re:Photo managers by Anonymous Coward · 2014-01-24 05:25 · Score: 0

oh? Like, say, Irfanview? Not that I've ever had the urge to go looking...
Obviously if you think IrfranView is a photo manager. And by the way, that's a windoze only app. It works great for what it is, an image viewer and batch processor, but it will not find dupes and is by no means any sort of photo manager, nor does it even resemble anything like a photo manager.
Re:Photo managers by ihtoit · 2014-01-25 13:17 · Score: 1

maybe you'd like to tell Softonic that their software isn't what it says on the tin, and further inform the developer of StudioLine that his photo manager plugin for irfanview isn't as described, either.
Fool.

--
Political debates have me rolling my eyes so much I think I got optical whiplash. I should sue. - Foamy The Squirrel

Seriously? by DigitAl56K · 2014-01-23 10:53 · Score: 0

Are we seriously discussing how to dedupe files based on a hash here?

News for nerds, stuff that matters, questions that belong in a forum where people answer things you couldn't be bothered to Google.

Re:Seriously? by postbigbang · 2014-01-23 10:56 · Score: 3, Interesting

Yeah. Thanks. It's a simple question. So far, I've seen scripting suggestions, which might be useful. I'm a nerd, but not wanting to do much code because I'm really rusty at it. Instead, I'm amazed that no one runs into this problem and has built an app that does this. That's all I'm looking for: consolidation.

--
---- Teach Peace. It's Cheaper Than War.
Re:Seriously? by Anonymous Coward · 2014-01-23 10:58 · Score: 0

I for one, would welcome a website that combines moderation features of stackoverflow and slashdot.
Re:Seriously? by zakkie · 2014-01-23 11:14 · Score: 5, Informative

See my earlier contrivution: geeqie. It will even scan for image similarity not just rudimentary hashing. Someone else mentioned gqview & that it was out of date - geeqie is what gqview became.
Re:Seriously? by Cley+Faye · 2014-01-23 11:49 · Score: 2

When you're talking about duplicate content, you can't limit yourself to "just hashes".
In this case, with pictures, just opening one and saving it again might produce a different hash, just by recompression or changing the file format. How does all these "just check the hashes" solution works for that?
Finding duplicates image is not that easy.
Re:Seriously? by bluefoxlucid · 2014-01-24 03:08 · Score: 1

There have been interesting responses. Tools that find substantially-similar (read: the same image lossy encoded, resized, and rotated) images, produce hashes that can be compared to find out "How similar" two images are, and so on.

--
Support my political activism on Patreon.

Consider git-annex by dondelelcaro · 2014-01-23 10:57 · Score: 1

In addition to the other methods (ZFS, fdupes, etc), I personally use git-annex.

Git annex can even run on android, so I keep at least two copies of my photos spread throughout all of my computers and removable devices.

--
http://www.donarmstrong.com

DigicaMerge by jalet · 2014-01-23 10:58 · Score: 1

See http://www.librelogiciel.com/s...

I haven't modified nor used it in years (I don't own a digital camera anymore...) so I ignore if it still works with up to date libraries, but its "--nodupes" option does what you want, and its numerous other command line options (http://www.librelogiciel.com/software/DigicaMerge/commandline) help you solve the main problems of managing directories full of pictures.

It's Free Software, licensed under the GNU GPL of the Free Software Foundation.

Hoping this helps

--
Votez ecolo : Chiez dans l'urne !

If not, you can by Anonymous Coward · 2014-01-23 10:58 · Score: 0

You can even compile your home-grown photo-deduplicator into your custom kernel if you want to.

I would... by Anonymous Coward · 2014-01-23 10:59 · Score: 0

Get your co-workers @ the nsa to do their own work

General case by xaxa · 2014-01-23 11:00 · Score: 5, Informative

For the general case (any file), I've used this script:

#!/bin/sh

OUTF=rem-duplicates.sh;

echo "#! /bin/sh" > $OUTF;

find "$@" -type f -print0 | xargs -0 -n1 md5sum | sort --key=1,32 | uniq -w 32 -d --all-repeated=separate | sed -r 's/^[0-9a-f]*( )*//;s/([^a-zA-Z0-9./_-])/\\\1/g;s/(.+)/#rm \1/' >> $OUTF;

chmod a+x $OUTF; ls -l $OUTF

It should be straightforward to change "md5sum" to some other key -- e.g. EXIF Date + some other EXIF fields.

(Also, isn't this really a question for superuser.com or similar?)

Re:General case by Forever+Wondering · 2014-01-23 11:31 · Score: 3, Informative

(Also, isn't this really a question for superuser.com or similar?)
Possibly ;-)
http://superuser.com/questions...

--
Like a good neighbor, fsck is there ...
Re:General case by Anonymous Coward · 2014-01-23 11:45 · Score: 0

also on reddit...
Re:General case by Anonymous Coward · 2014-01-23 11:50 · Score: 0

now the question is, how do we adapt that to de-dup stories? I've seen this question everywhere but phoronix by now.
Re:General case by Anonymous Coward · 2014-01-23 12:37 · Score: 0

Same thing, but faster...
You md5sum everyfile, but you only need to do it for files with the same size:
#/bin/bash find "$@" -type f -not -empty -printf "%-32i%-32s%p\n" \ | sort -n -r \ | uniq -w32 \ | cut -b33- \ | uniq -D -w32 \ | cut -b33- \ | xargs -0 -d"\n" -l1 md5sum \ | uniq --all-repeated=separate -w32 \ | cut -b35-
(you are correct)
Re:General case by Yakasha · 2014-01-23 13:21 · Score: 3, Funny

(Also, isn't this really a question for superuser.com or similar?)
Possibly ;-) http://superuser.com/questions...
So adapt the script to de-dupe stories?
But then if we did that... what would we read on /.?
Re:General case by ihtoit · 2014-01-23 19:25 · Score: 1

oh, there'll be plenty of spamvertisements for penis extensions buried in the sea of rejected submissions somewhere...

--
Political debates have me rolling my eyes so much I think I got optical whiplash. I should sue. - Foamy The Squirrel
Re: General case by EvandroJúnior · 2014-01-23 20:14 · Score: 1

This program does all that and more with a graphic interface https://github.com/evandrojr/F... It is open source works for windows and should work for linux with minor changes. It uses mono with c# Hope it helps

Use DIM and a deduplicator by Anonymous Coward · 2014-01-23 11:00 · Score: 0

Hi. I've been through the same problem. In my case, I had to deduplicate 80k photos. The reason why most suggestions in this thread won't work is because generic solutions don't take advantage of the extra information photos contain. In my case, around 90% of the photos had good EXIF information, but in itself, that is not enough.

I used DIM to classify photos into year / month / day structure, and later I used a photo deduplicator on each day's sub folder.

Additionally, there was extra manual work for those photos not resolved in this way, but definitely was way better than comparing 80k with 80k.

rsync, noob. by Narcocide · 2014-01-23 11:01 · Score: 0

rtfm.

FSLint by Anonymous Coward · 2014-01-23 11:02 · Score: 0

If you're not big into scripting there's a program on the Ubuntu Software Center called FSLint that does exactly what you're looking for. You can have it match on filenames, filesize, hashes, etc. It's just a generic file deduplicator, not optimized for images or anything.

What's the problem? by Anonymous Coward · 2014-01-23 11:04 · Score: 0

What's the problem? Just cp -u the $file to /newhd/by_md5/$(md5sum $file).${file##*.}
( ...and store the original file name in exif create another hardlink to the md5 filename or whatever way you prefer to locate your stuff )

git-annex by Anonymous Coward · 2014-01-23 11:04 · Score: 0

Create a git-annex repository on each file system and set them up with at least one common remote. Then add all of the photos on each file system into the git-annex repository (git annex add *.jpg), sync it with the common remote (git annex sync yourremote), and move all content to the remote (git annex move -t yourremote .).

Re:Don't reinvent the wheel: fdupes, md5deep, gqvi by rwa2 · 2014-01-23 11:04 · Score: 5, Informative

Yeah, this Ask Slashdot should really be about teaching people how to search for packages in aptitude or whatever your package manager is...
Here are some others:

findimagedupes
Finds visually similar or duplicate images
findimagedupes is a commandline utility which performs a rough "visual diff" to
two images. This allows you to compare two images or a whole tree of images and
determine if any are similar or identical. On common image types,
findimagedupes seems to be around 98% accurate.
Homepage: http://www.jhnc.org/findimaged...

fslint :

kleansweep :
File cleaner for KDE
KleanSweep allows you to reclaim disk space by finding unneeded files. It can
search for files basing on several criterias; you can seek for:
* empty files
* empty directories
* backup files
* broken symbolic links
* broken executables (executables with missing libraries)
* dead menu entries (.desktop files pointing to non-existing executables)
* duplicated files ...
Homepage: http://linux.bydg.org/~yogin/

komparator :
directories comparator for KDE
Komparator is an application that searches and synchronizes two directories. It
discovers duplicate, newer or missing files and empty folders. It works on
local and network or kioslave protocol folders.
Homepage: http://komparator.sourceforge....

backuppc : (just in case this was related to your intended use case for some reason)
high-performance, enterprise-grade system for backing up PCs
BackupPC is disk based and not tape based. This particularity allows features #
not found in any other backup solution:
* Clever pooling scheme minimizes disk storage and disk I/O. Identical files
across multiple backups of the same or different PC are stored only once
resulting in substantial savings in disk storage and disk writes. Also known
as "data deduplication".

I bet if you throw Picasa at your combined images directory, it might have some kind of "similar image" detection too, particularly since its sorts everything by exif timestamp.

That said, I've never had to use any of this stuff, because my habit was to rename my camera image dumps to a timestamped directory (e.g. 20140123_DCIM ) to begin with, and upload it to its final resting place on my main file server immediately, so I know all other copies I encounter on other household machines are redundant.

Typical approach doesn't work always by Anonymous Coward · 2014-01-23 11:06 · Score: 1

You can google forever and not get the correct answer.

This is not a trivial problem, and in my case, I had to test multiple ways to do this before finding the correct tools. Also, most approaches work fine with 100 photos, but the problem becomes different if you are talking about 80k photos.

And if it was 100 photos, very likely he would do it by hand and won't need a tool.

I checked multiple places including slashdot before almost writing my own tools in perl.

Quick shell script using exiftool by Khopesh · 2014-01-23 11:07 · Score: 4, Interesting

This will help find exact matches by exif data. It will not find near-matches unless they have the same exif data. If you want that, good luck. Geeqie has a find-similar command, but it's only so good (image search is hard!). Apparently there's also a findimagedupes tool available, see comments above (I wrote this before seeing that and had assumed apt-cache search had already been exhausted).

I would write a script that runs exiftool on each file you want to test. Remove the items that refer to timestamp, file name, path, etc. make a md5.

Something like this exif_hash.sh (sorry, slashdot eats whitespace so this is not indented):

#!/bin/sh for image in "$@"; do echo "`exiftool |grep -ve 20..:..: -e 19..:..: -e File -e Directory |md5sum` $image" done

And then run:

find [list of paths] -typef -print0 |xargs -0 exif_hash.sh |sort > output

If you have a really large list of images, do not run this through sort. Just pipe it into your output file and sort it later. It's possible that the sort utility can't deal with the size of the list (you can work around this by using grep '^[0-8]' output |sort >output-1 and grep -v '^[0-8]' output |sort >output-2, then cat output-1 output-2 > output.sorted or thereabouts; you may need more than two passes).

There are other things you can do to display these, e.g. awk '{print $1}' output |uniq -c |sort -n to rank them by hash.

On Debian, exiftool is part of the libimage-exiftool-perl package. If you know perl, you can write this with far more precision (I figured this would be an easier explanation for non-coders).

--
Use my userscript to add story images to Slashdot. There's no going back.

Re:Quick shell script using exiftool by inode_buddha · 2014-01-23 14:57 · Score: 1

IIRC you could always pipe it into xargs before the sort. xargs isn't necessarily tied to find, but thats where you usually see it. It would make the script much more readable, but as it is, its pretty good already.

--
C|N>K
Re:Quick shell script using exiftool by Anonymous Coward · 2014-01-23 15:20 · Score: 0

IIRC you could always pipe it into xargs before the sort. xargs isn't necessarily tied to find, but thats where you usually see it. It would make the script much more readable, but as it is, its pretty good already.
It is already piped to xargs before the sort! This sorts the hashes, which haven't been generated before the xargs command invokes exif_hash.sh, so (in the event you meant to suggest the opposite) you can't run sort before xargs.
Also, when using find -print0 and xargs -0, they kind of are tied together, or at least anything between them must understand null-terminated lines.

AND, the devil is on the details by Anonymous Coward · 2014-01-23 11:09 · Score: 1

AND, most people come with the trivial answer on deduping files. You DON'T want to MD5 or do anything based on hash tags for 80k photos. That doesn't work. Photos are a particular type of file with particular characteristics, which can reduce your workload a lot.

Trivial approach sucks in this case, and carefully picking the correct tools (in my case classifying photos with an exif / date approach) before deduplicating can convert an impractical solution into a working solution.

Re:AND, the devil is on the details by jedidiah · 2014-01-23 14:21 · Score: 1

When you are destroying data it is far better to err on the side of caution. In this case the solution that "sucks" is more appropriate because it's safer.
Manually manipulating all the data first kind of totally negates the ponit of trying to automate it.

--
A Pirate and a Puritan look the same on a balance sheet.

Use fslint or fslint-gui by Y2K+is+bogus · 2014-01-23 11:11 · Score: 1

fslint is the tool you are looking for.

perl by Anonymous Coward · 2014-01-23 11:11 · Score: 0

this would be a nice intermediate-level weekend perl project

Why use perl? by Arker · 2014-01-23 11:15 · Score: 1

Why use perl when a bash script will do?

--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Friends don't let friends enable ecmascript.

checksum? by Anonymous Coward · 2014-01-23 11:16 · Score: 0

script to output path, filename, checksum to somewhere maybe?

File system lint FSlint by Script+Cat · 2014-01-23 11:18 · Score: 1

This will find duplicate files in the general sense:
http://packages.debian.org/sid...

Going for the obvious unmentioned by Anonymous Coward · 2014-01-23 11:18 · Score: 1

Picasa as a local instance, importing from all the other locations... just remember to check the "exclude duplicates" box

sigh.... by djsmiley · 2014-01-23 11:34 · Score: 1

fdupes.

Done :)

--
- http://www.milkme.co.uk

or you could try to GoogleIt... by Anonymous Coward · 2014-01-23 11:36 · Score: 1

http://en.wikipedia.org/wiki/L...

Re:Don't reinvent the wheel: fdupes, md5deep, gqvi by Anonymous Coward · 2014-01-23 11:38 · Score: 1

Be careful with fdupes. It defaults to including zero length files and will hard link those together too, which is generally a really bad idea.

ftwin by DaJoky · 2014-01-23 11:42 · Score: 1

ftwin is a command line tool, when built with libpuzzle, able to generate a signature for each image and detect duplicates (including resized/sliightly modified). Link: http://freecode.com/projects/f... Disclaimer: I'm the author and don't maintain it actively :-P

http://en.wikipedia.org/wiki/List_of_duplicate_fil by Anonymous Coward · 2014-01-23 11:49 · Score: 2, Informative

http://en.wikipedia.org/wiki/List_of_duplicate_file_finders

Perhaps a better way exists. by Anonymous Coward · 2014-01-23 11:53 · Score: 1

This seems like a rather lot of work just to automate deduping of your porn collection. It might be more enjoyable to do it by hand anyway.

Duplicated in RSS by Anonymous Coward · 2014-01-23 11:58 · Score: 0

I love how this article was listed twice in the RSS feed. Kudos!

There's a command line tool for that. by Anonymous Coward · 2014-01-23 12:12 · Score: 0

findimagedupes

Re:Don't reinvent the wheel: fdupes, md5deep, gqvi by nctritech · 2014-01-23 12:14 · Score: 1

I always use: fdupes -nrSd *

Re:Don't reinvent the wheel: fdupes, md5deep, gqvi by Impy+the+Impiuos+Imp · 2014-01-23 12:16 · Score: 1

Reminds me of Windows link repairer, automatically searching for the nearest file size, which was almost always the wrong thing to do, then suggesting grampa accept the new pointer.

--
(-1: Post disagrees with my already-settled worldview) is not a valid mod option.

fslint? by PaperGeek · 2014-01-23 12:28 · Score: 1

If you are going stricly based on hashing (e.g. not trying to match images that may have different EXIF data embedded, thus making the hashes different) fslint works quite well. It will chug through a filesystem and basically wraps python commands to compare by hash and file size (using both md5 and sha256) and will give you a report of wasted space. You can then save a parseable plain text file. It can take a while - it's bandwidth-bound as you might expect - I just did this for a 2tb network share and it took over 12 hours. But it got the job done and all I had to do was sudo apt-get install fslint

Hope it doesn't access your backup drive by Anonymous Coward · 2014-01-23 12:31 · Score: 1

Hope it doesn't access your backup drive and wipe out your backups as "duplicates".

Look beyond computers.... by whizbang77045 · 2014-01-23 12:32 · Score: 0

The de-duplicator is called a human.

Re:Look beyond computers.... by Anonymous Coward · 2014-01-23 16:15 · Score: 0

Right, because that's why I bought a computer... to play MP3s while I manually do the work it ought to be doing.
Idiot.

Follow the FBI's lead by Mike+Buddha · 2014-01-23 12:33 · Score: 1

They use a database of hashes of kiddie porn to identify offending material without forcing anyone to look at the stuff. Seems like it would be ready to use Perl to crawl your filesystem and identify dupes.

--
by Mike Buddha -- Someday the mountain might get him, but the law never will.

Re:Follow the FBI's lead by Anonymous Coward · 2014-01-24 01:46 · Score: 0

Really? Seems to be easy to defeat. Just change a single pixel by a very small value, and the hash would be completely different.
But somehow I suspect they use tools which are not that easily defeated.
Re:Follow the FBI's lead by bluefoxlucid · 2014-01-24 03:50 · Score: 1

Wrong. OSI explained to us that a person is "victimized" again every time someone looks at an image of them in child porn, and the hash of images is used so that they don't feel that pang in their stomach when an FBI investigator double-clicks 0FEDCABE1.jpg.

--
Support my political activism on Patreon.

Did this years ago by Enry · 2014-01-23 12:38 · Score: 1

I wrote a shell script that looked at the datestamp for each photo and then moved it to a directory called YYYY/MM/DD (so 2000/12/25). I'm going off the assumption that there weren't two photos taken on the same day with the same filenames. So far that seems to be working.

Are you sure by Anonymous Coward · 2014-01-23 13:06 · Score: 1

...you want to trust the EXIF time stamp to determine a duplicate? I had a video cam that was constantly resetting the internal clock to "Jan 1, 2000." It's possible that you could lose some data.

Interestingly, I wrote a program years back, that did the OPPOSITE of this. It read the file name (formatted as a date) and set the date in the EXIF header. I was converting DV-AVI video to still images.

My solution by alantus · 2014-01-23 13:08 · Score: 2

#!/usr/bin/perl # $Id: findDups.pl 218 2014-01-24 01:04:52Z alan $ # # Find duplicate files: for files of the same size compares md5 of successive chunks until they differ # use strict; use warnings; use Digest::MD5 qw(md5 md5_hex md5_base64); use Fcntl; use Cwd qw(realpath); my $BUFFSIZE = 131072; # compare these many bytes at a time for files of same size my %fileByName; # all files, name => size my %fileBySize; # all files, size => [fname1, fname2, ...] my %fileByHash; # only with duplicates, hash => [fname1, fname2, ...] if ($#ARGV < 0) { print "Syntax: findDups.pl <file|dir> [...]\n"; exit; } # treat params as files or dirs foreach my $arg (@ARGV) { $arg = realpath($arg); if (-d $arg) { addDir($arg); } else { addFile($arg); } } # get filesize after adding dirs, to avoid more than one stat() per file in case of symlinks, duplicate dirs, etc foreach my $fname (keys %fileByName) { $fileByName{$fname} = -s $fname; } # build hash of filesize => [ filename1, filename2, ...] foreach my $fname (keys %fileByName) { push(@{$fileBySize{$fileByName{$fname}}}, $fname); } # for files of the same size: compare md5 of each successive chunk until there is a difference foreach my $size (keys %fileBySize) { next if $#{$fileBySize{$size}} < 1; # skip filesizes array with just one file my %checking; foreach my $fname (@{$fileBySize{$size}}) { if (sysopen my $FH, $fname, O_RDONLY) { $checking{$fname}{fh} = $FH; # file handle $checking{$fname}{md5} = Digest::MD5->new; # md5 object } else { warn "Error opening $fname: $!"; } } my $read=0; while (($read < $size) && (keys %checking > 0)) { my $r; foreach my $fname (keys %checking) { # read buffer and update md5 my $buffer; $r = sysread($checking{$fname}{fh}, $buffer, $BUFFSIZE); if (! defined($r)) { warn "Error reading from $fname: $!"; close $checking{$fname}{fh}; delete $checking{$fname}; } else { $checking{$fname}{md5}->add($buffer); } } $read += $r; FILE1: foreach my $fname1 (keys %checking) { # remove files without dups my $duplicate = 0; FILE2: foreach my $fname2 (keys %checking) { # compare to each checking file next if $fname1 eq $fname2; if ($checking{$fname1}{md5}->clone->digest eq $checking{$fname2}{md5}->clone->digest) { $duplicate = 1; next FILE1; # skip to next file } }

Re:My solution by MightyYar · 2014-01-23 13:56 · Score: 1

I'm replying to you because one of my two solutions has the same name :)
https://github.com/caluml/find...
I have another solution, written in Python. It is pretty efficient but very limited. It walks two folders, sorting files into bins according to size. If any bins match between the two folders, it does a hash once on each file in each bin and then compares them. That way, the files are not read repeatedly and hashes are only done if necessary. It could be sped up further by only doing partial file matches, but it worked fine for me. Reply if you want it.

--
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
Re:My solution by alantus · 2014-01-23 14:21 · Score: 1

My script is can save some I/O and cpu cycles, but has to keep more files open at a time (could run out of filedescriptors in extreme cases).
The script you describe must be shorter and easier to understand, but I would only use it for smaller files, where discarding duplicates before reading the whole file doesn't make a big difference.
The next step is to create some UI that allows deleting duplicates easily.
Re:My solution by fisted · 2014-01-23 15:39 · Score: 1

what a long and convoluted pain.
consider the POSIX shell variant

--
CLI paste? paste.pr0.tips!
Re:My solution by alantus · 2014-01-23 16:49 · Score: 1

It is long and convoluted in the same way that an airplane is long convoluted compared with a bicycle ;)
Re:My solution by MightyYar · 2014-01-24 00:45 · Score: 1

I was using it where one of the directories was mounted over the network, so I didn't want to read the files unless I had to... a directory listing is a pretty cheap operation. One problem that I ran into was that Macs can add resource forks to some files, so if one of the folders was on a Mac you could have weird file sizes. For photos and pdfs and such, the resource fork is disposable so it was driving me nuts... some "unique" files were not unique at all.

--
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.

Getting pretty simple, yes? by Anonymous Coward · 2014-01-23 13:12 · Score: 0

What happens on Slashdot is after about 20 posts, some people with experience and a simple idea speak up.
paradxum's outline is getting pretty clear, simple and powerful

The book Unix Powertools by O'Reilly and ? has a number of recipes for solving your problem. Here is a verbal plan if you have access to Linux or a Unix type command line.
ls -l >filelist ; Make a long style listing of every directory.
ls -l >>filelist ; Append all the directories you can to the filelist
sort filelist ; Sort it by the file size column, see man sort for fields. Be careful, when you go over 1000 files, sort takes lots of time.
----------- what I would do, since I like the gnumeric spreadsheet
is open filelist with gnumeric and write a cell comparison script. Even old stuff, since the hard disk they are stored on is old too, I prefer to unplug the disk, leave it inside the old steel case computer.

Manually read the file list, identical image files will have the same file size. By sorting, all the likely exact duplicate files will appear together with the parent file.

For the delete duplicates task see other Slashdot posts.

Re:Getting pretty simple, yes? by ihtoit · 2014-01-23 19:51 · Score: 1

or, in DOS/Win7 CLI: "dir /s /os >filelist" returns the entire tree contents from the current directory sorted in ascending file size order to the text file "filelist". 10,070 files/6359 folders (random tree search on my hard drive) took 16 seconds.
Import tab-delimited list into your favourite spreadsheet.
Do what you need to do.

--
Political debates have me rolling my eyes so much I think I got optical whiplash. I should sue. - Foamy The Squirrel

That's not a backup by dbIII · 2014-01-23 13:13 · Score: 1

If it's attached to a live system and is writeable then it's not a backup yet, it's just a copy.
A web hosting business near me went under because they made that mistake and lost all of their hosted data in a single incident.
Copies on instantly available disk are often a lot more convenient than detached disks, tapes or whatever, but if that's all you've got there are plenty of ways to lose the lot.

Nice by Anonymous Coward · 2014-01-23 13:17 · Score: 0

Why didn't I think of that?

Quite simple... by Anonymous Coward · 2014-01-23 13:18 · Score: 0

Well, if you are searching for identical files, just use "rdfind". It has options to delete or even hard-link duplicate files.

If you need to find **similar** images, you can use the "geeqie" image viewer. It can compare images sets by similarity level. I'm not aware of a command-line tool doing this, though.

Dividing files into like groups by strombrg9575 · 2014-01-23 13:19 · Score: 1

In four different programming languages: http://stromberg.dnsalias.org/...

You could also by Anonymous Coward · 2014-01-23 13:20 · Score: 0

You could also go back and rename the photos to the time stamp created using exif tool, when you're done. Assuming the poster doesn't mind renaming the files.

digiKam is what you want. by Lurching · 2014-01-23 13:54 · Score: 2

DigiKam will do everything you want. It works by creating hashes. You set your level of similarity and digiKam will find the files. It can handle multiple locations, and even "albums" on removable media. If you have a lot of images it can be slow, but if you limit any particular search you can greatly improve performance. It is available for Linux and Windows both.

Re:digiKam is what you want. by Anonymous Coward · 2014-01-24 10:09 · Score: 0

DigiKam will do everything you want. It works by creating hashes. You set your level of similarity and digiKam will find the files. It can handle multiple locations, and even "albums" on removable media. If you have a lot of images it can be slow, but if you limit any particular search you can greatly improve performance.
It is available for Linux and Windows both.
I'm a big fan of Digikam, and often use it to find duplicated images. However, I don't think that it will automatically delete all-but-one of a set of duplicates (at least, not version 3.5.0).

Obligatory by multimediavt · 2014-01-23 14:06 · Score: 1

Google: image duplicate finder

I wrote one myself by tepples · 2014-01-23 14:14 · Score: 3, Insightful

What I did in my deduplicator written in Python was group the files by their and reject any file with a unique size. Then I'd hash the first few kilobytes of each file with MD5 (it's just a spot check so speed is more valuable than security against intentional collisions) and reject any file with a unique first few kilobytes. Finally I'd hash the whole file with a more secure hash.

One liner by tepples · 2014-01-23 14:18 · Score: 1, Offtopic

Here's Gus, and here's your one liner:

wget http://pineight.com/pc/dedupe.py.zip; unzip dedupe.py.zip; python dedupe.py

Dupeguru by Anonymous Coward · 2014-01-23 14:22 · Score: 0

You may want to try dupeguru. It's available at http://www.hardcoded.net/dupeguru_pe.

SIFT is patented by tepples · 2014-01-23 14:24 · Score: 2

What you want, is a first pass which identifies some interesting points in the image.

There is an algorithm for that called SIFT (scale-invariant feature transform), but it's patented and apparently unavailable for licensing in free software.

Re:SIFT is patented by Ivan+the+Terrible · 2014-01-24 07:40 · Score: 1

http://robwhess.github.io/open...
Is software (algorithms) patentable in Europe? Asia outside of China & Russia? (Effectively, nothing is patentable in China or Russia.)
Re:SIFT is patented by Anonymous Coward · 2014-01-24 08:14 · Score: 0

The algorithm is not patented. Some applications of it are, perhaps, protected by patent -- certainly not all implementations of it with a computer, for any purpose. Having briefly glanced at the claims made in the Lowe patent (which claim the invention of using the algorithm in any computer processor or computer readable medium implementation) I doubt that its claims are obviously valid enough in the post-Bilsky US court for it to be worth it for UBC, a Canadian public university, to pursue a free software infringer. Keep in mind that only four years remain on the patent.
At any rate, VLFeat.org hosts some BSD licensed C libraries, (VLFeat) that implement SIFT.
Re:SIFT is patented by tepples · 2014-01-24 08:16 · Score: 1

Is software (algorithms) patentable in Europe?
Maybe not, but Slashdot is in the United States, where OpenSIFT appears illegal to use until the fourth quarter of 2021.
Re:SIFT is patented by tepples · 2014-01-24 08:20 · Score: 1

Keep in mind that only four years remain on the patent.
Since when? It's 2014 right now, and the Lowe patent was filed in March 8, 1999 and granted in March 23, 2004. Patents last 20 years after filing, but as I understand it, a U.S. patent that takes more than three years to grant may be extended to compensate for the period between three years after filing and the grant date. This appears to set expiry at March 23, 2021.

Count hard links by tepples · 2014-01-23 14:59 · Score: 1

if you have the exact same picture in three different folders/subdirectories on the same file system, zfs will only allocate storage for one copy of the data, and point the three file entries to that one copy. Similar to how hard links work in ext2 and friends.

I think the idea is to use some utility to query ZFS and find files that ZFS has deduplicated. Similar to how one can count hard links to each inode in ext2 and friends.

Sorry, no app for that by fisted · 2014-01-23 15:37 · Score: 1

However it's fairly easy to do with a unix shell and only standard tools...
Something along the lines of:find /path/to/pics -type f -print0 | xargs -0 md5 | sort | while read hash r; do if ! [ "$lasthash" = "$hash" ]; then echo "$rest" fi lasthash="$hash" done | while read dupe; do echo rm -- "$dupe" done
That would, once the echo is removed, delete all files that are dupes (except one of each).
Typed it right into the /. comment box, though, so it's probably wrong somewhere. Only intended to convey the idea.

--
CLI paste? paste.pr0.tips!

If all else fails.... by ewieling · 2014-01-23 15:42 · Score: 1

I have used http://www.duplicate-finder.com/photo.html (MS Windows only) because I could not find anything on Linux with similar functionality. It does work very well, it can find similar, but not identical images, such as the same picture saved in a different format or with different compression settings. It tends to slow down when working directories with multiple thousands of images.

--
I really shouldn't have used someone else's email address for this account.

Why not dupe-guru?? by Anonymous Coward · 2014-01-23 15:47 · Score: 0

I use a utility called dupe-guru. Does md5 sums all all files/directories specified and looks for duplicates. Lets the user decide what action to take on the duplicates. works perfectly and is available here: https://launchpad.net/~hsoft/+archive/ppa/+packages

Re:Don't reinvent the wheel: fdupes, md5deep, gqvi by Anonymous Coward · 2014-01-23 15:51 · Score: 1

I use DigiKam but when it came to finding duplicates in unmanaged folders I was happy to find out Geeqie has a very powerful File->FindDuplicates tool with many methods for identifying duplicates. Start with the quick ones and move on to the slower methods.
Also: I love Geeqie's view-> pan view mode... check it out!

Huh? by Tablizer · 2014-01-23 16:01 · Score: 1

Whats whats wrong wrong with with dupes dupes? Picky picky.

--
Table-ized A.I.

fdupes does what you want by Anonymous Coward · 2014-01-23 16:28 · Score: 0

fdupes = find dupes.. works for all files, you can specify .jpg .jpeg .JPG .JPEG etc

There's fdupes by Anonymous Coward · 2014-01-23 16:42 · Score: 0

which can just identify duplicate files by full content

fdupes by Anonymous Coward · 2014-01-23 16:53 · Score: 0

fdupes does it for me.

FDUPES is a program for identifying or deleting duplicate files residing within specified directories.
http://premium.caribe.net/~adrian2/programs/fdupes.html

Similar by Anonymous Coward · 2014-01-23 17:01 · Score: 0

I used to use a command line program called Similar, which was part of a package called STIC (Simple Tools for Image Collectors). Similar would scan the images and build a database, and then the database could be queried for which images were similar, and I would pipe the output of the query into xargs rm and remove the duplicates.

symbol lookup error by Anonymous Coward · 2014-01-23 17:24 · Score: 0

On Mageia4 x86_64, I get the error message: /usr/bin/perl: symbol lookup error: /usr/lib/perl5/vendor_perl/5.18.1/x86_64-linux-thread-multi/auto/Graphics/Magick/Magick.so: undefined symbol: InitializeMagick

perl-Graphics-Magick is version 1.3.18

Re:symbol lookup error by Anonymous Coward · 2014-01-23 23:36 · Score: 0

You're using an RC of a distro that has maybe 30 users. Go figure.

Visipics is excellent. by micronicos · 2014-01-23 17:25 · Score: 3, Informative

I use VisiPics for Windows. It's a free software that actually analyses the content of images to find duplicates. This works very well because images may not have exif data or the same image may be different file sizes or formats.
I don't know if it will work under Wine, but it's worth a try.

Visipics is the only tool I have ever found that will reliably use image matching to dedupe; it is Windows only but I have used it on my own collections & it works very well indeed: http://www.visipics.info/

Now (v1.31) understands .raw as well as all other main image formats & can handle rotated images; brilliant little program!

--
Nico M, London, GB.

Re:Visipics is excellent. by DMUTPeregrine · 2014-01-24 02:36 · Score: 2

I've used Duplicate Photo Finder for a while, but VisiPics looks like it's probably better. That said, I have tested and Duplicate Photo Finder worked for me with WINE.

--
Not a sentence!
Re:Visipics is excellent. by amiga3D · 2014-01-25 14:01 · Score: 1

Actually Visipics is reported to work just fine under Wine on Linux.

Re:Don't reinvent the wheel: fdupes, md5deep, gqvi by rgbe · 2014-01-23 17:33 · Score: 1

I use fslint. It does more than just find duplicate images.

dupeGuru and phash by rhewt · 2014-01-23 18:19 · Score: 1

Your best bet is using something like dupeGuru (http://www.hardcoded.net/dupeguru_pe/). It uses a variant of phash (http://www.phash.org/) to also find similar images. I've used it on an archive of 250,000 photos and it works beautifully.

I had the same problem by Anonymous Coward · 2014-01-23 18:43 · Score: 0

I had the same problem once. I had my daughter's pictures since birth (6 years). They were saved by the month, but I had multiple copies, modifications (rotation, resizing), etc. Also, I had her videos, and even worse, the system was faulty, and some of the copies were on exact. All in all, I had something well over a hundred thousand image + videos. All spread over three disks. A picture viewer was obviously out of the question. I just checked it right now, and the tree right now is 232GB. When I did this, the final tree was maybe 150GB.

Here is how I solved it: First, I wrote a program which tried to extract as much info from the images/videos as possible.These included the creation date (via exif or mplayer parsing), if the file was faulty (jpeginfo and mplayer are your friends), the orientation, md5, geometry, and the length for videos. Originally, I also collected inodes. I collected all of these attrs for each image/video and associated them with the files as extended fs attributes. This program run for over a day (or maybe two). Then I wrote another program which inserted these attributes into the filename (or removed them). This way I got some impressively long filenames, but all info was there. The filename looked like this: in-seattle-zoo-01.jpeg__-length-3911023-orientation-portrait-error-no-......jpeg

Once this was done, I wrote a lot of small programs to eliminate files, dirs, whole trees. For example, suppose I had seven directories called month9. I wrote a program which extracted the md5sums from each directory (remember, it was in the filename now), and if two dirs were matches (or one was the subset of another one), I could eliminate one. if it wasn't enough, I went to other attributes.

All in all, I spent something like 2-3 weeks deduping the mess I have created.

I decided it was best to put the info into extended attributes (it served as the database) since if a file was deemed dup and removed, the database entry with it automatically went.

Good luck, and try not to tear out your hair. :-)

Vilmos

Boar can do that by Anonymous Coward · 2014-01-23 20:55 · Score: 0

Boar is a svn-like tool that handles large binaries. It deduplicates identical files out of the box, and there is also a plugin (in the devel version) that enables block-level deduplication (useful for efficient storage of large images with only differing exif info). Try it out at http://www.boarvcs.org

KDE's digikam by Anonymous Coward · 2014-01-23 21:16 · Score: 0

KDE's digikam has multiple dedup features from hash checks to close matches

same by geert · 2014-01-23 21:55 · Score: 1

same (ftp://ftp.bitwizard.nl/same/) replaces the duplicates by hard or symbolic links.

Python script for sorting by Anonymous Coward · 2014-01-23 22:20 · Score: 0

I have had a bad practice in just emptying the SD card in folders, and renaming the folder to somthing, like backup_christmas. I didn't always format the SD card, so I have had a lot of copies of the same photo.
After a few years this has made it quite a mess.
So I wrote a python script that reads the file, renames it and relocates it to
year \ month \ day \ time_with_seconds.jpg

and if there were two in the same second, I just added one more to the jpg until the name was free.

One thing I found out was that the date and time was off on a lot of pictures, so the ones I spotted, I found the series of photos, eg a holiday for a week, and then used the datetime module to calculate the difference in time from some reference picture (eg. renembering that we we were on the beach one day, and guessing on the time) and then modified the exif data and ran the move and rename script again.

Afterwards I made a deduplication script, think it was based on:
http://code.activestate.com/recipes/362459-dupinator-detect-and-delete-duplicate-files/

Afterwards I imported the folder into DigiKam, which I use to organize the pictures and tags them.
Actually DigiKam seems to have a deduplication feature, and can probadly rename and move photos also, I don't know. It can also find 'similar' photos. I have a habbit of making a multiple shots for taking the one that was eg. the least blurry.

DIY by Anonymous Coward · 2014-01-23 22:26 · Score: 0

If there isn't, find some time and do it. Today I found some time, my ignored her because of the time I already have.

fdupes by the_other_chewey · 2014-01-23 22:47 · Score: 1

No need to roll your own. If the redundant files are identical (the
problem as stated lets me assume that), use fdupes.

"Searches the given path for duplicate files. Such files are found by
comparing file sizes and MD5 signatures, followed by a byte-by-byte
comparison."

It's fast, accurate, and generates a list of duplicate files to handle
yourself - or automatically deletes all except the first of duplicate
files found.

I've used it myself with tens of thousands of pictures to exactly do
what the OP wants.

yes sure by Anonymous Coward · 2014-01-24 00:48 · Score: 0

digikam

fdupes by Anonymous Coward · 2014-01-24 01:33 · Score: 0

I'm surprised nobody has mentioned fdupes yet.

It's a terminal program found in most linux distributions that will identify and/or delete file duplicates.

It uses several checks to speed things up: file size, hashes when the size matches, and full file comparisons if the hashes match.

try this by Anonymous Coward · 2014-01-24 01:56 · Score: 0

DupeGuru Picture Edition at http://www.hardcoded.net/dupeguru_pe/

cross platform and easy to use

Photo De-Duplicator in Python by ninjazombie · 2014-01-24 02:57 · Score: 1

I know I'm a little late to the conversation, but I wrote a script to tackle this very problem just a month or two ago.

https://github.com/mikegreiling/photosort.py
http://pixelcog.com/blog/2013/recover-corrupted-photo-library/

I had a corrupted iPhoto library after a hard drive went bad, so I needed to combine the photos from my iPhone and several other sources to recompile the library, and the only way to recognize duplicates was with EXIF information.

yes, use EXIF by Anonymous Coward · 2014-01-24 03:06 · Score: 0

Using filenames is not usefull because the name may change but the contents don't.

I've done this several times using EXIF data of photo time + camera + exposure to make a unique key.
But I've never considered if the file has been edited and the EXIF key stays the same - then also check dimensions and maybe hash of contents.

geeqie by Anonymous Coward · 2014-01-24 03:17 · Score: 0

Geeqie (was GQView) does duplicate image searches including options for non-perfect duplicates, or just bit for bit checks. It'll then show you the images side by side with some meta info.

It claims to be able to read exif.

Beyond Compare by Anonymous Coward · 2014-01-24 03:49 · Score: 0

Beyond Compare (http://www.scootersoftware.com/) is a really great solution for this. I use it all the time.

Clonespy is awesome but not exactly linux by Anonymous Coward · 2014-01-24 04:56 · Score: 0

I have found clonespy to be an excellent tool for finding duplicates and not just for photos. I prefer the two folder method. You would have to run it under a windows vm to run it on linux ;)

Use Gimp by araxius · 2014-01-24 05:40 · Score: 1

I am wondering why no one suggested gimp. Gimp has a command-line interface that does almost anything brilliantly and is perfect for working on multiple files. Gimp reads EXIF data and also accepts python scripts internally if you prefer a GUI over CLI.

Freedups by Anonymous Coward · 2014-01-24 07:54 · Score: 0

Freedups - http://www.stearns.org/freedups/ - should also do the trick. It hardlinks identical files, freeing up the space without changing the directory tree. It does caching to reduce disk bandwidth, and does the whole thing with a single pass through the files. GPL'd. (I'm the author).

Gemini by Anonymous Coward · 2014-01-24 22:13 · Score: 0

On mac I use macpaw's Gemini it's not free but for me it has worked well.

fdupes by Anonymous Coward · 2014-01-24 23:54 · Score: 0

There is a command line tool called fdupes.

Or a GUI tool fslint.

Both in ubuntu repo.

I wrote a thing to do this, and my friends hassled by withorwithoutgod · 2014-01-25 11:32 · Score: 1

so here it is: https://github.com/withorwitho...

enjoy, it doesn't delete or move anything automatically. You can add that if you want, just outputs images that are perceptually similar.

Example usage and output is included on github page. email me if you want it to work a different way or do something different. It's not the most robust phash algorithm, but it's better than straight hashes (in some ways) as it'll detect a similar png and jpg that are similar.

I wrote one by wscott · 2014-01-26 01:48 · Score: 1

I wrote one that I use, works really well because it also hardlinks all the duplicates. https://github.com/wscott/link...

FlickrDupFinder by shokk · 2014-01-26 06:07 · Score: 1

I have done it with Flickr and FlickrDupFinder (https://www.flickr.com/services/apps/72157623582289101/) which has worked very well!

--
"Beware of he who would deny you access to information, for in his heart, he dreams himself your master."

digikam by Anonymous Coward · 2014-01-26 23:55 · Score: 0

Have you tried digikam?
It has a photo de-duplicator.

if a 4 Tb drive is 169 dollars by Anonymous Coward · 2014-01-27 08:12 · Score: 0

why bother ?
Just put all your images into yfidb (your favorite image database) and sort em out as time goes by
or something like that
If you have 1,000s of images, I assume either (a) you don't really care that much about anyone particular image, or you have a special set of images you care about all ready

Tool that I use... by Anonymous Coward · 2014-01-27 19:36 · Score: 0

I have the exact same problem, so hopefully you will find the tool that I use to your liking:
https://github.com/christophelg/DuplicateFinder

Just what you say by Fuzzums · 2014-01-30 07:44 · Score: 1

If the images are identical, hash them and compare hashes.

--
Privacy is terrorism.

Slashdot Mirror

Does Anyone Make a Photo De-Duplicator For Linux? Something That Reads EXIF?

243 comments