Ask Slashdot: How Do I De-Dupe a System With 4.2 Million Files?

CRC by Spazmania · 2012-09-02 01:32 · Score: 5, Informative

Do a CRC32 of each file. Write to a file one per line in this order: CRC, directory, filename. Sort the file by CRC. Read the file linearly doing a full compare on any file with the same CRC (these will be adjacent in the file).

--
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.

Re:CRC by Anonymous Coward · 2012-09-02 01:36 · Score: 5, Informative

s/CRC32/sha1 or md5, you won't be CPU bound anyway.
Re:CRC by Anonymous Coward · 2012-09-02 01:37 · Score: 0

Do a CRC32 of each file. Write to a file one per line in this order: CRC, directory, filename. Sort the file by CRC. Read the file linearly doing a full compare on any file with the same CRC (these will be adjacent in the file).
Would you be so kind to write a program/script which can do that ?
Re:CRC by wisty · 2012-09-02 01:37 · Score: 1

This is similar to what git and ZFS do (but with a better hash, some kind of sha I think).
Re:CRC by Kral_Blbec · 2012-09-02 01:38 · Score: 5, Informative

Or just by file size first, then do a hash. No need to compute a hash to compare a 1mb file and a 1kb file.
Re:CRC by Pieroxy · 2012-09-02 01:38 · Score: 2

Exactly.
1. Install MySQL,
2. create a table (CRC, directory, filename, filesize)
3. fill it in
4. play with inner joins.
I'd even go down the path of forgetting about the CRC. Before deleting something, do a manual check anyways. CRC has the advantage of making things very straightforward but is a bit more complex to generate.

--
Write boring code, not shiny code!
Re:CRC by Anonymous Coward · 2012-09-02 01:42 · Score: 0

Manual check first? I got the impression that this guy had LOTS of data and presumably also LOTS of dupes. I would hate doing manual checks on tens or hundreds of thousands of files.
Re:CRC by cheesybagel · 2012-09-02 01:44 · Score: 1

md5sum `find /` | sort -k1,1
Or something like that. You probably need xargs. My script-fu is weak.
Re:CRC by the+eric+conspiracy · 2012-09-02 01:47 · Score: 1

Use SHA-1 instead of CRC.
Re:CRC by Spazmania · 2012-09-02 01:47 · Score: 2

I have a script which does this for openstreetmap tiles. Once it identifies the dupes it archives all the tiles into a single file, pointing the dupes at a single copy in the archive. Then I use a Linux fuse filesystem to read the file and present the results to Apache. Saves a truly massive amount of disk space for an openstreetmap server since the files are mostly smaller than a single disk block and never consume enough disk blocks that the space lost to the inode and unused part of the last block is insignificant.

--
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
Re:CRC by SwashbucklingCowboy · 2012-09-02 01:48 · Score: 2

DO NOT do a CRC, do a hash. Too many chances of collision with a CRC.
But that still won't fix his real problem - he's got lots of data to process and only one system to process it with.
Re:CRC by Pieroxy · 2012-09-02 01:50 · Score: 1

You can check a few files in a directory and then easily deduce the whole directory is a dupe. You don't have to do it file by file.
Plus, when the system finds a dupe, you need to tell it which copy it should delete, or else you risk having stuff all around and not knowing where it is. Some file you knew was in directory A/B/C/D is suddenly not there anymore and you have no clue where its "dupe" is located. Unless the dupe finder creates symlinks in place of the deleted file...

--
Write boring code, not shiny code!
Re:CRC by igb · 2012-09-02 01:52 · Score: 5, Insightful

That involves reading every byte. It would be faster to read the bytecount of each file, which doesn't involve reading the files themselves as that metadata is available, and then exclude from further examination all the files which have unique sizes. You could then read the first block of each large file, and discard all the files that have unique first blocks. After that, CRC32 (or MD5 or SHA1 --- you're going to be disk-bound anyway) and look for duplicates that way.
Re:CRC by vlm · 2012-09-02 01:54 · Score: 2, Interesting

4. play with inner joins.
Much like there's 50 ways to do anything in Perl, there's quite a few ways to do this in SQL.
select filename_and_backup_tape_number_and_stuff_like_that, count(*) as number_of_copies
from pile_of_junk_table
group by md5hash
having number_of_copies > 1
Theres another strategy where you mush two tables up against each other... one is basically the DISTINCT of the other.
triggers are widely complained about, but you can implement a trigger system (or psuedo-trigger, where you make a wrapper function in your app) where basically a table of "files" is stored with a row called "count of identical md5hash" and then your sql looks like select * from pile where identicalcount>1
There's ways to play with views.
Do you need to run it interactively or batch it or just run it basically once or ... If you're allowed to barf on data input you can even enforce the md5 hash as a UNIQUE INDEX or UNIQUE KEY in the table definition.
You'll learn a lot about how to think about high performance computing. Are you trying to minimize latency or minimize storage or minimize index size or maximize reliability/uptime or minimize processor time or minimize NAS bandwidth or minimize (initial OR maintenance) programming time or ....
The funniest thing is if you're never tried restoring data from backups (hey, it happens), and/or never had a tape failure (hey it happens), you'll THINK you want to eliminate dupes, but trust me, those dupes will save your bacon someday, and tape is cheap compared to cost of programmer and cost of lost data.... 5 TB is not much technically but is obviously worth a lot from a business standpoint...
Also from personal experience you're going to find people gaming the system where DOOM3.EXE and NOTEPAD.EXE happen to have the same md5hash and length and NOTEPAD.EXE was found an a not-totally but pretty much noob's desk. Use some judgement and don't come down too hard on the newest of new learners.

--
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
Re:CRC by Rich0 · 2012-09-02 01:55 · Score: 1

Things get unnecessarily messy when you have to do them all in one line. However, if I were doing this as a one-time operation, I'd start with something like what you suggest, dumping the results into file1.
Then I'd cat the whole thing through awk '{ print $1 }' | uniq -d > file2 to get a list of all the hashes that are not unique (that way you can focus on the duplicates and not have to scan that huge file).
Then I'd grep the original file with grep -f file2 file1 > file3 to get the full output of the original search for each of the duplicated files.
Chances are that you're going to want to semi-manually deal with the duplicates, but if you pare the file down to a list of stuff where you just want to keep the first instance of each duplicate then I'm sure it wouldn't be hard to remove the first one and then pass the rest into an rm command. You'd want to be careful, since there could be numerous duplicates that the system expects to be there.
All the steps above would be very fast, aside from the original md5sum. And yes, I'd probably use xargs so that you don't have 40 bazillion arguments to md5sum. Or you can just use the find option that executes a command against each one. find / | xargs -n 1 md5sum is probably what you're looking for.
Re:CRC by baffled · 2012-09-02 01:57 · Score: 1

Sounds ideal. Wouldn't take long to code, nor execute.
Re:CRC by Anonymous Coward · 2012-09-02 01:58 · Score: 0

The easiest ways are to: 1. Bin by extension 2. Then bin by size 3. Then bin by MD5 / CRC32. You put those files in/look them up as hashes, typically with a directory structure. The top level directory would be the bucket, of course.
Finally, to retain the structure, stick to symlinks/links. Oh, if it's windows, write the list of duplicate files into a text file or something.
Re:CRC by Zocalo · 2012-09-02 01:58 · Score: 4, Informative

No. No. No. Blindly CRCing every file is probably what took so long on the first pass and is a terribly inefficient way of de-duplicating files.

There is absolutely no point in generating CRCs of files unless they match on some other, simpler to compare characteristic like file size. The trick is to break the problem apart into smaller chunks. Start with the very large files, they exact size break to use it'll depend on the data set, but as the poster mentioned video file say everything over 1GB to start. Chances are you can fully de-dupe your very large files manually based on nothing more than a visual inspection of names and file sizes in little more time than it takes to find them all in the first place. You can then exclude those files from further checks, and more importantly, from CRC generation.

After that, try and break the problem down into smaller chunks. Whether you are sorting on size, name or CRC, it's quicker to do so when you only have a few hundred thousand files rather than several million. Maybe do another size constrained search; 512MB-1GB, say. Or if you have them, look for duplicated backups files in the form of ZIP files, or whatever archive format(s), you are using based on their extension - that also saves you having to expand and examine the contents of multiple archive files. Similarly, do a de-dupe of just the video files by extensions as these should again lend themselves to rapid manual sorting without having to generate CRCs for many GB of data. Another grouping to consider might be to at least try and get all of the website data, or as much of is as you can, into one place and de-dupe that, and consider whether you really need multiple archival copies of a site, or whether just the latest/final revision will do.

By the time you've done all that, including moving the stuff that you know is unique out of the way and into a better filing structure as you go, the remainder should be much more manageable for a single final pass. Scan the lot, identify duplicates based on something simple like the file size and, ideally, manually get your de-dupe tool to CRC only those groups of identically sized files that you can't easily tell apart like bunches of identically sized word processor or image files with cryptic file names.

--
UNIX? They're not even circumcised! Savages!
Re:CRC by caluml · 2012-09-02 01:58 · Score: 5, Informative

Exactly. What I do is this:

1. Compare filesizes.
2. When there are multiple files with the same size, start diffing them. I don't read the whole file to compute a checksum - that's inefficient with large files. I simply read the two files byte by byte, and compare - that way, I can quit checking as soon as I hit the first different byte.

Source is at https://github.com/caluml/finddups - it needs some tidying up, but it works pretty well.

git clone, and then mvn clean install.

--
Get your own free personal location tracker
Re:CRC by Anonymous Coward · 2012-09-02 01:58 · Score: 0

sha1sum would be a better choice than crc32, just to avoid unnecessary hash collisions.
The most "difficult" part of it all I suppose is proper filetree traversal.
All in all, I'd say coding time is less than 15 minutes if one is familiar with the Win API.
Re:CRC by TubeSteak · 2012-09-02 02:01 · Score: 1

It's possible the free de-dup program was trying to do that.
Best case scenarios would put your hash time at 1.5~6 hours (100 MB/s to 25 MB/s) for 4.9 TB
But millions of small files are the absolute worst case scenario.
God help you if there's any defragmentation.

--
[Fuck Beta]
o0t!
Re:CRC by Anonymous Coward · 2012-09-02 02:03 · Score: 0

That is only IFF you haven't already received a duplicate. So the probability of a CRC32 collision will be like 1/2^32 ....
Re:CRC by Joce640k · 2012-09-02 02:03 · Score: 1

Did you read the bit about "doing a full compare on any file with the same CRC"?
The CRC is just for bringing likely files together. It will work fine.

--
No sig today...
Re:CRC by Anonymous Coward · 2012-09-02 02:05 · Score: 5, Informative

If you get a linux image running (say in a livecd or VM) that can access the file system then fdupes is built to do this already. Various output format/recursion options.
From the man page:
DESCRIPTION
Searches the given path for duplicate files. Such files are found by
comparing file sizes and MD5 signatures, followed by a byte-by-byte
comparison.
Re:CRC by Art+Challenor · 2012-09-02 02:05 · Score: 1

The problem is that you need more intelligence.
If you've dup'd a folder, what in your scheme ensure that one complete folder will be removed? You could end up with both folders with half he files in each - an organizational nightmare.
Re:CRC by Joce640k · 2012-09-02 02:07 · Score: 3, Insightful

s/CRC32/sha1 or md5, you won't be CPU bound anyway.
Whatever you use it's going to be SLOW on 5TB of data. You can probably eliminate 90% of the work just by:
a) Looking at file sizes, then
b) Looking at the first few bytes of files with the same size.
After THAT you can start with the checksums.

--
No sig today...
Re:CRC by kanweg · 2012-09-02 02:07 · Score: 2

You're not baffled.
Bert
Re:CRC by WoLpH · 2012-09-02 02:18 · Score: 2

Indeed, I once created a dedup script which basically did that.
1. compare the file sizes
2. compare the first 1MB of the file
3. compare the last 1MB of the file
4. compare the middle 1MB in the file
It's not a 100% foolproof solution but it was more than enough for my use case at that time and much faster than getting checksums.
Re:CRC by Anonymous Coward · 2012-09-02 02:22 · Score: 0

The one issue (with some probability) where it won't work is if file extensions are intentionally changed.
Re:CRC by igb · 2012-09-02 02:22 · Score: 3, Interesting

The problem isn't CRC vs secure hash, the problem is the number of bits available. He's not concerned about an attacker sneaking collisions into his filestore, and he always has the option of either a byte-by-byte comparison or choosing some number of random blocks to confirm the files are in fact the same. But 32 bits isn't enough simply because he's guaranteed to get collisions even if all the files are different, as he has more than 2^32 files. But using two different 32-bit CRC algorithms, for example, wouldn't be "secure" but would be reasonably safe. But as he's going to be disk bound, calculating an SHA-512 would be reasonable, as he can probably do that faster than he can read the data.
I confess, if I had a modern i5 or i7 processor and appropriate software I'd be tempted to in fact calculate some sort of AES-based HMAC, as I would have hardware assist to do that.
Re:CRC by DarkOx · 2012-09-02 02:32 · Score: 1

Right and these are backs so its useful to have not just every unique file but their layout. If they were all in a folder together at one time, its useful to preserve that fact.
It sounds like the poster is somewhat organized, he was making backups in the first place. What he failed to do was manage versioning and generations. My inclination would be to copy the entire thing into some other file system that does block-level dedupe. Keep all the files, mapp them onto the same media underneath, where they are similar. Likely he will save more space this way as well. All the other suggestions about using sha-1 or some form of CRC are going to result in keeping full copies of files that 99% the same. The transaction history file from a personal finance app is a perfect example.
It might have some headers in the first block that get updated and then more data appended to the end, all the stuff in the middle never changes. That might get backed up every week or every day. Its mostly not unquie, he would save lots of space deduping at the block layer rather than the file.

--
Repeal the 17th Amendment TODAY! Also Please Read http://www.gnu.org/philosophy/right-to-read.html
Re:CRC by Anonymous Coward · 2012-09-02 02:35 · Score: 0

md5sum `find /` | sort -k1,1
This will fail unless very few files are found.
Re:CRC by TheGratefulNet · 2012-09-02 02:36 · Score: 2

divide and conquer.
your idea of using file size as first discriminant is good. its fast and throws out a lot of things that don't need to be checked.
another accelrant is to find if the count of the # of files in a folder is the same. and if a few are the same, maybe the rest are. use 'info' like that to make it run faster.
I have this problem and am going to write some code to do this, too.
but I might have some files are are 'close' to the others and so I need smarter code. example: some music files might be the same in content but only vary in tags. or their titles are different. or maybe even their run length is slightly diff but they are still mostly the same file. I'd want to dedupe those, too.
you would have a manual list to verify (the computer thinks these are the same; please verify, mr human).
some files may have errors in them! maybe I made copies of mp3 files and there was a static hit on one disk. finding by dupe filename and even size is not good enough. you found 2 contenders, but which is the CLEAN file? which has no dropouts or buzzsaws? same for photos, too, if you retouch photos you may not know which is the original or the fixed/keeper.
special knowledge helps here. if its audio, if its video, if its text, spreadsheets, o/s runnable files, etc conf files, all can use diff 'tricks' to help accelerate.
this is why this solution is NOT easy unless you just go brute force by disk block. and this is not do-able on anything large unless you have hardware support.

--

--
"It is now safe to switch off your computer."
Re:CRC by bzipitidoo · 2012-09-02 02:38 · Score: 5, Insightful

Part 2 of your method will quickly bog down if you run into many files that are the same size. Takes (n choose 2) comparisons, for a problem that can be done in n time. If you have 100 files all of one size, you'll have to do 4950 comparisons. Much faster to compute and sort 100 checksums.
Also, you don't have to read the whole file to make use of checksums, CRCs, hashes and the like. Just check a few pieces likely to be different if the files are different, such as the first and last 2000 bytes. Then for those files with matching parts, check the full files.

--
Intellectual Property is a monopolistic, selfish, and defective concept. It is "tyranny over the mind of man"
Re:CRC by Anonymous Coward · 2012-09-02 02:38 · Score: 0

find / | xargs -n 1 md5sum is probably what you're looking for.
This will fail unless you're geeky enough to have absolutely no spaces in any filenames.
Re:CRC by Anonymous Coward · 2012-09-02 02:40 · Score: 0

You can check a few files in a directory and then easily deduce the whole directory is a dupe. You don't have to do it file by file.
How can you be sure a single file wasn't added to one duplicate?
Re:CRC by belg4mit · 2012-09-02 02:40 · Score: 2, Informative

Unique Filer http://www.uniquefiler.com/ implements these short-circuits for you.
It's meant for images but will handle any filetype, and even runs under WINE.

--
Were that I say, pancakes?
Re:CRC by GuldKalle · 2012-09-02 02:41 · Score: 1

the CRC is not just a bit more complex to generate, it forces you to read the entire file. Reading 5 TB data takes quite a lot more time than reading a filesystem with 4M files. So yes, delay the CRC, play with filesizes first.

--
What?
Re:CRC by michael_cain · 2012-09-02 02:52 · Score: 1

...as he has more than 2^32 files.

4.2 million, not billion. About 2^22 files.
Re:CRC by JoeMerchant · 2012-09-02 02:55 · Score: 1

Added benefit, when sorting by filesize you can hit the biggest ones first. Depending on your dataset, most of your redundant data might be in just a few duplicated files.
Re:CRC by JoeMerchant · 2012-09-02 02:56 · Score: 2, Funny

Do a CRC32 of each file. Write to a file one per line in this order: CRC, directory, filename. Sort the file by CRC. Read the file linearly doing a full compare on any file with the same CRC (these will be adjacent in the file).
Would you be so kind to write a program/script which can do that ?
Payment information please, AC?
Re:CRC by K.+S.+Kyosuke · 2012-09-02 03:05 · Score: 3, Insightful

Why not simply do it adaptively? Two or three files of the same size => check by comparing, more files of the same size => check by hashing.

--
Ezekiel 23:20
Re:CRC by blueg3 · 2012-09-02 03:06 · Score: 5, Informative

b) Looking at the first few bytes of files with the same size.
Note that there's no reason to only look at the first few bytes. On spinning disks, any read smaller than about 16K will take the same amount of time. Comparing two 16K chunks takes zero time compared to how long it takes to read them from disk.
You could, for that matter, make it a 3-pass system that's pretty fast:
a) get all file sizes; remove all files that have unique sizes
b) compute the MD5 hash of the first 16K of each file; remove all files that have unique (size, header-hash) pairs
c) compute the MD5 hash of the whole file; remove all files that have unique (size, hash) pairs
Now you have a list of duplicates.
Don't forget to eliminate all files of zero length in step (a). They're trivially duplicates but shouldn't be deduplicated.
Re:CRC by blueg3 · 2012-09-02 03:09 · Score: 1

Fortunately, you actually only need about 2^16 files to get collisions on a 32-bit CRC.
Re:CRC by Anonymous Coward · 2012-09-02 03:10 · Score: 1

This is an instance where big-O analysis is misleading. Yes, his method requires N choose 2 in the worst case, but that's only if there are a lot of files with the same size but different data. Most of the time, that won't happen. And when it does, the diff will catch it as soon as there's a difference, so each compare has the possibility of taking a lot less time than computing a checksum.
Re:CRC by Zeroko · 2012-09-02 03:13 · Score: 2

The relevant number when worrying about non-adversarial hash collisions is the square root of the number of outputs (assuming they are close enough to uniformly distributed), due to the birthday paradox. So in the case of CRC32, more than ~2^16 files makes a collision likely (well, 2^16 gives about 39%), & with 2^22, the probability is nearly indistinguishable from 1 (it being over 99.9% for only 2^18 files).
Re:CRC by b4dc0d3r · 2012-09-02 03:32 · Score: 4, Interesting

This was theorized by one of the RSA guys (Rivest, if I'm not mistaken). I helped support a system that identified files by CRC32, as a lot of tools did back then. As soon as we got to about 65k files (2^16), we had two files with the same CRC32.
Let me say, CRC32 is a very good algorithm. So good, I'll tell you how good. It is 4 bytes long, which means in theory you can change any 4 bytes of a file and get a CRC32 collision, unless the algorithm distributes them randomly, in which case you will get more or less.
I naively tried to reverse engineer a file from a known CRC32. Optimized and recursive, on a 333 mHz computer, it took 10 minutes to generate the first collision. Then every 10 minutes or so. Every 4 bytes (last 4, last 5 with the original last byte, last 6 with original last 2 bytes, etc) there was a collision.
Compare file sises first, not CRC32. The s^16 estimate is not only mathematically proven, but also in the big boy world. I tried to move the community towards another hash.
CRC32 *and* filesize are a great combination. File size is not included in the 2^16 estimate. I have yet to find two files in the real world, in the same domain (essentially type of file), with the same size and CRC32.
Be smart, use the right tool for the job. First compare file size (ignoring things like mp3 ID3 tags, or other headers). Then do two hashes of the contents - CRC32 and either MD5 or SHA1 (again ignoring well-known headers if possible). Then out of the results, you can do a byte for byte comparison, or let a human decide.
This is solely to dissuade CRC32 based identification. After all, it was designed for error detection, not identification. For a 4-byte file, my experience says CCITT standard CRC32 will work for identification. For 5 byte files, you can have two bytes swapped and possibly have the same result. The longer the file, the less likely it is to be unique.
Be smart, use size and two or more hashes to identify files. And even then, verify the contents. But don't compute hashes on every file - the operating system tells you file size as you traverse the directories, so start there.
Re:CRC by X0563511 · 2012-09-02 03:34 · Score: 1

THIS THIS THIS THIS.
Someone needs to mod this up.

--
For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
Re:CRC by dingen · 2012-09-02 03:38 · Score: 0, Troll

Who the hell posts on Slashdot but can't write a simple script to compare hashes of files?

--
Pretty good is actually pretty bad.
Re:CRC by jones_supa · 2012-09-02 03:40 · Score: 1

Yup. The corrected version is like:
find -print0 / | xargs -0 md5sum
Re:CRC by b4dc0d3r · 2012-09-02 03:42 · Score: 1

This is trivial, given examples of
- Listing directory information
- Initializing and calculating CCITT standard CRC32 values
- Sorting any container
All of these examples are on the web, the only caveat you find is in Visual Studio, where the VECTOR (and a few other) header contains bugs they could not fix due to license agreement. Containers over 64k will get sliced during a sort without dinkumware header patching.
Also, on any Linux system, and Windows with a single utility, these can be simple command-line scripts. The only real problem is sybolic linking, which may return two different results for the same file.
In other words, if you pay someone to do this, make sure they are aware of problems like symbolic links, and have patched their compiler or scripting host.
Re:CRC by propus · 2012-09-02 03:44 · Score: 1

The special cases that you describe (similar files with slight variations) are much difficult to handle programmatically. If you expect the number of such files to be small, then I would just handle them manually after the rest of the dedupe process is done. However, if you think there would be numerous such files and would require a non-trivial amount of time to classify, then I would consider automating the step using a service such as Mechanical Turk from Amazon. With MTurk some real person is involved in the classification loop (I don't recall what they charge, but it's pennies for each classification request).
Re:CRC by fatp · 2012-09-02 03:44 · Score: 1

You mean joining 2 recordsets with 4M rows in MySQL?
Re:CRC by caluml · 2012-09-02 04:00 · Score: 1

Yes, that's something I thought about. It's a trade off, isn't it? If you have two 700MB files that are both the exact same size, but are different, the way I'm doing it is quickest. If the two 700MB files are the same, then it will probably be about the same time as CRC/MD5ing the files.

If you have many small files, then I guess the IO won't be that much anyway.
My implementation offers a parameter to ignore files smaller than a specific size, which is how I run it: java -jar finddups.jar /path 200000 (for instance).

Another commenter to your comment suggested doing it adaptively, which would be easy.

--
Get your own free personal location tracker
Re:CRC by Anonymous Coward · 2012-09-02 04:01 · Score: 3, Interesting

With 4.2 million files, given the probability of SHA-1 collisions plus the birthday paradox and there will be around 500 SHA-1 collisions which are not duplicates. SHA-512 reduces that number to 1.
Re:CRC by aaaaaaargh! · 2012-09-02 04:07 · Score: 1

Also, you don't have to read the whole file to make use of checksums, CRCs, hashes and the like. Just check a few pieces likely to be different if the files are different, such as the first and last 2000 bytes.
Coincidentally I've just experimented with this method and found a number of files with just 0s both at the beginning and at the end. I believe it's better to take the bytes from the middle of the file, e.g. sha1 8096 bytes around the middle.
Re:CRC by BasilBrush · 2012-09-02 04:14 · Score: 5, Insightful

Someone who's technical expertise is in areas other than writing script files. There are technical jobs other than being a sysop you know.
Re:CRC by Anonymous Coward · 2012-09-02 04:32 · Score: 0

Use hardlinking of your duplicate files in linux, so you can keep your filestructure intact, but remove the extra usage of storage. Its then easy to delete old backup dirs, just delete every file with a hardcounter > 1. The remaining files in such old backup dirs seems to be unique and are prop. woth looking after.
Re:CRC by Anonymous Coward · 2012-09-02 04:33 · Score: 2, Informative

Actually, this is an instance where lots of random IO will bog you down when comparing a bunch of files. His 4+ TB divided by 4.2M files is roughly 1MB average file size, which really isn't that much content to access per random seek. A naive all-to-all comparison will cause a lot of random IO, so you really need to generate a batch file listing with per-file metadata and then analyze the listings efficiently. Adding checksum info to this batch listing is actually not that costly and allows the entire de-dupe analysis to be performed with no further disk IO. Even if we assume 1kB per file of name, size, and checksum info (it's probably a lot less), the whole listing is around 4GB which can be largely cached in RAM for analysis.
When I had this same problem on Linux, I did two scans of the entire file set using the 'find . -type f -exec cmd {} +' command to automatically run 'stat' and 'md5sum' on batches of files, then I merged these scan results to have one table of information per file. You could do all of this by processing files (e.g. sort and join on Unix) but it is more efficient to just import the data into sqlite or another database and do it there. In my case, I grouped files by size and checksum, also sorting the group members by name length, preferring the shortest name as the "original" file, since the names tended to get longer with each redundant backup copy adding some other top-level directory name to the original file name.
The reason I ran two scans was that I was too lazy to implement a hybrid command to efficiently run 'md5sum' and 'stat' as one utility. It would have taken me longer to develop and test the utility enough to trust it than to just run it with the existing utilities. In the end, the scan with md5sum did not take that much longer than the scan with stat, because the overall time is dominated by digging around vast directory hierarchies and randomly accessing file metadata, versus the bulk sequential access pattern used to perform the checksum once each file was found. If you monitor the system while these commands run, there is steady high-bandwidth disk access for the duration of the md5sum scan, while there is steady disk seeking with very little bandwidth for the duration of the stat scan. Neither scan saturates a CPU.
Another question is what to do with the results of analysis. One option is to delete all but one copies of each length/checksum group, and assume you would use the database information in the future if you ever need to reconstitute one of the deleted hierarchies. Or you could turn all secondary references into hard-links to the same file, retaining the original hierarchies as accessible file trees. Or, as I chose to do, you can replace secondary references with symbolic links to the primary copy, which is close enough to preserving the original hierarchy for most programmatic access but is also self-documenting the fact that it is a secondary name for the same file at the other end of the link.
Re:CRC by Anonymous Coward · 2012-09-02 04:35 · Score: 0

Rather than spending even 30 minutes writing the extra code, I just let my computer work on md5sums of everything for a few days.
It works perfectly even in case the first and last 1k are the same or the dates are different but the files are the same or...
Yes Virginia, with enough CPU time you can make md5sum hashes fail, but this does not happen in real life except via bad hackers.
Re:CRC by joocemann · 2012-09-02 04:36 · Score: 1

Was there anything wrong with the idea of letting the deduping software complete the task?
It sounds like impatience is to blame. 4.9TB is a lot of data.
Horders: Digital Edition.
I bet you over half the data will never be accessed for any other reason than copying/storing.
Re:CRC by mlts · 2012-09-02 04:41 · Score: 1

This is the exact reason why IBM does not use any hashing when it comes to their deduplication algorithms.
Re:CRC by TheGratefulNet · 2012-09-02 04:43 · Score: 2

I usually use:
find . -type f -exec md5sum {} \; > /tmp/files.md5.txt
you can check back with that file:
md5sum -c /tmp/files.md5.txt

--

--
"It is now safe to switch off your computer."
Re:CRC by Anonymous Coward · 2012-09-02 04:43 · Score: 0

What are you blathering about? Database joins, noob users and minimizing latency... You either decided to ignore the OPs question and just spew garbage about the question you wished he had asked, or you are too lazy to read his question. Good morning. You are that guy.
Re:CRC by Anonymous Coward · 2012-09-02 04:45 · Score: 0

I want to hire you. For everything.
Re:CRC by Goaway · 2012-09-02 04:46 · Score: 2

I don't know where you are finding these numbers, but they are about as wrong as it is possible to get.
There is no known SHA-1 collision yet in the entire world. You're not going to find 500 of them in your dump of old files.
Re:CRC by fm6 · 2012-09-02 04:54 · Score: 1

Small problem: sorting a flat file with 4.2 million records takes a lot of time and space,
Re:CRC by hsmyers · 2012-09-02 04:56 · Score: 1

Not being a Java (at least not knowingly) I was puzzled by this:
C:\>java Main.java C:\
Exception in thread "main" java.lang.NoClassDefFoundError: Main/java
Caused by: java.lang.ClassNotFoundException: Main.java
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
Could not find the main class: Main.java. Program will exit.
Could you explicate?
--hsm
Re:CRC by zedeler · 2012-09-02 04:57 · Score: 1

I have the same issue and decided on this strategy:
Decide if two files are different by first comparing sizes, then comparing the first four bytes, then comparing a hash of the first 4k of the file, if the file is large, then a hash of the first mb, and then finally a hash of the whole file.
If any of the equality tests fails, the files are different.
Comparing files as caluml suggests below is prohibitive when you want to dedup a very large number of files, since the number of compare operations has complexity n^2, which essentially means that each file will be opened for reading an order of n times. The checksum approach will reduce the number of opens to a maximum of 4 per file. The rest can be done working on the hashes.
The data needed to dedup is in the range of 4 bytes per hash (CRC32), 4 bytes for the size and with some ingenuity, some 4-10 bytes for the path (using a trie to compless the path tree). Thus a total of some 30 bytes per file. Dedupping millions of file using in memory data structures like this shouldn't be a problem.
Re:CRC by gerf · 2012-09-02 04:57 · Score: 1

doublekiller does it all for you, and it is free. gnore small files that often have false positives, select which folders to scan, and match hash and/or size and/or file name.
Re:CRC by Anonymous Coward · 2012-09-02 04:58 · Score: 0

I wouldn't use the extension as a discriminator. I also wouldn't recommend letting the program handle the duplicates it finds automatically. Symlinking/hardlinking isn't a horrible idea, and is certainly better than just deleting most of the files. But there are use cases where it doesn't do the right thing.
Say, for example, that you've created two Rails projects. They're going to become very different from each other, but right now there's just a bunch of placeholder code. Because the files happen to match up right now, that doesn't mean linking them is a good idea.
Re:CRC by robogun · 2012-09-02 04:59 · Score: 3, Funny

I looked at this as I, like the subby, have terabytes of porn to sort.
But $19.95 for a beta?
Re:CRC by caluml · 2012-09-02 05:08 · Score: 1

You need to compile the .java files to .class files.

You'll need something called javac (which comes with the Java Development Kit (JDK)).
It's not super easy getting the hang of it at first.

Clone the git repo, run mvn clean install (you'll need to download Maven), and then you should end up with a JAR file. Then run java -jar finddups.jar and things should start to happen.

Or should I just commit a JAR file to Github?

--
Get your own free personal location tracker
Re:CRC by Anonymous Coward · 2012-09-02 05:15 · Score: 0

was gonna say this. Linux utilities FTW
Re:CRC by Anonymous Coward · 2012-09-02 05:16 · Score: 0

>> I need to put up a picture frame?
Exactly.
1. Build a kirn.
2. Forge a hammer.
3. Forge some nails.
4. Hammer the nails into the wall.
5. Play with inner joints.
Re:CRC by iluvcapra · 2012-09-02 05:19 · Score: 5, Insightful
First compare file size (ignoring things like mp3 ID3 tags, or other headers).
I once had to write an audio file de-deuplicator; one of the big problems was you would ignore the metadata and the out-of-band data when you did the comparisons, but you always had to take this stuff into account when you were deciding which version of a file to keep -- you didn't want to delete two copies f a file with all the tags filled out and keep the one that was naked.
My de-duper worked like everyone here is saying -- it cracked open wav and aiff (and Sound Designer 2) files, captured their sample count and sample format into a sqlite db, did a couple of big joins and then did some SHA1 hashes of likely suspects. All of this worked great, but once I had the list I had the epiphany that the real problem of these tools is the resolution and how you make sure you're doing exactly what the user wants.
How do you decide which one to keep? You can just do hard links, but...
- The users I was working with were very uncomfortable with hard links, they didn't really understand the concept and were concerned that it made it difficult to know if you were "really" throwing something away when you dragged something to the trash. (It's stupid but it was their box.)
- Our existing backup/archival software wouldn't do the right thing with hard links, so it'd save no space on the tapes.
- Our audio workstation software wouldn't read audio off of files that were hard links on OS X (because hard links on OSX aren't really hard links, I believe our audio workstation vendor have since resolved this).
But let's say you can do hard links, no problem. How do you decide which instance of the file is to be kept, if you've only compared the "real" content of the file and ignored metadata? You could just give the user a big honking list of every set of files that are duplicates -- two here, three here, six here, and then let them go through and elect which one will be kept, but that's a mess and 99% of the time they're going to select a keeper on the basis of which part of the directory tree it's in. So, you need to do a rule system or a preferential ranking of parts of the directory hierarchy that tell the system "keep files you find here." Now, the files will also have metadata, so you also have to preferentially rank the files on the basis of its presence -- you might also rank files higher if your guy did the metadata tagging, because things like audio descriptions are often done with a specialized jargon that can be specific to a particular house.
Also, it'd be very common to delete a file from a directory containing an editor's personal library, and replacing it with a hard link to a file in the company's main library -- several people would have copies of the same commercial sound, or an editor would be the recordist of a sound that was subsequently sold to a commercial library, or whatever. Is it a good policy to replace his file with a hardlink to a different one, particularly if they differ in the metadata? Directories on a volume are often controlled by different people with different policies and proprietary interest to the files -- maybe the company "owns" everything, but it still can create a lot of internal disputes if files in a division or individual project's library folder starting getting their metadata changed, on account of being replaced with a hard link to a "better" file in the central repository. We can agree not to de-dup these, but it's more rules and exceptions that have to be made.
Once you have to list of duplicates, and maybe the rules, do you just go and delete, or do you give the user a big list to review? And, if upon review, he makes one change to one duplicate instance, it'd be nice to have that change intelligently reflected on the others. The rules have to be applied to the dupe list interactively and changes have to be reflected in the same way, otherwise it becomes a miserable experience for the user to de-dupe 1M files over 7 terabytes. The resolution of duplicates is the hard part, the finding of dupes is relatively easy.
--
Don't blame me, I voted for Baltar.
Re:CRC by HornWumpus · 2012-09-02 05:22 · Score: 1

Not the OP. 4.9TB is a lot of data. It is not a lot of porn (OK it's kind of a lot of porn too).

--
John McAfee 'It was like that time I hired that Bangkok prostitute; to do my taxes, while I fucked my accountant'
Re:CRC by scubamage · 2012-09-02 05:33 · Score: 1

Don't use md5, with a huge base of hashes you run into the possibility of a conflicting hash getting generated (it's an astranomically low chance, but it is there), sha-1 is safer.
Re:CRC by Lulu+of+the+Lotus-Ea · 2012-09-02 05:34 · Score: 0

Other commenters have observed your big-O inefficiency here. I had the same initial approach in my script http://gnosis.cx/bin/find-duplicate-contents (that I've mentioned in another post), but changed it when I realized it took hundreds of times as long to run as the correct hash-and-sort approach.
The problem is that if you have many files of the same size, you need to do a pairwise comparison of every pair from among them. This multiplies the operations enormously in what I found to be typical cases on my filesystem. For example, if you have 50 files that all are *exactly* 1GB in size, you need to do "50 choose 2" comparisons, i.e. 1225 of them. If the files wind up differing in the first few bytes (few meaning even first few megabytes), that's still possibly cheaper than hashing the whole files. However, if genuine duplicates exist in many of those cases--which is more typical--you wind up having to read through the entire identical content many times.
In contrast, if when you find same-sized files you commit to making exactly one read of each of them (but indeed, the entirety of them), you can store an MD5 or other hash of the whole thing, and then just compare those short MD5 sums. In actual testing and refinement, I have found this to be a hell of a lot cheaper (as in tens or hundreds of times cheaper in my actual test cases). Of course, the actual answer depends on what files of what sizes you actually have, and it is easy to construct artificial cases in which my approach (I didn't invent it, of course) loses. But not in my real-life experience.

--
Buy Text Processing in Python
Re:CRC by hsmyers · 2012-09-02 05:39 · Score: 1

I'd vote yes to "Or should I just commit a JAR file to Github". I'd add a read.me suitable for both experts and idiots like me :)
--hsm
Re:CRC by Anonymous Coward · 2012-09-02 05:42 · Score: 0

That fails for dupes on different partitions.
Also fails if the filesystem doesn't support hardlinks, such as FAT.
Re:CRC by Anonymous Coward · 2012-09-02 05:49 · Score: 1

you should try "find . -type f -exec md5sum {} \+" as it is quite a bit faster for lots of small files, since it does many fewer fork/exec calls.
Re:CRC by IICV · 2012-09-02 05:56 · Score: 3, Insightful

$19.95 for a beta of something you can whip up in about an hour of shell scripting.
Hell, I wrote exactly what people are talking about here in an afternoon in college - I even did both SHA and MD5, because I ended up finding a SHA collision between one of the Quake 3 files and a Linux system file.
Re:CRC by IICV · 2012-09-02 05:59 · Score: 1

One time I wrote this sort of script, and discovered that there's an MD5 collision between one of the Linux system files and a Quake 3 Arena game file (I checked, the contents were different).
I'm glad I didn't run the deletion part of it without checking the uniqueness output first.
Re:CRC by SWPadnos · 2012-09-02 06:00 · Score: 1

Good ideas here, but your first assumption suffers from over-optimization :)
The Core i7 CPU the OP has should be able to do MD5 or SHA256/512 sums at a rate of at least 100 MBytes/second. (See here and here.) Any reasonably modern storage system should be able to feed data that quickly.
At 100 MBytes/sec, 1GB takes 10 seconds, 1TB takes 10,000 seconds. 10,000 seconds is somewhat under 3 hours (800 seconds less), so let's assume 3 hours per TB. With 5TB to hash, it should take around 15 hours, or one overnight plus a little bit.
Surprisingly, it's not that big a problem for a modern PC.
I would make a series of tables as you suggested, split by size. Maybe have separate tables by order of magnitude, FS<1k, 1K<FS<10K, 10K<FS<100k ... In each table, store the file size, file path, create/modify date (may not be accurate, but could be useful), and the hash for that file.
After the first run, this will also provide a mechanism for determining if files have changed.

--
- The Sigless Wonder
Re:CRC by cpu6502 · 2012-09-02 06:16 · Score: 1

I have a better idea:
DELETE FILES
If you're like me you probably have a lot of old movies or TV shows that you never watch, due to lack of spare time. Last spring I went through my whole drive deleting the things I knew I had no interest in keeping. (And watching some items, discovering they were junk like Transformers2, and trashing them forever.)
Then I sorted through the remaining files dividing them into top folders like Movies, TV shows, Books, Music, etc. Then just by glancing through each directory I could immediately see the duplicates and erase them.
Took two weekends overall. My drive was 1 terabyte, but now it's down to 300 gigs of stuff, which is organized in a fashion that I can actually find the things I need.

--
My AC stalker: " I personally agree with your posts most of the time, but that won't keep me from modding you troll"
Re:CRC by Dr_Barnowl · 2012-09-02 06:28 · Score: 1

There is no known SHA-1 collision yet in the entire world.
There's a guy further up the thread that claims to have found one ... but he doesn't provide adequate detail to reproduce it.
Re:CRC by marcosdumay · 2012-09-02 06:32 · Score: 1

Better this way:
1 - Compare file sizes
2 - When a set of files have the same size you seed that size into a pseudo-random number generator and gather the first 3 numbers it generates that are within the size. You hash those blocks and refine your set.
3 - After step 2 gave you a set of matching files, refine hashing the entire files
4 - The sets that step 3 gave you are duplicated. Now comes the hardest step, dedup them. I can't tell you how, since everybody wants something different here.
I don't know of any tool that does that. Last time I needed it, I used a bounch of perl scripts.

--
Rethinking email
Re:CRC by Anonymous Coward · 2012-09-02 06:37 · Score: 0

You sir are a first class douche. Slap yourself and stfu
Re:CRC by Anonymous Coward · 2012-09-02 06:51 · Score: 0

Just Checksum them all (and save the checksum db). Yes it'll take a long time, but you do have 4tb of data. If you're using your own tool you can at least put in a decent progress meter (%files done). It's better because if you can just do the whole drive, then run tests on the best way to de-dupe and organize the data several times w/o having to re-checksum.
Re:CRC by dingen · 2012-09-02 07:01 · Score: 0

I've been trying for quite some time to score a "+5 Flamebait" or "+5 Troll", but with little success so far. This might be my best attempt yet though.

--
Pretty good is actually pretty bad.
Re:CRC by dingen · 2012-09-02 07:03 · Score: 0

Writing scripts has nothing to do with being a system administrator. If you don't use your computer to automate trivial yet repetitive tasks, then what the hell do you have it for in the first place?

--
Pretty good is actually pretty bad.
Re:CRC by xigxag · 2012-09-02 07:10 · Score: 4, Informative

With 4.2 million files, given the probability of SHA-1 collisions plus the birthday paradox and there will be around 500 SHA-1 collisions which are not duplicates.
That's totally, completely wrong. The birthday problem isn't a breakthrough concept, and the probability of random SHA-1 collisions is therefore calculated with it in mind. The number is known to be 1/2^80. This is straightforwardly derived from the total number of SHA-1 values, 2^160, which is then immensely reduced by the birthday paradox to 2^80 expected hashes required for a collision. This means that a hard drive with 2^80 or 1,208,925,819,614,629,174,706,176 files would have on average ONE collision. Note that this is a different number than the number of hashes one has to generate for a targeted cryptographic SHA-1 attack, which with best current theory is on the order of 2^51 for the full 80-round SHA-1, although as Goaway has pointed out, no such collision has yet been found.
Frankly I'm at a loss as to how you arrived at 500 SHA-1 collisions out of 4.2 million files. That's ludicrous. Any crypto hashing function with such a high collision rate would be useless. Much worse than MD5, even.

--
There are two kinds of people: 1) those who start arrays with one and 1) those who start them with zero.
Re:CRC by blueg3 · 2012-09-02 07:13 · Score: 1

For this purpose, there's really no point in using CRC32 at all. The disk-access cost so far outweighs any computational cost (at least on this guy's hardware) that you might as well use SHA-512.
As you mention, CRC32 has much too high a chance of collision. MD5 has essentially no chance of collision unless files are specifically designed that way. But since you have plenty of free computational time, you might as well use SHA-512, since it has the smallest chance of an accidental collision and no known way of creating intentional collisions.
Re:CRC by Captain+Segfault · 2012-09-02 07:19 · Score: 1

[citation needed]
A bug in your script is far more likely than a collision between two files in full 128 bit md5, barring a deliberate attack on md5 to create the collision.
Re:CRC by Anonymous Coward · 2012-09-02 07:52 · Score: 0

If $19.95 is too steep for you, write your own. Unless your time is "free" you'll be spending far more.
Re:CRC by __aaltlg1547 · 2012-09-02 08:19 · Score: 1

I think comparing the 100-byte chunks would be almost as effective and save 99.9% of the file-compare time. One can resort to more exacting methods in the small fraction of the time when they match.
Re:CRC by Surt · 2012-09-02 08:21 · Score: 4, Insightful

$19.95 for a beta of something you can whip up in an hour of shell scripting.
If the poster were you, they wouldn't have had to 'ask slashdot'.

--
"Who is the Journal of Quantum Physics going to believe?" --Stephen Hawking
Re:CRC by Surt · 2012-09-02 08:25 · Score: 1

May as well compare the first 8 bytes, you can do that in a single instruction, and the cost to read 8 should be the same as 4. If you're clever enough you can probably do a 32 byte comparison in a single op.

--
"Who is the Journal of Quantum Physics going to believe?" --Stephen Hawking
Re:CRC by Surt · 2012-09-02 08:27 · Score: 1

Given he's not interested in security but uniqueness, a Rabin Fingerprint would be both faster and better for this purpose than any of the security hashes.

--
"Who is the Journal of Quantum Physics going to believe?" --Stephen Hawking
Re:CRC by Anonymous Coward · 2012-09-02 08:42 · Score: 0

When I saw this story, I thought about for a few minutes, read some of the comments, wrote some requirements, thought some more, posted a few AC comments - 1 rude - then opened vim and started writing some perl.
I code perl of this complexity about 5 hours a month, so I'm not a professional anymore. The last time I was paid for perl code was in 2000.
After about 45 minutes in a debugger, knocked out the last bug. I was getting hash collisions. My test data was only 2K files. It found lots of dups that made perfect sense and some that surprised me. The script runs nearly instantaneously. On my main server, there's only 6TB of files.
It only does an md5sum on files with the same size and extension. The collision happened for 2 different image files - they were very different. Not even close by human standards. The collision was unexpected. Of course, the debugger helped me find the issue. It was strange that for 2K files, this happened just once.
Anyway, it was more than an hour of scripting for me, but I was relearning perl and watching some TV.
$19.95 seems like a high price for this, but my code actually works. I can see lots of addons - side by side compares for text files, img files, and videos. Leave intermediate working temp-files around ... and adding a GUI ... all should be pretty simple.
Or perhaps I'll simply blog about this script and building it for a week. This is the perfect "itch" for someone wanting to learn scripting to scratch. Pick a language - ruby, python, perl, bash, powershell - this is a hard enough problem to be challenging, but not so hard that it won't be useful later. The feeling of accomplishment ...
Even working through the pseudocode is a learning experience for people who don't want to learn to script.
Re:CRC by caluml · 2012-09-02 09:19 · Score: 1

Ask, and ye shall receive.

Let me know how it goes for you. Especially if you're not on Linux, as I've only tried it on Linux, and I'm not sure how the symlink detection works on other OSes.

--
Get your own free personal location tracker
Re:CRC by caluml · 2012-09-02 09:24 · Score: 1

if you have 50 files that all are *exactly* 1GB in size
Hmm. To the byte?

I very rarely find any similarly sized files that large, and those that I do, there are usually only two of them. Usually, these are videos, or audio files that I've copied/rsynced around in such a way that they ended up in two places.
Of course, everyone's usage will be unique, but I can't imagine finding that scenario being common.

--
Get your own free personal location tracker
Re:CRC by SuricouRaven · 2012-09-02 09:47 · Score: 1

I've faced exactly this problem in my own deduping experiments. I found two techniques really made it more practical. Firstly, I used a modified bloom filter* to eliminate most of the records that I was certain contained no duplicates. Then I used a radix sort. Not the most efficient sort around, but it's access pattern is very linear, which made it ideally sorted to storing the tables on disk. Something like quicksort would need fewer operations, but would also thrash like crazy.

*You can use a tristate bloom, or two normal blooms chained - they function identically.
Re:CRC by Pieroxy · 2012-09-02 10:32 · Score: 1

You can check a few files in a directory and then easily deduce the whole directory is a dupe. You don't have to do it file by file.
How can you be sure a single file wasn't added to one duplicate?
By the list of files you see as being duplicates in said directory?

--
Write boring code, not shiny code!
Re:CRC by Anonymous Coward · 2012-09-02 10:33 · Score: 0

Agreed. See man fdupes and/or man hardlink.
Re:CRC by Pieroxy · 2012-09-02 10:33 · Score: 1

Yes, why? Not able to do it yourself?
BTW, it's only one big table, it's not two distinct recordsets.

--
Write boring code, not shiny code!
Re:CRC by euxneks · 2012-09-02 10:55 · Score: 1

It'd be easier to write to a database which uses the checksum as the index, you don't have to sort it then, and you can check for (possible) dupes very quickly.

--
in girum imus nocte et consumimur igni
Re:CRC by fm6 · 2012-09-02 11:04 · Score: 1

Did you code the filter yourself? If not, some links would be helpful.
Re:CRC by garglblaster · 2012-09-02 11:04 · Score: 1

Or just by file size first, then do a hash. No need to compute a hash to compare a 1mb file and a 1kb file.
thats exactly right.
Some time ago (years?) I came across a little book called "wicked cool perl scripts" (do a google shure you find it..) In one of the first chapters there is a script that does exactly what you describe (dupfiles.pl if I remember right):
To find duplicate files it goes down the directory hierarchy, sorts the files by size and does a md5sum comparison of files with similar size.
I tried it out once in awhile and it always worked out very well for me.
The scripts themselves can be downloaded from the publishers website. Have a look if your interested -
(btw. I do not have any commercial interests in this recommendation..)

--
perl -e 'printf("%x!\n",49153)'
Re:CRC by hsmyers · 2012-09-02 11:05 · Score: 1

Running under Windows Vista I get:
C:\>java -jar finddups.jar / 0 >dups.txt
java.lang.NullPointerException
at finddups.FileFinder.findFiles(FileFinder.java:25)
at finddups.FileFinder.findFiles(FileFinder.java:33)
at finddups.FileFinder.findFiles(FileFinder.java:33)
at finddups.Main.main(Main.java:31)
dups.txt has a copy of the command line and nothing else.
--hsm
Re:CRC by Anonymous Coward · 2012-09-02 11:20 · Score: 0

I do not believe you found a real SHA1 collision in the wild.
Re:CRC by yakatz · 2012-09-02 11:26 · Score: 2

And use a Bloom Filter to easily eliminate many files without doing a major comparison of all 100 checksums.
Re:CRC by ultranova · 2012-09-02 11:41 · Score: 1

Do a CRC32 of each file. Write to a file one per line in this order: CRC, directory, filename. Sort the file by CRC. Read the file linearly doing a full compare on any file with the same CRC (these will be adjacent in the file).

But how much disk space will this really free you? Remember, Moore's law works on hard drives too (I still remember my first 120MB drive, vs. the 4 terabytes I have now), so the space taken up by all of your dublicated data is shrinking towars insignifance at an exponential rate. So, I propose the Ultranova way of managing disk space:
Just let it go.
No matter what you do, you are never going to recover amounts of storage space that would make any kind of real difference nowadays, so why bother about it? Simply accept that you have multiple copies of old, low-resolution digital photos, and understand that they are occupying a percentage of your hard drive that simply makes no difference whatsoever - the space they take up hardly amounts to a rounding error, and will only keep on getting more insignificant over time. So let them. The cost of electricity to run a deduplicating program will likely exceed the cost of hard drive space they're occupying nowadays, even if you're using RAID to get redundancy.

--
Forget magic. Any technology distinguishable from divine power is insufficiently advanced.
Re:CRC by Zocalo · 2012-09-02 11:48 · Score: 2

Sure, yet it didn't. Reading between the lines, and seeing phrases like "multiple drives", "attached to PC", "last decade", and I think it safe to say that we are most definitely not talking about about a reasonably modern storage system that can do 100MB/s, or it wouldn't have taken a week for the first pass. It seems much more likely that the poster has a whole bunch of external backup drives, most probably USB2, hence their first attempt was probably seriously I/O bound. That means doing as much of the de-duplication as possible without reading in the raw data, just the file tables, and and starting with the larger files so that you can move what ever is left over for the final pass that will need the files CRC'd (or hashed) onto the fastest available media.

--
UNIX? They're not even circumcised! Savages!
Re:CRC by js33 · 2012-09-02 11:50 · Score: 1

Let me say, CRC32 is a very good algorithm. So good, I'll tell you how good. It is 4 bytes long, which means in theory you can change any 4 bytes of a file and get a CRC32 collision, unless the algorithm distributes them randomly, in which case you will get more or less.
I naively tried to reverse engineer a file from a known CRC32. Optimized and recursive, on a 333 mHz computer, it took 10 minutes to generate the first collision. Then every 10 minutes or so. Every 4 bytes (last 4, last 5 with the original last byte, last 6 with original last 2 bytes, etc) there was a collision.
CRC is only good for what it's designed for: to detect random bit-flipping errors due to noise. It has no cryptographic properties whatsoever. CRC is nothing more than polynomial long division mod 2. It really is nothing more than a straightforward algebra problem to modify the 4 bytes at any given position in any given file to generate any desired CRC32 checksum. Brute force is totally unnecessary.
Re:CRC by Anonymous Coward · 2012-09-02 11:59 · Score: 0

A common effect in scientific circles is large file formats that lack compression and so have very regular sizes derived from multiples of some internal block size or some multi-dimensional grid size. Imagine a specialized camera or other instrument that dumps raw images... it always has the same number of pixels and if the header part is a fixed/padded out size, then all files from the same instrument always have the same size.
Re:CRC by BasilBrush · 2012-09-02 12:13 · Score: 0

Writing scripts has nothing to do with being a system administrator.
That's like saying goals are nothing to do with football.

If you don't use your computer to automate trivial yet repetitive tasks, then what the hell do you have it for in the first place?
Most people use applications to achieve that. For example a spreadsheet app automates trivial yet repetitive tasks.
About 0.00001% of computer users write scripts to achieve their ends.
It's amusingly psychopathic of the slashdot community that nearly all contributions to the question discussed how to (waste time) writing a script to solve the problem, with very few considering whether there's an alternative app out there already. The questioner had used an app which choked on the amount of data he has. There's a good chance there are others out there that wouldn't.
I can certainly find de-dupe apps faster than you can create a script that solves his problem.
Re:CRC by Anonymous Coward · 2012-09-02 12:19 · Score: 1

've sorted 100M record text files using the GNU 'sort' command by setting an appropriate buffer size, fast temporary file location, and the parallel sort option to use multiple CPU cores. But loading the data into postgres or sqlite is another option to get quite robust sorting on a desktop environment.
Also, merge sort works fine here, since it is a streaming algorithm, so it juts needs a bit of read-ahead and write-behind buffering to allow good sequential disk accesses. I don't know why so many people fixate on quick sort when merge sort is a much more reliable and scale-free algorithm in practical systems.
Most implementations will not run merge sort all the way down to single-record leaf buffers on disk, but instead will switch to an in-memory sort once the leaf buffers are small enough to do efficiently. Even then, an in-memory merge sort is pretty easy on the cache hierarchy, particularly if it has a streaming optimization, so you again can get pretty good performance just continuing with merges after you transition from temp files to memory buffers.
Re:CRC by WoLpH · 2012-09-02 12:26 · Score: 1

100 byte chunks would work just as well, but with a smart raid card and 2GB raid cache the difference in execution time is fairly low actually. When creating this I did try 1k blocks first and it didn't have any noticable (i.e. wall clock) slowdowns in the long run and did give me a few false positives.
YMMV though, for my use case this was useful. For you that might be different
Re:CRC by Rich0 · 2012-09-02 12:56 · Score: 1

You could also use the xargs substitution feature and quote the parameter.
Re:CRC by Anonymous Coward · 2012-09-02 12:56 · Score: 0

You want to be moderately careful about 0B files, if you're scanning your entire filesystem (or even just all your home directory), since some of those will be locks and the like.
Re:CRC by Anonymous Coward · 2012-09-02 13:19 · Score: 1

It says Copyright 1998, so I wouldn't hold my breath waiting for the "final version".
Re:CRC by DarwinSurvivor · 2012-09-02 13:33 · Score: 1

So diff-check the files with matching md5's (you should be doing this anyway just in case). Problem solved.
Re:CRC by petermgreen · 2012-09-02 13:53 · Score: 1

The question you ultimately have to ask yourself is cost VS benefit.
Getting rid of files that are identical is low cost and fairly high benefit so it's potentially worth doing. Trying to work out which version of a file that you probablly won't use againis the "good" one has an extremely high cost verses it's benefit so it's probablly not worth doing.
Personally I use a tool I wrote myself called hashbackup to dedupe files, it works but it's kinda rough. http://www.lcore.org/viewvc/hashbackup/trunk/

--
note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
Re:CRC by vocatan · 2012-09-02 14:16 · Score: 2

Be VERY careful about only relying upon the file contents -- my wife spent 3 weeks tagging a large (~8,000 images) collection of family photos -- and the method she used was to put the children's names in the filename. Being the clever geek, I ran a MD5 against all the files, and compared both filesize and MD5 -- and triumphantly purged all the binary duplicates -- only to find that the filename itself was important to retain. Also, note that some application such as Apple's iPhoto will conveniently retain multiple copies of the same image in various dimensions - as well as the original image before any transformations would apply. Bottom line: doing a filename+filecontents hash (single O(n) to calculate over entire file set), and then comparison of the hash feels _to me_ as the safest approach.
Re:CRC by blueg3 · 2012-09-02 14:30 · Score: 1

Sorry, I was unclear -- I agree with you. You should *keep* *every single* zero-length file. They take up virtually no space (only space for the file metadata) and are usually important.
Re:CRC by Anonymous Coward · 2012-09-02 14:50 · Score: 0

do you mind posting the name of the quake 3 file and the linux system file? the probability of collision is so small that it would be cool to have a real example of a collision.
Re:CRC by FoolishBluntman · 2012-09-02 16:46 · Score: 1

$19.95 for a beta of something you can whip up in about an hour of shell scripting.
Hell, I wrote exactly what people are talking about here in an afternoon in college - I even did both SHA and MD5, because I ended up finding a SHA collision between one of the Quake 3 files and a Linux system file.
I think your full of shit if you found an SHA collision.
Plesae provide an example.
Re:CRC by Anonymous Coward · 2012-09-02 16:46 · Score: 0

In all fairness, Linux and Quake 3 are pretty similar.
Re:CRC by Anonymous Coward · 2012-09-02 17:02 · Score: 0

Now you have a list of duplicates.
Actually, what you now have is a list of probable duplicates. You might still want to do an actual byte by byte comparison depending on how important those files are.
Re:CRC by FoolishBluntman · 2012-09-02 17:15 · Score: 1

It would take about 12 hours to read the 5TB and compute SHA-1 for everything. (assumes 120MB/sec disk)
(5*10^12)/(120^10^6)=41666 sec = 11.57 hours.
After that you have 4M*(256 bytes metadata+20bytes/hash)=~1GB of memory.
Everything fits into an in memory hash table, lookups O(1) proceed at 1 lookup/microsecond.
It takes another 20 minutes or so to delete the duplicate files. project done.
And I have a perl script that does this that I wrote in 1999, email me and I'll send it.
Re:CRC by FoolishBluntman · 2012-09-02 17:18 · Score: 1

SHA-1 runs at 450MB/sec per core on a current generation i7, this is not a CPU bound problem.
Re:CRC by SuricouRaven · 2012-09-02 18:05 · Score: 1

http://birds-are-nice.me/programming/BLDD.shtml

The code is hideous. It's made just to see if it'll work. I'm a hobbyist, not a developer.
Re:CRC by Solandri · 2012-09-02 18:16 · Score: 1

Whatever you use it's going to be SLOW on 5TB of data. You can probably eliminate 90% of the work just by:
a) Looking at file sizes, then
b) Looking at the first few bytes of files with the same size.
I have a file server with about 3 TB of data. Backing it up to an external RAID 0 storage bay takes about 18 hours. If this guy is waiting over a week for a dedupe to finish, it's not because accessing 5 TB is slow.
Re:CRC by Anonymous Coward · 2012-09-02 18:44 · Score: 0

The problem lies in music and images, that are available multiple times, with changed meta data. It will be exactly the same mp3 or jpg, but with different spelling of the artist or with stripped away location data, so the file sizes will be different and the hashes will be too, but they are nonetheless dups.
Re:CRC by inKubus · 2012-09-02 19:28 · Score: 3, Informative

For the lazy, here are 3 more tools:
fdupes, duff, and rdfind.
Duff claims it's O(n log n), because they:
Only compare files if they're of equal size.
Compare the beginning of files before calculating digests.
Only calculate digests if the beginning matches.
Compare digests instead of file contents.
Only compare contents if explicitly asked.

--
Cool! Amazing Toys.
Re:CRC by LordLimecat · 2012-09-02 19:41 · Score: 1

Maybe he could write the shell script and sell it to OP for $19.94.
Re:CRC by darkonc · 2012-09-02 20:04 · Score: 1

Do a CRC32 of each file. Write to a file one per line in this order: CRC, directory, filename. Sort the file by CRC. Read the file linearly doing a full compare on any file with the same CRC (these will be adjacent in the file).
or try this script: find . -type f -print0 | xargs -0 md5sum | sort | uniq --all-repeated=separate -w32 Of course, you'll have to do this in Linux (live CD would do.. these are all common tools). -- but you'll end up with a list of duplicate files... What you do after that is up to you..
Yes, it requires that you read each and every file (once), but you can start the script and go to bed, (redirect the output to a file). That's what computers are for, after all.

--
Sometimes boldness is in fashion. Sometimes only the brave will be bold.
Re:CRC by Dr_Barnowl · 2012-09-02 20:13 · Score: 1

My bad, it was an MD5 collision he claims.
Re:CRC by zakkie · 2012-09-02 20:21 · Score: 1

Why the hell would you bin by extension?
Re:CRC by Anonymous Coward · 2012-09-02 20:35 · Score: 0

I can see how that's a plausible solution for some people, but others actually need to efficiently use their computer: if I tied one of my computers up in checksumming my external storage (approx 3TB) then during that time it would drastically reduce my work efficiency. I'd rather do 30 minutes of coding now than lose hours of productivity over the next few days. Time is money they say, and that might be true, but when it comes to making an income it often comes down to how well you can use that time.
Re:CRC by Roger+Lindsjo · 2012-09-02 20:40 · Score: 1

Did you use SHA-0 or SHA-1? If you really did find a collision I think you should try to reproduce your case as I have so far not seen any success of not only being able to create hash(m1) == hash(m2) but also m1 and m2 are meaningful messages. I did find some papers indicating that it might "not be much harder" than finding collisions. A collision in the wild would probably warrant a whole bunch publications.
Re:CRC by Anonymous Coward · 2012-09-02 20:59 · Score: 0

... there's an MD5 collision between one of the Linux system files and a Quake 3 Arena game file ...
Or was it an SHA collision?
Re:CRC by Anonymous Coward · 2012-09-02 20:59 · Score: 0

I ended up finding a SHA collision
Which SHA? And could you post the files somewhere?
Re:CRC by Anonymous Coward · 2012-09-02 23:21 · Score: 0

You know, $19.95 is less than my hourly wage and I am probably not that well paid. Here, the script is trivial, not even an hour of work, but for $20 I would think twice about wasting an hour of my life...
Re:CRC by Anonymous Coward · 2012-09-02 23:26 · Score: 0

Yeah, because comparing 100 numbers is so hard for a computer. Egads!
Re:CRC by Anonymous Coward · 2012-09-02 23:32 · Score: 0

I even did both SHA and MD5, because I ended up finding a SHA collision between one of the Quake 3 files and a Linux system file.
Surely you must be mistaken. Nobody in the world knows of any SHA-1 collision yet. The likelihood of finding one by chance is essentially zero. Even MD5 collisions are unlikely enough so that they can only be generated through some malicious rather than accidental process.
Re:CRC by Anonymous Coward · 2012-09-02 23:49 · Score: 0

Reading 1TB is 10,000s, or roughly 3 hours. Reading the filename and size from the file metadata for 4m files, a handful of minutes.
Write a simple python script, used a defaultdict(list) indexed by size to add the filename to a list.
Sort keys by size (reverse). Run through the list from biggest to smallest looking if the list len > 1:
if so, read 20k from each after seeking roughly to the middle (key/2) and run it through a simple hash. If it still matches, then you need to check more.
Decide if you want to do another random sampling, or just do a block for block compare.
Either way you'll finish well before the 3 hours above assuming any sort of reasonable distribution of files.
By the way your 10,000s estimate is wrong, you'll need to do 4.2m seeks at a minimum (one for each file), at 5millisecs a seek, that will add around 21,000s to the time to read unless your files are laid out in the same way as you plan to open them, which would be very lucky!
Re:CRC by IAmR007 · 2012-09-03 01:10 · Score: 1

fdupes is in Cygwin. No need for a livecd or vm. Cygwin can handle ntfs hardlinks well, too (I used a python/sh script in cygwin just a few weeks ago to hard link a bunch of photographs into a series of dvd-sized folders).

Also, in the future, I recommend using rsync to back up files (either remotely or to another directory). Rsync can do incremental backups using deltas (a binary version of diffs), which saves a ton of space. Rsync also works well in cygwin, and there are cron daemons available that will install themselves as windows services for automatic backup.
Re:CRC by Anonymous Coward · 2012-09-03 02:12 · Score: 0

Don't forget to eliminate all files of zero length in step (a). They're trivially duplicates but shouldn't be deduplicated.
I would say in general it's also not safe to automatically dedup any small text files. They could be (e.g.) text config files with identical default contents, for different users, and the results of deduping them could be at best very confusing, at worst prevent a user logging in (e.g. users a and b both have identical desktop environment config files).
Re:CRC by GuB-42 · 2012-09-03 02:48 · Score: 1

With a UNIX shell :
find . -type f -exec md5sum '{}' ';' > md5_list perl -e 'while (<>) { /(\w+)/; push @{$d{$1}}, $_; } for (values %d) { print @{$_} if (@{$_} > 1); }' < md5_list
First command makes a list of all files with their MD5 checksum. With 4.9 TB of data, il will probably take a full day to complete but it is completely unattended and you only have to do it once.
The second command lists all duplicates and is much faster.
Re:CRC by TheTrueScotsman · 2012-09-03 07:05 · Score: 1

Chance of a collision between any two random files is the birthday problem, isn't it? That would be one in 2^64 for MD5 (square root of 2^128). That's a pretty low probability for a few TB of data. Anyway, just do a straight compare if the hashes collide.
Re:CRC by wrfelts · 2012-09-03 08:16 · Score: 1

Do a CRC ...AND... an MD5, and use a database. I built an internal file management system for one company that was (at the time) handling only about 1.75 TB of files. I was unpleasantly surprised at the number of duplicate CRCs I got on non-duplicate files. Using two different types of signatures, plus the file sizes alleviated that problem. Once you have it in a database it is also a bit easier to decide which one you want to keep and/or where to keep it by using some query/column value tricks. Flagging the values as duplicates is also trivial. My product was server based with a web services interface to access the files, so I also threw in a compressing/encrypting feature with access by their original path hitting a keyed server file in the back end. It was amazing to me how much faster server file access was when I did endpoint (both server and client) zipping before transfer. I also used the CRC, MD5, and length to avoid unnecessary downloads to begin with.

Another option (not available for Windows, as far as I've seen) is to use a file system that simply dedups each sector, which buys A LOT more disk space than a simple file-level dedup.
Re:CRC by Anonymous Coward · 2012-09-03 11:56 · Score: 0

http://makguidetosoft.blogspot.com/2012/09/de-duping-huge-amount-of-files.html
Re:CRC by I(rispee_I(reme · 2012-09-03 12:29 · Score: 1

GPL.txt
Re:CRC by Anonymous Coward · 2012-09-03 13:27 · Score: 0

dunno 'bout linux, but win has date created & modified, usually I just compare these.
same name, same size, same date, I'm happy... next files, done
usually I do this manually....
but then I'm anal when it comes to file copy & bkups
data is way hard to replace, & sometimes impossible...
I'm an ex-operator, from back when computers cost way more than now
Re:CRC by Anonymous Coward · 2012-09-03 20:11 · Score: 0

It also fails if the two copies are owned by different people who may not appreciate having the file moved to either a neutral owner with permissions set broadly, or having the permissions changed so that someone else can share it.
Re:CRC by belg4mit · 2012-09-04 02:22 · Score: 1

The v1.4 nagware works fine

--
Were that I say, pancakes?
Re:CRC by darkonc · 2012-09-04 03:01 · Score: 1

btw: not a whole lot of need to compare file sizes if you use md5sum (or any larger hash). with a 128 bit hash and 4million files, your probability of an accidental hash collision are roughly 1/2^85 (or 4e25) .. close enough to zero for my purposes.

--
Sometimes boldness is in fashion. Sometimes only the brave will be bold.
Re:CRC by SiChemist · 2012-09-04 06:21 · Score: 1

I watched Transformers 2 at the movie theater and I have never been so bored in my entire life. That's six and a half hours that I'll never get back. (Well, it felt like six and a half hours.)

--
God is imaginary
Re:CRC by Anonymous Coward · 2012-09-04 07:17 · Score: 0

If you had found a SHA-1 hash collision in college, you should have published it. There are no known SHA-1 collisions in the wild. http://stackoverflow.com/questions/3475648/sha1-collision-demo-example
Re:CRC by Anonymous Coward · 2012-09-04 07:48 · Score: 0

This means that a hard drive with 2^80 or 1,208,925,819,614,629,174,706,176 files would have on average ONE collision.
It might be worth noting that Avogadro's number is approximately 2^78, two orders of magnitude smaller. 2^80 is close to the number of stars in the observable universe. These are IMMENSE numbers.
Re:CRC by Anonymous Coward · 2012-09-04 14:53 · Score: 0

i thought no collisions in sha1 had been found in the wild?
Re:CRC by SuricouRaven · 2012-09-04 22:35 · Score: 1

At some point I'm going to get around to rewriting the sort part in a much more efficient way. It could probably go ten times faster with ease. But the mood to code strikes rarely, I'm not a professional. I'm working on a program to zero-out unallocated clusters in an NTFS filesystem at the moment.
Re:CRC by jamiedolan · 2012-09-11 04:18 · Score: 1

The problem with this is that the vast majority of what is on my drive is content that created (e.g. HD movies, RAW photos, graphics work, etc.) I would say there are only a handful of tv shows and such in itunes or Amazon ubox that take up very little space overall.
Re:CRC by Anonymous Coward · 2012-09-15 01:18 · Score: 0

> Sort the file by CRC.
Read it into memory, sort it, then write out to a file.
Re:CRC by Snorbert+Xangox · 2012-09-16 12:37 · Score: 1

Hell, I wrote exactly what people are talking about here in an afternoon in college - I even did both SHA and MD5, because I ended up finding a SHA collision between one of the Quake 3 files and a Linux system file.
It would be interesting to know how long each of these colliding files was... funny how we all *know* that for nontrivial hash inputs there are many many possible colliding inputs, but over time we tend to slide into "let's just compare hashes to find identical data; collisions are so rare - after all, we haven't seen any!"

--
-Snorbert, somewhere in the antipodes

This sounds easy. by MusicOS · 2012-09-02 01:32 · Score: 0

You need to sort by file size. Then compare matches. You might hash files and sort by hashes.

Don't do it all at once by Anonymous Coward · 2012-09-02 01:34 · Score: 0

Dedup in smaller pieces. It may take the same amount of time, but you will see progress.
Dedup the contents of one folder (or a small set of folders), then the next, etc.
Once you are finished, dedup the entire disk.

Is this porn? by Anonymous Coward · 2012-09-02 01:35 · Score: 0

Just wondering....

Re:Is this porn? by Anonymous Coward · 2012-09-02 01:38 · Score: 0

With porn the job is harder because you have to worry about about the same piece of poprn being saved atr different resolutions.
Re:Is this porn? by kenh · 2012-09-02 03:02 · Score: 1

Worry? Multiple different resolutions serve a purpose - different resolution playback devices.

--
Ken
Re:Is this porn? by Anonymous Coward · 2012-09-02 04:41 · Score: 0

Lots of it, too...
4.2M files over 10 years is roughly 1k files per day. Even if we assume that large subsets got duplicated every few years (eg whenever the OP changes his HD or his backup HD), that still sounds like an enormous quantity of files...

Like so: by Arffeh · 2012-09-02 01:36 · Score: 1

Very, VERY carefully.

ZFS by smash · 2012-09-02 01:37 · Score: 2

as per subject.

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.

Re:ZFS by smash · 2012-09-02 01:39 · Score: 4, Informative

To clarify - no this will not remove duplicate references to the data. The files ystem will remain in tact. However it will perform block level dedupe of the data which will recover your space. Duplicate references aren't necessarily a bad thing anyway, as if you have any sort of content index (memory, code, etc) that refers to data in a particular location, it will continue to work. However the space will be recovered.

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Re:ZFS by Anonymous Coward · 2012-09-02 02:20 · Score: 0

Deduplication can be done with HAMMER on DragonflyBSD, too.
Re:ZFS by Gordonjcp · 2012-09-02 02:24 · Score: 1

Could you then use something clever in ZFS to identify files that reference shared data?
Re:ZFS by aliquis · 2012-09-02 03:04 · Score: 1

Always or do one have to enable it / start a scan somehow?
Make PC-BSD even cooler. PBI or whatever they are called and ZFS.
Re:ZFS by Daniel_Staal · 2012-09-02 03:37 · Score: 2

You have to enable it, which can be done on a per-filesystem basis. Once it's on, any new data written to that filesystem will be deduplicated. If you then turn it off, new data will not be deduplicated but data already on disk will remain deduplicated. (Unless it gets modified, of course. Then it's new data.)
PC-BSD installs onto ZFS by default if you have over 4GB or so of ram, but won't turn on deduplication automatically. Dedup is costly: it requires a dedup table which has 320 bytes per (variably sized) block, which must be consulted on every write. (A quick estimate based on an average 64K block size for the case above results in a 24 GB dedup table.) So, if you can't fit that table into ram or onto a SSD cache drive, writes are going to be very slow. But for this usage, setting up a fileserver on ZFS and copying all his files to it would fit well, especially as the other advantages of ZFS with large filesystems will come into play.

--
'Sensible' is a curse word.
Re:ZFS by smash · 2012-09-02 17:10 · Score: 1

Yup, as per the other reply, don't just blindly enable de-dupe, as it uses a lot of RAM. But - if you have hardware capable of it (dedicated NAS box recommended), it will do the job above as described. RAM is cheap these days anyway, 32gb for a desktop is only a few hundred bucks?

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.

Simplify the list by jrq · 2012-09-02 01:38 · Score: 1

Scan all simple file details (name, size, date, path) into a simple database. Sort on size, remove unique sized files. Decide on your criteria for identifying duplicates, whether it's by name or CRC, and then proceed to identify the dupes. Keep logs and stats.

--
My UID is prime!

My suggestion by DoofusOfDeath · 2012-09-02 01:38 · Score: 0

First step, get a big cup of good coffee. Or a fifth of Jack Daniels, if you just want to finish in 10 minutes.

Hardlinks? by airjrdn · 2012-09-02 01:40 · Score: 1

If you can get them on a single filesystem (drive/partition), check out Duplicate and Same Files Searcher ( http://malich.ru/duplicate_searcher.aspx ) which will replace duplicates with hardlinks. I link to that and a few others (some specific to locating similar images) on my freeware site; http://missingbytes.net/ Good luck.

--

My Tech Posts on Twitter

Re:Hardlinks? by Anonymous Coward · 2012-09-02 01:44 · Score: 0

Duplicate Cleaner will also replace with hard links.
Re:Hardlinks? by cyberjock1980 · 2012-09-02 05:07 · Score: 1

I like this idea (I'm running your program as I write this..) but I'm also thinking that this could backfire. I'm rather new to hardlinks, so this may not be a problem at all.
Is there an easy way to "go back" and have something delete the hard link and copy the original files back?
Let's say I have a folder of random mp3s and an organized folder of albums with all of the tracks. Let's assume that your program just replaced my albums with hardlinks to the songs in the random mp3s folder. As soon as I find my album folders I'll realize I want the original files moved/copied to someplace else and I'll want the original files and not the hardlinks. How do I "undo" the hardlinks and restore the original files?
Re:Hardlinks? by amRadioHed · 2012-09-02 08:04 · Score: 1

The hardlinks and the original files are equivalent. You should read up on inodes for more details, but in short all the files in a unix filesystem have pointers to their data. Usually there is one pointer to each files data, but you can create multiple pointers to the same data. This is what they are talking about, you will have multiple files with different names that all reference the same data and it's impossible to tell the original from the new links because they are the same.

--
We hope your rules and wisdom choke you / Now we are one in everlasting peace
Re:Hardlinks? by cyberjock1980 · 2012-09-02 10:25 · Score: 1

But if I'm trying to organize my stuff and I move the original with the 10 hardlinks, those hardlinks will no longer work. To me hardlinks are a good first step to downsize the amount of data you have. Then you have to organize it.
Re:Hardlinks? by amRadioHed · 2012-09-02 11:13 · Score: 1

No, the hardlinks still work. You are only moving a pointer to the data, and the other pointers to the same data aren't affected.

--
We hope your rules and wisdom choke you / Now we are one in everlasting peace
Re:Hardlinks? by cyberjock1980 · 2012-09-02 13:38 · Score: 1

You are awesome! I just learned something today. I was confused as to the difference between softlinks and hardlinks. Hardlinks are my new best friend! It's like a super-shortcut :)
Google helped a little bit, but you definitely verified what I realized while googling it. I'll admit, I'm a windoze user, but recently started playing with FreeBSD and Linux and I first heard about soft/hard links a few weeks ago.
Re:Hardlinks? by DarwinSurvivor · 2012-09-02 13:44 · Score: 1

*Every* file on your computer is a hardlink. It's an entry in a filesystem table that lists the file's directory, name and inodes (inodes reference the physical sectors). When you create a hardlink, it basically creates a new file, but instead of allocating new inodes to it, it lists the same inodes as the original file. When you "move" a file, it only changes the director/name parts of the filesystem table entry for that file (this is why you can move a 1TB file as quickly as a 1KB file). In fact, the old "a move is nothing more than a copy+delete" is completely false since a copy+delete copies the actual data, creates new inodes and then deletes the old filesystem entry (which deletes the original inodes ONLY if there are no other hardlinks pointing to it). The kernel is smart enough to only release sectors once ALL hardlinks that point to their inodes have been removed from the filesystem table.
Re:Hardlinks? by amRadioHed · 2012-09-03 17:29 · Score: 1

Cool, glad to help. Yeah, links make sense once you get what's happening, but since Windows has never had an equivalent I can see how it would be confusing at first. They don't work across filesystems though which can be a problem, though how much of a problem depends on how you're using them and how your file systems are set up.

--
We hope your rules and wisdom choke you / Now we are one in everlasting peace
Re:Hardlinks? by airjrdn · 2012-09-04 00:47 · Score: 1

Sorry for the delay in responding. Think of hardlinks as multiple pointers (filenames) to the same physical file (data on disk). The actual file (data) won't be deleted until all hardlinks have been deleted. I don't know if there's an easy way of going back, but I've never had any problems with them. In fact, that's how I manage a lot of my media (same movie in both drama and comedy for example). That allows me to show the same movie in multiple locations without using up that much more storage. You might also want to check out Hardlink Shell Extension, also linked to from my site.

--

My Tech Posts on Twitter

The biggest problem you will have by Anonymous Coward · 2012-09-02 01:41 · Score: 1

is not finding the same file, but when you have duplicate files associated with different applications. For example Program A and Program B both install a fonts directory with thousands of fonts most of which are identical.

Or if you install multiple copies of slightly different versions of the same OS ...

Re:The biggest problem you will have by Anonymous Coward · 2012-09-02 02:40 · Score: 0

Why would you be backing up programs or operating system files?

There are tools for this by Anonymous Coward · 2012-09-02 01:41 · Score: 5, Informative

If you don't mind booting Linux (a live version will do), fdupes has been fast enough for my needs and has various options to help you when multiple collisions occur. For finding similar images with non-identical checksums, findimagedupes will work, although it's obviously much slower than a straight 1-to-1 checksum comparison.

YMMV

Break it up into chunks by Gordonjcp · 2012-09-02 01:41 · Score: 1

Use something like find to generate a rough "map" of where duplications are and then pull out duplicates from that. You can then work your way back up, merging as you go.

I've found that deja-dup works pretty well for this, but since it takes an md5sum of each file it can be slow on extremely large directory trees.

If you're comfortable with Linux try FSlint by Anonymous Coward · 2012-09-02 01:42 · Score: 0

There's a great tool for Linux with a GUI called FSlint. If your data is portable, you could use that method. I've used it on several hundred GB of data, though never on anything as large as what you have. Either a separate box with Linux or a LiveCD should work fine. There might be similar tools for Windows, but I haven't seen them.

Re:If you're comfortable with Linux try FSlint by pater+noster · 2012-09-02 05:03 · Score: 1

It also provides a standalone command line tool (findup) generating text output you can process later on with a little bit of scripting. With this tool I've been able to process about a TB of data in a reasonable amount of time.

Simple dedupe algorithm by Anonymous Coward · 2012-09-02 01:42 · Score: 5, Funny

Delete all files but one. The remaining file is guaranteed unique!

Re:Simple dedupe algorithm by Sulphur · 2012-09-02 03:19 · Score: 1

Delete all files but one. The remaining file is guaranteed unique!
Preparing to delete all files. Press any key to continue.
Re:Simple dedupe algorithm by Anonymous Coward · 2012-09-15 12:34 · Score: 0

Don't forget to leave behind the executable of the program doing the deduping also.

Don't waste your time. by Fuzzums · 2012-09-02 01:43 · Score: 4, Insightful

if you really want, sort, order and index it all, but my suggestion would be different.

If you didn't need the files in the last 5 years, you'll probably never need them at all.
Maybe one or two. Make one volume called OldSh1t, index it, and forget about it again.

Really. Unless you have a very good reason to un-dupe everything, don't.

I have my share of old files and dupes. I know what you're talking about :)
Well, the sun is shining. If you need me, I'm outside.

--
Privacy is terrorism.

Re:Don't waste your time. by equex · 2012-09-02 02:34 · Score: 3, Interesting

I probably have 5-10 gigs of everything i ever did on a computer. all this is wrapped in a perpetual folder structure of older backups within old backups within.... i've tried sorting it and deduping it with various tools, but theres no point. you find this snippet named clever_code_2002.c at 10kb and then the same file somewhere else at 11kb and how do you know which one to keep? are you going to inspect every file ? are you going to auto-dedupe it based on size? on date? it wont work out in the end im afraid. the closest i have gotten to some structure in the madness is to put all single files of the same type in the same folder, and keep a folder with stuff that needs to be in folders. put a folder named 'unsorted' anywhere you want when you are not sure right away what to do with a file(s). copy all your stuff into the folders. decide if you want to rename dupes to file_that_exists(1).jpg or leave them in their original folders and sort it out later in the file copy/move dialogs that pops up when it detects similar folders/files. i like to just rename them, and then whenever i browse a particular 'ancient' folder, i quickly sort trough some files every time. over time, it becomes tidier and tidier. one tool that everyone should use is Locate32. it indexes your preferred locations and stores it in a database when you want to. (its not a service) you can then search very much like the old Windows search function again, only much much better.

--
Can I light a sig ?
Re:Don't waste your time. by complete+loony · 2012-09-02 11:26 · Score: 1

For source code you could commit them into git repo's in whatever folder structure you have. Create separate repo's for each project or sub folder, or just one big one.
Then just push each project with a unique branch name into a single bare repository. git can automatically find duplicates, compare similar objects looking for ways to store them using delta compression, and then compresses everything. This works across the entire repository, even if the code is stored internally in different branches.

--
09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.

Prioritize by file size by jwales · 2012-09-02 01:43 · Score: 5, Insightful

Since the objective is to recover disk space, the smallest couple of million files are unlikely to do very much for you at all. It's the big files that are the issue in most situations.

Compile a list of all your files, sorted by size. The ones that are the same size and the same name are probably the same file. If you're paranoid about duplicate file names and sizes (entirely plausible in some situations), then crc32 or byte-wise comparison can be done for reasonable or absolute certainty. Presumably at that point, to maintain integrity of any links to these files, you'll want to replace the files with hard links (not soft links!) so that you can later manually delete any of the "copies" without hurting all the other "copies". (There won't be separate copies, just hard links to one copy.)

If you give up after a week, or even a day, at least you will have made progress on the most important stuff.

--
Wikia

Re:Prioritize by file size by TubeSteak · 2012-09-02 02:14 · Score: 1

Remember the good old days when a 10 byte text file would take up a 2KB block on your hard drive?
Well now hard drives use a 4KB block size.
Web site backups = millions of small files = the worst case scenario for space

--
[Fuck Beta]
o0t!
Re:Prioritize by file size by b4dc0d3r · 2012-09-02 03:49 · Score: 3, Informative

ZIP, test, then Par2 the zip. Even at the worst possible compression level, greater than 100% filezises, you just saved a ton of space.
I got to the point where I rarely copy small files without first zipping on the source drive. It takes so frigging long, when a full zip or tarball takes seconds. Even a flat tar without the gzip step is a vast improvement, since the filesystem doesn't have to be continually updated. But zipping takes so little resource that Windows XP's "zipped folders" actually makes a lot of sense for any computer after maybe 2004, even with the poor implementation.
Re:Prioritize by file size by DeathFromSomewhere · 2012-09-02 04:24 · Score: 1

1 million files of 4KB each is a grand total of 4GB. In an era of multi-terabyte drives it's just not worth caring about.

--
-1 overrated isn't the same thing as "I disagree".
Re:Prioritize by file size by SuricouRaven · 2012-09-02 09:52 · Score: 1

That's the filesystem block size, not the hard drive. Hard drives, excluding the very newest, have been 512 bytes per block since almost forever. The allocation unit size of filesystems varies greatly - the largest I've ever seen was 64K, the smallest 512B.

Linux livecd? by thePowerOfGrayskull · 2012-09-02 01:47 · Score: 3

perhaps you could boot with a livecd and mount your windows drives under a single directory? Then:

find /your/mount/point -type f -exec sha256sum > sums.out
uniq -u -w 64 sums.out

Re:Linux livecd? by thePowerOfGrayskull · 2012-09-02 01:49 · Score: 1

Damn, just remembered that won't include the filename :) I'll reply with a fixed once I get back to my pc unless someone else beats me to it.
Re:Linux livecd? by dargaud · 2012-09-02 02:30 · Score: 3, Insightful

Read the other comments: that's highly inefficient. Compare the file sizes, then diff the files until the 1st differing byte. No need to checksum two Tb files if the 1st bytes are different !

--
Non-Linux Penguins ?
Re:Linux livecd? by thePowerOfGrayskull · 2012-09-02 02:34 · Score: 1

Fixed below:
find /exports -type f | xargs -d "\n" sha256sum > sums.out
uniq -d -w 64 sums.out
You could also do another pipe to run it in one line, but this way you have a list of files and checksums if you want them for anything else in the future.
Re:Linux livecd? by Anonymous Coward · 2012-09-02 03:02 · Score: 1

but if you have 1000 files with exactly the same size, are you going to diff every file with all other 999 files?
If the files are actually different, especially the first part of the file, then diff may still be faster.
If many of the files are the same, it's faster to compute CRC's and then sort. (or delete files while doing the diff's, so the number of files quickly is reduced if they are mostly the same)
Re:Linux livecd? by thoriumbr · 2012-09-02 03:06 · Score: 1

I liked the script, but I think sha256 is way too much overkill. CRC32 will suffice.
Re:Linux livecd? by thePowerOfGrayskull · 2012-09-02 03:32 · Score: 1

Read the other comments: that's highly inefficient. Compare the file sizes, then diff the files until the 1st differing byte. No need to checksum two Tb files if the 1st bytes are different !
Who cares? He's doing this once - and since he's posting it to ask slashdot I think we can safely assume there's no urgency. Run it, let it go overnight or two, and move on.
Re:Linux livecd? by TheGratefulNet · 2012-09-02 04:40 · Score: 1

you just gave me an idea.
previously, I would have done a full md5sum of a file and saved its result. the reason is that I can move 10 drives to 10 different systems, start them all doing local md5sum's and collect the data for a merge *later on*.
the idea you gave me is to do 'short sums' first, which would look at perhaps the top X bytes or last X bytes and save those. do all those first, on separate systems and use that as the quick-test sorter.
you are right that you don't have to run a full 1tb worth of read on a 1tb file to know that its not the same file as another 1tb file. then again, if it was a logfile, only the last few entries might be different. the goal is to get this 'accelerator' smart enough to take a small bit of the file and 'know' that its likely that its a good checker. for *.log files, you'd snip some of the end. for binary .exe files, the first part is probably ok. for music, I'd check the ending (where id3 tags tend to be). etc.
the idea is to find a good balance of compute time vs hit ratio. its fine to get collisions; and then you have to dive deeper. but if you can avoid the longer tasks by being smart about it, that's the key, isn't it?

--

--
"It is now safe to switch off your computer."
Re:Linux livecd? by thePowerOfGrayskull · 2012-09-02 05:17 · Score: 1

yep agreed.
Re:Linux livecd? by DarwinSurvivor · 2012-09-02 13:50 · Score: 1

The chances of having 1000 different files with the same size over 1MB is so small it's not even worth considering worrying about. 1000 files under 1MB is going to take up only 1GB, so after the initial read, they'll all be in cache anyways.
Re:Linux livecd? by DarwinSurvivor · 2012-09-02 13:52 · Score: 1

Better yet, just make a list of filesize vs filename, then only compare files with the same filesize. This should eliminate 99% of your large files instantly.
Re:Linux livecd? by petermgreen · 2012-09-02 14:30 · Score: 1

sha256 is overkill but CRC32 is too small and in a system with 4.2 million files is almost certain to produce false positives.
The expected number of false positives with crc32 is arround 4200000^2 / 2 / 2^32 ~= 2054

--
note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register

don't run the app on a usb EXT disk by Joe_Dragon · 2012-09-02 01:47 · Score: 2

put the disk on the build in sata bus or use E-sata or even fire wire.

Don't worry about it by MassiveForces · 2012-09-02 01:48 · Score: 1, Insightful

If nuking it isn't an option, it's valuable to you. There are programs that can delete duplicates, but if you want some tolerance to changes in file-name and age, they can get hard to trust. But with the price of drives these days, is it worth your time de-duping them?

First, copy everything to a NAS with new drives in it in RAID5. Store the old drives someplace safe (they may stop working if left off for too long, but its better if something does go wrong with the NAS to have them right?).

Then, copy everything current to your new backup drives on your computer, and automate the backup so that it only keeps two or three versions of files so you don't end up with this problem again. Keep track of things you want to archive and archive them separately.

An ounce of prevention is better than a pound of cure. We all get into backup and duplicate problems eventually. I have found keeping my core work in dropbox and making a backup of it occasionally provides enough measure of data backup for me, but the information I generate in the lab doesn't take up so much space.

Re:Don't worry about it by Daniel_Staal · 2012-09-02 05:09 · Score: 1

If he uses FreeNAS on RAIDZ, he can get the dedup in there as well. ;)

--
'Sensible' is a curse word.

Review all deletions! by Anonymous Coward · 2012-09-02 01:48 · Score: 0

Just remember to not delete anything automatically. There may very well be files that are meant to exist in duplicates.

Re:Review all deletions! by fatp · 2012-09-02 03:57 · Score: 1

Then simply make another copy when it is needed (BTW, what files are meant to exist in duplicates?)
Re:Review all deletions! by Anonymous Coward · 2012-09-02 05:50 · Score: 0

Files that are the same in two different versions of the same game, and other places.

It's going to take a long time by fa2k · 2012-09-02 01:49 · Score: 1

Assuming fully sequential access, reading 5 TB of data at 100 MB/s takes 14 hours. With a mean file size of 1 M, you probably have a lot of tiny files and a few big files. The access will be far from sequential, so the access time will be many times greater. Don't expect it to be quick.

I would probably cook some script together with Cygwin, md5sum and find, but if you have duplicated *directories*, you may have to get smarter. With a simple script (i may post one later if nobody else has a better idea) , the end-result would be a list of files with identical hashes, and you'd have to decide what do to about them. [I would actually use a filesystem with built-in deduplication, like ZFS, and failing that I would write a script to hard-link identical files. But it's kind of limited what you can do on Windows]

Re:It's going to take a long time by fa2k · 2012-09-02 02:03 · Score: 1

As MusicOS said above, sort by file size first, then you don't have to hash every file, just the ones that have equal size. Still going to be slow, though.
Re:It's going to take a long time by b4dc0d3r · 2012-09-02 04:00 · Score: 2

I wrote my own to do exactly this, thinking it would be vastly superior to anything I could have downloaded.
File size collisions are a lot more common than one would realize. Even the following algorithm takes a very long time to complete on any sizeable data source:
- Find all files, storing directory and filename as separate strings to prevent memory allocation isses (the path will be the same for lots of files, so keep it in memory once - a hashtable or binsearch or similar optimized storage makes this negligable overhead)
- Sort the resulting list by filesize
- Iterate over the list. If the next file has a different size, continue the loop
- Otherwise, for each file with the same size, open the first file. Open the second file and do a byte-wise compare. This will fail faster than doing a hash for different files, usually it takes a single cluster read to find differences
- After going through each filesize match, drop the first file of the bunch and repeat. The OS file cache will retain most of the files you just opened, so compares go quickly
100k files can take several hours, even in fully automatic "just choose one to delete" mode. Even if they are small.
Re:It's going to take a long time by TheGratefulNet · 2012-09-02 04:30 · Score: 1

one thing that I prefer over doing a file compare; is doing md5sum's of files and saving that to a file. using 'find . -type f' and sending that to md5sum and that gets saved as lines in a file. nice since you can re-run md5sum -c on that very file.
best part: if you have 10 spare systems at home, you can run 10 truly absolutely concurrent non-interfering jobs and get so much of that pre-processing (first pass) work done.
I like to break this up into parallel parts as much as possible.

--

--
"It is now safe to switch off your computer."
Re:It's going to take a long time by Anonymous Coward · 2012-09-02 06:44 · Score: 0

best part: if you have 10 spare systems at home, you can run 10 truly absolutely concurrent non-interfering jobs and get so much of that pre-processing (first pass) work done.
Unless the process is disk-bound; if the 10 systems are, say, 286es, this might not be an issue. For practical applications, you're a moron.
Re:It's going to take a long time by Anonymous Coward · 2012-09-02 07:05 · Score: 0

- Iterate over the list. If the next file has a different size, continue the loop
- Otherwise, for each file with the same size, open the first file. Open the second file and do a byte-wise compare. This will fail faster than doing a hash for different files, usually it takes a single cluster read to find differences
- After going through each filesize match, drop the first file of the bunch and repeat. The OS file cache will retain most of the files you just opened, so compares go quickly
This is stupid if you have many files of the same size (e.g. multi-part archives of the same split size); blueg3 had it right

Checksum by Anonymous Coward · 2012-09-02 01:50 · Score: 1

cd directory_with_files
md5sum * | sort

I wouldn't recommend using crc32 if you have a substantial amount of files or else you risk a collision (i.e. two different files that produce the exact same crc32).

General advice... by Anonymous Coward · 2012-09-02 01:51 · Score: 0

Your write-up seems to imply you've only attempted dealing with it as a whole. If you're dealing with that much data all at once, the only software that will help you are database tools. Have you tried breaking it into smaller tasks like a year's worth of data or 50 gigs worth, etc?

My other suggestion would be to try a different hammer. There's a possibility that since Linux programs are mixes between desktop/server/database that you might be able to find dupe sorting programs that won't choke when such huge amounts of data are thrown at it.

.

Good free command line tool by Anonymous Coward · 2012-09-02 01:52 · Score: 1

I recently had this problem and solved it with finddupe (http://www.sentex.net/~mwandel/finddupe/). It's a free command line tool. It can create hardlinks, you can tell it which is a master directory to keep and which directories to delete, and it can create a batch file to do actually do the deletion if you don't trust it or just want to see what it will do. Highly recommend. In any case, 5 TB is going to take forever but with finddupe you can be sure your time is not wasted, unlike one of the free tools that analyzed my drive for 12 hours and then told me it would only fix ten duplicates.

Re:Good free command line tool by Acy+James+Stapp · 2012-09-02 01:59 · Score: 3, Interesting

I recently had this problem and solved it with finddupe (http://www.sentex.net/~mwandel/finddupe/). It's a free command line tool. It can create hardlinks, you can tell it which is a master directory to keep and which directories to delete, and it can create a batch file to do actually do the deletion if you don't trust it or just want to see what it will do. Highly recommend. In any case, 5 TB is going to take forever but with finddupe you can be sure your time is not wasted, unlike one of the free tools that analyzed my drive for 12 hours and then told me it would only fix ten duplicates.
I tried this vs. Clone Spy, Fast Duplicate File Finder, Easy Duplicate File Finder, and the GPL Duplicate Files Finder (crashy). (Side note: Get some creativity guys). There's no UI but I don't care. It doesn't keep any state between runs so run it a few times on subdirectories to make sure you know what it's doing first then let it rip.

--
-- Too lazy to get a lower UID.

How about this? by Hans+Adler · 2012-09-02 01:52 · Score: 1

As it is mostly about space, ignore the smaller files. For large files, the file size is already a pretty close approximation to a unique hash. First of all, create a database with size/path information and some extra fields where you will later add better hash sums and maybe note how far you got in processing.

Process files by decreasing size. If there are only two files of a particular size, compare them directly.

If there are more than two files of a particular size, get a better hash for each. (Choose a fast hashing algorithm that looks only at the first KB or so of the files.) After that, make the obvious comparisons to detect precise copies.

I have some further ideas in case this is still not fast enough, but I am worried that I may have already pissed off enough people by reinventing key parts of their precious patented algorithms without mentioning them.

FDUPE by Anonymous Coward · 2012-09-02 01:52 · Score: 0

http://neaptide.org/projects/fdupe/

Desired outcome by markus_baertschi · 2012-09-02 01:53 · Score: 1

You don't say what your desired outcome is.

If this was my data I would proceed as this:

Data chunks (like web site backups) you want to keep together: weed out / move to their new permanent destination
Create a file database with CRC data (see comment by Spazmania)
Write a script to eliminate duplicate data using the file database. I would go through the files I have in the new system and delete their duplicates elsewhere.
Manually clean up / move to new destination for all remaining files.

There will be a lot of manual cleanup, I think.

File Groupings by Mendy · 2012-09-02 01:55 · Score: 1

The problem with a lot of file duplication tools is that they only consider files individually and not their location or the type of file. Often we have a lot of rules about what we'd like to keep and delete - such as keeping an mp3 in an album folder but deleting the one from the 'random mp3s' folder, or always keeping duplicate DLL files to avoid breaking backups of certain programs.

With a large and varied enough collection of files it would take more time to automate that than you would want to spend. There are a couple of options though:

You could get some software to replace duplicate files with hard links. This will save you space but not make things any neater - DupeMerge looks like it would do it on NTFS but I haven't tried it myself.

Another alternative would be to move your data to a file system that has built in de-duplication such as ZFS and let that handle everything.

Finally when I was looking at this myself what I found was that the problem was not individual duplicate files but that certain trees of files occurred identically in multiple places (adhoc backups of systems were a big culprit here). What you could do with but which I couldn't find and didn't get round to finishing writing was something that would CRC not individual files but entire trees of files/folders and report back the matches. If something does already exist to do that I'd be quite interested myself.

use an APP called doublekill by Anonymous Coward · 2012-09-02 01:55 · Score: 0

its free

Re:use an APP called doublekill by Anonymous Coward · 2012-09-02 01:58 · Score: 0

its actually called "doublekiller"

Wait it out by tstrunk · 2012-09-02 01:57 · Score: 1

My crystal ball tells me:
At some point Btrfs will be standard in most linux distributions. Some time later deduplication will be developed to be used for the layman. (Planned features, wikipedia: http://en.wikipedia.org/wiki/Btrfs#Features )

1.) Wait it out until we are there.
2.) Get a NAS box using Btrfs
3.) transfer everything ...
5.) PROFIT (for the people building the NAS).

Re:Wait it out by Anonymous Coward · 2012-09-02 02:09 · Score: 0

Btrfs doesn't do efficient copy on writes like ZFS does. It probably may never,
Re:Wait it out by gbjbaanb · 2012-09-02 03:51 · Score: 1

replace "NAS box with Btrfs" with "NAS box with ZFS" and not only are you going to have a happy time, but the profit will be all yours as FreeNAS is BSD based and free.
Re:Wait it out by mlts · 2012-09-02 05:01 · Score: 3, Insightful

I will go out on a limb, risk my geek card and propose another alternative:
Windows Server 2012 has a deduplication feature which works atop of NTFS (not ReFS). Unlike "real" deduplication on the LVM level which you get with your EMC, the files are written to the filesystem fully "hydrated", and as time passes, a background task [1] sifts through the blocks, finds ones that are the same, then adds reparse points.
The reason I'm suggesting this is that if one already has a Windows file server, it might be good to slap on 2012 when it is available, configure deduplication on a dedicated storage volume, and let it do the dirty work on the block level for you.
Of course, ZFS is the most elegant solution, but it may not be the best in the application.
[1]: Fire up PowerShell and type in:
Start-DedupJob E: â"Type Optimization
if you want to do it in the foreground after setting it up, if you did a large copy and want to dedupe it all.
Re:Wait it out by Anonymous Coward · 2012-09-13 12:07 · Score: 0

Mod parent up. It just works, and the guy is clearly not a Unix Beard like the rest of you posting about tarballing and pearl scripts you dreamed up while in the shower. He clearly has issues and wants the magic done for him, so this is the best solution, however expensive.
The next solution is JBMFS (Just buy more fucking space)

Don't bother by Anonymous Coward · 2012-09-02 02:03 · Score: 1

Don't do it. You're on a fool's errand. Old files are so much smaller than new files that you're not wasting very much space. Now as you go through it all manually, you will find some of the duplicates. You can create symbolic links (supported in Win7) among duplicates as you encounter them. File positions in the directory tree are important information. e.g. the same image crookedtree.jpg may be duplicated between trips\2007\June\Smoky Mountains and trees\best\maple. It has meaning in both places. You will encounter whole directories that can simply be deleted because they are old backups, and you can verify this will tools like the simpleminded windiff of whatever you use instead.

You have done an excellent job of gathering it all together, and you should be proud of that. I'll do that "someday". Don't beat yourself up about what may only be a single-digit percentage of waste from duplication. Don't be the geezer who spends his whole retirement sorting his slides only to die and have them all tossed in the landfill.

Create hardlinks with Dupemerge.exe by Anonymous Coward · 2012-09-02 02:04 · Score: 1

I use the free command line tool dupemerge.exe to do file level dedupe on ntfs and I have found it to be pretty fast with lots of options.

See http://schinagl.priv.at/nt/dupemerge/dupemerge.html for full details.
"Introduction
Most hard disks contain quite a lot of completely identical files, which consume a lot of disk space. This waste of space can be drastically reduced by using the NTFS file system hardlink functionality to link the identical files ("dupes") together.
Dupemerge searches for identical files on a logical drive and creates hardlinks among those files, thus saving lots of hard disk space.

Backgrounders
Dupemerge creates a cryptological hashsum for each file found below the given paths and compares those hashes to each other to find the dupes. There is no file date comparison involved in detecting dupes, only the size and content of the files.

To speed up comparison, only files with the same size get compared to each other. Furthermore the hashsums for equal sized files get calculated incrementally, which means that during the first pass only the first 4 kilobyte are hashed and compared, and during the next rounds more and more data are hashed and compared.

Due to long run time on large disks, a file which has already been hashsummed might change before all dupes to that file are found. To prevent false hardlink creation due to intermediate changes, dupemerge saves the file write time of a file when it hashsums the file and checks back if this time changed when it tries to hardlink dupes.

If dupemerge is run once, hardlinks among identical files are created. To save time during a second run on the same locations, dupemerge checks if a file is already a hardlink, and tries to find the other hardlinks by comparing the unique NTFS file-id. This saves a lot of time, because checksums for large files need not be created twice.

Dupemerge has a dupe-find algorithm which is tuned to perform especially well on large server disks, where it has been tested in depth to guarantee data integrity."

linux/cygwin solution by gizmo_mathboy · 2012-09-02 02:11 · Score: 1

I was just looking at this for a much smaller pile of data (aroudn 300GB) and came across this http://ldiracdelta.blogspot.com/2012/01/detect-duplicate-files-in-linux-or.html

I faced similar situation by Anonymous Coward · 2012-09-02 02:12 · Score: 0

and I used "Advanced File Organiser"... i catalog my dvds and external hdds... there is a tool to identify duplicate files/folders..

hope it helps...

Try "SearchMyFiles" by fgrieu · 2012-09-02 02:12 · Score: 1

Recently had this situation.

Nirsoft's free "SearchMyFiles" http://www.nirsoft.net/utils/search_my_files.html has a straightforward Find Duplicates mode which helped a lot. It is easy (the most "complex" is designating the base locations for searches as e.g. K:\;L:\;P:\;Q:\), fast, never crashed on me, and had only cosmetic issues ("del" key not working). I recommend running it with administrative privileges so that it does not miss files.

Clonespy by Anonymous Coward · 2012-09-02 02:17 · Score: 0

I use clonespy. http://clonespy.com

It does a CRC check on all files, and pops up with any duplicates.

fun project by v1 · 2012-09-02 02:19 · Score: 2

I had to do that with an itunes library recently. Nowhere near the number of items you're working with, but same principle - watch your O's. (that's the first time I've had to deal with a 58mb XML file!) After the initial run forecasting 48 hrs and not being highly reliable, I dug in and optimized. A few hours later I had a program that would run in 48 seconds. When you're dealing with data sets of that size, process optimizing really can matter that much. (if it's taking too long, you're almost certainly doing it wrong)

The library I had to work with had an issue with songs being in the library multiple times, under different names, and that ended up meaning there was NOTHING unique about the songs short of the checksums. To make matters WORSE, I was doing this offline. (I did not have access to the music files which were on the customer's hard drives, all seven of them)

It sounds like you are also dealing with differing filenames. I was able to figure out a unique hashing system based on the metadata I had in the library file. If you can't do that, and I suspect you don't have any similar information to work with, you will need to do some thinking. Checksumming all the files is probably unnecessarily wasteful. Files that aren't the same size don't need to be checksummed. You may decide to consider files with the same size AND same creation and/or modification dates to be identical. That will reduce the number of files you need to checksum by several orders. A file key may be "filesize:checksum", where unique filesizes just have a 0 for the checksum.

Write your program in two separate phases. First phase is to gather checksums where needed. Make sure the program is resumable. It may take awhile. It should store a table somehow that can be read by the 2nd program. The table should include full pathname and checksum. For files that did not require checksumming, simply leave it zero.

Phase 2 should load the table, and create a collection from it. Use a language that supports it natively. (realbasic does, and is very fast and mac/win/lin targetable) For each item, do a collection lookup. Collections store a single arbitrary object (pathname) via a key. (checksum) If the collection (key) doesn't exist, it will create a new collection entry with that as its only object. if it already exists, the object is appended to the array for that collection. That's the actual deduping process, and will be done in a few seconds. Dictionaries and collections kick ass for deduping.

From here you'll have to decide what you want to do.... delete, move, whatever. Duplicate songs required consolidation of playlists when removing dups for example. Simply walk the collection, looking for items with more than one object in the collection. Decide what to keep and what to do elsewise with (delete?) I recommend dry-running it and looking at what it's going to do before letting it start blowing things away.

It will take 30-60 min to code probably. The checksum part may take awhile to run. Assuming you don't have a ton of files that are the same size (database chunks, etc) the checksumming shouldn't be too bad. The actual processing afterward will be relatively instantaneous. Use whatever checksumming method you can find that works fastest.

The checksumming part can be further optimized by doing it in two phases, depending on file sizes. If you have a lot of files that are large-ish (>20mb) that will be the same size, try checksumming in two steps. Checksum the first 1mb of the file. If they differ, ok, they're different. If they're the same, ok then checksum the entire file. I don't know what your data set is like so this may or may not speed things up for you.

--
I work for the Department of Redundancy Department.

CRCing & diff-ing do not a consistent deduping by williamyf · 2012-09-02 02:20 · Score: 2

After you have found the "equal files", you need to decide which one to erase and which ones to keep. For example, let's say that a gif file is part of a web site and is also present in a few other places because you backed it up to removable media which latter got consolidated. If you chose to erase the copy that is part of the website structure, the website will stop working.

Lucky for you, most filesystem implemenations nowadays include the capacity to create symbolic links (in windows, that would be NTFS Symbolic links since vista, and junction points since Win2K, in *nix is the soft hand hard symlinks we know and love, and in mac, the engineers added hard links to whole directories), both hard and soft. So, the solution must not only identify which files are the same, but also, keep one copy, while preserving accesability, this is what makes apple (r)(c)(tm) work so well. You will need a script that, upon identifying equal files, erases all but one, and creates symlinks for ll the erased ones to the surviving one.

--
*** Suerte a todos y Feliz dia!

Perhaps an easier way.... by Anonymous Coward · 2012-09-02 02:20 · Score: 0

You might consider moving all your storage to a small home NAS. FreeNAS, for example, can be installed on most consumer-grade computers, it is free and it comes with the ZFS file system which automated de-duplication. It might take you a fe whours to get it set up the way you want, but then the file system should do the work for you from there.

FreeFileSync by YrWrstNtmr · 2012-09-02 02:24 · Score: 1

I'm going through this same thing. New master PC, and trying to consolidate 8 zillion files and copies of files from the last decade or so.
If you're like me, you copied foldres or trees, instead of individual files. FreeFileSync will show you which files are different between two folders.

Grab two folders you think are pretty close. Compare. Then Sync. This copies dissimilar files in both directions. Now you have two identical folders/files. Delete one of the folders. Wash, rinse, repeat.
Time consuming, but it works for me.

FreeFileSync at sourceforge.

Manual work will have to be done by Qbertino · 2012-09-02 02:24 · Score: 4, Informative

Your problem isn't unduping files in your archives, your problem is getting an overview of your data archives. If you'd have it, you wouldn't have dupes in the first place.

This is a larger personal project, but you should take it on, since it will be a good lesson in data organisation. I've been there and done that.

You should get a rough overview of what you're looking at and where to expect large sets of dupes. Do this by manually parsing your archives in broad strokes. If you want to automate dupe-removal, do so by de-duping smaller chunks of your archive. You will need extra CPU and storage - maybe borrow a box or two from friends and set up a batch of scripts you can run from Linux live CDs with external HDDs attached.

Most likely you will have to do some scripting or programming, and you will have to devise a strategy not only of dupe removal, but of merging the remaining skeletons of dirtrees. That's actually the tough part. Removing dupes takes raw processing power and can be done in a few weeks and brute force and a solid storage bandwidth.

Organising the remaining stuff is where the real fun begins. ... You should start thinking about what you are willing to invest and how your backup, versioning and archiving strategy should look in the end, data/backup/archive retrival included. The latter might even determine how you go about doing your dirtree diffs - maybe you want to use a database for that for later use.

Anyway you put it, just setting up a box in the corner and having a piece of software churn away for a few days, weeks or months won't solve your problem in the end. If you plan well, it will get you started, but that's the most you can expect.

As I say: Been there, done that.
I still have unfinished business in my backup/archiving strategy and setup, but the setup now is 2 1TB external USB3 drives and manual arsync sessions every 10 weeks or so to copy from HDD-1 to HDD-2 to have dual backups/archives. It's quite simple now, but it was a long hard way to clean up the mess of the last 10 years. And I actually was quite conservative about keeping my boxed tidy. I'm still missing external storage in my setup, aka Cloud-Storage, the 2012 buzzword for that, but it will be much easyer for me to extend to that, now that I've cleaned up my shit halfway.

Good luck, get started now, work in iterations, and don't be silly and expect this project to be over in less than half a year.

My 2 cents.

--
We suffer more in our imagination than in reality. - Seneca

Asking Slashdot how to "de-dupe"?? by fustakrakich · 2012-09-02 02:26 · Score: 1

Can it more surreal?

--
“He’s not deformed, he’s just drunk!”

Re:Asking Slashdot how to "de-dupe"?? by fustakrakich · 2012-09-02 02:27 · Score: 1

be? As in "Can it more surreal be?

--
“He’s not deformed, he’s just drunk!”

Why hasn't anyone considered... by Anonymous Coward · 2012-09-02 02:27 · Score: 0

Looking at the EXIF information, if none available, compare by filesize, then by hash, although I mostly work with RAW and that results in the same size files all of the time. So that check is useless! cmp -bl works fairly quickly even on very large file sizes. Exit codes will tell you if they match or not, 0 for match, 1 for no match.

One-line solution by Anonymous Coward · 2012-09-02 02:29 · Score: 0

http://linux.die.net/man/1/hardlink

Don't solve a problem you don't have. by musmax · 2012-09-02 02:39 · Score: 1

So what if you have many dup's ? Keep all on disk and know that you will have it on hand in the very unlikely event that you'll need something from five years ago. Spend $300 on a few more disks and get on with your life. Perfection is the enemy of the good.

So simple by Anonymous Coward · 2012-09-02 02:41 · Score: 0

The easiest way to get started is with this:

find /here -type f -printf "%f %s\n" | sort -u >f1
find /there -type f -printf "%f %s\n" | sort -u >f2
comm -12 f1 f2

That gives you the duplicate file names, then you can use:
find /there -name whatever -exec rm '{}' ';'
to get rid of the duplicate.

You could also do an md5sum of each file but that would unnecessarily slow things down. In all likelihood the file name will be the same as will the file size. But if in doubt move the matching files rather than delete.

Duplicate Cleaner by Anonymous Coward · 2012-09-02 02:41 · Score: 0

I did the same thing a few weeks ago. ~10TB with ~1 million files. I used "Duplicate Cleaner" for windows and it took only about 3 days.

Use file size to identify duplicates by blake1 · 2012-09-02 02:42 · Score: 1

If it were me, I would use the file size to identify which were likely duplicates. Less reliable than hashing, but much faster. Using PowerShell:

Get-ChildItem D:\MyData -Recursive | Export-CSV mydata.csv

$objData = Import-CSV mydata.csv
$objData | sort Size | Export-CSV mydata_sorted.csv

$objSortedData = Import-CSV mydata_sorted.csv
$objUniqueSortedData = $objSortedData | sort Size -unique

Then loop through comparing both sets of data, comparing file extension for those files of the same size. Do a few test runs until you're confident and then run with Remove-Item -Confirm:$false.

Why? by mwvdlee · 2012-09-02 02:42 · Score: 0, Redundant

Why do you want to detect dupes.

Save disk space?
Reorganization?
Improve performance?
Some other reason?

Different requirements may have require different solutions.

--
Slashdot social media options: AIM, ICQ, Yahoo, Jabber and Mobile Text. Why no MySpace?

Re:Why? by Anonymous Coward · 2012-09-02 03:48 · Score: 1

Let me RTFA for you:
How Do I De-Dupe a System With 4.2 Million Files?
I have many old files that have been duplicated multiple times across my drives ...chewing up space.
I do need to keep the data, nuking it is not a viable option
Your solution is?

Simple by Anonymous Coward · 2012-09-02 02:43 · Score: 0

1) Throw all of the data onto a big disk.
2) Walk the filesystem and do md5sums or your favorite hash
3) If the hash already exists, remove the file and hardlink the other one.
4) PROFIT????

Also, why dedup at all?? Disk space is cheap. Also, zfs supports native deduping at the block level if you are lazy.

My advice: Use fdupes.pl by wazoox · 2012-09-02 02:44 · Score: 1

For this purpose I'm using a wonderful perl script, fdupes.pl. I've tested it on many millions files, many terabytes filesystems and it works fine. I've found the original on perlmonks.org, but modified it to 1 skip symbolic links (a symlink is obviously identical to its target) 2 auto-delete dupes (after confirmation). For anyone interested, find the script here: http://pastebin.com/cMFbBjt9

Use the Goldwyn algorithm by djmurdoch · 2012-09-02 02:45 · Score: 1

Delete the dupes, but be sure to make copies first.

Duplic8 by stends · 2012-09-02 02:47 · Score: 1

Windows? Duplic8, it handles 16 million files, does size and then binary comparison and processes this as fast as your medium can handle. It's also got a nice useful delete wizard that helps you select the ones to kill. I recommend it highly. http://www.kewlit.com/duplic8/

Backup all files with BackupPC by taleman · 2012-09-02 02:47 · Score: 1

BackupPC does deduplication. So if You take a backup from all your filesystems with BackupPC, You have identical files stored only once. BackupPC uses hard links to do the deduplication, so another copy of a file only takes a directory entry. You can then discard you current backups, if need be.

http://backuppc.sourceforge.net/

fdupes by Anonymous Coward · 2012-09-02 02:53 · Score: 0

As many commenters mentioned, it will be slow, just due to the data size, but fdupes may do exactly what you want. I have used it for a few years (it is in debian's package pool)
http://code.google.com/p/fdupes/
http://en.wikipedia.org/wiki/Fdupes
It works by reading a list of the files, comparing file sizes from that list, and then checking for duplicates from matches of those. It has options to delete all duplicates other than the first. The first found will be the file kept, so you could specify the 'cleanest' directory first, if searching multiple directories. A big downside for your problem, is it doesn't actually start the deletion until after all the duplicates are found (rather than as it finds them), and also I don't believe there is a Windows port (your post didn't specify Windows only, but didn't mention Linux at all)
fdupes --recurse --noempty --delete --noprompt [dir]

fslint by Anonymous Coward · 2012-09-02 02:54 · Score: 0

There is already software that takes the most salient points given here (compare filesizes first, then start comparing hashes), and wraps it up in a nice GUI. It's called fslint, and it's available in the standard repository of most distros.

If it's all on a POSIX filesystem, then you also have options like turning duplicates into hardlinks to just one copy on-disk.

There's also a pile of other useful things it can do to neaten filesystems and recover disk space.

The answer is simple jamiedolan by Anonymous Coward · 2012-09-02 03:00 · Score: 0

Here is how you de-dupe a system with 4.2m files. Go find a cliff or a bridge somewhere then take your entire fucktarded family. Have all of them jump off to their deaths and after that jump to yours as you are obviously too fucking stupid to even exist let alone use a computer.

Already done it - python script by Terrasque · 2012-09-02 03:01 · Score: 3, Informative

I found a python script online and hacked it a bit to work on a larger scale.

The script originally scanned a directory, found files with same size, and md5'ed them for comparison.

Among other things I added option to ignore files under a certain size, and to cache md5 in a sqlite db. I also think I did some changes to the script to handle large number of files better, and do more effective md5 (also added option to limit number of bytes to md5, but that didn't make much difference in performance for some reason). I also added option to hard link files that are the same.

With inodes in memory, and sqlite db already built, it takes about 1 second to "scan" 6TB of data. First scan will probably take a while, tho.

Script here - It's only tested on Linux.

Even if it's not perfect, it might be a good starting point :)

--
It's The Golden Rule: "He who has the gold makes the rules."

Hashing is the answer by elabs · 2012-09-02 03:04 · Score: 1

Write a simple script or program to create a md5 hash for each file and put the hash, along with the file path) in a database or flat file. Then, for each entry in the list, check the rest of the list (after that entry) for duplicate hashes. This will take several minutes to crunch through, but not days or weeks.

If You're Like Me by crackspackle · 2012-09-02 03:08 · Score: 3, Interesting

The problem started with a complete lack of discipline. I had numerous systems over the years and never really thought I needed to bother with any tracking or control system to manage my home data. I kept way to many minor revisions of the same file, often forking them over different systems. As time past and rebuilt systems, I could no longer remember where all the critical stuff was so I'd create tar or zip archives over huge swaths of the file system just in case. I eventually decided to clean up like you are now when I had over 11 million files. I am down to less than half a million now. While I know there are still effective duplicates, at least the size is what I consider manageable. For the stuff from my past, I think this is all I can hope for; however, I've now learned the importance of organization, documentation and version control so I don't have this problem again in the future.

Before even starting to de-duplicate, I recommend organizing your files in a consistent folder structure. Download wikimedia and start a wiki documenting what you're doing with your systems. The more notes you make, the easier it will be to reconstruct work you've done as time passes. Do this for your other day to day work as well. Get git and start using it for all your code and scripts. Let git manage the history and set it up to automatically duplicate changes on at least one other backup system. Use rsync to do likewise on your new directory structure. Force yourself to stop making any change you consider worth keeping outside of these areas. If you take these steps, you'll likely not have this problem again, at least on the same scope. You'll also find it a heck of a lot easier to decommission or rebuild home systems and you won't have to worry about "saving" data if one of them craps out.

Re:If You're Like Me by dolmen.fr · 2012-09-02 20:35 · Score: 2

If you need MediaWiki to manage the documentation about your filesystem structure, you really have a problem.
TiddlyWiki should be more than sufficient for that task.

I use Duplicate Cleaner by Quick+Reply · 2012-09-02 03:08 · Score: 1

It does the job for me, the selection assistant is quite powerful.
http://www.digitalvolcano.co.uk/content/duplicate-cleaner
Fast, but the old version (2.0) was better and freeware if you can still find a copy of it.

I am a dupe duper person by AbRASiON · 2012-09-02 03:18 · Score: 1

I have too many, due to simply being a messy pig and pedantic with files.
The best tool I've found is called Duplicate Cleaner - it's from Digital Volcano.
I do not work for / am not affiliated with these people.

I've used many tools over the years, DFL, Duplic8 and "Duplicate Files Finder" - one of which had a shitty bug which matched non identical files.

Duplicate cleaners algorithm is good and the UI, while not perfect, is one of the better ones at presenting the data. Especially identifying entire branches / directories being binarily (word?) identical.

Yes it takes a while, that's what minimising applications is for, do you want a TRUE representation of genuinely identical files, or not?

5TB only why dedupe? by TheLink · 2012-09-02 03:18 · Score: 3, Insightful

It's only 5TB. Why dedupe? Just buy another HDD or two. How much is your time worth anyway?

You say the data is important enough that you don't want to nuke it. Wouldn't it be also true to say that the data that you've taken the trouble to copy more than once is likely to be important? So keep those dupes.

To me not being able to find stuff (including being aware of stuff in the first place) would be a bigger problem :). That would be my priority, not eliminating dupes.

--

Too many replies beneath your current threshold

Smart way to de dupe by Anonymous Coward · 2012-09-02 03:19 · Score: 0

I've had to deal with this before:

First make a complete directory listing that includes dates, sizes and individual file names in addition to the complete path

Then take a strong crypto hash of each file. (sha256 or better)
Md5 and crc32 will have more hash collisions that could mess you up

For each file that matches on size and the hash, calculate the levenshtein distance on each name
(this algorithm determins how close the file names are to each other)

If the names are vastly different. Treat them as different files. If they are reasonably close, they could be the same file with a slightly renamed name. Use the file with the most recent date

Write a script to do all of this and tell you the files that need to be checked manually, or you can automate it.

But take a backup first and test your script before letting her rip...

How much space needs to be saved? by Anonymous Coward · 2012-09-02 03:20 · Score: 0

If the goal is to save space have you thought about compression. Might not get you much on audio & video but if you have large text or even binary it might help out.

Next, as others have pointed out, size matters. Removing 100 100 KB file is only 10 MB. The number of 100 MB files is probably a much more tractable number. Start there.

Also, diff not only tells you if the files are the same but gives you a sed script to convert from one to the other, if memory servers. If you've followed some sort of naming convention you could just keep the diffs. If the audio and video are edits then a diff could be used there as well. Not sure if it exist but it shouldn't be that hard to write one.

Use a hashing tool by naasking · 2012-09-02 03:20 · Score: 1

As many others have stated, use a tool that computes a hash of file contents. Coincidentally, I wrote one last week to do exactly this when I was organizing my music folder. It'll interactively prompt you for which file to keep among the duplicates once it's finished scanning. It churns through about 30 GB of data in roughly 5 minutes. Not sure if it will scale to 4.2 million files, but it's worth a try!

--
Higher Logics: where programming meets science.

Anyway... by Forty+Two+Tenfold · 2012-09-02 03:21 · Score: 2

Anyway...

--
Upward mobility is a slippery slope - the higher you climb the more you show your ass.

Re:Anyway... by Samantha+Wright · 2012-09-02 04:33 · Score: 1

Hey! That's my textbook shelf!

--
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
Re:Anyway... by Forty+Two+Tenfold · 2012-09-02 04:42 · Score: 1

Oh yeah? Then what's the name of the animal in the top left corner?

--
Upward mobility is a slippery slope - the higher you climb the more you show your ass.
Re:Anyway... by Samantha+Wright · 2012-09-02 04:49 · Score: 1, Interesting

Lizards aren't really my area of expertise, but I would guess a stylized green iguana or some ancestor thereof. The size of the dorsal spines doesn't seem very pragmatic.

--
Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
Re:Anyway... by Forty+Two+Tenfold · 2012-09-02 04:57 · Score: 1

I asked for its name, not species. OK, OK, just kidding. Anyway, I think I'd have quite similar facial expression had I found myself in its place.
Apologies to everyone for the OT.

--
Upward mobility is a slippery slope - the higher you climb the more you show your ass.
Re:Anyway... by leromarinvit · 2012-09-02 06:02 · Score: 1

Anyway...

Shit, when did you break into my home?

--
Proud member of the Ferengi Socialist Party.
Re:Anyway... by dissy · 2012-09-02 15:41 · Score: 1

Anyway...

And that's just the visual representation of my oldest 1 GB hard drive. You should see my 3 TB drives!

Use DROID 6 by mattpalmer1086 · 2012-09-02 03:21 · Score: 4, Informative

There is a digital preservation tool called DROID (Digital Record Object Identification) which scans all the files you ask it to, identifying their file type. It can also optionally generate an MD5 hash of each file it scans. It's available for download from sourceforge (BSD license, requires Java 6, update 10 or higher).

http://sourceforge.net/projects/droid/

It has a fairly nice GUI (for Java, anyway!), and a command line if you prefer scripting your scan. Once you have scanned all your files (with MD5 hash), export the results into a CSV file. If you like, you can first also define filters to exclude files you're not interested in (e.g. small files could be filtered out). Then import the CSV file into your data anlaysis app or database of your choice, and look for duplicate MD5 hashes. Alternetively, DROID actually stores its results in an Apache Derby database, so you could just connect directly to that rather than export to CSV, if you have a tool that an work with Derby.

One of the nice things about DROID when working over large datasets is you can save the progress at any time, and resume scanning later on. It was built to scan very large government datastores (multiple Tb). It has been tested over several million files (this can take a week or two to process, but as I say, you can pause at any time, save or restore, although only from the GUI, not the command line).

Disclaimer: I was responsible for the DROID 4, 5 and 6 projects while working at the UK National Archives. They are about to release an update to it (6.1 I think), but it's not available just yet.

Why it's taking so long... by MarcQuadra · 2012-09-02 03:22 · Score: 1

So your de-dupe ran for a week before you cut it out? On a modern CPU, the de-dupe is limited not by the CPU speed (since deduplication basically just checksums blocks of storage), but by the speed of the drives.

What you need to do is put all this data onto a single RAID10 array with high IO performance. 5TB of data, plus room to grow on a RAID10 with decent IOPS would probably be something like 6 3TB SATA drives on a new array controller. Set up the array with a large stripe size to prioritize reads (writes are going to be 'fast enough' on a RAID10, trust me). Once you have that hooked-up with your files copied onto it, you want to connect the drive to an OS that can natively deduplicate, like Windows Server 2012. If you must, you can set this box up as a storage server (with a low-end CPU, an old 'Core 2' should be able to keep up with 180MB/sec I/O), and keep your workstation separate. Reading this entire array (when full) through the CPU -should- take about 6-10 hours, deduplication will take slightly longer.

If you don't want to do deduplication at the block level, and you want to actually only have one copy of each duplicated file, you'll need to write scripts that do something like this:

1. Run through the data store and checksum each file (except for those ending in ".mychecksum" with AES128.
2. For each file, create an empty file named .."mychecksum" next to it. This will create the 'index' using the filesystem, which will be MUCH faster than having to read the data from inside each file.
3. Search through the store and concatenate all the ".mychecksum" files into a single CSV.
4. Run sed+unique on the file to see what will be nixed (i.e. Get a report)
5. Create another script that actually takes the output from step 4 and deletes ONE of the duplicate files. You can test by -renaming- ONE of the files to .deleteme and then deleting all those files after you confirm that it worked.
6. Repeat as necessary, possibly with a scheduled job.

--
"Sometimes, I think Trent just needs a cup of hot chocolate and a blankie." -Tori Amos on Nine Inch Nails

Just hash first 4K of each file, avoid 2nd pass by Anonymous Coward · 2012-09-02 03:24 · Score: 2, Insightful

Only hash the first 4K of each file and just do them all. The size check will save a hash only for files with unique sizes, and I think there won't be many with 4.2M media files averaging ~1MB. The second near-full directory scan won't be all that cheap.

A second box by kenh · 2012-09-02 03:25 · Score: 1

At a superficial level, the issue would seem to be quite hard, but with a little planning it shouldn't be *that* hard.

My path would be to go out and build a new file server running either Windows Server or Linux, based on what OS your current file server uses, install the de-dupe tool of your choice from the many listed above, and migrate your entire file structure from your current box to the the new box - the de-dupe tools will work their magic as the files trip in over the network connection. Once de-duped, your old file server can be rebuilt with the same de-dupe tool, and the files migrated back to it for use going forward if desired, with the two large drives used as an online backup.

The temporary de-dupe box can be fairly simple with nothing more thana fairly robust CPU, two 2 or 3 TB drives and a gigabit NIC, you won't even need to buy an OS license if you are running Windows, as you can just use a trial copy of Windows Server,

--
Ken

Here are a couple of ways, but... by kandresen · 2012-09-02 03:27 · Score: 1

This gives an sha256sum list of all files assuming you are in linux and writing it to list.sha256 in the base of your home folder:

find /<folder_containing_data> -type f -print0 | xargs -0 sha256sum > ~/list.sha256

You may replace sha256sum with another checksum routine if you want, such as. sha512sum, md5sum, sha1sum, or other preference.

now sort the file:

sort ~/list.sha256 > ~/list.sha256.sorted

(notice, this create a sorted list according to the sha256 value but with the path to the file as well. Assuming you would want to manually check some lines, this might be helpful, but if you only want the machine to check there is really no need to include the file and path data in the output giving a much smaller duplicate list file. )
without paths the command could be something like

cat ~/list.sha256 | awk '{print $1}' | sort > ~/list.sha256.chksum.sorted

You could now find duplicates by doing one of the following:

uniq -c ~/list.sha256.chksum.sorted | while read count chksum; do if [ $count != 1 ]; then grep ^$chksum ~/list.sha256 >> ~/list.duplicates; fi ; done

or in the first case

cat ~/list.sha256.sorted | awk '{print $1}' | while read count chksum; do if [ $count != 1 ]; then grep ^ $chksum ~/list.sha256 >> ~/list.duplicates; fi ; done

Now with the list of duplicates come the important question... Does meta data of the files such as in which path it is, date and time, file permissions etc matter to you?

Regardless I would usually recommend doing a binary comparison of the files as well to fully ensure the files are the same, before merging...

The quick and dirty removal of duplicates would be

oldchecksum='' ; cat ~/list.duplicates | while read checksum currpath; do if [ "$oldchecksum" == "$checksum" ]; then rm "$currpath"; else oldchecksum = $checksum ; fi; done

If wanting to preserve meta data, then the best way might be to use hard links to the original maintaining setting the hardlink to date and time of duplicate.

oldchecksum='' ;oldpath=''; cat ~/list.duplicates | while read checksum currpath; do if [ "$oldchecksum" == "$checksum" ]; then mv "$currpath" "$currpath".dup; ln "$oldpath" "$currpath"; touch "$currpath" --reference="$currpath".dup ; chmod "$currpath" --reference="$currpath".dup ; chown "$currpath" --reference="$currpath".dup ; rm "$currpath".dup ; else oldchecksum = $checksum ; oldpath=$currpath; fi; done

Do note that I did not test any of these commands and I might have missed something that make these commands eat important data too... Check on something unimportant before trying!

Re:Here are a couple of ways, but... by kandresen · 2012-09-02 03:53 · Score: 1

I did not pay attention to the files being on windows which most likely mean NTFS. Everything should still be possible using a linux livecd except for the last command to make hardlinks... I do not believe NTFS have anything like that, it is a feature of linux file systems such as ext2/ext3/ext4.
Re:Here are a couple of ways, but... by Anonymous Coward · 2012-09-02 04:00 · Score: 0

NTFS has had hard links for 20 years! It's had directory symlinks for over 12, and file symlinks for 5.
dom

this is just like /. back in the day... by Anonymous Coward · 2012-09-02 03:32 · Score: 0

A great technical discussion that hasn't completely and totally degenerated into flame and troll bait. I miss the old /. Well, except for Jon Katz ;-)

Auslogics Duplicate File Finder by TheInsaneSicilian · 2012-09-02 03:33 · Score: 1

I recently ran into the same problem you are having, just on a lesser scale. The program I had the best success with was Auslogics Duplicate File Finder .

It includes two options that I absolutely needed: Ignore File Names and Ignore File Dates. I'm pretty sure those were off by default, so check that if you try this. Considering I knew that I had done the same thing you are describing, and even renamed some files in the process, I really needed those options otherwise it would not have found the dupes.

I was paranoid there would be false positives, so I did a quick test on a select few folders. It worked perfectly, so I let it run... then blindly allowed it to do its thing in the last step and delete duplicates. I found about 600GB of dupes out of 3TB, and it took less than a few hours to run.

You need too keep it but you dunno what it is? by Anonymous Coward · 2012-09-02 03:33 · Score: 0

You have 4.9TB of data that you do not know what it is but you "need" too keep it? That sounds like packrat behavior. The best advice is, buy some new storage and move the stuff you know what it is onto that. Place your old disks in the attic and forget about them. You will still have them and since you don't have a clue what it is its better off "off-line" (no pun intended).

Re:You need too keep it but you dunno what it is? by GrantRobertson · 2012-09-02 05:08 · Score: 1

Given all the caveats and assumed programming skills in all the other messages, I agree that this is the fastest, simplest method. I mean, really, how much attic, shelf, or drawer space do three 2-terabyte drives take up? Just copy off what you know you need to a new drive. Set the old ones aside. Then, when you need something that you can't find on the new drive, just fire up those old drives and do a search.

"The enemy of having a life is perfect hard drive de-duplication"
Me (just now)

backuppc by lkcl · 2012-09-02 03:33 · Score: 1

there's a program called backuppc which does the job very very effectively, even across multiple systems. [note: do not imagine for one second that the god you call windows has all the answers]. run yourself a 2nd system, even if it's a virtual machine, install debian gnu/linux in it and then run and configure backuppc.

backuppc uses MD5/SHA checksums to identify files, such that it stores only *one* copy of any given file. this occurs entirely automatically. given the size of the task you can expect it to take some considerable time, however even if it is interrupted the backup process can be restarted and it will happily chunder on from where it left off.

if you want to, backuppc can create "snapshots" for you. however given the sheer number of files i would not recommend you enable that feature unless a) absolutely necessary b) you've at least made one complete backup of the files!

realistically, you should have been running backuppc or something like it for some considerable number of years, now. backuppc and systems like it can do "incremental" backups very very efficiently... but the first time you ever run it is absolute hell.... well, that's going to be the case regardless of what system you use: you'll just have to bite the bullet.

Simple solution by ikarys · 2012-09-02 03:33 · Score: 0

Buy more hdds. It's cheap, and way more cost effective and reliable than anything else mentioned in these comments. Who cares if you have dupes of your 1997 Pamela Anderson gifs - what's the worst that could happen?

why keep all thar junk? by Anonymous Coward · 2012-09-02 03:35 · Score: 0

You're trying to do the equivalent of my wife sorting through boxes of 20 year old clothing that we ( I) have to keep packing and hauling every time we move. If any of it were worth keeping it would already be in a place where you could access it, such as on your current website. There's a reason that file formats come and go and HDD magnetic domains fail and optical discs quit working in a few years. It's God's plan to help us overcome our pasts. Let it go.

De-Duping files on BTRFS. by Anonymous Coward · 2012-09-02 03:36 · Score: 0

http://dummdida.blogspot.de/2011/12/de-duping-files-on-btrfs.html

diruse by Anonymous Coward · 2012-09-02 03:40 · Score: 0

Start with identifying the largest directories, the easiest way is probably to:
download diruse.exe from microsoft

Execute as diruse /S /M /, /L x:\ . This will: include sub-directories, display output in Megabytes, include the thousand separator, and log to .\diruse.log. Optionally, only look at directories over a certain size by specifying /Q:1024 (in this example, it would be a 1gig folder) and only log the directories exceeding your quota with /D.

Hash all of the files, compare hashes of the files in the largest directories (or directories exceeding your quota) against all of the files, if you have same filename, modified date, size, you probably have a dupe and can add it to the short list to do a full diff on.

Most of this has been suggested already, I just suggest adding the step of identifying the largest directories first, it'll probably make the overall process a little quicker.

Figured I'd share by sco08y · 2012-09-02 03:42 · Score: 1

So, ran into a similar problem ages ago, and I wrote a python script to handle it. If you can't follow some rather dense python, this won't be for you.

https://github.com/scooby/fdb

It's mostly the 'fdb' script, there's some other cruft in there.

My approach stores the filesystem data in a sqlite database. It's not fast, but it is reasonably recoverable, which wound up being the most important aspect. The traditional Unix convoluted pipeline approach simply doesn't scale much past 100,000 files, in my experience.

It does actually understand inodes, in fact, it is pretty much a relational model of an inode based file system. The usage model is basically: read a portion of a file system in to the database. Update unhashed inodes. Hard link identical inodes.

The catch is that I also wanted it to work over time, so I wanted a permanent volume identifier for devices, users, etc, which makes it a bit OS X centric. I don't think there's any reason it wouldn't port relatively easily to Linux: you just need to use the Linux way of looking up system information. Basically, POSIX doesn't guarantee much about device ids, uids or gids beyond "it's not going to change while the process is running," and there's no standard way to obtain a UUID.

Also, if you *do* have multiple devices, it will try to hash them on separate threads. This won't work so well if the multiple devices are simply separate partitions :-(

By the time you have sorted this out... by 3seas · 2012-09-02 03:44 · Score: 3, Insightful

...it will have cost you far more than simply buying another drive(s) if all you are really concerned about is space...

I have to disagree... by mattpalmer1086 · 2012-09-02 03:49 · Score: 1

There is no reason to use a crypto-strength hash. This will simply be slower. MD5 should be perfectly fine - it outputs a 128 bit hash, which is more than enough to avoid accidental collisions, and it's fast. You could match on the size as well as the hash, if you really really think you might have a hash match on different content, but it's probably not necessary.

It is true that if you're trying to avoid *intentionally malicious* collisions, you should never use MD5 as it's badly broken for that use - but not for detecting duplicate content. You're correct to avoid using CRC - but that's not a hash algorithm, it's a checksum algorithm. Accidental collisions with that algorithm will be very frequent.

The names of files should never be used to distinguish them. Files are often renamed by applications or during normal work by users. In any case, if you already have a hash match, then why do you care if the names are different? The content is already overwhelmingly likely to be identical. If you're really paranoid, then do a byte comparision of those files.

Simple Script-Time to break out perl, python, .... by Anonymous Coward · 2012-09-02 03:54 · Score: 0

If you know any scripting language, this is a simple script. The hard part is to do the correct level of data design.

What data is needed?
What order do the columns need to be in?
How to ensure that sorting is useful? Leading zeros, columns?
Do you use text or SQL files?
Do you specify a directory/mount order for priority to keep?
Do you trust it to remove files automatically?

Based on the other posts, it seems that these fields are needed:
* File Size
* CRC
* Filename
* Date
* Directory

With this information, a machine should be able to make an intellegent guess as to which files are retained and which files are deleted.

I'm well organized today and have been the last 10 yrs or so, but I haven't always been. I have files from the 80s, 90s, and early 2000s that are probably duplicated.

Having a good system going forward is critical to limit to problems when this same issue happens again, as it will.

Time to break out the perl, python, ruby, awk, .... bash, powershell. I think I could do it in about an hour (really think 15 mins), but I've learned that stuff like this is usually longer.

You have more than one problem by Anonymous Coward · 2012-09-02 03:58 · Score: 1

#1 You must develop a naming and storage system that fits your data needs. This means amount of redundancy in case a hard drive dies or burns in a fire, directory name hierarchy, file and directory naming conventions.
#2 You must resolve to spend the time to sort and name new things correctly from now on.
#3 You must decide how much time to spend on your current data.

I assume that what you have is all on external USB drives, but the same issues arrise with internal drives or firewire.

Even if you had a magic program that would instantly allow you to delete all dups, you would be left with mismatched directories. Let's say that you have 3 copies of an install directory for a free game. Randomly deleting 1 file from here and 2 from there will leave different directories that are all incomplete. Even smart programs fail at this when the difference is just a few files like might happen between two versions of the same game.

Are the files / directories at least named something reasonable? If not, then you must do one of three things:
- Go through each directory by hand, examine each file to determine what it actually is, and name files and directories reasonably. While you are at it create a hierarchy of directory names to organise the mess. Millions of files will take 10 years (full time) to sort unless there are significant time savings like a game directory with 10 games in it, each game having 10,000 correctly named files.
- Leave things as they are. You don't even know what you have when you have that many files. You will never be able to use much of what you have because you don't know what you have, where it is, or what it is named.
- Give up and start over, doing things correctly. That could mean to get the biggest bang for the buck with the data that you have by extracting the important items that you know about and anything that is easy to understand. Leave the rest in a "junk" directory.

Of course you can mix and match...

There exists an external USB drive chip set that gives single bit errors about once per 100 gigabytes of read access (no errors on write). The one I have plugs into a bare drive and makes it talk like a USB drive. If you copied files from one USB drive to another, the second copy may have errors. I have this problem. I finally tracked down the source and fixed it, but I was left with mulitple copies of files with single bit errors. I wrote some scripts to help me do md5sum (checsums) on everything. I then went to the oldest copy (you have to examine all of the file timestamps, not just the usual one) for file types that could not tell me they were corrupted (e.g. a zip file can tell you it is corrupted).

The manufacturer of this chip had a buggy chip and a reference design that everyone used. Find a big file on a drive and run continuous checksums on the one file (WITH A DISK CACHE FLUSH EACH TIME). If you have the same checksum for 4 hours, you don't have this problem. Here is the Linux USB ID of the bad controller.
Bus 002 Device 003: ID 152d:2338 JMicron Technology Corp. / JMicron USA Technology Corp. JM20337 Hi-Speed USB to SATA & PATA Combo Bridge

It is easiest if you have all drives attached at the same time, but this may not be possible for you depending on how many drives you have and the number of USB ports you have free, consider buying a USB hub so you can plug them all in at once to run fslint.

For your case, I would do things in this order:
- decide how to divide the data up and how to name it
- start organizing it in a "bang for the buck" fashion. You will probably never get finished, but there will come a time... Do this first so the checksumming step will print names that you understand.
- run some sort of checksumming utility on all the files. It will take a long time. Let the computer do the work (checksumming) rather than you doing the work to write programs that look first at file size, and time stamp, then the first and last 1k if the file, etc. I have used fslint under

Poor problem statement by Anonymous Coward · 2012-09-02 04:00 · Score: 0

You haven't defined what you want to accomplish.

Are you trying to reduce disk usage but need to retain a link from each occurance in their respective trees?
Do you just want to keep the single occurances with the most recent date?
Do certain subtrees take precedence when deciding where to file the one you keep?

To compare directories that should be identical... by Anonymous Coward · 2012-09-02 04:06 · Score: 0

Beyond Compare is a windows program, but I got my old version at least running fine under Wine.

I have an old paid version, but there is a shareware version...

I am not affiliated with them at all. We had a site license where I used to work.

Solved problem - fdupes by Anonymous Coward · 2012-09-02 04:09 · Score: 0

http://en.wikipedia.org/wiki/Fdupes
Works well with lots of options. The wikipedia page also has a big list of alternate programs to do the same job.

LessFS by paranoidd · 2012-09-02 04:12 · Score: 1

There's a FUSE-based file system called LessFS capable of performing block-level deduplication. The project is actively maintained and looks like worth a shot. For more information, check its webpage at http://www.lessfs.com

Forget it by yurikhan · 2012-09-02 04:15 · Score: 1

All the methods suggested so far assume that identical files are bitwise identical. That’s a false assumption.

Consider an mp3 file. Add ID3 tags. Add ID3v2 tags. Re-encode to ogg. Now you have four files that have (almost) the same content but are bitwise different.

Consider a raw photo with EXIF tags. Convert it to jpeg, preserving the tags. Strip the tags. Resize to a web-friendly resolution. Now, you have 4 files which are bitwise different, but contain roughly the same image. (JPEG is a bit lossy and the downscaled version is *quite* lossy, but still.)

Consider a C++ program in source form. Build it, producing a binary and a bunch of intermediate files.

If you wanted to perfectly deduplicate this collection, you’d have to invent software that can detect all this non-bitwise duplicity.

Re:Forget it by mattpalmer1086 · 2012-09-02 06:19 · Score: 1

It's very true that there are often a lot of files with near duplicate content. Detecting near duplicates is much, much harder and will be probably orders of magnitudes slower to do, even if you can figure out how to do it in the first place.
However, there are also often a lot of files with exactly duplicate content. A government agency I worked with figured out they had over 30% identical duplication of files across their file stores. This was a signficant cost for them.
So, while your initial observation has some truth, your conclusion to "forget it" is false. I'm reminded of my old boss, who always used to say "Don't let the perfect be the enemy of the good".
Re:Forget it by qubezz · 2012-09-03 01:52 · Score: 1

Consider an mp3 file. Add ID3 tags. Add ID3v2 tags. Re-encode to ogg. ...If you wanted to perfectly deduplicate this collection, you’d have to invent software that can detect all this non-bitwise duplicity.
For audio files there are audio fingerprint de-duplication tools, such as Bolide Audio Comparer. This particular program is pretty amazing when you need to rearrange and de-duplicate your audio files, set it loose on all your drives, it can scan flacs and mp4 too, so you truly can select just the most desirable copy of the audio rip. It is very good, with false positives only starting to occur when you reduce the detection threshold and it starts to find the non-explicit version, no-vocal versions, or remixes of the same song. It would be wise to organize and intelligently de-dupe media before setting less intelligent checksum de-duplication loose.

DoubleKiller by Anonymous Coward · 2012-09-02 04:15 · Score: 0

DoubleKiller is the best.

http://www.bigbangenterprises.de/en/doublekiller/

It is very fast, but obviously you should not start with CRC match. Start with Filesize and you should be able to weed out over 99% of your dups in half a day.

Re:Simple Script-Time to break out perl, python, . by Anonymous Coward · 2012-09-02 04:20 · Score: 0

Nobody mentioned this, but different file systems will have slightly different file sizes. The differences are small, a few bytes, but that is enough to screw over a program looking for exact matches.

BTW, the script is almost done now. It doesn't delete anything, but it does suggest which file should be retained by date.

Can you create something yourself by Zomalaja · 2012-09-02 04:33 · Score: 1

I don't know why people are recommending you use "find', "awk", "grep" etc when you clearly stated that this is a Windows 7 environment. In any case a quick VB Net program I just created processed 43,800 files in 10 seconds. It would be faster but you must catch "Access Denied" errors for Folders like "System Volume Information" - extrapolating tells me 4 million files at 4000/sec means 15+ minutes to create a file with "Path" [TAB] "Name" [TAB] "Size". Adding a hash would add significantly to the processing time but it could be done easily. Question is if you have the tool(s) and ability to create something. Once the file is crreated you still have to parse it, sort it and flag the dupes.

Re:Sorry, if you can't write a simple script, then by Anonymous Coward · 2012-09-02 04:40 · Score: 0

fuck Im an EE and my top 3 tools are batch files, lua, and C#

Tool I use by cdxta · 2012-09-02 04:42 · Score: 1

I use this program: http://www.foldermatch.com/ . It's build in duplicate finder does exactly what you want: http://www.foldermatch.com/images/duplicate-file-finder.jpg . Of course you could always write your own tool as well. Folder match does it pretty efficiently though.

same by geert · 2012-09-02 04:44 · Score: 2

ftp://ftp.bitwizard.nl/same/

I used this to keep all versions of the Linux kernel source tree on my computer, with identical files hardlinked together to reduce storage space.
Both diff (blazing fast "diff -purN ") and patch handle hard links, so this was very workable.
It can be slow and take quite some memory (only 128 MiB-1 GiB in those days), but guess 16 GiB of RAM should handle 4 million files fine, as this is about the same order of magnitude as the few hundred kernel source trees I had lying around.

After git arrived, it was faster to just use git.

Re:same by rew · 2012-09-02 05:18 · Score: 2

As the author of "same", I was going to post the above suggestion.
Last time I used "same", 4.2 million files was peanuts. Of course, running through 4.8Tb of data is going to take some time.
People above are doing suggestions like doing CRCs of the files. Checking filesizes. Etc etc. Same does all of this:
First a list is compiled of the files to be handled. Then each file is stat-ed to determine its size. Then only same-size files are considered candidates for being the same. Next if the filesizes are the same, the CRCs are compared. The CRCs are calculated on an "as needed" basis. This means that most big media files will never need to be read entirely unless a duplicate is going to be found. Anyway. When the CRCs are the same, the files are compared bit-for-bit and if THAT comes out good, the files are hardlinked together.
The hardlinking means that you can further process the results. You can use find to eliminate say all duplicate files in a directory called "backup", provided that they ARE duplicates. Now you'll be left just with the Uniqe files in that directory.
I'm not sure if all of this will easily run on windows: It's a Unix program. On the other hand, it uses simple calls and should easily be ported using the cygwin suite.
Re:same by Gunstick · 2012-09-02 23:36 · Score: 1

have you run findup (from fslint package) against same. Who wins?
Note: fslint is a shell script, so porting to windows could be a problem. Or simple by installing cygwin.

--
Atari rules... ermm... ruled.

Huh? by fm6 · 2012-09-02 04:49 · Score: 1

If you have 4.2 million files, duplication would seem to be the least of your problems. How do you find the specific one of the 4.2 million you need? Are there sets of files you know you'll never need to access.

And forgive me for playing the shrink, but how much of your problem is just compulsive hoarding?

Here's what I use by Anonymous Coward · 2012-09-02 04:49 · Score: 0

fdupes

http://premium.caribe.net/~adrian2/fdupes.html

Boot your system from a live Linux dristro and install fdupes.

It's fast, simple and quick.

I know no one will read it but here it is. by Anonymous Coward · 2012-09-02 04:53 · Score: 0

http://dosnlinux.wordpress.com/2007/02/18/fdupes-tutorial/

done this before by Anonymous Coward · 2012-09-02 04:55 · Score: 0

Do it in chunks.
De-Dup all the PDFs, then do all the JPGs, then all the BMPs, then all the MP3s,
then all the movies, then all the databases, then documents, etc. etc.

When you've gone through the 30 or so file extensions most common to your files,
then do a *.* search. If that still bogs down, do that in chunks:
All files from 0K to 1 Meg, then from 1Meg to 10Meg, then do all from 10Meg to 50Meg, then from 50Meg....

I use this program, it has handled some real hairballs for me:
http://archive.org/details/tucows_373411_Duplicate_File_Finder
dup-setup.exe

Re:Sorry, if you can't write a simple script, then by wisdom_brewing · 2012-09-02 05:01 · Score: 2

How about intelligent people just looking out for truly insightful comments amongst the various posts? It would be interesting to see a true, accurate demographic of slashdot folk, I guess the people that post are actually a fairly small subset and the number in computer related industries equally small...

--
I am very sucseptible to "let's have another drink"

Not first few bytes, somewhere in the middle... by SuperKendall · 2012-09-02 05:02 · Score: 1

a) Looking at file sizes, then
b) Looking at the first few bytes of files with the same size.

I would say instead you should seek to some value near the middle of the file calculated by the file size.

The reason I say that is I have around 100k uncompressed tiff files (yes my own), mostly about three or four distinct sizes across the set. If you just look at the first few characters it's going to be the same TIFF header for every TIFF file of the same size, leading to a huge amount of checksum work that could be eliminated by shifting the check.

Perhaps even a few quick random samples, with one at the start, middle and end.

--
"There is more worth loving than we have strength to love." - Brian Jay Stanley

Patiently. by dave_leigh · 2012-09-02 05:02 · Score: 1

Compare names and sizes, then CRC. Let the thing run.

Re:Patiently. by jedwidz · 2012-09-02 12:08 · Score: 1

Or just 'Let the thing run'.
There's nothing here to indicate that the de-duper referenced in TFA actually failed, and a week or more doesn't sound unreasonable for that amount of data.
I'm currently going through the same process on my NAS. The first step was to gather a table of filenames and checksums, as suggested in first post. This can take a long, long time.
The next step, which I'm still tackling, is to do something with the duplicates. If you're actually deleting files (as opposed to replacing duplicates with hardlinks to the same file), this pretty much has to be manually-guided or you risk majorly screwing up your data (hint: some files are duplicates for a good reason).

whatpix by Anonymous Coward · 2012-09-02 05:05 · Score: 0

There is a perlscipt called whatpix (whatpix.sourceforge.net) you could use. But for large datasets like in your case the database used storing all the crc's is not big enough. I solved that by including DB_File and it works with several 100k pictures. Just change the script to include "use DB_File;". See below.

use strict;
use Getopt::Long;
use Digest::MD5;
use Digest::SHA1;
use File::Copy qw(move);
use DB_File;

You start the script via "perl whatpix.pl -e -dir pictures". -e for removing duplicates and -r for renaming duplicates. If you work under windows, you could install Strawberry Perl.
This should work.

Ran for a week? by PPH · 2012-09-02 05:05 · Score: 1

Is this one of these apps that restarts itself every time the file system changes? Like when a background process appends to a log or something like that.

You might have to start your system in a maintenance mode and skip starting all background processes. Or mount this drive under another system as a data drive.

--
Have gnu, will travel.

Only if you have 100 unique files by HiggsBison · 2012-09-02 05:08 · Score: 4, Informative

If you have 100 files all of one size, you'll have to do 4950 comparisons.

You only have to do 4950 comparisons if you have 100 unique files.

What I do is pop the first file from the list, to use as a standard, and compare all the files with it, block by block. If a block fails to match, I give up on that file matching the standard. The files that don't match generally don't go very far, and don't take much time. For the ones that match, I would have taken all that time if I was using a hash method anyway. As for reading the standard file multiple times: It goes fast because it's in cache.

The ones that match get taken from the list. Obviously I don't compare the one which match with each other. That would be stupid.

Then I go back to the list and rinse/repeat until there are less than 2 files.

I have done this many times with a set of 3 million files which take up about 600GB.

--
My other car is a 1984 Nark Avenger.

done this before by Egyptoid · 2012-09-02 05:11 · Score: 1

Do it in parcels. What are your most common files? databases? PDFs? JPGs? spreadsheets? Whatever they are, de-dup all the common file extensions one at a time. Do the pictures, then the documents, then the PDFs, then the MP3s, ... If this is a porn collection, you need to log out and never come back to slashdot ever again. Anyways, then start working on *.*, but break that up into chunks. De-dup from 0K size to 1Meg, then from1Meg to 10Meg, from 10Meg to 50Meg, etc. This is the tool I use: http://archive.org/details/tucows_373411_Duplicate_File_Finder I have thrown some real hairballs at it and it works fine.

--
== I question your beliefs, makes me a Troll. You insult my beliefs, you are progressive and mainstream. Okay. Got

Use a file system that does de-dupe for you by Jawnn · 2012-09-02 05:17 · Score: 1

If your aim is to clean up your sloppy directory organization and the almost inevitable dupes that will ensue over the years, good luck to you. Several respondents have made good suggestions. If, however, your aim is to just save space, use a storage platform that will de-dupe for you, at the block level. Nexenta comes to mind, but there are others, of course. I wouldn't do this on a file system that saw a lot of interactive use, but you have indicated that this is an archive. Perfect fit.

My own script (feel free to change) by Lulu+of+the+Lotus-Ea · 2012-09-02 05:18 · Score: 2

My home-rolled solution to exactly this problem is: http://gnosis.cx/bin/find-duplicate-contents.

This script is efficient algorithmically and has a variety of options to work incrementally and to optimize common cases. It's not excessively user-friendly, possibly, but the --help screen gives reasonable guidance. And the whole thing is short and readable Python code (which doesn't matter for speed, since the expensive steps like MD5 are callouts to fast C code in the standard library).

--
Buy Text Processing in Python

One Piece At A Time by AmberBlackCat · 2012-09-02 05:19 · Score: 1

I think you should only de-duplicate one type of file at a time. Maybe start with all png files. Then all mp3 files. Then all txt. Then all jpg. The problem will get smaller and smaller and you won't have to do the whole thing at one time, which results in nothing getting de-duplicated in the first place. And as the number of files gets smaller, eventually you will get to a point that you can de-duplicate the whole pile of remaining files at once. And it might not hurt to delete *.tmp or whatever your operating system's equivalent of "all temporary files" is, before you start de-duplicating. And if possible, it probably wouldn't hurt to delete all files that are zero bytes in size before starting de-duplication. If 4 million of your 4.2 million files all happen to be the same file type then never mind.

4.2 million? by jon3k · 2012-09-02 05:21 · Score: 1

That's a lot of porn, good luck!

Re:The answer is simple jamiedolan by Anonymous Coward · 2012-09-02 05:35 · Score: 1

It's off to anger management classes for you!

FSLINT by Anonymous Coward · 2012-09-02 05:39 · Score: 0

use "fslint" powerful, fast, already written and in use for quite a while.

Re:Sorry, if you can't write a simple script, then by Anonymous Coward · 2012-09-02 05:42 · Score: 0

Perhaps we have different definitions of "technical?"

Is a cell phone salesguy "technical?" I'd say no.
Is a medical doctor "technical?" No unless they can write a script.
Is a car technicial using computer diagnostics "technical?"
Is a "tech sergent" in the military "technical?"

Hummm. This is thougher.

Each of these folks have skills that I most definitely do not. Perhaps I need to rethink my definition of "technical" or amend it to be within a specified field?

Someone "technical" in the computing field, should be able to write a tiny script. "Power users" don't count, unless they can script. In that case, they are "users" and not "technical."

Getting started on this problem is trivial - 1 line script that lists file information recursively. If you can't do that and claim to be "technical", I weep for your skills. Windows can do this too, but in UNIX/Linux/OSX something like
$ find . -type f -P -ls > /tmp/outputfile
is a good start. I little use of awk, perl, python, ruby ... to get only the desired columns, and print them in the order desired ... and the script is almost done. With just the output from that "find" command, the resulting file can be opened in most spreadsheets and columns sorted as needed.

"Technical" is about solving a problem, not "liking computers."

Why??? by barfy · 2012-09-02 05:47 · Score: 1

Ok, first, why do you need to do this? Space is pretty darn cheap, and this seems like a tremendous waste of time and energy to save tens of dollars. But more importantly, I find I need TONS less space now that I just depend on the Internet to keep all of my porn and to stream it.

File size then interleaved secure hash by Terje+Mathisen · 2012-09-02 06:01 · Score: 2

This is a very fun programming task!

Since it will be totally limited by disk IO, the language you choose doesn't really matter, as long as you make sure that you never read each file more than once:

1) Recursive scan of all disks/directories, saving just file name and size plus a pointer to the directory you found it in.
If you have multiple physical disks you can run this in parallel, one task/thread for each disk.

2) Sort the list by file size.

3) For each file size with multiple entries do:

3a) How many matches are there and how large are they?

3a1) Just two files: Read them both in parallel, using a block size of 1MB or more in order to avoid extra disk seeks, and compare directly. Exit on first difference of course!
3a2) 3 or more files: Read them all interleaved, still using a 1MB+ block size. For each block calculate a CRC32 or secure hash, compare these at the end of each block iteration. When a single file differs from the rest, it is unique.
When two or more are equal but still different from the majority of the group, recurse into a new copy of the scanning function that checks the smallest group, then upon return go on with the rest.

It should be obvious that your scanning function needs to accept an array of open file handles/descriptor plus an offset to start the scanning process at, thus making it easy to call it recursively to check the tails of a sub-array!

(A possible problem can occur if you have _very_ many files of the same size, in that the operating system could run out of file handles for simultaneously open files! In that case I'd fall back on passing in file paths instead of open handles and take the hit of re-opening each file for each block to be read. I would also increase the block size significantly, into the 10-100 MB range, so that everything except big ISOs and similar would be read in a single access. The same process is probably optimal for file sizes less than the minimum block size.)

This algorithm should be able to do what you need in significantly less time than you'd need to just read everything once. I'd estimate about 50 MB/s effective reading speed, so if everything is on a single disk (4.9 TB? Not very likely!) and every single file size has multiple entries that only differ in the last byte, you would need 100 K seconds, or a little more than a day. My guess is you should easily finish overnight!

Terje

--
"almost all programming can be viewed as an exercise in caching"

dupseek perl script by Anonymous Coward · 2012-09-02 06:05 · Score: 0

I found this on google... appears to use all the shortcuts mentioned in previous posts. Requires linux or osx. GPL licence too.
http://www.beautylabs.net/software/dupseek.html

A helpful tool by Anonymous Coward · 2012-09-02 06:06 · Score: 0

Look at Beyond Compare from Scooter Software. It doesn't do exactly everything you need but does enough of it that it may help speed up the task. It is available for Windows and (many) *nix systems. It is very inexpensive to buy but is available for free trial and the company is very liberal with what constitutes a trial. It can be scripted, too. I use it all the time for similar, but not quite as big, sorting and categorizing tasks.

There's an easy program for that: by Anonymous Coward · 2012-09-02 06:08 · Score: 0

I had this exact problem a few years ago. Eventually I came across this program:
http://www.sentex.net/~mwandel/finddupe/

It searches all your files for duplicates and replaces identical files with hard links. Just run it like this:
finddupe -hardlink c:\photos

It's also very fast, using a few heuristics before doing a byte-for-byte comparison. It chewed through a couple hundred gigs of files in a reasonable amount of time (an hour or so). I use this program all the time and it has worked well for my purposes.

fslint by Anonymous Coward · 2012-09-02 06:10 · Score: 0

...and the result is sorted by wasted space with the biggest bank for the buck at the top of the list. Stop deleting the dups when the savings is no longer worth it.

Bikeshed by Wonko+the+Sane · 2012-09-02 06:12 · Score: 1

Step 1: Build a bikeshed
Step 2: Ask a bunch of geeks what color to paint it
Step 3: ???
Step 4: Profit!

Trailer Park Method by SuperCharlie · 2012-09-02 06:14 · Score: 1

Grab whats important and let the format tornado take care of the rest.

simpler method by Anonymous Coward · 2012-09-02 06:16 · Score: 0

I am not certain if this option is viable, but if you have a separate machine available, you could
try using DragonflyBSD's HAMMER file system, which has block level de-duplication.

if you think of it, chances are it already exists by crispytwo · 2012-09-02 06:25 · Score: 1

Perhaps this is something you're looking for:
https://github.com/SoftwareMaven/DeDuper

google: github deduper

find by Anonymous Coward · 2012-09-02 06:33 · Score: 0

Enjoy:

find / -type f -size +1 -print0 | xargs -0 cksum | sort -nk1,2 > /tmp/allcksums awk '$0=$1" "$2' /tmp/allcksums | uniq -c | awk '$1 != 1{print $2" "$3}' > /tmp/dups grep -Ff /tmp/dups /tmp/allcksums

Note - it'll also find hardlinks.

A Unix-y method for Mac OS X by Smurf · 2012-09-02 06:41 · Score: 1

I can tell you how I have done similar stuff on Mac OS X, using only built-in tools and features and very simple bash scripts. Of course you are using Windows, so you will have to change some of the steps to use the matching Windows tools (like using .bat files instead of bash, etc) and may even need to install some stuff. Even if you don't use it, it may be of interest for other Mac users.

Here it goes:

First, save this very crude bash script into a file (sorry, I'm not a bash programmer):
#!/bin/bash

function navigate_directory { cd "$1" for anItem in * do if [ -d "$anItem" ] then echo $level$anItem export level=$level"." navigate_directory "$anItem" export level=${level:1:`expr ${#level} - 1`} elif [ `mdls -name md5cs -raw "$anItem"` = "(null)" ] then #echo \ \ $anItem md5cs=`md5 -q "$anItem"` #echo \ \ \ \ $md5cs xattr -w com.apple.metadata:md5cs $md5cs "$anItem" fi done cd .. }

crawlDirs=$@;

export level="." for anItem in "$*" do echo $anItem navigate_directory "$anItem" done

All that script does is crawl through all the directories in the input, and for each file it calculates the MD5 checksum (hint: md5cs=`md5 -q "$anItem"` ). Then it uses xattr to save the MD5 checksum as an extended attribute that can be searched using Spotlight (you would need to use the equivalent search feature in Windows 7).

Because you want it to be searchable through Spotlight the "legal" way to do this is by creating your own little application that "registers" the attribute in the system. But that is waaaaaay too much work for something that you don't plan to use a week from now, so just cheat and register it as an Apple metadata attribute: xattr -w com.apple.metadata:md5cs $md5cs "$anItem"
(if this makes you uncomfortable you can later delete the attributes using a similar function)

To index everything, run the script from the base directory of your filesystem (not sure how to do that in Windows, you may have to run it on every drive), or just run on the directories that have your files (it's pointless to index the system files). The time it will take depends on the number and size of the files you have. Given your 4.2 million files in 4.9 TB it should take a day or so given your fast hardware.

At this point if you do a Spotlight search for the MD5 checksum of a file you will almost immediately get a list of all its dupes. (If you don't, you may need to rebuild the Spotlight indexes by running mdutil -i on and then off on every drive. I don't think it's necessary but YMMV).

Now copy this other bash script. Note how it is very similar to the above one.
#!/bin/bash

function get_md5_for_file

I have half-solved this problem... by sootman · 2012-09-02 06:45 · Score: 1

... but the other half is a bitch.

Using various tools, I got a listing of all files and a checksum for each. (Checksumming obviously takes some time.) Then, sort by checksum. Any time you have two matching rows, you probably have dupes. If the filesize is the same (down to the byte) then they are almost certainly dupes. (Further things to compare: date modified and filename. If all four match, you can be pretty sure, unless you have Google's amount of data, that you have a dupe.) If you want, write a script to delete all but the first instance of each file.

BUT--the problem is logic. Deleting files only gains you space. It does NOTHING to help you organize things. In fact, it'll probably make things worse. For one thing, you'll wind up with lots of empty folders. For another, there are many scenarios you'll run into.

Folder A has File1 and File2, and Folder B has File1, File2, and File3. A dumb system might leave behind FolderA/File1, FolderA/File2, and FolderB/File3.

Or, maybe you have FolderA/File1, FolderA/File2, and FolderA/File3, along with FolderB/File2, FolderB/File3, and FolderB/File4. Ideally, you'd want to end up with Folder/File1, Folder/File2, Folder/File3, and Folder/File4. Again, that's beyond the scope of a typical dumb tool.

Even tools that search for dupes and replace all but one with a link will still leave you with, at best, twice as many folders as you need. You might have some space but you still have a big mess too.

So, all you can do is decide what's most important: your time, your money, or your sense of neatness. Probably the best solution is to search for big files (ISO, MPG, etc.) and delete any obvious dupes. Then, get another big disk and start migrating one type of file at a time. Get to the point where you can say "Every single ISO I have exists one time on this disk over here. Any others I find can be deleted." You want to aim for the low-hanging fruit. You can spend 2 minutes and delete 5 movies and reclaim a few GB, or you can spend hours pruning little web files and get back just a few MB.

You almost certainly have enough files that cleaning them would literally take weeks or months. Try to do a little at a time. Don't think you can lock yourself in a room and emerge 2 days later with a perfectly clean filesystem. Trying to reach the theoretical perfection of "There are no dupes anywhere among all my disks" will take a lifetime, drive you mad, or both.

I've been meaning to clean up about 5 TB of disk myself for about 4 years, judging by folders with names like "Master_2008_All_Organized_No_Dupes". The most effective method I've found for dealing with it is accepting the fact that I never will. :-) Disks just keep getting cheaper. Just keep buying them. Every so often, take some time, do a nice migration, and clean up what you can, but if you're employed in a technical field, you can buy another 1 TB drive for just a few hours' pay. No reason to spend 40 hours of your life trying to save that.

--
Dear Slashdot: next time you want to mess with the site, add a rich-text editor for comments.

WinDirStat... by blahplusplus · 2012-09-02 06:49 · Score: 1

... it tabulates the size of a given directory and gives you graphical representations of where big files are that you can click on immediately. Then use something like beyond compare to compare directories.

Try samefile by cpghost · 2012-09-02 07:47 · Score: 1

On Unix systems, a small utility named samefile does wonders to de-dup after the fact. It should be portable enough to run on Windows as well...

--
cpghost at Cordula's Web.

Which free de-dup program? by Lorens · 2012-09-02 07:47 · Score: 1

OP wrote:

I tried running a free de-dup program, but it ran for a week straight and was still 'processing' when I finally gave up on it.

Maybe you're not naming the free de-dup program in question out of politeness, but I'd like to know... Or leave a message with the author of said program?

How to reduce the complexity of the problem by satch89450 · 2012-09-02 07:48 · Score: 1

I've tried to read through the comments, but got a little lost here and there. So I thought I'd share a way I did the job on a fairly large corpus of data to identify all the duplicates.

Build a file from the corpus with three fields: size (with leading zeros), hash of the first bytes (I used 32 kilobytes of CRC-16, using a really fast implementation taking from a comm program), and the file name
Sort the resulting file
Filter out the entries have unique size/CRC pairs; declare as duplicate any sets of file Based on the first filtered file, build a second file with three fields: size (with leading zeros), hash of the entire file (I used MD5) and the file name
Sort the resulting file
Filter out the entries having unique size/MD5 pairs; declare as duplicate any sets of file Compare the remaining sets of potentially duplicate files byte by byte.

Got really large files in your corpus? Then consider an intermediate step where you hash a larger and different portion of the file. For something different, you could hash the last bytes of the file so you don't end up duplicating work. Say a megabyte. In my case, I didn't need the extra pass because of the data involved. My corpus was on magnetic tape, so I couldn't just compare files byte by byte, because I would have had to load them somewhere first to do the compare. So I had to identify the potential duplicates *first*.

You don't. by Anonymous Coward · 2012-09-02 07:55 · Score: 0

Dedupe is a great buzzword, and sure, it's the new hotness.

But let's face it... 5TB of data? That's so cheap and easy for you to keep that it's not much of an issue. You're not trying to cram hundreds of TB (or even in the PB range) onto a SAN, you're not trying to get a few hundred thousand IOPS, and you're not trying to cram all of that data onto SSDs for performance reason.

Let the drives be. Move on.

Instead of Programming This and Reinventing the Wh by Anonymous Coward · 2012-09-02 08:07 · Score: 0

Get a de-dupe storage device such as NextaStor CE which is free up to 18 TB.

Question by jodido · 2012-09-02 08:08 · Score: 1

Pardon me for asking, but if all this data is so important to you that you can't bear losing a single file, why didn't you keep it sorted in the first place?

easy duplicate finder by Anonymous Coward · 2012-09-02 08:15 · Score: 0

Simple... Use Easy Duplicate Finder easyduplicatefinder.com. I use it myself; inexpensive and just works.

No need to reinvent steel just because you want a hammer.

r

C# code to get all the duplicates on windows 7 by RekkanoRyuji · 2012-09-02 08:19 · Score: 1

Read this, and prompted me to write a bit of code to do the de-dupe comparisons. Here is the code. You will have to mark the project to run unsafe code :) (in project properties) Compiled with Visual Studio 2010.

Program reads the first 4MB of each file and computes a hash. A thread is run for each drive you are looking for.

If you want all drives, comment out the section it says to do so, else just add the drives you want to the list of DrivesToSearch
I suggest if you use your C Drive, add some of the folders like I have below to the Ignore Directories. The "ToLower()" is there just to make sure that it is lower case, else the hash match won't work.

Please forgive the code, as this was very quick-n-dirty

Code runs *far* faster than a week....
C:\ = 185,000 files.
F:\ = 29,690 files
G:\ = 20,765 files
H:\ = 60,851 files
i:\ = 52,442 files
D:\ 196 files (DVD ROM)

Total: 348,944 files on 6 drives with 3.2TB of used space took about 50 minutes 52 seconds

Speed can be improved by lowering the 4 meg check to something lower. Many of the files on F,G are over 4MB in size and took the longest to complete, even though they had less total files.
Code Below. (mutters about slashdot and their inability to allow code)

http://pastie.org/4652387

Re:C# code to get all the duplicates on windows 7 by RekkanoRyuji · 2012-09-02 08:26 · Score: 1

Oh, I forgot. The output will be put in the C:\Temp Folder. It will look something like this.

C0EgbnbmRRjDX47IPZ3TxaNSUYTDifvRXMnq0YjlGIA=
F:\Music\jpop\Saki_Nijino-Over_the_Rainbow\(Nijino_S-Rainbow)-01-Tokimeki_Arigatou.mp3
F:\Music\Anime\Tokimeki Memorial\tokimeki memorial - over the rainbow\over the rainbow_track01_tokimeki arigatou.mp3

Pr3lS9OFNHLjWCQ8OW3/fh+KOGL5J9lJVZzPUMqRptI=
I:\Pictures\old s\Picture 208.jpg
H:\Desktop Backup\old s\Picture 208.jpg

SDFS by WhiteDragon · 2012-09-02 08:51 · Score: 1

SDFS is a cross-platform dedupe system that works on Linux and Windows.

At work, my company uses EMC Avamar, so if you are interested in a commercial product, that's a standalone storage/dedupe system. However, it's pretty expensive.

--
Did you mount a military-grade, variable-focus MASER on an unlicensed artificial intelligence?

Tools for the Job... by Eyeballs · 2012-09-02 08:52 · Score: 2

First: Get a copy of Windows Server 2012 and use the new deduplication system (which uses 'file chunk' deuplication level across an entire disk): https://www.usenix.org/conference/usenixfederatedconferencesweek/primary-data-deduplication%E2%80%94large-scale-study-and-system

Now, that you've taken care of the data duplication, let's talk about the tools for sifting through large sets of files:

1. Get 'Everything' (http://www.voidtools.com/): This tool allows for the 'instant' searching for any file throughout _all_ your files, I've used it on 4 million files myself. Just start typing part of the file name and it will show you a list of where those files are located on your system. Also, the list is 'live', you can right click on any icon in the file list, and it will act the same as you right clicked on the file itself in Explorer.

2. Get 'SpaceMonger' (http://www.sixty-five.cc/sm/): This tool shows what's taking up the space on your computer, it's similar to 'WinDirStat' but more flexible, customizable, and detailed.

3. Get 'ZTreeWin' (http://www.ztree.com/): This tool is the Swiss-Army knife program for working on files (finding, searching, viewing). If you remember 'XTree', it's a clone of that which can work on 4 million(+) files.

4. Get 'Beyond Compare' (http://www.scootersoftware.com/): This tool allows for easy comparison/synchronization of folders (and files). Compare two of your old backup folders and merge them.

Beta since 2001! Must moonlight at Google by Anonymous Coward · 2012-09-02 09:08 · Score: 0

Beta since 2001! They must moonlight at Google!

Google. by Anonymous Coward · 2012-09-02 09:22 · Score: 0

Google.

And you could also google for dedup and Linux or Windows.

Compress it by bugs2squash · 2012-09-02 09:46 · Score: 1

Even if each file is "uncompressible", a good compression system should almost eat the dupes and won't break anything that relies on the dupe actually being where it is in the file system plus it is a more "standard" solution and if your processor outpaces your disk it may even make things run faster.

--
Nullius in verba

file size is good by Rexel99 · 2012-09-02 10:00 · Score: 1

In my smaller efforts, I do a standard file search in the Windows folder/browser in detail view.. say *.mov or *.mp3 and sort them by file size and it's pretty quick.
Add the folder/view column and you can see their location and identify all the duplicates. This may not work so well for .jpg or .raw where the file sizes are closer but if the file-names are also duplicated this will be quite obvious. Right-click and open destination for more info (what else is in that folder) or Simply select all but one of the files shown and delete , there and then. Done.
Or is this too simple a solution?

Deduplication by thetrom · 2012-09-02 10:41 · Score: 2

Check out dedup in Windows Server 2012 - http://blogs.technet.com/b/filecab/archive/2012/05/21/introduction-to-data-deduplication-in-windows-server-2012.aspx

Re:Sorry, if you can't write a simple script, then by Anonymous Coward · 2012-09-02 10:49 · Score: 0

See, comments like this are why I read slashdot. Here's some dufus who defines a person as "technical" if and only if he can write a script, then proceeds to "solve" the problem of finding duplicates within a list of 4 million files by proposing a "script" that dumps ls -lR to a text file, that "can be opened in most spreadsheets and columns sorted as needed." Thanks for the afternoon laugh, dude.

People and their busy work by sunking2 · 2012-09-02 10:49 · Score: 1

Do the dupes really matter? Out of 4.5T how much could be duplicates? In the overall scheme of things it's probably less than 1%, so who cares. If you stumble on them, clean it up. If you don't, who cares.

Use zfs on bsd or nexentastor by Anonymous Coward · 2012-09-02 11:06 · Score: 0

You can do it on zfs. Checkout nextentastor I bellieve it is free up to 18 tb but there are other options. Make sure you have plenty of ram and a small ssd for the dedup data and you should be fine. In fact you can even also compress.
Or you can wait until btrfs does it...

Maybe this will help by 192_kbps · 2012-09-02 11:23 · Score: 1

Use the -d switch if you want to automatically select the file to delete or use no switch if you want a list of commands to copy to remove duplicates.

For readability, s/;/;\n/g. From an error message it seems Slashdot is hostile to small lines in posts. The original is 73 lines.

http://pastebin.com/sUfZkVaQ

#!/usr/bin/env perl use strict; use Digest::SHA; use Cwd; use File::Util; my $topDir=cwd(); my($f) = File::Util->new(); my(@files) = $f->list_dir($topDir,'--recurse'); my %hash; my $deleteFlag=$ARGV[0]; #print $deleteFlag,"\n"; foreach my $file(@files) { if(-d $file) {next;} my $size=$f->size($file); push @{$hash{$size}},$file; } my ($filectr,$setctr)=(0,0); foreach my $key (sort { $a $b } keys %hash) {#loop through sizes my $value=$hash{$key}; my @arr=@{$value}; my $numFiles = @arr; if ($numFiles $b } keys %shahash) { #loop through files of same hash value my $shavalue=$shahash{$shakey}; my @shaarr=@{$shavalue}; my $numFilesSha = @shaarr; if($numFilesSha new($alg); $sha->addfile($filename); my $digest = $sha->hexdigest(); return $digest; } sub unixFilename { my ($filename) = @_; $filename =~ s/\)/\\\)/g; $filename =~ s/\(/\\\(/g; $filename =~ s/\ /\\ /g; $filename =~ s/\;/\\\;/g; $filename =~ s/\'/\\\'/g; $filename =~ s/\"/\\\"/g; $filename =~ s/\&/\\\&/g; $filename =~ s/\!/\\\!/g; return $filename; }

I wrote something that solved this for me by Anonymous Coward · 2012-09-02 11:37 · Score: 0

I have about 12 TB of data, across 3 computers, 5 sata drives and 11 USB drives.

I set up 1 system with SQL server and wrote an app that saves the attribues of every file to SQL:
Computername, DriveVolumeID, Path, Name, CreateDate, ModifiedDate, Size, MD5 (I can turn this off during some scans).
I can then write some really great queries that show me matching files by size, by MD5 and prioritize them by size.
I then added a table that let me add certain paths (With wildcards) of file names or MD5's to a 'classification'. this lets me mark files that i know about. For instance, one drive is just my rationalized movies. no dupes there, just movies that work and are organized, so i classify all files on that driveVolume as Rational moves and can filter them out in queries that i use to locate data i haven't addressed yet.
I then wrote a UI that takes these queries and allows me to select file attribues and attach classifications to those attribues, and in the UI i can also click on the file record and it will pull up the file (if on the same machine that it was inventoried on). Works awesome, was definately worth the time and effort, and now data cleanups are fast and easy and finding new data that is randomly sprawled about can be identified and organized quickly whenever i find time to do it.

Wrote it today in Perl by Anonymous Coward · 2012-09-02 12:08 · Score: 0

Wrote one today in perl. Ignoring directories in the script, but have them as reference in the list-o-dups file. Directories help a human decide which files to keep, but usually the less useful filename is the deciding factor for me, not the directory.

Columns
* file size (right justified to handle 99GB files or less (use uint)
* md5sum (placeholder for most files)
* filename (not including directory)
* create date (just for reference)
* directory (just for reference)

Processing:
* create the complete file list for processing ... this allows filtering by extension (only images, only video files, .... )
* sort that list on file size - getting the columns lined up is critical
* if files have the same size, first verify they have the same extension - no use looking further if a.jpg and a.txt happen to be the same size. Case insensitive check.
* if they are same extension and size, then I perform an md5sum and store that into a data hash for this file.
* if the md5 hash for this file and the last file match, they are dups - spit out a line saying that with both files listed to stdout
* Save the current file data hash for comparisons next pass

This process runs very fast for small files. 38500 image files are processed in about 3:04 minutes on a 1st-gen Core i5 with an external array. Spent 10 minutes cleaning those files up. I should make the files-to-be-deleted easier to get at. .git/ folders suck for this - I'm just sayin'. For example:

DUPs Found : 274420 : 92956761e3e2db1f1f93a2241613c77e
./2008/03-Costa_Rica/3-29-Sat_Volcano_Irazu/640/DSC02839-Volcano_Irazu.jpg
./2008/03-Costa_Rica/3-29-Sat_Volcano_Irazu/640/FILE0048.jpg

A single key can place an 'x' in column 1 next to the file to be removed. Highly efficient in vim/vi. Just pipe those lines into xargs and rm.

It took longer on video files ... had too many identical titles on a DVD rip. Basically, 16 of 18 titles matched, so the md5sums took a while.

I couldn't imagine doing this under cygwin. That subsystem is too slow. Find the win32 ports of UNIX utils to get most of those tools in natively compiled versions - FAST. PowerShell is pretty fast once it is loaded - but the load time is java-fast.

I wrote mine in cross-platform perl, but did cheat with the first `find` call to build the file list. The File::Find perl interface is clunky in comparison. Perl is damn fast besides that, as everyone knows.

Other optimizations are possible - mainly to avoid running a full md5sum, perhaps comparing the first 4kb and last 4kb of each file would be faster to avoid the md5sum as much as possible, but then you still need to do the full md5sum if those match anyway. By avoiding doing the md5sum for 98% of the files, the process comes down to reading the directory and pulling the "stats()" for each file out. That is extremely fast compared to actually opening each file.

Now you have me currious.

cloud service by Anonymous Coward · 2012-09-02 13:25 · Score: 0

sounds to me like you work for a cloud service and you want to delete duplicate copies of everybody's mp3 library.
did't apple or someone else offer that a while back, a free upgrade to a higher bit rate for songs for free

Python by Anonymous Coward · 2012-09-02 13:34 · Score: 0

Write your own filter in Python.
Seriously. It will take less time to read the google Python tutorials, write filtering software and run generated copy scripts than it would to configure and run any off-the-shelf solution.

You know where your shit is buried. A sniffer dog can only guess.

http://code.google.com/edu/languages/google-python-class/

have you tried Microsoft's Dupfinder? by Anonymous Coward · 2012-09-02 13:42 · Score: 0

You'll have to do some clever stuff to get it from the XP SP2 tools distribution, but I bet it will work great for you.

http://www.techrepublic.com/article/remove-clutter-with-windows-xp-sp2s-duplicate-finder-tool/6160661

Script it by Lord+Kano · 2012-09-02 13:57 · Score: 1

Install cygwin.
get an m5dsum of every file and store the file paths/md5sums in a text file.
Sort the file.
use a script(perl or your scripting language of choice) to spit out the paths of every file that's duplicated.

No, I'm not writing the code to do it for you.

LK

--
"Hi. This is my friend, Jack Shit, and you don't know him." - Lord Kano

Folderscope by tomazos · 2012-09-02 14:42 · Score: 1

I wrote a freeware Windows utility called Folderscope and deduping large folders is one of the main use cases:

Folderscope

Enjoy, Andrew.

Doublekiller by Anonymous Coward · 2012-09-02 14:45 · Score: 0

I've used http://www.bigbangenterprises.de/en/ 's DoubleKiller on a similar sized amount of data...

Worked a treat :-)

Cheers
Jules

Squashfs by Anonymous Coward · 2012-09-02 14:52 · Score: 0

Squashfs. Stores the data from dupes only once, and compresses as well. You end up with one file with a compressed read-only filesystem inside it.

http://squashfs.sourceforge.net/

Cloud by Anonymous Coward · 2012-09-02 14:58 · Score: 0

Have the Cloud do it.

Sincerely,
AC, MBA

sourceforge.net/projects/fdups/ by Anonymous Coward · 2012-09-02 15:03 · Score: 0

sourceforge.net/projects/fdups/

Answer: by reitton · 2012-09-02 15:04 · Score: 1

By hand

Use WinDirStat. by zenlessyank · 2012-09-02 15:20 · Score: 1

WinDirStat is a useful utility that might be able to help you break up your task into smaller parts. http://windirstat.en.softonic.com/

The Dear Hunter by Anonymous Coward · 2012-09-02 15:39 · Score: 0

The 'post' I will posit is not about technology but about psychology.

In that case and context, Kill it and be rid of the incriminating information of your habits like ... Porn ... Bestiality ... Perversions ... thoughts of killing the President of Russia ... thoughts about killing the Pope.

Use fdisk.

Do the dirty ... loose your beloved Porn, Bestiality, Perversions and such nonsense.

I saw through you! :)

My setup for a very similar project by ZeroPly · 2012-09-02 16:17 · Score: 1

My project was complicated by the fact that the files were scattered across several 2TB external hard drives and a few internal ones. Your situation should be easier since you have consolidated everything. Here is what I did:

Set up a MySQL database with filename, full path, and MD5 checksum as fields - as well as a few other EXIF fields since I was working with a huge photo archive. Also the appropriate index.
Put together a quick Python script that would walk the directories, take an MD5 checksum of each file, and plug it into the database.
Finally I just did a query on the database to print out duplicates based on MD5 which took surprisingly little time to run even with several million records.

Caveat emptor: All this was done in Linux. I started putting this together on a fast Win7 machine and quickly realized that it was just too slow to get this one in a week.

--
Support microSD: in a post 9/11 world, it is unwise to carry your data on media that you cannot comfortably swallow.

Get a good NAS and some drives by Anonymous Coward · 2012-09-02 16:38 · Score: 0

Buy a Synology NAS (DS 1512+ would be perfect) 3 or more 3TB (greater) drives and forget about it.

Use 'bup', it will work by Anonymous Coward · 2012-09-02 17:06 · Score: 0

The backup tool 'bup' will have no problem with this. And the de-duping is stellar.

If you have enough dupes, you can probably manage to make enough space for compressed backups of all your data - or a new disk.

https://github.com/apenwarr/bup

Merge by fulldecent · 2012-09-02 17:36 · Score: 2

Best tool. http://hungrycats.org/~zblaxell/dupemerge/faster-dupemerge worked great for me in the past 10 years. Scales.

--

-- I was raised on the command line, bitch

Multiple users' data? by weazzle · 2012-09-02 17:42 · Score: 1

There are a lot of good recommendations for how to locate duplicates. If you really plan to attempt deduplication rather than purchasing more space, there are a number of things to consider. First, don't use a tool to perform the deduplication, only to locate the duplicates. You are bound to run into a scenario you didn't anticipate. Multiple users may each maintain their own copy of identical files. If one is removed, one user no longer has access. If they are simply hard linked to the same file, modifications are applied to both. Multiple copies of the same repository from a distributed SCM (Git, Mercurial, etc.) you are going to run a vast number of false positives. There are other situations where use/ownership, and not simply structure, must be taken into consideration.

I can't see this suggested anywhere.... by Anonymous Coward · 2012-09-02 18:39 · Score: 0

Try http://doubles.sourceforge.net/

It found 40GB of duplicates in about 15 minutes, over a total of about 3TB of data. It works by sorting all files by size, and when the size is the same it performs a hash to determine if there is a match. Plus, it runs on Windows 7.

You're all missing the point by Anonymous Coward · 2012-09-02 18:54 · Score: 0

I have almost exactly the same problem and have been procrastinating writing an AskSlashdot post about this for a while. So thanks, OP.

I think everyone is missing the point. This is not about deduping, this is about finding what IS NOT duplicated and adding into the re-organised file structure. In my case, I now have photos organised by year/date in a simple folder structure, but I am scared to delete my disorganised archives/backups in case something has not been copied over to my new archive structure correctly. I have not yet found a program which will scan a folder hierarchy and report all the files NOT present in my 'official' archive.

The most amazing thing is, I am sure there are more than two people (me and the OP) in the world with the same problem. It is scary to think how many people will lose a generation of photos/videos because they are in a mess on a hard drive, and if noone knows where they are they can't be backed up.

Re:You're all missing the point by Anonymous Coward · 2012-09-04 21:34 · Score: 0

I am sure there are more than two people (me and the OP) in the world with the same problem.
Guess what, I have EXACTLY the same problem.
Tell me when you get a solution.

Fdupe by Kirth · 2012-09-02 20:20 · Score: 1

http://freecode.com/projects/fdupe -- perl. Only finds exact duplicates, and I haven't used it against more than 200'000 files and 2TB.

--
"The more prohibitions there are, The poorer the people will be" -- Lao Tse

Beyond Compare by DigiShaman · 2012-09-02 20:39 · Score: 1

A co-worker and I used a program called Beyond Compare to match data stored on tape with a live Archive directory on a file server. Actually, it was his idea and it worked out pretty well. Check it out, it may be of some use.

http://www.scootersoftware.com/

--
Life is not for the lazy.

Re:Sorry, if you can't write a simple script, then by Anonymous Coward · 2012-09-02 20:51 · Score: 0

Perhaps you should redefine your definition of "technical" to something a bit closer to the definition that the rest of the world uses, or would that be too much of an insult?

technical/teknikl/
Adjective:
1. Of or relating to a particular subject, art, or craft, or its techniques: "technical terms"; "a test of an artist's technical skill".
2. (esp. of a book or article) Requiring special knowledge to be understood: "a technical report".

So, in other words an artist that focuses on developing techniques, like say water colours or lithography, would be a technical artist. Likewise not every job involving technology is technical, and not every technical job involves any knowledge of programming. The people designing keyboards for instance have little need to ever know how to script a sorting file, yet they have a technical job. You don't have to know anything about scripting or programming for most simple electronics, even. I'd say the people assembling phones in a factory are unlikely to know much about how to program one.

In short, I don't think your definition of technical needs to be adjusted or amended, it needs to be over-written. It's blatantly false, and it just exposes your general view of the world as being very focused on just your own experiences. Out here in the rest of the world, you're just one small subset of technical people in the computer field, and you happen to be in the small subset that can program and script. Hell, I am as well, although I'm far from the computer field. Now open your eyes and realize that the world is larger than your classifications.

DP by Anonymous Coward · 2012-09-02 20:58 · Score: 0

Double penetration FTW!

the OP's real problem? by smhsmh · 2012-09-02 21:27 · Score: 1

The strategies to compute file lengths, then crcs, are generally wise. But they may miss the real problem.

In addition to detecting duplicate backup files, the OP ought think about how duplicates should be handled. The goal, one presumes, is to create a single tree of backed-up files where each file is represented only once, but which preserves something about the original organization of the original directory hierarchy.

I similarly have duplicate backups distributed between flash drives, burned dvd's and cdr's (some stored off site), external disks, and old-but-still-working computers and cell phones. Many of these repositories are ancient and were managed by software with odd naming conventions. But destroying that history may lose information, such as "With which camera did I take this picture?"

The easy problem is detecting duplicates. The hard problem is figuring out how to organize the resulting files into a meaningful new single tree.

False assumption! by Peter+(Professor)+Fo · 2012-09-02 21:47 · Score: 1

The objective may be to make sure you don't unwittingly fork.

Triage by Twylite · 2012-09-02 22:15 · Score: 1

I've dealt with a similar problem on a smaller scale (500K files, 120Gb). I started by generating hashes over all my current properly-organised files using hashdeep, and parsed the output into a database (columns filesize, hash, path, filename, mtime) using a custom scripts. Then I wrote another script to walk through the archives finding and deleting files that matched those already in the database; the script also used the database to keep track of its walk so it could be stopped and restarted. This halved the size of the archive material before I had to start trying to understand what was there.

From there I identified pivotal directories in the archive - ones I could reasonable assume to be recent or more complete (for example, based on backup date) - and added them to the hash database, then walked the rest of the archives culling duplicates again. Lather, rinse, repeat and you rapidly reach a point where you have a small number of directories with a lot of de-duplicated data, and a large number of directories with small amounts of possibly-duplicated data that can be handled by a free dedup tool.

--
i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net

Pareto's principle by ferespo · 2012-09-02 23:15 · Score: 1

If you can live with less than perfect results (wasted space) you could apply pareto's principle, and start working with a list of file sizes in descending order and dedup manually until you recovered enough space. Chance are that 20% of files make up for 80% of space.

More info http://en.wikipedia.org/wiki/Pareto_principle

Getting rid of stupid files? Dupes? by Anonymous Coward · 2012-09-02 23:21 · Score: 0

Shouldn't you be de-dup'ing? De-dupeing is getting rid of the stupid files, which isn't easily achieved by comparing file sizes, a hash of the first block followed by further blockwise checking if still a match.

fslint's findup deduplicator by Gunstick · 2012-09-02 23:29 · Score: 1

Well yes, this is a linux tool, but still I was quite pleased with it's results for 800k files. It took some time but it had an end.
It's basically a shellscript doing what others have suggested: sort by size, same size files are checksummed. /usr/share/fslint/fslint/findup
find dUPlicate files.
Usage: findup [[[-t [-m|-d]] | [--summary]] [-r] [-f] paths(s) ...]
If no path(s) specified then the currrent directory is assumed.
When -m is specified any found duplicates will be merged (using hardlinks).
When -d is specified any found duplicates will be deleted (leaving just 1).
When -t is specfied, only report what -m or -d would do.

When --summary is specified change output format to include file sizes.
You can also pipe this summary format to /usr/share/fslint/fslint/fstool/dupwaste
to get a total of the wastage due to duplicates.

As it's a single command line with dozens of pipes, it should use all cores if needed.
some text from the source:

Description

will show duplicate files in the specified directories
(and their subdirectories), in the format:

file1
file2

file3
file4
file5

or if the --summary option is specified:

2 * 2048 file1 file2
3 * 1024 file3 file4 file5

Where the number is the disk usage in bytes of each of the
duplicate files on that line, and all duplicate files are
shown on the same line.
Output it ordered by largest disk usage first and
then by the number of duplicate files.
Caveats/Notes:
I compared this to any equivalent utils I could find (as of Nov 2000)
and it's (by far) the fastest, has the most functionality (thanks to
find) and has no (known) bugs. In my opinion fdupes is the next best but
is slower (even though written in C), and has a bug where hard links
in different directories are reported as duplicates sometimes.

This script requires uniq > V2.0.21 (part of GNU textutils|coreutils)
dir/file names containing \n are ignored
undefined operation for dir/file names containing \1
sparse files are not treated differently.
Don't specify params to find that affect output etc. (e.g -printf etc.)
zero length files are ignored.
symbolic links are ignored.
path1 & path2 can be files &/or directories

and the code has optimizations like this one
sort -k2,2n -k3,3n | #NB sort inodes so md5sum does less seeking all over disk

--
Atari rules... ermm... ruled.

Divide and Conquer by krsmav · 2012-09-03 01:18 · Score: 1

Break the 4.9 Tb into convenient size files (say, 500 Mb) and de-dup them one at a time. I'd dedicate a spare computer to do this, so you can leave it running over nights, weekends, etc. Then merge the now-smaller files into 500 Mb chunks and work through iteratively.

solutions by Anonymous Coward · 2012-09-03 03:39 · Score: 0

You could get all of the names of the files then perform a disk based sort. Then after you could delete the names which are next to each other in the list if they are the same name. This could be done with externel merge sort.

How I did it... by Anonymous Coward · 2012-09-03 03:43 · Score: 0

I found I had a similar issue. However, being a programmer AND a cheapskate, I simply rolled my own "De-Duplicator". It was VERY basic, but for personal use, it was worth it. I downloaded Visual Studio 2010 Express and simply wrote a wrapper (using a nested parallel.foreach) around the following bit of code.

I didn't care if filenames were the same, only if content was the same. As I went through, I moved duplicates off of the drive as soon as I found them so that I could stop and restart the code as necessary yet still make progress. It was "ghetto", but it de-duped my drive within 24 hours.

fs1 = new FileStream(file1, FileMode.Open, FileAccess.Read, FileShare.Read);
fs2 = new FileStream(file2, FileMode.Open, FileAccess.Read, FileShare.Read);

if (fs1.Length != fs2.Length)
{ // Close the file
fs1.Close();
fs2.Close(); // Return false to indicate files are different
return false;
} // Read and compare a byte from each file until either a // non-matching set of bytes is found or until the end of // file1 is reached.
do
{ // Read one byte from each file.
file1byte = fs1.ReadByte();
file2byte = fs2.ReadByte();
}
while ((file1byte == file2byte) && (file1byte != -1)); // Close the files.
fs1.Close();
fs2.Close(); // Return the success of the comparison. "file1byte" is // equal to "file2byte" at this point only if the files are // the same.
return ((file1byte - file2byte) == 0);

Re:How I did it... by Anonymous Coward · 2012-09-03 03:46 · Score: 0

Oh, and I should also point out. I didn't have as many files as you do, but it is something you could set to run in the background for a while and slowly find your freespace increasing.

Thank you for the suggestions by Igarden2 · 2012-09-03 04:11 · Score: 1

After reading only a few posts I was finally motivated to dedupe my SkyDrive. Using FastDuplicateFileFinder I found many dupes and sorted through them. My number of files was a mere fraction of the OP's files, but this worked for me. I was surprised to find some I no longer needed and some I barely remembered from back in 1993.

--
Normally I ascribe all life to intelligent design, but in your case I'll make an exception.

Windows 2012 by defected · 2012-09-03 04:14 · Score: 0

Windows 2012 has built-in post processes duping so just upgrade your system: http://blogs.technet.com/b/filecab/archive/2012/05/21/introduction-to-data-deduplication-in-windows-server-2012.aspx

clonespy by datadefender · 2012-09-03 04:45 · Score: 1

use http://www.clonespy.com/ and let it run some days/weeks
I have done this ecercise just last week for 120,000 files - ran one night on an old P4

Leave the dups by peetm · 2012-09-03 04:52 · Score: 1

Rationale:

You don't need to free up space, heck, space is cheap so there's no real reason to recover it.

Also, given that the worth of something is inversely proportional to its availability, it actually makes sense to have duplicates hanging around: once you loose your only copy of a file you'll be *very* happy to find its duplicate.

--
@peetm

No one mentioned freedups.pl ? by ultracosm · 2012-09-03 05:16 · Score: 1

If you are on NTFS you can use http://freedup.org/ or freedups.pl (http://www.stearns.org/freedups/). It makes hard links among duplicate files. On NTFS, it poops out after 1024 links, but at least you have 1023 fewer copies of the file on your hard drive.

This makes sense if you are used to a particular file structure. The file structure stays the same, so you can find the one copy of the file by whatever name/path you happen to remember first.

I've used a few free deduplicators, and haven't had a huge problem with them. I'd work in smaller chunks (directory trees) to start with, if you don't want the computer chugging away for long periods without knowing what it is doing. The first one I tried (DupeLocator, no longer at it's original location but possibly around in freeware collections) seemed relatively efficient, finding equal size files and doing some sort of compare. It eliminated the files that were not dupes pretty quickly. Took longer to confirm that the rest were really dupes, but not excessively long. It had the added advantage of "locating" the dupes, and letting you do with them what you please (I drag them into the Recycle Bin most of the time, all except the one I want to keep). It also keeps updating a status so you know how far it has gotten. I suppose if there are thousands of 1GB files, it might take a while.

Surely someone creating a de-dupe utility would make it at least moderately efficient, if not user friendly. I'd guess a program using a less-efficient algorithm with updated status would seem faster than one that just sits there doing a highly optimized algorithm without letting you know what's happening.

GQView? by Anonymous Coward · 2012-09-03 05:50 · Score: 0

How does findimagedupes compare to GQView's built-in image duplicate checker? That's what I use for all my photos & images.

Here: by Anonymous Coward · 2012-09-03 08:23 · Score: 0

Time to wrest control over your hoarding urges and delete shit.

Manually pick out a VERY SMALL subset of data you know you really want. Copy to somewhere.

Then format the drives.

What you need by rkinch · 2012-09-03 09:14 · Score: 1

What you need is a computer.

Nexentastor with NFS and dedup and compression by Anonymous Coward · 2012-09-03 09:51 · Score: 0

Sort the files using file size, followed by MD5 then ,build a file server with Nexentastor access the Nexentastor gui through browser set it up to use iscsi and dedup and the highest compression level. copy files over using a script in the order completed by your sort using the file size and then by MD5.

Set up Nexentastor to scrub the file system every night at 3 AM. Almost as good as an expensive SAN. More uses than you will ever think of easily accessible from all systems also you can use NFS and CIFS and SAMBA services. Set it up as a virtual appliance...better in my opinion by far than open filer. For home use 14 TB is free for use. Just copy data to it across gigabit network..Slow to start but it will be very useful once done. Use RAID Z1 or Z2 file system with tiered storage (fastest storage level being on raid 1z1 with flash drives slower tier on regular SATA disks in a RAID 10 z2 and a third tier being a backup storage device where you store snapshots and disconnect and put a way for safe keeping. Mirror across wan over VPN to another trusted sight like a trusted relatives house or a computer in Amazon cloud. Apply AES Tripple des encryption to file storage. Have lots of fun in the process! Gain lots of security at the same time! Make an ADMIN career in the process. Make lots of money!

sudo apt-get install fdupes by KingBenny · 2012-09-03 18:44 · Score: 1

i never tried it on millions of files though ... about 65k at most
o right windows ... dunno maybe it has its equivalent at sourceforge somewhere you can compile yourself

--
Free speech was meant to be free for all... how can anyone grow up in a nanny state ?

Crap Cleaner! by DarthVain · 2012-09-04 02:09 · Score: 1

http://en.wikipedia.org/wiki/CCleaner

http://www.piriform.com/ccleaner

I have used it on multiple TB machines, both in a home and work settings. I have used it for special projects targeting file repositories.

It is flexible enough that you can configure it pretty much any way you wish. With a little imigination you should be able to do whatever you need to do with this. I have used in conjunction with SyncToy to backup, move, etc... using contribute (which can generate duplicates).

It is fast, or at least I never had any problems. Some of the larger seaches I did take awhile, but that is to be expected. It also has a pretty flexible output, you can delete, move, just about anything you want.

A very useful utility.

Drugs by Anonymous Coward · 2012-09-04 02:45 · Score: 0

After the auto sort is done and you need to dig in for the manual stuff, be sure to have 1/2gram of coke on hand. You'll go through it in a night. Your files will be sorted out, too. ;-)

Clonespy works great and is free by Anonymous Coward · 2012-09-06 07:50 · Score: 0

I have had great luck with clonespy which is a freeware program. I kept using it so much that I donated money to him. The nicest feature is the ability to compare 2 pools of files against each other. I found that really helped me clean up backups I had made.

Slashdot Mirror

Ask Slashdot: How Do I De-Dupe a System With 4.2 Million Files?

440 comments