Ask Slashdot: How Do I De-Dupe a System With 4.2 Million Files?

← Back to Stories (view on slashdot.org)

Ask Slashdot: How Do I De-Dupe a System With 4.2 Million Files?

Posted by samzenpus on Sunday September 2, 2012 @01:30AM from the copies-of-the-copies dept.

First time accepted submitter jamiedolan writes "I've managed to consolidate most of my old data from the last decade onto drives attached to my main Windows 7 PC. Lots of files of all types from digital photos & scans to HD video files (also web site backup's mixed in which are the cause of such a high number of files). In more recent times I've organized files in a reasonable folder system and have an active / automated backup system. The problem is that I know that I have many old files that have been duplicated multiple times across my drives (many from doing quick backups of important data to an external drive that later got consolidate onto a single larger drive), chewing up space. I tried running a free de-dup program, but it ran for a week straight and was still 'processing' when I finally gave up on it. I have a fast system, i7 2.8Ghz with 16GB of ram, but currently have 4.9TB of data with a total of 4.2 million files. Manual sorting is out of the question due to the number of files and my old sloppy filing (folder) system. I do need to keep the data, nuking it is not a viable option.

72 of 440 comments (clear)

Min score:

Reason:

Sort:

CRC by Spazmania · 2012-09-02 01:32 · Score: 5, Informative

Do a CRC32 of each file. Write to a file one per line in this order: CRC, directory, filename. Sort the file by CRC. Read the file linearly doing a full compare on any file with the same CRC (these will be adjacent in the file).

--
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
1. Re:CRC by Anonymous Coward · 2012-09-02 01:36 · Score: 5, Informative
  
  s/CRC32/sha1 or md5, you won't be CPU bound anyway.
2. Re:CRC by Kral_Blbec · 2012-09-02 01:38 · Score: 5, Informative
  
  Or just by file size first, then do a hash. No need to compute a hash to compare a 1mb file and a 1kb file.
3. Re:CRC by Pieroxy · 2012-09-02 01:38 · Score: 2
  
  Exactly.
  1. Install MySQL,
  2. create a table (CRC, directory, filename, filesize)
  3. fill it in
  4. play with inner joins.
  I'd even go down the path of forgetting about the CRC. Before deleting something, do a manual check anyways. CRC has the advantage of making things very straightforward but is a bit more complex to generate.
  
  --
  Write boring code, not shiny code!
4. Re:CRC by Spazmania · 2012-09-02 01:47 · Score: 2
  
  I have a script which does this for openstreetmap tiles. Once it identifies the dupes it archives all the tiles into a single file, pointing the dupes at a single copy in the archive. Then I use a Linux fuse filesystem to read the file and present the results to Apache. Saves a truly massive amount of disk space for an openstreetmap server since the files are mostly smaller than a single disk block and never consume enough disk blocks that the space lost to the inode and unused part of the last block is insignificant.
  
  --
  Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
5. Re:CRC by SwashbucklingCowboy · 2012-09-02 01:48 · Score: 2
  
  DO NOT do a CRC, do a hash. Too many chances of collision with a CRC.
  But that still won't fix his real problem - he's got lots of data to process and only one system to process it with.
6. Re:CRC by igb · 2012-09-02 01:52 · Score: 5, Insightful
  
  That involves reading every byte. It would be faster to read the bytecount of each file, which doesn't involve reading the files themselves as that metadata is available, and then exclude from further examination all the files which have unique sizes. You could then read the first block of each large file, and discard all the files that have unique first blocks. After that, CRC32 (or MD5 or SHA1 --- you're going to be disk-bound anyway) and look for duplicates that way.
7. Re:CRC by vlm · 2012-09-02 01:54 · Score: 2, Interesting
  
  4. play with inner joins.
  Much like there's 50 ways to do anything in Perl, there's quite a few ways to do this in SQL.
  select filename_and_backup_tape_number_and_stuff_like_that, count(*) as number_of_copies
  from pile_of_junk_table
  group by md5hash
  having number_of_copies > 1
  Theres another strategy where you mush two tables up against each other... one is basically the DISTINCT of the other.
  triggers are widely complained about, but you can implement a trigger system (or psuedo-trigger, where you make a wrapper function in your app) where basically a table of "files" is stored with a row called "count of identical md5hash" and then your sql looks like select * from pile where identicalcount>1
  There's ways to play with views.
  Do you need to run it interactively or batch it or just run it basically once or ... If you're allowed to barf on data input you can even enforce the md5 hash as a UNIQUE INDEX or UNIQUE KEY in the table definition.
  You'll learn a lot about how to think about high performance computing. Are you trying to minimize latency or minimize storage or minimize index size or maximize reliability/uptime or minimize processor time or minimize NAS bandwidth or minimize (initial OR maintenance) programming time or ....
  The funniest thing is if you're never tried restoring data from backups (hey, it happens), and/or never had a tape failure (hey it happens), you'll THINK you want to eliminate dupes, but trust me, those dupes will save your bacon someday, and tape is cheap compared to cost of programmer and cost of lost data.... 5 TB is not much technically but is obviously worth a lot from a business standpoint...
  Also from personal experience you're going to find people gaming the system where DOOM3.EXE and NOTEPAD.EXE happen to have the same md5hash and length and NOTEPAD.EXE was found an a not-totally but pretty much noob's desk. Use some judgement and don't come down too hard on the newest of new learners.
  
  --
  "Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
8. Re:CRC by Zocalo · 2012-09-02 01:58 · Score: 4, Informative
  
  No. No. No. Blindly CRCing every file is probably what took so long on the first pass and is a terribly inefficient way of de-duplicating files.
  
  There is absolutely no point in generating CRCs of files unless they match on some other, simpler to compare characteristic like file size. The trick is to break the problem apart into smaller chunks. Start with the very large files, they exact size break to use it'll depend on the data set, but as the poster mentioned video file say everything over 1GB to start. Chances are you can fully de-dupe your very large files manually based on nothing more than a visual inspection of names and file sizes in little more time than it takes to find them all in the first place. You can then exclude those files from further checks, and more importantly, from CRC generation.
  
  After that, try and break the problem down into smaller chunks. Whether you are sorting on size, name or CRC, it's quicker to do so when you only have a few hundred thousand files rather than several million. Maybe do another size constrained search; 512MB-1GB, say. Or if you have them, look for duplicated backups files in the form of ZIP files, or whatever archive format(s), you are using based on their extension - that also saves you having to expand and examine the contents of multiple archive files. Similarly, do a de-dupe of just the video files by extensions as these should again lend themselves to rapid manual sorting without having to generate CRCs for many GB of data. Another grouping to consider might be to at least try and get all of the website data, or as much of is as you can, into one place and de-dupe that, and consider whether you really need multiple archival copies of a site, or whether just the latest/final revision will do.
  
  By the time you've done all that, including moving the stuff that you know is unique out of the way and into a better filing structure as you go, the remainder should be much more manageable for a single final pass. Scan the lot, identify duplicates based on something simple like the file size and, ideally, manually get your de-dupe tool to CRC only those groups of identically sized files that you can't easily tell apart like bunches of identically sized word processor or image files with cryptic file names.
  
  --
  UNIX? They're not even circumcised! Savages!
9. Re:CRC by caluml · 2012-09-02 01:58 · Score: 5, Informative
  
  Exactly. What I do is this:
  
  1. Compare filesizes.
  2. When there are multiple files with the same size, start diffing them. I don't read the whole file to compute a checksum - that's inefficient with large files. I simply read the two files byte by byte, and compare - that way, I can quit checking as soon as I hit the first different byte.
  
  Source is at https://github.com/caluml/finddups - it needs some tidying up, but it works pretty well.
  
  git clone, and then mvn clean install.
  
  --
  Get your own free personal location tracker
10. Re:CRC by Anonymous Coward · 2012-09-02 02:05 · Score: 5, Informative
  
  If you get a linux image running (say in a livecd or VM) that can access the file system then fdupes is built to do this already. Various output format/recursion options.
  From the man page:
  DESCRIPTION
  Searches the given path for duplicate files. Such files are found by
  comparing file sizes and MD5 signatures, followed by a byte-by-byte
  comparison.
11. Re:CRC by Joce640k · 2012-09-02 02:07 · Score: 3, Insightful
  
  s/CRC32/sha1 or md5, you won't be CPU bound anyway.
  Whatever you use it's going to be SLOW on 5TB of data. You can probably eliminate 90% of the work just by:
  a) Looking at file sizes, then
  b) Looking at the first few bytes of files with the same size.
  After THAT you can start with the checksums.
  
  --
  No sig today...
12. Re:CRC by kanweg · 2012-09-02 02:07 · Score: 2
  
  You're not baffled.
  Bert
13. Re:CRC by WoLpH · 2012-09-02 02:18 · Score: 2
  
  Indeed, I once created a dedup script which basically did that.
  1. compare the file sizes
  2. compare the first 1MB of the file
  3. compare the last 1MB of the file
  4. compare the middle 1MB in the file
  It's not a 100% foolproof solution but it was more than enough for my use case at that time and much faster than getting checksums.
14. Re:CRC by igb · 2012-09-02 02:22 · Score: 3, Interesting
  
  The problem isn't CRC vs secure hash, the problem is the number of bits available. He's not concerned about an attacker sneaking collisions into his filestore, and he always has the option of either a byte-by-byte comparison or choosing some number of random blocks to confirm the files are in fact the same. But 32 bits isn't enough simply because he's guaranteed to get collisions even if all the files are different, as he has more than 2^32 files. But using two different 32-bit CRC algorithms, for example, wouldn't be "secure" but would be reasonably safe. But as he's going to be disk bound, calculating an SHA-512 would be reasonable, as he can probably do that faster than he can read the data.
  I confess, if I had a modern i5 or i7 processor and appropriate software I'd be tempted to in fact calculate some sort of AES-based HMAC, as I would have hardware assist to do that.
15. Re:CRC by TheGratefulNet · 2012-09-02 02:36 · Score: 2
  
  divide and conquer.
  your idea of using file size as first discriminant is good. its fast and throws out a lot of things that don't need to be checked.
  another accelrant is to find if the count of the # of files in a folder is the same. and if a few are the same, maybe the rest are. use 'info' like that to make it run faster.
  I have this problem and am going to write some code to do this, too.
  but I might have some files are are 'close' to the others and so I need smarter code. example: some music files might be the same in content but only vary in tags. or their titles are different. or maybe even their run length is slightly diff but they are still mostly the same file. I'd want to dedupe those, too.
  you would have a manual list to verify (the computer thinks these are the same; please verify, mr human).
  some files may have errors in them! maybe I made copies of mp3 files and there was a static hit on one disk. finding by dupe filename and even size is not good enough. you found 2 contenders, but which is the CLEAN file? which has no dropouts or buzzsaws? same for photos, too, if you retouch photos you may not know which is the original or the fixed/keeper.
  special knowledge helps here. if its audio, if its video, if its text, spreadsheets, o/s runnable files, etc conf files, all can use diff 'tricks' to help accelerate.
  this is why this solution is NOT easy unless you just go brute force by disk block. and this is not do-able on anything large unless you have hardware support.
  
  --
  
  --
  "It is now safe to switch off your computer."
16. Re:CRC by bzipitidoo · 2012-09-02 02:38 · Score: 5, Insightful
  
  Part 2 of your method will quickly bog down if you run into many files that are the same size. Takes (n choose 2) comparisons, for a problem that can be done in n time. If you have 100 files all of one size, you'll have to do 4950 comparisons. Much faster to compute and sort 100 checksums.
  Also, you don't have to read the whole file to make use of checksums, CRCs, hashes and the like. Just check a few pieces likely to be different if the files are different, such as the first and last 2000 bytes. Then for those files with matching parts, check the full files.
  
  --
  Intellectual Property is a monopolistic, selfish, and defective concept. It is "tyranny over the mind of man"
17. Re:CRC by belg4mit · 2012-09-02 02:40 · Score: 2, Informative
  
  Unique Filer http://www.uniquefiler.com/ implements these short-circuits for you.
  It's meant for images but will handle any filetype, and even runs under WINE.
  
  --
  Were that I say, pancakes?
18. Re:CRC by JoeMerchant · 2012-09-02 02:56 · Score: 2, Funny
  
  Do a CRC32 of each file. Write to a file one per line in this order: CRC, directory, filename. Sort the file by CRC. Read the file linearly doing a full compare on any file with the same CRC (these will be adjacent in the file).
  Would you be so kind to write a program/script which can do that ?
  Payment information please, AC?
19. Re:CRC by K.+S.+Kyosuke · 2012-09-02 03:05 · Score: 3, Insightful
  
  Why not simply do it adaptively? Two or three files of the same size => check by comparing, more files of the same size => check by hashing.
  
  --
  Ezekiel 23:20
20. Re:CRC by blueg3 · 2012-09-02 03:06 · Score: 5, Informative
  
  b) Looking at the first few bytes of files with the same size.
  Note that there's no reason to only look at the first few bytes. On spinning disks, any read smaller than about 16K will take the same amount of time. Comparing two 16K chunks takes zero time compared to how long it takes to read them from disk.
  You could, for that matter, make it a 3-pass system that's pretty fast:
  a) get all file sizes; remove all files that have unique sizes
  b) compute the MD5 hash of the first 16K of each file; remove all files that have unique (size, header-hash) pairs
  c) compute the MD5 hash of the whole file; remove all files that have unique (size, hash) pairs
  Now you have a list of duplicates.
  Don't forget to eliminate all files of zero length in step (a). They're trivially duplicates but shouldn't be deduplicated.
21. Re:CRC by Zeroko · 2012-09-02 03:13 · Score: 2
  
  The relevant number when worrying about non-adversarial hash collisions is the square root of the number of outputs (assuming they are close enough to uniformly distributed), due to the birthday paradox. So in the case of CRC32, more than ~2^16 files makes a collision likely (well, 2^16 gives about 39%), & with 2^22, the probability is nearly indistinguishable from 1 (it being over 99.9% for only 2^18 files).
22. Re:CRC by b4dc0d3r · 2012-09-02 03:32 · Score: 4, Interesting
  
  This was theorized by one of the RSA guys (Rivest, if I'm not mistaken). I helped support a system that identified files by CRC32, as a lot of tools did back then. As soon as we got to about 65k files (2^16), we had two files with the same CRC32.
  Let me say, CRC32 is a very good algorithm. So good, I'll tell you how good. It is 4 bytes long, which means in theory you can change any 4 bytes of a file and get a CRC32 collision, unless the algorithm distributes them randomly, in which case you will get more or less.
  I naively tried to reverse engineer a file from a known CRC32. Optimized and recursive, on a 333 mHz computer, it took 10 minutes to generate the first collision. Then every 10 minutes or so. Every 4 bytes (last 4, last 5 with the original last byte, last 6 with original last 2 bytes, etc) there was a collision.
  Compare file sises first, not CRC32. The s^16 estimate is not only mathematically proven, but also in the big boy world. I tried to move the community towards another hash.
  CRC32 *and* filesize are a great combination. File size is not included in the 2^16 estimate. I have yet to find two files in the real world, in the same domain (essentially type of file), with the same size and CRC32.
  Be smart, use the right tool for the job. First compare file size (ignoring things like mp3 ID3 tags, or other headers). Then do two hashes of the contents - CRC32 and either MD5 or SHA1 (again ignoring well-known headers if possible). Then out of the results, you can do a byte for byte comparison, or let a human decide.
  This is solely to dissuade CRC32 based identification. After all, it was designed for error detection, not identification. For a 4-byte file, my experience says CCITT standard CRC32 will work for identification. For 5 byte files, you can have two bytes swapped and possibly have the same result. The longer the file, the less likely it is to be unique.
  Be smart, use size and two or more hashes to identify files. And even then, verify the contents. But don't compute hashes on every file - the operating system tells you file size as you traverse the directories, so start there.
23. Re:CRC by Anonymous Coward · 2012-09-02 04:01 · Score: 3, Interesting
  
  With 4.2 million files, given the probability of SHA-1 collisions plus the birthday paradox and there will be around 500 SHA-1 collisions which are not duplicates. SHA-512 reduces that number to 1.
24. Re:CRC by BasilBrush · 2012-09-02 04:14 · Score: 5, Insightful
  
  Someone who's technical expertise is in areas other than writing script files. There are technical jobs other than being a sysop you know.
25. Re:CRC by Anonymous Coward · 2012-09-02 04:33 · Score: 2, Informative
  
  Actually, this is an instance where lots of random IO will bog you down when comparing a bunch of files. His 4+ TB divided by 4.2M files is roughly 1MB average file size, which really isn't that much content to access per random seek. A naive all-to-all comparison will cause a lot of random IO, so you really need to generate a batch file listing with per-file metadata and then analyze the listings efficiently. Adding checksum info to this batch listing is actually not that costly and allows the entire de-dupe analysis to be performed with no further disk IO. Even if we assume 1kB per file of name, size, and checksum info (it's probably a lot less), the whole listing is around 4GB which can be largely cached in RAM for analysis.
  When I had this same problem on Linux, I did two scans of the entire file set using the 'find . -type f -exec cmd {} +' command to automatically run 'stat' and 'md5sum' on batches of files, then I merged these scan results to have one table of information per file. You could do all of this by processing files (e.g. sort and join on Unix) but it is more efficient to just import the data into sqlite or another database and do it there. In my case, I grouped files by size and checksum, also sorting the group members by name length, preferring the shortest name as the "original" file, since the names tended to get longer with each redundant backup copy adding some other top-level directory name to the original file name.
  The reason I ran two scans was that I was too lazy to implement a hybrid command to efficiently run 'md5sum' and 'stat' as one utility. It would have taken me longer to develop and test the utility enough to trust it than to just run it with the existing utilities. In the end, the scan with md5sum did not take that much longer than the scan with stat, because the overall time is dominated by digging around vast directory hierarchies and randomly accessing file metadata, versus the bulk sequential access pattern used to perform the checksum once each file was found. If you monitor the system while these commands run, there is steady high-bandwidth disk access for the duration of the md5sum scan, while there is steady disk seeking with very little bandwidth for the duration of the stat scan. Neither scan saturates a CPU.
  Another question is what to do with the results of analysis. One option is to delete all but one copies of each length/checksum group, and assume you would use the database information in the future if you ever need to reconstitute one of the deleted hierarchies. Or you could turn all secondary references into hard-links to the same file, retaining the original hierarchies as accessible file trees. Or, as I chose to do, you can replace secondary references with symbolic links to the primary copy, which is close enough to preserving the original hierarchy for most programmatic access but is also self-documenting the fact that it is a secondary name for the same file at the other end of the link.
26. Re:CRC by TheGratefulNet · 2012-09-02 04:43 · Score: 2
  
  I usually use:
  find . -type f -exec md5sum {} \; > /tmp/files.md5.txt
  you can check back with that file:
  md5sum -c /tmp/files.md5.txt
  
  --
  
  --
  "It is now safe to switch off your computer."
27. Re:CRC by Goaway · 2012-09-02 04:46 · Score: 2
  
  I don't know where you are finding these numbers, but they are about as wrong as it is possible to get.
  There is no known SHA-1 collision yet in the entire world. You're not going to find 500 of them in your dump of old files.
28. Re:CRC by robogun · 2012-09-02 04:59 · Score: 3, Funny
  
  I looked at this as I, like the subby, have terabytes of porn to sort.
  But $19.95 for a beta?
29. Re:CRC by iluvcapra · 2012-09-02 05:19 · Score: 5, Insightful
  First compare file size (ignoring things like mp3 ID3 tags, or other headers).
  I once had to write an audio file de-deuplicator; one of the big problems was you would ignore the metadata and the out-of-band data when you did the comparisons, but you always had to take this stuff into account when you were deciding which version of a file to keep -- you didn't want to delete two copies f a file with all the tags filled out and keep the one that was naked.
  My de-duper worked like everyone here is saying -- it cracked open wav and aiff (and Sound Designer 2) files, captured their sample count and sample format into a sqlite db, did a couple of big joins and then did some SHA1 hashes of likely suspects. All of this worked great, but once I had the list I had the epiphany that the real problem of these tools is the resolution and how you make sure you're doing exactly what the user wants.
  How do you decide which one to keep? You can just do hard links, but...
  
  The users I was working with were very uncomfortable with hard links, they didn't really understand the concept and were concerned that it made it difficult to know if you were "really" throwing something away when you dragged something to the trash. (It's stupid but it was their box.)
  Our existing backup/archival software wouldn't do the right thing with hard links, so it'd save no space on the tapes.
  Our audio workstation software wouldn't read audio off of files that were hard links on OS X (because hard links on OSX aren't really hard links, I believe our audio workstation vendor have since resolved this).
  But let's say you can do hard links, no problem. How do you decide which instance of the file is to be kept, if you've only compared the "real" content of the file and ignored metadata? You could just give the user a big honking list of every set of files that are duplicates -- two here, three here, six here, and then let them go through and elect which one will be kept, but that's a mess and 99% of the time they're going to select a keeper on the basis of which part of the directory tree it's in. So, you need to do a rule system or a preferential ranking of parts of the directory hierarchy that tell the system "keep files you find here." Now, the files will also have metadata, so you also have to preferentially rank the files on the basis of its presence -- you might also rank files higher if your guy did the metadata tagging, because things like audio descriptions are often done with a specialized jargon that can be specific to a particular house.
  Also, it'd be very common to delete a file from a directory containing an editor's personal library, and replacing it with a hard link to a file in the company's main library -- several people would have copies of the same commercial sound, or an editor would be the recordist of a sound that was subsequently sold to a commercial library, or whatever. Is it a good policy to replace his file with a hardlink to a different one, particularly if they differ in the metadata? Directories on a volume are often controlled by different people with different policies and proprietary interest to the files -- maybe the company "owns" everything, but it still can create a lot of internal disputes if files in a division or individual project's library folder starting getting their metadata changed, on account of being replaced with a hard link to a "better" file in the central repository. We can agree not to de-dup these, but it's more rules and exceptions that have to be made.
  Once you have to list of duplicates, and maybe the rules, do you just go and delete, or do you give the user a big list to review? And, if upon review, he makes one change to one duplicate instance, it'd be nice to have that change intelligently reflected on the others. The rules have to be applied to the dupe list interactively and changes have to be reflected in the same way, otherwise it becomes a miserable experience for the user to de-dupe 1M files over 7 terabytes. The resolution of duplicates is the hard part, the finding of dupes is relatively easy.
  --
  Don't blame me, I voted for Baltar.
30. Re:CRC by IICV · 2012-09-02 05:56 · Score: 3, Insightful
  
  $19.95 for a beta of something you can whip up in about an hour of shell scripting.
  Hell, I wrote exactly what people are talking about here in an afternoon in college - I even did both SHA and MD5, because I ended up finding a SHA collision between one of the Quake 3 files and a Linux system file.
31. Re:CRC by xigxag · 2012-09-02 07:10 · Score: 4, Informative
  
  With 4.2 million files, given the probability of SHA-1 collisions plus the birthday paradox and there will be around 500 SHA-1 collisions which are not duplicates.
  That's totally, completely wrong. The birthday problem isn't a breakthrough concept, and the probability of random SHA-1 collisions is therefore calculated with it in mind. The number is known to be 1/2^80. This is straightforwardly derived from the total number of SHA-1 values, 2^160, which is then immensely reduced by the birthday paradox to 2^80 expected hashes required for a collision. This means that a hard drive with 2^80 or 1,208,925,819,614,629,174,706,176 files would have on average ONE collision. Note that this is a different number than the number of hashes one has to generate for a targeted cryptographic SHA-1 attack, which with best current theory is on the order of 2^51 for the full 80-round SHA-1, although as Goaway has pointed out, no such collision has yet been found.
  Frankly I'm at a loss as to how you arrived at 500 SHA-1 collisions out of 4.2 million files. That's ludicrous. Any crypto hashing function with such a high collision rate would be useless. Much worse than MD5, even.
  
  --
  There are two kinds of people: 1) those who start arrays with one and 1) those who start them with zero.
32. Re:CRC by Surt · 2012-09-02 08:21 · Score: 4, Insightful
  
  $19.95 for a beta of something you can whip up in an hour of shell scripting.
  If the poster were you, they wouldn't have had to 'ask slashdot'.
  
  --
  "Who is the Journal of Quantum Physics going to believe?" --Stephen Hawking
33. Re:CRC by yakatz · 2012-09-02 11:26 · Score: 2
  
  And use a Bloom Filter to easily eliminate many files without doing a major comparison of all 100 checksums.
34. Re:CRC by Zocalo · 2012-09-02 11:48 · Score: 2
  
  Sure, yet it didn't. Reading between the lines, and seeing phrases like "multiple drives", "attached to PC", "last decade", and I think it safe to say that we are most definitely not talking about about a reasonably modern storage system that can do 100MB/s, or it wouldn't have taken a week for the first pass. It seems much more likely that the poster has a whole bunch of external backup drives, most probably USB2, hence their first attempt was probably seriously I/O bound. That means doing as much of the de-duplication as possible without reading in the raw data, just the file tables, and and starting with the larger files so that you can move what ever is left over for the final pass that will need the files CRC'd (or hashed) onto the fastest available media.
  
  --
  UNIX? They're not even circumcised! Savages!
35. Re:CRC by vocatan · 2012-09-02 14:16 · Score: 2
  
  Be VERY careful about only relying upon the file contents -- my wife spent 3 weeks tagging a large (~8,000 images) collection of family photos -- and the method she used was to put the children's names in the filename. Being the clever geek, I ran a MD5 against all the files, and compared both filesize and MD5 -- and triumphantly purged all the binary duplicates -- only to find that the filename itself was important to retain. Also, note that some application such as Apple's iPhoto will conveniently retain multiple copies of the same image in various dimensions - as well as the original image before any transformations would apply. Bottom line: doing a filename+filecontents hash (single O(n) to calculate over entire file set), and then comparison of the hash feels _to me_ as the safest approach.
36. Re:CRC by inKubus · 2012-09-02 19:28 · Score: 3, Informative
  
  For the lazy, here are 3 more tools:
  fdupes, duff, and rdfind.
  Duff claims it's O(n log n), because they:
  Only compare files if they're of equal size.
  Compare the beginning of files before calculating digests.
  Only calculate digests if the beginning matches.
  Compare digests instead of file contents.
  Only compare contents if explicitly asked.
  
  --
  Cool! Amazing Toys.
ZFS by smash · 2012-09-02 01:37 · Score: 2

as per subject.

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
1. Re:ZFS by smash · 2012-09-02 01:39 · Score: 4, Informative
  
  To clarify - no this will not remove duplicate references to the data. The files ystem will remain in tact. However it will perform block level dedupe of the data which will recover your space. Duplicate references aren't necessarily a bad thing anyway, as if you have any sort of content index (memory, code, etc) that refers to data in a particular location, it will continue to work. However the space will be recovered.
  
  --
  I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
2. Re:ZFS by Daniel_Staal · 2012-09-02 03:37 · Score: 2
  
  You have to enable it, which can be done on a per-filesystem basis. Once it's on, any new data written to that filesystem will be deduplicated. If you then turn it off, new data will not be deduplicated but data already on disk will remain deduplicated. (Unless it gets modified, of course. Then it's new data.)
  PC-BSD installs onto ZFS by default if you have over 4GB or so of ram, but won't turn on deduplication automatically. Dedup is costly: it requires a dedup table which has 320 bytes per (variably sized) block, which must be consulted on every write. (A quick estimate based on an average 64K block size for the case above results in a 24 GB dedup table.) So, if you can't fit that table into ram or onto a SSD cache drive, writes are going to be very slow. But for this usage, setting up a fileserver on ZFS and copying all his files to it would fit well, especially as the other advantages of ZFS with large filesystems will come into play.
  
  --
  'Sensible' is a curse word.
There are tools for this by Anonymous Coward · 2012-09-02 01:41 · Score: 5, Informative

If you don't mind booting Linux (a live version will do), fdupes has been fast enough for my needs and has various options to help you when multiple collisions occur. For finding similar images with non-identical checksums, findimagedupes will work, although it's obviously much slower than a straight 1-to-1 checksum comparison.
YMMV
Simple dedupe algorithm by Anonymous Coward · 2012-09-02 01:42 · Score: 5, Funny

Delete all files but one. The remaining file is guaranteed unique!
Don't waste your time. by Fuzzums · 2012-09-02 01:43 · Score: 4, Insightful

if you really want, sort, order and index it all, but my suggestion would be different.
If you didn't need the files in the last 5 years, you'll probably never need them at all.
Maybe one or two. Make one volume called OldSh1t, index it, and forget about it again.
Really. Unless you have a very good reason to un-dupe everything, don't.
I have my share of old files and dupes. I know what you're talking about :)
Well, the sun is shining. If you need me, I'm outside.

--
Privacy is terrorism.
1. Re:Don't waste your time. by equex · 2012-09-02 02:34 · Score: 3, Interesting
  
  I probably have 5-10 gigs of everything i ever did on a computer. all this is wrapped in a perpetual folder structure of older backups within old backups within.... i've tried sorting it and deduping it with various tools, but theres no point. you find this snippet named clever_code_2002.c at 10kb and then the same file somewhere else at 11kb and how do you know which one to keep? are you going to inspect every file ? are you going to auto-dedupe it based on size? on date? it wont work out in the end im afraid. the closest i have gotten to some structure in the madness is to put all single files of the same type in the same folder, and keep a folder with stuff that needs to be in folders. put a folder named 'unsorted' anywhere you want when you are not sure right away what to do with a file(s). copy all your stuff into the folders. decide if you want to rename dupes to file_that_exists(1).jpg or leave them in their original folders and sort it out later in the file copy/move dialogs that pops up when it detects similar folders/files. i like to just rename them, and then whenever i browse a particular 'ancient' folder, i quickly sort trough some files every time. over time, it becomes tidier and tidier. one tool that everyone should use is Locate32. it indexes your preferred locations and stores it in a database when you want to. (its not a service) you can then search very much like the old Windows search function again, only much much better.
  
  --
  Can I light a sig ?
Prioritize by file size by jwales · 2012-09-02 01:43 · Score: 5, Insightful

Since the objective is to recover disk space, the smallest couple of million files are unlikely to do very much for you at all. It's the big files that are the issue in most situations.
Compile a list of all your files, sorted by size. The ones that are the same size and the same name are probably the same file. If you're paranoid about duplicate file names and sizes (entirely plausible in some situations), then crc32 or byte-wise comparison can be done for reasonable or absolute certainty. Presumably at that point, to maintain integrity of any links to these files, you'll want to replace the files with hard links (not soft links!) so that you can later manually delete any of the "copies" without hurting all the other "copies". (There won't be separate copies, just hard links to one copy.)
If you give up after a week, or even a day, at least you will have made progress on the most important stuff.

--
Wikia
1. Re:Prioritize by file size by b4dc0d3r · 2012-09-02 03:49 · Score: 3, Informative
  
  ZIP, test, then Par2 the zip. Even at the worst possible compression level, greater than 100% filezises, you just saved a ton of space.
  I got to the point where I rarely copy small files without first zipping on the source drive. It takes so frigging long, when a full zip or tarball takes seconds. Even a flat tar without the gzip step is a vast improvement, since the filesystem doesn't have to be continually updated. But zipping takes so little resource that Windows XP's "zipped folders" actually makes a lot of sense for any computer after maybe 2004, even with the poor implementation.
Linux livecd? by thePowerOfGrayskull · 2012-09-02 01:47 · Score: 3

perhaps you could boot with a livecd and mount your windows drives under a single directory? Then:
find /your/mount/point -type f -exec sha256sum > sums.out
uniq -u -w 64 sums.out
1. Re:Linux livecd? by dargaud · 2012-09-02 02:30 · Score: 3, Insightful
  
  Read the other comments: that's highly inefficient. Compare the file sizes, then diff the files until the 1st differing byte. No need to checksum two Tb files if the 1st bytes are different !
  
  --
  Non-Linux Penguins ?
don't run the app on a usb EXT disk by Joe_Dragon · 2012-09-02 01:47 · Score: 2

put the disk on the build in sata bus or use E-sata or even fire wire.
Re:Good free command line tool by Acy+James+Stapp · 2012-09-02 01:59 · Score: 3, Interesting

I recently had this problem and solved it with finddupe (http://www.sentex.net/~mwandel/finddupe/). It's a free command line tool. It can create hardlinks, you can tell it which is a master directory to keep and which directories to delete, and it can create a batch file to do actually do the deletion if you don't trust it or just want to see what it will do. Highly recommend. In any case, 5 TB is going to take forever but with finddupe you can be sure your time is not wasted, unlike one of the free tools that analyzed my drive for 12 hours and then told me it would only fix ten duplicates.
I tried this vs. Clone Spy, Fast Duplicate File Finder, Easy Duplicate File Finder, and the GPL Duplicate Files Finder (crashy). (Side note: Get some creativity guys). There's no UI but I don't care. It doesn't keep any state between runs so run it a few times on subdirectories to make sure you know what it's doing first then let it rip.

--
-- Too lazy to get a lower UID.
fun project by v1 · 2012-09-02 02:19 · Score: 2

I had to do that with an itunes library recently. Nowhere near the number of items you're working with, but same principle - watch your O's. (that's the first time I've had to deal with a 58mb XML file!) After the initial run forecasting 48 hrs and not being highly reliable, I dug in and optimized. A few hours later I had a program that would run in 48 seconds. When you're dealing with data sets of that size, process optimizing really can matter that much. (if it's taking too long, you're almost certainly doing it wrong)
The library I had to work with had an issue with songs being in the library multiple times, under different names, and that ended up meaning there was NOTHING unique about the songs short of the checksums. To make matters WORSE, I was doing this offline. (I did not have access to the music files which were on the customer's hard drives, all seven of them)
It sounds like you are also dealing with differing filenames. I was able to figure out a unique hashing system based on the metadata I had in the library file. If you can't do that, and I suspect you don't have any similar information to work with, you will need to do some thinking. Checksumming all the files is probably unnecessarily wasteful. Files that aren't the same size don't need to be checksummed. You may decide to consider files with the same size AND same creation and/or modification dates to be identical. That will reduce the number of files you need to checksum by several orders. A file key may be "filesize:checksum", where unique filesizes just have a 0 for the checksum.
Write your program in two separate phases. First phase is to gather checksums where needed. Make sure the program is resumable. It may take awhile. It should store a table somehow that can be read by the 2nd program. The table should include full pathname and checksum. For files that did not require checksumming, simply leave it zero.
Phase 2 should load the table, and create a collection from it. Use a language that supports it natively. (realbasic does, and is very fast and mac/win/lin targetable) For each item, do a collection lookup. Collections store a single arbitrary object (pathname) via a key. (checksum) If the collection (key) doesn't exist, it will create a new collection entry with that as its only object. if it already exists, the object is appended to the array for that collection. That's the actual deduping process, and will be done in a few seconds. Dictionaries and collections kick ass for deduping.
From here you'll have to decide what you want to do.... delete, move, whatever. Duplicate songs required consolidation of playlists when removing dups for example. Simply walk the collection, looking for items with more than one object in the collection. Decide what to keep and what to do elsewise with (delete?) I recommend dry-running it and looking at what it's going to do before letting it start blowing things away.
It will take 30-60 min to code probably. The checksum part may take awhile to run. Assuming you don't have a ton of files that are the same size (database chunks, etc) the checksumming shouldn't be too bad. The actual processing afterward will be relatively instantaneous. Use whatever checksumming method you can find that works fastest.
The checksumming part can be further optimized by doing it in two phases, depending on file sizes. If you have a lot of files that are large-ish (>20mb) that will be the same size, try checksumming in two steps. Checksum the first 1mb of the file. If they differ, ok, they're different. If they're the same, ok then checksum the entire file. I don't know what your data set is like so this may or may not speed things up for you.

--
I work for the Department of Redundancy Department.
CRCing & diff-ing do not a consistent deduping by williamyf · 2012-09-02 02:20 · Score: 2

After you have found the "equal files", you need to decide which one to erase and which ones to keep. For example, let's say that a gif file is part of a web site and is also present in a few other places because you backed it up to removable media which latter got consolidated. If you chose to erase the copy that is part of the website structure, the website will stop working.
Lucky for you, most filesystem implemenations nowadays include the capacity to create symbolic links (in windows, that would be NTFS Symbolic links since vista, and junction points since Win2K, in *nix is the soft hand hard symlinks we know and love, and in mac, the engineers added hard links to whole directories), both hard and soft. So, the solution must not only identify which files are the same, but also, keep one copy, while preserving accesability, this is what makes apple (r)(c)(tm) work so well. You will need a script that, upon identifying equal files, erases all but one, and creates symlinks for ll the erased ones to the surviving one.

--
*** Suerte a todos y Feliz dia!
Manual work will have to be done by Qbertino · 2012-09-02 02:24 · Score: 4, Informative

Your problem isn't unduping files in your archives, your problem is getting an overview of your data archives. If you'd have it, you wouldn't have dupes in the first place.
This is a larger personal project, but you should take it on, since it will be a good lesson in data organisation. I've been there and done that.
You should get a rough overview of what you're looking at and where to expect large sets of dupes. Do this by manually parsing your archives in broad strokes. If you want to automate dupe-removal, do so by de-duping smaller chunks of your archive. You will need extra CPU and storage - maybe borrow a box or two from friends and set up a batch of scripts you can run from Linux live CDs with external HDDs attached.
Most likely you will have to do some scripting or programming, and you will have to devise a strategy not only of dupe removal, but of merging the remaining skeletons of dirtrees. That's actually the tough part. Removing dupes takes raw processing power and can be done in a few weeks and brute force and a solid storage bandwidth.
Organising the remaining stuff is where the real fun begins. ... You should start thinking about what you are willing to invest and how your backup, versioning and archiving strategy should look in the end, data/backup/archive retrival included. The latter might even determine how you go about doing your dirtree diffs - maybe you want to use a database for that for later use.
Anyway you put it, just setting up a box in the corner and having a piece of software churn away for a few days, weeks or months won't solve your problem in the end. If you plan well, it will get you started, but that's the most you can expect.
As I say: Been there, done that.
I still have unfinished business in my backup/archiving strategy and setup, but the setup now is 2 1TB external USB3 drives and manual arsync sessions every 10 weeks or so to copy from HDD-1 to HDD-2 to have dual backups/archives. It's quite simple now, but it was a long hard way to clean up the mess of the last 10 years. And I actually was quite conservative about keeping my boxed tidy. I'm still missing external storage in my setup, aka Cloud-Storage, the 2012 buzzword for that, but it will be much easyer for me to extend to that, now that I've cleaned up my shit halfway.
Good luck, get started now, work in iterations, and don't be silly and expect this project to be over in less than half a year.
My 2 cents.

--
We suffer more in our imagination than in reality. - Seneca
Already done it - python script by Terrasque · 2012-09-02 03:01 · Score: 3, Informative

I found a python script online and hacked it a bit to work on a larger scale.
The script originally scanned a directory, found files with same size, and md5'ed them for comparison.
Among other things I added option to ignore files under a certain size, and to cache md5 in a sqlite db. I also think I did some changes to the script to handle large number of files better, and do more effective md5 (also added option to limit number of bytes to md5, but that didn't make much difference in performance for some reason). I also added option to hard link files that are the same.
With inodes in memory, and sqlite db already built, it takes about 1 second to "scan" 6TB of data. First scan will probably take a while, tho.
Script here - It's only tested on Linux.
Even if it's not perfect, it might be a good starting point :)

--
It's The Golden Rule: "He who has the gold makes the rules."
If You're Like Me by crackspackle · 2012-09-02 03:08 · Score: 3, Interesting

The problem started with a complete lack of discipline. I had numerous systems over the years and never really thought I needed to bother with any tracking or control system to manage my home data. I kept way to many minor revisions of the same file, often forking them over different systems. As time past and rebuilt systems, I could no longer remember where all the critical stuff was so I'd create tar or zip archives over huge swaths of the file system just in case. I eventually decided to clean up like you are now when I had over 11 million files. I am down to less than half a million now. While I know there are still effective duplicates, at least the size is what I consider manageable. For the stuff from my past, I think this is all I can hope for; however, I've now learned the importance of organization, documentation and version control so I don't have this problem again in the future.
Before even starting to de-duplicate, I recommend organizing your files in a consistent folder structure. Download wikimedia and start a wiki documenting what you're doing with your systems. The more notes you make, the easier it will be to reconstruct work you've done as time passes. Do this for your other day to day work as well. Get git and start using it for all your code and scripts. Let git manage the history and set it up to automatically duplicate changes on at least one other backup system. Use rsync to do likewise on your new directory structure. Force yourself to stop making any change you consider worth keeping outside of these areas. If you take these steps, you'll likely not have this problem again, at least on the same scope. You'll also find it a heck of a lot easier to decommission or rebuild home systems and you won't have to worry about "saving" data if one of them craps out.
1. Re:If You're Like Me by dolmen.fr · 2012-09-02 20:35 · Score: 2
  
  If you need MediaWiki to manage the documentation about your filesystem structure, you really have a problem.
  TiddlyWiki should be more than sufficient for that task.
5TB only why dedupe? by TheLink · 2012-09-02 03:18 · Score: 3, Insightful

It's only 5TB. Why dedupe? Just buy another HDD or two. How much is your time worth anyway?

You say the data is important enough that you don't want to nuke it. Wouldn't it be also true to say that the data that you've taken the trouble to copy more than once is likely to be important? So keep those dupes.

To me not being able to find stuff (including being aware of stuff in the first place) would be a bigger problem :). That would be my priority, not eliminating dupes.
--
- Too many replies beneath your current threshold
Anyway... by Forty+Two+Tenfold · 2012-09-02 03:21 · Score: 2

Anyway...

--
Upward mobility is a slippery slope - the higher you climb the more you show your ass.
Use DROID 6 by mattpalmer1086 · 2012-09-02 03:21 · Score: 4, Informative

There is a digital preservation tool called DROID (Digital Record Object Identification) which scans all the files you ask it to, identifying their file type. It can also optionally generate an MD5 hash of each file it scans. It's available for download from sourceforge (BSD license, requires Java 6, update 10 or higher).
http://sourceforge.net/projects/droid/
It has a fairly nice GUI (for Java, anyway!), and a command line if you prefer scripting your scan. Once you have scanned all your files (with MD5 hash), export the results into a CSV file. If you like, you can first also define filters to exclude files you're not interested in (e.g. small files could be filtered out). Then import the CSV file into your data anlaysis app or database of your choice, and look for duplicate MD5 hashes. Alternetively, DROID actually stores its results in an Apache Derby database, so you could just connect directly to that rather than export to CSV, if you have a tool that an work with Derby.
One of the nice things about DROID when working over large datasets is you can save the progress at any time, and resume scanning later on. It was built to scan very large government datastores (multiple Tb). It has been tested over several million files (this can take a week or two to process, but as I say, you can pause at any time, save or restore, although only from the GUI, not the command line).
Disclaimer: I was responsible for the DROID 4, 5 and 6 projects while working at the UK National Archives. They are about to release an update to it (6.1 I think), but it's not available just yet.
Just hash first 4K of each file, avoid 2nd pass by Anonymous Coward · 2012-09-02 03:24 · Score: 2, Insightful

Only hash the first 4K of each file and just do them all. The size check will save a hash only for files with unique sizes, and I think there won't be many with 4.2M media files averaging ~1MB. The second near-full directory scan won't be all that cheap.
By the time you have sorted this out... by 3seas · 2012-09-02 03:44 · Score: 3, Insightful

...it will have cost you far more than simply buying another drive(s) if all you are really concerned about is space...
Re:It's going to take a long time by b4dc0d3r · 2012-09-02 04:00 · Score: 2

I wrote my own to do exactly this, thinking it would be vastly superior to anything I could have downloaded.
File size collisions are a lot more common than one would realize. Even the following algorithm takes a very long time to complete on any sizeable data source:
- Find all files, storing directory and filename as separate strings to prevent memory allocation isses (the path will be the same for lots of files, so keep it in memory once - a hashtable or binsearch or similar optimized storage makes this negligable overhead)
- Sort the resulting list by filesize
- Iterate over the list. If the next file has a different size, continue the loop
- Otherwise, for each file with the same size, open the first file. Open the second file and do a byte-wise compare. This will fail faster than doing a hash for different files, usually it takes a single cluster read to find differences
- After going through each filesize match, drop the first file of the bunch and repeat. The OS file cache will retain most of the files you just opened, so compares go quickly
100k files can take several hours, even in fully automatic "just choose one to delete" mode. Even if they are small.
same by geert · 2012-09-02 04:44 · Score: 2

ftp://ftp.bitwizard.nl/same/
I used this to keep all versions of the Linux kernel source tree on my computer, with identical files hardlinked together to reduce storage space.
Both diff (blazing fast "diff -purN ") and patch handle hard links, so this was very workable.
It can be slow and take quite some memory (only 128 MiB-1 GiB in those days), but guess 16 GiB of RAM should handle 4 million files fine, as this is about the same order of magnitude as the few hundred kernel source trees I had lying around.
After git arrived, it was faster to just use git.
1. Re:same by rew · 2012-09-02 05:18 · Score: 2
  
  As the author of "same", I was going to post the above suggestion.
  Last time I used "same", 4.2 million files was peanuts. Of course, running through 4.8Tb of data is going to take some time.
  People above are doing suggestions like doing CRCs of the files. Checking filesizes. Etc etc. Same does all of this:
  First a list is compiled of the files to be handled. Then each file is stat-ed to determine its size. Then only same-size files are considered candidates for being the same. Next if the filesizes are the same, the CRCs are compared. The CRCs are calculated on an "as needed" basis. This means that most big media files will never need to be read entirely unless a duplicate is going to be found. Anyway. When the CRCs are the same, the files are compared bit-for-bit and if THAT comes out good, the files are hardlinked together.
  The hardlinking means that you can further process the results. You can use find to eliminate say all duplicate files in a directory called "backup", provided that they ARE duplicates. Now you'll be left just with the Uniqe files in that directory.
  I'm not sure if all of this will easily run on windows: It's a Unix program. On the other hand, it uses simple calls and should easily be ported using the cygwin suite.
Re:Sorry, if you can't write a simple script, then by wisdom_brewing · 2012-09-02 05:01 · Score: 2

How about intelligent people just looking out for truly insightful comments amongst the various posts? It would be interesting to see a true, accurate demographic of slashdot folk, I guess the people that post are actually a fairly small subset and the number in computer related industries equally small...

--
I am very sucseptible to "let's have another drink"
Re:Wait it out by mlts · 2012-09-02 05:01 · Score: 3, Insightful

I will go out on a limb, risk my geek card and propose another alternative:
Windows Server 2012 has a deduplication feature which works atop of NTFS (not ReFS). Unlike "real" deduplication on the LVM level which you get with your EMC, the files are written to the filesystem fully "hydrated", and as time passes, a background task [1] sifts through the blocks, finds ones that are the same, then adds reparse points.
The reason I'm suggesting this is that if one already has a Windows file server, it might be good to slap on 2012 when it is available, configure deduplication on a dedicated storage volume, and let it do the dirty work on the block level for you.
Of course, ZFS is the most elegant solution, but it may not be the best in the application.
[1]: Fire up PowerShell and type in:
Start-DedupJob E: â"Type Optimization
if you want to do it in the foreground after setting it up, if you did a large copy and want to dedupe it all.
Only if you have 100 unique files by HiggsBison · 2012-09-02 05:08 · Score: 4, Informative

If you have 100 files all of one size, you'll have to do 4950 comparisons.
You only have to do 4950 comparisons if you have 100 unique files.
What I do is pop the first file from the list, to use as a standard, and compare all the files with it, block by block. If a block fails to match, I give up on that file matching the standard. The files that don't match generally don't go very far, and don't take much time. For the ones that match, I would have taken all that time if I was using a hash method anyway. As for reading the standard file multiple times: It goes fast because it's in cache.
The ones that match get taken from the list. Obviously I don't compare the one which match with each other. That would be stupid.
Then I go back to the list and rinse/repeat until there are less than 2 files.
I have done this many times with a set of 3 million files which take up about 600GB.

--
My other car is a 1984 Nark Avenger.
My own script (feel free to change) by Lulu+of+the+Lotus-Ea · 2012-09-02 05:18 · Score: 2

My home-rolled solution to exactly this problem is: http://gnosis.cx/bin/find-duplicate-contents.
This script is efficient algorithmically and has a variety of options to work incrementally and to optimize common cases. It's not excessively user-friendly, possibly, but the --help screen gives reasonable guidance. And the whole thing is short and readable Python code (which doesn't matter for speed, since the expensive steps like MD5 are callouts to fast C code in the standard library).

--
Buy Text Processing in Python
File size then interleaved secure hash by Terje+Mathisen · 2012-09-02 06:01 · Score: 2

This is a very fun programming task!
Since it will be totally limited by disk IO, the language you choose doesn't really matter, as long as you make sure that you never read each file more than once:
1) Recursive scan of all disks/directories, saving just file name and size plus a pointer to the directory you found it in.
If you have multiple physical disks you can run this in parallel, one task/thread for each disk.
2) Sort the list by file size.
3) For each file size with multiple entries do:
3a) How many matches are there and how large are they?
3a1) Just two files: Read them both in parallel, using a block size of 1MB or more in order to avoid extra disk seeks, and compare directly. Exit on first difference of course!
3a2) 3 or more files: Read them all interleaved, still using a 1MB+ block size. For each block calculate a CRC32 or secure hash, compare these at the end of each block iteration. When a single file differs from the rest, it is unique.
When two or more are equal but still different from the majority of the group, recurse into a new copy of the scanning function that checks the smallest group, then upon return go on with the rest.
It should be obvious that your scanning function needs to accept an array of open file handles/descriptor plus an offset to start the scanning process at, thus making it easy to call it recursively to check the tails of a sub-array!
(A possible problem can occur if you have _very_ many files of the same size, in that the operating system could run out of file handles for simultaneously open files! In that case I'd fall back on passing in file paths instead of open handles and take the hit of re-opening each file for each block to be read. I would also increase the block size significantly, into the 10-100 MB range, so that everything except big ISOs and similar would be read in a single access. The same process is probably optimal for file sizes less than the minimum block size.)
This algorithm should be able to do what you need in significantly less time than you'd need to just read everything once. I'd estimate about 50 MB/s effective reading speed, so if everything is on a single disk (4.9 TB? Not very likely!) and every single file size has multiple entries that only differ in the last byte, you would need 100 K seconds, or a little more than a day. My guess is you should easily finish overnight!
Terje

--
"almost all programming can be viewed as an exercise in caching"
Tools for the Job... by Eyeballs · 2012-09-02 08:52 · Score: 2

First: Get a copy of Windows Server 2012 and use the new deduplication system (which uses 'file chunk' deuplication level across an entire disk): https://www.usenix.org/conference/usenixfederatedconferencesweek/primary-data-deduplication%E2%80%94large-scale-study-and-system
Now, that you've taken care of the data duplication, let's talk about the tools for sifting through large sets of files:
1. Get 'Everything' (http://www.voidtools.com/): This tool allows for the 'instant' searching for any file throughout _all_ your files, I've used it on 4 million files myself. Just start typing part of the file name and it will show you a list of where those files are located on your system. Also, the list is 'live', you can right click on any icon in the file list, and it will act the same as you right clicked on the file itself in Explorer.
2. Get 'SpaceMonger' (http://www.sixty-five.cc/sm/): This tool shows what's taking up the space on your computer, it's similar to 'WinDirStat' but more flexible, customizable, and detailed.
3. Get 'ZTreeWin' (http://www.ztree.com/): This tool is the Swiss-Army knife program for working on files (finding, searching, viewing). If you remember 'XTree', it's a clone of that which can work on 4 million(+) files.
4. Get 'Beyond Compare' (http://www.scootersoftware.com/): This tool allows for easy comparison/synchronization of folders (and files). Compare two of your old backup folders and merge them.
Deduplication by thetrom · 2012-09-02 10:41 · Score: 2

Check out dedup in Windows Server 2012 - http://blogs.technet.com/b/filecab/archive/2012/05/21/introduction-to-data-deduplication-in-windows-server-2012.aspx
Merge by fulldecent · 2012-09-02 17:36 · Score: 2

Best tool. http://hungrycats.org/~zblaxell/dupemerge/faster-dupemerge worked great for me in the past 10 years. Scales.

--
-- I was raised on the command line, bitch