Ask Slashdot: How Do I De-Dupe a System With 4.2 Million Files?

← Back to Stories (view on slashdot.org)

Ask Slashdot: How Do I De-Dupe a System With 4.2 Million Files?

Posted by samzenpus on Sunday September 2, 2012 @01:30AM from the copies-of-the-copies dept.

First time accepted submitter jamiedolan writes "I've managed to consolidate most of my old data from the last decade onto drives attached to my main Windows 7 PC. Lots of files of all types from digital photos & scans to HD video files (also web site backup's mixed in which are the cause of such a high number of files). In more recent times I've organized files in a reasonable folder system and have an active / automated backup system. The problem is that I know that I have many old files that have been duplicated multiple times across my drives (many from doing quick backups of important data to an external drive that later got consolidate onto a single larger drive), chewing up space. I tried running a free de-dup program, but it ran for a week straight and was still 'processing' when I finally gave up on it. I have a fast system, i7 2.8Ghz with 16GB of ram, but currently have 4.9TB of data with a total of 4.2 million files. Manual sorting is out of the question due to the number of files and my old sloppy filing (folder) system. I do need to keep the data, nuking it is not a viable option.

18 of 440 comments (clear)

Min score:

Reason:

Sort:

CRC by Spazmania · 2012-09-02 01:32 · Score: 5, Informative

Do a CRC32 of each file. Write to a file one per line in this order: CRC, directory, filename. Sort the file by CRC. Read the file linearly doing a full compare on any file with the same CRC (these will be adjacent in the file).

--
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
1. Re:CRC by Anonymous Coward · 2012-09-02 01:36 · Score: 5, Informative
  
  s/CRC32/sha1 or md5, you won't be CPU bound anyway.
2. Re:CRC by Kral_Blbec · 2012-09-02 01:38 · Score: 5, Informative
  
  Or just by file size first, then do a hash. No need to compute a hash to compare a 1mb file and a 1kb file.
3. Re:CRC by Zocalo · 2012-09-02 01:58 · Score: 4, Informative
  
  No. No. No. Blindly CRCing every file is probably what took so long on the first pass and is a terribly inefficient way of de-duplicating files.
  
  There is absolutely no point in generating CRCs of files unless they match on some other, simpler to compare characteristic like file size. The trick is to break the problem apart into smaller chunks. Start with the very large files, they exact size break to use it'll depend on the data set, but as the poster mentioned video file say everything over 1GB to start. Chances are you can fully de-dupe your very large files manually based on nothing more than a visual inspection of names and file sizes in little more time than it takes to find them all in the first place. You can then exclude those files from further checks, and more importantly, from CRC generation.
  
  After that, try and break the problem down into smaller chunks. Whether you are sorting on size, name or CRC, it's quicker to do so when you only have a few hundred thousand files rather than several million. Maybe do another size constrained search; 512MB-1GB, say. Or if you have them, look for duplicated backups files in the form of ZIP files, or whatever archive format(s), you are using based on their extension - that also saves you having to expand and examine the contents of multiple archive files. Similarly, do a de-dupe of just the video files by extensions as these should again lend themselves to rapid manual sorting without having to generate CRCs for many GB of data. Another grouping to consider might be to at least try and get all of the website data, or as much of is as you can, into one place and de-dupe that, and consider whether you really need multiple archival copies of a site, or whether just the latest/final revision will do.
  
  By the time you've done all that, including moving the stuff that you know is unique out of the way and into a better filing structure as you go, the remainder should be much more manageable for a single final pass. Scan the lot, identify duplicates based on something simple like the file size and, ideally, manually get your de-dupe tool to CRC only those groups of identically sized files that you can't easily tell apart like bunches of identically sized word processor or image files with cryptic file names.
  
  --
  UNIX? They're not even circumcised! Savages!
4. Re:CRC by caluml · 2012-09-02 01:58 · Score: 5, Informative
  
  Exactly. What I do is this:
  
  1. Compare filesizes.
  2. When there are multiple files with the same size, start diffing them. I don't read the whole file to compute a checksum - that's inefficient with large files. I simply read the two files byte by byte, and compare - that way, I can quit checking as soon as I hit the first different byte.
  
  Source is at https://github.com/caluml/finddups - it needs some tidying up, but it works pretty well.
  
  git clone, and then mvn clean install.
  
  --
  Get your own free personal location tracker
5. Re:CRC by Anonymous Coward · 2012-09-02 02:05 · Score: 5, Informative
  
  If you get a linux image running (say in a livecd or VM) that can access the file system then fdupes is built to do this already. Various output format/recursion options.
  From the man page:
  DESCRIPTION
  Searches the given path for duplicate files. Such files are found by
  comparing file sizes and MD5 signatures, followed by a byte-by-byte
  comparison.
6. Re:CRC by belg4mit · 2012-09-02 02:40 · Score: 2, Informative
  
  Unique Filer http://www.uniquefiler.com/ implements these short-circuits for you.
  It's meant for images but will handle any filetype, and even runs under WINE.
  
  --
  Were that I say, pancakes?
7. Re:CRC by blueg3 · 2012-09-02 03:06 · Score: 5, Informative
  
  b) Looking at the first few bytes of files with the same size.
  Note that there's no reason to only look at the first few bytes. On spinning disks, any read smaller than about 16K will take the same amount of time. Comparing two 16K chunks takes zero time compared to how long it takes to read them from disk.
  You could, for that matter, make it a 3-pass system that's pretty fast:
  a) get all file sizes; remove all files that have unique sizes
  b) compute the MD5 hash of the first 16K of each file; remove all files that have unique (size, header-hash) pairs
  c) compute the MD5 hash of the whole file; remove all files that have unique (size, hash) pairs
  Now you have a list of duplicates.
  Don't forget to eliminate all files of zero length in step (a). They're trivially duplicates but shouldn't be deduplicated.
8. Re:CRC by Anonymous Coward · 2012-09-02 04:33 · Score: 2, Informative
  
  Actually, this is an instance where lots of random IO will bog you down when comparing a bunch of files. His 4+ TB divided by 4.2M files is roughly 1MB average file size, which really isn't that much content to access per random seek. A naive all-to-all comparison will cause a lot of random IO, so you really need to generate a batch file listing with per-file metadata and then analyze the listings efficiently. Adding checksum info to this batch listing is actually not that costly and allows the entire de-dupe analysis to be performed with no further disk IO. Even if we assume 1kB per file of name, size, and checksum info (it's probably a lot less), the whole listing is around 4GB which can be largely cached in RAM for analysis.
  When I had this same problem on Linux, I did two scans of the entire file set using the 'find . -type f -exec cmd {} +' command to automatically run 'stat' and 'md5sum' on batches of files, then I merged these scan results to have one table of information per file. You could do all of this by processing files (e.g. sort and join on Unix) but it is more efficient to just import the data into sqlite or another database and do it there. In my case, I grouped files by size and checksum, also sorting the group members by name length, preferring the shortest name as the "original" file, since the names tended to get longer with each redundant backup copy adding some other top-level directory name to the original file name.
  The reason I ran two scans was that I was too lazy to implement a hybrid command to efficiently run 'md5sum' and 'stat' as one utility. It would have taken me longer to develop and test the utility enough to trust it than to just run it with the existing utilities. In the end, the scan with md5sum did not take that much longer than the scan with stat, because the overall time is dominated by digging around vast directory hierarchies and randomly accessing file metadata, versus the bulk sequential access pattern used to perform the checksum once each file was found. If you monitor the system while these commands run, there is steady high-bandwidth disk access for the duration of the md5sum scan, while there is steady disk seeking with very little bandwidth for the duration of the stat scan. Neither scan saturates a CPU.
  Another question is what to do with the results of analysis. One option is to delete all but one copies of each length/checksum group, and assume you would use the database information in the future if you ever need to reconstitute one of the deleted hierarchies. Or you could turn all secondary references into hard-links to the same file, retaining the original hierarchies as accessible file trees. Or, as I chose to do, you can replace secondary references with symbolic links to the primary copy, which is close enough to preserving the original hierarchy for most programmatic access but is also self-documenting the fact that it is a secondary name for the same file at the other end of the link.
9. Re:CRC by xigxag · 2012-09-02 07:10 · Score: 4, Informative
  
  With 4.2 million files, given the probability of SHA-1 collisions plus the birthday paradox and there will be around 500 SHA-1 collisions which are not duplicates.
  That's totally, completely wrong. The birthday problem isn't a breakthrough concept, and the probability of random SHA-1 collisions is therefore calculated with it in mind. The number is known to be 1/2^80. This is straightforwardly derived from the total number of SHA-1 values, 2^160, which is then immensely reduced by the birthday paradox to 2^80 expected hashes required for a collision. This means that a hard drive with 2^80 or 1,208,925,819,614,629,174,706,176 files would have on average ONE collision. Note that this is a different number than the number of hashes one has to generate for a targeted cryptographic SHA-1 attack, which with best current theory is on the order of 2^51 for the full 80-round SHA-1, although as Goaway has pointed out, no such collision has yet been found.
  Frankly I'm at a loss as to how you arrived at 500 SHA-1 collisions out of 4.2 million files. That's ludicrous. Any crypto hashing function with such a high collision rate would be useless. Much worse than MD5, even.
  
  --
  There are two kinds of people: 1) those who start arrays with one and 1) those who start them with zero.
10. Re:CRC by inKubus · 2012-09-02 19:28 · Score: 3, Informative
  
  For the lazy, here are 3 more tools:
  fdupes, duff, and rdfind.
  Duff claims it's O(n log n), because they:
  Only compare files if they're of equal size.
  Compare the beginning of files before calculating digests.
  Only calculate digests if the beginning matches.
  Compare digests instead of file contents.
  Only compare contents if explicitly asked.
  
  --
  Cool! Amazing Toys.
Re:ZFS by smash · 2012-09-02 01:39 · Score: 4, Informative

To clarify - no this will not remove duplicate references to the data. The files ystem will remain in tact. However it will perform block level dedupe of the data which will recover your space. Duplicate references aren't necessarily a bad thing anyway, as if you have any sort of content index (memory, code, etc) that refers to data in a particular location, it will continue to work. However the space will be recovered.

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
There are tools for this by Anonymous Coward · 2012-09-02 01:41 · Score: 5, Informative

If you don't mind booting Linux (a live version will do), fdupes has been fast enough for my needs and has various options to help you when multiple collisions occur. For finding similar images with non-identical checksums, findimagedupes will work, although it's obviously much slower than a straight 1-to-1 checksum comparison.
YMMV
Manual work will have to be done by Qbertino · 2012-09-02 02:24 · Score: 4, Informative

Your problem isn't unduping files in your archives, your problem is getting an overview of your data archives. If you'd have it, you wouldn't have dupes in the first place.
This is a larger personal project, but you should take it on, since it will be a good lesson in data organisation. I've been there and done that.
You should get a rough overview of what you're looking at and where to expect large sets of dupes. Do this by manually parsing your archives in broad strokes. If you want to automate dupe-removal, do so by de-duping smaller chunks of your archive. You will need extra CPU and storage - maybe borrow a box or two from friends and set up a batch of scripts you can run from Linux live CDs with external HDDs attached.
Most likely you will have to do some scripting or programming, and you will have to devise a strategy not only of dupe removal, but of merging the remaining skeletons of dirtrees. That's actually the tough part. Removing dupes takes raw processing power and can be done in a few weeks and brute force and a solid storage bandwidth.
Organising the remaining stuff is where the real fun begins. ... You should start thinking about what you are willing to invest and how your backup, versioning and archiving strategy should look in the end, data/backup/archive retrival included. The latter might even determine how you go about doing your dirtree diffs - maybe you want to use a database for that for later use.
Anyway you put it, just setting up a box in the corner and having a piece of software churn away for a few days, weeks or months won't solve your problem in the end. If you plan well, it will get you started, but that's the most you can expect.
As I say: Been there, done that.
I still have unfinished business in my backup/archiving strategy and setup, but the setup now is 2 1TB external USB3 drives and manual arsync sessions every 10 weeks or so to copy from HDD-1 to HDD-2 to have dual backups/archives. It's quite simple now, but it was a long hard way to clean up the mess of the last 10 years. And I actually was quite conservative about keeping my boxed tidy. I'm still missing external storage in my setup, aka Cloud-Storage, the 2012 buzzword for that, but it will be much easyer for me to extend to that, now that I've cleaned up my shit halfway.
Good luck, get started now, work in iterations, and don't be silly and expect this project to be over in less than half a year.
My 2 cents.

--
We suffer more in our imagination than in reality. - Seneca
Already done it - python script by Terrasque · 2012-09-02 03:01 · Score: 3, Informative

I found a python script online and hacked it a bit to work on a larger scale.
The script originally scanned a directory, found files with same size, and md5'ed them for comparison.
Among other things I added option to ignore files under a certain size, and to cache md5 in a sqlite db. I also think I did some changes to the script to handle large number of files better, and do more effective md5 (also added option to limit number of bytes to md5, but that didn't make much difference in performance for some reason). I also added option to hard link files that are the same.
With inodes in memory, and sqlite db already built, it takes about 1 second to "scan" 6TB of data. First scan will probably take a while, tho.
Script here - It's only tested on Linux.
Even if it's not perfect, it might be a good starting point :)

--
It's The Golden Rule: "He who has the gold makes the rules."
Use DROID 6 by mattpalmer1086 · 2012-09-02 03:21 · Score: 4, Informative

There is a digital preservation tool called DROID (Digital Record Object Identification) which scans all the files you ask it to, identifying their file type. It can also optionally generate an MD5 hash of each file it scans. It's available for download from sourceforge (BSD license, requires Java 6, update 10 or higher).
http://sourceforge.net/projects/droid/
It has a fairly nice GUI (for Java, anyway!), and a command line if you prefer scripting your scan. Once you have scanned all your files (with MD5 hash), export the results into a CSV file. If you like, you can first also define filters to exclude files you're not interested in (e.g. small files could be filtered out). Then import the CSV file into your data anlaysis app or database of your choice, and look for duplicate MD5 hashes. Alternetively, DROID actually stores its results in an Apache Derby database, so you could just connect directly to that rather than export to CSV, if you have a tool that an work with Derby.
One of the nice things about DROID when working over large datasets is you can save the progress at any time, and resume scanning later on. It was built to scan very large government datastores (multiple Tb). It has been tested over several million files (this can take a week or two to process, but as I say, you can pause at any time, save or restore, although only from the GUI, not the command line).
Disclaimer: I was responsible for the DROID 4, 5 and 6 projects while working at the UK National Archives. They are about to release an update to it (6.1 I think), but it's not available just yet.
Re:Prioritize by file size by b4dc0d3r · 2012-09-02 03:49 · Score: 3, Informative

ZIP, test, then Par2 the zip. Even at the worst possible compression level, greater than 100% filezises, you just saved a ton of space.
I got to the point where I rarely copy small files without first zipping on the source drive. It takes so frigging long, when a full zip or tarball takes seconds. Even a flat tar without the gzip step is a vast improvement, since the filesystem doesn't have to be continually updated. But zipping takes so little resource that Windows XP's "zipped folders" actually makes a lot of sense for any computer after maybe 2004, even with the poor implementation.
Only if you have 100 unique files by HiggsBison · 2012-09-02 05:08 · Score: 4, Informative

If you have 100 files all of one size, you'll have to do 4950 comparisons.
You only have to do 4950 comparisons if you have 100 unique files.
What I do is pop the first file from the list, to use as a standard, and compare all the files with it, block by block. If a block fails to match, I give up on that file matching the standard. The files that don't match generally don't go very far, and don't take much time. For the ones that match, I would have taken all that time if I was using a hash method anyway. As for reading the standard file multiple times: It goes fast because it's in cache.
The ones that match get taken from the list. Obviously I don't compare the one which match with each other. That would be stupid.
Then I go back to the list and rinse/repeat until there are less than 2 files.
I have done this many times with a set of 3 million files which take up about 600GB.

--
My other car is a 1984 Nark Avenger.