Slashdot Mirror


Ask Slashdot: How Do I De-Dupe a System With 4.2 Million Files?

First time accepted submitter jamiedolan writes "I've managed to consolidate most of my old data from the last decade onto drives attached to my main Windows 7 PC. Lots of files of all types from digital photos & scans to HD video files (also web site backup's mixed in which are the cause of such a high number of files). In more recent times I've organized files in a reasonable folder system and have an active / automated backup system. The problem is that I know that I have many old files that have been duplicated multiple times across my drives (many from doing quick backups of important data to an external drive that later got consolidate onto a single larger drive), chewing up space. I tried running a free de-dup program, but it ran for a week straight and was still 'processing' when I finally gave up on it. I have a fast system, i7 2.8Ghz with 16GB of ram, but currently have 4.9TB of data with a total of 4.2 million files. Manual sorting is out of the question due to the number of files and my old sloppy filing (folder) system. I do need to keep the data, nuking it is not a viable option.

13 of 440 comments (clear)

  1. CRC by Spazmania · · Score: 5, Informative

    Do a CRC32 of each file. Write to a file one per line in this order: CRC, directory, filename. Sort the file by CRC. Read the file linearly doing a full compare on any file with the same CRC (these will be adjacent in the file).

    --
    Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
    1. Re:CRC by Anonymous Coward · · Score: 5, Informative

      s/CRC32/sha1 or md5, you won't be CPU bound anyway.

    2. Re:CRC by Kral_Blbec · · Score: 5, Informative

      Or just by file size first, then do a hash. No need to compute a hash to compare a 1mb file and a 1kb file.

    3. Re:CRC by igb · · Score: 5, Insightful

      That involves reading every byte. It would be faster to read the bytecount of each file, which doesn't involve reading the files themselves as that metadata is available, and then exclude from further examination all the files which have unique sizes. You could then read the first block of each large file, and discard all the files that have unique first blocks. After that, CRC32 (or MD5 or SHA1 --- you're going to be disk-bound anyway) and look for duplicates that way.

    4. Re:CRC by caluml · · Score: 5, Informative

      Exactly. What I do is this:

      1. Compare filesizes.
      2. When there are multiple files with the same size, start diffing them. I don't read the whole file to compute a checksum - that's inefficient with large files. I simply read the two files byte by byte, and compare - that way, I can quit checking as soon as I hit the first different byte.

      Source is at https://github.com/caluml/finddups - it needs some tidying up, but it works pretty well.

      git clone, and then mvn clean install.

    5. Re:CRC by Anonymous Coward · · Score: 5, Informative

      If you get a linux image running (say in a livecd or VM) that can access the file system then fdupes is built to do this already. Various output format/recursion options.

      From the man page:
      DESCRIPTION
                    Searches the given path for duplicate files. Such files are found by
                    comparing file sizes and MD5 signatures, followed by a byte-by-byte
                    comparison.

    6. Re:CRC by bzipitidoo · · Score: 5, Insightful

      Part 2 of your method will quickly bog down if you run into many files that are the same size. Takes (n choose 2) comparisons, for a problem that can be done in n time. If you have 100 files all of one size, you'll have to do 4950 comparisons. Much faster to compute and sort 100 checksums.

      Also, you don't have to read the whole file to make use of checksums, CRCs, hashes and the like. Just check a few pieces likely to be different if the files are different, such as the first and last 2000 bytes. Then for those files with matching parts, check the full files.

      --
      Intellectual Property is a monopolistic, selfish, and defective concept. It is "tyranny over the mind of man"
    7. Re:CRC by blueg3 · · Score: 5, Informative

      b) Looking at the first few bytes of files with the same size.

      Note that there's no reason to only look at the first few bytes. On spinning disks, any read smaller than about 16K will take the same amount of time. Comparing two 16K chunks takes zero time compared to how long it takes to read them from disk.

      You could, for that matter, make it a 3-pass system that's pretty fast:
      a) get all file sizes; remove all files that have unique sizes
      b) compute the MD5 hash of the first 16K of each file; remove all files that have unique (size, header-hash) pairs
      c) compute the MD5 hash of the whole file; remove all files that have unique (size, hash) pairs

      Now you have a list of duplicates.

      Don't forget to eliminate all files of zero length in step (a). They're trivially duplicates but shouldn't be deduplicated.

    8. Re:CRC by BasilBrush · · Score: 5, Insightful

      Someone who's technical expertise is in areas other than writing script files. There are technical jobs other than being a sysop you know.

    9. Re:CRC by iluvcapra · · Score: 5, Insightful

      First compare file size (ignoring things like mp3 ID3 tags, or other headers).

      I once had to write an audio file de-deuplicator; one of the big problems was you would ignore the metadata and the out-of-band data when you did the comparisons, but you always had to take this stuff into account when you were deciding which version of a file to keep -- you didn't want to delete two copies f a file with all the tags filled out and keep the one that was naked.

      My de-duper worked like everyone here is saying -- it cracked open wav and aiff (and Sound Designer 2) files, captured their sample count and sample format into a sqlite db, did a couple of big joins and then did some SHA1 hashes of likely suspects. All of this worked great, but once I had the list I had the epiphany that the real problem of these tools is the resolution and how you make sure you're doing exactly what the user wants.

      How do you decide which one to keep? You can just do hard links, but...

      • The users I was working with were very uncomfortable with hard links, they didn't really understand the concept and were concerned that it made it difficult to know if you were "really" throwing something away when you dragged something to the trash. (It's stupid but it was their box.)
      • Our existing backup/archival software wouldn't do the right thing with hard links, so it'd save no space on the tapes.
      • Our audio workstation software wouldn't read audio off of files that were hard links on OS X (because hard links on OSX aren't really hard links, I believe our audio workstation vendor have since resolved this).

      But let's say you can do hard links, no problem. How do you decide which instance of the file is to be kept, if you've only compared the "real" content of the file and ignored metadata? You could just give the user a big honking list of every set of files that are duplicates -- two here, three here, six here, and then let them go through and elect which one will be kept, but that's a mess and 99% of the time they're going to select a keeper on the basis of which part of the directory tree it's in. So, you need to do a rule system or a preferential ranking of parts of the directory hierarchy that tell the system "keep files you find here." Now, the files will also have metadata, so you also have to preferentially rank the files on the basis of its presence -- you might also rank files higher if your guy did the metadata tagging, because things like audio descriptions are often done with a specialized jargon that can be specific to a particular house.

      Also, it'd be very common to delete a file from a directory containing an editor's personal library, and replacing it with a hard link to a file in the company's main library -- several people would have copies of the same commercial sound, or an editor would be the recordist of a sound that was subsequently sold to a commercial library, or whatever. Is it a good policy to replace his file with a hardlink to a different one, particularly if they differ in the metadata? Directories on a volume are often controlled by different people with different policies and proprietary interest to the files -- maybe the company "owns" everything, but it still can create a lot of internal disputes if files in a division or individual project's library folder starting getting their metadata changed, on account of being replaced with a hard link to a "better" file in the central repository. We can agree not to de-dup these, but it's more rules and exceptions that have to be made.

      Once you have to list of duplicates, and maybe the rules, do you just go and delete, or do you give the user a big list to review? And, if upon review, he makes one change to one duplicate instance, it'd be nice to have that change intelligently reflected on the others. The rules have to be applied to the dupe list interactively and changes have to be reflected in the same way, otherwise it becomes a miserable experience for the user to de-dupe 1M files over 7 terabytes. The resolution of duplicates is the hard part, the finding of dupes is relatively easy.

      --
      Don't blame me, I voted for Baltar.
  2. There are tools for this by Anonymous Coward · · Score: 5, Informative

    If you don't mind booting Linux (a live version will do), fdupes has been fast enough for my needs and has various options to help you when multiple collisions occur. For finding similar images with non-identical checksums, findimagedupes will work, although it's obviously much slower than a straight 1-to-1 checksum comparison.

    YMMV

  3. Simple dedupe algorithm by Anonymous Coward · · Score: 5, Funny

    Delete all files but one. The remaining file is guaranteed unique!

  4. Prioritize by file size by jwales · · Score: 5, Insightful

    Since the objective is to recover disk space, the smallest couple of million files are unlikely to do very much for you at all. It's the big files that are the issue in most situations.

    Compile a list of all your files, sorted by size. The ones that are the same size and the same name are probably the same file. If you're paranoid about duplicate file names and sizes (entirely plausible in some situations), then crc32 or byte-wise comparison can be done for reasonable or absolute certainty. Presumably at that point, to maintain integrity of any links to these files, you'll want to replace the files with hard links (not soft links!) so that you can later manually delete any of the "copies" without hurting all the other "copies". (There won't be separate copies, just hard links to one copy.)

    If you give up after a week, or even a day, at least you will have made progress on the most important stuff.

    --
    Wikia