Slashdot Mirror


Ask Slashdot: How Do I De-Dupe a System With 4.2 Million Files?

First time accepted submitter jamiedolan writes "I've managed to consolidate most of my old data from the last decade onto drives attached to my main Windows 7 PC. Lots of files of all types from digital photos & scans to HD video files (also web site backup's mixed in which are the cause of such a high number of files). In more recent times I've organized files in a reasonable folder system and have an active / automated backup system. The problem is that I know that I have many old files that have been duplicated multiple times across my drives (many from doing quick backups of important data to an external drive that later got consolidate onto a single larger drive), chewing up space. I tried running a free de-dup program, but it ran for a week straight and was still 'processing' when I finally gave up on it. I have a fast system, i7 2.8Ghz with 16GB of ram, but currently have 4.9TB of data with a total of 4.2 million files. Manual sorting is out of the question due to the number of files and my old sloppy filing (folder) system. I do need to keep the data, nuking it is not a viable option.

7 of 440 comments (clear)

  1. CRC by Spazmania · · Score: 5, Informative

    Do a CRC32 of each file. Write to a file one per line in this order: CRC, directory, filename. Sort the file by CRC. Read the file linearly doing a full compare on any file with the same CRC (these will be adjacent in the file).

    --
    Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
    1. Re:CRC by Anonymous Coward · · Score: 5, Informative

      s/CRC32/sha1 or md5, you won't be CPU bound anyway.

    2. Re:CRC by Kral_Blbec · · Score: 5, Informative

      Or just by file size first, then do a hash. No need to compute a hash to compare a 1mb file and a 1kb file.

    3. Re:CRC by caluml · · Score: 5, Informative

      Exactly. What I do is this:

      1. Compare filesizes.
      2. When there are multiple files with the same size, start diffing them. I don't read the whole file to compute a checksum - that's inefficient with large files. I simply read the two files byte by byte, and compare - that way, I can quit checking as soon as I hit the first different byte.

      Source is at https://github.com/caluml/finddups - it needs some tidying up, but it works pretty well.

      git clone, and then mvn clean install.

    4. Re:CRC by Anonymous Coward · · Score: 5, Informative

      If you get a linux image running (say in a livecd or VM) that can access the file system then fdupes is built to do this already. Various output format/recursion options.

      From the man page:
      DESCRIPTION
                    Searches the given path for duplicate files. Such files are found by
                    comparing file sizes and MD5 signatures, followed by a byte-by-byte
                    comparison.

    5. Re:CRC by blueg3 · · Score: 5, Informative

      b) Looking at the first few bytes of files with the same size.

      Note that there's no reason to only look at the first few bytes. On spinning disks, any read smaller than about 16K will take the same amount of time. Comparing two 16K chunks takes zero time compared to how long it takes to read them from disk.

      You could, for that matter, make it a 3-pass system that's pretty fast:
      a) get all file sizes; remove all files that have unique sizes
      b) compute the MD5 hash of the first 16K of each file; remove all files that have unique (size, header-hash) pairs
      c) compute the MD5 hash of the whole file; remove all files that have unique (size, hash) pairs

      Now you have a list of duplicates.

      Don't forget to eliminate all files of zero length in step (a). They're trivially duplicates but shouldn't be deduplicated.

  2. There are tools for this by Anonymous Coward · · Score: 5, Informative

    If you don't mind booting Linux (a live version will do), fdupes has been fast enough for my needs and has various options to help you when multiple collisions occur. For finding similar images with non-identical checksums, findimagedupes will work, although it's obviously much slower than a straight 1-to-1 checksum comparison.

    YMMV