Ask Slashdot: How Do I De-Dupe a System With 4.2 Million Files?
First time accepted submitter jamiedolan writes "I've managed to consolidate most of my old data from the last decade onto drives attached to my main Windows 7 PC. Lots of files of all types from digital photos & scans to HD video files (also web site backup's mixed in which are the cause of such a high number of files). In more recent times I've organized files in a reasonable folder system and have an active / automated backup system. The problem is that I know that I have many old files that have been duplicated multiple times across my drives (many from doing quick backups of important data to an external drive that later got consolidate onto a single larger drive), chewing up space. I tried running a free de-dup program, but it ran for a week straight and was still 'processing' when I finally gave up on it. I have a fast system, i7 2.8Ghz with 16GB of ram, but currently have 4.9TB of data with a total of 4.2 million files. Manual sorting is out of the question due to the number of files and my old sloppy filing (folder) system. I do need to keep the data, nuking it is not a viable option.
Do a CRC32 of each file. Write to a file one per line in this order: CRC, directory, filename. Sort the file by CRC. Read the file linearly doing a full compare on any file with the same CRC (these will be adjacent in the file).
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
To clarify - no this will not remove duplicate references to the data. The files ystem will remain in tact. However it will perform block level dedupe of the data which will recover your space. Duplicate references aren't necessarily a bad thing anyway, as if you have any sort of content index (memory, code, etc) that refers to data in a particular location, it will continue to work. However the space will be recovered.
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
If you don't mind booting Linux (a live version will do), fdupes has been fast enough for my needs and has various options to help you when multiple collisions occur. For finding similar images with non-identical checksums, findimagedupes will work, although it's obviously much slower than a straight 1-to-1 checksum comparison.
YMMV
Delete all files but one. The remaining file is guaranteed unique!
if you really want, sort, order and index it all, but my suggestion would be different.
If you didn't need the files in the last 5 years, you'll probably never need them at all.
Maybe one or two. Make one volume called OldSh1t, index it, and forget about it again.
Really. Unless you have a very good reason to un-dupe everything, don't.
I have my share of old files and dupes. I know what you're talking about :)
Well, the sun is shining. If you need me, I'm outside.
Privacy is terrorism.
Since the objective is to recover disk space, the smallest couple of million files are unlikely to do very much for you at all. It's the big files that are the issue in most situations.
Compile a list of all your files, sorted by size. The ones that are the same size and the same name are probably the same file. If you're paranoid about duplicate file names and sizes (entirely plausible in some situations), then crc32 or byte-wise comparison can be done for reasonable or absolute certainty. Presumably at that point, to maintain integrity of any links to these files, you'll want to replace the files with hard links (not soft links!) so that you can later manually delete any of the "copies" without hurting all the other "copies". (There won't be separate copies, just hard links to one copy.)
If you give up after a week, or even a day, at least you will have made progress on the most important stuff.
Wikia
perhaps you could boot with a livecd and mount your windows drives under a single directory? Then:
find /your/mount/point -type f -exec sha256sum > sums.out
uniq -u -w 64 sums.out
I recently had this problem and solved it with finddupe (http://www.sentex.net/~mwandel/finddupe/). It's a free command line tool. It can create hardlinks, you can tell it which is a master directory to keep and which directories to delete, and it can create a batch file to do actually do the deletion if you don't trust it or just want to see what it will do. Highly recommend. In any case, 5 TB is going to take forever but with finddupe you can be sure your time is not wasted, unlike one of the free tools that analyzed my drive for 12 hours and then told me it would only fix ten duplicates.
I tried this vs. Clone Spy, Fast Duplicate File Finder, Easy Duplicate File Finder, and the GPL Duplicate Files Finder (crashy). (Side note: Get some creativity guys). There's no UI but I don't care. It doesn't keep any state between runs so run it a few times on subdirectories to make sure you know what it's doing first then let it rip.
-- Too lazy to get a lower UID.
Your problem isn't unduping files in your archives, your problem is getting an overview of your data archives. If you'd have it, you wouldn't have dupes in the first place.
This is a larger personal project, but you should take it on, since it will be a good lesson in data organisation. I've been there and done that.
You should get a rough overview of what you're looking at and where to expect large sets of dupes. Do this by manually parsing your archives in broad strokes. If you want to automate dupe-removal, do so by de-duping smaller chunks of your archive. You will need extra CPU and storage - maybe borrow a box or two from friends and set up a batch of scripts you can run from Linux live CDs with external HDDs attached.
Most likely you will have to do some scripting or programming, and you will have to devise a strategy not only of dupe removal, but of merging the remaining skeletons of dirtrees. That's actually the tough part. Removing dupes takes raw processing power and can be done in a few weeks and brute force and a solid storage bandwidth.
Organising the remaining stuff is where the real fun begins. ... You should start thinking about what you are willing to invest and how your backup, versioning and archiving strategy should look in the end, data/backup/archive retrival included. The latter might even determine how you go about doing your dirtree diffs - maybe you want to use a database for that for later use.
Anyway you put it, just setting up a box in the corner and having a piece of software churn away for a few days, weeks or months won't solve your problem in the end. If you plan well, it will get you started, but that's the most you can expect.
As I say: Been there, done that.
I still have unfinished business in my backup/archiving strategy and setup, but the setup now is 2 1TB external USB3 drives and manual arsync sessions every 10 weeks or so to copy from HDD-1 to HDD-2 to have dual backups/archives. It's quite simple now, but it was a long hard way to clean up the mess of the last 10 years. And I actually was quite conservative about keeping my boxed tidy. I'm still missing external storage in my setup, aka Cloud-Storage, the 2012 buzzword for that, but it will be much easyer for me to extend to that, now that I've cleaned up my shit halfway.
Good luck, get started now, work in iterations, and don't be silly and expect this project to be over in less than half a year.
My 2 cents.
We suffer more in our imagination than in reality. - Seneca
I found a python script online and hacked it a bit to work on a larger scale.
The script originally scanned a directory, found files with same size, and md5'ed them for comparison.
Among other things I added option to ignore files under a certain size, and to cache md5 in a sqlite db. I also think I did some changes to the script to handle large number of files better, and do more effective md5 (also added option to limit number of bytes to md5, but that didn't make much difference in performance for some reason). I also added option to hard link files that are the same.
With inodes in memory, and sqlite db already built, it takes about 1 second to "scan" 6TB of data. First scan will probably take a while, tho.
Script here - It's only tested on Linux.
Even if it's not perfect, it might be a good starting point :)
It's The Golden Rule: "He who has the gold makes the rules."
The problem started with a complete lack of discipline. I had numerous systems over the years and never really thought I needed to bother with any tracking or control system to manage my home data. I kept way to many minor revisions of the same file, often forking them over different systems. As time past and rebuilt systems, I could no longer remember where all the critical stuff was so I'd create tar or zip archives over huge swaths of the file system just in case. I eventually decided to clean up like you are now when I had over 11 million files. I am down to less than half a million now. While I know there are still effective duplicates, at least the size is what I consider manageable. For the stuff from my past, I think this is all I can hope for; however, I've now learned the importance of organization, documentation and version control so I don't have this problem again in the future.
Before even starting to de-duplicate, I recommend organizing your files in a consistent folder structure. Download wikimedia and start a wiki documenting what you're doing with your systems. The more notes you make, the easier it will be to reconstruct work you've done as time passes. Do this for your other day to day work as well. Get git and start using it for all your code and scripts. Let git manage the history and set it up to automatically duplicate changes on at least one other backup system. Use rsync to do likewise on your new directory structure. Force yourself to stop making any change you consider worth keeping outside of these areas. If you take these steps, you'll likely not have this problem again, at least on the same scope. You'll also find it a heck of a lot easier to decommission or rebuild home systems and you won't have to worry about "saving" data if one of them craps out.
It's only 5TB. Why dedupe? Just buy another HDD or two. How much is your time worth anyway?
:). That would be my priority, not eliminating dupes.
You say the data is important enough that you don't want to nuke it. Wouldn't it be also true to say that the data that you've taken the trouble to copy more than once is likely to be important? So keep those dupes.
To me not being able to find stuff (including being aware of stuff in the first place) would be a bigger problem
There is a digital preservation tool called DROID (Digital Record Object Identification) which scans all the files you ask it to, identifying their file type. It can also optionally generate an MD5 hash of each file it scans. It's available for download from sourceforge (BSD license, requires Java 6, update 10 or higher).
http://sourceforge.net/projects/droid/
It has a fairly nice GUI (for Java, anyway!), and a command line if you prefer scripting your scan. Once you have scanned all your files (with MD5 hash), export the results into a CSV file. If you like, you can first also define filters to exclude files you're not interested in (e.g. small files could be filtered out). Then import the CSV file into your data anlaysis app or database of your choice, and look for duplicate MD5 hashes. Alternetively, DROID actually stores its results in an Apache Derby database, so you could just connect directly to that rather than export to CSV, if you have a tool that an work with Derby.
One of the nice things about DROID when working over large datasets is you can save the progress at any time, and resume scanning later on. It was built to scan very large government datastores (multiple Tb). It has been tested over several million files (this can take a week or two to process, but as I say, you can pause at any time, save or restore, although only from the GUI, not the command line).
Disclaimer: I was responsible for the DROID 4, 5 and 6 projects while working at the UK National Archives. They are about to release an update to it (6.1 I think), but it's not available just yet.
...it will have cost you far more than simply buying another drive(s) if all you are really concerned about is space...
I will go out on a limb, risk my geek card and propose another alternative:
Windows Server 2012 has a deduplication feature which works atop of NTFS (not ReFS). Unlike "real" deduplication on the LVM level which you get with your EMC, the files are written to the filesystem fully "hydrated", and as time passes, a background task [1] sifts through the blocks, finds ones that are the same, then adds reparse points.
The reason I'm suggesting this is that if one already has a Windows file server, it might be good to slap on 2012 when it is available, configure deduplication on a dedicated storage volume, and let it do the dirty work on the block level for you.
Of course, ZFS is the most elegant solution, but it may not be the best in the application.
[1]: Fire up PowerShell and type in:
Start-DedupJob E: â"Type Optimization
if you want to do it in the foreground after setting it up, if you did a large copy and want to dedupe it all.
If you have 100 files all of one size, you'll have to do 4950 comparisons.
You only have to do 4950 comparisons if you have 100 unique files.
What I do is pop the first file from the list, to use as a standard, and compare all the files with it, block by block. If a block fails to match, I give up on that file matching the standard. The files that don't match generally don't go very far, and don't take much time. For the ones that match, I would have taken all that time if I was using a hash method anyway. As for reading the standard file multiple times: It goes fast because it's in cache.
The ones that match get taken from the list. Obviously I don't compare the one which match with each other. That would be stupid.
Then I go back to the list and rinse/repeat until there are less than 2 files.
I have done this many times with a set of 3 million files which take up about 600GB.
My other car is a 1984 Nark Avenger.