Ask Slashdot: How Do I De-Dupe a System With 4.2 Million Files?
First time accepted submitter jamiedolan writes "I've managed to consolidate most of my old data from the last decade onto drives attached to my main Windows 7 PC. Lots of files of all types from digital photos & scans to HD video files (also web site backup's mixed in which are the cause of such a high number of files). In more recent times I've organized files in a reasonable folder system and have an active / automated backup system. The problem is that I know that I have many old files that have been duplicated multiple times across my drives (many from doing quick backups of important data to an external drive that later got consolidate onto a single larger drive), chewing up space. I tried running a free de-dup program, but it ran for a week straight and was still 'processing' when I finally gave up on it. I have a fast system, i7 2.8Ghz with 16GB of ram, but currently have 4.9TB of data with a total of 4.2 million files. Manual sorting is out of the question due to the number of files and my old sloppy filing (folder) system. I do need to keep the data, nuking it is not a viable option.
Do a CRC32 of each file. Write to a file one per line in this order: CRC, directory, filename. Sort the file by CRC. Read the file linearly doing a full compare on any file with the same CRC (these will be adjacent in the file).
Moderating "-1, Disagree" is simple censorship. Have the guts to post your opinion.
Very, VERY carefully.
as per subject.
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Scan all simple file details (name, size, date, path) into a simple database. Sort on size, remove unique sized files. Decide on your criteria for identifying duplicates, whether it's by name or CRC, and then proceed to identify the dupes. Keep logs and stats.
My UID is prime!
If you can get them on a single filesystem (drive/partition), check out Duplicate and Same Files Searcher ( http://malich.ru/duplicate_searcher.aspx ) which will replace duplicates with hardlinks. I link to that and a few others (some specific to locating similar images) on my freeware site; http://missingbytes.net/ Good luck.
My Tech Posts on Twitter
is not finding the same file, but when you have duplicate files associated with different applications. For example Program A and Program B both install a fonts directory with thousands of fonts most of which are identical.
Or if you install multiple copies of slightly different versions of the same OS ...
If you don't mind booting Linux (a live version will do), fdupes has been fast enough for my needs and has various options to help you when multiple collisions occur. For finding similar images with non-identical checksums, findimagedupes will work, although it's obviously much slower than a straight 1-to-1 checksum comparison.
YMMV
Use something like find to generate a rough "map" of where duplications are and then pull out duplicates from that. You can then work your way back up, merging as you go.
I've found that deja-dup works pretty well for this, but since it takes an md5sum of each file it can be slow on extremely large directory trees.
Delete all files but one. The remaining file is guaranteed unique!
if you really want, sort, order and index it all, but my suggestion would be different.
If you didn't need the files in the last 5 years, you'll probably never need them at all.
Maybe one or two. Make one volume called OldSh1t, index it, and forget about it again.
Really. Unless you have a very good reason to un-dupe everything, don't.
I have my share of old files and dupes. I know what you're talking about :)
Well, the sun is shining. If you need me, I'm outside.
Privacy is terrorism.
Since the objective is to recover disk space, the smallest couple of million files are unlikely to do very much for you at all. It's the big files that are the issue in most situations.
Compile a list of all your files, sorted by size. The ones that are the same size and the same name are probably the same file. If you're paranoid about duplicate file names and sizes (entirely plausible in some situations), then crc32 or byte-wise comparison can be done for reasonable or absolute certainty. Presumably at that point, to maintain integrity of any links to these files, you'll want to replace the files with hard links (not soft links!) so that you can later manually delete any of the "copies" without hurting all the other "copies". (There won't be separate copies, just hard links to one copy.)
If you give up after a week, or even a day, at least you will have made progress on the most important stuff.
Wikia
perhaps you could boot with a livecd and mount your windows drives under a single directory? Then:
find /your/mount/point -type f -exec sha256sum > sums.out
uniq -u -w 64 sums.out
put the disk on the build in sata bus or use E-sata or even fire wire.
If nuking it isn't an option, it's valuable to you. There are programs that can delete duplicates, but if you want some tolerance to changes in file-name and age, they can get hard to trust. But with the price of drives these days, is it worth your time de-duping them?
First, copy everything to a NAS with new drives in it in RAID5. Store the old drives someplace safe (they may stop working if left off for too long, but its better if something does go wrong with the NAS to have them right?).
Then, copy everything current to your new backup drives on your computer, and automate the backup so that it only keeps two or three versions of files so you don't end up with this problem again. Keep track of things you want to archive and archive them separately.
An ounce of prevention is better than a pound of cure. We all get into backup and duplicate problems eventually. I have found keeping my core work in dropbox and making a backup of it occasionally provides enough measure of data backup for me, but the information I generate in the lab doesn't take up so much space.
Assuming fully sequential access, reading 5 TB of data at 100 MB/s takes 14 hours. With a mean file size of 1 M, you probably have a lot of tiny files and a few big files. The access will be far from sequential, so the access time will be many times greater. Don't expect it to be quick.
I would probably cook some script together with Cygwin, md5sum and find, but if you have duplicated *directories*, you may have to get smarter. With a simple script (i may post one later if nobody else has a better idea) , the end-result would be a list of files with identical hashes, and you'd have to decide what do to about them. [I would actually use a filesystem with built-in deduplication, like ZFS, and failing that I would write a script to hard-link identical files. But it's kind of limited what you can do on Windows]
cd directory_with_files
md5sum * | sort
I wouldn't recommend using crc32 if you have a substantial amount of files or else you risk a collision (i.e. two different files that produce the exact same crc32).
I recently had this problem and solved it with finddupe (http://www.sentex.net/~mwandel/finddupe/). It's a free command line tool. It can create hardlinks, you can tell it which is a master directory to keep and which directories to delete, and it can create a batch file to do actually do the deletion if you don't trust it or just want to see what it will do. Highly recommend. In any case, 5 TB is going to take forever but with finddupe you can be sure your time is not wasted, unlike one of the free tools that analyzed my drive for 12 hours and then told me it would only fix ten duplicates.
As it is mostly about space, ignore the smaller files. For large files, the file size is already a pretty close approximation to a unique hash. First of all, create a database with size/path information and some extra fields where you will later add better hash sums and maybe note how far you got in processing.
Process files by decreasing size. If there are only two files of a particular size, compare them directly.
If there are more than two files of a particular size, get a better hash for each. (Choose a fast hashing algorithm that looks only at the first KB or so of the files.) After that, make the obvious comparisons to detect precise copies.
I have some further ideas in case this is still not fast enough, but I am worried that I may have already pissed off enough people by reinventing key parts of their precious patented algorithms without mentioning them.
You don't say what your desired outcome is.
If this was my data I would proceed as this:
There will be a lot of manual cleanup, I think.
The problem with a lot of file duplication tools is that they only consider files individually and not their location or the type of file. Often we have a lot of rules about what we'd like to keep and delete - such as keeping an mp3 in an album folder but deleting the one from the 'random mp3s' folder, or always keeping duplicate DLL files to avoid breaking backups of certain programs.
With a large and varied enough collection of files it would take more time to automate that than you would want to spend. There are a couple of options though:
You could get some software to replace duplicate files with hard links. This will save you space but not make things any neater - DupeMerge looks like it would do it on NTFS but I haven't tried it myself.
Another alternative would be to move your data to a file system that has built in de-duplication such as ZFS and let that handle everything.
Finally when I was looking at this myself what I found was that the problem was not individual duplicate files but that certain trees of files occurred identically in multiple places (adhoc backups of systems were a big culprit here). What you could do with but which I couldn't find and didn't get round to finishing writing was something that would CRC not individual files but entire trees of files/folders and report back the matches. If something does already exist to do that I'd be quite interested myself.
My crystal ball tells me:
At some point Btrfs will be standard in most linux distributions. Some time later deduplication will be developed to be used for the layman. (Planned features, wikipedia: http://en.wikipedia.org/wiki/Btrfs#Features )
1.) Wait it out until we are there. ...
2.) Get a NAS box using Btrfs
3.) transfer everything
5.) PROFIT (for the people building the NAS).
Don't do it. You're on a fool's errand. Old files are so much smaller than new files that you're not wasting very much space. Now as you go through it all manually, you will find some of the duplicates. You can create symbolic links (supported in Win7) among duplicates as you encounter them. File positions in the directory tree are important information. e.g. the same image crookedtree.jpg may be duplicated between trips\2007\June\Smoky Mountains and trees\best\maple. It has meaning in both places. You will encounter whole directories that can simply be deleted because they are old backups, and you can verify this will tools like the simpleminded windiff of whatever you use instead.
You have done an excellent job of gathering it all together, and you should be proud of that. I'll do that "someday". Don't beat yourself up about what may only be a single-digit percentage of waste from duplication. Don't be the geezer who spends his whole retirement sorting his slides only to die and have them all tossed in the landfill.
I use the free command line tool dupemerge.exe to do file level dedupe on ntfs and I have found it to be pretty fast with lots of options.
See http://schinagl.priv.at/nt/dupemerge/dupemerge.html for full details.
"Introduction
Most hard disks contain quite a lot of completely identical files, which consume a lot of disk space. This waste of space can be drastically reduced by using the NTFS file system hardlink functionality to link the identical files ("dupes") together.
Dupemerge searches for identical files on a logical drive and creates hardlinks among those files, thus saving lots of hard disk space.
Backgrounders
Dupemerge creates a cryptological hashsum for each file found below the given paths and compares those hashes to each other to find the dupes. There is no file date comparison involved in detecting dupes, only the size and content of the files.
To speed up comparison, only files with the same size get compared to each other. Furthermore the hashsums for equal sized files get calculated incrementally, which means that during the first pass only the first 4 kilobyte are hashed and compared, and during the next rounds more and more data are hashed and compared.
Due to long run time on large disks, a file which has already been hashsummed might change before all dupes to that file are found. To prevent false hardlink creation due to intermediate changes, dupemerge saves the file write time of a file when it hashsums the file and checks back if this time changed when it tries to hardlink dupes.
If dupemerge is run once, hardlinks among identical files are created. To save time during a second run on the same locations, dupemerge checks if a file is already a hardlink, and tries to find the other hardlinks by comparing the unique NTFS file-id. This saves a lot of time, because checksums for large files need not be created twice.
Dupemerge has a dupe-find algorithm which is tuned to perform especially well on large server disks, where it has been tested in depth to guarantee data integrity."
I was just looking at this for a much smaller pile of data (aroudn 300GB) and came across this http://ldiracdelta.blogspot.com/2012/01/detect-duplicate-files-in-linux-or.html
Recently had this situation.
Nirsoft's free "SearchMyFiles" http://www.nirsoft.net/utils/search_my_files.html has a straightforward Find Duplicates mode which helped a lot. It is easy (the most "complex" is designating the base locations for searches as e.g. K:\;L:\;P:\;Q:\), fast, never crashed on me, and had only cosmetic issues ("del" key not working). I recommend running it with administrative privileges so that it does not miss files.
I had to do that with an itunes library recently. Nowhere near the number of items you're working with, but same principle - watch your O's. (that's the first time I've had to deal with a 58mb XML file!) After the initial run forecasting 48 hrs and not being highly reliable, I dug in and optimized. A few hours later I had a program that would run in 48 seconds. When you're dealing with data sets of that size, process optimizing really can matter that much. (if it's taking too long, you're almost certainly doing it wrong)
The library I had to work with had an issue with songs being in the library multiple times, under different names, and that ended up meaning there was NOTHING unique about the songs short of the checksums. To make matters WORSE, I was doing this offline. (I did not have access to the music files which were on the customer's hard drives, all seven of them)
It sounds like you are also dealing with differing filenames. I was able to figure out a unique hashing system based on the metadata I had in the library file. If you can't do that, and I suspect you don't have any similar information to work with, you will need to do some thinking. Checksumming all the files is probably unnecessarily wasteful. Files that aren't the same size don't need to be checksummed. You may decide to consider files with the same size AND same creation and/or modification dates to be identical. That will reduce the number of files you need to checksum by several orders. A file key may be "filesize:checksum", where unique filesizes just have a 0 for the checksum.
Write your program in two separate phases. First phase is to gather checksums where needed. Make sure the program is resumable. It may take awhile. It should store a table somehow that can be read by the 2nd program. The table should include full pathname and checksum. For files that did not require checksumming, simply leave it zero.
Phase 2 should load the table, and create a collection from it. Use a language that supports it natively. (realbasic does, and is very fast and mac/win/lin targetable) For each item, do a collection lookup. Collections store a single arbitrary object (pathname) via a key. (checksum) If the collection (key) doesn't exist, it will create a new collection entry with that as its only object. if it already exists, the object is appended to the array for that collection. That's the actual deduping process, and will be done in a few seconds. Dictionaries and collections kick ass for deduping.
From here you'll have to decide what you want to do.... delete, move, whatever. Duplicate songs required consolidation of playlists when removing dups for example. Simply walk the collection, looking for items with more than one object in the collection. Decide what to keep and what to do elsewise with (delete?) I recommend dry-running it and looking at what it's going to do before letting it start blowing things away.
It will take 30-60 min to code probably. The checksum part may take awhile to run. Assuming you don't have a ton of files that are the same size (database chunks, etc) the checksumming shouldn't be too bad. The actual processing afterward will be relatively instantaneous. Use whatever checksumming method you can find that works fastest.
The checksumming part can be further optimized by doing it in two phases, depending on file sizes. If you have a lot of files that are large-ish (>20mb) that will be the same size, try checksumming in two steps. Checksum the first 1mb of the file. If they differ, ok, they're different. If they're the same, ok then checksum the entire file. I don't know what your data set is like so this may or may not speed things up for you.
I work for the Department of Redundancy Department.
After you have found the "equal files", you need to decide which one to erase and which ones to keep. For example, let's say that a gif file is part of a web site and is also present in a few other places because you backed it up to removable media which latter got consolidated. If you chose to erase the copy that is part of the website structure, the website will stop working.
Lucky for you, most filesystem implemenations nowadays include the capacity to create symbolic links (in windows, that would be NTFS Symbolic links since vista, and junction points since Win2K, in *nix is the soft hand hard symlinks we know and love, and in mac, the engineers added hard links to whole directories), both hard and soft. So, the solution must not only identify which files are the same, but also, keep one copy, while preserving accesability, this is what makes apple (r)(c)(tm) work so well. You will need a script that, upon identifying equal files, erases all but one, and creates symlinks for ll the erased ones to the surviving one.
*** Suerte a todos y Feliz dia!
I'm going through this same thing. New master PC, and trying to consolidate 8 zillion files and copies of files from the last decade or so.
If you're like me, you copied foldres or trees, instead of individual files. FreeFileSync will show you which files are different between two folders.
Grab two folders you think are pretty close. Compare. Then Sync. This copies dissimilar files in both directions. Now you have two identical folders/files. Delete one of the folders. Wash, rinse, repeat.
Time consuming, but it works for me.
FreeFileSync at sourceforge.
Your problem isn't unduping files in your archives, your problem is getting an overview of your data archives. If you'd have it, you wouldn't have dupes in the first place.
This is a larger personal project, but you should take it on, since it will be a good lesson in data organisation. I've been there and done that.
You should get a rough overview of what you're looking at and where to expect large sets of dupes. Do this by manually parsing your archives in broad strokes. If you want to automate dupe-removal, do so by de-duping smaller chunks of your archive. You will need extra CPU and storage - maybe borrow a box or two from friends and set up a batch of scripts you can run from Linux live CDs with external HDDs attached.
Most likely you will have to do some scripting or programming, and you will have to devise a strategy not only of dupe removal, but of merging the remaining skeletons of dirtrees. That's actually the tough part. Removing dupes takes raw processing power and can be done in a few weeks and brute force and a solid storage bandwidth.
Organising the remaining stuff is where the real fun begins. ... You should start thinking about what you are willing to invest and how your backup, versioning and archiving strategy should look in the end, data/backup/archive retrival included. The latter might even determine how you go about doing your dirtree diffs - maybe you want to use a database for that for later use.
Anyway you put it, just setting up a box in the corner and having a piece of software churn away for a few days, weeks or months won't solve your problem in the end. If you plan well, it will get you started, but that's the most you can expect.
As I say: Been there, done that.
I still have unfinished business in my backup/archiving strategy and setup, but the setup now is 2 1TB external USB3 drives and manual arsync sessions every 10 weeks or so to copy from HDD-1 to HDD-2 to have dual backups/archives. It's quite simple now, but it was a long hard way to clean up the mess of the last 10 years. And I actually was quite conservative about keeping my boxed tidy. I'm still missing external storage in my setup, aka Cloud-Storage, the 2012 buzzword for that, but it will be much easyer for me to extend to that, now that I've cleaned up my shit halfway.
Good luck, get started now, work in iterations, and don't be silly and expect this project to be over in less than half a year.
My 2 cents.
We suffer more in our imagination than in reality. - Seneca
Can it more surreal?
“He’s not deformed, he’s just drunk!”
So what if you have many dup's ? Keep all on disk and know that you will have it on hand in the very unlikely event that you'll need something from five years ago. Spend $300 on a few more disks and get on with your life. Perfection is the enemy of the good.
If it were me, I would use the file size to identify which were likely duplicates. Less reliable than hashing, but much faster. Using PowerShell:
Get-ChildItem D:\MyData -Recursive | Export-CSV mydata.csv
$objData = Import-CSV mydata.csv
$objData | sort Size | Export-CSV mydata_sorted.csv
$objSortedData = Import-CSV mydata_sorted.csv
$objUniqueSortedData = $objSortedData | sort Size -unique
Then loop through comparing both sets of data, comparing file extension for those files of the same size. Do a few test runs until you're confident and then run with Remove-Item -Confirm:$false.
For this purpose I'm using a wonderful perl script, fdupes.pl. I've tested it on many millions files, many terabytes filesystems and it works fine. I've found the original on perlmonks.org, but modified it to 1 skip symbolic links (a symlink is obviously identical to its target) 2 auto-delete dupes (after confirmation). For anyone interested, find the script here: http://pastebin.com/cMFbBjt9
Delete the dupes, but be sure to make copies first.
Windows? Duplic8, it handles 16 million files, does size and then binary comparison and processes this as fast as your medium can handle. It's also got a nice useful delete wizard that helps you select the ones to kill. I recommend it highly. http://www.kewlit.com/duplic8/
BackupPC does deduplication. So if You take a backup from all your filesystems with BackupPC, You have identical files stored only once. BackupPC uses hard links to do the deduplication, so another copy of a file only takes a directory entry. You can then discard you current backups, if need be.
http://backuppc.sourceforge.net/
I found a python script online and hacked it a bit to work on a larger scale.
The script originally scanned a directory, found files with same size, and md5'ed them for comparison.
Among other things I added option to ignore files under a certain size, and to cache md5 in a sqlite db. I also think I did some changes to the script to handle large number of files better, and do more effective md5 (also added option to limit number of bytes to md5, but that didn't make much difference in performance for some reason). I also added option to hard link files that are the same.
With inodes in memory, and sqlite db already built, it takes about 1 second to "scan" 6TB of data. First scan will probably take a while, tho.
Script here - It's only tested on Linux.
Even if it's not perfect, it might be a good starting point :)
It's The Golden Rule: "He who has the gold makes the rules."
Worry? Multiple different resolutions serve a purpose - different resolution playback devices.
Ken
Write a simple script or program to create a md5 hash for each file and put the hash, along with the file path) in a database or flat file. Then, for each entry in the list, check the rest of the list (after that entry) for duplicate hashes. This will take several minutes to crunch through, but not days or weeks.
The problem started with a complete lack of discipline. I had numerous systems over the years and never really thought I needed to bother with any tracking or control system to manage my home data. I kept way to many minor revisions of the same file, often forking them over different systems. As time past and rebuilt systems, I could no longer remember where all the critical stuff was so I'd create tar or zip archives over huge swaths of the file system just in case. I eventually decided to clean up like you are now when I had over 11 million files. I am down to less than half a million now. While I know there are still effective duplicates, at least the size is what I consider manageable. For the stuff from my past, I think this is all I can hope for; however, I've now learned the importance of organization, documentation and version control so I don't have this problem again in the future.
Before even starting to de-duplicate, I recommend organizing your files in a consistent folder structure. Download wikimedia and start a wiki documenting what you're doing with your systems. The more notes you make, the easier it will be to reconstruct work you've done as time passes. Do this for your other day to day work as well. Get git and start using it for all your code and scripts. Let git manage the history and set it up to automatically duplicate changes on at least one other backup system. Use rsync to do likewise on your new directory structure. Force yourself to stop making any change you consider worth keeping outside of these areas. If you take these steps, you'll likely not have this problem again, at least on the same scope. You'll also find it a heck of a lot easier to decommission or rebuild home systems and you won't have to worry about "saving" data if one of them craps out.
It does the job for me, the selection assistant is quite powerful.
http://www.digitalvolcano.co.uk/content/duplicate-cleaner
Fast, but the old version (2.0) was better and freeware if you can still find a copy of it.
I have too many, due to simply being a messy pig and pedantic with files.
The best tool I've found is called Duplicate Cleaner - it's from Digital Volcano.
I do not work for / am not affiliated with these people.
I've used many tools over the years, DFL, Duplic8 and "Duplicate Files Finder" - one of which had a shitty bug which matched non identical files.
Duplicate cleaners algorithm is good and the UI, while not perfect, is one of the better ones at presenting the data. Especially identifying entire branches / directories being binarily (word?) identical.
Yes it takes a while, that's what minimising applications is for, do you want a TRUE representation of genuinely identical files, or not?
It's only 5TB. Why dedupe? Just buy another HDD or two. How much is your time worth anyway?
:). That would be my priority, not eliminating dupes.
You say the data is important enough that you don't want to nuke it. Wouldn't it be also true to say that the data that you've taken the trouble to copy more than once is likely to be important? So keep those dupes.
To me not being able to find stuff (including being aware of stuff in the first place) would be a bigger problem
As many others have stated, use a tool that computes a hash of file contents. Coincidentally, I wrote one last week to do exactly this when I was organizing my music folder. It'll interactively prompt you for which file to keep among the duplicates once it's finished scanning. It churns through about 30 GB of data in roughly 5 minutes. Not sure if it will scale to 4.2 million files, but it's worth a try!
Higher Logics: where programming meets science.
Anyway...
Upward mobility is a slippery slope - the higher you climb the more you show your ass.
There is a digital preservation tool called DROID (Digital Record Object Identification) which scans all the files you ask it to, identifying their file type. It can also optionally generate an MD5 hash of each file it scans. It's available for download from sourceforge (BSD license, requires Java 6, update 10 or higher).
http://sourceforge.net/projects/droid/
It has a fairly nice GUI (for Java, anyway!), and a command line if you prefer scripting your scan. Once you have scanned all your files (with MD5 hash), export the results into a CSV file. If you like, you can first also define filters to exclude files you're not interested in (e.g. small files could be filtered out). Then import the CSV file into your data anlaysis app or database of your choice, and look for duplicate MD5 hashes. Alternetively, DROID actually stores its results in an Apache Derby database, so you could just connect directly to that rather than export to CSV, if you have a tool that an work with Derby.
One of the nice things about DROID when working over large datasets is you can save the progress at any time, and resume scanning later on. It was built to scan very large government datastores (multiple Tb). It has been tested over several million files (this can take a week or two to process, but as I say, you can pause at any time, save or restore, although only from the GUI, not the command line).
Disclaimer: I was responsible for the DROID 4, 5 and 6 projects while working at the UK National Archives. They are about to release an update to it (6.1 I think), but it's not available just yet.
So your de-dupe ran for a week before you cut it out? On a modern CPU, the de-dupe is limited not by the CPU speed (since deduplication basically just checksums blocks of storage), but by the speed of the drives.
What you need to do is put all this data onto a single RAID10 array with high IO performance. 5TB of data, plus room to grow on a RAID10 with decent IOPS would probably be something like 6 3TB SATA drives on a new array controller. Set up the array with a large stripe size to prioritize reads (writes are going to be 'fast enough' on a RAID10, trust me). Once you have that hooked-up with your files copied onto it, you want to connect the drive to an OS that can natively deduplicate, like Windows Server 2012. If you must, you can set this box up as a storage server (with a low-end CPU, an old 'Core 2' should be able to keep up with 180MB/sec I/O), and keep your workstation separate. Reading this entire array (when full) through the CPU -should- take about 6-10 hours, deduplication will take slightly longer.
If you don't want to do deduplication at the block level, and you want to actually only have one copy of each duplicated file, you'll need to write scripts that do something like this:
1. Run through the data store and checksum each file (except for those ending in ".mychecksum" with AES128. .."mychecksum" next to it. This will create the 'index' using the filesystem, which will be MUCH faster than having to read the data from inside each file. .deleteme and then deleting all those files after you confirm that it worked.
2. For each file, create an empty file named
3. Search through the store and concatenate all the ".mychecksum" files into a single CSV.
4. Run sed+unique on the file to see what will be nixed (i.e. Get a report)
5. Create another script that actually takes the output from step 4 and deletes ONE of the duplicate files. You can test by -renaming- ONE of the files to
6. Repeat as necessary, possibly with a scheduled job.
"Sometimes, I think Trent just needs a cup of hot chocolate and a blankie." -Tori Amos on Nine Inch Nails
Only hash the first 4K of each file and just do them all. The size check will save a hash only for files with unique sizes, and I think there won't be many with 4.2M media files averaging ~1MB. The second near-full directory scan won't be all that cheap.
At a superficial level, the issue would seem to be quite hard, but with a little planning it shouldn't be *that* hard.
My path would be to go out and build a new file server running either Windows Server or Linux, based on what OS your current file server uses, install the de-dupe tool of your choice from the many listed above, and migrate your entire file structure from your current box to the the new box - the de-dupe tools will work their magic as the files trip in over the network connection. Once de-duped, your old file server can be rebuilt with the same de-dupe tool, and the files migrated back to it for use going forward if desired, with the two large drives used as an online backup.
The temporary de-dupe box can be fairly simple with nothing more thana fairly robust CPU, two 2 or 3 TB drives and a gigabit NIC, you won't even need to buy an OS license if you are running Windows, as you can just use a trial copy of Windows Server,
Ken
This gives an sha256sum list of all files assuming you are in linux and writing it to list.sha256 in the base of your home folder:
You may replace sha256sum with another checksum routine if you want, such as. sha512sum, md5sum, sha1sum, or other preference.
now sort the file:
(notice, this create a sorted list according to the sha256 value but with the path to the file as well. Assuming you would want to manually check some lines, this might be helpful, but if you only want the machine to check there is really no need to include the file and path data in the output giving a much smaller duplicate list file. )
without paths the command could be something like
You could now find duplicates by doing one of the following:
or in the first case
Now with the list of duplicates come the important question... Does meta data of the files such as in which path it is, date and time, file permissions etc matter to you?
Regardless I would usually recommend doing a binary comparison of the files as well to fully ensure the files are the same, before merging...
The quick and dirty removal of duplicates would be
If wanting to preserve meta data, then the best way might be to use hard links to the original maintaining setting the hardlink to date and time of duplicate.
Do note that I did not test any of these commands and I might have missed something that make these commands eat important data too... Check on something unimportant before trying!
I recently ran into the same problem you are having, just on a lesser scale. The program I had the best success with was Auslogics Duplicate File Finder .
It includes two options that I absolutely needed: Ignore File Names and Ignore File Dates. I'm pretty sure those were off by default, so check that if you try this. Considering I knew that I had done the same thing you are describing, and even renamed some files in the process, I really needed those options otherwise it would not have found the dupes.
I was paranoid there would be false positives, so I did a quick test on a select few folders. It worked perfectly, so I let it run... then blindly allowed it to do its thing in the last step and delete duplicates. I found about 600GB of dupes out of 3TB, and it took less than a few hours to run.
there's a program called backuppc which does the job very very effectively, even across multiple systems. [note: do not imagine for one second that the god you call windows has all the answers]. run yourself a 2nd system, even if it's a virtual machine, install debian gnu/linux in it and then run and configure backuppc.
backuppc uses MD5/SHA checksums to identify files, such that it stores only *one* copy of any given file. this occurs entirely automatically. given the size of the task you can expect it to take some considerable time, however even if it is interrupted the backup process can be restarted and it will happily chunder on from where it left off.
if you want to, backuppc can create "snapshots" for you. however given the sheer number of files i would not recommend you enable that feature unless a) absolutely necessary b) you've at least made one complete backup of the files!
realistically, you should have been running backuppc or something like it for some considerable number of years, now. backuppc and systems like it can do "incremental" backups very very efficiently... but the first time you ever run it is absolute hell.... well, that's going to be the case regardless of what system you use: you'll just have to bite the bullet.
So, ran into a similar problem ages ago, and I wrote a python script to handle it. If you can't follow some rather dense python, this won't be for you.
https://github.com/scooby/fdb
It's mostly the 'fdb' script, there's some other cruft in there.
My approach stores the filesystem data in a sqlite database. It's not fast, but it is reasonably recoverable, which wound up being the most important aspect. The traditional Unix convoluted pipeline approach simply doesn't scale much past 100,000 files, in my experience.
It does actually understand inodes, in fact, it is pretty much a relational model of an inode based file system. The usage model is basically: read a portion of a file system in to the database. Update unhashed inodes. Hard link identical inodes.
The catch is that I also wanted it to work over time, so I wanted a permanent volume identifier for devices, users, etc, which makes it a bit OS X centric. I don't think there's any reason it wouldn't port relatively easily to Linux: you just need to use the Linux way of looking up system information. Basically, POSIX doesn't guarantee much about device ids, uids or gids beyond "it's not going to change while the process is running," and there's no standard way to obtain a UUID.
Also, if you *do* have multiple devices, it will try to hash them on separate threads. This won't work so well if the multiple devices are simply separate partitions :-(
...it will have cost you far more than simply buying another drive(s) if all you are really concerned about is space...
Let me RTFA for you:
How Do I De-Dupe a System With 4.2 Million Files? ...chewing up space.
I have many old files that have been duplicated multiple times across my drives
I do need to keep the data, nuking it is not a viable option
Your solution is?
There is no reason to use a crypto-strength hash. This will simply be slower. MD5 should be perfectly fine - it outputs a 128 bit hash, which is more than enough to avoid accidental collisions, and it's fast. You could match on the size as well as the hash, if you really really think you might have a hash match on different content, but it's probably not necessary.
It is true that if you're trying to avoid *intentionally malicious* collisions, you should never use MD5 as it's badly broken for that use - but not for detecting duplicate content. You're correct to avoid using CRC - but that's not a hash algorithm, it's a checksum algorithm. Accidental collisions with that algorithm will be very frequent.
The names of files should never be used to distinguish them. Files are often renamed by applications or during normal work by users. In any case, if you already have a hash match, then why do you care if the names are different? The content is already overwhelmingly likely to be identical. If you're really paranoid, then do a byte comparision of those files.
Then simply make another copy when it is needed (BTW, what files are meant to exist in duplicates?)
#1 You must develop a naming and storage system that fits your data needs. This means amount of redundancy in case a hard drive dies or burns in a fire, directory name hierarchy, file and directory naming conventions.
#2 You must resolve to spend the time to sort and name new things correctly from now on.
#3 You must decide how much time to spend on your current data.
I assume that what you have is all on external USB drives, but the same issues arrise with internal drives or firewire.
Even if you had a magic program that would instantly allow you to delete all dups, you would be left with mismatched directories. Let's say that you have 3 copies of an install directory for a free game. Randomly deleting 1 file from here and 2 from there will leave different directories that are all incomplete. Even smart programs fail at this when the difference is just a few files like might happen between two versions of the same game.
Are the files / directories at least named something reasonable? If not, then you must do one of three things:
- Go through each directory by hand, examine each file to determine what it actually is, and name files and directories reasonably. While you are at it create a hierarchy of directory names to organise the mess. Millions of files will take 10 years (full time) to sort unless there are significant time savings like a game directory with 10 games in it, each game having 10,000 correctly named files.
- Leave things as they are. You don't even know what you have when you have that many files. You will never be able to use much of what you have because you don't know what you have, where it is, or what it is named.
- Give up and start over, doing things correctly. That could mean to get the biggest bang for the buck with the data that you have by extracting the important items that you know about and anything that is easy to understand. Leave the rest in a "junk" directory.
Of course you can mix and match...
There exists an external USB drive chip set that gives single bit errors about once per 100 gigabytes of read access (no errors on write). The one I have plugs into a bare drive and makes it talk like a USB drive. If you copied files from one USB drive to another, the second copy may have errors. I have this problem. I finally tracked down the source and fixed it, but I was left with mulitple copies of files with single bit errors. I wrote some scripts to help me do md5sum (checsums) on everything. I then went to the oldest copy (you have to examine all of the file timestamps, not just the usual one) for file types that could not tell me they were corrupted (e.g. a zip file can tell you it is corrupted).
The manufacturer of this chip had a buggy chip and a reference design that everyone used. Find a big file on a drive and run continuous checksums on the one file (WITH A DISK CACHE FLUSH EACH TIME). If you have the same checksum for 4 hours, you don't have this problem. Here is the Linux USB ID of the bad controller.
Bus 002 Device 003: ID 152d:2338 JMicron Technology Corp. / JMicron USA Technology Corp. JM20337 Hi-Speed USB to SATA & PATA Combo Bridge
It is easiest if you have all drives attached at the same time, but this may not be possible for you depending on how many drives you have and the number of USB ports you have free, consider buying a USB hub so you can plug them all in at once to run fslint.
For your case, I would do things in this order:
- decide how to divide the data up and how to name it
- start organizing it in a "bang for the buck" fashion. You will probably never get finished, but there will come a time... Do this first so the checksumming step will print names that you understand.
- run some sort of checksumming utility on all the files. It will take a long time. Let the computer do the work (checksumming) rather than you doing the work to write programs that look first at file size, and time stamp, then the first and last 1k if the file, etc. I have used fslint under
There's a FUSE-based file system called LessFS capable of performing block-level deduplication. The project is actively maintained and looks like worth a shot. For more information, check its webpage at http://www.lessfs.com
All the methods suggested so far assume that identical files are bitwise identical. That’s a false assumption.
Consider an mp3 file. Add ID3 tags. Add ID3v2 tags. Re-encode to ogg. Now you have four files that have (almost) the same content but are bitwise different.
Consider a raw photo with EXIF tags. Convert it to jpeg, preserving the tags. Strip the tags. Resize to a web-friendly resolution. Now, you have 4 files which are bitwise different, but contain roughly the same image. (JPEG is a bit lossy and the downscaled version is *quite* lossy, but still.)
Consider a C++ program in source form. Build it, producing a binary and a bunch of intermediate files.
If you wanted to perfectly deduplicate this collection, you’d have to invent software that can detect all this non-bitwise duplicity.
I don't know why people are recommending you use "find', "awk", "grep" etc when you clearly stated that this is a Windows 7 environment. In any case a quick VB Net program I just created processed 43,800 files in 10 seconds. It would be faster but you must catch "Access Denied" errors for Folders like "System Volume Information" - extrapolating tells me 4 million files at 4000/sec means 15+ minutes to create a file with "Path" [TAB] "Name" [TAB] "Size". Adding a hash would add significantly to the processing time but it could be done easily. Question is if you have the tool(s) and ability to create something. Once the file is crreated you still have to parse it, sort it and flag the dupes.
I use this program: http://www.foldermatch.com/ . It's build in duplicate finder does exactly what you want: http://www.foldermatch.com/images/duplicate-file-finder.jpg . Of course you could always write your own tool as well. Folder match does it pretty efficiently though.
ftp://ftp.bitwizard.nl/same/
I used this to keep all versions of the Linux kernel source tree on my computer, with identical files hardlinked together to reduce storage space.
Both diff (blazing fast "diff -purN ") and patch handle hard links, so this was very workable.
It can be slow and take quite some memory (only 128 MiB-1 GiB in those days), but guess 16 GiB of RAM should handle 4 million files fine, as this is about the same order of magnitude as the few hundred kernel source trees I had lying around.
After git arrived, it was faster to just use git.
If you have 4.2 million files, duplication would seem to be the least of your problems. How do you find the specific one of the 4.2 million you need? Are there sets of files you know you'll never need to access.
And forgive me for playing the shrink, but how much of your problem is just compulsive hoarding?
How about intelligent people just looking out for truly insightful comments amongst the various posts? It would be interesting to see a true, accurate demographic of slashdot folk, I guess the people that post are actually a fairly small subset and the number in computer related industries equally small...
I am very sucseptible to "let's have another drink"
a) Looking at file sizes, then
b) Looking at the first few bytes of files with the same size.
I would say instead you should seek to some value near the middle of the file calculated by the file size.
The reason I say that is I have around 100k uncompressed tiff files (yes my own), mostly about three or four distinct sizes across the set. If you just look at the first few characters it's going to be the same TIFF header for every TIFF file of the same size, leading to a huge amount of checksum work that could be eliminated by shifting the check.
Perhaps even a few quick random samples, with one at the start, middle and end.
"There is more worth loving than we have strength to love." - Brian Jay Stanley
Compare names and sizes, then CRC. Let the thing run.
It also provides a standalone command line tool (findup) generating text output you can process later on with a little bit of scripting. With this tool I've been able to process about a TB of data in a reasonable amount of time.
Is this one of these apps that restarts itself every time the file system changes? Like when a background process appends to a log or something like that.
You might have to start your system in a maintenance mode and skip starting all background processes. Or mount this drive under another system as a data drive.
Have gnu, will travel.
Given all the caveats and assumed programming skills in all the other messages, I agree that this is the fastest, simplest method. I mean, really, how much attic, shelf, or drawer space do three 2-terabyte drives take up? Just copy off what you know you need to a new drive. Set the old ones aside. Then, when you need something that you can't find on the new drive, just fire up those old drives and do a search.
"The enemy of having a life is perfect hard drive de-duplication"
Me (just now)
If you have 100 files all of one size, you'll have to do 4950 comparisons.
You only have to do 4950 comparisons if you have 100 unique files.
What I do is pop the first file from the list, to use as a standard, and compare all the files with it, block by block. If a block fails to match, I give up on that file matching the standard. The files that don't match generally don't go very far, and don't take much time. For the ones that match, I would have taken all that time if I was using a hash method anyway. As for reading the standard file multiple times: It goes fast because it's in cache.
The ones that match get taken from the list. Obviously I don't compare the one which match with each other. That would be stupid.
Then I go back to the list and rinse/repeat until there are less than 2 files.
I have done this many times with a set of 3 million files which take up about 600GB.
My other car is a 1984 Nark Avenger.
Do it in parcels. What are your most common files? databases? PDFs? JPGs? spreadsheets? Whatever they are, de-dup all the common file extensions one at a time. Do the pictures, then the documents, then the PDFs, then the MP3s, ...
If this is a porn collection, you need to log out and never come back to slashdot ever again.
Anyways, then start working on *.*, but break that up into chunks.
De-dup from 0K size to 1Meg, then from1Meg to 10Meg, from 10Meg to 50Meg, etc.
This is the tool I use:
http://archive.org/details/tucows_373411_Duplicate_File_Finder
I have thrown some real hairballs at it and it works fine.
== I question your beliefs, makes me a Troll. You insult my beliefs, you are progressive and mainstream. Okay. Got
If your aim is to clean up your sloppy directory organization and the almost inevitable dupes that will ensue over the years, good luck to you. Several respondents have made good suggestions. If, however, your aim is to just save space, use a storage platform that will de-dupe for you, at the block level. Nexenta comes to mind, but there are others, of course. I wouldn't do this on a file system that saw a lot of interactive use, but you have indicated that this is an archive. Perfect fit.
My home-rolled solution to exactly this problem is: http://gnosis.cx/bin/find-duplicate-contents.
This script is efficient algorithmically and has a variety of options to work incrementally and to optimize common cases. It's not excessively user-friendly, possibly, but the --help screen gives reasonable guidance. And the whole thing is short and readable Python code (which doesn't matter for speed, since the expensive steps like MD5 are callouts to fast C code in the standard library).
Buy Text Processing in Python
I think you should only de-duplicate one type of file at a time. Maybe start with all png files. Then all mp3 files. Then all txt. Then all jpg. The problem will get smaller and smaller and you won't have to do the whole thing at one time, which results in nothing getting de-duplicated in the first place. And as the number of files gets smaller, eventually you will get to a point that you can de-duplicate the whole pile of remaining files at once. And it might not hurt to delete *.tmp or whatever your operating system's equivalent of "all temporary files" is, before you start de-duplicating. And if possible, it probably wouldn't hurt to delete all files that are zero bytes in size before starting de-duplication. If 4 million of your 4.2 million files all happen to be the same file type then never mind.
That's a lot of porn, good luck!
It's off to anger management classes for you!
Ok, first, why do you need to do this? Space is pretty darn cheap, and this seems like a tremendous waste of time and energy to save tens of dollars. But more importantly, I find I need TONS less space now that I just depend on the Internet to keep all of my porn and to stream it.
This is a very fun programming task!
Since it will be totally limited by disk IO, the language you choose doesn't really matter, as long as you make sure that you never read each file more than once:
1) Recursive scan of all disks/directories, saving just file name and size plus a pointer to the directory you found it in.
If you have multiple physical disks you can run this in parallel, one task/thread for each disk.
2) Sort the list by file size.
3) For each file size with multiple entries do:
3a) How many matches are there and how large are they?
3a1) Just two files: Read them both in parallel, using a block size of 1MB or more in order to avoid extra disk seeks, and compare directly. Exit on first difference of course!
3a2) 3 or more files: Read them all interleaved, still using a 1MB+ block size. For each block calculate a CRC32 or secure hash, compare these at the end of each block iteration. When a single file differs from the rest, it is unique.
When two or more are equal but still different from the majority of the group, recurse into a new copy of the scanning function that checks the smallest group, then upon return go on with the rest.
It should be obvious that your scanning function needs to accept an array of open file handles/descriptor plus an offset to start the scanning process at, thus making it easy to call it recursively to check the tails of a sub-array!
(A possible problem can occur if you have _very_ many files of the same size, in that the operating system could run out of file handles for simultaneously open files! In that case I'd fall back on passing in file paths instead of open handles and take the hit of re-opening each file for each block to be read. I would also increase the block size significantly, into the 10-100 MB range, so that everything except big ISOs and similar would be read in a single access. The same process is probably optimal for file sizes less than the minimum block size.)
This algorithm should be able to do what you need in significantly less time than you'd need to just read everything once. I'd estimate about 50 MB/s effective reading speed, so if everything is on a single disk (4.9 TB? Not very likely!) and every single file size has multiple entries that only differ in the last byte, you would need 100 K seconds, or a little more than a day. My guess is you should easily finish overnight!
Terje
"almost all programming can be viewed as an exercise in caching"
Step 1: Build a bikeshed
Step 2: Ask a bunch of geeks what color to paint it
Step 3: ???
Step 4: Profit!
Grab whats important and let the format tornado take care of the rest.
Perhaps this is something you're looking for:
https://github.com/SoftwareMaven/DeDuper
google: github deduper
I can tell you how I have done similar stuff on Mac OS X, using only built-in tools and features and very simple bash scripts. Of course you are using Windows, so you will have to change some of the steps to use the matching Windows tools (like using .bat files instead of bash, etc) and may even need to install some stuff. Even if you don't use it, it may be of interest for other Mac users.
Here it goes:
First, save this very crude bash script into a file (sorry, I'm not a bash programmer):
#!/bin/bash
function navigate_directory { ..
cd "$1"
for anItem in *
do
if [ -d "$anItem" ]
then
echo $level$anItem
export level=$level"."
navigate_directory "$anItem"
export level=${level:1:`expr ${#level} - 1`}
elif [ `mdls -name md5cs -raw "$anItem"` = "(null)" ]
then
#echo \ \ $anItem
md5cs=`md5 -q "$anItem"`
#echo \ \ \ \ $md5cs
xattr -w com.apple.metadata:md5cs $md5cs "$anItem"
fi
done
cd
}
crawlDirs=$@;
export level="."
for anItem in "$*"
do
echo $anItem
navigate_directory "$anItem"
done
All that script does is crawl through all the directories in the input, and for each file it calculates the MD5 checksum (hint: md5cs=`md5 -q "$anItem"` ). Then it uses xattr to save the MD5 checksum as an extended attribute that can be searched using Spotlight (you would need to use the equivalent search feature in Windows 7).
Because you want it to be searchable through Spotlight the "legal" way to do this is by creating your own little application that "registers" the attribute in the system. But that is waaaaaay too much work for something that you don't plan to use a week from now, so just cheat and register it as an Apple metadata attribute: xattr -w com.apple.metadata:md5cs $md5cs "$anItem"
(if this makes you uncomfortable you can later delete the attributes using a similar function)
To index everything, run the script from the base directory of your filesystem (not sure how to do that in Windows, you may have to run it on every drive), or just run on the directories that have your files (it's pointless to index the system files). The time it will take depends on the number and size of the files you have. Given your 4.2 million files in 4.9 TB it should take a day or so given your fast hardware.
At this point if you do a Spotlight search for the MD5 checksum of a file you will almost immediately get a list of all its dupes. (If you don't, you may need to rebuild the Spotlight indexes by running mdutil -i on and then off on every drive. I don't think it's necessary but YMMV).
Now copy this other bash script. Note how it is very similar to the above one.
#!/bin/bash
function get_md5_for_file
... but the other half is a bitch.
Using various tools, I got a listing of all files and a checksum for each. (Checksumming obviously takes some time.) Then, sort by checksum. Any time you have two matching rows, you probably have dupes. If the filesize is the same (down to the byte) then they are almost certainly dupes. (Further things to compare: date modified and filename. If all four match, you can be pretty sure, unless you have Google's amount of data, that you have a dupe.) If you want, write a script to delete all but the first instance of each file.
BUT--the problem is logic. Deleting files only gains you space. It does NOTHING to help you organize things. In fact, it'll probably make things worse. For one thing, you'll wind up with lots of empty folders. For another, there are many scenarios you'll run into.
Folder A has File1 and File2, and Folder B has File1, File2, and File3. A dumb system might leave behind FolderA/File1, FolderA/File2, and FolderB/File3.
Or, maybe you have FolderA/File1, FolderA/File2, and FolderA/File3, along with FolderB/File2, FolderB/File3, and FolderB/File4. Ideally, you'd want to end up with Folder/File1, Folder/File2, Folder/File3, and Folder/File4. Again, that's beyond the scope of a typical dumb tool.
Even tools that search for dupes and replace all but one with a link will still leave you with, at best, twice as many folders as you need. You might have some space but you still have a big mess too.
So, all you can do is decide what's most important: your time, your money, or your sense of neatness. Probably the best solution is to search for big files (ISO, MPG, etc.) and delete any obvious dupes. Then, get another big disk and start migrating one type of file at a time. Get to the point where you can say "Every single ISO I have exists one time on this disk over here. Any others I find can be deleted." You want to aim for the low-hanging fruit. You can spend 2 minutes and delete 5 movies and reclaim a few GB, or you can spend hours pruning little web files and get back just a few MB.
You almost certainly have enough files that cleaning them would literally take weeks or months. Try to do a little at a time. Don't think you can lock yourself in a room and emerge 2 days later with a perfectly clean filesystem. Trying to reach the theoretical perfection of "There are no dupes anywhere among all my disks" will take a lifetime, drive you mad, or both.
I've been meaning to clean up about 5 TB of disk myself for about 4 years, judging by folders with names like "Master_2008_All_Organized_No_Dupes". The most effective method I've found for dealing with it is accepting the fact that I never will. :-) Disks just keep getting cheaper. Just keep buying them. Every so often, take some time, do a nice migration, and clean up what you can, but if you're employed in a technical field, you can buy another 1 TB drive for just a few hours' pay. No reason to spend 40 hours of your life trying to save that.
Dear Slashdot: next time you want to mess with the site, add a rich-text editor for comments.
... it tabulates the size of a given directory and gives you graphical representations of where big files are that you can click on immediately. Then use something like beyond compare to compare directories.
On Unix systems, a small utility named samefile does wonders to de-dup after the fact. It should be portable enough to run on Windows as well...
cpghost at Cordula's Web.
OP wrote:
I tried running a free de-dup program, but it ran for a week straight and was still 'processing' when I finally gave up on it.
Maybe you're not naming the free de-dup program in question out of politeness, but I'd like to know... Or leave a message with the author of said program?
Got really large files in your corpus? Then consider an intermediate step where you hash a larger and different portion of the file. For something different, you could hash the last bytes of the file so you don't end up duplicating work. Say a megabyte. In my case, I didn't need the extra pass because of the data involved. My corpus was on magnetic tape, so I couldn't just compare files byte by byte, because I would have had to load them somewhere first to do the compare. So I had to identify the potential duplicates *first*.
Pardon me for asking, but if all this data is so important to you that you can't bear losing a single file, why didn't you keep it sorted in the first place?
Read this, and prompted me to write a bit of code to do the de-dupe comparisons. Here is the code. You will have to mark the project to run unsafe code :) (in project properties) Compiled with Visual Studio 2010.
Program reads the first 4MB of each file and computes a hash. A thread is run for each drive you are looking for.
If you want all drives, comment out the section it says to do so, else just add the drives you want to the list of DrivesToSearch
I suggest if you use your C Drive, add some of the folders like I have below to the Ignore Directories. The "ToLower()" is there just to make sure that it is lower case, else the hash match won't work.
Please forgive the code, as this was very quick-n-dirty
Code runs *far* faster than a week....
C:\ = 185,000 files.
F:\ = 29,690 files
G:\ = 20,765 files
H:\ = 60,851 files
i:\ = 52,442 files
D:\ 196 files (DVD ROM)
Total: 348,944 files on 6 drives with 3.2TB of used space took about 50 minutes 52 seconds
Speed can be improved by lowering the 4 meg check to something lower. Many of the files on F,G are over 4MB in size and took the longest to complete, even though they had less total files.
Code Below. (mutters about slashdot and their inability to allow code)
http://pastie.org/4652387
SDFS is a cross-platform dedupe system that works on Linux and Windows.
At work, my company uses EMC Avamar, so if you are interested in a commercial product, that's a standalone storage/dedupe system. However, it's pretty expensive.
Did you mount a military-grade, variable-focus MASER on an unlicensed artificial intelligence?
First: Get a copy of Windows Server 2012 and use the new deduplication system (which uses 'file chunk' deuplication level across an entire disk): https://www.usenix.org/conference/usenixfederatedconferencesweek/primary-data-deduplication%E2%80%94large-scale-study-and-system
Now, that you've taken care of the data duplication, let's talk about the tools for sifting through large sets of files:
1. Get 'Everything' (http://www.voidtools.com/): This tool allows for the 'instant' searching for any file throughout _all_ your files, I've used it on 4 million files myself. Just start typing part of the file name and it will show you a list of where those files are located on your system. Also, the list is 'live', you can right click on any icon in the file list, and it will act the same as you right clicked on the file itself in Explorer.
2. Get 'SpaceMonger' (http://www.sixty-five.cc/sm/): This tool shows what's taking up the space on your computer, it's similar to 'WinDirStat' but more flexible, customizable, and detailed.
3. Get 'ZTreeWin' (http://www.ztree.com/): This tool is the Swiss-Army knife program for working on files (finding, searching, viewing). If you remember 'XTree', it's a clone of that which can work on 4 million(+) files.
4. Get 'Beyond Compare' (http://www.scootersoftware.com/): This tool allows for easy comparison/synchronization of folders (and files). Compare two of your old backup folders and merge them.
Even if each file is "uncompressible", a good compression system should almost eat the dupes and won't break anything that relies on the dupe actually being where it is in the file system plus it is a more "standard" solution and if your processor outpaces your disk it may even make things run faster.
Nullius in verba
In my smaller efforts, I do a standard file search in the Windows folder/browser in detail view.. say *.mov or *.mp3 and sort them by file size and it's pretty quick. .jpg or .raw where the file sizes are closer but if the file-names are also duplicated this will be quite obvious. Right-click and open destination for more info (what else is in that folder) or Simply select all but one of the files shown and delete , there and then. Done.
Add the folder/view column and you can see their location and identify all the duplicates. This may not work so well for
Or is this too simple a solution?
Check out dedup in Windows Server 2012 - http://blogs.technet.com/b/filecab/archive/2012/05/21/introduction-to-data-deduplication-in-windows-server-2012.aspx
Do the dupes really matter? Out of 4.5T how much could be duplicates? In the overall scheme of things it's probably less than 1%, so who cares. If you stumble on them, clean it up. If you don't, who cares.
For readability, s/;/;\n/g. From an error message it seems Slashdot is hostile to small lines in posts. The original is 73 lines.
http://pastebin.com/sUfZkVaQ
#!/usr/bin/env perl use strict; use Digest::SHA; use Cwd; use File::Util; my $topDir=cwd(); my($f) = File::Util->new(); my(@files) = $f->list_dir($topDir,'--recurse'); my %hash; my $deleteFlag=$ARGV[0]; #print $deleteFlag,"\n"; foreach my $file(@files) { if(-d $file) {next;} my $size=$f->size($file); push @{$hash{$size}},$file; } my ($filectr,$setctr)=(0,0); foreach my $key (sort { $a $b } keys %hash) {#loop through sizes my $value=$hash{$key}; my @arr=@{$value}; my $numFiles = @arr; if ($numFiles $b } keys %shahash) { #loop through files of same hash value my $shavalue=$shahash{$shakey}; my @shaarr=@{$shavalue}; my $numFilesSha = @shaarr; if($numFilesSha new($alg); $sha->addfile($filename); my $digest = $sha->hexdigest(); return $digest; } sub unixFilename { my ($filename) = @_; $filename =~ s/\)/\\\)/g; $filename =~ s/\(/\\\(/g; $filename =~ s/\ /\\ /g;
$filename =~ s/\;/\\\;/g;
$filename =~ s/\'/\\\'/g;
$filename =~ s/\"/\\\"/g;
$filename =~ s/\&/\\\&/g;
$filename =~ s/\!/\\\!/g;
return $filename;
}
Install cygwin.
get an m5dsum of every file and store the file paths/md5sums in a text file.
Sort the file.
use a script(perl or your scripting language of choice) to spit out the paths of every file that's duplicated.
No, I'm not writing the code to do it for you.
LK
"Hi. This is my friend, Jack Shit, and you don't know him." - Lord Kano
I wrote a freeware Windows utility called Folderscope and deduping large folders is one of the main use cases:
Folderscope
Enjoy, Andrew.
By hand
WinDirStat is a useful utility that might be able to help you break up your task into smaller parts. http://windirstat.en.softonic.com/
My project was complicated by the fact that the files were scattered across several 2TB external hard drives and a few internal ones. Your situation should be easier since you have consolidated everything. Here is what I did:
Set up a MySQL database with filename, full path, and MD5 checksum as fields - as well as a few other EXIF fields since I was working with a huge photo archive. Also the appropriate index.
Put together a quick Python script that would walk the directories, take an MD5 checksum of each file, and plug it into the database.
Finally I just did a query on the database to print out duplicates based on MD5 which took surprisingly little time to run even with several million records.
Caveat emptor: All this was done in Linux. I started putting this together on a fast Win7 machine and quickly realized that it was just too slow to get this one in a week.
Support microSD: in a post 9/11 world, it is unwise to carry your data on media that you cannot comfortably swallow.
Best tool. http://hungrycats.org/~zblaxell/dupemerge/faster-dupemerge worked great for me in the past 10 years. Scales.
-- I was raised on the command line, bitch
There are a lot of good recommendations for how to locate duplicates. If you really plan to attempt deduplication rather than purchasing more space, there are a number of things to consider. First, don't use a tool to perform the deduplication, only to locate the duplicates. You are bound to run into a scenario you didn't anticipate. Multiple users may each maintain their own copy of identical files. If one is removed, one user no longer has access. If they are simply hard linked to the same file, modifications are applied to both. Multiple copies of the same repository from a distributed SCM (Git, Mercurial, etc.) you are going to run a vast number of false positives. There are other situations where use/ownership, and not simply structure, must be taken into consideration.
http://freecode.com/projects/fdupe -- perl. Only finds exact duplicates, and I haven't used it against more than 200'000 files and 2TB.
"The more prohibitions there are, The poorer the people will be" -- Lao Tse
A co-worker and I used a program called Beyond Compare to match data stored on tape with a live Archive directory on a file server. Actually, it was his idea and it worked out pretty well. Check it out, it may be of some use.
http://www.scootersoftware.com/
Life is not for the lazy.
The strategies to compute file lengths, then crcs, are generally wise. But they may miss the real problem.
In addition to detecting duplicate backup files, the OP ought think about how duplicates should be handled. The goal, one presumes, is to create a single tree of backed-up files where each file is represented only once, but which preserves something about the original organization of the original directory hierarchy.
I similarly have duplicate backups distributed between flash drives, burned dvd's and cdr's (some stored off site), external disks, and old-but-still-working computers and cell phones. Many of these repositories are ancient and were managed by software with odd naming conventions. But destroying that history may lose information, such as "With which camera did I take this picture?"
The easy problem is detecting duplicates. The hard problem is figuring out how to organize the resulting files into a meaningful new single tree.
The objective may be to make sure you don't unwittingly fork.
I've dealt with a similar problem on a smaller scale (500K files, 120Gb). I started by generating hashes over all my current properly-organised files using hashdeep, and parsed the output into a database (columns filesize, hash, path, filename, mtime) using a custom scripts. Then I wrote another script to walk through the archives finding and deleting files that matched those already in the database; the script also used the database to keep track of its walk so it could be stopped and restarted. This halved the size of the archive material before I had to start trying to understand what was there.
From there I identified pivotal directories in the archive - ones I could reasonable assume to be recent or more complete (for example, based on backup date) - and added them to the hash database, then walked the rest of the archives culling duplicates again. Lather, rinse, repeat and you rapidly reach a point where you have a small number of directories with a lot of de-duplicated data, and a large number of directories with small amounts of possibly-duplicated data that can be handled by a free dedup tool.
i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
If you can live with less than perfect results (wasted space) you could apply pareto's principle, and start working with a list of file sizes in descending order and dedup manually until you recovered enough space. Chance are that 20% of files make up for 80% of space.
More info http://en.wikipedia.org/wiki/Pareto_principle
Well yes, this is a linux tool, but still I was quite pleased with it's results for 800k files. It took some time but it had an end. /usr/share/fslint/fslint/findup ...]
It's basically a shellscript doing what others have suggested: sort by size, same size files are checksummed.
find dUPlicate files.
Usage: findup [[[-t [-m|-d]] | [--summary]] [-r] [-f] paths(s)
If no path(s) specified then the currrent directory is assumed.
When -m is specified any found duplicates will be merged (using hardlinks).
When -d is specified any found duplicates will be deleted (leaving just 1).
When -t is specfied, only report what -m or -d would do.
When --summary is specified change output format to include file sizes. /usr/share/fslint/fslint/fstool/dupwaste
You can also pipe this summary format to
to get a total of the wastage due to duplicates.
As it's a single command line with dozens of pipes, it should use all cores if needed.
some text from the source:
Description
will show duplicate files in the specified directories
(and their subdirectories), in the format:
file1
file2
file3
file4
file5
or if the --summary option is specified:
2 * 2048 file1 file2
3 * 1024 file3 file4 file5
Where the number is the disk usage in bytes of each of the
duplicate files on that line, and all duplicate files are
shown on the same line.
Output it ordered by largest disk usage first and
then by the number of duplicate files.
Caveats/Notes:
I compared this to any equivalent utils I could find (as of Nov 2000)
and it's (by far) the fastest, has the most functionality (thanks to
find) and has no (known) bugs. In my opinion fdupes is the next best but
is slower (even though written in C), and has a bug where hard links
in different directories are reported as duplicates sometimes.
This script requires uniq > V2.0.21 (part of GNU textutils|coreutils)
dir/file names containing \n are ignored
undefined operation for dir/file names containing \1
sparse files are not treated differently.
Don't specify params to find that affect output etc. (e.g -printf etc.)
zero length files are ignored.
symbolic links are ignored.
path1 & path2 can be files &/or directories
and the code has optimizations like this one
sort -k2,2n -k3,3n | #NB sort inodes so md5sum does less seeking all over disk
Atari rules... ermm... ruled.
Break the 4.9 Tb into convenient size files (say, 500 Mb) and de-dup them one at a time. I'd dedicate a spare computer to do this, so you can leave it running over nights, weekends, etc. Then merge the now-smaller files into 500 Mb chunks and work through iteratively.
After reading only a few posts I was finally motivated to dedupe my SkyDrive. Using FastDuplicateFileFinder I found many dupes and sorted through them. My number of files was a mere fraction of the OP's files, but this worked for me. I was surprised to find some I no longer needed and some I barely remembered from back in 1993.
Normally I ascribe all life to intelligent design, but in your case I'll make an exception.
use http://www.clonespy.com/ and let it run some days/weeks
I have done this ecercise just last week for 120,000 files - ran one night on an old P4
Rationale:
You don't need to free up space, heck, space is cheap so there's no real reason to recover it.
Also, given that the worth of something is inversely proportional to its availability, it actually makes sense to have duplicates hanging around: once you loose your only copy of a file you'll be *very* happy to find its duplicate.
@peetm
If you are on NTFS you can use http://freedup.org/ or freedups.pl (http://www.stearns.org/freedups/). It makes hard links among duplicate files. On NTFS, it poops out after 1024 links, but at least you have 1023 fewer copies of the file on your hard drive.
This makes sense if you are used to a particular file structure. The file structure stays the same, so you can find the one copy of the file by whatever name/path you happen to remember first.
I've used a few free deduplicators, and haven't had a huge problem with them. I'd work in smaller chunks (directory trees) to start with, if you don't want the computer chugging away for long periods without knowing what it is doing. The first one I tried (DupeLocator, no longer at it's original location but possibly around in freeware collections) seemed relatively efficient, finding equal size files and doing some sort of compare. It eliminated the files that were not dupes pretty quickly. Took longer to confirm that the rest were really dupes, but not excessively long. It had the added advantage of "locating" the dupes, and letting you do with them what you please (I drag them into the Recycle Bin most of the time, all except the one I want to keep). It also keeps updating a status so you know how far it has gotten. I suppose if there are thousands of 1GB files, it might take a while.
Surely someone creating a de-dupe utility would make it at least moderately efficient, if not user friendly. I'd guess a program using a less-efficient algorithm with updated status would seem faster than one that just sits there doing a highly optimized algorithm without letting you know what's happening.
What you need is a computer.
i never tried it on millions of files though ... about 65k at most ... dunno maybe it has its equivalent at sourceforge somewhere you can compile yourself
o right windows
Free speech was meant to be free for all... how can anyone grow up in a nanny state ?
http://en.wikipedia.org/wiki/CCleaner
http://www.piriform.com/ccleaner
I have used it on multiple TB machines, both in a home and work settings. I have used it for special projects targeting file repositories.
It is flexible enough that you can configure it pretty much any way you wish. With a little imigination you should be able to do whatever you need to do with this. I have used in conjunction with SyncToy to backup, move, etc... using contribute (which can generate duplicates).
It is fast, or at least I never had any problems. Some of the larger seaches I did take awhile, but that is to be expected. It also has a pretty flexible output, you can delete, move, just about anything you want.
A very useful utility.