I would do this:
- Scan directories of interest to you, looking at each file size; save the size / path of the pair file in
multimap , the file size as an index; - then scan
multimap for buckets with only one item per key, that is, with files whose size is unique; they certainly cannot be duplicates. - hash the contents of the remaining files and do the same as before (
multimap with hashes as keys and paths as values). - then perform a real (byte per byte) comparison of only files that have the same hash.
This process should be much faster than blind hashing of all files, since most files have different sizes and can be opened simply by looking at it; and checking the file size is much cheaper than hash files, as it is just a search for the file system attribute, and not reading the entire contents of the file.
The last step is necessary because there is the possibility of different files with the same hash; but with good hashing features, most of the work has already been done, since hash collisions for unrelated files should be really rare.
Note that there is no need for cryptographic protection of your hash function, but not particularly fast (I believe that IO will dominate during this process).
In addition, since you do not really need a sorted container, you can use unordered_multimap instead of multimap , since it should have a faster search time and, as soon as you know how many files you have, you can call reserve with a certain maximum number of elements avoiding redistribution.
source share