What is the best way to find duplicate files in C ++?

I want to find duplicate files in a C ++ file system. Is there any algorithm to do this as quickly as possible? And do I need to create a multi-threaded application, or can I just use a single thread for this?

+4
source share
3 answers

I agree with Kerrek SB that there are more efficient tools for this than C ++, however, assuming you really need to do this in C ++, here are some tips and things to consider when implementing:

  • use boost :: file system to transfer file system

  • hash any sentence of the file is very reasonable, but it may be more efficient to create a multimap first, where the file size is the key. Then apply only the hash when there are duplicate file sizes.

  • decide how you want to handle empty files and symbolic links / short cuts

  • decided how you want to handle special files, for example. on unix, you have fifos directories, sockets, etc.

  • take into account the fact that files or directory structure can change, disappear or move while your algorithm is running

  • consider the fact that some files or directories may be inaccessible or damaged (for example, recursive links to directories)

  • Make the number of threads configurable, because the amount of concurrency that makes sense depends on the hardware and configuration of the disk. It will be different if you are on a simple hard drive against an expensive dignity. However, do not make assumptions; Check this. For example, Linux is very good at caching files, so many of your reads will come from memory and thus will not be blocked during I / O.

+10
source

1) Do not use C ++. All necessary tools already exist.

2) Hash each file (for example, using md5sum ) and create an index of file names, file sizes and hash values. *

3) Sort by hash value and search for duplicate pairs of value and hash size (for example, using sort ).

4) Make the usual diff on the duplicate candidate.

You can parallelize step 2) with a bit of work, but you will be limited by the I / O speed of your storage. You can parallelize step 3) by dividing your large index file into bits, sorting them individually and then combining them ( sort -m ).

*) As @frankc writes, it’s actually not a hash file, but only those whose sizes are not unique. Start with an index based on size. You will need to hash a lot of small files, but only very few large files.

+8
source

I would do this:

  • Scan directories of interest to you, looking at each file size; save the size / path of the pair file in multimap , the file size as an index;
  • then scan multimap for buckets with only one item per key, that is, with files whose size is unique; they certainly cannot be duplicates.
  • hash the contents of the remaining files and do the same as before ( multimap with hashes as keys and paths as values).
  • then perform a real (byte per byte) comparison of only files that have the same hash.

This process should be much faster than blind hashing of all files, since most files have different sizes and can be opened simply by looking at it; and checking the file size is much cheaper than hash files, as it is just a search for the file system attribute, and not reading the entire contents of the file.

The last step is necessary because there is the possibility of different files with the same hash; but with good hashing features, most of the work has already been done, since hash collisions for unrelated files should be really rare.

Note that there is no need for cryptographic protection of your hash function, but not particularly fast (I believe that IO will dominate during this process).

In addition, since you do not really need a sorted container, you can use unordered_multimap instead of multimap , since it should have a faster search time and, as soon as you know how many files you have, you can call reserve with a certain maximum number of elements avoiding redistribution.

+4
source

All Articles