Duplicate Search Algorithm

Are there any known algorithms for finding duplicates efficiently?

For example, suppose if I have thousands of photos and photos are named with unique names. There may be a chance that a duplicate may exist in different subfolders. Is using std :: map or any other hash maps a good idea?

+4
source share
2 answers

If you are working with files, you first need to check the file length, and then create a hash only for files with the same size.

Then just compare the file hashes. If they are the same, you have a duplicate file.

There is a trade-off between security and accuracy: maybe, who knows, have different files with the same hash. This way you can improve your decision: create a simple, fast hash to find duplicates. When they are different, you have different files. When they are equal, create a second hash. If the second hash is different, you just had a false positive. If they are equal again, you may have a real duplicate.

In other words:

generate file sizes for each file, verify if there some with the same size. if you have any, then generate a fast hash for them. compare the hashes. If different, ignore. If equal: generate a second hash. Compare. If different, ignore. If equal, you have two identical files. 

Running a hash for each file will take too long and will be useless if most of your files are different.

+6
source

Perhaps you want to hash each object and store the hashes in some kind of table? To check for duplicates, you simply scan the table quickly.

Mysterious data structure

Regarding the “known algorithm” for this task, see MD5 .

+1
source

All Articles