Difficulty finding duplicate files in bash

I had to write a Bash script to remove duplicate files today using their md5 hashes. I stored these hashes as files in a temporary directory:

for i in * ; do hash=$(md5sum /tmp/msg | cut -d " " -f1) ; if [ -f /tmp/hashes/$hash ] ; then echo "Deleted $i" ; mv $i /tmp/deleted ; else touch /tmp/hashes/$hash ; fi ; done 

This worked fine, but made me wonder: is this a time-efficient way to do this? At first I thought about storing MD5 hashes in a file, but then I thought โ€œnoโ€, because checking if this MD5 in this file should require to re-read it completely every time. โ€Now, Iโ€™m wondering: this is the same when using the method โ€œcreate files in a directory?โ€ Does Bash [-f] check for linear or quasi-constant complexity when there are many files in the same directory?

If it depends on the file system, what is the complexity in tmpfs?

+6
source share
4 answers

I will try to qualitatively answer how fast the tests for the existence of the file are on tmpfs, and then I can offer how you can speed up the execution of your entire program.

First, a search in the tmpfs directory relies (in the kernel) on looking for directory cache table hashes that are less sensitive to the number of files in your directory. They are affected, but sublinearly. This is due to the fact that correctly executed hash tables for searching take some constant time, O(1) , regardless of the number of elements in the hash table.

To explain, we can look at the work performed by test -f or [ -f X ] from coreutils ( gitweb ):

 case 'e': unary_advance (); return stat (argv[pos - 1], &stat_buf) == 0; ... case 'f': /* File is a file? */ unary_advance (); /* Under POSIX, -f is true if the given file exists and is a regular file. */ return (stat (argv[pos - 1], &stat_buf) == 0 && S_ISREG (stat_buf.st_mode)); 

Therefore, it uses stat() in the file name directly. The list of directories is not explicitly specified by test , but the stat runtime may depend on the number of files in the directory. The stat call completion time will depend on the implementation of the non-transient file system.

For each file system, stat will split the path into directory components and skip it. For example, for the path /tmp/hashes/the_md5 : first / , gets its index, then looks at tmp inside it, gets this inode (this is a new mount point), then gets hashes inode and finally the test filename and its inode. You can expect that inodes up to /tmp/hashes/ will be cached because they are repeated at each iteration, so these searches are fast and most likely do not require disk access. Each search will depend on the file system in which the parent directory is located. After the /tmp/ search is performed on tmpfs (all this is in memory, unless you have run out of memory and you need to use swap).

tmpfs on linux relies on simple_lookup to get the inode file in a directory. tmpfs is under its old name in the linux tree mm / shmem.c . tmpfs, like ramfs, does not seem to implement its own data structures for tracking virtual data; it simply relies on caching VFS directory entry (under Directory Entry Keys ).

Therefore, I suspect that looking for an inode file in a directory is as simple as searching a hash table. I would say that while all your temporary files fit into your memory, and you use tmpfs / ramfs, it doesn't matter how many files are there - every time O (1) searches.

Other file systems, such as Ext2 / 3, will incur a penalty linear with the number of files present in the directory.

storing them in memory

Like others, you can also store MD5 in memory by storing them in bash variables and avoiding file system fines (and those associated with it). Saving them to the file system has the advantage that you can resume work from where you left off if you must interrupt your cycle (your md5 may be a symbolic link to a file whose digest matches you can rely on on subsequent launches ), but slower.

 MD5=d41d8cd98f00b204e9800998ecf8427e let SEEN_${MD5}=1 ... digest=$(md5hash_of <filename>) let exists=SEEN_$digest if [[ "$exists" == 1 ]]; then # already seen this file fi 

faster tests

And you can use [[ -f my_file ]] instead of [ -f my_file ] . The [[ command is a built-in bash and is much faster than spawning a new process ( /usr/bin/[ ) for each comparison. This will make an even bigger difference.

what is / usr / bin / [

/usr/bin/test and /usr/bin/[ are two different programs, but the source code for [ (lbracket.c) is the same as test.c (again in coreutils):

 #define LBRACKET 1 #include "test.c" 

therefore, they are interchangeable.

+1
source

I am a fan of using the right tool for the job. In this case, you want to see only duplicate files. I tested this against several thousand files at my disposal, and re-reading the file did not seem to have any problems. Plus, I noticed that I have hundreds of duplicate files. When I store hashes in separate files and then process this large number of files, my system slowly creeps along about 10,000 hashes of files in one directory. All hashes in one file have greatly accelerated this.

 # This uses md5deep. An alternate is presented later. md5deep -r some_folder > hashes.txt # If you do not have md5deep find . -type f -exec md5sum \{\} \; 

It gives you hashes of everything.

 cut -b -32 hashes.txt | sort | uniq -d > dupe_hashes.txt 

This will use cut to get a hash for each file, sort the hashes, and then search for any duplicate hashes. They are recorded in dupe_hashes.txt without attached file names. Now we need to map the hashes back to the files.

 (for hash in $(cat dupe_hashes.txt); do grep "^$hash" hashes.txt | tail -n +2 | cut -b 35- done) > dupe_files.txt 

This does not seem to work slowly for me. The Linux kernel does a great job of storing such files in memory, rather than often reading them from disk. If you want to force this to be in memory, you can simply use /dev/shm/hashes.txt instead of hashes.txt . I found that this is not necessary in my tests.

This gives you every duplicate file. So far, so good. You probably want to view this list. If you also want to list the source, delete the tail -n +2 | bit. from the team.

When itโ€™s convenient for you that you can delete each specified file, you can drag things into xargs. This will delete files in groups of 50.

 xargs -L 50 rm < dupe_files.txt 
+2
source

The choice between reading the contents of a file containing hashes and finding a hash in the directory of file names that are hashes basically comes down to the fact that the kernel reads the directory or your program faster when reading the file. Both will be associated with a linear search for each hash, so you end up in the same behavior. You can probably argue that the core should be a little faster, but the margin will not be large. Please note that most often the linear search will be exhaustive, because the hash will not exist (unless you have a large number of duplicate files). So, if you process several thousand files, the search will process several million records in total - this is a quadratic behavior.

If you have many hundreds or thousands of files, you might be better off with a two-level hierarchy - for example, a directory containing the two-character subdirectories 00 .. FF, and then saving the rest of the name (or full name) in a subdirectory. A small change to this method is used, for example, in the terminfo directories. The advantage is that the kernel only needs to read relatively small directories to find if the file is present or not.

0
source

I did not "hash" this, but I would try to save your md5sums in a bash hash.

See How to define hash tables in Bash?

Store md5sum as the key, and if you want, the file name as the value. For each file, just check if the key exists in the hash table. If so, you do not care about the value, but you can use it to print the original name of the duplicate file. Then delete the current file (with a duplicate key). Not being a bash expert, where will I start looking.

0
source

All Articles