I will try to qualitatively answer how fast the tests for the existence of the file are on tmpfs, and then I can offer how you can speed up the execution of your entire program.
First, a search in the tmpfs directory relies (in the kernel) on looking for directory cache table hashes that are less sensitive to the number of files in your directory. They are affected, but sublinearly. This is due to the fact that correctly executed hash tables for searching take some constant time, O(1) , regardless of the number of elements in the hash table.
To explain, we can look at the work performed by test -f or [ -f X ] from coreutils ( gitweb ):
case 'e': unary_advance (); return stat (argv[pos - 1], &stat_buf) == 0; ... case 'f': unary_advance (); return (stat (argv[pos - 1], &stat_buf) == 0 && S_ISREG (stat_buf.st_mode));
Therefore, it uses stat() in the file name directly. The list of directories is not explicitly specified by test , but the stat runtime may depend on the number of files in the directory. The stat call completion time will depend on the implementation of the non-transient file system.
For each file system, stat will split the path into directory components and skip it. For example, for the path /tmp/hashes/the_md5 : first / , gets its index, then looks at tmp inside it, gets this inode (this is a new mount point), then gets hashes inode and finally the test filename and its inode. You can expect that inodes up to /tmp/hashes/ will be cached because they are repeated at each iteration, so these searches are fast and most likely do not require disk access. Each search will depend on the file system in which the parent directory is located. After the /tmp/ search is performed on tmpfs (all this is in memory, unless you have run out of memory and you need to use swap).
tmpfs on linux relies on simple_lookup to get the inode file in a directory. tmpfs is under its old name in the linux tree mm / shmem.c . tmpfs, like ramfs, does not seem to implement its own data structures for tracking virtual data; it simply relies on caching VFS directory entry (under Directory Entry Keys ).
Therefore, I suspect that looking for an inode file in a directory is as simple as searching a hash table. I would say that while all your temporary files fit into your memory, and you use tmpfs / ramfs, it doesn't matter how many files are there - every time O (1) searches.
Other file systems, such as Ext2 / 3, will incur a penalty linear with the number of files present in the directory.
storing them in memory
Like others, you can also store MD5 in memory by storing them in bash variables and avoiding file system fines (and those associated with it). Saving them to the file system has the advantage that you can resume work from where you left off if you must interrupt your cycle (your md5 may be a symbolic link to a file whose digest matches you can rely on on subsequent launches ), but slower.
MD5=d41d8cd98f00b204e9800998ecf8427e let SEEN_${MD5}=1 ... digest=$(md5hash_of <filename>) let exists=SEEN_$digest if [[ "$exists" == 1 ]]; then
faster tests
And you can use [[ -f my_file ]] instead of [ -f my_file ] . The [[ command is a built-in bash and is much faster than spawning a new process ( /usr/bin/[ ) for each comparison. This will make an even bigger difference.
what is / usr / bin / [
/usr/bin/test and /usr/bin/[ are two different programs, but the source code for [ (lbracket.c) is the same as test.c (again in coreutils):
#define LBRACKET 1 #include "test.c"
therefore, they are interchangeable.