Determine if files in the directory have been added, deleted or changed

I am trying to write a Python script that will get md5sum of all the files in a directory (on Linux). Which, I believe, I did in the code below.

I want this to be able to run to make sure that the files in the directory are not changed and that the files were not added for deletion.

The problem is that I make changes to the file in the directory, but then change it. I get a different result from running the function below. (Despite the fact that I changed the modified file back.

Can someone explain this. And let me know if you can think about work?

def get_dir_md5(dir_path): """Build a tar file of the directory and return its md5 sum""" temp_tar_path = 'tests.tar' t = tarfile.TarFile(temp_tar_path,mode='w') t.add(dir_path) t.close() m = hashlib.md5() m.update(open(temp_tar_path,'rb').read()) ret_str = m.hexdigest() #delete tar file os.remove(temp_tar_path) return ret_str 

Edit: As these lovely people replied, it looks like tar contains header information, such as the date modified. Will using zip work differently or in a different format?

Any other ideas for working around?

+7
source share
4 answers

As mentioned in other answers, the two tar files may be different even if the contents are the same either due to changes in tar metadata or changes in the order in which they change. You must run the checksum on the file data directly, sorting the directory lists to make sure that they are always in the same order. If you want to include some metadata in the checksum, include it manually.

Unconfirmed example using os.walk :

 import os import os.path def get_dir_md5(dir_root): """Build a tar file of the directory and return its md5 sum""" hash = hashlib.md5() for dirpath, dirnames, filenames in os.walk(dir_root, topdown=True): dirnames.sort(key=os.path.normcase) filenames.sort(key=os.path.normcase) for filename in filenames: filepath = os.path.join(dirpath, filename) # If some metadata is required, add it to the checksum # 1) filename (good idea) # hash.update(os.path.normcase(os.path.relpath(filepath, dir_root)) # 2) mtime (possibly a bad idea) # st = os.stat(filepath) # hash.update(struct.pack('d', st.st_mtime)) # 3) size (good idea perhaps) # hash.update(bytes(st.st_size)) f = open(filepath, 'rb') for chunk in iter(lambda: f.read(65536), b''): hash.update(chunk) return hash.hexdigest() 
+8
source

TAR file headers include a field for modified file time; the effect of changing the file, even if this change is later changed back, will mean that the headers of the TAR files will be different, which will lead to different hashes.

+7
source

You do not need to make a tar file to do what you offer.

Here is your workaround algorithm:

  • Go through the directory tree;
  • Take the md5 signature of each file;
  • Sort signatures;
  • Take the signature md5 of the text string of all the signatures of the individual files.

The only resulting signature will be what you are looking for.

Hell, you don't even need Python. You can do it:

 find /path/to/dir/ -type f -name *.py -exec md5sum {} + | awk '{print $1}'\ | sort | md5sum 
+3
source

tar files contain metadata beyond the actual contents of the file, such as file access time, modification time, etc. Even if the contents of the file do not change, the tar file will actually be different.

+1
source

All Articles