I have a backup hard drive, which, as I know, contains duplicate files, and I decided that it would be a fun project to write a small python script to find and delete them. I wrote the following code only to go through the disk and calculate the md5 sum of each file and compare it with what I’m going to call the “first collision” list. If md5 does not yet exist, add it to the list. If the amount already exists, delete the current file.
import sys import os import hashlib def checkFile(fileHashMap, file): fReader = open(file) fileData = fReader.read(); fReader.close() fileHash = hashlib.md5(fileData).hexdigest() del fileData if fileHash in fileHashMap:
The script processes 10 GB files and then throws a MemoryError in the line 'fileData = fReader.read ()'. I thought that since I close fReader and mark fileData for deletion after I calculated the sum of md5, I would not run into this. How can I calculate md5 sums without using this memory error?
Edit: I was asked to remove the dictionary and look at memory usage to see if there is a leak in hashlib. Here is the code I ran.
import sys import os import hashlib def checkFile(file): fReader = open(file) fileData = fReader.read(); fReader.close() fileHash = hashlib.md5(fileData).hexdigest() del fileData def main(argv): for curDir, subDirs, files in os.walk(argv[1]): print(curDir) for file in files: print("------: " + str(curDir + file)) checkFile(curDir + file) if __name__ == "__main__": main(sys.argv)
and I'm still getting a memory failure.
source share