Python memory error while processing files

I have a backup hard drive, which, as I know, contains duplicate files, and I decided that it would be a fun project to write a small python script to find and delete them. I wrote the following code only to go through the disk and calculate the md5 sum of each file and compare it with what I’m going to call the “first collision” list. If md5 does not yet exist, add it to the list. If the amount already exists, delete the current file.

import sys import os import hashlib def checkFile(fileHashMap, file): fReader = open(file) fileData = fReader.read(); fReader.close() fileHash = hashlib.md5(fileData).hexdigest() del fileData if fileHash in fileHashMap: ### Duplicate file. fileHashMap[fileHash].append(file) return True else: fileHashMap[fileHash] = [file] return False def main(argv): fileHashMap = {} fileCount = 0 for curDir, subDirs, files in os.walk(argv[1]): print(curDir) for file in files: fileCount += 1 print("------------: " + str(fileCount)) print(curDir + file) checkFile(fileHashMap, curDir + file) if __name__ == "__main__": main(sys.argv) 

The script processes 10 GB files and then throws a MemoryError in the line 'fileData = fReader.read ()'. I thought that since I close fReader and mark fileData for deletion after I calculated the sum of md5, I would not run into this. How can I calculate md5 sums without using this memory error?

Edit: I was asked to remove the dictionary and look at memory usage to see if there is a leak in hashlib. Here is the code I ran.

 import sys import os import hashlib def checkFile(file): fReader = open(file) fileData = fReader.read(); fReader.close() fileHash = hashlib.md5(fileData).hexdigest() del fileData def main(argv): for curDir, subDirs, files in os.walk(argv[1]): print(curDir) for file in files: print("------: " + str(curDir + file)) checkFile(curDir + file) if __name__ == "__main__": main(sys.argv) 

and I'm still getting a memory failure.

+6
source share
2 answers

Your problem is to read all the files, they are too large, and your system cannot load all this into memory, so it gives an error.

As you can see in the official Python documentation, MemoryError :

Raised when the operation ends from memory , but the situation may still persist (deleting some objects). A related value is a string indicating which (internal) operation ended in memory. Please note that due to the basic architecture of memory management (Cs malloc () function), the translator cannot always recover from this situation ; it nevertheless raises an exception, so a stack trace can be printed in case you run away the program was the reason.

For your purpose you can use hashlib.md5()

In this case, you will need to sequentially read pieces of 4096 bytes and pass them to the Md5 function:

 def md5(fname): hash = hashlib.md5() with open(fname) as f: for chunk in iter(lambda: f.read(4096), ""): hash.update(chunk) return hash.hexdigest() 
+4
source

Not a solution to a memory problem, but an optimization that can avoid it:

  • small files: calculate md5 sum, delete duplicates

  • large files: remember size and path

  • in the end, only calculate md5sums files of the same size when there is more than one file

Python collection.defaultdict may be useful for this.

+1
source

All Articles