I would like to understand the difference in RAM usage of these methods when reading a large file in python.
Version 1 found here on stackoverflow:
def read_in_chunks(file_object, chunk_size=1024): while True: data = file_object.read(chunk_size) if not data: break yield data f = open(file, 'rb') for piece in read_in_chunks(f): process_data(piece) f.close()
Version 2, I used this before I found the code above:
f = open(file, 'rb') while True: piece = f.read(1024) process_data(piece) f.close()
The file is partially read in both versions. And the current piece can be processed. In the second example, piece gets new content in each loop, so I thought it would do the job of not loading the full file into memory.?
But I really donβt understand what yield does, and Iβm sure that something is wrong with me here. Can anyone explain this to me?
There is something else that puzzles me besides the method used:
The content of the part I'm reading is determined by the size of the block, 1 KB in the examples above. But ... what if I need to search for lines in a file? Something like "ThisIsTheStringILikeToFind" ?
Depending on where the String occurs in the file, it may be that one part contains the "ThisIsTheStr" - and the next part will contain "ingILikeToFind" . Using this method, it is impossible to detect the whole chain in any part.
Is there a way to read a file in chunks - but for some reason, take care of such lines?
Any help or idea is appreciated,
hello!