Reading a file in chunks - using RAM, reading strings from binary files

I would like to understand the difference in RAM usage of these methods when reading a large file in python.

Version 1 found here on stackoverflow:

def read_in_chunks(file_object, chunk_size=1024): while True: data = file_object.read(chunk_size) if not data: break yield data f = open(file, 'rb') for piece in read_in_chunks(f): process_data(piece) f.close() 

Version 2, I used this before I found the code above:

 f = open(file, 'rb') while True: piece = f.read(1024) process_data(piece) f.close() 

The file is partially read in both versions. And the current piece can be processed. In the second example, piece gets new content in each loop, so I thought it would do the job of not loading the full file into memory.?

But I really don’t understand what yield does, and I’m sure that something is wrong with me here. Can anyone explain this to me?


There is something else that puzzles me besides the method used:

The content of the part I'm reading is determined by the size of the block, 1 KB in the examples above. But ... what if I need to search for lines in a file? Something like "ThisIsTheStringILikeToFind" ?

Depending on where the String occurs in the file, it may be that one part contains the "ThisIsTheStr" - and the next part will contain "ingILikeToFind" . Using this method, it is impossible to detect the whole chain in any part.

Is there a way to read a file in chunks - but for some reason, take care of such lines?

Any help or idea is appreciated,

hello!

+9
source share
1 answer

yield is a keyword in python used for generator expressions. This means that the next time the function is called (or repeated), execution will start from the same point at which it stopped the last time you called it. Two functions behave the same; the only difference is that the former uses a little more space on the call stack than the latter. However, the first option is much more reusable, so from the point of view of program development, the first is really better.

UPDATE: In addition, another difference is that the first will stop reading after all the data has been read, as it should be, but the second will stop only after f.read() or process_data() throws an exception. For the second to work properly, you need to change it as follows:

 f = open(file, 'rb') while True: piece = f.read(1024) if not piece: break process_data(piece) f.close() 
+19
source

All Articles