I have an application that generates several log files> 500 MB.
I have written several utilities in Python, which allows me to quickly look through the log file and find the data of interest. But now I get several data sets where the file is too large to load all this into memory.
Thus, I want to scan a document once, build an index, and then load only a part of the document into the memory that I want to look at at a time.
This works for me when I open a “file”, read it one line at a time and save the offset with file.tell (). Then I can return to this section of the file later with file.seek (offset, 0).
My problem is that I can have UTF-8 in the log files, so I need to open them using the codec module ( codecs.open(<filename>, 'r', 'utf-8')). With the resulting object, I can call a search and say, but they do not match.
I assume that codecs should do some buffering, or perhaps return the number of characters instead of bytes from tell?
Is there any way around this?
source
share