Can you search and report work with UTF-8 encoded documents in Python?

I have an application that generates several log files> 500 MB.

I have written several utilities in Python, which allows me to quickly look through the log file and find the data of interest. But now I get several data sets where the file is too large to load all this into memory.

Thus, I want to scan a document once, build an index, and then load only a part of the document into the memory that I want to look at at a time.

This works for me when I open a “file”, read it one line at a time and save the offset with file.tell (). Then I can return to this section of the file later with file.seek (offset, 0).

My problem is that I can have UTF-8 in the log files, so I need to open them using the codec module ( codecs.open(<filename>, 'r', 'utf-8')). With the resulting object, I can call a search and say, but they do not match.

I assume that codecs should do some buffering, or perhaps return the number of characters instead of bytes from tell?

Is there any way around this?

+5
source share
4 answers

If true, this sounds like an error or restriction on the codec module, since it probably confuses the offsets of bytes and characters.

open() , seek()/tell() , . , , f.readline().decode('utf-8').

, , f.read() , , , UTF-8. readline() .

, , , .

+2

UTF-8 codecs.open. ( .decode ). ; ( > 128).

+1

, UTF8 python, , , Python 3. , "" " Python 3": http://diveintopython3.org/files.html

, , file.seek file.tell , Unicode . , :

f.seek(10)
f.read(1)
f.tell()

- 17, , .

+1

. / , codec.open(). unicode .

, , . , , . , , , .

, , (, , , ).

, , - - . deocding , .

0

All Articles