Inconsistency file.tell ()

Does anyone know why when you iterate over a file this way:

Entrance:

f = open('test.txt', 'r') for line in f: print "f.tell(): ",f.tell() 

Output:

 f.tell(): 8192 f.tell(): 8192 f.tell(): 8192 f.tell(): 8192 

I constantly get the wrong file index from tell (), however, if I use readline, I get the corresponding index for tell ():

Entrance:

 f = open('test.txt', 'r') while True: line = f.readline() if (line == ''): break print "f.tell(): ",f.tell() 

Output:

 f.tell(): 103 f.tell(): 107 f.tell(): 115 f.tell(): 124 

I am running python 2.7.1 BTW.

+40
source share
3 answers

Using open files as an iterator uses a read buffer to increase efficiency. As a result, the file pointer advances in large steps in the file when you iterate over the lines.

In the documentation of File Objects :

To make the for loop the most efficient way to loop over lines in a file (a very common operation), the next() method uses a hidden read buffer. As a result of using the read buffer, combining next() with other file methods (e.g. readline() ) does not work correctly. However, using seek() to move the file to its absolute position will reset the read buffer.

If you need to rely on .tell() , do not use the file object as an iterator. Instead, you can turn .readline() into an iterator (at the cost of some performance loss):

 for line in iter(f.readline, ''): print f.tell() 

It uses the iter() sentinel argument to turn any called into an iterator.

+60
source

The answer is the following part of the Python 2.7 source code ( fileobject.c ):

 #define READAHEAD_BUFSIZE 8192 static PyObject * file_iternext(PyFileObject *f) { PyStringObject* l; if (f->f_fp == NULL) return err_closed(); if (!f->readable) return err_mode("reading"); l = readahead_get_line_skip(f, 0, READAHEAD_BUFSIZE); if (l == NULL || PyString_GET_SIZE(l) == 0) { Py_XDECREF(l); return NULL; } return (PyObject *)l; } 

As you can see, the file interface of the iterator reads the file in blocks of 8 KB. This explains why f.tell() behaves the way it does.

The documentation offers it for performance reasons (and does not guarantee any specific readahead buffer size).

+12
source

I ran into the same read buffer problem and solved it with Martin's suggestion .

Since then, I have summarized my solution for those who want to do such things:

https://github.com/loisaidasam/csv-position-reader

Happy parsing CSV!

0
source

All Articles