How to fill in the list of starting positions of each line using the For Cycle and Scan functions?

All I want to do is create a list of the starting positions of each line so that I can quickly find them. I get the error message "indicating the position disabled by calling next ()". How can I overcome this?

>>> in_file = open("data_10000.txt") >>> in_file.tell() 0 >>> line_numbers = [in_file.tell() for line in in_file] Traceback (most recent call last): File "<pyshell#9>", line 1, in <module> line_numbers = [in_file.tell() for line in in_file] File "<pyshell#9>", line 1, in <listcomp> line_numbers = [in_file.tell() for line in in_file] OSError: telling position disabled by next() call 

Note: in this context, the index will relate the line number to the search position.

+7
source share
1 answer

A simple generator can solve your problem:

 def line_ind(fileobj): i = 0 for line in fileobj: yield i i += len(line) 

It gives (generates) indices of the initial positions of the line one by one. You know that regular functions return a value and stop. When the generator gives a value, it continues to work until it is exhausted. Soo what I did here is to give 0, then add the length fo of the first line to it, then give it then add the length of the second line, etc. This will result in the indexes you want.

To put the resulting values ​​into a list, you can use list(generator()) same way you can use list(range(10)) . When you open the file, better do it using with , as shown below. Not because you forget to close the file object often (you will), but it automatically closes it if an exception occurs. So, with the code below, I have two lists of starting position indices:

 with open("test.dat", encoding="utf-8") as f: u_ind = list(line_ind(f)) f.seek(0) u = f.read() with open("test.dat", "rb") as f: b_ind = list(line_ind(f)) f.seek(0) b = f.read() 

Note that indexes may differ for unicode strings than for bytes. For example, an accented character may occupy two bytes of space. The first list contains Unicode character indexes. You will use this when dealing with the regular string representation of your file. The example below shows how the index values ​​differ in two cases in the test file:

 >>> u_ind[-10:] [24283, 24291, 24300, 24309, 24322, 24331, 24341, 24349, 24359, 24368] >>> b_ind[-10:] [27297, 27306, 27316, 27326, 27342, 27352, 27363, 27372, 27383, 27393] 

Now I want the contents of the last line:

 >>> u[24368:] 'S-Γ©rtΓ©k=9,59' >>> b[27393:] b'S-\xc3\xa9rt\xc3\xa9k=9,59' 

If you want to use seek() before read() , you must adhere to byte indices:

 >>> with open("test.dat", encoding="utf-8") as f: ... f.seek(27393) ... f.read() ... 27393 'S-Γ©rtΓ©k=9,59' >>> with open("test.dat", "rb") as f: ... f.seek(27393) ... f.read() ... 27393 b'S-\xc3\xa9rt\xc3\xa9k=9,59' 

Using 24368 in the first case would be a terrible mistake here.

Note that when you read() contents of a file onto a string / bytestring object and want to deal with individual lines after that, it .splitlines() more .splitlines() use .splitlines() .

Hope this helps!

+4
source share

All Articles