A simple generator can solve your problem:
def line_ind(fileobj): i = 0 for line in fileobj: yield i i += len(line)
It gives (generates) indices of the initial positions of the line one by one. You know that regular functions return a value and stop. When the generator gives a value, it continues to work until it is exhausted. Soo what I did here is to give 0, then add the length fo of the first line to it, then give it then add the length of the second line, etc. This will result in the indexes you want.
To put the resulting values ββinto a list, you can use list(generator()) same way you can use list(range(10)) . When you open the file, better do it using with , as shown below. Not because you forget to close the file object often (you will), but it automatically closes it if an exception occurs. So, with the code below, I have two lists of starting position indices:
with open("test.dat", encoding="utf-8") as f: u_ind = list(line_ind(f)) f.seek(0) u = f.read() with open("test.dat", "rb") as f: b_ind = list(line_ind(f)) f.seek(0) b = f.read()
Note that indexes may differ for unicode strings than for bytes. For example, an accented character may occupy two bytes of space. The first list contains Unicode character indexes. You will use this when dealing with the regular string representation of your file. The example below shows how the index values ββdiffer in two cases in the test file:
>>> u_ind[-10:] [24283, 24291, 24300, 24309, 24322, 24331, 24341, 24349, 24359, 24368] >>> b_ind[-10:] [27297, 27306, 27316, 27326, 27342, 27352, 27363, 27372, 27383, 27393]
Now I want the contents of the last line:
>>> u[24368:] 'S-Γ©rtΓ©k=9,59' >>> b[27393:] b'S-\xc3\xa9rt\xc3\xa9k=9,59'
If you want to use seek() before read() , you must adhere to byte indices:
>>> with open("test.dat", encoding="utf-8") as f: ... f.seek(27393) ... f.read() ... 27393 'S-Γ©rtΓ©k=9,59' >>> with open("test.dat", "rb") as f: ... f.seek(27393) ... f.read() ... 27393 b'S-\xc3\xa9rt\xc3\xa9k=9,59'
Using 24368 in the first case would be a terrible mistake here.
Note that when you read() contents of a file onto a string / bytestring object and want to deal with individual lines after that, it .splitlines() more .splitlines() use .splitlines() .
Hope this helps!