Iterate over line by line

Question

Iterate over line by line

I have a multi-line string defined as follows:

foo = """ this is a multi-line string. """

This line that we used as test input for the parser that I am writing. The parser function receives a file object as input and iterates over it. It also calls the next() method to skip the lines, so I really need an iterator as input, not an iterable one. I need an iterator that iterates over individual lines of this line, like a file object, over lines of a text file. I could, of course, do it like this:

 lineiterator = iter(foo.splitlines())

Is there a more direct way to do this? In this scenario, the string must go through once to split, and then again by the parser. In my test case, it doesn’t matter, since the line is very short, I just ask out of curiosity. Python has so many useful and efficient built-in modules for such things, but I could not find anything suitable for this need.

+96

python iterator string

Björn Pollex Jun 16 '10 at 15:13

source share

5 answers

I'm not sure what you mean by "then parser again." After splitting has been performed, there is no further traversal of the string, only traversal of the list of split strings. This will probably be the fastest way to accomplish this, unless your row size is absolutely huge. The fact that python uses immutable strings means you should always create a new string, so this should be done at some point.

If your line is very large, the drawback is memory usage: you will have the original line and a list of split lines in memory at the same time, doubling the required memory. An iterative approach can save you this by building the string as needed, although it still pays a “split” penalty. However, if your string is so large, you usually want to avoid even the unsplit string in memory. It would be better to just read a line from a file, which already allows you to iterate over it as lines.

However, if you already have a huge string in memory, one approach would be to use StringIO, which is the file interface for the string, including allowing iteration over the string (using .find internally to find the next new line), then you get:

 import StringIO s = StringIO.StringIO(myString) for line in s: do_something_with(line)

+44

Brian Jun 16 '10 at 15:46

source share

If I read Modules/cStringIO.c , this should be reasonably efficient (albeit somewhat detailed):

 from cStringIO import StringIO def iterbuf(buf): stri = StringIO(buf) while True: nl = stri.readline() if nl != '': yield nl.strip() else: raise StopIteration

+3

Jacob Oscarson Jun 16 2018-10-06T00:

source share

Regex-based searches are sometimes faster than a generator approach:

 RRR = re.compile(r'(.*)\n') def f4(arg): return (i.group(1) for i in RRR.finditer(arg))

+3

socketpair Jun 26 '17 at 13:39 on

source share

I suppose you could collapse yourself:

 def parse(string): retval = '' for char in string: retval += char if not char == '\n' else '' if char == '\n': yield retval retval = '' if retval: yield retval

I'm not sure how effective this implementation is, but it will only execute on your line once.

Mmm, generators.

Edit:

Of course, you will also want to add any types of parsing actions you want to take, but this is pretty simple.

+1

Wayne Werner Jun 16 '10 at 15:23

source share

Alex Martelli · Accepted Answer · 2010-06-16 15:38

Here are three options:

 foo = """ this is a multi-line string. """ def f1(foo=foo): return iter(foo.splitlines()) def f2(foo=foo): retval = '' for char in foo: retval += char if not char == '\n' else '' if char == '\n': yield retval retval = '' if retval: yield retval def f3(foo=foo): prevnl = -1 while True: nextnl = foo.find('\n', prevnl + 1) if nextnl < 0: break yield foo[prevnl + 1:nextnl] prevnl = nextnl if __name__ == '__main__': for f in f1, f2, f3: print list(f())

Running this as the main script confirms that the three functions are equivalent. With timeit (and a * 100 for foo to get significant strings for a more accurate measurement):

 $ python -mtimeit -s'import asp' 'list(asp.f3())' 1000 loops, best of 3: 370 usec per loop $ python -mtimeit -s'import asp' 'list(asp.f2())' 1000 loops, best of 3: 1.36 msec per loop $ python -mtimeit -s'import asp' 'list(asp.f1())' 10000 loops, best of 3: 61.5 usec per loop

Note that we need a list() call to ensure that iterators are passed, not just built.

IOW, the naive implementation is much faster, it’s not even funny: 6 times faster than my attempt with find calls, which in turn is 4 times faster than the lower level approach.

Conservation lessons: measurement is always good (but must be accurate); string methods such as splitlines are implemented in very fast ways; putting the lines together, programming at a very low level (especially loops += very small parts) can be quite slow.

Edit : the @Jacob sentence has been added, slightly modified to give the same results as the others (spaces remain in the line), i.e.

 from cStringIO import StringIO def f4(foo=foo): stri = StringIO(foo) while True: nl = stri.readline() if nl != '': yield nl.strip('\n') else: raise StopIteration

Measurement gives:

 $ python -mtimeit -s'import asp' 'list(asp.f4())' 1000 loops, best of 3: 406 usec per loop

not as good as the .find -based .find - it’s still worth .find in mind, because it may be less prone to minor errors in turn (any cycle in which you see the occurrences of +1 and -1, like my f3 above, should automatically launch suspicious suspicions - and therefore many cycles that do not have such settings and should have them, although I believe that my code is also right, because I was able to check its output using other functions ").

But the separation approach is still in place.

Aside: perhaps the best style for f4 would be:

 from cStringIO import StringIO def f4(foo=foo): stri = StringIO(foo) while True: nl = stri.readline() if nl == '': break yield nl.strip('\n')

at least it's a little less verbose. Obviously, the need to separate trailing \n prohibits a clearer and faster replacement of the while return iter(stri) (the part of iter that is redundant in modern versions of Python, I believe, starting from 2.3 or 2.4, but it is also harmless). It might be worth a try as well:

  return itertools.imap(lambda s: s.strip('\n'), stri)

or their variations - but I stop here, as this is a rather theoretical exercise based on strip , the simplest and fastest, one.

Iterate over line by line

More articles: