End Generator Length

I have these two implementations to calculate the length of the final generator, saving data for further processing:

def count_generator1(generator): '''- build a list with the generator data - get the length of the data - return both the length and the original data (in a list) WARNING: the memory use is unbounded, and infinite generators will block this''' l = list(generator) return len(l), l def count_generator2(generator): '''- get two generators from the original generator - get the length of the data from one of them - return both the length and the original data, as returned by tee WARNING: tee can use up an unbounded amount of memory, and infinite generators will block this''' for_length, saved = itertools.tee(generator, 2) return sum(1 for _ in for_length), saved 

Both have flaws, both do the job. Can someone comment on them or even suggest a better alternative?

+6
python generator
Aug 2 '13 at 10:14
source share
2 answers

If you need to do this, the first way is much better - since you consume all the values, itertools.tee() will store all the values โ€‹โ€‹anyway, which means the list will be more efficient.

To quote documents :

This itertool may require significant auxiliary storage (depending on how much temporary data needs to be stored). In general, if one Iterator uses most or all of the data before starting another Iterator, it is faster to use list () instead of tee ().

+11
Aug 02 '13 at 10:18
source share

I ran the Windows 64-bit Python 3.4.3 timeit on several approaches that I might think of:

 >>> from timeit import timeit >>> from textwrap import dedent as d >>> timeit( ... d(""" ... count = -1 ... for _ in s: ... count += 1 ... count += 1 ... """), ... "s = range(1000)", ... ) 50.70772041983173 >>> timeit( ... d(""" ... count = -1 ... for count, _ in enumerate(s): ... pass ... count += 1 ... """), ... "s = range(1000)", ... ) 42.636973504498656 >>> timeit( ... d(""" ... count, _ = reduce(f, enumerate(range(1000)), (-1, -1)) ... count += 1 ... """), ... d(""" ... from functools import reduce ... def f(_, count): ... return count ... s = range(1000) ... """), ... ) 121.15513102540672 >>> timeit("count = sum(1 for _ in s)", "s = range(1000)") 58.179126025925825 >>> timeit("count = len(tuple(s))", "s = range(1000)") 19.777029680237774 >>> timeit("count = len(list(s))", "s = range(1000)") 18.145157531932 >>> timeit("count = len(list(1 for _ in s))", "s = range(1000)") 57.41422175998332 

Horribly, the fastest approach was to use list (not even < tuple ) to retrieve the iterator and get the length from there:

 >>> timeit("count = len(list(s))", "s = range(1000)") 18.145157531932 

Of course, this is due to memory problems. The best low memory alternative was to use an enum on NOOP for -loop:

 >>> timeit( ... d(""" ... count = -1 ... for count, _ in enumerate(s): ... pass ... count += 1 ... """), ... "s = range(1000)", ... ) 42.636973504498656 

Hurrah!

+2
Jul 10 '15 at 21:17
source share



All Articles