Efficient use of python generators in tight double for loop with multiple numpy arrays

The speed limit in my code is a tight double for looping over elements from two arrays, x and y. The standard hpc performance trick is to loop through chunks to avoid cache misses. I'm trying to use python generators to do chunking, but having to constantly recreate the spent generator in the outer loop kills my runtime.

Question:

Is there a more intelligent algorithm for constructing an appropriate generator for performing double-cycle loops?

Concrete illustration:

I will create two dummy arrays, x and y. I will keep them short for illustration, but in practice these are numpy arrays with ~ 1e6 elements.

x = np.array(['a', 'b', 'b', 'c', 'c', 'd']) y = np.array(['e', 'f', 'f', 'g']) 

A naive double loop cycle would be:

 for xletter in x: for yletter in y: # algebraic manipulations on x & y 

Now let me use the generators for this loop in pieces:

 chunk_size = 3 xchunk_gen = (x[i: i+chunk_size] for i in range(0, len(x), chunk_size)) for xchunk in xchunk_gen: ychunk_gen = (y[i: i+chunk_size] for i in range(0, len(y), chunk_size)) for ychunk in ychunk_gen: for xletter in xchunk: for yletter in ychunk: # algebraic manipulations on x & y 

Note that in order to implement the generator solution for this problem, I have to constantly recreate ychunk_gen in the outer loop. Since y is a large array, this kills my working environment (for ~ 1e6 elements, creating this generator takes 20 ms on my laptop).

Is there a way to be smarter in the way I build my generators that circumvent this problem? Or will it just be necessary to abandon the decision of the generator?

(Note. In practice, I use cython to execute this hard loop, but all of the above applies independently.)

+5
source share
3 answers

In my opinion, creating a generator expression kills your runtime because it is not optimized by cython.

The best solution that takes care of all caching optimization issues is to use numexpr . Since manipulating x and y is algebraic, it should fit your limitations well (numexpr can do a bit more)

+3
source

You define ychunk_gen again in the xchunk loop. Perhaps the following will be faster:

 chunk_size = 3 xchunk_gen = (x[i: i+chunk_size] for i in xrange(0, len(x), chunk_size)) def ychunk_gen(some_dependency_on_outer_loop): # use some_dependency_on_outer_loop for i in xrange(0, len(y), chunk_size): yield y[i: i+chunk_size] for xchunk in xchunk_gen: for ychunk in ychunk_gen(chunk_or_something_else): for xletter in xchunk: for yletter in ychunk: # algebraic manipulations on x & y 

But maybe there is an even better way:

I assume x and y are numpy arrays, so you can resize the arrays and then scroll through each line:

 for xchunk in x.reshape((len(x)//chunk_size, chunk_size)): for ychunk in y.reshape((len(y)//chunk_size, chunk_size)): # the letter loops 

In http://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html I read that if you want the data not to be copied using reshape , you must change the shape - data properties:

 x.shape = len(x)//chunk_size, chunk_size y.shape = len(y)//chunk_size, chunk_size 
+1
source

itertools.tee can provide modest time savings:

 y = np.arange(100) def foo1(y): # create ygen each loop # py3 so range is a generator for j in range(100): ygen=(y[i:i+10] for i in range(0,1000,10)) r = [x.sum() for x in ygen] return r def foo3(y): # use tee to replicate the gen ygen=(y[i:i+10] for i in range(0,1000,10)) ygens=itertools.tee(ygen,100) for g in ygens: r=[x.sum() for x in g] return r In [1123]: timeit foo3(y) 10 loops, best of 3: 108 ms per loop In [1125]: timeit foo1(y) 10 loops, best of 3: 144 ms per loop 

But based

http://docs.cython.org/0.15/src/userguide/limitations.html#generators-and-generator-expressions

Since Cython 0.13, some generator expressions are supported when they can be converted to loop loops in combination with built-in loops, for example. sum (x * 2 for x in seq). Starting with version 0.14, the supported built-in functions are: list (), set (), dict (), sum (), any (), all (), sorted ().

I wonder what cython does with your expressed generator expressions.


Repeated iteration of the lines does not help much in time.

 def foo4(y): y2d=y.reshape(100,10) for _ in range(100): r=[x.sum() for x in y2d] return r 

slightly slower than teed generator. Of course, relative timings like this can vary with the size of the array.

0
source

All Articles