The speed limit in my code is a tight double for looping over elements from two arrays, x and y. The standard hpc performance trick is to loop through chunks to avoid cache misses. I'm trying to use python generators to do chunking, but having to constantly recreate the spent generator in the outer loop kills my runtime.
Question:
Is there a more intelligent algorithm for constructing an appropriate generator for performing double-cycle loops?
Concrete illustration:
I will create two dummy arrays, x and y. I will keep them short for illustration, but in practice these are numpy arrays with ~ 1e6 elements.
x = np.array(['a', 'b', 'b', 'c', 'c', 'd']) y = np.array(['e', 'f', 'f', 'g'])
A naive double loop cycle would be:
for xletter in x: for yletter in y:
Now let me use the generators for this loop in pieces:
chunk_size = 3 xchunk_gen = (x[i: i+chunk_size] for i in range(0, len(x), chunk_size)) for xchunk in xchunk_gen: ychunk_gen = (y[i: i+chunk_size] for i in range(0, len(y), chunk_size)) for ychunk in ychunk_gen: for xletter in xchunk: for yletter in ychunk: # algebraic manipulations on x & y
Note that in order to implement the generator solution for this problem, I have to constantly recreate ychunk_gen in the outer loop. Since y is a large array, this kills my working environment (for ~ 1e6 elements, creating this generator takes 20 ms on my laptop).
Is there a way to be smarter in the way I build my generators that circumvent this problem? Or will it just be necessary to abandon the decision of the generator?
(Note. In practice, I use cython to execute this hard loop, but all of the above applies independently.)