Code Optimization - Number of function calls in Python

Question

Code Optimization - Number of function calls in Python

I would like to know how I could convert this problem to reduce the overhead of calling np.sum() functions in my code.

I have an input matrix, for example shape=(1000, 36) . Each row represents a node in a graph. I have an operation that I perform that iterates over each row and makes an elementary addition to a variable number of other rows. These "other" lines are defined in the nodes_nbrs dictionary, which records for each line a list of lines that should be added together. An example is this:

 nodes_nbrs = {0: [0, 1], 1: [1, 0, 2], 2: [2, 1], ...}

Here node 0 will be converted to the sum of nodes 0 and 1 . node 1 will be converted to the sum of nodes 1 , 0 and 2 . And so on for the rest of the nodes.

The current (and naive) way that I have currently implemented, as such. First, I create a zero array of the final shape that I want, and then nodes_nbrs over each key-value pair in the nodes_nbrs dictionary:

 output = np.zeros(shape=input.shape) for k, v in nodes_nbrs.items(): output[k] = np.sum(input[v], axis=0)

In small tests ( shape=(1000, 36) ) this code is all cool and thin, but with larger tests ( shape=(~1E(5-6), 36) ) it takes ~ 2-3 seconds. I have to do this operation thousands of times, so I'm trying to see if there is a more optimized way to do this.

After completing the line profiling, I noticed that the key killer here calls the np.sum function again and again, which takes about 50% of the total time. Is there a way to eliminate this overhead? Or can I optimize it in another way?

In addition, here is a list of the things I did, and (very briefly) their results:

A cython version: eliminates the overhead of checking for type of for loop, reducing time by 30%. With the cython version cython np.sum takes about 80% of the total wall clock time, not 50%.
First declare np.sum as an npsum variable, and then call npsum inside the for loop. There is no difference with the original.
Replace np.sum with np.add.reduce and assign it to the npsum variable, then call npsum inside the for loop. ~ 10% reduction in wall clock time, but then incompatible with autograd (explanation below at the rarefied point of the marker).
numba JIT-ing: no longer trying to add a decorator. No improvement, but did not try harder.
Convert the nodes_nbrs dictionary into a dense numpy binary array (1s and 0s), and then do one np.dot operation. Well, theoretically, in practice, it’s bad, because this requires a square matrix shape=(10^n, 10^n) , which is quadratic when using memory.

Things that I have not tried, but I hesitate to do:

scipy sparse matrices: I use autograd , which does not support automatic differentiation of the dot operation for sparse scipy matrices.

For those who are interested, this is essentially a convolution operation on graphically structured data. Have fun developing it for the gradient school, but also somewhat disappointing, being at the forefront of knowledge.

+6

optimization python numpy matrix cython

ericmjl May 24, '16 at 22:32

source share

2 answers

Based on work on recent sparse issues, for example. Extremely low row sum operation in sparse LIL matrix in Python

Here's how your problem can be solved with sparse matrices. The method can also be applied to dense ones. The idea is that a sparse sum implemented as a matrix product with a row (or column) of 1s. Sparse matrices are indexed slowly, but matrix product is good C code.

In this case, I'm going to build a multiplication matrix that has 1s for the rows that I want to sum, a different set of 1s for each entry in the dictionary.

Matrix example:

 In [302]: A=np.arange(8*3).reshape(8,3) In [303]: M=sparse.csr_matrix(A)

Dictionary of choice

:

 In [304]: dict={0:[0,1],1:[1,0,2],2:[2,1],3:[3,4,7]}

build a sparse matrix from this dictionary. This may not be the most effective way to build such a matrix, but this is enough to demonstrate the idea.

 In [305]: r,c,d=[],[],[] In [306]: for i,col in dict.items(): c.extend(col) r.extend([i]*len(col)) d.extend([1]*len(col)) In [307]: r,c,d Out[307]: ([0, 0, 1, 1, 1, 2, 2, 3, 3, 3], [0, 1, 1, 0, 2, 2, 1, 3, 4, 7], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]) In [308]: idx=sparse.csr_matrix((d,(r,c)),shape=(len(dict),M.shape[0]))

Run the sum and look at the result (as a dense array):

 In [310]: (idx*M).A Out[310]: array([[ 3, 5, 7], [ 9, 12, 15], [ 9, 11, 13], [42, 45, 48]], dtype=int32)

Here is the original for comparison.

 In [312]: MA Out[312]: array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 9, 10, 11], [12, 13, 14], [15, 16, 17], [18, 19, 20], [21, 22, 23]], dtype=int32)

+1

hpaulj May 25 '16 at 1:26

source share

jakevdp · Accepted Answer · 2016-05-25T00:39:14+0000

If scipy.sparse is not an option, one way you could approach it would be to massage your data so that you can use vectorized functions to do everything at a compiled level. If you change the neighbors dictionary into a two-dimensional array with the appropriate flags for missing values, you can use np.take to retrieve the desired data, and then make a single sum() call.

Here is an example of what I mean:

 import numpy as np def make_data(N=100): X = np.random.randint(1, 20, (N, 36)) connections = np.random.randint(2, 5, N) nbrs = {i: list(np.random.choice(N, c)) for i, c in enumerate(connections)} return X, nbrs def original_solution(X, nbrs): output = np.zeros(shape=X.shape) for k, v in nbrs.items(): output[k] = np.sum(X[v], axis=0) return output def vectorized_solution(X, nbrs): # Make neighbors all the same length, filling with -1 new_nbrs = np.full((X.shape[0], max(map(len, nbrs.values()))), -1, dtype=int) for i, v in nbrs.items(): new_nbrs[i, :len(v)] = v # add a row of zeros to X new_X = np.vstack([X, 0 * X[0]]) # compute the sums return new_X.take(new_nbrs, 0).sum(1)

Now we can confirm that the results correspond to:

 >>> X, nbrs = make_data(100) >>> np.allclose(original_solution(X, nbrs), vectorized_solution(X, nbrs)) True

And we can time see acceleration:

 X, nbrs = make_data(1000) %timeit original_solution(X, nbrs) %timeit vectorized_solution(X, nbrs) # 100 loops, best of 3: 13.7 ms per loop # 100 loops, best of 3: 1.89 ms per loop

Transition to large sizes:

 X, nbrs = make_data(100000) %timeit original_solution(X, nbrs) %timeit vectorized_solution(X, nbrs) 1 loop, best of 3: 1.42 s per loop 1 loop, best of 3: 249 ms per loop

It is about 5-10 times faster, which may be good enough for your purposes (although it will depend heavily on the exact characteristics of your nbrs dictionary).

Edit: Just for fun, I tried a couple of other approaches using numpy.add.reduceat using pandas.groupby and the other scipy.sparse . It seems that the vector approach proposed above is probably the best option. Here they are for reference:

 from itertools import chain def reduceat_solution(X, nbrs): ind, j = np.transpose([[i, len(v)] for i, v in nbrs.items()]) i = list(chain(*(nbrs[i] for i in ind))) j = np.concatenate([[0], np.cumsum(j)[:-1]]) return np.add.reduceat(X[i], j)[ind] np.allclose(original_solution(X, nbrs), reduceat_solution(X, nbrs)) # True

-

 import pandas as pd def groupby_solution(X, nbrs): i, j = np.transpose([[k, vi] for k, v in nbrs.items() for vi in v]) return pd.groupby(pd.DataFrame(X[j]), i).sum().values np.allclose(original_solution(X, nbrs), groupby_solution(X, nbrs)) # True

-

 from scipy.sparse import csr_matrix from itertools import chain def sparse_solution(X, nbrs): items = (([i]*len(col), col, [1]*len(col)) for i, col in nbrs.items()) rows, cols, data = (np.array(list(chain(*a))) for a in zip(*items)) M = csr_matrix((data, (rows, cols))) return M.dot(X) np.allclose(original_solution(X, nbrs), sparse_solution(X, nbrs)) # True

And all the timings together:

 X, nbrs = make_data(100000) %timeit original_solution(X, nbrs) %timeit vectorized_solution(X, nbrs) %timeit reduceat_solution(X, nbrs) %timeit groupby_solution(X, nbrs) %timeit sparse_solution(X, nbrs) # 1 loop, best of 3: 1.46 s per loop # 1 loop, best of 3: 268 ms per loop # 1 loop, best of 3: 416 ms per loop # 1 loop, best of 3: 657 ms per loop # 1 loop, best of 3: 282 ms per loop

Code Optimization - Number of function calls in Python

More articles: