Write a double (triple) sum as a scalar product?

Since my np.dot accelerated by OpenBlas and Openmpi, I wonder if it was possible to write a double sum

 for i in range(N): for j in range(N): B[k,l] += A[i,j,k,l] * X[i,j] 

as a scalar product. I'm currently using

 B = np.einsum("ijkl,ij->kl",A,X) 

but unfortunately it is rather slow and uses only one processor. Any ideas?

Edit: I compared the answers so far with a simple example, it looks like they are all in the same order:

 A = np.random.random([200,200,100,100]) X = np.random.random([200,200]) def B1(): return es("ijkl,ij->kl",A,X) def B2(): return np.tensordot(A, X, [[0,1], [0, 1]]) def B3(): shp = A.shape return np.dot(X.ravel(),A.reshape(shp[0]*shp[1],1)).reshape(shp[2],shp[3]) %timeit B1() %timeit B2() %timeit B3() 1 loops, best of 3: 300 ms per loop 10 loops, best of 3: 149 ms per loop 10 loops, best of 3: 150 ms per loop 

In conclusion, I would choose np.einsum from these results, since its syntax is still the most readable, and the improvement with the other two is just a 2x factor. I guess the next step is to externalize the code in C or fortran.

+7
performance python arrays numpy linear-algebra
source share
3 answers

You can use np.tensordot() :

 np.tensordot(A, X, [[0,1], [0, 1]]) 

which uses multiple cores.


EDIT: with increasing size of the input arrays, an increase in the scale of np.einsum and np.tensordot :

 In [18]: for n in range(1, 31): ....: A = np.random.rand(n, n+1, n+2, n+3) ....: X = np.random.rand(n, n+1) ....: print(n) ....: %timeit np.einsum('ijkl,ij->kl', A, X) ....: %timeit np.tensordot(A, X, [[0, 1], [0, 1]]) ....: 1 1000000 loops, best of 3: 1.55 µs per loop 100000 loops, best of 3: 8.36 µs per loop ... 11 100000 loops, best of 3: 15.9 µs per loop 100000 loops, best of 3: 17.2 µs per loop 12 10000 loops, best of 3: 23.6 µs per loop 100000 loops, best of 3: 18.9 µs per loop ... 21 10000 loops, best of 3: 153 µs per loop 10000 loops, best of 3: 44.4 µs per loop 

and the advantage of using tensordot for large arrays becomes clear.

+8
source share

You can use np.dot like this:

 shp = A.shape B_dot = np.dot(X.ravel(),A.reshape(shp[0]*shp[1],-1)).reshape(shp[2],shp[3]) 
+2
source share

I find that tensordot is superior to einsum by a lot for some operations. I am using Anaconda's numpy acceleration, and the Intel Math Kernel Library (MKL) is installed. Consider what happens when the second matrix has an additional dimension that does not stack:

 In [39]: A = np.random.random([200, 200, 100, 100]) In [40]: X = np.random.random([200, 200]) In [41]: Y = np.random.random([200, 200, 100]) 

A: (200, 200, 100, 100)

X: (200, 200)

Y: (200, 200, 100)

The operation I am doing here is as follows:

AX ---> (100, 100)

AY ---> (100, 100, 100)

An AY operation should basically perform 100 AX operations and store each of them. Here's how the tensor point performs in this setup:

 In [42]: %timeit tensordot(A, X, [(0,1), (0,1)]) 1 loops, best of 3: 477 ms per loop In [43]: %timeit tensordot(A, Y, [(0,1), (0,1)]) 1 loops, best of 3: 1.35 s per loop 

Stop and think about it for a second. On line [43], we just did 100 times more operations, and it only took 3 times longer. I know that MKL does some fancy use of the CPU cache to avoid porting from RAM. I suspect it is reusing blocks A for an additional 100 arrays of Y.

Using Einsum leads to something more expected, given that we have another 100 operations:

 In [44]: %timeit einsum('ijkl,ij->kl', A, X) 1 loops, best of 3: 962 ms per loop In [45]: %timeit einsum('ijkl,ijm->klm', A, Y) 1 loops, best of 3: 1min 45s per loop 

Einsum seems to work very well when one of the argument arrays has all its sizes, summed. Using tensordot has a huge performance boost when some dimensions are not np.outer (similar to np.outer , but with multidimensional arrays).

Here is another example:

enter image description here

For an array operation:

50x1000x1000 X 50x1000x1000 → 50x50

Using tensordot gives me 6 GFLOPS compared to 0.2 GFLOPS with einsum.

I think that the important point is that modern machines should be able to get into the 5-50 GFLOP range for large arrays. If you count the operations and get less than that, check which library you use.

+2
source share

All Articles