I find that tensordot is superior to einsum by a lot for some operations. I am using Anaconda's numpy acceleration, and the Intel Math Kernel Library (MKL) is installed. Consider what happens when the second matrix has an additional dimension that does not stack:
In [39]: A = np.random.random([200, 200, 100, 100]) In [40]: X = np.random.random([200, 200]) In [41]: Y = np.random.random([200, 200, 100])
A: (200, 200, 100, 100)
X: (200, 200)
Y: (200, 200, 100)
The operation I am doing here is as follows:
AX ---> (100, 100)
AY ---> (100, 100, 100)
An AY operation should basically perform 100 AX operations and store each of them. Here's how the tensor point performs in this setup:
In [42]: %timeit tensordot(A, X, [(0,1), (0,1)]) 1 loops, best of 3: 477 ms per loop In [43]: %timeit tensordot(A, Y, [(0,1), (0,1)]) 1 loops, best of 3: 1.35 s per loop
Stop and think about it for a second. On line [43], we just did 100 times more operations, and it only took 3 times longer. I know that MKL does some fancy use of the CPU cache to avoid porting from RAM. I suspect it is reusing blocks A for an additional 100 arrays of Y.
Using Einsum leads to something more expected, given that we have another 100 operations:
In [44]: %timeit einsum('ijkl,ij->kl', A, X) 1 loops, best of 3: 962 ms per loop In [45]: %timeit einsum('ijkl,ijm->klm', A, Y) 1 loops, best of 3: 1min 45s per loop
Einsum seems to work very well when one of the argument arrays has all its sizes, summed. Using tensordot has a huge performance boost when some dimensions are not np.outer (similar to np.outer , but with multidimensional arrays).
Here is another example:

For an array operation:
50x1000x1000 X 50x1000x1000 → 50x50
Using tensordot gives me 6 GFLOPS compared to 0.2 GFLOPS with einsum.
I think that the important point is that modern machines should be able to get into the 5-50 GFLOP range for large arrays. If you count the operations and get less than that, check which library you use.