There are several ways to do this. The first thing that comes to mind is np.einsum :
# some fake data gen = np.random.RandomState(0) ni, nj, nk = 10, 20, 100 U = gen.randn(ni, nj, nk) V = gen.randn(nj, nk) res1 = np.zeros((ni, nk)) for k in range(nk): res1[:,k] = U[:,:,k].dot(V[:,k]) res2 = np.einsum('ijk,jk->ik', U, V) print(np.allclose(res1, res2))
np.einsum uses Einstein's notation to express tensor abbreviations. In the expression 'ijk,jk->ik' above, i , j and k are indices corresponding to different sizes of U and V Each comma-separated grouping corresponds to one of the operands passed to np.einsum (in this case, U has dimensions ijk and V has dimensions jk ). The '->ik' determines the size of the output array. Any measurements with indices that are not in the output row are summed.
np.einsum incredibly useful for performing complex tensor contractions, but it may take some time to completely wrap your head around how this works. You should take a look at the examples in the documentation (see above).
Some other options:
Elemental multiplication with broadcasting followed by a summation:
res3 = (U * V[None, ...]).sum(1)
inner1d with transpose load:
from numpy.core.umath_tests import inner1d res4 = inner1d(U.transpose(0, 2, 1), VT)
Some guidelines:
In [1]: ni, nj, nk = 100, 200, 1000 In [2]: %%timeit U = gen.randn(ni, nj, nk); V = gen.randn(nj, nk) ....: np.einsum('ijk,jk->ik', U, V) ....: 10 loops, best of 3: 23.4 ms per loop In [3]: %%timeit U = gen.randn(ni, nj, nk); V = gen.randn(nj, nk) (U * V[None, ...]).sum(1) ....: 10 loops, best of 3: 59.7 ms per loop In [4]: %%timeit U = gen.randn(ni, nj, nk); V = gen.randn(nj, nk) inner1d(U.transpose(0, 2, 1), VT) ....: 10 loops, best of 3: 45.9 ms per loop