No, you canβt. As your interesting example shows, numpy.sum can be suboptimal, and a more efficient layout of operations through explicit for loops can be more efficient.
Let me show you another example:
>>> N, M = 10**4, 10**4 >>> v = np.random.randn(N,M) >>> r = np.empty(M) >>> timeit.timeit('v.sum(axis=0, out=r)', 'from __main__ import v,r', number=1) 1.2837879657745361 >>> r = np.empty(N) >>> timeit.timeit('v.sum(axis=1, out=r)', 'from __main__ import v,r', number=1) 0.09213519096374512
Here you clearly see that numpy.sum is optimal if the summation on the quick start index ( v is C-adjacent) and suboptimal when summing on a slow working axis. Interestingly, the opposite is true for for loops:
>>> r = np.zeros(M) >>> timeit.timeit('for row in v[:]: r += row', 'from __main__ import v,r', number=1) 0.11945700645446777 >>> r = np.zeros(N) >>> timeit.timeit('for row in vT[:]: r += row', 'from __main__ import v,r', number=1) 1.2647287845611572
I did not have time to check the numpy code, but I suspect that the difference between the two is continuous memory access or string access.
As these examples show, when implementing a numerical algorithm, the correct memory location is of great importance. Vectorized code does not necessarily solve every problem.
Stefano m
source share