It seems you are more interested in the difference between your function 3 compared to the pure NumPy function (function 1) and Python (function 2). The answer is quite simple (especially if you look at function 4):
- NumPy functions have a "huge" constant coefficient.
Usually you need several thousand elements to enter a mode where the np.sum really depends on the number of elements in the array. Using IPython and matplotlib (the graph is at the end of the answer), you can easily check the runtime dependency:
import numpy as np n = [] timing_sum1 = [] timing_sum2 = [] for i in range(1, 25): num = 2**i arr = np.arange(num) print(num) time1 = %timeit -o arr.sum()
The results for np.sum (shortened) are quite interesting:
4 22.6 µs ± 297 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 16 25.1 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 64 25.3 µs ± 1.58 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 256 24.1 µs ± 1.48 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 1024 24.6 µs ± 221 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 4096 27.6 µs ± 147 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 16384 40.6 µs ± 1.29 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 65536 91.2 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 262144 394 µs ± 8.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1048576 1.24 ms ± 4.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 4194304 4.71 ms ± 22.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 16777216 18.6 ms ± 280 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It looks like a constant coefficient is approximately 20µs on my computer), and it takes an array with 16384 thousand elements to double that time. Thus, the time for functions 3 and 4 is mainly associated with the multiplicativeness of the constant coefficient.
In function 3, you turn on a constant factor 2 times, once with np.sum and once with np.arange . In this case, arange pretty cheap because each array is the same size, so NumPy and Python and your OS are probably reusing the memory of the array of the last iteration. However, even this takes time (approximately 2µs for very small arrays on my computer).
More generally: to identify bottlenecks, you should always profile functions!
I am showing results for functions with line-profiler . So I changed the functions a bit, so they only perform one operation per line:
import numpy as np def func1(): x = np.arange(1000) x = x*2 return np.sum(x) def func2(): sum_ = 0 for i in range(1000): tmp = i*2 sum_ += tmp return sum_ def func3(): sum_ = 0 for i in range(0, 1000, 4):
Results:
%load_ext line_profiler %lprun -f func1 func1() Line # Hits Time Per Hit % Time Line Contents ============================================================== 4 def func1(): 5 1 62 62.0 23.8 x = np.arange(1000) 6 1 65 65.0 24.9 x = x*2 7 1 134 134.0 51.3 return np.sum(x) %lprun -f func2 func2() Line # Hits Time Per Hit % Time Line Contents ============================================================== 9 def func2(): 10 1 7 7.0 0.1 sum_ = 0 11 1001 2523 2.5 30.9 for i in range(1000): 12 1000 2819 2.8 34.5 tmp = i*2 13 1000 2819 2.8 34.5 sum_ += tmp 14 1 3 3.0 0.0 return sum_ %lprun -f func3 func3() Line # Hits Time Per Hit % Time Line Contents ============================================================== 16 def func3(): 17 1 7 7.0 0.0 sum_ = 0 18 251 909 3.6 2.9 for i in range(0, 1000, 4): 19 250 6527 26.1 21.2 x = np.arange(i, i + 4, 1) 20 250 5615 22.5 18.2 x = x * 2 21 250 16053 64.2 52.1 tmp = np.sum(x) 22 250 1720 6.9 5.6 sum_ += tmp 23 1 3 3.0 0.0 return sum_ %lprun -f func4 func4() Line # Hits Time Per Hit % Time Line Contents ============================================================== 25 def func4(): 26 1 7 7.0 0.0 sum_ = 0 27 1 49 49.0 0.2 x = np.arange(1000) 28 251 892 3.6 3.4 for i in range(0, 1000, 4): 29 250 2177 8.7 8.3 y = x[i:i + 4] 30 250 5431 21.7 20.7 y = y * 2 31 250 15990 64.0 60.9 tmp = np.sum(y) 32 250 1686 6.7 6.4 sum_ += tmp 33 1 3 3.0 0.0 return sum_
I will not go into details of the results, but as you can see, np.sum definitely a bottleneck in func3 and func4 . I already guessed that np.sum is a bottleneck before I write the answer, but these line profiles really confirm that this is a bottleneck .
This leads to a very important fact when using NumPy:
- Know when to use it! Small arrays are not worth it (mostly).
- Know the functions of NumPy and just use them. They already use (if avaiable) compiler optimization flags to deploy loops.
If you really think the part is too slow, you can use:
- NumPy C API and handle the array using C (can be very simple with Cython, but you can also do it manually)
- Numba (based on LLVM).
But usually you probably can't beat NumPy for moderately sized arrays (several thousand records or more).
Timing visualization:
%matplotlib notebook import matplotlib.pyplot as plt
The graphs are log-log, I think this was the best way to visualize the data, given that it extends several orders of magnitude (I just hope that it is still understandable).
The first graph shows how long it takes to execute sum :

The second graph shows the average time required to complete sum divided by the number of elements in the array. This is just another way of interpreting data:
