Different vectorization performance in numpy

I wanted to test the performance of code vectorization in python:

import timeit import numpy as np def func1(): x = np.arange(1000) sum = np.sum(x*2) return sum def func2(): sum = 0 for i in xrange(1000): sum += i*2 return sum def func3(): sum = 0 for i in xrange(0,1000,4): x = np.arange(i,i+4,1) sum += np.sum(x*2) return sum print timeit.timeit(func1, number = 1000) print timeit.timeit(func2, number = 1000) print timeit.timeit(func3, number = 1000) 

The code gives the following result:

 0.0105729103088 0.069864988327 0.983253955841 

The difference in performance in the first and second functions is not surprising. But I was surprised that the 3rd function is much slower than other functions.

I am much more familiar with code vectorization in C than in Python, and the 3rd function is more like C - the for loop works and processes 4 numbers in one instruction in each loop. To my understanding, numpy calls the C function, and then vectorize the code in C. So, if so, my code also skips 4 numbers in numpy each at a time. The code should not work better when I pass more numbers at the same time. So why is it so much slower? Is it due to the overhead of calling the numpy function?

Also, the reason I even came up with the third function is because I worry about the performance of the large x memory allocation in func1 .

Is my concern valid? Why and how can I improve it or why not?

Thanks in advance.

Edit:

For curiosity, although it defeats my original goal for creating the third version, I reviewed the Rogogios offer and tried the following editing.

 def func3(): sum = 0 x = np.arange(0,1000) for i in xrange(0,1000,4): sum += np.sum(x[i:i+4]*2) return sum 

Exit:

 0.0104308128357 0.0630609989166 0.748773813248 

There is an improvement, but still a big gap compared to other features.

Is this because x[i:i+4] is still creating a new array?

Edit 2:

I changed the code again according to Daniel's suggestion.

 def func1(): x = np.arange(1000) x *= 2 return x.sum() def func3(): sum = 0 x = np.arange(0,1000) for i in xrange(0,1000,4): x[i:i+4] *= 2 sum += x[i:i+4].sum() return sum 

Exit:

 0.00824999809265 0.0660569667816 0.598328828812 

There is another acceleration. So declaring numpy arrays is definitely a problem. Now in func3 there should be only one array declaration, but still the time is still slower. Is it due to the overhead of calling numpy arrays?

+8
performance python vectorization numpy
source share
3 answers

It seems you are more interested in the difference between your function 3 compared to the pure NumPy function (function 1) and Python (function 2). The answer is quite simple (especially if you look at function 4):

  • NumPy functions have a "huge" constant coefficient.

Usually you need several thousand elements to enter a mode where the np.sum really depends on the number of elements in the array. Using IPython and matplotlib (the graph is at the end of the answer), you can easily check the runtime dependency:

 import numpy as np n = [] timing_sum1 = [] timing_sum2 = [] for i in range(1, 25): num = 2**i arr = np.arange(num) print(num) time1 = %timeit -o arr.sum() # calling the method time2 = %timeit -o np.sum(arr) # calling the function n.append(num) timing_sum1.append(time1) timing_sum2.append(time2) 

The results for np.sum (shortened) are quite interesting:

 4 22.6 µs ± 297 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 16 25.1 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 64 25.3 µs ± 1.58 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 256 24.1 µs ± 1.48 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 1024 24.6 µs ± 221 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 4096 27.6 µs ± 147 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 16384 40.6 µs ± 1.29 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 65536 91.2 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 262144 394 µs ± 8.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1048576 1.24 ms ± 4.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 4194304 4.71 ms ± 22.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 16777216 18.6 ms ± 280 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 

It looks like a constant coefficient is approximately 20µs on my computer), and it takes an array with 16384 thousand elements to double that time. Thus, the time for functions 3 and 4 is mainly associated with the multiplicativeness of the constant coefficient.

In function 3, you turn on a constant factor 2 times, once with np.sum and once with np.arange . In this case, arange pretty cheap because each array is the same size, so NumPy and Python and your OS are probably reusing the memory of the array of the last iteration. However, even this takes time (approximately 2µs for very small arrays on my computer).

More generally: to identify bottlenecks, you should always profile functions!

I am showing results for functions with line-profiler . So I changed the functions a bit, so they only perform one operation per line:

 import numpy as np def func1(): x = np.arange(1000) x = x*2 return np.sum(x) def func2(): sum_ = 0 for i in range(1000): tmp = i*2 sum_ += tmp return sum_ def func3(): sum_ = 0 for i in range(0, 1000, 4): # I'm using python3, so "range" is like "xrange"! x = np.arange(i, i + 4, 1) x = x * 2 tmp = np.sum(x) sum_ += tmp return sum_ def func4(): sum_ = 0 x = np.arange(1000) for i in range(0, 1000, 4): y = x[i:i + 4] y = y * 2 tmp = np.sum(y) sum_ += tmp return sum_ 

Results:

 %load_ext line_profiler %lprun -f func1 func1() Line # Hits Time Per Hit % Time Line Contents ============================================================== 4 def func1(): 5 1 62 62.0 23.8 x = np.arange(1000) 6 1 65 65.0 24.9 x = x*2 7 1 134 134.0 51.3 return np.sum(x) %lprun -f func2 func2() Line # Hits Time Per Hit % Time Line Contents ============================================================== 9 def func2(): 10 1 7 7.0 0.1 sum_ = 0 11 1001 2523 2.5 30.9 for i in range(1000): 12 1000 2819 2.8 34.5 tmp = i*2 13 1000 2819 2.8 34.5 sum_ += tmp 14 1 3 3.0 0.0 return sum_ %lprun -f func3 func3() Line # Hits Time Per Hit % Time Line Contents ============================================================== 16 def func3(): 17 1 7 7.0 0.0 sum_ = 0 18 251 909 3.6 2.9 for i in range(0, 1000, 4): 19 250 6527 26.1 21.2 x = np.arange(i, i + 4, 1) 20 250 5615 22.5 18.2 x = x * 2 21 250 16053 64.2 52.1 tmp = np.sum(x) 22 250 1720 6.9 5.6 sum_ += tmp 23 1 3 3.0 0.0 return sum_ %lprun -f func4 func4() Line # Hits Time Per Hit % Time Line Contents ============================================================== 25 def func4(): 26 1 7 7.0 0.0 sum_ = 0 27 1 49 49.0 0.2 x = np.arange(1000) 28 251 892 3.6 3.4 for i in range(0, 1000, 4): 29 250 2177 8.7 8.3 y = x[i:i + 4] 30 250 5431 21.7 20.7 y = y * 2 31 250 15990 64.0 60.9 tmp = np.sum(y) 32 250 1686 6.7 6.4 sum_ += tmp 33 1 3 3.0 0.0 return sum_ 

I will not go into details of the results, but as you can see, np.sum definitely a bottleneck in func3 and func4 . I already guessed that np.sum is a bottleneck before I write the answer, but these line profiles really confirm that this is a bottleneck .

This leads to a very important fact when using NumPy:

  • Know when to use it! Small arrays are not worth it (mostly).
  • Know the functions of NumPy and just use them. They already use (if avaiable) compiler optimization flags to deploy loops.

If you really think the part is too slow, you can use:

  • NumPy C API and handle the array using C (can be very simple with Cython, but you can also do it manually)
  • Numba (based on LLVM).

But usually you probably can't beat NumPy for moderately sized arrays (several thousand records or more).


Timing visualization:

 %matplotlib notebook import matplotlib.pyplot as plt # Average time per sum-call fig = plt.figure(1) ax = plt.subplot(111) ax.plot(n, [time.average for time in timing_sum1], label='arr.sum()', c='red') ax.plot(n, [time.average for time in timing_sum2], label='np.sum(arr)', c='blue') ax.set_xscale('log') ax.set_yscale('log') ax.set_xlabel('elements') ax.set_ylabel('time it takes to sum them [seconds]') ax.grid(which='both') ax.legend() # Average time per element fig = plt.figure(1) ax = plt.subplot(111) ax.plot(n, [time.average / num for num, time in zip(n, timing_sum1)], label='arr.sum()', c='red') ax.plot(n, [time.average / num for num, time in zip(n, timing_sum2)], label='np.sum(arr)', c='blue') ax.set_xscale('log') ax.set_yscale('log') ax.set_xlabel('elements') ax.set_ylabel('time per element [seconds / element]') ax.grid(which='both') ax.legend() 

The graphs are log-log, I think this was the best way to visualize the data, given that it extends several orders of magnitude (I just hope that it is still understandable).

The first graph shows how long it takes to execute sum :

enter image description here

The second graph shows the average time required to complete sum divided by the number of elements in the array. This is just another way of interpreting data:

enter image description here

+9
source share

Based on the tests (shown below), you seem to be beaten up due to functional challenges. Along with the vectorized capability of NumPy functions / tools, we need to provide enough data for the crunch. With func3 we give it only 4 elements per np.sum call.

Let's look at the overhead for each call to np.sum . Here np.sum , starting with the summation of a single, single element and further -

 In [90]: a = np.array([]) In [91]: %timeit np.sum(a) 1000000 loops, best of 3: 1.6 µs per loop In [61]: a = np.array([0]) In [62]: %timeit np.sum(a) 1000000 loops, best of 3: 1.66 µs per loop In [63]: a = np.random.randint(0,9,(100)) In [64]: %timeit np.sum(a) 100000 loops, best of 3: 1.79 µs per loop In [65]: a = np.random.randint(0,9,(1000)) In [66]: %timeit np.sum(a) 100000 loops, best of 3: 2.25 µs per loop In [67]: a = np.random.randint(0,9,(10000)) In [68]: %timeit np.sum(a) 100000 loops, best of 3: 7.27 µs per loop 

etc.

Thus, we would have to carry about 1.6 u-sec for calling np.sum in the system setup for these tests.

Let's see how scalar addition is done with the addition operator -

 In [98]: def add_nums(a,b): ...: return a+b ...: In [99]: %timeit add_nums(2,3) 10000000 loops, best of 3: 71.5 ns per loop 

This is approximately 25x faster than the overhead for each call to np.sum .

The obvious idea is as follows to check how func3 performs with the large number of data crunches specified in np.sum .

Modified func3 (version using slicing) to have a variable data dimension for each iterative summation:

 def func3(scale_factor = 4): sum1 = 0 x = np.arange(0,1000) for i in xrange(0,1000,scale_factor): sum1 += np.sum(x[i:i+scale_factor]*2) return sum1 

Starting with scale_factor = 4 , as is commonly used -

 In [83]: %timeit func1() 100000 loops, best of 3: 5.39 µs per loop In [84]: %timeit func2() 10000 loops, best of 3: 39.8 µs per loop In [85]: %timeit func3(scale_factor = 4) 1000 loops, best of 3: 741 µs per loop 

Yes, func3 is slow.

Now let's get more data for calling np.sum , i.e. increase scale_factor -

 In [86]: %timeit func3(scale_factor = 8) 1000 loops, best of 3: 376 µs per loop In [87]: %timeit func3(scale_factor = 20) 10000 loops, best of 3: 152 µs per loop In [88]: %timeit func3(scale_factor = 100) 10000 loops, best of 3: 33.5 µs per loop 

etc., until we transfer all the data to np.sum for maximum performance limit using np.sum and minimum call overhead.

+4
source share

First of all, no one will write the third option in C, because the compiler must make the necessary optimizations.

So, take the first one, you have two numpy array creation (arange and * 2) and one summation. Creating complex objects, such as numpy arrays, takes some time, but each vector operation is written in C code and is very fast.

The second uses only primitive python operations (about 3000, iteration, multiplication and summation), which are written in C and very fast.

In the third option, you create 2 * 250 numpy arrays (a relatively slow operation), which leads to 100 times slower execution speed compared to creating only 2 numpy arrays.

If you have problems using memory, you should use built-in operations that create only one array:

 x = np.arange(1000) x *= 2 return x.sum() 

If you still need to use too much memory, split your operations into pieces as much as possible.

+3
source share

All Articles