Performance loss after vectoring in numpy

I am writing a laborious program. To shorten the time, I tried my best to use numpy.dot instead of for loops.

However, I found that a vectorized program has much worse performance than the version for the loop:

 import numpy as np import datetime kpt_list = np.zeros((10000,20),dtype='float') rpt_list = np.zeros((1000,20),dtype='float') h_r = np.zeros((20,20,1000),dtype='complex') r_ndegen = np.zeros(1000,dtype='float') r_ndegen.fill(1) # setup completed # this is a the vectorized version r_ndegen_tile = np.tile(r_ndegen.reshape(1000, 1), 10000) start = datetime.datetime.now() phase = np.exp(1j * np.dot(rpt_list, kpt_list.T))/r_ndegen_tile kpt_data_1 = h_r.dot(phase) end = datetime.datetime.now() print((end-start).total_seconds()) # the result is 19.302483 # this is the for loop version kpt_data_2 = np.zeros((20, 20, 10000), dtype='complex') start = datetime.datetime.now() for i in range(10000): kpt = kpt_list[i, :] phase = np.exp(1j * np.dot(kpt, rpt_list.T))/r_ndegen kpt_data_2[:, :, i] = h_r.dot(phase) end = datetime.datetime.now() print((end-start).total_seconds()) # the result is 7.74583 

What's going on here?

+6
source share
2 answers

The first thing I suggest you do is break up your script into separate functions in order to simplify profiling and debugging:

 def setup(n1=10000, n2=1000, n3=20, seed=None): gen = np.random.RandomState(seed) kpt_list = gen.randn(n1, n3).astype(np.float) rpt_list = gen.randn(n2, n3).astype(np.float) h_r = (gen.randn(n3, n3,n2) + 1j*gen.randn(n3, n3,n2)).astype(np.complex) r_ndegen = gen.randn(1000).astype(np.float) return kpt_list, rpt_list, h_r, r_ndegen def original_vec(*args, **kwargs): kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs) r_ndegen_tile = np.tile(r_ndegen.reshape(1000, 1), 10000) phase = np.exp(1j * np.dot(rpt_list, kpt_list.T)) / r_ndegen_tile kpt_data = h_r.dot(phase) return kpt_data def original_loop(*args, **kwargs): kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs) kpt_data = np.zeros((20, 20, 10000), dtype='complex') for i in range(10000): kpt = kpt_list[i, :] phase = np.exp(1j * np.dot(kpt, rpt_list.T)) / r_ndegen kpt_data[:, :, i] = h_r.dot(phase) return kpt_data 

I also highly recommend using random data rather than all-null or all-one arrays, unless your actual data looks (!). This greatly simplifies the verification of the correctness of your code - for example, if your last step is to multiply by a matrix of zeros, then your output will always be zeros, regardless of whether there was an error earlier in your code.


Then I ran these functions through line_profiler to see where they spend most of their time. In particular, for original_vec :

 In [1]: %lprun -f original_vec original_vec() Timer unit: 1e-06 s Total time: 23.7598 s File: <ipython-input-24-c57463f84aad> Function: original_vec at line 12 Line # Hits Time Per Hit % Time Line Contents ============================================================== 12 def original_vec(*args, **kwargs): 13 14 1 86498 86498.0 0.4 kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs) 15 16 1 69700 69700.0 0.3 r_ndegen_tile = np.tile(r_ndegen.reshape(1000, 1), 10000) 17 1 1331947 1331947.0 5.6 phase = np.exp(1j * np.dot(rpt_list, kpt_list.T)) / r_ndegen_tile 18 1 22271637 22271637.0 93.7 kpt_data = h_r.dot(phase) 19 20 1 4 4.0 0.0 return kpt_data 

You can see that he spends 93% of his time calculating the point product between h_r and phase . Here h_r is an array (20, 20, 1000) , and phase is (1000, 10000) . We calculate the product of the sum by the last size h_r and the first size phase (you can write this in the designation einsum as ijk,kl->ijl ).


The first two dimensions of h_r have no meaning here - we could just change the shape of h_r into an array (20*20, 1000) before accepting the point product. It turns out that this change operation in itself gives a huge performance improvement:

 In [2]: %timeit h_r.dot(phase) 1 loop, best of 3: 22.6 s per loop In [3]: %timeit h_r.reshape(-1, 1000).dot(phase) 1 loop, best of 3: 1.04 s per loop 

I'm not quite sure why this should be so - I would hope that the numpy dot function would be smart enough to automatically apply this simple optimization. On my laptop, the second case seems to use multiple threads, while the first does not, assuming that it cannot invoke multi-threaded BLAS procedures.


Here is a vector version that includes a rebuild operation:

 def new_vec(*args, **kwargs): kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs) phase = np.exp(1j * np.dot(rpt_list, kpt_list.T)) / r_ndegen[:, None] kpt_data = h_r.reshape(-1, phase.shape[0]).dot(phase) return kpt_data.reshape(h_r.shape[:2] + (-1,)) 

Indexes -1 indicate numpy to indicate the size of these dimensions according to other dimensions and the number of elements in the array. I also used broadcasting to split into r_ndegen , which eliminates the need for np.tile .

Using the same random input, we can verify that the new version gives the same result as the original:

 In [4]: ans1 = original_loop(seed=0) In [5]: ans2 = new_vec(seed=0) In [6]: np.allclose(ans1, ans2) Out[6]: True 

Some performance benchmarks:

 In [7]: %timeit original_loop() 1 loop, best of 3: 13.5 s per loop In [8]: %timeit original_vec() 1 loop, best of 3: 24.1 s per loop In [5]: %timeit new_vec() 1 loop, best of 3: 2.49 s per loop 

Update:

I was wondering why np.dot was much slower for the source array (20, 20, 1000) h_r , so I dig into the numpy source code. The logic implemented in multiarraymodule.c looks amazingly simple:

 #if defined(HAVE_CBLAS) if (PyArray_NDIM(ap1) <= 2 && PyArray_NDIM(ap2) <= 2 && (NPY_DOUBLE == typenum || NPY_CDOUBLE == typenum || NPY_FLOAT == typenum || NPY_CFLOAT == typenum)) { return cblas_matrixproduct(typenum, ap1, ap2, out); } #endif 

In other words, numpy just checks to see if one of the input arrays has> 2 dimensions, and immediately returns to a non-BLAS implementation of matrix matrix multiplication. It seems like it shouldn't be too hard to check if the internal dimensions of the two arrays are compatible, and if so, consider them as 2D and perform *gemm matrix multiplication on them. In fact, there is an open function request for this dating from 2012 , if any numpy devs read ...

In the meantime, this is a good trick to know when multiplying tensors.


Update 2:

I forgot about np.tensordot . Since it calls the same basic BLAS procedures as np.dot in a 2D array, it can achieve the same result, but without all these ugly reshape operations:

 In [6]: %timeit np.tensordot(h_r, phase, axes=1) 1 loop, best of 3: 1.05 s per loop 
+8
source

I suspect that the first operation falls within the resource limit. Maybe you can take advantage of these two questions: Efficient point products of large arrays with memory mapping and Point product of huge arrays in numpy .

0
source

All Articles