What is the most platform and Python independent way to make a quick loop for use in Python?

I am writing a scientific application in Python with a very processor intensive loop. I would like to optimize this as much as possible, with minimal inconvenience to end users, who are likely to use it as an unrelated collection of Python scripts and use Windows, Mac and (mostly Ubuntu) Linux.

It is currently written in Python with a NumPy dash, and I have included the code below.

  • Is there a solution that would be fast enough that would not require compilation? This would seem to be the easiest way to maintain platform independence.
  • If you use something like Pyrex that requires compilation, is there an easy way to combine many modules and choose Python between them depending on the OS detected and the version of Python? Is there an easy way to put together a collection of modules without having to access each system with every version of Python?
  • Does one method provide especially multiprocessor optimization?

(If you're interested, the loop should calculate the magnetic field at a given point inside the crystal by combining the contributions of a large number of neighboring magnetic ions, considered to be tiny magnets on the rod. Basically, a massive sum of these .)

# calculate_dipole # ------------------------- # calculate_dipole works out the dipole field at a given point within the crystal unit cell # --- # INPUT # mu = position at which to calculate the dipole field # r_i = array of atomic positions # mom_i = corresponding array of magnetic moments # --- # OUTPUT # B = the B-field at this point def calculate_dipole(mu, r_i, mom_i): relative = mu - r_i r_unit = unit_vectors(relative) #4pi / mu0 (at the front of the dipole eqn) A = 1e-7 #initalise dipole field B = zeros(3,float) for i in range(len(relative)): #work out the dipole field and add it to the estimate so far B += A*(3*dot(mom_i[i],r_unit[i])*r_unit[i] - mom_i[i]) / sqrt(dot(relative[i],relative[i]))**3 return B 
+6
optimization python
source share
5 answers

You can make it work much, much faster if you eliminate the loop and use vectorized Numpy operations. Put your data in numpy (3, N) arrays and try the following:

 import numpy as np N = 20000 mu = np.random.random((3,1)) r_i = np.random.random((3,N)) mom_i = np.random.random((3,N)) def unit_vectors(r): return r / np.sqrt((r*r).sum(0)) def calculate_dipole(mu, r_i, mom_i): relative = mu - r_i r_unit = unit_vectors(relative) A = 1e-7 num = A*(3*np.sum(mom_i*r_unit, 0)*r_unit - mom_i) den = np.sqrt(np.sum(relative*relative, 0))**3 B = np.sum(num/den, 1) return B 

For me, this works about 50 times faster than using a for loop.

+10
source share

Numpy uses some native optimization to process the array. You can use a Numpy array with Cython to get some speedups.

+4
source share

Perhaps your python code can be slightly faster by replacing your loop with a generator expression and removing all queries mom_i [i], relative [i] and r_unit [i], iterating over all three sequences in parallel using itertools.izip.

i.e. replace

 B = zeros(3,float) for i in range(len(relative)): #work out the dipole field and add it to the estimate so far B += A*(3*dot(mom_i[i],r_unit[i])*r_unit[i] - mom_i[i]) / sqrt(dot(relative[i],relative[i]))**3 return B 

from:

 from itertools import izip ... return sum((A*(3*dot(mom,ru)*ru - mom) / sqrt(dot(rel,rel))**3 for mom, ru, rel in izip(mom_i, r_unit, relative)), zeros(3,float)) 

This is also more readable IMHO, as the basic equation is not cluttered [i] everywhere.

I suspect, however, that this will only give you marginal gains compared to performing the entire function in a compiled language such as Cython.

+3
source share

One simple but significant acceleration is multiplying by A outside of your sum. You can just smash B with it when you return it:

 for i in range(len(relative)): #work out the dipole field and add it to the estimate so far B += (3*dot(mom_i[i],r_unit[i])*r_unit[i] - mom_i[i]) / sqrt(dot(relative[i],relative[i]))**3 return A*B 

This gave an approximately 8% acceleration using 20,000 random dipoles.

Besides this easy acceleration, I would recommend using Cython (which is usually recommended using Pyrex) or Scave's Weave. Take a look at Performance Python for some examples and comparisons of various ways to speed up Numpy / Scipy.

If you want to try this parallel, I would recommend to get started with Scipy Parallel Programming .

Good to see another physicist at SO. Not much here.

Edit:

I decided to take this as a challenge to develop some of Cython's skills and received a 10-fold improvement over the optimized version of Psyco. Let me know if you want to see my code.

Edit2:

Ok, came back and found what was slowing down in my version of Cython. Now the acceleration is much more than 100 times. If you want or need another 2x or so ratio according to Ray sped-up Numpy, let me know and I will send my code.

Download source package cython:

Here is the Cython code that I ran:

 import numpy as np cimport numpy as np cimport cython cdef extern from "math.h": double sqrt(double theta) ctypedef np.float64_t dtype_t @cython.boundscheck(False) @cython.wraparound(False) def calculate_dipole_cython(np.ndarray[dtype_t,ndim=2,mode="c"] mu, np.ndarray[dtype_t,ndim=2,mode="c"] r_i, np.ndarray[dtype_t,ndim=2,mode="c"] mom_i): cdef Py_ssize_t i cdef np.ndarray[dtype_t,ndim=1,mode="c"] tmp = np.empty(3,np.float64) cdef np.ndarray[dtype_t,ndim=1,mode="c"] relative = np.empty(3,np.float64) cdef double A = 1e-7 cdef double C, D, F cdef np.ndarray[dtype_t,ndim=1,mode="c"] B = np.zeros(3,np.float64) for i in xrange(r_i.shape[0]): relative[0] = mu[0,0] - r_i[i,0] relative[1] = mu[0,1] - r_i[i,1] relative[2] = mu[0,2] - r_i[i,2] C = relative[0]*relative[0] + relative[1]*relative[1] + relative[2]*relative[2] C = 1.0/sqrt(C) D = C**3 tmp[0] = relative[0]*C F = mom_i[i,0]*tmp[0] tmp[1] = relative[1]*C F += mom_i[i,1]*tmp[1] tmp[2] = relative[2]*C F += mom_i[i,2]*tmp[2] F *= 3 B[0] += (F*tmp[0] - mom_i[i,0])*D B[1] += (F*tmp[1] - mom_i[i,1])*D B[2] += (F*tmp[2] - mom_i[i,2])*D return A*B 

I optimized it a bit, but there may be a bit more that you can get away from it. You can still replace np.zeros and np.empty with direct calls with the Numpy C API, but that shouldn't matter much. Be that as it may, this code provides a 2–3x improvement over the optimized Numpy code. However, you need to pass the numbers correctly. Arrays must be in C format (which is the default for Numpy arrays, but in Numpy, transposing a formatted C array is a Fortran formatted array).

For example, to run the code from your other question , you need to replace np.random.random((3,N)) with np.random.random((N,3)) . In addition, `

 r_test_fast = reshape_vector(r_test) 

need to change to

 r_test_fast = np.array(np.matrix(r_test)) 

This last line can be made simpler / faster, but it will be premature optimization, in my opinion.

If you have not used Cython before and do not know how to do this, let me know and I will be happy to help.

Finally, I would recommend looking at this article . I used it as a guide for my optimizations. The next step would be to try using the BLAS functions that use the SSE2 instruction set, trying to use the SSE API or trying to use more of the Numpy C API that interacts with SSE2. In addition, you can view parallelization.

+2
source share

Python is not intended for high performance computing. Write a kernel loop in C and call it from Python.

+1
source share

All Articles