One simple but significant acceleration is multiplying by A outside of your sum. You can just smash B with it when you return it:
for i in range(len(relative)):
This gave an approximately 8% acceleration using 20,000 random dipoles.
Besides this easy acceleration, I would recommend using Cython (which is usually recommended using Pyrex) or Scave's Weave. Take a look at Performance Python for some examples and comparisons of various ways to speed up Numpy / Scipy.
If you want to try this parallel, I would recommend to get started with Scipy Parallel Programming .
Good to see another physicist at SO. Not much here.
Edit:
I decided to take this as a challenge to develop some of Cython's skills and received a 10-fold improvement over the optimized version of Psyco. Let me know if you want to see my code.
Edit2:
Ok, came back and found what was slowing down in my version of Cython. Now the acceleration is much more than 100 times. If you want or need another 2x or so ratio according to Ray sped-up Numpy, let me know and I will send my code.
Download source package cython:
Here is the Cython code that I ran:
import numpy as np cimport numpy as np cimport cython cdef extern from "math.h": double sqrt(double theta) ctypedef np.float64_t dtype_t @cython.boundscheck(False) @cython.wraparound(False) def calculate_dipole_cython(np.ndarray[dtype_t,ndim=2,mode="c"] mu, np.ndarray[dtype_t,ndim=2,mode="c"] r_i, np.ndarray[dtype_t,ndim=2,mode="c"] mom_i): cdef Py_ssize_t i cdef np.ndarray[dtype_t,ndim=1,mode="c"] tmp = np.empty(3,np.float64) cdef np.ndarray[dtype_t,ndim=1,mode="c"] relative = np.empty(3,np.float64) cdef double A = 1e-7 cdef double C, D, F cdef np.ndarray[dtype_t,ndim=1,mode="c"] B = np.zeros(3,np.float64) for i in xrange(r_i.shape[0]): relative[0] = mu[0,0] - r_i[i,0] relative[1] = mu[0,1] - r_i[i,1] relative[2] = mu[0,2] - r_i[i,2] C = relative[0]*relative[0] + relative[1]*relative[1] + relative[2]*relative[2] C = 1.0/sqrt(C) D = C**3 tmp[0] = relative[0]*C F = mom_i[i,0]*tmp[0] tmp[1] = relative[1]*C F += mom_i[i,1]*tmp[1] tmp[2] = relative[2]*C F += mom_i[i,2]*tmp[2] F *= 3 B[0] += (F*tmp[0] - mom_i[i,0])*D B[1] += (F*tmp[1] - mom_i[i,1])*D B[2] += (F*tmp[2] - mom_i[i,2])*D return A*B
I optimized it a bit, but there may be a bit more that you can get away from it. You can still replace np.zeros and np.empty with direct calls with the Numpy C API, but that shouldn't matter much. Be that as it may, this code provides a 2–3x improvement over the optimized Numpy code. However, you need to pass the numbers correctly. Arrays must be in C format (which is the default for Numpy arrays, but in Numpy, transposing a formatted C array is a Fortran formatted array).
For example, to run the code from your other question , you need to replace np.random.random((3,N)) with np.random.random((N,3)) . In addition, `
r_test_fast = reshape_vector(r_test)
need to change to
r_test_fast = np.array(np.matrix(r_test))
This last line can be made simpler / faster, but it will be premature optimization, in my opinion.
If you have not used Cython before and do not know how to do this, let me know and I will be happy to help.
Finally, I would recommend looking at this article . I used it as a guide for my optimizations. The next step would be to try using the BLAS functions that use the SSE2 instruction set, trying to use the SSE API or trying to use more of the Numpy C API that interacts with SSE2. In addition, you can view parallelization.