The fastest way to calculate the distance between each point in python

Question

The fastest way to calculate the distance between each point in python

In my project, I need to calculate the Euclidean distance between each point stored in an array. The write array is a 2D numpy array with three columns that are the coordinates (x, y, z), and each row defines a new point.

I usually work with 5000-6000 points in tests.

My first algorithm uses Cython and my second numpy. I found that my numpy algorithm is faster than cython.

edit: with 6000 points:

numpy 1.76 s / cython 4.36 s

Here is my cython code:

cimport cython from libc.math cimport sqrt @cython.boundscheck(False) @cython.wraparound(False) cdef void calcul1(double[::1] M,double[::1] R): cdef int i=0 cdef int max = M.shape[0] cdef int x,y cdef int start = 1 for x in range(0,max,3): for y in range(start,max,3): R[i]= sqrt((M[y] - M[x])**2 + (M[y+1] - M[x+1])**2 + (M[y+2] - M[x+2])**2) i+=1 start += 1

M is the memory representation of the initial array of records, but flatten() numpy before calling the function calcul1() , R is the type of memory 1D of the output array to store all the results.

Here is my phone code:

 def calcul2(M): return np.sqrt(((M[:,:,np.newaxis] - M[:,np.newaxis,:])**2).sum(axis=0))

Here M is the initial array of records, and transpose() is numpy, before the function call must have coordinates (x, y, z) in the form of rows and dots in the form of columns.

Also, this numpy function is pretty handy because the array that it returns is well organized. This is an n array of n with n number of points, and each point has a row and column. So, for example, the distance AB is stored in the intersection index of row A and column B.

This is what I call them (cython function):

 cpdef test(): cdef double[::1] Mf cdef double[::1] out = np.empty(17998000,dtype=np.float64) # (6000² - 6000) / 2 M = np.arange(6000*3,dtype=np.float64).reshape(6000,3) # Example array with 6000 points Mf = M.flatten() #because my cython algorithm need a 1D array Mt = M.transpose() # because my numpy algorithm need coordinates as rows calcul2(Mt) calcul1(Mf,out)

Am I doing something wrong here? For my project, both are not fast enough.

1: Is there a way to improve my cython code to outperform numpy speed?

2: Is there a way to improve my numpy code to calculate even faster?

3: Or any other solutions, but should it be python / cython (e.g. parallel computing)?

Thanks.

+5

optimization python numpy cython

Userat May 18, '16 at 11:48

source share

1 answer

Daniel · Accepted Answer · 2016-05-18T13:59:39+0000

Not sure where you get your timings, but you can use scipy.spatial.distance :

 M = np.arange(6000*3, dtype=np.float64).reshape(6000,3) np_result = calcul2(M) sp_result = sd.cdist(MT, MT) #Scipy usage np.allclose(np_result, sp_result) >>> True

Timings:

 %timeit calcul2(M) 1000 loops, best of 3: 313 µs per loop %timeit sd.cdist(MT, MT) 10000 loops, best of 3: 86.4 µs per loop

It is important to note that it is also useful to understand that your conclusion is symmetrical:

 np.allclose(sp_result, sp_result.T) >>> True

An alternative is only to calculate the upper triangle of this array:

 %timeit sd.pdist(MT) 10000 loops, best of 3: 39.1 µs per loop

Edit: Don't know which index you want to archive, it looks like you can do this in both directions? Pin another index for comparison:

 %timeit sd.pdist(M) 10 loops, best of 3: 135 ms per loop

Still about 10 times faster than your current NumPy implementation.

The fastest way to calculate the distance between each point in python

More articles: