Fastest travel distance metric in python

I have a 1D array of numbers and want to calculate all paired Euclidean distances. I have a way (thanks to SO) to do this using translation, but it is inefficient because it calculates each distance twice. And it does not scale well.

Here is an example that gives me what I want with an array of 1000 numbers.

import numpy as np import random r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)]) dists = np.abs(r - r[:, None]) 

What is the fastest implementation in scipy / numpy / scikit-learn that I can use for this, given that it should scale in situations where a 1D array has values> 10k.

Note: the matrix is ​​symmetrical, so I assume that you can get at least 2x acceleration, turning to this, I just don’t know how to do it.

+8
python arrays numpy scipy scikit-learn
source share
3 answers

None of the answers answered the question: 1 was in Keaton, one was slower. But both provided very useful tips. Following them, we can assume that scipy.spatial.distance.pdist is the way to go.

Here is the code:

 import numpy as np import random import sklearn.metrics.pairwise import scipy.spatial.distance r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)]) c = r[:, None] def option1(r): dists = np.abs(r - r[:, None]) def option2(r): dists = scipy.spatial.distance.pdist(r, 'cityblock') def option3(r): dists = sklearn.metrics.pairwise.manhattan_distances(r) 

Timing with IPython:

 In [36]: timeit option1(r) 100 loops, best of 3: 5.31 ms per loop In [37]: timeit option2(c) 1000 loops, best of 3: 1.84 ms per loop In [38]: timeit option3(c) 100 loops, best of 3: 11.5 ms per loop 

I did not try to implement Cython (I cannot use it for this project), but comparing the results with another answer, it seems that scipy.spatial.distance.pdist about a third slower than the implementation of Cython (taking into account different machines by benchmarking in np.abs solution).

+15
source share

Here is a Cython implementation that gives a more than 3x speed boost for this example on my computer. This time should be revised for large arrays because BLAS procedures can probably scale much better than this rather naive code.

I know that you asked for something inside scipy / numpy / scikit-learn, but maybe this will open up new possibilities for you:

my_cython.pyx file:

 import numpy as np cimport numpy as np import cython cdef extern from "math.h": double abs(double t) @cython.wraparound(False) @cython.boundscheck(False) def pairwise_distance(np.ndarray[np.double_t, ndim=1] r): cdef int i, j, c, size cdef np.ndarray[np.double_t, ndim=1] ans size = sum(range(1, r.shape[0]+1)) ans = np.empty(size, dtype=r.dtype) c = -1 for i in range(r.shape[0]): for j in range(i, r.shape[0]): c += 1 ans[c] = abs(r[i] - r[j]) return ans 

The answer is a one-dimensional array containing all non-repeating estimates.

To import into Python:

 import numpy as np import random import pyximport; pyximport.install() from my_cython import pairwise_distance r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)], dtype=float) def solOP(r): return np.abs(r - r[:, None]) 

Timing with IPython:

 In [2]: timeit solOP(r) 100 loops, best of 3: 7.38 ms per loop In [3]: timeit pairwise_distance(r) 1000 loops, best of 3: 1.77 ms per loop 
+5
source share

Using half the memory, but 6 times slower than np.abs(r - r[:, None]) :

 triu = np.triu_indices(r.shape[0],1) dists2 = abs(r[triu[1]]-r[triu[0]]) 
+3
source share

All Articles