The fastest way to calculate the Euclidean distance between two sets of vectors using numpy or scipy

OK. I recently discovered that the scipy.spatial.distance.cdist command scipy.spatial.distance.cdist very fast for solving the COMPLETE distance matrix between two vector arrays for source and destination. see How can I calculate the Euclidean distance using numpy? I wanted to try to duplicate these performance metrics when deciding the distance between two arrays of equal size. The distance between two SINGLE vectors is pretty straight forward to calculate, as shown in the previous link. We can take vectors:

  import numpy as np A=np.random.normal(size=(3)) B=np.random.normal(size=(3)) 

and then use 'numpy.linalg.norm' where

  np.linalg.norm(AB) 

equivalently

  temp = AB np.sqrt(temp[0]**2+temp[1]**2+temp[2]**2) 

which works great when I want to know the distance between two sets of vectors, where my_distance = distance_between( A[i], B[i] ) for all i second solution works fine. In this, as expected:

  A=np.random.normal(size=(3,42)) B=np.random.normal(size=(3,42)) temp = AB np.sqrt(temp[0]**2+temp[1]**2+temp[2]**2) 

gives me a set of 42 distances between the ith element of A to the ith element of B While the norm function correctly calculates the norm for the whole matrix, giving me one value, which is not what I'm looking for. The 42-distance behavior is what I want to maintain, hopefully at almost the same speed as I get from cdist to solve complete matrices. So the question is, what is the most efficient way to use python and numpy / scipy to calculate the distances between data with the form (n,i) <

Thanks Sloan

+4
source share
2 answers

I think you have already hacked most of your business. However, instead of your last line, I would use:

 np.sqrt(np.sum(temp**2,0)) 
+3
source

Below are the time comparisons for the two methods that I find most suitable:

 import timeit In[19]: timeit.timeit(stmt='np.linalg.norm(xy,axis=0)', setup='import numpy as np; x,y = np.random.normal(size=(10, 100)), np.random.normal(size=(10, 100))', number=1000000) Out[19]: 15.132534857024439 In[20]: timeit.timeit(stmt='np.sqrt(np.sum((xy),axis=1))', setup='import numpy as np; x,y = np.random.normal(size=(10, 100)), np.random.normal(size=(10, 100))', number=1000000) Out[20]: 9.417887529009022 

I am not surprised that the numpy method is faster. I believe that as python improves, many of these built-in functions will be improved.

Tests were performed on anaconda python version 3.5.2

0
source

All Articles