Efficient L2 memory rate using Python translation

Question

Efficient L2 memory rate using Python translation

I am trying to implement a clustering method in a test dataset based on their similarity to a sample dataset using Euclidean distance. The test data set has 500 points, each point is an N-dimensional vector (N = 1024). The training dataset has about 10,000 points, and each point is also a 1024-dimensional vector. The goal is to find the L2 distance between each test point and all sample points in order to find the closest sample (without using any python distance functions). Since the test array and the training array are of different sizes, I tried to use translation:

import numpy as np dist = np.sqrt(np.sum( (test[:,np.newaxis] - train)**2, axis=2))

where the test is an array of the form (500, 1024), and the train is an array of the form (10000, 1024). I get a MemoryError. However, the same code works for smaller arrays. For instance:

  test= np.array([[1,2],[3,4]]) train=np.array([[1,0],[0,1],[1,1]])

Is there a more efficient way of memory to do the above calculations without loops? Based on messages on the Internet, we can implement the L2 norm using sqrt matrix multiplication (X * X-2 * X * Y + Y * Y). So I tried the following:

  x2 = np.dot(test, test.T) y2 = np.dot(train,train.T) xy = 2* np.dot(test,train.T) dist = np.sqrt(x2 - xy + y2)

Since matrices have different shapes when I tried to translate, there is a size mismatch, and I'm not sure if this is the correct way to translate (I don't have much experience with Python translation). I would like to know what is the correct way to implement L2 distance calculation as matrix multiplication in Python, where matrices have different shapes. The resulting distance matrix should have dist [i, j] = Euclidean distance between the control point i and the reference point j.

thanks

+6

python numpy euclidean distance numpy-broadcasting

user1462351 30 sept '15 at 2:20

source share

3 answers

spiky · Answer 1 · 2016-01-19T14:33:22+0000

Here are broadcast with forms of intermediate cells:

 m = x.shape[0] # x has shape (m, d) n = y.shape[0] # y has shape (n, d) x2 = np.sum(x**2, axis=1).reshape((m, 1)) y2 = np.sum(y**2, axis=1).reshape((1, n)) xy = x.dot(yT) # shape is (m, n) dists = np.sqrt(x2 + y2 - 2*xy) # shape is (m, n)

The broadcast documentation contains some good examples.

user2379410 · Answer 2 · 2015-11-19T14:27:58+0000

A simplified and working version from this answer :

 x, y = test, train x2 = np.sum(x**2, axis=1, keepdims=True) y2 = np.sum(y**2, axis=1) xy = np.dot(x, yT) dist = np.sqrt(x2 - 2*xy + y2)

So, the approach you have in mind is correct, but you need to be careful how you apply it.

To simplify your life, consider tried and tested functions from scipy or scikit-learn .

John greenall · Answer 3 · 2016-01-19T14:51:30+0000

I think what you are asking for already exists in scipy as cdist .

 from scipy.spatial.distance import cdist res = cdist(test, train, metric='euclidean')

Efficient L2 memory rate using Python translation

More articles: