I am trying to implement a clustering method in a test dataset based on their similarity to a sample dataset using Euclidean distance. The test data set has 500 points, each point is an N-dimensional vector (N = 1024). The training dataset has about 10,000 points, and each point is also a 1024-dimensional vector. The goal is to find the L2 distance between each test point and all sample points in order to find the closest sample (without using any python distance functions). Since the test array and the training array are of different sizes, I tried to use translation:
import numpy as np dist = np.sqrt(np.sum( (test[:,np.newaxis] - train)**2, axis=2))
where the test is an array of the form (500, 1024), and the train is an array of the form (10000, 1024). I get a MemoryError. However, the same code works for smaller arrays. For instance:
test= np.array([[1,2],[3,4]]) train=np.array([[1,0],[0,1],[1,1]])
Is there a more efficient way of memory to do the above calculations without loops? Based on messages on the Internet, we can implement the L2 norm using sqrt matrix multiplication (X * X-2 * X * Y + Y * Y). So I tried the following:
x2 = np.dot(test, test.T) y2 = np.dot(train,train.T) xy = 2* np.dot(test,train.T) dist = np.sqrt(x2 - xy + y2)
Since matrices have different shapes when I tried to translate, there is a size mismatch, and I'm not sure if this is the correct way to translate (I don't have much experience with Python translation). I would like to know what is the correct way to implement L2 distance calculation as matrix multiplication in Python, where matrices have different shapes. The resulting distance matrix should have dist [i, j] = Euclidean distance between the control point i and the reference point j.
thanks