Euclidean distance of two (non-traditional) vectors in Python

I have two non-traditional vectors, and I would like to calculate the Euclidean distance between them. The vectors are configured as follows:

line1 = '2:20 3:20 5:10 6:10 10:20'
line2 = '1:18 2:20 4:10 6:10 8:20 9:10 10:10'

For each element, the first number is the location in the vector, and the second is the value (for example, 2:20 means element 2 in the vector, the value is 20). Thus, the vector for line 1 is (0.20,20,0,10,10,10,0,0,0,0,20), and the vector for line 2 is (18,20,0,10,0,10,10,0 , 20, 10,10).

I wrote the following program that works great. The problem is that I have HUGE vectors, and I want to compare them with thousands of other vectors. My computer starts to give me memory errors when I try to start it like this. Is there a way to calculate the Euclidean distance between two vectors that are configured this way without creating long vectors (with many 0 elements)?

def vec_line(line):
    vector = [0]*10
    datapoints = line.split(' ')
    for d,datapoint in enumerate(datapoints):
        element = int(datapoint.split(':')[0])
        value = float(datapoint.split(':')[1])
        vector[element-1]=value

    npvec = np.array(vector)
    return npvec

vector1 = vec_line(line1)
vector2 = vec_line(line2)

dist = np.linalg.norm(vector1-vector2)
print dist
--> [39.0384425919]
+4
source share
1 answer

Your "unconventional" vectors are usually called "sparse vectors" (or generally "sparse matrices"). Scipy package for creating and performing algebraic operations on them.

Here is more or less what you want:

import numpy as np
from scipy.sparse import csr_matrix


def parse_sparse_vector(line):
    tokens = line.split()
    indexes = []
    values = []
    for token in tokens:
        index, value = token.split(':')
        index = int(index)
        value = int(value)
        indexes.append(index)
        values.append(value)
    return csr_matrix((values, ([0] * len(indexes), indexes)))

v = parse_sparse_vector(line1)
w = parse_sparse_vector(line2)
dist = v - w
# avoiding a cast to dense matrix:
np.sqrt(dist.dot(dist.T).sum())
## result is 39.038442591886273
+5

All Articles