Rare distance computation implementations in python / scikit-learn

I have a large (100K by 30K) and (very) sparse dataset in svmlight format, which I load as follows:

import numpy as np
from scipy.cluster.vq import kmeans2
from scipy.spatial.distance import pdist, squareform
from sklearn.datasets import load_svmlight_file

X,Y = load_svmlight_file("somefile_svm.txt")

which returns a sparse array of Scipy X

I just need to calculate the pairwise distances of all training points as

D = pdist(X)

Unfortunately, distance calculation implementations in scipy.spatial.distance only work for dense matrices. Due to the size of the dataset, it is impossible to use pdist as, say

D = pdist(X.todense())

Any pointers to sparse calculations of distance calculations between maps or workarounds regarding this issue are welcome.

Many thanks

+5
source share
1 answer

scikit-learn sklearn.metrics.euclidean_distances , . .

.

+4

All Articles