I have a large (100K by 30K) and (very) sparse dataset in svmlight format, which I load as follows:
import numpy as np
from scipy.cluster.vq import kmeans2
from scipy.spatial.distance import pdist, squareform
from sklearn.datasets import load_svmlight_file
X,Y = load_svmlight_file("somefile_svm.txt")
which returns a sparse array of Scipy X
I just need to calculate the pairwise distances of all training points as
D = pdist(X)
Unfortunately, distance calculation implementations in scipy.spatial.distance only work for dense matrices. Due to the size of the dataset, it is impossible to use pdist as, say
D = pdist(X.todense())
Any pointers to sparse calculations of distance calculations between maps or workarounds regarding this issue are welcome.
Many thanks
source
share