I am trying to calculate the closest neighbors cluster on a sparse Scipy matrix returned from scikit-learn DictVectorizer. However, when I try to calculate the distance matrix using scikit-learn, I get an error message using the Euclidean distance through pairwise.euclidean_distancesand pairwise.pairwise_distances. I got the impression that scikit-learn could calculate these distance matrices.
My matrix is very sparse with the form <364402x223209 sparse matrix of type <class 'numpy.float64'>
with 728804 stored elements in Compressed Sparse Row format>.
I also tried methods such as pdistand kdtreein SciPy, but got other errors associated with the inability to process the result.
Can someone point me to a solution that would allow me to calculate the distance matrix and / or the result of the nearest neighbor?
Code example:
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import pairwise
import scipy.spatial
file = 'FileLocation'
data = []
FILE = open(file,'r')
for line in FILE:
templine = line.strip().split(',')
data.append({'user':str(int(templine[0])),str(int(templine[1])):int(templine[2])})
FILE.close()
vec = DictVectorizer()
X = vec.fit_transform(data)
result = scipy.spatial.KDTree(X)
Mistake:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/scipy/spatial/kdtree.py", line 227, in __init__
self.n, self.m = np.shape(self.data)
ValueError: need more than 0 values to unpack
Similarly, if I run:
scipy.spatial.distance.pdist(X,'euclidean')
I get the following:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/scipy/spatial/distance.py", line 1169, in pdist
[X] = _copy_arrays_if_base_present([_convert_to_double(X)])
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/scipy/spatial/distance.py", line 113, in _convert_to_double
X = X.astype(np.double)
ValueError: setting an array element with a sequence.
Finally, running NearestNeighborin scikit-learn results in a memory error with:
nbrs = NearestNeighbors(n_neighbors=10, algorithm='brute')