Scipy Sparse - distance matrix (Scikit or Scipy)

I am trying to calculate the closest neighbors cluster on a sparse Scipy matrix returned from scikit-learn DictVectorizer. However, when I try to calculate the distance matrix using scikit-learn, I get an error message using the Euclidean distance through pairwise.euclidean_distancesand pairwise.pairwise_distances. I got the impression that scikit-learn could calculate these distance matrices.

My matrix is very sparse with the form <364402x223209 sparse matrix of type <class 'numpy.float64'> with 728804 stored elements in Compressed Sparse Row format>.

I also tried methods such as pdistand kdtreein SciPy, but got other errors associated with the inability to process the result.

Can someone point me to a solution that would allow me to calculate the distance matrix and / or the result of the nearest neighbor?

Code example:

import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import pairwise
import scipy.spatial

file = 'FileLocation'
data = []
FILE = open(file,'r')
for line in FILE:
    templine = line.strip().split(',')
    data.append({'user':str(int(templine[0])),str(int(templine[1])):int(templine[2])})
FILE.close()

vec = DictVectorizer()
X = vec.fit_transform(data)

result = scipy.spatial.KDTree(X)

Mistake:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/scipy/spatial/kdtree.py", line 227, in __init__
    self.n, self.m = np.shape(self.data)
ValueError: need more than 0 values to unpack

Similarly, if I run:

scipy.spatial.distance.pdist(X,'euclidean')

I get the following:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/scipy/spatial/distance.py", line 1169, in pdist
    [X] = _copy_arrays_if_base_present([_convert_to_double(X)])
  File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/scipy/spatial/distance.py", line 113, in _convert_to_double
    X = X.astype(np.double)
ValueError: setting an array element with a sequence.

Finally, running NearestNeighborin scikit-learn results in a memory error with:

nbrs = NearestNeighbors(n_neighbors=10, algorithm='brute')
+4
source share
2 answers

Firstly, you can not use KDTree, and pdistwith sparse matrix, you have to convert it into a dense (of your choice, whether it's your option):

>>> X
<2x3 sparse matrix of type '<type 'numpy.float64'>'
        with 4 stored elements in Compressed Sparse Row format>

>>> scipy.spatial.KDTree(X.todense())
<scipy.spatial.kdtree.KDTree object at 0x34d1e10>
>>> scipy.spatial.distance.pdist(X.todense(),'euclidean')
array([ 6.55743852])

Secondly, the documents :

Effective brute force neighbor searches can be very competitive for small data samples. However, as the number of samples N increases, the brute force approach quickly becomes impracticable.

, "ball_tree" , .

+2

:

, , .

, , , ( ). , , .

, , sklearn.metrics.pairwise_distances_argmin_min X * X.T, .

0

All Articles