Effective Nearest Neighbor Search for Sparse Matrices

I have a large array of data (text) that I converted to a term-document sparse matrix (I use scipy.sparse.csr.csr_matrix to store the sparse matrix). I want to find for each document the best matches of the nearest neighbors. I was hoping the NearestNeighbor procedure in the Python scikit-learn library ( sklearn.neighbors.NearestNeighbor , to be exact) would solve my problem, but efficient algorithms that use space partitioning data structures like KD trees or Ball trees do not work with sparse matrices, Only the brute force algorithm works with sparse matrices (which is impossible in my case, since I deal with a large case).

Is there an effective implementation of finding the nearest neighbor for sparse matrices (in Python or in any other language)?

Thanks.

+8
python scipy scikit-learn nearest-neighbor
source share
2 answers

Late answer: see Location Sensitivity

Support for scikit-learn has been offered here and here .

+4
source share

You can try to convert your high-dimensional sparse data into low-dimensional dense data using TruncatedSVD, then make a spherical tree.

+3
source share

All Articles