The fastest way to take the top N nearest cosine vectors

I have a huge list of vectors (~ 100k) (representing words and calculated using random indexing) and have to find the given 1 input word to the top N nearest vectors. The way I am doing this now is to do a complete sort by distance, and then extract the top results N, but it takes too much time to use it, since I have to calculate distances of 100 km. Is there a more efficient way to do this? Vectors are already normalized, so I just need to calculate the point product when calculating the distance.

Vectors are stored in Java HashMap<String, Vector>, where Vector is the la4j class for sparse vectors.

+4
source share
2 answers

You can put your vectors in a container with spatial support, such as R-tree or kd tree or PK-Tree .

Thus, you can find points without repeating your entire data set, just by looking at several neighboring cells. Do not forget that you need to search not only in one cell, but also in neighboring cells, and in multidimensional space - many neighbors.

Update: You still need to manually measure the distance. However, you will not need to iterate over all the vectors.

- , , , N.

( ) - . , , vX, N . vX N- ( ) vX , , N. , , . - , ( PK-, ).

( , ) - , . node, vX, N vX , N- , - , node. , , . ( , vX ), - 100k .

+3

, N- , .

, 20 HashMap<List<Integer>, List<Vector>>, , - , .

0

All Articles