Index structure for top-k queries for bitstron

The given array of bit strings (all of the same length) and the query string Q find the top-k most similar strings in Q, where the similarity between strings A and B is defined as the number 1 in B and (bitwise operation is applied).

I think there should be a classic result for this problem.

k is few, in hundreds, and the number of vectors in hundreds of millions and the length of the vectors is 512 or 1024

+7
algorithm indexing tree bitset
source share
2 answers

One way to solve this problem is to build a K-Nearest Neighbor Graph (K-NNG) (digraph) using the Russell-Rao similarity function.

Please note that the effective K-NNG design is still an open problem, and none of the known solutions to this problem is general, efficient and scalable [citation from Efficient K-Nearest Neighbor Graph Construction for general measures of similarity - Dong, Charikar, Li 2011] .

Your distance function is often called the Russell-Rao affinity (see, for example, Review of binary affinity and distance measurements - Choi, Cha, Tappert 2010 ). Please note that the Russell-Rao similarity is not metric (see Properties Binary Vector Differences in Differences - Zhang, Shrihari, 2003): the "if" part of "d (x, y) = 0 iff x == y" is incorrect.

In the Fast search algorithm for k-nearest neighbors with non-metric dissimilarity - Zhang, Srihari 2002 , the authors propose a quick hierarchical search algorithm for finding k-NN using non-metric measures in a binary vector space. They use the parametric binary vector distance function D (β). For β = 0, this function reduces to the Russell-Rao distance function. I would not call this a “classic result”, but this is the only document I could find considering this issue.

You can check out these two polls: On the problems of finding odd similarities in complex domains - Skopal, Bustos 2011 and An overview of the nearest methods for finding neighbors - Reza, Gahremani, Naderi 2014 . You might find something I missed.

+3
source share

This problem can be solved by writing a simple task "Map" and "Zoom out". I am not saying that this is the best solution, and I am not saying that this is the only solution.

In addition, you revealed in the comments that k is in hundreds, there are millions of bit strings and the size of each of them is 512 or 1024.


Pseudocode:

  • Given Q;
  • For each bit string b, calculate the similarity = b and Q
  • Emit (similarity, b)

Now the combiner can consolidate the list of all bits of the Strings from each mapping that have the same affinity.

Gearbox Pseudo Code:

  • Consumption (affinity, listOfBitStringsWithThisSimilarity);
  • Print them in descending order for the similarity value.

You can extract vertex k-bits from the gearbox output.

So, the MapReduce paradigm is probably the classic solution you're looking for.

+1
source share

All Articles