Storing and indexing binary strings in a database

A binary string, as defined here, is a fixed size "bit" array. I call them strings, since there is no order on them (sorting / indexing them, because numbers do not make sense), each bit is independent of the others. Each such string has a length of N bits, and N - in hundreds.

I need to save these strings and set a new binary string query for the closest neighbor, using the Hamming distance as a distance metric.
There are specialized data structures (metric trees) for searching by metric (VP-trees, cover trees, M-trees), but I need to use a regular database (MongoDB in my case).

Is there any indexing function that can be applied to binary strings that can help the database access only a subset of records before performing a one-way Hamming match? Alternatively, how can such a search be based on Hamming on a standard database?

+4
source share
2 answers

The hamming distance is a metric; therefore, it satisfies the triangle inequality. For each bit string in your database, you can save its distance for hamming to some predetermined constant bit string. Then you can use triangle inequality to filter bitstrings in the database.

So let's say

C <- some constant bitstring S <- bitstring you're trying to find the best match for B <- a bitstring in the database distS <- hamming_dist(S,C) distB <- hamming_dist(B,C) 

So, for each B you save it with the corresponding distB .

The lower bound for hamming(B,S) will then be abs(distB-distS) . The upper bound will be distB+distS .

You can drop all B so that the lower bound is higher than the lower upper bound.

I am not 100% sure about the best way to choose C I think you would like it to be a bit string close to the "center" of your metric space bitstron.

+3
source

You can use an approach similar to levenshtein automata applied to bistro strings:

  • Calculate the first (lexicographically smallest) bit string that matches your criteria from a distance.
  • Retrieve the first result from the database that is greater than or equal to this value
  • If the value matches, print it and enter the following result. Otherwise, calculate the next value greater than the match and goto 2.

This is equivalent to doing a merge across your database and a list of possible matches, without having to generate all possible matches. This will reduce the search space, but it will probably require a significant number of queries.

+2
source

All Articles