What data structure stores binary strings and query with disabling distane

I am looking for a data structure for processing billions of binary strings containing 512 binary values.

My goal is to send queries to the structure and get a result set that contains all the data that is at a distance below.

My first idea was to use the kd tree. But these trees are very slow for high dimension. My second idea is to use the lsh approach (minHash / superbit) lsh. But for this, I also need to have a structure for effective search

Any ideas how to handle this big data?

** Update ** Some notes:

  • for the distance from interference, there is only an upper limit, which can be 128. But over time, I do not know the upper limit
  • Insert or delete will be nice, but I can also rebuild the schedule (the database is updated only weekly to weekly).
  • the result set should contain all the relevant nodes (I'm not looking for knn)
+6
source share
2 answers

Without knowing your intended search options, it's hard to be too optimized. However, I think that a good approach would be to build a B- or T-tree, and then optimize this structure for the binary nature of the data.

In particular, you have 64 bytes of data as a 512-bit string. Your assessment is that you will have "billions" of records. What is the order of values ​​of 2 32 so that 1/16 th of the space will be filled? (Does this agree with your expectations?)

Anyway, try breaking the data into bytes, let each byte be the key level. You are likely to compress level entries if the probability of multiple bits is uniform. (If not, if they say that bit is more likely at the beginning of the key, then you may just need to highlight 256 next-level pointers and have some null values. It's not always worth it.)

All your levels will be the same - they will represent another 8 bits of string. So, calculate a table that displays all byte values ​​for a byte that are located at a distance S from this byte, 0 <= S <= 8. Also, calculate a table that displays two bytes at a distance E between them, hamming(a,b) .

To cross the tree, let your search distance be SD. Set D = SD. Read the top level block. Find all 8-bit values ​​in a block less than the distance min(8, D) from your query. For each value, calculate the exact distance hamming(query, value) and list the bottom block with D = D - hamming(query, value) for this subtree.

+3
source

The biggest design problem that I see here is the closure requirement: we need to return all the elements within the distance N of this vector for an arbitrary N. The amount of data is sparse: "billions" are of the order of 2 ^ 33, but we have 512 bits of information, so there is only 1 record for 2 ^ (512-33) possibilities. For randomly distributed keys, the expected distance between any two nodes is 256; the expected nearest neighbor distance is about 180.

This makes me expect that your search will depend on non-random data clusters and that recognition of this clustering will contribute to your search. This will be a somewhat painful step in pre-processing the source data, but it should be useful.

My general approach to this is to first identify these clusters in some usually fast way. Start with a hash function that returns a very general distance metric. For example, for any vector, calculate the distances to each of the many orthogonal support vectors. For 16 bits, you can take the following set (specified in hexadecimal format): 0000, 00FF, 0F0F, 3333, 5555, a serial β€œseed of alternating bits.” Return this hash as a simple set of 4-bit distances, only 20 bits (there is a real saving for long vectors, since one of the sizes is 2 ^ (2 ^ N)).

Now this hash tuple allows you to get a rough estimate of the distance for hamming, so you can group vectors more easily: similar vectors should have similar hash values.

Inside each cluster, find the central element, and then characterize each cluster element by its distance from this center. For speed, give each node a list of nearest neighbors with distances, all of which are inside the cluster. This gives you a graph for each cluster.

Similarly, connect all the centers of the clusters, giving straight edges to the nearest centers of the cluster. If your data is reasonably searchable, then we can guarantee that for any two nodes A, B with cluster centers Ac and Bc we will have d (A, Ac) + d (B, Bc) <d (A, B). Each cluster is a topological neighborhood.


The request process is slightly faster. For the target vector V, find the hash value. Find the centers of the clusters that are close enough to the hat value, which may correspond to something in their vicinity ([actual distance] - [query range] - [cluster radius]). This will immediately remove entire clusters and may give you a whole group of "hits". For each marginal cluster (some, but not all nodes qualify) you need to find a node that works with something close to brute force (beginning in the middle of the range of viable distances from the center of the cluster) and then search the width of each node neighbor.

I expect this to give you something comparable to optimal performance. It also decently decides on additions and deletions if they are not frequent enough to change cluster membership for other nodes.


The set of vectors is simple. Write out the bit patterns for the 16-bit case:

 0000 0000 0000 0000 16 0s 0000 0000 1111 1111 8 0s, 8 1s 0000 1111 0000 1111 4 0s, 4 1s, repeat 0011 0011 0011 0011 2 0s, 2 1s, repeat 0101 0101 0101 0101 1 0s, 1 1s, repeat 
+1
source

All Articles