Finding the nearest neighbor of a bit string

I have hundreds of thousands of sparse bit strings 32 bits long.

I would like to do a nearest neighbor search on them, and search performance is very important. I read different algorithms, but they seem to be targeting text strings, not binary strings. I think either locally sensitive hashing or spectral hashing seem to be good candidates, or I could consider compression. Will any of them work well for my bit string problem? Any guidance or guidance would be greatly appreciated.

+3
compression hash nearest-neighbor hamming distance
Mar 31 '12 at 21:13
source share
2 answers

Here is a quick and simple method, the option with better performance due to the larger amount of memory.

Q: Uint X [] array, for example. 1M 32-bit words
Wanted: function near( Uint q ) β†’ j with a small hammingdist( q, X[j] )
Method: binary search q in sorted X , then linear search for a block around it.
Pseudocode:

 def near( q, X, Blocksize=100 ): preprocess: sort X Uint* p = binsearch( q, X ) # match q in leading bits linear-search Blocksize words around p return the hamming-nearest of these. 

It's fast - A binary search of 1M words + the nearest hammingdist in a block of size 100 takes <10 us on my Mac ppc. (This is highly cache dependent; your mileage will vary.)

How close is this to find the true nearest X [j]? I can only experiment, I can not do the math:
for 1M random queries in 1M random words, the nearest match is on average at a distance of 4-5 bits, vs. 3 for true nearest (linear scan of just 1M):

 near32 N 1048576 Nquery 1048576 Blocksize 100 binary search, then nearest +- 50 7 usec distance distribution: 0 4481 38137 185212 443211 337321 39979 235 0 near32 N 1048576 Nquery 100 Blocksize 1048576 linear scan all 1048576 38701 usec distance distribution: 0 0 7 58 35 0 

Run your data using blocks of sizes 50 and 100 to see how the distances on the back decrease.




To get even closer, at the cost of dual memory,
make a copy of Xswap of X with the replacement of the upper / lower halves, and back to better
 near( q, X, Blocksize ) near( swap q, Xswap, Blocksize ) 

With lots of memory, you can still use many more shuffled copies of X , such as 32 rotations.
I don't know how performance depends on Nshuffle and Blocksize - a question for LSP theorists.




(Added): for close-to-bit bit strings, say 320 bits, 10 words, make 10 arrays of pointers sorted by word 0, word 1 ... and search blocks with binzis, as described above:
 nearest( query word 0, Sortedarray0, 100 ) -> min Hammingdist eg 42 of 320 nearest( query word 1, Sortedarray1, 100 ) -> min Hammingdist 37 nearest( query word 2, Sortedarray2, 100 ) -> min Hammingdist 50 ... -> eg the 37. 

This, of course, will miss the near matches where not a single word is close, but it is very simple, and sorting and binzisk are incredibly fast. Arrays of pointers take up as much space as data bits. 100 words, 3200 bits will work exactly the same.
But: this only works if there is an approximately equal number of 0 bits and 1 bit, not 99% 0 bits.

+3
Apr 15 2018-12-12T00:
source share

I just stumbled upon a paper that solves this problem.

Randomized Algorithms and NLP: Using a location-sensitive hash function to cluster high-speed names. (Ravichandran et al, 2005)

The basic idea is similar to Denis's answer (sorting by lexicographically different permutations of bits), but it includes a number of additional ideas and further links to articles on this topic.

It is actually implemented at https://github.com/soundcloud/cosine-lsh-join-spark , where I found it.

+1
Feb 19 '16 at 11:45
source share



All Articles