Here is a quick and simple method, the option with better performance due to the larger amount of memory.
Q: Uint X [] array, for example. 1M 32-bit words
Wanted: function near( Uint q ) β j with a small hammingdist( q, X[j] )
Method: binary search q in sorted X , then linear search for a block around it.
Pseudocode:
def near( q, X, Blocksize=100 ): preprocess: sort X Uint* p = binsearch( q, X )
It's fast - A binary search of 1M words + the nearest hammingdist in a block of size 100 takes <10 us on my Mac ppc. (This is highly cache dependent; your mileage will vary.)
How close is this to find the true nearest X [j]? I can only experiment, I can not do the math:
for 1M random queries in 1M random words, the nearest match is on average at a distance of 4-5 bits, vs. 3 for true nearest (linear scan of just 1M):
near32 N 1048576 Nquery 1048576 Blocksize 100 binary search, then nearest +- 50 7 usec distance distribution: 0 4481 38137 185212 443211 337321 39979 235 0 near32 N 1048576 Nquery 100 Blocksize 1048576 linear scan all 1048576 38701 usec distance distribution: 0 0 7 58 35 0
Run your data using blocks of sizes 50 and 100 to see how the distances on the back decrease.
To get even closer, at the cost of dual memory,
make a copy of
Xswap of
X with the replacement of the upper / lower halves, and back to better
near( q, X, Blocksize ) near( swap q, Xswap, Blocksize )
With lots of memory, you can still use many more shuffled copies of X , such as 32 rotations.
I don't know how performance depends on Nshuffle and Blocksize - a question for LSP theorists.
(Added): for close-to-bit bit strings, say 320 bits, 10 words, make 10 arrays of pointers sorted by word 0, word 1 ... and search blocks with binzis, as described above:
nearest( query word 0, Sortedarray0, 100 ) -> min Hammingdist eg 42 of 320 nearest( query word 1, Sortedarray1, 100 ) -> min Hammingdist 37 nearest( query word 2, Sortedarray2, 100 ) -> min Hammingdist 50 ... -> eg the 37.
This, of course, will miss the near matches where not a single word is close, but it is very simple, and sorting and binzisk are incredibly fast. Arrays of pointers take up as much space as data bits. 100 words, 3200 bits will work exactly the same.
But: this only works if there is an approximately equal number of 0 bits and 1 bit, not 99% 0 bits.