Problem:
I have N (~ 100 m) lines each D (e.g. 100) characters long and low alphabet (e.g. 4 possible characters). I would like to find k-nearest neighbors for each of these N points (k ~ 0.1D). Adjacent strings are determined by the distance from the interference. The solution should not be the best, but the better.
Thoughts about the problem
I have a bad feeling that this is a non-trivial problem. I have read a lot of documents and algorithms, but most of them have poor results in high size, and it works when the dimension is less than 5. For example, this article offers an efficient algorithm, but its constant is related to the measurement exponentially.
I am currently studying the question of how to reduce the size in the sense that the distance from the hamming is maintained or can be calculated.
Another option is sensitivity to location , points that are close to each other under the selected indicator are most likely to be mapped to the same bucket. Any help? Which option do you prefer?
Ashki source
share