High Dimension Nearest Neighbor Search and Localization Sensitivity Hashing

Question

High Dimension Nearest Neighbor Search and Localization Sensitivity Hashing

Here is the main problem. I have a very large database (25,000 or so) of 48-dimensional vectors, each of which has values from 0 to 255. Specificity is not so important, but I believe that this could help give context.

I don’t need a nearest neighbor, so an approximate search for neighbors that are within a certain accuracy is acceptable. I practiced with hashing the sensitivity of the place , but I lost it very much.

I wrote a hash function as described in the article in the Stable Distributions section as far as I can. Here is the code.

def lsh(vector, mean, stdev, r = 1.0, a = None, b = None): if not a: a = [normalvariate(mean, stdev) for i in range(48)] if not b: b = uniform(0, r) hashVal = (sum([a[i]*vectorA[i] for i in range(48)]) + b)/r return hashVal

The hash function is at least some “working”. If I order a list of points by hash value and calculate the average distance between a point and its neighbor in the list, the average distance is about 400, compared to the average distance of about 530 for any two randomly selected points.

My biggest questions are this.

A: Any suggestions on where I can learn more about this. My search did not bring many results.

B: The method assumes that it prints an integer value (which I do not have). And then you should try to find matches for this integer value, and match means the closest nearest neighbor. I understand that I have to calculate some set of hash tables for all my points, and then check the specified tables for hash matches, but the values I return seem to be insufficient and I get matches in general. For my part, more testing is required.

C: Instructions for building hash functions based on other hashing methods?

+7

python math

Piper merriam Jul 16 '10 at 7:06

source share

2 answers

Here are two answers:

B : The Wikipedia page states that math.floor() should be used in hashVal : this is how you get integers.

C. If you want to use the Hamming method, you can implement it quite simply: each Hamming hash function is simply determined by the coordinate (between 0 and 47) and the number of bits (between 0 and 7). You can get the value of an integer with the given bit b with:

 bool(i & 2**b)

+2

Eol Jul 16 '10 at 7:51

source share

Piotr duda · Accepted Answer · 2010-07-16T07:17:05+0000

Maby, this is a little off topic, but you can try using the PCA http://en.wikipedia.org/wiki/Principal_component_analysis to reduce the dimension of the data set. There should be many PCA modules designed for numPy (for example: http://folk.uio.no/henninri/pca_module/ ). This method is quite simple and will be equipped with ready-to-use modules.

Basically, this reduces the number of measurements (you must specify the desired number), maximizing the variance in a given number of measurements.

High Dimension Nearest Neighbor Search and Localization Sensitivity Hashing

More articles: