I read this poll about LSH , in particular, referring to the last paragraph of section 2.2.1 :
To improve recall, L hash tables are constructed, and elements of the hash buckets h_1 (q), ..., h_L (q) lying in L (L ', L' <L) are extracted as close q elements for a randomized search of R- close proximity (or a randomized c-approximate search for R-close proximity). To ensure accuracy, each of the hash codes L, y_i, must be a long code, which means that the total number of buckets is too large to index directly. Thus, only non-empty buckets are preserved using convective hashing of the hash codes h_l (X).
I have 3 questions:
- The bold sentence is not clear to me: what does it mean, "resorting to combined hashing of the hash codes
h_l (x) "? - Always about a bold sentence, I'm not sure I had a problem: I fully understand that
h_l(x) can be a long code, so the number of possible buckets can be huge. For example, if h_l(x) is binary and length is h_l(x) , then we have a total of L*2^length possible buckets (since we use L hash tables) ... is this correct? - Last question: as soon as we find which vector the query vector
q belongs to, in order to find the nearest neighbor, should we use the original vector q and the original distance metric? Suppose, for example, that the original vector q is in 128 dimensions q=[1,0,12,...,14.3]^T , and in our application the Euclidean distance is used. Suppose now that our hash function (suppose L = 1 for simplicity) used in LSH maps this vector to binary space in 20 dimensions y=[0100...11]^T to decide which bucket assigns q to. So y has the same bucket index B and which already contains 100 vectors. Now, to find the nearest neighbor, we must compare q with all the other 100 128-dimensional vectors using the Euclidean distance. Is it correct?
locality-sensitive-hash
justHelloWorld
source share