Non-empty buckets in LSH

I read this poll about LSH , in particular, referring to the last paragraph of section 2.2.1 :

To improve recall, L hash tables are constructed, and elements of the hash buckets h_1 (q), ..., h_L (q) lying in L (L ', L' <L) are extracted as close q elements for a randomized search of R- close proximity (or a randomized c-approximate search for R-close proximity). To ensure accuracy, each of the hash codes L, y_i, must be a long code, which means that the total number of buckets is too large to index directly. Thus, only non-empty buckets are preserved using convective hashing of the hash codes h_l (X).

I have 3 questions:

  • The bold sentence is not clear to me: what does it mean, "resorting to combined hashing of the hash codes h_l (x) "?
  • Always about a bold sentence, I'm not sure I had a problem: I fully understand that h_l(x) can be a long code, so the number of possible buckets can be huge. For example, if h_l(x) is binary and length is h_l(x) , then we have a total of L*2^length possible buckets (since we use L hash tables) ... is this correct?
  • Last question: as soon as we find which vector the query vector q belongs to, in order to find the nearest neighbor, should we use the original vector q and the original distance metric? Suppose, for example, that the original vector q is in 128 dimensions q=[1,0,12,...,14.3]^T , and in our application the Euclidean distance is used. Suppose now that our hash function (suppose L = 1 for simplicity) used in LSH maps this vector to binary space in 20 dimensions y=[0100...11]^T to decide which bucket assigns q to. So y has the same bucket index B and which already contains 100 vectors. Now, to find the nearest neighbor, we must compare q with all the other 100 128-dimensional vectors using the Euclidean distance. Is it correct?
0
locality-sensitive-hash
source share
1 answer

The approach that they use to improve the design of the revocation contains more hash tables and, in essence, stores several copies of the identifier for each reference element, so the cost of space is more [4]. If there are many empty buckets that increase the cost of searching, a double hash scheme or a fast search algorithm in the Hamming space can be used to quickly extract hash codes. I think in this case they use a double hash function to extract non-empty buckets.

No codes / memories [1] [2] [3] β†’ O (nL)

References:

[1] http://simsearch.yury.name/russir/03nncourse-hand.pdf

[2] http://joyceho.imtqy.com/cs584_s16/slides/lsh-12.pdf

[3] https://users.soe.ucsc.edu/~niejiazhong/slides/kumar.pdf

[4] http://research.microsoft.com/en-us/um/people/jingdw/Pubs%5CLTHSurvey.pdf

0
source share

All Articles