Search in local sensitive hashing

I am trying to understand section 5. of this document about LSH, in particular, how to hang generated hashes. Quoting a linked document:

Given bit vectors consisting of d bits, select N = O (n 1 / (1 + epsilon)) random permutations of bits. For each random permutation Οƒ we have to maintain a sorted order O Οƒ of bit vectors, in the lexicographic order of bits transposed by Οƒ. Given the request bit q of the query q, we find the approximate nearest neighbor by doing the following: for each permute t, we perform a binary search on O Οƒ to find the two bits the vectors closest to q (in the lexicographic order, the bits are rearranged by Οƒ) . Now we do a search in each sorted order O Οƒ looking at the elements above and below the position returned by binary, do a search in order of length of the longest prefix that matches q. This can be done by maintaining two pointers for each ordered order O Οƒ (one moves up and the other moves down). At each step, we move one of the pointers up or down, corresponding to the element with the longest corresponding prefix. (Here, the length of the longest matching prefix in O Οƒ is calculated with respect to q with its bits transposing Οƒ). We study 2N = O (n 1 / (1 + epsilon)) bits this way. Of all the bit vectors, consider, we will return the one that has the smallest Hamming distance to q.

I am confused by this algorithm, and I don't think I understood how this works.

I already found this question on this topic, but I did not understand the answer in the comments. Also in this issue, in paragraph 2, the same algorithm is described, but again, I do not understand how it works.

Could you try to explain to me how it works step by step, trying to be as simple as possible?

I even tried to make a list of things that I do not understand, but in practice it is so poorly written that I do not understand most of the sentences!

EDIT after gsamaras answer:

I basically understood the answer, but I still have some doubts:

  • Is it possible to say that the total cost of performing permutations of N is O(Nnlogn) , since we must sort each of them?

  • The permutation + sorting process described above is performed only once during pre-processing, or for each query q ? It seems O(Nnlogn) already quite expensive even in preprocessing, if we have to do this during the request, this is a disaster: D

  • At the last point, where we compare v0 and v4 with q , do we compare their permutation version or the original one (before their permutation)?

+2
algorithm nearest-neighbor computational-geometry locality-sensitive-hash approximate-nn-searching
source share
1 answer

This question is somehow broad, so I'll just give a minimal (abstract) example here:

We have 6 (= n ) vectors in our data set with d bits each. Suppose we are doing 2 (= n ) random permutations.

Let the first random permutation begin! Remember that we rearrange the bits, not the order of the vectors . After permuting the bits, they maintain order, for example:

 v1 v5 v0 v3 v2 v4 

Now the query query q appears, but it is (almost) unlikely to be the same with the vector in our dataset (after permutation), so we will not find it by performing a binary search.

However, we will end up between two vectors. So, now we can imagine the scenario to be like this (for example, q lies between v0 and v3:

 v1 v5 v0 <-- up pointer <-- q lies here v3 <-- down pointer v2 v4 

Now we move the pointer up or down, looking for the vector vi, which will correspond to the largest bits with q . Let them say that it is v0.

Similarly, we perform the second permutation and find the vector vi, say, v4. now we compare v0 with the first permutation and v4 to see which one is closest to q , i.e. which one has the most bits equal to q .


Edit:

Is it right to say that the total cost of performing N permutations is O (Nnlogn), since we must sort each of them?

If they really sort each permutation from scratch, then yes, but I don’t understand how they do it.

The permutation + sorting process described above is performed only once during pre-processing, or for each query q ?

ONCE .

At the last point, where we compare v0 and v4 with q , do we compare their permutation version or the original one (before their permutation)?

I think they do this with a rearranged version (see parentheses before 2N in the document). But this does not matter, since they rearrange q too, with the same permutation ( Οƒ ).


This quora answer may shed some light too.

+3
source share

All Articles