Distributed LSH (location-sensitive hashing)

Question

Distributed LSH (location-sensitive hashing)

I want to build a large, scalable database with millions of high-dimensional vectors using LSH. Since I have to store all the data in ram for a quick request, the data should be distributed on several servers to store all the objects.

The naive approach is to distribute all objects to different servers and send one request to each server. The correct server with the best answer has the correct object.

I am sure that there should be some better solution when the request should not be sent to all server nodes, and similar objects are grouped on one server.

What would be a good approach for distributed LSH tables? Maybe there are even some projects?

Thanks for any hint.

+4

database hash distributed

Marlon Oct 23 '10 at 17:58

source share

1 answer

aSteve · Answer 1 · 2011-09-25T13:38:56+0000

First you want to consider the keys for which data should be available. It is these keys that you want to use in the hash - and if you know the exact keys that you want to access, you can hash them to determine which server to request - eliminating the need to request each server.

It gets harder and harder if you don’t know the exact keys (as I suspect this is your situation) - LSH generates a general order for your records - where such records most likely (but not guaranteed) have the same hash, I think about it, for example, on comparing hyperplanes with the length of their normal vector from the origin ... therefore, for example, when searching for a similar (but non-identical) hyperplane accurate to 4-5 units from the origin, a good place to start the search is among other hyperplanes from 4 up to 5 units from ist chnika. Therefore, if this "distance from the source" is your location-sensitive hash function, you can shard use your data, and at the same time, you can reduce the load (with increasing delays in the worst case) by searching only for a fragment with the corresponding "distance from the source "LCH. With this particular LCH, where the similarity linearly correlates with the hash, one could get the final result when accessing only a subset of the distributed servers. This does not apply to all LSH features.

IMHO, it all depends on your LSH function - and this choice depends on the specifics of your application.

Distributed LSH (location-sensitive hashing)

More articles: