Choosing between SimHash and MinHash for a production system

I am familiar with the LSH (locally sensitive hashing) methods of SimHash and MinHash. SimHash uses cosine similarities with real data. MinHash calculates the similarity of binary vectors. But I can’t decide which one will be better to use.

I am creating a backend system for a website to find almost duplicates of semi-structured text data. For example, each entry will have a name, location and a short text description (<500 words).

If you do not take into account a specific language implementation, which algorithm will be best for a new production system?

+11
simhash minhash
source share
2 answers

Simhash is faster (very fast) and usually requires less storage space, but imposes a strict limit on how heterogeneous the two documents can be, and at the same time they can be detected as duplicates. If you use a 64-bit simhash (general choice) and depending on how many swap tables you can store, you can limit yourself to a Hamming distance of 3, or possibly 6 or 7. small Hamming distances! You will be limited in detecting documents that are basically identical, and even so, you may need to carefully configure which functions you choose to add to simhash and what weights you provide them.

Simash generation is patented by Google, although in practice they allow at least non-commercial use.

Minhash uses more memory, since you usually save 50-400 hashes for each document, and it is not as efficient for the processor as simhash, but it allows you to find rather distant similarities, for example, approximate similarities up to 5% if you want. It is also a little easier to understand than simhash, especially in terms of how tables work. It is quite simple to implement, usually using shingling, and does not require a lot of customization to get good results. It is not (as far as I know) patented.

If you are dealing with big data, the most resource-intensive part of the minhash approach is likely to be after you have generated minheshes for your document when you look at your spreadsheet to find other documents that share some of its hashes. There may be tens or hundreds of thousands of documents that share at least one hash, and you need to look through all this to find those few that share, for example, at least half of its hashes. Simhash is much faster here.

As Otmar notes in his comment below, there are minhash optimizations that allow you to achieve the same accuracy when evaluating similarities with fewer hashes per document. This can significantly reduce the amount of weeding you have to do.

Edit:

I tried superminhash now. This is pretty fast, although my minhash implementation using one hash function plus bit conversions to get all the other hashes was faster for my purposes. It offers more accurate jaccard ratings, about 15% better in some situations that I tested (although there is almost no difference in others). This should mean that you need about a third less hashes to achieve the same precision. Keeping fewer hashes in your table means that identifying almost duplicates requires less weeding, which provides significant speedup. I do not know a single superminhash patent. Thank you, Otmar!

+10
source share

This article can give you some ideas on two algorithms.

http://jmlr.org/proceedings/papers/v33/shrivastava14.pdf

+6
source share

All Articles