Hash function to index similar text

I am looking for some kind of hash function to index similar text. So, for example, if we have two very long texts called "A" and "B", where A and B do not differ so much, then the hash function (called H) applied to A and B should return the same number .

So H (A) = H (B), where A and B are similar text.

I tried "DoubleMetaphone" (I use text in Italian), but I saw that it is very dependent on line prefixes. For instance:

A = "This is a very long text that I want to hash" B = "This is very"

==> doubleMetaPhone (A) = doubleMetaPhone (B)

And this is not very good for me, beacause strings with the same prefix can be compared as similar, and I do not want this.

Can someone suggest me another way?

+4
source share
2 answers

A problem (close to) is unsolvable for many line spacing functions.

Most distance functions (for example, editing distance ) allow you to convert a string to another string using a sequence of conversions from 1 distance:

"AAAA" -> "AAAB" -> "AAABC" 

according to your requirements, the first and second lines should have the same hash value. But so it should be the second, and the third, and so on. Thus, all lines must have the same hash, if we allow a pair with distance = 1 to have the same hash value.

Even if we impose a higher threshold on the distance (possibly in relation to the length of the string), we will get a messy result.

The best (IMO) approach is to find the relation of equivalence to a set of strings, so that each row in each equivalence class has the same hash. The possibility is to determine the classes by their distance to a predefined string (for example, the editing distance from "AAAAA"), and the distance itself will be the hash value. This approach would probably not be the best in your case, but perhaps with some additional information about the problem we can find a better equivalence relation.

+2
source

Source: https://habr.com/ru/post/1315674/


All Articles