Algorithm for determining the similarity between text messages

I am looking for an algorithm that can compare two text messages (let them say messages on the forum) and determine the similarity in percentages.

What will be the most effective solution for this purpose?

The idea is to use this algorithm to identify users on a forum that has more than two aliases pretending to be different people.

I am going to create a program that will read all my messages and compare each message from the first account to the messages of the second account to determine whether they are really two different persons or only two registrations of one user.

+6
source share
2 answers

The first thing that came to my mind was Levenshtein Distance , but it is more focused on the similarity of words.

You can use tf-idf , but it will probably work better if your enclosure contains more than two documents.

An alternative could be the presentation of documents (messages) using a vector space model, for example:

(w_0, w_1, ..., w_k) 

Where

  • k is the total number of terms (words) in your document.
  • w_i is a member of i-th .

and then calculate the Hamming Distance , which basically compares two vectors (arrays) and counts the positions where they are different. You can first discard stop words (e.g. words such as prepositions, etc.).

Keep in mind that the user can change some words, use synonyms, etc. There are many models for the presentation of documents, the computational similarities between them. Some of them take into account the dependence of words in words, which gives more semantics to the process, while others do not.

+1
source

google-diff-match-patch would be a good choice for you. You can see the demo for testing.

0
source

All Articles