Algorithm for calculating the similarity of texts

I’m trying to find a similarity between messages from social networks, but I didn’t find good algorithms for this, thoughts?

I just tried Levenshtein, Jaro Winkler and others, but they are more used to comparing texts without feelings. In the messages we can get one text: “I really love dogs” and another saying “I really hate dogs”, we need to classify this case as completely different.

thanks

+7
java text artificial-intelligence nlp
source share
3 answers

Perhaps you should take a look at Mining and Mood Analysis to give you an idea of ​​the complexity of the task.

Short answer: there are no "good algorithms" for this, only mediocre ones. And this is a very difficult problem. Good luck.

+1
source share

Ah ... but "I really love dogs" and "I really hate dogs" are completely similar;), both discuss the same feelings for dogs. It seems that you are missing a step:

  • Run your algorithm and get common groups of topics (i.e. "feelings for dogs").
  • Run your algorithm again, but this time for each previously “discovered” group and let your algorithm further classify them into subgroups (ie “I hate dogs” / “I love dogs”).

If your algorithm is customizable based on its experience (i.e. some students participate there)., Then make sure that you run separate instances of the algorithm for the first classification and a new instance of the algorithm for each subclass. If you do not, you may encounter a situation where you find several groups, and at any time when you run your algorithm in the same groups, the results are almost identical and / or nothing has changed at all.

Update

Apache Mahout provides many useful algorithms and examples Clustering, classification, genetic programming, decision forests, recommendations. Here are some examples of text classification from mahout:

I'm not sure which one works best for your problem, but maybe if you look at them, you will understand which one is most suitable for your specific application.

+4
source share

My research deals with the analysis of moods, and I agree with Pierre, this is a difficult problem and, given its subjective nature, there is no general algorithm. One of the approaches that I first tried was to compare sentences in the emotional space and make a decision about its relationship with respect to the distance of the sentence to the centrides of moods. You can look at it at:

http://dtminredis.housing.salle.url.edu:8080/EmoLib/

The above suggestions work well;)

+2
source share

All Articles