Measuring similarity between sets of documents

For illustration purposes, let me assume that this is a forum service. I need to calculate the "similarity" among each user's posts, so the result will be something like this:

among posts by user A, similarity 60% among posts by user B, similarity 20% ... 

I deal with multibyte strings, so I guess I'm stuck in search engines. We already use Solr, already implemented more LikeThis, but I'm not quite sure how to build the request. Any help appreciated!

+7
source share
3 answers

Perhaps Carrot2 will interest you (and this blog related to it)

+1
source

strange question in two ways: 1. Why do you have to deal with SOLR? 2. The type of similarity depends on the target. Your question sounds too general to me. In the field of semantic similarity, research is being conducted. There is an edit-distance algorithm, which is probably not the one you want.

So, ask the question more precisely and you will get better answers.

0
source

There are several measures of similarity, simple and effective - Kosin's similarity. There are more complex ones like Smith-Waterman etc.,

Take a look at http://sourceforge.net/projects/simmetrics/

0
source

All Articles