Measuring similarity between sets of documents

Question

Measuring similarity between sets of documents

For illustration purposes, let me assume that this is a forum service. I need to calculate the "similarity" among each user's posts, so the result will be something like this:

among posts by user A, similarity 60% among posts by user B, similarity 20% ...

I deal with multibyte strings, so I guess I'm stuck in search engines. We already use Solr, already implemented more LikeThis, but I'm not quite sure how to build the request. Any help appreciated!

+7

lucene solr morelikethis

jodeci May 20, '11 at 9:25

source share

3 answers

Omnaest · Answer 1 · 2011-09-15T19:09:15+0000

Perhaps Carrot2 will interest you (and this blog related to it)

D_K · Answer 2 · 2011-07-27T20:30:00+0000

strange question in two ways: 1. Why do you have to deal with SOLR? 2. The type of similarity depends on the target. Your question sounds too general to me. In the field of semantic similarity, research is being conducted. There is an edit-distance algorithm, which is probably not the one you want.

So, ask the question more precisely and you will get better answers.

Mikos · Answer 3 · 2011-12-09T05:18:41+0000

There are several measures of similarity, simple and effective - Kosin's similarity. There are more complex ones like Smith-Waterman etc.,

Take a look at http://sourceforge.net/projects/simmetrics/

Measuring similarity between sets of documents

More articles: