N-gram resemblance to cosine similarity measurement

Question

N-gram resemblance to cosine similarity measurement

I am working on a project on the similarity of sentences. I know that he was asked many times at SO, but I just want to know if my problem can be achieved using the method that I use the way I do it, or should I change my approach to the problem. Roughly speaking, the system should split all the sentences of the article and find similar sentences among other articles that are submitted to the system.

I use cosine similarities with tf-idf weights, and that's exactly how I did it.

1 First, I divided all the articles into sentences, then I generated trigrams for each sentence and sorted them (should I?).

2- I calculate the tf-idf of the weight of the trigrams and create vectors for all sentences.

3- I calculate the point product and the magnitude of the original proposal and the compared proposal. Then calculate the cosine similarity.

However, the system does not work as I expected. Here I have some questions in my mind.

As far as I read about tf-idf weights, I think they are more useful for finding similar “documents”. Since I am working on sentences, I slightly modified the algorithm by changing some variables of the definition formula tf and idf (instead of the document, I tried to find a definition based on the sentence).

tf = number of trigrams in the sentence / number of all trigrams in the sentence

idf = number of all sentences in all articles / number of sentences where the trigram is displayed

Do you consider it appropriate to use such a definition for this problem?

, , . , , ( ). - x, - x + 1, , x + 1 0. , ? , ?

, , ( n- )?

.

+5

similarity cosine n-gram

Ahmet Keskin 27 . '10 19:59

1

srean · Accepted Answer · 2010-10-27T20:27:52+0000

, . , , , . , . , . N, N. - , . , , .

, , . k k , k-. , , . k = 3 k-, , . , , 1. .

, , , , 0 . - , . . , . . LSI ( ), , .

x y, x y. 2- x . - . , (tf).

, .

N-gram resemblance to cosine similarity measurement

More articles: