Tf idf similarity

I use TF / IDF to calculate the similarities. For example, if I have the following two documents.

Doc A => cat dog Doc B => dog sparrow 

It is normal that the similarity will be 50%, but when I calculate its TF / IDF. It is as it should

Tf Values ​​for Doc A

 dog tf = 0.5 cat tf = 0.5 

Tf Values ​​for Doc B

 dog tf = 0.5 sparrow tf = 0.5 

IDF Values ​​for Doc A

 dog idf = -0.4055 cat idf = 0 

IDF Values ​​for Doc B

 dog idf = -0.4055 ( without +1 formula 0.6931) sparrow idf = 0 

TF / IDF Value for Doc A

 0.5x-0.4055 + 0.5x0 = -0.20275 

TF / IDF Values ​​for Doc B

 0.5x-0.4055 + 0.5x0 = -0.20275 

Now it looks like there is -0.20275 similarity. It? Or am I missing something? Or is this some kind of next step? Please tell me so that I can calculate this too.

I used the tf / idf formula that Wikipedia mentioned

+4
source share
3 answers

Let's see if I get the question: Do you want to calculate the TF / IDF similarity between two documents:

 Doc A: cat dog 

and

 Doc B: dog sparrow 

I understand that this is your whole corps. Therefore, |D| = 2 |D| = 2 Tfs is really 0.5 for all words. To compute the dog IDF, take log(|D|/|d:dog in d| = log(2/2) = 0 Similarly, the cat and sparrow IDFs are log(2/1) = log(2) =1 (I use 2 as the base of logs to make this easier).

Therefore, the TF / IDF values ​​for the "dog" will be 0.5 * 0 = 0; the TF / IDF values ​​for "cat" and the "sparrow" will be 0.5 * 1 = 0.5

To measure the similarity between two documents, you must calculate the cosine between the vectors in space (cat, sparrow, dog): (0.5, 0, 0) and (0, 0.5, 0) and get a result of 0.

Summarizing:

  • You have an error in IDF calculations.
  • This error creates invalid TF / IDF values.
  • The Wikipedia article does not sufficiently explain the use of TF / IDF for similarities. I like the explanation of Manning, Raghavan and Schutze much better.
+17
source

I think you need to take ln instead of log.

0
source
 def calctfidfvec(tfvec, withidf): tfidfvec = {} veclen = 0.0 for token in tfvec: if withidf: tfidf = (1+log10(tfvec[token])) * getidf(token) else: tfidf = (1+log10(tfvec[token])) tfidfvec[token] = tfidf veclen += pow(tfidf,2) if veclen > 0: for token in tfvec: tfidfvec[token] /= sqrt(veclen) return tfidfvec def cosinesim(vec1, vec2): commonterms = set(vec1).intersection(vec2) sim = 0.0 for token in commonterms: sim += vec1[token]*vec2[token] return sim 
0
source

All Articles