Tf idf similarity

Question

Tf idf similarity

I use TF / IDF to calculate the similarities. For example, if I have the following two documents.

Doc A => cat dog Doc B => dog sparrow

It is normal that the similarity will be 50%, but when I calculate its TF / IDF. It is as it should

Tf Values for Doc A

 dog tf = 0.5 cat tf = 0.5

Tf Values for Doc B

 dog tf = 0.5 sparrow tf = 0.5

IDF Values for Doc A

 dog idf = -0.4055 cat idf = 0

IDF Values for Doc B

 dog idf = -0.4055 ( without +1 formula 0.6931) sparrow idf = 0

TF / IDF Value for Doc A

 0.5x-0.4055 + 0.5x0 = -0.20275

TF / IDF Values for Doc B

 0.5x-0.4055 + 0.5x0 = -0.20275

Now it looks like there is -0.20275 similarity. It? Or am I missing something? Or is this some kind of next step? Please tell me so that I can calculate this too.

I used the tf / idf formula that Wikipedia mentioned

+4

java text similarity tf-idf

user238384 Dec 31 '09 at 20:13

source share

3 answers

Yuval F · Answer 1 · 2009-12-31T20:46:32+0000

Let's see if I get the question: Do you want to calculate the TF / IDF similarity between two documents:

 Doc A: cat dog

and

 Doc B: dog sparrow

I understand that this is your whole corps. Therefore, |D| = 2 |D| = 2 Tfs is really 0.5 for all words. To compute the dog IDF, take log(|D|/|d:dog in d| = log(2/2) = 0 Similarly, the cat and sparrow IDFs are log(2/1) = log(2) =1 (I use 2 as the base of logs to make this easier).

Therefore, the TF / IDF values for the "dog" will be 0.5 * 0 = 0; the TF / IDF values for "cat" and the "sparrow" will be 0.5 * 1 = 0.5

To measure the similarity between two documents, you must calculate the cosine between the vectors in space (cat, sparrow, dog): (0.5, 0, 0) and (0, 0.5, 0) and get a result of 0.

Summarizing:

You have an error in IDF calculations.
This error creates invalid TF / IDF values.
The Wikipedia article does not sufficiently explain the use of TF / IDF for similarities. I like the explanation of Manning, Raghavan and Schutze much better.

Toqir · Answer 2 · 2010-01-03T16:10:39+0000

I think you need to take ln instead of log.

user7113676 · Answer 3 · 2016-11-04T07:10:48+0000

 def calctfidfvec(tfvec, withidf): tfidfvec = {} veclen = 0.0 for token in tfvec: if withidf: tfidf = (1+log10(tfvec[token])) * getidf(token) else: tfidf = (1+log10(tfvec[token])) tfidfvec[token] = tfidf veclen += pow(tfidf,2) if veclen > 0: for token in tfvec: tfidfvec[token] /= sqrt(veclen) return tfidfvec def cosinesim(vec1, vec2): commonterms = set(vec1).intersection(vec2) sim = 0.0 for token in commonterms: sim += vec1[token]*vec2[token] return sim

Tf idf similarity

More articles: