There are several questions about SO and the Internet that describe how to take cosine similaritybetween two lines, and even between two lines with TFIDF as weights. But the output of a function like scikit linear_kernelconfuses me a bit.
Consider the following code:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
a = ['hello world', 'my name is', 'what is your name?']
b = ['my name is', 'hello world', 'my name is what?']
df = pd.DataFrame(data={'a':a, 'b':b})
df['ab'] = df.apply(lambda x : x['a'] + ' ' + x['b'], axis=1)
print(df.head())
a b ab
0 hello world my name is hello world my name is
1 my name is hello world my name is hello world
2 what is your name? my name is what? what is your name? my name is what?
Question : I would like to have a column, which is the cosine similarity between rows in aand rows in b.
What I tried :
I trained the TFIDF classifier on abto include all the words:
clf = TfidfVectorizer(ngram_range=(1, 1), stop_words='english')
clf.fit(df['ab'])
Then I got a sparse matrix of TFIDF columns aand b:
tfidf_a = clf.transform(df['a'])
tfidf_b = clf.transform(df['b'])
Now, if I use scikit linear_kernelas recommended by others, I return the Gram matrix (nfeatures, nfeatures) as indicated in their docs.
from sklearn.metrics.pairwise import linear_kernel
linear_kernel(tfidf_a,tfidf_b)
array([[ 0., 1., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.]])
, cosin_sim a b, - cos_sim (a [1], b [ 1]) ..
python3, scikit-learn 0.17.