Python: How to calculate the cosine similarity of two word lists?

I want to calculate the cosine similarity of two lists, for example:

A = [u'home (private)', u'bank', u'bank', u'building(condo/apartment)','factory']

B = [u'home (private)', u'school', u'bank', u'shopping mall']

I know that the similarity of cosines A and B should be

3/(sqrt(7)*sqrt(4)).

I am trying to convert lists to forms like “building a house bank of a factory house”, which looks like a sentence, however some elements (like home (private)) have white space by themselves, and some elements have brackets, so it’s hard for me to calculate the occurrence of the word.

Do you know how to calculate the occurrence of a word in this complex list, so for list B, the occurrence of words can be represented as

{'home (private):1, 'school':1, 'bank': 1, 'shopping mall':1}? 

Or do you know how to calculate the cosine similarity of these two lists?

Many thanks

+4
2
from collections import Counter

# word-lists to compare
a = [u'home (private)', u'bank', u'bank', u'building(condo/apartment)','factory']
b = [u'home (private)', u'school', u'bank', u'shopping mall']

# count word occurrences
a_vals = Counter(a)
b_vals = Counter(b)

# convert to word-vectors
words  = list(a_vals.keys() | b_vals.keys())
a_vect = [a_vals.get(word, 0) for word in words]        # [0, 0, 1, 1, 2, 1]
b_vect = [b_vals.get(word, 0) for word in words]        # [1, 1, 1, 0, 1, 0]

# find cosine
len_a  = sum(av*av for av in a_vect) ** 0.5             # sqrt(7)
len_b  = sum(bv*bv for bv in b_vect) ** 0.5             # sqrt(4)
dot    = sum(av*bv for av,bv in zip(a_vect, b_vect))    # 3
cosine = dot / (len_a * len_b)                          # 0.5669467
+3

( ).

vocab = {}
i = 0

# loop through each list, find distinct words and map them to a
# unique number starting at zero

for word in A:
    if word not in vocab:
        vocab[word] = i
        i += 1


for word in B:
    if word not in vocab:
        vocab[word] = i
        i += 1

vocab , . ( ).

-, . numpy . . ( ), .

import numpy as np

# create a numpy array (vector) for each input, filled with zeros
a = np.zeros(len(vocab))
b = np.zeros(len(vocab))

# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary

for word in A:
    index = vocab[word] # get index from dictionary
    a[index] += 1 # increment count for that index

for word in B:
    index = vocab[word]
    b[index] += 1

- .

# use numpy dot product to calculate the cosine similarity
sim = np.dot(a, b) / np.sqrt(np.dot(a, a) * np.dot(b, b))

sim . , .

( ). (, ) , . .

+1

All Articles