Gensim: a common measure of similarity

Using gensim , I want to calculate the similarity in the list of documents. This library does a great job with the amount of data that I have. All documents time_similarity down to timestamps, and I have a time_similarity function to compare them. gensim however uses a semblance of cosine.

I am wondering if someone has done this before or has a different solution.

+6
source share
1 answer

This can be done by inheriting from the SimilarityABC interface. I did not find any documentation for this, but it looks like it was done earlier to determine the similarity with distance for Word Mover . Here is a general way to do this. You can probably make it more effective by specializing in the measure of similarity that you care about.

 import numpy from gensim import interfaces class CustomSimilarity(interfaces.SimilarityABC): def __init__(self, corpus, custom_similarity, num_best=None, chunksize=256): self.corpus = corpus self.custom_similarity = custom_similarity self.num_best = num_best self.chunksize = chunksize self.normalize = False def get_similarities(self, query): """ **Do not use this function directly; use the self[query] syntax instead.** """ if isinstance(query, numpy.ndarray): # Convert document indexes to actual documents. query = [self.corpus[i] for i in query] if not isinstance(query[0], list): query = [query] n_queries = len(query) result = [] for qidx in range(n_queries): qresult = [self.custom_similarity(document, query[qidx]) for document in self.corpus] qresult = numpy.array(qresult) result.append(qresult) if len(result) == 1: # Only one query. result = result[0] else: result = numpy.array(result) return result 

To implement custom affinity:

 def overlap_sim(doc1, doc2): # similarity defined by the number of common words return len(set(doc1) & set(doc2)) corpus = [['cat', 'dog'], ['cat', 'bird'], ['dog']] cs = CustomSimilarity(corpus, overlap_sim, num_best=2) print(cs[['bird', 'cat', 'frog']]) 

Outputs [(1, 2.0), (0, 1.0)] .

+1
source

All Articles