How to quickly calculate cosine similarity for a large number of vectors in Python?

I have a set of 100 thousand vectors, and I need to get the closest vector of the 25th order, based on the similarity of cosines.

Scipy and Sklearn have implementations for calculating cosine distances / similarities of 2 vectors, but I will need to calculate Cosine Sim for size 100k X 100k and then print the top 25. Is there any speed dial in python to calculate what?

As suggested by @Silmathoron, this is what I do -

#vectors is a list of vectors of size : 100K x 400 ie 100K vectors each of dimenions 400 vectors = numpy.array(vectors) similarity = numpy.dot(vectors, vectors.T) # squared magnitude of preference vectors (number of occurrences) square_mag = numpy.diag(similarity) # inverse squared magnitude inv_square_mag = 1 / square_mag # if it doesn't occur, set it inverse magnitude to zero (instead of inf) inv_square_mag[numpy.isinf(inv_square_mag)] = 0 # inverse of the magnitude inv_mag = numpy.sqrt(inv_square_mag) # cosine similarity (elementwise multiply by inverse magnitudes) cosine = similarity * inv_mag cosine = cosine.T * inv_mag k = 26 box_plot_file = file("box_data.csv","w+") for sim,query in itertools.izip(cosine,queries): k_largest = heapq.nlargest(k, sim) k_largest = map(str,k_largest) result = query + "," + ",".join(k_largest) + "\n" box_plot_file.write(result) box_plot_file.close() 
+5
source share
1 answer

First, I would try smarter algorithms, rather than speed up brute force (by computing all pairs of vectors). KDTrees can work, scipy.spatial.KDTree (), if your vectors are low dimensional. If they are tall, you may need a random projection first: http://scikit-learn.org/stable/modules/random_projection.html

+2
source

All Articles