The similarity of cosines is usually defined as x T y / (|| x || * || y ||) and prints 1 if they are the same and equal to -1 if they are completely different. This definition is technically not a metric, so you cannot use accelerating structures such as balls and kd trees with it. If you make scikit learn to use brute force, you should be able to use it as a distance if you pass it your own distance metric object. There are methods for converting the cosine similarity to a valid distance metric if you want to use balls (you can find it in the JSAT library)
We note, however, that x T y / (|| x || * || y ||) = (x / || x ||) T (r / || y ||). The Euclidean distance can be arbitrarily written as sqrt (x T x + y T y - 2 x T y). If we normalize each datapoint before passing it to the KNeighborsClassifier, then x^T x = 1 for all x . Thus, the Euclidean distance degrades to sqrt(2 β 2x^T y) . For all the same input, we would get sqrt(2-2*1) = 0 and for the complete opposites sqrt(2-2*-1)= 2 . And this is obviously a simple form, so you can get the same order as the cosine distance by normalizing your data and then using the Euclidean distance. As long as you use the uniform weight option, the results will be identical if you use the correct distance for the cosine.
Raff.edward
source share