Using cosine distance with scikit learn KNeighborsClassifier

Question

Using cosine distance with scikit learn KNeighborsClassifier

Is it possible to use something like a 1-cosine similarity with scikit learn KNeighborsClassifier?

This answer is no, but in the documentation for KNeighborsClassifier, it says that the metrics mentioned in DistanceMetrics are available. Distance metrics do not include the explicit cosine distance, perhaps because it is not really a distance, but it is possible to introduce a function in the metric. I tried to enter scikit while studying the linear kernel in KNeighborsClassifier, but this gives me an error that functions need two arrays as arguments. Has anyone else tried this?

+7

python scikit-learn machine-learning knn

Novice Dec 7 '15 at 10:36

source share

1 answer

Raff.edward · Accepted Answer · 2015-12-07T23:39:54+0000

The similarity of cosines is usually defined as x ^T y / (|| x || * || y ||) and prints 1 if they are the same and equal to -1 if they are completely different. This definition is technically not a metric, so you cannot use accelerating structures such as balls and kd trees with it. If you make scikit learn to use brute force, you should be able to use it as a distance if you pass it your own distance metric object. There are methods for converting the cosine similarity to a valid distance metric if you want to use balls (you can find it in the JSAT library)

We note, however, that x ^T y / (|| x || * || y ||) = (x / || x ||) ^T (r / || y ||). The Euclidean distance can be arbitrarily written as sqrt (x ^T x + y ^T y - 2 x ^T y). If we normalize each datapoint before passing it to the KNeighborsClassifier, then x^T x = 1 for all x . Thus, the Euclidean distance degrades to sqrt(2 − 2x^T y) . For all the same input, we would get sqrt(2-2*1) = 0 and for the complete opposites sqrt(2-2*-1)= 2 . And this is obviously a simple form, so you can get the same order as the cosine distance by normalizing your data and then using the Euclidean distance. As long as you use the uniform weight option, the results will be identical if you use the correct distance for the cosine.

Using cosine distance with scikit learn KNeighborsClassifier

More articles: