That you have to choose “k” for k-means is one of the biggest disadvantages of k-means. However, if you use the search function here, you will find a number of questions that relate to well-known heuristic approaches to choosing k. Basically, comparing the results of an algorithm several times.
As for the "nearest". K-means it does not use distances. Some people believe that it uses Euclidean, others say that it is a Euclidean square. Technically, what k-means is interested in is variance. This minimizes overall variance by assigning a cluster to each object, so that variance is minimized. In coincidence, the sum of the squared deviations — one contribution to the total variance — across all dimensions — is the exact definition of a quadratic Euclidean distance. And since the square root is monotonous, you can also use the Euclidean distance.
In any case, if you want to use k-means with words, first you need to represent the words in the form of vectors, where the quadratic Euclidean distance makes sense. I don’t think it will be easy or perhaps even impossible.
source share