Clustering words into groups

This is a household issue. I have a huge document full of words. My task is to classify these words into different groups / clusters that adequately represent the words. My strategy for dealing with it is to use the K-Means algorithm, which, as you know, performs the following steps.

  • Create k random tools for the entire group
  • Create K clusters by matching each word with the closest average
  • Calculate the centroid of each cluster that becomes the new average
  • Repeat steps 2 and 3 until a certain level / convergence is reached.
Theoretically, I understand, but not quite. I think that at every step I have questions that correspond to it:
  • How can I choose k random means, technically I can say 5, but this may not be necessarily a good random number. Thus, this k is a purely random number, or it is actually due to heuristics, such as the size of the data set, the number of words involved, etc.

  • How do you associate each word with the closest average? Theoretically, I can conclude that each word is associated with its distance to the nearest average, therefore, if there are 3 means, any word belonging to a particular cluster depends on what average it has the shortest distance to. However, how is this actually calculated? Between the two words “group”, “text word” the mean word “pencil” is meant, how to create a similarity matrix.

  • How do you calculate a centroid?

  • When you repeat steps 2 and 3, do you take each previous cluster as a new dataset?

A lot of questions, and I obviously don’t know. If there are any resources that I can read, that would be great. Wikipedia was missing :(

+7
source share
4 answers

Since you do not know the exact number of clusters, I would suggest you use a kind of hierarchical clustering :

  • Imagine that all your words are points in non-Euclidean space. Use the Levenshtein distance to calculate the distance between words (it works great if you want to find clusters of lexicographically similar words )
  • Build a minimal spanning tree that contains all your words
  • Remove links longer than the threshold
  • Related word groups are clusters of similar words.

Here is a small illustration:

enter image description here

PS you can find many documents on the Internet that describe clustering based on building a minimum spanning tree

PPS If you want to detect clusters of semantically similar words , you need some algorithms for constructing an automatic thesaurus

+11
source

That you have to choose “k” for k-means is one of the biggest disadvantages of k-means. However, if you use the search function here, you will find a number of questions that relate to well-known heuristic approaches to choosing k. Basically, comparing the results of an algorithm several times.

As for the "nearest". K-means it does not use distances. Some people believe that it uses Euclidean, others say that it is a Euclidean square. Technically, what k-means is interested in is variance. This minimizes overall variance by assigning a cluster to each object, so that variance is minimized. In coincidence, the sum of the squared deviations — one contribution to the total variance — across all dimensions — is the exact definition of a quadratic Euclidean distance. And since the square root is monotonous, you can also use the Euclidean distance.

In any case, if you want to use k-means with words, first you need to represent the words in the form of vectors, where the quadratic Euclidean distance makes sense. I don’t think it will be easy or perhaps even impossible.

0
source

About the distance: in fact, the Levenshtein distance (or edit) satisfies the triangle inequality. It also satisfies the rest of the necessary properties to become a metric (not all distance functions are metric functions). Therefore, you can implement the clustering algorithm using this metric function, and this is a function that you could use to calculate your similarity matrix S:

-> S_ {i, j} = d (x_i, x_j) = S_ {j, i} = d (x_j, x_i)

It is worth noting that the Damerau-Levenshtein distance does not satisfy the triangle inequality, so be careful with this.

About the k-means algorithm: Yes, in the basic version you must manually determine the parameter K. The rest of the algorithm is the same for this metric.

0
source

All Articles