The two most popular document clustering approaches are hierarchical clustering and k-means . k-means faster because it is linear in the number of documents, unlike hierarchical, which is quadratic, but is generally considered the best result. Each document in the data set is usually represented as an n-dimensional vector (n is the number of words) with a dimension corresponding to each word equal to its long-term frequency β the inverse frequency of the document . The tf-idf score reduces the importance of high-frequency words in similarity calculations. similarity to cosine is often used as a measure of similarity.
A document comparing experimental results between hierarchical and bisecting k-means, a cousin algorithm with k-means, can be found here .
The simplest approaches to reducing the dimension of document clustering are: a) throwing away all rare and frequently occurring words (for example, less than 1% and more than 60% of documents: this is somewhat arbitrary, you need to try different ranges for each data set to see the effect on results), b) stop : drop all words into the stop list of common English words: lists can be found on the Internet and c) stem or remove suffixes to leave only the roots of words. The most common stalk is the stalk developed by Martin Porter. Implementations in many languages ββcan be found here . Typically, this reduces the number of unique words in the dataset to a few hundred or less than thousands, and a further reduction in dimension may not be necessary. Otherwise, you can use methods such as PCA.
source share