Data clustering algorithm

What is the most popular text clustering algorithm that deals with large sizes and huge datasets and fast? I am confused by reading so many documents and so many approaches. Now I just want to know which one is most used to have a good starting point for writing a clustering application for documents.

+4
source share
5 answers

To deal with the scourge of dimension, you can try to identify the blind sources (i.e. topics) that generated your dataset. You can use Basic Component Analysis or Factor Analysis to reduce the dimension of your feature set and the calculation of useful indexes.

PCA is what is used in hidden semantic indexing , since SVD can be demonstrated as PCA :)

Remember that you can lose interpretation when you get the main components of your data set or its factors, so you might want to go the Negative Matrix Factorization route. (And here's the punch! K-Means is a special NNMF!) In NNMF, a dataset can be explained simply by its additional non-negative components.

+2
source

None of them are suitable for any approach. Hierarchical clustering is always an option. If you want to have separate groups formed from data, you can go with a K-mean cluster (this is also supposedly computationally less intense).

+1
source

The two most popular document clustering approaches are hierarchical clustering and k-means . k-means faster because it is linear in the number of documents, unlike hierarchical, which is quadratic, but is generally considered the best result. Each document in the data set is usually represented as an n-dimensional vector (n is the number of words) with a dimension corresponding to each word equal to its long-term frequency β€” the inverse frequency of the document . The tf-idf score reduces the importance of high-frequency words in similarity calculations. similarity to cosine is often used as a measure of similarity.

A document comparing experimental results between hierarchical and bisecting k-means, a cousin algorithm with k-means, can be found here .

The simplest approaches to reducing the dimension of document clustering are: a) throwing away all rare and frequently occurring words (for example, less than 1% and more than 60% of documents: this is somewhat arbitrary, you need to try different ranges for each data set to see the effect on results), b) stop : drop all words into the stop list of common English words: lists can be found on the Internet and c) stem or remove suffixes to leave only the roots of words. The most common stalk is the stalk developed by Martin Porter. Implementations in many languages ​​can be found here . Typically, this reduces the number of unique words in the dataset to a few hundred or less than thousands, and a further reduction in dimension may not be necessary. Otherwise, you can use methods such as PCA.

+1
source

I will stick with kmedoids, since you can calculate the distance from any point to any point at the start of the algorithm. You only need to do this once, and it saves your time, especially if there are many dimensions. This algorithm works by choosing the point that is closer to it as the center of the cluster, rather than the centroid, calculated on the basis of the average values ​​of points belonging to this cluster. Therefore, you have all the possible distance calculations made for you in this algorithm.

-1
source

In case you are not looking for clusters of semantic text (I can’t say whether this is a requirement or not from your original question), try to use the Levenshtein distance and build a similarity matrix with it. From this, you can use k-medoids for the cluster, and then test your clustering using silhouette coefficients. Unfortunately, Levensthein can be quite slow, but there are ways to speed it up using thresholds and other methods.

Another way to deal with a scourge of dimension is to find "contrasting sets", attribute-value pairs of unions that are more visible in one group than in the rest. You can then use these contrasting sets as dimensions, either in place of the original attributes or with a limited number of attributes.

-1
source

All Articles