Although this is the simplest algorithm, and everything is considered independent, in the case of classical text classification, this method works fine. And I would try this algorithm first.
KNN is for clustering, not classification. I think you misunderstood the concept of clustering and classification.
SVM has SVC (classification) and SVR (regression) algorithms for class classification and forecasting. This once worked well, but from my experience it does not work well in text classification, as it places high demands on good tokenizers (filters). But there are always dirty tokens in the data set dictionary. The accuracy is very poor.
- Random forest (decision tree)
I have never tried this method to classify text. Since it seems to me that the decision tree needs several key nodes, while it’s hard to find “several key tokens” to classify the text, and a random forest does not work well for large sparse sizes.
Fyi
All this is from my experience, but for your case you do not have the best ways to decide which methods to use, but try each algorithm in accordance with your model.
Apache Mahout is a great tool for machine learning algorithms. It combines algorithms of three aspects: recommendations, clustering and classification. You can try this library. But you should learn some basic knowledge of Hadoop.
And for machine learning, weka is a software toolkit for experiments that integrates many algorithms.
Freya ren
source share