Text classification methods? SVM and decision tree

I have a training kit and I want to use the classification method to classify other documents according to my training sets. My types of documents are news, and the categories are sports, politics, economics, etc.

I understand naive tales and KNN completely, but SVM and the decision tree are fuzzy, and I don’t know if I can implement this method myself or are there applications for using these methods?

What is the best method I can use to classify documents this way?

thanks!

+8
classification svm
source share
3 answers
  • Naive Bayes

Although this is the simplest algorithm, and everything is considered independent, in the case of classical text classification, this method works fine. And I would try this algorithm first.

  • Ind

KNN is for clustering, not classification. I think you misunderstood the concept of clustering and classification.

  • SVM

SVM has SVC (classification) and SVR (regression) algorithms for class classification and forecasting. This once worked well, but from my experience it does not work well in text classification, as it places high demands on good tokenizers (filters). But there are always dirty tokens in the data set dictionary. The accuracy is very poor.

  • Random forest (decision tree)

I have never tried this method to classify text. Since it seems to me that the decision tree needs several key nodes, while it’s hard to find “several key tokens” to classify the text, and a random forest does not work well for large sparse sizes.

Fyi

All this is from my experience, but for your case you do not have the best ways to decide which methods to use, but try each algorithm in accordance with your model.

Apache Mahout is a great tool for machine learning algorithms. It combines algorithms of three aspects: recommendations, clustering and classification. You can try this library. But you should learn some basic knowledge of Hadoop.

And for machine learning, weka is a software toolkit for experiments that integrates many algorithms.

+11
source share

Linear SVMs are one of the best algorithms for text classification tasks (along with logistic regression). Solution Trees suffer poorly in such spatial spaces.

Pegasos algorithm is one of the simplest linear SVM algorithms and incredibly efficient.

EDIT: Multicomponent naive bays also work well with text data, although this is usually not the case with linear SVMs. kNN may work fine, but it is already a slow algorithm and never raises accuracy charts on textual issues.

+5
source share

If you are familiar with Python, you can consider NLTK and scikit-learn . The first is for NLP, and the latter is a more comprehensive machine learning package (but it has a large set of word processing modules). Both are open source and have excellent community support on SO.

+2
source share

All Articles