Text classification algorithm


I have millions of short (up to 30 words) documents that I need to divide into several well-known categories. It is possible that a document corresponds to several categories (rarely, but possible). It is also possible that the document does not correspond to any of the categories (also rarely). I also have millions of documents that have already been classified. Which algorithm should I use to complete the task. I do not need to do it fast. I have to be sure that the algorithm is categorizing correctly (as far as possible).
Which algorithm should I use? Is there an implementation in C #?
Thanks for the help!

+4
source share
5 answers

Take a look at the time frequency and reverse frequency of the document and the cosine convergence to find important words for creating categories and assigning documents to categories in similarity

EDIT:

Found an example here

+7
source

IMHO's main problem here is the length of the documents. I think I would call this classification of phrases, and this will work on Twitter. You can add additional text that does a web search of 30 words and then analyze the top matches. There is an article about this, but I can not find it right now. Then I tried to use the vector approach (tdf-idf, as in Jimmy's answer) and multiclass SVM for classification.

+1
source

Perhaps a decision tree combined with NN?

0
source

You can use the SVM algorithm to classify text in C # with the libsvm.net library.

0
source

All Articles