Text classification algorithm

Question

Text classification algorithm

I have millions of short (up to 30 words) documents that I need to divide into several well-known categories. It is possible that a document corresponds to several categories (rarely, but possible). It is also possible that the document does not correspond to any of the categories (also rarely). I also have millions of documents that have already been classified. Which algorithm should I use to complete the task. I do not need to do it fast. I have to be sure that the algorithm is categorizing correctly (as far as possible).
Which algorithm should I use? Is there an implementation in C #?
Thanks for the help!

+4

c # artificial-intelligence text-processing machine-learning

Stuffhappens Oct 08 '10 at 13:04

source share

5 answers

Interesting articles:

+1

Romain meresse Oct 08 '10 at 13:46

source share

IMHO's main problem here is the length of the documents. I think I would call this classification of phrases, and this will work on Twitter. You can add additional text that does a web search of 30 words and then analyze the top matches. There is an article about this, but I can not find it right now. Then I tried to use the vector approach (tdf-idf, as in Jimmy's answer) and multiclass SVM for classification.

+1

piccolbo Oct 08 '10 at 18:54

source share

Perhaps a decision tree combined with NN?

0

AndreDurao Oct 08 '10 at 13:14

source share

You can use the SVM algorithm to classify text in C # with the libsvm.net library.

0

Hamidreza-safari Dec 12 '17 at 9:03

source share

Jimmy · Accepted Answer · 2010-10-08T13:17:15+0000

Take a look at the time frequency and reverse frequency of the document and the cosine convergence to find important words for creating categories and assigning documents to categories in similarity

EDIT:

Found an example here

Text classification algorithm

More articles: