What classification algorithm can be used to categorize a document?

Question

What classification algorithm can be used to categorize a document?

Hey here is my problem

Given a set of documents, I need to assign a predefined category to each document.

I was going to use the n-gram approach to represent the textual content of each document, and then train the SVM classifier for the training data that I have.
Correct me if I miss something clear.

Now the problem is that the categories must be dynamic. Meaning, my classifier should process new training data with a new category.

For example, if I prepared a classifier to classify this document as category A, category B or category C, and then I was given new training data with category D. I should be able to gradually train my classifier by providing him with new training data for " category D ".

To summarize, I do not want to combine the old training data (with 3 categories) and the new training data (with a new / invisible category) and train my classifier again. I want to train my classifier on the fly

Can this be implemented using SVM? if not, can you recommend me some classification algorithms? or any book / document that can help me.

Thanks at Advance.

+7

algorithm machine-learning classification document-classification

Tefa Aug 20 '12 at 1:54

source share

3 answers

This blog post by Edwin Chen describes endless clustering mix models . I think this method supports automatic number of clusters, but I'm still trying to circle everything around it.

+1

jergason Aug 20 '12 at 2:02

source share

The class of algorithms that match your criteria is called "Incremental Algorithms". Incremental versions of almost any method exist. The easiest to implement are naive bays.

0

Elkamina Aug 20 '12 at 5:28

source share

amit · Accepted Answer · 2012-08-20T06:22:01+0000

Naive Bayes is a relatively fast incremental salsification algorithm.
KNN is also phased in nature and even easier to implement and understand.

Both algorithms are implemented in the Weka open source project as NaiveBayes and IBk for KNN.

However, from personal experience - they are both vulnerable to a large number of uninformative functions (as a rule, this refers to the classification of text), and thus, some choice of function is usually used to compress the best performance from these algorithms, which can be problematic to implement as incremental.

What classification algorithm can be used to categorize a document?

More articles: