SVM training in a heterogeneous object space

Question

SVM training in a heterogeneous object space

I am experimenting with some task of classifying documents, and SVM still works well in TF * IDF function vectors so far. I want to include some new functions that are not based on frequency (for example, document length), and see if these new functions contribute to classification efficiency. I have the following questions:

Is it possible to simply concatenate new functions with old functions with a frequency response and train SVMs in this heterogeneous space?
if not, Learn a few kernels> how to do this by training the kernel in each space of auxiliary functions and combining them using linear interpolation? (we don’t have MKL yet implemented in scikit-learn, right?)
or should I turn to alternative learners who are well versed in heterogeneous functions like MaxEnt and decision trees?

Thank you in advance for your advice!

+4

scikit-learn machine-learning classification svm

Moses xu Feb 04 '13 at 5:16

source share

2 answers

It is possible to use arbitrary functions and combinations of functions with SVM. Keep in mind that you must standardize your functions , which means that they must all be on the same scale. This will prevent accidental weighting of function spaces.

If this does not give acceptable results, you can see convolution kernels that provide the basis for combining kernels in different spatial spaces into one core. However, I would be surprised if necessary.

+3

Ben allison Feb 04 '13 at 10:45

source share

Fred foo · Accepted Answer · 2013-02-05T17:45:05+0000

1) can I just combine new functions with old functions based on frequency and train SVM in this heterogeneous space of objects?

Since you noted this with scikit-learn : yes, you can, and you can use FeatureUnion to do this for you.

2) if not, it is a multiple core. Learn the learning method by training the core in each space of the auxiliary function and combining them using linear interpolation? (we still don't have MKL implemented in scikit-learn, right?)

Linear SVMs are the standard model for this task. Kernel methods are too slow to classify text in the real world (with the possible exception of training algorithms such as LaSVM , but this is not implemented in scikit-learn).

3) or should I turn to alternative students who are well versed in heterogeneous functions like MaxEnt and decision trees?

SVMs handle heterogeneous functions as well as MaxEnt / logistic regression. In both cases, you really have to enter scaled data, for example. with MinMaxScaler . Note that scikit-learn TfidfTransformer produces normalized vectors by default, so you don't need to scale its output, just other functions.

SVM training in a heterogeneous object space

More articles: