How can I use non-integer string labels with SVM from scikit-learn? python

Question

How can I use non-integer string labels with SVM from scikit-learn? python

Scikit-learn has pretty user friendly python modules for machine learning.

I am trying to prepare an SVM tag for Natural Language Processing (NLP), where my labels and input are words and annotation. For instance. Partial speech, rather than using double / integer data as input tuples [[1,2], [2,0]] , my tuples will look like this: [['word','NOUN'], ['young', 'adjective']]

Can someone give an example of how I can use SVM with string tuples? the tutorial / documentation given here is for integer / double inputs. http://scikit-learn.org/stable/modules/svm.html

+6

python scikit-learn nlp svm pos-tagging

alvas Oct 18 '12 at 2:53

source share

2 answers

These are not many questions on scikit or python, but more common with SVM.

Data instances in SVM should be represented as scalar vectors, usually real numbers. Therefore, categorical attributes must first be mapped to some numeric values before they can be included in the SVM.

Some categorical attributes allow for a more natural / logical display on some scale (some free metric). For example, it might make sense to display (1, 2, 3, 5) for the Priority field with the values ("no rush", "standard delivery", "Urgent" and "Most urgent"). Another example would be with colors that can be displayed in 3 sizes, one at a time for their components Red, Green, Blue, etc.
Other attributes do not have semantics that allow any approximate logical display on a scale; different values of these attributes should then be assigned to an arbitrary numerical value for one (or possibly several) SVM dimensions (s). It is clear that if SVM has many of these arbitrary "non-metric" measurements, it may be less efficient if elements are correctly classified, since distance calculations and clustering logic implicit for SVM to work are less semantically related.

This observation does not mean that SVM cannot be used at all when elements include non-numeric or non-metric measurements, but it is certainly a reminder that selecting objects and displaying functions are very sensitive parameters of classifiers in general and SVM in particular.

In the specific case of the POS marking ... I'm afraid that I am now at an impasse, on what attributes of the marked case I need to use and how to compare them with numerical values. I know that SVMTool can create very effective POS tags using SVM, and several scientific articles also describe SVM tags. However, I am more familiar with other labeling approaches (e.g. using HMM or Maximum Entropy.)

+4

mjv Oct 18 '12 at 3:03

source share

ogrisel · Accepted Answer · 2012-10-18T08:46:52+0000

Most machine learning algorithm processing algorithms are float vectors, so a small (often Euclidean) distance between two samples means that 2 samples are similar in a way that is relevant to the problem.

The responsibility for machine learning programming lies with finding a good set of float functions for coding. This encoding is domain-specific , so there is no general way to create this view from raw data that will work in all application domains (various NLP tasks, computer vision, transaction log analysis ...). This part of the machine learning simulation work is called function extraction . When it is associated with a lot of manual work, it is often called a design .

Now for your specific problem, the POS labels of the word box around the word of interest in the sentence (for example, for sequence tags such as object name detection) can be encoded appropriately, using the DiktVectorizer scikit-learn attributes master class.

How can I use non-integer string labels with SVM from scikit-learn? python

More articles: