These are not many questions on scikit or python, but more common with SVM.
Data instances in SVM should be represented as scalar vectors, usually real numbers. Therefore, categorical attributes must first be mapped to some numeric values ββbefore they can be included in the SVM.
Some categorical attributes allow for a more natural / logical display on some scale (some free metric). For example, it might make sense to display (1, 2, 3, 5) for the Priority field with the values ββ("no rush", "standard delivery", "Urgent" and "Most urgent"). Another example would be with colors that can be displayed in 3 sizes, one at a time for their components Red, Green, Blue, etc.
Other attributes do not have semantics that allow any approximate logical display on a scale; different values ββof these attributes should then be assigned to an arbitrary numerical value for one (or possibly several) SVM dimensions (s). It is clear that if SVM has many of these arbitrary "non-metric" measurements, it may be less efficient if elements are correctly classified, since distance calculations and clustering logic implicit for SVM to work are less semantically related.
This observation does not mean that SVM cannot be used at all when elements include non-numeric or non-metric measurements, but it is certainly a reminder that selecting objects and displaying functions are very sensitive parameters of classifiers in general and SVM in particular.
In the specific case of the POS marking ... I'm afraid that I am now at an impasse, on what attributes of the marked case I need to use and how to compare them with numerical values. I know that SVMTool can create very effective POS tags using SVM, and several scientific articles also describe SVM tags. However, I am more familiar with other labeling approaches (e.g. using HMM or Maximum Entropy.)
source share