Import categorical data from CSV into scikit-learn

I would like to import data from a CSV file for use in scikit-learn. It contains a lot of categorical data of numerical data, for example

someValue,color,someOtherValue 1.2,red,55.6 1.9,blue,20.5 3.2,red,16.5 

I need to convert this representation to purely numerical, where categorical data points are converted to multiple binary columns, for example.

 someValue,colorIsRed,colorIsBlue,someOtherValue 1.2,1,0,55.6 1.9,0,1,20.5 3.2,1,0,16.5 

Is there any utility that does this for me, or an easy way to iterate over the data and get this view?

+4
source share
2 answers

scikit-learn does not provide data loading functions as far as I know, but it prefers Numpy arrays as input. The Numpy loadtxt function along with its converters parameter can be used to load your csv and specify the types of each column. This does not binarize your second column.

+4
source

In this answer, I assume that you are trying to convert a CSV to a file that LibSVM , LIBLINEAR , or scikit-learn can load.

You can use csv2libsvm , which is provided as part of the Ruby vector_embed :

 $ gem install vector_embed Successfully installed vector_embed-0.1.0 1 gem installed 

You need Ruby 1.9 + ...

 $ ruby -v ruby 1.9.3p374 (2013-01-15 revision 38858) [x86_64-darwin12.2.0] 

If you do not have Ruby 1.9, it is easy to install using rvm , which does not require (or does not recommend using) root

 $ curl -#L https://get.rvm.io | bash -s stable $ rvm install 1.9.3 

After running gem install vector_embed make sure your first column is called "label":

 $ cat example.csv label,color,someOtherValue 1.2,red,55.6 1.9,blue,20.5 3.2,red,16.5 $ csv2libsvm example.csv > example.libsvm $ cat example.libsvm 1.2 1139043:55.6 1997960:1 1.9 1089740:1 1139043:20.5 3.2 1139043:16.5 1997960:1 

Note that it processes both categorical and continuous data and uses MurmurHash version 3 to generate function names ("colorIsBlue" corresponds to 1089740, "colorIsRed" to 1997960 ... although the Ruby code does hash something like "color \ 0red ").

If you are using svm, be sure to scale your data as recommended in the SVM Classification Guide .

Finally, let's say you use the scikit-learn svmlight / libsvm bootloader :

 >>> from sklearn.datasets import load_svmlight_file >>> X_train, y_train = load_svmlight_file("/path/to/example.libsvm") 
+2
source

All Articles