In this answer, I assume that you are trying to convert a CSV to a file that LibSVM , LIBLINEAR , or scikit-learn can load.
You can use csv2libsvm , which is provided as part of the Ruby vector_embed :
$ gem install vector_embed Successfully installed vector_embed-0.1.0 1 gem installed
You need Ruby 1.9 + ...
$ ruby -v ruby 1.9.3p374 (2013-01-15 revision 38858) [x86_64-darwin12.2.0]
If you do not have Ruby 1.9, it is easy to install using rvm , which does not require (or does not recommend using) root
$ curl -
After running gem install vector_embed make sure your first column is called "label":
$ cat example.csv label,color,someOtherValue 1.2,red,55.6 1.9,blue,20.5 3.2,red,16.5 $ csv2libsvm example.csv > example.libsvm $ cat example.libsvm 1.2 1139043:55.6 1997960:1 1.9 1089740:1 1139043:20.5 3.2 1139043:16.5 1997960:1
Note that it processes both categorical and continuous data and uses MurmurHash version 3 to generate function names ("colorIsBlue" corresponds to 1089740, "colorIsRed" to 1997960 ... although the Ruby code does hash something like "color \ 0red ").
If you are using svm, be sure to scale your data as recommended in the SVM Classification Guide .
Finally, let's say you use the scikit-learn svmlight / libsvm bootloader :
>>> from sklearn.datasets import load_svmlight_file >>> X_train, y_train = load_svmlight_file("/path/to/example.libsvm")