How to start a Spark decision tree with a categorical feature set using Scala?

I have a feature set with the corresponding categorical FeaturesInfo: Map [Int, Int]. However, in my life I cannot figure out how I should get the DecisionTree class to work. It will not accept anything, but LabeledPoint as data. However, LabeledPoint requires (double, vector), where the vector requires doubling.

val LP = featureSet.map(x => LabeledPoint(classMap(x(0)),Vectors.dense(x.tail))) // Run training algorithm to build the model val maxDepth: Int = 3 val isMulticlassWithCategoricalFeatures: Boolean = true val numClassesForClassification: Int = countPossibilities(labelCol) val model = DecisionTree.train(LP, Classification, Gini, isMulticlassWithCategoricalFeatures, maxDepth, numClassesForClassification,categoricalFeaturesInfo) 

The error I get is:

 scala> val LP = featureSet.map(x => LabeledPoint(classMap(x(0)),Vectors.dense(x.tail))) <console>:32: error: overloaded method value dense with alternatives: (values: Array[Double])org.apache.spark.mllib.linalg.Vector <and> (firstValue: Double,otherValues: Double*)org.apache.spark.mllib.linalg.Vector cannot be applied to (Array[String]) val LP = featureSet.map(x => LabeledPoint(classMap(x(0)),Vectors.dense(x.tail))) 

My resources so far: config tree, decision tree , labeledpoint

+7
scala tree apache-spark apache-spark-mllib categorical-data
source share
3 answers

You can convert the categories to numbers first, and then load the data as if all functions were numerical.

When you build a decision tree model in Spark, you just need to indicate the spark which functions are categorical, as well as the attribute of the function (the number of individual categories of this function), indicating the map Map[Int, Int]() from the function indices in its arity.

For example, if you have data like:

 1,a,add 2,b,more 1,c,thinking 3,a,to 1,c,me 

First you can convert the data to a number format:

 1,0,0 2,1,1 1,2,2 3,0,3 1,2,4 

In this format, you can upload data to Spark. Then, if you want to say Spark, the second and third columns are categorical, you must create a map:

 categoricalFeaturesInfo = Map[Int, Int]((1,3),(2,5)) 

The map tells us that the function with index 1 has arity 3, and the function with index 2 has the value 5. They will be considered categorical when we build a decision tree model that passes this map as a parameter to the learning function:

 val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins) 
+16
source share

Rows are not supported by LabeledPoint, one way to put them in LabeledPoint is to split your data into multiple columns, given that your rows are categorical.

So, for example, if you have the following data set:

 id,String,Intvalue 1,"a",123 2,"b",456 3,"c",789 4,"a",887 

Then you can split your string data by making each row value in a new column

 a -> 1,0,0 b -> 0,1,0 c -> 0,0,1 

Since you have 3 different row values, you will convert the row column to 3 new columns and each value will be represented by a value in these new columns.

Now your dataset will be

 id,String,Intvalue 1,1,0,0,123 2,0,1,0,456 3,0,0,1,789 4,1,0,0,887 

Now you can convert to double values ​​and use it in your LabeledPoint.

Another way to convert rows in LabeledPoint is to create a separate list of values ​​for each column and convert the row values ​​to the index of that row in that list. This is not recommended, because if so, there will be in this intended dataset

 a = 0 b = 1 c = 2 

But in this case, the algorithms will be considered closer to b than to c, which is impossible to determine.

+2
source share

You need to confirm the type of the x array. From the error log, he said that the element in the x array is a string that is not supported in spark mode. Current spark vectors can only be filled by Double.

0
source share

All Articles