When I run the Spark RandomForest algorithm, I seem to get different splits in the trees on different runs even when using the same seed. Can someone explain that I'm doing something wrong (probably), or is the implementation wrong (which I consider unlikely)? Here is my startup diagram:
//read data into rdd //convert string rdd to LabeledPoint rdd // train_LP_RDD is RDD of LabeledPoint // call random forest val seed = 123417 val numTrees = 10 val numClasses = 2 val categoricalFeaturesInfo: Map[Int, Int] = Map() val featureSubsetStrategy = "auto" val impurity = "gini" val maxDepth = 8 val maxBins = 10 val rfmodel = RandomForest.trainClassifier(train_LP_RDD, numClasses, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins,seed) println(rfmodel.toDebugString)
In two different runs, the output of this fragment is different. For example, diff for two results shows the following:
sdiff -bBWs run1.debug run2.debug If (feature 2 <= 15.96) | If (feature 2 <= 16.0) Else (feature 2 > 15.96) | Else (feature 2 > 16.0) If (feature 2 <= 15.96) | If (feature 2 <= 16.0) Else (feature 2 > 15.96) | Else (feature 2 > 16.0) If (feature 2 <= 33.68) | If (feature 2 <= 34.66) Else (feature 2 > 33.68) | Else (feature 2 > 34.66) If (feature 1 <= 17.0) | If (feature 1 <= 16.0) Else (feature 1 > 17.0) | Else (feature 1 > 16.0)
source share