I tried to run an example in Spark and Scala with an adult dataset .
Using Scala 2.11.8 and Spark 1.6.1.
The problem (for now) is the number of categorical functions in this dataset that must be encoded into numbers before the Spark ML algorithm can do its job.
So far I have this:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.OneHotEncoder
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
object Adult {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Adult example").setMaster("local[*]")
val sparkContext = new SparkContext(conf)
val sqlContext = new SQLContext(sparkContext)
val data = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("src/main/resources/adult.data")
val categoricals = data.dtypes filter (_._2 == "StringType")
val encoders = categoricals map (cat => new OneHotEncoder().setInputCol(cat._1).setOutputCol(cat._1 + "_encoded"))
val features = data.dtypes filterNot (_._1 == "label") map (tuple => if(tuple._2 == "StringType") tuple._1 + "_encoded" else tuple._1)
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.01)
val pipeline = new Pipeline()
.setStages(encoders ++ Array(lr))
val model = pipeline.fit(training)
}
}
However, this does not work. The call pipeline.fitstill contains the original string functions and, therefore, throws an exception. How can I remove these "StringType"in the pipeline? Or maybe I'm doing it completely wrong, so if anyone has a different suggestion, I am happy with all the comments :).
, , , Python Pandas, Scala Spark.