How to vectorize DataFrame columns for ML algorithms?

Question

How to vectorize DataFrame columns for ML algorithms?

have a DataFrame with some categorical string values (for example, uuid | url | browser).

I would like to convert it to double in order to execute an ML algorithm that accepts a double matrix.

As a conversion method, I used StringIndexer (spark 1.4), which mapped my string values to double values, so I defined this function:

def str(arg: String, df:DataFrame) : DataFrame = ( val indexer = new StringIndexer().setInputCol(arg).setOutputCol(arg+"_index") val newDF = indexer.fit(df).transform(df) return newDF )

Now the problem is that I will iterate over the foreach column from df, call this function and add (or convert) the original column of the row to a collapsible double column, so the result will look like this:

Starting df:

 [String: uuid|String: url| String: browser]

Final df:

 [String: uuid|Double: uuid_index|String: url|Double: url_index|String: browser|Double: Browser_index]

Thank you in advance

+6

scala apache-spark apache-spark-mllib apache-spark-ml

fase_jhn Sep 2 '15 at 15:27

source share

1 answer

zero323 · Accepted Answer · 2015-09-02T15:57:58+0000

You can simply foldLeft over Array columns:

 val transformed: DataFrame = df.columns.foldLeft(df)((df, arg) => str(arg, df))

However, I will argue that this is not a very good approach. Since src discards StringIndexerModel , it cannot be used when you get new data. Because of this, I would recommend using Pipeline :

 import org.apache.spark.ml.Pipeline val transformers: Array[org.apache.spark.ml.PipelineStage] = df.columns.map( cname => new StringIndexer() .setInputCol(cname) .setOutputCol(s"${cname}_index") ) // Add the rest of your pipeline like VectorAssembler and algorithm val stages: Array[org.apache.spark.ml.PipelineStage] = transformers ++ ??? val pipeline = new Pipeline().setStages(stages) val model = pipeline.fit(df) model.transform(df)

VectorAssembler can be enabled as follows:

 val assembler = new VectorAssembler() .setInputCols(df.columns.map(cname => s"${cname}_index")) .setOutputCol("features") val stages = transformers :+ assembler

You can also use RFormula , which is less customizable but much more concise:

 import org.apache.spark.ml.feature.RFormula val rf = new RFormula().setFormula(" ~ uuid + url + browser - 1") val rfModel = rf.fit(dataset) rfModel.transform(dataset)

How to vectorize DataFrame columns for ML algorithms?

More articles: