How to vectorize DataFrame columns for ML algorithms?

have a DataFrame with some categorical string values ​​(for example, uuid | url | browser).

I would like to convert it to double in order to execute an ML algorithm that accepts a double matrix.

As a conversion method, I used StringIndexer (spark 1.4), which mapped my string values ​​to double values, so I defined this function:

def str(arg: String, df:DataFrame) : DataFrame = ( val indexer = new StringIndexer().setInputCol(arg).setOutputCol(arg+"_index") val newDF = indexer.fit(df).transform(df) return newDF ) 

Now the problem is that I will iterate over the foreach column from df, call this function and add (or convert) the original column of the row to a collapsible double column, so the result will look like this:

Starting df:

 [String: uuid|String: url| String: browser] 

Final df:

 [String: uuid|Double: uuid_index|String: url|Double: url_index|String: browser|Double: Browser_index] 

Thank you in advance

+6
source share
1 answer

You can simply foldLeft over Array columns:

 val transformed: DataFrame = df.columns.foldLeft(df)((df, arg) => str(arg, df)) 

However, I will argue that this is not a very good approach. Since src discards StringIndexerModel , it cannot be used when you get new data. Because of this, I would recommend using Pipeline :

 import org.apache.spark.ml.Pipeline val transformers: Array[org.apache.spark.ml.PipelineStage] = df.columns.map( cname => new StringIndexer() .setInputCol(cname) .setOutputCol(s"${cname}_index") ) // Add the rest of your pipeline like VectorAssembler and algorithm val stages: Array[org.apache.spark.ml.PipelineStage] = transformers ++ ??? val pipeline = new Pipeline().setStages(stages) val model = pipeline.fit(df) model.transform(df) 

VectorAssembler can be enabled as follows:

 val assembler = new VectorAssembler() .setInputCols(df.columns.map(cname => s"${cname}_index")) .setOutputCol("features") val stages = transformers :+ assembler 

You can also use RFormula , which is less customizable but much more concise:

 import org.apache.spark.ml.feature.RFormula val rf = new RFormula().setFormula(" ~ uuid + url + browser - 1") val rfModel = rf.fit(dataset) rfModel.transform(dataset) 
+11
source

All Articles