Spark> = 3.0 :
In Spark 3.0, OneHotEncoderEstimator been renamed OneHotEncoder :
import org.apache.spark.ml.feature.{OneHotEncoder, OneHotEncoderModel} val encoder = new OneHotEncoder() .setInputCols(indexColumns) .setOutputCols(indexColumns map (name => s"${name}_vec"))
Spark> = 2.3
Spark 2.3 introduced new classes OneHotEncoderEstimator , OneHotEncoderModel , which require fitting, even if they are used outside of Pipeline , and work with several columns at the same time.
import org.apache.spark.ml.feature.{OneHotEncoderEstimator, OneHotEncoderModel} val encoder = new OneHotEncoderEstimator() .setInputCols(indexColumns) .setOutputCols(indexColumns map (name => s"${name}_vec")) encoder.fit(df_indexed).transform(df_indexed)
Spark <2.3
Even if the converters you use do not require fitting, you should use the fit method to create a PipelineModel that you can use to convert the data.
one_hot_pipeline.fit(df_indexed).transform(df_indexed)
In addition, you can combine indexing and coding into one Pipeline :
val pipeline = new Pipeline() .setStages(index_transformers ++ one_hot_encoders) val model = pipeline.fit(df) model.transform(df)
Edit :
The error you see means that one of your columns contains an empty String . It is accepted by the indexer, but cannot be used for coding. Depending on your requirements, you can refuse them or use a dummy. Unfortunately, you cannot use NULLs until SPARK-11569 is resolved.
source share