Spark> = 2.2
You can use org.apache.spark.ml.feature.Imputer (which supports both medium and median strategy).
Scala :
import org.apache.spark.ml.feature.Imputer val imputer = new Imputer().setInputCols(df.columns).setOutputCols(df.columns.map(c = > s "$ {c} _imputed" )).setStrategy( "" ) imputer.fit(DF).transform(DF) >
Python
pyspark.ml.feature import Imputer imputer = Imputer ( inputCols = df.columns, outputCols = [ "{} _ imputed".format(c) c df.columns] ) imputer.fit(DF).transform(DF) >
Spark & lt; 2.2
Here you are:
import org.apache.spark.sql.functions.mean df.na.fill(df.columns.zip( df.select(df.columns.map(mean (_)): _ *). first.toSeq ). ) >
Where
df.columns.map(mean (_)): Array [Column] >
calculates the average value for each column,
df.select(_: *). first.toSeq: Seq [Any] >
collects aggregated values and converts the string to Seq [Any] (I know that this is suboptimal, but this is the API we need to work with)
df.columns.zip(_). toMap: [String, Any] >
creates aMap: Map [String, Any] , which maps the column name to its average value, and finally
df.na.fill(_): DataFrame >
fills in the missing values using:
fill: [String, Any] = > DataFrame >
from DataFrameNaFunctions .
To add NaN entries that you can replace:
df.select(df.columns.map(mean (_)): _ *). first.toSeq >
from:
import org.apache.spark.sql.functions. {col, isnan, when} df.select(df.columns.map( c = > ( (! isnan (col (c)), col (c))) ): _ *). first.toSeq >