Spark Scala - How do I repeat rows in a dataframe and add computed values ​​as new columns of the data frame

I have a data block with two columns "date" and "value", how to add 2 new columns "value_mean" and "value_sd" in the dataframe, where "value_mean" is the average value of "value" for the last 10 days (including the current day, specified in "date"), and "value_sd" is the standard deviation of the "value" for the last 10 days?

+3
scala apache-spark apache-spark-sql spark-dataframe
source share
1 answer

Spark sql provides various dataframe functions such as avg, mean, sum, etc.

you just need to apply to the dataframe column using fix sql column

import org.apache.spark.sql.types._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.Column 

Create a private method for standard deviation

 private def stddev(col: Column): Column = sqrt(avg(col * col) - avg(col) * avg(col)) 

Now you can create sql Column for mean and standard deviation

 val value_sd: org.apache.spark.sql.Column = stddev(df.col("value")).as("value_sd") val value_mean: org.apache.spark.sql.Column = avg(df.col("value").as("value_mean")) 

Filter your data framework in the last 10 days or whatever you want.

 val filterDF=df.filter("")//put your filter condition 

Yon can now apply the aggregate function on your filterDF

 filterDF.agg(stdv, value_mean).show 
+2
source share

All Articles