Spark Scala - How do I repeat rows in a dataframe and add computed values as new columns of the data frame

Question

Spark Scala - How do I repeat rows in a dataframe and add computed values as new columns of the data frame

I have a data block with two columns "date" and "value", how to add 2 new columns "value_mean" and "value_sd" in the dataframe, where "value_mean" is the average value of "value" for the last 10 days (including the current day, specified in "date"), and "value_sd" is the standard deviation of the "value" for the last 10 days?

+3

scala apache-spark apache-spark-sql spark-dataframe

May xue Feb 11 '16 at 19:37

source share

1 answer

Amit dubey · Answer 1 · 2016-02-12T06:52:10+0000

Spark sql provides various dataframe functions such as avg, mean, sum, etc.

you just need to apply to the dataframe column using fix sql column

import org.apache.spark.sql.types._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.Column

Create a private method for standard deviation

 private def stddev(col: Column): Column = sqrt(avg(col * col) - avg(col) * avg(col))

Now you can create sql Column for mean and standard deviation

 val value_sd: org.apache.spark.sql.Column = stddev(df.col("value")).as("value_sd") val value_mean: org.apache.spark.sql.Column = avg(df.col("value").as("value_mean"))

Filter your data framework in the last 10 days or whatever you want.

 val filterDF=df.filter("")//put your filter condition

Yon can now apply the aggregate function on your filterDF

 filterDF.agg(stdv, value_mean).show

Spark Scala - How do I repeat rows in a dataframe and add computed values ​​as new columns of the data frame

More articles:

Spark Scala - How do I repeat rows in a dataframe and add computed values as new columns of the data frame