I have a dataset consisting of a timestamp column and a dollar column. I would like to find the average number of dollars per week ending at the timestamp of each line. At first I looked at the pyspark.sql.functions.window function, but this forces the data for the week.
Here is an example:
%pyspark
import datetime
from pyspark.sql import functions as F
df1 = sc.parallelize([(17,"2017-03-11T15:27:18+00:00"), (13,"2017-03-11T12:27:18+00:00"), (21,"2017-03-17T11:27:18+00:00")]).toDF(["dollars", "datestring"])
df2 = df1.withColumn('timestampGMT', df1.datestring.cast('timestamp'))
w = df2.groupBy(F.window("timestampGMT", "7 days")).agg(F.avg("dollars").alias('avg'))
w.select(w.window.start.cast("string").alias("start"), w.window.end.cast("string").alias("end"), "avg").collect()
This results in two entries:
| start | end | avg |
|---------------------|----------------------|-----|
|'2017-03-16 00:00:00'| '2017-03-23 00:00:00'| 21.0|
|---------------------|----------------------|-----|
|'2017-03-09 00:00:00'| '2017-03-16 00:00:00'| 15.0|
|---------------------|----------------------|-----|
A window function binds time series data, rather than performing a moving average.
Is there a way to execute a moving average when I get back to the weekly average for each row with a period of time ending in timestampGMT row?
EDIT:
Zhang's answer below is close to what I want, but not quite what I would like to see.
Here is the best example to show what I'm trying to get to:
%pyspark
from pyspark.sql import functions as F
df = spark.createDataFrame([(17, "2017-03-10T15:27:18+00:00"),
(13, "2017-03-15T12:27:18+00:00"),
(25, "2017-03-18T11:27:18+00:00")],
["dollars", "timestampGMT"])
df = df.withColumn('timestampGMT', df.timestampGMT.cast('timestamp'))
df = df.withColumn('rolling_average', F.avg("dollars").over(Window.partitionBy(F.window("timestampGMT", "7 days"))))
This results in the following file frame:
dollars timestampGMT rolling_average
25 2017-03-18 11:27:18.0 25
17 2017-03-10 15:27:18.0 15
13 2017-03-15 12:27:18.0 15
, , timestampGMT, :
dollars timestampGMT rolling_average
17 2017-03-10 15:27:18.0 17
13 2017-03-15 12:27:18.0 15
25 2017-03-18 11:27:18.0 19
roll_average 2017-03-10 17, . roll_average 2017-03-15 15, 13 2017-03-15 17 2017-03-10, 7- . 2017-03-18 19, 25 2017-03-18 13 2017-03-10, 7- , 17 2017 -03-10, 7- .
, binning, ?