Since 1.5.0 Spark provides a number of functions, such as dayofmonth , hour , month or year , which can work with dates and timestamps. Therefore, if timestamp is a TimestampType , all you need is the correct expression. For example:
from pyspark.sql.functions import hour, mean (df .groupBy(hour("timestamp").alias("hour")) .agg(mean("value").alias("mean")) .show()) ## +----+------------------+ ## |hour| mean| ## +----+------------------+ ## | 0|508.05999999999995| ## | 1| 449.8666666666666| ## | 2| 524.9499999999999| ## | 3|264.59999999999997| ## +----+------------------+
Pre-1.5.0, your best option is to use HiveContext and UUF Hive with selectExpr :
df.selectExpr("year(timestamp) AS year", "value").groupBy("year").sum()
or raw SQL:
df.registerTempTable("df") sqlContext.sql(""" SELECT MONTH(timestamp) AS month, SUM(value) AS values_sum FROM df GROUP BY MONTH(timestamp)""")
Just remember that the aggregation performed by Spark is not discarded to an external source. This is usually the desired behavior, but there are situations where you may prefer to perform aggregation as a subquery to restrict data transfer.
zero323
source share