How to subtract the days column from the dates column in Pyspark?

Question

How to subtract the days column from the dates column in Pyspark?

Given the following PySpark DataFrame

df = sqlContext.createDataFrame([('2015-01-15', 10), ('2015-02-15', 5)], ('date_col', 'days_col'))

How can the columns of days be subtracted from the date column? In this example, the resulting column should be ['2015-01-05', '2015-02-10'] .

I looked at pyspark.sql.functions.date_sub() , but this requires a date column and one day, i.e. date_sub(df['date_col'], 10) . Ideally, I would prefer to do date_sub(df['date_col'], df['days_col']) .

I also tried to create UDF:

 from datetime import timedelta def subtract_date(start_date, days_to_subtract): return start_date - timedelta(days_to_subtract) subtract_date_udf = udf(subtract_date, DateType()) df.withColumn('subtracted_dates', subtract_date_udf(df['date_col'], df['days_col'])

This technically works, but I read that the transition between Spark and Python can cause performance problems for large datasets. I can stick with this solution for now (there is no need to prematurely optimize), but my opinion says that there was just a way to do this simple thing without using Python UDF.

+10

python user-defined-functions apache-spark pyspark apache-spark-sql spark-dataframe udf

kjmij Mar 17 '16 at 3:45

source share

4 answers

Use the expr function (if you have dynamic values from columns to subtraction):

 >>> from pyspark.sql.functions import * >>> df.withColumn('substracted_dates',expr("date_sub(date_col,days_col)"))

Use the withColumn function (if you have literal values to subtract):

 >>> df.withColumn('substracted_dates',date_sub('date_col',<int_literal_value>))

+3

Shu Jun 08 '18 at 16:19

source share

Not the most elegant solution ever, but if you don't want to crack SQL expressions in Scala (not that it should be complicated, but they are closed to sql ), something like this should do the trick:

 from pyspark.sql import Column def date_sub_(c1: Column, c2: Column) -> Column: return ((c1.cast("timestamp").cast("long") - 60 * 60 * 24 * c2) .cast("timestamp").cast("date"))

For Python 2.x, just clear the type annotations.

+1

zero323 Mar 17 '16 at 15:21

source share

slightly different format, but also works:

 df.registerTempTable("dfTbl") newdf = spark.sql(""" SELECT *, date_sub(d.date_col, d.day_col) AS DateSub FROM dfTbl d """)

0

Grant shannon Jan 11 '18 at 13:46

source share

kjmij · Accepted Answer · 2016-03-17T17:23:35+0000

I was able to solve this with selectExpr .

 df.selectExpr('date_sub(date_col, day_col) as subtracted_dates')

If you want to add a column to the original DF, just add * to the expression

 df.selectExpr('*', 'date_sub(date_col, day_col) as subtracted_dates')

How to subtract the days column from the dates column in Pyspark?

More articles: