Given the following PySpark DataFrame
df = sqlContext.createDataFrame([('2015-01-15', 10), ('2015-02-15', 5)], ('date_col', 'days_col'))
How can the columns of days be subtracted from the date column? In this example, the resulting column should be ['2015-01-05', '2015-02-10'] .
I looked at pyspark.sql.functions.date_sub() , but this requires a date column and one day, i.e. date_sub(df['date_col'], 10) . Ideally, I would prefer to do date_sub(df['date_col'], df['days_col']) .
I also tried to create UDF:
from datetime import timedelta def subtract_date(start_date, days_to_subtract): return start_date - timedelta(days_to_subtract) subtract_date_udf = udf(subtract_date, DateType()) df.withColumn('subtracted_dates', subtract_date_udf(df['date_col'], df['days_col'])
This technically works, but I read that the transition between Spark and Python can cause performance problems for large datasets. I can stick with this solution for now (there is no need to prematurely optimize), but my opinion says that there was just a way to do this simple thing without using Python UDF.