How to subtract the days column from the dates column in Pyspark?

Given the following PySpark DataFrame

df = sqlContext.createDataFrame([('2015-01-15', 10), ('2015-02-15', 5)], ('date_col', 'days_col')) 

How can the columns of days be subtracted from the date column? In this example, the resulting column should be ['2015-01-05', '2015-02-10'] .

I looked at pyspark.sql.functions.date_sub() , but this requires a date column and one day, i.e. date_sub(df['date_col'], 10) . Ideally, I would prefer to do date_sub(df['date_col'], df['days_col']) .

I also tried to create UDF:

 from datetime import timedelta def subtract_date(start_date, days_to_subtract): return start_date - timedelta(days_to_subtract) subtract_date_udf = udf(subtract_date, DateType()) df.withColumn('subtracted_dates', subtract_date_udf(df['date_col'], df['days_col']) 

This technically works, but I read that the transition between Spark and Python can cause performance problems for large datasets. I can stick with this solution for now (there is no need to prematurely optimize), but my opinion says that there was just a way to do this simple thing without using Python UDF.

+10
source share
4 answers

I was able to solve this with selectExpr .

 df.selectExpr('date_sub(date_col, day_col) as subtracted_dates') 

If you want to add a column to the original DF, just add * to the expression

 df.selectExpr('*', 'date_sub(date_col, day_col) as subtracted_dates') 
+6
source

Use the expr function (if you have dynamic values from columns to subtraction):

 >>> from pyspark.sql.functions import * >>> df.withColumn('substracted_dates',expr("date_sub(date_col,days_col)")) 

Use the withColumn function (if you have literal values to subtract):

 >>> df.withColumn('substracted_dates',date_sub('date_col',<int_literal_value>)) 
+3
source

Not the most elegant solution ever, but if you don't want to crack SQL expressions in Scala (not that it should be complicated, but they are closed to sql ), something like this should do the trick:

 from pyspark.sql import Column def date_sub_(c1: Column, c2: Column) -> Column: return ((c1.cast("timestamp").cast("long") - 60 * 60 * 24 * c2) .cast("timestamp").cast("date")) 

For Python 2.x, just clear the type annotations.

+1
source

slightly different format, but also works:

 df.registerTempTable("dfTbl") newdf = spark.sql(""" SELECT *, date_sub(d.date_col, d.day_col) AS DateSub FROM dfTbl d """) 
0
source

All Articles