Spark SQL Row_number () PartitionBy Sort Desc

I successfully created row_number() partitionBy in Spark using Window, but would like to sort it in descending order, and not by default. Here is my working code:

 from pyspark import HiveContext from pyspark.sql.types import * from pyspark.sql import Row, functions as F from pyspark.sql.window import Window data_cooccur.select("driver", "also_item", "unit_count", F.rowNumber().over(Window.partitionBy("driver").orderBy("unit_count")).alias("rowNum")).show() 

This gives me this result:

  +------+---------+----------+------+ |driver|also_item|unit_count|rowNum| +------+---------+----------+------+ | s10| s11| 1| 1| | s10| s13| 1| 2| | s10| s17| 1| 3| 

And here I add desc () to sort in descending order:

 data_cooccur.select("driver", "also_item", "unit_count", F.rowNumber().over(Window.partitionBy("driver").orderBy("unit_count").desc()).alias("rowNum")).show() 

And get this error:

AttributeError: WindowSpec object does not have desc attribute

What am I doing wrong here?

+19
source share
3 answers

desc should apply to the column, not to the window definition. You can use either the method in the column:

 from pyspark.sql.functions import col F.rowNumber().over(Window.partitionBy("driver").orderBy(col("unit_count").desc()) 

or autonomous function:

 from pyspark.sql.functions import desc F.rowNumber().over(Window.partitionBy("driver").orderBy(desc("unit_count")) 
+37
source

Or you can use SQL code in Spark-SQL:

 from pyspark.sql import SparkSession spark = SparkSession\ .builder\ .master('local[*]')\ .appName('Test')\ .getOrCreate() spark.sql(""" select driver ,also_item ,unit_count ,ROW_NUMBER() OVER (PARTITION BY driver ORDER BY unit_count DESC) AS rowNum from data_cooccur """).show() 
+1
source

Update Actually, I tried to figure this out, and this does not seem to work. (actually it gives an error). The reason this didn't work is because I had this code when calling display() in the Databricks (the code after calling display() never ran). It looks like orderBy() in the data frame and orderBy() in the window not really the same. I will keep this answer only for negative confirmation

Starting with PySpark 2.4 (and possibly earlier), just adding the ascending=False keyword to the orderBy call works for me.

Ex.

personal_recos.withColumn("row_number", F.row_number().over(Window.partitionBy("COLLECTOR_NUMBER").orderBy("count", ascending=False)))

and

personal_recos.withColumn("row_number", F.row_number().over(Window.partitionBy("COLLECTOR_NUMBER").orderBy(F.col("count").desc())))

seem to give me the same behavior.

0
source

All Articles