Use the following to get an index column containing monotonically increasing, unique, and consecutive integers that don't , like monotonically_increasing_id() work. Indexes will increase in the same order as the colName your DataFrame.
import pyspark.sql.functions as F from pyspark.sql.window import Window as W window = W.orderBy('colName').rowsBetween(W.unboundedPreceding, W.currentRow) df = df\ .withColumn('int', F.lit(1))\ .withColumn('index', F.sum('int').over(window))\ .drop('int')\
Use the following code to look at the tail or the last rownums in a DataFrame.
rownums = 10 df.where(F.col('index')>df.count()-rownums).show()
Use the following code to view rows from start_row to end_row DataFrame.
start_row = 20 end_row = start_row + 10 df.where((F.col('index')>start_row) & (F.col('index')<end_row)).show()
zipWithIndex() is an RDD method that returns monotonically increasing, unique, and sequential integers, but seems to be much slower to implement in such a way that you can return to the original DataFrame corrected with the id column.
source share