Assuming I have the following data frame:
dummy_data = [('a',1),('b',25),('c',3),('d',8),('e',1)] df = sc.parallelize(dummy_data).toDF(['letter','number'])
And I want to create the following data framework:
[('a',0),('b',2),('c',1),('d',3),('e',0)]
What I do is convert it to rdd and use the zipWithIndex function zipWithIndex and after attaching to the results:
convertDF = (df.select('number') .distinct() .rdd .zipWithIndex() .map(lambda x:(x[0].number,x[1])) .toDF(['old','new'])) finalDF = (df .join(convertDF,df.number == convertDF.old) .select(df.letter,convertDF.new))
Is there something like the zipWithIndex function in dataframes? Is there an even more efficient way to accomplish this task?