if you have a data frame and you want to delete all duplicates - with a link to duplicates in a specific column (called "colName"):
count before grandfather:
df.count()
do de-dupe (convert the column that you are nulling to a row type):
from pyspark.sql.functions import col df = df.withColumn('colName',col('colName').cast('string')) df.drop_duplicates(subset=['colName']).count()
can use a sorted group to check if duplicates are removed:
df.groupBy('colName').count().toPandas().set_index("count").sort_index(ascending=False)
source share