Remove duplicates from data frame in pyspark

Question

Remove duplicates from data frame in pyspark

I deal with dataframes in pyspark 1.4 locally, and I'm having problems with the drop duplicates method working. Continues to return an error. The AttributeError: 'list' object does not have the 'dropDuplicates' attribute. It's not entirely clear why, since I seem to follow the syntax in the latest documentation . It seems like I'm missing an import for this function or something else.

#loading the CSV file into an RDD in order to start working with the data rdd1 = sc.textFile("C:\myfilename.csv").map(lambda line: (line.split(",")[0], line.split(",")[1], line.split(",")[2], line.split(",")[3])).collect() #loading the RDD object into a dataframe and assigning column names df1 = sqlContext.createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4']).collect() #dropping duplicates from the dataframe df1.dropDuplicates().show()

+5

python apache-spark pyspark

Jared Jun 26 '15 at 3:08

source share

2 answers

if you have a data frame and you want to delete all duplicates - with a link to duplicates in a specific column (called "colName"):

count before grandfather:

 df.count()

do de-dupe (convert the column that you are nulling to a row type):

 from pyspark.sql.functions import col df = df.withColumn('colName',col('colName').cast('string')) df.drop_duplicates(subset=['colName']).count()

can use a sorted group to check if duplicates are removed:

 df.groupBy('colName').count().toPandas().set_index("count").sort_index(ascending=False)

0

gps Jan 2 '18 at 14:40

source share

zero323 · Accepted Answer · 2015-06-26T03:22:43+0000

This is not an import problem. You just call .dropDuplicates() on the wrong object. Although the sqlContext.createDataFrame(rdd1, ...) pyspark.sql.dataframe.DataFrame is equal to pyspark.sql.dataframe.DataFrame , after applying .collect() it is a simple Python list , and lists do not provide the dropDuplicates method. You want something like this:

  (df1 = sqlContext .createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4']) .dropDuplicates()) df1.collect()

Remove duplicates from data frame in pyspark

More articles: