Remove duplicates from data frame in pyspark

I deal with dataframes in pyspark 1.4 locally, and I'm having problems with the drop duplicates method working. Continues to return an error. The AttributeError: 'list' object does not have the 'dropDuplicates' attribute. It's not entirely clear why, since I seem to follow the syntax in the latest documentation . It seems like I'm missing an import for this function or something else.

#loading the CSV file into an RDD in order to start working with the data rdd1 = sc.textFile("C:\myfilename.csv").map(lambda line: (line.split(",")[0], line.split(",")[1], line.split(",")[2], line.split(",")[3])).collect() #loading the RDD object into a dataframe and assigning column names df1 = sqlContext.createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4']).collect() #dropping duplicates from the dataframe df1.dropDuplicates().show() 
+5
source share
2 answers

This is not an import problem. You just call .dropDuplicates() on the wrong object. Although the sqlContext.createDataFrame(rdd1, ...) pyspark.sql.dataframe.DataFrame is equal to pyspark.sql.dataframe.DataFrame , after applying .collect() it is a simple Python list , and lists do not provide the dropDuplicates method. You want something like this:

  (df1 = sqlContext .createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4']) .dropDuplicates()) df1.collect() 
+8
source

if you have a data frame and you want to delete all duplicates - with a link to duplicates in a specific column (called "colName"):

count before grandfather:

 df.count() 

do de-dupe (convert the column that you are nulling to a row type):

 from pyspark.sql.functions import col df = df.withColumn('colName',col('colName').cast('string')) df.drop_duplicates(subset=['colName']).count() 

can use a sorted group to check if duplicates are removed:

 df.groupBy('colName').count().toPandas().set_index("count").sort_index(ascending=False) 
0
source

All Articles