Spark Deduplicate column in a data frame based on a column in another data frame

Question

Spark Deduplicate column in a data frame based on a column in another data frame

I am trying to deduplicate the values in a DataFrame Spark column based on the values in another DataFrame column. It seems to withColumn()work with only one data framework, and subqueries will not be available until version 2. Suppose I could try to join tables, but that seems a bit messy. Here is a general idea:

df.take(1)
[Row(TIMESTAMP='20160531 23:03:33', CLIENT ID=233347, ROI NAME='my_roi', ROI VALUE=1, UNIQUE_ID='173888')]

df_re.take(1)
[Row(UNIQUE_ID='6866144:ST64PSIMT5MB:1')]

In principle, you just need to take the values from dfand delete everything found in df_re, and then return the entire data file with the removal of lines containing these duplicates. I'm sure I can iterate over everything, but I wonder if there is a better way. Any ideas?

+4

apache-spark apache-spark-sql spark-dataframe

patrickbarker Jun 06 '16 at 17:50

source share

1 answer

David Griffin · Accepted Answer · 2016-06-06T18:10:49+0000

The way to do this is to do left_outer joinand then filter out where the right side of the connection is empty. Sort of:

val df1 = Seq((1,2),(2,123),(3,101)).toDF("uniq_id", "payload")
val df2 = Seq((2,432)).toDF("uniq_id", "other_data")

df1.as("df1").join(
  df2.as("df2"),
  col("df1.uniq_id") === col("df2.uniq_id"),
  "left_outer"
).filter($"df2.uniq_id".isNull)

Spark Deduplicate column in a data frame based on a column in another data frame

More articles: