I am trying to deduplicate the values in a DataFrame Spark column based on the values in another DataFrame column. It seems to withColumn()work with only one data framework, and subqueries will not be available until version 2. Suppose I could try to join tables, but that seems a bit messy. Here is a general idea:
df.take(1)
[Row(TIMESTAMP='20160531 23:03:33', CLIENT ID=233347, ROI NAME='my_roi', ROI VALUE=1, UNIQUE_ID='173888')]
df_re.take(1)
[Row(UNIQUE_ID='6866144:ST64PSIMT5MB:1')]
In principle, you just need to take the values from dfand delete everything found in df_re, and then return the entire data file with the removal of lines containing these duplicates. I'm sure I can iterate over everything, but I wonder if there is a better way. Any ideas?
source
share