I have largeDataFrame (multiple columns and billions of rows) and smallDataFrame (single column and 10,000 rows).
I would like to filter all rows from largeDataFrame when the some_identifier column in largeDataFrame matches one of the rows in smallDataFrame .
Here is an example:
largeDataFrame
some_idenfitier,first_name 111,bob 123,phil 222,mary 456,sue
smallDataFrame
some_identifier 123 456
desiredOutput
111,bob 222,mary
Here is my ugly solution.
val smallDataFrame2 = smallDataFrame.withColumn("is_bad", lit("bad_row")) val desiredOutput = largeDataFrame.join(broadcast(smallDataFrame2), Seq("some_identifier"), "left").filter($"is_bad".isNull).drop("is_bad")
Is there a cleaner solution?
dataframe apache-spark apache-spark-sql
Powers
source share