Spark DataFrame filter based on another DataFrame that defines blacklist criteria

Question

Spark DataFrame filter based on another DataFrame that defines blacklist criteria

I have largeDataFrame (multiple columns and billions of rows) and smallDataFrame (single column and 10,000 rows).

I would like to filter all rows from largeDataFrame when the some_identifier column in largeDataFrame matches one of the rows in smallDataFrame .

Here is an example:

largeDataFrame

 some_idenfitier,first_name 111,bob 123,phil 222,mary 456,sue

smallDataFrame

 some_identifier 123 456

desiredOutput

 111,bob 222,mary

Here is my ugly solution.

 val smallDataFrame2 = smallDataFrame.withColumn("is_bad", lit("bad_row")) val desiredOutput = largeDataFrame.join(broadcast(smallDataFrame2), Seq("some_identifier"), "left").filter($"is_bad".isNull).drop("is_bad")

Is there a cleaner solution?

+7

dataframe apache-spark apache-spark-sql

Powers Oct 6 '16 at 4:27

source share

1 answer

eliasah · Accepted Answer · 2016-10-06T06:45:16+0000

In this case, you need to use the leftanti .

The left anti-join is the opposite of the left half-join.

Filters data from the right table in the left table according to the specified key:

 largeDataFrame .join(smallDataFrame, Seq("some_identifier"),"leftanti") .show // +---------------+----------+ // |some_identifier|first_name| // +---------------+----------+ // | 222| mary| // | 111| bob| // +---------------+----------+

Spark DataFrame filter based on another DataFrame that defines blacklist criteria

More articles: