I have two RDDs:
**rdd1** id1 val1 id2 val2 **rdd2** id1 v1 id2 v2 id1 v3 id8 v7 id1 v4 id3 v5 id6 v6
I want to filter RDD2 so that it contains only the keys to rdd1. Thus, the output will be
**output** id1 v1 id2 v2 id1 v3 id1 v4
This has been set in stackoverflow before, but for a smaller dataset where people transferred a lot and then used to filter, but my problem in the amount of rdd1 is> 500 million and rdd2 is more than 10 billion
Help Pls
source share