Sparkscala: RDD filter if an RDD record does not exist in another RDD

I have an RDD that has the following structure:

((user_id,item_id,rating)) 

lets call it RDD as training

Then there is another rdd with the same structure:

 ((user_id,item_id,rating)) 

and this rdd as a test

I want to make sure that the data in the test is not displayed on the train for each user. So let's say

 train = {u1,item2: u1,item4 : u1,item3} test={u1,item2:u1, item5} 

I want to make sure item2 is removed from u1 training data.

so i started doing groupBy as rdd (s) (user_id, item_id)

  val groupedTrainData = trainData.groupBy(x => (x._1, x._2)) 

But I feel that this is not the way.

+5
source share
1 answer

You need PairRDDFunctions.subtractByKey :

 def cleanTrain( train: RDD[((user, item), rating)], test: RDD[((user, item), rating)]) = train.subtractByKey(test) 
+3
source

All Articles