Sparkscala: RDD filter if an RDD record does not exist in another RDD

Question

I have an RDD that has the following structure:

((user_id,item_id,rating))

lets call it RDD as training

Then there is another rdd with the same structure:

 ((user_id,item_id,rating))

and this rdd as a test

I want to make sure that the data in the test is not displayed on the train for each user. So let's say

 train = {u1,item2: u1,item4 : u1,item3} test={u1,item2:u1, item5}

I want to make sure item2 is removed from u1 training data.

so i started doing groupBy as rdd (s) (user_id, item_id)

  val groupedTrainData = trainData.groupBy(x => (x._1, x._2))

But I feel that this is not the way.

+5

Null-hypothesis Aug 3 '15 at 21:21

1 answer

Daniel Darabos · Accepted Answer · 2015-08-03T21:34:29+0000

You need PairRDDFunctions.subtractByKey :

 def cleanTrain( train: RDD[((user, item), rating)], test: RDD[((user, item), rating)]) = train.subtractByKey(test)