), ("TypeB", List) I need to...">

Spark - Connecting Two Elements of PairRDD

Hi JavaRDDPair with two elements:

("TypeA", List<jsonTypeA>), ("TypeB", List<jsonTypeB>) 

I need to combine 2 pairs into 1 pair of types:

 ("TypeA_B", List<jsonCombinedAPlusB>) 

I need to combine 2 lists into 1 list, where each 2 jsons (1 type A and 1 type B) have some common field that I can join.

Consider that the list of types A is much smaller than the other, and the connection must be internal, so the list of results should be as small as the list of type A.

What is the most efficient way to do this?

0
hadoop bigdata apache-spark
source share
1 answer

rdd.join(otherRdd) provides you with an inner join on the first rdd. To use it, you need to convert both RDDs to PairRDD, which has as a key a common attribute to which you will join. Something like this (example, untested):

 val rddAKeyed = rddA.keyBy{case (k,v) => key(v)} val rddBKeyed = rddB.keyBy{case (k,v) => key(v)} val joined = rddAKeyed.join(rddBKeyed).map{case (k,(json1,json2)) => (newK, merge(json1,json2))} 

Where merge(j1,j2) is the specific business logic on how to join two json objects.

+2
source share

All Articles