I need to do no equijoin in a pig. The first thing I want to try is the CROSS + filter:
together = CROSS A, B;
filtered = FILTER together BY (JOIN PREDICATE);
However, one of the relationships is certainly small enough to fit into memory. This makes me wonder how CROSS is actually implemented on Pig. Can it "replicate" CROSS?
If not, I could do something like this:
small = FOREACH small GENERATE *, 1 AS key:int;
large = FOREACH large GENERATE *, 1 AS key:int;
together = JOIN large BY key, small BY key USING 'replicated';
filtered = FILTER together BY (JOIN PREDICATE);
Will the second approach see an increase in productivity?
source
share