PIG CROSS versus Replicated JOIN

I need to do no equijoin in a pig. The first thing I want to try is the CROSS + filter:

    together = CROSS A, B;
    filtered = FILTER together BY (JOIN PREDICATE);

However, one of the relationships is certainly small enough to fit into memory. This makes me wonder how CROSS is actually implemented on Pig. Can it "replicate" CROSS?

If not, I could do something like this:

    small = FOREACH small GENERATE *, 1 AS key:int;
    large = FOREACH large GENERATE *, 1 AS key:int;
    together = JOIN large BY key, small BY key USING 'replicated';
    filtered = FILTER together BY (JOIN PREDICATE);

Will the second approach see an increase in productivity?

+4
source share
2 answers

Thus, for a large relationship with 2M records and a small relation to 500K records, the replicated connection was much faster.

, UDF, .

, , , - .

+2

! .

CROSS GFCross COGROUP, . , : " - ", , , . , .

( 100%) , CROSS JOIN , .

, , !

+1

All Articles