PIG CROSS versus Replicated JOIN

Question

PIG CROSS versus Replicated JOIN

I need to do no equijoin in a pig. The first thing I want to try is the CROSS + filter:

    together = CROSS A, B;
    filtered = FILTER together BY (JOIN PREDICATE);

However, one of the relationships is certainly small enough to fit into memory. This makes me wonder how CROSS is actually implemented on Pig. Can it "replicate" CROSS?

If not, I could do something like this:

    small = FOREACH small GENERATE *, 1 AS key:int;
    large = FOREACH large GENERATE *, 1 AS key:int;
    together = JOIN large BY key, small BY key USING 'replicated';
    filtered = FILTER together BY (JOIN PREDICATE);

Will the second approach see an increase in productivity?

+4

hadoop apache-pig

user3909850 Aug 12 '14 at 10:21

source share

2 answers

user3909850 · Answer 1 · 2014-08-13T20:54:13+0000

Thus, for a large relationship with 2M records and a small relation to 500K records, the replicated connection was much faster.

, UDF, .

, , , - .

Gaurav Phapale · Answer 2 · 2014-08-12T16:34:55+0000

! .

CROSS GFCross COGROUP, . , : " - ", , , . , .

( 100%) , CROSS JOIN , .

, , !

PIG CROSS versus Replicated JOIN

More articles: