Renaming fields after JOIN takes time?

Question

Renaming fields after JOIN takes time?

In the following code, how many rename fields after the join harm the script computation time? Is it optimized for Pig? Or does it really go through each record?

-- tables A: (f1, f2, id) and B: (g1, g2, id) to be joined by id C = JOIN A BY id, B by id; C = FOREACH C GENERATE A::f1 AS f1, A::f2 AS f2, B::id AS id, B::g1 AS g1, B::g2 AS g2;

Does the FOREACH command FOREACH all C records? If so, is there a way to optimize?

Thanks.

+4

apache-pig

Navneet Aug 6 '12 at 18:15

source share

1 answer

cyang · Accepted Answer · 2012-08-07T16:05:23+0000

Don’t worry about optimizing this, renaming the fields may result in small overhead, but this will not add the Map / Reduce job. Field projection will occur in the gearbox after the JOIN .

Consider the two code fragments and map reduction plans explain below.

Without renaming

 A = load 'first' using PigStorage() as (f1, f2, id); B = load 'second' using PigStorage() as (g1, g2, id); C = join A by id, B by id; store C into 'output'; #-------------------------------------------------- # Map Reduce Plan #-------------------------------------------------- MapReduce node scope-30 Map Plan Union[tuple] - scope-31 | |---C: Local Rearrange[tuple]{bytearray}(false) - scope-20 | | | | | Project[bytearray][2] - scope-21 | | | |---A: New For Each(false,false,false)[bag] - scope-7 | | | | | Project[bytearray][0] - scope-1 | | | | | Project[bytearray][1] - scope-3 | | | | | Project[bytearray][2] - scope-5 | | | |---A: Load(hdfs://location/first:PigStorage) - scope-0 | |---C: Local Rearrange[tuple]{bytearray}(false) - scope-22 | | | Project[bytearray][2] - scope-23 | |---B: New For Each(false,false,false)[bag] - scope-15 | | | Project[bytearray][0] - scope-9 | | | Project[bytearray][1] - scope-11 | | | Project[bytearray][2] - scope-13 | |---B: Load(hdfs://location/second:PigStorage) - scope-8-------- Reduce Plan C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-27 | |---POJoinPackage(true,true)[tuple] - scope-32-------- Global sort: false ----------------

Renamed

 A = load 'first' using PigStorage() as (f1, f2, id); B = load 'second' using PigStorage() as (g1, g2, id); C = join A by id, B by id; C = foreach C generate A::f1 as f1, -- This A::f2 as f2, -- section B::id as id, -- is B::g1 as g1, -- different B::g2 as g2; -- store C into 'output'; #-------------------------------------------------- # Map Reduce Plan #-------------------------------------------------- MapReduce node scope-41 Map Plan Union[tuple] - scope-42 | |---C: Local Rearrange[tuple]{bytearray}(false) - scope-20 | | | | | Project[bytearray][2] - scope-21 | | | |---A: New For Each(false,false,false)[bag] - scope-7 | | | | | Project[bytearray][0] - scope-1 | | | | | Project[bytearray][1] - scope-3 | | | | | Project[bytearray][2] - scope-5 | | | |---A: Load(hdfs://location/first:PigStorage) - scope-0 | |---C: Local Rearrange[tuple]{bytearray}(false) - scope-22 | | | Project[bytearray][2] - scope-23 | |---B: New For Each(false,false,false)[bag] - scope-15 | | | Project[bytearray][0] - scope-9 | | | Project[bytearray][1] - scope-11 | | | Project[bytearray][2] - scope-13 | |---B: Load(hdfs://location/second:PigStorage) - scope-8-------- Reduce Plan C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-38 | |---C: New For Each(false,false,false,false,false)[bag] - scope-37 | | | Project[bytearray][0] - scope-27 | | | Project[bytearray][1] - scope-29 | | | Project[bytearray][5] - scope-31 | | | Project[bytearray][3] - scope-33 | | | Project[bytearray][4] - scope-35 | |---POJoinPackage(true,true)[tuple] - scope-43-------- Global sort: false ----------------

The difference is in reduction plans. Without renaming:

 Reduce Plan C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-27 | |---POJoinPackage(true,true)[tuple] - scope-32-------- Global sort: false

against renaming:

 Reduce Plan C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-38 | |---C: New For Each(false,false,false,false,false)[bag] - scope-37 | | | Project[bytearray][0] - scope-27 | | | Project[bytearray][1] - scope-29 | | | Project[bytearray][5] - scope-31 | | | Project[bytearray][3] - scope-33 | | | Project[bytearray][4] - scope-35 | |---POJoinPackage(true,true)[tuple] - scope-43-------- Global sort: false

In short, in your script, you can optimize other things before worrying about renaming. Since you will go through each entry anyway due to JOIN , renaming will be just a cheap extra step.

Renaming fields after JOIN takes time?

Without renaming

Renamed

More articles: