Donβt worry about optimizing this, renaming the fields may result in small overhead, but this will not add the Map / Reduce job. Field projection will occur in the gearbox after the JOIN .
Consider the two code fragments and map reduction plans explain below.
Without renaming
A = load 'first' using PigStorage() as (f1, f2, id); B = load 'second' using PigStorage() as (g1, g2, id); C = join A by id, B by id; store C into 'output'; #-------------------------------------------------- # Map Reduce Plan #-------------------------------------------------- MapReduce node scope-30 Map Plan Union[tuple] - scope-31 | |---C: Local Rearrange[tuple]{bytearray}(false) - scope-20 | | | | | Project[bytearray][2] - scope-21 | | | |---A: New For Each(false,false,false)[bag] - scope-7 | | | | | Project[bytearray][0] - scope-1 | | | | | Project[bytearray][1] - scope-3 | | | | | Project[bytearray][2] - scope-5 | | | |---A: Load(hdfs://location/first:PigStorage) - scope-0 | |---C: Local Rearrange[tuple]{bytearray}(false) - scope-22 | | | Project[bytearray][2] - scope-23 | |---B: New For Each(false,false,false)[bag] - scope-15 | | | Project[bytearray][0] - scope-9 | | | Project[bytearray][1] - scope-11 | | | Project[bytearray][2] - scope-13 | |---B: Load(hdfs://location/second:PigStorage) - scope-8-------- Reduce Plan C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-27 | |---POJoinPackage(true,true)[tuple] - scope-32-------- Global sort: false ----------------
Renamed
A = load 'first' using PigStorage() as (f1, f2, id); B = load 'second' using PigStorage() as (g1, g2, id); C = join A by id, B by id; C = foreach C generate A::f1 as f1, -- This A::f2 as f2, -- section B::id as id, -- is B::g1 as g1, -- different B::g2 as g2; -- store C into 'output'; #-------------------------------------------------- # Map Reduce Plan #-------------------------------------------------- MapReduce node scope-41 Map Plan Union[tuple] - scope-42 | |---C: Local Rearrange[tuple]{bytearray}(false) - scope-20 | | | | | Project[bytearray][2] - scope-21 | | | |---A: New For Each(false,false,false)[bag] - scope-7 | | | | | Project[bytearray][0] - scope-1 | | | | | Project[bytearray][1] - scope-3 | | | | | Project[bytearray][2] - scope-5 | | | |---A: Load(hdfs://location/first:PigStorage) - scope-0 | |---C: Local Rearrange[tuple]{bytearray}(false) - scope-22 | | | Project[bytearray][2] - scope-23 | |---B: New For Each(false,false,false)[bag] - scope-15 | | | Project[bytearray][0] - scope-9 | | | Project[bytearray][1] - scope-11 | | | Project[bytearray][2] - scope-13 | |---B: Load(hdfs://location/second:PigStorage) - scope-8-------- Reduce Plan C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-38 | |---C: New For Each(false,false,false,false,false)[bag] - scope-37 | | | Project[bytearray][0] - scope-27 | | | Project[bytearray][1] - scope-29 | | | Project[bytearray][5] - scope-31 | | | Project[bytearray][3] - scope-33 | | | Project[bytearray][4] - scope-35 | |---POJoinPackage(true,true)[tuple] - scope-43-------- Global sort: false ----------------
The difference is in reduction plans. Without renaming:
Reduce Plan C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-27 | |---POJoinPackage(true,true)[tuple] - scope-32-------- Global sort: false
against renaming:
Reduce Plan C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-38 | |---C: New For Each(false,false,false,false,false)[bag] - scope-37 | | | Project[bytearray][0] - scope-27 | | | Project[bytearray][1] - scope-29 | | | Project[bytearray][5] - scope-31 | | | Project[bytearray][3] - scope-33 | | | Project[bytearray][4] - scope-35 | |---POJoinPackage(true,true)[tuple] - scope-43-------- Global sort: false
In short, in your script, you can optimize other things before worrying about renaming. Since you will go through each entry anyway due to JOIN , renaming will be just a cheap extra step.