Creating all fields from an alias after JOIN in Pig

I would like to fulfill the equivalent of "save all a in a , where a.field == b.field for some b in b " in Apache Pig. I implement it like this

 AB_joined = JOIN A by field, B by field; A2 = FOREACH AB_joined GENERATE A::field as field, A::field2 as field2, A::field3 as field3; 

Enumerating all the entries of a pretty stupid, and I would rather do something like

 A2 = FOREACH AB_joined GENERATE flatten(A); 

However, this does not work. Is there any other way to do something equivalent without listing the fields of a ?

+7
source share
4 answers

This should work:

 A2 = FOREACH AB_joined GENERATE $0.. 
+5
source

You can use COGROUP so that columns A are separated from columns B. This is especially useful when the scheme is dynamic and you do not want your code to fail when changing the scheme.

 AB = COGROUP A BY field, B BY field; -- schema of AB will be: -- {group, A:{all fields of A}, B:{all fields of B}} A2 = FOREACH AB FLATTEN(A); 

Hope this helps.

+3
source

Starting with at least pig 0.9.1, you can use either Star Express or Project-Range expressions to select multiple fields from a tuple. For more information, read Pig Latin 0.15.0, chapter "Expressions" .

Here is my example that I did to give you an understanding.

 -- A: {id: long, f1: int, f2: int, f3: int, f4: int} -- B: {id: long, f5: int} 

Let A and B join and select only the fields A

 AB = FOREACH (JOIN A BY id, B by id) GENERATE $0..$4; --AB: {A::id: long, A::f1: int, A::f2: int, A::f3: int, A::f4: int} 

or

 BA = FOREACH (JOIN B BY id, A by id) GENERATE $2..; --BA: {A::id: long, A::f1: int, A::f2: int, A::f3: int, A::f4: int} 

select all fields using the expression "Star"

 AB = FOREACH (JOIN A BY id, B by id) GENERATE *; --AB: {A::id: long, A::f1: int, A::f2: int, A::f3: int, A::f4: int, B::id: long, B::f5: int} 

select all individual fields (without the B :: id field) using the Project-range expression

 AB = FOREACH (JOIN A BY id, B by id) GENERATE $0..$4, f5; --AB: {A::id: long, A::f1: int, A::f2: int, A::f3: int, A::f4: int, B::f5: int} 

This is sometimes really useful when you have dozens of fields in one respect and only a couple in another.

+2
source

With Pig 12 and above, use PluckTuple: https://pig.apache.org/docs/r0.12.0/func.html#plucktuple .

 AB_joined = JOIN A by field, B by field; DEFINE pluck PluckTuple('A::'); A2 = FOREACH AB_joined generate FLATTEN(pluck(*)); 
+1
source

All Articles