Combining two datasets in Pig

I have a pig script in which I load a dataset by immersing it in two separate datasets, and then doing some calculations and finally adding another calculated field to it. Now I want to combine these two datasets.

A = LOAD '/user/hdfs/file1' AS (a:int, b:int); A1 = FILTER A BY a > 100; A2 = FILTER A BY a <= 100 AND b > 100; -- Now I do some calculation on A1 and A2 

So, essentially, after calculating, here is a diagram for both:

 {A1 : {a:int, b:int, type:chararray}} {A2: {a:int, b:int, type:chararray}} 

Now, before I bring it back to HDFS, I want to merge the two datasets back. Something like UNION ALL in SQL. How can i do this?

+4
source share
2 answers

UNION should work for you, but your original schema does not match the result shown (b loads as chararray and later becomes int) - I assume this is a typo.

If tuples have fields with different orders, you can use the ONSCHEMA keyword when executing UNION:

 A_MERGED = UNION ONSCHEMA A1, A2; 

EDIT Link to PigLatin docs for UNION

+4
source

You can use SPLIT

 SPLIT A INTO A1 IF a > 100, A2 IF a <= 100 AND b > 100; A = UNION A1, A2; 
0
source

All Articles