Is there any specific reason why you are not using PigStorage? Because it can make life a lot easier for you :).
a = load '/file1' USING PigStorage('_') AS (num:int, char:chararray); b = load '/file2' USING PigStorage('\t') AS (num:int, char:chararray); c = join a by num, b by num; dump c;
Also note that in file1 you used the underscore as a delimiter, but you specify "-" as an argument to STRSPLIT.
edit I spent a little more time on the scripts that you provided; script 1 and 2 really do not work, and script 3 also works like this (without additional foreach):
a = load 'file1' as (str:chararry); b = load 'file2' as (num:int, ch:chararry); a1 = foreach a generate flatten(STRSPLIT(str,'_',2)); c = join a1 by (int)($0), b by num; dump c;
As for the source of the problem, I will think about it and say that this can be connected with this ( as indicated in the documentation for the swing ) in combination with optimization of the pigβs working cycle:
If you insert a bag with an empty internal circuitry, the circuitry for the resulting relationship is zero.
In your case, I believe that the STRSPLIT result schema is unknown before execution.
edit2: Ok, here is my theory explained:
This is the full -explain output for script 2 and this is for script 3 . Iβll just put in interesting snippets here.
|---a2: (Name: LOForEach Schema: num#288:int,ch#289:chararray) | | | | | (Name: LOGenerate[false,false] Schema: num#288:int,ch#289:chararray)ColumnPrune:InputUids=[288, 289]ColumnPrune:OutputUids=[288, 289] | | | | | | | (Name: Cast Type: int Uid: 288) | | | | | | | |---num:(Name: Project Type: int Uid: 288 Input: 0 Column: (*))
The above section is for script 2; see the last line. It assumes that flatten(STRSPLIT) output flatten(STRSPLIT) will have the first integer element (because you provided the circuit this way). But in fact, STRSPLIT has a null output circuitry that is treated as bytearray fields; so the output of flatten(STRSPLIT) is actually (n:bytearray, c:bytearray) . Since you provided the schema, the pig is trying to do a java cast (to output a1 ) in the num field; which does not work since num is actually a java String represented as bytearray. Since this java-cast does not work, the pig does not even try to do explicit casting in the line above.
Let's look at the situation for script 3:
|---a2: (Name: LOForEach Schema: num#85:int,ch#87:bytearray) | | | | | (Name: LOGenerate[false,false] Schema: num
See the last line, here the output of a1 correctly processed as bytearray , there are no problems. Now look at the second on the last line; (and succeeds) to make an explicit casting action from bytearray to integer .