The operator smoothing scheme in lead latin

I recently encountered this problem in my work on pigs. I use a simple example to express it

two files
=== file1 ===
1_a
2_b
4_d

=== file2 (tab separated) ===
1 a
2 b
3 c

pigs script 1:

a = load 'file1' as (str:chararray); b = load 'file2' as (num:int, ch:chararray); a1 = foreach a generate flatten(STRSPLIT(str,'_',2)) as (num:int, ch:chararray); c = join a1 by num, b by num; dump c; -- exception java.lang.String cannot be cast to java.lang.Integer 

pigs script 2:

 a = load 'file1' as (str:chararray); b = load 'file2' as (num:int, ch:chararray); a1 = foreach a generate flatten(STRSPLIT(str,'_',2)) as (num:int, ch:chararray); a2 = foreach a1 generate (int)num as num, ch as ch; c = join a2 by num, b by num; dump c; -- exception java.lang.String cannot be cast to java.lang.Integer 

pigs script 3:

 a = load 'file1' as (str:chararray); b = load 'file2' as (num:int, ch:chararray); a1 = foreach a generate flatten(STRSPLIT(str,'_',2)); a2 = foreach a1 generate (int)$0 as num, $1 as ch; c = join a2 by num, b by num; dump c; -- right 

I don’t know why script 1.2 is wrong and script 3 to the right, and I also want to know if there is a more concise expression to get the relation c, thanks.

+3
java apache-pig
source share
1 answer

Is there any specific reason why you are not using PigStorage? Because it can make life a lot easier for you :).

 a = load '/file1' USING PigStorage('_') AS (num:int, char:chararray); b = load '/file2' USING PigStorage('\t') AS (num:int, char:chararray); c = join a by num, b by num; dump c; 

Also note that in file1 you used the underscore as a delimiter, but you specify "-" as an argument to STRSPLIT.

edit I spent a little more time on the scripts that you provided; script 1 and 2 really do not work, and script 3 also works like this (without additional foreach):

 a = load 'file1' as (str:chararry); b = load 'file2' as (num:int, ch:chararry); a1 = foreach a generate flatten(STRSPLIT(str,'_',2)); c = join a1 by (int)($0), b by num; dump c; 

As for the source of the problem, I will think about it and say that this can be connected with this ( as indicated in the documentation for the swing ) in combination with optimization of the pig’s working cycle:

If you insert a bag with an empty internal circuitry, the circuitry for the resulting relationship is zero.

In your case, I believe that the STRSPLIT result schema is unknown before execution.

edit2: Ok, here is my theory explained:

This is the full -explain output for script 2 and this is for script 3 . I’ll just put in interesting snippets here.

 |---a2: (Name: LOForEach Schema: num#288:int,ch#289:chararray) | | | | | (Name: LOGenerate[false,false] Schema: num#288:int,ch#289:chararray)ColumnPrune:InputUids=[288, 289]ColumnPrune:OutputUids=[288, 289] | | | | | | | (Name: Cast Type: int Uid: 288) | | | | | | | |---num:(Name: Project Type: int Uid: 288 Input: 0 Column: (*)) 

The above section is for script 2; see the last line. It assumes that flatten(STRSPLIT) output flatten(STRSPLIT) will have the first integer element (because you provided the circuit this way). But in fact, STRSPLIT has a null output circuitry that is treated as bytearray fields; so the output of flatten(STRSPLIT) is actually (n:bytearray, c:bytearray) . Since you provided the schema, the pig is trying to do a java cast (to output a1 ) in the num field; which does not work since num is actually a java String represented as bytearray. Since this java-cast does not work, the pig does not even try to do explicit casting in the line above.

Let's look at the situation for script 3:

 |---a2: (Name: LOForEach Schema: num#85:int,ch#87:bytearray) | | | | | (Name: LOGenerate[false,false] Schema: num#85:int,ch#87:bytearray)ColumnPrune:InputUids=[]ColumnPrune:OutputUids=[85, 87] | | | | | | | (Name: Cast Type: int Uid: 85) | | | | | | | |---(Name: Project Type: bytearray Uid: 85 Input: 0 Column: (*)) 

See the last line, here the output of a1 correctly processed as bytearray , there are no problems. Now look at the second on the last line; (and succeeds) to make an explicit casting action from bytearray to integer .

+4
source share

All Articles