Does the EvalFunc pig throw an exception in the UDF to skip this line or stop it completely?

I have a user-defined function (UDF) written in Java to parse lines in a log file and return information back to pigs, so it can do all the processing.

It looks something like this:

public abstract class Foo extends EvalFunc<Tuple> { public Foo() { super(); } public Tuple exec(Tuple input) throws IOException { try { // do stuff with input } catch (Exception e) { throw WrappedIOException.wrap("Error with line", e); } } } 

My question is: if it throws an IOException, will it stop completely or will it return results for the rest of the lines that don't throw an exception?

Example: I run this in a pig

 REGISTER myjar.jar DEFINE Extractor com.namespace.Extractor(); logs = LOAD '$IN' USING TextLoader AS (line: chararray); events = FOREACH logs GENERATE FLATTEN(Extractor(line)); 

With this input:

 1.5 7 "Valid Line" 1.3 gghyhtt Inv"alid line"" I throw an exceptioN!! 1.8 10 "Valid Line 2" 

Will he process two lines and will the "logs" have 2 tuples, or will he just die on fire?

+6
hadoop apache-pig
source share
1 answer

If an exception is thrown by UDF, the task will fail and will be retried.

It will work again three times (4 attempts by default), and all work will be FAILED.

If you want to register an error and do not want the task to be stopped, you can return null:

 public Tuple exec(Tuple input) throws IOException { try { // do stuff with input } catch (Exception e) { System.err.println("Error with ..."); return null; } } 

And filter them later in Pig:

 events_all = FOREACH logs GENERATE Extractor(line) AS line; events_valid = FILTER events_all by line IS NOT null; events = FOREACH events_valid GENERATE FLATTEN(line); 

In your example, the output will contain only two valid lines (but be careful with this behavior, since the error is present only in the logs and will not fail your work!).

Reply to comment # 1:

Actually, the entire resulting tuple will be empty (therefore, there will be no fields inside).

For example, if your circuit has 3 fields:

  events_all = FOREACH logs GENERATE Extractor(line) AS line:tuple(a:int,b:int,c:int); 

and some lines are incorrect:

  () ((1,2,3)) ((1,2,3)) () ((1,2,3)) 

And if you do not filter the null string and try to access the field, you will get java.lang.NullPointerException :

 events = FOREACH events_all GENERATE line.a; 
+8
source share

All Articles