How to trim header line from files processed by Hadoop Pig?

I am trying to analyze data-delimited files created by our services using Amazon Elastic Map Reduce through the Pig program. Everything is going well, except that all of our data files contain a header line that defines the purpose of each column. Obviously, the headers (string) cannot be entered in the numerical values ​​of the data, so I get warnings from Pig, as shown below:

2011-03-17 22:49:55,378 [main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.builtin.PigStorage: Unable to interpret value [<snip>] in field being converted to double, caught NumberFormatException <For input string: "headerName"> field discarded

I have a filter after the download statement that tries to guarantee that I will not work in any header lines (by filtering the header terms), but I would like to get rid of the warning noise in order to avoid masking any potential problems (e.g. actual fields data that is not displayed properly).

Is it possible?

+5
source share
3 answers

You can do this before submitting the Pig job (if possible) or try writing a UDF that will emit null values ​​if certain conditions are met, so you can filter it out later.

0
source

Another option, if you are not comfortable writing UDF, might be something like this:

Sample data:

MyIntVal
123
456

Script:

A = load 's3://blah/myFile' USING PigStorage() as (myintval: chararray);

B = filter A by myintval neq 'MyIntVal';

C = foreach B generate (int)$0;

, int.

, , , , .

+3

This will help you get the result: -

input_file = load 'input' using PigStorage(',') as (row1:chararay, row2:chararray);
ranked = rank input_file;
/* ranked:{rank_input_file:long, row1:chararay, row2:chararay} */
NoHeader = filter ranked by (rank_input_file > 1);
New_input_file = foreach NoHeader generate row1, row2;
0
source

All Articles