How to trim header line from files processed by Hadoop Pig?

Question

How to trim header line from files processed by Hadoop Pig?

I am trying to analyze data-delimited files created by our services using Amazon Elastic Map Reduce through the Pig program. Everything is going well, except that all of our data files contain a header line that defines the purpose of each column. Obviously, the headers (string) cannot be entered in the numerical values of the data, so I get warnings from Pig, as shown below:

2011-03-17 22:49:55,378 [main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.builtin.PigStorage: Unable to interpret value [<snip>] in field being converted to double, caught NumberFormatException <For input string: "headerName"> field discarded

I have a filter after the download statement that tries to guarantee that I will not work in any header lines (by filtering the header terms), but I would like to get rid of the warning noise in order to avoid masking any potential problems (e.g. actual fields data that is not displayed properly).

Is it possible?

+5

hadoop apache-pig

Chris phillips Mar 17 '11 at 23:02

source share

3 answers

Another option, if you are not comfortable writing UDF, might be something like this:

Sample data:

MyIntVal
123
456

Script:

A = load 's3://blah/myFile' USING PigStorage() as (myintval: chararray);

B = filter A by myintval neq 'MyIntVal';

C = foreach B generate (int)$0;

, int.

, , , , .

+3

Dan 25 . '13 18:50

This will help you get the result: -

input_file = load 'input' using PigStorage(',') as (row1:chararay, row2:chararray);
ranked = rank input_file;
/* ranked:{rank_input_file:long, row1:chararay, row2:chararay} */
NoHeader = filter ranked by (rank_input_file > 1);
New_input_file = foreach NoHeader generate row1, row2;

0

Manish agrawal Jul 02 '15 at 17:35

source share

wlk · Accepted Answer · 2011-04-07T11:37:21+0000

You can do this before submitting the Pig job (if possible) or try writing a UDF that will emit null values if certain conditions are met, so you can filter it out later.

How to trim header line from files processed by Hadoop Pig?

More articles: