Max / Min for entire recordsets in PIG

Question

Max / Min for entire recordsets in PIG

I have a set of record sets that I load from a file, and the first thing I need to do is get the max and min column. In SQL, I would do this with a subquery like this:

select c.state, c.population, (select max(c.population) from state_info c) as max_pop, (select min(c.population) from state_info c) as min_pop from state_info c

I suppose there should be an easy way in PIG to do this, but it's hard for me to find it. It has a MAX and MIN function, but when I tried to do the following, it did not work:

 records=LOAD '/Users/Winter/School/st_incm.txt' AS (state:chararray, population:int); with_max = FOREACH records GENERATE state, population, MAX(population);

This did not work. I was fortunate to add an extra column with the same value for each row, and then group them in that column. Then get max in this new group. This seems like a confusing way to get what I want, so I thought I would ask if anyone knew an easier way.

Thanks in advance for your help.

+8

hadoop apache-pig

Winter Mar 07 '11 at 18:17

source share

1 answer

Romain · Accepted Answer · 2011-03-08T19:44:39+0000

As you said, you need to combine all the data together, but an extra column is not required if you use GROUP ALL .

Pigs

 records = LOAD 'states.txt' AS (state:chararray, population:int); records_group = GROUP records ALL; with_max = FOREACH records_group GENERATE FLATTEN(records.(state, population)), MAX(records.population);

Enter

 CA 10 VA 5 WI 2

Exit

 (CA,10,10) (VA,5,10) (WI,2,10)

Max / Min for entire recordsets in PIG

More articles: