How to choose a 10% table?

I need to select the top x% of the rows of a table in Pig. Can someone tell me how to do this without writing UDF?

Thanks!

0
source share
3 answers

As mentioned earlier, first you need to count the number of rows in the table, and then, obviously, you can do:

A = load 'X' as (row); B = group A all; C = foreach B generate COUNT(A) as count; D = LIMIT A C.count/10; --you might need a cast to integer here 

The catch is that dynamic argument support for the LIMIT function was introduced in Pig 0.10 . If you are working with a previous version, then an offer using the TOP function is offered here .

+3
source

Not sure how you will pull the percentage, but if you know that your table size is 100 rows, you can use the LIMIT command to get the best 10%, for example:

 A = load 'myfile' as (t, u, v); B = order A by t; C = limit B 10; 

(The above is an example from http://pig.apache.org/docs/r0.7.0/cookbook.html#Use+the+LIMIT+Operator )

As for the dynamic limit of up to 10%, I’m not sure that you can do this without knowing how the “big” table is, and I’m sure that you could not do this in UDF, you will need to run the task to count the number of rows, then another job to execute the LIMIT query.

0
source

I will not write the pig code, as it will take time to write and test, but I would do it like this (if you need an exact solution, if not, there are simpler methods):

  • Get a sample from your input. Say a few thousand data points or so.
  • Sort it and find n quantiles, where n should be somewhere in order of the number of gears you have or more.
  • Count the data points for each quantile.

  • At this point, the minimum point of 10% falls into one of these intervals. Find this interval (this is easy, since the calculations will tell you exactly where), and using the sum of the large quantile counters along with the corresponding quantile, find the 10% point in this interval.

  • Repeat the data and filter everything except the points that are larger than the one you just found.

Parts of this may require UDF.

0
source

All Articles