Hive: is there a better way to assign percentile to a column?

Currently, to make the percentage rank of a column in a hive, I am using something like the following. I am trying to rank the positions in the column by the percentiles to which they fit, assigning a value from 0 to 1 for each element. The code below assigns a value from 0 to 9, essentially saying that an element with char_percentile_rank of 0 is at the bottom of 10% of the elements, and a value of 9 is in the top 10% of the elements. Is there a better way to do this?

 select item , characteristic , case when characteristic <= char_perc[0] then 0 when characteristic <= char_perc[1] then 1 when characteristic <= char_perc[2] then 2 when characteristic <= char_perc[3] then 3 when characteristic <= char_perc[4] then 4 when characteristic <= char_perc[5] then 5 when characteristic <= char_perc[6] then 6 when characteristic <= char_perc[7] then 7 when characteristic <= char_perc[8] then 8 else 9 end as char_percentile_rank from ( select split(item_id,'-')[0] as item , split(item_id,'-')[1] as characteristic , char_perc from ( select collect_set(concat_ws('-',item,characteristic)) as item_set , PERCENTILE(BIGINT(characteristic),array(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) as char_perc from( select item , sum(characteristic) as characteristic from table group by item ) t1 ) t2 lateral view explode(item_set) explodetable as item_id ) t3 

Note. I needed to do collect_set to avoid self- collect_set , since the percentile function implicitly executes group by .

I realized that the percentile function is terribly slow (at least in this use). Perhaps it would be better to manually calculate percentiles?

+6
source share
1 answer

Try to delete one of the resulting tables.

 select item , characteristic , case when characteristic <= char_perc[0] then 0 when characteristic <= char_perc[1] then 1 when characteristic <= char_perc[2] then 2 when characteristic <= char_perc[3] then 3 when characteristic <= char_perc[4] then 4 when characteristic <= char_perc[5] then 5 when characteristic <= char_perc[6] then 6 when characteristic <= char_perc[7] then 7 when characteristic <= char_perc[8] then 8 else 9 end as char_percentile_rank from ( select item, characteristic, , PERCENTILE(BIGINT(characteristic),array(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) over () as char_perc from ( select item , sum(characteristic) as characteristic from table group by item ) t1 ) t2 
+4
source

All Articles