performance - Hive: Is there a better way to percentile rank a column? -
currently, percentile rank column in hive, using following. trying rank items in column percentile fall under, assigning value form 0 1 each item. code below assigns value 0 9, saying item char_percentile_rank
of 0 in bottom 10% of items, , value of 9 in top 10% of items. there better way of doing this?
select item , characteristic , case when characteristic <= char_perc[0] 0 when characteristic <= char_perc[1] 1 when characteristic <= char_perc[2] 2 when characteristic <= char_perc[3] 3 when characteristic <= char_perc[4] 4 when characteristic <= char_perc[5] 5 when characteristic <= char_perc[6] 6 when characteristic <= char_perc[7] 7 when characteristic <= char_perc[8] 8 else 9 end char_percentile_rank ( select split(item_id,'-')[0] item , split(item_id,'-')[1] characteristic , char_perc ( select collect_set(concat_ws('-',item,characteristic)) item_set , percentile(bigint(characteristic),array(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) char_perc from( select item , sum(characteristic) characteristic table group item ) t1 ) t2 lateral view explode(item_set) explodetable item_id ) t3
note: had collect_set
in order avoid self join, percentile function implicitly performs group by
.
i've gathered percentile function horribly slow (at least in usage). perhaps better manually calculate percentile?
try removing 1 of derived tables
select item , characteristic , case when characteristic <= char_perc[0] 0 when characteristic <= char_perc[1] 1 when characteristic <= char_perc[2] 2 when characteristic <= char_perc[3] 3 when characteristic <= char_perc[4] 4 when characteristic <= char_perc[5] 5 when characteristic <= char_perc[6] 6 when characteristic <= char_perc[7] 7 when characteristic <= char_perc[8] 8 else 9 end char_percentile_rank ( select item, characteristic, , percentile(bigint(characteristic),array(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) on () char_perc ( select item , sum(characteristic) characteristic table group item ) t1 ) t2
Comments
Post a Comment