Calculating the current amount using hive udf functions

I am new to Hive and would like to pardon my ignorance in advance for any things below. I have a table as follows:

SELECT a.storeid, a.smonth, a.sales FROM table a; 1001 1 35000.0 1002 2 35000.0 1001 2 25000.0 1002 3 110000.0 1001 3 40000.0 1002 1 40000.0 

My objective result is as follows:

 1001 1 35000.0 35000.0 1001 2 25000.0 60000.0 1001 3 40000.0 100000.0 1002 1 40000.0 40000.0 1002 2 35000.0 75000.0 1002 3 110000.0 185000.0 

I wrote a simple hive udf sum class to achieve the above and used SORT BY storeid, smonth in the request:

 SELECT a.storeid, a.smonth, a.sales, rsum(sales) FROM (SELECT * FROM table SORT BY storeid, smonth) a; 

Obviously, this output is not output, since there is only one handler and the same udf instance is called, which generates the current amount in the common set. My goal is to reset the runSum instance variable in the udf class for each repository so that the evaluation function returns the above result. I used the following: 1. Pass the variable stores rsum (sales, storeid), and then we can correctly deal with the situation in the udf class. 2. Using 2 cardboards, as in the following query:

 set mapred.reduce.tasks=2; SELECT a.storeid, a.smonth, a.sales, rsum(sales) FROM (SELECT * FROM table DISTRIBUTE BY storeid SORT BY storeid, smonth) a; 1002 1 40000.0 40000.0 1002 2 35000.0 75000.0 1002 3 110000.0 185000.0 1001 1 35000.0 35000.0 1001 2 25000.0 60000.0 1001 3 40000.0 100000.0 

Why does 1002 always appear above? I would like to receive your suggestions on various other methods in which I can achieve the same (for example, subqueries / joins), besides the above methods. Also, what will be the time difficulties of your proposed methods?

+5
source share
4 answers

Alternatively, you can take a look at this Beehive Ticket, which contains several feature extensions.
Among other things, there is a total implementation ( GenericUDFSum ).

This function (called "rsum") takes two arguments, an identifier hash (by which records are divided between reducers) and their corresponding values ​​that need to be summed:

 select t.storeid, t.smonth, t.sales, rsum(hash(t.storeid),t.sales) as sales_sum from (select storeid, smonth, sales from sm distribute by hash(storeid) sort by storeid, smonth) t; 1001 1 35000.0 35000.0 1001 2 25000.0 60000.0 1001 3 40000.0 100000.0 1002 1 40000.0 40000.0 1002 2 35000.0 75000.0 1002 3 110000.0 185000.0 
+4
source

The hive provides the best way to do this in one line -
Follow the process below to achieve your goal.

create a beehive table that may contain your data set -

 1001 1 35000.0 1002 2 35000.0 1001 2 25000.0 1002 3 110000.0 1001 3 40000.0 1002 1 40000.0 

Now just run the command in your hive -

 SELECT storeid, smonth, sales, SUM(sales) OVER (PARTITION BY storeid ORDER BY smonth) FROM table_name; 

The output will look like

 1001 1 35000.0 35000.0 1001 2 25000.0 60000.0 1001 3 40000.0 100000.0 1002 1 40000.0 40000.0 1002 2 35000.0 75000.0 1002 3 110000.0 185000.0 

I hope this helps you get the target result.

+9
source

CHOOSE storeid, smonth, sales, sum (sales) over (sort storeid by smonth) as the rsum FROM table;

0
source

This should do the trick:

 SELECT a.storeid, a.smonth, a.sales, SUM(a.sales) OVER ( PARTITION BY a.storeid ORDER BY a.smonth asc ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) FROM table a; 

source: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics

0
source

All Articles