I am new to Hive and would like to pardon my ignorance in advance for any things below. I have a table as follows:
SELECT a.storeid, a.smonth, a.sales FROM table a; 1001 1 35000.0 1002 2 35000.0 1001 2 25000.0 1002 3 110000.0 1001 3 40000.0 1002 1 40000.0
My objective result is as follows:
1001 1 35000.0 35000.0 1001 2 25000.0 60000.0 1001 3 40000.0 100000.0 1002 1 40000.0 40000.0 1002 2 35000.0 75000.0 1002 3 110000.0 185000.0
I wrote a simple hive udf sum class to achieve the above and used SORT BY storeid, smonth in the request:
SELECT a.storeid, a.smonth, a.sales, rsum(sales) FROM (SELECT * FROM table SORT BY storeid, smonth) a;
Obviously, this output is not output, since there is only one handler and the same udf instance is called, which generates the current amount in the common set. My goal is to reset the runSum instance variable in the udf class for each repository so that the evaluation function returns the above result. I used the following: 1. Pass the variable stores rsum (sales, storeid), and then we can correctly deal with the situation in the udf class. 2. Using 2 cardboards, as in the following query:
set mapred.reduce.tasks=2; SELECT a.storeid, a.smonth, a.sales, rsum(sales) FROM (SELECT * FROM table DISTRIBUTE BY storeid SORT BY storeid, smonth) a; 1002 1 40000.0 40000.0 1002 2 35000.0 75000.0 1002 3 110000.0 185000.0 1001 1 35000.0 35000.0 1001 2 25000.0 60000.0 1001 3 40000.0 100000.0
Why does 1002 always appear above? I would like to receive your suggestions on various other methods in which I can achieve the same (for example, subqueries / joins), besides the above methods. Also, what will be the time difficulties of your proposed methods?