Summing values โ€‹โ€‹of Hive array types

The hive has this rather nice array type, which is very useful in theory, but when it comes to practice, I have found very little information on how to do anything with it. We store a series of numbers in an array type column and must query them in a query, preferably from the nth to mth element. Is this possible with standard HiveQL or is it necessary to use UDF or a client converter / gearbox?

Note: we use Hive 0.8.1 in the EMR environment.

+6
source share
2 answers

I would write a simple UDF for this purpose. You must have hive-exec in your build path.
For example, in the case of Maven :

 <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-exec</artifactId> <version>0.8.1</version> </dependency> 

A simple raw implementation would look like this:

 package com.myexample; import java.util.ArrayList; import java.util.List; import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.IntWritable; public class SubArraySum extends UDF { public IntWritable evaluate(ArrayList<Integer> list, IntWritable from, IntWritable to) { IntWritable result = new IntWritable(-1); if (list == null || list.size() < 1) { return result; } int m = from.get(); int n = to.get(); //m: inclusive, n:exclusive List<Integer> subList = list.subList(m, n); int sum = 0; for (Integer i : subList) { sum += i; } result.set(sum); return result; } } 

Then create a jar and load it into the Hive shell:

 hive> add jar /home/user/jar/myjar.jar; hive> create temporary function subarraysum as 'com.myexample.SubArraySum'; 

Now you can use it to calculate the sum of the array that you have.

eg:

Suppose you have an input file with columns separated by tabs:

 1 0,1,2,3,4 2 5,6,7,8,9 

Download it to the table:

 hive> create external table mytable ( id int, nums array<int> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/user/hadoopuser/hive/input'; 

Run the following queries:

 hive> select * from mytable; 1 [0,1,2,3,4] 2 [5,6,7,8,9] 

Sum it in the range m, n, where m = 1, n = 3

 hive> select subarraysum(nums, 1,3) from mytable; 3 13 

or

 hive> select sum(subarraysum(nums, 1,3)) from mytable; 16 
+9
source

The answer above is reasonably well explained. I am publishing a very simple implementation of UDF.

 package com.ak.hive.udf.test; import java.util.ArrayList; import org.apache.hadoop.hive.ql.exec.UDF; public final class ArraySumUDF extends UDF { public int evaluate(ArrayList<Integer>arrayOfIntegers,int startIndex,int endIndex) { // add code to handle all index problem int sum=0; int count=startIndex-1; for(;count<endIndex;count++){ sum+=arrayOfIntegers.get(count); } return sum; } } 

Also published table creation and other queries.

 create table table1 (col1 int,col2 array<int>)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '~' STORED AS TEXTFILE; load data local inpath '/home/ak/Desktop/hivedata' into table table1; 

My input file will look like

1.3 ~ 5 ~ 8 ~ 5 ~ 7 ~ 9
2.93 ~ 5 ~ 8 ~ 5 ~ 7 ~ 29
3.3 ~ 95 ~ 8 ~ 5 ~ 27 ~ 9
4.3 ~ 5 ~ 58 ~ 15 ~ 7 ~ 9
5.3 ~ 25 ~ 8 ~ 55 ~ 7 ~ 49
6.3 ~ 25 ~ 8 ~ 15 ~ 7 ~ 19
7.3 ~ 55 ~ 78 ~ 5 ~ 7 ~ 9

I created a jar of my UDF, add a jar to the bush using the following command

 add jar file:///home/ak/Desktop/array.jar; 

Then I create a temporary function as shown

 create temporary function getSum as 'com.ak.hive.udf.test.ArraySumUDF'; 

Run the sample query as shown below.

 select col1,getSum(col2,1,3) from table1; 

This should solve the most basic need. If this is not what the problem is talking about, answer so that I can help you again.

+1
source

Source: https://habr.com/ru/post/925184/


All Articles