Summing values of Hive array types

Question

Summing values of Hive array types

The hive has this rather nice array type, which is very useful in theory, but when it comes to practice, I have found very little information on how to do anything with it. We store a series of numbers in an array type column and must query them in a query, preferably from the nth to mth element. Is this possible with standard HiveQL or is it necessary to use UDF or a client converter / gearbox?

Note: we use Hive 0.8.1 in the EMR environment.

+6

arrays aggregation aggregate hadoop hive

Alex N. 12 sept '12 at 3:51

source share

2 answers

The answer above is reasonably well explained. I am publishing a very simple implementation of UDF.

 package com.ak.hive.udf.test; import java.util.ArrayList; import org.apache.hadoop.hive.ql.exec.UDF; public final class ArraySumUDF extends UDF { public int evaluate(ArrayList<Integer>arrayOfIntegers,int startIndex,int endIndex) { // add code to handle all index problem int sum=0; int count=startIndex-1; for(;count<endIndex;count++){ sum+=arrayOfIntegers.get(count); } return sum; } }

Also published table creation and other queries.

 create table table1 (col1 int,col2 array<int>)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '~' STORED AS TEXTFILE; load data local inpath '/home/ak/Desktop/hivedata' into table table1;

My input file will look like

1.3 ~ 5 ~ 8 ~ 5 ~ 7 ~ 9
2.93 ~ 5 ~ 8 ~ 5 ~ 7 ~ 29
3.3 ~ 95 ~ 8 ~ 5 ~ 27 ~ 9
4.3 ~ 5 ~ 58 ~ 15 ~ 7 ~ 9
5.3 ~ 25 ~ 8 ~ 55 ~ 7 ~ 49
6.3 ~ 25 ~ 8 ~ 15 ~ 7 ~ 19
7.3 ~ 55 ~ 78 ~ 5 ~ 7 ~ 9

I created a jar of my UDF, add a jar to the bush using the following command

 add jar file:///home/ak/Desktop/array.jar;

Then I create a temporary function as shown

 create temporary function getSum as 'com.ak.hive.udf.test.ArraySumUDF';

Run the sample query as shown below.

 select col1,getSum(col2,1,3) from table1;

This should solve the most basic need. If this is not what the problem is talking about, answer so that I can help you again.

+1

Arun ak 18 sept. '12 at 6:41

source share

Lorand bendig · Accepted Answer · 2012-09-17T13:30:12+0000

I would write a simple UDF for this purpose. You must have hive-exec in your build path.
For example, in the case of Maven :

 <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-exec</artifactId> <version>0.8.1</version> </dependency>

A simple raw implementation would look like this:

 package com.myexample; import java.util.ArrayList; import java.util.List; import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.IntWritable; public class SubArraySum extends UDF { public IntWritable evaluate(ArrayList<Integer> list, IntWritable from, IntWritable to) { IntWritable result = new IntWritable(-1); if (list == null || list.size() < 1) { return result; } int m = from.get(); int n = to.get(); //m: inclusive, n:exclusive List<Integer> subList = list.subList(m, n); int sum = 0; for (Integer i : subList) { sum += i; } result.set(sum); return result; } }

Then create a jar and load it into the Hive shell:

 hive> add jar /home/user/jar/myjar.jar; hive> create temporary function subarraysum as 'com.myexample.SubArraySum';

Now you can use it to calculate the sum of the array that you have.

eg:

Suppose you have an input file with columns separated by tabs:

 1 0,1,2,3,4 2 5,6,7,8,9

Download it to the table:

 hive> create external table mytable ( id int, nums array<int> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/user/hadoopuser/hive/input';

Run the following queries:

 hive> select * from mytable; 1 [0,1,2,3,4] 2 [5,6,7,8,9]

Sum it in the range m, n, where m = 1, n = 3

 hive> select subarraysum(nums, 1,3) from mytable; 3 13

or

 hive> select sum(subarraysum(nums, 1,3)) from mytable; 16

Summing values ​​of Hive array types

More articles:

Summing values of Hive array types