"reduce" the rowset in the "Hive" to another rowset

I use Hive to batch process my spatial database. The trace table looks something like this:

object | lat | long | timestamp 1 | X11 | X12 | T11 1 | X21 | X22 | T12 2 | X11 | X12 | T21 1 | X31 | X22 | T13 2 | X21 | X22 | T22 

I want to compare each lat length of each object with a number (for example, think about matching maps), but to get the result, the algorithm must take into account a number of adjacent data points. For example, I need all 3 data points of object 1 to map each of these three data points to a number. It is impossible to process them one by one.

I'm going to use map-reduce with a hive using transform, but I'm not sure how to do this. Can someone please help me?

0
source share
1 answer

You can use the custom map reduction feature in Hive.

With the following:

 add file /some/path/identity.pl; add file /some/path/collect.pl; from ( from trace_input MAP id, lat, lon, ts USING './identity.pl' as id, lat, lon, ts CLUSTER BY id) map_output REDUCE id, lat, lon, ts USING './collect.pl' as id, list 

trace_input contains your trace data as described above:

 create table trace_input(id string, lat string, lon string, ts string) row format delimited fields terminated by '\t' stored as textfile ; 

identity.pl is a simple script to unload each line (there may also be a script to select only lat, long fields):

 #!/usr/bin/perl while (<STDIN>) { print; } 

collect.pl (example here ) is a simple script that collects consecutive lines with the same object identifier, saves the remainder of each line and unloads the line with the identifier and a comma-separated list (tab delimiter).

The cluster proposal ensures that the reducers get the correctly sorted input needed to build the script.

Custom script output is divided into STRING column tabs.

Running the query will result in the following output:

 1 X11,X12,T11,X21,X22,T12,X31,X22,T13 2 X11,X12,T21,X21,X22,T22 

You can change the script map to restrict the columns and / or change the script shortening to add results or separate lat, lon from ts, etc.

If this form is sufficient, you can insert it directly into the results table by adding to it before insertion:

 from ( from trace_input MAP id, lat, lon, ts USING './identity.pl' as id, lat, lon, ts CLUSTER BY id) map_output INSERT overwrite table trace_res REDUCE id, lat, lon, ts USING './collect.pl'; 

Fields will be converted from string fields according to the trace_res schema as needed.

If you use collection types like me, you can also do something like:

 create table trace_res as select sq.id, split(sq.list,",") from ( from ( from trace_input MAP id, lat, lon, ts USING './identity.pl' as id, lat, lon, ts CLUSTER BY id) map_output REDUCE id, lat, lon, ts USING './collect.pl' as (id int, list string) ) sq; 

This second field in the created table will be a list of all lat, lon, ts; but will probably have a more complex table.

+2
source

All Articles