I have a Hive table that contains data about customer calls. For simplicity, consider that it has 2 columns, the first column contains the client identifier, and the second column contains the call timestamp (unix timestamp).
I can query this table to find all the calls for each client:
SELECT * FROM mytable SORT BY customer_id, call_time;
Result:
Customer1 timestamp11 Customer1 timestamp12 Customer1 timestamp13 Customer2 timestamp21 Customer3 timestamp31 Customer3 timestamp32 ...
Is it possible to create a beehive request that returns, for each client, starting from the second call, the time interval between two successful calls? In the above example, the query should return:
Customer1 timestamp12-timestamp11 Customer1 timestamp13-timestamp12 Customer3 timestamp32-timestamp31 ...
I tried to adapt the solutions from the sql solution , but I adhere to the limitations of Hive: it accepts subqueries only in FROM and joins should contain only equalities .
Thanks.
EDIT1:
I tried using the UUF Hive function:
public class DeltaComputerUDF extends UDF { private String previousCustomerId; private long previousCallTime; public String evaluate(String customerId, LongWritable callTime) { long callTimeValue = callTime.get(); String timeDifference = null; if (customerId.equals(previousCustomerId)) { timeDifference = new Long(callTimeValue - previousCallTime).toString(); } previousCustomerId = customerId; previousCallTime = callTimeValue; return timeDifference; }}
and use it with the name "delta".
But it seems (from the logs and results) that it is used during MAP. 2 problems follow from this:
First: Table data must be sorted by customer ID and timestamp before using this function. Request:
SELECT customer_id, call_time, delta(customer_id, call_time) FROM mytable DISTRIBUTE BY customer_id SORT BY customer_id, call_time;
doesn't work because part of the sorting is done during REDUCE, long after using my function.
I can sort the table data before using the function, but I am not happy with this because it is an overhead that I hope to avoid.
Second:. In the case of a distributed Hadoop configuration, the data is distributed among the available job trackers. Therefore, I believe that there will be several instances of this function, one for each display device, so you can split the same client data between two cards. In this case, I will lose customer requests, which is unacceptable.
I do not know how to solve this problem. I know that DISTRIBUTE BY ensures that all data with a specific value is sent to the same gearbox (so as to ensure that SORT works as expected), does anyone know if there is something like that for converter?
Next, I plan to follow the libjack suggestion to use a shorthand script. This "calculation" is necessary between some other requests for the hive, so I want to try everything that Hive offers before moving on to another tool, as suggested by Balaswami Waddeman.
EDIT2:
I began to research the solution to custom scripts. But on the first page of chapter 14 in the Hive programming book (user scripts are presented in this chapter), I found the following paragraph:
Streaming is usually less efficient than encoding comparable UDFs or InputFormat objects. Serialization and deserialization of data for out of pipe is relatively inefficient. It is also more difficult to debug the entire program in a single order. However, it is useful for rapid prototyping and for using existing code that is not written in Java. For Hive, users who do not want to write Java code can be a very effective approach.
So, it became clear that user scripts are not the best solution in terms of efficiency.
But how do I save my UDF function, but make sure that it works as expected in a distributed Hadoop configuration? I found the answer to this question in the "Internal UDF" section of the English wiki page. If I write my request:
SELECT customer_id, call_time, delta(customer_id, call_time) FROM (SELECT customer_id, call_time FROM mytable DISTRIBUTE BY customer_id SORT BY customer_id, call_time) t;
runs in REDUCE, and the DISTRIBUTE BY and SORT BY constructs ensure that all records of the same client are processed by the same reducer in the order of calls.
So the above UDF and this query construct solves my problem.
(Sorry to not add links, but I am not allowed to do this because I do not have enough reputation points)