Is it possible to execute queries on bushes in parallel by writing a separate mapreduce program?

Question

Is it possible to execute queries on bushes in parallel by writing a separate mapreduce program?

I asked some questions about increasing the performance of Hive queries. Some of the answers related to the number of cards and reducers. I tried to use several cards and reducers, but I did not see any difference in performance. I don’t know why, maybe I didn’t do it right, or I missed something else.

I would like to know if it is possible to execute bus requests in parallell? What I mean is that usually requests are queued. For example: Query1

Query2

query3

. , , P

It takes too much time to complete, and I want to shorten the execution time.

I need to know if we use mapreduce in the Hive JDBC program, is it possible to execute it in parallel? I don’t know if this will work or not, but what is my goal to achieve?

I return my questions below:

1) If you can run multiple bus requests in parallel, do you need multiple Hive Thrift Servers?

2) Is it possible to open multiple Hive Thrift servers?

3) I think it is not possible to open multiple Hive Thrift servers on the same port?

4) Can I open multiple Hive Thrift servers on different ports?

Please suggest me some solution for this. If you have another alternative, I will try this too.

+4

mapreduce hive

Bhavesh shah May 11 '12 at 11:58

source share

1 answer

Mark grover · Accepted Answer · 2012-05-12T14:59:04+0000

As you already know, Hive is a similar SQL interface for Hadoop and Map-reduce. Any non-trivial request for Hive is compiled into Map-Reduce and launched on Hadoop. Map-reduce is a parallel processing infrastructure, so each of your Hive requests will run and process data in parallel. By default, Hive uses the FIFO Scheduler to schedule jobs on Hadoop, so only one Hive request can be executed at a given time, and the next request will be executed when the first is executed. In most cases, I suggest people optimize individual requests for bushes instead of parallelizing multiple requests for bushes. If you are prone to parallelizing Hive requests, this may indicate that your cluster is being used inefficiently. To further analyze the performance and usage of your Hive requests, you can install a distributed monitoring system such as Ganglia to monitor your cluster usage (Amazon EMR also supports it).

In short, you do not need to write a map reduction program; this is what you use for the Hive first. However, if something may be known about data that may not be available to Hive, this may lead to suboptimal performance of your bush queries. For example, your data may be sorted by some columns, and Hive may not be aware of this information. In such cases, if you cannot install this additional meta-information in Hive, it might make sense to write a map reduction work that takes into account this additional information and potentially gives better performance. In most cases, I found that the Hive performance matches the Map-reduce parameters matching the Hive request.

Is it possible to execute queries on bushes in parallel by writing a separate mapreduce program?

More articles: