How does Hive decide when to use a map to reduce and when not to?

As a simple example,

select * from tablename;

DOES NOT DISCONNECT to map reduction, but

select count(*) from tablename;

DOES. What is the general principle used to determine when to use map abbreviation (on the hive)?

+5
source share
4 answers

In general, any kind of aggregation, such as min / max / count, will require a MapReduce job. This probably won't explain everything to you.

In the style of many RDBMS, there is a EXPLAINkeyword that will describe how your data request is translated into MapReduce jobs. Try to explain both your sample queries and see what he is trying to do behind the scenes.

+6
source

* tablename;

HDFS, MapReduce.

+1

, * tablename, Hive - (min/max/count ..). FetchTask, mapreduce.

Hive. hive.fetch.task.conversion (, FETCH) .

hadoop: hasoop fs -cat _

select colNames tablename, , "" , .

+1
source

This is an optimization method, the hive.fetch.task.conversionproperty can (FETCH) minimize mapreduce latency overhead.

When executing SELECT, LIMIT, FETCH queries, this property skips mapreduce and uses the FETCH task.

This property can have 3 values ​​- none, minimal(default) and more.

-1
source

All Articles