Hive breaks ORC files into small parts

Question

Hive breaks ORC files into small parts

create table n_data(MARKET string,CATEGORY string,D map<string,string>,monthid int,value  DOUBLE)
  STORED AS ORC
 ;

I load data into it (more than 45000000 lines), look at the warehouse of the hive

enter image description here

The results table consists of 5 files with a size of 10 MB-20 MB, but dfs.block.size sets to 128 MB, this is not optimal for storing small files, because it uses the entire block!

How to set up shared files using 128 MB HDE?

EDIT insert query:

insert into n_data
select tmp.market,tmp.category,d,adTable.monthid,tmp.factperiod[adTable.monthid] as fact 
from (select market,category,d,factperiod,map_keys(factperiod) as month_arr  from n_src where market is not null) as tmp 
LATERAL VIEW explode(month_arr) adTable AS monthid

+4

hive hdfs

rpc1 Jul 9 '15 at 12:24

source share

1 answer

Simon · Answer 1 · 2015-07-23T14:41:19+0000

You must set the following configuration parameters for the hive:

hive.merge.mapfiles = true
hive.merge.mapredfiles = true
hive.merge.tezfiles = true
hive.merge.smallfiles.avgsize = 16000000

I had the same problem until I found this source . You can try to set these parameters manually in a hive session using the "set" command:

set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.tezfiles=true;
set hive.merge.smallfiles.avgsize=16000000;

"set;" , . hive-site.xml Ambari ( Hortonworksdistribution). !

Hive breaks ORC files into small parts

More articles: