Compressing bushes using partition rewriting

Trying to solve the problem with small files by compressing the files under the hive section with the Insert overwrite partition command in hadoop.

Inquiry:

SET hive.exec.compress.output=true;
SET mapred.max.split.size=256000000;
SET mapred.output.compression.type=BLOCK;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
set hive.merge.mapredfiles=true;
set hive.merge.size.per.task=256000000;
set hive.merge.smallfiles.avgsize=256000000;


INSERT OVERWRITE TABLE tbl1 PARTITION (year=2016, month=03, day=11) 
SELECT col1,col2,col3  from tbl1  
WHERE year=2016 and  month=03 and  day=11;

Input Files:

For testing purposes, I have three files under the hive section (2016/03/11) in HDFS with a size of 40 MB each.

2016/03/11 / file1.csv

2016/03/11 / file2.csv

2016/03/11 / file3.csv

An example of my block size is 128, so I would like to create only one output file. But I get 3 different compressed files.

Please help me get the hive configuration to limit the output file size. If I do not use compression, I get a single file.

Vertex Version: 1.1

+4
source share
1

, 3 , , . , , , , .

, , . SQL, , , , , . , , .

set mapred.reduce.tasks = 1;

SQL, , , , . , , .

, mappers, - false.

set hive.hadoop.supports.splittable.combineinputformat = true;

, - , .

set hive.mapjoin.smalltable.filesize = 25000000;

, , , - .

set hive.exec.orc.default.compress = gzip, snappy, etc...
+1

All Articles