Compressing bushes using partition rewriting

Question

Compressing bushes using partition rewriting

Trying to solve the problem with small files by compressing the files under the hive section with the Insert overwrite partition command in hadoop.

Inquiry:

SET hive.exec.compress.output=true;
SET mapred.max.split.size=256000000;
SET mapred.output.compression.type=BLOCK;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
set hive.merge.mapredfiles=true;
set hive.merge.size.per.task=256000000;
set hive.merge.smallfiles.avgsize=256000000;


INSERT OVERWRITE TABLE tbl1 PARTITION (year=2016, month=03, day=11) 
SELECT col1,col2,col3  from tbl1  
WHERE year=2016 and  month=03 and  day=11;

Input Files:

For testing purposes, I have three files under the hive section (2016/03/11) in HDFS with a size of 40 MB each.

2016/03/11 / file1.csv
2016/03/11 / file2.csv
2016/03/11 / file3.csv

An example of my block size is 128, so I would like to create only one output file. But I get 3 different compressed files.

Please help me get the hive configuration to limit the output file size. If I do not use compression, I get a single file.

Vertex Version: 1.1

+4

hql hadoop hive hdfs

William R Mar 24 '16 at 9:29

source share

1

Jared · Accepted Answer · 2016-03-25T14:05:00+0000

, 3 , , . , , , , .

, , . SQL, , , , , . , , .

set mapred.reduce.tasks = 1;

SQL, , , , . , , .

, mappers, - false.

set hive.hadoop.supports.splittable.combineinputformat = true;

, - , .

set hive.mapjoin.smalltable.filesize = 25000000;

, , , - .

set hive.exec.orc.default.compress = gzip, snappy, etc...

Compressing bushes using partition rewriting

More articles: