Trying to solve the problem with small files by compressing the files under the hive section with the Insert overwrite partition command in hadoop.
Inquiry:
SET hive.exec.compress.output=true;
SET mapred.max.split.size=256000000;
SET mapred.output.compression.type=BLOCK;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
set hive.merge.mapredfiles=true;
set hive.merge.size.per.task=256000000;
set hive.merge.smallfiles.avgsize=256000000;
INSERT OVERWRITE TABLE tbl1 PARTITION (year=2016, month=03, day=11)
SELECT col1,col2,col3 from tbl1
WHERE year=2016 and month=03 and day=11;
Input Files:
For testing purposes, I have three files under the hive section (2016/03/11) in HDFS with a size of 40 MB each.
2016/03/11 / file1.csv
2016/03/11 / file2.csv
2016/03/11 / file3.csv
An example of my block size is 128, so I would like to create only one output file. But I get 3 different compressed files.
Please help me get the hive configuration to limit the output file size. If I do not use compression, I get a single file.
Vertex Version: 1.1
source
share