Setting a compression codec for an INSERT OVERWRITE SELECT in Hive

Question

Setting a compression codec for an INSERT OVERWRITE SELECT in Hive

I have a table with a hive like

CREATE TABLE beacons ( foo string, bar string, foonotbar string ) COMMENT "Digest of daily beacons, by day" PARTITIONED BY ( day string COMMENt "In YYYY-MM-DD format" );

To fill in, I am doing something like:

  SET hive.exec.compress.output=True; SET io.seqfile.compression.type=BLOCK; INSERT OVERWRITE TABLE beacons PARTITION ( day = "2011-01-26" ) SELECT someFunc(query, "foo") as foo, someFunc(query, "bar") as bar, otherFunc(query, "foo||bar") as foonotbar ) FROM raw_logs WHERE day = "2011-01-26";

This creates a new partition with individual products compressed using deflate, but the ideal way would be to go through the LZO compression codec.

Unfortunately, I'm not quite sure how to do this, but I assume this is one of many runtime parameters, or perhaps just an extra line in the CREATE TABLE DDL.

+6

compression hadoop configuration hive

David Jan 28 '11 at 17:26

source share

1 answer

David · Accepted Answer · 2011-01-28T20:14:45+0000

Before adding INSERT OVERWRITE with the following runtime configuration values:

 SET hive.exec.compress.output=true; SET io.seqfile.compression.type=BLOCK; SET mapred.output.compression.codec = com.hadoop.compression.lzo.LzopCodec;

Also make sure you have the required compression codec by checking:

 io.compression.codecs

Further information on io.seqfile.compression.type can be found here http://wiki.apache.org/hadoop/Hive/CompressedStorage

I may be mistaken, but it seems that the BLOCK type will provide compression of large files with a higher ratio compared to a smaller set of compressed compressed files.

Setting a compression codec for an INSERT OVERWRITE SELECT in Hive

More articles: