Setting a compression codec for an INSERT OVERWRITE SELECT in Hive

I have a table with a hive like

CREATE TABLE beacons ( foo string, bar string, foonotbar string ) COMMENT "Digest of daily beacons, by day" PARTITIONED BY ( day string COMMENt "In YYYY-MM-DD format" ); 

To fill in, I am doing something like:

  SET hive.exec.compress.output=True; SET io.seqfile.compression.type=BLOCK; INSERT OVERWRITE TABLE beacons PARTITION ( day = "2011-01-26" ) SELECT someFunc(query, "foo") as foo, someFunc(query, "bar") as bar, otherFunc(query, "foo||bar") as foonotbar ) FROM raw_logs WHERE day = "2011-01-26"; 

This creates a new partition with individual products compressed using deflate, but the ideal way would be to go through the LZO compression codec.

Unfortunately, I'm not quite sure how to do this, but I assume this is one of many runtime parameters, or perhaps just an extra line in the CREATE TABLE DDL.

+6
compression hadoop configuration hive
source share
1 answer

Before adding INSERT OVERWRITE with the following runtime configuration values:

 SET hive.exec.compress.output=true; SET io.seqfile.compression.type=BLOCK; SET mapred.output.compression.codec = com.hadoop.compression.lzo.LzopCodec; 

Also make sure you have the required compression codec by checking:

 io.compression.codecs 

Further information on io.seqfile.compression.type can be found here http://wiki.apache.org/hadoop/Hive/CompressedStorage

I may be mistaken, but it seems that the BLOCK type will provide compression of large files with a higher ratio compared to a smaller set of compressed compressed files.

+13
source share

All Articles