I recently installed compression in a cluster. Other posts contain useful links, but the actual code that you want to get when working with LZO compression is given below: https://github.com/kevinweil/hadoop-lzo .
You can, out of the box, use GZIP compression, BZIP2 compression, and Unix Compress. Just upload the file to one of these formats. When using a file as an input to a task, you need to indicate that the file is compressed, as well as the corresponding code. Here is an example of LZO compression.
-jobconf mapred.output.compress=true -jobconf mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec
Why am I going to enable LZO compression? cloudera article link Praveena goes for it. LZO compression is a splittable compression (unlike GZIP, for example). This means that one file can be divided into pieces that will be transferred to the cartographer. Without a broken compressed file, one cartographer will receive the entire file. This can result in you having too few cards and moving too much data on your network.
BZIP2 is also split. It also has higher compression than LZO. However, it is very slow. LZO has a worse compression ratio than GZIP. However, it is optimized as extremely fast . In fact, it can even increase your productivity by minimizing disk I / O.
It takes a bit of work to set up, and it's a bit of a pain to use, but it's worth it (transparent encryption would be awesome). Once again, the steps are:
- Install LZO and LZOP (command line utility)
- Install hadoop-lzo
- Download the file compressed with LZOP.
- Index the file as described in the hasoop-lzo wiki (the index allows you to split it).
- Start your work (with the appropriate parameters mapred.output.compress and mapred.output.compression.code)
schmmd
source share