Does HDFS support encryption or data compression during storage?

Question

Does HDFS support encryption or data compression during storage?

When I put a file in HDFS, for example

$ ./bin/hadoop/dfs -put /source/file input

Is the file compressed during storage?
Is the file encrypted ? Is there a configuration parameter that we can specify to change whether it is encrypted or not?

+7

hadoop hdfs

Lazer Sep 19 '11 at 4:02

source share

2 answers

I recently installed compression in a cluster. Other posts contain useful links, but the actual code that you want to get when working with LZO compression is given below: https://github.com/kevinweil/hadoop-lzo .

You can, out of the box, use GZIP compression, BZIP2 compression, and Unix Compress. Just upload the file to one of these formats. When using a file as an input to a task, you need to indicate that the file is compressed, as well as the corresponding code. Here is an example of LZO compression.

  -jobconf mapred.output.compress=true -jobconf mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec

Why am I going to enable LZO compression? cloudera article link Praveena goes for it. LZO compression is a splittable compression (unlike GZIP, for example). This means that one file can be divided into pieces that will be transferred to the cartographer. Without a broken compressed file, one cartographer will receive the entire file. This can result in you having too few cards and moving too much data on your network.

BZIP2 is also split. It also has higher compression than LZO. However, it is very slow. LZO has a worse compression ratio than GZIP. However, it is optimized as extremely fast . In fact, it can even increase your productivity by minimizing disk I / O.

It takes a bit of work to set up, and it's a bit of a pain to use, but it's worth it (transparent encryption would be awesome). Once again, the steps are:

Install LZO and LZOP (command line utility)
Install hadoop-lzo
Download the file compressed with LZOP.
Index the file as described in the hasoop-lzo wiki (the index allows you to split it).
Start your work (with the appropriate parameters mapred.output.compress and mapred.output.compression.code)

-one

schmmd Sep 19 '11 at 17:43

source share

Praveen sripati · Accepted Answer · 2011-09-19T08:31:21+0000

There is no implied compression in HDFS. In other words, if you want your data to be compressed, you should write it that way. If you plan to record map reduction jobs for processing compressed data, you will want to use the split compression format.

Hadoop can handle compressed files, and there is a good article here. In addition, the intermediate and final output of MR can be compressed .

There is JIRA in Transparent Compression in HDFS, but I don't see much progress in that.

I don’t think there is a separate API for encryption, although you can also use a compression codec for encryption / decryption. Here's more information on encryption and HDFS.

Does HDFS support encryption or data compression during storage?

More articles: