Hadoop MR: is it better to have compressed input files or raw files?

Question

Hadoop MR: is it better to have compressed input files or raw files?

how can I get from the question, I want to know when it makes sense to have input files in a compressed format (for example, gzip) and when it makes sense to have input files in an uncompressed format.

What is the overhead of file compression? Is it much slower when reading a file? Are there any tests on large input files?

thanks!

+4

compression mapreduce hadoop decompression

Bob Jun 27 '12 at 15:03

source share

3 answers

Let's put reasons for compression for reasons that cannot be compressed.

For:

a) Data is mainly stored and not often processed. This is a common DWH script. In this case, the space savings can be significantly more significant than processing overhead b) The compression ratio is very high and we save a lot of IO.
c) Decompression is very fast (e.g. Snappy), and we have some gain with a small price
d) Data is already compressed

Against:

a) Compressed data is not split. It should be noted that many modern formats are built with block-level compression to provide separation and other partial file processing. b) Data is created in the cluster, and compression takes considerable time. It should be noted that compression is usually much more intense than processor intensity, and then decompression.
c) Data has little redundancy and compression gives little gain.

+2

David gruzman Jun 27 '12 at 18:13

source share

1) Compression of input files If the input file is compressed, the bytes read from HDFS are reduced, which means less time to read data. This time saving is beneficial for the job.

If the input files are compressed, they will be unpacked automatically, as they are read by MapReduce using the file name extension to determine which codec to use. For example, a file ending in .gz can be identified as a gzip-compressed file and thus read with GzipCodec.

2) Compressing output files Often we need to save the output as history files. If daily output is extensive, and we often need to save historical results for future use, then these accumulated results will require a lot of HDFS space. However, these history files cannot be used very often, which is a waste of HDFS space. Therefore, before recording to HDFS, you must compress the output.

3) Compressing Map Output Even if your MapReduce application reads and writes uncompressed data, it may be useful to compress the intermediate output of the map phase. Since the output of the card is written to disk and transferred over the network to the gear units using a fast compressor such as LZO or Snappy, you can get a performance boost simply because the amount of data to transfer is reduced. 2. General input format

GZIP: gzip is naturally supported by Hadoop. gzip is based on the DEFLATE algorithm, which is a combination of LZ77 and Huffman coding.

bzip2: bzip2 is a freely available, patent-free (see below), high-quality data compressor. Usually it compresses files with an accuracy of 10% to 15% of the best available technologies (the family of statistical compressors PPM), while it is twice as fast when compressed and six times faster when unpacked.

LZO: The LZO compression format consists of many smaller (~ 256 thousand) Blocks of compressed data, which allows you to separate tasks by block boundaries. Moreover, it was designed with speed in mind: it decompresses about two times faster than gzip, which means it is fast enough to keep up with the speed of reading on the hard drive. It does not compress in the same way as gzip - expect files of about 50% size compared to your version of gzip. But this is still 20-50% of the file size without compression at all, which means that jobs with IO binding complete the map phase about four times faster.

Snappy: Snappy is a compression / decompression library. It is not intended for maximum compression or compatibility with any other compression library; instead, it is designed for very high speeds and reasonable compression. For example, compared to the fastest zlib mode, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are 20% to 100% larger. On one core of the Core i7 processor in 64-bit mode, Snappy is compressed at a speed of about 250 MB / s or more and decompressed at a speed of about 500 MB / s or more. Snappy is widely used internally by Google, in everything from BigTable and MapReduce to our internal RPC systems.

Some trade-offs: All compression algorithms show a trade-off between space and time: faster compression and decompression speed usually occurs due to less space saving. The tools listed in the table above usually give some control over this trade-off during compression, offering nine different options: -1 means optimization for speed and -9 means optimization for space.

Different tools have very different compression characteristics. Gzip is a general purpose compressor and is in the middle of a compromise between space and time. Bzip2 is compressed more efficiently than gzip, but more slowly. The decompression speed of Bzip2s is faster than the compression speed, but it is still slower than other formats. LZO and Snappy, on the other hand, are optimized for speed and are about an order of magnitude faster than gzip, but they compress less efficiently. Snappy is also significantly faster than LZO for decompression. 3. Problems with the separation of compression and input When considering how to compress the data that MapReduce will process, it is important to understand whether the compression format supports separation. Consider an uncompressed file stored in HDFS that is 1 GB in size. If the HDFS block size is 64 MB, the file will be saved as 16 blocks, and the MapReduce task using this file as an input file will create 16 input fractions, each of which will be processed independently as an input to a separate map task.

Imagine a file is a gzip compressed file with a compressed size of 1 GB. As before, HDFS will save the file as 16 blocks. However, creating a separation for each block will not work, since it is impossible to start reading at an arbitrary point in the gzip stream and, therefore, it is impossible for the map task to read its separation independently of the others. The gzip format uses DEFLATE to store compressed data, and DEFLATE stores data in a series of compressed blocks. The problem is that the beginning of each block is no different, which allows the reader to position at an arbitrary point in the stream in order to advance to the beginning of the next block, thereby synchronizing with the stream. For this reason, gzip does not support splitting.

In this case, MapReduce will do the right job and will not try to split the gzip file, since it knows that the input is gzip-compressed (if you look at the file name extension) and that gzip does not support splitting. This will work, but at the expense of locality: one card will process 16 HDFS blocks, most of which will not be local to the card. In addition, with fewer cards, the task is less granular, and therefore it may take longer to complete.

If the file in our hypothetical example was an LZO file, we would have the same problem, since the basic compression format does not allow the reader to synchronize itself with the stream. However, you can pre-process LZO files with the indexer tool that comes with the Hadoop LZO libraries. The tool creates an index of split points, effectively making them split when using the appropriate MapReduce input format.

The bzip2 file, on the other hand, provides a synchronization marker between blocks (48-bit approximation of pi), so it supports splitting. 4. IO-related and CPU-related Storage of compressed data in HDFS allows you to distribute your equipment further, since compressed data often accounts for 25% of the size of the source data. Moreover, since MapReduce jobs are almost always associated with IO binding, storing compressed data means that IO is the least performed, which means that jobs complete faster. However, there are two caveats for this: some compression formats cannot be split for parallel processing, while others are quite slow during decompression, when tasks become attached to the processor, eliminating the gain in IO.

The gzip compression format illustrates the first disclaimer. Imagine you have a gzip file of 1.1 GB in size and your cluster has a block size of 128 MB. This file will be split into 9 chunks of approximately 128 MB in size. To process them in parallel in the MapReduce task, a different cartographer will respond for each fragment. But this means that the second converter will start with an arbitrary byte of about 128 MB per file. The context dictionary that gzip uses to unpack the input will be empty at this moment, which means that the gzip decompressor will not be able to interpret bytes correctly. The result is that large gzip files in Hadoop need to be handled by a single mapper who defeats the goal of parallelism.

The Bzip2 compression format illustrates a second caveat in which jobs become CPU bound. Bzip2 files are well compressed and even split, but the decompression algorithm is slow and cannot keep up with streaming disk readings, which are common for Hadoop work orders. While Bzip2 compression has some growth potential because it saves storage space, working jobs now spend their time waiting for the processor to complete the decompression of data, which slows them down and compensates for other benefits. 5. Summary Reasons for compression: a) Data is mainly stored and not often processed. This is a common DWH script. In this case, the space savings can be significantly greater than the processing overhead b) The compression ratio is very high and we save a lot of IO. c) Decompression is very fast (e.g. Snappy), and we have some gain with a small price d) Data is already compressed

Reasons not to compress a) Compressed data is not split. It should be noted that many modern formats are built with block-level compression to provide separation and other partial file processing. b) Data is created in the cluster, and compression takes considerable time. It should be noted that compression is usually much more intense than processor intensity, and then decompression. c) Data has little redundancy and compression gives little gain.

+1

Ashish singh Sep 03 '15 at 17:14

source share

pyfunc · Accepted Answer · 2012-06-27T15:24:56+0000

Basically, it makes sense to have the input files in a compressed format if you are not involved in development, and you often need to read data from HDFS to the local file system to work on it.

The compressed format provides a significant advantage. Data is already replicated in the Hadoop cluster, unless you install it in another way. Replicated data is good redundancy, but consumes more space. If all your data is replicated with a factor of 3, you will consume 3 times the large capacity needed to store it.

Compressing text data, such as log data, is very effective because it provides a high degree of compression. This is also the data that you most often find in a Hadoop cluster.

I do not have benchmarks, but I did not see any significant penalties for a decent-sized cluster and the data we have.

Be that as it may, temporarily choose LZO over gzip.

See: LZO compression and its value compared to gzip

Gzip compresses better than LZO. LZO is faster at compression and decompression. You can split Lzo files, split Gzip is not available, but I saw Jira tasks for them. (Also for bzip2)

Hadoop MR: is it better to have compressed input files or raw files?

More articles: