How does stream compression ability affect compression algorithm?

I recently maintained my high-speed university home directory by sending it as a tar stream and compressing it at my end: ssh user@host "tar cf - my_dir/" | bzip2 > uni_backup.tar.bz2 ssh user@host "tar cf - my_dir/" | bzip2 > uni_backup.tar.bz2 .

This made me think: I only know how compression works, but I would suggest that this ability to compress the data stream will reduce compression, since the algorithm should finish processing the data block at one point, write this to the output stream and go to the next block .

This is true? Or do these programs just read a lot of data into memory, compress it, write, and then do it again? Or are there any clever tricks used in these in-line compressors? I see that both pages of the bzip2 and xz pages talk about memory usage, and man bzip2 also hints that little happens when shredding data that needs to be compressed into blocks:

Large block sizes give rapidly decreasing limit values. Most of the compression comes from the first two or three hundred kiloblock block sizes, which is important to consider when using bzip2 on small machines. It is also important to understand that the decompression memory requirement is set during compression by choosing a block size.

I would still like to hear if other tricks are used or where I can learn more about it.

+4
source share
1 answer

This question is more related to buffer processing than to the compression algorithm, although a little can be said about this.

Some compression algorithms are inherently “block-based”, which means that they absolutely need to work with blocks of a certain size. This is the situation with bzip2, the block size of which is selected thanks to the level switch, from 100 kB to 900 kB. Thus, if you transfer data to it, it will wait for the block to be filled and begin to compress this block when it is full (as an alternative to the last block, it will work with any size it receives).

Some other compression algorithms can process streams, which means that they can continuously compress new data using the older one stored in the memory buffer. Sliding window algorithms can do this, and usually zlib can achieve this.

Now even sliding-window compressors can, however, choose to cut input into blocks, either to simplify buffer management or to develop multi-threading capabilities such as pigz.

+4
source

All Articles