How to beat gzip (or other lossless compression)

Question

How to beat gzip (or other lossless compression)

In accordance with the “pig” principle, each lossless compression algorithm can be “defeated”, that is, for some inputs it produces output that is larger than the input. Is it possible to explicitly build a file that, when submitted, for example, gzip or another lossless compression program, will lead to a (large) increase in performance? (or, still, a file that inflates to infinity with subsequent compressions?)

+6

gzip compression

Marcin kotowski Aug 6 '10 at 16:18

source share

5 answers

mjschultz · Answer 1 · 2010-08-06T16:28:36+0000

Well, I would suggest that in the end this will be maximized, as the bit patterns will repeat, but I just did:

touch file gzip file -c > file.1 ... gzip file.9 -c > file.10

And received:

  0 bytes: file 25 bytes: file.1 45 bytes: file.2 73 bytes: file.3 103 bytes: file.4 122 bytes: file.5 152 bytes: file.6 175 bytes: file.7 205 bytes: file.8 232 bytes: file.9 262 bytes: file.10

Here are 24,380 files graphically (this is really amazing for me, actually):

alt text http://research.engineering.wustl.edu/~schultzm/images/filesize.png

I did not expect such growth, I would just expect linear growth, since it should just encapsulate existing data in the header with the template dictionary. I intended to run over 1,000,000 files, but before that, my system ran out of disk space.

If you want to reproduce, here is a bash script for generating files:

 #!/bin/bash touch file.0 for ((i=0; i < 20000; i++)); do gzip file.$i -c > file.$(($i+1)) done wc -c file.* | awk '{print $2 "\t" $1}' | sed 's/file.//' | sort -n > filesizes.txt

The resulting filesizes.txt file is a sorted tab delimited file for your favorite graphical display utility. (You will have to manually delete the "total" field or script it.)

Douglas leder · Answer 2 · 2010-08-06T16:26:02+0000

Random data or data encrypted with a good cypher would probably be better.

But any good packer should only add constant overhead as soon as he decides that he cannot compress the data. (@Frank). With fixed overhead, an empty file or a single character will give the highest percentage overhead.

For packers that include a file name (e.g. rar, zip, tar), you could just make the file name really long :-)

banx · Answer 3 · 2010-08-06T16:23:01+0000

Try the gzip file, which is obtained from the following command:

 echo a > file.txt

Compressing a file with two bytes from a file with a file size of 31 bytes!

Tom gullen · Answer 4 · 2010-08-06T16:24:41+0000

A text file with 1 byte in it (for example, one character such as "A") is stored in 1 byte on the disk, but winrar transfers it to 94 bytes and encrypts up to 141 bytes.

I know this is a kind of cheat answer, but it works. I think this will be the biggest difference in% of the original size and the “compressed” size that you will see.

Take a look at the formula for zipping, they are quite simple and make a “compressed” file larger than the original, the easiest way is to avoid duplicate data.

Gumbo · Answer 5 · 2010-08-06T16:36:06+0000

All of these compression algorithms look for redundant data. If the file does not have or has too little redundancy in it (for example, the sequence abac…az , bcbd…bz , cdce…cz , etc.), it is very likely that the result of "deflation" is more likely to be inflation.

How to beat gzip (or other lossless compression)

More articles: