zlib is a thin shell around data compressed using DEFLATE and is defined in RFC1950 :
A zlib stream has the following structure: 0 1 +---+---+ |CMF|FLG| (more-->) +---+---+ (if FLG.FDICT set) 0 1 2 3 +---+---+---+---+ | DICTID | (more-->) +---+---+---+---+ +=====================+---+---+---+---+ |...compressed data...| ADLER32 | +=====================+---+---+---+---+
Thus, it adds at least two, possibly six bytes before and 4 bytes with the ADLER32 checksum after the raw compressed DEFLATE data.
The first byte contains CMF (compression method and flags), which is divided into CM (compression method) (first 4 bits) and CINFO (compression information) (last 4 bits).
From this it is completely clear that, unfortunately, the first two bytes of the zlib stream can vary greatly depending on the compression method and settings used.
Fortunately, I came across a message from Mark Adler, the author of the ADLER32 algorithm, which lists the most common and less common combinations of these two start bytes .
With that in mind, let's see how we can use Python to learn zlib:
>>> import zlib >>> msg = 'foo' >>> [hex(ord(b)) for b in zlib.compress(msg)] ['0x78', '0x9c', '0x4b', '0xcb', '0xcf', '0x7', '0x0', '0x2', '0x82', '0x1', '0x45']
So the zlib data created by the Python zlib module (using the default options) starts with 78 9c . We will use this to create a script that writes a custom file format combining the preamble, some zlib compressed data and a footer.
Then we write a second script that scans the file for this two byte patterns, starts unpacking everything that follows the zlib stream, and calculates where the stream ends and the footer starts.
create.py
import zlib msg = 'foo' filename = 'foo.compressed' compressed_msg = zlib.compress(msg) data = 'HEADER' + compressed_msg + 'FOOTER' with open(filename, 'wb') as outfile: outfile.write(data)
Here we take msg , compress it with zlib and surround it with a header and before writing it to a file.
Headers and footers have a fixed length in this example, but they can, of course, have arbitrary, unknown lengths.
Now for a script that is trying to find the zlib stream in such a file. Because for in this example we know exactly which marker to expect, I use only one, but it is obvious that the ZLIB_MARKERS list can be filled with all the markers from the message mentioned above.
ident.py
import zlib ZLIB_MARKERS = ['\x78\x9c'] filename = 'foo.compressed' infile = open(filename, 'r') data = infile.read() pos = 0 found = False while not found: window = data[pos:pos+2] for marker in ZLIB_MARKERS: if window == marker: found = True start = pos print "Start of zlib stream found at byte %s" % pos break if pos == len(data): break pos += 1 if found: header = data[:start] rest_of_data = data[start:] decomp_obj = zlib.decompressobj() uncompressed_msg = decomp_obj.decompress(rest_of_data) footer = decomp_obj.unused_data print "Header: %s" % header print "Message: %s" % uncompressed_msg print "Footer: %s" % footer if not found: print "Sorry, no zlib streams starting with any of the markers found."
The idea is this:
Start at the beginning of the file and create a double-byte search window.
Move the search box forward in increments of one byte.
For each window, check to see if it matches any of the two byte markers. defined.
If a match is found, write down the starting position, stop the search and try to unpack everything that follows.
Finding the end of a stream is now not as trivial as finding two byte markers. zlib streams are not interrupted by a fixed sequence of bytes and their length is not specified in any of the header fields. Instead, it terminates the four-byte ADLER32 checksum, which must match the data to this point.
How it works, the internal function C inflate() constantly saves trying to unpack a stream when it reads it, and if it encounters a matching checksum, signals it to the caller, indicating that the rest of the data is no longer part of the zlib stream.
In Python, this behavior is displayed when using decompression objects instead of simply calling zlib.decompress() . A decompress(string) call to the decompress(string) object will unzip the zlib stream to string and return the decompressed data that was part of the stream. Everything that follows the stream will be stored in unused_data and can be retrieved after that.
This should produce the following output in a file created from the first script:
Start of zlib stream found at byte 6 Header: HEADER Message: foo Footer: FOOTER
The example can be easily modified to write an uncompressed message to a file instead of printing it. Then you can continue to analyze the previously zlib compressed data and try to identify the known fields in the metadata in the header and footer that you allocated.