Extract zlib compressed data from binary in python

Question

Extract zlib compressed data from binary in python

My company uses an outdated file format for electromyography data that is no longer in production. However, there is some interest in maintaining retro compatibility, so I am exploring the possibility of writing to the reader for this file format.

By analyzing the very confusing old source code written in Delphi, the file reader / writer uses ZLIB, and inside HexEditor it looks like there is a file header in binary ASCII (with fields such as "Player", "Analyzer" readable), followed by compressed string containing raw data.

I doubt: how should I act to determine:

If it is a compressed stream,
Where does the compressed stream begin and where does it end

From Wikipedia:

zlib compressed data is usually written using the gzip or zlib wrapper. The wrapper encapsulates the raw DEFLATE data, adding a header and trailer. It provides identification and flow error detection

Is this relevant?

I will be happy to publish additional information, but I do not know what would be most relevant.

Thanks for any hint.

EDIT: I have a working application and I can use it to record actual data of any length of time, if necessary, receive files even less than 1 KB.

Some sample files:

Just created, without data stream: https://dl.dropbox.com/u/4849855/Mio_File/HeltonEmpty.mio

Same as above, after a very short (1 second?) Data stream was saved: https://dl.dropbox.com/u/4849855/Mio_File/HeltonFilled.mio

Another, from a patient called "manco" instead of "Helton", with an even shorter stream (ideal for viewing Hex): https://dl.dropbox.com/u/4849855/Mio_File/manco_short.mio

Instructions: each file must be a patient (person) file. One or more exams are stored inside these files, each exam consists of one or more time series. The files provided contain only one exam with one sequence of data.

+6

language-agnostic binaryfiles zlib decompression

heltonbiker Aug 27 '12 at 18:29

source share

2 answers

zlib is a thin shell around data compressed using DEFLATE and is defined in RFC1950 :

  A zlib stream has the following structure: 0 1 +---+---+ |CMF|FLG| (more-->) +---+---+ (if FLG.FDICT set) 0 1 2 3 +---+---+---+---+ | DICTID | (more-->) +---+---+---+---+ +=====================+---+---+---+---+ |...compressed data...| ADLER32 | +=====================+---+---+---+---+

Thus, it adds at least two, possibly six bytes before and 4 bytes with the ADLER32 checksum after the raw compressed DEFLATE data.

The first byte contains CMF (compression method and flags), which is divided into CM (compression method) (first 4 bits) and CINFO (compression information) (last 4 bits).

From this it is completely clear that, unfortunately, the first two bytes of the zlib stream can vary greatly depending on the compression method and settings used.

Fortunately, I came across a message from Mark Adler, the author of the ADLER32 algorithm, which lists the most common and less common combinations of these two start bytes .

With that in mind, let's see how we can use Python to learn zlib:

 >>> import zlib >>> msg = 'foo' >>> [hex(ord(b)) for b in zlib.compress(msg)] ['0x78', '0x9c', '0x4b', '0xcb', '0xcf', '0x7', '0x0', '0x2', '0x82', '0x1', '0x45']

So the zlib data created by the Python zlib module (using the default options) starts with 78 9c . We will use this to create a script that writes a custom file format combining the preamble, some zlib compressed data and a footer.

Then we write a second script that scans the file for this two byte patterns, starts unpacking everything that follows the zlib stream, and calculates where the stream ends and the footer starts.

create.py

 import zlib msg = 'foo' filename = 'foo.compressed' compressed_msg = zlib.compress(msg) data = 'HEADER' + compressed_msg + 'FOOTER' with open(filename, 'wb') as outfile: outfile.write(data)

Here we take msg , compress it with zlib and surround it with a header and before writing it to a file.

Headers and footers have a fixed length in this example, but they can, of course, have arbitrary, unknown lengths.

Now for a script that is trying to find the zlib stream in such a file. Because for in this example we know exactly which marker to expect, I use only one, but it is obvious that the ZLIB_MARKERS list can be filled with all the markers from the message mentioned above.

ident.py

 import zlib ZLIB_MARKERS = ['\x78\x9c'] filename = 'foo.compressed' infile = open(filename, 'r') data = infile.read() pos = 0 found = False while not found: window = data[pos:pos+2] for marker in ZLIB_MARKERS: if window == marker: found = True start = pos print "Start of zlib stream found at byte %s" % pos break if pos == len(data): break pos += 1 if found: header = data[:start] rest_of_data = data[start:] decomp_obj = zlib.decompressobj() uncompressed_msg = decomp_obj.decompress(rest_of_data) footer = decomp_obj.unused_data print "Header: %s" % header print "Message: %s" % uncompressed_msg print "Footer: %s" % footer if not found: print "Sorry, no zlib streams starting with any of the markers found."

The idea is this:

Start at the beginning of the file and create a double-byte search window.
Move the search box forward in increments of one byte.
For each window, check to see if it matches any of the two byte markers. defined.
If a match is found, write down the starting position, stop the search and try to unpack everything that follows.

Finding the end of a stream is now not as trivial as finding two byte markers. zlib streams are not interrupted by a fixed sequence of bytes and their length is not specified in any of the header fields. Instead, it terminates the four-byte ADLER32 checksum, which must match the data to this point.

How it works, the internal function C inflate() constantly saves trying to unpack a stream when it reads it, and if it encounters a matching checksum, signals it to the caller, indicating that the rest of the data is no longer part of the zlib stream.

In Python, this behavior is displayed when using decompression objects instead of simply calling zlib.decompress() . A decompress(string) call to the decompress(string) object will unzip the zlib stream to string and return the decompressed data that was part of the stream. Everything that follows the stream will be stored in unused_data and can be retrieved after that.

This should produce the following output in a file created from the first script:

 Start of zlib stream found at byte 6 Header: HEADER Message: foo Footer: FOOTER

The example can be easily modified to write an uncompressed message to a file instead of printing it. Then you can continue to analyze the previously zlib compressed data and try to identify the known fields in the metadata in the header and footer that you allocated.

+8

Lukas Graf Aug 27 '12 at 22:45

source share

cgohlke · Accepted Answer · 2012-08-28T00:35:03+0000

To get started, why not scan files for all valid mail flows (this is good enough for small files and to determine the format):

import zlib from glob import glob def zipstreams(filename): """Return all zip streams and their positions in file.""" with open(filename, 'rb') as fh: data = fh.read() i = 0 while i < len(data): try: zo = zlib.decompressobj() yield i, zo.decompress(data[i:]) i += len(data[i:]) - len(zo.unused_data) except zlib.error: i += 1 for filename in glob('*.mio'): print(filename) for i, data in zipstreams(filename): print (i, len(data))

It looks like the data streams contain double precision floating point data:

 import numpy from matplotlib import pyplot for filename in glob('*.mio'): for i, data in zipstreams(filename): if data: a = numpy.fromstring(data, '<f8') pyplot.plot(a[1:]) pyplot.title(filename + ' - %i' % i) pyplot.show()

Extract zlib compressed data from binary in python

More articles: