Introduction
I wrote code to compress / decompress VBA code in Microsoft Office files. Micorosoft has published a specification of the algorithm used, including pseudo-code (see here ). My implementation is in C #, which I posted to GitHub .
The decompression algorithm seems to be working correctly, however I am having problems with compression. I believe that I followed the pseudo-code in the letter, but I get different results with the examples given in the specification, as well as with the samples that I extracted from the actual Office documents.
The compression algorithm is a kind of encoding of the execution length, where copy tokens can be used to represent the sequences of bytes found earlier in the data. The problem I am facing is that as you delve deeper into the data, there is more than one equivalent location that you can use as the copy location. That is, the same sequence of bytes occurs in several places, so there are several equivalent copy tokens that you could accept.
Based on the algorithm provided, you should start from the current encoding position and move backward, looking for the longest sequence that can be encoded at that position. It seems logical to me that if you have multiple tokens with an equal length, you will either take the first one that you encounter (if you do not replace your best match every time you find an equivalent token) or the last one that you will come (if you change it everytime). I find that the sample provided by Microsoft seems to choose one of the average alternatives, and I cannot understand why.
Example - unit test failure
To explain, check out this unit test :
public void CompressionProducesExpectedOutput()
{
CompressionTestHelper.LowLevelCompressionComparison(_expectedDecompressedBytes, _expectedCompressedBytes);
}
public static void LowLevelCompressionComparison(byte[] decompressedBytes, byte[] expectedCompressedBytes)
{
var refCompressed = new CompressedContainer(expectedCompressedBytes);
var decompressed = new DecompressedBuffer(decompressedBytes);
var sutCompressed = new CompressedContainer(decompressed);
var refTokens = GetTokensFromCompressedContainer(refCompressed).OfType<CopyToken>().ToList();
var sutTokens = GetTokensFromCompressedContainer(sutCompressed).OfType<CopyToken>().ToList();
for (var i = 0; i < refTokens.Count; i++)
{
var expected = refTokens[i];
var actual = sutTokens[i];
Assert.Equal(expected, actual);
}
}
private static IEnumerable<IToken> GetTokensFromCompressedContainer(CompressedContainer refCompressed)
{
var refTokens = from c in refCompressed.CompressedChunks
from s in ((CompressedChunkData)c.ChunkData).TokenSequences
from t in s.Tokens
select t;
return refTokens;
}
, . , . . GitHub, .
unit test #aaabcdefaaaaghijaaaaaklaaamnopqaaaaaaaaaaaarstuvwxyzaaa. , a . 2 , , , ( ). , 3 a .
:
private const string ExpectedCompressedOutput =
"01 2F B0 00 23 61 61 61 62 63 64 65 82 66 00 70 61 67 68 69 6A 01 38 08 61 6B 6C 00 30 6D 6E 6F" +
"70 06 71 02 70 04 10 72 73 74 75 76 10 77 78 79 7A 00 3C";
. , (00 3C) . (16 , , endian), :
00 3C: 0011 1100 0000 0000
(. 2.4.1.3.19.2 Unpack CopyToken):
difference = 53
bitcount = 6
length mask = 0000 0000 0011 1111
offset mask = 1111 1111 1100 0000
encoded = 0011 1100 0000 0000
length = 0000 0000 0000 0000 + 3 = 3
offset = (0011 1100 0000 0000 >> 10) + 1
= 0000 0000 0000 1111 + 1
= 16
, (3 ), , 16 , , , a. 12 .
- , . , , - , MS ?
@JeroenMostert, , . , , Microsoft. , , , , .
, , , , a , . , , , MS Software.