Background: I have an implementation of the common LZSS backend in C ++ (available here . The corresponding algorithm that I use in this version is extremely simple, because it was originally intended to compress relatively small files (no more than 64 KB) for relatively old hardware (in particular, Mega Drive / Sega Genesis, where 64kB is the fullness of the main RAM).
However, some files take too long to compress my implementation, in the order of minutes. The reason is twofold: the naive matching algorithm takes most of the time, but this happens precisely because I built a compression graph from a file to achieve optimal compression. Looking at the profiler, most of the time is spent searching for matches, eclipsing even the quadratic size of the resulting graph.
For some time I have been exploring several potential substitutions; one of which attracted my attention is dictionary-character-flexible parsing using multilayer suffix trees . The layered part is important because one of the LZSS options that interests me uses variable-size encodings for (position, length).
My current implementation allows matches in a sliding window to overlap the wait buffer, so this input:
aaaaaaaaaaaaaaaa
can be directly encoded as
(0,'a')(1,0,15)
instead
(0,'a')(1,0,1)(1,0,2)(1,0,4)(1,0,8)
Here (0, 'a') means the character encoding 'a' as a literal, while (1, n, m) means "copy m characters from position n".
Question: Having said all this, here is my problem: every resource I found on the suffix trees seems to imply that they cannot handle the overlapping case, but instead only allow matching matches to be found. When suffix trees, research papers, books, and even some implementations were involved, they gave examples of compression without overlapping, as if they were perfect compression (I would contact some of them, but my reputation does not allow it). Some of them even mentioned that overlaps can be useful in describing basic compression schemes, but were strangely silent on this subject when discussing suffix trees.
Since the suffix tree should be supplemented to store offset information anyway, this is similar to a property that can be checked when looking for a match β you should filter out any matches that start with the wait buffer. And the way the tree will be built / updated will mean that if the edge leads you to a node that matches a match starting with the appearance, you will return the previous node, and the rest of the descendants will also have wait buffers.
Is my approach wrong or wrong? Is there an implementation or discussion of LZ77 / LZSS with suffix trees that mention matches overriding the wait buffer?