Same as overlapping look at LZ77 / LZSS with suffix trees

Background: I have an implementation of the common LZSS backend in C ++ (available here . The corresponding algorithm that I use in this version is extremely simple, because it was originally intended to compress relatively small files (no more than 64 KB) for relatively old hardware (in particular, Mega Drive / Sega Genesis, where 64kB is the fullness of the main RAM).

However, some files take too long to compress my implementation, in the order of minutes. The reason is twofold: the naive matching algorithm takes most of the time, but this happens precisely because I built a compression graph from a file to achieve optimal compression. Looking at the profiler, most of the time is spent searching for matches, eclipsing even the quadratic size of the resulting graph.

For some time I have been exploring several potential substitutions; one of which attracted my attention is dictionary-character-flexible parsing using multilayer suffix trees . The layered part is important because one of the LZSS options that interests me uses variable-size encodings for (position, length).

My current implementation allows matches in a sliding window to overlap the wait buffer, so this input:

aaaaaaaaaaaaaaaa 

can be directly encoded as

 (0,'a')(1,0,15) 

instead

 (0,'a')(1,0,1)(1,0,2)(1,0,4)(1,0,8) 

Here (0, 'a') means the character encoding 'a' as a literal, while (1, n, m) means "copy m characters from position n".

Question: Having said all this, here is my problem: every resource I found on the suffix trees seems to imply that they cannot handle the overlapping case, but instead only allow matching matches to be found. When suffix trees, research papers, books, and even some implementations were involved, they gave examples of compression without overlapping, as if they were perfect compression (I would contact some of them, but my reputation does not allow it). Some of them even mentioned that overlaps can be useful in describing basic compression schemes, but were strangely silent on this subject when discussing suffix trees.

Since the suffix tree should be supplemented to store offset information anyway, this is similar to a property that can be checked when looking for a match β€” you should filter out any matches that start with the wait buffer. And the way the tree will be built / updated will mean that if the edge leads you to a node that matches a match starting with the appearance, you will return the previous node, and the rest of the descendants will also have wait buffers.

Is my approach wrong or wrong? Is there an implementation or discussion of LZ77 / LZSS with suffix trees that mention matches overriding the wait buffer?

+7
c ++ algorithm suffix-tree lossless-compression lz77
source share
1 answer

As I understand it, given the tree of suffixes, we are interested in (roughly) calculating for each suffix S, whose longer suffix has the longest common prefix with S.

Add a link from each node tree to the descendant sheet with the longest suffix (linear time with DFS). From each sheet, go to the side, exploring new links, stopping if a longer suffix is ​​found. The execution time of the last step is linear, since each edge of the tree is checked exactly once.

Life with a limited window is unfortunately more difficult. Instead of distributing one link, we distribute several. To compute a set of suffixes that reference a node, we combine them in a sorted order along the length. Then whenever we have suffixes of lengths x> y> z, if x - z <8192 (for example), then we can drop y, because all three are equally good for all suffixes with which the current node is the most common the ancestor of the sheet, and if y is in the window, then either x or z. Since the window is a large part of the shared file, each node will contain no more than a few links.

If you want to look back at the shortest possible distance, then there is an O (n log ^ 2 n) -time algorithm (probably improved for O (n log n) with various hard-to-implement magicians). In the course of the algorithm, we build for each node a binary tree for the search for descendant suffixes in length, and then perform the next search. To build a node tree from its children, start with the largest child tree and insert elements from others. By argument, a hard path each length is inserted O (log n) times.

+3
source share

All Articles