Byte pair optimization

Having noticed that byte encoding (BPE) is very lacking from the large text compression test, I quickly made a trivial literal implementation.

The compression ratio - given that there is no additional processing, for example. no Huffman or arithmetic coding - surprisingly good.

However, the runtime of my trivial implementation was less than stellar.

How can this be optimized? Can this be done in one go?

+7
optimization algorithm compression
source share
8 answers

This is a summary of my progress:

Googling found this little report that links to the source code and links to the source:

Philip Gage, entitled "New Algorithm for Data Compression," which appeared in the "User Log C", February 1994.

The code links on the Dr Dobbs website are broken, but this web page reflects them.

This code uses a hash table to track used digraphs and their counters, each of which goes through a buffer to avoid recalculating each pass.

My test data is enwik8 from the Hutter Award .

|----------------|-----------------| | Implementation | Time (min.secs) | |----------------|-----------------| | bpev2 | 1.24 | //The current version in the large text benchmark | bpe_c | 1.07 | //The original version by Gage, using a hashtable | bpev3 | 0.25 | //Uses a list, custom sort, less memcpy |----------------|-----------------| 

bpev3 creates a list of all digraphs; the blocks are 10 KB in size, and, as a rule, 200 digraphs are above the threshold (4, which is the smallest, we can get bytes by compression); this list is sorted and the first lookup is performed.

As substitutions, statistics are updated; usually each passage has only about 10 or 20 digraphs; they are “colorized” and sorted, and then combined with the spelling list; this is significantly faster than just sorting the entire list of digraphs in each pass, since the list is almost sorted.

The source code moves between the buffers 'tmp' and 'buf'; bpev3 just changes the buffer pointers, which only cost about 10 seconds.

Considering that fixing a buffer exchange on bpev2 will lead to an exhaustive search according to the version of the hash table; I think the hash table is a reasoned value and that a list is the best structure for this problem.

However, its multi-level threshold. And therefore, it is not a generally accepted algorithm.

If you look at the big text compression test, the original bpe is added. Because of this big locks, it works better than my bpe on on enwik9. In addition, the performance gap between hash tables and my lists is much closer - I put this in march=PentiumPro , which uses LTCB.

Of course, there are times when they are suitable and used; Symbian use it to compress pages in ROM images. I guess the 16-bit nature of the Thumb binaries makes this simple and useful approach; Compression is performed on the PC, and decompression is performed on the device.

+5
source share

I worked on optimizing the implementation of LZF compression, and some of the same principles that I used to improve performance can be used here.

To speed things up when encoding a byte pair:

  • Limit the block size to less than 65 kB (probably 8-16 kB will be optimal). . This ensures that not all bytes are used and will allow storing intermediate processing information in RAM.
  • Use a hash table or a simple lookup table with a short integer (more RAM, but faster) to keep counts for pairs of bytes. There are 65,656 2-byte pairs and maybe BlockSize max blocks 64k). This gives you a table of 128k possible outputs.
  • Allocate and reuse data structures that can hold a full compression block, a replacement table, the number of byte pairs and output bytes in memory. That sounds wasteful for RAM, but if you think your block size is small, it's worth it. Your data should be able to sit completely in the L2 cache or (worst case) L3 cache. This gives a BIG speed increase.
  • Make a quick jump over the data for collecting counters, THEN worry about creating a replacement table.
  • Pack bytes in integers or short ints whenever possible (applicable mainly to C / C ++). One entry in the counting table can be represented as an integer (16-bit number, plus a pair of bytes).
+3
source share

The code in JustBasic can be found here complete with a text input file.

BASIC Files archive only - forum post

EBPE by TomC 02/2014 - Encoding of Byte Byte Code

EBPE uses two mail processes to encode byte pairs


1. Compresses the dictionary (considered new)

The dictionary contains 3 bytes:

 AA – the two char to be replaced by (byte pair) 1 – this single token (tokens are unused symbols) 

So "AA1" tells us when it decodes that every time we see "1" in the data file, replace it with "AA" .

While long sequences of consecutive tokens are possible, let's look at these 8 tokens:

 AA1BB3CC4DD5EE6FF7GG8HH9 

Length 24 bytes (8 * 3)

Token 2 is not in a file indicating that it is not an open token using or another way of saying this: 2 was in the source data.

We can see that the last 7 tokens 3,4,5,6,7,8,9 are sequential, therefore at any time we see the sequential launch of 4 tokens or more, allows us to change our dictionary:

 AA1BB3<255>CCDDEEFFGGHH<255> 

Where <255> tells us that tokens for pairs of bytes are implied and increased by 1 more than the last token we saw ( 3 ). We increment by one until we see the next <255> indicating the end of the run.

  • The source dictionary was 24 bytes,
  • Advanced Dictionary - 20 bytes.

I saved 175 bytes using this extension in a text file, where the tokens from 128 to 254 will be sequentially, as well as others in general, enable the launch created by the lowercase preprocessing.

2. Compresses the data file

The reuse of rarely used characters as tokens is not new.

After using all the characters for compression (except <255> ), we look at the file and find one "j" . Let this char do a double duty:

  • "<255>j" means it's literal "j"
  • "j" now used as a token for re-compression,

If j happened 1 time in the data file, we need to add 1 <255> and a 3-byte entry in the dictionary, so we need to save more than 4 bytes in the BPE because it's worth it.

If j happened 6 times, we need 6 <255> and a 3-byte dictionary so we need to store more than 9 bytes in the BPE, so it's worth it.

Depending on whether further compression is possible and how many pairs of bytes left in the file, this post-process saved more than 100 bytes on test runs.

Note. When unpacking, make sure that you do not unpack every "j" . You need to look at the previous character to make sure that it is not <255> in order to unpack. Finally, after decompression, go and delete <255> to recreate the original file.

3. What will happen next in EBPE?

Unknown at present

+2
source share

I do not think that this can be done in one pass, if you do not find a way to predict, given the replacement of a byte pair, if a new byte pair (after replacement) is also good for replacement.

Here are my thoughts at a glance. Perhaps you already thought it, or already thought it all.

I would try the following.

Two adjustable parameters:

  • The number of byte pairs in a piece of data before considering replacing it. (So ​​the dictionary does not grow faster than the piece is compressed.)
  • The number of replacements per passage before it is probably no longer worth replacing. (For the algorithm to stop wasting time when only 1 or 2% is left to win.)

I would make passes while it is still worth compressing another level (according to parameter 2). During each pass, I would continue to count the number of byte pairs as I go.

I would play a little with two parameters and see how it affects the compression ratio and speed. They probably should change dynamically, depending on the length of the piece, in order to compress (and possibly one or two other things).

Another thing to consider is the data structure used to store the count of each byte pair while passing. This is most likely a way to write a custom one that will be faster than general data structures.

Keep us informed if you try something and get interesting results!

+1
source share

Yes, keep us posted.

warranty?

BobMcGee gives good advice. However, I suspect that "Limiting the block size to less than 65 kbytes ... This ensures that not all bytes will be used" is not always true. I can generate a (very artificial) binary file less than 1 KB in length that has a byte pair that repeats 10 times, but cannot compress at all with BPE because it uses all 256 bytes — there are no free bytes that BPE can use for represent a frequent pair of bytes.

If we restrict ourselves to 7-bit ASCII text, we have more than 127 free bytes, so all files that repeat a couple of bytes enough can be compressed at least BPE. However, even then I can (artificially) generate a file that uses only ASCII isgraph () characters and has a length of less than 30 KB, which ultimately falls into the BPE “no free bytes” limit, although there is still a byte pair left with more 4 reps

one run

It seems that this algorithm can be slightly modified to do this in one go. Assuming 7-bit plain ASCII text: Scan the input text, remembering all the byte pairs that we saw in some internal data structure, somehow counting the number of unique byte pairs that we have seen so far, and copy each byte to the output (high bit zero). Whenever we encounter repetition, we emit a special byte representing a byte pair (with a high bit of 1, so we do not confuse literal bytes with byte pairs). Include this special byte in the internal list of bytes of the “pair” so that the compressor can later emit some other special byte that represents this special byte plus a literal byte, so the net effect of this other special byte should be a triple, As noted by Fkahler, it sounds almost the same as LZW.

EDIT: Apparently, the “no byte free” restriction that I mentioned above is not, after all, an integral limitation of all byte pair compressors, since there is at least one byte compressor without this restriction.

Have you seen SCZ - Simple Compression Utilities and a Library ? SCZ seems to be a kind of byte encoder. SCZ seems to give better compression than other byte pairs of compressors I've seen, because SCZ does not have the “no free bytes” restriction mentioned above.

If a BP byte pair repeats many times in clear text (or, after several rounds of iteration, partially compressed text), SCZ can compress a byte pair, even if the text already contains all 256 bytes.

(SCZ uses a special escape byte E in the compressed text, which indicates that the next byte is intended to be represented literally and not expanded as a byte pair. This allows some byte M in the compressed text to do double work: Two EM bytes in the compressed text represent M in plain text. Byte M (without the preceding evacuation byte) in compressed text is a pair of BP bytes in plain text. If some BP byte pair occurs many times more than M in clear text, then the space saved by presenting I have each pair of BP bytes as a single M byte in the compressed data, more than the space "lost", representing each M as two EM bytes.)

+1
source share

You can also optimize the dictionary so that:

AA1BB2CC3DD4EE5FF6GG7HH8 - sequential launch of 8 tokens.

Record that:

AA1<255>BBCCDDEEFFGGHH<255> , where <255> tells the program that each of the following pairs of bytes (up to the next <255> ) is sequential and is incremented by one. Files and everything with at least 4 consecutive tokens are great for text.

save 175 bytes in the last test.

+1
source share

Here is the new BPE ( http://encode.ru/threads/1874-Alba ). Compilation example, gcc -O1 alba.c -o alba.exe This is faster than the default.

+1
source share

The simplest effective structure is a 2-dimensional array, for example byte_pair (255,255). Drop the counters there and change how the file is compressed.

0
source share

All Articles