Good compression scheme for continuous fractions?

So, I am implementing a continue fraction library to handle a subset of quadratic integers and rational numbers . Continued members of the fraction are represented by unsigned integers. When working with continuous fractions, I noticed the following general patterns:

  • Most of the terms are small single digits, with the most common being 1.
  • Some terms can be huge, the maximum possible for my application is 366 bits, but this is very rare.
  • A large term is an approximation with a very good number, which means that for a large fraction there are usually fewer members.
  • The worst possible continued fraction is the golden ratio, and a rational approximation of this with 366 bits of precision corresponds to about 525 1 per line.
  • Random rational numbers usually do not have large runs of the same number, but can have from two to four lines, with 1 again being the most common.

Thus, I have a limit both in the number of members and in the size of the member, and the number of members is approximately inversely proportional to their size. Thus, storing these terms in machine words or even bytes is usually very wasteful for space, and I still need to handle verbose arithmetic in the worst case. Given the roughly inverse relationship between the size of the term and the number of terms (which both also depend on the size of the numerator of fractions and the denominator), I tried to find or come up with a good compression scheme so that I do not lose as much space as the integer terms.

I looked at Huffman encoding , since the encoding and decoding speed is good, but I'm not sure how to come up with probabilities for code values. I have a vague intuition that Stern-Brocot Trees can offer a hint, since they are binary trees with a direct connection with continued fractions.

Does anyone know of a good compression approach for handling a large number of small numbers with random huge numbers, where runs of the same number are usually short (but can be long in rare cases)? In particular, I need to be able to encode and decode pretty quickly (for example, O (n * lg (n)) is the absolute worst speed I can tolerate with O (n) or better) and be able to get the position separate terms so that I know from what date I work (fourth, fifth, etc.).

In addition, I am not interested in using any third-party real or ongoing fraction libraries. I looked at a few, and they are either insufficient or too small for my needs, and I would like the coding experience to be for my own edification.

Update

I also found out about the Gauss-Kuzmin distribution , which gives the probability distribution of a particular continued member of the fraction k for a random number evenly distributed between 0 and 1. This

P(k) = -lg(1.0 - 1.0/((k + 1.0)**2) in the Python pseudocode, where lg is the logarithm of base 2. This is the limit, as k approaches infinity, so my case is somewhat more limited ( although 2**366 is still huge.) The entropy of continued fractions by Linas Vepstas gives (information) the entropy of the Gauss-Kuzmin distribution of about 3.43 bits. My maximum denominator is large enough that the informational entropy of my continued fractions is probably close to this limit, although the graph in a related article shows that the restriction is approaching rather slowly, and therefore to that it differs for relatively small denominators.

+4
source share
4 answers

One possibility is a simple prefix code where the binary number 1x has bits x encoded as 0 → 10 and 1 → 11, and then the terminator 0.

Table:

 1 -> 0 2 -> 100 3 -> 110 4 -> 10100 5 -> 10110 6 -> 11100 7 -> 11110 8 -> 1010100 

The corresponding Huffman code probabilities here for n are something like Theta (1 / n ^ 2) (waving a little because the alphabet is infinite).

+2
source

Your distribution seems amenable to Rice Coding . You can configure the encoding parameter, let it call it k your data. The encoding takes bits of a number above the low k bits and transmits it many 1 bits followed by 0. Then you pass the low k bits directly.

+2
source

You can use arithmetic coding , considering each positive integer as a character in the original alphabet. It does not matter that there is infinitely many, since the relative probability of large and large integers drops to zero.

In fact, given the uniform distribution on [0,1), you can set the conditional probability of each new integer a_n in the continuation of the fractional fraction (that is, each new symbol from the source of entropy) as P (a_n = k) = 1 / k - 1 / (k + 1). You can consider the very first integer to understand why this conditional probability makes sense (half of the numbers in the interval [0,1) will have a_1 = 1, for one sixth of them a_1 = 2, etc.).

In addition, you may need to flip the direction of arithmetic coding with each new character for decoding, to be unique. Unfortunately, arithmetic en / decoding is not very fast.

+1
source

You can consider dropping the ongoing fractions and instead implement their mutated cousin: ongoing logarithms . See the bottom of this article for a discussion and implementation of basic arithmetic.

There is even a hardware implementation for extremely parallel arithmetic while continuing the logarithms with the same asymptotic complexity as the FFT, but it is much simpler and, therefore, many lower constants.

If you are looking for something more exotic than can only be created with the help of the reversible operators {+, -, *, /}, see my new new floor operator question .

As you have seen, continued fractions tend to explode in terminal bit size for very large integers or very small fractions. This wobble of giant terms with small terms is what you would need to use in any compression scheme.

Continuing logarithms, on the other hand, share large terms in a kind of “recursive scientific notation,” as Bill Gosper does in this article. Each terminological collaborative procedure emits or consumes only very small messages, which can be generated as a natural sort, encoded in length, with a description of "database 2 of the remaining quantity to be described."

The unfortunate side effect of using these ongoing logarithms is that the Hurwitz numbers have no analogues and are very regular in continued fractions. However, most, but not all, quadratic perturbations remain periodic, which can also be considered as a compression scheme.

Once in a compact encoded length format, you can use traditional compression methods to run small numbers, such as LZW, described in the comments to Mark Ransome's comments.

0
source

All Articles