Search for minimum RLE length

The classic RLE algorithm compresses data using numbers to represent how many times the character following the number appears in the text at that position. For example:

AAABBAAABBCECE => 3A2B3A2B1C1E1C1E

However, in the above example, this method leads to even more space used by compressed text. It would be best to use numbers to represent how many times the substring after the number appears in this text. For example:

AAABBAAABBCECE => 2AAABB2CE ("AAABB" twice, then "CE" twice).

Now, my question is: how can I implement an efficient algorithm that detects the minimum number of characters in an optimal RLE using this method? There are brute force methods, but I need something faster (no more than O (length 2 ) ). Perhaps we can use dynamic programming?

+6
algorithm dynamic-programming run-length-encoding
source share
4 answers

This can be done in quadratic quadratic quadratic time using dynamic programming.

Here is the Python code:

import sys import numpy as np bignum = 10000 S = sys.argv[1] #'AAABBAAABBCECE' N = len(S) # length of longest substring match bet s[i:] and s[j:] maxmatch = np.zeros( (N+1,N+1), dtype=int) for i in xrange(N-1,-1,-1): for j in xrange(i+1,N): if S[i] == S[j]: maxmatch[i,j] = maxmatch[i+1,j+1]+1 # P[n,k] = cost of encoding first n characters given that last k are a block P = np.zeros( (N+1,N+1),dtype=int ) + bignum # Q[n] = cost of encoding first n characters Q = np.zeros(N+1, dtype=int) + bignum # base case: no cost for empty string P[0,0]=0 Q[0]=0 for n in xrange(1,N+1): for k in xrange(1,n+1): if n-2*k >= 0: # s1, s2 = S[nk:n], S[n-2*k:nk] # if s1 == s2: if maxmatch[n-2*k,nk] >=k: # Here we are incrementing the count: C x_1...x_k -> C+1 x_1...x_k P[n,k] = min(P[n,k], P[nk,k]) print 'P[%d,%d] = %d' % (n,k,P[n,k]) # Here we are starting a new block: 1 x_1...x_k P[n,k] = min(P[n,k], Q[nk] + 1 + k) print 'P[%d,%d] = %d' % (n,k,P[n,k]) for k in xrange(1,n+1): Q[n] = min(Q[n], P[n,k]) print print Q[N] 

You can restore the actual encoding, remembering your options along the way.

I left a slight wrinkle, which means that we need an extra byte to hold C + 1 if C is big. If you use 32-bit ints, this will not occur in any context where this algorithm execution time is possible. If you sometimes use shorter ints to save space, you will have to think about it and maybe add another dimension to your table depending on the size of the last C. Theoretically, this can add the log (N) factor, but I don't think that will be obvious in practice.

Edit: in the interest of @Moron, here is the same code with a lot of print statements, so you can more easily see what the algorithm thinks:

 import sys import numpy as np bignum = 10000 S = sys.argv[1] #'AAABBAAABBCECE' N = len(S) # length of longest substring match bet s[i:] and s[j:] maxmatch = np.zeros( (N+1,N+1), dtype=int) for i in xrange(N-1,-1,-1): for j in xrange(i+1,N): if S[i] == S[j]: maxmatch[i,j] = maxmatch[i+1,j+1]+1 # P[n,k] = cost of encoding first n characters given that last k are a block P = np.zeros( (N+1,N+1),dtype=int ) + bignum # Q[n] = cost of encoding first n characters Q = np.zeros(N+1, dtype=int) + bignum # base case: no cost for empty string P[0,0]=0 Q[0]=0 for n in xrange(1,N+1): for k in xrange(1,n+1): if n-2*k >= 0: # s1, s2 = S[nk:n], S[n-2*k:nk] # if s1 == s2: if maxmatch[n-2*k,nk] >=k: # Here we are incrementing the count: C x_1...x_k -> C+1 x_1...x_k P[n,k] = min(P[n,k], P[nk,k]) print "P[%d,%d] = %d\t I can encode first %d characters of S in only %d characters if I use my solution for P[%d,%d] with %s count incremented" % (n\ ,k,P[n,k],n,P[nk,k],nk,k,S[nk:n]) # Here we are starting a new block: 1 x_1...x_k P[n,k] = min(P[n,k], Q[nk] + 1 + k) print 'P[%d,%d] = %d\t I can encode first %d characters of S in only %d characters if I use my solution for Q[%d] with a new block 1%s' % (n,k,P[n,k],n,Q[\ nk]+1+k,nk,S[nk:n]) for k in xrange(1,n+1): Q[n] = min(Q[n], P[n,k]) print print 'Q[%d] = %d\t I can encode first %d characters of S in only %d characters!' % (n,Q[n],n,Q[n]) print print Q[N] 

The last few lines of his release on ABCDABCDABCDBCD look like this:

 Q[13] = 7 I can encode first 13 characters of S in only 7 characters! P[14,1] = 9 I can encode first 14 characters of S in only 9 characters if I use my solution for Q[13] with a new block 1C P[14,2] = 8 I can encode first 14 characters of S in only 8 characters if I use my solution for Q[12] with a new block 1BC P[14,3] = 13 I can encode first 14 characters of S in only 13 characters if I use my solution for Q[11] with a new block 1DBC P[14,4] = 13 I can encode first 14 characters of S in only 13 characters if I use my solution for Q[10] with a new block 1CDBC P[14,5] = 13 I can encode first 14 characters of S in only 13 characters if I use my solution for Q[9] with a new block 1BCDBC P[14,6] = 12 I can encode first 14 characters of S in only 12 characters if I use my solution for Q[8] with a new block 1ABCDBC P[14,7] = 16 I can encode first 14 characters of S in only 16 characters if I use my solution for Q[7] with a new block 1DABCDBC P[14,8] = 16 I can encode first 14 characters of S in only 16 characters if I use my solution for Q[6] with a new block 1CDABCDBC P[14,9] = 16 I can encode first 14 characters of S in only 16 characters if I use my solution for Q[5] with a new block 1BCDABCDBC P[14,10] = 16 I can encode first 14 characters of S in only 16 characters if I use my solution for Q[4] with a new block 1ABCDABCDBC P[14,11] = 16 I can encode first 14 characters of S in only 16 characters if I use my solution for Q[3] with a new block 1DABCDABCDBC P[14,12] = 16 I can encode first 14 characters of S in only 16 characters if I use my solution for Q[2] with a new block 1CDABCDABCDBC P[14,13] = 16 I can encode first 14 characters of S in only 16 characters if I use my solution for Q[1] with a new block 1BCDABCDABCDBC P[14,14] = 15 I can encode first 14 characters of S in only 15 characters if I use my solution for Q[0] with a new block 1ABCDABCDABCDBC Q[14] = 8 I can encode first 14 characters of S in only 8 characters! P[15,1] = 10 I can encode first 15 characters of S in only 10 characters if I use my solution for Q[14] with a new block 1D P[15,2] = 10 I can encode first 15 characters of S in only 10 characters if I use my solution for Q[13] with a new block 1CD P[15,3] = 11 I can encode first 15 characters of S in only 11 characters if I use my solution for P[12,3] with BCD count incremented P[15,3] = 9 I can encode first 15 characters of S in only 9 characters if I use my solution for Q[12] with a new block 1BCD P[15,4] = 14 I can encode first 15 characters of S in only 14 characters if I use my solution for Q[11] with a new block 1DBCD P[15,5] = 14 I can encode first 15 characters of S in only 14 characters if I use my solution for Q[10] with a new block 1CDBCD P[15,6] = 14 I can encode first 15 characters of S in only 14 characters if I use my solution for Q[9] with a new block 1BCDBCD P[15,7] = 13 I can encode first 15 characters of S in only 13 characters if I use my solution for Q[8] with a new block 1ABCDBCD P[15,8] = 17 I can encode first 15 characters of S in only 17 characters if I use my solution for Q[7] with a new block 1DABCDBCD P[15,9] = 17 I can encode first 15 characters of S in only 17 characters if I use my solution for Q[6] with a new block 1CDABCDBCD P[15,10] = 17 I can encode first 15 characters of S in only 17 characters if I use my solution for Q[5] with a new block 1BCDABCDBCD P[15,11] = 17 I can encode first 15 characters of S in only 17 characters if I use my solution for Q[4] with a new block 1ABCDABCDBCD P[15,12] = 17 I can encode first 15 characters of S in only 17 characters if I use my solution for Q[3] with a new block 1DABCDABCDBCD P[15,13] = 17 I can encode first 15 characters of S in only 17 characters if I use my solution for Q[2] with a new block 1CDABCDABCDBCD P[15,14] = 17 I can encode first 15 characters of S in only 17 characters if I use my solution for Q[1] with a new block 1BCDABCDABCDBCD P[15,15] = 16 I can encode first 15 characters of S in only 16 characters if I use my solution for Q[0] with a new block 1ABCDABCDABCDBCD Q[15] = 9 I can encode first 15 characters of S in only 9 characters! 
+5
source share

I don’t think dynamic programming will work here, since you can have substrings about half the length of a full line in a solution. Sounds like you need to use brute force. For help, refer to the Lempel-Ziv-Welch Algorithm . This is an efficient algorithm that finds minimal encoding using substrings.

+1
source share

A very common way to encode RLE compressed data is to assign a special byte as “DLE” (sorry, I don’t remember what the term means), which means “next is the count followed by the byte”.

Thus, it is only necessary to encode repeating sequences. Typically, the DLE symbol is chosen so as to minimize the likelihood of its natural distribution in uncompressed data.

For your initial example, let me set the full stop (or point) to DLE, this will encode your example as follows:

 AAABBAAABBCECE => 3A2B3A2B1C1E1C1E <-- your encoding AAABBAAABBCECE => .3ABB.3ABBCECE <-- my encoding 

You would only encode a sequence if it actually ends up as a space saver. If you limit the length of the sequences to 255 so that the counter matches a byte, the sequence thus accepts three bytes, DLE, a counter, and a byte to repeat. You probably would not encode 3-byte sequences, because decoding them carries a bit more overhead than an uncoded sequence.

In your trivial example, saving is nonexistent, but if you try to compress a bitmap containing a screenshot from a large part of a white program, such as Notepad or a browser, you will see real space savings.

If you encounter a DLE character, of course, just count the number 0, since we know that we will never encode a sequence of length 0, DLE followed by a 0-byte means that you decode it as one byte of DLE.

+1
source share

Very smart ways to find matching substrings can lead to the consideration of suffixes and suffix arrays. Thinking of suffix and compression arrays, you can go to http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform . This may be the most elegant way to encode mileage.

0
source share

All Articles