How can I efficiently encode / decode a compressed position description?

Question

How can I efficiently encode / decode a compressed position description?

I am writing a table base for the Japanese chess version. To index the table base, I encode each chess position as a whole. In one of the coding steps, I encode where the pieces are on the board. Since the actual method is a bit complicated, let me explain the problem in a simplified way.

Coding

In the dining room of the endgame, I have (say) six separate pieces that I want to distribute on a 9-square board. I naively present their positions with a six-digit number (a, b, c, d, e, f & thinsp;), where each of the variables a and f is a number in the range from 0 to 8 inclusive, indicating where the corresponding chess piece is located.

However, this representation is not optimal: no two chess pieces can occupy the same square, but the aforementioned coding gladly allows this. We can encode the same position with six tuples [a, b ', c', d ', e', f 'and thinsp;], where a is the same as before, b' is a number from 0 to 7 inclusively indicating the number of the square on which the second part is located. This works by assigning a number from 0 to 7 per square; the first element is not included. For example, if the first part is on square 3, the square numbers for the second part:

1st piece: 0 1 2 3 4 5 6 7 8 2nd piece: 0 1 2 - 3 4 5 6 7

other parts are encoded in the same way, c 'as a number from 0 to 6, d' as a number from 0 to 5, etc. For example, naive coding (5, 2, 3, 0, 7, 4) gives compact coding (5, 2, 2, 0, 3, 1):

 1st: 0 1 2 3 4 5 6 7 8 --> 5 2nd: 0 1 2 3 4 - 5 6 7 --> 2 3rd: 0 1 - 2 3 - 4 5 6 --> 2 4th: 0 1 - - 2 - 3 4 5 --> 0 5th: - 0 - - 1 - 2 3 4 --> 3 6th: - 0 - - 1 - 2 - 3 --> 1

In my actual encoding, the number of parts I want to encode is not fixed. However, the number of squares on the board.

Question

How can I effectively convert a naive view into a compact view and vice versa? I use the standard C99 for the program. In the context of this question, I'm not interested in answers that use non-standard designs, built-in assemblies, or built-in functions.

Clarification of the issue

There seems to be some confusion about the issue:

The question is to find a practically effective way to implement the transformation between naive and compact representations of positions.
Both representations are n-tuples of integers in specific ranges. The question is not how to encode these representations into something else.
In one case, my number of squares is 25, and the number of pieces is up to 12. However, I am interested in an implementation that works for a reasonable parameter space (for example, up to 64 squares and up to 32 pieces).
I am not interested in alternative representations or encodings, especially representations or encodings that are not optimal.
I'm not interested in the comments that a compact presentation is not worth the effort.
I'm not interested in answers that use embedded assemblies, embedded assemblies, or any other non-standard constructs (with the possible exception of those described in POSIX).

+4

c encoding compression chess

fuz 21 sept '16 at 17:36

source share

6 answers

Naive solution to the problem: create an array in which the values are initially equal to the indices. When you use a square, take its value from the array and reduce all values to the right. The run time of this solution is O(n*p) , where n is the number of squares on the board, and p is the number of pieces on the board.

 int codes[25]; void initCodes( void ) { for ( int i = 0; i < 25; i++ ) codes[i] = i; } int getCodeForLocation( int location ) { for ( int i = location + 1; i < 25; i++ ) codes[i]--; return codes[location]; }

You can try to improve the performance of this code with binning. Consider the location on board in the form of 5 silos with 5 seats. Each bit has an offset, and each place in the hopper matters. When a value is taken from bin y at location x , then the offsets for all cells below y decrease. And all the values to the right of x in bin y decrease.

 int codes[5][5]; int offset[5]; void initCodes( void ) { int code = 0; for ( int row = 0; row < 5; row++ ) { for ( int col = 0; col < 5; col++ ) codes[row][col] = code++; offset[row] = 0; } } int getCodeForLocation( int location ) { int startRow = location / 5; int startCol = location % 5; for ( int col = startCol+1; col < 5; col++ ) codes[startRow][col]--; for ( int row = startRow+1; row < 5; row++ ) offset[row]--; return codes[startRow][startCol] + offset[startRow]; }

The runtime of this solution is O(sqrt(n) * p) . However, on a board with 25 squares, you will not see an improvement. To understand why you should consider the actual operations performed by a naive solution compared to a bin solution. In the worst case, the naive solution updates 24 places. In the worst case, the binding solution updates 4 entries in the offset array and 4 locations in the codes array. So this seems like a 3: 1 acceleration. However, binned code contains an unpleasant division / modulation instruction and is more complex in general. That way you can get 2: 1 acceleration if you're lucky.

If the board size was huge, i.e. 256x256 then binning will be great. The worst-case scenario for a naive solution would be 65,535 entries, while binning will update a maximum of 255 + 255 = 510 array entries. Thus, this will surely make up for the unpleasant separation and increase the complexity of the code.

And this is the futility of trying to optimize small sets of tasks. You won't save a lot of changes from O(n) to O(sqrt(n)) or O(log(n)) if you have n=25 sqrt(n)=5 log(n)=5 . You get theoretical acceleration, but it is almost always a false savings when you consider a lot of persistent factors that big-Os so blithely ignore.

For completeness, here is a driver code that can be used with the fragment above

 int main( void ) { int locations[6] = { 5,2,3,0,7,4 }; initCodes(); for ( int i = 0; i < 6; i++ ) printf( "%d ", getCodeForLocation(locations[i]) ); printf( "\n" ); }

Output: 5 2 2 0 3 1

+3

user3386109 21 sept '16 at 19:56

source share

Your coding technique has the property that the value of each element of the output tuple depends on the values of the corresponding element and all previous elements of the input tuple. I don’t see a way to accumulate partial results when calculating one encoded element, which could be reused when calculating another, and without this, the encoding calculation cannot be scaled more (time) efficiently than o (n ² ) in the number of elements to be encoded. ~~Therefore,~~ for the size of the problem you are describing, I don’t think you can do much better than this:

 typedef <your choice> element_t; void encode(element_t in[], element_t out[], int num_elements) { for (int p = 0; p < num_elements; p++) { element_t temp = in[p]; for (int i = 0; i < p; i++) { temp -= (in[i] < in[p]); } out[p] = temp; } }

Appropriate decoding can be performed as follows:

 void decode(element_t in[], element_t out[], int num_elements) { for (int p = 0; p < num_elements; p++) { element_t temp = in[p]; for (int i = p - 1; i >= 0; i--) { temp += (in[i] <= temp); } out[p] = temp; } }

There are approaches that scale better, some of which are discussed in the comments and other answers, but I believe that the size of your problem is not large enough to improve their scaling to overcome their increased costs.

Obviously, these transformations themselves do not change the size of the view. However, the encoded representation is easier to verify since each position in the tuple can be checked independently of the others. For this reason, the entire space of regular tuples can also be enumerated much more efficiently in encoded form than in decoded form.

I continue to argue that the decoded form can be stored almost as efficiently as the encoded form, especially if you want to be able to access the description of individual items. If your goal for an encoded form is to support bulk enumeration, then you can consider listing tuples in a "coded" form, but storing and then using them in a decoded form. A small amount of extra space may well be worth the fact that you do not need to decode after reading, especially if you plan to read a lot of them.

Update:

In response to your comment, the elephant in the room is a question about how you convert the encoded form into one index, such as you describe so that there are as few unused indexes as possible. I think this disconnect has generated so many discussions that you think off topic, and I believe that you have some assumptions that you are submitting your statement about saving space 24 times.

The encoded form is more easily converted to a compact index. For example, you can consider a position as a small number with the size of the board as its base:

 #define BOARD_SIZE 25 typedef <big enough> index_t; index_t to_index(element_t in[], int num_elements) { // The leading digit must not be zero index_t result = in[num_elements - 1] + 1; for (int i = num_elements - 1; i--; ) { result = result * BOARD_SIZE + in[i]; } }

Of course, there are still gaps, but I believe that they make up a fairly small part of the total range of index values used (and also for this to be the reason for interpreting the low-rise interpretation). I leave the inverse transformation as an exercise :).

+2

John bollinger 21 sept '16 at 19:27

source share

To convert from naive to compact position, you can iterate over the n-tuple and follow these steps for each position p :

optionally check that position p is available
set p position as busy
subtract from p number of occupied lower positions
save result to destination n-tuple

You can do this by maintaining an array of n bits for the busy state:

steps 1, 2 and 4 are calculated in constant time
step 3 can be efficiently calculated if the array is small, that is: 64 bits.

Here is the implementation:

 #include <stdio.h> #include <stdlib.h> /* version for up to 9 positions */ #define BC9(n) ((((n)>>0)&1) + (((n)>>1)&1) + (((n)>>2)&1) + \ (((n)>>3)&1) + (((n)>>4)&1) + (((n)>>5)&1) + \ (((n)>>6)&1) + (((n)>>7)&1) + (((n)>>8)&1)) #define x4(m,n) m(n), m((n)+1), m((n)+2), m((n)+3) #define x16(m,n) x4(m,n), x4(m,(n)+4), x4(m,(n)+8), x4(m,(n)+12) #define x64(m,n) x16(m,n), x16(m,(n)+16), x16(m,(n)+32), x16(m,(n)+48) #define x256(m,n) x64(m,n), x64(m,(n)+64), x64(m,(n)+128), x64(m,(n)+192) static int const bc512[1 << 9] = { x256(BC9, 0), x256(BC9, 256), }; int encode9(int dest[], int src[], int n) { unsigned int busy = 0; for (int i = 0; i < n; i++) { int p = src[i]; unsigned int bit = 1 << p; //if (busy & bit) return 1; // optional validity check busy |= bit; dest[i] = p - bc512[busy & (bit - 1)]; } return 0; } /* version for up to 64 positions */ static inline int bitcount64(unsigned long long m) { m = m - ((m >> 1) & 0x5555555555555555); m = (m & 0x3333333333333333) + ((m >> 2) & 0x3333333333333333); m = (m + (m >> 4)) & 0x0f0f0f0f0f0f0f0f; m = m + (m >> 8); m = m + (m >> 16); m = m + (m >> 16 >> 16); return m & 0x3f; } int encode64(int dest[], int src[], int n) { unsigned long long busy = 0; for (int i = 0; i < n; i++) { int p = src[i]; unsigned long long bit = 1ULL << p; //if (busy & bit) return 1; // optional validity check busy |= bit; dest[i] = p - bitcount64(busy & (bit - 1)); } return 0; } int main(int argc, char *argv[]) { int src[argc], dest[argc]; int cur, max = 0, n = argc - 1; for (int i = 0; i < n; i++) { src[i] = cur = atoi(argv[i + 1]); if (max < cur) max = cur; } if (max < 9) { encode9(dest, src, n); } else { encode64(dest, src, n); } for (int i = 0; i < n; i++) { printf("%d ", dest[i]); } printf("\n"); return 0; }

The main optimization is to implement bitcount() , which you can adapt to your needs, specializing in the actual number of positions. I have published above effective solutions for small numbers up to 9 and large numbers up to 64, but you can create a more effective solution for 12 or 32 positions.

In terms of time complexity, in the general case we still have O (n ² ) , but for small values of n it actually works in O (n.Log (n)) or better, since the parallel implementation of bitcount() can be reduced to the steps of log (n) or less for n up to 64.

You can look at http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive for inspiration and amazement.

Unfortunately, I'm still looking for ways to use this or a similar trick to decode ...

+1

chqrlie 21 sept '16 at 20:46

source share

To go from (5, 2, 3, 0, 7, 4) to (5, 2, 2, 0, 3, 1), you just need to:

start with (5, 2, 3, 0, 7, 4), press 5 as a result (5)
take 2 and count the number of previous values less than 2, 0, then press 2-0: (5, 2)
take 3, count the number of previous values less than 3, 1, then press 3-1: (5, 2, 2)
take 0, count the number of previous values less than 0, 0, then press 0-0 (5,2, 2, 0)
take 7, count ..., 4, then press 7-4: (5,2,2,0,3)
take 4, count ..., 3, then press 4-3: (5,2,2,0,3,1)

0

Jean-baptiste yunès 21 sept '16 at 19:04

source share

In this answer, I want to show some of my own ideas for implementing conversions, as well as some benchmarking results.

You can find the code on Github . These are the results on my main machine:

 algorithm ------ total time ------ ---------- per call ----------- decoding encoding total decoding encoding total baseline 0.0391s 0.0312s 0.0703s 3.9062ns 3.1250ns 7.0312ns count 1.5312s 1.4453s 2.9766s 153.1250ns 144.5312ns 297.6562ns bitcount 1.5078s 0.0703s 1.5781s 150.7812ns 7.0312ns 157.8125ns decrement 2.1875s 1.7969s 3.9844s 218.7500ns 179.6875ns 398.4375ns bin4 2.1562s 1.7734s 3.9297s 215.6250ns 177.3438ns 392.9688ns bin5 2.0703s 1.8281s 3.8984s 207.0312ns 182.8125ns 389.8438ns bin8 2.0547s 1.8672s 3.9219s 205.4688ns 186.7188ns 392.1875ns vector 0.3594s 0.2891s 0.6484s 35.9375ns 28.9062ns 64.8438ns shuffle 0.1328s 0.3438s 0.4766s 13.2812ns 34.3750ns 47.6562ns tree 2.0781s 1.7734s 3.8516s 207.8125ns 177.3438ns 385.1562ns treeasm 1.4297s 0.7422s 2.1719s 142.9688ns 74.2188ns 217.1875ns bmi2 0.0938s 0.0703s 0.1641s 9.3750ns 7.0312ns 16.4062ns

Implementation

basic is an implementation that does nothing but read input. It is designed to measure call and memory access functions.
count are "naive" implementations that store an employment card indicating which squares have pieces on them
bitcount is the same, but with a busy map stored as a bitmap. __builtin_popcount used for coding, which greatly speeds up the process. If handwritten popcount is used instead, bitcount is still the fastest portable encoding implementation.
decrement is the second naive realization. It stores the encoding for each square of the board, after adding a fragment, all square numbers on the right are reduced.
bin4 , bin5 and bin8 use binning with cells of size 4, 5 and 8 entries, as suggested by user3386109 .
shuffle computes a slightly different encoding based on Fisher-Yates shuffle . It works by restoring random values that would fall into a random move that generates the allowable length that we want to encode. The code is branching and fast, in particular when decoding.
the vector uses a five-bit number vector proposed by chqrlie .
the tree uses the difference tree, which is the data structure that I created. This is a complete binary tree of depth and word, log ₂ & thinsp; n & rceil; where the leaves represent each square, and the internal nodes on the path to each leave the sum of the code for this square (only those nodes where you go to the right are added). Square numbers are not stored, resulting in n & minus; 1 words of extra memory.
With this data structure, we can calculate the code for each square in <lgeil; log ₂ & thinsp; n & rceil; − 1 step and mark the square occupied by the same number of steps. The inner loop is very simple , including a branch and two actions, depending on whether you go left or right. In ARM, this branch is compiled into several conditional instructions, which leads to a very fast implementation. On x86, neither gcc nor clang are smart enough to get rid of the branches.
treeasm is a variant of a tree that uses an inline assembly to implement an internal tree loop without branches, carefully manipulating the carry flag.
bmi2 uses the pdep and pext commands from the BMI2 command set to quickly execute the algorithm.

For my actual project, I will probably use a shuffle implementation, as it is the fastest one, which does not depend on any non-configurable extensions (such as Intel intrinsics) or implementation details (such as the availability of 128-bit integers).

0

fuz Sep 26 '16 at 15:08

source share

chqrlie · Accepted Answer · 2016-09-21T21:56:53+0000

I found a more elegant solution for 16 positions using 64-bit integers with one loop for encoding and decoding:

 #include <stdio.h> #include <stdlib.h> void encode16(int dest[], int src[], int n) { unsigned long long state = 0xfedcba9876543210; for (int i = 0; i < n; i++) { int p4 = src[i] * 4; dest[i] = (state >> p4) & 15; state -= 0x1111111111111110 << p4; } } void decode16(int dest[], int src[], int n) { unsigned long long state = 0xfedcba9876543210; for (int i = 0; i < n; i++) { int p4 = src[i] * 4; dest[i] = (state >> p4) & 15; unsigned long long mask = ((unsigned long long)1 << p4) - 1; state = (state & mask) | ((state >> 4) & ~mask); } } int main(int argc, char *argv[]) { int naive[argc], compact[argc]; int n = argc - 1; for (int i = 0; i < n; i++) { naive[i] = atoi(argv[i + 1]); } encode16(compact, naive, n); for (int i = 0; i < n; i++) { printf("%d ", compact[i]); } printf("\n"); decode16(naive, compact, n); for (int i = 0; i < n; i++) { printf("%d ", naive[i]); } printf("\n"); return 0; }

The code uses unsigned 64-bit integers to store arrays of 16 values in the range 0..15 . Such an array can be updated in parallel in one step, retrieving the value is simple and deleting the value is a little more cumbersome, but still only a few steps.

You can expand this method to 25 positions using non-portable 128-bit integers ( __int128 type __int128 supported by both gcc and clang), encoding each position to 5 bits, using the fact that 5 * 25 < 128 , but magic constants are more bulky for the record.

How can I efficiently encode / decode a compressed position description?

Coding

Question

Clarification of the issue

Implementation

More articles: