Bitwise permutation of several 64-bit values ​​in parallel / combined

This question does NOT concern "How is a bitwise permutation". Now, how to do it, what we are looking for is a faster way with less cpu instructions, inspired by the implementation of sbox in bitlice in DES

To speed up some encryption code, we want to reduce the number of permutation calls. The main encryption functions perform several bitwise permutations based on lookup arrays. Since permutation operations are only bits,

Our main idea is to accept several input values ​​that need the same permutation and shift them in parallel. For example, if input bit 1 should be moved to output bit 6.

Is there any way to do this? We have no sample code right now, because there is absolutely no idea how to do this in an executive way.

The maximum size size that we have on our records is 128 bits, the longest input value is 64 bits. Therefore, the code should be faster and then perform the entire permutation 128 times.

EDIT

Here is a simple 8-bit permutation example

+---+---+---+---+---+---+---+---+ | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | <= Bits +---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+ | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | <= Input +---+---+---+---+---+---+---+---+ | 3 | 8 | 6 | 2 | 5 | 1 | 4 | 7 | <= Output +---+---+---+---+---+---+---+---+ 

The cipher makes use of several input keys. This is a block cipher, so the same pattern should apply to all 64-bit input blocks.

Since the permutations are the same for each input block, we want to process several input blocks in one step / in order to combine operations for several input sequences. Instead of moving 128 times one bit per call, while moving 1 time 128 bits.

EDIT2

We could not use threads, since we had to run code on embedded systems without thread support. Therefore, we also do not have access to external libraries, and we must keep it equal to C.

Decision

After testing and playing with these answers, we did it as follows:

  • We put the individual bits of 128 64-bit values ​​into the uint128_t [64] * array.
  • For permutation, we just need to copy the pointers
  • After everything is done, we will return the first operation and return 128 permuted values

Yes, it really is that simple. We tested this way at an early stage of the project, but it was too slow. It seems we had a bug in the test code.

Thanks to everyone for the tips and patience.

+7
source share
4 answers

You can make Stan step-wise code faster by using eight lookup tables matching bytes to 64-bit words. To process a 64-bit word from input, divide it into eight bytes and find each from a different lookup table, then OR results. On my computer, the latter is 10 times faster than the bitwise approach for 32-bit permutations. Obviously, if your embedded system has a small cache, then a 32 kB 16 kB search table problem may arise. If you process 4 bits at a time, you only need 16 lookup tables 16 * 8 = 128 bytes each, i.e. 2 kB of lookup tables.

EDIT: The inner loop might look something like this:

 void permute(uint64_t* input, uint64_t* output, size_t n, uint64_t map[8][256]) { for (size_t i = 0; i < n; ++i) { uint8_t* p = (uint8_t*)(input+i); output[i] = map[0][p[0]] | map[1][p[1]] | map[2][p[2]] | map[3][p[3]] | map[4][p[4]] | map[5][p[5]] | map[6][p[6]] | map[7][p[7]]; } } 
+4
source

I think you could look for a bit slice of the implementation . This is how the fastest DES-cracking destructions work. (Or it was before all SSE instructions existed.)

The idea is to write your function in a bit-wise fashion, representing each output bit as a Boolean expression over the input bits. Since each output bit depends only on the input bits, any function can be represented in this way, even such things as addition, multiplication or search in the S-box.

The trick is to use the actual bits of the same register to represent one bit of several input words.

I will illustrate a simple four-bit function.

Suppose, for example, that you want to take four-bit form inputs:

 x3 x2 x1 x0 

... and for each input, calculate the four-digit output:

 x2 x3 x2^x3 x1^x2 

And you want to do this, for example, for eight inputs. (OK for the four bits, the lookup table will be the fastest, but it's just to illustrate the principle.)

Suppose your eight inputs are:

 A = a3 a2 a1 a0 B = b3 b2 b1 b0 ... H = h3 h2 h1 h0 

Here a3 a2 a1 a0 represent the four bits of input A , etc.

First, encode all eight inputs into four bytes, where each byte contains one bit from each of the eight inputs:

 X3 = a3 b3 c3 d3 e3 f3 g3 h3 X2 = a2 b2 c2 d2 e2 f2 g2 h2 X1 = a1 b1 c1 d1 e1 f1 g1 h1 X0 = a0 b0 c0 d0 e0 f0 g0 h0 

Here a3 b3 c3 ... h3 is eight bits of X3 . It consists of the highest bits of all eight inputs. X2 is the next bit of all eight inputs. And so on.

Now, to compute a function eight times in parallel, you simply do:

 Y3 = X2; Y2 = X3; Y1 = X2 ^ X3; Y0 = X1 ^ X2; 

Now Y3 contains high bits from all eight outputs, Y2 holds the next bit from all eight outputs, etc. We just calculated this function on eight different inputs using only four machine instructions!

Even better, if our CPU is 32-bit (or 64-bit), we could compute this function on 32 (or 64) inputs, still using only four commands.

Encoding input and decoding output to / from the “cut-off bit” representation takes some time, of course. But for the right function, this approach offers a massive bit-level parallelism and thus mass acceleration.

The basic assumption is that you have many inputs (for example, 32 or 64) on which you want to calculate the same function, and that this function is not too complicated and not too simple to represent as Boolean operations. (Too complicated makes slow computation too slow, too simple takes the dominant place in the encoding / decoding of the bit-cut.) For cryptography, in particular, where (a) the data must go through many “rounds” of processing, (b), the algorithm often is in terms of bit-bitting, and (c) you, for example, try to use many keys on the same data ... This often works very well.

+2
source

It seems hard to do a permutation in just one call. A special case of your problem turning bits into an integer requires more than one “call” (what do you mean by a call?). For information on this example, see Sean's Twiddling Hacks Bit .

If your matching pattern is not complicated, maybe you can find a quick way to calculate the answer :) However, I don't know if you like this direct path:

 #include <stdio.h> unsigned char mask[8]; //map bit to position //0 -> 2 //1 -> 7 //2 -> 5 //... //7 -> 6 unsigned char map[8] = { 2,7,5,1,4,0,3,6 }; int main() { int i; //input: //-------------------- //bit 7 6 5 4 3 2 1 0 //-------------------- //val 0 0 1 0 0 1 1 0 //-------------------- unsigned char input = 0x26; //so the output should be 0xA1: // 1 0 1 0 0 0 0 1 unsigned char output; for(i=0; i<8; i++){ //initialize mask once mask[i] = 1<<i; } //do permutation output = 0; for(i=0; i<8; i++){ output |= (input&mask[i])?mask[map[i]]:0; } printf("output=%x\n", output); return 0; } 
+1
source

It would be best to look at some type of streaming scheme ... either you can use a messaging system in which you send each block to a fixed set of workflows, or you can configure the pipeline with non-blocking single chains of manufacturers / consumers who perform multiple shifts synchronously. I say “synchronously” because the pipeline on the general-purpose CPU will not really work synchronously with the pipeline, as you have on a device with a fixed function, but basically for this “slice” of time, each thread will work on one multi-stage step Tasks at the same time, and you will "transfer" the source data to and from the pipeline.

0
source

All Articles