I think you could look for a bit slice of the implementation . This is how the fastest DES-cracking destructions work. (Or it was before all SSE instructions existed.)
The idea is to write your function in a bit-wise fashion, representing each output bit as a Boolean expression over the input bits. Since each output bit depends only on the input bits, any function can be represented in this way, even such things as addition, multiplication or search in the S-box.
The trick is to use the actual bits of the same register to represent one bit of several input words.
I will illustrate a simple four-bit function.
Suppose, for example, that you want to take four-bit form inputs:
x3 x2 x1 x0
... and for each input, calculate the four-digit output:
x2 x3 x2^x3 x1^x2
And you want to do this, for example, for eight inputs. (OK for the four bits, the lookup table will be the fastest, but it's just to illustrate the principle.)
Suppose your eight inputs are:
A = a3 a2 a1 a0 B = b3 b2 b1 b0 ... H = h3 h2 h1 h0
Here a3 a2 a1 a0 represent the four bits of input A , etc.
First, encode all eight inputs into four bytes, where each byte contains one bit from each of the eight inputs:
X3 = a3 b3 c3 d3 e3 f3 g3 h3 X2 = a2 b2 c2 d2 e2 f2 g2 h2 X1 = a1 b1 c1 d1 e1 f1 g1 h1 X0 = a0 b0 c0 d0 e0 f0 g0 h0
Here a3 b3 c3 ... h3 is eight bits of X3 . It consists of the highest bits of all eight inputs. X2 is the next bit of all eight inputs. And so on.
Now, to compute a function eight times in parallel, you simply do:
Y3 = X2; Y2 = X3; Y1 = X2 ^ X3; Y0 = X1 ^ X2;
Now Y3 contains high bits from all eight outputs, Y2 holds the next bit from all eight outputs, etc. We just calculated this function on eight different inputs using only four machine instructions!
Even better, if our CPU is 32-bit (or 64-bit), we could compute this function on 32 (or 64) inputs, still using only four commands.
Encoding input and decoding output to / from the “cut-off bit” representation takes some time, of course. But for the right function, this approach offers a massive bit-level parallelism and thus mass acceleration.
The basic assumption is that you have many inputs (for example, 32 or 64) on which you want to calculate the same function, and that this function is not too complicated and not too simple to represent as Boolean operations. (Too complicated makes slow computation too slow, too simple takes the dominant place in the encoding / decoding of the bit-cut.) For cryptography, in particular, where (a) the data must go through many “rounds” of processing, (b), the algorithm often is in terms of bit-bitting, and (c) you, for example, try to use many keys on the same data ... This often works very well.