I wrote a triple logic simulator not so long ago, and this question was viable to me, since it directly affects the speed of my interpreter; I needed to model tons and tons of triple logical gates as quickly as possible.
In a binary coded triple system, one trit is packed into two bits. The most significant bit means negative and least significant, positive. Case "11" should not occur, but it must be handled appropriately and treated as 0.
Consider the inline int bct_decoder( unsigned bctData ) , which should return our formatted trit as a regular integer -1, 0, or 1; As I noticed, there are 4 approaches: I called them "cond", "mod", "math" and "lut"; Let's explore them
First based on jz | jnz and jl | jb conditional jumps thus cond. Its performance is not very good because it relies on a branch predictor. And even worse - it changes, because it is not known whether there will be one branch or two a priori. And here is an example:
inline int bct_decoder_cond( unsigned bctData ) { unsigned lsB = bctData & 1; unsigned msB = bctData >> 1; return ( lsB == msB ) ? 0 :
This is the slowest version, in the worst case, it can include 2 branches, and this is where the binary logic fails. On my 3770k it produces about 200MIPS on average according to random data. (here and after - each test is an average of 1000 attempts on a randomly filled 2mb dataset)
It then relies on the modulo operator, and its speed is somewhere between the first and third, but definitely faster - 600 MIPS:
inline int bct_decoder_mod( unsigned bctData ) { return ( int )( ( bctData + 1 ) % 3 ) - 1; }
The next is a ramified approach that includes only mathematics, thus mathematics; it does not involve jumps at all:
inline int bct_decoder_math( unsigned bctData ) { return ( int )( bctData & 1 ) - ( int )( bctData >> 1 ); }
It does what it should and behaves really cool. For comparison, the performance rating is 1000 MIPS, and it is 5 times faster than the branched version. The forked version is probably slowing down due to the lack of built-in 2-bit signed int support. But in my application, this is a pretty good version in itself.
If this is not enough, we can go further with something special. Next, the approach to the lookup table is called:
inline int bct_decoder_lut( unsigned bctData ) { static const int decoderLUT[] = { 0, 1, -1, 0 }; return decoderLUT[ bctData & 0x3 ]; }
In my case, one trit occupied only 2 bits, so the lut table was only 2b * 4 = 8 bytes, and it was worth a try. It fits in the cache and works quickly in the 1400-1600 MIPS, here my measurement accuracy is reduced. And it's 1.5x acceleration from a quick math approach. This is because you simply pre-calculated the result and one AND instruction. Unfortunately, caches are small and (if your index is longer than a few bits), you simply cannot use it.
So, I think I answered your question what branched / non-distributed code might affect. The answer is much better and with detailed examples, real applications and real results of performance measurements.