Quick search and replace some piece in int [c; microoptimisation]

This is an option to Quickly search for some nibbles in two ints with the same bias issue (C, microoptimization) with another task:

The challenge is to find a predefined piece in int32 and replace it with another piece. For example, nibble for search is 0x5; nibble for replacement is 0xe:

int: 0x3d542753 (input) ^ ^ output:0x3dE427E3 (output int) 

There may be other pairs of nails for searching and borrowing for replacement (known at compile time).

I checked my program, this part is one of the hottest places (proved by gprof, 75% of the time in function); and this is called very, very many times (gcov proved). In fact, this is the 3rd or 4th loop of nested cycles with the estimated mileage calculation (n ^ 3) * (2 ^ n), for n = 18..24.

My current code is slow (I rewrite it as a function, but this is code from a loop):

 static inline uint32_t nibble_replace (uint32_t A) __attribute__((always_inline)) { int i; uint32_t mask = 0xf; uint32_t search = 0x5; uint32_t replace = 0xe; for(i=0;i<8;i++) { if( (A&mask) == search ) A = (A & (~mask) ) // clean i-th nibble | replace; // and replace mask <<= 4; search <<= 4; replace <<= 4; } return A; } 

Is it possible to rewrite this function and macro in parallel using some bit logic? The magic is something like (t-0x11111111)&(~t)-0x88888888 and possibly with SSE *. Check the accepted answer of the related question to learn about the necessary magic.

My compiler is gcc452, and cpu is Intel Core2 Solo in 32-bit mode (x86) or (in the near future) in 64-bit mode (x86-64).

+5
optimization with micro-optimization
source share
1 answer

This seemed like a funny question, so I wrote a solution without looking at the other answers. It seems that in my system the speed is about 4.9x. On my system, it is also slightly faster than the DigitalRoss solution (~ 25% faster).

 static inline uint32_t nibble_replace_2(uint32_t x) { uint32_t SEARCH = 0x5, REPLACE = 0xE, ONES = 0x11111111; uint32_t y = (~(ONES * SEARCH)) ^ x; y &= y >> 2; y &= y >> 1; y &= ONES; y *= 15; /* This is faster than y |= y << 1; y |= y << 2; */ return x ^ (((SEARCH ^ REPLACE) * ONES) & y); } 

I would explain how this works, but ... I think that explaining this spoils the pleasure.

Note on SIMD: This type of material is very, very simple to vectorize. You do not even need to know how to use SSE or MMX. Here's how I vectorized it:

 static void nibble_replace_n(uint32_t *restrict p, uint32_t n) { uint32_t i; for (i = 0; i < n; ++i) { uint32_t x = p[i]; uint32_t SEARCH = 0x5, REPLACE = 0xE, ONES = 0x11111111; uint32_t y = (~(ONES * SEARCH)) ^ x; y &= y >> 2; y &= y >> 1; y &= ONES; y *= 15; p[i] = x ^ (((SEARCH ^ REPLACE) * ONES) & y); } } 

Using GCC, this function is automatically converted to SSE code in -O3 , assuming the -march flag is used -march . You can pass -ftree-vectorizer-verbose=2 to GCC to ask it to print which loops are vectorized, for example:

 $ gcc -std=gnu99 -march=native -O3 -Wall -Wextra -o opt opt.c opt.c:66: note: LOOP VECTORIZED. 

Automatic vectorization gave me an additional speed increase of about 64%, and I didnโ€™t even have to go for a processor manual.

Edit: I noticed another 48% acceleration, changing the types in the auto-vectorized version from uint32_t to uint16_t . This leads to a full acceleration of up to about 12 times compared with the original. Switching to uint8_t causes vectorization to fail. I suspect there is significant extra speed that can be found with manual assembly, if that matters.

Edit 2: Changed *= 7 to *= 15 , this invalidates speed tests.

Edit 3:. This is a change that is obvious in retrospect:

 static inline uint32_t nibble_replace_2(uint32_t x) { uint32_t SEARCH = 0x5, REPLACE = 0xE, ONES = 0x11111111; uint32_t y = (~(ONES * SEARCH)) ^ x; y &= y >> 2; y &= y >> 1; y &= ONES; return x ^ (y * (SEARCH ^ REPLACE)); } 
+3
source share

All Articles