This seemed like a funny question, so I wrote a solution without looking at the other answers. It seems that in my system the speed is about 4.9x. On my system, it is also slightly faster than the DigitalRoss solution (~ 25% faster).
static inline uint32_t nibble_replace_2(uint32_t x) { uint32_t SEARCH = 0x5, REPLACE = 0xE, ONES = 0x11111111; uint32_t y = (~(ONES * SEARCH)) ^ x; y &= y >> 2; y &= y >> 1; y &= ONES; y *= 15; return x ^ (((SEARCH ^ REPLACE) * ONES) & y); }
I would explain how this works, but ... I think that explaining this spoils the pleasure.
Note on SIMD: This type of material is very, very simple to vectorize. You do not even need to know how to use SSE or MMX. Here's how I vectorized it:
static void nibble_replace_n(uint32_t *restrict p, uint32_t n) { uint32_t i; for (i = 0; i < n; ++i) { uint32_t x = p[i]; uint32_t SEARCH = 0x5, REPLACE = 0xE, ONES = 0x11111111; uint32_t y = (~(ONES * SEARCH)) ^ x; y &= y >> 2; y &= y >> 1; y &= ONES; y *= 15; p[i] = x ^ (((SEARCH ^ REPLACE) * ONES) & y); } }
Using GCC, this function is automatically converted to SSE code in -O3 , assuming the -march flag is used -march . You can pass -ftree-vectorizer-verbose=2 to GCC to ask it to print which loops are vectorized, for example:
$ gcc -std=gnu99 -march=native -O3 -Wall -Wextra -o opt opt.c opt.c:66: note: LOOP VECTORIZED.
Automatic vectorization gave me an additional speed increase of about 64%, and I didnโt even have to go for a processor manual.
Edit: I noticed another 48% acceleration, changing the types in the auto-vectorized version from uint32_t to uint16_t . This leads to a full acceleration of up to about 12 times compared with the original. Switching to uint8_t causes vectorization to fail. I suspect there is significant extra speed that can be found with manual assembly, if that matters.
Edit 2: Changed *= 7 to *= 15 , this invalidates speed tests.
Edit 3:. This is a change that is obvious in retrospect:
static inline uint32_t nibble_replace_2(uint32_t x) { uint32_t SEARCH = 0x5, REPLACE = 0xE, ONES = 0x11111111; uint32_t y = (~(ONES * SEARCH)) ^ x; y &= y >> 2; y &= y >> 1; y &= ONES; return x ^ (y * (SEARCH ^ REPLACE)); }