I wrote 4 different versions that work by replacing bytes. I compiled them using gcc 4.2.1 with -O3 -mssse3 , -O3 -mssse3 them 10 times with 32 MB of random data, and found the average values.
The first version uses the C loop to convert each pixel separately, using the OSSwapInt32 function (which bswap with the bswap instruction with -O3 ).
void swap1(ARGB *orig, BGR *dest, unsigned imageSize) { unsigned x; for(x = 0; x < imageSize; x++) { *((uint32_t*)(((uint8_t*)dest)+x*3)) = OSSwapInt32(((uint32_t*)orig)[x]); } }
The second method performs the same operation, but uses the built-in assembly loop instead of the C loop.
void swap2(ARGB *orig, BGR *dest, unsigned imageSize) { asm ( "0:\n\t" "movl (%1),%%eax\n\t" "bswapl %%eax\n\t" "movl %%eax,(%0)\n\t" "addl $4,%1\n\t" "addl $3,%0\n\t" "decl %2\n\t" "jnz 0b" :: "D" (dest), "S" (orig), "c" (imageSize) : "flags", "eax" ); }
The third version is a modified version of only poseur's answer . I converted the built-in functions to GCC equivalents and used the lddqu built-in function, so the input argument does not need to be aligned.
typedef uint8_t v16qi __attribute__ ((vector_size (16))); void swap3(uint8_t *orig, uint8_t *dest, size_t imagesize) { v16qi mask = __builtin_ia32_lddqu((const char[]){3,2,1,7,6,5,11,10,9,15,14,13,0xFF,0xFF,0xFF,0XFF}); uint8_t *end = orig + imagesize * 4; for (; orig != end; orig += 16, dest += 12) { __builtin_ia32_storedqu(dest,__builtin_ia32_pshufb128(__builtin_ia32_lddqu(orig),mask)); } }
Finally, the fourth version is a built-in assembly equivalent to the third.
void swap2_2(uint8_t *orig, uint8_t *dest, size_t imagesize) { int8_t mask[16] = {3,2,1,7,6,5,11,10,9,15,14,13,0xFF,0xFF,0xFF,0XFF};//{0xFF, 0xFF, 0xFF, 0xFF, 13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3}; asm ( "lddqu (%3),%%xmm1\n\t" "0:\n\t" "lddqu (%1),%%xmm0\n\t" "pshufb %%xmm1,%%xmm0\n\t" "movdqu %%xmm0,(%0)\n\t" "add $16,%1\n\t" "add $12,%0\n\t" "sub $4,%2\n\t" "jnz 0b" :: "r" (dest), "r" (orig), "r" (imagesize), "r" (mask) : "flags", "xmm0", "xmm1" ); }
In my 2010 MacBook Pro, 2.4 GHz i5, 4 GB of RAM, these were average times for everyone:
Version 1: 10.8630 milliseconds
Version 2: 11.3254 milliseconds
Version 3: 9.3163 milliseconds
Version 4: 9.3584 milliseconds
As you can see, the compiler is good at optimizing that you do not need to write an assembly. In addition, vector features were 1.5 milliseconds faster with 32 MB of data, so it won’t harm if you want to support the earliest Intel macs that do not support SSSE3.
Edit: liori requested standard deviation information. Unfortunately, I did not save the data, so I did another test with 25 iterations.
Average | Standard deviation
Brute force: 01/18/956 ms | 1.22980 ms (6.8%)
Version 1: 11.13120 ms | 0.81076 ms (7.3%)
Version 2: 11.27092 ms | 0.66209 ms (5.9%)
Version 3: 9.29184 ms | 0.27851 ms (3.0%)
Version 4: 9.40948 ms | 0.32702 ms (3.5%)
In addition, here is the raw data from the new tests, if anyone wants to. For each iteration, a 32 MB dataset was randomly generated and performed through four functions. The execution time of each function in microseconds is given below.
Brute force: 22173 18344 17458 17277 17508 19844 17093 17116 19758 17395 18393 17075 17499 19023 19875 17203 16996 17442 17458 17073 17043 18567 17285 17746 17845
Version 1: 10508 11042 13432 11892 12577 10587 11281 11912 12500 10601 10551 10444 11655 10421 11285 10554 10334 10452 10490 10554 10419 11458 11682 11048 10601
Version 2: 10623 12797 13173 11130 11218 11433 11621 10793 11026 10635 11042 11328 12782 10943 10693 10755 11547 11028 10972 10811 11152 11143 11240 10952 10936
Version 3: 9036 9619 9341 8970 9453 9758 9043 10114 9243 9027 9163 9176 9168 9122 9514 9049 9161 9086 9064 9604 9178 9233 9301 9717 9156
Version 4: 9339 10119 9846 9217 9526 9182 9145 10286 9051 9614 9249 9653 9799 9270 9173 9103 9132 9550 9147 9157 9199 9113 9699 9354 9314