Fast method of copying memory with translation - ARGB to BGR

Question

Fast method of copying memory with translation - ARGB to BGR

Overview

I have an image buffer that I need to convert to another format. The source image buffer is four channels, 8 bits per channel, alpha, red, green, and blue. Destination buffer - three channels, 8 bits per channel, blue, green and red.

So the brute force method:

// Assume a 32 x 32 pixel image #define IMAGESIZE (32*32) typedef struct{ UInt8 Alpha; UInt8 Red; UInt8 Green; UInt8 Blue; } ARGB; typedef struct{ UInt8 Blue; UInt8 Green; UInt8 Red; } BGR; ARGB orig[IMAGESIZE]; BGR dest[IMAGESIZE]; for(x = 0; x < IMAGESIZE; x++) { dest[x].Red = orig[x].Red; dest[x].Green = orig[x].Green; dest[x].Blue = orig[x].Blue; }

However, I need more speed than using loops and three-byte copies. I hope there can be some tricks that I can use to reduce the amount of reading and writing in memory, given that I am running a 32-bit machine.

Additional Information

Each image is a multiple of at least 4 pixels. So we could address 16 ARGB bytes and move them to 12 RGB bytes per loop. Perhaps this fact can be used to speed up the work, especially since it falls beautifully on 32-bit boundaries.

I have access to OpenCL - and while this requires moving the entire buffer to the GPU memory, and then moving the result back, the fact that OpenCL can work in many areas of the image at the same time, and the fact that a large block of memory The moves are actually quite effective, which can make this a useful study.

So far I have given the example of the small buffers above, I really move HD video (1920x1080) and sometimes large, mostly smaller buffers, therefore, while a 32x32 situation can be trivial, copying 8.3 MB bytes of image data by byte really. very bad.

Work on Intel processors (Core 2 and above), and therefore, there are streaming and data processing commands that I know about, but don’t know - maybe pointers to where to look for specialized instructions for processing data would be good.

This happens in an OS X application, and I use Xcode 4. If the build is painless and the obvious way to go, I know this way very well, but I didn’t do this on this installation, before doing this I am afraid to dive into it for too long.

The pseudocode is fine - I'm not looking for a complete solution, just an algorithm and an explanation of any deceivers that may not be immediately understood.

+64

c algorithm

Adam Davis Jul 24 2018-11-11T00:

source share

11 answers

Obvious using pshufb.

 #include <assert.h> #include <inttypes.h> #include <tmmintrin.h> // needs: // orig is 16-byte aligned // imagesize is a multiple of 4 // dest has 4 trailing scratch bytes void convert(uint8_t *orig, size_t imagesize, uint8_t *dest) { assert((uintptr_t)orig % 16 == 0); assert(imagesize % 4 == 0); __m128i mask = _mm_set_epi8(-128, -128, -128, -128, 13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3); uint8_t *end = orig + imagesize * 4; for (; orig != end; orig += 16, dest += 12) { _mm_storeu_si128((__m128i *)dest, _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig), mask)); } }

+25

just a poseur Jul 24 '11 at 1:22

source share

Combining only the posur and Jitamaro answers, if you assume the inputs and outputs are 16 byte matched, and if you process 4 pixels at a time, you can use a combination of shuffles, masks, etc., as well as storage using oriented stores. The basic idea is to generate four intermediate data sets, then, or with masks, to select the appropriate pixel values and write out 3 16-byte pixel data sets. Please note that I did not compile this or run it at all.

EDIT2: more details on the basic code structure:

With SSE2, you get better performance with 16-byte aligned reads and write 16 bytes. Since your 3-byte pixel is only 16-byte aligned for every 16 pixels, we eat 16 pixels at a time using a combination of shuffles and masks and eagles of 16 input pixels at a time.

From LSB to MSB, inputs look like this, ignoring specific components:

 s[0]: 0000 0000 0000 0000 s[1]: 1111 1111 1111 1111 s[2]: 2222 2222 2222 2222 s[3]: 3333 3333 3333 3333

and extracts are as follows:

 d[0]: 000 000 000 000 111 1 d[1]: 11 111 111 222 222 22 d[2]: 2 222 333 333 333 333

So, to generate these outputs, you need to do the following (I will indicate the actual conversions later):

 d[0]= combine_0(f_0_low(s[0]), f_0_high(s[1])) d[1]= combine_1(f_1_low(s[1]), f_1_high(s[2])) d[2]= combine_2(f_1_low(s[2]), f_1_high(s[3]))

Now, what should combine_<x> look like? If we assume that d just s compacted together, we can combine the two s with the mask and a or:

 combine_x(left, right)= (left & mask(x)) | (right & ~mask(x))

where (1 means choosing the left pixel, 0 means choosing the right pixel): mask (0) = 111 111 111 111 000 0 mask (1) = 11 111 111 000 000 00 mask (2) = 1 111 000 000 000 000

But the actual conversions ( f_<x>_low , f_<x>_high ) are actually not that simple. Since we are reversing and removing bytes from the original pixel, the actual conversion (for the first assignment for short):

 d[0]= s[0][0].Blue s[0][0].Green s[0][0].Red s[0][1].Blue s[0][1].Green s[0][1].Red s[0][2].Blue s[0][2].Green s[0][2].Red s[0][3].Blue s[0][3].Green s[0][3].Red s[1][0].Blue s[1][0].Green s[1][0].Red s[1][1].Blue

If you translate the above into byte offsets from source to dest, you get: q [0] = & s [0] +3 & s [0] +2 & s [0] +1
& s [0] +7 & s [0] +6 & s [0] +5 & s [0] +11 & s [0] +10 & s [0] +9 & s [0] +15 & s [0] +14 & s [0] +13
& s [1] +3 & s [1] +2 & s [1] +1
& Amp; s [1]: +7

(If you look at all offsets s [0], they coincide with the shuffle mask in the reverse order.)

Now we can create a shuffle mask to map each source byte to the destination byte ( X means we don't care what that value is):

 f_0_low= 3 2 1 7 6 5 11 10 9 15 14 13 XXXX f_0_high= XXXXXXXXXXXX 3 2 1 7 f_1_low= 6 5 11 10 9 15 14 13 XXXXXXXX f_1_high= XXXXXXXX 3 2 1 7 6 5 11 10 f_2_low= 9 15 14 13 XXXXXXXXXXXX f_2_high= XXXX 3 2 1 7 6 5 11 10 9 15 14 13

We can further optimize this by looking at the masks that we use for each pixel in the source. If you look at the tass masks that we use for s [1]:

 f_0_high= XXXXXXXXXXXX 3 2 1 7 f_1_low= 6 5 11 10 9 15 14 13 XXXXXXXX

Since the two shuffle masks do not overlap, we can combine them and simply mask the unnecessary pixels in comb_, which we have already done! The following code performs all these optimizations (plus assumes that the source and destination addresses are aligned by 16 bytes). In addition, masks are written out in code in the order MSB-> LSB, if you are confused with the order.

EDIT: changed the repository to _mm_stream_si128 , since you probably write a lot, and we don’t want it to hide the cache. Plus, it must be aligned anyway so that you get a free perfect!

 #include <assert.h> #include <inttypes.h> #include <tmmintrin.h> // needs: // orig is 16-byte aligned // imagesize is a multiple of 4 // dest has 4 trailing scratch bytes void convert(uint8_t *orig, size_t imagesize, uint8_t *dest) { assert((uintptr_t)orig % 16 == 0); assert(imagesize % 16 == 0); __m128i shuf0 = _mm_set_epi8( -128, -128, -128, -128, // top 4 bytes are not used 13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3); // bottom 12 go to the first pixel __m128i shuf1 = _mm_set_epi8( 7, 1, 2, 3, // top 4 bytes go to the first pixel -128, -128, -128, -128, // unused 13, 14, 15, 9, 10, 11, 5, 6); // bottom 8 go to second pixel __m128i shuf2 = _mm_set_epi8( 10, 11, 5, 6, 7, 1, 2, 3, // top 8 go to second pixel -128, -128, -128, -128, // unused 13, 14, 15, 9); // bottom 4 go to third pixel __m128i shuf3 = _mm_set_epi8( 13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3, // top 12 go to third pixel -128, -128, -128, -128); // unused __m128i mask0 = _mm_set_epi32(0, -1, -1, -1); __m128i mask1 = _mm_set_epi32(0, 0, -1, -1); __m128i mask2 = _mm_set_epi32(0, 0, 0, -1); uint8_t *end = orig + imagesize * 4; for (; orig != end; orig += 64, dest += 48) { __m128i a= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig), shuf0); __m128i b= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 1), shuf1); __m128i c= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 2), shuf2); __m128i d= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 3), shuf3); _mm_stream_si128((__m128i *)dest, _mm_or_si128(_mm_and_si128(a, mask0), _mm_andnot_si128(b, mask0)); _mm_stream_si128((__m128i *)dest + 1, _mm_or_si128(_mm_and_si128(b, mask1), _mm_andnot_si128(c, mask1)); _mm_stream_si128((__m128i *)dest + 2, _mm_or_si128(_mm_and_si128(c, mask2), _mm_andnot_si128(d, mask2)); } }

+15

MSN Jul 24 '11 at 4:19

source share

I'm a little late to the party, it looks like the community has already decided for poseur pshufb-answer, but sharing the 2000 reputation is so generous that I have to try.

Here is my version without platform-specific embedded or machine-based asm applications, I have included some cross-platform time code showing 4x speedup if you are doing bit-scripting like me AND activate compiler optimization (register optimization, loopback):

 #include "stdlib.h" #include "stdio.h" #include "time.h" #define UInt8 unsigned char #define IMAGESIZE (1920*1080) int main() { time_t t0, t1; int frames; int frame; typedef struct{ UInt8 Alpha; UInt8 Red; UInt8 Green; UInt8 Blue; } ARGB; typedef struct{ UInt8 Blue; UInt8 Green; UInt8 Red; } BGR; ARGB* orig = malloc(IMAGESIZE*sizeof(ARGB)); if(!orig) {printf("nomem1");} BGR* dest = malloc(IMAGESIZE*sizeof(BGR)); if(!dest) {printf("nomem2");} printf("to start original hit a key\n"); getch(); t0 = time(0); frames = 1200; for(frame = 0; frame<frames; frame++) { int x; for(x = 0; x < IMAGESIZE; x++) { dest[x].Red = orig[x].Red; dest[x].Green = orig[x].Green; dest[x].Blue = orig[x].Blue; x++; } } t1 = time(0); printf("finished original of %u frames in %u seconds\n", frames, t1-t0); // on my core 2 subnotebook the original took 16 sec // (8 sec with compiler optimization -O3) so at 60 FPS // (instead of the 1200) this would be faster than realtime // (if you disregard any other rendering you have to do). // However if you either want to do other/more processing // OR want faster than realtime processing for eg a video-conversion // program then this would have to be a lot faster still. printf("to start alternative hit a key\n"); getch(); t0 = time(0); frames = 1200; unsigned int* reader; unsigned int* end = reader+IMAGESIZE; unsigned int cur; // your question guarantees 32 bit cpu unsigned int next; unsigned int temp; unsigned int* writer; for(frame = 0; frame<frames; frame++) { reader = (void*)orig; writer = (void*)dest; next = *reader; reader++; while(reader<end) { cur = next; next = *reader; // in the following the numbers are of course the bitmasks for // 0-7 bits, 8-15 bits and 16-23 bits out of the 32 temp = (cur&255)<<24 | (cur&65280)<<16|(cur&16711680)<<8|(next&255); *writer = temp; reader++; writer++; cur = next; next = *reader; temp = (cur&65280)<<24|(cur&16711680)<<16|(next&255)<<8|(next&65280); *writer = temp; reader++; writer++; cur = next; next = *reader; temp = (cur&16711680)<<24|(next&255)<<16|(next&65280)<<8|(next&16711680); *writer = temp; reader++; writer++; } } t1 = time(0); printf("finished alternative of %u frames in %u seconds\n", frames, t1-t0); // on my core 2 subnotebook this alternative took 10 sec // (4 sec with compiler optimization -O3) }

The results of these (in my main 2nd subnotebook):

 F:\>gcc bc -o b.exe F:\>b to start original hit a key finished original of 1200 frames in 16 seconds to start alternative hit a key finished alternative of 1200 frames in 10 seconds F:\>gcc bc -O3 -o b.exe F:\>b to start original hit a key finished original of 1200 frames in 8 seconds to start alternative hit a key finished alternative of 1200 frames in 4 seconds

+11

Bernd Elkemann Jul 24 '11 at 8:50

source share

You want to use the Duff device: http://en.wikipedia.org/wiki/Duff%27s_device . It also works in JavaScript. This post, however, is a little fun to read http://lkml.indiana.edu/hypermail/linux/kernel/0008.2/0171.html . Imagine a Duff device with 512 kilobytes of moves.

+6

Bytemain Jul 24 2018-11-11T00:

source share

This build function should do, however, I don’t know if you want to keep the old data or not, this function cancels it.

The code is for MinGW GCC with Intel assembler , you will have to change it according to your compiler / assembler.

 extern "C" { int convertARGBtoBGR(uint buffer, uint size); __asm( ".globl _convertARGBtoBGR\n" "_convertARGBtoBGR:\n" " push ebp\n" " mov ebp, esp\n" " sub esp, 4\n" " mov esi, [ebp + 8]\n" " mov edi, esi\n" " mov ecx, [ebp + 12]\n" " cld\n" " convertARGBtoBGR_loop:\n" " lodsd ; load value from [esi] (4byte) to eax, increment esi by 4\n" " bswap eax ; swap eax ( ARGB ) to ( BGRA )\n" " stosd ; store 4 bytes to [edi], increment edi by 4\n" " sub edi, 1; move edi 1 back down, next time we will write over A byte\n" " loop convertARGBtoBGR_loop\n" " leave\n" " ret\n" ); }

You should call it like this:

 convertARGBtoBGR( &buffer, IMAGESIZE );

This function accesses memory only twice per pixel / packet (1 read, 1 write) compared to your brute force method, which had (at least / provided that it was compiled for registration) 3 reads and 3 write operations. The method is the same, but the implementation makes it more efficient.

+6

Sebi Jul 24 '11 at 3:21

source share

In combination with one of the quick conversion functions here, when accessing Core 2s, it would be wise to divide the broadcast into streams that work on their, say, a quarter of the data, as in this psudeocode:

 void bulk_bgrFromArgb(byte[] dest, byte[] src, int n) { thread threads[] = { create_thread(bgrFromArgb, dest, src, n/4), create_thread(bgrFromArgb, dest+n/4, src+n/4, n/4), create_thread(bgrFromArgb, dest+n/2, src+n/2, n/4), create_thread(bgrFromArgb, dest+3*n/4, src+3*n/4, n/4), } join_threads(threads); }

+6

Dave Jul 24 '11 at 5:15

source share

You can do this in pieces of 4 pixels by moving 32 bits with unsigned long pointers. Just think that with 4 32-bit pixels you can construct by offset and OR / AND, 3 words representing 4 24-bit pixels, for example:

 //col0 col1 col2 col3 //ARGB ARGB ARGB ARGB 32bits reading (4 pixels) //BGRB GRBG RBGR 32 bits writing (4 pixels)

Switching operations are always performed with the help of 1 cycle of instructions in all modern 32/64 bit processors (trunk shift method), therefore its quick way to build these three words for writing, bitwise AND and OR also grows rapidly.

Like this:

 //assuming we have 4 ARGB1 ... ARGB4 pixels and 3 32 bits words, W1, W2 and W3 to write // and *dest its an unsigned long pointer for destination W1 = ((ARGB1 & 0x000f) << 24) | ((ARGB1 & 0x00f0) << 8) | ((ARGB1 & 0x0f00) >> 8) | (ARGB2 & 0x000f); *dest++ = W1;

etc ... with the following pixels in the loop.

You will need adjustment with images that are not a multiple of 4, but I am sure that this is the fastest approach for everyone, without using assembler.

And btw, forget about using structures and indexed access, these are SLOWER ways to move data, just take a look at the demo list of the compiled C ++ program and you will agree with me.

+4

ruhalde Jul 28 '11 at 3:17

source share

 typedef struct{ UInt8 Alpha; UInt8 Red; UInt8 Green; UInt8 Blue; } ARGB; typedef struct{ UInt8 Blue; UInt8 Green; UInt8 Red; } BGR;

Besides the built-in functions of the compiler, I can try to do the following by checking the final behavior very carefully, since some of them (taking into account trade unions) are likely to be dependent on the compiler implementation:

 union uARGB { struct ARGB argb; UInt32 x; }; union uBGRA { struct { BGR bgr; UInt8 Alpha; } bgra; UInt32 x; };

and then for your code core, with any looping to be appropriate:

 inline void argb2bgr(BGR* pbgr, ARGB* pargb) { uARGB* puargb = (uARGB*)pargb; uBGRA ubgra; ubgra.x = __byte_reverse_32(pargb->x); *pbgr = ubgra.bgra.bgr; }

where __byte_reverse_32() assumes the existence of a built-in compiler that changes the bytes of a 32-bit word.

To summarize the basic approach:

viewing an ARGB structure as a 32-bit integer
cancel the 32-bit integer
viewing a 32-bit integer in the opposite direction as a structure (BGR).
let the compiler copy (BGR) part of the structure (BGR) A

+3

Jason S Jul 24 2018-11-11T00:

source share

Although you can use some CPU-based tricks,

 This kind of operations can be done fasted with GPU.

It seems that you are using C / C ++ ... Therefore, your alternatives for programming a GPU might be (on a Windows platform)

DirectCompute (DirectX 11) Watch this video
Microsoft Research Project Accelerator Check this link
Cuda
"Google" GPU programming ...

Briefly use the GPU for this kind of array operation to do faster computations. They are designed for this.

+3

Novalis Jul 28 '11 at 11:15

source share

I have not seen anyone show an example of how to do this on the GPU.

Some time ago I wrote something similar to your problem. I received data from the video4linux2 camera in YUV format and wanted to draw it as gray levels on the screen (only component Y). I also wanted to paint areas that are too dark in the blue and oversaturated areas in red.

I started with the example smooth_opengl3.c from freeglut .

The data is copied as YUV to the texture, and then the following GLSL shader programs are applied. I am sure that now GLSL code works on all macs, and it will be much faster than all approaches to the processor.

Please note that I have no experience with how you return data. Theoretically, glReadPixels should read the data back, but I never measured its performance.

OpenCL may be an easier approach, but then I will just start developing for it when I have a laptop that supports it.

 (defparameter *vertex-shader* "void main(){ gl_Position = gl_ModelViewProjectionMatrix * gl_Vertex; gl_FrontColor = gl_Color; gl_TexCoord[0] = gl_MultiTexCoord0; } ") (progn (defparameter *fragment-shader* "uniform sampler2D textureImage; void main() { vec4 q=texture2D( textureImage, gl_TexCoord[0].st); float v=qz; if(int(gl_FragCoord.x)%2 == 0) v=qx; float x=0; // 1./255.; v-=.278431; v*=1.7; if(v>=(1.0-x)) gl_FragColor = vec4(255,0,0,255); else if (v<=x) gl_FragColor = vec4(0,0,255,255); else gl_FragColor = vec4(v,v,v,255); } ")

enter image description here

+3

whoplisp Aug 03 2018-11-11T00:

source share

ughoavgfhw · Accepted Answer · 2011-07-24 06:10

I wrote 4 different versions that work by replacing bytes. I compiled them using gcc 4.2.1 with -O3 -mssse3 , -O3 -mssse3 them 10 times with 32 MB of random data, and found the average values.

The first version uses the C loop to convert each pixel separately, using the OSSwapInt32 function (which bswap with the bswap instruction with -O3 ).

 void swap1(ARGB *orig, BGR *dest, unsigned imageSize) { unsigned x; for(x = 0; x < imageSize; x++) { *((uint32_t*)(((uint8_t*)dest)+x*3)) = OSSwapInt32(((uint32_t*)orig)[x]); } }

The second method performs the same operation, but uses the built-in assembly loop instead of the C loop.

 void swap2(ARGB *orig, BGR *dest, unsigned imageSize) { asm ( "0:\n\t" "movl (%1),%%eax\n\t" "bswapl %%eax\n\t" "movl %%eax,(%0)\n\t" "addl $4,%1\n\t" "addl $3,%0\n\t" "decl %2\n\t" "jnz 0b" :: "D" (dest), "S" (orig), "c" (imageSize) : "flags", "eax" ); }

The third version is a modified version of only poseur's answer . I converted the built-in functions to GCC equivalents and used the lddqu built-in function, so the input argument does not need to be aligned.

 typedef uint8_t v16qi __attribute__ ((vector_size (16))); void swap3(uint8_t *orig, uint8_t *dest, size_t imagesize) { v16qi mask = __builtin_ia32_lddqu((const char[]){3,2,1,7,6,5,11,10,9,15,14,13,0xFF,0xFF,0xFF,0XFF}); uint8_t *end = orig + imagesize * 4; for (; orig != end; orig += 16, dest += 12) { __builtin_ia32_storedqu(dest,__builtin_ia32_pshufb128(__builtin_ia32_lddqu(orig),mask)); } }

Finally, the fourth version is a built-in assembly equivalent to the third.

 void swap2_2(uint8_t *orig, uint8_t *dest, size_t imagesize) { int8_t mask[16] = {3,2,1,7,6,5,11,10,9,15,14,13,0xFF,0xFF,0xFF,0XFF};//{0xFF, 0xFF, 0xFF, 0xFF, 13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3}; asm ( "lddqu (%3),%%xmm1\n\t" "0:\n\t" "lddqu (%1),%%xmm0\n\t" "pshufb %%xmm1,%%xmm0\n\t" "movdqu %%xmm0,(%0)\n\t" "add $16,%1\n\t" "add $12,%0\n\t" "sub $4,%2\n\t" "jnz 0b" :: "r" (dest), "r" (orig), "r" (imagesize), "r" (mask) : "flags", "xmm0", "xmm1" ); }

In my 2010 MacBook Pro, 2.4 GHz i5, 4 GB of RAM, these were average times for everyone:

 Version 1: 10.8630 milliseconds
 Version 2: 11.3254 milliseconds
 Version 3: 9.3163 milliseconds
 Version 4: 9.3584 milliseconds

As you can see, the compiler is good at optimizing that you do not need to write an assembly. In addition, vector features were 1.5 milliseconds faster with 32 MB of data, so it won’t harm if you want to support the earliest Intel macs that do not support SSSE3.

Edit: liori requested standard deviation information. Unfortunately, I did not save the data, so I did another test with 25 iterations.

               Average |  Standard deviation
 Brute force: 01/18/956 ms |  1.22980 ms (6.8%)
 Version 1: 11.13120 ms |  0.81076 ms (7.3%)
 Version 2: 11.27092 ms |  0.66209 ms (5.9%)
 Version 3: 9.29184 ms |  0.27851 ms (3.0%)
 Version 4: 9.40948 ms |  0.32702 ms (3.5%)

In addition, here is the raw data from the new tests, if anyone wants to. For each iteration, a 32 MB dataset was randomly generated and performed through four functions. The execution time of each function in microseconds is given below.

 Brute force: 22173 18344 17458 17277 17508 19844 17093 17116 19758 17395 18393 17075 17499 19023 19875 17203 16996 17442 17458 17073 17043 18567 17285 17746 17845
 Version 1: 10508 11042 13432 11892 12577 10587 11281 11912 12500 10601 10551 10444 11655 10421 11285 10554 10334 10452 10490 10554 10419 11458 11682 11048 10601
 Version 2: 10623 12797 13173 11130 11218 11433 11621 10793 11026 10635 11042 11328 12782 10943 10693 10755 11547 11028 10972 10811 11152 11143 11240 10952 10936
 Version 3: 9036 9619 9341 8970 9453 9758 9043 10114 9243 9027 9163 9176 9168 9122 9514 9049 9161 9086 9064 9604 9178 9233 9301 9717 9156
 Version 4: 9339 10119 9846 9217 9526 9182 9145 10286 9051 9614 9249 9653 9799 9270 9173 9103 9132 9550 9147 9157 9199 9113 9699 9354 9314

Fast method of copying memory with translation - ARGB to BGR

Overview

Additional Information

More articles: