Upgrading SSE (SSSE3) YUV to RGB Code

I want to optimize some SSE code that I wrote to convert YUV to RGB (both flat and packaged YUV functions).

I am using SSSE3 at the moment, but if there are useful features from later versions of SSE that are good.

I am mainly interested in how I will work with processor stalls, etc.

Does anyone know of any tools that do static analysis of SSE code?

; ; Copyright (C) 2009-2010 David McPaul ; ; All rights reserved. Distributed under the terms of the MIT License. ; ; A rather unoptimised set of ssse3 yuv to rgb converters ; does 8 pixels per loop ; inputer: ; reads 128 bits of yuv 8 bit data and puts ; the y values converted to 16 bit in xmm0 ; the u values converted to 16 bit and duplicated into xmm1 ; the v values converted to 16 bit and duplicated into xmm2 ; conversion: ; does the yuv to rgb conversion using 16 bit integer and the ; results are placed into the following registers as 8 bit clamped values ; r values in xmm3 ; g values in xmm4 ; b values in xmm5 ; outputer: ; writes out the rgba pixels as 8 bit values with 0 for alpha ; xmm6 used for scratch ; xmm7 used for scratch %macro cglobal 1 global _%1 %define %1 _%1 align 16 %1: %endmacro ; conversion code %macro yuv2rgbsse2 0 ; u = u - 128 ; v = v - 128 ; r = y + v + v >> 2 + v >> 3 + v >> 5 ; g = y - (u >> 2 + u >> 4 + u >> 5) - (v >> 1 + v >> 3 + v >> 4 + v >> 5) ; b = y + u + u >> 1 + u >> 2 + u >> 6 ; subtract 16 from y movdqa xmm7, [Const16] ; loads a constant using data cache (slower on first fetch but then cached) psubsw xmm0,xmm7 ; y = y - 16 ; subtract 128 from u and v movdqa xmm7, [Const128] ; loads a constant using data cache (slower on first fetch but then cached) psubsw xmm1,xmm7 ; u = u - 128 psubsw xmm2,xmm7 ; v = v - 128 ; load r,b with y movdqa xmm3,xmm0 ; r = y pshufd xmm5,xmm0, 0xE4 ; b = y ; r = y + v + v >> 2 + v >> 3 + v >> 5 paddsw xmm3, xmm2 ; add v to r movdqa xmm7, xmm1 ; move u to scratch pshufd xmm6, xmm2, 0xE4 ; move v to scratch psraw xmm6,2 ; divide v by 4 paddsw xmm3, xmm6 ; and add to r psraw xmm6,1 ; divide v by 2 paddsw xmm3, xmm6 ; and add to r psraw xmm6,2 ; divide v by 4 paddsw xmm3, xmm6 ; and add to r ; b = y + u + u >> 1 + u >> 2 + u >> 6 paddsw xmm5, xmm1 ; add u to b psraw xmm7,1 ; divide u by 2 paddsw xmm5, xmm7 ; and add to b psraw xmm7,1 ; divide u by 2 paddsw xmm5, xmm7 ; and add to b psraw xmm7,4 ; divide u by 32 paddsw xmm5, xmm7 ; and add to b ; g = y - u >> 2 - u >> 4 - u >> 5 - v >> 1 - v >> 3 - v >> 4 - v >> 5 movdqa xmm7,xmm2 ; move v to scratch pshufd xmm6,xmm1, 0xE4 ; move u to scratch movdqa xmm4,xmm0 ; g = y psraw xmm6,2 ; divide u by 4 psubsw xmm4,xmm6 ; subtract from g psraw xmm6,2 ; divide u by 4 psubsw xmm4,xmm6 ; subtract from g psraw xmm6,1 ; divide u by 2 psubsw xmm4,xmm6 ; subtract from g psraw xmm7,1 ; divide v by 2 psubsw xmm4,xmm7 ; subtract from g psraw xmm7,2 ; divide v by 4 psubsw xmm4,xmm7 ; subtract from g psraw xmm7,1 ; divide v by 2 psubsw xmm4,xmm7 ; subtract from g psraw xmm7,1 ; divide v by 2 psubsw xmm4,xmm7 ; subtract from g %endmacro ; outputer %macro rgba32sse2output 0 ; clamp values pxor xmm7,xmm7 packuswb xmm3,xmm7 ; clamp to 0,255 and pack R to 8 bit per pixel packuswb xmm4,xmm7 ; clamp to 0,255 and pack G to 8 bit per pixel packuswb xmm5,xmm7 ; clamp to 0,255 and pack B to 8 bit per pixel ; convert to bgra32 packed punpcklbw xmm5,xmm4 ; bgbgbgbgbgbgbgbg movdqa xmm0, xmm5 ; save bg values punpcklbw xmm3,xmm7 ; r0r0r0r0r0r0r0r0 punpcklwd xmm5,xmm3 ; lower half bgr0bgr0bgr0bgr0 punpckhwd xmm0,xmm3 ; upper half bgr0bgr0bgr0bgr0 ; write to output ptr movntdq [edi], xmm5 ; output first 4 pixels bypassing cache movntdq [edi+16], xmm0 ; output second 4 pixels bypassing cache %endmacro SECTION .data align=16 Const16 dw 16 dw 16 dw 16 dw 16 dw 16 dw 16 dw 16 dw 16 Const128 dw 128 dw 128 dw 128 dw 128 dw 128 dw 128 dw 128 dw 128 UMask db 0x01 db 0x80 db 0x01 db 0x80 db 0x05 db 0x80 db 0x05 db 0x80 db 0x09 db 0x80 db 0x09 db 0x80 db 0x0d db 0x80 db 0x0d db 0x80 VMask db 0x03 db 0x80 db 0x03 db 0x80 db 0x07 db 0x80 db 0x07 db 0x80 db 0x0b db 0x80 db 0x0b db 0x80 db 0x0f db 0x80 db 0x0f db 0x80 YMask db 0x00 db 0x80 db 0x02 db 0x80 db 0x04 db 0x80 db 0x06 db 0x80 db 0x08 db 0x80 db 0x0a db 0x80 db 0x0c db 0x80 db 0x0e db 0x80 ; void Convert_YUV422_RGBA32_SSSE3(void *fromPtr, void *toPtr, int width) width equ ebp+16 toPtr equ ebp+12 fromPtr equ ebp+8 ; void Convert_YUV420P_RGBA32_SSSE3(void *fromYPtr, void *fromUPtr, void *fromVPtr, void *toPtr, int width) width1 equ ebp+24 toPtr1 equ ebp+20 fromVPtr equ ebp+16 fromUPtr equ ebp+12 fromYPtr equ ebp+8 SECTION .text align=16 cglobal Convert_YUV422_RGBA32_SSSE3 ; reserve variables push ebp mov ebp, esp push edi push esi push ecx mov esi, [fromPtr] mov edi, [toPtr] mov ecx, [width] ; loop width / 8 times shr ecx,3 test ecx,ecx jng ENDLOOP REPEATLOOP: ; loop over width / 8 ; YUV422 packed inputer movdqa xmm0, [esi] ; should have yuyv yuyv yuyv yuyv pshufd xmm1, xmm0, 0xE4 ; copy to xmm1 movdqa xmm2, xmm0 ; copy to xmm2 ; extract both y giving y0y0 pshufb xmm0, [YMask] ; extract u and duplicate so each u in yuyv becomes u0u0 pshufb xmm1, [UMask] ; extract v and duplicate so each v in yuyv becomes v0v0 pshufb xmm2, [VMask] yuv2rgbsse2 rgba32sse2output ; endloop add edi,32 add esi,16 sub ecx, 1 ; apparently sub is better than dec jnz REPEATLOOP ENDLOOP: ; Cleanup pop ecx pop esi pop edi mov esp, ebp pop ebp ret cglobal Convert_YUV420P_RGBA32_SSSE3 ; reserve variables push ebp mov ebp, esp push edi push esi push ecx push eax push ebx mov esi, [fromYPtr] mov eax, [fromUPtr] mov ebx, [fromVPtr] mov edi, [toPtr1] mov ecx, [width1] ; loop width / 8 times shr ecx,3 test ecx,ecx jng ENDLOOP1 REPEATLOOP1: ; loop over width / 8 ; YUV420 Planar inputer movq xmm0, [esi] ; fetch 8 y values (8 bit) yyyyyyyy00000000 movd xmm1, [eax] ; fetch 4 u values (8 bit) uuuu000000000000 movd xmm2, [ebx] ; fetch 4 v values (8 bit) vvvv000000000000 ; extract y pxor xmm7,xmm7 ; 00000000000000000000000000000000 punpcklbw xmm0,xmm7 ; interleave xmm7 into xmm0 y0y0y0y0y0y0y0y0 ; extract u and duplicate so each becomes 0u0u punpcklbw xmm1,xmm7 ; interleave xmm7 into xmm1 u0u0u0u000000000 punpcklwd xmm1,xmm7 ; interleave again u000u000u000u000 pshuflw xmm1,xmm1, 0xA0 ; copy u values pshufhw xmm1,xmm1, 0xA0 ; to get u0u0 ; extract v punpcklbw xmm2,xmm7 ; interleave xmm7 into xmm1 v0v0v0v000000000 punpcklwd xmm2,xmm7 ; interleave again v000v000v000v000 pshuflw xmm2,xmm2, 0xA0 ; copy v values pshufhw xmm2,xmm2, 0xA0 ; to get v0v0 yuv2rgbsse2 rgba32sse2output ; endloop add edi,32 add esi,8 add eax,4 add ebx,4 sub ecx, 1 ; apparently sub is better than dec jnz REPEATLOOP1 ENDLOOP1: ; Cleanup pop ebx pop eax pop ecx pop esi pop edi mov esp, ebp pop ebp ret SECTION .note.GNU-stack noalloc noexec nowrite progbits 
+4
source share
3 answers

If you keep u and v alternating in the same register and use "pmaddwd" and pre-calculated constants instead of your shift and add approach, you can compress the conversion code to about a third and get rid of most kiosks at the same time:

 ; xmm0 = yyyyyyyy ; xmm3 = uvuvuvuv psubsw xmm3, [Const128] psubsw xmm0, [Const16] movdqa xmm4, xmm3 movdqa xmm5, xmm3 pmaddwd xmm3, [const_1] pmaddwd xmm4, [const_2] pmaddwd xmm5, [const_3] psrad xmm3, 14 psrad xmm4, 14 psrad xmm5, 14 pshufb xmm3, xmm3, [const_4] ; or pshuflw & pshufhw pshufb xmm4, xmm4, [const_4] pshufb xmm5, xmm5, [const_4] paddsw xmm3, xmm0 paddsw xmm4, xmm0 paddsw xmm5, xmm0 

If you want it to work even faster, playing with PMADDUBSW should allow you to work 16 pixels at a time with a slight increase in complexity.

Most processors (in particular non-Intels, which obviously do not have a well-prepared hardware prefeder, but, to a lesser extent, Intels, too), will benefit from the [esi + 256] preprogram that is thrown inside the loop.

EDIT: code that uses PMADDUBSW might look like this (correctness is not guaranteed):

 const a: times 4 db 1,3 times 4 db 5,7 const b: times 4 db 9,11 times 4 db 13,15 const_c: times 8 dw 0x00ff const_d: times 4 dd 0x00ffffff const_uv_to_rgb_mul: ... const_uv_to_rgb_add: ... movdqa xmm4, [esi] movdqa xmm0, xmm4 movdqa xmm1, xmm4 pshufb xmm0, [const_a] pshufb xmm1, [const_b] pand xmm4, [const_c] ; xmm0: uv0 uv0 uv0 uv0 uv2 uv2 uv2 uv2 uv2 ; xmm1: uv4 uv4 uv4 uv4 ... ; xmm4: y0 0 y1 0 y2 0 y3 0 y4 0 y5 0 y6 0 y7 0 pmaddubsw xmm0, [const_uv_to_rgb_mul] pmaddubsw xmm1, [const_uv_to_rgb_mul] paddsw xmm0, [const_uv_to_rgb_add] paddsw xmm1, [const_uv_to_rgb_add] psraw xmm0, 6 psraw xmm1, 6 ; r01 g01 b01 0 r23 g23 b23 0 pshufd xmm2, xmm0, 2+3*4+2*16+3*64 pshufd xmm0, xmm0, 0+1*4+0+16+1*64 pshufd xmm3, xmm1, 2+3*4+2*16+3*64 pshufd xmm1, xmm1, 0+1*4+0+16+1*64 ; xmm0: r01 g01 b01 0 r01 g01 b01 0 ; xmm2: r23 g23 b23 0 r23 g23 b23 0 ; xmm1: r45 g45 b45 0 r45 g45 b45 0 paddsw xmm0, xmm4 ; add y paddsw xmm1, xmm4 paddsw xmm2, xmm4 paddsw xmm3, xmm4 packuswb xmm0, xmm2 ; pack with saturation into 0-255 range packuswb xmm1, xmm3 pand xmm0, [const_d] ; zero out the alpha byte pand xmm1, [const_d] movntdq [edi], xmm0 movntdq [edi+16], xmm1 
+6
source

Search tables work if you use saturating additions, but they limit you to 1 pixel at a time, and memory searches are slow when they miss the cache. 3 pmaddubsw is working fine, but the instruction is slow on Core2 and not available for older ones. So 4 pmul may work better.

+1
source

Given that the source data is only 8 bits per component, have you tried something simple, for example:

 uint32_t YtoRGBlookupTable[256] = { /* precomputed table (cut & pasted from a spreadsheet or something) */ }; uint32_t UtoRGBlookupTable[256] = { /* precomputed table (cut & pasted from a spreadsheet or something) */ }; uint32_t VtoRGBlookupTable[256] = { /* precomputed table (cut & pasted from a spreadsheet or something) */ }; while(i < something) { UVtemp = UtoRGBlookupTable[src->u0] + VtoRGBlookupTable[src->v0]; dest[i] = YtoRGBlookupTable[src->y0] + UVtemp; dest[i+1] = YtoRGBlookupTable[src->y1] + UVtemp; UVtemp = UtoRGBlookupTable[src->u1] + VtoRGBlookupTable[src->v1]; dest[i+2] = YtoRGBlookupTable[src->y2] + UVtemp; dest[i+3] = YtoRGBlookupTable[src->y3] + UVtemp; i += 4; src++; } 

D'oh - sorry. This will not work, because you cannot prevent the overflow of green into red, and you will need to handle the green separately.

0
source

All Articles