SSE code runs 30% faster, but when using more than 20% CPU increase

Question

SSE code runs 30% faster, but when using more than 20% CPU increase

I am trying to optimize the procedure used in VLC that converts an NV12 frame to a YV12 frame.

For reference, NV12 is identical to YV12, except that the color planes U and V alternate. So, to convert one format to another, it’s just a matter of flipping the channel: UVUVUVUVUVUVU becomes UUUUUUU VVVVVVV

The procedure I'm trying to improve is this: http://git.videolan.org/?p=vlc.git;a=blob;f=modules/video_chroma/copy.c;h=d29843c037e494170f0d6bc976bea8439dd6115b;hb=HEAD#l286

Now the main problem with this routine is that to store it, you need a memory cache with 16 bytes as an intermediate storage. Thus, the procedure first deinterlaces the data into the cache (4kiB max), and then copies the result found in the cache back to the target frame.

I rewrote this function, so it does not require a cache, using SSE2 / 3 instructions that work with non-smooth memory when needed, and instructions using aligned memory when possible.

The code is as follows:

static void SSE_SplitPlanes(uint8_t *dstu, size_t dstu_pitch, uint8_t *dstv, size_t dstv_pitch, const uint8_t *src, size_t src_pitch, uint8_t *cache, size_t cache_size, unsigned width, unsigned height, unsigned cpu) { VLC_UNUSED(cache); VLC_UNUSED(cache_size); const uint8_t shuffle[] = { 0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15 }; const uint8_t mask[] = { 0xff, 0x00, 0xff, 0x00, 0xff, 0x00, 0xff, 0x00, 0xff, 0x00, 0xff, 0x00, 0xff, 0x00, 0xff, 0x00 }; const bool aligned = ((uintptr_t)src & 0xf) == 0; asm volatile ("mfence"); #define LOAD64A \ "movdqa 0(%[src]), %%xmm0\n" \ "movdqa 16(%[src]), %%xmm1\n" \ "movdqa 32(%[src]), %%xmm2\n" \ "movdqa 48(%[src]), %%xmm3\n" #define LOAD64U \ "movdqu 0(%[src]), %%xmm0\n" \ "movdqu 16(%[src]), %%xmm1\n" \ "movdqu 32(%[src]), %%xmm2\n" \ "movdqu 48(%[src]), %%xmm3\n" #define STORE2X32 \ "movq %%xmm0, 0(%[dst1])\n" \ "movq %%xmm1, 8(%[dst1])\n" \ "movhpd %%xmm0, 0(%[dst2])\n" \ "movhpd %%xmm1, 8(%[dst2])\n" \ "movq %%xmm2, 16(%[dst1])\n" \ "movq %%xmm3, 24(%[dst1])\n" \ "movhpd %%xmm2, 16(%[dst2])\n" \ "movhpd %%xmm3, 24(%[dst2])\n" if (aligned) { for (unsigned y = 0; y < height; y++) { unsigned x = 0; #ifdef CAN_COMPILE_SSSE3 if (vlc_CPU_SSSE3()) { for (x = 0; x < (width & ~31); x += 32) { asm volatile ( "movdqu (%[shuffle]), %%xmm7\n" LOAD64A "pshufb %%xmm7, %%xmm0\n" "pshufb %%xmm7, %%xmm1\n" "pshufb %%xmm7, %%xmm2\n" "pshufb %%xmm7, %%xmm3\n" STORE2X32 : : [dst1]"r"(&dstu[x]), [dst2]"r"(&dstv[x]), [src]"r"(&src[2*x]), [shuffle]"r"(shuffle) : "memory", "xmm0", "xmm1", "xmm2", "xmm3", "xmm7"); } } else #endif { for (x = 0; x < (width & ~31); x += 32) { asm volatile ( "movdqu (%[mask]), %%xmm7\n" LOAD64A "movdqa %%xmm0, %%xmm4\n" "movdqa %%xmm1, %%xmm5\n" "movdqa %%xmm2, %%xmm6\n" "psrlw $8, %%xmm0\n" "psrlw $8, %%xmm1\n" "pand %%xmm7, %%xmm4\n" "pand %%xmm7, %%xmm5\n" "pand %%xmm7, %%xmm6\n" "packuswb %%xmm4, %%xmm0\n" "packuswb %%xmm5, %%xmm1\n" "pand %%xmm3, %%xmm7\n" "psrlw $8, %%xmm2\n" "psrlw $8, %%xmm3\n" "packuswb %%xmm6, %%xmm2\n" "packuswb %%xmm7, %%xmm3\n" STORE2X32 : : [dst2]"r"(&dstu[x]), [dst1]"r"(&dstv[x]), [src]"r"(&src[2*x]), [mask]"r"(mask) : "memory", "xmm0", "xmm1", "xmm2", "xmm3", "xmm4", "xmm5", "xmm6", "xmm7"); } } for (; x < width; x++) { dstu[x] = src[2*x+0]; dstv[x] = src[2*x+1]; } src += src_pitch; dstu += dstu_pitch; dstv += dstv_pitch; } } else { for (unsigned y = 0; y < height; y++) { unsigned x = 0; #ifdef CAN_COMPILE_SSSE3 if (vlc_CPU_SSSE3()) { for (x = 0; x < (width & ~31); x += 32) { asm volatile ( "movdqu (%[shuffle]), %%xmm7\n" LOAD64U "pshufb %%xmm7, %%xmm0\n" "pshufb %%xmm7, %%xmm1\n" "pshufb %%xmm7, %%xmm2\n" "pshufb %%xmm7, %%xmm3\n" STORE2X32 : : [dst1]"r"(&dstu[x]), [dst2]"r"(&dstv[x]), [src]"r"(&src[2*x]), [shuffle]"r"(shuffle) : "memory", "xmm0", "xmm1", "xmm2", "xmm3", "xmm7"); } } else #endif { for (x = 0; x < (width & ~31); x += 32) { asm volatile ( "movdqu (%[mask]), %%xmm7\n" LOAD64U "movdqu %%xmm0, %%xmm4\n" "movdqu %%xmm1, %%xmm5\n" "movdqu %%xmm2, %%xmm6\n" "psrlw $8, %%xmm0\n" "psrlw $8, %%xmm1\n" "pand %%xmm7, %%xmm4\n" "pand %%xmm7, %%xmm5\n" "pand %%xmm7, %%xmm6\n" "packuswb %%xmm4, %%xmm0\n" "packuswb %%xmm5, %%xmm1\n" "pand %%xmm3, %%xmm7\n" "psrlw $8, %%xmm2\n" "psrlw $8, %%xmm3\n" "packuswb %%xmm6, %%xmm2\n" "packuswb %%xmm7, %%xmm3\n" STORE2X32 : : [dst2]"r"(&dstu[x]), [dst1]"r"(&dstv[x]), [src]"r"(&src[2*x]), [mask]"r"(mask) : "memory", "xmm0", "xmm1", "xmm2", "xmm3", "xmm4", "xmm5", "xmm6", "xmm7"); } } for (; x < width; x++) { dstu[x] = src[2*x+0]; dstv[x] = src[2*x+1]; } src += src_pitch; dstu += dstu_pitch; dstv += dstv_pitch; } } #undef STORE2X32 #undef LOAD64U #undef LOAD64A }

Now, comparing this function alone, it works 26% faster on the i7-2600 processor (ivybridge 3.4GHz), slightly faster on the i7-4650U (haswell 1.7GHz) with a 30 percent increase in speed compared to the original function

What was expected when you go from 2 reads + 2 writes, to 1 read + 1 write.

However, when used in VLC (the function is used to display each frame decoded via Intel VAAPI), the CPU usage for the same video jumps from 20% to 32-34%

Therefore, I am puzzled why this would be so. and how it can be solved. I expected the opposite result. Both routines use SSE2 / 3, one works faster, but causes an increase in CPU usage.

thanks

+8

c sse

jyavenard Jun 12 '14 at 12:59

source share

1 answer

jyavenard · Accepted Answer · 2014-06-15T02:51:58+0000

Ok

I found out what's going on.

While the new procedure works much faster with traditional memory, when it comes to working with frames created by hardware decoding, it is actually slower:

This Intel technical documentation explains things: https://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers

All my tests and tests were conducted using traditionally distributed memory. Not from a bad speculative word pool (USWC)

back to the drawing board

SSE code runs 30% faster, but when using more than 20% CPU increase

More articles: