Why doesn't MSVC auto vectorize use AVX2?

I am trying to use vectorization in my compiler (Microsoft Visual Studio 2013). One of the problems that I am facing is that she does not want to use AVX2. In exploring this problem, I built the following example, which calculates the sum of 16 numbers, each of which is 16 bits.

int16_t input1[16] = {0}; int16_t input2[16] = {0}; ... // fill the arrays with some data // Calculate the sum using a loop int16_t output1[16] = {0}; for (int x = 0; x < 16; x++){ output1[x] = input1[x] + input2[x]; } 

The compiler vectorizes this code, but only for SSE instructions:

 vmovdqu xmm1, xmmword ptr [rbp+rax] lea rax, [rax+10h] vpaddw xmm1, xmm1, xmmword ptr [rbp+rax+10h] vmovdqu xmmword ptr [rbp+rax+30h], xmm1 dec rcx jne main+0b0h 

To make sure that the compiler has the ability to generate AVX2 code, I wrote the same calculation as follows:

 // Calculate the sum using one AVX2 instruction int16_t output2[16] = {0}; __m256i in1 = _mm256_loadu_si256((__m256i*)input1); __m256i in2 = _mm256_loadu_si256((__m256i*)input2); __m256i out2 = _mm256_add_epi16(in1, in2); _mm256_storeu_si256((__m256i*)output2, out2); 

I see that the two parts of the code are equivalent (i.e., output11 is equal to output2 after they are executed).

And it outputs AVX2 commands for the second part of the code:

 vmovdqu ymm1, ymmword ptr [input2] vpaddw ymm1, ymm1, ymmword ptr [rbp] vmovdqu ymmword ptr [output2], ymm1 

I do not want to rewrite my code to use intrinsics, however: if it is written as a loop, much more natural, it is compatible with old (SSE-only) processors and has other advantages.

So, how can I customize my example so that the compiler can vectorize it in AVX2 mode?

+7
c ++ c vectorization visual-studio-2013 avx2
source share
1 answer

Visual Studio easily generates AVX2 code when performing floating point arithmetic. I think this is enough to announce that "VS2013 supports AVX2."

However, no matter what I did, VS2013 did not generate AVX2 code for whole calculations (neither int16_t nor int32_t worked), so I assume this is not supported at all (gcc produces AVX2 for my code on version 4.8.2, unsure of earlier versions).

If I had to do calculations on int32_t , I could consider converting them to float and vice versa. However, since I use int16_t , this does not help.

0
source share

All Articles