GCC help with auto vector

I have a shader that I need to optimize (with a lot of vector operations), and I'm experimenting with SSE instructions to better understand the problem.

I have a very simple code example. Using the USE_SSE define, it uses explicit SSE functions; without it, I hope that GCC will work for me. Auto-vectorization feels a bit concise, but I hope this saves me from some hair.

Compiler and platform: gcc 4.7.1 (tdm64), target x86_64-w64-mingw32 and Windows 7 on Ivy Bridge.

Here's the test code:

 /* Include all the SIMD intrinsics. */ #ifdef USE_SSE #include <x86intrin.h> #endif #include <cstdio> #if defined(__GNUG__) || defined(__clang__) /* GCC & CLANG */ #define SSVEC_FINLINE __attribute__((always_inline)) #elif defined(_WIN32) && defined(MSC_VER) /* MSVC. */ #define SSVEC_FINLINE __forceinline #else #error Unsupported platform. #endif #ifdef USE_SSE typedef __m128 vec4f; inline void addvec4f(vec4f &a, vec4f const &b) { a = _mm_add_ps(a, b); } #else typedef float vec4f[4]; inline void addvec4f(vec4f &a, vec4f const &b) { a[0] = a[0] + b[0]; a[1] = a[1] + b[1]; a[2] = a[2] + b[2]; a[3] = a[3] + b[3]; } #endif int main(int argc, char *argv[]) { int const count = 1e7; #ifdef USE_SSE printf("Using SSE.\n"); #else printf("Not using SSE.\n"); #endif vec4f data = {1.0f, 1.0f, 1.0f, 1.0f}; for (int i = 0; i < count; ++i) { vec4f val = {0.1f, 0.1f, 0.1f, 0.1f}; addvec4f(data, val); } float result[4] = {0}; #ifdef USE_SSE _mm_store_ps(result, data); #else result[0] = data[0]; result[1] = data[1]; result[2] = data[2]; result[3] = data[3]; #endif printf("Result: %f %f %f %f\n", result[0], result[1], result[2], result[3]); return 0; } 

This is compiled with:

 g++ -O3 ssetest.cpp -o nossetest.exe g++ -O3 -DUSE_SSE ssetest.cpp -o ssetest.exe 

Apart from the explicit version of SSE, which is slightly faster, there is no difference in output.

Here's the build for the loop, the first explicit SSE:

 .L3: subl $1, %eax addps %xmm1, %xmm0 jne .L3 

He introduced a challenge. Nice, more or less just up _mm_add_ps .

Massive version:

 .L3: subl $1, %eax addss %xmm0, %xmm1 addss %xmm0, %xmm2 addss %xmm0, %xmm3 addss %xmm0, %xmm4 jne .L3 

It uses SSE math in order, but for each element of the array. Not very desirable.

My question is, how can I help GCC so that it can better optimize the version of the vec4f array?

Any specific Linux hints in which real code will be executed are also useful.

+4
source share
2 answers

This LockLess Auto-vectorization article with gcc 4.7 is the best article I've ever seen, and I spent some time looking for good articles on similar topics. They also have many other articles that you may find very useful for similar subjects on all of the low-level software development techniques.

+6
source

Here are some tips based on your code to do gcc automatic vectography:

  • make loop-upbound constant . To vectorize GCC, it is necessary to split the loop into 4 iterations to fit in the SSM XMM register, which is 128 bits long. the upper boundary of the loop contour will help GCC make sure that the loop has many iterations and that vectorization is beneficial.
  • remove the inline . if the code is marked as embedded, GCC cannot know if the starting point of the array is aligned without cross-procedure analysis, which does not include -O3 .

    therefore, to make your code vectorized, your addvec4f function must be modified as follows:

     void addvec4f(vec4f &a, vec4f const &b) { int i = 0; for(;i < 4; i++) a[i] = a[i]+b[i]; } 

BTW:

  • GCC also has flags to help you figure out if the vector was a vector. -ftree-vectorizer-verbose=2 , a larger number will have more output, currently the value may be 0 , 1 , 2 . Here is the documentation of this flag and another related flag.
  • Be careful with alignment . The address of the array must be aligned, and the compiler cannot know if the address is aligned without starting it. Usually there will be a bus error if the data is not aligned. Here .
+5
source

All Articles