I have a shader that I need to optimize (with a lot of vector operations), and I'm experimenting with SSE instructions to better understand the problem.
I have a very simple code example. Using the USE_SSE define, it uses explicit SSE functions; without it, I hope that GCC will work for me. Auto-vectorization feels a bit concise, but I hope this saves me from some hair.
Compiler and platform: gcc 4.7.1 (tdm64), target x86_64-w64-mingw32 and Windows 7 on Ivy Bridge.
Here's the test code:
#ifdef USE_SSE #include <x86intrin.h> #endif #include <cstdio> #if defined(__GNUG__) || defined(__clang__) /* GCC & CLANG */ #define SSVEC_FINLINE __attribute__((always_inline)) #elif defined(_WIN32) && defined(MSC_VER) /* MSVC. */ #define SSVEC_FINLINE __forceinline #else #error Unsupported platform. #endif #ifdef USE_SSE typedef __m128 vec4f; inline void addvec4f(vec4f &a, vec4f const &b) { a = _mm_add_ps(a, b); } #else typedef float vec4f[4]; inline void addvec4f(vec4f &a, vec4f const &b) { a[0] = a[0] + b[0]; a[1] = a[1] + b[1]; a[2] = a[2] + b[2]; a[3] = a[3] + b[3]; } #endif int main(int argc, char *argv[]) { int const count = 1e7; #ifdef USE_SSE printf("Using SSE.\n"); #else printf("Not using SSE.\n"); #endif vec4f data = {1.0f, 1.0f, 1.0f, 1.0f}; for (int i = 0; i < count; ++i) { vec4f val = {0.1f, 0.1f, 0.1f, 0.1f}; addvec4f(data, val); } float result[4] = {0}; #ifdef USE_SSE _mm_store_ps(result, data); #else result[0] = data[0]; result[1] = data[1]; result[2] = data[2]; result[3] = data[3]; #endif printf("Result: %f %f %f %f\n", result[0], result[1], result[2], result[3]); return 0; }
This is compiled with:
g++ -O3 ssetest.cpp -o nossetest.exe g++ -O3 -DUSE_SSE ssetest.cpp -o ssetest.exe
Apart from the explicit version of SSE, which is slightly faster, there is no difference in output.
Here's the build for the loop, the first explicit SSE:
.L3: subl $1, %eax addps %xmm1, %xmm0 jne .L3
He introduced a challenge. Nice, more or less just up _mm_add_ps .
Massive version:
.L3: subl $1, %eax addss %xmm0, %xmm1 addss %xmm0, %xmm2 addss %xmm0, %xmm3 addss %xmm0, %xmm4 jne .L3
It uses SSE math in order, but for each element of the array. Not very desirable.
My question is, how can I help GCC so that it can better optimize the version of the vec4f array?
Any specific Linux hints in which real code will be executed are also useful.