SIMD alignment issue with PPL Combinable

I am trying to sum the elements of an array in parallel with SIMD. To avoid blocking, I use combinable thread local, which is not always 16 byte aligned because of this _mm_add_epi32 throws an exception

concurrency::combinable<__m128i> sum_combine; int length = 40; // multiple of 8 concurrency::parallel_for(0, length , 8, [&](int it) { __m128i v1 = _mm_load_si128(reinterpret_cast<__m128i*>(input_arr + it)); __m128i v2 = _mm_load_si128(reinterpret_cast<__m128i*>(input_arr + it + sizeof(uint32_t))); auto temp = _mm_add_epi32(v1, v2); auto &sum = sum_combine.local(); // here is the problem TRACE(L"%d\n", it); TRACE(L"add %x\n", &sum); ASSERT(((unsigned long)&sum & 15) == 0); sum = _mm_add_epi32(temp, sum); } ); 

here is the defination combinable of ppl.h

 template<typename _Ty> class combinable { private: // Disable warning C4324: structure was padded due to __declspec(align()) // This padding is expected and necessary. #pragma warning(push) #pragma warning(disable: 4324) __declspec(align(64)) struct _Node { unsigned long _M_key; _Ty _M_value; // this might not be aligned on 16 bytes _Node* _M_chain; _Node(unsigned long _Key, _Ty _InitialValue) : _M_key(_Key), _M_value(_InitialValue), _M_chain(NULL) { } }; 

sometimes alignment is ok and the code works fine but most of the time it doesn't work

I tried using the following, but this will not compile

 union combine { unsigned short x[sizeof(__m128i) / sizeof(unsigned int)]; __m128i y; }; concurrency::combinable<combine> sum_combine; then auto &sum = sum_combine.local().y; 

Any suggestions to fix the alignment problem still using combinable.

On x64, it works fine when aligning by default to 16 bytes by default. On x86, alignment issues sometimes occur.

0
c ++ simd ppl
source share
2 answers

Just loaded amount using unloaded load

 auto &sum = sum_combine.local(); #if !defined(_M_X64) if (((unsigned long)&sum & 15) != 0) { // just for breakpoint means, sum is unaligned. int a = 5; } auto sum_temp = _mm_loadu_si128(&sum); sum = _mm_add_epi32(temp, sum_temp); #else sum = _mm_add_epi32(temp, sum); #endif 
+1
source share

Since the sum variable used with _mm_add_epi32 is not aligned, you need to explicitly load / save sum using non-standard loads / storages ( _mm_loadu_si128 / _mm_storeu_si128 ). Change:

 sum = _mm_add_epi32(temp, sum); 

in

 __m128i v2 = _mm_loadu_si128((__m128i *)&sum); v2 = _mm_add_epi32(v2, temp); _mm_storeu_si128((__m128i *)&sum, v2); 
0
source share

All Articles