I am trying to sum the elements of an array in parallel with SIMD. To avoid blocking, I use combinable thread local, which is not always 16 byte aligned because of this _mm_add_epi32 throws an exception
concurrency::combinable<__m128i> sum_combine; int length = 40; // multiple of 8 concurrency::parallel_for(0, length , 8, [&](int it) { __m128i v1 = _mm_load_si128(reinterpret_cast<__m128i*>(input_arr + it)); __m128i v2 = _mm_load_si128(reinterpret_cast<__m128i*>(input_arr + it + sizeof(uint32_t))); auto temp = _mm_add_epi32(v1, v2); auto &sum = sum_combine.local(); // here is the problem TRACE(L"%d\n", it); TRACE(L"add %x\n", &sum); ASSERT(((unsigned long)&sum & 15) == 0); sum = _mm_add_epi32(temp, sum); } );
here is the defination combinable of ppl.h
template<typename _Ty> class combinable { private:
sometimes alignment is ok and the code works fine but most of the time it doesn't work
I tried using the following, but this will not compile
union combine { unsigned short x[sizeof(__m128i) / sizeof(unsigned int)]; __m128i y; }; concurrency::combinable<combine> sum_combine; then auto &sum = sum_combine.local().y;
Any suggestions to fix the alignment problem still using combinable.
On x64, it works fine when aligning by default to 16 bytes by default. On x86, alignment issues sometimes occur.
c ++ simd ppl
vito
source share