Align and align the loading and storage of SSE vectors - how to reduce code duplication?

Often I have to write two implementations of a function that used SSE instructions, because input and output buffers can have either aligned or not aligned addresses:

void some_function_aligned(const float * src, size_t size, float * dst) { for(size_t i = 0; i < size; i += 4) { __m128 a = _mm_load_ps(src + i); // do something... _mm_store_ps(dst + i, a); } } 

and

 void some_function_unaligned(const float * src, size_t size, float * dst) { for(size_t i = 0; i < size; i += 4) { __m128 a = _mm_loadu_ps(src + i); // do something... _mm_storeu_ps(dst + i, a); } } 

And the question arises: how to reduce code duplication, because these functions are almost equal?

+5
source share
1 answer

There is a solution to this problem that is widely used here ( http://simd.sourceforge.net/ ). It is based on the specialization of template functions for loading and saving SSE vectors:

 template <bool align> __m128 load(const float * p); template <> inline __m128 load<false>(const float * p) { return _mm_loadu_ps(p); } template <> inline __m128 load<true>(const float * p) { return _mm_load_ps(p); } template <bool align> void store(float * p, __m128 a); template <> inline void Store<false>(float * p, __m128 a) { _mm_storeu_ps(p, a); } template <> inline void Store<true>(float * p, __m128 a) { _mm_store_ps(p, a); } 

And now we can write only one implementation of the template function:

 template <bool align> void some_function(const float * src, size_t size, float * dst) { for(size_t i = 0; i < size; i += 4) { __m128 a = load<align>(src + i); // do something... store<align>(dst + i, a); } } 
+5
source

All Articles