Often I have to write two implementations of a function that used SSE instructions, because input and output buffers can have either aligned or not aligned addresses:
void some_function_aligned(const float * src, size_t size, float * dst) { for(size_t i = 0; i < size; i += 4) { __m128 a = _mm_load_ps(src + i);
and
void some_function_unaligned(const float * src, size_t size, float * dst) { for(size_t i = 0; i < size; i += 4) { __m128 a = _mm_loadu_ps(src + i);
And the question arises: how to reduce code duplication, because these functions are almost equal?
user4792273
source share