I tried to optimize the following code (sum of squared differences for two arrays):
inline float Square(float value)
{
return value*value;
}
float SquaredDifferenceSum(const float * a, const float * b, size_t size)
{
float sum = 0;
for(size_t i = 0; i < size; ++i)
sum += Square(a[i] - b[i]);
return sum;
}
So, I performed the optimization using the SSE CPU instructions:
inline void SquaredDifferenceSum(const float * a, const float * b, size_t i, __m128 & sum)
{
__m128 _a = _mm_loadu_ps(a + i);
__m128 _b = _mm_loadu_ps(b + i);
__m128 _d = _mm_sub_ps(_a, _b);
sum = _mm_add_ps(sum, _mm_mul_ps(_d, _d));
}
inline float ExtractSum(__m128 a)
{
float _a[4];
_mm_storeu_ps(_a, a);
return _a[0] + _a[1] + _a[2] + _a[3];
}
float SquaredDifferenceSum(const float * a, const float * b, size_t size)
{
size_t i = 0, alignedSize = size/4*4;
__m128 sums = _mm_setzero_ps();
for(; i < alignedSize; i += 4)
SquaredDifferenceSum(a, b, i, sums);
float sum = ExtractSum(sums);
for(; i < size; ++i)
sum += Square(a[i] - b[i]);
return sum;
}
This code works fine if the size of the arrays is not too large. But if the size is large enough, then there is a large computational error between the results given by the base function and its optimized version. And so I have a question: where is the error in the optimized SSE code, which leads to a calculation error.
user5243211
source
share