Position change optimization

I have a loop that changes elements in an array. I simplified and reduced the problem to the following:

for (int x=0;x<w/2;++x) {
    int il =     x;
    int ir = w-1-x;
    type_copy l = data[il];
    type_copy r = data[ir];
    data[il] = r;
    data[ir] = l;
}

This code is changing elements, but rather slow. Firstly, it cannot be auto-vectorized since array accesses are non-contiguous. On the other hand, accesses from the right side backward from a perfect cache bypass. Finally, there is probably some stagnation, because loading for the next cycle of the cycle cannot happen before the data from the last one has been committed, since the compiler probably cannot say that the self-aliased pointer never hit itself.

In my case, sizeof(type_copy)either 4*sizeof(uint8_t)= 4or 4*sizeof(float)= 4*4= 16. Therefore, note that an inverse byte error is not acceptable.

My question is: how to optimize this code if it can be?

+4
source share
4 answers

Assuming your data types are similar:

struct float_data
{
    float f1;
    float f2;
    float f3;
    float f4;
};

struct uint8_t_data
{
    uint8_t f1;
    uint8_t f2;
    uint8_t f3;
    uint8_t f4;
};

You can try the built-in SSE features. For uint8_t_data, a good speed is observed:

typedef uint8_t_data type_copy;

for (int x = 0; x<w / 2; x += 4) 
{
    int il = x;
    int ir = w - 1 - x - 3;

    __m128i dl = _mm_loadu_si128((const __m128i*)&data[il]);
    __m128i dr = _mm_loadu_si128((const __m128i*)&data[ir]);
    _mm_storeu_si128((__m128i*)&data[ir], _mm_shuffle_epi32(dl, _MM_SHUFFLE(0, 1, 2, 3)));
    _mm_storeu_si128((__m128i*)&data[il], _mm_shuffle_epi32(dr, _MM_SHUFFLE(0, 1, 2, 3)));
}

Conclusion:

g++ -O3 non vectorized: 16ms
g++ -O3 vectorized: 5ms

However, there are not many speed improvements for float_data:

typedef float_data type_copy;

for (int x = 0; x<w / 2; x+=2) {
    int il = x;
    int ir = w - 1 - x - 1;

    __m256 dl = _mm256_loadu_ps((const float*)&data[il]);
    __m256 dr = _mm256_loadu_ps((const float*)&data[ir]);

    _mm256_storeu_ps((float*)&data[ir], _mm256_permute2f128_ps(dl, dl, 1));
    _mm256_storeu_ps((float*)&data[il], _mm256_permute2f128_ps(dr, dr, 1));

}

Conclusion:

g++ -O3 -mavx non vectorized: 27ms
g++ -O3 -msse4.2 non vectorized: 25ms
g++ -O3 -mavx vectorized: 24ms
+1
source

The reason your code cannot be properly parallelized is because there is a dependency between the data:

for (int x=0;x<w/2;++x) {
   int il =     x;
   int ir = w-1-x;
   type_copy l = data[il];
   type_copy r = data[ir];
   data[il] = r;
   data[ir] = l;
}

There are 3 steps: compute l/r indexes, read from array, write to array. Each of them depends on the result of the previous operation, so they cannot be executed in parallel. Note that I am grouping operations for left or right under the same category.

, - .

, , N N-1; .

int il =   0;
int ir = w-1;
type_copy l = data[il];
type_copy r = data[ir];

for (int x=0;x<w/2;++x) {
   data[il] = r;
   data[ir] = l;
   il =     x;
   ir = w-1-x;
   l = data[il];
   r = data[ir];       
}

, , :

int il_0 =   0;
int ir_0 = w-1;
int il_1 =   1;
int ir_1 = w-2;
type_copy l = data[il_0];
type_copy r = data[ir_0];

for (int x=0;x<w/2;++x) {
   data[il_0] = r;
   data[ir_0] = l;       
   l = data[il_1];
   r = data[ir_1];
   il_0 = il_1;
   ir_0 = ir_1;       
   il_1 = il_1++; 
   ir_1 = ir_1--;
}

, - ; , / 2,4,... .. . parallelism .

+1

, :

for (int x=0;x<w/2;++x) {
    std::swap(data[x], data[w-i-x]);    
}

, :

  • 3
  • 3

for (int x=0;x<w/2;++x) { type_copy l = data[x]; data[x] = data[w-1-x]; data[w-l-x] = l; }

0

The code can certainly be optimized, but it can be platform dependent. For example, AMD64 inherits a bunch of useful instructions from SSE x86, including PSHUFD or VPPERM, which can arbitrarily change the order of words in the vector and MASKMOVDQU, which can combine partial recording.

0
source

All Articles