Addressing non-integer address and sse

I am trying to speed up my code using sse and the following code works well. Basically, the __m128 variable should point to 4 floats per line in order to perform 4 operations at once.

This code is equivalent to calculating c[i]=a[i]+b[i] with i from 0 to 3 .

 float *data1,*data2,*data3 // ... code ... allocating data1-2-3 which are very long. __m128* a = (__m128*) (data1); __m128* b = (__m128*) (data2); __m128* c = (__m128*) (data3); *c = _mm_add_ps(*a, *b); 

However, when I want to slightly move the data that I use (see below) to calculate c[i]=a[i+1]+b[i] with i from 0 to 3 , it pops at runtime.

 __m128* a = (__m128*) (data1+1); // <-- +1 __m128* b = (__m128*) (data2); __m128* c = (__m128*) (data3); *c = _mm_add_ps(*a, *b); 

I assume that this is due to the fact that __m128 is 128 bits, and according to floating data - 32 bits. Thus, it may not be possible for a 128-bit pointer to specify an address that is not divisible by 128.

In any case, do you know what the problem is and how I can get around this?

0
c ++ c pointers sse
source share
3 answers

Instead of using implicit oriented loads / storages such as:

 __m128* a = (__m128*) (data1+1); // <-- +1 __m128* b = (__m128*) (data2); __m128* c = (__m128*) (data3); *c = _mm_add_ps(*a, *b); 

use explicitly balanced / non-aligned loads / storages if necessary, for example:

 __m128 va = _mm_loadu_ps(data1+1); // <-- +1 (NB: use unaligned load) __m128 vb = _mm_load_ps(data2); __m128 vc = _mm_add_ps(va, vb); _mm_store_ps(data3, vc); 

The same amount of code (i.e. the same number of instructions), but it will not crash, and you have explicit control over which loads / stores are aligned and which are not aligned.

Please note that the latest processors have relatively small penalties for unloaded workloads, but on older processors there may be a 2x or more severe hit.

+5
source share

Your problem is that a ends up being not __m128 ; it indicates that it contains the last 96 bits of __m128 and 32 bits from the outside, which can be anything. It may be the first 32 bits of the next __m128 , but in the end, when you come to the last __m128 in the same memory block, it will be something else. There may be a reserved memory that you cannot access, therefore, a failure.

+1
source share

I'm not very familiar with sse, but I think you can make a local (or other copy) of the data that is properly assigned to 128 and contains 4 floats from the location data1 + 1.

Hope this helps, Razvan.

0
source share

All Articles