Why do my data not match?

I am trying to figure out how best to pre-compute some sin and cosine values, store them in aligned blocks, and then use them later for SSE calculations:

At the beginning of my program, I create an object with a member:

static __m128 *m_sincos; 

then I initialize this member in the constructor:

 m_sincos = (__m128*) _aligned_malloc(Bins*sizeof(__m128), 16); for (int t=0; t<Bins; t++) m_sincos[t] = _mm_set_ps(cos(t), sin(t), sin(t), cos(t)); 



When I go to use m_sincos, I encounter three problems:
- The data does not seem to be aligned

 movaps xmm0, m_sincos[t] //crashes movups xmm0, m_sincos[t] //does not crash 

- Variables do not seem to be correct

 movaps result, xmm0 // returns values that are not what is in m_sincos[t] //Although, putting a watch on m_sincos[t] displays the correct values 

-What really bothers me is that everything works (but too slowly):

 __m128 _sincos = m_sincos[t]; movaps xmm0, _sincos movaps result, xmm0 
+4
source share
2 answers

m_sincos[t] is an expression of C. However, in the assembly instructions ( __asm ?) it is interpreted as the x86 addressing mode with a completely different result. For example, VS2008 SP1 compiles:

 movaps xmm0, m_sincos[t] 

Q: (see the disassembly window when the application crashes in debug mode)

 movaps xmm0, xmmword ptr [t] 

This interpretation attempts to copy the 128-bit value stored at the address of the variable t into xmm0. t , however, is a 32-bit value with a probable uneven address. The execution of the command can lead to alignment failure and lead to incorrect results in the odd case when the address t aligned.

You can fix this using the appropriate x86 addressing mode. Here's a slow but understandable version:

 __asm mov eax, m_sincos ; eax <- m_sincos __asm mov ebx, dword ptr t __asm shl ebx, 4 ; ebx <- t * 16 ; each array element is 16-bytes (128 bit) long __asm movaps xmm0, xmmword ptr [eax+ebx] ; xmm0 <- m_sincos[t] 

Sidenote:

When I put this into a complete program, something strange comes up:

 #include <math.h> #include <tchar.h> #include <xmmintrin.h> int main() { static __m128 *m_sincos; int Bins = 4; m_sincos = (__m128*) _aligned_malloc(Bins*sizeof(__m128), 16); for (int t=0; t<Bins; t++) { m_sincos[t] = _mm_set_ps(cos((float) t), sin((float) t), sin((float) t), cos((float) t)); __asm movaps xmm0, m_sincos[t]; __asm mov eax, m_sincos __asm mov ebx, t __asm shl ebx, 4 __asm movaps xmm0, [eax+ebx]; } return 0; } 

When you run this, if you follow the register window, you may notice something strange. Although the results are correct, xmm0 gets the correct value before running the movaps . How does this happen?

A look at the generated assembly code shows that _mm_set_ps() loads the results of sin / cos into xmm0 , and then saves it to the memory address m_sincos[t] . But the value remains in xmm0 too. _mm_set_ps is an "internal", not a function call; it does not attempt to restore the values ​​of the registers that it uses after its execution.

If there is a lesson to be learned from this, perhaps using SSE's built-in functions, use them everywhere, so the compiler can optimize things for you. Otherwise, if you use the built-in assembly, use it too.

+10
source

You should always use instrinsics, or even just turn it on and leave it, and not explicitly encode it. This is due to the fact that __asm ​​is not ported to 64-bit code.

+1
source

Source: https://habr.com/ru/post/1311875/


All Articles