m_sincos[t] is an expression of C. However, in the assembly instructions ( __asm ?) it is interpreted as the x86 addressing mode with a completely different result. For example, VS2008 SP1 compiles:
movaps xmm0, m_sincos[t]
Q: (see the disassembly window when the application crashes in debug mode)
movaps xmm0, xmmword ptr [t]
This interpretation attempts to copy the 128-bit value stored at the address of the variable t into xmm0. t , however, is a 32-bit value with a probable uneven address. The execution of the command can lead to alignment failure and lead to incorrect results in the odd case when the address t aligned.
You can fix this using the appropriate x86 addressing mode. Here's a slow but understandable version:
__asm mov eax, m_sincos ; eax <- m_sincos __asm mov ebx, dword ptr t __asm shl ebx, 4 ; ebx <- t * 16 ; each array element is 16-bytes (128 bit) long __asm movaps xmm0, xmmword ptr [eax+ebx] ; xmm0 <- m_sincos[t]
Sidenote:
When I put this into a complete program, something strange comes up:
#include <math.h> #include <tchar.h> #include <xmmintrin.h> int main() { static __m128 *m_sincos; int Bins = 4; m_sincos = (__m128*) _aligned_malloc(Bins*sizeof(__m128), 16); for (int t=0; t<Bins; t++) { m_sincos[t] = _mm_set_ps(cos((float) t), sin((float) t), sin((float) t), cos((float) t)); __asm movaps xmm0, m_sincos[t]; __asm mov eax, m_sincos __asm mov ebx, t __asm shl ebx, 4 __asm movaps xmm0, [eax+ebx]; } return 0; }
When you run this, if you follow the register window, you may notice something strange. Although the results are correct, xmm0 gets the correct value before running the movaps . How does this happen?
A look at the generated assembly code shows that _mm_set_ps() loads the results of sin / cos into xmm0 , and then saves it to the memory address m_sincos[t] . But the value remains in xmm0 too. _mm_set_ps is an "internal", not a function call; it does not attempt to restore the values ββof the registers that it uses after its execution.
If there is a lesson to be learned from this, perhaps using SSE's built-in functions, use them everywhere, so the compiler can optimize things for you. Otherwise, if you use the built-in assembly, use it too.