Using XMM0 registers and memory fetching (C ++ code) is twice as fast as ASM, only using XMM registers - Why?

I am trying to implement some built-in assembler (in Visual Studio 2012 C ++ code) to take advantage of SSE. I want to add 7 numbers within 1 time, so I put them from RAM in xmm0 to xmm6 processor registers. when I do this with inline assembly in visual studio 2012 with this code:

C ++ code:

for(int i=0;i<count;i++) resVal+=val1+val2+val3+val4+val5+val6+val7; 

my ASM code:

 int count=1000000000; double resVal=0.0; //placing values to register __asm{ movsd xmm0,val1;placing var1 in xmm0 register movsd xmm1,val2 movsd xmm2,val3 movsd xmm3,val4 movsd xmm4,val5 movsd xmm5,val6 movsd xmm6,val7 pxor xmm7,xmm7;//turns xmm7 to zero } for(int i=0;i<count;i++) { __asm { addsd xmm7,xmm0;//+=var1 addsd xmm7,xmm1;//+=var2 addsd xmm7,xmm2; addsd xmm7,xmm3; addsd xmm7,xmm4; addsd xmm7,xmm5; addsd xmm7,xmm6;//+=var7 } } __asm { movsd resVal,xmm7;//placing xmm7 into resVal } 

and this is the compiled code from the C ++ compiler for the code 'resVal + = val1 + val2 + val3 + val4 + val5 + val6 + val7':

 movsd xmm0,mmword ptr [val1] addsd xmm0,mmword ptr [val2] addsd xmm0,mmword ptr [val3] addsd xmm0,mmword ptr [val4] addsd xmm0,mmword ptr [val5] addsd xmm0,mmword ptr [val6] addsd xmm0,mmword ptr [val7] addsd xmm0,mmword ptr [resVal] movsd mmword ptr [resVal],xmm0 

As you can see, the compiler uses only one register xmm0, and in other cases it extracts values ​​from RAM.

The answer of both codes (my ASM code and C ++ code) is the same, but C ++ code takes about half the time of my asm code to execute!

I read about CPU registers that work with them much faster than memory. I do not think this ratio is true. Why does the asm version have lower C ++ code performance?

+6
source share
2 answers
  • After the data is in the cache (what will happen after the first cycle, if it does not already exist), it does not matter much if you use memory or register.
  • Adding a floating point will initially take a little more than one loop.
  • The final store, prior to resVal "unties" the xmm0 register to allow the register to be freely "renamed", which allows several cycles to be run in parallel.

This is a typical case "if you are absolutely sure, leave the writing code to the compiler."

The latest issue above explains why the code is faster than the code, where each step of the cycle depends on the previously calculated result.

In the code generated by the compiler, the loop can execute the equivalent:

 movsd xmm0,mmword ptr [val1] addsd xmm0,mmword ptr [val2] addsd xmm0,mmword ptr [val3] addsd xmm0,mmword ptr [val4] addsd xmm0,mmword ptr [val5] addsd xmm0,mmword ptr [val6] addsd xmm0,mmword ptr [val7] addsd xmm0,mmword ptr [resVal] movsd mmword ptr [resVal],xmm0 movsd xmm1,mmword ptr [val1] addsd xmm1,mmword ptr [val2] addsd xmm1,mmword ptr [val3] addsd xmm1,mmword ptr [val4] addsd xmm1,mmword ptr [val5] addsd xmm1,mmword ptr [val6] addsd xmm1,mmword ptr [val7] addsd xmm1,mmword ptr [resVal] movsd mmword ptr [resVal],xmm1 

Now, as you can see, we could “mix” these two “streams”:

 movsd xmm0,mmword ptr [val1] movsd xmm1,mmword ptr [val1] addsd xmm0,mmword ptr [val2] addsd xmm1,mmword ptr [val2] addsd xmm0,mmword ptr [val3] addsd xmm1,mmword ptr [val3] addsd xmm0,mmword ptr [val4] addsd xmm1,mmword ptr [val4] addsd xmm0,mmword ptr [val5] addsd xmm1,mmword ptr [val5] addsd xmm0,mmword ptr [val6] addsd xmm1,mmword ptr [val6] addsd xmm0,mmword ptr [val7] addsd xmm1,mmword ptr [val7] addsd xmm0,mmword ptr [resVal] movsd mmword ptr [resVal],xmm0 // Here we have to wait for resval to be uppdated! addsd xmm1,mmword ptr [resVal] movsd mmword ptr [resVal],xmm1 

I do not propose that this be done in a different way, but I, of course, can see how a cycle can execute faster than your cycle. Perhaps you may have achieved the same thing in your assembler code if you have a spare register [in x86_64 you have 8 more registers, although you cannot use the built-in assembler in x86_64 ...]

(Note that register renaming is different from my “streaming” cycle, which uses two different registers, but the effect is about the same, the cycle may continue after it gets into the “resVal” update, without waiting for the result to be updated)

+10
source

It may be useful for you not to use _asm, but intrinsics functions and instinctive types like __m128i from __m128d witch are sse registers. See Immintrin.h for types and many sse functions. You can find a good description and specifications for them here: http://software.intel.com/sites/landingpage/IntrinsicsGuide/

0
source

All Articles