- After the data is in the cache (what will happen after the first cycle, if it does not already exist), it does not matter much if you use memory or register.
- Adding a floating point will initially take a little more than one loop.
- The final store, prior to
resVal "unties" the xmm0 register to allow the register to be freely "renamed", which allows several cycles to be run in parallel.
This is a typical case "if you are absolutely sure, leave the writing code to the compiler."
The latest issue above explains why the code is faster than the code, where each step of the cycle depends on the previously calculated result.
In the code generated by the compiler, the loop can execute the equivalent:
movsd xmm0,mmword ptr [val1] addsd xmm0,mmword ptr [val2] addsd xmm0,mmword ptr [val3] addsd xmm0,mmword ptr [val4] addsd xmm0,mmword ptr [val5] addsd xmm0,mmword ptr [val6] addsd xmm0,mmword ptr [val7] addsd xmm0,mmword ptr [resVal] movsd mmword ptr [resVal],xmm0 movsd xmm1,mmword ptr [val1] addsd xmm1,mmword ptr [val2] addsd xmm1,mmword ptr [val3] addsd xmm1,mmword ptr [val4] addsd xmm1,mmword ptr [val5] addsd xmm1,mmword ptr [val6] addsd xmm1,mmword ptr [val7] addsd xmm1,mmword ptr [resVal] movsd mmword ptr [resVal],xmm1
Now, as you can see, we could “mix” these two “streams”:
movsd xmm0,mmword ptr [val1] movsd xmm1,mmword ptr [val1] addsd xmm0,mmword ptr [val2] addsd xmm1,mmword ptr [val2] addsd xmm0,mmword ptr [val3] addsd xmm1,mmword ptr [val3] addsd xmm0,mmword ptr [val4] addsd xmm1,mmword ptr [val4] addsd xmm0,mmword ptr [val5] addsd xmm1,mmword ptr [val5] addsd xmm0,mmword ptr [val6] addsd xmm1,mmword ptr [val6] addsd xmm0,mmword ptr [val7] addsd xmm1,mmword ptr [val7] addsd xmm0,mmword ptr [resVal] movsd mmword ptr [resVal],xmm0 // Here we have to wait for resval to be uppdated! addsd xmm1,mmword ptr [resVal] movsd mmword ptr [resVal],xmm1
I do not propose that this be done in a different way, but I, of course, can see how a cycle can execute faster than your cycle. Perhaps you may have achieved the same thing in your assembler code if you have a spare register [in x86_64 you have 8 more registers, although you cannot use the built-in assembler in x86_64 ...]
(Note that register renaming is different from my “streaming” cycle, which uses two different registers, but the effect is about the same, the cycle may continue after it gets into the “resVal” update, without waiting for the result to be updated)