I have a simple 32-bit code that calculates the product of an array of 32-bit integers. The inner loop is as follows:
@@loop: mov esi,[ebx] mov [esp],esi imul eax,[esp] add ebx, 4 dec edx jnz @@loop
What I'm trying to understand is why the code above is 6% faster than these two versions of code that doesn't duplicate memory in the opposite direction:
@@loop: mov esi,[ebx] imul eax,esi add ebx, 4 dec edx jnz @@loop
and
@@loop: imul eax,[ebx] add ebx, 4 dec edx jnz @@loop
The last two parts of the code are executed almost at the same time, and, as mentioned, 6% slower than the first part (165 ms versus 155 ms, 200 million elements).
I tried to manually align the jump target to a border of 16 bytes, but that doesn't make any difference.
I am running this on an Intel i7 4770k, Windows 10 x64.
Note. I know that the code can be improved by performing all kinds of optimizations, but I'm interested in the performance difference between the above code snippets.
performance assembly x86
AsbjΓΈrn
source share