Why is round-trip memory faster than not round-trip?

Question

Why is round-trip memory faster than not round-trip?

I have a simple 32-bit code that calculates the product of an array of 32-bit integers. The inner loop is as follows:

@@loop: mov esi,[ebx] mov [esp],esi imul eax,[esp] add ebx, 4 dec edx jnz @@loop

What I'm trying to understand is why the code above is 6% faster than these two versions of code that doesn't duplicate memory in the opposite direction:

 @@loop: mov esi,[ebx] imul eax,esi add ebx, 4 dec edx jnz @@loop

and

 @@loop: imul eax,[ebx] add ebx, 4 dec edx jnz @@loop

The last two parts of the code are executed almost at the same time, and, as mentioned, 6% slower than the first part (165 ms versus 155 ms, 200 million elements).

I tried to manually align the jump target to a border of 16 bytes, but that doesn't make any difference.

I am running this on an Intel i7 4770k, Windows 10 x64.

Note. I know that the code can be improved by performing all kinds of optimizations, but I'm interested in the performance difference between the above code snippets.

+8

performance assembly x86

Asbjørn Aug 7 '15 at 3:08

source share

1 answer

dave · Accepted Answer · 2015-10-06T08:03:54+0000

I suspect but cannot be sure that you are preventing a data dependency breakdown:

The code is as follows:

 @@loop: mov esi,[ebx] # (1)Load the memory location to esi reg (mov [esp],esi) # (1)optionally store the location on the stack imul eax,[esp] # (3) Perform the multiplication add ebx, 4 # (1) Add 4 dec edx # (1)decrement counter jnz @@loop # (0**) loop

These numbers in brackets are delays in the instructions ... that the jump is 0 if the branch predictor guesses correctly (which is because it will mainly quote it most of the time).

So: while the multiplication is still ongoing (3 teams), we return to the beginning of the cycle after 2 and try to load into memory and should stop. Or we could make a store ... which we can do at the same time as our multiplication, and then not stop at all.

What about the dummy shop you ask? Why does it work? Please note that you store the critical value that we use to multiply by memory. Thus, the processor can use this value, which is stored in memory and compresses the register.

So why can't the processor do this? A processor cannot make more memory access than you ask for it, or it may interfere with several processor programs (assume that the cache line you write is shared, and you must make it invalid on other processors in each cycle by writing her ... oh!).

All this is pure speculation, but it seems like all the evidence (your code and my knowledge of Intel architecture ... and x86 assembly). Hope someone can point out if I have something wrong.

Why is round-trip memory faster than not round-trip?

More articles: