Unexpected slowdown from inserting nop in a loop and reading from a movnti store

  • I can’t understand why the first code has ~ 1 cycle per iteration, and the second has 2 cycles per iteration. I measured the tool Agner and perf. According to IACA, 1 cycle should also go from my theoretical calculations.

It takes 1 loop per iteration.

; array is array defined in section data
%define n 1000000
xor rcx, rcx   

.begin:
    movnti [array], eax
    add rcx, 1 
    cmp rcx, n
    jle .begin

And it takes 2 cycles per iteration. , but why?

; array is array defined in section data
%define n 1000000
xor rcx, rcx   

.begin:
    movnti [array], eax
    nop
    add rcx, 1 
    cmp rcx, n
    jle .begin

This final version takes ~ 27 cycles per iteration. But why? After all, there is no dependency chain.

.begin:
    movnti [array], eax
    mov rbx, [array+16]
    add rcx, 1 
    cmp rcx, n
    jle .begin

My processor is IvyBridge.

+4
source share
1 answer

movnti 2 Agner Fog IvyBridge.

, - 4 .

nop - fop-domain uop ( , unused-domain uops). , 2 .

. tag wiki , .


, , , mov rbx, [array+16], , , movnti . , , , . ( movnti, -, .)

+2

All Articles