Unexpected slowdown from inserting nop in a loop and reading from a movnti store

Question

Unexpected slowdown from inserting nop in a loop and reading from a movnti store

I can’t understand why the first code has ~ 1 cycle per iteration, and the second has 2 cycles per iteration. I measured the tool Agner and perf. According to IACA, 1 cycle should also go from my theoretical calculations.

It takes 1 loop per iteration.

; array is array defined in section data
%define n 1000000
xor rcx, rcx   

.begin:
    movnti [array], eax
    add rcx, 1 
    cmp rcx, n
    jle .begin

And it takes 2 cycles per iteration. , but why?

; array is array defined in section data
%define n 1000000
xor rcx, rcx   

.begin:
    movnti [array], eax
    nop
    add rcx, 1 
    cmp rcx, n
    jle .begin

This final version takes ~ 27 cycles per iteration. But why? After all, there is no dependency chain.

.begin:
    movnti [array], eax
    mov rbx, [array+16]
    add rcx, 1 
    cmp rcx, n
    jle .begin

My processor is IvyBridge.

+4

performance optimization x86 micro-optimization

user6262188 May 08 '16 at 15:40

source share

1 answer

Peter Cordes · Accepted Answer · 2016-05-08T18:22:39+0000

movnti 2 Agner Fog IvyBridge.

, - 4 .

nop - fop-domain uop ( , unused-domain uops). , 2 .

. x86 tag wiki , .

, , , mov rbx, [array+16], , , movnti . , , , . ( movnti, -, .)

Unexpected slowdown from inserting nop in a loop and reading from a movnti store

More articles: