- I can’t understand why the first code has ~ 1 cycle per iteration, and the second has 2 cycles per iteration. I measured the tool Agner and perf. According to IACA, 1 cycle should also go from my theoretical calculations.
It takes 1 loop per iteration.
; array is array defined in section data
%define n 1000000
xor rcx, rcx
.begin:
movnti [array], eax
add rcx, 1
cmp rcx, n
jle .begin
And it takes 2 cycles per iteration. , but why?
; array is array defined in section data
%define n 1000000
xor rcx, rcx
.begin:
movnti [array], eax
nop
add rcx, 1
cmp rcx, n
jle .begin
This final version takes ~ 27 cycles per iteration. But why? After all, there is no dependency chain.
.begin:
movnti [array], eax
mov rbx, [array+16]
add rcx, 1
cmp rcx, n
jle .begin
My processor is IvyBridge.
user6262188
source
share