Why is MOVNTI slower in a loop stored repeatedly at the same address?

section .text

%define n    100000

_start:
xor rcx, rcx

jmp .cond
.begin:
    movnti [array], eax

.cond:
    add rcx, 1 
    cmp rcx, n
    jl .begin


section .data
array times 81920 db "A"

According to perfit, it works on 1.82 instructions per cycle. I don’t understand why so fast. In the end, it must be stored in memory (RAM), so it must be slow.

PS Is there any cycle dependency?

EDIT

section .text

%define n    100000

_start:
xor rcx, rcx

jmp .cond
.begin:
    movnti [array+rcx], eax

.cond:
    add rcx, 1 
    cmp rcx, n
    jl .begin


section .data
array times n dq 0

Now the iteration takes 5 cycles per iteration. What for? After all, there is still no cycle dependency.

+3
source share
2 answers

movnti it can apparently support one-cycle throughput when re-writing to the same address.

, movnti , , . ( WC SSE4.1 NT, NT.)

, NT NT , , DRAM.

DDR DRAM . movnti 4B, , . //, , . . .

, , . , , .


, , 4 ( ), , . 100k ( perf).

, Core2 E6600 (Merom/Conroe) DDR2 533MHz , / , 0.113846 . 266 007 .

( movnti) :

global _start
_start:
    xor ecx,ecx
.begin:
    movnti  [array], eax
    dec     ecx
    jnz     .begin         ; 2^32 iterations

    mov eax, 60     ; __NR_exit
    xor edi,edi
    syscall         ; exit(0)

section .bss
array resb 81920

(asm-link - script, )

$ asm-link movnti-same-address.asm
+ yasm -felf64 -Worphan-labels -gdwarf2 movnti-same-address.asm
+ ld -o movnti-same-address movnti-same-address.o
$ perf stat -e task-clock,cycles,instructions ./movnti-same-address 

 Performance counter stats for './movnti-same-address':

       1835.056710      task-clock (msec)         #    0.995 CPUs utilized          
     4,398,731,563      cycles                    #    2.397 GHz                    
    12,891,491,495      instructions              #    2.93  insns per cycle        
       1.843642514 seconds time elapsed

:

$ time ./movnti-same-address; time ./movnti-same-address & time ./movnti-same-address &

real    0m1.844s / user    0m1.828s    # running alone
[1] 12523
[2] 12524
peter@tesla:~/src/SO$ 
real    0m1.855s / user    0m1.824s    # running together
real    0m1.984s / user    0m1.808s
# output compacted by hand to save space

SMP ( ), . 10- Xeon 10 ( ), , . ( , , , .)


zx485 uop count , .

, CPU , - . , , , , , IPC .


P.S - ?

, . (1 ). , insn, dec/jg cmp.


"" , , . "", , , - "".

. , - , , .

+2

. . Agner Fog, :

Instruction    regs     fused  unfused  ports   Latency Reciprocal Throughput
---------------------------------------------------------------------------------------------------------------------------
MOVNTI         m,r      2      2        p23 p4  ~400     1
ADD            r,r/i    1      1        p0156   1       0.25    
CMP            r,r/i    1      1        p0156   1       0.25    
Jcc            short    1      1        p6      1       1-2    if predicted that the jump is taken
Fused CMP+Jcc  short    1      1        p6      1       1-2    if predicted that the jump is taken

,

  • MOVNTI 2 uOps, 1 2 3 4
  • ADD 1 uOps 0 1 5 6
  • CMP Jcc-- , 1 uOp

ADD, CMP+Jcc MOVNTI, () , , 1,2,4,6. 1 ADD CMP+Jcc.

, : [array] 100000 , .

L1- ,

, , , (UC) (WP).

, .

, 3GHz CPU 1600 DDR3-RAM, 3/1,6 = 1,875 . .

+1

All Articles