How to speed up the implementation of additional instructions

When I run the following function, I get some unexpected results.

On my machine, the code below sequentially takes about 6 seconds to run. However, if I uncomment the line " ;dec [variable + 24] ", so it takes about 4.5 seconds to execute more code. Why?

 .DATA variable dq 0 dup(4) .CODE runAssemblyCode PROC mov rax, 2330 * 1000 * 1000 start: dec [variable] dec [variable + 8] dec [variable + 16] ;dec [variable + 24] dec rax jnz start ret runAssemblyCode ENDP END 

I noticed that there are already problems with Stack Overflow, but their code samples are not as simple as this, and I could not find any succinct answers to this question.

I tried filling out the code using nop instructions to see if this is an alignment problem, and also establish proximity to a single processor. Nothing changed.

+4
source share
3 answers

The simple answer is that modern processors are extremely complex. Much happens under the hood, which seems unpredictable or random for the observer.

Inserting this additional instruction may cause it to schedule instructions differently, which in such a closed loop may matter. But this is just an assumption.

As far as I can tell, it touches the same cache line as the previous instruction, so this seems to be not a prefetch. I can't think of a logical explanation, but again, the processor uses a lot of undocumented heuristics and assumptions to execute the code as quickly as possible, and sometimes this means strange corner cases where they fail and the code becomes slower than you expected.

Have you tested this on different CPU models? It would be interesting to see if this is true only on your particular processor, or if other x86 processors are showing the same thing.

+3
source

bob.s

 .data variable: .word 0,0,0,0 .word 0,0,0,0 .word 0,0,0,0 .word 0,0,0,0 .word 0,0,0,0 .word 0,0,0,0 .text .globl runAssemblyCode runAssemblyCode: mov $0xFFFFFFFF,%eax start_loop: decl variable+0 decl variable+8 decl variable+16 ;decl variable+24 dec %eax jne start_loop retq 

ted.c

 #include <stdio.h> #include <time.h> void runAssemblyCode ( void ); int main ( void ) { volatile unsigned int ra,rb; ra=(unsigned int)time(NULL); runAssemblyCode(); rb=(unsigned int)time(NULL); printf("%u\n",rb-ra); return(0); } 

gcc -O2 ted.c bob.s -o ted

this was with additional instruction:

 00000000004005d4 <runAssemblyCode>: 4005d4: b8 ff ff ff ff mov $0xffffffff,%eax 00000000004005d9 <start_loop>: 4005d9: ff 0c 25 28 10 60 00 decl 0x601028 4005e0: ff 0c 25 30 10 60 00 decl 0x601030 4005e7: ff 0c 25 38 10 60 00 decl 0x601038 4005ee: ff 0c 25 40 10 60 00 decl 0x601040 4005f5: ff c8 dec %eax 4005f7: 75 e0 jne 4005d9 <start_loop> 4005f9: c3 retq 4005fa: 90 nop 

I don’t see the difference, maybe you can fix my code, or others can try their systems to see what they see ...

which is an extremely painful instruction plus if you are doing something other than a memory byte decrement that is not aligned and will be painful for the memory system. therefore, this procedure should be sensitive to cache lines, as well as the number of cores, etc.

It took about 13 seconds with or without additional instructions.

amd phenom 9950 quad core processor

on the

Intel (R) Core (TM) 2 CPU 6300

took about 9-10 seconds with or without additional instructions.

Two processors: Intel (R) Xeon (TM) CPU

It took about 13 seconds with or without additional instructions.

In this case: Intel (R) Core (TM) 2 Duo CPU T7500

8 seconds with or without.

All work with Ubuntu 64 bit 10.04 or 10.10, maybe 11.04 there.

A few more machines, 64 bits, ubuntu

Intel (R) Xeon (R) CPU X5450 (8 cores)

6 seconds with or without additional instructions.

Intel (R) Xeon (R) CPU E5405 (8 cores)

9 seconds with or without.

What is the speed of your DDR / DRAM on your system? Which processor are you using (cat / proc / cpuinfo if on linux).

Intel (R) Xeon (R) CPU E5440 (8 cores)

6 seconds with or without

Ahh, found one core, xeon: Intel (R) Xeon (TM) CPU

15 seconds with or without additional instructions

+1
source

It's not so bad. On average, a full cycle takes 2.6 ns to complete, and another 1.9 ns. Assuming a 2 GHz CPU that has a period of 0.5 ns, the difference is around (2.6 - 1.9) / 0.5 = 1 clock cycle per cycle, nothing surprising.
The time difference becomes so noticeable, although due to the number of cycles requested: 0.5 ns * 2330000000 = 1.2 seconds , the difference you observed.

0
source

All Articles