Why does using mod with the int64_t operand make this function 150% slower?

The max_rem function calculates the maximum remainder that (a+1)^n + (a-1)^n leaves when dividing by by n = 1, 2, 3... main calls max_rem on each a from 3 to 999 . Full code:

 #include <inttypes.h> #include <stdio.h> int max_rem(int a) { int max_r = 0; int m = a * a; // <-------- offending line int r1 = a+1, r2 = a-1; for(int n = 1; n <= a*a; n++) { r1 = (r1 * (a + 1)) % m; r2 = (r2 * (a - 1)) % m; int r = (r1 + r2) % m; if(max_r < r) max_r = r; } return max_r; } int main() { int64_t sum = 0; for(int a = 3; a < 1000; a++) sum += max_rem(a); printf("%ld\n", sum); } 

If I change line 6 to:

 int m = a * a; 

to

 int64_t m = a * a; 

all calculations are about 150% slower . I tried both with gcc 5.3 and clang 3.6 .

With int :

 $ gcc -std=c99 -O3 -Wall -o 120 120.c $ time(./120) real 0m3.823s user 0m3.816s sys 0m0.000s 

with int64_t :

 $ time(./120) real 0m9.861s user 0m9.836s sys 0m0.000s 

and yes, I am on a 64-bit system . Why is this happening?

I always assumed that using int64_t is safer and more portable and a “modern way to write C” ® and will not hurt performance on 64-bit systems for numeric code. Is this assumption erroneous?

EDIT : just to be clear: the slowdown persists even if you change each variable to int64_t . So this is not a problem with mixing int and int64_t .

+4
source share
4 answers

I always assumed that using int64_t is safer and more portable and a “modern way to write C” ® and will not hurt performance on 64-bit systems for numeric code. Is this assumption erroneous?

I think so. You can find instruction timings in the Intel Software Optimization Reference Guide (Appendix C, Table C-17 General Purpose of the Instructions on page 645):
  IDIV r64 Throughput 85-100 cycles per instruction
     IDIV r32 Throughput 20-26 cycles per instruction
+6
source

TL DR: you see different performance when changing types, because you measure different calculations - one with all 32-bit data, and the other with partially or all 64-bit data.

I always assumed that using int64_t is safer and more portable and a “modern way to write C” ®

int64_t is the safest and most portable (among compatible C99 and C11 compilers) for referencing a 64-bit signed integer type with no padding and presentation bits with two additions, if the implementation actually provides that type, Regardless of whether this type actually does your code is more portable, depends on whether the code depends on any specific characteristics of the integer representation and whether you are connected with portability in environments that do not provide this type.

and it will not hurt performance on 64-bit systems for numerical code. Is this assumption erroneous?

int64_t indicated as typedef . On any given system, using int64_t semantically identical directly using the type underlying the typedef on that system. You will not see a difference in performance between these alternatives.

However, your line of reasoning and questions seems to contradict the assumption: either in the system where you run your tests, the base type underlying int64_t is int , or that 64-bit arithmetic will perform identically to 32-bit arithmetic on this system. None of these assumptions are justified. It is by no means guaranteed that the implementation of C for 64-bit systems will make int 64-bit type, and, in particular, neither GCC nor Clang for x86_64 do this. Moreover, C has nothing to do with the relative performance of arithmetic for different types, and as others have pointed out, native x86_64 integer division instructions are actually slower for 64-bit operands than for 32-bit operands. Other platforms may have other differences.

+2
source

Integer division / modulation is extremely slow compared to any other operation. (And it depends on the size of the data, unlike most operations on modern equipment, see the End of this answer)

To reuse the same module, you will get much better performance from finding a multiplicative inverse for your integer divisor . Compilers do this for you for compile-time constants, but it is a moderately expensive time and code size to do this at runtime, so with the current compilers you have to decide when it's worth it.

First, several processor cycles are required, but they are amortized by 3 divisions per iteration.

A reference document for this idea is the article "Granlund and Montgomery 1994" , when the difference was only 4 times more expensive than multiplication by the P5 Pentium hardware. This article discusses the implementation of the idea in gcc 2.6, as well as the mathematical proof of its operation.

The output of the compiler shows the type of code, dividing it by a small constant turns into:

 ## clang 3.8 -O3 -mtune=haswell for x86-64 SysV ABI: first arg in rdi int mod13 (int a) { return a%13; } movsxd rax, edi # sign-extend 32bit a into 64bit rax imul rcx, rax, 1321528399 # gcc uses one-operand 32bit imul (32x32 => 64b), which is faster on Atom but slower on almost everything else. I'm showing clang output because it simpler mov rdx, rcx shr rdx, 63 # 0 or 1: extract the sign bit with a logical right shift sar rcx, 34 # only use the high half of the 32x32 => 64b multiply add ecx, edx # ecx = a/13. # adding the sign bit accounts for the rounding semantics of C integer division with negative numbers imul ecx, ecx, 13 # do the remainder as a - (a/13)*13 sub eax, ecx ret 

And yes, all this is cheaper than the div instruction for bandwidth and latency.

I tried google for simpler descriptions or calculators and found stuff like this page .


On modern Intel processors, 32 and 64b are multiplied by one bandwidth per cycle and a 3-clock delay. (i.e. completely conveyor belt).

The unit is only partially pipelined (the Div unit cannot accept one input per cycle), and unlike most instructions, it has data-dependent performance:

From Agner Fog insn tables (see also tag wiki):

  • Intel Core2: idiv r32 : one at 12-36c bandwidth (18-42c latency, 4 uops).
    idiv r64 : one at 28-40c bandwidth (39-72c latency, 56 uops). (unsigned div significantly faster: 32 beats / single, throughput 18-37c).
  • Intel Haswell: div/idiv r32 : one at 8-11c bandwidth (22-29c latency, 9 hours).
    idiv r64 : one at 24-81c bandwidth (39-103c latency, 59 uops). (unsigned div : one per bandwidth 21-74 s, 36 hours)
  • Skylake: div/idiv r32 : one on 6c bandwidth (26c latency, 10 uops).
    64b: one at 24-90c bandwidth (42-95c latency, 57 uops). (unsigned div : one per throughput 21-83c, 36 mcp)

So, on Intel hardware, unsigned separation is cheaper for 64-bit operands, the same for 32b operands.

Differences in bandwidth between 32b and 64b idiv can easily provide 150% performance. Your code is completely bandwidth-bound, as you have many independent operations, especially between loop iterations. The loop related dependency is just cmov for maximum operation.

+1
source

The answer to this question can arise only when looking at the assembly. I launched it on my box for my curiosity, but 3000 miles from it: (so I have to guess, and you look and publish your results here ... Just add -S to your compiler command line.

I believe that with int64 compilers do something different than with int32. That is, they cannot use the use of some optimization available to them using int32.

Maybe gcc replaces division by multiplication only with int32? There must be an if (x <0) 'branch. Maybe gcc can fix it with int32?

For some reason I don’t think that performance can be so different if they both make a simple "idiv"

0
source

All Articles