Integer division / modulation is extremely slow compared to any other operation. (And it depends on the size of the data, unlike most operations on modern equipment, see the End of this answer)
To reuse the same module, you will get much better performance from finding a multiplicative inverse for your integer divisor . Compilers do this for you for compile-time constants, but it is a moderately expensive time and code size to do this at runtime, so with the current compilers you have to decide when it's worth it.
First, several processor cycles are required, but they are amortized by 3 divisions per iteration.
A reference document for this idea is the article "Granlund and Montgomery 1994" , when the difference was only 4 times more expensive than multiplication by the P5 Pentium hardware. This article discusses the implementation of the idea in gcc 2.6, as well as the mathematical proof of its operation.
The output of the compiler shows the type of code, dividing it by a small constant turns into:
#
And yes, all this is cheaper than the div instruction for bandwidth and latency.
I tried google for simpler descriptions or calculators and found stuff like this page .
On modern Intel processors, 32 and 64b are multiplied by one bandwidth per cycle and a 3-clock delay. (i.e. completely conveyor belt).
The unit is only partially pipelined (the Div unit cannot accept one input per cycle), and unlike most instructions, it has data-dependent performance:
From Agner Fog insn tables (see also x86 tag wiki):
- Intel Core2:
idiv r32 : one at 12-36c bandwidth (18-42c latency, 4 uops).
idiv r64 : one at 28-40c bandwidth (39-72c latency, 56 uops). (unsigned div significantly faster: 32 beats / single, throughput 18-37c). - Intel Haswell:
div/idiv r32 : one at 8-11c bandwidth (22-29c latency, 9 hours).
idiv r64 : one at 24-81c bandwidth (39-103c latency, 59 uops). (unsigned div : one per bandwidth 21-74 s, 36 hours) - Skylake:
div/idiv r32 : one on 6c bandwidth (26c latency, 10 uops).
64b: one at 24-90c bandwidth (42-95c latency, 57 uops). (unsigned div : one per throughput 21-83c, 36 mcp)
So, on Intel hardware, unsigned separation is cheaper for 64-bit operands, the same for 32b operands.
Differences in bandwidth between 32b and 64b idiv can easily provide 150% performance. Your code is completely bandwidth-bound, as you have many independent operations, especially between loop iterations. The loop related dependency is just cmov for maximum operation.