C ++ signed and unsigned int vs long long speed

Today I noticed that the speed of several simple bitwise and arithmetic operations differs significantly between int , unsigned , long long and unsigned long long on my 64-bit computer.

In particular, the next loop is about two times faster for unsigned than for long long , which I did not expect.

 int k = 15; int N = 30; int mask = (1 << k) - 1; while (!(mask & 1 << N)) { int lo = mask & ~(mask - 1); int lz = (mask + lo) & ~mask; mask |= lz; mask &= ~(lz - 1); mask |= (lz / lo / 2) - 1; } 

(full code here )

Below are the timings (in seconds) (for g++ -O , -O2 and -O3 ):

 1.834207723 (int) 3.054731598 (long long) 1.584846237 (unsigned) 2.201142018 (unsigned long long) 

These times are very consistent (i.e. 1% margin). Without the -O flag, each of them is about one second slower, but the relative speeds are the same.

Is there a clear reason for this? Vectorization can be for 32-bit types, but I do not see where the huge difference between long long and unsigned long long comes from. Are some operations much slower on some types than others, or is it just common that 64-bit types are slower (even on 64-bit architectures)?

For those who are interested in this cycle, loops over all subsets {1,2,...,30} contain exactly 15 elements. This is done by cyclic (in order) over all integers less than 1<<30 , with an accuracy of 15 bits. For the current case, that is 155117520 iterations. I no longer know the source of this fragment, but this one explains a few more.

Edit

It seems from the assembly code that division can be done faster when the type is unsigned. I think this makes sense, because we do not need to consider the sign bit.

In addition, 32-bit operations use movl and other xxxl , while 64-bit operations use movq and xxxq .

Edit 2

After reading the post I linked, I decided to use the above formula:

 T k = 15; TN = 30; T mask = (1 << k) - 1; while (!(mask & 1 << N)) { T t = mask | (mask - 1); mask = (t + 1) | (((~t & -~t) - 1) >> (__builtin_ctz(mask) + 1)); } 

This runs in about a third of the code described above and uses the same time for all four types.

+6
source share
1 answer

The slowest operation in your code

 mask |= (lz / lo / 2) - 1 

32-bit division is much faster than 64-bit division. For example, on Ivy Bridge, a 32-bit IDIV takes 19-26 hours, while a 64-bit IDIV takes 28-103 hours of waiting.

The unsigned version is also faster than the signed one, because dividing by 2 is a simple bit offset in the unsigned case, and there are no requests for size expansion (CDQ, CQO).

in the unsigned case this is a simple bit shift, and in the signed

[1] http://www.agner.org/optimize/instruction_tables.pdf

+8
source

All Articles