My question is about conditional test in trial division. There seems to be some discussion about which conditional tests to use. Let's look at the code for this from RosettaCode .
int is_prime(unsigned int n) { unsigned int p; if (!(n & 1) || n < 2 ) return n == 2; for (p = 3; p <= n/p; p += 2) if (!(n % p)) return 0; return 1; }
Wheel factorization or using a given list of primes will not change the essence of my question.
There are three cases that I can think of to do a conditional test:
- p <= p / p
- p * p <= n
- int cut = sqrt (n); for (p = 3; p <= cut; p + = 2)
Case 1: it works for all n, but it must do an additional split for each iteration ( edit: in fact it does not require additional division, but it is still slower). I do not know why. See assembly output below) . I found that it is twice as slow as in case 2 with large values โโof n, which are the main ones (on my Sandy Bridge system).
Case 2: significantly faster than case 1, but has the problem that it overflows for large n and goes into the infinitive loop. The maximum value that can handle is
(sqrt(n) + c)^2 = INT_MAX //solve n = INT_MAX -2*c*sqrt(INT_MAX) + c^2 //INT_MAX = 2^32 -> n = 2^32 - c*s^17 + c^2; in our case c = 2
For example, for uint64_t, case 2 will go into an infinite loop for x = -1L-58 (x ^ 64-59), which is simple.
Case 3: no division or multiplication should be performed at each iteration, and it does not overflow, as in case 2. This is slightly faster than case 2. The only question is that sqrt (n) is accurate enough .
Can someone explain to me why case 2 is much faster than case 1? Case 1 does NOT use additional separation, although I, although in spite of this, am still much slower.
Here are the times for a simple 2 ^ 56-5;
case 1 9.0s case 2 4.6s case 3 4.5s
Here is the code I used to test this http://coliru.stacked-crooked.com/a/69497863a97d8953 . I also added features to the end of this question.
Here is the assembly form of GCC 4.8 with -O3 for case 1 and case 2. They both have only one division. Case 2 also has multiplication, so I assume Case 2 will be slower, but it is about twice as fast on both GCC and MSVC. I do not know why.
Case 1:
.L5: testl %edx, %edx je .L8 .L4: addl $2, %ecx xorl %edx, %edx movl %edi, %eax divl %ecx cmpl %ecx, %eax jae .L5
Case 2:
.L20: xorl %edx, %edx movl %edi, %eax divl %ecx testl %edx, %edx je .L23 .L19: addl $2, %ecx movl %ecx, %eax imull %ecx, %eax cmpl %eax, %edi jae .L20
Here are the functions that I use to check the time:
int is_prime(uint64_t n) { uint64_t p; if (!(n & 1) || n < 2 ) return n == 2; for (p = 3; p <= n/p; p += 2) if (!(n % p)) return 0; return 1; } int is_prime2(uint64_t n) { uint64_t p; if (!(n & 1) || n < 2 ) return n == 2; for (p = 3; p*p <= n; p += 2) if (!(n % p)) return 0; return 1; } int is_prime3(uint64_t n) { uint64_t p; if (!(n & 1) || n < 2 ) return n == 2; uint32_t cut = sqrt(n); for (p = 3; p <= cut; p += 2) if (!(n % p)) return 0; return 1; }
Added content after the award.
Aean found that in the case where the storage of both the private and the remainder is as fast (or slightly faster) than case 2. Let me call this case 4. The following code is twice as fast as in case 1.
int is_prime4(uint64_t n) { uint64_t p, q, r; if (!(n & 1) || n < 2 ) return n == 2; for (p = 3, q=n/p, r=n%p; p <= q; p += 2, q = n/p, r=n%p) if (!r) return 0; return 1; }
I'm not sure why this helps. In any case, there is no need to use case 2. For case 3, most versions of the sqrt function in the hardware or software receive ideal squares, so they are safe for use in general. Case 3 is the only case that will work with OpenMP.