Simplified conditional tests by trial division

Question

Simplified conditional tests by trial division

My question is about conditional test in trial division. There seems to be some discussion about which conditional tests to use. Let's look at the code for this from RosettaCode .

int is_prime(unsigned int n) { unsigned int p; if (!(n & 1) || n < 2 ) return n == 2; /* comparing p*p <= n can overflow */ for (p = 3; p <= n/p; p += 2) if (!(n % p)) return 0; return 1; }

Wheel factorization or using a given list of primes will not change the essence of my question.

There are three cases that I can think of to do a conditional test:

p <= p / p
p * p <= n
int cut = sqrt (n); for (p = 3; p <= cut; p + = 2)

Case 1: it works for all n, but it must do an additional split for each iteration ( edit: in fact it does not require additional division, but it is still slower). I do not know why. See assembly output below) . I found that it is twice as slow as in case 2 with large values of n, which are the main ones (on my Sandy Bridge system).

Case 2: significantly faster than case 1, but has the problem that it overflows for large n and goes into the infinitive loop. The maximum value that can handle is

 (sqrt(n) + c)^2 = INT_MAX //solve n = INT_MAX -2*c*sqrt(INT_MAX) + c^2 //INT_MAX = 2^32 -> n = 2^32 - c*s^17 + c^2; in our case c = 2

For example, for uint64_t, case 2 will go into an infinite loop for x = -1L-58 (x ^ 64-59), which is simple.

Case 3: no division or multiplication should be performed at each iteration, and it does not overflow, as in case 2. This is slightly faster than case 2. The only question is that sqrt (n) is accurate enough .

Can someone explain to me why case 2 is much faster than case 1? Case 1 does NOT use additional separation, although I, although in spite of this, am still much slower.

Here are the times for a simple 2 ^ 56-5;

 case 1 9.0s case 2 4.6s case 3 4.5s

Here is the code I used to test this http://coliru.stacked-crooked.com/a/69497863a97d8953 . I also added features to the end of this question.

Here is the assembly form of GCC 4.8 with -O3 for case 1 and case 2. They both have only one division. Case 2 also has multiplication, so I assume Case 2 will be slower, but it is about twice as fast on both GCC and MSVC. I do not know why.

Case 1:

 .L5: testl %edx, %edx je .L8 .L4: addl $2, %ecx xorl %edx, %edx movl %edi, %eax divl %ecx cmpl %ecx, %eax jae .L5

Case 2:

 .L20: xorl %edx, %edx movl %edi, %eax divl %ecx testl %edx, %edx je .L23 .L19: addl $2, %ecx movl %ecx, %eax imull %ecx, %eax cmpl %eax, %edi jae .L20

Here are the functions that I use to check the time:

 int is_prime(uint64_t n) { uint64_t p; if (!(n & 1) || n < 2 ) return n == 2; /* comparing p*p <= n can overflow */ for (p = 3; p <= n/p; p += 2) if (!(n % p)) return 0; return 1; } int is_prime2(uint64_t n) { uint64_t p; if (!(n & 1) || n < 2 ) return n == 2; /* comparing p*p <= n can overflow */ for (p = 3; p*p <= n; p += 2) if (!(n % p)) return 0; return 1; } int is_prime3(uint64_t n) { uint64_t p; if (!(n & 1) || n < 2 ) return n == 2; /* comparing p*p <= n can overflow */ uint32_t cut = sqrt(n); for (p = 3; p <= cut; p += 2) if (!(n % p)) return 0; return 1; }

Added content after the award.

Aean found that in the case where the storage of both the private and the remainder is as fast (or slightly faster) than case 2. Let me call this case 4. The following code is twice as fast as in case 1.

 int is_prime4(uint64_t n) { uint64_t p, q, r; if (!(n & 1) || n < 2 ) return n == 2; for (p = 3, q=n/p, r=n%p; p <= q; p += 2, q = n/p, r=n%p) if (!r) return 0; return 1; }

I'm not sure why this helps. In any case, there is no need to use case 2. For case 3, most versions of the sqrt function in the hardware or software receive ideal squares, so they are safe for use in general. Case 3 is the only case that will work with OpenMP.

+8

c math floating-point primes

Z boson Mar 21 '14 at 10:47

source share

4 answers

sqrt(n) is accurate enough until your sqrt becomes monotonous, it gets perfect squares on the right, and each unsigned int can be represented exactly as a double . All three of them relate to every platform that I know of.

You can work around these problems (if you consider them to be problems) by implementing the unsigned int sqrti(unsigned int n) function, which returns the square root of an unsigned int using the Newton method. (This is an interesting exercise if you have never done it before!)

+2

tmyklebu Mar 21 '14 at 17:25

source share

Reply to a small part of this post.

Correction Case 2 to handle overflow.

 #include <limits.h> int is_prime(unsigned n) { unsigned p; if (!(n & 1) || n < 2) return n == 2; #define UINT_MAX_SQRT (UINT_MAX >> (sizeof(unsigned)*CHAR_BIT/2)) unsigned limit = n; if (n >= UINT_MAX_SQRT * UINT_MAX_SQRT) limit = UINT_MAX_SQRT * UINT_MAX_SQRT - 1; for (p = 3; p * p < limit; p += 2) if (!(n % p)) return 0; if (n != limit) if (!(n % p)) return 0; return 1; }

The marginal calculation fails if both sizeof(unsigned) and CHAR_BIT are odd - a rare situation.

+2

chux Mar 21 '14 at 21:00

source share

About your first question: why is (2) faster than (1)?
Well, it depends on the compiler, maybe. However, in general, division could be expected to be a more expensive operation than multiplication.

About your second question: is sqrt () an exact function?

All in all, that's for sure.
The only case that can cause problems is the one that sqrt(n) is an integer.
For example, if n == 9 and sqrt(n) == 2.9999999999999 on your system, then you have problems there, because the integer part is 2, but the exact value is 3.
However, these rare cases can be easily handled by adding a not so small double constant as 0.1, say,
So you can write:

  double stop = sqrt(n) + 0.1; for (unsigned int d = 2; d <= stop; d += 2) if (n % d == 0) break; /* not prime!! */

The added term 0.1 can add one iteration to your algorithm, which is not a big problem at all.

Finally, the obvious choice for your algorithm is (3), i.e. the sqrt() approach, because there is no calculation (multiplication or division), and the stop value is calculated only once.

Another improvement you can get is the following:

Note that each stroke p >= 5 has the form 6n - 1 or well, 6n + 1 .

So, you can alternate the increments of the variable d as 2, 4, 2, 4 and so on.

+1

pablo1977 Mar 25 '14 at 23:55

source share

Aean · Accepted Answer · 2014-03-25 08:28

UPD: This is a compiler optimization problem. Although MinGW used only one div statement in the loop, and GCC on Linux and MSVC were unable to reuse the quotient from the previous iteration.

I think the best we could do is to explicitly define quo and rem and calculate them in the same basic block of the instruction to show the compiler that we want both private and the rest.

 int is_prime(uint64_t n) { uint64_t p = 3, quo, rem; if (!(n & 1) || n < 2) return n == 2; quo = n / p; for (; p <= quo; p += 2){ quo = n / p; rem = n % p; if (!(rem)) return 0; } return 1; }

I tried your code from http://coliru.stacked-crooked.com/a/69497863a97d8953 in the MinGW-w64 compiler, case 1 faster than case 2 .

enter image description here

So, I guess you are compiling a 32-bit architecture and using the uint64_t type. Your assembly shows that it does not use 64-bit register.

If I understand correctly, there is a reason.

In a 32-bit architecture, 64-bit numbers are represented in two 32-bit registers, your compiler will perform all concatenation operations. It is easy to do 64-bit addition, subtraction and multiplication. But modulation and division are done with a little function call called ___umoddi3 and ___udivdi3 in GCC, aullrem and aulldiv in MSVC.

So, for each iteration in case 1 you will need one ___umoddi3 and one ___udivdi3 , one ___udivdi3 and one concatenation of 64-bit multiplication in case 2 . This is why case 1 seems to be twice as slow as case 2 in your test.

What you really get in case 1 :

 L5: addl $2, %esi adcl $0, %edi movl %esi, 8(%esp) movl %edi, 12(%esp) movl %ebx, (%esp) movl %ebp, 4(%esp) call ___udivdi3 // A call for div cmpl %edi, %edx ja L6 jae L21 L6: movl %esi, 8(%esp) movl %edi, 12(%esp) movl %ebx, (%esp) movl %ebp, 4(%esp) call ___umoddi3 // A call for modulo. orl %eax, %edx jne L5

What do you really get in case 2 :

 L26: addl $2, %esi adcl $0, %edi movl %esi, %eax movl %edi, %ecx imull %esi, %ecx mull %esi addl %ecx, %ecx addl %ecx, %edx cmpl %edx, %ebx ja L27 jae L41 L27: movl %esi, 8(%esp) movl %edi, 12(%esp) movl %ebp, (%esp) movl %ebx, 4(%esp) call ___umoddi3 // Just one call for modulo orl %eax, %edx jne L26

MSVC could not reuse the result of the div . Optimization is broken on return . Try this code:

 __declspec(noinline) int is_prime_A(unsigned int n) { unsigned int p; int ret = -1; if (!(n & 1) || n < 2) return n == 2; /* comparing p*p <= n can overflow */ p = 1; do { p += 2; if (p >= n / p) ret = 1; /* Let return latter outside the loop. */ if (!(n % p)) ret = 0; } while (ret < 0); return ret; } __declspec(noinline) int is_prime_B(unsigned int n) { unsigned int p; if (!(n & 1) || n < 2) return n == 2; /* comparing p*p <= n can overflow */ p = 1; do { p += 2; if (p > n / p) return 1; /* The common routine. */ if (!(n % p)) return 0; } while (1); }

is_prime_B will be twice as slow as is_prime_A in MSVC / ICC for windows.

Simplified conditional tests by trial division

More articles: