The fastest way to calculate a 128-bit integer modulo a 64-bit integer

Question

The fastest way to calculate a 128-bit integer modulo a 64-bit integer

I have a 128-bit unsigned integer A and a 64-bit unsigned integer B. What is the fastest way to calculate A % B is the (64-bit) remainder of dividing A by B?

I want to do this in C or assembler, but I need to target the x86 32-bit platform. This, unfortunately, means that I cannot use the compiler support for 128-bit integers or the x64 architecture architecture to perform the required operation in a single command.

Edit:

Thanks for answers. However, it seems to me that the proposed algorithms will be rather slow - not the fastest way to perform 128-bit by 64-bit separation - to use the built-in processor support for 64-bit 32-bit division? Does anyone know if there is a way to accomplish a larger division in terms of several smaller divisions?

Re: How often does B change?

First of all, I am interested in the general solution - what calculation would you perform if A and B can be different each time?

However, the second possible situation is that B does not change as often as A - maybe as many as 200 How to divide by each B. How will your answer differ in this case?

+53

c assembly x86 algorithm modulo

user200783 Apr 2 '10 at 9:54 on

source share

13 answers

You may be looking for a ready-made program, but the basic algorithms for multi-point arithmetic can be found in Knuth The Art of Computer Programming , Volume 2. You can find the division algorithm described online here . Algorithms deal with arbitrary arithmetic with multiple points, so they are more general than you need, but you should be able to simplify them for 128-bit arithmetic performed with 64- or 32-bit digits. Be prepared for a reasonable amount of work (a) understanding the algorithm and (b) converting it to C or assembler.

You can also check Hacker Delight , which is full of very smart assemblers and other low-level hacker attacks, precision arithmetic.

+13

Dale Hagglund Apr 6 2018-10-06T00:

source share

Given A = AH*2^64 + AL :

 A % B == (((AH % B) * (2^64 % B)) + (AL % B)) % B == (((AH % B) * ((2^64 - B) % B)) + (AL % B)) % B

If your compiler supports 64-bit integers, then this is probably the easiest way. The MSVC implementation of a 64-bit modulo 32-bit x86 is an assembly filled with a hairy loop ( VC\crt\src\intel\llrem.asm for the brave), so I personally would go with that.

+11

MSN Apr 05 2018-10-10T00:

source share

This is an almost untested, partially modified modification of Mod128by64 "Russian Peasant" function of the algorithm. Unfortunately, I am a Delphi user, so this function works in Delphi. :) But the assembler is almost the same as ...

 function Mod128by64(Dividend: PUInt128; Divisor: PUInt64): UInt64; //In : eax = @Dividend // : edx = @Divisor //Out: eax:edx as Remainder asm //Registers inside rutine //Divisor = edx:ebp //Dividend = bh:ebx:edx //We need 64 bits + 1 bit in bh //Result = esi:edi //ecx = Loop counter and Dividend index push ebx //Store registers to stack push esi push edi push ebp mov ebp, [edx] //Divisor = edx:ebp mov edx, [edx + 4] mov ecx, ebp //Div by 0 test or ecx, edx jz @DivByZero xor edi, edi //Clear result xor esi, esi //Start of 64 bit division Loop mov ecx, 15 //Load byte loop shift counter and Dividend index @SkipShift8Bits: //Small Dividend numbers shift optimisation cmp [eax + ecx], ch //Zero test jnz @EndSkipShiftDividend loop @SkipShift8Bits //Skip 8 bit loop @EndSkipShiftDividend: test edx, $FF000000 //Huge Divisor Numbers Shift Optimisation jz @Shift8Bits //This Divisor is > $00FFFFFF:FFFFFFFF mov ecx, 8 //Load byte shift counter mov esi, [eax + 12] //Do fast 56 bit (7 bytes) shift... shr esi, cl //esi = $00XXXXXX mov edi, [eax + 9] //Load for one byte right shifted 32 bit value @Shift8Bits: mov bl, [eax + ecx] //Load 8 bits of Dividend //Here we can unrole partial loop 8 bit division to increase execution speed... mov ch, 8 //Set partial byte counter value @Do65BitsShift: shl bl, 1 //Shift dividend left for one bit rcl edi, 1 rcl esi, 1 setc bh //Save 65th bit sub edi, ebp //Compare dividend and divisor sbb esi, edx //Subtract the divisor sbb bh, 0 //Use 65th bit in bh jnc @NoCarryAtCmp //Test... add edi, ebp //Return privius dividend state adc esi, edx @NoCarryAtCmp: dec ch //Decrement counter jnz @Do65BitsShift //End of 8 bit (byte) partial division loop dec cl //Decrement byte loop shift counter jns @Shift8Bits //Last jump at cl = 0!!! //End of 64 bit division loop mov eax, edi //Load result to eax:edx mov edx, esi @RestoreRegisters: pop ebp //Restore Registers pop edi pop esi pop ebx ret @DivByZero: xor eax, eax //Here you can raise Div by 0 exception, now function only return 0. xor edx, edx jmp @RestoreRegisters end;

Perhaps there will be another speed optimization! After “Huge optimization of the shifts of Divisor numbers”, we can check the high bit of divisors, if it is 0, we do not need to use the extra register bh as the 65th bit to store it. Thus, the expanded part of the loop may look like this:

  shl bl,1 //Shift dividend left for one bit rcl edi,1 rcl esi,1 sub edi, ebp //Compare dividend and divisor sbb esi, edx //Subtract the divisor jnc @NoCarryAtCmpX add edi, ebp //Return privius dividend state adc esi, edx @NoCarryAtCmpX:

+8

Gj. Apr 6 '10 at 4:37

source share

The decision depends on what exactly you are trying to solve.

eg. if you are doing arithmetic in a ring modulo a 64-bit integer, then using Montgomerys Reduction is very effective. Of course, this assumes that you are the same module many times and that it pays to convert the elements of the ring into a special representation.

To give a very rough estimate of the speed of this Montgomerys reduction: I have an old test that performs modular exponentiation with a 64-bit module and a 1600 ns exponent on 2.4Ghz Core 2. This exponentiation is about 96 modular multiplications (and modular abbreviations) and therefore requires about 40 cycles per modular multiplication.

+4

Accipitridae Apr 6 '10 at 21:40

source share

I made both versions of the Mod128by64 function "Russian peasant": classic and speed-optimized. Speed optimization can do on my 3Ghz PC more than 1,000,000 random calculations per second and more than three times faster than the classic function. If we compare the calculation time of 128 by 64 and compute 64 by 64 bits in absolute value, than this function is only 50% slower.

Classic Russian peasant:

 function Mod128by64Clasic(Dividend: PUInt128; Divisor: PUInt64): UInt64; //In : eax = @Dividend // : edx = @Divisor //Out: eax:edx as Remainder asm //Registers inside rutine //edx:ebp = Divisor //ecx = Loop counter //Result = esi:edi push ebx //Store registers to stack push esi push edi push ebp mov ebp, [edx] //Load divisor to edx:ebp mov edx, [edx + 4] mov ecx, ebp //Div by 0 test or ecx, edx jz @DivByZero push [eax] //Store Divisor to the stack push [eax + 4] push [eax + 8] push [eax + 12] xor edi, edi //Clear result xor esi, esi mov ecx, 128 //Load shift counter @Do128BitsShift: shl [esp + 12], 1 //Shift dividend from stack left for one bit rcl [esp + 8], 1 rcl [esp + 4], 1 rcl [esp], 1 rcl edi, 1 rcl esi, 1 setc bh //Save 65th bit sub edi, ebp //Compare dividend and divisor sbb esi, edx //Subtract the divisor sbb bh, 0 //Use 65th bit in bh jnc @NoCarryAtCmp //Test... add edi, ebp //Return privius dividend state adc esi, edx @NoCarryAtCmp: loop @Do128BitsShift //End of 128 bit division loop mov eax, edi //Load result to eax:edx mov edx, esi @RestoreRegisters: lea esp, esp + 16 //Restore Divisors space on stack pop ebp //Restore Registers pop edi pop esi pop ebx ret @DivByZero: xor eax, eax //Here you can raise Div by 0 exception, now function only return 0. xor edx, edx jmp @RestoreRegisters end;

Speed-optimized Russian peasant:

 function Mod128by64Oprimized(Dividend: PUInt128; Divisor: PUInt64): UInt64; //In : eax = @Dividend // : edx = @Divisor //Out: eax:edx as Remainder asm //Registers inside rutine //Divisor = edx:ebp //Dividend = ebx:edx //We need 64 bits //Result = esi:edi //ecx = Loop counter and Dividend index push ebx //Store registers to stack push esi push edi push ebp mov ebp, [edx] //Divisor = edx:ebp mov edx, [edx + 4] mov ecx, ebp //Div by 0 test or ecx, edx jz @DivByZero xor edi, edi //Clear result xor esi, esi //Start of 64 bit division Loop mov ecx, 15 //Load byte loop shift counter and Dividend index @SkipShift8Bits: //Small Dividend numbers shift optimisation cmp [eax + ecx], ch //Zero test jnz @EndSkipShiftDividend loop @SkipShift8Bits //Skip Compute 8 Bits unroled loop ? @EndSkipShiftDividend: test edx, $FF000000 //Huge Divisor Numbers Shift Optimisation jz @Shift8Bits //This Divisor is > $00FFFFFF:FFFFFFFF mov ecx, 8 //Load byte shift counter mov esi, [eax + 12] //Do fast 56 bit (7 bytes) shift... shr esi, cl //esi = $00XXXXXX mov edi, [eax + 9] //Load for one byte right shifted 32 bit value @Shift8Bits: mov bl, [eax + ecx] //Load 8 bit part of Dividend //Compute 8 Bits unroled loop shl bl, 1 //Shift dividend left for one bit rcl edi, 1 rcl esi, 1 jc @DividentAbove0 //dividend hi bit set? cmp esi, edx //dividend hi part larger? jb @DividentBelow0 ja @DividentAbove0 cmp edi, ebp //dividend lo part larger? jb @DividentBelow0 @DividentAbove0: sub edi, ebp //Return privius dividend state sbb esi, edx @DividentBelow0: shl bl, 1 //Shift dividend left for one bit rcl edi, 1 rcl esi, 1 jc @DividentAbove1 //dividend hi bit set? cmp esi, edx //dividend hi part larger? jb @DividentBelow1 ja @DividentAbove1 cmp edi, ebp //dividend lo part larger? jb @DividentBelow1 @DividentAbove1: sub edi, ebp //Return privius dividend state sbb esi, edx @DividentBelow1: shl bl, 1 //Shift dividend left for one bit rcl edi, 1 rcl esi, 1 jc @DividentAbove2 //dividend hi bit set? cmp esi, edx //dividend hi part larger? jb @DividentBelow2 ja @DividentAbove2 cmp edi, ebp //dividend lo part larger? jb @DividentBelow2 @DividentAbove2: sub edi, ebp //Return privius dividend state sbb esi, edx @DividentBelow2: shl bl, 1 //Shift dividend left for one bit rcl edi, 1 rcl esi, 1 jc @DividentAbove3 //dividend hi bit set? cmp esi, edx //dividend hi part larger? jb @DividentBelow3 ja @DividentAbove3 cmp edi, ebp //dividend lo part larger? jb @DividentBelow3 @DividentAbove3: sub edi, ebp //Return privius dividend state sbb esi, edx @DividentBelow3: shl bl, 1 //Shift dividend left for one bit rcl edi, 1 rcl esi, 1 jc @DividentAbove4 //dividend hi bit set? cmp esi, edx //dividend hi part larger? jb @DividentBelow4 ja @DividentAbove4 cmp edi, ebp //dividend lo part larger? jb @DividentBelow4 @DividentAbove4: sub edi, ebp //Return privius dividend state sbb esi, edx @DividentBelow4: shl bl, 1 //Shift dividend left for one bit rcl edi, 1 rcl esi, 1 jc @DividentAbove5 //dividend hi bit set? cmp esi, edx //dividend hi part larger? jb @DividentBelow5 ja @DividentAbove5 cmp edi, ebp //dividend lo part larger? jb @DividentBelow5 @DividentAbove5: sub edi, ebp //Return privius dividend state sbb esi, edx @DividentBelow5: shl bl, 1 //Shift dividend left for one bit rcl edi, 1 rcl esi, 1 jc @DividentAbove6 //dividend hi bit set? cmp esi, edx //dividend hi part larger? jb @DividentBelow6 ja @DividentAbove6 cmp edi, ebp //dividend lo part larger? jb @DividentBelow6 @DividentAbove6: sub edi, ebp //Return privius dividend state sbb esi, edx @DividentBelow6: shl bl, 1 //Shift dividend left for one bit rcl edi, 1 rcl esi, 1 jc @DividentAbove7 //dividend hi bit set? cmp esi, edx //dividend hi part larger? jb @DividentBelow7 ja @DividentAbove7 cmp edi, ebp //dividend lo part larger? jb @DividentBelow7 @DividentAbove7: sub edi, ebp //Return privius dividend state sbb esi, edx @DividentBelow7: //End of Compute 8 Bits (unroled loop) dec cl //Decrement byte loop shift counter jns @Shift8Bits //Last jump at cl = 0!!! //End of division loop mov eax, edi //Load result to eax:edx mov edx, esi @RestoreRegisters: pop ebp //Restore Registers pop edi pop esi pop ebx ret @DivByZero: xor eax, eax //Here you can raise Div by 0 exception, now function only return 0. xor edx, edx jmp @RestoreRegisters end;

+4

Gj. Apr 10 2018-10-10T00:

source share

I would like to share a few thoughts.

It's not as easy as MSN suggests, I'm afraid.

In the expression:

 (((AH % B) * ((2^64 - B) % B)) + (AL % B)) % B

and multiplication and addition can overflow. I think it was possible to take this into account and still use the general concept with some changes, but something tells me that it will be very scary.

I was curious how the 64-bit modular operation was implemented in MSVC, and I tried to find something. I really don’t know the build, and all I had was Express Edition, without the source VC \ crt \ src \ intel \ llrem.asm, but I think I managed to understand what happens after a little game with the debugger output and showdowns. I tried to figure out how the remainder is calculated in the case of positive integers and a divisor> = 2 ^ 32. Of course, there is a code that deals with negative numbers, but I did not delve into this.

Here is how I see it:

If divisor> = 2 ^ 32, then both the dividend and the divider are shifted to the right as much as necessary to fit the 32-bit divider. In other words: if n digits for writing the divisor down in binary and n> 32, n-32 least significant bits of both the divisor and the dividend are discarded. After that, the division is performed using hardware support for dividing 64-bit integers into 32-bit ones. The result may be wrong, but I think it can be proved that the result can be no more than 1. After dividing, the divisor (source) is multiplied by the result and the product is subtracted from the dividend. Then it is adjusted by adding or subtracting the divisor, if necessary (if the separation result has been disabled by one).

It is easy to split a 128-bit integer into a 32-bit one using hardware support for 64-bit 32-bit division. In the case where the divisor is <2 ^ 32, you can calculate the remainder performing only 4 divisions as follows:

Suppose a dividend is stored in:

 DWORD dividend[4] = ...

the balance will go into:

 DWORD remainder; 1) Divide dividend[3] by divisor. Store the remainder in remainder. 2) Divide QWORD (remainder:dividend[2]) by divisor. Store the remainder in remainder. 3) Divide QWORD (remainder:dividend[1]) by divisor. Store the remainder in remainder. 4) Divide QWORD (remainder:dividend[0]) by divisor. Store the remainder in remainder.

After these 4 steps, the remainder of the variable will contain what you are looking for. (Please do not kill me, if I am mistaken in the entiance, I am not even a programmer)

In case the divisor is greater than 2 ^ 32-1, I do not have good news. I do not have full evidence that the result after the shift is disabled by no more than 1 in the above procedure, which I consider MSVC. I think, however, that this has something to do with the fact that the part that is discarded is at least 2 times smaller than in the divider, 31 times smaller, the dividend is less than 2 ^ 64, and the divisor is greater than 2 ^ 32-1, so the result is less than 2 ^ 32.

If the dividend has 128 bits, the drop-bit trick will not work. Therefore, in the general case, the best solution is probably the one offered by GJ or caf. (Well, it would probably be better even if the bits were dropped. Separation, subtraction of multiplication, and correction to a 128-bit integer can be slower.)

I also thought about using floating point hardware. The x87 floating point block uses an 80-bit precision format with a fraction of 64 bits in length. I think that you can get the exact result from 64 bits to 64 bits. (Not a remainder directly, but a remainder using multiplication and subtraction, as in the "MSVC procedure"). If the dividend> = 2 ^ 64 and <2 ^ 128, storing it in a floating point format is similar to dropping the least significant bits in the "MSVC procedure". Maybe someone can prove that the error in this case is related and considers it useful. I have no idea if he has a chance to be faster than a GJ solution, but it might be worth a try.

+4

Maciej Hehl Apr 10 '10 at 23:00

source share

@Caf's accepted answer was really nice and appreciated, but it contains an error that has not been observed for years.

To test this and other solutions, I submit a test feed and create its community wiki.

 unsigned cafMod(unsigned A, unsigned B) { assert(B); unsigned X = B; // while (X < A / 2) { Original code used < while (X <= A / 2) { X <<= 1; } while (A >= B) { if (A >= X) A -= X; X >>= 1; } return A; } void cafMod_test(unsigned num, unsigned den) { if (den == 0) return; unsigned y0 = num % den; unsigned y1 = mod(num, den); if (y0 != y1) { printf("FAIL num:%x den:%x %x %x\n", num, den, y0, y1); fflush(stdout); exit(-1); } } unsigned rand_unsigned() { unsigned x = (unsigned) rand(); return x * 2 ^ (unsigned) rand(); } void cafMod_tests(void) { const unsigned i[] = { 0, 1, 2, 3, 0x7FFFFFFF, 0x80000000, UINT_MAX - 3, UINT_MAX - 2, UINT_MAX - 1, UINT_MAX }; for (unsigned den = 0; den < sizeof i / sizeof i[0]; den++) { if (i[den] == 0) continue; for (unsigned num = 0; num < sizeof i / sizeof i[0]; num++) { cafMod_test(i[num], i[den]); } } cafMod_test(0x8711dd11, 0x4388ee88); cafMod_test(0xf64835a1, 0xf64835a); time_t t; time(&t); srand((unsigned) t); printf("%u\n", (unsigned) t);fflush(stdout); for (long long n = 10000LL * 1000LL * 1000LL; n > 0; n--) { cafMod_test(rand_unsigned(), rand_unsigned()); } puts("Done"); } int main(void) { cafMod_tests(); return 0; }

+4

chux Aug 31 '16 at 19:00

source share

I know the question says 32-bit code, but the answer to 64-bit can be useful or interesting for others.

And yes, 64b / 32b => 32b division makes a useful building block for 128b% 64b => 64b. libgcc __umoddi3 (source linked below) gives an idea of how to do this, but it only implements 2N% 2N => 2N over the division 2N / N => N, and not 4N% 2N => 2N.

Wider libraries with high precision are available, for example https://gmplib.org/manual/Integer-Division.html#Integer-Division .

GNU C on 64-bit machines does provide the __int128 type and libgcc functions for multiplication and division as efficiently as possible on the target architecture.

div r/m64 x86-64 div r/m64 executes div r/m64 128b / 64b => 64b (also produces the remainder as the second output), but does not work if the private overflow. So you cannot use it directly if A/B > 2^64-1 , but you can force gcc to use it for yourself (or even embed the same code that libgcc uses).

This compiles ( // + godbolt! 'S + "binary" mode (link and disassemble) apparently links + a shared libgcc, // because we get a call to __umodti3 that goes + through the PLT. // Most systems + link a static libgcc.a, so __umodti3 is + part of the binary. uint64_t AmodB (unsigned __int128 A, uint64_t B) {return A% B;} // gcc won! 't use 128x128 => high_half (256) multiplication to make this + optimization uint64_t modulo_by_constant (unsigned __int128 A) {return A% 0x12345678ABULL;} uint64_t modulo_by_constant64 (uint64_t A) {return A% 0x12345678ABULL;} unsigned __int128 mul128 (unsigned 3)} ), filterAsm: (commentOnly:! t, directives:! t, intel:! t, labels:! t), version: 3 (link and disassemble) apparently links + a shared libgcc, // because we get a call to __umodti3 that goes + through the PLT. // Most systems + link a static libgcc.a, so __umodti3 is + part of the binary. uint64_t AmodB (unsigned __int128 A, uint64_t B) {return A% B; } uint64_t modulo_by_constant (unsigned __int128 A) {return A% 0x12345678ABULL; } uint64_t modulo_by_constant64 (uint64_t A) {return A% 0x12345678ABULL; } ')), filterAsm: (commentOnly:! t, directives:! t, intel:! t, labels:! t), version: 3 rel = "nofollow noreferrer"> Godbolt compiler explorer ) into one or two div instructions ( which occur inside the libgcc function call ). If there was a faster path, libgcc would probably use this instead.

 #include <stdint.h> uint64_t AmodB(unsigned __int128 A, uint64_t B) { return A % B; }

It calls the __umodti3 function, computes the full 128b / 128b modulo, but the implementation of this function checks the special case when the upper half of the divisor is 0, as you can see in the libgcc source . (libgcc creates a version of the si / di / ti function from this code according to the target architecture. udiv_qrnnd is a built-in asm macro that performs unsigned 2N / N => N division for the target architecture.

For x86-64 (and other architectures with hardware division instructions), the quick way (when high_half(A) < B ; ensuring that the div does not work) is just two high_half(A) < B branches , some due to fuzz from According to Agner Fog insn tables, we order processor processors and one div r64 instruction, which takes about 50-100 cycles on modern x86 processors. Some other work may be done in parallel with the div , but the integer division unit is not very pipelined, and the div decodes many mops (as opposed to FP division).

64- div , B 64-, A/B 64-, A/B .

, libgcc __umodti3 __udivmoddi4 , .

`B`

, B , . , , gcc , 128b.

 uint64_t modulo_by_constant64(uint64_t A) { return A % 0x12345678ABULL; } movabs rdx, -2233785418547900415 mov rax, rdi mul rdx mov rax, rdx # wasted instruction, could have kept using RDX. movabs rdx, 78187493547 shr rax, 36 # division result imul rax, rdx # multiply and subtract to get the modulo sub rdi, rax mov rax, rdi ret

x86 mul r64 mul r64 64b * 64b => 128b (rdx: rax) 128b * 128b => 256b . 256b, .

Intel mul : 3c , . , , JIT- ( ).

, . JIT- , ~ 200 , B "" 200 , IDK, 128- /64- .

libdivide , 32- 64- . , , , .

+4

Peter Cordes 01 . '16 8:22

source share

, , . , , , -. . (AKA-).

. , A% B, B... 1/B. -. .

- http://en.wikipedia.org/wiki/Division_(digital)

, Q = A * 1/B.

R = A - Q * B.

, , ( , 32- 64- 128- , .

B , , . , , , .

Hope this helps.

+2

Sparky 08 . '10 11:03

source share

x86, SSE2 + 128- . - , x86, , .

+1

Adam Shiemke 02 . '10 20:08

source share

128- 63- , , 63 .

MSNs, 1-. , 2, .

64-, 64-, div - .

 unsigned 128_mod(uint64_t upper, uint64_t lower, uint64_t div) { uint64_t result = 0; uint64_t a = (~0%div)+1; upper %= div; // the resulting bit-length determines number of cycles required // first we work out modular multiplication of (2^64*upper)%div while (upper != 0){ if(upper&1 == 1){ result += a; if(result >= div){result -= div;} } a <<= 1; if(a >= div){a -= div;} upper >>= 1; } // add up the 2 results and return the modulus if(lower>div){lower -= div;} return (lower+result)%div; }

, 64-, 1- ( ), .

, .

+1

CookieNinja 03 . '16 10:49

source share

C 128- , A . B (64- ) unsigned long long int, B , A B.

B Bx2, Bx3, Bx4,... , B, A. (AB) , 2.

, ?

-2

Ahmet Altun 02 . '10 11:04

source share

caf · Accepted Answer · 2010-04-02 12:15

You can use the division version of Russian peasant multiplication .

To find the remainder, execute (in pseudo-code):

 X = B; while (X <= A/2) { X <<= 1; } while (A >= B) { if (A >= X) A -= X; X >>= 1; }

The module is left in A.

You will need to implement shifts, comparisons and subtractions in order to work with values consisting of a pair of 64-bit numbers, but this is pretty trivial (probably you should implement a left shift of 1 as X + X ).

This will be a loop no more than 255 times (with 128-bit A). Of course, you need to do a preliminary check on the zero divider.

The fastest way to calculate a 128-bit integer modulo a 64-bit integer

B

More articles:

`B`