A64 Neon SIMD - 256-bit comparison

I would like to compare two low-value 256-bit values ​​with A64 Neon (asm) instructions efficiently.


Equality (=)

For equality, I already got a solution:

bool eq256(const UInt256 *lhs, const UInt256 *rhs) { bool result; 

First load the two values ​​into the SIMD registers.

  __asm__("ld1.2d { v0, v1 }, %1 \n\t" "ld1.2d { v2, v3 }, %2 \n\t" 

Compare each 64-bit finiteness of values ​​with each other. This results in -1 (all bits are set) for those limbs that are equal, and 0 (all bits are cleared) if the bit is different.

  "cmeq.2d v0, v0, v2 \n\t" "cmeq.2d v1, v1, v3 \n\t" 

Reduce the result from 2 vectors to 1 vector, leaving only the one that contains "0 (all digits cleared)", if any.

  "uminp.16b v0, v0, v1 \n\t" 

Decrease the result from 1 vector to 1 byte, keeping only bytes with zeros, if any.

  "uminv.16b b0, v0 \n\t" 

Move to the ARM register, then compare with 0xFF. This is the result.

  "umov %w0, v0.b[0] \n\t" "cmp %w0, 0xFF \n\t" "cset %w0, eq " : "=r" (result) : "m" (*lhs->value), "m" (*rhs->value) : "v0", "v1", "v2", "v3", "cc"); return result; } 

Questions

  • Is this more efficient than doing 4 comparisons with regular old ARM registers?

    • eg. is there a source that quotes timings for different operations? I do this on the iPhone 5s.
  • Is there any way to optimize this even further? I think that I spend many cycles just to reduce the whole vector to one scalar Boolean.


Less than comparison (<)

We denote the two ints as tuples of 64-bit limbs (little-endian):

  • lhs = (l0, l1, l2, l3)
  • rhs = (r0, r1, r2, r3)

Then lhs <rhs if this value is true:

 (l3 < r3) & 1 & 1 & 1 | (l3 = r3) & (l2 < r2) & 1 & 1 | (l3 = r3) & (l2 = r2) & (l1 < r1) & 1 | (l3 = r3) & (l2 = r2) & (l1 = r1) & (l0 < r0) 

Now SIMD instructions can be used to simultaneously calculate multiple operands. Assuming (l1, l2), (l3, l4), (r1, r2), (r3, r4) is a way to store two 256-bit numbers, we can easily get all the necessary values ​​(useful values ​​in bold):

  • cmlo.2d => (l1 <r1) , (l2 <r2)
  • cmlo.2d => (l3 <r3) , (l4 <r4)
  • cmeq.2d => (l1 = r1), (l2 = r2)
  • cmeq.2d => (l3 = r3) , (l4 = r4)

Questions

  • With these values ​​in four SIMD registers, I now wonder what the best strategy is to apply and and | operators, and then reducing it to one Boolean.

Update

I just struck a working implementation for less.

Basically, I replaced 1s above with a double condition, because A & A == A & 1 .

Then I will lay out three 2x2 squares in my matrix and bitwise And them. Now I reduce it with bitwise ORs - first from two vectors to one vector, then to one byte, then copy it to the ARM register and check for 0xFF. The same picture as for equality above.

The question above remains valid. I'm not sure if the code is optimal, and I wonder if I have missed some general SIMD template to make such material more efficient. Also: NEON stands for such cases when the input operands come from memory?

 bool lt256(const UInt256 *lhs, const UInt256 *rhs) { bool result; __asm__(// (l3 < r3) & (l3 < r3) | // (l3 = r3) & (l2 < r2) | // (l3 = r3) & (l2 = r2) & (l1 < r1) & (l1 < r1) | // (l3 = r3) & (l2 = r2) & (l1 = r1) & (l0 < r0) "ld1.2d { v0, v1 }, %1 \n\t" "ld1.2d { v2, v3 }, %2 \n\t" // v0: [ l3 = r3 ] [ l2 = r2 ] // v1: [ l0 < r0 ] [ l1 < r1 ] // v2: [ l0 = r0 ] [ l1 = r1 ] // v3: [ l2 < r2 ] [ l3 < r3 ] // v4: [ l2 = r2 ] [ l3 = r3 ] "cmeq.2d v4, v1, v3 \n\t" "cmlo.2d v3, v1, v3 \n\t" "cmlo.2d v1, v0, v2 \n\t" "cmeq.2d v2, v0, v2 \n\t" "ext.16b v0, v4, v4, 8 \n\t" // v2: [ l1 < r1 ] [ l1 = r1 ] // v1: [ l1 < r1 ] [ l0 < r0 ] "trn2.2d v2, v1, v2 \n\t" "ext.16b v1, v1, v1, 8 \n\t" // v1: [ l1 < r1 & l1 < r1 ] [ l1 = r1 & l0 < r0 ] "and.16b v1, v2, v1 \n\t" // v2: [ l3 < r3 ] [ l3 = r3 ] // v3: [ l3 < r3 ] [ l2 < r2 ] "ext.16b v2, v3, v0, 8 \n\t" "ext.16b v3, v3, v3, 8 \n\t" // v3: [ l3 < r3 & l3 < r3 ] [ l3 = r3 & l2 < r2 ] "and.16b v3, v2, v3 \n\t" // v2: [ l3 = r3 ] [ l3 = r3 ] // v4: [ l2 = r2 ] [ l2 = r2 ] "ext.16b v2, v4, v0, 8 \n\t" "ext.16b v4, v0, v4, 8 \n\t" // v2: [ l3 = r3 & l2 = r2 ] [ l3 = r3 & l2 = r2 ] "and.16b v2, v2, v4 \n\t" // v1: [ l3 = r3 & l2 = r2 & l1 < r1 & l1 < r1 ] // [ lr = r3 & l2 = r2 & l1 = r1 & l0 < r0 ] "and.16b v1, v2, v1 \n\t" // v1: [ l3 < r3 & l3 < r3 | // l3 = r3 & l2 = r2 & l1 < r1 & l1 < r1 ] // [ l3 = r3 & l2 < r2 | // lr = r3 & l2 = r2 & l1 = r1 & l0 < r0 ] "orr.16b v1, v3, v1 \n\t" // b1: [ l3 < r3 & l3 < r3 | // l3 = r3 & l2 = r2 & l1 < r1 & l1 < r1 | // l3 = r3 & l2 < r2 | // lr = r3 & l2 = r2 & l1 = r1 & l0 < r0 ] "umaxv.16b b1, v1 \n\t" "umov %w0, v1.b[0] \n\t" "cmp %w0, 0xFF \n\t" "cset %w0, eq" : "=r" (result) : "m" (*lhs->value), "m" (*rhs->value) : "v0", "v1", "v2", "v3", "v4", "cc"); return result; } 
+5
source share
1 answer

Benchmark with XCTestMetrics criteria with Swift test sheet. Two 256-bit Ints are allocated. Then the operation is repeated 100 million times for the same two purposes, the measurement is stopped, and each limb of the two ints is assigned a new random value using arc4random. The second launch is performed with the tools attached, and the processor time distribution is marked for each team as a comment next to it.


Equality (==)

With equality, SIMD seems to lose when the result is transferred from the SIMD registers back to the ARM register. SIMD is probably worth it only when the result is used in further SIMD calculations or if longer interferons are used than 256-bit ones (ld1 seems to be faster than ldp).

  • SIMD

     bool result; __asm__("ld1.2d { v0, v1 }, %1 \n\t" // 5.1% "ld1.2d { v2, v3 }, %2 \n\t" // 26.4% "cmeq.2d v0, v0, v2 \n\t" "cmeq.2d v1, v1, v3 \n\t" "uminp.16b v0, v0, v1 \n\t" // 4.0% "uminv.16b b0, v0 \n\t" // 26.7% "umov %w0, v0.b[0] \n\t" // 32.9% "cmp %w0, 0xFF \n\t" // 0.0% "cset %w0, eq " : "=r" (result) : "m" (*lhs->value), "m" (*rhs->value) : "v0", "v1", "v2", "v3", "cc"); return result; // 4.9% ("ret") 

    measured [Mean time, seconds]: 11.558, relative standard deviation: 0.065%, values: [11.572626, 11.560558, 11.549322, 11.568718, 11.558530, 11.550490, 11.557086, 11.551803, 11.557529, 11.549782]

  • Standard

    The winner is here. The ccmp instruction really shines here :-) It is clear, however, that the problem is memory related.

     bool result; __asm__("ldp x8, x9, %1 \n\t" // 33.4% "ldp x10, x11, %2 \n\t" "cmp x8, x10 \n\t" "ccmp x9, x11, 0, eq \n\t" "ldp x8, x9, %1, 16 \n\t" // 34.1% "ldp x10, x11, %2, 16 \n\t" "ccmp x8, x10, 0, eq \n\t" // 32.6% "ccmp x9, x11, 0, eq \n\t" "cset %w0, eq \n\t" : "=r" (result) : "m" (*lhs->value), "m" (*rhs->value) : "x8", "x9", "x10", "x11", "cc"); return result; 

    measured [Mean time, seconds]: 11.146, relative standard deviation: 0.034%, values: [11.149754, 11.142854, 11.146840, 11.149392, 11.141254, 11.148708, 11.142293, 11.150491, 11.139593, 11.145873]

  • FROM

    LLVM does not detect that "ccmp" is a good instruction to use here and is slower than the asm version above.

     return lhs->value[0] == rhs->value[0] & lhs->value[1] == rhs->value[1] & lhs->value[2] == rhs->value[2] & lhs->value[3] == rhs->value[3]; 

    Compiled in

     ldp x8, x9, [x0] // 24.1% ldp x10, x11, [x1] // 0.1% cmp x8, x10 // 0.4% cset w8, eq // 1.0% cmp x9, x11 // 23.7% cset w9, eq and w8, w8, w9 // 0.1% ldp x9, x10, [x0, #16] ldp x11, x12, [x1, #16] // 24.8% cmp x9, x11 cset w9, eq // 0.2% and w8, w8, w9 cmp x10, x12 // 0.3% cset w9, eq // 25.2% and w0, w8, w9 ret // 0.1% 

    measured value [Time, seconds]: 11.531, relative standard deviation: 0.040%, values: [11.525511, 11.529820, 11.541940, 11.531776, 11.533287, 11.526628, 11.531392, 11.526037, 11.531784, 11.533786]


Less than (<)

(to be determined later - updated)

+3
source

All Articles