I would like to compare two low-value 256-bit values ββwith A64 Neon (asm) instructions efficiently.
Equality (=)
For equality, I already got a solution:
bool eq256(const UInt256 *lhs, const UInt256 *rhs) { bool result;
First load the two values ββinto the SIMD registers.
__asm__("ld1.2d { v0, v1 }, %1 \n\t" "ld1.2d { v2, v3 }, %2 \n\t"
Compare each 64-bit finiteness of values ββwith each other. This results in -1 (all bits are set) for those limbs that are equal, and 0 (all bits are cleared) if the bit is different.
"cmeq.2d v0, v0, v2 \n\t" "cmeq.2d v1, v1, v3 \n\t"
Reduce the result from 2 vectors to 1 vector, leaving only the one that contains "0 (all digits cleared)", if any.
"uminp.16b v0, v0, v1 \n\t"
Decrease the result from 1 vector to 1 byte, keeping only bytes with zeros, if any.
"uminv.16b b0, v0 \n\t"
Move to the ARM register, then compare with 0xFF. This is the result.
"umov %w0, v0.b[0] \n\t" "cmp %w0, 0xFF \n\t" "cset %w0, eq " : "=r" (result) : "m" (*lhs->value), "m" (*rhs->value) : "v0", "v1", "v2", "v3", "cc"); return result; }
Questions
Is this more efficient than doing 4 comparisons with regular old ARM registers?
- eg. is there a source that quotes timings for different operations? I do this on the iPhone 5s.
Is there any way to optimize this even further? I think that I spend many cycles just to reduce the whole vector to one scalar Boolean.
Less than comparison (<)
We denote the two ints as tuples of 64-bit limbs (little-endian):
- lhs = (l0, l1, l2, l3)
- rhs = (r0, r1, r2, r3)
Then lhs <rhs if this value is true:
(l3 < r3) & 1 & 1 & 1 | (l3 = r3) & (l2 < r2) & 1 & 1 | (l3 = r3) & (l2 = r2) & (l1 < r1) & 1 | (l3 = r3) & (l2 = r2) & (l1 = r1) & (l0 < r0)
Now SIMD instructions can be used to simultaneously calculate multiple operands. Assuming (l1, l2), (l3, l4), (r1, r2), (r3, r4) is a way to store two 256-bit numbers, we can easily get all the necessary values ββ(useful values ββin bold):
- cmlo.2d => (l1 <r1) , (l2 <r2)
- cmlo.2d => (l3 <r3) , (l4 <r4)
- cmeq.2d => (l1 = r1), (l2 = r2)
- cmeq.2d => (l3 = r3) , (l4 = r4)
Questions
- With these values ββin four SIMD registers, I now wonder what the best strategy is to apply and and | operators, and then reducing it to one Boolean.
Update
I just struck a working implementation for less.
Basically, I replaced 1s above with a double condition, because A & A == A & 1 .
Then I will lay out three 2x2 squares in my matrix and bitwise And them. Now I reduce it with bitwise ORs - first from two vectors to one vector, then to one byte, then copy it to the ARM register and check for 0xFF. The same picture as for equality above.
The question above remains valid. I'm not sure if the code is optimal, and I wonder if I have missed some general SIMD template to make such material more efficient. Also: NEON stands for such cases when the input operands come from memory?
bool lt256(const UInt256 *lhs, const UInt256 *rhs) { bool result; __asm__(// (l3 < r3) & (l3 < r3) | // (l3 = r3) & (l2 < r2) | // (l3 = r3) & (l2 = r2) & (l1 < r1) & (l1 < r1) | // (l3 = r3) & (l2 = r2) & (l1 = r1) & (l0 < r0) "ld1.2d { v0, v1 }, %1 \n\t" "ld1.2d { v2, v3 }, %2 \n\t" // v0: [ l3 = r3 ] [ l2 = r2 ] // v1: [ l0 < r0 ] [ l1 < r1 ] // v2: [ l0 = r0 ] [ l1 = r1 ] // v3: [ l2 < r2 ] [ l3 < r3 ] // v4: [ l2 = r2 ] [ l3 = r3 ] "cmeq.2d v4, v1, v3 \n\t" "cmlo.2d v3, v1, v3 \n\t" "cmlo.2d v1, v0, v2 \n\t" "cmeq.2d v2, v0, v2 \n\t" "ext.16b v0, v4, v4, 8 \n\t" // v2: [ l1 < r1 ] [ l1 = r1 ] // v1: [ l1 < r1 ] [ l0 < r0 ] "trn2.2d v2, v1, v2 \n\t" "ext.16b v1, v1, v1, 8 \n\t" // v1: [ l1 < r1 & l1 < r1 ] [ l1 = r1 & l0 < r0 ] "and.16b v1, v2, v1 \n\t" // v2: [ l3 < r3 ] [ l3 = r3 ] // v3: [ l3 < r3 ] [ l2 < r2 ] "ext.16b v2, v3, v0, 8 \n\t" "ext.16b v3, v3, v3, 8 \n\t" // v3: [ l3 < r3 & l3 < r3 ] [ l3 = r3 & l2 < r2 ] "and.16b v3, v2, v3 \n\t" // v2: [ l3 = r3 ] [ l3 = r3 ] // v4: [ l2 = r2 ] [ l2 = r2 ] "ext.16b v2, v4, v0, 8 \n\t" "ext.16b v4, v0, v4, 8 \n\t" // v2: [ l3 = r3 & l2 = r2 ] [ l3 = r3 & l2 = r2 ] "and.16b v2, v2, v4 \n\t" // v1: [ l3 = r3 & l2 = r2 & l1 < r1 & l1 < r1 ] // [ lr = r3 & l2 = r2 & l1 = r1 & l0 < r0 ] "and.16b v1, v2, v1 \n\t" // v1: [ l3 < r3 & l3 < r3 | // l3 = r3 & l2 = r2 & l1 < r1 & l1 < r1 ] // [ l3 = r3 & l2 < r2 | // lr = r3 & l2 = r2 & l1 = r1 & l0 < r0 ] "orr.16b v1, v3, v1 \n\t" // b1: [ l3 < r3 & l3 < r3 | // l3 = r3 & l2 = r2 & l1 < r1 & l1 < r1 | // l3 = r3 & l2 < r2 | // lr = r3 & l2 = r2 & l1 = r1 & l0 < r0 ] "umaxv.16b b1, v1 \n\t" "umov %w0, v1.b[0] \n\t" "cmp %w0, 0xFF \n\t" "cset %w0, eq" : "=r" (result) : "m" (*lhs->value), "m" (*rhs->value) : "v0", "v1", "v2", "v3", "v4", "cc"); return result; }