Data Type Compatibility with NEON intrinsics

I am working on optimizing ARM using NEON intrinsics, from C ++ code. I understand and deal with most typing issues, but I'm stuck with this:

The vzip_u8 instruction returns the value uint8x8x2_t (actually an array of two uint8x8_t ). I want to assign the return value to a simple uint16x8_t . I do not see a suitable vreinterpretq to achieve this, and simple drops are rejected.

+4
source share
5 answers

Some definitions for a clear answer ...

NEON has 32 registers, 64-bit wide (double representation as 16 registers, 128 bits).

The NEON module can view the same register bank as:

  • sixteen 128-bit quad registers, Q0-Q15
  • thirty-two 64-bit double word registers, D0-D31.

uint16x8_t is a type that requires 128-bit storage, so it must be in the quadword register.

ARM NEON Intrinsics has a definition of vector array data type in ARM® C Language Extensions :

... for use in load and store operations, in table lookups, and as a result of type operations that return a pair of vectors.

vzip instruction

... interleaves the elements of two vectors.

vzip Dd, Dm

and has intrinsic for example

 uint8x8x2_t vzip_u8 (uint8x8_t, uint8x8_t) 

from them we can conclude that uint8x8x2_t is actually a list of two random numeric double-word registers, because vzip instructions have no requirement for the order of the input registers.

Now the answer is ...

uint8x8x2_t can contain irregular two registers with two words, and uint16x8_t is a data structure consisting of two consecutive double-word registers that first have an even index (D0-D31 → Q0-Q15).

Because of this, you cannot distinguish a vector array data type with two double-word registers in a quad-word register ... easily.

The compiler may be smart enough to help you, or you can just force the conversion, but I would check the resulting assembly for correctness as well as performance.

+5
source

You can build a 128-bit vector from two 64-bit vectors using vcombine_ * intrinsics. That way you can achieve what you want.

 #include <arm_neon.h> uint8x16_t f(uint8x8_t a, uint8x8_t b) { uint8x8x2_t tmp = vzip_u8(a,b); uint8x16_t result; result = vcombine_u8(tmp.val[0], tmp.val[1]); return result; } 
+4
source

I found a workaround: given that the val element of type uint8x8x2_t is an array, so it is treated as a pointer. Casting and deferment of the pointer is in progress! [When accepting the data address, the warning “temporary address” is issued.]

 uint16x8_t Value= *(uint16x8_t*)vzip_u8(arg0, arg1).val; 

It turns out that this compiles and runs as it should (at least if I tried). I did not look at the assembly code, so I can not provide it correctly implemented (I mean just save the value in the register instead of writing / reading to / from memory.)

+1
source

I had the same problem, so I entered a flexible data type .

Now I can determine the following:

 typedef NeonVectorType<uint8x16_t> uint_128bit_t; //suitable for uint8x16_t, uint8x8x2_t, uint32x4_t, etc. typedef NeonVectorType<uint8x8_t> uint_64bit_t; //suitable for uint8x8_t, uint32x2_t, etc. 
0
source

His bug in the GCC (now fixed) in the 4.5 and 4.6 series.

Bugzilla Link http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48252

Please fix this error and apply it to the gcc source and restore it.

-1
source

Source: https://habr.com/ru/post/1215783/


All Articles