Using a union (encapsulated in a structure) to bypass conversions for neon data types

I made my first approach using vectorization using SSE, where basically there is only one __m128i data __m128i . By switching to Neon, I found the data types and function prototypes more specific, for example. uint8x16_t (vector 16 unsigned char ), uint8x8x2_t (2 vectors with 8 unsigned char each), uint32x4_t (vector with 4 uint32_t ), etc.

At first I was delighted (it is much easier to find the exact function that works for the desired data type), then I saw what a mess it was when you want to process the data in different ways. Using specific casting operators will take me forever. The problem is also addressed here . Then I came up with the idea of ​​a union encapsulated in a structure, and some casting and assignment operators.

 struct uint_128bit_t { union { uint8x16_t uint8x16; uint16x8_t uint16x8; uint32x4_t uint32x4; uint8x8x2_t uint8x8x2; uint8_t uint8_array[16] __attribute__ ((aligned (16) )); uint16_t uint16_array[8] __attribute__ ((aligned (16) )); uint32_t uint32_array[4] __attribute__ ((aligned (16) )); }; operator uint8x16_t& () {return uint8x16;} operator uint16x8_t& () {return uint16x8;} operator uint32x4_t& () {return uint32x4;} operator uint8x8x2_t& () {return uint8x8x2;} uint8x16_t& operator =(const uint8x16_t& in) {uint8x16 = in; return uint8x16;} uint8x8x2_t& operator =(const uint8x8x2_t& in) {uint8x8x2 = in; return uint8x8x2;} }; 

This approach works for me: I can use a variable like uint_128bit_t as an argument and output with various neon functions, for example. vshlq_n_u32 , vuzp_u8 , vget_low_u8 (in this case, the same as the input). And I can expand it with more data types if needed. Note. Arrays should easily print the contents of a variable.

Is this the right way?
Is there a hidden flaw?
Did I reinvent the wheel?
(Do you need a aligned attribute?)

+2
source share
3 answers

Since the original proposed method has undefined behavior in C ++ , I applied something like this:

 template <typename T> struct NeonVectorType { private: T data; public: template <typename U> operator U () { BOOST_STATIC_ASSERT_MSG(sizeof(U) == sizeof(T),"Trying to convert to data type of different size"); U u; memcpy( &u, &data, sizeof u ); return u; } template <typename U> NeonVectorType<T>& operator =(const U& in) { BOOST_STATIC_ASSERT_MSG(sizeof(U) == sizeof(T),"Trying to copy from data type of different size"); memcpy( &data, &in, sizeof data ); return *this; } }; 

Then:

 typedef NeonVectorType<uint8x16_t> uint_128bit_t; //suitable for uint8x16_t, uint8x8x2_t, uint32x4_t, etc. typedef NeonVectorType<uint8x8_t> uint_64bit_t; //suitable for uint8x8_t, uint32x2_t, etc. 

Using memcpy is discussed here (and here ), and the strict anti-aliasing rule avoids breaking it . Please note that it is generally optimized .

If you look at the change history, I implemented a custom version with combining operators for vector vectors (e.g. uint8x8x2_t ). The problem was mentioned here . However, since these data types are declared as arrays (see guide , section 12.2.2) and therefore are located in sequential memory cells, the compiler needs to handle memcpy correctly.

Finally, to print the contents of a variable, you can use such a function .

+1
source

According to the C ++ standard, this data type is almost useless (and, of course, for this purpose). This is because reading from an inactive member of a union is undefined behavior.

It is possible, however, that your promises compiler does the job. However, you did not ask a question about any particular compiler, so it is impossible to comment on this.

+3
source

If you try to avoid casting in a sensible way due to the hacking of the data structure, you will end up shuffling the memory / words around which will kill any performance you hope to get from NEON.

You can probably flush quadrants in double registers, but another way might not be possible.

It all comes down to this. Each command has several bits for register indexing. If a command expects quad registers, it will count two-on-two registers such as Q (2 * n), Q (2 * n + 1) and use only n in the encoded instruction, (2 * n + 1) will implicit for the kernel, If any point in the code that you are trying to make two doubles into a square, you may be in a position where they will not sequentially force the compiler to move around the registers onto the stack and back to get a consistent layout.

I think this is the same answer in different words fooobar.com/questions/1215783 / ...

NEON instructions are for streaming, you load large chunks from memory, process them, and then save what you want. It should be very simple mechanics, if you do not lose the extra performance that it offers, which will make people ask why you are trying to use Neon, first of all, making life more difficult for yourself.

Think of NEON as immutable value types and operations.

0
source

Source: https://habr.com/ru/post/1215751/


All Articles