Data loading for GCC vector extensions

GCC vector extensions offer a nice, reasonably portable way to access some SIMD instructions on different hardware architectures without resorting to hardware specific properties (or auto-injection).

Real use case, calculates a simple additive checksum. The only thing that is unclear is the safe loading of data into the vector.

typedef char v16qi __attribute__ ((vector_size(16))); static uint8_t checksum(uint8_t *buf, size_t size) { assert(size%16 == 0); uint8_t sum = 0; vec16qi vec = {0}; for (size_t i=0; i<(size/16); i++) { // XXX: Yuck! Is there a better way? vec += *((v16qi*) buf+i*16); } // Sum up the vector sum = vec[0] + vec[1] + vec[2] + vec[3] + vec[4] + vec[5] + vec[6] + vec[7] + vec[8] + vec[9] + vec[10] + vec[11] + vec[12] + vec[13] + vec[14] + vec[15]; return sum; } 

Pressing a pointer to a vector type seems to work, but I'm worried that it could explode in a horrible way if the SIMD hardware expects the vector types to align correctly.

The only other option that I was thinking about is to use a temporary vector and explicitly load values ​​(via memcpy assignment or by element), but when testing this counteraction, the use of SIMD instructions got most of the acceleration. Ideally, I would suggest that it would be something like the generic __builtin_load() function, but it doesn't seem to exist.

What is a safer way to load data into a vector, at risk of alignment problems?

+8
gcc vectorization simd checksum
source share
2 answers

You can use an initializer to load values, i.e. do

 const vec16qi e = { buf[0], buf[1], ... , buf[15] } 

and hope that GCC turns this into an SSE boot instruction. I would check this with a disassembler, though ;-). In addition, to improve performance, you are trying to align buf to 16 bytes and report this to the compiler using the aligned attribute. If you can guarantee that the input buffer will be aligned, process it until you reach the 16-byte limit.

0
source share

Edit (thanks Peter Cordes) You can sketch pointers:

 typedef char v16qi __attribute__ ((vector_size (16), aligned (16))); v16qi vec = *(v16qi*)&buf[i]; // load *(v16qi*)(buf + i) = vec; // store whole vector 

Compiled for vmovdqa for download and vmovups for storage. If the data is not known for alignment, set aligned (1) to generate vmovdqu . ( godbolt )

Please note that there are also some special built-in functions for loading and unloading these registers ( Change 2 ):

 v16qi vec = _mm_loadu_si128((__m128i*)&buf[i]); // _mm_load_si128 for aligned _mm_storeu_si128((__m128i*)&buf[i]), vec); // _mm_store_si128 for aligned 

It seems necessary to use -flax-vector-conversions to go from char in v16qi to this function.

See also: C - How to access vector elements using SSE SSCC vector extensions
See also: Loading SSE in __m128

(Tip. The best phrase for google is something like "gcc load __m128i.")

+1
source share

All Articles