Circuit Vectoring

I use a compact structure of two unsigned shorts, indicating the start and end positions.
I need to be able to quickly determine if there are any Range objects with a length (difference from start to end) per threshold value.

I will have a huge number of objects, each with its own Range array, so it is not possible to track which Range objects are above the threshold in the list or something like that. This code will also run very often (many times per second for each array), so it should be efficient.

 struct Range { unsigned short start; unsigned short end; } 

I will always have a Range array of size 2 ^ n. Although I would like to interrupt as soon as I find something above the threshold, I am sure it would be easier to just OR all together and check at the end ... assuming I can vectorize the loop. Although if I could make an if statement on a piece of results for each vector, that would be great.

 size_t rangecount = 1 << resolution; Range* ranges = new Range[rangecount]; ... bool result = false; for (size_t i = 0; i < ranges; ++i) { result |= (range[i].end - range[i].start) > 4; } 

It is not surprising that the autointervalizer gives error 1202 because my data type is not 32 or 64 bits. I really don't want to double my data size and make each field an unsigned int. Therefore, I assume that the auto-invasion method is used for this.

Are there vector instructions that can process 16-bit variables? If there is, how can I use them in C ++ to vectorize my loop?

+6
source share
1 answer

Are you curious if any value is greater than 4?

Yes, there are SIMD instructions for this. It is sad that auto-vectorization is not able to cope with this scenario. Here's a vectorized algorithm:

 diff_v = end_v - start_v; // _mm_hsub_epi16 floor_v = max(4_v, diff_v); // _mm_max_epi16 if (floor_v != 4_v) return true; // wide scalar comparison 

Use _mm_sub_epi16 with an array structure or _mm_hsub_epi16 with an array of structures.

In fact, since start stored first in memory, you will work with start_v - end_v , so use _mm_min_epi16 and the vector -4 .

Each SSE3 instruction will perform 8 comparisons at a time. It will still return faster sooner than the cycle. However, by expanding the loop a little more, you can buy extra speed (pass the first set of results to the packed min / max function to combine them).

So you will end with (approximately):

 most_negative = threshold = _mm_set_epi64(0xFCFCFCFCFCFCFCFC); // vectorized -4 loop: a = load from range; b = load from range; diff = _mm_hsub_epi16(a, b); most_negative = _mm_min_epi16(most_negative, diff); // unroll by repeating the above four instructions 4 times or so if (most_negative != threshold) return true; repeat loop 
+1
source

All Articles