Vectorial extraction of a specific shorts template from an array, as well as insertion into a new array

Question

Vectorial extraction of a specific shorts template from an array, as well as insertion into a new array

I have an array of shorts where I want to capture half the values and put them in a new array, which is half as much. I want to get certain values in this kind of template, where each block has 128 bits (8 shorts). This is the only template I will use, it does not have to be “any general template”!

Values in white are discarded. My array sizes will always be valid 2. Here's a vague idea of this, unformalized:

unsigned short size = 1 << 8; unsigned short* data = new unsigned short[size]; ... unsigned short* newdata = new unsigned short[size >>= 1]; unsigned int* uintdata = (unsigned int*) data; unsigned int* uintnewdata = (unsigned int*) newdata; for (unsigned short uintsize = size >> 1, i = 0; i < uintsize; ++i) { uintnewdata[i] = (uintdata[i * 2] & 0xFFFF0000) | (uintdata[(i * 2) + 1] & 0x0000FFFF); }

I started with something like this:

 static const __m128i startmask128 = _mm_setr_epi32(0xFFFF0000, 0x00000000, 0xFFFF0000, 0x00000000); static const __m128i endmask128 = _mm_setr_epi32(0x00000000, 0x0000FFFF, 0x00000000, 0x0000FFFF); __m128i* data128 = (__m128i*) data; __m128i* newdata128 = (__m128i*) newdata;

and I can iteratively execute _mm_and_si128 using the masks to get the values I'm looking for, combine with _mm_or_si128 and put the results in newdata128[i] . However, I do not know how to “squeeze” things together and remove the values in white. And it seems that if I could do this, I would not need masks.

How can I do that?

In any case, in the end, I also want to do the opposite of this operation and create a new array twice the size and expand the current values in it.

I will also have new values to insert into the white blocks, which I would have to calculate with each pair of shorts in the source data, iteratively. This calculation would not be vectorized, but insertion of the resulting values should be. How could I “expand” my current values into a new array, and what is the best way to insert my calculated values? Should I compute them all for each 128-bit iteration and put them in my own temporary block (64 bit? 128 bit?), And then do something to insert in bulk? Or should they be placed directly in my __m128i target, since apparently cost should be equivalent to investing in temp? If so, how can this be done without ruining my other values?

I would prefer to use SSE2 operations the most for this.

+4

c ++ algorithm vectorization visual-c ++ sse2

user173342 Jan 7 '13 at 16:27

source share

1 answer

Guy sirton · Accepted Answer · 2013-01-07T19:20:35+0000

Here is a diagram you can try:

Use the interlace command ( _mm_unpackhi/lo_epi16 ) with a register containing zero to decompose your 16-bit values. You will now have two registers similar to B_R_B_R_ .
Move right to create _B_R_B_R
And R from the first version of B___B___
And B from the second version ___R___R
OR together B__RB__R

In the other direction, use _mm_packs_epi32 at the end after setting with a shift of / and / or.

Each direction should have 10 SSE instructions (not counting setting constants, zero and AND mask, and loading / saving).

Vectorial extraction of a specific shorts template from an array, as well as insertion into a new array

More articles: