Vectorial extraction of a specific shorts template from an array, as well as insertion into a new array

I have an array of shorts where I want to capture half the values ​​and put them in a new array, which is half as much. I want to get certain values ​​in this kind of template, where each block has 128 bits (8 shorts). This is the only template I will use, it does not have to be “any general template”!

Values ​​in white are discarded. My array sizes will always be valid 2. Here's a vague idea of ​​this, unformalized:

unsigned short size = 1 << 8; unsigned short* data = new unsigned short[size]; ... unsigned short* newdata = new unsigned short[size >>= 1]; unsigned int* uintdata = (unsigned int*) data; unsigned int* uintnewdata = (unsigned int*) newdata; for (unsigned short uintsize = size >> 1, i = 0; i < uintsize; ++i) { uintnewdata[i] = (uintdata[i * 2] & 0xFFFF0000) | (uintdata[(i * 2) + 1] & 0x0000FFFF); } 

I started with something like this:

 static const __m128i startmask128 = _mm_setr_epi32(0xFFFF0000, 0x00000000, 0xFFFF0000, 0x00000000); static const __m128i endmask128 = _mm_setr_epi32(0x00000000, 0x0000FFFF, 0x00000000, 0x0000FFFF); __m128i* data128 = (__m128i*) data; __m128i* newdata128 = (__m128i*) newdata; 

and I can iteratively execute _mm_and_si128 using the masks to get the values ​​I'm looking for, combine with _mm_or_si128 and put the results in newdata128[i] . However, I do not know how to “squeeze” things together and remove the values ​​in white. And it seems that if I could do this, I would not need masks.

How can I do that?

In any case, in the end, I also want to do the opposite of this operation and create a new array twice the size and expand the current values ​​in it.

I will also have new values ​​to insert into the white blocks, which I would have to calculate with each pair of shorts in the source data, iteratively. This calculation would not be vectorized, but insertion of the resulting values ​​should be. How could I “expand” my current values ​​into a new array, and what is the best way to insert my calculated values? Should I compute them all for each 128-bit iteration and put them in my own temporary block (64 bit? 128 bit?), And then do something to insert in bulk? Or should they be placed directly in my __m128i target, since apparently cost should be equivalent to investing in temp? If so, how can this be done without ruining my other values?

I would prefer to use SSE2 operations the most for this.

+4
source share
1 answer

Here is a diagram you can try:

  • Use the interlace command ( _mm_unpackhi/lo_epi16 ) with a register containing zero to decompose your 16-bit values. You will now have two registers similar to B_R_B_R_ .
  • Move right to create _B_R_B_R
  • And R from the first version of B___B___
  • And B from the second version ___R___R
  • OR together B__RB__R

In the other direction, use _mm_packs_epi32 at the end after setting with a shift of / and / or.

Each direction should have 10 SSE instructions (not counting setting constants, zero and AND mask, and loading / saving).

+1
source

All Articles