I have an array of shorts where I want to capture half the values and put them in a new array, which is half as much. I want to get certain values in this kind of template, where each block has 128 bits (8 shorts). This is the only template I will use, it does not have to be “any general template”!

Values in white are discarded. My array sizes will always be valid 2. Here's a vague idea of this, unformalized:
unsigned short size = 1 << 8; unsigned short* data = new unsigned short[size]; ... unsigned short* newdata = new unsigned short[size >>= 1]; unsigned int* uintdata = (unsigned int*) data; unsigned int* uintnewdata = (unsigned int*) newdata; for (unsigned short uintsize = size >> 1, i = 0; i < uintsize; ++i) { uintnewdata[i] = (uintdata[i * 2] & 0xFFFF0000) | (uintdata[(i * 2) + 1] & 0x0000FFFF); }
I started with something like this:
static const __m128i startmask128 = _mm_setr_epi32(0xFFFF0000, 0x00000000, 0xFFFF0000, 0x00000000); static const __m128i endmask128 = _mm_setr_epi32(0x00000000, 0x0000FFFF, 0x00000000, 0x0000FFFF); __m128i* data128 = (__m128i*) data; __m128i* newdata128 = (__m128i*) newdata;
and I can iteratively execute _mm_and_si128 using the masks to get the values I'm looking for, combine with _mm_or_si128 and put the results in newdata128[i] . However, I do not know how to “squeeze” things together and remove the values in white. And it seems that if I could do this, I would not need masks.

How can I do that?
In any case, in the end, I also want to do the opposite of this operation and create a new array twice the size and expand the current values in it.

I will also have new values to insert into the white blocks, which I would have to calculate with each pair of shorts in the source data, iteratively. This calculation would not be vectorized, but insertion of the resulting values should be. How could I “expand” my current values into a new array, and what is the best way to insert my calculated values? Should I compute them all for each 128-bit iteration and put them in my own temporary block (64 bit? 128 bit?), And then do something to insert in bulk? Or should they be placed directly in my __m128i target, since apparently cost should be equivalent to investing in temp? If so, how can this be done without ruining my other values?
I would prefer to use SSE2 operations the most for this.