Retrieving SSE shuffled 32-bit value with SSE2 only

I am trying to extract 4 bytes from a 128 bit register in an efficient way. The problem is that each value is in a 32-bit 32-bit {120,0,0,0,55,0,0,0,42,0,0,0,120,0,0,0} . I want to convert 128 bits to 32 bits into the form {120,55,42,120} .

The "raw" code is as follows:

 __m128i byte_result_vec={120,0,0,0,55,0,0,0,42,0,0,0,120,0,0,0}; unsigned char * byte_result_array=(unsigned char*)&byte_result_vec; result_array[x]=byte_result_array[0]; result_array[x+1]=byte_result_array[4]; result_array[x+2]=byte_result_array[8]; result_array[x+3]=byte_result_array[12]; 

My SSSE3 Code:

 unsigned int * byte_result_array=...; __m128i byte_result_vec={120,0,0,0,55,0,0,0,42,0,0,0,120,0,0,0}; const __m128i eight_bit_shuffle_mask=_mm_set_epi8(1,1,1,1,1,1,1,1,1,1,1,1,0,4,8,12); byte_result_vec=_mm_shuffle_epi8(byte_result_vec,eight_bit_shuffle_mask); unsigned int * byte_result_array=(unsigned int*)&byte_result_vec; result_array[x]=byte_result_array[0]; 

How can I do this efficiently using SSE2. Is there a better version with SSSE3 or SSE4?

+6
source share
1 answer

You can see my previous answer to some solutions to this problem and the reverse operation.

In particular, in SSE2 you can do this by first packing 32-bit integers into signed 16-bit integers and saturating:

 byte_result_vec = _mm_packs_epi32(byte_result_vec, byte_result_vec); 

Then we collect these 16-bit values ​​into 8-bit unsigned values ​​using unsigned saturation:

 byte_result_vec = _mm_packus_epi16(byte_result_vec, byte_result_vec); 

Then we can finally take our values ​​from the lower 32-bit register:

 int int_result = _mm_cvtsi128_si32(byte_result_vec); unsigned char* byte_result_array = (unsigned char*)&int_result; result_array[x] = byte_result_array[0]; result_array[x+1] = byte_result_array[1]; result_array[x+2] = byte_result_array[2]; result_array[x+3] = byte_result_array[3]; 

EDIT: The above assumes that 8-bit words are initially found in the low bytes of their respective 32-bit words, and the rest are filled with 0 s, as otherwise they will get jammed during the saturation packing process. Thus, the following operations:

  byte 15 0 0 0 0 D 0 0 0 C 0 0 0 B 0 0 0 A _mm_packs_epi32 -> 0 D 0 C 0 B 0 A 0 D 0 C 0 B 0 A _mm_packus_epi16 -> DCBADCBADCBADCBA ^^^^^^^ _mm_cvtsi128_si32 -> int DCBA, laid out in x86 memory as bytes ABCD -> reinterpreted as unsigned char array { A, B, C, D } 

If the original bytes are not filled with 0 , you need to mask them in advance:

 byte_result_vec = _mm_and_si128(byte_result_vec, _mm_set1_epi32(0x000000FF)); 

Or, if the start bytes are initially in high bytes, you must first transfer them to the lower bytes:

 byte_result_vec = _mm_srli_epi32(byte_result_vec, 24); 

Or, if you really want { D, C, B, A } (which is not entirely clear to me from your question), then this simply means switching the array index in the assignments (or alternative execution of the 32-bit shuffle ( _mm_shuffle_epi32 ) in the initial register SSE).

+9
source

All Articles