How to load a sliding diagonal vector from data stored in columns using SSE

Question

How to load a sliding diagonal vector from data stored in columns using SSE

A sliding diagonal vector contains 16 elements, each of which is an 8-bit unsigned integer.

Without SSE and a bit simplified, it would look like this in C:

int width=1000000; // a big number uint8_t matrix[width][16]; fill_matrix_with_interesting_values(&matrix); for (int i=0; i < width - 16; ++i) { uint8_t diagonal_vector[16]; for (int j=0; j<16; ++j) { diagonal_vector[j] = matrix[i+j][j]; } do_something(&diagonal_vector); }

but in my case, I can load the column (vertically) from the matrix using the _mm_load_si128 intrinsics function. The sliding diagonal vector moves horizontally, so I need to preload 16 column vectors and use one element from each of these column vectors to create the diagonal vector.

Is it possible to perform a fast implementation with low memory for this using SSE?

Update November 14, 2016: Providing some additional information. In my case, I am reading single-letter codes from a text file in FASTA format. Each letter represents a specific amino acid. Each amino acid has a specific column vector associated with it. This column vector is viewed from the constant table ( BLOSUM ). In C code, it will look like

 while (uint8_t c = read_next_letter_from_file()) { column_vector = lookup_from_const_table(c) uint8_t diagonal_vector[16]; ... rearrange the values from the latest column vectors into the diagonal_vector ... do_something(&diagonal_vector) }

+4

performance c vectorization sse simd

Erik sjölund Mar 04 '13 at 9:12

source share

1 answer

Erik sjölund · Accepted Answer · 2013-03-04T09:12:42+0000

The implementation I represent requires only one column load per iteration. First we initialize some variables

 const __m128i mask1=_mm_set_epi8(0,0,0,0,0,0,0,0,255,255,255,255,255,255,255,255); const __m128i mask2=_mm_set_epi8(0,0,0,0,255,255,255,255,0,0,0,0,255,255,255,255); const __m128i mask3=_mm_set_epi8(0,0,255,255,0,0,255,255,0,0,255,255,0,0,255,255); const __m128i mask4=_mm_set_epi8(0,255,0,255,0,255,0,255,0,255,0,255,0,255,0,255); __m128i v0, v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13, v14, v15;

Then, for each step, the v_column_load variable v_column_load loaded with the next column.

 v15 = v_column_load; v7 = _mm_blendv_epi8(v7,v15,mask1); v3 = _mm_blendv_epi8(v3,v7,mask2); v1 = _mm_blendv_epi8(v1,v3,mask3); v0 = _mm_blendv_epi8(v0,v1,mask4); v_diagonal = v0;

In the next step, the number of variable names in v0 , v1 , v3 , v7 , v15 increased by 1 and adjusted in the range from 0 to 15. In other words: newnumber = (oldnumber + 1) modulo 16.

 v0 = v_column_load; v8 = _mm_blendv_epi8(v8,v0,mask1); v4 = _mm_blendv_epi8(v4,v8,mask2); v2 = _mm_blendv_epi8(v2,v4,mask3); v1 = _mm_blendv_epi8(v1,v2,mask4); v_diagonal = v1;

After 16 iterations, v_diagonal will begin to contain the correct diagonal values.

Considering mask1 , mask2 , mask3 , mask4 , we see a template that can be used to generalize this algorithm to other vector lengths (2 ^ n).

For example, for the length of vector 8, we need only 3 masks, and the iteration steps will look like this:

 v7 = aaaaaaaa v6 = v5 = v4 = v3 = aaaa v2 = v1 = aa v0 = a v0 = bbbbbbbb v7 = aaaaaaaa v6 = v5 = v4 = bbbb v3 = aaaa v2 = bb v1 = ab v1 = cccccccc v0 = bbbbbbbb v7 = aaaaaaaa v6 = v5 = cccc v4 = bbbb v3 = aacc v2 = abc v2 = dddddddd v1 = cccccccc v0 = bbbbbbbb v7 = aaaaaaaa v6 = dddd v5 = cccc v4 = bbdd v3 = aacd v3 = eeeeeeee v2 = dddddddd v1 = cccccccc v0 = bbbbbbbb v7 = aaaaeeee v6 = dddd v5 = aaccee v4 = abbda v4 = ffffffff v3 = eeeeeeee v2 = dddddddd v1 = cccccccc v0 = bbbbffff v7 = aaaaeeee v6 = bbddff v5 = abcdef v5 = gggggggg v4 = ffffffff v3 = eeeeeeee v2 = dddddddd v1 = ccccgggg v0 = bbbbffff v7 = aacceegg v6 = abcdefg v6 = hhhhhhhh v5 = gggggggg v4 = ffffffff v3 = eeeeeeee v2 = ddddhhhh v1 = ccccgggg v0 = bbddffhh v7 = abcdefgh <-- this vector now contains the diagonal v7 = iiiiiiii v6 = hhhhhhhh v5 = gggggggg v4 = ffffffff v3 = eeeeiiii v2 = ddddhhhh v1 = cceeggii v0 = bcdefghi <-- this vector now contains the diagonal v0 = jjjjjjjj v7 = iiiiiiii v6 = hhhhhhhh v5 = gggggggg v4 = ffffjjjj v3 = eeeeiiii v2 = ddffhhjj v1 = cdefghij <-- this vector now contains the diagonal

Sidenote: I discovered this way of loading a diagonal vector when I was working on the Smith-Waterman algorithm. More information can be found on the old SourceForge web page .

How to load a sliding diagonal vector from data stored in columns using SSE

More articles: