The implementation I represent requires only one column load per iteration. First we initialize some variables
const __m128i mask1=_mm_set_epi8(0,0,0,0,0,0,0,0,255,255,255,255,255,255,255,255); const __m128i mask2=_mm_set_epi8(0,0,0,0,255,255,255,255,0,0,0,0,255,255,255,255); const __m128i mask3=_mm_set_epi8(0,0,255,255,0,0,255,255,0,0,255,255,0,0,255,255); const __m128i mask4=_mm_set_epi8(0,255,0,255,0,255,0,255,0,255,0,255,0,255,0,255); __m128i v0, v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13, v14, v15;
Then, for each step, the v_column_load variable v_column_load loaded with the next column.
v15 = v_column_load; v7 = _mm_blendv_epi8(v7,v15,mask1); v3 = _mm_blendv_epi8(v3,v7,mask2); v1 = _mm_blendv_epi8(v1,v3,mask3); v0 = _mm_blendv_epi8(v0,v1,mask4); v_diagonal = v0;
In the next step, the number of variable names in v0 , v1 , v3 , v7 , v15 increased by 1 and adjusted in the range from 0 to 15. In other words: newnumber = (oldnumber + 1) modulo 16.
v0 = v_column_load; v8 = _mm_blendv_epi8(v8,v0,mask1); v4 = _mm_blendv_epi8(v4,v8,mask2); v2 = _mm_blendv_epi8(v2,v4,mask3); v1 = _mm_blendv_epi8(v1,v2,mask4); v_diagonal = v1;
After 16 iterations, v_diagonal will begin to contain the correct diagonal values.
Considering mask1 , mask2 , mask3 , mask4 , we see a template that can be used to generalize this algorithm to other vector lengths (2 ^ n).
For example, for the length of vector 8, we need only 3 masks, and the iteration steps will look like this:
v7 = aaaaaaaa v6 = v5 = v4 = v3 = aaaa v2 = v1 = aa v0 = a v0 = bbbbbbbb v7 = aaaaaaaa v6 = v5 = v4 = bbbb v3 = aaaa v2 = bb v1 = ab v1 = cccccccc v0 = bbbbbbbb v7 = aaaaaaaa v6 = v5 = cccc v4 = bbbb v3 = aacc v2 = abc v2 = dddddddd v1 = cccccccc v0 = bbbbbbbb v7 = aaaaaaaa v6 = dddd v5 = cccc v4 = bbdd v3 = aacd v3 = eeeeeeee v2 = dddddddd v1 = cccccccc v0 = bbbbbbbb v7 = aaaaeeee v6 = dddd v5 = aaccee v4 = abbda v4 = ffffffff v3 = eeeeeeee v2 = dddddddd v1 = cccccccc v0 = bbbbffff v7 = aaaaeeee v6 = bbddff v5 = abcdef v5 = gggggggg v4 = ffffffff v3 = eeeeeeee v2 = dddddddd v1 = ccccgggg v0 = bbbbffff v7 = aacceegg v6 = abcdefg v6 = hhhhhhhh v5 = gggggggg v4 = ffffffff v3 = eeeeeeee v2 = ddddhhhh v1 = ccccgggg v0 = bbddffhh v7 = abcdefgh <-- this vector now contains the diagonal v7 = iiiiiiii v6 = hhhhhhhh v5 = gggggggg v4 = ffffffff v3 = eeeeiiii v2 = ddddhhhh v1 = cceeggii v0 = bcdefghi <-- this vector now contains the diagonal v0 = jjjjjjjj v7 = iiiiiiii v6 = hhhhhhhh v5 = gggggggg v4 = ffffjjjj v3 = eeeeiiii v2 = ddffhhjj v1 = cdefghij <-- this vector now contains the diagonal
Sidenote: I discovered this way of loading a diagonal vector when I was working on the Smith-Waterman algorithm. More information can be found on the old SourceForge web page .