How to perform an 8 x 8 operation using SSE?

My initial attempt looked like this (assuming we want to breed)

__m128 mat[n]; /* rows */ __m128 vec[n] = {1,1,1,1}; float outvector[n]; for (int row=0;row<n;row++) { for(int k =3; k < 8; k = k+ 4) { __m128 mrow = mat[k]; __m128 v = vec[row]; __m128 sum = _mm_mul_ps(mrow,v); sum= _mm_hadd_ps(sum,sum); /* adds adjacent-two floats */ } _mm_store_ss(&outvector[row],_mm_hadd_ps(sum,sum)); } 

But this clearly does not work. How do I approach this?

I have to load 4 at a time ....

Another question: if my array is very large (let's say n = 1000), how can I do it with 16-byte alignment? Is it possible?

+7
source share
2 answers

OK ... I use the row matrix convention. Each [m] requires (2) __m128 elements to get 8 floats. Vector 8x1 v is a column vector. Since you are using the haddps instruction, I assume SSE3 is available. Search r = [m] * v :

 void mul (__m128 r[2], const __m128 m[8][2], const __m128 v[2]) { __m128 t0, t1, t2, t3, r0, r1, r2, r3; t0 = _mm_mul_ps(m[0][0], v[0]); t1 = _mm_mul_ps(m[1][0], v[0]); t2 = _mm_mul_ps(m[2][0], v[0]); t3 = _mm_mul_ps(m[3][0], v[0]); t0 = _mm_hadd_ps(t0, t1); t2 = _mm_hadd_ps(t2, t3); r0 = _mm_hadd_ps(t0, t2); t0 = _mm_mul_ps(m[0][1], v[1]); t1 = _mm_mul_ps(m[1][1], v[1]); t2 = _mm_mul_ps(m[2][1], v[1]); t3 = _mm_mul_ps(m[3][1], v[1]); t0 = _mm_hadd_ps(t0, t1); t2 = _mm_hadd_ps(t2, t3); r1 = _mm_hadd_ps(t0, t2); t0 = _mm_mul_ps(m[4][0], v[0]); t1 = _mm_mul_ps(m[5][0], v[0]); t2 = _mm_mul_ps(m[6][0], v[0]); t3 = _mm_mul_ps(m[7][0], v[0]); t0 = _mm_hadd_ps(t0, t1); t2 = _mm_hadd_ps(t2, t3); r2 = _mm_hadd_ps(t0, t2); t0 = _mm_mul_ps(m[4][1], v[1]); t1 = _mm_mul_ps(m[5][1], v[1]); t2 = _mm_mul_ps(m[6][1], v[1]); t3 = _mm_mul_ps(m[7][1], v[1]); t0 = _mm_hadd_ps(t0, t1); t2 = _mm_hadd_ps(t2, t3); r3 = _mm_hadd_ps(t0, t2); r[0] = _mm_add_ps(r0, r1); r[1] = _mm_add_ps(r2, r3); } 

As for alignment, a variable of type __m128 should automatically align on the stack. With dynamic memory, this is not a safe assumption. Some malloc / new implementations may only return memory guaranteed by 8-byte alignment.

The intrinsics header provides _mm_malloc and _mm_free. The align parameter should be (16) in this case.

+4
source

Intel has developed the Small Matrix Library for matrices from 1 × 1 to 6 × 6. Note AP-930 Streaming SIMD Extensions - Matrix Multiplication application describes in detail the algorithm for multiplying two 6 × 6 matrices. This needs to be adapted to other sizes with some effort.

+2
source

All Articles