This is not an answer, but may be of interest to others trying to vectorize matrix multiplications using GCC.
Below I assume that c is a 4 × 4 matrix in row order, a is a 4-row, n-column matrix in the main column (transposed), b is a 4-column n-row matrix in row order, and the calculation operation is c = a × b + c, where × denotes matrix multiplication.
The naive function to achieve this is
void slow_4(double *c, const double *a, const double *b, size_t n) { size_t row, col, i; for (row = 0; row < 4; row++) for (col = 0; col < 4; col++) for (i = 0; i < n; i++) c[4*row+col] += a[4*i+row] * b[4*i+col]; }
GCC generates pretty good code for SSE2 / SSE3 using
#if defined(__SSE2__) || defined(__SSE3__) typedef double vec2d __attribute__((vector_size (2 * sizeof (double)))); void fast_4(vec2d *c, const vec2d *a, const vec2d *b, size_t n) { const vec2d *const b_end = b + 2L * n; vec2d s00 = c[0]; vec2d s02 = c[1]; vec2d s10 = c[2]; vec2d s12 = c[3]; vec2d s20 = c[4]; vec2d s22 = c[5]; vec2d s30 = c[6]; vec2d s32 = c[7]; while (b < b_end) { const vec2d b0 = b[0]; const vec2d b2 = b[1]; const vec2d a0 = { a[0][0], a[0][0] }; const vec2d a1 = { a[0][1], a[0][1] }; const vec2d a2 = { a[1][0], a[1][0] }; const vec2d a3 = { a[1][1], a[1][1] }; s00 += a0 * b0; s10 += a1 * b0; s20 += a2 * b0; s30 += a3 * b0; s02 += a0 * b2; s12 += a1 * b2; s22 += a2 * b2; s32 += a3 * b2; b += 2; a += 2; } c[0] = s00; c[1] = s02; c[2] = s10; c[3] = s12; c[4] = s20; c[5] = s22; c[6] = s30; c[7] = s32; } #endif
For AVX GCC can do even better with
#if defined(__AVX__) || defined(__AVX2__) typedef double vec4d __attribute__((vector_size (4 * sizeof (double)))); void fast_4(vec4d *c, const vec4d *a, const vec4d *b, size_t n) { const vec4d *const b_end = b + n; vec4d s0 = c[0]; vec4d s1 = c[1]; vec4d s2 = c[2]; vec4d s3 = c[3]; while (b < b_end) { const vec4d bc = *(b++); const vec4d ac = *(a++); const vec4d a0 = { ac[0], ac[0], ac[0], ac[0] }; const vec4d a1 = { ac[1], ac[1], ac[1], ac[1] }; const vec4d a2 = { ac[2], ac[2], ac[2], ac[2] }; const vec4d a3 = { ac[3], ac[3], ac[3], ac[3] }; s0 += a0 * bc; s1 += a1 * bc; s2 += a2 * bc; s3 += a3 * bc; } c[0] = s0; c[1] = s1; c[2] = s2; c[3] = s3; } #endif
The SSE3 version of the generated assembly using gcc-4.8.4 ( -O2 -march=x86-64 -mtune=generic -msse3 ) is essentially
fast_4: salq $5, %rcx movapd (%rdi), %xmm13 addq %rdx, %rcx cmpq %rcx, %rdx movapd 16(%rdi), %xmm12 movapd 32(%rdi), %xmm11 movapd 48(%rdi), %xmm10 movapd 64(%rdi), %xmm9 movapd 80(%rdi), %xmm8 movapd 96(%rdi), %xmm7 movapd 112(%rdi), %xmm6 jnb .L2 .L3: movddup (%rsi), %xmm5 addq $32, %rdx movapd -32(%rdx), %xmm1 addq $32, %rsi movddup -24(%rsi), %xmm4 movapd %xmm5, %xmm14 movddup -16(%rsi), %xmm3 movddup -8(%rsi), %xmm2 mulpd %xmm1, %xmm14 movapd -16(%rdx), %xmm0 cmpq %rdx, %rcx mulpd %xmm0, %xmm5 addpd %xmm14, %xmm13 movapd %xmm4, %xmm14 mulpd %xmm0, %xmm4 addpd %xmm5, %xmm12 mulpd %xmm1, %xmm14 addpd %xmm4, %xmm10 addpd %xmm14, %xmm11 movapd %xmm3, %xmm14 mulpd %xmm0, %xmm3 mulpd %xmm1, %xmm14 mulpd %xmm2, %xmm0 addpd %xmm3, %xmm8 mulpd %xmm2, %xmm1 addpd %xmm14, %xmm9 addpd %xmm0, %xmm6 addpd %xmm1, %xmm7 ja .L3 .L2: movapd %xmm13, (%rdi) movapd %xmm12, 16(%rdi) movapd %xmm11, 32(%rdi) movapd %xmm10, 48(%rdi) movapd %xmm9, 64(%rdi) movapd %xmm8, 80(%rdi) movapd %xmm7, 96(%rdi) movapd %xmm6, 112(%rdi) ret
The AVX version of the generated assembly ( -O2 -march=x86-64 -mtune=generic -mavx ) is essentially
fast_4: salq $5, %rcx vmovapd (%rdi), %ymm5 addq %rdx, %rcx vmovapd 32(%rdi), %ymm4 cmpq %rcx, %rdx vmovapd 64(%rdi), %ymm3 vmovapd 96(%rdi), %ymm2 jnb .L2 .L3: addq $32, %rsi vmovapd -32(%rsi), %ymm1 addq $32, %rdx vmovapd -32(%rdx), %ymm0 cmpq %rdx, %rcx vpermilpd $0, %ymm1, %ymm6 vperm2f128 $0, %ymm6, %ymm6, %ymm6 vmulpd %ymm0, %ymm6, %ymm6 vaddpd %ymm6, %ymm5, %ymm5 vpermilpd $15, %ymm1, %ymm6 vperm2f128 $0, %ymm6, %ymm6, %ymm6 vmulpd %ymm0, %ymm6, %ymm6 vaddpd %ymm6, %ymm4, %ymm4 vpermilpd $0, %ymm1, %ymm6 vpermilpd $15, %ymm1, %ymm1 vperm2f128 $17, %ymm6, %ymm6, %ymm6 vperm2f128 $17, %ymm1, %ymm1, %ymm1 vmulpd %ymm0, %ymm6, %ymm6 vmulpd %ymm0, %ymm1, %ymm0 vaddpd %ymm6, %ymm3, %ymm3 vaddpd %ymm0, %ymm2, %ymm2 ja .L3 .L2: vmovapd %ymm5, (%rdi) vmovapd %ymm4, 32(%rdi) vmovapd %ymm3, 64(%rdi) vmovapd %ymm2, 96(%rdi) vzeroupper ret
Registration planning is not optimal, but it does not seem to look terrible. I am personally pleased with the above, without trying to manually optimize it at this point.
On the Core i5-4200U processor (with AVX2 support), fast versions of the above functions calculate the product of two 4 × 256 matrices during processor cycles of the processor 1843 (median) for SSE3 and 1248 cycles for AVX2. This comes down to 1.8 and 1.22 cycles per matrix recording. For comparison, the informal slow version takes about 11 cycles to write the matrix.
(The numbers are considered median values, i.e. half the tests were faster. I used only approximate benchmarking with repetitions of ~ 100,000 or so, so do these numbers with salt.)
In this CPU, the cache effects are such that the AVX2 file size with a 4 × 512 matrix size is still 1.2 cycles per write, but at 4 × 1024 it drops to 1.4, from 4 × 4096 to 1.5, with 4 × 8192 1.8 and from 4 × 65536 to 2.2 cycles per input. The SSE3 version is 1.8 cycles per record up to 4 × 3072, after which it begins to slow down; at 4 × 65536 it is also about 2.2 cycles per record. I really believe that this (laptop!) Processor has limited cache bandwidth.