How to ask GCC to fully expand this loop (i.e. clear this loop)?

Is there a way to instruct GCC (I'm using 4.8.4) to roll out the while loop in the bottom function completely, that is, clear this loop? The number of iterations of the loop is known at compile time: 58.

Let me first explain what I tried.

Checking the gas outlet:

 gcc -fpic -O2 -S GEPDOT.c 

Used 12 registers XMM0 - XMM11. If I -funroll-loops flag in gcc:

 gcc -fpic -O2 -funroll-loops -S GEPDOT.c 

the cycle unfolds only two times. I checked the GCC optimization options. GCC says that -funroll-loops will also include -frename-registers , so when GCC deploys a loop, its first choice for register allocation is to use the "remaining" registers. But there is only 4 left over XMM12 - XMM15, so GCC can only deploy 2 times at its best. If 48 were available instead of 16 XMM registers, GCC would run the while loop 4 times without any problems.

Yet I did another experiment. First, I doubled the while loop, getting the GEPDOT_2 function. Then there is no difference between

 gcc -fpic -O2 -S GEPDOT_2.c 

and

 gcc -fpic -O2 -funroll-loops -S GEPDOT_2.c 

Since GEPDOT_2 has already GEPDOT_2 out of registers, deployment is not performed.

GCC registers renaming to avoid possible false dependency. But I know for sure that in my GEPDOT this potential is not GEPDOT ; even if there is, it doesn’t matter. I tried to deploy the loop myself, and deployment is 4 times faster than deployment 2 times, faster than without deployment. Of course, I can manually deploy more times, but this is tedious. Can GCC do this for me? Thank you

 // C file "GEPDOT.c" #include <emmintrin.h> void GEPDOT (double *A, double *B, double *C) { __m128d A1_vec = _mm_load_pd(A); A += 2; __m128d B_vec = _mm_load1_pd(B); B++; __m128d C1_vec = A1_vec * B_vec; __m128d A2_vec = _mm_load_pd(A); A += 2; __m128d C2_vec = A2_vec * B_vec; B_vec = _mm_load1_pd(B); B++; __m128d C3_vec = A1_vec * B_vec; __m128d C4_vec = A2_vec * B_vec; B_vec = _mm_load1_pd(B); B++; __m128d C5_vec = A1_vec * B_vec; __m128d C6_vec = A2_vec * B_vec; B_vec = _mm_load1_pd(B); B++; __m128d C7_vec = A1_vec * B_vec; A1_vec = _mm_load_pd(A); A += 2; __m128d C8_vec = A2_vec * B_vec; B_vec = _mm_load1_pd(B); B++; int k = 58; /* can compiler unroll the loop completely (ie, peel this loop)? */ while (k--) { C1_vec += A1_vec * B_vec; A2_vec = _mm_load_pd(A); A += 2; C2_vec += A2_vec * B_vec; B_vec = _mm_load1_pd(B); B++; C3_vec += A1_vec * B_vec; C4_vec += A2_vec * B_vec; B_vec = _mm_load1_pd(B); B++; C5_vec += A1_vec * B_vec; C6_vec += A2_vec * B_vec; B_vec = _mm_load1_pd(B); B++; C7_vec += A1_vec * B_vec; A1_vec = _mm_load_pd(A); A += 2; C8_vec += A2_vec * B_vec; B_vec = _mm_load1_pd(B); B++; } C1_vec += A1_vec * B_vec; A2_vec = _mm_load_pd(A); C2_vec += A2_vec * B_vec; B_vec = _mm_load1_pd(B); B++; C3_vec += A1_vec * B_vec; C4_vec += A2_vec * B_vec; B_vec = _mm_load1_pd(B); B++; C5_vec += A1_vec * B_vec; C6_vec += A2_vec * B_vec; B_vec = _mm_load1_pd(B); C7_vec += A1_vec * B_vec; C8_vec += A2_vec * B_vec; /* [write-back] */ A1_vec = _mm_load_pd(C); C1_vec = A1_vec - C1_vec; A2_vec = _mm_load_pd(C + 2); C2_vec = A2_vec - C2_vec; A1_vec = _mm_load_pd(C + 4); C3_vec = A1_vec - C3_vec; A2_vec = _mm_load_pd(C + 6); C4_vec = A2_vec - C4_vec; A1_vec = _mm_load_pd(C + 8); C5_vec = A1_vec - C5_vec; A2_vec = _mm_load_pd(C + 10); C6_vec = A2_vec - C6_vec; A1_vec = _mm_load_pd(C + 12); C7_vec = A1_vec - C7_vec; A2_vec = _mm_load_pd(C + 14); C8_vec = A2_vec - C8_vec; _mm_store_pd(C,C1_vec); _mm_store_pd(C + 2,C2_vec); _mm_store_pd(C + 4,C3_vec); _mm_store_pd(C + 6,C4_vec); _mm_store_pd(C + 8,C5_vec); _mm_store_pd(C + 10,C6_vec); _mm_store_pd(C + 12,C7_vec); _mm_store_pd(C + 14,C8_vec); } 

update 1

Thanks to @ user3386109 comment, I would like to expand this question a bit. @ user3386109 brings up a very good question. In fact, I have some doubts about the compiler's ability to optimally allocate registers when there are so many parallel instructions for planning.

I personally believe that a reliable way is to first encode the loop body (which is the key for HPC) in the asm assembly assembly, and then duplicate it as many times as I want. Earlier this year I had an unpopular post: built-in assembly . The code was a little different, because the number of iterations of the loop, j, is an argument to the function, so it is unknown at compile time. In this case, I can’t fully expand the loop, so I only duplicated the assembly code twice and converted the loop to label and jumped. It turned out that the resulting performance of my written assembly is about 5% higher than the assembly generated by the compiler, which may indicate that the compiler is unable to place the registers in the expected optimal way.

I was (and still have) a child in assembly coding, so this is a good example for me to learn a little about the x86 assembly. But in the end, I am not GEPDOT to a GEPDOT code with a large proportion for assembly. There are basically three reasons:

  1. The asm built-in assembly has been criticized for not being portable, although I don’t understand why. Perhaps because different machines have different registers clogged?
  2. The compiler is getting better too. So I would still prefer algorithmic optimization and a better C coding habit to help the compiler generate good results;
  3. The last reason is more important. The number of iterations can not always be 58. I am developing a high-performance matrix factorization routine. For a cache blocking factor of nb number of iterations will be nb-2 . I am not going to put nb as an argument to a function, as I did in a previous post. This is a machine-specific parameter that will be defined as a macro. Thus, the number of iterations is known at compile time, but can vary from machine to machine. Guess how much tedious work I have to do when manually spinning cycles for different nb . So if there is a way to simply instruct the compiler to clear the loop, that’s great.

I would be very grateful if you would also share your experience in creating a high-performance but portable library.

+9
c gcc x86 loops hpc loop-unrolling
source share
2 answers

Try setting optimizer options:

 gcc -O3 -funroll-loops --param max-completely-peeled-insns=1000 --param max-completely-peel-times=100 

That should do the trick.

+3
source share

This is not an answer, but may be of interest to others trying to vectorize matrix multiplications using GCC.

Below I assume that c is a 4 × 4 matrix in row order, a is a 4-row, n-column matrix in the main column (transposed), b is a 4-column n-row matrix in row order, and the calculation operation is c = a × b + c, where × denotes matrix multiplication.

The naive function to achieve this is

 void slow_4(double *c, const double *a, const double *b, size_t n) { size_t row, col, i; for (row = 0; row < 4; row++) for (col = 0; col < 4; col++) for (i = 0; i < n; i++) c[4*row+col] += a[4*i+row] * b[4*i+col]; } 

GCC generates pretty good code for SSE2 / SSE3 using

 #if defined(__SSE2__) || defined(__SSE3__) typedef double vec2d __attribute__((vector_size (2 * sizeof (double)))); void fast_4(vec2d *c, const vec2d *a, const vec2d *b, size_t n) { const vec2d *const b_end = b + 2L * n; vec2d s00 = c[0]; vec2d s02 = c[1]; vec2d s10 = c[2]; vec2d s12 = c[3]; vec2d s20 = c[4]; vec2d s22 = c[5]; vec2d s30 = c[6]; vec2d s32 = c[7]; while (b < b_end) { const vec2d b0 = b[0]; const vec2d b2 = b[1]; const vec2d a0 = { a[0][0], a[0][0] }; const vec2d a1 = { a[0][1], a[0][1] }; const vec2d a2 = { a[1][0], a[1][0] }; const vec2d a3 = { a[1][1], a[1][1] }; s00 += a0 * b0; s10 += a1 * b0; s20 += a2 * b0; s30 += a3 * b0; s02 += a0 * b2; s12 += a1 * b2; s22 += a2 * b2; s32 += a3 * b2; b += 2; a += 2; } c[0] = s00; c[1] = s02; c[2] = s10; c[3] = s12; c[4] = s20; c[5] = s22; c[6] = s30; c[7] = s32; } #endif 

For AVX GCC can do even better with

 #if defined(__AVX__) || defined(__AVX2__) typedef double vec4d __attribute__((vector_size (4 * sizeof (double)))); void fast_4(vec4d *c, const vec4d *a, const vec4d *b, size_t n) { const vec4d *const b_end = b + n; vec4d s0 = c[0]; vec4d s1 = c[1]; vec4d s2 = c[2]; vec4d s3 = c[3]; while (b < b_end) { const vec4d bc = *(b++); const vec4d ac = *(a++); const vec4d a0 = { ac[0], ac[0], ac[0], ac[0] }; const vec4d a1 = { ac[1], ac[1], ac[1], ac[1] }; const vec4d a2 = { ac[2], ac[2], ac[2], ac[2] }; const vec4d a3 = { ac[3], ac[3], ac[3], ac[3] }; s0 += a0 * bc; s1 += a1 * bc; s2 += a2 * bc; s3 += a3 * bc; } c[0] = s0; c[1] = s1; c[2] = s2; c[3] = s3; } #endif 

The SSE3 version of the generated assembly using gcc-4.8.4 ( -O2 -march=x86-64 -mtune=generic -msse3 ) is essentially

 fast_4: salq $5, %rcx movapd (%rdi), %xmm13 addq %rdx, %rcx cmpq %rcx, %rdx movapd 16(%rdi), %xmm12 movapd 32(%rdi), %xmm11 movapd 48(%rdi), %xmm10 movapd 64(%rdi), %xmm9 movapd 80(%rdi), %xmm8 movapd 96(%rdi), %xmm7 movapd 112(%rdi), %xmm6 jnb .L2 .L3: movddup (%rsi), %xmm5 addq $32, %rdx movapd -32(%rdx), %xmm1 addq $32, %rsi movddup -24(%rsi), %xmm4 movapd %xmm5, %xmm14 movddup -16(%rsi), %xmm3 movddup -8(%rsi), %xmm2 mulpd %xmm1, %xmm14 movapd -16(%rdx), %xmm0 cmpq %rdx, %rcx mulpd %xmm0, %xmm5 addpd %xmm14, %xmm13 movapd %xmm4, %xmm14 mulpd %xmm0, %xmm4 addpd %xmm5, %xmm12 mulpd %xmm1, %xmm14 addpd %xmm4, %xmm10 addpd %xmm14, %xmm11 movapd %xmm3, %xmm14 mulpd %xmm0, %xmm3 mulpd %xmm1, %xmm14 mulpd %xmm2, %xmm0 addpd %xmm3, %xmm8 mulpd %xmm2, %xmm1 addpd %xmm14, %xmm9 addpd %xmm0, %xmm6 addpd %xmm1, %xmm7 ja .L3 .L2: movapd %xmm13, (%rdi) movapd %xmm12, 16(%rdi) movapd %xmm11, 32(%rdi) movapd %xmm10, 48(%rdi) movapd %xmm9, 64(%rdi) movapd %xmm8, 80(%rdi) movapd %xmm7, 96(%rdi) movapd %xmm6, 112(%rdi) ret 

The AVX version of the generated assembly ( -O2 -march=x86-64 -mtune=generic -mavx ) is essentially

 fast_4: salq $5, %rcx vmovapd (%rdi), %ymm5 addq %rdx, %rcx vmovapd 32(%rdi), %ymm4 cmpq %rcx, %rdx vmovapd 64(%rdi), %ymm3 vmovapd 96(%rdi), %ymm2 jnb .L2 .L3: addq $32, %rsi vmovapd -32(%rsi), %ymm1 addq $32, %rdx vmovapd -32(%rdx), %ymm0 cmpq %rdx, %rcx vpermilpd $0, %ymm1, %ymm6 vperm2f128 $0, %ymm6, %ymm6, %ymm6 vmulpd %ymm0, %ymm6, %ymm6 vaddpd %ymm6, %ymm5, %ymm5 vpermilpd $15, %ymm1, %ymm6 vperm2f128 $0, %ymm6, %ymm6, %ymm6 vmulpd %ymm0, %ymm6, %ymm6 vaddpd %ymm6, %ymm4, %ymm4 vpermilpd $0, %ymm1, %ymm6 vpermilpd $15, %ymm1, %ymm1 vperm2f128 $17, %ymm6, %ymm6, %ymm6 vperm2f128 $17, %ymm1, %ymm1, %ymm1 vmulpd %ymm0, %ymm6, %ymm6 vmulpd %ymm0, %ymm1, %ymm0 vaddpd %ymm6, %ymm3, %ymm3 vaddpd %ymm0, %ymm2, %ymm2 ja .L3 .L2: vmovapd %ymm5, (%rdi) vmovapd %ymm4, 32(%rdi) vmovapd %ymm3, 64(%rdi) vmovapd %ymm2, 96(%rdi) vzeroupper ret 

Registration planning is not optimal, but it does not seem to look terrible. I am personally pleased with the above, without trying to manually optimize it at this point.

On the Core i5-4200U processor (with AVX2 support), fast versions of the above functions calculate the product of two 4 × 256 matrices during processor cycles of the processor 1843 (median) for SSE3 and 1248 cycles for AVX2. This comes down to 1.8 and 1.22 cycles per matrix recording. For comparison, the informal slow version takes about 11 cycles to write the matrix.

(The numbers are considered median values, i.e. half the tests were faster. I used only approximate benchmarking with repetitions of ~ 100,000 or so, so do these numbers with salt.)

In this CPU, the cache effects are such that the AVX2 file size with a 4 × 512 matrix size is still 1.2 cycles per write, but at 4 × 1024 it drops to 1.4, from 4 × 4096 to 1.5, with 4 × 8192 1.8 and from 4 × 65536 to 2.2 cycles per input. The SSE3 version is 1.8 cycles per record up to 4 × 3072, after which it begins to slow down; at 4 × 65536 it is also about 2.2 cycles per record. I really believe that this (laptop!) Processor has limited cache bandwidth.

+4
source share

All Articles