Assembly code / AVX instructions for multiplying complex numbers. (Built-in GCC assembly)

We have a scientific program, and we would like to implement the functions of AVX. The whole program (written in Fortran + C) will be vectorized and at the moment I'm trying to implement complex multiplication of numbers in the built-in GCC assembly.

The assembly code takes 4 complex numbers and performs two complex multiplications at once:

v2complex cmult(v2complex *a, v2complex *b) { v2complex ret; asm ( "vmovupd %2,%%ymm1;" "vmovupd %2, %%ymm2;" "vmovddup %%ymm2, %%ymm2;" "vshufpd $15,%%ymm1,%%ymm1,%%ymm1;" "vmulpd %1, %%ymm2, %%ymm2;" "vmulpd %1, %%ymm1, %%ymm1;" "vshufpd $5,%%ymm1,%%ymm1, %%ymm1;" "vaddsubpd %%ymm1, %%ymm2,%%ymm1;" "vmovupd %%ymm1, %0;" : "=m"(ret) : "m" (*a), "m" (*b) ); return ret; } 

where a and b are 256-bit double precision:

 typedef union v2complex { __m256d v; complex c[2]; } v2complex; 

The problem is that the code basically gives the correct result, but sometimes it fails.

I am very new to assembly, but I tried to figure it out myself. The C program (optimized by -O3) seems to interact with the ymm registers used in the assembly code. For example, I can print one of the values ​​(for example, a) before doing the multiplication, and the program never gives the wrong results.

My question is how to tell GCC not to interact with ymm. I was not able to put the ymm list in a list of grouped registers.

+7
source share
2 answers

As you may have guessed, the problem is that you did not specify the GCC, which registers you, you go astray. I am surprised if they do not yet support the placement of YMM registers in the clobber list; What version of GCC are you using?

In any case, it is almost certainly sufficient to place the corresponding XMM lists in the clobber list:

 : "=m" (ret) : "m" (*a), "m" (*b) : "%xmm1", "%xmm2"); 

Some other notes:

  • You load both inputs twice, which is inefficient. There is no reason for this.
  • I would use "r" (a), "r" (b) as restrictions and write down my loads as vmovupd (%2), %%ymm1 . There is probably no difference in the generated code, but it looks more idiomatically correct.
  • Remember to put vzeroupper following AVX code before the SSE code is executed to avoid (large) stalls.
+7
source

I add two comments without directly answering your question:

  • I highly recommend using the built-in compiler tools instead of direct assembly. Thus, the compiler takes care of register allocation and can improve the work of optimizing your code (built-in methods, reordering instructions, etc.).
  • Agner Fog has a library of C ++ vector classes optimized vectorized operations, including operations on complex numbers. Even if you cannot use its libraries directly in your C code, its optimized code may be a good starting point; see src/special/complexvec.h in zipped source code .
+3
source

All Articles