Gcc compilation is very slow (large file)

I am trying to compile a large c file (especially for MATLAB mexing). The c file is about 20 MB (available from the gcc tracker if you want to play with it).

Below is the command that I run and the output to the screen below. This works for several hours, and as you can see, optimization is already disabled (-O0). Why is it so slow? Is there any way to make this faster?

(for reference: Ubuntu 12.04 64 bit, gcc 4.7.3)

/usr/bin/gcc -c -DMX_COMPAT_32 -D_GNU_SOURCE -DMATLAB_MEX_FILE -I"/usr/local/MATLAB/R2015a/extern/include" -I"/usr/local/MATLAB/R2015a/simulink/include" -ansi -fexceptions -fPIC -fno-omit-frame-pointer -pthread -O0 -DNDEBUG path/to/test4.c -o /tmp/mex_198714460457975_3922/test4.o -v Using built-in specs. COLLECT_GCC=/usr/bin/gcc Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 4.7.3-2ubuntu1~12.04' --with-bugurl=file:///usr/share/doc/gcc-4.7/README.Bugs --enable-languages=c,c++,go,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.7 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.7 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --enable-plugin --with-system-zlib --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 4.7.3 (Ubuntu/Linaro 4.7.3-2ubuntu1~12.04) COLLECT_GCC_OPTIONS='-c' '-D' 'MX_COMPAT_32' '-D' '_GNU_SOURCE' '-D' 'MATLAB_MEX_FILE' '-I' '/usr/local/MATLAB/R2015a/extern/include' '-I' '/usr/local/MATLAB/R2015a/simulink/include' '-ansi' '-fexceptions' '-fPIC' '-fno-omit-frame-pointer' '-pthread' '-O0' '-D' 'NDEBUG' '-o' '/tmp/mex_198714460457975_3922/test4.o' '-v' '-mtune=generic' '-march=x86-64' /usr/lib/gcc/x86_64-linux-gnu/4.7/cc1 -quiet -v -I /usr/local/MATLAB/R2015a/extern/include -I /usr/local/MATLAB/R2015a/simulink/include -imultilib . -imultiarch x86_64-linux-gnu -D_REENTRANT -D MX_COMPAT_32 -D _GNU_SOURCE -D MATLAB_MEX_FILE -D NDEBUG path/to/test4.c -quiet -dumpbase test4.c -mtune=generic -march=x86-64 -auxbase-strip /tmp/mex_198714460457975_3922/test4.o -O0 -ansi -version -fexceptions -fPIC -fno-omit-frame-pointer -fstack-protector -o /tmp/ccxDOA5f.s GNU C (Ubuntu/Linaro 4.7.3-2ubuntu1~12.04) version 4.7.3 (x86_64-linux-gnu) compiled by GNU C version 4.7.3, GMP version 5.0.2, MPFR version 3.1.0-p3, MPC version 0.9 GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072 ignoring nonexistent directory "/usr/local/include/x86_64-linux-gnu" ignoring nonexistent directory "/usr/lib/gcc/x86_64-linux-gnu/4.7/../../../../x86_64-linux-gnu/include" #include "..." search starts here: #include <...> search starts here: /usr/local/MATLAB/R2015a/extern/include /usr/local/MATLAB/R2015a/simulink/include /usr/lib/gcc/x86_64-linux-gnu/4.7/include /usr/local/include /usr/lib/gcc/x86_64-linux-gnu/4.7/include-fixed /usr/include/x86_64-linux-gnu /usr/include End of search list. GNU C (Ubuntu/Linaro 4.7.3-2ubuntu1~12.04) version 4.7.3 (x86_64-linux-gnu) compiled by GNU C version 4.7.3, GMP version 5.0.2, MPFR version 3.1.0-p3, MPC version 0.9 GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072 Compiler executable checksum: c119948b394d79ea05b6b3986ab084cf 

EDIT: continued: I followed the advice of chqrlie and tcc compiled my function in <5 seconds (I had to remove only the -ansi flag and turn "gcc" into "tcc"), which is pretty remarkable, really. I can only imagine the complexity of gcc.

When trying then mex, this, however, is another mex command that is usually required. The second command is usually:

 /usr/bin/gcc -pthread -Wl,--no-undefined -Wl,-rpath-link,/usr/local/MATLAB/R2015a/bin/glnxa64 -shared -O -Wl,--version-script,"/usr/local/MATLAB/R2015a/extern/lib/glnxa64/mexFunction.map" /tmp/mex_61853296369424_4031/test4.o -L"/usr/local/MATLAB/R2015a/bin/glnxa64" -lmx -lmex -lmat -lm -lstdc++ -o test4.mexa64 

I cannot run this with tcc, as some of these flags are incompatible. If I try to run this second stage of compilation using gcc, I get:

 /usr/bin/ld: test4.o: relocation R_X86_64_PC32 against undefined symbol `mxGetPr' can not be used when making a shared object; recompile with -fPIC /usr/bin/ld: final link failed: Bad value collect2: error: ld returned 1 exit status 

EDIT: The solution seems to be clang. tcc can compile the file, but the arguments in the second step in mexing are incompatible with the parameters of the tcc arguments. Clang is very fast and creates a nice, small, optimized file.

+7
c gcc mex
source share
3 answers

After testing, I found that the Clang compiler seems to have fewer problems compiling large files. Although Clang consumed nearly a gigabyte of memory at compile time, it successfully converted the form of the OP source code to a 70 kilobyte object file. This works for all optimization levels that I tested.

gcc was also able to quickly compile this file and not consume too much memory if optimization was enabled. This error in gcc comes from a large expression in the OPs code, which places a huge burden on the register allocator. When optimization is turned on, the compiler performs an optimization called general elimination of subexpression, which allows you to remove a lot of redundancy from OPs code, reducing both compilation time and the size of the object file to managed values.

Here are some tests with a test record from the above error report:

 $ time gcc5 -O3 -c -o testcase.gcc5-O3.o testcase.c real 0m39,30s user 0m37,85s sys 0m1,42s $ time gcc5 -O0 -c -o testcase.gcc5-O0.o testcase.c real 23m33,34s user 23m27,07s sys 0m5,92s $ time tcc -c -o testcase.tcc.o testcase.c real 0m2,60s user 0m2,42s sys 0m0,17s $ time clang -O3 -c -o testcase.clang-O3.o testcase.c real 0m13,71s user 0m12,55s sys 0m1,16s $ time clang -O0 -c -o testcase.clang-O0.o testcase.c real 0m17,63s user 0m16,14s sys 0m1,49s $ time clang -Os -c -o testcase.clang-Os.o testcase.c real 0m14,88s user 0m13,73s sys 0m1,11s $ time clang -Oz -c -o testcase.clang-Oz.o testcase.c real 0m13,56s user 0m12,45s sys 0m1,09 

This is the resulting object file size:

  text data bss dec hex filename 39101286 0 0 39101286 254a366 testcase.clang-O0.o 72161 0 0 72161 119e1 testcase.clang-O3.o 72087 0 0 72087 11997 testcase.clang-Os.o 72087 0 0 72087 11997 testcase.clang-Oz.o 38683240 0 0 38683240 24e4268 testcase.gcc5-O0.o 87500 0 0 87500 155cc testcase.gcc5-O3.o 78239 0 0 78239 1319f testcase.gcc5-Os.o 69210504 3170616 0 72381120 45072c0 testcase.tcc.o 
+11
source share

Almost the entire file is a single expression, assignment double f[24] = ... This will lead to the creation of a giant abstract syntax tree. I would be surprised if something other than a specialized compiler could deal with this effectively.

A 20 megabyte file in itself can be beautiful, but one gigantic expression can be the reason for this. Try as a preliminary step, dividing the line into double f[24] = {0} , and then into 24 destinations f[0] = ...; f[1] = ... f[0] = ...; f[1] = ... and see what happens. In the worst case scenario, you can split 24 jobs into 24 functions, each in its own .c file, and compile them separately. This will not reduce the size of the AST, it just reorganizes it, but GCC is probably more optimized when processing many statements that together make up a lot of code, compared to one huge expression.

The ultimate approach would be to create the code in a more optimized way. For example, if I search for s4*s5*s6 , I get s4*s5*s6 hits. These variables s[4-6] do not change. You must create a temporary variable double _tmp1 = s4*s5*s6; , and then use this instead of repeating the expression. You have just excluded 311,132 nodes from your abstract syntax tree (assuming s4*s5*s6 is 5 nodes and _tmp1 is one node). This is much less than GCC processing. This should also generate faster code (you won’t repeat the same multiplication 77,783 times).

If you do this in a smart way in a recursive way (e.g. s4*s5*s6 β†’ _tmp1 , (c4*c6+s4*s5*s6) β†’ (c4*c6+_tmp1) β†’ _tmp2 , c5*s6*(c4*c6+s4*s5*s6) β†’ c5*s6*_tmp2 β†’ _tmp3 etc.), you can probably eliminate most of the generated code.

+15
source share

Try the Fabrice Bellard tiny C tcc compiler from http://tinycc.org :

 chqrlie$ time tcc -c test4.c real 0m1.336s user 0m1.248s sys 0m0.084s chqrlie$ size test4.o text data bss dec hex filename 38953877 3170632 0 42124509 282c4dd test4.o 

Yes, it's 1.336 seconds on a fairly basic PC!

Of course, I cannot verify the resulting executable, but the object file must be linked to the rest of your program and libraries.

For this test, I used a dummy version of the mex.h file:

 typedef struct mxArray mxArray; double *mxGetPr(const mxArray*); enum { mxREAL = 0 }; mxArray *mxCreateDoubleMatrix(int nx, int ny, int type); 

gcc has not completed compilation ...

EDIT: gcc I managed to confuse my Linux port a lot, I can no longer connect to it :(

+6
source share

All Articles