Why is this no-op loop not optimized?

Question

Why is this no-op loop not optimized?

The following code does some copying from one array of zeros, interpreted as floats, to another, and prints the timing of this operation. Since I saw many cases where no-op loops are simply optimized by compilers, including gcc, I was expecting that at some point in the process of changing my copy program, it would stop copying.

#include <iostream> #include <cstring> #include <sys/time.h> static inline long double currentTime() { timespec ts; clock_gettime(CLOCK_MONOTONIC,&ts); return ts.tv_sec+(long double)(ts.tv_nsec)*1e-9; } int main() { size_t W=20000,H=10000; float* data1=new float[W*H]; float* data2=new float[W*H]; memset(data1,0,W*H*sizeof(float)); memset(data2,0,W*H*sizeof(float)); long double time1=currentTime(); for(int q=0;q<16;++q) // take more time for(int k=0;k<W*H;++k) data2[k]=data1[k]; long double time2=currentTime(); std::cout << (time2-time1)*1e+3 << " ms\n"; delete[] data1; delete[] data2; }

I compiled this with the g ++ 4.8.1 command g ++ g++ main.cpp -o test -std=c++0x -O3 -lrt . This program prints 6952.17 ms for me. (I had to install ulimit -s 2000000 so it wouldn't crash.)

I also tried changing the creation of arrays from new to automatic VLAs by removing memset s, but this does not change the behavior of g ++ (except changing the timings several times).

It seems the compiler can prove that this code will not do anything reasonable, so why didn't it optimize the loop?

+7

c ++ optimization gcc

Ruslan Feb 24 '14 at 9:31

source share

4 answers

The code in this question has changed quite a bit, the invalid correct answer. This answer relates to version 5: since the code is currently trying to read uninitialized memory, the optimizer can reasonably assume that unexpected things are happening.

Many stages of optimization have a similar pattern: there is a pattern of instructions that corresponds to the current state of compilation. If the pattern matches at any point, the matched pattern (parametrically) is replaced by a more efficient version. A very simple example of such a pattern is the definition of a variable that was not subsequently used; a replacement in this case is simply a deletion.

These patterns are for the right code. With the wrong code, the templates may simply not match, or they may match in completely unintended ways. The first case does not lead to optimization, the second case can lead to completely unpredictable results (of course, if the modified code, if it is further optimized)

+3

Msalters Feb 24 '14 at 12:22

source share

Why do you expect the compiler to optimize this? Its generally hard to prove that writes to arbitrary memory addresses are "no-op". In your case, this would be possible, but it would take the compiler to track heap memory addresses through new (which is again difficult, since these addresses are generated at run time), and there really is no incentive for this.

In the end, you explicitly tell the compiler that you want to allocate memory and write to it. How does a bad compiler know that you lied to him?

In particular, the problem is that heap memory can be smoothed out by many other things. This happens to be personal to your process, but, as I said above, proving that the compiler has a lot of work, unlike functional local memory.

0

Konrad Rudolph Feb 24 '14 at 9:54

source share

The only way the compiler could know that this is not-op is to know what memset does. For this to happen, the function must either be defined in the header (and usually not), or it must be processed by the compiler as a special internal one. But by prohibiting these tricks, the compiler simply sees a call to an unknown function that can have side effects and do different things for each of the two calls.

0

jalf Feb 24 '14 at 9:56

source share

manlio · Accepted Answer · 2014-02-24T10:24:34+0000

In any case, this is not possible (clang ++ version 3.3):

 clang++ main.cpp -o test -std=c++0x -O3 -lrt

The program prints 0.000367 ms for me ... and looks at the assembler language:

 ... callq clock_gettime movq 56(%rsp), %r14 movq 64(%rsp), %rbx leaq 56(%rsp), %rsi movl $1, %edi callq clock_gettime ...

and for g ++:

 ... call clock_gettime fildq 32(%rsp) movl $16, %eax fildq 40(%rsp) fmull .LC0(%rip) faddp %st, %st(1) .p2align 4,,10 .p2align 3 .L2: movl $1, %ecx xorl %edx, %edx jmp .L5 .p2align 4,,10 .p2align 3 .L3: movq %rcx, %rdx movq %rsi, %rcx .L5: leaq 1(%rcx), %rsi movss 0(%rbp,%rdx,4), %xmm0 movss %xmm0, (%rbx,%rdx,4) cmpq $200000001, %rsi jne .L3 subl $1, %eax jne .L2 fstpt 16(%rsp) leaq 32(%rsp), %rsi movl $1, %edi call clock_gettime ...

EDIT (g ++ v4.8.2 / clang ++ v3.3)

CODE SOURCE - ORIGINAL VERSION (1)

 ... size_t W=20000,H=10000; float* data1=new float[W*H]; float* data2=new float[W*H]; ...

CODE SOURCE - MODIFIED VERSION (2)

 ... const size_t W=20000; const size_t H=10000; float data1[W*H]; float data2[W*H]; ...

Now a case that is not optimized is (1) + g ++

Why is this no-op loop not optimized?

More articles: