In any case, this is not possible (clang ++ version 3.3):
clang++ main.cpp -o test -std=c++0x -O3 -lrt
The program prints 0.000367 ms for me ... and looks at the assembler language:
... callq clock_gettime movq 56(%rsp), %r14 movq 64(%rsp), %rbx leaq 56(%rsp), %rsi movl $1, %edi callq clock_gettime ...
and for g ++:
... call clock_gettime fildq 32(%rsp) movl $16, %eax fildq 40(%rsp) fmull .LC0(%rip) faddp %st, %st(1) .p2align 4,,10 .p2align 3 .L2: movl $1, %ecx xorl %edx, %edx jmp .L5 .p2align 4,,10 .p2align 3 .L3: movq %rcx, %rdx movq %rsi, %rcx .L5: leaq 1(%rcx), %rsi movss 0(%rbp,%rdx,4), %xmm0 movss %xmm0, (%rbx,%rdx,4) cmpq $200000001, %rsi jne .L3 subl $1, %eax jne .L2 fstpt 16(%rsp) leaq 32(%rsp), %rsi movl $1, %edi call clock_gettime ...
EDIT (g ++ v4.8.2 / clang ++ v3.3)
CODE SOURCE - ORIGINAL VERSION (1)
... size_t W=20000,H=10000; float* data1=new float[W*H]; float* data2=new float[W*H]; ...
CODE SOURCE - MODIFIED VERSION (2)
... const size_t W=20000; const size_t H=10000; float data1[W*H]; float data2[W*H]; ...
Now a case that is not optimized is (1) + g ++