This affects only the 32-bit compiler; x86-64 builds are not affected, regardless of optimization settings. However, you see the problem in a 32-bit build, whether it be optimizing speed (/ O2) or size (/ O1). As you already mentioned, it works as expected in debug builds with optimizations disabled.
Wimmel's proposal to change the packaging, more precisely, although it does not change the behavior. (The code below assumes the package is correctly set to 1 for WMatrix .)
I can not play it in VS 2010, but I can in VS 2013 and 2015. I do not have the established 2012. However, this is good enough to allow us to analyze the difference between the object code generated by the two compilers.
Here is the code for mul1 from VS 2010 ("working" code):
(Actually, in many cases, the compiler injected the code from this function to the call site. But the compiler will still output disassemble files containing the code that it generated for the individual functions before embedding. This is what we are looking at here because itโs more cluttered. The behavior of the code is completely equivalent whether it was built-in or not.)
PUBLIC mul1 _TEXT SEGMENT _m$ = 8 ; size = 64 _f$ = 72 ; size = 4 mul1 PROC ___$ReturnUdt$ = eax push esi push edi ; WMatrix out = m; mov ecx, 16 ; 00000010H lea esi, DWORD PTR _m$[esp+4] mov edi, eax rep movsd ; for (unsigned int i = 0; i < 4; i++) ; { ; for (unsigned int j = 0; j < 4; j++) ; { ; unsigned int idx = i * 4 + j; // critical code ; *(&out._11 + idx) *= f; // critical code movss xmm0, DWORD PTR [eax] cvtps2pd xmm1, xmm0 movss xmm0, DWORD PTR _f$[esp+4] cvtps2pd xmm2, xmm0 mulsd xmm1, xmm2 cvtpd2ps xmm1, xmm1 movss DWORD PTR [eax], xmm1 movss xmm1, DWORD PTR [eax+4] cvtps2pd xmm1, xmm1 cvtps2pd xmm2, xmm0 mulsd xmm1, xmm2 cvtpd2ps xmm1, xmm1 movss DWORD PTR [eax+4], xmm1 movss xmm1, DWORD PTR [eax+8] cvtps2pd xmm1, xmm1 cvtps2pd xmm2, xmm0 mulsd xmm1, xmm2 cvtpd2ps xmm1, xmm1 movss DWORD PTR [eax+8], xmm1 movss xmm1, DWORD PTR [eax+12] cvtps2pd xmm1, xmm1 cvtps2pd xmm2, xmm0 mulsd xmm1, xmm2 cvtpd2ps xmm1, xmm1 movss DWORD PTR [eax+12], xmm1 movss xmm2, DWORD PTR [eax+16] cvtps2pd xmm2, xmm2 cvtps2pd xmm1, xmm0 mulsd xmm1, xmm2 cvtpd2ps xmm1, xmm1 movss DWORD PTR [eax+16], xmm1 movss xmm1, DWORD PTR [eax+20] cvtps2pd xmm1, xmm1 cvtps2pd xmm2, xmm0 mulsd xmm1, xmm2 cvtpd2ps xmm1, xmm1 movss DWORD PTR [eax+20], xmm1 movss xmm1, DWORD PTR [eax+24] cvtps2pd xmm1, xmm1 cvtps2pd xmm2, xmm0 mulsd xmm1, xmm2 cvtpd2ps xmm1, xmm1 movss DWORD PTR [eax+24], xmm1 movss xmm1, DWORD PTR [eax+28] cvtps2pd xmm1, xmm1 cvtps2pd xmm2, xmm0 mulsd xmm1, xmm2 cvtpd2ps xmm1, xmm1 movss DWORD PTR [eax+28], xmm1 movss xmm1, DWORD PTR [eax+32] cvtps2pd xmm1, xmm1 cvtps2pd xmm2, xmm0 mulsd xmm1, xmm2 cvtpd2ps xmm1, xmm1 movss DWORD PTR [eax+32], xmm1 movss xmm1, DWORD PTR [eax+36] cvtps2pd xmm1, xmm1 cvtps2pd xmm2, xmm0 mulsd xmm1, xmm2 cvtpd2ps xmm1, xmm1 movss DWORD PTR [eax+36], xmm1 movss xmm2, DWORD PTR [eax+40] cvtps2pd xmm2, xmm2 cvtps2pd xmm1, xmm0 mulsd xmm1, xmm2 cvtpd2ps xmm1, xmm1 movss DWORD PTR [eax+40], xmm1 movss xmm1, DWORD PTR [eax+44] cvtps2pd xmm1, xmm1 cvtps2pd xmm2, xmm0 mulsd xmm1, xmm2 cvtpd2ps xmm1, xmm1 movss DWORD PTR [eax+44], xmm1 movss xmm2, DWORD PTR [eax+48] cvtps2pd xmm1, xmm0 cvtps2pd xmm2, xmm2 mulsd xmm1, xmm2 cvtpd2ps xmm1, xmm1 movss DWORD PTR [eax+48], xmm1 movss xmm1, DWORD PTR [eax+52] cvtps2pd xmm1, xmm1 cvtps2pd xmm2, xmm0 mulsd xmm1, xmm2 cvtpd2ps xmm1, xmm1 movss DWORD PTR [eax+52], xmm1 movss xmm1, DWORD PTR [eax+56] cvtps2pd xmm1, xmm1 cvtps2pd xmm2, xmm0 mulsd xmm1, xmm2 cvtpd2ps xmm1, xmm1 cvtps2pd xmm0, xmm0 movss DWORD PTR [eax+56], xmm1 movss xmm1, DWORD PTR [eax+60] cvtps2pd xmm1, xmm1 mulsd xmm1, xmm0 pop edi cvtpd2ps xmm0, xmm1 movss DWORD PTR [eax+60], xmm0 pop esi ; return out; ret 0 mul1 ENDP
Compare this with the code for mul1 generated by VS 2015:
mul1 PROC _m$ = 8 ; size = 64 ; ___$ReturnUdt$ = ecx ; _f$ = xmm2s ; WMatrix out = m; movups xmm0, XMMWORD PTR _m$[esp-4] ; for (unsigned int i = 0; i < 4; i++) xor eax, eax movaps xmm1, xmm2 movups XMMWORD PTR [ecx], xmm0 movups xmm0, XMMWORD PTR _m$[esp+12] shufps xmm1, xmm1, 0 movups XMMWORD PTR [ecx+16], xmm0 movups xmm0, XMMWORD PTR _m$[esp+28] movups XMMWORD PTR [ecx+32], xmm0 movups xmm0, XMMWORD PTR _m$[esp+44] movups XMMWORD PTR [ecx+48], xmm0 npad 4 $LL4@mul1 : ; for (unsigned int j = 0; j < 4; j++) ; { ; unsigned int idx = i * 4 + j; // critical code ; *(&out._11 + idx) *= f; // critical code movups xmm0, XMMWORD PTR [ecx+eax*4] mulps xmm0, xmm1 movups XMMWORD PTR [ecx+eax*4], xmm0 inc eax cmp eax, 4 jb SHORT $LL4@mul1 ; return out; mov eax, ecx ret 0 ?mul1@ @ YA?AUWMatrix@ @ U1@M @Z ENDP ; mul1 _TEXT ENDS
You can see right away how shorter the code is Apparently, the optimizer has become much smarter between VS 2010 and VS 2015. Unfortunately, sometimes the source of the optimizer โsmartsโ is to use errors in your code.
By looking at the matching code, you will see that VS 2010 loops around. All calculations are done inline, so there are no branches. This is what you expect from loops with upper and lower bounds that are known at compile time and, as in this case, are quite small.
What happened in VS 2015? Well, that didn't reveal anything. There are 5 lines of code, and then the conditional jump of JB back to the beginning of the loop sequence. This one does not tell you much. What looks very suspicious is that it only loops 4 times (see instruction cmp eax, 4 , which sets flags before executing JB , effectively continuing the loop until the counter is less than 4). Well, it might be nice if he combined two loops into one. Let's see what it does inside the loop:
$LL4@mul1 : movups xmm0, XMMWORD PTR [ecx+eax*4] ; load a packed unaligned value into XMM0 mulps xmm0, xmm1 ; do a packed multiplication of XMM0 by XMM1, ; storing the result in XMM0 movups XMMWORD PTR [ecx+eax*4], xmm0 ; store the result of the previous multiplication ; back into the memory location that we ; initially loaded from inc eax ; one iteration done, increment loop counter cmp eax, 4 ; see how many loops we've done jb $LL4@mul1 ; keep looping if < 4 iterations
The code reads the value from memory (the XMM size value from the location determined by ecx + eax * 4 ) in XMM0 , multiplies it by the value in XMM1 (which was set outside the loop based on f ), and then saves the result back to the original memory location.
Compare this with the code for the corresponding loop in mul2 :
$LL4@mul2 : lea eax, DWORD PTR [eax+16] movups xmm0, XMMWORD PTR [eax-24] mulps xmm0, xmm2 movups XMMWORD PTR [eax-24], xmm0 sub ecx, 1 jne $LL4@mul2
Besides the other loop control sequence (this sets ECX to 4 outside the loop, subtracts 1 each time, and continues the loop until ECX ! = 0), the big difference here is the actual XMM values โโthat it manipulates in memory. Instead of loading from [ecx+eax*4] it loads from [eax-24] (after adding 16 to EAX ).
What is the difference mul2 ? You added code to track a single index in idx2 , incrementing it every time through the loop. Now this will not be enough. If you comment out the assignment of the variable bool b , mul1 and mul2 , the result will be identical object code. Obviously, without comparing idx with idx2 compiler can infer that idx2 is not fully used, and therefore eliminate it by turning mul2 into mul1 . But with such a comparison, the compiler, apparently, will not be able to eliminate idx2 , and its presence so slightly changes what optimizations are considered possible for the function, which leads to an output mismatch.
Now the question is why this is happening. This is an optimizer error, as you first suspected? Well, no, and as some of the commentators have already pointed out, it should never be your first instinct to blame the compiler / optimizer. Always assume that there are errors in your code if you cannot prove otherwise. This proof would always include a discussion of disassembly and, preferably, references to the relevant parts of the locale if you really want to be serious.
In this case, Mystic has nailed the problem . Your code demonstrates undefined behavior when it does *(&out._11 + idx) . This makes certain assumptions about the layout of the WMatrix structure in memory, which you cannot legally make, even after the package is explicitly configured.
This is why undefined behavior is evil - this leads to the fact that the code sometimes works, but in other cases it is not. It is very sensitive to compiler flags, especially optimizations, as well as to target platforms (as we saw at the top of this answer). mul2 only works by accident. Both mul1 and mul2 wrong. Sorry, there is an error in your code. Worse, the compiler did not issue a warning that could warn you about using undefined behavior.