Error in compiler VC ++ 14.0 (2015)?

Question

Error in compiler VC ++ 14.0 (2015)?

I am having problems that occurred only in Release x86 mode and not during Release x64 or in any Debug mode. I was able to reproduce the error using the following code:

#include <stdio.h> #include <iostream> using namespace std; struct WMatrix { float _11, _12, _13, _14; float _21, _22, _23, _24; float _31, _32, _33, _34; float _41, _42, _43, _44; WMatrix(float f11, float f12, float f13, float f14, float f21, float f22, float f23, float f24, float f31, float f32, float f33, float f34, float f41, float f42, float f43, float f44) : _11(f11), _12(f12), _13(f13), _14(f14), _21(f21), _22(f22), _23(f23), _24(f24), _31(f31), _32(f32), _33(f33), _34(f34), _41(f41), _42(f42), _43(f43), _44(f44) { } }; void printmtx(WMatrix m1) { char str[256]; sprintf_s(str, 256, "%.3f, %.3f, %.3f, %.3f", m1._11, m1._12, m1._13, m1._14); cout << str << "\n"; sprintf_s(str, 256, "%.3f, %.3f, %.3f, %.3f", m1._21, m1._22, m1._23, m1._24); cout << str << "\n"; sprintf_s(str, 256, "%.3f, %.3f, %.3f, %.3f", m1._31, m1._32, m1._33, m1._34); cout << str << "\n"; sprintf_s(str, 256, "%.3f, %.3f, %.3f, %.3f", m1._41, m1._42, m1._43, m1._44); cout << str << "\n"; } WMatrix mul1(WMatrix m, float f) { WMatrix out = m; for (unsigned int i = 0; i < 4; i++) { for (unsigned int j = 0; j < 4; j++) { unsigned int idx = i * 4 + j; // critical code *(&out._11 + idx) *= f; // critical code } } return out; } WMatrix mul2(WMatrix m, float f) { WMatrix out = m; unsigned int idx2 = 0; for (unsigned int i = 0; i < 4; i++) { for (unsigned int j = 0; j < 4; j++) { unsigned int idx = i * 4 + j; // critical code bool b = idx == idx2; // critical code *(&out._11 + idx) *= f; // critical code idx2++; } } return out; } int main() { WMatrix m1(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16); WMatrix m2 = mul1(m1, 0.5f); WMatrix m3 = mul2(m1, 0.5f); printmtx(m1); cout << "\n"; printmtx(m2); cout << "\n"; printmtx(m3); int x; cin >> x; }

In the above code, mul2 works, but mul1 does not. mul1 and mul2 just try to iterate over the floats in WMatrix and multiply them by f, but the way that the indices mul1 (i * 4 + j) somehow evaluate the incorrect results. Everything mul2 does is different, it checks the index before using it, and then it works (there are many other ways to mess with the index to make it work). Please note, if you delete the line "bool b = idx == idx2", then mul2 will also break ...

Here is the result:

 1.000, 2.000, 3.000, 4.000 5.000, 6.000, 7.000, 8.000 9.000, 10.000, 11.000, 12.000 13.000, 14.000, 15.000, 16.000 0.500, 0.500, 0.375, 0.250 0.625, 1.500, 3.500, 8.000 9.000, 10.000, 11.000, 12.000 13.000, 14.000, 15.000, 16.000 0.500, 1.000, 1.500, 2.000 2.500, 3.000, 3.500, 4.000 4.500, 5.000, 5.500, 6.000 6.500, 7.000, 7.500, 8.000

The correct conclusion should be ...

 1.000, 2.000, 3.000, 4.000 5.000, 6.000, 7.000, 8.000 9.000, 10.000, 11.000, 12.000 13.000, 14.000, 15.000, 16.000 0.500, 1.000, 1.500, 2.000 2.500, 3.000, 3.500, 4.000 4.500, 5.000, 5.500, 6.000 6.500, 7.000, 7.500, 8.000 0.500, 1.000, 1.500, 2.000 2.500, 3.000, 3.500, 4.000 4.500, 5.000, 5.500, 6.000 6.500, 7.000, 7.500, 8.000

Am I missing something? Or is it really a compiler error?

-1

c ++ visual-studio

Hasan al-jawahiri Aug 29 '16 at 15:11

source share

2 answers

Cody gray · Answer 1 · 2016-08-29T16:32:33+0000

This affects only the 32-bit compiler; x86-64 builds are not affected, regardless of optimization settings. However, you see the problem in a 32-bit build, whether it be optimizing speed (/ O2) or size (/ O1). As you already mentioned, it works as expected in debug builds with optimizations disabled.

Wimmel's proposal to change the packaging, more precisely, although it does not change the behavior. (The code below assumes the package is correctly set to 1 for WMatrix .)

I can not play it in VS 2010, but I can in VS 2013 and 2015. I do not have the established 2012. However, this is good enough to allow us to analyze the difference between the object code generated by the two compilers.

Here is the code for mul1 from VS 2010 ("working" code):
_{(Actually, in many cases, the compiler injected the code from this function to the call site. But the compiler will still output disassemble files containing the code that it generated for the individual functions before embedding. This is what we are looking at here because it’s more cluttered. The behavior of the code is completely equivalent whether it was built-in or not.)}

 PUBLIC mul1 _TEXT SEGMENT _m$ = 8 ; size = 64 _f$ = 72 ; size = 4 mul1 PROC ___$ReturnUdt$ = eax push esi push edi ; WMatrix out = m; mov ecx, 16 ; 00000010H lea esi, DWORD PTR _m$[esp+4] mov edi, eax rep movsd ; for (unsigned int i = 0; i < 4; i++) ; { ; for (unsigned int j = 0; j < 4; j++) ; { ; unsigned int idx = i * 4 + j; // critical code ; *(&out._11 + idx) *= f; // critical code movss xmm0, DWORD PTR [eax] cvtps2pd xmm1, xmm0 movss xmm0, DWORD PTR _f$[esp+4] cvtps2pd xmm2, xmm0 mulsd xmm1, xmm2 cvtpd2ps xmm1, xmm1 movss DWORD PTR [eax], xmm1 movss xmm1, DWORD PTR [eax+4] cvtps2pd xmm1, xmm1 cvtps2pd xmm2, xmm0 mulsd xmm1, xmm2 cvtpd2ps xmm1, xmm1 movss DWORD PTR [eax+4], xmm1 movss xmm1, DWORD PTR [eax+8] cvtps2pd xmm1, xmm1 cvtps2pd xmm2, xmm0 mulsd xmm1, xmm2 cvtpd2ps xmm1, xmm1 movss DWORD PTR [eax+8], xmm1 movss xmm1, DWORD PTR [eax+12] cvtps2pd xmm1, xmm1 cvtps2pd xmm2, xmm0 mulsd xmm1, xmm2 cvtpd2ps xmm1, xmm1 movss DWORD PTR [eax+12], xmm1 movss xmm2, DWORD PTR [eax+16] cvtps2pd xmm2, xmm2 cvtps2pd xmm1, xmm0 mulsd xmm1, xmm2 cvtpd2ps xmm1, xmm1 movss DWORD PTR [eax+16], xmm1 movss xmm1, DWORD PTR [eax+20] cvtps2pd xmm1, xmm1 cvtps2pd xmm2, xmm0 mulsd xmm1, xmm2 cvtpd2ps xmm1, xmm1 movss DWORD PTR [eax+20], xmm1 movss xmm1, DWORD PTR [eax+24] cvtps2pd xmm1, xmm1 cvtps2pd xmm2, xmm0 mulsd xmm1, xmm2 cvtpd2ps xmm1, xmm1 movss DWORD PTR [eax+24], xmm1 movss xmm1, DWORD PTR [eax+28] cvtps2pd xmm1, xmm1 cvtps2pd xmm2, xmm0 mulsd xmm1, xmm2 cvtpd2ps xmm1, xmm1 movss DWORD PTR [eax+28], xmm1 movss xmm1, DWORD PTR [eax+32] cvtps2pd xmm1, xmm1 cvtps2pd xmm2, xmm0 mulsd xmm1, xmm2 cvtpd2ps xmm1, xmm1 movss DWORD PTR [eax+32], xmm1 movss xmm1, DWORD PTR [eax+36] cvtps2pd xmm1, xmm1 cvtps2pd xmm2, xmm0 mulsd xmm1, xmm2 cvtpd2ps xmm1, xmm1 movss DWORD PTR [eax+36], xmm1 movss xmm2, DWORD PTR [eax+40] cvtps2pd xmm2, xmm2 cvtps2pd xmm1, xmm0 mulsd xmm1, xmm2 cvtpd2ps xmm1, xmm1 movss DWORD PTR [eax+40], xmm1 movss xmm1, DWORD PTR [eax+44] cvtps2pd xmm1, xmm1 cvtps2pd xmm2, xmm0 mulsd xmm1, xmm2 cvtpd2ps xmm1, xmm1 movss DWORD PTR [eax+44], xmm1 movss xmm2, DWORD PTR [eax+48] cvtps2pd xmm1, xmm0 cvtps2pd xmm2, xmm2 mulsd xmm1, xmm2 cvtpd2ps xmm1, xmm1 movss DWORD PTR [eax+48], xmm1 movss xmm1, DWORD PTR [eax+52] cvtps2pd xmm1, xmm1 cvtps2pd xmm2, xmm0 mulsd xmm1, xmm2 cvtpd2ps xmm1, xmm1 movss DWORD PTR [eax+52], xmm1 movss xmm1, DWORD PTR [eax+56] cvtps2pd xmm1, xmm1 cvtps2pd xmm2, xmm0 mulsd xmm1, xmm2 cvtpd2ps xmm1, xmm1 cvtps2pd xmm0, xmm0 movss DWORD PTR [eax+56], xmm1 movss xmm1, DWORD PTR [eax+60] cvtps2pd xmm1, xmm1 mulsd xmm1, xmm0 pop edi cvtpd2ps xmm0, xmm1 movss DWORD PTR [eax+60], xmm0 pop esi ; return out; ret 0 mul1 ENDP

Compare this with the code for mul1 generated by VS 2015:

 mul1 PROC _m$ = 8 ; size = 64 ; ___$ReturnUdt$ = ecx ; _f$ = xmm2s ; WMatrix out = m; movups xmm0, XMMWORD PTR _m$[esp-4] ; for (unsigned int i = 0; i < 4; i++) xor eax, eax movaps xmm1, xmm2 movups XMMWORD PTR [ecx], xmm0 movups xmm0, XMMWORD PTR _m$[esp+12] shufps xmm1, xmm1, 0 movups XMMWORD PTR [ecx+16], xmm0 movups xmm0, XMMWORD PTR _m$[esp+28] movups XMMWORD PTR [ecx+32], xmm0 movups xmm0, XMMWORD PTR _m$[esp+44] movups XMMWORD PTR [ecx+48], xmm0 npad 4 $LL4@mul1 : ; for (unsigned int j = 0; j < 4; j++) ; { ; unsigned int idx = i * 4 + j; // critical code ; *(&out._11 + idx) *= f; // critical code movups xmm0, XMMWORD PTR [ecx+eax*4] mulps xmm0, xmm1 movups XMMWORD PTR [ecx+eax*4], xmm0 inc eax cmp eax, 4 jb SHORT $LL4@mul1 ; return out; mov eax, ecx ret 0 ?mul1@ @ YA?AUWMatrix@ @ U1@M @Z ENDP ; mul1 _TEXT ENDS

You can see right away how shorter the code is Apparently, the optimizer has become much smarter between VS 2010 and VS 2015. Unfortunately, sometimes the source of the optimizer “smarts” is to use errors in your code.

By looking at the matching code, you will see that VS 2010 loops around. All calculations are done inline, so there are no branches. This is what you expect from loops with upper and lower bounds that are known at compile time and, as in this case, are quite small.

What happened in VS 2015? Well, that didn't reveal anything. There are 5 lines of code, and then the conditional jump of JB back to the beginning of the loop sequence. This one does not tell you much. What looks very suspicious is that it only loops 4 times (see instruction cmp eax, 4 , which sets flags before executing JB , effectively continuing the loop until the counter is less than 4). Well, it might be nice if he combined two loops into one. Let's see what it does inside the loop:

 $LL4@mul1 : movups xmm0, XMMWORD PTR [ecx+eax*4] ; load a packed unaligned value into XMM0 mulps xmm0, xmm1 ; do a packed multiplication of XMM0 by XMM1, ; storing the result in XMM0 movups XMMWORD PTR [ecx+eax*4], xmm0 ; store the result of the previous multiplication ; back into the memory location that we ; initially loaded from inc eax ; one iteration done, increment loop counter cmp eax, 4 ; see how many loops we've done jb $LL4@mul1 ; keep looping if < 4 iterations

The code reads the value from memory (the XMM size value from the location determined by ecx + eax * 4 ) in XMM0 , multiplies it by the value in XMM1 (which was set outside the loop based on f ), and then saves the result back to the original memory location.

Compare this with the code for the corresponding loop in mul2 :

 $LL4@mul2 : lea eax, DWORD PTR [eax+16] movups xmm0, XMMWORD PTR [eax-24] mulps xmm0, xmm2 movups XMMWORD PTR [eax-24], xmm0 sub ecx, 1 jne $LL4@mul2

Besides the other loop control sequence (this sets ECX to 4 outside the loop, subtracts 1 each time, and continues the loop until ECX ! = 0), the big difference here is the actual XMM values that it manipulates in memory. Instead of loading from [ecx+eax*4] it loads from [eax-24] (after adding 16 to EAX ).

What is the difference mul2 ? You added code to track a single index in idx2 , incrementing it every time through the loop. Now this will not be enough. If you comment out the assignment of the variable bool b , mul1 and mul2 , the result will be identical object code. Obviously, without comparing idx with idx2 compiler can infer that idx2 is not fully used, and therefore eliminate it by turning mul2 into mul1 . But with such a comparison, the compiler, apparently, will not be able to eliminate idx2 , and its presence so slightly changes what optimizations are considered possible for the function, which leads to an output mismatch.

Now the question is why this is happening. This is an optimizer error, as you first suspected? Well, no, and as some of the commentators have already pointed out, it should never be your first instinct to blame the compiler / optimizer. Always assume that there are errors in your code if you cannot prove otherwise. This proof would always include a discussion of disassembly and, preferably, references to the relevant parts of the locale if you really want to be serious.

In this case, Mystic has nailed the problem . Your code demonstrates undefined behavior when it does *(&out._11 + idx) . This makes certain assumptions about the layout of the WMatrix structure in memory, which you cannot legally make, even after the package is explicitly configured.

This is why undefined behavior is evil - this leads to the fact that the code sometimes works, but in other cases it is not. It is very sensitive to compiler flags, especially optimizations, as well as to target platforms (as we saw at the top of this answer). mul2 only works by accident. Both mul1 and mul2 wrong. Sorry, there is an error in your code. Worse, the compiler did not issue a warning that could warn you about using undefined behavior.

Jerry Coffin · Answer 2 · 2016-08-29T16:34:30+0000

If we look at the generated code, the problem is clear enough. Ignoring a few bits and parts that are not related to the problem, mul1 generates this code:

 movss xmm1, DWORD PTR _f$[esp-4] ; load xmm1 from _11 of source ; ... shufps xmm1, xmm1, 0 ; duplicate _11 across floats of xmm1 ; ... for ecx = 0 to 3 { movups xmm0, XMMWORD PTR [dest+ecx*4] ; load 4 floats from dest mulps xmm0, xmm1 ; multiply each by _11 movups XMMWORD PTR [dest+ecx*4], xmm0 ; store result back to dest }

So, instead of multiplying each element of one matrix by the corresponding element of another matrix, it multiplies each element of one matrix by _11 another matrix.

Although it is not possible to confirm exactly how this happened (without looking at the source code of the compiler), this certainly matches @Mysticial's assumption of how the problem arose.

Error in compiler VC ++ 14.0 (2015)?

More articles: