TL: DR uses lrint(x) or (int)rint(x) to convert from float to int, rounded to the nearest, and not for truncation. Unfortunately, not all compilers efficiently use the same mathematical functions. See round () for float in C ++
gcc -mno-sse2 should use x87 for double , even in 64-bit code. The x87 registers have an internal precision of 80 bits, but SSE2 uses the IEEE binary64 (aka double ) format initially in XMM registers, so all time series are rounded to 64 double bits at each step.
The problem is not as interesting as the problem of double rounding (80 bits β 64 bits, and then an integer). It is also not from gcc -O0 (default: without additional optimizations) rounding off when storing temporary data in memory, because you did all this in one C expression, so it just uses the x87 registers for the whole expression.
It's just that 80-bit precision produces a result that is just below 242.0 and truncated to 241 in terms of semantics C float-> int, while SSE2 gives a result just above 242.0, which truncates to 242. For x87, rounding to the next lower integer occurs sequentially, and not just 242, for any input from 1 to 65535. (I made a version of your program using atoi(argv[1]) so that I can test other values, and using -O3 ).
Remember that int foo = 123.99999 is 123, because C uses the rounding mode of rounding (to zero). For non-negative numbers, this is the same as floor (which is rounded to -Infinity). https://en.wikipedia.org/wiki/Floating-point_arithmetic#Rounding_modes .
double cannot accurately represent the coefficients: I printed them using gdb and got: {0.21260000000000001, 0.71519999999999995, 0.0722} . These decimal representations are probably not exact representations of base-2 floating point values. But they are close enough to see that the coefficients are 0.99999999999999996 (using a calculator of arbitrary accuracy).
We finish rounding off because the internal accuracy of x87 is higher than the accuracy of the coefficients, therefore rounding errors of the sum in n * rec709_luma_coeff[0] etc., and summing up the results, are ~ 2^11 less than the difference between the sum of the coefficients and 1.0. (64-bit value versus 53 bits).
The real question is: how did the SSE2 version work! Presumably from the round to the nearest, even in a temporary, sufficient upward movement occurs, at least by 242. This happens to produce the initial input for more cases than not, but it produces an input-1 for 5, 7, 10, 13, 14, 20 ... (252 of the first 1000 numbers from 1..1000 are "mixed up" with SSE2, so it does not always work.)
With -O3 for your source, it calculates compilation time with extended precision and gives an accurate result. that is, it compiles in the same way as printf("%u\n", n); .
And BTW you have to use static const for your constants so that gcc can optimize better. static much better than simple global, though, since the compiler can see that nothing in the compiler writes values ββor passes their address anywhere, so it can treat them as if they were const .