Is there a document describing how Clang handles excessive floating point precision?

Question

Is there a document describing how Clang handles excessive floating point precision?

It is almost impossible (*) to provide strict IEEE 754 semantics at a reasonable price if you are allowed to use only floating point commands, which are 387. It is especially difficult if you want the FPU to work on a full 64-bit value so that the long double type is available for enhanced accuracy. The usual "solution" is to perform intermediate calculations with the only available accuracy and convert to lower accuracy in more or less clearly defined cases.

Recent versions of GCC handle redundant accuracy in intermediate calculations as interpreted by Joseph S. Myers in the 2008 GCC mailing list . This description makes the program compiled with gcc -std=c99 -mno-sse2 -mfpmath=387 completely predictable, to the last bit, as I understand it. And if by chance this is not so, this is a mistake, and it will be corrected: the stated intention of Joseph S. Myers in his post is to make it predictable.

Is this documented how Clang handles excessive precision (say when the -mno-sse2 option is used) and where?

(*) EDIT: This is an exaggeration. This is a little annoying, but not so difficult to emulate 64 binary when it is allowed to configure x87 FPU to use a 53-bit value.

Following R .. comment below, here is a log of my short interaction with the latest version of Clang I:

 Hexa:~ $ clang -v Apple clang version 4.1 (tags/Apple/clang-421.11.66) (based on LLVM 3.1svn) Target: x86_64-apple-darwin12.4.0 Thread model: posix Hexa:~ $ cat fem.c #include <stdio.h> #include <math.h> #include <float.h> #include <fenv.h> double x; double y = 2.0; double z = 1.0; int main(){ x = y + z; printf("%d\n", (int) FLT_EVAL_METHOD); } Hexa:~ $ clang -std=c99 -mno-sse2 fem.c Hexa:~ $ ./a.out 0 Hexa:~ $ clang -std=c99 -mno-sse2 -S fem.c Hexa:~ $ cat fem.s … movl $0, %esi fldl _y(%rip) fldl _z(%rip) faddp %st(1) movq _x@GOTPCREL(%rip), %rax fstpl (%rax) …

+52

c floating-point c99 clang

Pascal Cuoq Jul 15 '13 at 20:56 on

source share

2 answers

Nominal Animal · Answer 1 · 2013-09-24 23:25

This does not answer the original question, but if you are a programmer working with similar problems, this answer may help you.

I really don’t see where the difficulty is perceived. Providing rigorous IEEE-754 binary64 semantics while confining 80387 to floating-point math and preserving 80-bit double computation seems to comply with well-defined C99 custom rules with both GCC-4.6.3 and clang-3.0 (based on LLVM 3.0).

Edited to add: However, Pascal Cuoq is correct: neither gcc-4.6.3 nor clang-llvm-3.0 apply these rules correctly for 388 floating point maths. Given the correct compiler options, the rules apply correctly to expressions evaluated in compilation time, but not for runtime expressions. Possible workarounds listed after the break below.

I use the molecular dynamics simulation code and am very familiar with the requirements for repeatability / predictability, as well as the desire to maintain maximum accuracy whenever possible, so I claim to know what I'm talking about. This answer should show that the tools exist and are easy to use; problems arise because they do not know or do not use these tools.

(The preferred example that I like is the Kahan summation algorithm. With C99 and the correct cast (adding castings, for example, to the Wikipedia example code), no tricks or additional temporary variables are needed at all. The implementation works regardless of the compiler optimization level, including -O3 and -Ofast .)

C99 explicitly states (e.g., 5.4.2.2) that casting and assignment remove all additional range and precision. This means that you can use long double arithmetic, defining your temporary variables used in the calculation, as long double , also applying your input variables to this type; whenever an IEEE-754 binary bit is required, just add to double .

In '387, casting generates assignment and load on both of the above compilers; this correctly matches the 80-bit value for the IEEE-754 binary code. In my opinion, this is very reasonable. The exact time depends on the architecture and surrounding code; usually this can be alternated with another code to reduce the cost to the level of negligence. When MMX, SSE, or AVX are available, their registers are separated from the 80-bit registers 80387, and the conversion is usually done by moving the value to the MMX / SSE / AVX register.

(I prefer the production code to use a specific floating point type, such as tempdouble or one, for temporary variables, so that it can be defined as double or long double depending on architecture and speed / accuracy, tradeoffs are desirable.)

In a nutshell:

Do not assume that (expression) has double precision just because all variables and literal constants. Write it as (double)(expression) if you want to get the result with double precision.

This also applies to compound expressions and can sometimes lead to bulky expressions with many levels of cast.

If you have expr1 and expr2 that you want to calculate with 80-bit precision, but you also need to use a product with rounded to 64-bit first, use

 long double expr1; long double expr2; double product = (double)(expr1) * (double)(expr2);

Note. product calculated as the product of two 64-bit values; not calculated with 80-bit precision, then rounded down. Computing a product with 80-bit precision, then rounding, would be

 double other = expr1 * expr2;

or by adding descriptive roles that tell you exactly what’s going on,

 double other = (double)((long double)(expr1) * (long double)(expr2));

Obviously, product and other often differ.

C99 Casting Rules is another tool you should learn to master if you work with mixed 32-bit / 64-bit / 80-bit / 128-bit floating point values. Indeed, you are facing the same problems if you mix binary32 and binary64 float ( float and double on most architectures)!

Perhaps rewriting the Pascal Cuoq study code to correctly apply casting rules makes this clearer?

 #include <stdio.h> #define TEST(eq) printf("%-56s%s\n", "" # eq ":", (eq) ? "true" : "false") int main(void) { double d = 1.0 / 10.0; long double ld = 1.0L / 10.0L; printf("sizeof (double) = %d\n", (int)sizeof (double)); printf("sizeof (long double) == %d\n", (int)sizeof (long double)); printf("\nExpect true:\n"); TEST(d == (double)(0.1)); TEST(ld == (long double)(0.1L)); TEST(d == (double)(1.0 / 10.0)); TEST(ld == (long double)(1.0L / 10.0L)); TEST(d == (double)(ld)); TEST((double)(1.0L/10.0L) == (double)(0.1)); TEST((long double)(1.0L/10.0L) == (long double)(0.1L)); printf("\nExpect false:\n"); TEST(d == ld); TEST((long double)(d) == ld); TEST(d == 0.1L); TEST(ld == 0.1); TEST(d == (long double)(1.0L / 10.0L)); TEST(ld == (double)(1.0L / 10.0)); return 0; }

The output, with both GCC and clang, is

 sizeof (double) = 8 sizeof (long double) == 12 Expect true: d == (double)(0.1): true ld == (long double)(0.1L): true d == (double)(1.0 / 10.0): true ld == (long double)(1.0L / 10.0L): true d == (double)(ld): true (double)(1.0L/10.0L) == (double)(0.1): true (long double)(1.0L/10.0L) == (long double)(0.1L): true Expect false: d == ld: false (long double)(d) == ld: false d == 0.1L: false ld == 0.1: false d == (long double)(1.0L / 10.0L): false ld == (double)(1.0L / 10.0): false

except that the latest versions of GCC push the right side ld == 0.1 to double double (i.e. to ld == 0.1L ), which gives true , and with SSE / AVX, long double - 128-bit.

For the clean tests "387" I used

 gcc -W -Wall -m32 -mfpmath=387 -mno-sse ... test.c -o test clang -W -Wall -m32 -mfpmath=387 -mno-sse ... test.c -o test

with various optimization flag combinations as ... , including -fomit-frame-pointer , -O0 , -O1 , -O2 , -O3 and -Os .

Using any other flags or C99 compilers should produce the same results, except for long double size (and ld == 1.0 for current versions of GCC). If you have any disagreements, I will be very grateful to you for this; I may have to warn my users about such compilers (compiler versions). Please note that Microsoft does not support C99, so they are completely uninteresting to me.

Pascal Quoc causes an interesting problem in the comment chain below, which I did not immediately learn about.

When evaluating expressions, both GCC and clang with -mfpmath=387 indicate that all expressions are evaluated using 80-bit precision. This leads, for example, to

 7491907632491941888 = 0x1.9fe2693112e14p+62 = 110011111111000100110100100110001000100101110000101000000000000 5698883734965350400 = 0x1.3c5a02407b71cp+62 = 100111100010110100000001001000000011110110111000111000000000000 7491907632491941888 * 5698883734965350400 = 42695510550671093541385598890357555200 = 100000000111101101101100110001101000010100100001011110111111111111110011000111000001011101010101100011000000000000000000000000

gives incorrect results, because a line of them in the middle of the binary result is in the difference between 53- and 64-bit mantissas (64 and 80-bit floating-point numbers, respectively). So, although the expected result

 42695510550671088819251326462451515392 = 0x1.00f6d98d0a42fp+125 = 100000000111101101101100110001101000010100100001011110000000000000000000000000000000000000000000000000000000000000000000000000

the result obtained only with -std=c99 -m32 -mno-sse -mfpmath=387 is equal to

 42695510550671098263984292201741942784 = 0x1.00f6d98d0a43p+125 = 100000000111101101101100110001101000010100100001100000000000000000000000000000000000000000000000000000000000000000000000000000

In theory, you should be able to tell gcc and clang to provide the correct C99 rounding rules using parameters

 -std=c99 -m32 -mno-sse -mfpmath=387 -ffloat-store -fexcess-precision=standard

However, this only affects the expressions that the compiler optimizes, and does not seem to fix clang -O1 -std=c99 -m32 -mno-sse -mfpmath=387 -ffloat-store -fexcess-precision=standard test.c -o test && ./test processing. If you use, for example, clang -O1 -std=c99 -m32 -mno-sse -mfpmath=387 -ffloat-store -fexcess-precision=standard test.c -o test && ./test with test.c Pascal test.c sample program , you will get the correct result according to IEEE-754 rules, but only because the compiler optimizes the expression and not uses 387.

Simply put, instead of calculating

 (double)d1 * (double)d2

and gcc and clang actually say "387" to compute

 (double)((long double)d1 * (long double)d2)

~~It's really~~ I believe this is a compiler error, affecting both gcc-4.6.3 and clang-llvm-3.0, and easily reproduced. (Pascal Quoc points out that FLT_EVAL_METHOD=2 means that operations with double-precision arguments are always performed with extended precision, but I see no reasonable reason - besides the need to rewrite parts of libm to '387 - do it on C99 and given that IEEE rules -754 is reachable by hardware! In the end, the correct operation is easily achievable by the compiler by changing the control word "387" to match the precision of the expression. And given the compiler options that should force this behavior - -std=c99 -ffloat-store -fexcess-precision=standard -std=c99 -ffloat-store -fexcess-precision=standard - it makes no sense if the behavior of FLT_EVAL_METHOD=2 really desirable, there are also no backward compatibility problems.) It is important to note that, given the correct compiler flags, expressions evaluated at compile time, get the estimate correctly and that only expressions evaluated at the time run, get incorrect results.

The simplest workaround and portable is to use fesetround(FE_TOWARDZERO) (from fenv.h ) to round all results to zero.

In some cases, rounding to zero can help in predictability and pathological cases. In particular, for intervals like x = [0,1) , rounding to zero means that the upper limit is never reached by rounding; important if you rate, for example. piecewise splines.

For other rounding modes, you need to directly control 387 equipment.

You can use either __FPU_SETCW() from #include <fpu_control.h> , or open it. For example, precision.c :

 #include <stdlib.h> #include <stdio.h> #include <limits.h> #define FP387_NEAREST 0x0000 #define FP387_ZERO 0x0C00 #define FP387_UP 0x0800 #define FP387_DOWN 0x0400 #define FP387_SINGLE 0x0000 #define FP387_DOUBLE 0x0200 #define FP387_EXTENDED 0x0300 static inline void fp387(const unsigned short control) { unsigned short cw = (control & 0x0F00) | 0x007f; __asm__ volatile ("fldcw %0" : : "m" (*&cw)); } const char *bits(const double value) { const unsigned char *const data = (const unsigned char *)&value; static char buffer[CHAR_BIT * sizeof value + 1]; char *p = buffer; size_t i = CHAR_BIT * sizeof value; while (i-->0) *(p++) = '0' + !!(data[i / CHAR_BIT] & (1U << (i % CHAR_BIT))); *p = '\0'; return (const char *)buffer; } int main(int argc, char *argv[]) { double d1, d2; char dummy; if (argc != 3) { fprintf(stderr, "\nUsage: %s 7491907632491941888 5698883734965350400\n\n", argv[0]); return EXIT_FAILURE; } if (sscanf(argv[1], " %lf %c", &d1, &dummy) != 1) { fprintf(stderr, "%s: Not a number.\n", argv[1]); return EXIT_FAILURE; } if (sscanf(argv[2], " %lf %c", &d2, &dummy) != 1) { fprintf(stderr, "%s: Not a number.\n", argv[2]); return EXIT_FAILURE; } printf("%s:\td1 = %.0f\n\t %s in binary\n", argv[1], d1, bits(d1)); printf("%s:\td2 = %.0f\n\t %s in binary\n", argv[2], d2, bits(d2)); printf("\nDefaults:\n"); printf("Product = %.0f\n\t %s in binary\n", d1 * d2, bits(d1 * d2)); printf("\nExtended precision, rounding to nearest integer:\n"); fp387(FP387_EXTENDED | FP387_NEAREST); printf("Product = %.0f\n\t %s in binary\n", d1 * d2, bits(d1 * d2)); printf("\nDouble precision, rounding to nearest integer:\n"); fp387(FP387_DOUBLE | FP387_NEAREST); printf("Product = %.0f\n\t %s in binary\n", d1 * d2, bits(d1 * d2)); printf("\nExtended precision, rounding to zero:\n"); fp387(FP387_EXTENDED | FP387_ZERO); printf("Product = %.0f\n\t %s in binary\n", d1 * d2, bits(d1 * d2)); printf("\nDouble precision, rounding to zero:\n"); fp387(FP387_DOUBLE | FP387_ZERO); printf("Product = %.0f\n\t %s in binary\n", d1 * d2, bits(d1 * d2)); return 0; }

Using clang-llvm-3.0 to compile and run, I get the correct results,

 clang -std=c99 -m32 -mno-sse -mfpmath=387 -O3 -W -Wall precision.c -o precision ./precision 7491907632491941888 5698883734965350400 7491907632491941888: d1 = 7491907632491941888 0100001111011001111111100010011010010011000100010010111000010100 in binary 5698883734965350400: d2 = 5698883734965350400 0100001111010011110001011010000000100100000001111011011100011100 in binary Defaults: Product = 42695510550671098263984292201741942784 0100011111000000000011110110110110011000110100001010010000110000 in binary Extended precision, rounding to nearest integer: Product = 42695510550671098263984292201741942784 0100011111000000000011110110110110011000110100001010010000110000 in binary Double precision, rounding to nearest integer: Product = 42695510550671088819251326462451515392 0100011111000000000011110110110110011000110100001010010000101111 in binary Extended precision, rounding to zero: Product = 42695510550671088819251326462451515392 0100011111000000000011110110110110011000110100001010010000101111 in binary Double precision, rounding to zero: Product = 42695510550671088819251326462451515392 0100011111000000000011110110110110011000110100001010010000101111 in binary

In other words, you can get around compiler issues by using fp387() to set the precision and rounding mode.

The disadvantage is that some math libraries ( libm.a , libm.so ) can be written with the assumption that intermediate results are always calculated with an accuracy of 80 bits. At least the GNU C library fpu_control.h on x86_64 has the comment "libm requires extended precision." Fortunately, you can use the “387 implementations”, for example, the GNU C Library and implement them in the header file or write the libm known to work if you need math.h functionality; in fact, I think I could help there.

Pascal Cuoq · Answer 2 · 2013-09-24 18:41

For the record below it is shown what I found in the experiment. The following program shows various types of compilation behavior using Clang:

 #include <stdio.h> int r1, r2, r3, r4, r5, r6, r7; double ten = 10.0; int main(int c, char **v) { r1 = 0.1 == (1.0 / ten); r2 = 0.1 == (1.0 / 10.0); r3 = 0.1 == (double) (1.0 / ten); r4 = 0.1 == (double) (1.0 / 10.0); ten = 10.0; r5 = 0.1 == (1.0 / ten); r6 = 0.1 == (double) (1.0 / ten); r7 = ((double) 0.1) == (1.0 / 10.0); printf("r1=%d r2=%d r3=%d r4=%d r5=%d r6=%d r7=%d\n", r1, r2, r3, r4, r5, r6, r7); }

Results vary depending on the level of optimization:

 $ clang -v Apple LLVM version 4.2 (clang-425.0.24) (based on LLVM 3.2svn) $ clang -mno-sse2 -std=c99 tc && ./a.out r1=0 r2=1 r3=0 r4=1 r5=1 r6=0 r7=1 $ clang -mno-sse2 -std=c99 -O2 tc && ./a.out r1=0 r2=1 r3=0 r4=1 r5=1 r6=1 r7=1

Listing (double) , which distinguishes r5 and r6 from -O2 , does not affect -O0 for the variables r3 and r4 . The result of r1 differs from r5 at all optimization levels, while r6 differs from r3 by -O2 .

Is there a document describing how Clang handles excessive floating point precision?

More articles: