Floating Point Div / Mul> 30 times slower than Add / Sub?

I recently read this post: Floating point and integer computing on modern hardware , and I was curious how my own processor works on this quasi-control, so I put together two versions of the code: one in C # and one in C ++ (Visual Studio 2010 Express) and compiled them both with optimizations to see what happens. The result of my C # version is pretty reasonable:

int add/sub: 350ms int div/mul: 3469ms float add/sub: 1007ms float div/mul: 67493ms double add/sub: 1914ms double div/mul: 2766ms 

When I compiled and launched the C ++ version, everything completely different shook out:

 int add/sub: 210.653ms int div/mul: 2946.58ms float add/sub: 3022.58ms float div/mul: 172931ms double add/sub: 1007.63ms double div/mul: 74171.9ms 

I expected some performance differences, but not this big one! I don’t understand why division / multiplication in C ++ is much slower than addition / subtraction, where the managed version of C # is more reasonable for my expectations. The code for the C ++ version for the function is as follows:

 template< typename T> void GenericTest(const char *typestring) { T v = 0; T v0 = (T)((rand() % 256) / 16) + 1; T v1 = (T)((rand() % 256) / 16) + 1; T v2 = (T)((rand() % 256) / 16) + 1; T v3 = (T)((rand() % 256) / 16) + 1; T v4 = (T)((rand() % 256) / 16) + 1; T v5 = (T)((rand() % 256) / 16) + 1; T v6 = (T)((rand() % 256) / 16) + 1; T v7 = (T)((rand() % 256) / 16) + 1; T v8 = (T)((rand() % 256) / 16) + 1; T v9 = (T)((rand() % 256) / 16) + 1; HTimer tmr = HTimer(); tmr.Start(); for (int i = 0 ; i < 100000000 ; ++i) { v += v0; v -= v1; v += v2; v -= v3; v += v4; v -= v5; v += v6; v -= v7; v += v8; v -= v9; } tmr.Stop(); // I removed the bracketed values from the table above, they just make the compiler // assume I am using the value for something do it doesn't optimize it out. cout << typestring << " add/sub: " << tmr.Elapsed() * 1000 << "ms [" << (int)v << "]" << endl; tmr.Start(); for (int i = 0 ; i < 100000000 ; ++i) { v /= v0; v *= v1; v /= v2; v *= v3; v /= v4; v *= v5; v /= v6; v *= v7; v /= v8; v *= v9; } tmr.Stop(); cout << typestring << " div/mul: " << tmr.Elapsed() * 1000 << "ms [" << (int)v << "]" << endl; } 

The code for C # tests is not general and is implemented this way:

 static double DoubleTest() { Random rnd = new Random(); Stopwatch sw = new Stopwatch(); double v = 0; double v0 = (double)rnd.Next(1, int.MaxValue); double v1 = (double)rnd.Next(1, int.MaxValue); double v2 = (double)rnd.Next(1, int.MaxValue); double v3 = (double)rnd.Next(1, int.MaxValue); double v4 = (double)rnd.Next(1, int.MaxValue); double v5 = (double)rnd.Next(1, int.MaxValue); double v6 = (double)rnd.Next(1, int.MaxValue); double v7 = (double)rnd.Next(1, int.MaxValue); double v8 = (double)rnd.Next(1, int.MaxValue); double v9 = (double)rnd.Next(1, int.MaxValue); sw.Start(); for (int i = 0; i < 100000000; i++) { v += v0; v -= v1; v += v2; v -= v3; v += v4; v -= v5; v += v6; v -= v7; v += v8; v -= v9; } sw.Stop(); Console.WriteLine("double add/sub: {0}", sw.ElapsedMilliseconds); sw.Reset(); sw.Start(); for (int i = 0; i < 100000000; i++) { v /= v0; v *= v1; v /= v2; v *= v3; v /= v4; v *= v5; v /= v6; v *= v7; v /= v8; v *= v9; } sw.Stop(); Console.WriteLine("double div/mul: {0}", sw.ElapsedMilliseconds); sw.Reset(); return v; } 

Any ideas here?

+4
source share
5 answers

For float div / mul tests, you probably get denormalized values ​​that process these normal floating point values ​​much more slowly. This is not a problem for int tests, and it can happen much later for double tests.

You should add this to the beginning of C ++ to reset the denormals to zero:

 _controlfp(_DN_FLUSH, _MCW_DN); 

I am not sure how to do this in C # though (or if possible).

More info here: Runtime of a floating point math

+3
source

It is possible that C # optimized dividing by vx by multiplying by 1 / vx , since it knows that these values ​​do not change during the cycle, and it can only calculate inversions once in front.

You can do this optimization yourself and time in C ++.

+3
source

If you are interested in floating point speed and possible optimizations, read this book: http://www.agner.org/optimize/optimizing_cpp.pdf

also you can check this: http://msdn.microsoft.com/en-us/library/aa289157%28VS.71%29.aspx

Your results may depend on things like JIT, compilation flags (debug / release, which FP optimizations to execute or enable a set of commands).

Try to set these flags to maximum optimization and change your program so that it certainly does not create overflows or NANs, since they affect the speed of calculations. (even something like "v + = v1; v + = v2; v - = v1; v - = v2;" is normal, because it will not be reduced in "strict" or "exact" floating point mode). Also try not to use more variables than you have FP registers.

+2
source

Multiplication is not so bad. I think this is a few cycles slower than adding, but yes, the division is very slow compared to others. This takes significantly more time, and, unlike the other three operations, it is not pipelined.

+1
source

I also decided that your C ++ was incredibly slow. So I ran myself. It turns out that you are actually completely wrong. fail http://img59.imageshack.us/img59/3597/loltimer.jpg

I replaced your timer (I have no idea which timer you used, but I don’t have one convenient) with Windows High Performance Timer. It can do nanoseconds or better. Guess what? Visual Studio says no. I did not even tune it for maximum performance. VS can see right through this kind of crap and ellipse all the loops. That is why you should never ever use such “profiling”. Get a professional profiler and come back. If 2010 Express is no different from 2010 Professional, then I doubt it. They are mostly distinguished by the IDE features, not the raw performance / optimization code.

I'm not even going to run your C #.

Edit: this is DEBUG x64 (previous x86 screen, but I thought I would do x64 since I am on x64), and I also fixed a minor bug that caused time to be negative rather than positive. Therefore, if you do not want to tell me that your 32bit FP release is hundreds of times slower, I think you messed it up. alt text http://img693.imageshack.us/img693/1866/loltimerdebug.jpg

I was curious that the x86 debugging program was never interrupted in the second floating-point test, that is, if you swam first and then doubled, it was a double div / mul that failed. If you did double, then float, float div / mul failed. There must be a compiler failure.

0
source

Source: https://habr.com/ru/post/1316326/


All Articles