How to properly evaluate a [templated] C ++ program

<Backgound>

I am at the point where I really need to optimize C ++ code. I am writing a library for molecular modeling, and I need to add a new function. I already tried to add this function in the past, but then used virtual functions called nested loops. I had bad feelings about this, and the first implementation showed that it was a bad idea. However, this was good for testing the concept.

</ Background>

Now I need this function to be as fast as possible (well, without assembly code or GPU computation, it should still be C ++ and more readable than less). Now I am a little versed in class templates and class policies (from the excellent book by Alexandrescu), and I think that compiling compile-time code might be the solution.

However, I need to test the design before doing the tremendous work of introducing it into the library. The question is how best to test the effectiveness of this new feature.

Obviously, I need to enable optimization, because without this, g ++ (and possibly other compilers) will contain some unnecessary operations in the object code. I also need to very actively use the new function in the test, because a delta of 1-3 seconds can make the difference between a good and a bad design (this function will be called a million times in a real program).

The problem is that g ++ is sometimes “too smart” in optimization and can delete the whole loop, given that the result of the calculation is never used. I already saw this when I look at the output code of the assembly.

If I add some printing to stdout, then the compiler will be forced to do the calculations in a loop, but I will probably mostly compare the iostream implementation.

So, how can I do the right test of a small function extracted from the library? A related question: is it the right approach for conducting this kind of in vitro test on a small unit, or do I need the whole context?

Thanks for the tips!




It seems that there are several strategies, from compiler-specific options, that allow you to fine-tune more general solutions that should work with each compiler, for example volatile or extern .

I think I'll try it all. Thanks so much for all your answers!

+7
c ++ optimization benchmarking
Jan 12 '09 at 14:46
source share
11 answers

If you want some compiler not to discard the result, write the result to a volatile object. This operation cannot be optimized by definition.

 template<typename T> void sink(T const& t) { volatile T sinkhole = t; } 

There is no iostream overhead, just a copy that should remain in the generated code. Now, if you are collecting results from a large number of operations, it is best not to drop them one by one. These copies may still add some overhead. Instead, somehow collect all the results in one non-volatile object (therefore, all separate results are needed), and then designate this result object as volatile. For example. if your individual operations produce strings, you can force the evaluation by adding all char values ​​together modulo 1 <32. This practically does not cost overhead; lines are likely to be in cache. The result of the addition will subsequently be tied to volatility, so each char in each bite should actually be calculated, shortcuts are not allowed.

+5
Jan 13 '09 at 15:09
source share
— -

If you have a really aggressive compiler (it can happen), I would suggest calculating the checksum (just add all the results together) and print the checksum.

In addition, you may want to look at the generated assembly code before running any tests so that you can visually check that any cycles are actually running.

+1
Jan 12 '09 at 14:51
source share

Compilers are only allowed to exclude code branches that cannot occur. While he cannot rule out that the branch should be completed, it will not eliminate it. As long as there is some kind of data dependency, the code will be there and will be launched. Compilers are not too smart at evaluating which aspects of the program will not run, rather than trying, because this is an NP problem and is hardly computable. They have some simple checks, for example for if (0) , but more on that.

My humble opinion is that you may have suffered from some other problem earlier, for example, how C / C ++ evaluates Boolean expressions.

But in any case, since it concerns the speed test, you can check that things are called for yourself - run it once, and then again with a test of the return values. Or a static variable is added. At the end of the test, print the number generated. The results will be equal.

To answer the question about in-vitro testing: Yes, do it. If your application is so critical, do it. On the other hand, your description suggests another problem: if your deltas are in the timeframe of 1–3 seconds, this sounds like a computational complexity problem, since this method needs to be called very, very often (for several runs, 1–3 seconds is negligible).

The problem area you are modeling sounds VERY complicated, and the data sets are probably huge. Such things are always interesting. First of all, make sure that you have the correct data structures and algorithms, and after that, optimize whatever you want. So, I would say first look at the whole context. ; -)

Out of curiosity, what is the problem you are calculating?

+1
Jan 12 '09 at 15:10
source share

You have a lot of control over optimizations for your compilation. -O1, -O2, etc. - These are just aliases for a group of switches.

From man pages

  -O2 turns on all optimization flags specified by -O. It also turns on the following optimization flags: -fthread-jumps -falign-func‐ tions -falign-jumps -falign-loops -falign-labels -fcaller-saves -fcrossjumping -fcse-follow-jumps -fcse-skip-blocks -fdelete-null-pointer-checks -fexpensive-optimizations -fgcse -fgcse-lm -foptimize-sibling-calls -fpeephole2 -fregmove -fre‐ order-blocks -freorder-functions -frerun-cse-after-loop -fsched-interblock -fsched-spec -fschedule-insns -fsched‐ ule-insns2 -fstrict-aliasing -fstrict-overflow -ftree-pre -ftree-vrp 

You can customize and use this command to help you narrow down which options to explore.

  ... Alternatively you can discover which binary optimizations are enabled by -O3 by using: gcc -c -Q -O3 --help=optimizers > /tmp/O3-opts gcc -c -Q -O2 --help=optimizers > /tmp/O2-opts diff /tmp/O2-opts /tmp/O3-opts Φ grep enabled 

Once you find goal optimization, you won't need cout.

+1
Jan 12 '09 at 15:10
source share

If this is possible for you, you can try breaking your code into:

  • the library you want to test compiled with optimizations enabled
  • A test program that dynamically associates a library with disabled optimizations.

Otherwise, you can specify a different optimization level (it looks like you are using gcc ...) for a test function with the optimize attribute (see http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html#Function- Attributes ).

+1
Jan 12 '09 at 15:11
source share

You can create a dummy function in a separate cpp file that does nothing but accepts any type of calculation result as an argument. Then you can call this function with the results of your calculation, forcing gcc to generate the intermediate code, and the only penalty is the cost of calling the function (which should not distort your results if you don't name it a lot!).

+1
Jan 12 '09 at 15:35
source share
 #include <iostream> // Mark coords as extern. // Compiler is now NOT allowed to optimise away coords // This it can not remove the loop where you initialise it. // This is because the code could be used by another compilation unit extern double coords[500][3]; double coords[500][3]; int main() { //perform a simple initialization of all coordinates: for (int i=0; i<500; ++i) { coords[i][0] = 3.23; coords[i][1] = 1.345; coords[i][2] = 123.998; } std::cout << "hello world !"<< std::endl; return 0; } 
+1
Jan 12 '09 at 17:01
source share

edit : the simplest thing you can do is simply use the data incorrectly after the function starts and outside of your tests. How,

 StartBenchmarking(); // ie, read a performance counter for (int i=0; i<500; ++i) { coords[i][0] = 3.23; coords[i][1] = 1.345; coords[i][2] = 123.998; } StopBenchmarking(); // what comes after this won't go into the timer // this is just to force the compiler to use coords double foo; for (int j = 0 ; j < 500 ; ++j ) { foo += coords[j][0] + coords[j][1] + coords[j][2]; } cout << foo; 



In some cases, I sometimes have to hide the in vitro test inside the function and pass the control data using volatile pointers. This tells the compiler that it should not flush subsequent writes to these pointers (since they can be, for example, memory I / O). So,

 void test1( volatile double *coords ) { //perform a simple initialization of all coordinates: for (int i=0; i<1500; i+=3) { coords[i+0] = 3.23; coords[i+1] = 1.345; coords[i+2] = 123.998; } } 

For some reason, I still do not understand that it does not always work in MSVC, but it often does - look at the assembly to be sure. Also remember that volatile will prevent some compiler optimization (this prevents the compiler from storing the contents of the pointer in the register and forcing writing entries in the order of the program), so this is trustworthy if you use it for final data writing.

In general, in vitro testing like this is very useful if you remember that this is not the whole story. I usually test my new math routines in isolation so that I can quickly iterate over only the cache and pipeline characteristics of my algorithm for consistent data.

The difference between profiling a test tube like this and running it in the “real world” means that you get wildly changing input data (sometimes the best case, sometimes the worst case, sometimes pathological), the cache will be in an unknown state when you enter the function, and You may encounter other threads on the bus; therefore, you must run some of the standards for this feature in vivo when done.

+1
Jan 12 '09 at 21:10
source share

I don't know if GCC has a similar function, but with VC ++ you can use:

 #pragma optimize 

to selectively enable / disable optimization. If GCC has similar capabilities, you can build with full optimization and just disable it where necessary to make sure your code is called.

0
Jan 12 '09 at 15:16
source share

A small example of unwanted optimization:

 #include <vector> #include <iostream> using namespace std; int main() { double coords[500][3]; //perform a simple initialization of all coordinates: for (int i=0; i<500; ++i) { coords[i][0] = 3.23; coords[i][1] = 1.345; coords[i][2] = 123.998; } cout << "hello world !"<< endl; return 0; } 

If you comment out the code from the "double coords [500] [3]" before the end of the for loop, it will generate exactly the same build code (just tried with g ++ 4.3.2). I know that this example is too simple, and I could not show this behavior with std :: vector of a simple "Coordinates" structure.

However, I think that this example still shows that some optimizations may introduce errors in the benchmark, and I wanted to avoid some surprises of this kind when introducing new code in the library. It is easy to imagine that a new context may interfere with some optimizations and lead to a very inefficient library.

The same should apply to virtual functions (but I do not prove it here). Used in a context where a static link will do the job, I am pretty sure that decent compilers should eliminate the extra indirectness call for a virtual function. I can try this call in a loop and conclude that calling a virtual function is not such a big deal. Then I will call it a hundred thousand times in a context where the compiler cannot figure out what the exact type of the pointer will be and increase the execution time by 20% ...

0
Jan 12 '09 at 16:18
source share

when starting, read from the file. in your code, say if (input == "x") cout <<result_of_benchmark;

The compiler will not be able to eliminate the calculations, and if you make sure that the input is not "x", you will not compare iostream.

0
Jan 12 '09 at 19:21
source share



All Articles