C / C ++ Optimization: Quickly Reduce Negation

I need to quickly roll up a very large number of doubles. If the generator_bit generates 0, the sign shall be changed. If the generator bit generates 1, then nothing happens. The cycle is executed many times, and the generator bit is very fast. On my platform, case 2 is noticeably faster than in case 1. It looks like my processor does not like branching. Is there a faster and more portable way to do this? What do you think of case 3?

// generates 0 and 1 int bit_generator(); // big vector (C++) vector<double> v; // case 1 for (size_t i=0; i<v.size(); ++i) if (bit_generator()==0) v[i] = -v[i]; // case 2 const int sign[] = {-1, 1}; for (size_t i=0; i<v.size(); ++i) v[i] *= sign[bit_generator()]; // case 3 const double sign[] = {-1, 1}; for (size_t i=0; i<v.size(); ++i) v[i] *= sign[bit_generator()]; // case 4 uses C-array double a[N]; double number_generator(); // generates doubles double z[2]; // used as buffer for (size_t i=0; i<N; ++i) { z[0] = number_generator(); z[1] = -z[0]; a[i] = z[bit_generator()]; } 

EDIT: Added case 4 and the C tag because the vector can be a simple array. Since I can control how the double digits are generated, I reworked the code, as shown in case 4. It avoids additional multiplication and branching at the same level. I guess it should be pretty fast on all platforms.

+4
source share
7 answers

If you do not want to resize the vector in the loop, raise v.size () from the for expression, i.e.

 const unsigned SZ=v.size(); for (size_t i=0; i<SZ; ++i) if (bit_generator()==0) v[i] = -v[i]; 

If the compiler cannot see what is happening in bit_generator (), it can be very difficult for the compiler to prove that v.size () is not changing, which makes it impossible to loop or vectorize.

UPDATE: I did some tests, and on my hardware method 2 seems to be the fastest. However, it seems faster to use a template that I call a "group action" :-). Basically, you group several solutions into one value and switch it:

 const size_t SZ=v.size(); for (size_t i=0; i<SZ; i+=2) // manual loop unrolling { int val=2*bit_generator()+bit_generator(); switch(val) // only one conditional { case 0: break; // nothing happes case 1: v[i+1]=-v[i+1]; break; case 2: v[i]=-v[i]; break; case 3: v[i]=-v[i]; v[i+1]=-v[i+1]; } } // not shown: wrap up the loop if SZ%2==1 
+8
source

if you can assume that the character is represented by one specific bit, for example, in x86 implementations, you can simply do:

 v[i] ^= !bit_generator() << SIGN_BIT_POSITION; // negate the output of // bit_generator because 0 means // negate and one means leave // unchanged. 

In x86, the sign bit is MSB, so to double bit 63:

 #define SIGN_BIT_POSITION 63 

will do the trick.

Edit:

Based on the comments, I should add that you may need to do some extra work for this to compile, since v is a double array and bit_generator() returns an int . You can do it as follows:

 union int_double { double d; // assumption: double is 64 bits wide long long int i; // assumption: long long is 64 bits wide }; 

(The syntax may be slightly different for C, because you may need a typedef.)

Then define v as an int_double vector and use:

 v[i].i ^= bit_generator() << SIGN_BIT_POSITION; 
+5
source

As a rule, if there is an if() in the loop, this loop cannot be vectorized or expanded, and the code should be executed once per pass, maximizing the loop overhead. Case 3 should do very well, especially if the compiler can use SSE instructions.

For fun, if you use GCC, use the -S -o foo.S -c foo.c flags instead of the usual -o foo.o -c foo.c flags. This will give you the build code and you will see what happens compiled for your three cases.

+3
source

You do not need a lookup table, a simple formula is enough:

 const size_t size = v.size(); for (size_t i=0; i<size; ++i) v[i] *= 2*bit_generator() - 1; 
+2
source

assuming that actual negation is fast (a good guess on a modern compiler and processor), you can use conditional assignment, which is also fast on modern CPUs, to choose one of two options:

 v[i] = bit_generator() ? v[i] : -v[i]; 

This avoids branches and allows the compiler to vectorize the loop and speed up its execution.

+1
source

Can you rewrite bit_generator so that it returns 1 and -1 instead? This eliminates indirection from the equation at the possible cost of some clarity.

0
source

Premature optimization is the root of unresolved SO questions

On my machine running 5333.24 BogoMIPS, the timings for 1'000 iterations in an array of 1'000'000 double, give the following expressions for expression:

 p->d = -p->d 7.33 ns p->MSB(d) ^= 0x80 6.94 ns 

Where MSB (d) is the pseudo-code to capture the most significant byte of d . This means that naive d = -d takes 5.32% longer than with a confusing approach. For a billion such denials, this means the difference between 7.3 and 6.9 seconds.

Someone must have a terribly large bunch of doubles to take care of this optimization.

By the way, I had to print the contents of the array at the end, or my compiler optimized the entire test for zero op codes.

-1
source

All Articles