Optimize this feature (in C ++)

I have a processor consuming code where some function with a loop is executed many times. Each optimization in this cycle brings a marked increase in productivity. Question: How would you optimize this cycle (to optimize not so much ...)?

void theloop(int64_t in[], int64_t out[], size_t N) { for(uint32_t i = 0; i < N; i++) { int64_t v = in[i]; max += v; if (v > max) max = v; out[i] = max; } } 

I have tried several things, for example. I replaced the arrays with pointers that were incremented in each loop, but (surprisingly) I lost some performance instead of typing ...

Edit:

  • the name of one variable has changed ( itsMaximums , error)
  • function is a class method
  • in and put are int64_t , therefore negative and positive
  • `(v> max) can be evaluated as true: consider the situation when the actual maximum is negative
  • the code works on the 32-bit version (development) and the 64-bit (production) version
  • N unknown at compile time
  • I tried some SIMDs, but I couldn’t increase the performance ... (the overhead of moving variables to _m128i , execution and saving back was higher than increasing the speed of SSE. However, I am not an expert on SSE, so maybe I had a bad the code)

Results:

I added a few expanding cycles and a good hack to Alex'es post. Below I insert a few results:

  • Original: 14.0s
  • detailed loop (4 iterations): 10.44s
  • Alexey Trick: 10.89s
  • 2) and 3) at once: 11.71s

that 4) is not faster than 3) and 4). Below is the code for 4):

 for(size_t i = 1; i < N; i+=CHUNK) { int64_t t_in0 = in[i+0]; int64_t t_in1 = in[i+1]; int64_t t_in2 = in[i+2]; int64_t t_in3 = in[i+3]; max &= -max >> 63; max += t_in0; out[i+0] = max; max &= -max >> 63; max += t_in1; out[i+1] = max; max &= -max >> 63; max += t_in2; out[i+2] = max; max &= -max >> 63; max += t_in3; out[i+3] = max; } 
+7
source share
8 answers

<sub>

# Ad see chat

Hi, Yakub, what would you say if I found a version using heuristic optimization that for uniformly distributed random data would increase the speed ~ 3.2x for int64_t (effective value 10.56x using float s)

I still need to find time to update the message, but the explanation and code can be found in the chat.
I used the same test layer code (below) to make sure that the results are correct and exactly match the original implementation from your OP Edit : ironically ... this test bench had a fatal flaw, which made the results invalid: the heuristic version is actually skipped parts but due to the fact that the existing output was not cleared, it seemed to have the correct result ... (still editing ...)


Well, I published a test based on your version of the code, as well as my proposed use of partial_sum .

Find all the code here https://gist.github.com/1368992#file_test.cpp

Functions

For default configuration

 #define MAGNITUDE 20 #define ITERATIONS 1024 #define VERIFICATION 1 #define VERBOSE 0 #define LIMITED_RANGE 0 // hide difference in output due to absense of overflows #define USE_FLOATS 0 

It will ( output a fragment here):

  • do 100 x 1024 iterations (i.e. 100 different random seeds)
  • for data length 1048576 (2 ^ 20).
  • Random input data is evenly distributed over the entire range of element data types ( int64_t )
  • Check the output by creating a hash digest of the output array and comparing it with the reference implementation from the OP.

results

There are a number of (unexpected or unsurprising) results:

  • There is a significant difference in performance without significant between any algorithms (for integer data), subject to compilation with optimizations enabled. (See Makefile , my arch is 64-bit, Intel Core Q9550 with gcc-4.6.1)

  • The algorithms are not equivalent (you will see that the hash sums are different): in particular, the bit script proposed by Alex does not handle the integer overflow in exactly the same way (this may be a hidden definition

     #define LIMITED_RANGE 1 

    which limits the input so overflow will not happen; Note that the partial_sum_incorrect version shows equivalent C ++ non-bitwise _arithmetic operations that give the same results:

     return max<0 ? v : max + v; 

    Perhaps this is normal for your purpose?)

  • Amazing It is odd to calculate both definitions of the maximum algorithm. You can see that this is done inside partial_sum_correct : it computes both max "formulations" in a single loop; It really is nothing more than a trio here, because neither of these two methods is much faster ...

  • Even more surprisingly, you can improve performance when you can use float instead of int64_t . Quick and dirty hacking can be applied to the standard

     #define USE_FLOATS 0 

    showing that the STL-based algorithm ( partial_sum_incorrect ) works about 2.5 times faster when using float instead of int64_t (!!!).
    Note.:

    • that the partial_sum_incorrect naming refers only to integer overflows, which does not apply to float; this is evident from the fact that the hashes match, so this is actually _partial_sum_float_correct_ :)
    • that the current partial_sum_correct implementation does a double job, because of which it does not work well in floating point mode. See Bullet 3.
  • (And there was a “one by one” error in the looped version from the previously mentioned OP)

Partial amount

For your interest, the partial sum application looks like this in C ++ 11:

 std::partial_sum(data.begin(), data.end(), output.begin(), [](int64_t max, int64_t v) -> int64_t { max += v; if (v > max) max = v; return max; }); 
+7
source

First, you need to look at the created assembly. Otherwise, you have no way of knowing what actually happens when this loop is executed.

Now: does this code work on a 64-bit machine? If not, then these 64-bit add-ons can do a bit of damage.

This loop seems like an obvious candidate for using SIMD instructions. SSE2 supports a number of SIMD instructions for integer arithmetic, including some that work with two 64-bit values.

In addition, see if the compiler will deploy the loop correctly, and if not, do it yourself. Expand a couple of iterations of the loop, and then reorder. Put all the memory loads at the top of the loop so that they can be started as early as possible.

On the if line, verify that the compiler generates a conditional move, not a branch.

Finally, see if your compiler supports something like the restrict / __restrict . This is not standard in C ++, but it is very useful to tell the compiler that in and out do not point to the same addresses.

Is size ( N ) known at compile time? If so, make it a template parameter (and then try passing in and out as references to arrays of the proper size, as this may also help the compiler with anti-aliasing analysis)

Just thoughts from the head. But then again, study disassembly. You need to know what the compiler does for you, and especially what it does not for you.

Edit

with your edit:

 max &= -max >> 63; max += t_in0; out[i+0] = max; 

It amazes me that you have added a huge chain of dependencies. Before the result can be calculated, max must be canceled, the result must be shifted, the result must be and equal with its original value, and the result must be added to another variable.

In other words, all of these operations must be serialized. You cannot start one of them before the previous one is finished. This is not necessarily acceleration. Modern conveyor processors of non-standard order, like many parallel operations. Binding it to one long chain of dependent instructions is one of the most dangerous things you can do. (Of course, if it can alternate with other iterations, it might turn out better, but my gut feeling is that a simple conditional move instruction would be preferable)

+15
source

Sometimes you need to step back and view it again. The first question is obviously you need it? Could there be an alternative algorithm that will work better?

Moreover, believing that for this issue that you have already identified by this algorithm, we can try to justify what we have.

Disclaimer: The method that I am describing is inspired by the successful method that Tim Peters used to improve the traditional introspective implementation, which led to TimSort . So please carry me;)

1. Extract properties

The main problem that I see is the dependency between iterations, which will prevent most of the possible optimizations and prevent many attempts to parallelize.

 int64_t v = in[i]; max += v; if (v > max) max = v; out[i] = max; 

Let's rework this code functionally:

 max = calc(in[i], max); out[i] = max; 

Where:

 int64_t calc(int64_t const in, int64_t const max) { int64_t const bumped = max + in; return in > bumped ? in : bumped; } 

Rather, a simplified version (overflow overflow with undefined):

 int64_t calc(int64_t const in, int64_t const max) { return 0 > max ? in : max + in; } 

Do you notice the point of the tip? The behavior changes depending on whether the unnamed (*) max positive or negative.

This polling point makes it more interesting to observe the values ​​in in more detail, especially depending on the effect that they can have on max :

  • max < 0 and in[i] < 0 , then out[i] = in[i] < 0
  • max < 0 and in[i] > 0 , then out[i] = in[i] > 0
  • max > 0 and in[i] < 0 , then out[i] = (max + in[i]) ?? 0 out[i] = (max + in[i]) ?? 0
  • max > 0 and in[i] > 0 , then out[i] = (max + in[i]) > 0

(*) poorly named because it is also a battery that the name hides. However, I have no better suggestion.

2. Optimization of operations

This allows us to discover interesting cases:

  • if we have a slice [i, j) array containing only negative values ​​(which we call a negative slice), then we could do std::copy(in + i, in + j, out + i) and max = out[j-1]
  • if we have a slice [i, j) array containing only positive values, then this is pure accumulation code (which can be easily expanded)
  • max becomes positive as soon as in[i] is positive

Therefore, it may be interesting (but perhaps not, I do not promise) to establish an input profile before actually working with it. Note that a profile can be made a chunk for large inputs, for example, to adjust the block size based on the size of the cache line.

For references, 3 routines:

 void copy(int64_t const in[], int64_t out[], size_t const begin, size_t const end) { std::copy(in + begin, in + end, out + begin); } // copy void accumulate(int64_t const in[], int64_t out[], size_t const begin, size_t const end) { assert(begin != 0); int64_t max = out[begin-1]; for (size_t i = begin; i != end; ++i) { max += in[i]; out[i] = max; } } // accumulate void regular(int64_t const in[], int64_t out[], size_t const begin, size_t const end) { assert(begin != 0); int64_t max = out[begin - 1]; for (size_t i = begin; i != end; ++i) { max = 0 > max ? in[i] : max + in[i]; out[i] = max; } } 

Now, suppose we can somehow characterize the input using a simple structure:

 struct Slice { enum class Type { Negative, Neutral, Positive }; Type type; size_t begin; size_t end; }; typedef void (*Func)(int64_t const[], int64_t[], size_t, size_t); Func select(Type t) { switch(t) { case Type::Negative: return &copy; case Type::Neutral: return &regular; case Type::Positive: return &accumulate; } } void theLoop(std::vector<Slice> const& slices, int64_t const in[], int64_t out[]) { for (Slice const& slice: slices) { Func const f = select(slice.type); (*f)(in, out, slice.begin, slice.end); } } 

Now, if introsort looping is minimal, therefore, calculating the characteristics can be too expensive as it is ... however, it behaves well in parallel .

3. Simple parallelization

Note that a characteristic is a pure input function. Therefore, assuming that you are working in piece by piece, it would be possible in parallel:

  • Slice Producer: A character stream that computes the value of Slice::Type
  • Slice Consumer: a workflow that actually executes code

Even if the input is essentially random, if the block is small enough (for example, the CPU cache line L1), there may be fragments for which it works. Synchronization between two streams can be done using a simple Slice thread queue (producer / consumer) and adding the bool last attribute to stop consumption or by creating Slice in a vector of type Unknown and with a consumer block until it is known (using atomistic )

Note: since the characteristic is pure, it is awkwardly parallel.

4. More Parallelization: speculative work

Remember this innocent remark: max gets a positive result as soon as in[i] is positive.

Suppose we can (reliably) assume that Slice[j-1] produce a negative max value, then the calculation on Slice[j] does not depend on what preceded them, and we can start working right now!

Of course, this is a guess, so we can be wrong ... but as soon as we have fully characterized all the slices, we have idle kernels, so we could use them for speculative work! And if we are wrong? Well, consumer flow will just slightly remove our mistake and replace it with the correct value.

Heuristics for speculative calculation of a Slice should be simple, and it will need to be configured. It can be adaptive ... but it can be harder!

Conclusion

Analyze your data set and try to find out if it is possible to break the dependencies. If possible, you can probably take advantage of it without even having to go multi-threading.

+5
source

If the values max and in[] are far from 64-bit min / max (say, they are always between -2 61 and +2 61 ), you can try a loop without a conditional branch, which can lead to some performance degradation:

 for(uint32_t i = 1; i < N; i++) { max &= -max >> 63; // assuming >> would do arithmetic shift with sign extension max += in[i]; out[i] = max; } 

Theoretically, the compiler can do a similar trick, but without seeing the disassembly, it is difficult to say whether it does it.

+4
source

The code looks pretty fast. Depending on the nature of the array, you can try a special case, for example, if you know that with a certain call all input numbers are positive, out [i] will be equal to the total amount, without the need for an if branch.

+1
source

to ensure that the method is not virtual , inline , _attribute _ ((always_inline)) and -funroll-loops strong> seems like a good option to learn.

Only you, comparing them, can determine whether they were worth the optimization in your larger program.

+1
source

The only thing that comes to mind that might help a bit is to use pointers, not array indices in your loop, something like

 void theloop(int64_t in[], int64_t out[], size_t N) { int64_t max = in[0]; out[0] = max; int64_t *ip = in + 1,*op = out+1; for(uint32_t i = 1; i < N; i++) { int64_t v = *ip; ip++; max += v; if (v > max) max = v; *op = max; op++ } } 

The idea here is that the index into the array can be compiled as taking the base address of the array, multiplying the size of the element by the index and adding the result to get the address of the element. Storing mileage pointers avoids this. I assume a good optimizing compiler will do this already, so you will need to examine the current assembler output.

-one
source
 int64_t max = 0, i; for(i=N-1; i > 0; --i) /* Comparing with 0 is faster */ { max = in[i] > 0 ? max+in[i] : in[i]; out[i] = max; --i; /* Will reduce checking of i>=0 by N/2 times */ max = in[i] > 0 ? max+in[i] : in[i]; /* Reduce operations v=in[i], max+=v by N times */ out[i] = max; } if(0 == i) /* When N is odd */ { max = in[i] > 0 ? max+in[i] : in[i]; out[i] = max; } 
-3
source

All Articles