Sometimes you need to step back and view it again. The first question is obviously you need it? Could there be an alternative algorithm that will work better?
Moreover, believing that for this issue that you have already identified by this algorithm, we can try to justify what we have.
Disclaimer: The method that I am describing is inspired by the successful method that Tim Peters used to improve the traditional introspective implementation, which led to TimSort . So please carry me;)
1. Extract properties
The main problem that I see is the dependency between iterations, which will prevent most of the possible optimizations and prevent many attempts to parallelize.
int64_t v = in[i]; max += v; if (v > max) max = v; out[i] = max;
Let's rework this code functionally:
max = calc(in[i], max); out[i] = max;
Where:
int64_t calc(int64_t const in, int64_t const max) { int64_t const bumped = max + in; return in > bumped ? in : bumped; }
Rather, a simplified version (overflow overflow with undefined):
int64_t calc(int64_t const in, int64_t const max) { return 0 > max ? in : max + in; }
Do you notice the point of the tip? The behavior changes depending on whether the unnamed (*) max positive or negative.
This polling point makes it more interesting to observe the values in in more detail, especially depending on the effect that they can have on max :
max < 0 and in[i] < 0 , then out[i] = in[i] < 0max < 0 and in[i] > 0 , then out[i] = in[i] > 0max > 0 and in[i] < 0 , then out[i] = (max + in[i]) ?? 0 out[i] = (max + in[i]) ?? 0max > 0 and in[i] > 0 , then out[i] = (max + in[i]) > 0
(*) poorly named because it is also a battery that the name hides. However, I have no better suggestion.
2. Optimization of operations
This allows us to discover interesting cases:
- if we have a slice
[i, j) array containing only negative values (which we call a negative slice), then we could do std::copy(in + i, in + j, out + i) and max = out[j-1] - if we have a slice
[i, j) array containing only positive values, then this is pure accumulation code (which can be easily expanded) max becomes positive as soon as in[i] is positive
Therefore, it may be interesting (but perhaps not, I do not promise) to establish an input profile before actually working with it. Note that a profile can be made a chunk for large inputs, for example, to adjust the block size based on the size of the cache line.
For references, 3 routines:
void copy(int64_t const in[], int64_t out[], size_t const begin, size_t const end) { std::copy(in + begin, in + end, out + begin); }
Now, suppose we can somehow characterize the input using a simple structure:
struct Slice { enum class Type { Negative, Neutral, Positive }; Type type; size_t begin; size_t end; }; typedef void (*Func)(int64_t const[], int64_t[], size_t, size_t); Func select(Type t) { switch(t) { case Type::Negative: return © case Type::Neutral: return ®ular; case Type::Positive: return &accumulate; } } void theLoop(std::vector<Slice> const& slices, int64_t const in[], int64_t out[]) { for (Slice const& slice: slices) { Func const f = select(slice.type); (*f)(in, out, slice.begin, slice.end); } }
Now, if introsort looping is minimal, therefore, calculating the characteristics can be too expensive as it is ... however, it behaves well in parallel .
3. Simple parallelization
Note that a characteristic is a pure input function. Therefore, assuming that you are working in piece by piece, it would be possible in parallel:
- Slice Producer: A character stream that computes the value of
Slice::Type - Slice Consumer: a workflow that actually executes code
Even if the input is essentially random, if the block is small enough (for example, the CPU cache line L1), there may be fragments for which it works. Synchronization between two streams can be done using a simple Slice thread queue (producer / consumer) and adding the bool last attribute to stop consumption or by creating Slice in a vector of type Unknown and with a consumer block until it is known (using atomistic )
Note: since the characteristic is pure, it is awkwardly parallel.
4. More Parallelization: speculative work
Remember this innocent remark: max gets a positive result as soon as in[i] is positive.
Suppose we can (reliably) assume that Slice[j-1] produce a negative max value, then the calculation on Slice[j] does not depend on what preceded them, and we can start working right now!
Of course, this is a guess, so we can be wrong ... but as soon as we have fully characterized all the slices, we have idle kernels, so we could use them for speculative work! And if we are wrong? Well, consumer flow will just slightly remove our mistake and replace it with the correct value.
Heuristics for speculative calculation of a Slice should be simple, and it will need to be configured. It can be adaptive ... but it can be harder!
Conclusion
Analyze your data set and try to find out if it is possible to break the dependencies. If possible, you can probably take advantage of it without even having to go multi-threading.