According to the equation below (from wikipedia ), the fastest way would be to split the range i = 1, k into the number of threads, give each thread one segment of the range, and each thread updates the final result in a lock. The “academic way” would be to break down the range into tasks, each task consists in calculating (n - k + i) / i, and then no matter how many threads you have, they all start in a loop, asking for the next task. At first it’s faster, the second is academic.

EDIT: further explanation - in both cases we have a certain number of threads. Typically, the number of threads is equal to the number of processor cores, because there is no benefit in adding more threads. The difference between the two methods is what these threads do.
In the first case, each thread is given N, K, I1 and I2, where I1 and I2 are a segment in the range 1..K. Then each stream has all the data that it is not, so it calculates its part of the result and after the finish updates the final result.
In the second case, each thread is assigned N, K and access to some synchronized counter, which is calculated from 1 to K. Each thread then gets one value from this total counter, calculates one part of the result, updates the final result, and loops on it until the counter will not tell the thread that there are no more elements. If it happens that some processor cores are faster than others, then this second method will maximize the use of all cores. The disadvantage of the second method is too much synchronization, which effectively blocks, say, 20% of the threads all the time.
Dialecticus
source share