It seems to me that you are starting a new thread for every prime check. This is not good IMHO, because the start / end of the stream plus synchronization adds to the calculation of each stroke. Starting a thread can be quite slow.
I would recommend starting these 4 threads outside the main for loop and processing 1/4 of the range in each thread. But this may require additional synchronization, because in order to verify simple code, the above code, apparently, must first have prime numbers up to sqrt N.
From my point of view, it would be easier to use the Sieve of Erastothenes algorithm , which can be much easier to parallelize without any blocking (however, a problem known as fake exchange may still occur).
EDIT
Here I quickly created a version using an Erastothenes sieve:
void processSieve(const unsigned long long& l, const unsigned long long& start, const unsigned long long& end, const unsigned long long& step, vector<char> &is_prime) { for (unsigned long long i = start; i <= end; i += step) if (is_prime[i]) for (unsigned long long j = i + i; j <= l; j += i) is_prime[j] = 0; } void runSieve(const unsigned long long& l) { vector<char> is_prime(l + 1, 1); unsigned long long end = sqrt(l); processSieve(l, 2, end, 1, is_prime); primeContainer.clear(); for (unsigned long long i = 1; i <= l; ++i) if (is_prime[i]) primeContainer.insert(i); } void runSieveThreads(const unsigned long long& l) { vector<char> is_prime(l + 1, 1); unsigned long long end = sqrt(l); vector<thread> threads; threads.reserve(cpuCount); for (unsigned long long i = 0; i < cpuCount; ++i) threads.emplace_back(processSieve, l, 2 + i, end, cpuCount, ref(is_prime)); for (unsigned long long i = 0; i < cpuCount; ++i) threads[i].join(); primeContainer.clear(); for (unsigned long long i = 1; i <= l; ++i) if (is_prime[i]) primeContainer.insert(i); }
Measurement results, strokes up to 1,000,000 (MSVC 2013, issue):
runLoop: 204.02 ms runLoopThread: 43947.4 ms runSieve: 30.003 ms runSieveThreads (8 cores): 24.0024 ms
Up to 10,000,000:
runLoop: 4387.44 ms
Times include final processing of the vector and clicking the results on the main number.
As you can see, the Sieve version is much faster than your version, even in the single-threaded version (for your version of the mutex, I had to change the lock to regular mutex locks, since MSVC 2013 does not have shared_lock, so the results are probably much worse than yours).
But you can see that the multi-threaded version of the sieve still does not work as fast as expected (8 cores, i.e. 8 threads, linear acceleration will be 8 times faster than one thread), although there is no blockage (some numbers can be executed unnecessarily if they have not yet been marked as โno prime numbersโ by other threads, but in general the results should be stable, since only 0 is set each time, it does not matter if multiple threads are set at the same time). The reason why the acceleration is not linear is most likely caused by " false communication ", as I mentioned earlier, threads that write zeros are invalid on other cache lines.