C ++ streaming application is slower than non-threaded

Question

C ++ streaming application is slower than non-threaded

I am currently writing a prime number generator in C ++. First I made a single-threaded version, and later - a multi-threaded version.

I found out that if my program generates values less than 100'000 , the single-threaded version is faster than the multi-threaded one. Obviously I'm doing something wrong.

My code is below:

 #include <iostream> #include <fstream> #include <set> #include <string> #include <thread> #include <mutex> #include <shared_mutex> using namespace std; set<unsigned long long> primeContainer; shared_mutex m; void checkPrime(const unsigned long long p) { if (p % 3 == 0) return; bool isPrime = true; for (set<unsigned long long>::const_iterator it = primeContainer.cbegin(); it != primeContainer.cend(); ++it) { if (p % *it == 0) { isPrime = false; break; } if (*it * *it > p) // check only up to square root break; } if (isPrime) primeContainer.insert(p); } void checkPrimeLock(const unsigned long long p) { if (p % 3 == 0) return; bool isPrime = true; try { shared_lock<shared_mutex> l(m); for (set<unsigned long long>::const_iterator it = primeContainer.cbegin(); it != primeContainer.cend(); ++it) { if (p % *it == 0) { isPrime = false; break; } if (*it * *it > p) break; } } catch (exception& e) { cout << e.what() << endl; system("pause"); } if (isPrime) { try { unique_lock<shared_mutex> l(m); primeContainer.insert(p); } catch (exception& e) { cout << e.what() << endl; system("pause"); } } } void runLoopThread(const unsigned long long& l) { for (unsigned long long i = 10; i < l; i += 10) { thread t1(checkPrimeLock, i + 1); thread t2(checkPrimeLock, i + 3); thread t3(checkPrimeLock, i + 7); thread t4(checkPrimeLock, i + 9); t1.join(); t2.join(); t3.join(); t4.join(); } } void runLoop(const unsigned long long& l) { for (unsigned long long i = 10; i < l; i += 10) { checkPrime(i + 1); checkPrime(i + 3); checkPrime(i + 7); checkPrime(i + 9); } } void printPrimes(const unsigned long long& l) { if (1U <= l) cout << "1 "; if (2U <= l) cout << "2 "; if (3U <= l) cout << "3 "; if (5U <= l) cout << "5 "; for (auto it = primeContainer.cbegin(); it != primeContainer.cend(); ++it) { if (*it <= l) cout << *it << " "; } cout << endl; } void writeToFile(const unsigned long long& l) { string name = "primes_" + to_string(l) + ".txt"; ofstream f(name); if (f.is_open()) { if (1U <= l) f << "1 "; if (2U <= l) f << "2 "; if (3U <= l) f << "3 "; if (5U <= l) f << "5 "; for (auto it = primeContainer.cbegin(); it != primeContainer.cend(); ++it) { if (*it <= l) f << *it << " "; } } else { cout << "Error opening file." << endl; system("pause"); } } int main() { unsigned int n = thread::hardware_concurrency(); std::cout << n << " concurrent threads are supported." << endl; unsigned long long limit; cout << "Please enter the limit of prime generation: "; cin >> limit; primeContainer.insert(7); if (10 < limit) { //runLoop(limit); //single-threaded runLoopThread(limit); //multi-threaded } printPrimes(limit); //writeToFile(limit); system("pause"); return 0; }

In the main function, you will find comments about which function is single and multi-threaded.

The main difference between them is the use of locks common to container iterations and unique to insertion. If that matters, my processor has 4 cores.

Why is a single thread version faster?

+6

c ++ multithreading c ++ 11

STALKER May 15, '16 at 20:36

source share

4 answers

You have a few issues.

Firstly, you are constantly creating and destroying threads. Spend each flow cycle until there is more work.

Secondly, your locks are too thin, and as a result you acquire them too often. Each thread captures a block of 100 numbers for testing, and not one at a time, and let them insert the found primes from each block at a time.

+11

David schwartz May 15, '16 at 20:59

source share

Since the comments section is a bit crowded, and the OP showed interest in a contactless solution, I gave an example of this approach below (in a semi-pseudo-code):

 vector<uint64_t> primes_thread1; vector<uint64_t> primes_thread2; ... // check all numbers in [start, end) void check_primes(uint64_t start, uint64_t end, vector<uint64_t> & out) { for (auto i = start; i < end; ++i) { if (is_prime(i)) { // simply loop through all odds from 3 to sqrt(i) out.push_back(i); } } } auto f1 = async(check_primes, 1, 1000'000, ref(primes_thread1)); auto f2 = async(check_primes, 1000'000, 2000'000, ref(primes_thread2)); ... f1.wait(); f2.wait(); ... primes_thread1.insert( primes_thread1.begin(), primes_thread2.cbegin(), primes_thread2.cend() ); primes_thread1.insert( primes_thread1.begin(), primes_thread3.cbegin(), primes_thread3.cend() ); ... // primes_thread1 contains all primes found in all threads

Obviously, this can be well reorganized by parameterizing the number of threads and the size of each range. I was verbose (I hope) to more clearly illustrate the concept of avoiding blocking, without sharing a single state in the first place.

+2

denniskb May 16 '16 at 10:35

source share

Your first test may have a different problem. You never check against 7 as a divisor.

In addition, your test assumes that firstContainer already contains all primes from 10 to the sqare root of the number being tested. This may not be the case if you use streams to fill the container.

If you fill a container with ever-increasing numbers (and your algorithm relies on this), you can use std :: vector instead of std :: set for better performance.

+2

CAF May 16 '16 at 10:45

source share

axalis · Accepted Answer · 2016-05-15T21:00:03+0000

It seems to me that you are starting a new thread for every prime check. This is not good IMHO, because the start / end of the stream plus synchronization adds to the calculation of each stroke. Starting a thread can be quite slow.

I would recommend starting these 4 threads outside the main for loop and processing 1/4 of the range in each thread. But this may require additional synchronization, because in order to verify simple code, the above code, apparently, must first have prime numbers up to sqrt N.

From my point of view, it would be easier to use the Sieve of Erastothenes algorithm , which can be much easier to parallelize without any blocking (however, a problem known as fake exchange may still occur).

EDIT

Here I quickly created a version using an Erastothenes sieve:

 void processSieve(const unsigned long long& l, const unsigned long long& start, const unsigned long long& end, const unsigned long long& step, vector<char> &is_prime) { for (unsigned long long i = start; i <= end; i += step) if (is_prime[i]) for (unsigned long long j = i + i; j <= l; j += i) is_prime[j] = 0; } void runSieve(const unsigned long long& l) { vector<char> is_prime(l + 1, 1); unsigned long long end = sqrt(l); processSieve(l, 2, end, 1, is_prime); primeContainer.clear(); for (unsigned long long i = 1; i <= l; ++i) if (is_prime[i]) primeContainer.insert(i); } void runSieveThreads(const unsigned long long& l) { vector<char> is_prime(l + 1, 1); unsigned long long end = sqrt(l); vector<thread> threads; threads.reserve(cpuCount); for (unsigned long long i = 0; i < cpuCount; ++i) threads.emplace_back(processSieve, l, 2 + i, end, cpuCount, ref(is_prime)); for (unsigned long long i = 0; i < cpuCount; ++i) threads[i].join(); primeContainer.clear(); for (unsigned long long i = 1; i <= l; ++i) if (is_prime[i]) primeContainer.insert(i); }

Measurement results, strokes up to 1,000,000 (MSVC 2013, issue):

 runLoop: 204.02 ms runLoopThread: 43947.4 ms runSieve: 30.003 ms runSieveThreads (8 cores): 24.0024 ms

Up to 10,000,000:

 runLoop: 4387.44 ms // runLoopThread disabled, taking too long runSieve: 350.035 ms runSieveThreads (8 cores): 285.029 ms

Times include final processing of the vector and clicking the results on the main number.

As you can see, the Sieve version is much faster than your version, even in the single-threaded version (for your version of the mutex, I had to change the lock to regular mutex locks, since MSVC 2013 does not have shared_lock, so the results are probably much worse than yours).

But you can see that the multi-threaded version of the sieve still does not work as fast as expected (8 cores, i.e. 8 threads, linear acceleration will be 8 times faster than one thread), although there is no blockage (some numbers can be executed unnecessarily if they have not yet been marked as “no prime numbers” by other threads, but in general the results should be stable, since only 0 is set each time, it does not matter if multiple threads are set at the same time). The reason why the acceleration is not linear is most likely caused by " false communication ", as I mentioned earlier, threads that write zeros are invalid on other cache lines.

C ++ streaming application is slower than non-threaded

More articles: