I am writing a program that does some lengthy computation, which I can break down into as many tasks as I want. For discussion, suppose I write an algorithm to determine if a number p is prime, trying to divide it into all numbers between 2 and p-1. Obviously, this task can be divided into many threads.
In fact, I wrote an example application that does just that. As a parameter, I specify the number I want to check and the number of threads to use (each stream is assigned a range of equal number sizes to try to divide p by one, they cover the entire range).
My car has 8 cores. I started to run the program with a lot, which, as I know, is simple (2971215073) and with 1, 2, 3 threads, etc. Until it reaches 8 threads - every time the program runs faster than the previous one, this was what I expected. However, when I tried numbers greater than 8, the calculation time actually decreased (at least a little)!
There is no I / O or anything similar in my threads, just pure cpu calculations. I expected the runtime to deteriorate when I transfer 8 threads, because there will be more context switching and the number of parallel threads will remain at level 8. It is difficult to say where the peak is, since the differences are very small and vary from one run to another, however it is clear that 50 threads somehow work faster than 8 (for ~ 300 ms) ...
I assume that since I have so many threads, I get more uptime since I have most of the pool of system threads, so my threads are selected more. However, it seems that it makes no sense that the more threads I create, the faster the program runs (otherwise why not create 1000 threads?).
Can anyone suggest an explanation, and perhaps the best way is to use the number of threads to create relative to the number of cores on the machine?
Thanks.
My code for those interested (compiled on Windows, VS2012):
#include <Windows.h> #include <conio.h> #include <iostream> #include <thread> #include <vector> using namespace std; typedef struct { unsigned int primeCandidate; unsigned int rangeStart; unsigned int rangeEnd; } param_t; DWORD WINAPI isDivisible(LPVOID p) { param_t* param = reinterpret_cast<param_t*>(p); for (unsigned int d = param->rangeStart; d < param->rangeEnd; ++d) { if (param->primeCandidate % d == 0) { cout << param->primeCandidate << " is divisible by " << d << endl; return 1; } } return 0; } bool isPrime(unsigned int primeCandidate, unsigned int numOfCores) { vector<HANDLE> handles(numOfCores); vector<param_t> params(numOfCores); for (unsigned int i = 0; i < numOfCores; ++i) { params[i].primeCandidate = primeCandidate; params[i].rangeStart = (primeCandidate - 2) * (static_cast<double>(i) / numOfCores) + 2; params[i].rangeEnd = (primeCandidate - 2) * (static_cast<double>(i+1) / numOfCores) + 2; HANDLE h = CreateThread(nullptr, 0, reinterpret_cast<LPTHREAD_START_ROUTINE>(isDivisible), ¶ms[i], 0, 0); if (NULL == h) { cout << "ERROR creating thread: " << GetLastError() << endl; throw exception(); } handles[i] = h; } DWORD ret = WaitForMultipleObjects(numOfCores, &handles[0], TRUE, INFINITE); if (ret >= WAIT_OBJECT_0 && ret <= WAIT_OBJECT_0 + numOfCores - 1) { for (unsigned int i = 0; i < numOfCores; ++i) { DWORD exitCode = -1; if (0 == GetExitCodeThread(handles[i], &exitCode)) { cout << "Failed to get thread exit code: " << GetLastError() << endl; throw exception(); } if (1 == exitCode) { return false; } } return true; } else { cout << "ERROR waiting on threads: " << ret << endl; throw exception(); } } int main() { unsigned int primeCandidate = 1; unsigned int numOfCores = 1; cout << "Enter prime candidate: "; cin >> primeCandidate; cout << "Enter # of cores (0 means all): "; cin >> numOfCores; while (primeCandidate > 0) { if (0 == numOfCores) numOfCores = thread::hardware_concurrency(); DWORD start = GetTickCount(); bool res = isPrime(primeCandidate, numOfCores); DWORD end = GetTickCount(); cout << "Time: " << end-start << endl; cout << primeCandidate << " is " << (res ? "" : "not ") << "prime!" << endl; cout << "Enter prime candidate: "; cin >> primeCandidate; cout << "Enter # of cores (0 means all): "; cin >> numOfCores; } return 0; }