When performing the calculation - how many threads should I open?

I am writing a program that does some lengthy computation, which I can break down into as many tasks as I want. For discussion, suppose I write an algorithm to determine if a number p is prime, trying to divide it into all numbers between 2 and p-1. Obviously, this task can be divided into many threads.

In fact, I wrote an example application that does just that. As a parameter, I specify the number I want to check and the number of threads to use (each stream is assigned a range of equal number sizes to try to divide p by one, they cover the entire range).

My car has 8 cores. I started to run the program with a lot, which, as I know, is simple (2971215073) and with 1, 2, 3 threads, etc. Until it reaches 8 threads - every time the program runs faster than the previous one, this was what I expected. However, when I tried numbers greater than 8, the calculation time actually decreased (at least a little)!

There is no I / O or anything similar in my threads, just pure cpu calculations. I expected the runtime to deteriorate when I transfer 8 threads, because there will be more context switching and the number of parallel threads will remain at level 8. It is difficult to say where the peak is, since the differences are very small and vary from one run to another, however it is clear that 50 threads somehow work faster than 8 (for ~ 300 ms) ...

I assume that since I have so many threads, I get more uptime since I have most of the pool of system threads, so my threads are selected more. However, it seems that it makes no sense that the more threads I create, the faster the program runs (otherwise why not create 1000 threads?).

Can anyone suggest an explanation, and perhaps the best way is to use the number of threads to create relative to the number of cores on the machine?

Thanks.


My code for those interested (compiled on Windows, VS2012):

#include <Windows.h> #include <conio.h> #include <iostream> #include <thread> #include <vector> using namespace std; typedef struct { unsigned int primeCandidate; unsigned int rangeStart; unsigned int rangeEnd; } param_t; DWORD WINAPI isDivisible(LPVOID p) { param_t* param = reinterpret_cast<param_t*>(p); for (unsigned int d = param->rangeStart; d < param->rangeEnd; ++d) { if (param->primeCandidate % d == 0) { cout << param->primeCandidate << " is divisible by " << d << endl; return 1; } } return 0; } bool isPrime(unsigned int primeCandidate, unsigned int numOfCores) { vector<HANDLE> handles(numOfCores); vector<param_t> params(numOfCores); for (unsigned int i = 0; i < numOfCores; ++i) { params[i].primeCandidate = primeCandidate; params[i].rangeStart = (primeCandidate - 2) * (static_cast<double>(i) / numOfCores) + 2; params[i].rangeEnd = (primeCandidate - 2) * (static_cast<double>(i+1) / numOfCores) + 2; HANDLE h = CreateThread(nullptr, 0, reinterpret_cast<LPTHREAD_START_ROUTINE>(isDivisible), &params[i], 0, 0); if (NULL == h) { cout << "ERROR creating thread: " << GetLastError() << endl; throw exception(); } handles[i] = h; } DWORD ret = WaitForMultipleObjects(numOfCores, &handles[0], TRUE, INFINITE); if (ret >= WAIT_OBJECT_0 && ret <= WAIT_OBJECT_0 + numOfCores - 1) { for (unsigned int i = 0; i < numOfCores; ++i) { DWORD exitCode = -1; if (0 == GetExitCodeThread(handles[i], &exitCode)) { cout << "Failed to get thread exit code: " << GetLastError() << endl; throw exception(); } if (1 == exitCode) { return false; } } return true; } else { cout << "ERROR waiting on threads: " << ret << endl; throw exception(); } } int main() { unsigned int primeCandidate = 1; unsigned int numOfCores = 1; cout << "Enter prime candidate: "; cin >> primeCandidate; cout << "Enter # of cores (0 means all): "; cin >> numOfCores; while (primeCandidate > 0) { if (0 == numOfCores) numOfCores = thread::hardware_concurrency(); DWORD start = GetTickCount(); bool res = isPrime(primeCandidate, numOfCores); DWORD end = GetTickCount(); cout << "Time: " << end-start << endl; cout << primeCandidate << " is " << (res ? "" : "not ") << "prime!" << endl; cout << "Enter prime candidate: "; cin >> primeCandidate; cout << "Enter # of cores (0 means all): "; cin >> numOfCores; } return 0; } 
+7
source share
2 answers

Yes. Here is a small excerpt from some of the tests I did in my i7 / Vista 64 box (4 "real" kernels + hyperthreading):

 8 tests, 400 tasks, counting to 10000000, using 8 threads: Ticks: 2199 Ticks: 2184 Ticks: 2215 Ticks: 2153 Ticks: 2200 Ticks: 2215 Ticks: 2200 Ticks: 2230 Average: 2199 ms 8 tests, 400 tasks, counting to 10000000, using 32 threads: Ticks: 2137 Ticks: 2121 Ticks: 2153 Ticks: 2138 Ticks: 2137 Ticks: 2121 Ticks: 2153 Ticks: 2137 Average: 2137 ms 

.. shows that, as in your tests, "over-subscribing" threads leads to a slight improvement of 2-3% in total execution time. My tests presented simple "counting integer" tasks with heavy CPU usage in a thread pool with a different number of threads.

At that time, my conclusion was that a slight improvement was due to the fact that more threads occupied a higher percentage of the "base load" on my mailbox - 1-4% of the load from several of 1000 - odd threads in almost always unused Firefox , uTorrent, Word, Taskbar, etc., which, during the tests, were slightly discharged.

It seems that in my test, “context switching”, for example, using 64 threads instead of 8, is negligible and can be ignored.

This is applicable only when the data used by the tasks is very small. Later, I repeated a similar batch of tests in which tasks used an 8K array - the size of the L1 cache. In this “worst case” scenario, using more threads than cores led to a very noticeable slowdown, while in 16 threads and above, performance dropped by 40% as threads changed the entire cache and exited. Above about 20 threads, the slowdown did not deteriorate, because, no matter how many threads performed the tasks, the cache was still unloaded from each core at the same speed.

Notice also that I had a lot of RAM and so few page errors.

+5
source

You make the assumption that each thread has an equal amount of work to do, which is actually not the case. What you should pay attention to is the exit time of each of your threads. If one or more of them comes out much earlier than the others, it will make sense that adding more threads will speed it up. That is, if someone stops earlier, this means that the kernel will no longer be used, adding extra threads so that it breaks the load more efficiently.

There are several reasons why each thread can execute a different runtime. I do not know the basic instructions for your code, but maybe they are variables. It is also likely that each thread has a different set of CPU optimizations, such as branch prediction. You can simply lose your time interval for the OS or instantly stall its tiny amount of memory. Suffice it to say that there are many factors that can make one slower than the other.

Which is the best score is hard to say. In general, you would like to keep the processor load, so you are usually right about N threads for N cores. However, keep in mind things like hyperthreading, where you actually don't have additional cores - if you have a lot of memory usage, which you don't know, hyperthreading will interfere. On new AMD chips, they have half the FPU, so your whole instructions are fine, but may stop floating point.

If you want every CPU to load, the only way to really do this is with a work-based platform. Break your calculation into smaller units (as it actually is), but still have only one thread per core. Since the thread is executing with its current job, it must do the next available job. Thus, it doesn’t matter if some jobs are longer / shorter, the freed processors will simply move on to the next job.

This, of course, makes sense if the calculation is long. If the total time is only a few seconds, the overhead can cause a slight slowdown. But even from 4-5 seconds you should start to see a win. In addition, make sure that you turn off the scaling of the processor frequency when performing small time tests, otherwise accelerating / decreasing the time on each CPU will basically give you random results.

+1
source

All Articles