How to set the number of threads in C ++

I wrote the following multithreaded program for multithreaded sorting using std :: sort. In my program, graininess is a parameter. Because the graininess or the number of streams that may appear are system dependent functions. Therefore, I do not get what should be the optimal value, to which I should set the grain value? Am I working on Linux?

int compare(const char*,const char*) { //some complex user defined logic } void multThreadedSort(vector<unsigned>::iterator data, int len, int grainsize) { if(len < grainsize) { std::sort(data, data + len, compare); } else { auto future = std::async(multThreadedSort, data, len/2, grainsize); multThreadedSort(data + len/2, len/2, grainsize); // No need to spawn another thread just to block the calling thread which would do nothing. future.wait(); std::inplace_merge(data, data + len/2, data + len, compare); } } int main(int argc, char** argv) { vector<unsigned> items; int grainSize=10; multThreadedSort(items.begin(),items.size(),grainSize); std::sort(items.begin(),items.end(),CompareSorter(compare)); return 0; } 

I need to do multithreaded sorting. Thus, for sorting large vectors, I can use several cores present in today's processor. If someone knows about an efficient algorithm, please share.

I don’t know why the value returned by multiThreadedSort () is not sorted, you see some kind of logical error in it, then please let me know about the same

+5
source share
3 answers

This gives you the optimal number of threads (e.g. number of cores):

 unsigned int nThreads = std::thread::hardware_concurrency(); 

As you wrote, your effective thread number is not equal to grainSize : it will depend on the size of the list and will potentially be much larger than grain size.

Just replace the grit with:

 unsigned int grainSize= std::max(items.size()/nThreads, 40); 

40 is arbitrary, but there is in order to avoid the initial flows for sorting by several elements that will be suboptimal (the start time of the stream will be longer than sorting several elements). It can be optimized by trial and error and potentially more than 40.

You have at least an error:

 multThreadedSort(data + len/2, len/2, grainsize); 

If len is odd (e.g. 9), you do not include the last element in the sort. Replaced by:

 multThreadedSort(data + len/2, len-(len/2), grainsize); 
+8
source

If you are not using a compiler with a completely broken implementation (a broken one is the wrong word, a better match would be ... shitty), a few calls to std::future should already do the job for you, without worrying.

Note that std::future is something that conceptually runs asynchronously, that is, it can spawn another thread for simultaneous execution. Maybe not, remember. This means that for implementations it is completely “legal” for an implementation to simply spawn a single thread in the future, and it is also legal to never create threads at all and just complete the task inside wait() .
In practice, realistic implementations avoid spawning threads on demand and instead use threadpool, where the number of workers is configured for something reasonable in accordance with the system in which the code is executed.

Note that trying to optimize threads using std::thread::hardware_concurrency() will not really help you, because the wording of this function is too loose to be useful. This is perfectly acceptable for implementation to return zero or more or less an arbitrary “best guess”, and there is no mechanism for you to determine if the return value is true or the value of crap.
There is also no way to recognize hypertext cores or any such thing as NUMA awareness, or anything like that. Thus, even if you think the number is correct, it is still not very significant.

In a more general note

The problem “What is the correct number of threads” is difficult to solve if there is a good universal answer at all (I believe not). A few things to consider:

  • A working group of 10 is, of course, a way too small. Spawning is a very expensive thing (yes, contrary to popular belief, that for Linux too), as well as switching or synchronizing threads is also expensive. Try ten thousand, not tens.
  • Hypertext cores are executed only if another core in the same group is stuck, most often on the memory I / O (or, when it is spinning, by explicitly executing a command, such as, for example, REP-NOP on Intel). If you do not have a significant number of memory storages, additional threads running on hyper-threading kernels will only add context switches, but will not work faster. For something like sorting (as for memory access!), You are probably as good as possible.
  • The memory bandwidth is usually saturated with one, sometimes 2 cores, less often (depends on real hardware). Throwing 8 or 12 threads into a problem usually will not increase the memory bandwidth, but will increase the pressure on the levels of the general cache (for example, L3, if any, and also often L2) and the system page manager. For a specific case of sorting (very incoherent access, many stalls) can be and vice versa. Maybe, but it should not be.
  • In connection with the foregoing, for the general case, the "number of real cores" or "the number of real cores + 1" is often a much better recommendation.
  • Accessing huge amounts of data with poor locality, for example, with your approach (single-threaded or multi-threaded), leads to a lot of cache / TLB misses and possibly even page errors. This can not only undo any of the benefits of the parallelism stream, but can actually run 4-5 orders of magnitude slower. Just think that you do not like the page. During one page error, you could sort a million elements.
  • Unlike the general rule, “real kernels plus 1” for tasks that are connected to network or disk I / O that can block for a long time, even “twice as many kernels” can also be a better match. So ... there really is no single right rule.

What conclusion are some of the most controversial points above? After you have implemented it, make sure that it works faster, because it is in no way guaranteed. And unfortunately, there is no way to know with certainty what is best without measuring.

Like the other, think that sorting is by no means trivial to parallelize. You are already using std::inplace_merge so that you seem to know that it doesn’t just “split the subbands and sort them”.

But think, what exactly does your approach really do? You subdivide (recursively descend) to a certain depth, then sort the subranges at the same time and merge - which means rewriting. Then you sort (recursively increase) the large ranges and combine them until the entire range is sorted. Classic fork-join.
This means that you touch some part of the memory to sort it (in a non-cache template), and then touch it again to merge it. Then you touch it again to sort the wider range, and you touch it again to combine this larger range. With any “luck”, different threads will access memory cells at different times, so you will have false access.
In addition, if your understanding of "big data" coincides with mine, it means that you rewrite each place in memory up to 20 and 30 times, perhaps more often. This is a lot of traffic.

So much memory is read and written again, over and over, and the main bottleneck is memory bandwidth. See where I'm going? Fork-join looks like a brilliant thing, and in academia it is probably ... but it’s not entirely clear that it works faster on a real machine (maybe it can be many times slower).

+1
source

Ideally, you cannot accept more than the n * 2 thread running on your system. n is the number of processor cores.

Modern OS uses the concept of Hyperthreading . So, now on one CPU at a time you can run 2 threads.

As mentioned in another answer, in C ++ 11 you can get the optimal number of threads using std::thread::hardware_concurrency();

0
source

All Articles