Why does a for-loop with multi-thread processing not have the same high performance as using a single thread?

Question

Why does a for-loop with multi-thread processing not have the same high performance as using a single thread?

I thought that it was better to handle simple and heavy work (for example, matrix calculation) with multi-threaded than with single-threaded, so I tested the following code:

int main() { constexpr int N = 100000; std::random_device rd; std::mt19937 mt(rd()); std::uniform_real_distribution<double> ini(0.0, 10.0); // single-thread { std::vector<int> vec(N); for(int i = 0; i < N; ++i) { vec[i] = ini(mt); } auto start = std::chrono::system_clock::now(); for(int i = 0; i < N; ++i) { vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i]; } auto end = std::chrono::system_clock::now(); auto dur = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count(); std::cout << "single : " << dur << " ms."<< std::endl; } // multi-threading (Th is the number of threads) for(int Th : {1, 2, 4, 8, 16}) { std::vector<int> vec(N); for(int i = 0; i < N; ++i) { vec[i] = ini(mt); } auto start = std::chrono::system_clock::now(); std::vector<std::future<void>> fut(Th); for(int t = 0; t < Th; ++t) { fut[t] = std::async(std::launch::async, [t, &vec, &N, &Th]{ for(int i = t*N / Th; i < (t + 1)*N / Th; ++i) { vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i]; } }); } for(int t = 0; t < Th; ++t) { fut[t].get(); } auto end = std::chrono::system_clock::now(); auto dur = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count(); std::cout << "Th = " << Th << " : " << dur << " ms." << std::endl; } return 0; }

Terms of fulfillment:

 OS : Windows 10 64-bit Build-system : Visual Studio Community 2015 CPU : Core i5 4210U

When creating this program in Debug mode, the result was as I expected:

 single : 146 ms. Th = 1 : 140 ms. Th = 2 : 71 ms. Th = 4 : 64 ms. Th = 8 : 61 ms. Th = 16 : 68 ms.

This suggests that code that does not use std :: async has the same performance as one using a single thread, and with 4 or 8 threads I can get excellent performance.

However, when in Release mode I got a different result (N: 100000 → 100000000):

 single : 54 ms. Th = 1 : 443 ms. Th = 2 : 285 ms. Th = 4 : 205 ms. Th = 8 : 206 ms. Th = 16 : 221 ms.

I am interested in this result. For the codes of the last half alone, multithreading has better performance than single. But the fastest is the first half codes that do not use std :: async. I know that optimization and the overhead of multithreading have a big impact on performance. Nonetheless,

A process is just a vector calculation, so what can be optimized not in multi-thread codes, but in single-threaded codes?
This program does not contain anything about a mutex or atom, etc., and data conflict may not occur. I think the overhead around multithreading will be relatively small.
CPU usage in codes that do not use std :: async is less than in multi-threaded codes. Is it efficient to use most of the processor?

Update . I tried to investigate the issue of vectorization. I turned on the /Qvec-report:1 options and got the fact:

 //vectorized (when N is large) for(int i = 0; i < N; ++i) { vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i]; } //not vectorized auto lambda = [&vec, &N]{ for(int i = 0; i < N; ++i) { vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i]; } }; lambda(); //not vectorized std::vector<std::future<void>> fut(Th); for(int t = 0; t < Th; ++t) { fut[t] = std::async(std::launch::async, [t, &vec, &N, Th]{ for(int i = t*N / Th; i < (t + 1)*N / Th; ++i) { vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i]; } }); }

and lead time:

 single (with vectorization) : 47 ms. single (without vectorization) : 70 ms.

It was certain that the for-loop was not vectorized in the multi-threaded version. However, the version is time consuming and for other reasons.

Update 2 : I rewrote the for-loop in lambda (type A by type B):

 //Type A (the previous one) fut[t] = std::async(std::launch::async, [t, &vec, &N, Th]{ for(int i = t*N / Th; i < (t + 1)*N / Th; ++i) { vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i]; } }); //Type B (the new one) fut[t] = std::async(std::launch::async, [t, &vec, &N, Th]{ int nb = t * N / Th; int ne = (t + 1) * N / Th; for(int i = nb; i < ne; ++i) { vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i]; } });

Type B worked well. Result:

 single (vectorized) : 44 ms. single (invectorized) : 77 ms. -- Th = 1 (Type A) : 435 ms. Th = 2 (Type A) : 278 ms. Th = 4 (Type A) : 219 ms. Th = 8 (Type A) : 212 ms. -- Th = 1 (Type B) : 112 ms. Th = 2 (Type B) : 74 ms. Th = 4 (Type B) : 60 ms. Th = 8 (Type B) : 61 ms.

The result of type B is understandable (multi-threaded codes will work faster than single-threaded invectorized codes, and not as fast as vectorized codes). On the other hand, type A seems equivalent to type B (only using temporary variables), but they show different performance. Two types can be considered as generating different assembly codes.

Update 3 . I could find a factor that slowed down the multi-threaded for loop. This division is for . This is a single threaded test:

 //ver 1 (ordinary) fut[t] = std::async(std::launch::async, [&vec, &N]{ for(int i = 0; i < N; ++i) { vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i]; } }); //ver 2 (introducing a futile variable Q) int Q = 1; fut[t] = std::async(std::launch::async, [&vec, &N, Q]{ for(int i = 0; i < N / Q; ++i) { vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i]; } }); //ver 3 (using a temporary variable) int Q = 1; fut[t] = std::async(std::launch::async, [&vec, &N, Q]{ int end = N / Q; for(int i = 0; i < end; ++i) { vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i]; } }); //ver 4 (using a raw value) fut[t] = std::async(std::launch::async, [&vec]{ for(int i = 0; i < 100000000; ++i) { vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i]; } });

And the lead time:

 ver 1 : 132 ms. ver 2 : 391 ms. ver 3 : 47 ms. ver 4 : 43 ms.

ver 3 and 4 were well optimized, and ver 1 not so much, because I think the compiler could not catch N as immutable, although N was constexpr . I think version 2 was very slow for the same reason. The compiler did not understand that N and Q would not change. Thus, the condition i < N / Q will require heavy assembly codes, which slows down the for loop.

+6

c ++ performance multithreading c ++ 11 stdasync

m. bs Feb 27 '16 at 5:29

source share

1 answer

Bitwhistler · Accepted Answer · 2016-02-27T11:36:28+0000

When you run a single threaded, your only thread has vec in caches since you just created it from mt. And it will transfer streams through caches well, as it is the only user of all levels of the cache.
I don’t think there is a lot of vectorization here, or you will get shorter times. Maybe I'm wrong, because memory bandwidth is important here. Have you looked at asm?

Any other threads will need to display RAM. This in itself is not a big problem in your case, since it is a single processor, so L3 is shared, and the data set is larger than L3.
BUT, the multi-threaded fight with L3 is bad. I think this is the main factor here.
You are starting too many threads. You have to run as many threads as you have in order to pay less for context switching and cache clogging.
HT is useful when streams with 2 hw have enough “holes” in pipelines (not here), BP (not here) and in using the cache (strong case here → see No. 1).
I am really surprised> 2 threads did not degrade much more --- currently cpus is awesome!
The start of the flow and the time is less than predicted. If you want to increase predictability, constantly run threads and use some cheap alarms to start them and let them know that they are done.

EDIT: answers to specific questions

A process is just a vector calculation, so what can be optimized not in multi-thread codes, but in single-threaded codes?

There is not much code to optimize here .... You can split long loops to enable loop reversal:

 C = 16; // try other C values? for(int i=nb; i<ne; i+=C) { for(int j=0; j<C; j++) vec[i+j] = ...; // that === vec[i] <<= 2; } // need to do the remainder....

You can vectorize manually if the compiler did not. First look at the assembly.

This program does not contain anything about a mutex or atom, etc., and data conflict may not occur. I think the overhead around multithreading will be relatively small.

True Except that threads can start in due time. Especially on Windows, and especially if there are a lot of them.

CPU usage in codes that do not use std :: async is less than in multi-threaded codes. Is it efficient to use most of the processor?

You always want to use more cpu% for a shorter time. I'm not sure what you see, since there is no IO here.

Why does a for-loop with multi-thread processing not have the same high performance as using a single thread?

More articles: