I thought that it was better to handle simple and heavy work (for example, matrix calculation) with multi-threaded than with single-threaded, so I tested the following code:
int main() { constexpr int N = 100000; std::random_device rd; std::mt19937 mt(rd()); std::uniform_real_distribution<double> ini(0.0, 10.0); // single-thread { std::vector<int> vec(N); for(int i = 0; i < N; ++i) { vec[i] = ini(mt); } auto start = std::chrono::system_clock::now(); for(int i = 0; i < N; ++i) { vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i]; } auto end = std::chrono::system_clock::now(); auto dur = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count(); std::cout << "single : " << dur << " ms."<< std::endl; } // multi-threading (Th is the number of threads) for(int Th : {1, 2, 4, 8, 16}) { std::vector<int> vec(N); for(int i = 0; i < N; ++i) { vec[i] = ini(mt); } auto start = std::chrono::system_clock::now(); std::vector<std::future<void>> fut(Th); for(int t = 0; t < Th; ++t) { fut[t] = std::async(std::launch::async, [t, &vec, &N, &Th]{ for(int i = t*N / Th; i < (t + 1)*N / Th; ++i) { vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i]; } }); } for(int t = 0; t < Th; ++t) { fut[t].get(); } auto end = std::chrono::system_clock::now(); auto dur = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count(); std::cout << "Th = " << Th << " : " << dur << " ms." << std::endl; } return 0; }
Terms of fulfillment:
OS : Windows 10 64-bit Build-system : Visual Studio Community 2015 CPU : Core i5 4210U
When creating this program in Debug mode, the result was as I expected:
single : 146 ms. Th = 1 : 140 ms. Th = 2 : 71 ms. Th = 4 : 64 ms. Th = 8 : 61 ms. Th = 16 : 68 ms.
This suggests that code that does not use std :: async has the same performance as one using a single thread, and with 4 or 8 threads I can get excellent performance.
However, when in Release mode I got a different result (N: 100000 β 100000000):
single : 54 ms. Th = 1 : 443 ms. Th = 2 : 285 ms. Th = 4 : 205 ms. Th = 8 : 206 ms. Th = 16 : 221 ms.
I am interested in this result. For the codes of the last half alone, multithreading has better performance than single. But the fastest is the first half codes that do not use std :: async. I know that optimization and the overhead of multithreading have a big impact on performance. Nonetheless,
- A process is just a vector calculation, so what can be optimized not in multi-thread codes, but in single-threaded codes?
- This program does not contain anything about a mutex or atom, etc., and data conflict may not occur. I think the overhead around multithreading will be relatively small.
- CPU usage in codes that do not use std :: async is less than in multi-threaded codes. Is it efficient to use most of the processor?
Update . I tried to investigate the issue of vectorization. I turned on the /Qvec-report:1 options and got the fact:
//vectorized (when N is large) for(int i = 0; i < N; ++i) { vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i]; } //not vectorized auto lambda = [&vec, &N]{ for(int i = 0; i < N; ++i) { vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i]; } }; lambda(); //not vectorized std::vector<std::future<void>> fut(Th); for(int t = 0; t < Th; ++t) { fut[t] = std::async(std::launch::async, [t, &vec, &N, Th]{ for(int i = t*N / Th; i < (t + 1)*N / Th; ++i) { vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i]; } }); }
and lead time:
single (with vectorization) : 47 ms. single (without vectorization) : 70 ms.
It was certain that the for-loop was not vectorized in the multi-threaded version. However, the version is time consuming and for other reasons.
Update 2 : I rewrote the for-loop in lambda (type A by type B):
//Type A (the previous one) fut[t] = std::async(std::launch::async, [t, &vec, &N, Th]{ for(int i = t*N / Th; i < (t + 1)*N / Th; ++i) { vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i]; } }); //Type B (the new one) fut[t] = std::async(std::launch::async, [t, &vec, &N, Th]{ int nb = t * N / Th; int ne = (t + 1) * N / Th; for(int i = nb; i < ne; ++i) { vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i]; } });
Type B worked well. Result:
single (vectorized) : 44 ms. single (invectorized) : 77 ms.
The result of type B is understandable (multi-threaded codes will work faster than single-threaded invectorized codes, and not as fast as vectorized codes). On the other hand, type A seems equivalent to type B (only using temporary variables), but they show different performance. Two types can be considered as generating different assembly codes.
Update 3 . I could find a factor that slowed down the multi-threaded for loop. This division is for . This is a single threaded test:
//ver 1 (ordinary) fut[t] = std::async(std::launch::async, [&vec, &N]{ for(int i = 0; i < N; ++i) { vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i]; } }); //ver 2 (introducing a futile variable Q) int Q = 1; fut[t] = std::async(std::launch::async, [&vec, &N, Q]{ for(int i = 0; i < N / Q; ++i) { vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i]; } }); //ver 3 (using a temporary variable) int Q = 1; fut[t] = std::async(std::launch::async, [&vec, &N, Q]{ int end = N / Q; for(int i = 0; i < end; ++i) { vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i]; } }); //ver 4 (using a raw value) fut[t] = std::async(std::launch::async, [&vec]{ for(int i = 0; i < 100000000; ++i) { vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i]; } });
And the lead time:
ver 1 : 132 ms. ver 2 : 391 ms. ver 3 : 47 ms. ver 4 : 43 ms.
ver 3 and 4 were well optimized, and ver 1 not so much, because I think the compiler could not catch N as immutable, although N was constexpr . I think version 2 was very slow for the same reason. The compiler did not understand that N and Q would not change. Thus, the condition i < N / Q will require heavy assembly codes, which slows down the for loop.