Need thoughts on profiling multithreading in C on Linux

My application scenario is as follows: I want to evaluate the performance gain that can be achieved on a quad-core machine to process the same amount of data. I have the following two configurations:

i) 1-Process: a program without any threads and processes data from 1M .. 1G, while it is assumed that the system runs only one core of its 4-cores.

ii) 4-threads-Process: a program with 4-threads (all threads perform the same operation), but process 25% of the input.

In my program for creating 4-threads, I used the default pthread options (i.e. without any specific pthread_attr_t). I believe that the performance gain of a 4-thread configuration compared to a 1-Process configuration should be closer to 400% (or somewhere between 350% and 400%).

I have profiled the time spent creating threads, as shown below:

timer_start(&threadCreationTimer); pthread_create( &thread0, NULL, fun0, NULL ); pthread_create( &thread1, NULL, fun1, NULL ); pthread_create( &thread2, NULL, fun2, NULL ); pthread_create( &thread3, NULL, fun3, NULL ); threadCreationTime = timer_stop(&threadCreationTimer); pthread_join(&thread0, NULL); pthread_join(&thread1, NULL); pthread_join(&thread2, NULL); pthread_join(&thread3, NULL); 

Since increasing the size of the input data can also increase in the memory consumption of each stream, therefore downloading all the data is definitely not a workable option in advance. Therefore, in order not to increase the memory requirement of each thread, each thread reads data in small fragments, processes them and reads the next fragment of the process, etc. Therefore, the code structure of my functions performed by threads is as follows:

 timer_start(&threadTimer[i]); while(!dataFinished[i]) { threadTime[i] += timer_stop(&threadTimer[i]); data_source(); timer_start(&threadTimer[i]); process(); } threadTime[i] += timer_stop(&threadTimer[i]); 

The variable dataFinished[i] is marked true process when it is received and processes all the necessary data. Process() knows when to do this :-)

In the main function, I calculate the time spent on a 4-threaded configuration, as shown below:

execTime4Thread = max(threadTime[0], threadTime[1], threadTime[2], threadTime[3]) + threadCreationTime .

And the performance gain is calculated simply

gain = execTime1process / execTime4Thread * 100

Question: With small data sizes from 1M to 4M, the performance gain is usually good (from 350 to 400%). However, the tendency to increase productivity decreases exponentially with increasing input size. It continues to shrink to a certain data size to 50 M or so, and then stabilizes by about 200%. Once it has reached this point, it remains almost stable even for 1 GB of data.

My question is: can anyone suggest a basic argument for this behavior (i.e., a drop in performance at the beginning and remaining stable later)?

And tell me how to fix it?

For your information, I also examined the behavior of threadCreationTime and threadTime for each thread to see what happens. For 1M data, the values โ€‹โ€‹of these variables are small, but as the data size increases, both of these variables exponentially increase (but threadCreationTime should remain almost the same regardless of the size of the data, and threadTime should increase at the speed corresponding to the processing data). After continuing to increase to 50 M or less, threadCreationTime becomes stable, and threadTime (like the performance drop becomes stable), and threadCreationTime continue to increase at a constant speed corresponding to an increase in the processed data (which is understood).

Do you think increasing the stack size for each thread, process priority material, or user-defined values โ€‹โ€‹of a different type of scheduler parameters (using pthread_attr_init ) might help?

PS: The results are obtained when running programs in security mode with a Linux error using root (i.e., the minimum OS works without a GUI and network material).

+7
source share
2 answers

Since increasing the size of the input data can also increase the memory requirement of each stream, then downloading all the data in advance is definitely not a workable option. Therefore, in order not to increase the memory requirement of each stream, each stream reads data in small pieces, processes it and reads the next fragment of the process, and so on.

Only one thing can lead to a sharp decrease in speed .

If there is enough memory, reading one large piece of input will always be faster than reading data in small pieces, especially from each stream. Any chunking I / O benefits (caching effects) disappear when you break them into pieces. Even allocating one large chunk of memory is much cheaper than allocating small chunks many times.

As a health check, you can run htop to ensure that at least all of your kernels are exceeded during the run. If not, your bottleneck may be outside of your multi-threaded code.

In stream

  • Stream context switches can cause suboptimal acceleration due to the large number of threads
  • as mentioned by others, a cold cache due to implicit memory reading can cause a slowdown

But re-reading your OP, I suspect that the slowdown has something to do with your data entry / memory allocation. Where exactly do you read your data? Some kind of socket? Are you sure you need to allocate memory more than once in your thread?

Some algorithm in your workflows is likely to be suboptimal / expensive.

+2
source

Is your thread created at creation? If so, the following will happen:

while your parent thread creates the thread, the already created thread will start working. When you click timerStop (ThreadCreation timer), these four are already running for a certain amount of time. So threadCreationTime overrides threadTime[i]

As now, you do not know what you are measuring. This will not solve your problem, because obviously you have a problem, since threadTime does not increase linearly, but at least you will not add overlap time.

For more information, you can use the perf tool , if available on your distribution. eg:

 perf stat -e cache-misses <your_prog> 

and see what happens with the two-thread version, the three-thread version, etc.

0
source

All Articles