My application scenario is as follows: I want to evaluate the performance gain that can be achieved on a quad-core machine to process the same amount of data. I have the following two configurations:
i) 1-Process: a program without any threads and processes data from 1M .. 1G, while it is assumed that the system runs only one core of its 4-cores.
ii) 4-threads-Process: a program with 4-threads (all threads perform the same operation), but process 25% of the input.
In my program for creating 4-threads, I used the default pthread options (i.e. without any specific pthread_attr_t). I believe that the performance gain of a 4-thread configuration compared to a 1-Process configuration should be closer to 400% (or somewhere between 350% and 400%).
I have profiled the time spent creating threads, as shown below:
timer_start(&threadCreationTimer); pthread_create( &thread0, NULL, fun0, NULL ); pthread_create( &thread1, NULL, fun1, NULL ); pthread_create( &thread2, NULL, fun2, NULL ); pthread_create( &thread3, NULL, fun3, NULL ); threadCreationTime = timer_stop(&threadCreationTimer); pthread_join(&thread0, NULL); pthread_join(&thread1, NULL); pthread_join(&thread2, NULL); pthread_join(&thread3, NULL);
Since increasing the size of the input data can also increase in the memory consumption of each stream, therefore downloading all the data is definitely not a workable option in advance. Therefore, in order not to increase the memory requirement of each thread, each thread reads data in small fragments, processes them and reads the next fragment of the process, etc. Therefore, the code structure of my functions performed by threads is as follows:
timer_start(&threadTimer[i]); while(!dataFinished[i]) { threadTime[i] += timer_stop(&threadTimer[i]); data_source(); timer_start(&threadTimer[i]); process(); } threadTime[i] += timer_stop(&threadTimer[i]);
The variable dataFinished[i]
is marked true
process when it is received and processes all the necessary data. Process()
knows when to do this :-)
In the main function, I calculate the time spent on a 4-threaded configuration, as shown below:
execTime4Thread = max(threadTime[0], threadTime[1], threadTime[2], threadTime[3]) + threadCreationTime
.
And the performance gain is calculated simply
gain = execTime1process / execTime4Thread * 100
Question: With small data sizes from 1M to 4M, the performance gain is usually good (from 350 to 400%). However, the tendency to increase productivity decreases exponentially with increasing input size. It continues to shrink to a certain data size to 50 M or so, and then stabilizes by about 200%. Once it has reached this point, it remains almost stable even for 1 GB of data.
My question is: can anyone suggest a basic argument for this behavior (i.e., a drop in performance at the beginning and remaining stable later)?
And tell me how to fix it?
For your information, I also examined the behavior of threadCreationTime
and threadTime
for each thread to see what happens. For 1M data, the values โโof these variables are small, but as the data size increases, both of these variables exponentially increase (but threadCreationTime
should remain almost the same regardless of the size of the data, and threadTime
should increase at the speed corresponding to the processing data). After continuing to increase to 50 M or less, threadCreationTime
becomes stable, and threadTime
(like the performance drop becomes stable), and threadCreationTime
continue to increase at a constant speed corresponding to an increase in the processed data (which is understood).
Do you think increasing the stack size for each thread, process priority material, or user-defined values โโof a different type of scheduler parameters (using pthread_attr_init
) might help?
PS: The results are obtained when running programs in security mode with a Linux error using root (i.e., the minimum OS works without a GUI and network material).