Matrix multiplication with streams: why is this not faster?

So, I played with pthreads, in particular, trying to calculate the product of two matrices. My code is very dirty because it just had to become a small project for me, but the flow theory I used was very similar to:

#include <pthread.h> #include <stdio.h> #include <stdlib.h> #define M 3 #define K 2 #define N 3 #define NUM_THREADS 10 int A [M][K] = { {1,4}, {2,5}, {3,6} }; int B [K][N] = { {8,7,6}, {5,4,3} }; int C [M][N]; struct v { int i; /* row */ int j; /* column */ }; void *runner(void *param); /* the thread */ int main(int argc, char *argv[]) { int i,j, count = 0; for(i = 0; i < M; i++) { for(j = 0; j < N; j++) { //Assign a row and column for each thread struct v *data = (struct v *) malloc(sizeof(struct v)); data->i = i; data->j = j; /* Now create the thread passing it data as a parameter */ pthread_t tid; //Thread ID pthread_attr_t attr; //Set of thread attributes //Get the default attributes pthread_attr_init(&attr); //Create the thread pthread_create(&tid,&attr,runner,data); //Make sure the parent waits for all thread to complete pthread_join(tid, NULL); count++; } } //Print out the resulting matrix for(i = 0; i < M; i++) { for(j = 0; j < N; j++) { printf("%d ", C[i][j]); } printf("\n"); } } //The thread will begin control in this function void *runner(void *param) { struct v *data = param; // the structure that holds our data int n, sum = 0; //the counter and sum //Row multiplied by column for(n = 0; n< K; n++){ sum += A[data->i][n] * B[n][data->j]; } //assign the sum to its coordinate C[data->i][data->j] = sum; //Exit the thread pthread_exit(0); } 

source: http://macboypro.com/blog/2009/06/29/matrix-multiplication-in-c-using-pthreads-on-linux/

For the non-streaming version, I used the same setup (3 2-dimensional matrices, dynamically allocated structures for storing r / c) and added a timer. The first tests showed that the non-streaming version was faster. My first thought was that the sizes were too small to notice the difference, and it took longer to create the threads. Therefore, I increased the size to 50x50, accidentally filled it and launched it, and I still do not see the performance improvement using the streaming version.

What am I missing here?

+7
c multithreading
source share
5 answers

If you are not working with very large matrices (many thousands of rows / columns), then you are unlikely to see a significant improvement in this approach. Setting up a thread on a modern processor / OS is actually quite expensive in relative terms of processor time, much more time than a few multiplication operations.

In addition, it is usually not worth installing more than one thread on the processor core that you have. If you have, say, only two cores, and you have configured 2500 threads (for 50x50 matrices), then the OS will spend all its time managing and switching between these 2500 threads, and not your calculations.

If you had pre-configured two threads (still assuming a dual-core processor), keep those threads up all the time, waiting for the work to be done, and supply them with 2500 point products that you need to calculate in some sort of synchronized work queue, then you can start to see an improvement. However, it will still not be more than 50% better than using just one core.

+11
source share

I'm not quite sure I understand the source code, but here's what it looks like: you have a loop that runs M * N times. Each time through a loop, you create a stream that fills one number in the result matrix. But immediately after starting the stream, you are waiting for it to complete. I don’t think you are actually running more than one thread.

Even if you use more than one thread, the thread does a trivial amount of work. Even if K was large (you mention 50), 50 multiplications are not so much compared to the cost of starting the stream in the first place. A program should create fewer threads β€” of course, no more than the number of processors β€” and assign more work to each.

+7
source share

You do not allow much parallel execution: you wait for the thread immediately after creating it, so it is almost impossible for your program to use additional processors (i.e. never use a third processor / core). Try allowing more threads to run (probably up to the number of cores you have).

+1
source share

If you have a processor with two cores, you just have to split the work that needs to be done in two halves and give each thread one half. The same principle if you have 3, 4, 5 cores. Optimal performance will always correspond to the number of threads in the number of available cores (for the available ones, I mean cores that are not yet used mainly by other processes).

Another thing you should consider is that each thread must have its data adjacent and independent of the data for other threads. Otherwise, memcache misses will significantly slow down processing.

To better understand these problems, I would recommend the book "Templates for parallel programming" http://astore.amazon.com/amazon-books-20/detail/0321228111

Although his code examples are more oriented to OpenMP and MPI, and you use PThreads, the first half of the book is very rich in fundamental concepts and the internal workings of multi-threaded environments, and is very useful to avoid most of the performance bottlenecks that you will encounter.

+1
source share

If the code is correctly parallelized (I won’t check it), the likely performance will increase only when the code is parallelized in hardware, that is, the threads are really parallel (multi-core processors, multiprocessors, ... other technologies ...) and not explicitly ("multitasking" way) in parallel. Just an idea, I'm not sure if that is so.

0
source share

All Articles