Questions by topic

I am new to thread programming and I have a conceptual problem. I am doing matrix multiplication as a project for my class. However, I do this without using streams, and then using streams to calculate the scalar product for each cell of the response matrix, and then again breaking the first matrix into proportions so that each stream has an equal part to calculate. My problem is that the implementation of the scalar product completes very quickly, which I expect, but the third implementation does not respond to the computer much faster than the non-thread implementation. For example, if he used 2 threads, he could force it out after about half the time, since it can work simultaneously in both halves of the matrix, but this is not at all the case. I feel that there is a problem in the third implementation, I do not think that it works in parallel, the code is below. Can anyone get me right on this? Not all code relates to the question, but I included it in case the problem is not local. Thanks,

The main program:

#include <iostream> #include <cstdio> #include <cstdlib> #include <cmath> #include<fstream> #include<string> #include<sstream> #include <matrix.h> #include <timer.h> #include <random_generator2.h> const float averager=2.0; //used to find the average of the time taken to multiply the matrices. //Precondition: The matrix has been manipulated in some way and is ready to output the statistics //Outputs the size of the matrix along with the user elapsed time. //Postconidition: The stats are outputted to the file that is specified with the number of threads used //file name example: "Nonparrallel2.dat" void output(string file, int numThreads , long double time, int n); //argv[1] = the size of the matrix //argv[2] = the number of threads to be used. //argv[3] = int main(int argc, char* argv[]) { random_generator rg; timer t, nonparallel, scalar, variant; int n, total = 0, numThreads = 0; long double totalNonP = 0, totalScalar = 0, totalVar = 0; n = 100; /* * check arguments */ n = atoi(argv[1]); n = (n < 1) ? 1 : n; numThreads = atoi(argv[2]); /* * allocated and generate random strings */ int** C; int** A; int** B; cout << "**NOW STARTING ANALYSIS FOR " << n << " X " << n << " MATRICES WITH " << numThreads << "!**"<< endl; for (int timesThrough = 0; timesThrough < averager; timesThrough++) { cout << "Creating the matrices." << endl; t.start(); C = create_matrix(n); A = create_random_matrix(n, rg); B = create_random_matrix(n, rg); t.stop(); cout << "Timer (generate): " << t << endl; //---------------------------------------------------------Ends non parallel----------------------------- /* * run algorithms */ cout << "Running non-parallel matrix multiplication: " << endl; nonparallel.start(); multiply(C, A, B, n); nonparallel.stop(); //-----------------------------------------Ends non parallel---------------------------------------------- //cout << "The correct matrix" <<endl; //output_matrix(C, n); cout << "Timer (multiplication): " << nonparallel << endl; totalNonP += nonparallel.user(); //D is the transpose of B so that the p_scalarproduct function does not have to be rewritten int** D = create_matrix(n); for (int i = 0; i < n; i++) for(int j = 0; j < n; j++) D[i][j] = B[j][i]; //---------------------------------------------------Start Threaded Scalar Poduct-------------------------- cout << "Running scalar product in parallel" << endl; scalar.start(); //Does the scalar product in parallel to multiply the two matrices. for (int i = 0; i < n; i++) for (int j = 0; j < n; j++){ C[i][j] = 0; C[i][j] = p_scalarproduct(A[i],D[j],n,numThreads); }//ends the for loop with j scalar.stop(); cout << "Timer (scalar product in parallel): " << scalar << endl; totalScalar += scalar.user(); //---------------------------------------------------Ends Threaded Scalar Poduct------------------------ //---------------------------------------------------Starts Threaded Variant For Loop--------------- cout << "Running the variation on the for loop." << endl; boost :: thread** thrds; //create threads and bind to p_variantforloop_t thrds = new boost::thread*[numThreads]; variant.start(); for (int i = 1; i <= numThreads; i++) thrds[i-1] = new boost::thread(boost::bind(&p_variantforloop_t, C, A, B, ((i)*n - n)/numThreads ,(i * n)/numThreads, numThreads, n)); cout << "before join" <<endl; // join threads for (int i = 0; i < numThreads; i++) thrds[i]->join(); variant.stop(); // cleanup for (int i = 0; i < numThreads; i++) delete thrds[i]; delete[] thrds; cout << "Timer (variation of for loop): " << variant <<endl; totalVar += variant.user(); //---------------------------------------------------Ends Threaded Variant For Loop------------------------ // output_matrix(A, n); // output_matrix(B, n); // output_matrix(E,n); /* * free allocated storage */ cout << "Deleting Storage" <<endl; delete_matrix(A, n); delete_matrix(B, n); delete_matrix(C, n); delete_matrix(D, n); //avoids dangling pointers A = NULL; B = NULL; C = NULL; D = NULL; }//ends the timesThrough for loop //output the results to .dat files output("Nonparallel", numThreads, (totalNonP / averager) , n); output("Scalar", numThreads, (totalScalar / averager), n); output("Variant", numThreads, (totalVar / averager), n); cout << "Nonparallel = " << (totalNonP / averager) << endl; cout << "Scalar = " << (totalScalar / averager) << endl; cout << "Variant = " << (totalVar / averager) << endl; return 0; } void output(string file, int numThreads , long double time, int n) { ofstream dataFile; stringstream ss; ss << numThreads; file += ss.str(); file += ".dat"; dataFile.open(file.c_str(), ios::app); if(dataFile.fail()) { cout << "The output file didn't open." << endl; exit(1); }//ends the if statement. dataFile << n << " " << time << endl; dataFile.close(); }//ends optimalOutput function 

Matrix file:

 #include <matrix.h> #include <stdlib.h> using namespace std; int** create_matrix(int n) { int** matrix; if (n < 1) return 0; matrix = new int*[n]; for (int i = 0; i < n; i++) matrix[i] = new int[n]; return matrix; } int** create_random_matrix(int n, random_generator& rg) { int** matrix; if (n < 1) return 0; matrix = new int*[n]; for (int i = 0; i < n; i++) { matrix[i] = new int[n]; for (int j = 0; j < n; j++) //rg >> matrix[i][j]; matrix[i][j] = rand() % 100; } return matrix; } void delete_matrix(int** matrix, int n) { for (int i = 0; i < n; i++) delete[] matrix[i]; delete[] matrix; //avoids dangling pointers. matrix = NULL; } /* * non-parallel matrix multiplication */ void multiply(int** C, int** A, int** B, int n) { if ((C == A) || (C == B)) { cout << "ERROR: C equals A or B!" << endl; return; } for (int i = 0; i < n; i++) for (int j = 0; j < n; j++) { C[i][j] = 0; for (int k = 0; k < n; k++) C[i][j] += A[i][k] * B[k][j]; } } void p_scalarproduct_t(int* c, int* a, int* b, int s, int e, boost::mutex* lock) { int tmp; tmp = 0; for (int k = s; k < e; k++){ tmp += a[k] * b[k]; //cout << "a[k]= "<<a[k]<<"b[k]= "<< b[k] <<" "<<k<<endl; } lock->lock(); *c = *c + tmp; lock->unlock(); } int p_scalarproduct(int* a, int* b, int n, int m) { int c; boost::mutex lock; boost::thread** thrds; c = 0; /* create threads and bind to p_merge_sort_t */ thrds = new boost::thread*[m]; for (int i = 0; i < m; i++) thrds[i] = new boost::thread(boost::bind(&p_scalarproduct_t, &c, a, b, i*n/m, (i+1)*n/m, &lock)); /* join threads */ for (int i = 0; i < m; i++) thrds[i]->join(); /* cleanup */ for (int i = 0; i < m; i++) delete thrds[i]; delete[] thrds; return c; } void output_matrix(int** matrix, int n) { cout << "["; for (int i = 0; i < n; i++) { cout << "[ "; for (int j = 0; j < n; j++) cout << matrix[i][j] << " "; cout << "]" << endl; } cout << "]" << endl; } void p_variantforloop_t(int** C, int** A, int** B, int s, int e, int numThreads, int n) { //cout << "s= " <<s<<endl<< "e= " << e << endl; for(int i = s; i < e; i++) for(int j = 0; j < n; j++){ C[i][j] = 0; //cout << "i " << i << " j " << j << endl; for (int k = 0; k < n; k++){ C[i][j] += A[i][k] * B[k][j];} } }//ends the function 
+4
source share
2 answers

I assume that you are working in False Sharing . Try using a local variable in p_variantforloop_t :

 void p_variantforloop_t(int** C, int** A, int** B, int s, int e, int numThreads, int n) { for(int i = s; i < e; i++) for(int j = 0; j < n; j++){ int accu = 0; for (int k = 0; k < n; k++) accu += A[i][k] * B[k][j]; C[i][j] = accu; } } 
+3
source

According to your answers in the comments, theoretically, since you only have one thread (i.e. CPU), all the threaded versions should be at the same time as the single threaded version or longer due to the overhead of thread management You should not see any acceleration at all, since the time spent on solving one part of the matrix is ​​a fragment of time that was stolen from another parallel task. Thanks to a single processor, you only share the resources of the processor - at present, there is no parallel operation for one piece of time. I would suspect then why your second implementation is faster because you are doing less dereferencing of pointers and memory access in your inner loop. For example, in the main operation C[i][j] += A[i][k] * B[k][j]; from multiply and p_variantforloop_t you look at a lot of assembly-level operations, many of which are memory related. In the "assembly pseudo-code" it looks something like this:

1) Move the pointer value from the address indicated by A on the stack to register R1
2) Increase the address in register R1 by a value that is not related to the stack referenced by the variable i , j or k
3) Move the pointer address value from the address pointed to by R1 to R1
4) Increase the address in R1 by the value from the stack referenced by the variable i , j or k
5) Move the value from the address pointed to by R1 to R1 (therefore, R1 now contains the value A[i][k] )
6) Take steps 1-5 for the address referred to by B on the stack, in the register R2 (therefore, R2 now has the value B[k][j] )
7) Take steps 1-4 for the address referred to by C in the stack, in register R3
8) Move the value from the address pointed to by R3 to R4 (ie R4 has the actual value in C[i][j] )
9) Multiply the registers R1 and R2 and save in the register R5
10) Add registers R4 and R5 and save in R4
11) Return the final value from R4 back to the memory address pointed to by R3 (now C[i][j] has the final result)

And if we assume that we have 5 general-purpose registers, and the compiler correctly optimized your C-code to use them. I left the loop index variables i , j and k on the stack, so accessing them will take even longer than if they were in registers ... it really depends on how many registers your compiler has on your platform . In addition, if you compiled without any optimizations, you can do much more memory access from the stack, where some of these temp values ​​are stored on the stack, not in registers, and then popped off the stack again, which takes a lot more time than moving values ​​between registers. In any case, the code above is much more difficult to optimize. It works, but if you are on an x86 32-bit platform, then you will not have as many general-purpose registries to play (you must have at least 6). x86_64 has more registers for playback, but still there are all memory accesses that can be dealt with.

On the other hand, an operation like tmp += a[k] * b[k] from p_scalarproduct_t in a narrow inner loop will move MUCH faster ... here is the above operation in the assembly pseudo-code:

There would be a small initialization step for the loop

1) Make tmp register R1 instead of the stack variable and initialize it with the value 0 2) Move the value of the address that A refers to on the stack to R2
3) Add the s value from the stack to R2 and save the resulting address in R2
4) Move the value of the address that B refers to on the stack to R3
5) Add the s value from the stack to R3 and store the resulting address in R3
6) Set the counter to R6 , initialized to e - s

After a single initialization, we will begin the actual inner loop

7) Move the value from the address that R2 points to R4
8) Move the value from the address pointed to by R3 to R5
9) Multiply R4 and R5 and save the results in R5
10) Add R5 to R1 and save the results to R1
11) Increment of R2 and R3
12) Decompression counter in R6 until it reaches zero, where we end the cycle

I can’t guarantee that this is how your compiler will install this loop, but you can see that overall with your scalar example there are fewer steps in the inner loop and, more importantly, less memory access. Therefore, more can be done with operations that use registers rather than operations that include memory cells and require memory fetching, which is much slower than register operations. So in general, it will move much faster, and this has nothing to do with threads.

Finally, I noticed that there are only two nested loops for a scalar product, so O (N ^ 2) complexity, where - like for your two other methods, you have three nested loops for O (N ^ 3) complexity. This will also matter.

0
source

All Articles