According to your answers in the comments, theoretically, since you only have one thread (i.e. CPU), all the threaded versions should be at the same time as the single threaded version or longer due to the overhead of thread management You should not see any acceleration at all, since the time spent on solving one part of the matrix is ββa fragment of time that was stolen from another parallel task. Thanks to a single processor, you only share the resources of the processor - at present, there is no parallel operation for one piece of time. I would suspect then why your second implementation is faster because you are doing less dereferencing of pointers and memory access in your inner loop. For example, in the main operation C[i][j] += A[i][k] * B[k][j]; from multiply and p_variantforloop_t you look at a lot of assembly-level operations, many of which are memory related. In the "assembly pseudo-code" it looks something like this:
1) Move the pointer value from the address indicated by A on the stack to register R1
2) Increase the address in register R1 by a value that is not related to the stack referenced by the variable i , j or k
3) Move the pointer address value from the address pointed to by R1 to R1
4) Increase the address in R1 by the value from the stack referenced by the variable i , j or k
5) Move the value from the address pointed to by R1 to R1 (therefore, R1 now contains the value A[i][k] )
6) Take steps 1-5 for the address referred to by B on the stack, in the register R2 (therefore, R2 now has the value B[k][j] )
7) Take steps 1-4 for the address referred to by C in the stack, in register R3
8) Move the value from the address pointed to by R3 to R4 (ie R4 has the actual value in C[i][j] )
9) Multiply the registers R1 and R2 and save in the register R5
10) Add registers R4 and R5 and save in R4
11) Return the final value from R4 back to the memory address pointed to by R3 (now C[i][j] has the final result)
And if we assume that we have 5 general-purpose registers, and the compiler correctly optimized your C-code to use them. I left the loop index variables i , j and k on the stack, so accessing them will take even longer than if they were in registers ... it really depends on how many registers your compiler has on your platform . In addition, if you compiled without any optimizations, you can do much more memory access from the stack, where some of these temp values ββare stored on the stack, not in registers, and then popped off the stack again, which takes a lot more time than moving values ββbetween registers. In any case, the code above is much more difficult to optimize. It works, but if you are on an x86 32-bit platform, then you will not have as many general-purpose registries to play (you must have at least 6). x86_64 has more registers for playback, but still there are all memory accesses that can be dealt with.
On the other hand, an operation like tmp += a[k] * b[k] from p_scalarproduct_t in a narrow inner loop will move MUCH faster ... here is the above operation in the assembly pseudo-code:
There would be a small initialization step for the loop
1) Make tmp register R1 instead of the stack variable and initialize it with the value 0 2) Move the value of the address that A refers to on the stack to R2
3) Add the s value from the stack to R2 and save the resulting address in R2
4) Move the value of the address that B refers to on the stack to R3
5) Add the s value from the stack to R3 and store the resulting address in R3
6) Set the counter to R6 , initialized to e - s
After a single initialization, we will begin the actual inner loop
7) Move the value from the address that R2 points to R4
8) Move the value from the address pointed to by R3 to R5
9) Multiply R4 and R5 and save the results in R5
10) Add R5 to R1 and save the results to R1
11) Increment of R2 and R3
12) Decompression counter in R6 until it reaches zero, where we end the cycle
I canβt guarantee that this is how your compiler will install this loop, but you can see that overall with your scalar example there are fewer steps in the inner loop and, more importantly, less memory access. Therefore, more can be done with operations that use registers rather than operations that include memory cells and require memory fetching, which is much slower than register operations. So in general, it will move much faster, and this has nothing to do with threads.
Finally, I noticed that there are only two nested loops for a scalar product, so O (N ^ 2) complexity, where - like for your two other methods, you have three nested loops for O (N ^ 3) complexity. This will also matter.