If you are not working with very large matrices (many thousands of rows / columns), then you are unlikely to see a significant improvement in this approach. Setting up a thread on a modern processor / OS is actually quite expensive in relative terms of processor time, much more time than a few multiplication operations.
In addition, it is usually not worth installing more than one thread on the processor core that you have. If you have, say, only two cores, and you have configured 2500 threads (for 50x50 matrices), then the OS will spend all its time managing and switching between these 2500 threads, and not your calculations.
If you had pre-configured two threads (still assuming a dual-core processor), keep those threads up all the time, waiting for the work to be done, and supply them with 2500 point products that you need to calculate in some sort of synchronized work queue, then you can start to see an improvement. However, it will still not be more than 50% better than using just one core.
Greg hewgill
source share