LAPACK / BLAS libraries provided by the vendor (Intel IPP / MKL, but also AMD ACML, and other processor manufacturers such as IBM / Power or Oracle / SPARC) were also often optimized for specific processor capabilities, which will significantly increase performance on large data sets .
Often, however, you have very specific small data to work with (say, 4x4 matrices or 4D dot products, that is, operations used to process 3D geometry), and for such things BLAS / LAPACK are redundant because the initial tests executed by these routines, which can select encodings, depending on the properties of the data set. In these situations, simple C / C ++ source code, possibly using SSE2 ... 4 built-in and / or compiler-generated vectorization, can outperform BLAS / LAPACK.
That is why, for example, Intel has two libraries - MKL for large linear algebra data sets and IPP for small (graphic vectors) data sets.
In this sense
- What is your data set?
- What are the sizes of the matrix / vector?
- What are linear algebra operations?
Also, regarding the "easy for loops": give the compiler the ability to vectorize for you. That is, something like:
for (i = 0; i < DIM_OF_MY_VECTOR; i += 4) { vecmul[i] = src1[i] * src2[i]; vecmul[i+1] = src1[i+1] * src2[i+1]; vecmul[i+2] = src1[i+2] * src2[i+2]; vecmul[i+3] = src1[i+3] * src2[i+3]; } for (i = 0; i < DIM_OF_MY_VECTOR; i += 4) dotprod += vecmul[i] + vecmul[i+1] + vecmul[i+2] + vecmul[i+3];
may be a better signal for vectorization compiler than regular
for (i = 0; i < DIM_OF_MY_VECTOR; i++) dotprod += src1[i]*src2[i];
expression. In a sense, what you mean by calculations for loops will have a significant impact.
If your vector is large enough, the BLAS version,
dotprod = CBLAS.ddot(DIM_OF_MY_VECTOR, src1, 1, src2, 1);
there will be cleaner code and most likely faster.
On the link side, they may be of interest: