LAPACK / BLAS vs. simple for loops

I want to transfer a piece of code that includes many vector and matrix calculations in C or C ++, the purpose of which is to speed up the code as much as possible.

Are linear algebra calculations with for loops in C code as fast as LAPACK / BLAS calculations, or is there some unique speed for using these libraries?

In other words, can simple C code (using for loops and the like) perform linear algebra calculations as fast as code that uses LAPACK / BLAS?

+7
source share
4 answers

LAPACK / BLAS libraries provided by the vendor (Intel IPP / MKL, but also AMD ACML, and other processor manufacturers such as IBM / Power or Oracle / SPARC) were also often optimized for specific processor capabilities, which will significantly increase performance on large data sets .

Often, however, you have very specific small data to work with (say, 4x4 matrices or 4D dot products, that is, operations used to process 3D geometry), and for such things BLAS / LAPACK are redundant because the initial tests executed by these routines, which can select encodings, depending on the properties of the data set. In these situations, simple C / C ++ source code, possibly using SSE2 ... 4 built-in and / or compiler-generated vectorization, can outperform BLAS / LAPACK.
That is why, for example, Intel has two libraries - MKL for large linear algebra data sets and IPP for small (graphic vectors) data sets.

In this sense

  • What is your data set?
  • What are the sizes of the matrix / vector?
  • What are linear algebra operations?

Also, regarding the "easy for loops": give the compiler the ability to vectorize for you. That is, something like:

 for (i = 0; i < DIM_OF_MY_VECTOR; i += 4) { vecmul[i] = src1[i] * src2[i]; vecmul[i+1] = src1[i+1] * src2[i+1]; vecmul[i+2] = src1[i+2] * src2[i+2]; vecmul[i+3] = src1[i+3] * src2[i+3]; } for (i = 0; i < DIM_OF_MY_VECTOR; i += 4) dotprod += vecmul[i] + vecmul[i+1] + vecmul[i+2] + vecmul[i+3]; 

may be a better signal for vectorization compiler than regular

 for (i = 0; i < DIM_OF_MY_VECTOR; i++) dotprod += src1[i]*src2[i]; 

expression. In a sense, what you mean by calculations for loops will have a significant impact.
If your vector is large enough, the BLAS version,

 dotprod = CBLAS.ddot(DIM_OF_MY_VECTOR, src1, 1, src2, 1); 

there will be cleaner code and most likely faster.

On the link side, they may be of interest:

+16
source

Probably not. People work quite a bit to ensure that lapack / BLAS procedures are optimized and numerically stable. Although code is often on the difficult side, this usually happens for some reason.

Depending on your targets, you may see the Intel library

+7
source

Numerical analysis is complex. At the very least, you need to be aware of the limitations of floating point arithmetic and know how to perform operations so that you balance speed with numerical stability. This is not trivial.

You really need to have an idea of ​​the balance between speed and stability that you really need. In more general software development, premature optimization is the root of all evil. In numerical analysis, this is the name of the game. If you do not get the balance for the first time, you will have to rewrite all or less.

And it gets harder when you try to adapt linear algebra proofs to algorithms. You need to understand algebra so that you can reorganize it into a stable (or reasonably stable) algorithm.

If I were you, I would target the LAPACK / BLAS API and offer a library that works for your dataset.

You have many options: LAPACK / BLAS, GSL and other self-optimizing libraries, vender libraries.

+4
source

I am not very good at these libraries. But you should keep in mind that libraries usually perform a couple of tests in parameters, they have a “communication system” with errors and even attribution of new variables when calling a function ... If the calculations are trivial, maybe you can try to do it yourself, adapting your needs. ..

0
source

All Articles