A good starting point is the excellent book Science of Mathematical Computing Programming by Robert A. van de Hein and Enrique S. Quintana-Orty. They provide a free version for download.
BLAS is divided into three levels:
Level 1 defines a set of linear algebra functions that act only on vectors. These features benefit from vectorization (e.g., using SSE).
Level 2 functions are matrix vector operations, for example. some matrix-vector product. These functions can be implemented in terms of Level1 functions. However, you can improve the performance of these functions if you can provide a dedicated implementation that uses some shared memory multiprocessing architecture.
Level 3 functions are operations similar to a matrix-matrix product. Again, you can implement them in terms of level 2 functions. But level 3 functions perform O (N ^ 3) operations on O (N ^ 2) data. Therefore, if your platform has a cache hierarchy, you can improve performance if you provide a dedicated implementation optimized for cache / cache memory. This is well described in the book. The main impulse of level 3 functions comes from cache optimization. This increase significantly exceeds the second impulse from parallelism and other hardware optimizations.
By the way, most (or even all) high-performance BLAS implementations are NOT implemented in Fortran. ATLAS is implemented in C. GotoBLAS / OpenBLAS is implemented in C and its mission-critical components in Assembler. Fortran only implements the BLAS reference implementation. However, all of these BLAS implementations provide a Fortran interface, so it can be associated with LAPACK (LAPACK gets all its performance from BLAS).
Optimized compilers play a secondary role in this regard (and for GotoBLAS / OpenBLAS, the compiler does not matter at all).
In IMHO, the BLAS implementation does not use algorithms such as the Coppersmith-Winograd algorithm or the Strassen algorithm. I'm not quite sure about the reason, but this is my guess:
- It may not be possible to provide a cache-optimized implementation of these algorithms (i.e. you lose more than you win).
- These algorithms are numerically unstable. Since BLAS is the LAPACK computational core, it is not-go.
Edit / Update:
A new and foundational document for this topic is BLIS articles . They are exceptionally well written. For my lecture "Fundamentals of High-Performance Computing Software" I implemented a matrix-matrix product after their work. In fact, I implemented several options for the matrix matrix product. The simplest options are completely written in simple C and have less than 450 lines of code. All other options just optimize the loops.
for (l=0; l<MR*NR; ++l) { AB[l] = 0; } for (l=0; l<kc; ++l) { for (j=0; j<NR; ++j) { for (i=0; i<MR; ++i) { AB[i+j*MR] += A[i]*B[j]; } } A += MR; B += NR; }
The overall performance of the matrix matrix product depends only on these cycles. Here, about 99.9% of the time. In other options, I used intrinsics and assembler code to improve performance. You can see the tutorial, which lists all the options:
ulmBLAS: GEMM tutorial (matrix matrix product)
With BLIS documents, it’s becoming pretty easy to understand how libraries like Intel MKL can get this kind of performance. And why it doesn’t matter if you use row or column storage!
The final tests are here (we called our project ulmBLAS):
Tests for ulmBLAS, BLIS, MKL, openBLAS and Eigen
Other Editing / Updating:
I also wrote several guides on how BLAS is used for problems of numerical linear algebra, such as solving a system of linear equations:
High performance LU factorization
(This factorization of LU, for example, is used by Matlab to solve a system of linear equations.)
I hope to take the time to expand the tutorial to describe and demonstrate how to implement a scalable parallel implementation of LU factorization, for example, in PLASMA .
So here you are: Coding optimized concurrent LU cache optimization
PS: I also did some experiments to improve uBLAS performance. It’s actually quite simple to increase (yes, play with words :)) the performance of uBLAS:
UBLAS experiments .
Here's a similar project with BLAZE :
BLAZE experiments .