How does BLAS get such extreme performance?

Question

How does BLAS get such extreme performance?

Out of curiosity, I decided to measure my own matrix multiplication function compared to the BLAS implementation ... I should have said the least surprised result:

Custom implementation, 10 tests 1000x1000 matrix multiplication:
Took: 15.76542 seconds. 
Implementation of BLAS, 10 tests 1000x1000 matrix multiplication:
 Took: 1.32432 seconds. 

This is using single precision floating point numbers.

My implementation:

 template<class ValT> void mmult(const ValT* A, int ADim1, int ADim2, const ValT* B, int BDim1, int BDim2, ValT* C) { if ( ADim2!=BDim1 ) throw std::runtime_error("Error sizes off"); memset((void*)C,0,sizeof(ValT)*ADim1*BDim2); int cc2,cc1,cr1; for ( cc2=0 ; cc2<BDim2 ; ++cc2 ) for ( cc1=0 ; cc1<ADim2 ; ++cc1 ) for ( cr1=0 ; cr1<ADim1 ; ++cr1 ) C[cc2*ADim2+cr1] += A[cc1*ADim1+cr1]*B[cc2*BDim1+cc1]; }

I have two questions:

Given that matrix matrix multiplication says: nxm * mxn requires n * n * m multiplications, so in the case above 1000 ^ 3 or 1e9 operations. How can I perform 10 * 1e9 operations in 1.32 seconds on my 2.6Ghz processor for BLAS? Even if multiplicity was the only operation and nothing else was done, it should take about 4 seconds.
Why is my implementation so much slower?

+85

c ++ fortran

DeusAduro Aug 19 '09 at 23:30

source share

10 answers

So, first of all, BLAS is just an interface of about 50 functions. There are many competing interface implementations.

First, I’ll talk about things that are largely unrelated:

Fortran vs C, doesn't make any difference
Advanced matrix algorithms such as Strassen, implementations do not use them, because they do not help in practice

Most implementations break down each operation into small or vector operations in a more or less obvious way. For example, a large 1000x1000 matrix multiplication can be broken down into a sequence of 50x50 matrix multiplications.

These small-size, fixed-size operations (called cores) are hard-coded in a processor-specific assembly code using several CPU functions of their purpose:

SIMD style instructions
Instruction level concurrency
Cache info

In addition, these cores can run parallel to each other using multiple threads (processor cores) in a typical design pattern with a reduction in the size of the card.

Take a look at ATLAS, which is the most commonly used version of BLAS open source. It has many different competing cores, and during the assembly process of the ATLAS library it competes with each other (some are even parameterized, so the same kernel can have different settings). He tries different configurations and then chooses the best for a specific target system.

(Tip. This is why if you use ATLAS, you better build and configure the library manually for a specific machine, and then use the pre-created one.)

+21

Andrew Tomazos Jun 26 '13 at 14:10

source share

Firstly, there are more efficient algorithms for matrix multiplication than the one you use.

Secondly, your processor can execute much more than one command at a time.

Your processor executes 3-4 instructions per cycle, and if SIMD modules are used, each command processes 4 floats or 2 doubles. (of course, this figure is also not accurate, since the processor can usually process only one SIMD instruction per cycle)

Thirdly, your code is far from optimal:

You are using raw pointers, which means that the compiler must assume that they may be an alias. There are keywords or flags for the compiler that you can specify to tell the compiler that they are not aliases. In addition, you should use types other than source pointers that take care of the problem.
You trick the cache by naively traversing each row / column of the input matrices. You can use the lock to do the maximum possible job on the smallest block of the matrix that fits into the processor cache before moving on to the next block.
For purely numerical tasks, Fortran is largely unsurpassed, and C ++ requires a lot of effort to achieve the same speed. This can be done, and there are several libraries demonstrating it (usually using expression templates), but this is not trivial, and this is not easy.

+14

jalf Aug 20 '09 at 12:12

source share

I do not know in detail about the implementation of BLAS, but there are more efficient algorithms for Matrix Multiplication that are better than O (n3). It is well known that the Strassen Algorithm

+11

softveda Aug 20 '09 at 0:15

source share

Most of the arguments to the second question are assembler, blocking, etc. (but no less than N ^ 3 algorithms, they are really too developed) - play a role. But the low speed of your algorithm is mainly due to the size of the matrix and the poor location of the three nested loops. Your matrices are so large that they do not fit right into the cache. You can rebuild the loops so that the maximum possible value is executed in the line in the cache, which significantly reduces the cache update (splitting BTW into small blocks has an analog effect, it is best if the loops above the blocks are located in the same way). The following is a model implementation for square matrices. On my computer, its time consumption was about 1:10 compared to the standard implementation (like yours). In other words: never program matrix multiplication according to the column layout "time string" that we learned at school. After rearranging the loops, more improvements are achieved by unrolling loops, assembler code, etc.

  void vector(int m, double ** a, double ** b, double ** c) { int i, j, k; for (i=0; i<m; i++) { double * ci = c[i]; for (k=0; k<m; k++) ci[k] = 0.; for (j=0; j<m; j++) { double aij = a[i][j]; double * bj = b[j]; for (k=0; k<m; k++) ci[k] += aij*bj[k]; } } }

One more note: this implementation is even better on my computer than replacing everything with the cblas_dgemm BLAS procedure (try it on your computer!). But much faster (1: 4) calls Fortran dgemm_ directly. I think that this procedure is actually not Fortran, but assembler code (I do not know what is in the library, I have no sources). It is completely incomprehensible to me why cblas_dgemm is not so fast, since, as far as I know, this is just a wrapper for dgemm _.

+4

Wolfgang Jansen Nov 30 '13 at 20:11

source share

This is real speed. For an example of what can be done using the SIMD assembler over C ++ code, see the iPhone matrix function example — they were 8 times faster than the C version and not even an “optimized” assembly — there are no pipe laying yet, and no unnecessary stack operations.

Also, your code does not “ restrict the correct ones ” - how does the compiler know that when it changes C, it does not change A and B?

+3

Justicle Aug 20 '09 at 0:10

source share

As for the source code in MM multiply, the memory reference for most operations is the main reason for poor performance. The memory runs 100-1000 times slower than the cache.

Most of the acceleration comes from using loop optimization techniques for this triple loop function in MM multiplication. Two main methods for optimizing the cycle are used; deployment and blocking. As for the deployment, we deploy the outer two most of the loop and lock it to reuse the data in the cache. Deploying an external loop helps optimize data access over time by reducing the number of memory references to the same data at different times throughout the operation. Blocking a loop index with a specific number helps with storing data in the cache. You can choose optimizations for L2 cache or L3 cache.

https://en.wikipedia.org/wiki/Loop_nest_optimization

+2

Pari Rajaram May 02 '16 at 12:07 a.m.

source share

We are going to launch MOOC on this issue on edX, hopefully in early June 2019. We are still collecting materials together, but if you are impatient, you can see the recordings (which are updated several times a day). ): ulaff.net (LAFF-On programming for high performance).

These materials reveal the methods underlying the implementation of matrix multiplication in the BLIS matrix (and are refactoring the Goto algorithm).

Please do not contact us about the materials (wait until the course goes online, then you can post questions on the forum). (Exceptions for billionaires who want to fund efforts ...) will be considered.

If you would like information about these and our other efforts, you can join the ULAFF-On group mentioned on this page.

Enjoy it!

0

ulaff.net May 02 '19 at 15:02

source share

Since my last answer was blocked, let me try again.

There are two important problems that need to be resolved when you are trying to implement a high-performance routine to multiply a matrix by a matrix (and related operations that are part of level 3 BLAS): you need to use vector instructions to access vector registers and hardware functions that give modern architecture has its own performance, and you need to carefully use the memory hierarchy. Modern compilers are not able to do this without the intervention of a programmer.

To make the resulting code portable to different architectures, the calculations are given in terms of a “microkernel" that updates a small block C (which is stored in vector registers) with a sequence of rank 1 updates (external products). To do this, divide the task by multiplying the matrix by the matrix with blocks to store data in all three levels of the cache (from L1 to L3). It is important to note that you need to "pack" the data in continuous memory so that the calculations access the data with a "single step" (continuously).

Although this is described in the articles in which I participated (as well as in many other discussions above and in the articles that were mentioned in previous answers):

Anatomy of high-performance matrix multiplication K. Goto, R. A. van de Gein. ACM Math Software Transactions (TOMS) 34 (3), 12. (Mentioned in previous articles and often used in classes)
BLIS: a platform for rapid implementation of BLAS functionality F. G. Van See, R. A. van de Gein. ACM Math Software Transactions 41 (3), Section 14.

we overcame the problem of creating a 4-week course called "Programming for High Performance", which carefully selects the exercises that are taught by both the novice and the expert in this sense.

We hope that censorship Qaru will understand that since this course will be offered free of charge to those who want to check it, and notes will be available for free, and allow a pointer to where these notes can be found: http://ulaff.net . A course based on these notes will be launched on edX in early June.

This post clearly refers to this discussion.

0

ulaff.net May 02 '19 at 17:13

source share

For many reasons.

First, Fortran compilers are optimized, and the language allows them to be so. C and C ++ are very weak with respect to processing arrays (for example, in the case of pointers belonging to the same memory area). This means that the compiler cannot know what to do in advance and is forced to create common code. In Fortran, your cases are more streamlined, and the compiler has better control over what is happening, which allows it to optimize more (for example, using registers).

Another thing is that Fortran stores material in columns, and C stores data in a series of rows. I have not tested your code, but be careful how you run the product. In C, you should scan the wise series: in this way, you scan your array along continuous memory, reducing cache misses. A cache error is the first source of inefficiency.

Thirdly, it depends on the blas implementation you are using. Some implementations can be written in assembler and optimized for the specific processor that you are using. The netlib version is written in fortran 77.

In addition, you perform many operations, most of which are repeated and duplicated. All of these multiplications to get the index are bad for performance. I really don't know how this is done in BLAS, but there are many tricks to prevent costly operations.

For example, you can redo your code this way

 template<class ValT> void mmult(const ValT* A, int ADim1, int ADim2, const ValT* B, int BDim1, int BDim2, ValT* C) { if ( ADim2!=BDim1 ) throw std::runtime_error("Error sizes off"); memset((void*)C,0,sizeof(ValT)*ADim1*BDim2); int cc2,cc1,cr1, a1,a2,a3; for ( cc2=0 ; cc2<BDim2 ; ++cc2 ) { a1 = cc2*ADim2; a3 = cc2*BDim1 for ( cc1=0 ; cc1<ADim2 ; ++cc1 ) { a2=cc1*ADim1; ValT b = B[a3+cc1]; for ( cr1=0 ; cr1<ADim1 ; ++cr1 ) { C[a1+cr1] += A[a2+cr1]*b; } } } }

Try it, I'm sure you will save something.

In question # 1, the reason is that matrix multiplication scales like O (n ^ 3) if you use a trivial algorithm. There are algorithms that scale much better .

-24

Stefano Borini Aug 19 '09 at 23:36

source share

Michael Lehn · Accepted Answer · 2012-07-10 20:23

A good starting point is the excellent book Science of Mathematical Computing Programming by Robert A. van de Hein and Enrique S. Quintana-Orty. They provide a free version for download.

BLAS is divided into three levels:

Level 1 defines a set of linear algebra functions that act only on vectors. These features benefit from vectorization (e.g., using SSE).
Level 2 functions are matrix vector operations, for example. some matrix-vector product. These functions can be implemented in terms of Level1 functions. However, you can improve the performance of these functions if you can provide a dedicated implementation that uses some shared memory multiprocessing architecture.
Level 3 functions are operations similar to a matrix-matrix product. Again, you can implement them in terms of level 2 functions. But level 3 functions perform O (N ^ 3) operations on O (N ^ 2) data. Therefore, if your platform has a cache hierarchy, you can improve performance if you provide a dedicated implementation optimized for cache / cache memory. This is well described in the book. The main impulse of level 3 functions comes from cache optimization. This increase significantly exceeds the second impulse from parallelism and other hardware optimizations.

By the way, most (or even all) high-performance BLAS implementations are NOT implemented in Fortran. ATLAS is implemented in C. GotoBLAS / OpenBLAS is implemented in C and its mission-critical components in Assembler. Fortran only implements the BLAS reference implementation. However, all of these BLAS implementations provide a Fortran interface, so it can be associated with LAPACK (LAPACK gets all its performance from BLAS).

Optimized compilers play a secondary role in this regard (and for GotoBLAS / OpenBLAS, the compiler does not matter at all).

In IMHO, the BLAS implementation does not use algorithms such as the Coppersmith-Winograd algorithm or the Strassen algorithm. I'm not quite sure about the reason, but this is my guess:

It may not be possible to provide a cache-optimized implementation of these algorithms (i.e. you lose more than you win).
These algorithms are numerically unstable. Since BLAS is the LAPACK computational core, it is not-go.

Edit / Update:

A new and foundational document for this topic is BLIS articles . They are exceptionally well written. For my lecture "Fundamentals of High-Performance Computing Software" I implemented a matrix-matrix product after their work. In fact, I implemented several options for the matrix matrix product. The simplest options are completely written in simple C and have less than 450 lines of code. All other options just optimize the loops.

  for (l=0; l<MR*NR; ++l) { AB[l] = 0; } for (l=0; l<kc; ++l) { for (j=0; j<NR; ++j) { for (i=0; i<MR; ++i) { AB[i+j*MR] += A[i]*B[j]; } } A += MR; B += NR; }

The overall performance of the matrix matrix product depends only on these cycles. Here, about 99.9% of the time. In other options, I used intrinsics and assembler code to improve performance. You can see the tutorial, which lists all the options:

ulmBLAS: GEMM tutorial (matrix matrix product)

With BLIS documents, it’s becoming pretty easy to understand how libraries like Intel MKL can get this kind of performance. And why it doesn’t matter if you use row or column storage!

The final tests are here (we called our project ulmBLAS):

Tests for ulmBLAS, BLIS, MKL, openBLAS and Eigen

Other Editing / Updating:

I also wrote several guides on how BLAS is used for problems of numerical linear algebra, such as solving a system of linear equations:

High performance LU factorization

(This factorization of LU, for example, is used by Matlab to solve a system of linear equations.)

~~I hope to take the time~~ to expand the tutorial to describe and demonstrate how to implement a scalable parallel implementation of LU factorization, for example, in PLASMA .

So here you are: Coding optimized concurrent LU cache optimization

PS: I also did some experiments to improve uBLAS performance. It’s actually quite simple to increase (yes, play with words :)) the performance of uBLAS:

UBLAS experiments .

Here's a similar project with BLAZE :

BLAZE experiments .

How does BLAS get such extreme performance?

More articles: