How to speed up matrix multiplications in Python?

I am developing a small neural network, the parameters of which require a lot of optimization, so a lot of processing time. I profiled my script using cProfile , and 80% of the processor time is the NumPy dot function, the rest is the inverse of the matrix using the numpy.linalg.solve function. My current version of numpy uses blas , or this is what it seems, since numpy.core._dotblas.dot appears as a function that takes up 80% of the total processing time.

As the core of my neural network, and since I need to work hard, any small increase in speed can save a lot of time on numerous repeated optimization parameters.

More accuracy: matrix multiplication is on matrices that have a minimum length of 100 * 100 to 500 * 500. I have a computer with 12 cores and still use them to parallelly optimize various parameters of neural networks, but maybe matrix multiplication can run in parallel?

Thank you for your time!

Answer:

I spent several days testing and installing uninstall libraries ... Here is the result of what I tested: By default, in my version of Ubuntu (12.04) and the installed version of the Numpy repository, BLAS libraries are ATLAS libraries. I have done several tests that reflect the improvement, ESPECIALLY with respect to the computations that interest me, so these results should not be interpreted as the final answer. These calculations include matrix multiplication (point product) in a loop with iterations of 55,000 with a matrix of 500 * 500 and 1000 * 1000. I am using an HP Z800 workstation with Xeon X5675 @ 3.07GHZ with 12 cores. All results (in percent) are a comparison between the described condition and the link, which is a packaged ATLAS library.

  • Scipy.sparse module : I donโ€™t know if I installed it correctly, but with 10% sparsity, using this module, it becomes useful, starting with 1500 * 1500 matrices with OpenBLAS and MKL. If you have a suggestion on how to use them correctly, I'm interested!
  • With OpenBlas I get a speed increase of 33% for 500 * 500 matrices, but 160% for 1000 * 1000. But with OpenBLAS, the scipy.sparse module does not work better, but actually worse.
  • The big winner here are the MKL libraries. Acceleration goes up to 230% with 1000 * 1000 matrices from the original ATLAS libraries! For matrices 500 * 500, the acceleration is more modest (100%), but still very good. In addition, with compilation with OpenMP, matrix multiplication can work on my 12 processors, and here it is twice as fast as on a single processor with MKL libraries. But this is a waste of computing power, it is much more efficient to use multiprocessor modules to run scripts / matrix multiplications in parallel.
+8
optimization python numpy parallel-processing blas
source share
2 answers

If you havenโ€™t already done so, you can try linking numpy to a highly optimized BLAS library, such as Intel MKL (which is free-as-in-beer for non-commercial use or a discount for academic use that does not seem to be considered non-commercial, instructions from Intel for use with numpy ) or OpenBLAS (free as in a speech). There's also Enthought Python Distribution , which is pre-affiliated with MKL and is free for beer for scientists. This can automatically parallelize your matrix multiplications and can be much faster than a typical BLAS / ATLAS installation on most Linux distributions or whatever you use.

Otherwise, the only thing I know about what you could do would be some mathematical tricks so as not to calculate so many multiplications / solutions. Not knowing exactly what you are doing, it is difficult to give any suggestions.

I assume that your matrices are dense, since they are usually located in neural networks, but if you do something unusual, scipy.sparse can help too.

+7
source share

Numpy uses very fast internal algorithms and representations based on third-party libraries (such as BLAS, as you called it), already using SSE optimizations. Since the initial BLAS is a bit slow (since it aims to be a reference implementation, focusing on accuracy rather than performance), you can use a different performance-oriented flavor like OpenBLAS. To use OpenBLAS, you need to either find the pre-installed OpenBLAS Numpy package or recompile the version associated with OpenBLAS. When you use an efficient BLAS implementation, you wonโ€™t find a better acceleration option in pure python unless you write a library in C and spend a lot of time optimizing it.

On the other hand, you can check how your Numpy and BLAS library has been compiled as efficiently as possible in your architecture. For example, if you can activate the OpenMP library while compiling Numpy, this will allow multiple kernels to work with your problem using the parallelism data layer. This can be an important source of acceleration if you have several cores on your computer and your calculations are CPU related. If your problem allows you to do this, you can even use the task-based parallel programming library ( SCOOP [Disclamer: I wrote this], Celery , etc.) to distribute your work on multiple computers.

As a last resort, another possibility would be to purchase new equipment. This makes the software potentially faster without changing a single line of code.

+4
source share

All Articles