BOOST UBLAS matrix product extremely slow

Is there a way to improve the performance of an ublas product?

I have two matrices A, B that I want mulitply / add / sub / ...

In MATLAB vs. C ++ I get the following times [s] for the 2000x2000 matrix Operations

OPERATION | MATLAB | C++ (MSVC10) A + B | 0.04 | 0.04 A - B | 0.04 | 0.04 AB | 1.0 | 62.66 A'B' | 1.0 | 54.35 

Why is there such a huge loss in productivity?

Matrices are only valid doubles. But I also need positively defined, symmetrical, rectangular products.

EDIT: Code is trivial

 matrix<double> A( 2000 , 2000 ); // Fill Matrix A matrix<double> B = A; C = A + B; D = A - B; E = prod(A,B); F = prod(trans(A),trans(B)); 

EDIT 2: The result is averages of 10 attempts. The stddev level was less than 0.005

I would expect a 2-3rd order, but not 50 (!)

EDIT 3: Everything has been expanded in Release mode (NDEBUG / MOVE_SEMANTICS / ..).

EDIT 4: Pre-allocated matrices for product results did not affect runtime.

+4
source share
4 answers

Send your C + code for advice on any possible optimization.

However, you should know that Matlab is highly specialized for its task, and you are unlikely to be able to match it with Boost. On the other hand, Boost is free, while Matlab decisively does not.

I believe that better Boost performance can be achieved by binding the uBlas code to the underlying LAPACK implementation.

+4
source

You must use noalias on the left side of the matrix multiplication to get rid of unnecessary copies.

Instead of E = prod(A,B); use noalias(E) = prod(A,b);

From the documentation:

If you know for sure that the left-hand expression and the right-hand expression do not have a common repository, then the assignment does not have aliases. a more effective appointment can be given in this case: noalias (C) = prod (A, B); . This avoids the creation of a temporary matrix, which is required for the usual purpose. The assignment of "noalias" requires that the left and right sides are sized.

+2
source

There are many effective BLAS implementations, such as ATLAS, gotoBLAS, MKL, instead.

I do not select the code, but I guess that ublas :: prod (A, B) uses three loops, does not block, and does not cache. If true, prod (A, B.trans ()) will be much faster than others.

If cblas is available, use cblas_dgemm to calculate. If not, you can simply reorder the data, means, prod (A, B.trans ()).

+1
source

You do not know what role memory management plays here. prod has to allocate 32 MB of the matrix, which means trans , twice, and then you do it all 10 times. Take a few shotshots and see what he actually does. My stupid guess is that if you pre-select matrices, you will get a better result.

Other matrix multiplication methods can be accelerated:

  • pre-transposing the left matrix to be cached, and

  • skipping zeros. Only if A (i, k) and B (k, j) are both nonzero, is any value added.

If you do this in uBlas, then someone guesses.

0
source

All Articles