Matlab + CUDA slowly solves the matrix-vector equation A * x = B

Question

Matlab + CUDA slowly solves the matrix-vector equation A * x = B

I am calculating the equation A * x = B, where A is the matrix and B is the vector, x is the answer (unknown) vector.

Hardware Specifications: Intel i7 3630QM (4 cores), nVidia GeForce GT 640M (384 CUDA cores)

Here is an example:

>> A=rand(5000); >> B=rand(5000,1); >> Agpu=gpuArray(A); >> Bgpu=gpuArray(B); >> tic;A\B;toc; Elapsed time is 1.382281 seconds. >> tic;Agpu\Bgpu;toc; Elapsed time is 4.775395 seconds.

Somehow the GPU is much slower ... Why? It is also slower in the calculations of FFT, INV, LU, which should be related to matrix division.

However, when the matrix is multiplied (the same data), the GPU is much faster:

 >> tic;A*B;toc; Elapsed time is 0.014700 seconds. >> tic;Agpu*Bgpu;toc; Elapsed time is 0.000505 seconds.

The main question is: why is the A \ B GPU (mldivide) so slow compared to the CPU?

UPDATED

Here are some more results when A, B (on CPU), AA, BB (on GPU) are rand (5000):

 >> tic;fft(A);toc; Elapsed time is *0.117189 *seconds. >> tic;fft(AA);toc; Elapsed time is 1.062969 seconds. >> tic;fft(AA);toc; Elapsed time is 0.542242 seconds. >> tic;fft(AA);toc; Elapsed time is *0.229773* seconds. >> tic;fft(AA);toc;

Oily times are stable times. However, the GPU is almost twice as slow. By the way, why is the GPU even slower in the first two attempts? Is it compiled twice first?

Moreover:

 >> tic;sin(A);toc; Elapsed time is *0.121008* seconds. >> tic;sin(AA);toc; Elapsed time is 0.020448 seconds. >> tic;sin(AA);toc; Elapsed time is 0.157209 seconds. >> tic;sin(AA);toc; Elapsed time is *0.000419 *seconds

After two calculations, the GPU calculates sins incredibly faster.

So, why is the GPU so slow in matrix division, fft and similar calculations, although it works so fast in matrix multiplication and trigonometry? The question really shouldn't be that way ... The GPU should be faster in all of these calculations, because Matlab has released overlapping functions (mldivide, fft) for the GPU.

Can someone help me solve these problems please? :)

+6

performance matrix matlab cuda linear-algebra

Aurimas Šimkus Feb 16 '13 at 0:52

source share

2 answers

ntarki · Answer 1 · 2013-04-14T23:03:47+0000

Read how Matlab calculates solutions. This will help you understand why the GPU is slower.

I will try to say this in a few words.

A * x = b becomes L * (U * x = y) = b, L * U = A

So, Matlab does from A to L * U (this process cannot be executed completely in parallel as far as I know, some steps can be done in parallel, due to their nature)
Then Matlab solves L * y = B and finds y. (This process cannot be performed in parallel, since each step requires data from previous ones)
Then Matlab solves U * x = y and finds x. (This process cannot be performed in parallel, since each step requires data from previous ones)

So, the GPU clock is slower than the processor, and since the processes cannot run in parallel, the processor is faster. And no, unless you come up with a better method (good luck!), Then the GPU will always be slower, except for some very specific cases.

pkofod · Answer 2 · 2014-01-22T08:59:53+0000

Part 1 of the explanation is in response from user 2230360, but your question is twofold, so I'll add a little about multiplication.

As already noted, factorization of LU is not very parallelized, even if some steps may be. Matrix multiplication, however, is very parallelizable. If you work with these things, you should be able to perform matrix multiplication manually, and then you will know that the calculation of the elements of the matrix C in * B = C can be performed in any order, for parallel calculations. This is probably why you are so quick to reproduce, but a slow solution to linear systems. You can not parallelize "as much as another."

Matlab + CUDA slowly solves the matrix-vector equation A * x = B

More articles: