How good is the OpenCV GPU library for matrix operations?

I use OpenCV for an application in computer vision. I would like to speed up some operations with matrices (matrices are quite large) on the GPU and, if possible, avoid coding directly in CUDA C. OpenCV 2.4.1 has several accelerated GPU functions. How well do they work in your experience? Is it better for me to use another library (like Thrust)?

EDIT Application example: Calculate the quadratic Euclidean distance matrix on the GPU . Currently, my accelerated (and vectorized) GPU implementation in Matlab using the Parallel Computing Toolbox (PCT) is about 5-10 times faster than my C ++ implementation with OpenCV.

Matlab implementation:

function K = sqEuclideanDist(P_cpu,Q_cpu) % Vectorized method to compute pairwise squared Euclidean distance on GPU % Returns K(i,j) = (P(i,:) - Q(j,:))'*(P(i,:) - Q(j,:)) P_gpu = gpuArray(P_cpu); Q_gpu = gpuArray(Q_cpu); [nP, d] = size(P_gpu); [nQ, d] = size(Q_gpu); pmag = sum(P_gpu .* P_gpu, 2); qmag = sum(Q_gpu .* Q_gpu, 2); % note that K is on GPU K = ones(nP,1)*qmag' + pmag*ones(1,nQ) - 2*P_gpu*Q_gpu'; end 

UPDATE Another Matlab implementation is executed here, which does the same thing (thanks to https://stackoverflow.com/a/167478/ ). But it only works on the processor, because bsxfun not supported by PCT. However, you are looking for an alternative to C ++.

 function K = sqEuclideanDist(P_cpu,Q_cpu) % Returns K(i,j) = (P(i,:) - Q(j,:))'*(P(i,:) - Q(j,:)) % Runs on CPU only. K = bsxfun(@plus,sum(p.^2,2),sum(q.^2,2)') - 2*(p*q'); end 
+8
c ++ opencv gpu cuda thrust
source share
2 answers

I find ArrayFire much faster and started using it instead of GPU cores in OpenCV for image processing. Below are some tests . I found that the comparison with ArrayFire (used in another interface called LibJacket) for OpenCV, and it was true in my benchmarking, that ArrayFire is 2-4X faster than GPU functions in OpenCV. From what I heard, NVIDIA did not write GPU cores in OpenCV, but imposed them on someone, and that is probably why they are so slow. Since I use only 1 GPU, I can use ArrayFire for free.

Update, given the new MATLAB code sent by @Alex:. I launched the standard of this code on my system. I get that the Parallel Computing Toolbox gpuArray is slower than the processor, but Jacket and ArrayFire. HW Specifications:

 Intel(R) Xeon(R) CPU X5660 @ 2.80GHz NVIDIA Tesla M2090 

CPU vs. GPU results using the Parallel Computing Toolbox gpuArray (fully warmed up). CPU faster than PCT gpuArray :

 >> tic; sqEuclideanDist(gpuArray(rand(1581,3)),gpuArray(rand(189,3))); toc; Elapsed time is 0.006859 seconds. >> tic; sqEuclideanDist(rand(1581,3),rand(189,3)); toc; Elapsed time is 0.005712 seconds. 

CPU vs. GPU results using Jacket (fully warmed up). Jacket outperforms PCT gpuArray by 3.7X and outperforms processor by 3X

 >> tic; sqEuclideanDist(gdouble(rand(1581,3)),gdouble(rand(189,3))); toc; Elapsed time is 0.001876 seconds. 

Here is a modified code that makes it easy to run all of this:

 function K = sqEuclideanDist(P,Q) % Vectorized method to compute pairwise squared Euclidean distance on GPU % Returns K(i,j) = (P(i,:) - Q(j,:))'*(P(i,:) - Q(j,:)) [nP, d] = size(P); [nQ, d] = size(Q); pmag = sum(P .* P, 2); qmag = sum(Q .* Q, 2); K = ones(nP,1)*qmag' + pmag*ones(1,nQ) - 2*P*Q'; end 

The jacket supports BSXFUN on the GPU and slightly improves speed:

 >> tic; sqEuclideanDist(gdouble(rand(1581,3)),gdouble(rand(189,3))); toc; Elapsed time is 0.001420 seconds. 

Please note that the sizes used here are quite small, so most CUDA codes that try to run these small sizes are likely to work poorly. That's why I like to use AccelerEyes materials because these guys optimized the GPU-style feature, unlike PCT gpuArray, Thrust, OpenCV, each of which I tried in the past.

Here are the results of ArrayFire Free C ++:

 Time: 0.0003577 seconds Speedups: 19.2X faster than PCT gpuArray, 16X faster than the CPU, 5.2X faster than Jacket in MATLAB original version, 4X faster than Jacket in MATLAB using BSXFUN 

Here is the ArrayFire code I wrote for this:

 static array SqEuclideanDist(array P, array Q) { // 0 based indexing array pmag = sum(P * P, 1); array qmag = sum(Q * Q, 1); int np = P.dims(0); int nq = Q.dims(0); array K = tile(qmag.T(), np, 1) * tile(pmag, 1, nq) - 2 * matmul(P, QT()); return K; } int main(int argc, char **argv) { double *P_cpu = new double[1581 * 3]; double *Q_cpu = new double[189 * 3]; array P = array(1581, 3, P_cpu); array Q = array(189 , 3, Q_cpu); af::sync(); int iter = 1000; timer::tic(); for (int i = 0; i < iter; i++) { array K = SqEuclideanDist(P, Q); af::eval(K); } af::sync(); printf("Time taken: %2.4lfms\n", (1000 * timer::toc()) / iter); delete[] P_cpu; delete[] Q_cpu; } 
+3
source share

They were introduced by NVidia, so it has good performance on CUDA compatible cards. Actual performance depends on the card itself and the function you use.

In my experience, only cvRotate and cvResize had better performance than a regular Intel processor. (Note: I was only interested in the functions related to the image)

+1
source share

All Articles