I find ArrayFire much faster and started using it instead of GPU cores in OpenCV for image processing. Below are some tests . I found that the comparison with ArrayFire (used in another interface called LibJacket) for OpenCV, and it was true in my benchmarking, that ArrayFire is 2-4X faster than GPU functions in OpenCV. From what I heard, NVIDIA did not write GPU cores in OpenCV, but imposed them on someone, and that is probably why they are so slow. Since I use only 1 GPU, I can use ArrayFire for free.
Update, given the new MATLAB code sent by @Alex:. I launched the standard of this code on my system. I get that the Parallel Computing Toolbox gpuArray is slower than the processor, but Jacket and ArrayFire. HW Specifications:
Intel(R) Xeon(R) CPU X5660 @ 2.80GHz NVIDIA Tesla M2090
CPU vs. GPU results using the Parallel Computing Toolbox gpuArray (fully warmed up). CPU faster than PCT gpuArray :
>> tic; sqEuclideanDist(gpuArray(rand(1581,3)),gpuArray(rand(189,3))); toc; Elapsed time is 0.006859 seconds. >> tic; sqEuclideanDist(rand(1581,3),rand(189,3)); toc; Elapsed time is 0.005712 seconds.
CPU vs. GPU results using Jacket (fully warmed up). Jacket outperforms PCT gpuArray by 3.7X and outperforms processor by 3X
>> tic; sqEuclideanDist(gdouble(rand(1581,3)),gdouble(rand(189,3))); toc; Elapsed time is 0.001876 seconds.
Here is a modified code that makes it easy to run all of this:
function K = sqEuclideanDist(P,Q) % Vectorized method to compute pairwise squared Euclidean distance on GPU % Returns K(i,j) = (P(i,:) - Q(j,:))'*(P(i,:) - Q(j,:)) [nP, d] = size(P); [nQ, d] = size(Q); pmag = sum(P .* P, 2); qmag = sum(Q .* Q, 2); K = ones(nP,1)*qmag' + pmag*ones(1,nQ) - 2*P*Q'; end
The jacket supports BSXFUN on the GPU and slightly improves speed:
>> tic; sqEuclideanDist(gdouble(rand(1581,3)),gdouble(rand(189,3))); toc; Elapsed time is 0.001420 seconds.
Please note that the sizes used here are quite small, so most CUDA codes that try to run these small sizes are likely to work poorly. That's why I like to use AccelerEyes materials because these guys optimized the GPU-style feature, unlike PCT gpuArray, Thrust, OpenCV, each of which I tried in the past.
Here are the results of ArrayFire Free C ++:
Time: 0.0003577 seconds Speedups: 19.2X faster than PCT gpuArray, 16X faster than the CPU, 5.2X faster than Jacket in MATLAB original version, 4X faster than Jacket in MATLAB using BSXFUN
Here is the ArrayFire code I wrote for this:
static array SqEuclideanDist(array P, array Q) { // 0 based indexing array pmag = sum(P * P, 1); array qmag = sum(Q * Q, 1); int np = P.dims(0); int nq = Q.dims(0); array K = tile(qmag.T(), np, 1) * tile(pmag, 1, nq) - 2 * matmul(P, QT()); return K; } int main(int argc, char **argv) { double *P_cpu = new double[1581 * 3]; double *Q_cpu = new double[189 * 3]; array P = array(1581, 3, P_cpu); array Q = array(189 , 3, Q_cpu); af::sync(); int iter = 1000; timer::tic(); for (int i = 0; i < iter; i++) { array K = SqEuclideanDist(P, Q); af::eval(K); } af::sync(); printf("Time taken: %2.4lfms\n", (1000 * timer::toc()) / iter); delete[] P_cpu; delete[] Q_cpu; }