CUDA Thrust slow when working with large vectors on my machine

I start CUDA and read in some tutorials. I am writing simple but terribly organized code and trying to figure out traction acceleration. (is this idea correct?). I am trying to add two vectors (with 10000000 int) to another vector by adding an array to the processor and adding device_vector to gpu.

That's what:

#include <iostream> #include "cuda.h" #include "cuda_runtime.h" #include "device_launch_parameters.h" #include <thrust/device_vector.h> #include <thrust/host_vector.h> #define N 10000000 int main(void) { float time_cpu; float time_gpu; int *a = new int[N]; int *b = new int[N]; int *c = new int[N]; for(int i=0;i<N;i++) { a[i]=i; b[i]=i*i; } clock_t start_cpu,stop_cpu; start_cpu=clock(); for(int i=0;i<N;i++) { c[i]=a[i]+b[i]; } stop_cpu=clock(); time_cpu=(double)(stop_cpu-start_cpu)/CLOCKS_PER_SEC*1000; std::cout<<"Time to generate (CPU):"<<time_cpu<<std::endl; thrust::device_vector<int> X(N); thrust::device_vector<int> Y(N); thrust::device_vector<int> Z(N); for(int i=0;i<N;i++) { X[i]=i; Y[i]=i*i; } cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start,0); thrust::transform(X.begin(), X.end(), Y.begin(), Z.begin(), thrust::plus<int>()); cudaEventRecord(stop,0); cudaEventSynchronize(stop); float elapsedTime; cudaEventElapsedTime(&elapsedTime,start,stop); std::cout<<"Time to generate (thrust):"<<elapsedTime<<std::endl; cudaEventDestroy(start); cudaEventDestroy(stop); getchar(); return 0; } 

The CPU results look very fast, but gpu works on my machine REALLY slower (i5-2320,4G, GTX 560 Ti), the processor time is about 26, the GPU time is about 30! Did I just make a mistake with stupid mistakes in my code? or was there a deeper reason?

As a newbie in C ++, I checked my code again and again and still got a slower time on the GPU with traction, so I did a few experiments to show the difference in computing vectorAdd using five different approaches. I use the Windows API QueryPerformanceFrequency() as a single time measurement method.

Each of the experiments is as follows:

 f = large_interger.QuadPart; QueryPerformanceCounter(&large_interger); c1 = large_interger.QuadPart; for(int j=0;j<10;j++) { for(int i=0;i<N;i++)//CPU array adding { c[i]=a[i]+b[i]; } } QueryPerformanceCounter(&large_interger); c2 = large_interger.QuadPart; printf("Time to generate (CPU array adding) %lf ms\n", (c2 - c1) * 1000 / f); 

and here is my simple __global__ function to add a GPU array:

 __global__ void add(int *a, int *b, int *c) { int tid=threadIdx.x+blockIdx.x*blockDim.x; while(tid<N) { c[tid]=a[tid]+b[tid]; tid+=blockDim.x*gridDim.x; } } 

and the function is called as:

 for(int j=0;j<10;j++) { add<<<(N+127)/128,128>>>(dev_a,dev_b,dev_c);//GPU array adding } 

I add the vector a [N] and b [N] to the vector c [N] for the loop 10 times:

  • add array to CPU
  • add std :: vector to CPU
  • add thrust :: host_vector for CPU
  • add thrust :: device_vector on the GPU
  • Add an array to the GPU. and here is the result

with N = 10000000

and I get the results:

  • Adding a 268.992968ms CPU Array
  • CPU std :: vector add 1908.013595ms
  • CPU Thrust :: host_vector adds 10776.456803ms
  • GPU Thrust :: device_vector addition 297.156610ms
  • GPU array adding 5.210573ms

And it confused me, I am not familiar with the implementation of the template library. Did performance really differ between containers and raw data structures?

+6
source share
3 answers

Most of the execution time is spent in your loop, which initializes X [i] and Y [i]. Although this is legal, it is a very slow way to initialize large device vectors. It would be better to create host vectors, initialize them, and then copy them to the device. As a test, change your code as follows (immediately after the loop in which you initialize the device vectors X [i] and Y [i]):

 } // this is your line of code std::cout<< "Starting GPU run" <<std::endl; //add this line cudaEvent_t start, stop; //this is your line of code 

Then you will see that the results of the synchronization of the GPU appear almost immediately after the added line is printed. Thus, all the time that you expect is spent on initializing these device vectors directly from the host code.

When I run this on my laptop, I get a processor time of about 40 and a GPU time of about 5, so the GPU is about 8 times faster than the processor for the sections of code that you are currently on.

If you create X and Y as host vectors and then create similar device vectors d_X and d_Y, the total execution time will be shorter, for example:

 thrust::host_vector<int> X(N); thrust::host_vector<int> Y(N); thrust::device_vector<int> Z(N); for(int i=0;i<N;i++) { X[i]=i; Y[i]=i*i; } thrust::device_vector<int> d_X = X; thrust::device_vector<int> d_Y = Y; 

and change the conversion call to:

 thrust::transform(d_X.begin(), d_X.end(), d_Y.begin(), Z.begin(), thrust::plus<int>()); 

OK, so now you have indicated that the measurement of the processor run is higher than the measurement of the GPU. Sorry, I jumped to the conclusions. My laptop is an HP laptop with a 2.6GHz i7 core and Quadro 1000M gpu. I am running centos 6.2 linux. A few comments: if you are performing any heavy display tasks on the GPU, this may reduce performance. In addition, when comparing these things with common practice, using the same mechanism for comparison, you can use cudaEvents for both, if you want, it can program the CPU code in the same way as the GPU code. In addition, it is a common practice with a craving to do warm-ups that are not involved, and then repeat the test for measurement, as well as the usual practice of doing a test 10 times or more in a loop, and then share to get the average value. In my case, I can say that the clocks () measurement is pretty rough, because successive runs will give me 30, 40 or 50. On the GPU measurement, I get something like 5.18256. Some of these things may help, but I can’t say exactly why your results and my difference are so different (on the GPU side).

OK I did another experiment. The compiler will go a long way on the processor side. I compiled with the -O3 switch, and the CPU time dropped to 0. Then I converted the CPU time measurement from the clocks () method to cudaEvents, and I got the CPU measured 12.4 time (with -O3 optimization) and 5.1 more on the GPU side.

Your mileage will depend on the synchronization method and the compiler you use on the processor side.

+9
source

First, Y[i]=i*i; does not fit into an integer for 10M elements. Integer values ​​contain approximately 1e10, and your code contains 1e14.

Secondly, it looks like the conversion time is correct and should be faster than the CPU, no matter which library you use. Robert’s suggestion to initialize the vectors on the processor and then switch to the GPU is good for this case.

Thirdly, since we cannot execute an integer, below is the simpler CUDA library code (using the ArrayFire I'm working on) to do the same with floats, for your benchmarking:

 int n = 10e6; array x = array(seq(n)); array y = x * x; timer t = timer::tic(); array z = x + y; af::eval(z); af::sync(); printf("elapsed seconds: %g\n", timer::toc( t)); 

Good luck

+1
source

I recently run a similar test using CUDA Thrust on my Quadro 1000m. I use thrust :: sort_by_key as a reference to test its performance, and the result is too good to convince my boos.It takes 100 + ms to sort 512 MB pairs.

For your problem, I got confused in two things.

(1) Why are you this time several times__cpu per 1000? Without 1000, this is already in seconds.

 time_cpu=(double)(stop_cpu-start_cpu)/CLOCKS_PER_SEC*1000; 

(2) And, referring to 26, 30, 40, do you mean seconds or ms? The report 'cudaEvent' passed the time in 'ms' not 's'.

-1
source

Source: https://habr.com/ru/post/926413/


All Articles