Most of the execution time is spent in your loop, which initializes X [i] and Y [i]. Although this is legal, it is a very slow way to initialize large device vectors. It would be better to create host vectors, initialize them, and then copy them to the device. As a test, change your code as follows (immediately after the loop in which you initialize the device vectors X [i] and Y [i]):
} // this is your line of code std::cout<< "Starting GPU run" <<std::endl; //add this line cudaEvent_t start, stop; //this is your line of code
Then you will see that the results of the synchronization of the GPU appear almost immediately after the added line is printed. Thus, all the time that you expect is spent on initializing these device vectors directly from the host code.
When I run this on my laptop, I get a processor time of about 40 and a GPU time of about 5, so the GPU is about 8 times faster than the processor for the sections of code that you are currently on.
If you create X and Y as host vectors and then create similar device vectors d_X and d_Y, the total execution time will be shorter, for example:
thrust::host_vector<int> X(N); thrust::host_vector<int> Y(N); thrust::device_vector<int> Z(N); for(int i=0;i<N;i++) { X[i]=i; Y[i]=i*i; } thrust::device_vector<int> d_X = X; thrust::device_vector<int> d_Y = Y;
and change the conversion call to:
thrust::transform(d_X.begin(), d_X.end(), d_Y.begin(), Z.begin(), thrust::plus<int>());
OK, so now you have indicated that the measurement of the processor run is higher than the measurement of the GPU. Sorry, I jumped to the conclusions. My laptop is an HP laptop with a 2.6GHz i7 core and Quadro 1000M gpu. I am running centos 6.2 linux. A few comments: if you are performing any heavy display tasks on the GPU, this may reduce performance. In addition, when comparing these things with common practice, using the same mechanism for comparison, you can use cudaEvents for both, if you want, it can program the CPU code in the same way as the GPU code. In addition, it is a common practice with a craving to do warm-ups that are not involved, and then repeat the test for measurement, as well as the usual practice of doing a test 10 times or more in a loop, and then share to get the average value. In my case, I can say that the clocks () measurement is pretty rough, because successive runs will give me 30, 40 or 50. On the GPU measurement, I get something like 5.18256. Some of these things may help, but I can’t say exactly why your results and my difference are so different (on the GPU side).
OK I did another experiment. The compiler will go a long way on the processor side. I compiled with the -O3 switch, and the CPU time dropped to 0. Then I converted the CPU time measurement from the clocks () method to cudaEvents, and I got the CPU measured 12.4 time (with -O3 optimization) and 5.1 more on the GPU side.
Your mileage will depend on the synchronization method and the compiler you use on the processor side.